Data storage is a big deal. Data companies are in the news a lot lately, especially as companies attempt to maximize value from big data’s potential. For the lay person, data storage is usually handled in a traditional database. But for big data, companies use data warehouses and data lakes.
Data lakes are often compared to data warehouses—but they shouldn’t be. Data lakes and data warehouses are very different, from the structure and processing all the way to who uses them and why. In this article, we’ll:
- Define databases, warehouses, and lakes
- Summarize the big differences
- Caution the use of data lakes
- Explore the future of data storage
- And more
Defining database, warehouse, and lake
Let’s start with the concepts, and we’ll use an expert analogy to draw out the differences.
What’s a database?
A database is a storage location that houses structured data. We usually think of a database on a computer—holding data, easily accessible in a number of ways. Arguably, you could consider your smartphone a database on its own, thanks to all the data it stores about you.
For all organizations, the use cases for databases include:
- Creating reports for financial and other data
- Analyzing relatively small datasets
- Automating business processes
- Auditing data entry
Popular databases are:
- Apache Cassandra
(Learn more about the key difference in databases: SQL vs NoSQL.)
What’s a data warehouse?
The next step up from a database is a data warehouse. Data warehouses are large storage locations for data that you accumulate from a wide range of sources. For decades, the foundation for business intelligence and data discovery/storage rested on data warehouses. Their specific, static structures dictate what data analysis you could perform.
Data warehouses are popular with mid- and large-size businesses as a way of sharing data and content across the team- or department-siloed databases. Data warehouses help organizations become more efficient. Organizations that use data warehouses often do so to guide management decisions—all those “data-driven” decisions you always hear about.
Popular companies that offer data warehouses include:
What’s a data lake?
A data lake is a large storage repository that holds a huge amount of raw data in its original format until you need it. Data lakes exploit the biggest limitation of data warehouses: their ability to be more flexible.
As we’ll see below, the use cases for data lakes are generally limited to data science research and testing—so the primary users of data lakes are data scientists and engineers. For a company that actually builds data warehouses, for instance, the data lake is a place to dump and temporarily store all the data until the data warehouse is up and running. Small and medium sized organizations likely have little to no reason to use a data lake.
Popular data lake companies are:
- Amazon S3
Illustrating the differences
Lee Easton, president of data-as-a-service provider AeroVision.io, recommends a tool analogy for understanding the differences. In this, your data are the tools you can use.
Imagine a tool shed in your backyard. You store some tools—data—in a toolbox or on (fairly) organized shelves. This specific, accessible, organized tool storage is your database. The tool shed, where all this is stored, is your data warehouse. You might have lots (and lots!) of toolboxes in the shop. Some toolboxes might be yours, but you could store toolboxes of your friends or neighbors, as long as your shed is big enough. Though you’re storing their tools, your neighbors still keep them organized in their own toolboxes.
But what if your friends aren’t using toolboxes to store all their tools? They’ve just dumped them in there, unorganized, unclear even what some tools are for—this is your data lake.
In a data lake, the data is raw and unorganized, likely unstructured. Any raw data from the data lake that hasn’t been organized into shelves (databases) or an organized system (data warehouses) is barely even a tool—in raw form, that data isn’t useful.
Comparing data storage
Now that we’ve got the concepts down, let’s look at the differences across databases, warehouses, and data lakes in six key areas.
Database and data warehouses can only store data that has been structured. A data lake, on the other hand, does not respect data like a data warehouse and a database. It stores all types of data: structured, semi-structured, or unstructured.
All three data storage locations can handle hot and cold data, but cold data is usually best suited in data lakes, where the latency isn’t an issue. (More on latency below.)