A Field Guide to Data Bodies of Water

From data lakes to data streams, the world of big data is awash in water metaphors.

Eric Miller / Rackspace Technology November 24, 2020

Contributor: Traey Hatch

From data lakes to data streams, the world of big data is awash in water metaphors. For the most part, the comparison makes sense. Like water, data is a resource that can be stored in static reservoirs or allowed to flow from place to place. Data can trigger action, just like a river turns a mill wheel or spins a turbine in a dam. And unexplored data can hide secrets like an ocean hides old shipwrecks in its depths.

But like any metaphor, this one can be taken a little too far. In recent years, the number of water-related data terms has become overwhelming. Many of us have heard of data lakes by now — but what about “data lake houses” or “data ponds”? It’s hard to tell which of these terms refer to something substantive, and which are mirages.

I wasn’t sure, either, so I dove into (pun intended) eight common data “bodies of water.” Here’s my take on which terms are worth keeping in your personal lexicon and which you should throw out with the bathwater.

Here’s my take on which terms are worth keeping in your personal lexicon and which you should throw out with the bathwater

Data Lake

This is the term that started it all. A core component of data infrastructure at most organizations, a data lake is a vast repository of raw or lightly processed data. The lake may exist purely for storage, or it may include a computational layer capable of performing analysis on the data it contains (see the “data lake house” entry below).

Either way, the lake metaphor is apt. A data lake’s nearly infinite storage capacity means it can absorb a constant flow of data without filling up or overflowing, just like a real lake fed by a river. (Okay, real lakes do overflow sometimes, but we don’t have to take the metaphor that seriously.)

VERDICT: Remember it

Data Lakehouse

When data lakes were relatively new, they were used for storage exclusively. To perform analysis, you had to copy the relevant data to a separate structure that usually ran on specialized hardware, called a data warehouse. More recently, technology has developed to the point that it’s become possible to search and aggregate data for analysis directly in the lake, using a managed service or transient arrangement instead of dedicated hardware. This “data lake house” runs the same analysis workloads you used to run in a warehouse, but sits right on top of your data lake, eliminating the need to copy and transfer data.

While this does reflect a shift in methodology, ultimately “data lake house” is a marketing term. Adding a distributed computational environment doesn’t really change what the data lake is — it just means there are new standards and software to access those datasets.

VERDICT: Forget it

Data Swamp

This is what happens when a data lake goes wrong — inadequate data governance, lack of commitment to processes for regularly/consistent cleaning data. Data in a data swamp may lack metadata, making it hard to organize and search. Or it may contain vast stores of completely irrelevant data that someone collected without having a real plan to do anything with it. A swamp can be cleaned up and turned into a liveable lake, but it takes some investment.

VERDICT: Remember it, and try to avoid it!

Data Stream

These days we use the term “streaming” so much, it’s easy to forget it’s a water metaphor, too. A data stream is a continuous flow of data with no beginning or end. While the term is most often used to describe flows of raw data, such as clickstream data from a digital property or sensor data from IoT devices, cleaned and processed data can be transmitted in a stream, too. Unlike static data sitting in your data lake, streaming data must be processed or stored sequentially, record by record as it comes in.

VERDICT: Remember it

Data River

This isn’t a very common term — yet — but some data experts do argue that a river is a better metaphor for modern data storage than a lake. A lake is generally static, whereas the flow of real-time data through a modern company is dynamic, triggering various actions as it flows. But given that we already have the term data stream to describe data in motion, this new term is largely unnecessary.

VERDICT: Forget it

Data Puddles and Ponds

There’s some disagreement over what you call a pool of data that’s smaller or more specialized than your general-purpose data lake. Per O’Reilly, data puddles are built with big data technology but intended for a specialized use case or one team, while a data pond is essentially a disorganized data lake, created by either pooling together several data puddles or by offloading data from a data warehouse onto a new platform. But some use the term “data pond” less pejoratively, to refer to a smaller, more manageable dataset or a pool of data that’s set off from the rest of the lake due to privacy, governance or other concerns.

After certain points, attempts to keep up the water metaphor dry up. This is one of those points. There’s no need to call an Excel spreadsheet that’s not integrated with your data lake a pond, a puddle or anything else — especially when no one agrees on the exact terminology.

VERDICT: Forget it

Delta Lake

Not to be limited to general terms for data structures, the data-as-water metaphor seeps into proper names for data tools and solutions as well. Created by Databricks and donated to the Linux Foundation, Delta Lake is an open source project created to re-engineer how a data lake works. Instead of writing data in an immutable fashion, Delta Lake allows you to update and delete single records in your lake, as well as offering some additional benefits.

VERDICT: Remember it

Data Glaciers, Icebergs and Other Chunks of Frozen Water

Many tech companies put their own twist on the water metaphor by naming their products after types of ice. For example, Snowflake is a cloud data warehouse, Apache Iceberg is an open table format for large analytic datasets (think similar to a Delta Lake), and Amazon S3 Glacier is a storage class for long-term data cold storage (get it?). Unlike some of the other terms on this list, these names are quite clever — and the products they describe are actually useful, too.

VERDICT: Remember it

The Lifespan of a Metaphor

As far as we’ve stretched the data-as-water metaphor already, it may still have further to go. There are actually already a few I intentionally left out, such “data droplets” and others! There are many water-related words that haven’t been absorbed into the tech lexicon yet, and plenty of data-related phenomena that still need naming. In a few years, we could all be talking about “data waterfalls” or some other new buzzword.

But that’s in the future. For now, the above list is a comprehensive accounting of all the data bodies of water you need to know — and a few that you don’t — to navigate the tricky waters of data-related conversations.

Join the Conversation: Find Solve on Twitter and LinkedIn, or follow along via RSS.

Stay on top of what's next in technology

Learn about tech trends, innovations and how technologists are working today.

Subscribe

Let's Drop the "AutoML vs Data Scientist Discussion

About the Authors

VP, Private Cloud Solutions

Eric Miller

An accomplished tech leader with 20 years of years of proven success in enterprise IT, Eric is a strong advocate of cloud native architectural patterns, passionate about Machine Learning, IoT, Serverless, and all things automation in the cloud. Eric has led several AWS and solutions architecture initiatives, including AWS Well Architected Framework (WAF) Assessment Partner Program, Amazon EC2 for Windows Server AWS Service Delivery Program, and a wide range of AWS rewrites for multi-billion dollar organizations. Prior to joining Rackspace, Eric was the Vice President of AWS Customer Solutions at Onica, which was acquired by Rackspace in 2019. Before working with Onica, Eric held several technology leadership positions at School Pointe, Inc., Neudesic, m2 Consultants, ARGUS International, Inc., Apex Mortgage Services LLC, and TechSkills. Eric lives in New Albany, Ohio with his wife and family. He holds a Bachelor of Science in Information Technology and Information Systems Security from the University of Phoenix.

Data Lake

VERDICT: Remember it

Data Lakehouse

VERDICT: Forget it

Data Swamp

VERDICT: Remember it, and try to avoid it!

Data Stream

VERDICT: Remember it

Data River

VERDICT: Forget it

Data Puddles and Ponds

VERDICT: Forget it

Delta Lake

VERDICT: Remember it

Data Glaciers, Icebergs and Other Chunks of Frozen Water

VERDICT: Remember it

The Lifespan of a Metaphor

Stay on top of what's next in technology

Let's Drop the "AutoML vs Data Scientist Discussion

About the Authors

Eric Miller

Related Topics