Beyond Data Lakes: The Total Integration Revolution

Perhaps you can remember the debate over VHS versus Betamax. It raged in the 1980s as consumers desperately advocated the type of machine they had just purchased. And the winner was: digital! First, the DVD and now Netflix! The point being: revolutionary new technologies always win out in the end and render debate over evolutionary technology moot.

Today, I think there’s a strong argument to be made that Data Lakes may be the Betamax of our time. Of course, this makes Data Warehouses the VHS. Don’t get me wrong. Betamax and VHS were both great technologies for their time. They allowed us to watch movies at home where and when we chose. But you had to rewind the tapes! You had to run to the video store in the middle of winter to rent a new movie! Ugh! If anybody thinks those times were better then I have some stock in Blockbuster Video stores I’d like to sell you.

I will argue that Data Warehouses and Data Lakes have seen their best days as evolutionary technologies and that a revolutionary technology is looming that could replace them altogether. Controversial territory where clear answers are not easy to find. There is a lot of media noise, vendor propaganda and, as always, a lot of money to be made selling IT technology so it’s a bit of a mystery where the truth lies. To solve the mystery, I’ll use classic crime detective techniques like ‘follow the money’ and ‘read between the lines’ to help solve the mystery and then look beyond Data Lakes to see what’s next. But first, some background.

A Very Brief History of Data Warehouses and Data Lakes

The story begins with Decision Support which was the old trendy term which is being replaced by the new trendy term Business Intelligence (BI) and the even newer and trendier Analytics. Regardless of the term, the idea was that companies needed to make better decisions based on the data stored in their various operational systems and databases. So two guys from IBM came up with the idea of the Data Warehouse in the late 1980s. Their concept was to consolidate data from the various systems and databases into a single database, that is, a Data Warehouse, to serve as the source for creating analytics and reporting, that is, delivering decision support.

It was a great idea and it worked! It was the first real step towards data integration because it solved the problem of isolated and heterogenous data stores, commonly referred to as data silos, by providing a single database for reporting and analysis. Some work was involved though. You had to understand the data, then architect a new data model for the Data Warehouse and then you had to Extract, Transform and Load the data, as illustrated below:

Data Warehouse

I’ve used a barb wire fence to illustrate the process of creating a Data Warehouse because it really was (and still is) quite cumbersome, costly and time-consuming — often stretching out for months, or even years. Duplicating data, staging data, integration layers, replication layers, temporary databases… Argh. This may have been cool at IBM back in the 1980s but it’s not so cool anymore because all of this costs big bucks to implement.

Not to mention the biggest bucks of all which are spent architecting the new data model required for a Data Warehouse. Data Lake vendors have pegged the cost at about $250,000 per terabyte for building a Data Warehouse. Of course, they are trying to vend you some Data Lakes so this may be a bit high, but whatever the exact cost, it is undeniably very significant, and grows almost exponentially with the avalanche of big data from the Internet of Things and other sources. Rigid Data Warehouses and their brittle data models simply can’t keep up. Enter the Data Lake.

Data Lake Architecture

The concept of a Data Lake has emerged in the last several years, primarily as an antidote to the fundamental flaw of Data Warehouses - their rigidity and inability to deal with the onslaught of big data. The basic idea with a Data Lake is: forget about all the extracting and transforming, architecting, layers, etc. — for now — and just dump all your data down into a great big lake and let the Data Scientists and SQL gurus figure it out, as illustrated below:

Data Lake Architecture

Yes, Data Lakes are largely Hadoop-based which is super-trendy right now, and yes, it does solve the problem of heterogeneous data, in that Data Lakes can handle structured, semi-structured and unstructured data, which is all important these days. But, is it just me, or does this sound a lot like procrastination? Why all the hype? Let’s follow the money to find out.

Data Lake vendors boast a 20X, to 50X to 100X reduction in cost. From $250,000 per terabyte down to as little as $2,500 per terabyte they say. Brilliant! Big data problem solved. Just store it all and figure it out later. Phew!

Not so quick. There’s a reason why the Internet is now rife with articles about fixing and figuring out your Data Swamp or your Data Graveyard. Swamp and graveyard being two of the more printable names that I can use in this article. I’m not joking here. Witness #1: Daniel Newman writing in Forbes: 6 Steps to Clean up your Data Swamp. Witness #2: Samantha Custer et al writing on AIDDATA: Avoiding Data Graveyards.

Data Lake Analytics? or Data Swamps

So Data Lakes are a suspect. To find more proof that something is amiss here, let’s read between the lines of Price Waterhouse Coopers’s piece entitled: The Enterprise Data Lake: Better integration and deeper analytics. Sounds rosy doesn’t it?

They write that Data Lakes have “nearly unlimited potential for operational insight and data discovery”. Cool! Potential is good. But then a caveat: “Analytics drawn from the lake become increasingly valuable as the metadata describing different views of the data accumulates.” Okay maybe my metadata will accumulate like the mould in my pond. I can live with that.

Read the entire article: https://www.datawerks.com/data-lake/