One doesn't go far into a conversation about “modern BI” these days without someone inevitably bringing up the topic of the data lake. Some in the business intelligence world are excited about this relatively new style of data repository because of its big data capabilities, while others lambast it for its drawbacks, most commonly the lack of data governance.
As today’s BI and Analytics landscape is increasingly making use of the data lake, it’s a good idea to familiarize yourself with the concept and its advantages and disadvantages.
Like a data warehouse, a data lake is a repository for data. As far as data storage goes though, that’s about where the similarities stop. Data within a data warehouse is processes and structured. As for a data lake, in addition to structured data, there is also semi-structured, unstructured, and raw data.
In addition to the general structural differences between a data lake and a data warehouse, there are a number of additional differentiation that separate the two including data types supported, speed, usability, and flexibility.
In a data warehouse, data is carefully considered and structured before being pulled in. This is known as a “schema on write” approach to data storage. A data lake however, takes all data in its original form. That includes that data that would be useful to analyze today, in the future, and perhaps never at all.Every data type is supported, including non-traditional data types such as text, images, social media content, and web server logs, that a data warehouse cannot. This is possible because, as I mentioned above, a data lake maintains data in its raw format and only transforms it when it is ready to be analyzed. This approach is known as “schema on read.”
Processing, cleansing, and transforming data for a data warehouse solution design takes time. Because this step is eliminated in a data lake, users have instant access to the data they want to analyze. Information Designers can quickly configure, re-configure, and otherwise experiment with data on the fly for powerful ad-hoc purposes.
This type of agility isn’t for everyone though. Not everyone wants or has the proper skills to get their hands dirty with data exploration. And the very nature of raw data means that data governance is essentially non-existent. Data governance is the responsibility of the users who should employ tactics such as a closed loop system, or sandbox analytics. Without this, the data lake risks becoming a mess of disconnected silos and unusable data.
A data warehouse is extremely powerful. By principle, they are designed to make it easy to link data across various dimensions. However, it can also be extremely cumbersome. Among the various types of users who utilize BI on a daily basis, only the highly technical Information Designers can get under the hood and make changes to a data warehouse.
A data lake, however, is much more agile. Information designers can fully immerse themselves in the large and varied data sets they need, while more casual Business Users can pick and choose from the more structured data sources within the data lake. The structured data is easily ordered and processed within the data lake, resulting in an output of analyzed data that users can quickly sift through to gain insight.
By definition, a data warehouse is highly structured. While this makes it a powerful storage option, it makes changes within the data warehouse difficult. Therefore, the biggest benefit of the de-normalized data warehouse is also its flaw. Any work down within a data warehouse falls to a highly skilled Data Scientist or Information Designer. Ad-hoc analytics are impossible with just a traditional data warehouse structure, as any new data has to first be folded into an appropriate cube.
That’s why the increasing demand for self-service BI makes a data lake highly attractive. Users are empowered to utilize and experiment with data outside the data warehouse, and don’t have to wait for IT to find time for their requests. That’s not to say the flexibility of the ungoverned data lake doesn’t come with a toll. Don’t forget that unstructured can quickly lead to chaos for those who don’t know what they’re doing – and even those who do.
At TARGIT, we’ve been proponents of self-service BI since inception. What good is BI if it doesn’t put the power of data discovery in the hands of every decision-maker, we’d like to know. When considering data lakes and data warehouses, it doesn’t have to be an either/or decision. Why not go bimodal and harness the power of both?
A data lake is a low cost alternative for data storage for companies who want to utilize external data. The data lake can pull directly from hundreds, if not thousands, of external data sources and serve as a dumping ground until that data is pulled into the front end business intelligence system. This makes the process significantly faster. Data lakes also encourage self-service data discovery. All of this, combined with the structure and security of a data warehouse make for unrivaled access to actionable insight.
Newer tools have emerged that make it possible to bridge the gap between the data warehouse and a data set such as the data lake, such as TARGIT’s Data Discovery module. With this, users can blend data outside the data warehouse with data within it, making it possible to experiment with and prototype data outside the data warehouse.
Data Discovery comes with native connections to dozens of data sources, including Hadoop’s data lake, that can be combined with each other and with data inside the data warehouse in just a few clicks.