What is a data lake?

April 19, 2016

One doesn’t go far into a conversation about “modern BI” these days without someone inevitably bringing up the topic of the data lake. Some in the business intelligence world are excited about this relatively new style of data repository because of its big data capabilities, while others lambast it for its drawbacks, most commonly the lack of data governance.

As today’s BI and analytics landscape is increasingly making use of the data lake, it’s a good idea to familiarize yourself with the concept and its advantages and disadvantages.

data lake

See how you can harness the power of a data lake in combination with your data warehouse in this on-demand webinar: Break Out of the Data Warehouse.

What is a data lake?

Like a data warehouse, a data lake is a repository for data. As far as data storage goes though, that’s about where the similarities stop. Data within a data warehouse is processes and structured. As for a data lake, in addition to structured data, there is also semi-structured, unstructured, and raw data. 

Aptly named, there are various data sources in a variety of structures that can feed into a data lake in the same way that rivers, streams, and other tributaries feed into an actual lake. And just like a real lake, the unstructured nature of a few sources can turn the lake into a chaotic swamp. When analyzing data, users can take small samples or dive in and explore as much of the data as they want.

This attempts to solve the problem of data silos. Instead of dozens of strictly managed, separate data collections, a data lake pools everything together. This promotes an increased use and sharing of data. It also cuts the costs of server licensing.

How is a data lake different from a data warehouse?

In addition to the general structural differences between a data lake and a data warehouse, there are a number of additional differentiators that separate the two including data types supported, speed, usability, and flexibility. 

1. Data types

In a data warehouse, data is carefully considered and structured before being pulled in. This is known as a “schema on write” approach to data storage. A data lake however, takes all data in its original form. That includes that data that would be useful to analyze today, in the future, and perhaps never at all. 

Every data type is supported, including non-traditional data types such as text, images, social media content, and web server logs, that a data warehouse cannot. This is possible because, as I mentioned above, a data lake maintains data in its raw format and only transforms it when it is ready to be analyzed. This approach is known as “schema on read.”

A data warehouse only stores data. Data that needs to be analyzed is taken from the cubes on top of the data warehouse that process it in a highly structured format. A data lake, however, processes data in its raw format. Whichever form it comes in is how it will be analyzed before it goes out.

2: Speed

Processing, cleansing, and transforming data for a data warehouse solution design takes time. Because this step is eliminated in a data lake, users have instant access to the data they want to analyze. Information Designers can quickly configure, re-configure, and otherwise experiment with data on the fly for powerful ad-hoc purposes.

This type of agility isn’t for everyone though. Not everyone wants or has the proper skills to get their hands dirty with data exploration. And the very nature of raw data means that data governance is essentially non-existent. Data governance is the responsibility of the users who should employ tactics such as a closed loop system, or sandbox analytics. Without this, the data lake risks becoming a mess of disconnected silos and unusable data.

3: Usability

As I mentioned in my last post, a data warehouse is extremely powerful. By principle, they are designed to make it easy to link data across various dimensions. However, it can also be extremely cumbersome. Among the various types of users who utilize BI on a daily basis, only the highly technical Information Designers can get under the hood and make changes to a data warehouse.

Read about the different types of BI users in a company and how you can increase BI adoptions rates by catering to how they work best: How to Ensure the Highest User Adoptions Rates for Your BI Project.

A data lake, however, is much more agile. Information designers can fully immerse themselves in the large and varied data sets they need, while more casual Business Users can pick and choose from the more structured data sources within the data lake. The structured data is easily ordered and processed within the data lake, resulting in an output of analyzed data that users can quickly sift through to gain insight.

4: Flexibility

By definition, a data warehouse is highly structured. While this makes it a powerful storage option, it makes changes within the data warehouse difficult. Therefore, the biggest benefit of the de-normalized data warehouse is also its flaw. Any work down within a data warehouse falls to a highly skilled Data Scientist or Information Designer. Ad-hoc analytics are impossible with just a traditional data warehouse structure, as any new data has to first be folded into an appropriate cube.

That’s why the increasing demand for self-service business intelligence makes a data lake highly attractive. Users are empowered to utilize and experiment with data outside the data warehouse, and don’t have to wait for IT to find time for their requests. That’s not to say the flexibility of the ungoverned data lake doesn’t come with a toll. Don’t forget that unstructured can quickly lead to chaos for those who don’t know what they’re doing – and even those who do.

Time to go swimming? 

At TARGIT, we’ve been proponents of self-service BI since inception. What good is BI if it doesn’t put the power of data discovery in the hands of every decision-maker, we’d like to know. When considering data lakes and data warehouses, it doesn’t have to be an either/or decision. Why not go bimodal and harness the power of both?

Newer tools have emerged that make it possible to bridge the gap between the data warehouse and a data set such as the data lake, such as TARGIT’s updated Data Discovery module. With this, users can blend data outside the data warehouse with data within it, making it possible to experiment with and prototype data outside the data warehouse. Data Discovery comes with native connections to dozens of data sources, including Hadoop’s data lake, that can be combined with each other and with data inside the data warehouse in just a few clicks.

Don't forget to check out this on-demand webinar to see how easy it is to harness data outside the data warehouse for powerful new insight with Data Discovery: Break Out of the Data Warehouse.

A data lake is a low cost alternative for data storage for companies who want to utilize external data. The data lake can pull directly from hundreds, if not thousands, of external data sources and serve as a dumping ground until that data is pulled into the front end business intelligence system. This makes the process significantly faster. Data lakes also encourage self-service data discovery. All of this, combined with the structure and security of a data warehouse make for unrivaled access to actionable insight.