What is the role of a Structured Data Lake in DW?
The Full 360 Approach Our approach is a little different than generic data lakes. We build structured data lakes. A structured data lake is…
The Full 360 Approach
Our approach is a little different than generic data lakes. We build structured data lakes. A structured data lake is just like any other, it takes all sorts of data in any format, but we feed the lake with special programs called ‘producers’. These producers work independently, store metadata and are optimized to chunk the data into the data lake with a basic understanding of how it will ultimately be consumed downstream. We always use dates and naming conventions, but we can arbitrarily add more metadata.
The purpose of this is to make the data lake more usable for direct consumers and downstream processes. The original developers of the source data could disappear from the planet, but anyone could eyeball the data and metadata still have a good idea what is in a structured data lake and how to use it.
What you get
The big deal about a structured data lake is that it extends the capabilities of data warehouses and BI. I can build a DW with 6 months of history that is optimized for that window of time. Meanwhile, my data lake has an operational data store of 36 months at nearline speeds and 60 additional months offline. So my DW has the capacity for 102 months of data because of the way I’ve designed it to consume from the structured data lake. But I can also allow direct consumers to query that history using the slow, cheap data lake.
PLUS
Disaster recovery becomes a no-brainer. It is almost always faster to wipe a database and simply reload six months of history than it is to use database recovery tools from incremental backups. It is certainly always cheaper to do so. Having a data lake allows you to actually test that out. A proper data lake will always be faster for this purpose than NFS and certainly Amazon S3 will be cheaper than a SAN of similar dimensions, not to mention more reliable with lower maintenance.
PLUS
I can use my data lake to feed multiple instances of the data warehouse for hot swapping or for global deployment in different regions. I could also conceivably have my entire data lake replicated automatically. Although we’ve never had such a paranoid requirement, three years ago naysayers would yelp every time they heard tell of an AWS outage.
For more information about the elasticBI ‘Pitbull’ Framework for Data Warehousing and BI, check out this blog.
ELT vs ETL
Our structured data lakes will perform cleansing transformations in the producers. That is because for most file based ingestion schemes we don’t have latency issues. IE when we’re pulling data from a generic source that spits files, end users can generally wait an hour before querying that data. For API based ingestion schemes like message queues, or direct queries against upstream databases, we make those instantly available with minimum transformation to the end-users and we fork off a copy for the data lake. The forked producers will do the rest of the cleansing and transformation necessary.
There are cases when we leave data in its raw state and send that to the lake with no transformation. Those tend to be for data science consumers and when the business really has no idea what the data means — and they are not necessarily ready to present it in a way that’s structured for analysis. This is more often the case with straight HDFS data that’s left native and ‘annexed’ to the lake.
I’ll be talking more about data lakes this month. Stay tuned.