Discover more from Stoic Observations
Data Lakes : A Fast Shallow Dive
Let’s start with a fast shallow dive. Structured data lakes are one of the things we’re all about at Full 360. You may have heard of data…
Let’s start with a fast shallow dive. Structured data lakes are one of the things we’re all about at Full 360. You may have heard of data lakes by now. If not, let me bring you up to speed.
What is it?
A data lake is a centralized, extensible, low cost storage facility that provides secured access to data of all types.
What does it do?
For us, a data lake provides three main functions. The first is a catch-all. The second is an operational data store. The third is an archive.
As a catch-all a data lake works just like a watershed. It’s the single destination where all trickles and streams of data from one organization end up. As an operational data store, a data lake is the staging area for loading data into higher performance databases or distributed data delivery networks. As an archive, it provides a comprehensive storage of all prior historical data.
How does it work?
We are fortunate to base our data lakes on Amazon S3, which is a secure, reliable cloud-based object store with virtually infinite capacity. The logical unit of a data lake is a bucket. Policies attach to a bucket, and buckets are not limited by size.
We can issue API or command line calls so it behaves like an FTP server. We can use GUIs so that it behaves like a shared drive. We can search for data objects. We either do a push or a pull and then work with the data locally. We can get logs of every transaction with the data lake and we can also use its native encryption facility and/or use our own. We generally do not present data as it is stored in the lake itself to end users, but soon more applications like AWS Athena and Redshift Spectrum will work directly against S3.
Isn’t a data lake just Hadoop?
In the cloud, object stores are the starting points for data lakes. HDFS can be part of a data lake solution, but it is not a necessary component.
How is a cloud object store different from NAS or SAN?
The short answer is that it’s cheaper and you don’t have to worry about striping drives, updating controllers or replacing old disks. Using Amazon S3, one can operate in a designated geographical region with the option to replicate data between multiple regions. You generally would not use a data lake for an interactive application like a standard block storage device. So most conventional apps would not run directly on a data lake.
What’s so cool about a data lake?
The existence of a data lake extends the capacity and capability of databases and other systems which use it. It provides low cost storage at the boundaries of the data space which is ‘live’ for queries. So it allows us to build applications that have ‘slow’ data and ‘fast’ data. Since it can hold every kind of data, it allows us to integrate ‘wide’ data that was once always stovepiped.
Data lakes allow us to build applications that use servers, clusters of instances or even serverless processes. It allows us to process, structure or analyze data directly in place or indirectly with specialized hardware and software. When a data lake is structured properly, it gets rid of many limitations of the on-premise enterprise architecture, and it fits into a smart cloud-native architecture.
Not only that, a data lake can support all of the versions of data that change over time. It can support a standard canonical form, but also new formats of the same data. (JSON could be the canonical version but CSV could also exist.)
Since data lake storage is inexpensive, when complex computations are needed on large sets of data, it can often be much more economical to save the results of that computation than to repeat the computation at some time in the future.
We are all about them, and we offer a fully customizable set of configurations depending upon our customers’ needs and preferences.