Data Lakes: A Deeper Dive
The benefits of a data lake are many. In one way, you can say it is one of the lynchpins of a data-centric cloud architecture. At a simple…
The benefits of a data lake are many. In one way, you can say it is one of the lynchpins of a data-centric cloud architecture. At a simple level, data lakes are the inevitable result of Moore’s Law operating in the storage aspect of computing. The industrialization of storage fabric by Amazon has enabled us to expand our imagination. This is only the beginning.
The Full 360 Approach
To keep it simple, using Gall’s Law, we just assume infinite, cheap, reliable, secure, slow storage. This is Amazon S3. It’s also Backblaze B2. Google has and offering and we expect to get requirements for Azure Storage soon. Now a lot of web developers tend to think of Hadoop and HDFS systems as data lakes. That may be the case, but as we’ve built systems incorporating Hadoop and S3, we have a broader definition. Also, as systems integrators, we tend to not to rely on one particular technology over another. Nevertheless given that we build almost exclusively in AWS, S3 has the focus as our object store.
Given this virtually infinite resource, how does it change what is possible? When we started thinking of simple migrations we had no idea what kind of complexities we could begin to address simply by having this capability. But we learned quickly and iterated on several ideas.
At Full 360’s Big Data & Data Warehouse Practice we build data-centric applications that employ single or multiple database technologies. We call the latter ‘multi-tier database architecture’. Our customers often have multiple end-user requirements. We apply the requisite technologies for the different use cases, while leveraging the data lake concept in a specific architectural design. For example, one of our gaming customers has a small group of data scientists who create special one-off queries of short term data. Next on the data path are power users who employ self-service BI. Thirdly are business users who consume more standardized reports. For the massive amounts of data flowing, we leverage a structured data lake along with several different database technologies that allow us to give each group of users exactly what they need instead of using one database technology that compromises each group differently. We’ll get to that example in a minute.
Properties of the Full 360 Structured Data lake.
1. Economy
2. Modularity
3. Extensibility
4. Reliability
5. Security
Economy
The Full360 structured data lake includes both an Operational Data Store and Archives. Each of these is designed to make optimum use of the storage facility whether it is Amazon S3 or Azure Storage. We can run our database engines to take advantage of fast local storage (SSD) with the ODS capability of the data lake. This makes for an optimized buffering of data across applications. By maximizing the use of data in the data lake, we minimize the cost of processing real-time queries from end-users. This is a balancing act which is driven by our experience and know-how. You have to understand the trade-offs of using one technology over another. We choose from a broad set of tech & methods which allows us to tailor the right performance for our customers. No two implementations are the same, but we always aim to maximize performance and minimize cost.
New technologies like AWS Athena extend the economy of the data lakes we create by querying text files directly in S3, but we also use Amazon EMR in some cases.
Modularity
We build our data lakes to be modular. That means we look at the grain of data we are to process and consider the way it is provided and the nature of feeds. Inbound streams of data from sources are processed by customer programs we call ‘producers’. Our data producers fill the structured data lake in a standard, canonical format. Alternative formats have their place as well. For example we may use JSON as well as CSV. Nothing gets into a structured data lake without using a formally defined producer.
Traditional data warehouse systems have focused on a small number of data feeds, generally fewer than five. The promise of integrating any source of data has been choked off by the incremental cost of traditional ETL processes. Our producers are built to work in containerized fleets that scale horizontally.
Full 360 customers are seeking to integrate larger numbers of sources into complex analytical models. We meet the demands of flexibility in architecting data lakes to handle any number of sources and formats. As a consequence of this modularity, Full 360 can leverage their database design experience to create new ‘wide data’ analysis.
Extensibility
Sometimes we annex data from Hadoop/EMR clusters and make them part of the data lake. This happens when we determine that data consumption can be most efficiently and effectively provided from such HDFS based technologies. We also respect our customers’ investments in those technologies. With Presto, Hive and Spark we have many choices, but we still think of them as producers. Sometimes these producers are pre-existing HDFS-based operations. Our task is to integrate them into discrete parts of a workflow processes.
Our migration customers almost always have traditional ETL processes. When we build data lakes for them, we must encapsulate these processes in a lift and shift. Our ability to build producers quickly often avails us to more options including a full replacement of legacy sources. So we are thus able to make more efficient paths. But we are very happy to use prebuilt Talend or Informatica as licensing permits.
Reliability
Reliability often means just what you think. Something doesn’t break. But when customers try to scale up and add more complexity than they’ve ever attempted before, sometimes reliability means knowing how something breaks and to what degree. All systems break. We always provide SLAs for our data lakes, and that means we consider what can go wrong with any of our pieces and parts. With our data workflows, these most often appear as latency issues with source systems. Whether they are due to network outages or database queries on the demand side, we will create the necessary notifications using CloudWatch and integrate them with our PagerDuty system so that when the system is stressed, we will have early alerts. We have enough experience to know that a dog that barks at everything is not a watchdog, so we tune our alerts and then we retune them over time. The end result is a system that tells us exactly what we need to know exactly when we need to know it.
We build runbooks based on that to provide 24/7 continuity across our support staff. So we don’t simply rely on the basic five 9s of S3, but we employ Amazon’s Well Architected Framework.
Security
We consider both the security of the AWS cloud, but also application security in the cloud. Part of using that framework includes using the principle of least privilege. Our own standards of encrypting data at rest and in transit always apply. Many times this is the first time enterprise applications have done so because of the cumbersome security apparatus of proprietary middleware and databases. Beyond that, we assign specific access roles to all process generating data in the workflow. Nothing gets into the lake by default ACLs. All access is granted on a need-to-know basis. Also producers generate specific logs that can be audited.
When we build hybrid solutions which are necessary for 3rd party datafeeds that our customers purchase, our installations are accepted as trusted peers of our customer’s own networks and datacenters. We run extensive planning by our certified security experts and their IPSec folks. We use Biscuit for simple secret-keeping in our scripting and Vault for more involved implementations. So we know at all times which processes are authorized to use which data.
These safeguards are built into our practice of developing infrastructure as code. Each iteration improves our ability. We use Terraform to automate such processes and insure that our security policies are automatically built into each new implementation.
—
The Gaming Example
To give you an idea of how we extend the capabilities of a data warehouse with a structured data lake, I’m going to abstract a case study of one of our customers.
This customer already had Redshift and was processing a live queue of events in near-real time. Their problem was that they had to ingest about 20TB of data per day for their newly released title. Now the fact of the matter is that Redshift will scale to any size you want but:
1. It is never easy to ingest that amount of data.
2. It is never cheap to add additional Redshift nodes.
We did some calculations. In order to maintain the amount of history that end users wanted to access (30 days) would mean that they would roughly have to go multiply their spending on Redshift 2, maybe 4 times to be comfortable. So they called us in try and tune their cluster.
We understood from the beginning that data is cheaper to house in S3 than in Redshift. So we went through a large matrix of options to pick out the best combination of resources.
Given:
API based event data. 200+ columns ~20TB uncompressed / day (JSON)
Four classes of end-users {developers, data scientists, user analysts, third parties}
S3, Amazon Elastic Map Reduce {Hive, Pig, Presto}
Redshift Cluster w/ 2 x ds2.8xlarge cluster (each with 16TB of disk space)
35% of disk space left free for query and vacuum operations
9.6 days of production data can be stored in the cluster. This should scale in a linear fashion as more nodes are added.
We knew from prior experience that in Redshift we could create wide columnar views of KVP data. But there were also certain advantages of turning JSON into CSV when it came to Redshift ingestion efficiency. So we went into detail looking at the pros and cons of the following three options:
In the end we chose option 2. Thus our data lake was then built using S3, and Amazon EMR spitting out canonical KVP records. Thus, were three streams of data to be processed.
Near Realtime Main Stream
This is the main source. On a daily basis, event data are processed into an S3 bucket by the transaction system. Using Hive, raw event data (a very wide record transposed into many KVP records) is sourced from S3. Amazon EMR reads and writes S3 directly. This used a permanent cluster and it landed all records into the S3 KVP History bucket. This bucket is used to process batch loads into Redshift which retains the full detail of all events for a few days, long enough for data scientists to see everything that’s going on in structured, fast Redshift queries.
Historical Conversion Stream
The full history of event records in S3 was transformed via EMR also into the S3 KVP History bucket. This was a one-time job that used a large transient cluster. We had to make sure this job completed all the way through because there were no checkpoints possible the way we ran it. Thus an expensive one-off got us history-proofed.
Incremental Conversion Stream
One of the inherent problems of unstructured data as compared to that structured in a lake or columnar database is that you never quite know how much of it you’re going to use. So you retain it all. With this in mind, and knowing that Redshift can easily accommodate new columns, we designed another EMR path into Redshift. This path allowed our customer the ability to ingest some forgotten metric into the structured data on demand. The CloudFormation recipe for this transient EMR cluster was also kept around so we could change a few parameters, launch it and add a new column of historical data to the ‘fast data’ model served by Redshift.
The Compressed Tail
Finally, knowing there were downstream customers as well as some limits to how much Redshift data we needed to keep live, we setup a periodic export to S3. So the ‘Redshift facts’ now run in a denormalized table represented the full set necessary for live queries for the customer’s business users and also included small datasets pushed from Tableau clients. Every once in a while we could prune the live dataset, keeping the overall Redshift cluster the same size.
Conclusion
Clearly there is nothing typical about this application. It is complex and demanding. Our approach has always been to avail our customers to our ability to build high performance cloud-native solutions beyond lift & shift out of in-premise enterprise apps. By sticking to the five principles of our structured data lake architecture, we are able to make the most of S3 and transient EMR to reduce the cost of operations going forward while increasing the performance and usability of Redshift across different sets of users. We think this stands as a great example of what can be accomplished in the AWS cloud leveraging the expertise of our team and the strengths of the tech and methods available in that cloud.