Discover more from Stoic Observations
The Pitbull Framework: Next-Gen Data Warehousing in the Cloud
I want to begin to tell you a bunch of stories about where we are in our Big Data & Data Warehouse practice at Full360. We’ve been…
I want to begin to tell you a bunch of stories about where we are in our Big Data & Data Warehouse practice at Full360. We’ve been improving our capabilities for several years and now is the time to begin getting in front of the public and letting you know what’s new. The Pitbull framework is something we’ve been doing since 2014, it’s pretty formalized now that we’ve had some time working with it. In fact, it’s Pitbull 2.0 but the fundamentals haven’t changed.
Pitbull is a cloud-native way of managing the flow of data through a data warehouse. It consists primarily of three independent processes. The Producer, the Ingestor and the Transformer. These work on each stream of data relevant to your data warehouse. First let’s take a high level view.
We generally build data warehouses that are application specific. In the AWS cloud you can go this way, with a single schema for the entire warehouse supporting a single application, or with multiple schemas for each application. The database itself is usually a columnar database with a cluster of nodes, but we work with RDS based relational databases as well as the more cumbersome traditional instance based databases. One database, one server just like in your datacenter only in the cloud is no advantage at all if it’s not integrated at the API level. Performance wise, when it comes to the general case of a cluster of columnar nodes, the number of applications and schemas is mostly determined by the sponsor of the project. There is a negligible impact on performance when there are multiple schemas per database, but combining billing across divisions within one customer can sometimes cause headaches.
So think about one cluster of say, 6 Redshift nodes, running a single schema for a single application, along with another one in hot reserve in a second availability zone processing the same streams of data. Now for the case of our application, let’s say it is a territory profitability model for a ride-sharing service.
The job of the producers is to create a standardized set of data from each source. In the case of our example, the data will come from the reservations generated by online and mobile systems, the financial transactions generated by the customers, realtime weather data from a subscription service and a feed from corporate finance with accounting data. Let’s also assume some master data feeds like driver info, customer demographics, fuel prices and other stuff gets to us too. Each of these distinct data sources will generate data at different times in different formats. Full360 will build a custom producer whether this data comes from an API call, a flat file, an XML document or an upstream database. Each producer runs in it’s own docker container continuously. Each producer knows what it has already eaten and what it has pushed downstream towards the application.
You’ve probably heard about a data lake. We use Amazon S3 as our ‘data lake’ with the provision that we write it with journaling in mind. So we have a permanent record of data production as well as a structure for the data lake according to the needs of the application. We build our S3 smartly from day one, encrypted and/or compressed to spec. We can have data overwritten and/or checked and reprocessed depending on requirements, or we can take the source’s word for it. We’re pretty good at writing producers whether they are batch oriented or based on Kinesis, Kafka or other message streams. Again, the architecture is flexible and robust. All of this is logged and we can tune performance over time based on aggregate history. What we’re finding, for example, is that our producers will sense glitches in subscribed sources quicker than those data providers can notify the subscribers.
Next we unleash the ingestor against the produced data. The job of the ingestor is to load the staged data into the database cluster in application sized bites. If a complete set of data has not been produced, the ingestor will do nothing until its next cycle. Note that the producers will squawk independently if their sources are not timely or proper. Ingestors will handle things like database rejects and the appropriate chunking depending on the extent of that database’s error-handling capabilities. We also have options like ‘lookback’ parameters for data that is allowed to be restated or re-produced.
In our standard high performance design, we keep ingested sources materialized as well as the history that is updated in an ingestion cycle. For the sake of transparency these will remain in place as well as the properly staged views. This gets rid of a lot of guesswork from people who are familiar with what data should look like, and overall it is a small price to pay even when you’re ingesting 10TB a day. But if that became a bottleneck we always have access to the produced data which could easily be shared anywhere we like. That’s the flexibility built into our framework.
Once the ingestion is complete and we have a new bite of application-complete data and a properly updated historical source set, we crank this new set into the transformers. Our philosophy is to run ELT. We load data, then transform it rather than trying to transform it outside of the db. Most of the time the database is the best place to transform and cleanse data. There are two reasons. The first is that we have obvious economies of scale with database engines, which are not, by the way, licensed by the size of their servers but the amount of data stored. Why pay money to maintain a traditional ETL server that you have to scale separately? The second reason is that almost all transforms can be done with SQL that a large body of people understand. Once again it benefits everyone with transparency.
Transformed data is now ready for downstream processes we call exhibits, views with access as needed by BI self-service, or whatever your needs are. Again, our experts with years of experience can tell you by our guts where the processing ought to be, but we’re happy to bench test any part of the Pitbull framework and estimate where scaling up or scaling back will be most cost-effective or performant.
Producers, ingestors and transformers are part of a continuous integration and deployment cycle. So if you have a change to your business rules, say for the transformer SQL, then you can make your modifications, commit them to a repository, test them out and be ready to run in production within minutes. How do we accomplish that magic? That’s a story for another day.
The secret sauce is how we’ve been refining our producers, ingestors and transformers over the years. However, some of our best work is included in an open source project called SneaQL. SneaQL allows your database folks to code their own transforms and recreate the functionality of PL/SQL and T/SQL stored procedures. This is the trickiest part of migrating your applications to cloud-native databases like Redshift and columnar databases like Vertica. We’ve seen it done very wrong, replicating some ‘best practices’ from terrestrial datacenters without taking advantage of the elastic nature of a well-designed cloud operation.
The Pitbull framework is how to do high performance database workloads in the cloud. Its strengths allowed us to build SneaQL. SneaQL is friendly and powerful for people who know SQL, and perfectly suited for DevOps guys as well. It’s never too late to get started in the next generation of data warehousing.