The New Data Warehouse Explained
The outfit I work with, Full360 is one of those who have advanced the state of data warehousing. I have been building these applications in…
The outfit I work with, Full360 is one of those who have advanced the state of data warehousing. I have been building these applications in one way or another since the late 80s. Over the past 7 years I have been building them in the AWS cloud and it may sound cliche but it’s a quantum leap forward.
The big deal difference between the data warehousing I used to do and what I do today has five major advantages.
Horizontal scaling
Columnar tech
Multi-tier database architectures
Low cost data storage
Infrastructure as code
Horizontal Scaling
Once upon a time there was Teradata, DB2 and nobody else. VLDB was the practice and if you wanted to deal with massive amounts of data you’d essentially have to buy specialized mainframe hardware to handle it. No matter how many servers you bought, none of the standard RDBMS vendors gave you the ability to use massive hardware in parallel. Somewhere around 2005 that changed and now you have something called ‘shared-nothing scale out architecture’. That basically means there are database technologies that actually work the way an advanced common sense would dictate. Don’t build a gigantic supercomputer with 500GB of RAM, just put together 10 servers with 50GB of RAM each. And guess what, performance scales linearly as you add more servers. We’ve built one with over 100 nodes.
Columnar Tech
There are three contenders. Redshift, Vertica and Snowflake. These are heavy duty full function commercial proprietary databases that use ordinary SQL. But they also compress data and use a completely unique form of indexing that allows them to essentially handle the ‘explosion’ problem of multidimensional modeling. At least one of them, Vertica, I know to be multi-user concurrent read/write with full security. This tech has turned the age old practice of normalization on its head. If you can do it in SQL you can do it faster and better with columnar databases in SQL. And yes of course you can do it all in memory if you like.
Multi-Tier Database Architectures
Our experts can deal with realtime, batch, streamed data, whatever you have. We do integration with Hadoop, Redis, Kafka, Kinesis and a host of other technologies and formats. It’s because we built our practice without bias to any vendor. Just like a motorsports racing team. If it’s the best, we’ll put the sticker on our car. We’re not a factory team. So all of our architecture has been designed from day one to mix and match proprietary and open source technologies. We prefer not to compromise.
Low Cost Data Storage
Essentially the ability to keep practically infinite amounts of data nearline makes the way we build data warehouses more economical. Think of the designed realtime query space as a window on a timeline of data. That window can be 10 years wide or 18 months or 48 hours. Depends on how much you need. We build smart data lakes. It makes a world of difference for disaster recovery as well.
Infrastructure As Code
All of this architecture is put together with modular templates. If I need to build a 5 node Redshift cluster replicated in two different time zones with an international fleet of 20 data ETL containers pulling from 8 different data sources, I have a way to specify all that in code (and much much more). Something breaks, I redeploy. This is what cloud DevOps is all about, and we have brought these methods to data warehousing.
It’s really a whole new ball game. We’re living in interesting and exciting times. We expect that a lot of enterprises will take a while to learn how all of this can benefit them. It takes a bit of a cultural shift to accept cloud operations and understand how to think about this next-generation data-centric application framework. We are the hands-on guys that are more interested in building stuff than spewing buzzwords, which is one of the reasons mass markets don’t know about us. But ask anybody at Amazon Web Services. We’re for real.
—
Data warehouse projects fail primarily because of poor project management and vendor lock in. It’s a deadly combination. Most large companies who have technical people who are smart enough to outthink crafty vendors tend to think they’re also smart enough to engineer a state-of-the art DW. Ahh but I could write a whole book on all that could go wrong. The bottom line is that there are some really smart DevOps practitioners out there that we have dealt with and they tell us we’re doing the right thing. Feels good.