
Discover more from Stoic Observations
Big Data Analysis vs Data Warehousing
Somebody asks: “What are examples to projects that must be implemented using big data analytics and not data warehousing?”
Somebody asks: “What are examples to projects that must be implemented using big data analytics and not data warehousing?”
One of the things I don’t know today but that I will know in six months time is what kind of data is generated by common web frameworks. It is my assessment that web developers who don’t know databases will generate files and logs and unstructured stuff with these frameworks and push them out to HDFS and other data stores. Especially those who use no-SQL databases and/or JSON or XML records of the website transactions. They may be inclined to run some pyspark or map-reduce job on that mass of data and never build a formal data warehouse, but it doesn’t mean they shouldn’t.
It is almost trivial for cloud-native data warehouse engineers such as those I work with to translate JSON, or any other format of data into a structured high performance database. Realtime, batch time, online, or any combination. So for me, and I’m probably fairly advanced, there is never a reason not to build a data warehouse. I’ve never met a data problem that I cannot put into a disciplined, structured, data warehouse. What I have seen are folks who don’t want to spend the time or money to do it right.
So in short, there are folks without the skills and folks without the money. The real question is whether or not there are folks without the requirements. The answer depends specifically on what is meant by ‘big data analytics’. From my perspective, there seems to be only one reason why somebody would not need the requirements of a data warehouse, and to me that means that they simply need to analyze the data once, and then throw it away. There are a few use cases I can imagine.
Flight data recorder: Your plane doesn’t crash. You swap it out every six months, archive the data. Nobody needs to query it. Wipe it. Put it back in the plane.
Alarm systems: You are monitoring IoT or sensor data streams looking at peak, average or aggregate values over a day or some short period. You can do all of the metrics in memory. Memory gets wiped. Nobody needs to know later.
Telemetry: You’re watching the g-forces on a racecar or the flight path of a satellite in real-time. You’re in realtime communication with the person or persons controlling this machine or process. Nobody needs to query the data later.
If you are not worried about throwing away data, then you don’t need data warehousing, (which for me means data lakes too). If you need to make realtime decisions and whatever happened in history doesn’t matter, then you don’t need data warehousing. If the item or object that’s creating the data is itself an effective store of the data, then you don’t need data warehousing. If you are simply required to store data and not analyze it because of security rules, then you don’t need data warehousing.
Otherwise you do.
Now if I sound overconfident or misinformed, perhaps we differ in our definitions of ‘data warehousing’. In this case I specifically mean parsing data to put into an RDBMS or Columnar DB with a fixed schema in order to serve SQL queries to end-users. In otherwords, Business Intelligence. But I think the question was meant to say specifically not to bother with defining a fixed set of fields in a relatively fixed schema. So there’s that. But I basically bet my nickel that this is about all the stuff that squirts out the backend of websites that doesn’t easily lend itself to be queried by MySQL or some such, but a developer could write some Python or use Spark or Pig…