
Discover more from Stoic Observations
An ETL / ELT Philosophy
Someone asked me which way I would run my data workflows generally through ETL or ELT.

Someone asked me which way I would run my data workflows generally through ETL or ELT.
It’s mostly an economic question. If your database tech is licensed by the amount of compute power (cores), then it’s probably more economical to use an ETL tool. That way your use of compute is maximized by doing only clean loads and queries. But if your database tech is licensed by storage, then in most cases ELT is preferable because you can pretty much eliminate ETL tools altogether.
In most of my on-premise enterprise data career, the former was always the case. Big money was spent or high quality hardware for the Oracle or DB2 database where space was at a premium. Also considerable money was spent on ETL licenses (mostly Informatica). The scalability of the entire system depended on how much license the customer could afford to pay, and projects I was involved on were always pushing those limits. I could only be assured of success when “All you can eat” licenses were the case.
In my more recent cloud native big data career, those economics do not apply. Vertica and Redshift are licensed by storage. ‘ETL’ is almost always ELT and we do as much processing in the database tech as possible.
Understand that this is more the case the larger the amount of data you process. Almost no commercial ETL tools scale linearly. They are not as efficient as databases. So a database will always be equal or faster than an ETL tool for the same operation.
The smart trend is to use Python and Pyspark to do data cleansing and basic transforms on large datasets. These are nominally free. But all business logic should be maintained in your database and programmed in SQL.
Again, as I said these are economic decisions. Therefore I would expect some fundamental test cases and testing to be done, and an architecture that allows flexibility on both ends.
My preference is to use columnar databases or RDS in fairly large clusters that have high ingestion rates from S3. I can then run producers to S3 made of small JRuby / Python / Pyspark jobs that run in parallel fleets of containers. The transforms and cleansing we do is almost always about XML to KVP or CSV to Parquet. This is basically bot work. The end game is to move those jobs to serverless streams.
So we have a hybrid data management environment that manages streams of data in an API-like framework. These are our botlike producers that push data to S3 into human readable pipe-delimited UTF-8. (Except that they are encrypted and therefore compressed). We have integrated this mature architecture with Amazon Glue when our customers demand it, but we don’t necessarily have to. We slide the lever over to the ETL side when our customers have already invested in Talend or Informatica and don’t want that re-architected. But mostly we make our simple producers carry all of that data movement from sources to S3.
The end result is that we spend very little time configuring our standard bot producers and spend no money on ETL license fees. We do the big crunching in the target database where stored data is rightsized and therefore static over the application lifecycle. S3 is maximized, and we welcome the new sophistication of free tools that work there. As well database vendors increase the efficiency and capability in S3 ingestion. Our procedural workflows and business logic operates in SneaQL and our developers and customers stay happy.