I have been promising to talk about what’s wrong with the design of social media, but I have to fully get my head around LLMs & NLP and understand what my new parent company will and will not do in that regard. In the meantime, here’s some of my gist.
Q: Which one is better, data science or data engineering?
A: I’m a data engineer.
Basically I didn’t pay attention to data science because I knew then what everybody admits finally which is that Hadoop is a crap system for data analysis. Furthermore that the selection of basically random combinations of open source tools is appropriate for small companies but not large multinationals.
So while I moved from enterprise architecture to cloud native back in 2011 (and abandoned all Windows) I became fascinated with DevOps. This discipline made it possible for me not only to build BI (front end and back end) but also observability within the stack. I have the kind of discipline to always hand-roll my own ETL and data-pipelines because I worked within tight performance and budget constraints.
Looking at the trend of data delivery, I saw companies like Tableau making all the money and creating all the jobs and people literally monkeying with making a report perfect after two weeks of effort. I went to a Tableau conference and they were presenting this type of work as brilliant but nobody there could explain the cacheing mechanism for the Tableau server or how to tune it for performance. I was stunned.
It’s not like I didn’t understand and respect advanced analytics. I made a Monte Carlo model in Excel back in 2000 - before MapReduce was even invented. I also had some experience with Clementine. But that all seemed to be a very small fraction of the market in terms of product revenue and interest. Basically people who really knew their business didn’t need data science.
So I became convinced that data science was more in demand for e-commerce and basically those businesses that had no real HUMINT. That applies to startups, social media, and website only businesses. Their aim was customer profiling, LTV calcs etc. When data scientists started making more money but they were all about Python and not Golang, I remained skeptical. But I admit that’s because I was a Ruby bigot. Besides, I was mastering the AWS API using Ruby and studying Hashi stuff.
Not long after this, maybe 2016 or thereabouts. I went to a conference in NYC basically for the big banks. And the double PhD nerds were flipping out over Spark. They were so enthusiastic about its performance over Hadoop. Duh. Meanwhile I already was familiar with Vertica and Redshift. So my company was all about Big / Fast / Wide data and I was integrating backends on AWS. I had all the scalability I wanted, but I realized that we were in separate worlds. I don’t think anybody has really benchmarked these backends for performance. We did for our own customers, but I didn’t see much at all in the industry. What I did see was that for typical replacement of OLAP models that got larger than 5TB, Vertica and Redshift were kicking butts. That meant Netezza, Exalytics, Teradata and basically everything else.
Since my company was very small (2 dozen folks) we did not have the time or energy (or money) to go marketing about. We just figured out what we were good at and kicked ass. So I couldn’t prove it, but I was convinced Vertica and Redshift could beat any combination of Spark, Hadoop, Cassandra, Redis, Couch, Mongo with the simple exception of backends for websites. What does that make me? A structured data king. All you unstructured data masters know stuff I didn’t really care about.
This is ironic because I did think specifically that the entire enterprise of comment karma is laughable. The fact that you cannot vote anything but thumbs up, means there is no analytic capability besides what is popular. Viral does not mean better. But that’s the entire business model of all social media and consequently of mainstream media. After all these years, people are finally starting to understand that was an incredibly shortsighted and narrow way to manage content.
Anyway. The good news was that despite the pandemic we were able to sell our company and I’m still a data engineer. But guess what. NLP has taken off and OpenAI and other companies have mastered unstructured data with relatively small teams. I’m convinced that structured data has a future and that tuning and engineering new kinds of backends will be much more interesting and profitable for those with the skills. Aside from all that, the sharpest data scientist I know says that 80% of the job is data wrangling, but the application of the science part of data science is just picking the right algorithm for the use case. That’s the other 20% For example if you know that you are counting human anatomy, then you’re talking about data in a Gaussian distribution. That means you should probably use Z-Scores for finding outliers. I honestly think that means you can do DS with one bag of clubs. 15 rings to rule them all, except for some small outlying use cases. I don’t know yet, but that’s what I think.
For the past year I’ve been using DALL-E and GPT for fun. GPT-4 is the game changer. I have every expectation that what I know about backend engineering will be in high demand, specifically in terms of using DevOps to build integrations between large corporate data architectures and commercial LLMs. I would bet a nickel, that it will trivial for a half dozen companies to integrate everything that is not data wrangling into proprietary looking IDEs / front-ends. So essentially all DS notebooks go away. I build a Supervised learning set of pipelines. I build an Unsupervised learning set of pipelines. What’s left? Neural nets as a service. Boom. Done.
I’d much rather know what’s going on to make backends secure, performant, scalable and flexible than anything else. I try to get excited about Data Science, and I have honestly dropped Ruby for Python. I have also dropped Golang for Rust. I’m using Phind and VS Code copilots to help. Oh yeah and I really care about Kafka streams, and am mildly interested in K8s. I still respect serverless computing and of course I love CockroachDB. I haven’t figured out much about Pinecone and only know enough about GraphQL to be halfway dangerous. But that’s it. I say that the LLM vendors are going to own the Front-End market within 3 years. In 2026 people will be saying that Siri and Alexa are retarded children.
I'm not sure if I'm a "Data Designer" or "Data Engineer". I like Engineer because its sounds more professional, but let us both admit - we are not engineers. There is no certificate in software equivalent to a Professional Engineering certificate. Your bad code loses money; a bad bridge kills people. Disregard the very few who code life support systems.
I envy you and you blazing forth with leading edge technologies. Sadly, my bread and butter is trying to explain why a clients huge COBOL program does not handle UTF-8, and some still think its because of racism.
Luckily I learned programming missiles (no exit code; end run is missile go boom!) to program for performance from day 1. If you do not design for performance and scalability from day 1, you are not going to get it. That's why, for all I hate the ancient waterfall design process, at least they understood you need some answers (not assumptions) before you start.
And all that mention of languages and data, but no love for SQL? Don't like functional languages? Not trying to exist a VI vs EMACS feud here, just curious.
Success is dependent on how quickly the dime drops when your gut has been wrong about something as key as this. I'd have made the same call - you don't think the baby can dance like Gregory Hines and sing like Aretha on command until you watch it happen in front of your own eyes.
I'd be interested in hearing you muse on the "riding the tiger" aspect of humanity's interface with GPT-4 and its offspring. Have a field day - I'm going to work on reeds after breakfast. They're not quite there with the synthetics yet.