Q: What is a data scientist? What do they do? Why are they important?
A: I don’t think there’s a general purpose ‘data scientist’. I think, as I’ve always said, that such a person is someone who can code and knows statistics. A data scientist is basically somebody who works at a company long enough to figure out the math around the fundamental metrics of the business model. This grand idea has now trickled down to the mainstream.
Every trader on Wall Street has been a data scientist for their whole career. They look at statistics all day, every day, and they make decisions about it. They are the best data scientists in the world because millions of dollars depend upon them being right. The best of them, the ones that come up with new statistical models are called ‘quants’. They are the elite of the elite. The rest of us just crunch numbers for middle management.
So Wall Street was first. Their quants and analysts got popular in the early 80s when a dude named Walter Wriston said. “Citibank will use Unix”. And it was off to the races. Then shortly there after a dude named Michael Bloomberg said, “Tickers are dead, long live workstations.” And the whole world of information available to institutional investors changed. (I know this is boring now, but it gets real interesting.) Then a dude named Peter Lynch became the undisputed king of mutual funds, which were a bundle of stock picks that he changed every once in a while. Everybody on the planet tried to figure out how he did it. Some of them were quants and some of them were just copycats with simple-minded algorithms. So in 1987 there was a program trading crash, because a bunch of people asked, “How can I program a computer to make money in the stock market?” , and they all came up with the same stupid answer. (Like, just buy Bitcoin, but sell if it has 20% volatility), basically something you could code in two lines of PHP. Wouldn’t you know, that’s just what happened; instant market crash.
A data scientist ought to be somebody who knows a particular business well enough so that they don’t write idiotic code that anybody off the street would guess. Plus, a data scientist ought to be able to look at the code that has already been written and the data it produced and discover conditions under which it should be ignored. Further, a data scientist ought to be able to find out what questions people aren’t asking and make sure they get factored into whatever decisions need to be made and build systems around that.
However. Data scientists all start off as scrubs and noobs. Before I go on, let me tell you why data scientist is a hot career.
It’s the hottest career right now because by and large, the stereotype of a computer geek who doesn’t understand business has become a self-fulfilling prophesy in terms of the people that corporations have been hiring. The emergence of the career path is really all about a shift in power to a younger set of employees and managers. Like I said, the whole thing started on Wall Street. The next adopters were banks and insurance companies, and then came aerospace and engineering companies. Telephone companies were in the realm too. The easiest way to tell which companies were serious about data was to look at their top management and count how many could actually do math in their heads and were never afraid of putting their hands on a computer keyboard and doing serious work. Obviously this included software companies.
So you can imagine that hospitals, restaurants, construction companies, marketing firms, undertakers and municipal bureaucracies were dead last in adopting a data-centric attitude and building a competent IT staff. For those companies in the middle, they did two things. One, they outsourced all of their IT grunt work to India and Costa Rica, mostly. I know a lot of Ukrainian and Pakistani programmers too. And they paid them crap. Two, they paid high wages to the same consulting companies they used to audit their finances like Deloitte, Price Waterhouse and KPMG who in turn paid a few software companies to build business intelligence systems which are the granddaddies of all of the stuff that goes by the name of data science. Now there are exceptions. Online marketing metrics paid the way for a lot of software and tech experience in that area. And Hollywood finally got it. (After Netflix kicked Blockbuster’s butt, considering Blockbuster, idiots that they were, didn’t do any serious data science relatively speaking. If they had, they would have killed late fees and survived). But most companies never ever did what Wall Street and the best banks and insurance companies did, which was pay the best quants and analysts top dollar for their ability to make better decisions for the business.
I cannot tell you how many managers I have talked to in my career that said it’s not worth it to track more data. A few of them, like the guys at Safeway and Edison, had a point. They had so much data collected that it would require them to buy a whole new set of mainframes just to make one copy. Their problem was that it was just too expensive (in the early 2000s). In the late 90s, the new truth was “Disk is cheap”. After people stopped using the term “Pentium” compute power was cheap. None of that has any influence on bad corporate culture, especially the stingy ones who figured they could keep getting away with paying dirt for offshore talent.
What guys like me knew, schlepping as we were for the likes of Deloitte. Was that you needed not only to be onshore, but in the face of management on a daily basis in order to build a good business intelligence system. These are way too complicated to build at a distance, and they need to evolve in order to stay relevant.
Now people get it. Now people want data scientists.
Now businesses understand that you can actually do things like price your products using supply and demand instead of just a flat markup. Before, they never bothered to count sales by SKU and cross-reference that with inventory and orders back up the supply chain. Before they never thought to look at customer profitability, just discount the big customers and charge everybody else the regular price. They had no idea of the effect on profit. Why? Integrating our inventory system with our financial system! Unthinkable! It would take a computer genius to do that. And nobody hired computer geniuses full time. They just rented them out and let the computer flunkies do the maintenance.
So bravo to all data scientists. You are here and corporate America recognizes your value, finally. You get to work full time and you get a permanent desk and everything. Nicely done. So what are you going to be doing on a day to day basis? For the most part, you are going to be making up for 20 years of slacking in your respective industries.
How do you know your industry is slacking? Easy. What’s faster? Bringing up your company’s intranet and getting your company’s stock price, or Google? If you can Google more facts about your company than you can get from your intranet, it ought to embarrass the hell out of you.
Therefore, your job, is to make sure everybody in your company has better and faster and more comprehensive information about the facts of your operations than anybody in the outside world. That also means that your company’s stuff can’t be datamined by hackers, but that’s not your direct job. Your people in HR should be able to know the weighted cost of every employee. Your people in shipping and logistics should know the status of every delivery on every truck to every customer. Now. Your people in IT should know the overhead cost for every system and know exactly who is logged on. Now. Your finance director should know… what finance directors need to know. And as a data scientist, you ought to know how all of that data looks like when it’s good, and what it smells like when it stinks. So unless your company’s board of directors has math PhDs, your job is going to be stinky
So what do scrubs and noobs do? They get to know all of the data in their company’s control. They know every database, every source, every field, every piece of metadata. They know when it comes, where it comes from, how it’s transformed, where it goes, how it’s encrypted, who has the keys, how often real people actually look at it, how seriously they take it. They know every system it touches, and what went wrong the last time. But there’s a poetic way to think about it.
A data scientist is like a hiker who has been to the mountaintop and touched the snowflakes as they fall, seen the slopes and watched the gentle melt as well as the avalanches. The data scientist has watched the springs and the babbling brooks, the calm streams and the rushing rivers, rapids, waterfalls and lakes. The data scientist knows the reservoirs and dams, the tunnels and pipes, and can go to any faucet in the city and taste what animal pissed in what river upstream. The data scientist cleans the sewers and recycles that which is fit for consumption or must be dumped.
Now sometime after Peter Lynch retired, and the Euro was just being introduced, I took a plane trip and sat next to a guy who, I kid you not, looked like Patton Oswalt’s uglier, fatter big brother. Except he was more obnoxious. This guy was telling me that he was taking all of his millions of dollars earned on Wall Street (oh did I mention he was brilliant?) and putting them into Euros, which at the time was somewhere around 80 cents. But he also told me that it was a proven psychological fact that investors never read the fine print of the prospectus. It’s human nature, he said, to judge a book by its cover. Something an ugly person must obviously know very well. So, by his logic, now that this study was out, Wall Street types like himself were going to sell junk stocks as if they were gold, and fund managers were going to continually get away with it because they get a cut (called a load) whether the fund nets 15% or 0.5%. He said that idiots and charlatans were running the asylum on Wall Street and suckers got what they deserved, that his money was safer in Euros. I never forgot that ugly guy. But here’s the point. Index funds got better. Quants got better. The hunches and guesses of sketchy fund managers were not winning. The mathiest quants were starting to win. The machine learning algos were starting to get smarter, and the Peter Lynches of the world were starting to look like the Bernie Madoffs of the world.
Why am I telling you this? Because it took about 30 years for Wall Street to get out of the seat of the pants business and into the computer augmented data science business. And the rest of the industries will follow suit, even the undertakers. Which means once the noob data scientists scrub all the nasty data and smoke signal management from the Fortune 500, there’s not going to be any more room for amateurs. So if you do a good job, in 40 years you’ll be out of a job. In the meantime, you’re going to spend half of your job learning data, scrubbing data, making it perfectly reliable, then getting people to actually pay attention to it and work accordingly. And then you’re going to have to fix it when things change, or when you make mistakes and then start all over again. And then some young punks 7 years younger than you are going to coin a term like ‘business telemetry’ and hype up a new programming language you never heard of and tell you you’re obsolete. Well, that will happen every 7 years, and sometimes they’ll be right.
So let me end with a story.
At my company, we’ve built a system that is basically so fast and sophisticated that it can take every airline reservation on the planet and every change to every airline reservation on the planet, and spit out everything anyone could possibly want to know about it. As it happens, we’ve only sold it one one airline, but it pays the bills. And of course we built it to spec, meaning they didn’t want it to do everything we know it could possibly do. Hey. It’s called a budget. Deal with it. Now our cutting edge cloud-centric data mastery company manages its project metrics in.. wait for it… Google Sheets around some rather cool macros we were able to throw together. Google Sheets! Not a database. And so it just so happens that we’ve hit the limit of what Google Sheets can do. But since we all bill astronomical rates to build our data Borg, none of us has time to fix a stupid Google Sheet and move it into Postgres or something. This, my friends, is how data scientists get their start. They work their way up from the stupid to the ordinary, to the useful, to the profound. It begins and ends with clean, reliable, secure, meaningful data that people pay attention to.
So mind your embedded quotes and trailing spaces. And whatever you do, don’t pretend that R and Python are the end of all knowledge. The data is dark and full of terrors.