The Fluffy Side of Data Science

While making a career change from trading to Data Science, I frantically read machine learning textbooks, scoured Kaggle forums, and attended meet-ups on Spark, h2o.ai, and AWS. But after five months of actually being a data scientist on a digital product team at Capital One, I realize I spend about half my time using a softer set of skills. So, while my imposter syndrome is in remission, here's the article I wish I read a year ago...


Do stuff, don't just do Data Science

I actually began writing something titled What is Data Science but it quickly got so obnoxious and buzzword-y that I Ctrl-A-Deleted and regrouped. Even capitalizing Data and Science seems overboard. Data science should be used to accomplish something, not glorified as an end in itself. It's more like carpentry or plumbing in that sense, and less like Music Theory or Cosmology.

If you put lipstick data science on a pig in a digital product...

It's still a digital product. Anything that's made great with data science should probably be halfway decent on it's own. Think of Spotify, Airbnb, Amazon, Google or Facebook--all of these would still be incredibly popular and elegant solutions were it not for machine learning or the invention of Hadoop. It can be tempting for product managers and even data scientists to think let's throw a predictive algorithm at this instead of really introspect about whether their product creates value for the customer. Which reminds me of the phrase...

Quarter-inch holes, not quarter-inch drill bits

Basically, ditto that. I love getting into a state of flow while hammering out a Python module. Or finding the faintest justification for using TensorFlow. But these are both just tools. When you really focus on the job-to-be-done, it often becomes clear that a simpler, perhaps rules-based solution can allow for quicker iteration and learning than a complicated algorithm. So don't forget to...

Use your Human Level General Intelligence

Until Siri launches a nuclear weapon to resolve a scheduling conflict or Pandora kills a musician to whom we've given a thumbs-down, it's worth remembering that we humans are still the smart ones. Before sending your obedient CPU on an algorithmic goose chase, you'll want to bake in as much qualitative knowledge and human intuition as you can about the problem at hand.

This can mean interviewing experts or reading market research. And focus on engineering features that you suspect are valuable instead of building several which are easy to make. This use of commonsense and industry expertise to frame a problem is often more of a differentiator than choosing XGBoost over random forests. And don't forget to say no...

lambda x: print("Let's not do {}".format(x))

This is counterintuitive, but the most value you might offer as a data scientist at your company is by recommending against developing a model. Just like a good doctor would examine the abdomen before using a scalpel, a good data scientist should think of cheaper, less invasive ways to solve a problem. And when choosing among viable solutions, you'll have to communicate the...

Confusion matrix

Yes, you first learn this concept in technical machine learning books, but, in practice, it is a fluffy skill. You want to train yourself and your less technical colleagues to think less about predicting and more about ranking. After all, a great model may predict a rare event with only 1% accuracy but still help put the most likely cases at the front of the line. The idea of a predictive model often sounds like a magical black box that spits out 0s and 1s. This obfuscates the reality that, behind the scenes somewhere, hyperparameters are turning probabilities into discrete labels. How to toe this line between Type I and II errors is very much a business decision rather than a mathematical one. But no matter how much you emphasize this tradeoff, you will be judged against the promises of Westworld and AI journalism. So to stay informed...

Google everything

Every time you overhear the name of a mathematical technique, a vendor platform, an oft-mentioned startup, or even a programming language that you're pretty sure is not in your job description, make the small investment in yourself and look it up. Or add it to a list for later. When I was looking for my first data science job, at some point I did a massive brain-dump of all the keywords I had heard at meet-ups or read in job descriptions. It was daunting until I began googling terms one-at-a-time. The reality was that many big data tools naturally fit into a few different buckets. A helpful trick for remembering tools you haven't actually used yourself is be skeptical and ask yourself when would I use this instead plain old Pandas or scikit-learn? And the more you learn the more you can...

Wipe your own butt

Traditional management science might have it that a product manager has the idea, a data analyst gets the data, a data scientist make the model, then the engineers deploy it. But, in my experience, it is really difficult to work with any data that you haven't pulled yourself. And it is really hard to be creative while executing someone else's framing of a problem. Even if you weren't in the room for that Aha moment, or a colleague offers to send you the data "to save time", at some point you'll hit a roadblock and start flailing. You'll question the merits of the whole project unless you've spent time independently convincing yourself of its worth. You have your job because you're relatively better at it than others, not because you're helpless at everything else.


If you have any thoughts or feedback, I'd love to know. Good luck!