Tuesday, November 12, 2013

What is a Data Scientist?

Today on Quora, someone asked, "What are some software and skills that every Data Scientist should know?". I wrote the following as a response, reflecting on my current position and the role I play.

I started adding post-it notes with sub-titles to my name/title tag on my cube, as sort of a joke regarding the question, "What is a Data Scientist?".

Here's the current list:
  • [Client] Analyst
  • [Product A] Analyst
  • Financial Analyst
  • Sales Analyst
  • Contract Analyst
  • Quality Assurance Analyst
  • Call Center Analyst
  • Data Surgeon (aka, data mining with the intent to figure out what's wrong)
  • Data Diagnostician (alternative of above, maybe with no details to examine)
  • [Product B] Analyst
  • Database Developer
  • Bug Finder (as in software bugs)
So it would appear from this list that there isn't a lot of data science going on. And that's partially true.

Each of our clients has its own relational database, so we do "meta-queries" to access them one by one in order to answer a question. That's sort of data science like. Eventually, though, we're going to have one master database with all clients that will cascade into individual databases. So our "meta-queries" will be obsolete.

We deal with a lot of "big data" too, but it's usually not that big of a deal. Even with relational databases, it's okay. Some queries may take a little longer (30-60 minutes), but that's rare. We have some machine learning tasks that pull in massive training data sets, so at that point you have to be more careful about "big data" problems like running out of RAM or disk space. But it can be handled, and rather simply.

What I really wish I could do more of is machine learning, and while I've accumulated several ideas that would enhance products or help us make better decisions in the year I've been a Data Scientist, these other tasks take up most of my day.

In the end, I write a lot of SQL, use the Linux command line moderately, and report on data in Excel spreadsheets. I use Python occasionally to write scripts. And I'm always learning something new (new SQL techniques, Python libraries, Linux command line tools, etc.).