A new survey ofdata scientists found that they spend most of their time massaging rather than mining or modeling data.Still, most are happy with havingthe sexiest job of the 21stcentury. The survey of about 80 data scientists was conducted for the second year in a row by CrowdFlower, provider of a “data enrichment” platform for data scientists. Here are the highlights:
Data preparationaccounts for about 80% of the work of data scientists
Data scientists spend 60% of their time on cleaning and organizing data. Collecting data sets comes second at 19% of their time, meaning data scientists spend around 80% of their time on preparing and managing data for analysis.
76% of data scientists view data preparation as the least enjoyable part of their work
57% of data scientists regard cleaning and organizing data as the least enjoyable part of their work and 19% say this about collecting data sets.
These findings are yet another confirmation of a very widely known and lamented fact of the data scientist’s work experience. In 2009, data scientist Mike Driscoll popularized the term “data munging,” describing the “painful process of cleaning, parsing, and proofing one’s data” as one ofthe three sexy skills of data geeks.
In 2013, Josh Wills (then director of DataScienceat Cloudera, now Director of Data Engineering atSlack) toldTechnology Review "I’m a data janitor. That’s the sexiest job of the 21st century. It’s very flattering, but it’s also a little baffling.” AndBig Data Borattweeted that “Data Science is 99% preparation, 1% misinterpretation.”
Given that themedian annualbase salaryin the U.S. ofthe hard-to-find and much-in-demand data scientistswas $104,000 last year, a number of startups have focused on automating a solution to this essential but boring task. In his 2016 Big Data Landscape, Matt Turck lists a number of them in the “data transformation” box plus companies (such as CrowdFlower) that are addressing this need with crowdsourcing (both in the “infrastructure” section).
Investingin solutions to messy data will continue and IDC has predicted that through 2020,spending on self-service visual discovery and data preparation toolswill grow 2.5x faster than traditional IT-controlled tools for similar functionality. Following the same trend, Forrester predicted that in 2016, machine learning will begin to replace manual “data wrangling” (another endearing term like “data munging”) and data governance dirty work, and that vendors will market these solutions as a way to make data ingestion, preparation, and discovery quicker.
Indeed, 55% of the respondents to the CrowdFlower survey agreed with Forrester, predicting that over the next yearmachine learning will have (or will continue to have) a significant importance for their companies and their departments.
35% of data scientists gave their job the highest mark possible.
Only 14% of data scientists felt they were being held back by their tools.
What data scientists want most is more support and direction from their management or executive team (27%).
Finally, CrowdFlower looked at nearly 4,000 data science job postings on LinkedIn.to find out what skills organizations wanted from their new hires. Last year they found that the skills most in demand were programming and coding. This year, they looked for more specific data science tools that are mentioned in job posting.
Here are theTop 10 in-demand skills for data scientists:
I’m sure it is relatively easy for employers to test prospective data scientists for their proficiency in any of the above tools and data platforms. But how do they test for their efficiency in removing commas?