Showing results for 
Search instead for 
Did you mean: 

Data Janitor's Point of View - Strata + Hadoop World Recap

Siemens Dreamer Siemens Dreamer
Siemens Dreamer

You have heard of Data Scientist, Data wrangling…. but …. Data Janitor…?  Let’s get it out of the way first; as you know a data scientist helps with the data-driven decision making (discoveries and insights..etc.).  The key for a data scientist to do the job is based on the handcrafted work, we could call that data wrangling, data munging and data janitor work.  I feel the term “Janitor” fits perfectly (again nothing to demean the work)From gathering to cleaning and organizing disparate data, I have done it all in my career


I just want to share a comment from Jeffrey Heer, cofounder of Trifacta, to show how important this Janitor work is: “It’s an absolute myth that you can send an algorithm over raw data and have insights pop up.”   Even if this algorithm is possible, for it to work across different data sets, data needs to be cleaned up, converted and unified for any algorithm to understand. 


I am not going to provide you a summary of what the conference was all about, but instead what I liked and my view of the where it is going.  This year the Strata + Hadoop conference focus (Streaming, Machine Language, Data science, IOT and scalable architecture) was on the same path as last year and it will for the next few years.  In my opinion Hadoop’s eco system is just getting started, and has a lot to grow and mature.  







Mile Olson, CSO and Chairman – Cloudera, opened the keynote address on Wednesday morning.  He spoke about machine learning renaissance and highlighted how the trend changed from no machine learning to machine learning because of data volume and reduction in computation cost.  Two trends that have driven big data to success are also driving artificial intelligence & machine learning forward.  He also discussed how the introduction of spark and new capabilities allow people to deploy scale out machine learning application very quickly.

He introduced Cloudera Data science workbench (self-service data science for the enterprise) and highlighted the factors that help to accelerate data science from development to production.  He used Holmes and Watson to illustrate what a data scientist should do.  In the Sherlock Holmes stories, “Watson tells the story and Holmes collects the data, if you want to do data science, be like Holmes and don’t be like Watson”.  He ended his note with what we should look forward to. Hardware companies are making investments that will support software algorithms that will help make artificial intelligence, machine learning, and advanced analytics even more conceivable. 

Two other machine learning use cases were interesting; one is from Coursera and the other is from MemSql.  Daphne Koller, Coursera cofounder, shared how machine learning was applied to scale education.  She mentioned how implicit features (country, economic classification, os, url, referring website) are not enough to apply machine learning and education users love to be understood.  It enabled them to ask more questions and Coursera used an machine learning approach of two-layer classifier, first identifying which cluster the user belongs to and then suggesting courses that that cluster of people would be interested in






Eric Frenkiel, MemSql CEO, shared how they use machine learning and fast learning to solve child sex trafficking (application, data science and real-time machine learning).  He mentioned that they use machine learning with face recognition using a “point map of face – 4996 points, classification, de-duplication and matching”.  To improve processing time, they used a vector dot product and could cut down the time from 20 mins to 200 mill seconds.  They worked with THORN (Digital Defender of Children) and saved 2000 children in 2016.  



Body 1.jpg



Last but not least was “The Evolution of Massive-Scale Data Processing” by Tyler Akidau from Google. His discussion was very interesting and covered the last 14 years of data processing tool evolution.







What did these tools bring to the table?






Something interesting to see was more implemented solutions using Spark streaming, Kafka and real time ingestion.  Beam is just getting started and these are a few things it brings it to the table: Unified data processing, Choice of SDK (Java, python/languages), choice of Runners (Java & Python, Apache spark, flink, dataflow, apex) and Potable & out of order data processing.

Again, Beam is the new kid on the block with a lot of promises and joining it with Spark & Kafka would help companies achieve their artificial intelligence, machine learning, advanced analytics and Internet of Things dreams.  It was a great conference covering a ton of topics.  I am looking forward to next years conference.