Showing results for 
Search instead for 
Did you mean: 

A Look Back at Strata + Hadoop World from a Data Scientist

Siemens Dreamer Siemens Dreamer
Siemens Dreamer

This year’s Strata/Hadoop conference is kind of special, because it is the 10th year of the Apache Hadoop project.  Doug Cutting, the creator of Hadoop, shared a great session in the opening keynote Wednesday morning.  He talked about how “Hadoop”, starting from the ideation, to a technology ecosystem supporting multi-billions of businesses today. He also shared his vision towards the next 10 years for the Hadoop technology. It is very exciting to be part of this journey as we see the rapid change and growth of the Hadoop world, and learning the new technologies coming out every year such as Storm, Spark, Impala, Yarn, etc...


Personally, this year’s Strata was all about machine learning, every talk we went into had some relationship with machine learning. Among those different machine learning algorithms, deep learning (neural network) is definitely the one shining most. Many industries have embraced machine learning technology such as medical health -- using it to detect early symptoms of cancer cell; and cyber security – using it to detect abnormally activities. One of the most interesting sessions is Stitch Fix, a company using machine learning to recommend real clothes for their customers. The real interesting part is how they combine machine learning and human   intelligence together to perform a traditionally very challenging task – shop clothes for human beings. I would recommend ladies check them out, right now they don’t support men’s clothes shopping.


machine learning.jpg


 Within the machine learning topics, two of them are becoming very popular in this year’s Strata conference, Text Mining and Computer Vision. Many companies (LinkedIn, Facebook) are starting to investigate how to build a text mining analytics platform to help create a better understanding of their businesses, their product releases and their real product feedback. Viv is another startup company, built by the creator of Siri, which was sold to Apple a few years ago.  They envision the next big thing will be an intelligent personal assistant, where the computer will have a better understanding of human voice and text. Text mining is one of the key components if this is to become a reality. Computer vision is starting to emerge as the next hot topic, gaining attention from face recognition (Facebook), Google’s 3D map, self-driving vehicles and virtual reality. As computer vision technology becomes more mature, more industries are starting to look into this area to determine how computer vision can help improve their businesses and solve their issues. A colleagues working in this field speculates that even though self-driving technology is still in its infant stage, within five years fully developed self-driving cars will be ready to release to the public. Right now the challenging part is to drive in a local area where traffic lights, pedestrians, complicated roads, and a constantly changing environment are making self-driving cars very difficult to develop. One observation about being a passenger in the current self-driving car is the uneasiness when it driving on surface roads because of frequent sudden breaking, which results from reactions to potentially dangerous situation around the car.


On the technology side, Spark continues to be the star. With SparkNet, Spark 2.0 coming out this year, the community using Spark technology should expect a lot happening this year. SparkNet is another open source project inside of Spark which leverages recently released Google machine learning projects TensorFlow and Caffe (Berkeley Deep learning package). Real time capability is also one of the reasons that people are looking into Spark.  By combining Spark Streaming and Kafka, real time ingestion and real time consumption become possible. Many IoT (Internet of Things) companies and projects are using this architecture to achieve real time analytics.



Pengcheng Liu, PhD

Advanced Research Engineer (Data Science & Machine Learning)

Siemens PLM Software, Cloud Services, Omneo