Strata San Jose this week celebrated the 10th birthday of Hadoop. It is hard to imagine how 10 years have passed and the vast changes in the area of big data technologies. Hadoop has played a major role in bringing about numerous open source tools, libraries and commercial products available today. Check out Doug Cutting's take on the 10 years of Hadoop.
The San Jose conference is often known for its focus on machine learning and data science, and this year was no different. We saw how machine learning and big data come together to solve challenges such as helping paraplegics control robotic limbs by thought, sensors connected to the brain that help researches understand epilepsy and product tracking using RFIDs to locate merchandise in stores.
The buzz at this week's event included innovations in Spark with Spark 2.0, which can process the same 1 billion to 1000 row table joins that took 76 seconds in Spark 1.6 cut down to 3.7 seconds in Spark 2.0.
Berkeley AMPlab continues to be a center for innovation brining even more new technologies into the Berkeley Data Analytics Stack (BDAS) with the introduction of: • Succinct, a fast, memory efficient data storage for interactive SQL • Velox, which provides scalable model management • Keystone ML, which simplifies the construction of large scale machine learning pipelines
Streaming data ingestion was another popular subject. Many sessions focused on showing various ways to leverage Kafka and Spark to get data from the source to the query engine in near realtime. Using Spark streaming coupled with Kafka, developers can build streaming pipelines with guaranteed delivery and high throughput.
Kudu, currently incubating and open source, was also a hot topic at the event. It provides a mutable data store with interactive query performance. This is a very promising storage technology which can ensure exactly once delivery of data through the pipeline and provides in place data updates, which avoids the need to rebuild data partitions when data is modified. Kudu also simplifies the management of data including optimization of the underlying storage by automatically managing file sizes, thus removing much of the complexity for providing realtime ad hoc queries.
One thing I love about Strata is seeing the different use cases in which big data and Machine Learning are changing the world for the better, including projects like genome mapping for cancer research and sensors connected to the brain that can look for causes of epilepsy. Researchers reach out to the developer and data science communities for help in solving these complex problems through collaboration and contribution to the open source tools that drive these innovations and make them scalable.
O'Reilly also entertained us with comedian Paula Poundstone, who brought some humor to the conference, improvising and playing off the audience with witty comments that had the audience ROFLing. "Map reduce...am I the only one here that doesn't know what that means?" LOL. Thanks Paula!
It was an excellent conference, with great speakers covering a myriad of topics that included deep technical discussions, complex machine learning, cybersecurity, and lessons learned. It will be an exciting year for Hadoop, machine learning and big data. I for one am eager to see what this next year will bring to the community.