The Omneo team embarked on a journey into Big Data to solve a complex problem, to bring product quality intelligence to manufacturers in a complex supply chain. The challenge encompassed several primary aspects including, multiple distributed data sources, heterogeneous data sets, lack of a consistent or common industry format for data, application silos, no unified contextualization of data sets, large volume, and high velocity. To overcome these challenges a new strategy was needed.
Use Case Analysis
In order to properly design a big data system, an assessment of the use cases is required. A collection of use cases was created to address product quality issues in a complex supply chain. Each use case was analyzed to identify the data required to address the use case, as well as access and query patterns so that the systems could be optimized for the intended usage.
Some use cases identified include:
Users need the ability to look through their data to find quick insights
Users need the ability to review the product data in the context of the product lifecycle
Users need the ability to define and calculate the KPIs from the data
KPIs must be configurable and flexible for the customer’s needs
A custom reporting solution for every customer is not viable
Users need to perform statistical analysis on measurement data at very large scale
Root Case Analysis
Identify key contributors to product failures & correlate to the source
Users would like to automate the discovery of problems that are impacting KPIs
Predict failures based on historical trends in the data
Move to planned maintenance versus unplanned maintenance
Based on an analysis of the use cases, a combination of tools was required to provide these capabilities. These tools include: ETL, data warehousing, analytics, search, data mining, and machine learning.
The evaluation process began with the identification of various technologies that were available to solve the problem at scale. Options included the existing RDBMS, traditional analytic database technology, open source technology such as NoSQL systems, and Hadoop-based platforms
A data set was created to assess the ability around:
Scalable solution for loading data to the system
Data transformation and cleansing from various formats including semi-structured and unstructured log data
SQL query engines for fast reporting and analysis
Search options for knowledge discovery and exploration
MPP for distributed machine learning algorithms
The Incumbent RDBMS Solution
An existing smaller solution based on a traditional RDBMS OLTP database design was reviewed to determine if it could be enhanced to operate at scale. After thorough testing it was determined that this solution was unable to scale to the needs. Data ingestion became a bottleneck for transformation and loading into the RDBMS which consisted of a highly normalized schema. Ad-hoc analytic queries were extremely slow and unable to support an interactive user experience, especially when executing joins. Scheduled dashboard generation was taking so long that the process could not complete before the next scheduled dashboard update. To resolve these challenges, technology options were identified and performance of each was assessed to determine the best fit.
Traditional Analytic Databases
Several traditional analytic databases were evaluated. These SQL-based options would provide a simple migration path for existing legacy applications. However after evaluation, these solutions were not cost-effective for an emerging software platform and application suite. A lower cost option that could scale as the company grew, as well as support multiple use cases, was necessary. As a result the focus was shifted to consider open source technologies such as document stores, columnar databases, NoSQL options, and Hadoop-based platforms.
Through investigation and observation, a determination was that different workloads may require different tools that are optimized for the specific needs. The Hadoop ecosystem provided a very nice option. A linearly scalable big data platform with various tools to meet the needs of the use cases and workloads. There were numerous options available, however Cloudera’s open source distribution, CDH, was chosen as it had a simple mechanism to get up and running quickly and a unified environment for managing the cluster.
Hadoop Operations & Management
In order to use Hadoop, there was a steep learning curve as well as significant management & operations requirements. This is exaggerated when the software solution is intended to serve applications to end users. Managing the cluster, security, monitoring, upgrades, and general operations becomes challenging as the environment can be complex. As a result, having the management tooling, as well as security and support that an enterprise platform provides was important to speed development, simplify the learning curve, and reduce operational challenges. The Cloudera Enterprise Data Hub provided a robust collection of technologies and tools to address these challenges.
Cloudera as an Enterprise Data Hub
To truly understand how well the solutions would meet the needs, a PoC was performed. The team gathered some “commodity hardware” consisting of 10 laptop computers connected to a 1gb switch. These were setup in a conference room and Cloudera was downloaded and deployed. The total cost for this PoC was minimal as the laptops were already available and the software could be downloaded freely. This allowed the team to explore the solutions before investing a lot of money. Each tool available was evaluated to identify the use cases for which it was a good fit.
Data Loading direct to HDFS
Data was loaded directly to HDFS in RAW format consisting of both semi-structured and unstructured data. By loading the data directly to HDFS the bottleneck on ingest was eased, but the data still needed to be transformed, cleansed, and made available to applications.
Data Transformation & Cleansing
Since most data was received from customers as batch uploads, MapReduce was used to process the data sets in batches. The mapreduce process far surpassed the performance of the traditional XML parsing and inserts into the incumbent RDBMS. The team found that processing larger files versus many small files was far more efficient.
The traditional approach with the incumbent RDBMS required data to be provided in a specific XML format. These data sets were parsed, validated and inserted into the RDBMS’ highly normalized structure. The process was able to ingest ~100K records per hour.
The new solution used MapReduce to transform the data into formats needed for the applications. Metrics for various tests can be seen below. These results showed that the solution could scale to larger data sets and increasing cluster size resulted in near linear scalability.
Low-Latency Read/Write Storage on HBase
HBase was evaluated to provide the primary storage of records ingested. This provided a low-latency read/write system and a very flexible data structure. Various rowkey designs were attempted including complex byte-encoded representation of common query filters to support fast lookups and queries. However this approach was very limited and any change would require a full refresh of all data. Instead, HBase was used as a primary system of record, a simple rowkey was employed for retrieving a record, and a secondary index was created in Cloudera Search in order to provide more flexible queries.
SQL Analytics via Impala
While the search engine described above provided a mechanism to support more complex data exploration, and record retrieval based on specific search criteria, it lacked the robust analytic capabilities of a SQL engine. Impala + Parquet in combination with an efficient partitioning strategy based on query patterns was found to be an efficient and scalable solution for high-performance SQL workloads.
Machine Learning with Spark
Not all targeted use cases could be achieved using the combination of search and SQL. Root cause analysis, data mining, and predictive modeling were still challenges, which Spark provided a solution. Spark introduced a distributed in-memory data processing system and the community was rapidly developing machine learning libraries that could run in a distributed fashion to perform machine learning at scale.
The conclusion was that by leveraging the Cloudera platform, all use cases could be achieved, the solution was linearly scalable, and the platform provided a rich enterprise toolset including security, operations, management, high availability, and disaster recovery. After the evaluation and PoC, the Omneo team engaged with Cloudera in order to leverage the full enterprise capabilities and upgraded to the Cloudera Enterprise Data Hub solution.
Workloads & Performance Metrics
Here we show the performance of two example workloads from the investigation: Search & Analytic queries.
Figure 1: Search PerformanceFigure 1 shows the performance of Cloudera Search (Solr) running in Cloudera’s platform. We can see that the search engine can perform on large data sets. The scalability was assessed using a boundary test that focused on the size of the collection and an assessment of the linear scalable nature of the solution. Below is a summary of the methodology used in the test:
Generate data set of size X & Load data set into cluster size N
Node type is fixed (256GB RAM, 2x6 core, 8 X 2TB HD)
Perform tests & Capture Key Metrics
QTIME, CPU, Load Average, Heap / RAM, Disk I/O
Increase data size to 2X & Retest
Increase cluster size to 2N – retest & repeat
Each test was performed to assess the performance as the data size increases and as the cluster size increases.
Impala as a distributed SQL engine for Hadoop provided a solution to incorporate SQL workloads and allow legacy reporting systems to leverage the new big data platform. A performance evaluation was used to assess the optimal solution for Impala. The conclusion was that Impala over Parquet file format provided the best compression & performance at scale. Various aggregation techniques were applied, including an HBase aggregation table. For the granularity of access to satisfy the use cases, aggregation did not provide enough flexibility to allow ad-hoc analysis at a fine-grained level. Using an appropriate partitioning strategy was key to achieving the desired performance. Impala + Parquet was selected as the data storage and SQL engine for these workloads. In addition to the performance, Impala was early to the market with standard connectors that made it easy to integrate with legacy systems or 3rd party tools. Figure 2 below shows the comparison of Impala over Flat Files versus Impala over Parquet.
Findings and Conclusions
The Cloudera platform has many tools available and each is optimized for different purposes. When combined within a platform or application architecture, the components can be used to solve various problems. Identifying the best tool for your needs is key to solving the problem. By leveraging the full capabilities of the platform, you have all the tools available for your various workloads and can chose the right one for each job.