Showing results for 
Search instead for 
Did you mean: 

B2B Hadoop Applications

Siemens Enthusiast Siemens Enthusiast
Siemens Enthusiast

Hadoop Applications

Hadoop has been a buzz-word in the big data space for a while and many companies are looking at big data and Hadoop.  However most of the work we see in Hadoop is in-house projects to explore a company’s big data, and software frameworks that allow you to build apps to analyze data in Hadoop.  What about building robust SAAS applications for your customers that can process and analyze their data in Hadoop?  This article will talk about building customer-facing SaaS applications on Hadoop.


Data Ingestion Pull vs Push

Many Hadoop data connector frameworks, such as scoop, provide the ability to pull data into Hadoop from external data sources.  Other frameworks, such as flume, provide the ability to constantly append log data directly into Hadoop from operating systems and application logging.  When building a SaaS offering for customers, often the source data exists in the customer’s systems and is behind the customer’s firewall.  Allowing a cloud-based Hadoop service to pull data from their internal data sources would often break the security rules of the customer, external 3rd party cloud systems should not retrieve data from the customer’s internal data sources.  In this case, the SaaS offering should provide a mechanism that allows customers to push data from their internal systems.  This can be achieved via various methods such as Web Services or IoT Gateways for real-time feeds and file transfers for batch feeds.  The SaaS-based Hadoop system can then process these feeds.  Real-time can be handled via message queue and/or streaming technologies (such as Spark-streaming or Storm).  Batch can be handled by file drop and map/reduce.  The challenge is managing large data volume transfers into the SaaS environment, which needs to be scalable.  In addition, ensuring guaranteed delivery is essential to prevent data loss between the data landing zone and HDFS. 


Modelling Customer Data

When building in-house Hadoop systems the data may be fairly well understood for specific usage scenarios.  This allows the Hadoop developer to understand the source data and leverage the various components of the Hadoop ecosystem to solve problems, analyze, or process the data.  The developer can make structure decisions at design time because they will be building systems for various data sets within a constrained scope.  However when building data processing systems and applications backed by Hadoop for customers the frameworks and applications need to adapt to the customers’ needs while abstracting the UX from the complexities of the data structures and processing.  This provides challenges for the Hadoop developer and encourages abstraction and metadata driven design principles.  Most structures built on top of the pile of data in Hadoop are metadata-definitions that describe the structure of the data.  By leveraging this metadata in the UI, business applications can be built that can adapt to different data sets and provide a common toolset even on complex and ever changing data structures.  Some configurations and data source definitions may be required, but these data structures can be made available to applications to consume at runtime. Milen Milen Yakimov

Leveraging Metadata

Let’s begin first by defining metadata in this context.  Metadata is data about data.  It may contain information about data structures, data attributes, data types, etc.  Business applications need to consume relevant data in the context of the application’s domain.  By collecting the metadata about the data stored in Hadoop, applications can then leverage the various Hadoop technologies to consume the data to server applications at runtime.  Metadata can be used to determine available data attributes and the data structures of for pockets of data stored in Hadoop.  Not all data looks the same, but to allow applications to consume disparate data from multiple sources, metadata can be used to provide applications a way to infer structure for accessing the data.  Take log data for a web site for example.  If the log data is examined we may find that there is information about the user, their IP address, the content they accessed and the time when thy accessed the content.  The metadata data defines a data set USER_ACCESS_LOG with attributes of username, ip_address, time, content_accessed.  Now applications can query the metadata to know that there is data available about users.  You can then drive user name selection, and provide some functionality to say how many users access a particular resource in a given hour for the purpose of analyzing user access patterns for the given web site.  Most Hadoop systems are built to store many different types and structures of data.  By capturing and accessing metadata, applications can consume the various structures.


Leveraging Schema on Read

Hadoop leverages the principle of schema on read.  The idea is that instead of writing data into a specific format, the data is simply written.  When a user needs to ask a question of the data, a structure is defined on top of the data and the data can be accessed to answer the question.  This is known as schema on read.  The schema is defined when the user needs to access the data and not enforced when the data is written to the system.  In this way Hadoop provides a faster way to dump any type of data into a data lake and define structures on that data later such as when the data is accessed.  The benefit of schema on read is two-fold.  The first and primary benefit is that all data can be stored in Hadoop and it is not constrained to a predefined structure or set of attributes such as in a RDBMS or traditional data warehouse structure.  Since the files are simply written, no time is required to force the data into the predefined structure.  All data is simply stored in the native or RAW format and available for access later.  At some future time, users will want to access the data.  At that point, the schema can be defined.  Taking the log example, the structure is defined as metadata {username, ip_address, time, content_accessed}.  Using this newly defined structure there are various Hadoop components that can now be used to access or process the data.  For example, if we define a HIVE structure over this simple log file, you can then create SQL statements which will equate to a map/reduce algorithm to process or access the data.  At runtime, a data structure can be defined and data can then be read.  There are some processes that will be batch oriented and some can be defined to use a real-time query engine.  Even in-memory data sets can be defined and consumed at runtime, through command line tools or various programming interfaces.  Thus we transform the solution and unbound the problem.  No longer is data constrained to predefined structures and formats, but it can be re-consumed and re-processed in its RAW form.



Hadoop is a powerful distributed storage processing and data processing system.  Not only does it provide linearly scalability, but it provides a redundant storage of data in its RAW format and mechanisms to structure, consume and process the data through a large library of tools.  This can be leveraged to build business applications, but new problems and challenges arise when using this highly flexible but complex system to build robust and solid foundations for business applications.  By leveraging the metadata and schema on read principles within Hadoop, business application can consume this wealth of data and serve it up to users in a simple and consumable fashion.