For decades, relational databases have been the defacto storage mechanism for most data, from transactional systems to data warehouses. Though extremely powerful and mature, some of the restrictions they impose have limited their usefulness and scalability. These limitations are evident when working with today’s large volumes of data that come in many varieties of formats. Recent years have seen a rise in so-called “NoSQL” databases, an all-encompassing term for numerous big data stores and solutions. While several key differences exist and tradeoffs can occur, there are many instances where a NoSQL solution is a better alternative.
Traditional relational databases have largely relied on scaling up hardware as additional data or performance is needed. This is achieved by increasing memory, employing faster processors, and faster, larger hard drives. Adding more resources to a single large machine or appliance is known as “vertical scaling.” However, this scalability is inherently limited because any machine has a maximum capacity. Once a machine’s capacity is reached, the only remaining vertical scalability possible is to replace it with a new, more powerful (i.e., higher capacity) machine. The price of such high-powered machines does not increase linearly; that is, as the hardware for a machine continually grows, you do not get as much “bang for your buck.”
Additionally, due to the way many database systems are designed, data sizes and performance do not scale linearly with hardware either. Say for instance we wanted to double the amount of data our database handled but keep the same query performance. Doubling the amount of hardware in the machine may yield the desired performance, but it just as likely may not. What the query planner does with table statistics, primary and secondary indexes, etc. is often a black box. Rather than using more hardware, the database and query design may have to be completely restructured.
As the price of commodity business-class hardware has gone down in recent years (think machines under $10K), there has been a drive to instead handle larger data loads and ensure performance by scaling out multiple machines, known as “horizontal scalability,” rather than scaling up a single server. Many NoSQL solutions do this by providing something called “sharding:” separating data into multiple sections by some key. A single database table may be served by many machines, each machine with its own piece, or shard, of the data. In HBase, a NoSQL data store that relies on Apache Hadoop, a shard of a table is called a region. Each machine in a cluster serves multiple regions for multiple tables. The processing of reads and writes is spread across these machines. This typically yields predictable, linear scalability for data growth and performance. Relational database vendors have been in a race to offer such capabilities as well, but most were not built and thus may require fundamental re-architecting. Pricing is often not as competitive for closed-source solutions, and open-source solutions, like Facebook’s original clusters of MySQL, can be quite complicated.
Certainly there are tradeoffs to achieve horizontal scalability. In particular, tradeoffs relate to the CAP theorem, which is that no solution can be Consistent, Available, and acceptable of Partitioning failure. While I won’t go into detail here, the core idea is that one must trade one off to achieve the other two. Examples include HBase, which trades Availability for Consistency and acceptability of Partition failures, and Cassandra, which trades Consistency for Availability and acceptability of Partition failures.
Other tradeoffs include the complexity of working with the databases. Relational databases are very mature and have the advantage of an industry-wide standard querying language (ie: SQL) to interface with them. Most support ACID (Atomic, Consistent, Isolated, Durable) transactions, which is a well-established model. This lies in stark contrast to NoSQL databases for which most APIs differ completely, provide a new paradigm for developers, and may only partially support ACID transactions, if they support them at all.
Anyone familiar with relational databases knows that data is stored in one or more tables, with a predefined number of columns, and rows representing data instances. Data may be duplicated in the same columns of a table across multiple rows in the same table (denormalized), or unique rows are stored in multiple tables and joined together using foreign keys (normalized). Transactional systems are often stored in a normalized form, while a denormalized form is more likely to be utilized in a data warehouse.
Either way, these structures tend to be rigid. Columns are fixed, and adding or removing columns can be an expensive operation. Mapping disparate data sources to this single rigid structure can prove problematic, and forcing data to do so loses its original format. This is only aggravated by today’s variety of data, where Word documents, Excel spreadsheets, Facebook Likes, and Tweets all may need to be stored for a single key.
Various NoSQL solutions provide alternate methods for storing data. One popular category of NoSQL databases that attempt to more closely model the real world data rather than transform it is called “document-centric.” Examples of this include MongoDB and CouchDB. Rather than forcing data into flat tables and rows, entire hierarchical documents are stored as a single value, with nested attributes and more complex data types like lists. Rather than forcing data into a static model, each entry is free to model the data as it actually is, and evolve over time. These solutions often include query languages that support many of the same functions as SQL, as well as more complex access methods like MapReduce.
Another popular category of NoSQL databases are key-value stores. Examples include HBase and Cassandra. These can provide extreme flexibility, where anything can be stored as values, with extremely high read and write performance. However, they often lack support for advanced querying, and querying across the entire dataset can be quite slow.
While still terrifically useful, relational databases have been supplanted as the end-all, be-all of performant data storage and retrieval. The points mentioned here should be part of the decision, but still only cover a few considerations when selecting a data store. Depending on the use cases, many of the NoSQL offerings can yield significant cost and scalability advantages, but all come at a price. Unfortunately, no one solution is right for all sizes, but the options from which to choose have never been larger or fuller featured.
About the Author
Kathleen deValk, Chief Architect, Omneo Solutions, Siemens PLM Software, is a hands-on technical architect with over 15 years of experience in software design and 10 years as an enterprise platform architect. She is the Chief Architect of the Omneo big data analytics solution and has led the Omneo team to design and release this innovative technology built on Hadoop. Kathleen also helped to found the Charlotte Big Data & Analytics Society and recently helped to author Architecting HBase Applications - A Guidebook for Successful Development and Design which utilizes the Omneo solution in chapters 2, 7 & 8.