In traditional RDBMS design, indexes are used to speed up data access based on user patterns. The indexes provide a quick way of accessing data that is analogous to a book’s word index. DB Architects monitor database usage in order to extract the most common query patterns and continually fine tune the performance of these data requests. Recently there has been a rise in non-relational databases including HBase due to its low latency and scalability. Unlike traditional DBs HBase only provides the ability to define a single index, which is named the RowKey value. This RowKey value provides rapid read speeds but it must be designed very cautiously because changing it is difficult and time-consuming. This write-up presents an alternate solution to this predicament by using Solr as a secondary index.
HBase as a System of Record
HBase provides many benefits as your primary system of record.
HBase can provide the necessary elements to store large amounts of data and access specific records quickly, but performing standard queries and simple aggregate methods is not straight forward. To provide more flexibility additional strategies must be applied.
Solr as HBase Index
Solr is a real-time search engine providing very fast response times on large data sets. In addition to search capabilities, Solr provides faceting which supports simple counts and alternative navigation interfaces. Solr Cloud is a distributed version of the Solr search engine and is supported for running on a Hadoop cluster.
Solr can be used to index data in HBase and provide a secondary index that allows for real time query and simple data analysis of data in HBase. There are multiple ways to index data in HBase, and some may be covered in a subsequent blog topic. The idea is to index the attributes of your HBase record in Solr, and also store a pointer reference back to the HBase location of that record. This allows a very low latency lookup of the specific RAW record returned from the search. Any data not stored in the Solr document can stay in HBase where it can quickly and easily be retrieved. The key is what data to index and what data to store in the Solr document.
When building applications, deciding what to index depends on the nature of the questions the user will ask the system. If the user wants to search books which are written by a specific author, then perhaps there is a field called “author” that should be indexed. What to store in the Solr document depends on what information you need to retrieve quickly to present to the user and how quickly the data can be retrieved. The key to leveraging HBase for the storage is to avoid large table scans where the user is pulling a significant number of records that span across the table. This will impact performance. However small batch gets of 10 rows (Ex: top 10 search results) can be extremely fast (on the order of 10ms).
Freeimages.com/ Gabor Kalman
Joys and Pains of Faceting
If we review again the book example, perhaps the user may want to know how many books about JAVA were written by each author. I can search for “JAVA” and facet by “author” to see how many books were written by each. Then I can further review the details of the books from each author if I filter by the author’s name and review the book details, which may be stored and retrieved from HBase. This approach, sometimes called faceted navigation, is a frequently used UI pattern for drilling into search results. It can also provide some basic count-based analytics on your search results. For many applications the counts may even be more interesting to end users and they may skip the lookup of the RAW records (Ex: compare the counts of books per author over the last year).
There are also pains of faceting which can get out of hand if you don’t provide some guidance and controls to users. Using Solr with many facets, particularly if they have many unique values, tends to be memory bound because it uses an in-memory cache to provide rapid query response. However, faceting by many fields at once can also cause CPU contention especially under heavy concurrency. Your Hadoop cluster is often tuned to the job types, Ex: memory intensive processes should run on memory heavy clusters. Faceting can change the nature of your Solr implementation and become both Memory and CPU bound under different conditions.
Let’s assume you have some complex data structures stored in HBase and want to provide 50 indexed fields. If all users are faceting by these 50 fields at once, you will see CPU contention on Solr. This becomes a bottle neck and can quickly degrade performance. Be cautious and selective about how you craft your UI to provide the capabilities that make sense, but be aware of the cost on the backend. There are many online shopping examples of faceted navigation that can provide some ideas on how to tailor your facet selections based on prior searches and/or selections. This provides a guided drill-in approach that can help you better manage the load on your backend and simplify the user’s experience.