Showing results for 
Search instead for 
Do you mean 

Running Solr on S3

by Siemens Dreamer Siemens Dreamer on ‎01-30-2017 10:00 AM - edited

Running Solr on HDFS has become the standard when running Solr alongside a Hadoop cluster. As our team begins a transition to the cloud, and specifically AWS, we have begun exploring new opportunities and technologies that come along with this transition. One of those is running Solr, natively on EC2, with our indexes backed on S3. The idea to run Solr on S3 has similar reasoning to running Solr on HDFS, except that now HDFS and Hadoop is no longer needed. This allows us to separate our Hadoop from our Solr.


After chatting with a colleague about running Solr on S3, he mentioned that HDFS supports running on S3. So, I proceeded to pursue running Solr on S3 via the HDFSDirectoryFactory. The setup is very similar to running on Solr on HDFS; the only changes needed include exchanging and add some libraries and implementing and packaging a class. Solr 6.2.1 comes with the following libraries: hadoop-annotations-2.7.2.jar, hadoop-auth-2.7.2.jar, hadoop-common-2.7.2.jar, and hadoop-hdfs-2.7.2.jar. Hadoop-aws.jar is the only missing Hadoop jar and a requirement to access S3.

Handling Hadoop Libraries

During my testing, I worked with the Hadoop-2.7.2, HDP 2.7.3.2.5.3.0-37, and CDH 5.9.0 libraries. Both CDH 5.9.0 and HDP 2.5.3 libraries worked; although older versions of CDH and HDP did not work. This appears to be due to hadoop-aws library’s use of the aws-java-sdk-s3 (version 1.10.6) rather than aws-java-sdk (version 1.7.4) which is used in other, older Hadoop-AWS libraries. In addition to the Hadoop jars, htrace-core4-4.0.1-incubating.jar and jets3t-0.9.0.jar need to be included.

AbstractFileSystem Implementation

Changing out the Hadoop libraries isn’t enough to make this work. The implementation of HDFSDirectoryFactory uses the AbstractFileSystem interface, which S3N and S3A does not have an implementation of until Hadoop 2.8. So for now the code can be created by generating a jar with the following code:

 

abstractfile.jpg

This code was copied from the S3A class that was created in Hadoop-AWS 2.8 branch, except for the
arguments of the super construction which use NativeS3FileSystem instead of S3AFileSystem and also
using “s3n” instead of “s3a” for the schema.

 

Configuration

When running Solr on HDFS, normally the HDFS client configurations are picked up by Solr in some
fashion. Since this is not using HDFS, we need to generate our own configuration file. The smallest
configuration I used was as follows:

 

configuration.jpg

It should be noted that normally when running on an EC2 instance, credentials can be picked up by the
AWS SDK. This does not work when using the Hadoop AWS libraries. Attempting to do so resulted in
the following error:


“Unable to create core [TestCollection_shard1_replica1] Caused by: AWS Access Key ID and Secret
Access Key must be specified by setting the fs.s3n.awsAccessKeyId and fs.s3n.awsSecretAccessKey
properties (respectively).”

Starting Solr

Startup is similar to starting a Solr instance for Solr cloud as described in Solr documentation
(https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS), with the exception of the
setting, solr.hdfs.confdir, which needs set since HDFS is not actually used. This is where the
configuration from above should be placed. Below is the sample start command:
“sudo bin/solr start -c -Dsolr.directoryFactory=solr.HdfsDirectoryFactory -Dsolr.lock.type=hdfs -
Dsolr.hdfs.home=s3n://cahill-solr-test/solr -Dsolr.hdfs.confdir=/opt/conf;”

Creating a Collection and What to Expect

Now you can create a collection. After creating the collection, you should see objects in your bucket.
Since S3 is an object store and not a filesystem, you’ll see objects like “solr_$folder$”, which is how
HDFS fakes a folder structure on S3. The structure of the collections will be the same as if they were on
HDFS. After indexing your data you should be able to query it as normal.

Notes on S3A

I was unable to get s3a to work as a storage layer for Solr due to what appears to be a known bug in HttpClient 4.3 (more information here: http://stackoverflow.com/questions/25889925/apache-poolinghttpclientconnectionmanager-throwing-illeg...). The error I was receiving can be found at the bottom.

Conclusion

Running Solr with indexes backed on S3 is pretty neat. It opens some new possibilities of deployment and more importantly allows for Solr run on a distributed filesystem without the needing the overhead of HDFS or Hadoop. Hadoop 3.0 is right around the corner and includes improvements to the HDFS using s3a. With the implementation of (https://issues.apache.org/jira/browse/SOLR-9515) hopefully, this solution can be modified to use s3a, which should provide better performance.

S3A Error

running solr on s3.jpg

 

 

 

Comments
by Enthusiast
on ‎03-23-2017 01:04 PM

Could you clarify how to not get the error :

 

Caused by: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).

by Enthusiast
on ‎03-27-2017 01:54 PM

Solved !!!

The trick was to rename the conf file to hdfs-site.xml.