Running Solr on HDFS has become the standard when running Solr alongside a Hadoop cluster. As our team begins a transition to the cloud, and specifically AWS, we have begun exploring new opportunities and technologies that come along with this transition. One of those is running Solr, natively on EC2, with our indexes backed on S3. The idea to run Solr on S3 has similar reasoning to running Solr on HDFS, except that now HDFS and Hadoop is no longer needed. This allows us to separate our Hadoop from our Solr.
After chatting with a colleague about running Solr on S3, he mentioned that HDFS supports running on S3. So, I proceeded to pursue running Solr on S3 via the HDFSDirectoryFactory. The setup is very similar to running on Solr on HDFS; the only changes needed include exchanging and add some libraries and implementing and packaging a class. Solr 6.2.1 comes with the following libraries: hadoop-annotations-2.7.2.jar, hadoop-auth-2.7.2.jar, hadoop-common-2.7.2.jar, and hadoop-hdfs-2.7.2.jar. Hadoop-aws.jar is the only missing Hadoop jar and a requirement to access S3.
Handling Hadoop Libraries
During my testing, I worked with the Hadoop-2.7.2, HDP 220.127.116.11.5.3.0-37, and CDH 5.9.0 libraries. Both CDH 5.9.0 and HDP 2.5.3 libraries worked; although older versions of CDH and HDP did not work. This appears to be due to hadoop-aws library’s use of the aws-java-sdk-s3 (version 1.10.6) rather than aws-java-sdk (version 1.7.4) which is used in other, older Hadoop-AWS libraries. In addition to the Hadoop jars, htrace-core4-4.0.1-incubating.jar and jets3t-0.9.0.jar need to be included.
Changing out the Hadoop libraries isn’t enough to make this work. The implementation of HDFSDirectoryFactory uses the AbstractFileSystem interface, which S3N and S3A does not have an implementation of until Hadoop 2.8. So for now the code can be created by generating a jar with the following code:
This code was copied from the S3A class that was created in Hadoop-AWS 2.8 branch, except for the arguments of the super construction which use NativeS3FileSystem instead of S3AFileSystem and also using “s3n” instead of “s3a” for the schema.
When running Solr on HDFS, normally the HDFS client configurations are picked up by Solr in some fashion. Since this is not using HDFS, we need to generate our own configuration file. The smallest configuration I used was as follows:
It should be noted that normally when running on an EC2 instance, credentials can be picked up by the AWS SDK. This does not work when using the Hadoop AWS libraries. Attempting to do so resulted in the following error:
“Unable to create core [TestCollection_shard1_replica1] Caused by: AWS Access Key ID and Secret Access Key must be specified by setting the fs.s3n.awsAccessKeyId and fs.s3n.awsSecretAccessKey properties (respectively).”
Startup is similar to starting a Solr instance for Solr cloud as described in Solr documentation (https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS), with the exception of the setting, solr.hdfs.confdir, which needs set since HDFS is not actually used. This is where the configuration from above should be placed. Below is the sample start command: “sudo bin/solr start -c -Dsolr.directoryFactory=solr.HdfsDirectoryFactory -Dsolr.lock.type=hdfs - Dsolr.hdfs.home=s3n://cahill-solr-test/solr -Dsolr.hdfs.confdir=/opt/conf;”
Creating a Collection and What to Expect
Now you can create a collection. After creating the collection, you should see objects in your bucket. Since S3 is an object store and not a filesystem, you’ll see objects like “solr_$folder$”, which is how HDFS fakes a folder structure on S3. The structure of the collections will be the same as if they were on HDFS. After indexing your data you should be able to query it as normal.
Running Solr with indexes backed on S3 is pretty neat. It opens some new possibilities of deployment and more importantly allows for Solr run on a distributed filesystem without the needing the overhead of HDFS or Hadoop. Hadoop 3.0 is right around the corner and includes improvements to the HDFS using s3a. With the implementation of (https://issues.apache.org/jira/browse/SOLR-9515) hopefully, this solution can be modified to use s3a, which should provide better performance.