This is the documentation for Cloudera Enterprise 5.12.x. Documentation for other versions is available at Cloudera Documentation.

Using MapReduce Batch Indexing to Index Sample Tweets

The following sections include examples that illustrate how to use MapReduce to index tweets. These examples require that you:

Complete the process of Preparing to Index Sample Tweets with Cloudera Search.
Install the MapReduce tools for Cloudera Search as described in Installing MapReduce Tools for use with Cloudera Search.

Batch Indexing into Online Solr Servers Using GoLive

Warning: Batch indexing into offline Solr shards is not supported in environments in which batch indexing into online Solr servers using GoLive occurs.

MapReduceIndexerTool is a MapReduce batch job driver that creates a set of Solr index shards from a set of input files and writes the indexes into HDFS in a flexible, scalable, and fault-tolerant manner. Using GoLive, MapReduceIndexerTool also supports merging the output shards into a set of live customer-facing Solr servers. The following steps demonstrate these capabilities.

Delete any existing documents in the tweets collection.
```
$ solrctl collection --deletedocs tweets
```

Run the MapReduce job with the --go-live parameter. Replace nn01.example.com and zk01.example.com with your NameNode and ZooKeeper hostnames, respectively.

Parcel-based Installation:

$ hadoop jar \
/opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar \
org.apache.solr.hadoop.MapReduceIndexerTool -D \
'mapred.child.java.opts=-Xmx500m' --log4j \
/opt/cloudera/parcels/CDH/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file \
/opt/cloudera/parcels/CDH/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf \
--output-dir hdfs://nn01.example.com:8020/user/jdoe/outdir --verbose --go-live \
--zk-host zk01.example.com:2181/solr --collection tweets \
hdfs://nn01.example.com:8020/user/jdoe/indir

Package-based Installation:

$ hadoop jar \
/usr/lib/solr/contrib/mr/search-mr-*-job.jar \
org.apache.solr.hadoop.MapReduceIndexerTool -D \
'mapred.child.java.opts=-Xmx500m' --log4j \
/usr/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file \
/usr/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf \
--output-dir hdfs://nn01.example.com:8020/user/jdoe/outdir --verbose --go-live \
--zk-host zk01.example.com:2181/solr --collection tweets \
hdfs://nn01.example.com:8020/user/jdoe/indir

This command requires a morphline file, which must include a SOLR_LOCATOR directive. Any CLI parameters for --zkhost and --collection override the parameters of the solrLocator. The SOLR_LOCATOR directive might appear as follows:

SOLR_LOCATOR : {
  # Name of solr collection
  collection : collection_name

  # ZooKeeper ensemble
  zkHost : "zk01.example.com:2181,zk02.example.com:2181,zk03.example.com:2181/solr"
}

morphlines : [
  {
    id : morphline1
    importCommands : ["org.kitesdk.**", "org.apache.solr.**"]
    commands : [
      { generateUUID { field : id } }

      { # Remove record fields that are unknown to Solr schema.xml.
        # Recall that Solr throws an exception on any attempt to load a document that
        # contains a field that isn't specified in schema.xml.
        sanitizeUnknownSolrFields {
          solrLocator : ${SOLR_LOCATOR} # Location from which to fetch Solr schema
        }
      }

      { logDebug { format : "output record: {}", args : ["@{}"] } }

      {
        loadSolr {
          solrLocator : ${SOLR_LOCATOR}
        }
      }
    ]
  }
]

For help on how to run the MapReduce job, run the following command:

Parcel-based Installation:

$ hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar \
org.apache.solr.hadoop.MapReduceIndexerTool --help

Package-based Installation:

$ hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar \
org.apache.solr.hadoop.MapReduceIndexerTool --help

For development purposes, you can use the --dry-run option to run in local mode and print documents to stdout instead of loading them into Solr. Using this option causes the morphline to run locally without submitting a job to MapReduce. Running locally provides faster turnaround during early trial and debug sessions.

To print diagnostic information, such as the content of records as they pass through the morphline commands, enable TRACE log level diagnostics by adding the following entry to your log4j.properties file:

log4j.logger.org.kitesdk.morphline=TRACE

The log4j.properties file can be passed by using the MapReduceIndexerTool --log4j command-line option.

Check the job status at:
- MRv1: http://jt01.example.com:50030/jobtracker.jsp
- YARN: http://rm01.example.com:8090/cluster
When the job is complete, run some Solr queries. For example, for a Solr server running on search01.example.com, go to http://search01.example.com:8983/solr/tweets/select?q=*%3A*&wt=json&indent=true in a browser. If indexing was successful, this page displays the first 10 query results.

Batch Indexing into Offline Solr Shards

Running the MapReduce job without GoLive causes the job to create a set of Solr index shards from a set of input files and write the indexes to HDFS. You can then explicitly point each Solr server to one of the HDFS output shard directories.

Batch indexing into offline Solr shards is mainly intended for offline use-cases by advanced users. Cases requiring read-only indexes for searching can be handled by using batch indexing without the --go-live option. By not using GoLive, you can avoid copying datasets between segments, thereby reducing resource utilization.

Delete any existing documents in the tweets collection:
```
$ solrctl collection --deletedocs tweets
$ sudo -u hdfs hdfs dfs -rm -r -skipTrash /user/jdoe/outdir
```
Replace jdoe with the username you used in Preparing to Index Sample Tweets with Cloudera Search.

Run the MapReduce job as follows, replacing nn01.example.com in the command with your NameNode hostname.

Parcel-based Installation:

$ hadoop jar \
/opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar \
org.apache.solr.hadoop.MapReduceIndexerTool -D \
'mapred.child.java.opts=-Xmx500m' --log4j \
/opt/cloudera/parcels/CDH/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file \
/opt/cloudera/parcels/CDH/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf \
--output-dir hdfs://nn01.example.com:8020/user/jdoe/outdir --verbose --solr-home-dir \
$HOME/tweet_config --shards 2 hdfs://nn01.example.com:8020/user/jdoe/indir

Package-based Installation:

$ hadoop jar \
/usr/lib/solr/contrib/mr/search-mr-*-job.jar \
org.apache.solr.hadoop.MapReduceIndexerTool -D \
'mapred.child.java.opts=-Xmx500m' --log4j \
/usr/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file \
/usr/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf \
--output-dir hdfs://nn01.example.com:8020/user/jdoe/outdir --verbose --solr-home-dir \
$HOME/tweet_config --shards 2 hdfs://nn01.example.com:8020/user/jdoe/indir

Check the job status at:
- MRv1: http://jt01.example.com:50030/jobtracker.jsp
- YARN: http://rm01.example.com:8090/cluster
After the job is completed, check the generated index files. Individual shards are written to the results directory with names of the form part-00000, part-00001, part-00002, and so on. This example has two shards:
```
$ hdfs dfs -ls /user/jdoe/outdir/results
$ hdfs dfs -ls /user/jdoe/outdir/results/part-00000/data/index
```
Stop the Solr service:
- Cloudera Manager: Solr service > Actions > Stop
- Unmanaged: On each Solr server host, run:
```
$ sudo service solr-server stop
```

Identify the paths to each Solr core:

$ hdfs dfs -ls /solr/tweets
Found 2 items
drwxr-xr-x   - solr solr          0 2017-03-13 06:20 /solr/tweets/core_node1
drwxr-xr-x   - solr solr          0 2017-03-13 06:20 /solr/tweets/core_node2

Move the index shards into place.

Remove outdated files:

$ sudo -u solr hdfs dfs -rm -r -skipTrash \
/solr/tweets/core_node1/data/index
$ sudo -u solr hdfs dfs -rm -r -skipTrash \
/solr/tweets/core_node1/data/tlog
$ sudo -u solr hdfs dfs -rm -r -skipTrash \
/solr/tweets/core_node2/data/index
$ sudo -u solr hdfs dfs -rm -r -skipTrash \
/solr/tweets/core_node2/data/tlog

Change ownership of the results directory to solr:

$ sudo -u hdfs hdfs dfs -chown -R solr /user/jdoe/outdir/results

Move the two index shards into place:

$ sudo -u solr hdfs dfs -mv /user/jdoe/outdir/results/part-00000/data/index \
/solr/tweets/core_node1/data/
$ sudo -u solr hdfs dfs -mv /user/jdoe/outdir/results/part-00001/data/index \
/solr/tweets/core_node2/data/

Start the Solr service:
- Cloudera Manager: Solr service > Actions > Start
- Unmanaged: On each Solr server host, run:
```
$ sudo service solr-server start
```
Run some Solr queries. For example, for a Solr server running on search01.example.com, go to http://search01.example.com:8983/solr/tweets/select?q=*%3A*&wt=json&indent=true in a browser. If indexing was successful, this page displays the first 10 query results.

Page generated August 14, 2017.