This is the documentation for Cloudera Enterprise 5.12.x. Documentation for other versions is available at Cloudera Documentation.

Preparing to Index Sample Tweets with Cloudera Search

To prepare for indexing tweets with MapReduce or Flume, complete the following steps:

Start a SolrCloud cluster containing at least two servers (this example uses two shards) as described in Deploying Cloudera Search.
On a host running Solr Server, make sure that the SOLR_ZK_ENSEMBLE environment variable is set in /etc/solr/conf/solr-env.sh. For example:
```
$ cat /etc/solr/conf/solr-env.sh
export SOLR_ZK_ENSEMBLE=zk01.example.com:2181,zk02.example.com:2181,zk03.example.com:2181/solr
```
If you are using Cloudera Manager, this is automatically set on hosts with a Solr Server or Gateway role.

Generate the configuration files for the collection, including the tweet-specific schema.xml:

Parcel-based Installation:

$ solrctl instancedir --generate $HOME/tweet_config
$ cp /opt/cloudera/parcels/CDH/share/doc/search*/examples/solr-nrt/collection1/conf/schema.xml \
$HOME/tweet_config/conf

Package-based Installation:

$ solrctl instancedir --generate $HOME/tweet_config
$ cp /usr/share/doc/search*/examples/solr-nrt/collection1/conf/schema.xml \
$HOME/tweet_config/conf

Upload the configuration to ZooKeeper:

$ solrctl instancedir --create tweet_config $HOME/tweet_config/

Create a new collection:

$ solrctl collection --create tweets -s 2 -c tweet_config

Verify the collection is live. Open the Solr admin web interface in a browser by accessing http://search01.example.com:8983/solr/#/~cloud. Replace solr01.example.com with the name of any host running the Solr Server process. Look for the tweets collection.
Prepare the configuration for use with MapReduce:
```
$ cp -r $HOME/tweet_config $HOME/mr_tweet_config
```

Copy sample tweets to HDFS:

Parcel-based Installation:

$ sudo -u hdfs hdfs dfs -mkdir -p /user/jdoe
$ sudo -u hdfs hdfs dfs -chown jdoe:jdoe /user/jdoe
$ hdfs dfs -mkdir -p /user/jdoe/indir
$ hdfs dfs -put /opt/cloudera/parcels/CDH/share/doc/search*/examples/test-documents/sample-statuses-*.avro \
/user/jdoe/indir/
$ hdfs dfs -ls /user/jdoe/indir

Package-based Installation:

$ sudo -u hdfs hdfs dfs -mkdir -p /user/jdoe
$ sudo -u hdfs hdfs dfs -chown jdoe:jdoe /user/jdoe
$ hdfs dfs -mkdir -p /user/jdoe/indir
$ hdfs dfs -put /usr/share/doc/search*/examples/test-documents/sample-statuses-*.avro \
/user/jdoe/indir/
$ hdfs dfs -ls /user/jdoe/indir

Ensure that outdir is empty and exists in HDFS:

$ hdfs dfs -rm -r -skipTrash /user/jdoe/outdir
$ hdfs dfs -mkdir /user/jdoe/outdir
$ hdfs dfs -ls /user/jdoe/outdir

The sample tweets are now in HDFS and ready to be indexed. Continue to Using MapReduce Batch Indexing to Index Sample Tweets to index the sample tweets or to Near Real Time (NRT) Indexing Tweets Using Flume to index live tweets from the Twitter firehose.

Page generated August 14, 2017.