Extracting Metadata from S3 Buckets
- Bulk
- Incremental
- Bulk and incremental combined (the default configuration)
The steps below assume that the cluster has already been configured to use S3 and that credentials for the AWS account hosting the S3 buckets have been added to the Cloudera Manager instance.
Amazon Simple Queue Service (SQS)
Amazon Simple Notification Service (SNS)
Extraction type | Advantages | Disadvantages |
---|---|---|
Bulk + Incremental
|
|
|
Bulk only |
Continue reading:
Creating IAM Policy Documents
To enable S3 data extraction for Cloudera Navigator, you must create a policy document in AWS and attach that policy document to the AWS user associated with the Cloudera Navigator instance. Each of the extraction methods described include a policy document that you create to enable extraction.
Bulk and Incremental Extraction
By default, Cloudera Navigator uses combined bulk and incremental extraction. First, the bulk process extracts all metadata. Subsequent extractions are incremental, that is, only changes are extracted and coalesced with the bulk extraction.

Policy Document - Bulk and Incremental Extraction
{ "Version": "2012-10-17", "Statement": [ { "Sid": "Stmt1481678612000", "Effect": "Allow", "Action": [ "sqs:CreateQueue", "sqs:DeleteMessage", "sqs:DeleteMessageBatch", "sqs:GetQueueAttributes", "sqs:GetQueueUrl", "sqs:ReceiveMessage", "sqs:SetQueueAttributes" ], "Resource": "*" }, { "Sid": "Stmt1481678744000", "Effect": "Allow", "Action": [ "s3:GetBucketLocation", "s3:ListAllMyBuckets", "s3:ListBucket", "s3:GetObject", "s3:GetObjectAcl", "s3:GetBucketNotification", "s3:PutBucketNotification" ], "Resource": [ "arn:aws:s3:::*" ] } ] }
Bulk Extraction Only

Because all S3 data is re-extracted each time it is run, bulk extraction is:
Setup
nav.s3.extractor.incremental.enable=false
Policy Document - Bulk Extraction
{ "Version": "2012-10-17", "Statement": [ { "Sid": "Stmt1481676614000", "Effect": "Allow", "Action": [ "s3:GetBucketLocation", "s3:ListAllMyBuckets", "s3:ListBucket", "s3:GetObject", "s3:GetObjectAcl" ], "Resource": [ "arn:aws:s3:::*" ] } ] }
Event Notification for an External Queue
If you have existing S3 event notification configured for any S3 buckets, you must use configure extraction to use that external queue. This requires you to set up the queues and configure event notification. "Bring your own queue" extraction is recommended for production environments.
- SQS queue for each region in which you have buckets.
- S3 event notification for each bucket to send change events to the Navigator queue.
- Full control over your AWS environment.
-
Performance level is high.
- Requires significant manual setup and configuration.
Setting Up External Queue for Event Notification

- Stop Cloudera Navigator.
- Log in to your account using the AWS Management Console and open the Amazon SQS setup page.
- Create a queue with the following settings:
- Default Visibility Timeout: 10 minutes
- Message Retention Period: 14 days
- Delivery Delay: 0 seconds
- Receive Message Wait Time: 0 seconds
- Select the queue you created, click the Permissions tab, click Add a Permission, and configure the following in the
Add a Permision to... dialog box:
- Effect: Allow
- Principal: Everybody
- Actions: SendMessage
In the Conditions (optional) area, set the following values:
- Qualifier: None
- Condition: ArnLike
- Key: aws:SourceArn
- Value: arn:aws:s3::*:*
When finished, click Add Condition, and then click Add Condition.
- Set up a queue in each region in which you have buckets.
- Configure event notification for every bucket:
- Name: nav-send-metadata-on-change
- Events: ObjectCreated(All) and ObjectRemoved(All)
- Send to: SQS queue
- SQS queue: The name of your queue
- Configure SNS fanout if you have existing S3 event notification. For more information about SNS fanout, see Common SNS Scenarios.
- In Cloudera Manager, add the following to the Navigator Metadata Server Advanced Configuration Snippet (Safety Valve) for
cloudera-navigator.properties in Cloudera Manager:
nav.s3.extractor.incremental.enable=true nav.s3.extractor.incremental.auto_setup.enable=false nav.s3.extractor.incremental.queues=queue_json
The queue_json is specified using the following JSON format, without any spaces. Use two backslash characters (\\) to escape commas (,) as shown here:[{"region":"us-west-1"\\,"queueUrl":"https://sqs.aws_region.amazonaws.com/account_num/queue_name"}\\,{queue_2}\\, ... {queue_n}]
Important: Cloudera Navigator extracts metadata from one queue only for each region.
- Restart Cloudera Navigator.
Policy Document - External Queue
{ "Version": "2012-10-17", "Statement": [ { "Sid": "Stmt1481678612000", "Effect": "Allow", "Action": [ "sqs:DeleteMessage", "sqs:DeleteMessageBatch", "sqs:GetQueueAttributes", "sqs:ReceiveMessage" ], "Resource": "*" }, { "Sid": "Stmt1481678744000", "Effect": "Allow", "Action": [ "s3:GetBucketLocation", "s3:ListAllMyBuckets", "s3:ListBucket", "s3:GetObject", "s3:GetObjectAcl", "s3:GetBucketNotification", "s3:PutBucketNotification" ], "Resource": [ "arn:aws:s3:::*" ] } ] }
Navigator S3 Extraction Options
Option | Description | Default |
---|---|---|
nav.s3.extractor.max_threads | Number of extractors (worker processes) to run in parallel. | 3 |
nav.s3.extractor.enable | Enables (true) and disables (false) S3 extraction. Set to true if AWS Credentials are provided by Cloudera Manager to Cloudera Navigator (see Enabling Cloudera Navigator Access to Amazon S3). | true |
nav.s3.extractor.incremental.enable | Enables (true) incremental extraction. Setting to false activates bulk extraction. Changing this value requires restarting Cloudera Navigator for the value to take effect. | true |
nav.s3.extractor.incremental.batch_size | Number of messages kept in memory at any given time during the extraction process. | 1000 |
nav.s3.extractor.incremental.auto_setup.enable | Autoconfigure queues and configure S3 event notification. Set to false to use “bring your own queue”. | true |
nav.s3.extractor.incremental.queues | List of queues to use in external queue use case. | N/A |
nav.aws.api.limit | Maximum number of API calls that Cloudera Navigator can make per month to AWS for SQS. | 5,000,000,000 |
nav.sqs.max_receive_count | Number of retries for inconsistent SQS messages (inconsistent due to eventual consistency). | 10 |
nav.s3.implicit.batch_size | Number of Solr documents held in memory as implicit-folder state is being updated. | 1000 |
nav.s3.home_region | AWS region nearest to the cluster and the Cloudera Navigator instance. Select the same AWS region (or the nearest one geographically) to minimize latency for API requests. | us-west-1 |
<< Cloudera Navigator and Amazon S3 Storage | ©2016 Cloudera, Inc. All rights reserved | Metadata Policies >> |
Terms and Conditions Privacy Policy |