This is the documentation for Cloudera Enterprise 5.12.x. Documentation for other versions is available at Cloudera Documentation.

Extracting Metadata from S3 Buckets

As discussed inCloudera Navigator and Amazon S3 Storage, Cloudera Navigator can extract metadata about objects stored on Amazon Web Services (AWS) S3 storage buckets. The extraction process can be configured to run in a few different ways:
  • Bulk
  • Incremental
  • Bulk and incremental combined (the default configuration)

The steps below assume that the cluster has already been configured to use S3 and that credentials for the AWS account hosting the S3 buckets have been added to the Cloudera Manager instance.

Amazon Simple Queue Service (SQS)

Amazon Simple Notification Service (SNS)

Continue reading:

Creating IAM Policy Documents

To enable S3 data extraction for Cloudera Navigator, you must create a policy document in AWS and attach that policy document to the AWS user associated with the Cloudera Navigator instance. Each of the extraction methods described include a policy document that you create to enable extraction.

Bulk and Incremental Extraction

By default, Cloudera Navigator uses combined bulk and incremental extraction. First, the bulk process extracts all metadata. Subsequent extractions are incremental, that is, only changes are extracted and coalesced with the bulk extraction.

  Important: Use combined bulk and incremental extraction if you do not have existing S3 event notification configured for any S3 buckets.
Bulk and incremental extraction is recommended for production environments.

Policy Document - Bulk and Incremental Extraction

To enable bulk and incremental extraction, you create the following policy document by copying the policy text and pasting it in the policy editor, and then attaching it to the Navigator user in AWS.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Stmt1481678612000",
            "Effect": "Allow",
            "Action": [
                "sqs:CreateQueue",
                "sqs:DeleteMessage",
                "sqs:DeleteMessageBatch",
                "sqs:GetQueueAttributes",
                "sqs:GetQueueUrl",
                "sqs:ReceiveMessage",
                "sqs:SetQueueAttributes"
            ],
            "Resource": "*"
        },
        {
            "Sid": "Stmt1481678744000",
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketLocation",
                "s3:ListAllMyBuckets",
                "s3:ListBucket",
                "s3:GetObject",
                "s3:GetObjectAcl",
                "s3:GetBucketNotification",
                "s3:PutBucketNotification"
            ],
            "Resource": [
                "arn:aws:s3:::*"
            ]
        }
    ]
}

Bulk Extraction Only

Use bulk extraction only for proof-of-concept or demonstration environments.
  Note: Using bulk extraction alone is not recommended for production environments.

Because all S3 data is re-extracted each time it is run, bulk extraction is:

Setup

To set up bulk extraction only, add the following to the Navigator Metadata Server Advanced Configuration Snippet (Safety Valve) for cloudera-navigator.properties in Cloudera Manager, and then restart the metadata server:
nav.s3.extractor.incremental.enable=false

Policy Document - Bulk Extraction

To enable bulk extraction only, create the following policy document by copying the policy text and pasting it in the policy editor, and then attaching it to the Navigator user in AWS.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Stmt1481676614000",
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketLocation",
                "s3:ListAllMyBuckets",
                "s3:ListBucket",
                "s3:GetObject",
                "s3:GetObjectAcl"
            ],
            "Resource": [
                "arn:aws:s3:::*"
            ]
        }
    ]
}

Event Notification for an External Queue

If you have existing S3 event notification configured for any S3 buckets, you must use configure extraction to use that external queue. This requires you to set up the queues and configure event notification. "Bring your own queue" extraction is recommended for production environments.

AWS Resources Created by the User:
  • SQS queue for each region in which you have buckets.
  • S3 event notification for each bucket to send change events to the Navigator queue.
Advantages:
  • Full control over your AWS environment.
  • Performance level is high.

Disadvantages:
  • Requires significant manual setup and configuration.

Setting Up External Queue for Event Notification

To set up event notification for an external queue, you must create a queue from which Navigator can extract, and then configure S3 event notifications to send to Cloudera Navigator.
  Important: Before adding data to newly created S3 buckets, configure AWS S3 event notification.
To create the event notification queue instead of having Cloudera Navigator set it up for you, do the following:
  1. Stop Cloudera Navigator.
  2. Log in to your account using the AWS Management Console and open the Amazon SQS setup page.
  3. Create a queue with the following settings:
    • Default Visibility Timeout: 10 minutes
    • Message Retention Period: 14 days
    • Delivery Delay: 0 seconds
    • Receive Message Wait Time: 0 seconds
  4. Select the queue you created, click the Permissions tab, click Add a Permission, and configure the following in the Add a Permision to... dialog box:
    • Effect: Allow
    • Principal: Everybody
    • Actions: SendMessage
      In the Conditions (optional) area, set the following values:
      • Qualifier: None
      • Condition: ArnLike
      • Key: aws:SourceArn
      • Value: arn:aws:s3::*:*

      When finished, click Add Condition, and then click Add Condition.

  5. Set up a queue in each region in which you have buckets.
  6. Configure event notification for every bucket:
    • Name: nav-send-metadata-on-change
    • Events: ObjectCreated(All) and ObjectRemoved(All)
    • Send to: SQS queue
    • SQS queue: The name of your queue
  7. Configure SNS fanout if you have existing S3 event notification. For more information about SNS fanout, see Common SNS Scenarios.
  8. In Cloudera Manager, add the following to the Navigator Metadata Server Advanced Configuration Snippet (Safety Valve) for cloudera-navigator.properties in Cloudera Manager:
    nav.s3.extractor.incremental.enable=true
    nav.s3.extractor.incremental.auto_setup.enable=false
    nav.s3.extractor.incremental.queues=queue_json
    The queue_json is specified using the following JSON format, without any spaces. Use two backslash characters (\\) to escape commas (,) as shown here:
    [{"region":"us-west-1"\\,"queueUrl":"https://sqs.aws_region.amazonaws.com/account_num/queue_name"}\\,{queue_2}\\,
    ...
    {queue_n}]
    
    
      Important: Cloudera Navigator extracts metadata from one queue only for each region.
  9. Restart Cloudera Navigator.

Policy Document - External Queue

To enable event notification for an external queue, create the following policy document by copying the policy text and pasting it in the policy editor, and then attaching it to the Cloudera Navigator user in AWS.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Stmt1481678612000",
            "Effect": "Allow",
            "Action": [
                "sqs:DeleteMessage",
                "sqs:DeleteMessageBatch",
                "sqs:GetQueueAttributes",
                "sqs:ReceiveMessage"
            ],
            "Resource": "*"
        },
        {
            "Sid": "Stmt1481678744000",
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketLocation",
                "s3:ListAllMyBuckets",
                "s3:ListBucket",
                "s3:GetObject",
                "s3:GetObjectAcl",
                "s3:GetBucketNotification",
                "s3:PutBucketNotification"
            ],
            "Resource": [
                "arn:aws:s3:::*"
            ]
        }
    ]
}

Navigator S3 Extraction Options

You can use the following S3 extraction options to configure extraction from S3. The options are available in Navigator Metadata Server Advanced Configuration Snippet (Safety Valve) for cloudera-navigator.properties in Cloudera Manager.
Table 1. S3 Extraction Options in Cloudera Navigator
Option Description Default
nav.s3.extractor.max_threads Number of extractors (worker processes) to run in parallel. 3
nav.s3.extractor.enable Enables (true) and disables (false) S3 extraction. Set to true if AWS Credentials are provided by Cloudera Manager to Cloudera Navigator (see Enabling Cloudera Navigator Access to Amazon S3). true
nav.s3.extractor.incremental.enable Enables (true) incremental extraction. Setting to false activates bulk extraction. Changing this value requires restarting Cloudera Navigator for the value to take effect. true
nav.s3.extractor.incremental.batch_size Number of messages kept in memory at any given time during the extraction process. 1000
nav.s3.extractor.incremental.auto_setup.enable Autoconfigure queues and configure S3 event notification. Set to false to use “bring your own queue”. true
nav.s3.extractor.incremental.queues List of queues to use in external queue use case. N/A
nav.aws.api.limit Maximum number of API calls that Cloudera Navigator can make per month to AWS for SQS. 5,000,000,000
nav.sqs.max_receive_count Number of retries for inconsistent SQS messages (inconsistent due to eventual consistency). 10
nav.s3.implicit.batch_size Number of Solr documents held in memory as implicit-folder state is being updated. 1000
nav.s3.home_region AWS region nearest to the cluster and the Cloudera Navigator instance. Select the same AWS region (or the nearest one geographically) to minimize latency for API requests. us-west-1
Page generated August 14, 2017.