This is the documentation for Cloudera Enterprise 5.12.x. Documentation for other versions is available at Cloudera Documentation.

Cloudera Navigator and Amazon S3 Storage

Amazon Simple Storage Service (S3) is a cloud-based storage solution offered by Amazon Web Services (AWS). Cloudera clusters can use Amazon S3 as persistent storage for clusters, whether deployed on-premises or in the AWS cloud. The possible use cases are endless, from BDR (backup and disaster recovery), persistent storage for transient clusters deployed to the cloud, source for ETL and target for analytical and operational databases, just to name a few examples.

As of the 2.9.0 release, Cloudera Navigator can obtain technical metadata about the objects stored on S3 and can display in the context of lineage diagrams. This section provides a quick overview of some of the nuances of S3 and how those affect what you see or how you work with metadata using Cloudera Navigator.

Continue reading:

S3 Metadata Entities in Navigator
Enabling Cloudera Navigator Access to Amazon S3
Eventual Consistency
S3 Event Notification
Setting API Limits
Limitations of Navigator for S3

S3 Metadata Entities in Navigator

Amazon S3 has a flat structure, without the hierarchy found in typical filesystems. S3 entities include buckets and objects. The bucket is the container for the object.

In Cloudera Navigator, S3 entities include the following:

S3 Bucket
Directory - Although S3 entities are limited to buckets and objects in those buckets, S3 supports the concept of a folder that can be used to organize objects. Folders in S3 are extracted as directories in Navigator.
File

Implicit Folders

Navigator creates implicit S3 folders to mimic the behavior of a file system. For example, for an object with key real_estate/sales/pending, Navigator creates a file with the path real_estate/sales/pending, and also creates two directories: real_estate and real_estate/sales.

Important: When working with S3 entities in Cloudera Navigator, keep in mind the following:

Currently, Navigator does not mark implicit folders as deleted if children under it are deleted.
Navigator does not mark objects and folders as deleted if they are deleted in S3.
You can search for implicit folders in Navigator Search by adding implicit:true in the Search bar; use implicit:false to show regular (explicit) folders only

For example, in the S3 bucket implicit-folder-test, if you create the folder structure /implicit/implicit2 and add the file explicit to the implicit2 folder, Cloudera Navigator shows the following for this object when extracted:

The directory/file combination is labeled as a Path.

You work with S3 entities in Navigator much as you would with entities for HDFS. For information on S3 entity properties, see S3 Properties.

For more information about Amazon S3, see the Amazon S3 documentation.

Enabling Cloudera Navigator Access to Amazon S3

To configure Navigator for S3, you must configure AWS credentials for Cloudera Manager and enable Cloudera Navigator to access metadata about the objects stored in Amazon S3 buckets.

You configure AWS Credentials to specify the Access Key Authentication type.

This type of authentication requires an AWS Access Key and an AWS Secret key that you obtain from Amazon. For more information about setting up keys in AWS, see Creating an IAM User in Your AWS Account in the AWS Identity and Access Management documentation.

Important:

Navigator supports only a single key for authentication; only one AWS credential can be used at a time.
A Navigator instance must extract from only one AWS account, and one AWS account can be extracted by only one Navigator instance. Multiple Navigator instances extracting from one AWS account can cause unpredictable behavior and is not supported.
The key provided to Navigator requires permissions for SQS and S3. For more information, see Extracting Metadata from S3 Buckets.

Cloudera Manager stores these values securely and does not store them in world-readable locations. The credentials are masked in the Cloudera Manager Admin console, encrypted in the configurations passed to processes managed by Cloudera Manager, and redacted from the logs.

Note: Using the Cloudera Manager Admin Console to configure the AWS credentials for use by Cloudera Navigator requires the Cloudera Manager user role of User Administrator or Full Administrator.

To enable Cloudera Navigator access to Amazon S3, you must add AWS Credentials for Amazon S3 and then enable Navigator access. If you have already added AWS credentials, skip to step 5:

Open the Cloudera Manager Admin Console.
Click Administration > AWS Credentials.
Click Add and select Access Key Authentication. This authentication mechanism requires you to obtain AWS credentials from Amazon.
1. Enter a Name for this account. The name can contain alphanumeric characters, hyphens, underscores, and spaces.
2. Enter the AWS Access Key ID.
3. Enter the AWS Secret Key.
Important: Although AWS offers two types of authentication—IAM Role-based Authentication and Access Key Authentication—you must specify Access Key Authentication for Cloudera Navigator. IAM Role-based Authentication is not supported.
Click Add. The Connect to Amazon Web Services screen displays.
Click the Enable for Cloudera Navigator link.
Restart the Cloudera Navigator Metadata server to enable access.

Extracted S3 information should now be available in Cloudera Navigator.

Eventual Consistency

Amazon S3 uses an eventual consistency model. To provide high availability, consistency is informally guaranteed: Eventually, an item returns the last updated value to all accesses of that item, assuming no new updates for a period of time.

Because S3 uses eventual consistency, it might take some time for the S3 object to appear in Navigator. In addition, you might notice discrepancies between the object in Navigator and the object in S3. In Navigator, if you do not immediately see an S3 object that you created or do not see modifications that you made, that does not mean the object does not exist or was not successfully edited. In most cases, the lag time associated with eventual consistency is causing the object to not appear in Navigator or to not match the most recent version in S3.

S3 Event Notification

In Amazon S3, you enable a bucket to send notification messages whenever certain events occur. Cloudera Navigator uses Amazon Simple Queue Service (Amazon SQS) to extract S3 information.

Amazon SQS is a distributed, highly scalable hosted queue for storing messages. Navigator pulls data from the SQS queue. For more information about Amazon SQS, see Getting Started with Amazon SQS.

By default, Navigator sets up these queues and configures S3 event notification for each bucket for you. Navigator does not overwrite existing S3 event notifications. However, if any buckets have existing S3 event notifications, you must use "bring your own queue" and use an Amazon SNS "fanout". In a "fanout" scenario, an Amazon SNS message is sent to a topic and then replicated and pushed to multiple Amazon SQS queues, HTTP endpoints, or email addresses.

For more information about configuring Navigator data extraction, including "bring your own queue", see Extracting Metadata from S3 Buckets.

Setting API Limits

You can set an API limit for Amazon S3 API and SQS API. You are billed on a monthly basis, depending on your usage; the billing cycle resets each month. By setting these API limits, you can manage the monthly cost of using the APIs.

To set API limits, add the following in the Navigator Metadata Server Advanced Configuration Snippet (Safety Valve) for cloudera-navigator.properties in Cloudera Manager:

nav.aws.api.limit=any_int

Once your API limit is reached, Navigator suspends extraction until the next 30-day interval begins. Then at that point, Navigator extracts any data that was not extracted during the time activity was suspended.

Cloudera Navigator does not indicate if your use of the API exceeds the monthly limit; monitor your monthly use of the APIs to manage your costs.

Limitations of Navigator for S3

S3 entities in Cloudera Navigator work in much the same way that HDFS entities do. However, the current release of Navigator has some limitations related to S3:

Only one instance of Navigator can be configured per S3 account, and Navigator can use only one AWS credential.
IAM role-based authentication is not supported.
Extraction limitations:
- Navigator extracts only user-defined metadata in S3. System-defined metadata types are not extracted.
- Navigator does not extract tags for S3 buckets and objects.
- Navigator extracts only the latest versions in S3; it does not extract historical versions.
- AWS supports unnamed directories, but Navigator does not extract them.
Auditing is not available for S3.
Lineage is supported for Hive, Impala, and MapReduce on S3. Other types are not supported.
MapReduce glob paths are not supported.
S3 object removal with Object Lifecycle Management is not supported.
For example, if you set a lifecycle rule to automatically remove any objects older than 10 days, that delete event is not be tracked by SQS and therefore not tracked by Navigator.

To use lifecycle rules with Navigator extraction, you must use only bulk extraction and not incremental extraction.

Page generated August 14, 2017.