Tamr Documentation

Configuring the Data Movement Service

You enable and configure the DMS to load data files into Tamr from, and export datasets from Tamr to, cloud storage destinations.

You enable and configure the DMS in order to:

  • Upload dataset files into Tamr from cloud storage locations.
  • Export dataset files from Tamr to cloud storage locations.

If DMS is enabled for your instance, users cannot download datasets to a local file system via the UI. This allows organizations to ensure all teams follow the appropriate data access policies, which are managed via their cloud storage accounts.

Important: Tamr users who need access to data files exported to cloud storage must be given access to the appropriate cloud storage locations.

The Tamr Data Movement Service (DMS) supports CSV and Parquet formats for Tamr dataset ingest and export.

Tamr supports ingesting and exporting datasets from cloud storage within your cloud provider.

Azure-Specific Requirements

For data ingestion in Azure deployments, you must create the storage account with hierarchical namespace enabled in the advanced options. Otherwise, data ingestion will fail.

For data export, DMS can export to any basic storage accounts.

Enabling DMS

To enable the DMS, you set APPS_DMS_ENABLED to true. See Configuring Tamr.

You'll need to restart Tamr and its dependencies. See Restarting.
To stop DMS, use the command pkill -f dms. This removes DMS jobs in the jobs page. Next, run /start-dependencies.sh.

Configuration Options for DMS

You set the following variables to configure the Data Movement Service.

Name

Description

APPS_DMS_DEFAULT_CLOUD_PROVIDER

Identifies the cloud service provider: ADLS2, GCS, or S3.

APPS_DMS_MAX_CONCURRENT_REQUESTS

Defines the maximum number of threads for parallelization.

APPS_DMS_MEMORY

Allocates memory to the DMS driver.

APPS_DMS_HOSTNAME

Stores the host for the DMS server.

APPS_DMS_PORT

Stores the port for the DMS server. Default port is 9155.

APPS_DMS_SCHEME

Identifies whether to use http or https for DMS connections.

For more information, see Configuration Variable Reference.

Note: Remember you need to restart Tamr and its dependencies. See Restarting.
To stop DMS, use the command pkill -f dms' This removes DMS jobs in the jobs page. Next, run /start-dependencies.sh.

Configuring Authentication for DMS

When you import data files into Tamr from cloud storage, authentication allowing access to that cloud source relies on provider-specific authentication settings.

S3 Authentication for DMS

The DMS uses the default credential provider chain for authentication. For more information about the default credential provider chain, see Working with AWS Credentials in the AWS JDK for Java documentation.

While Tamr recommends attaching roles to the EC2 or ECS directly, you can also use ENV variables and credentials files like the one shown in the example that follows to authenticate with AWS.

[default]
aws_access_key_id=<AWS_ACCESS_KEY_ID>
aws_secret_access_key=<AWS_SECRET_ACCESS_KEY>

Note: These credentials must have all of the permissions required for the DMS as well as for your Tamr deployment. (See the S3 Authorization for DMS next.)

S3 Authorization for DMS

The following template for an IAM policy includes the minimum actions (permissions) that must be allowed for the DMS. Tamr is likely require additional permissions, so you might have multiple policies attached to EC2 instances and service accounts.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:AbortMultipartUpload",
                "s3:ListBucket",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::<BUCKET_NAME>/*",
                "arn:aws:s3:::<BUCKET_NAME>"
            ]
        }
    ]
}

Tip: Replace the <BUCKET_NAME> placeholders with your AWS S3 values.

For information about attaching a policy, see Adding and removing IAM identity permissions in the AWS JDK for Java documentation.

Note: To have data encrypted at rest, use SSE-S3 as described in the AWS JDK for Java documentation.

ADLS2 Authentication for DMS

You can authenticate in ADLS2 either with service principals (recommended) or a shared account access key.

To configure ADLS authentication with service principals:

  1. Ensure that the service principal has roles or ACL to access the specified container. When assigning roles choose Storage Blob Data Contributor.
  2. In your (functional user) home directory, create a tamr folder if it does not already exist. Then, create a .cloud-credentials.json file in the tamr folder. For example:  /home/<functional_user>/tamr/.cloud-credentials.json.
  3. Include the following information in the JSON file:
  • accountName: String. The ADLS2 account name. For example, my_adls_storage.
  • containerName: String. Container for the required files. For example, inbound.
  • clientId: String. The id of the service principal object or app registered with Active Directory.
  • tenantId: String. The Azure Active Directory id.
  • clientSecret: String. The password for this service principal.
{
  "accountName": "",
  "containerName": "",
  "clientId": "",
  "tenantId": "",
  "clientSecret": ""
}

To authenticate with a shared account access key:

  1. Copy your storage account account key as follows:
    a. Navigate to your storage account in the Azure portal.
    b. Under Settings, select Access keys.
    c. Locate the Key value under key1, then select Copy.
    For more information, see the Azure Blob storage documentation.
  2. In your (functional user) home directory, create a tamr folder if it does not already exist. Then, create a .cloud-credentials.json file in the tamr folder. For example:  /home/<functional_user>/tamr/.cloud-credentials.json.
  3. Include the following information in the JSON file:
  • accountName: String. The ADLS2 account name. For example, my_adls_storage.
  • accountKey: String. The ADLS2 account key.
{
  "accountName": "",
  "accountKey": ""
}

GCS Authentication for DMS

The DMS uses your application default credentials (ADC) for authentication. For information, see the Google Cloud documentation.

GCS Authorization for DMS

The following roles must be granted on the storage bucket to be accessed:

  • roles/storage.legacyObjectReader
  • roles/storage.legacyBucketWriter

These roles grant the following permissions:

storage.buckets.get
storage.objects.get
storage.objects.list
storage.objects.create
storage.objects.delete
storage.multipartUploads.create
storage.multipartUploads.abort
storage.multipartUploads.listParts

Updated about a month ago



Configuring the Data Movement Service


You enable and configure the DMS to load data files into Tamr from, and export datasets from Tamr to, cloud storage destinations.

Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.