User GuidesAPI ReferenceRelease Notes
Doc HomeHelp CenterLog In

Configuring Core Connect

Authorization and authentication configuration for cloud providers and optional customizations.

The Core Connect service allows you to import and export large data files between Tamr Core and your cloud storage destinations. See Core Connect for the currently supported file types for import and export.

This topic explains authentication configuration for each supported cloud provider and optional configuration variables you can set to customize Core Connect for your deployment.

See Using the Connect API in the Tamr Core API Reference for instructions on using Core Connect.

Azure-Specific Requirements

For data import in Azure deployments, the storage account must have hierarchical namespace enabled in the advanced options. Otherwise, data import fails.

For data export, Core Connect can export to any basic storage accounts.

Configuring Authentication and Authorization for Core Connect

When you import data files into Tamr Core from cloud storage, authentication allowing access to that cloud source relies on provider-specific authentication settings. This section explains how to configure the following for Core Connect:

S3 Authentication for Core Connect

Core Connect uses the default credential provider chain for authentication. For more information about the default credential provider chain, see Working with AWS Credentials in the AWS JDK for Java documentation.

While Tamr recommends attaching roles to the EC2 or ECS directly, you can also use environment variables and credentials files, like the one shown in the following example to authenticate with AWS.

Default environment variables and credentials:

[default]
aws_access_key_id=<AWS_ACCESS_KEY_ID>
aws_secret_access_key=<AWS_SECRET_ACCESS_KEY>

These credentials must have all of the permissions required for Core Connect as well as for your Tamr Core deployment, as described in the S3 Authorization for Core Connect section below.

S3 Authorization for Core Connect

The following template for an IAM policy includes the minimum actions (permissions) that are required for Core Connect. Tamr Core will likely require additional permissions; you might have multiple policies attached to EC2 instances and service accounts

Example IAM policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:AbortMultipartUpload",
                "s3:ListBucket",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::<BUCKET_NAME>/*",
                "arn:aws:s3:::<BUCKET_NAME>"
            ]
        }
    ]
}

Replace the <BUCKET_NAME> placeholders with your AWS S3 values.

For information about attaching a policy, see Adding and removing IAM identity permissions in the AWS JDK for Java documentation.

To have data encrypted at rest, use SSE-S3 as described in the AWS JDK for Java documentation.

ADLS2 Authentication for Core Connect

You can authenticate in ADLS2 either with service principals (recommended) or a shared account access key or token.

ADLS2 Authentication with Service Principals

To configure ADLS2 authentication with service principals:

  1. Ensure that the service principal has roles or ACL to access the specified container. When assigning roles choose Storage Blob Data Contributor.
  2. Export the environment variables ADLS2_CLIENT_ID, ADLS2_CLIENT_SECRET, and ADLS2_TENANT_ID.

You can use these variable values to authenticate when calling Core Connect API endpoints for ADLS2.

Example:

ADLS2_CLIENT_ID = d08d0a03-40ce-4a7b-9367-3000000002
ADLS2_CLIENT_SECRET = qIKeyKeyKeywzmqepcxlcxtiY=
ADLS2_TENANT_ID = 8d0a03-40ce-4a7b-9367-3000000002

ADLS2 Authentication with Shared Account Access Key or Token

To authenticate with a shared account access key or token:

  1. Copy your storage account account key as follows:
    a. Navigate to your storage account in the Azure portal.
    b. Under Settings, select Access keys.
    c. Locate and copy the Key value for key1. For more information, see the Azure Blob storage documentation.
  2. Export either:
    • The environment variable ADLS2_ACCOUNT_KEY to use the shared account key.
    • The environment variable ADLS2_SAS_TOKEN to use the shared account token.

You can use these variable values when calling Core Connect API endpoints for ADLS2 to authenticate.

Example for shared account access key:

ADLS2_ACCOUNT_KEY = sS23U3jKHMAz4xSzZJbBnv3NPb2ndalgyQA5uVsuV9Lrb6XD82Se6NcoMzoPWLXJ0SvJN4L9hPFKhx==

Example for shared access token:

ADLS2_SAS_TOKEN = ?sv=2020-08-04&ss=bfqt&srt=sco&sp=rwdlacupx&se=2022-11-26T16:13:40Z&st=2021-11-26T08:13:40Z&spr=https&sig=1xGOStsP3%2BNF7L0aRCVuyjUfpGeuPMlmpCXj56Xdw6c

GCS Authentication for Core Connect

Core Connect uses your application default credentials (ADC) for authentication. For information, see the Google Cloud documentation.

GCS Authorization for Core Connect

For Core Connect to access a storage bucket, verify that the following roles are granted:

  • roles/storage.legacyObjectReader
  • roles/storage.legacyBucketWriter

These roles grant the following permissions, which give Core Connect read/write access to the storage bucket:

  • storage.buckets.get
  • storage.objects.get
  • storage.objects.list
  • storage.objects.create
  • storage.objects.delete
  • storage.multipartUploads.create
  • storage.multipartUploads.abort
  • storage.multipartUploads.listParts

Using a Jinja Template

You can use Jinja templating in your API requests to specify environment variable names instead of literals. As shown in the following example, after you export an environment variable, you can use the syntax "tamr_json_key": "{{ MY_VAR_NAME }}" in the POST body of your Connect requests. Tamr recommends the use of environment variables for security, to avoid transmission of password literals.

{
"query":"SELECT * FROM tamr.TAMR_TEST",  
 "datasetName":"oracle_test",
 "queryConfig":{
   "jdbcUrl":"jdbc:oracle:thin:@192.168.99.101:1521:orcl",
   "dbUsername":"{{ CONNECT_DB_USERNAME }}", 
   "dbPassword":"{{ CONNECT_DB_PASSWORD }}"
   } 
 }

Configuration Options for Core Connect

You can set the following variables to configure Core Connect.

Configuration VariableDescription
TAMR_CONNECT_ADMIN_PORTDefines the port where Core Connect Admin will run.

Default: 9151.
TAMR_CONNECT_ADMIN_BIND_PORTUsed only when Core Connect is running in a container.

Defines the port where Core Connect Admin will run.

Default: 9051
TAMR_CONNECT_BIND_PORTUsed only when Core Connect is running in a container.

Defines the port where Core Connect will run.

Default: 9050
TAMR_CONNECT_CONFIG_DIRUsed for HDFS and Hive.

Identifies the destination location where Core Connect copies the files defined in TAMR_CONNECT_EXTRA_URIS and TAMR_CONNECT_CONFIG_URIS.

Default: '/tmp/config'
TAMR_CONNECT_CONFIG_URISUsed for HDFS and Hive.

Identifies set of XML files that configure the Java client (that is, what is located in /etc/hive/conf/ together with what is in /etc/hadoop/conf). It is a semicolon separated list of full file paths.
TAMR_CONNECT_DEFAULT_CLOUD_PROVIDERDefines the default cloud service provider for UI. Valid values include "GCS", "S3", or "ADLS2".

Default: GCS
TAMR_CONNECT_ENABLE_HTTPSDefines whether Core Connect runs on HTTP (false) or HTTPS (true).

Default: false
TAMR_CONNECT_EXTRA_CONFIGUsed for HDFS and Hive.

Any extra configuration that is needed (normally empty) to override what is on the filesystem. It is a dictionary of Hadoop setting-value pairs.
TAMR_CONNECT_EXTRA_URISUsed for HDFS and Hive.

Identifies any extra files that configure the Java client. It is a semicolon separated list of full file paths.
TAMR_CONNECT_FILE_DROP_PATHDefines the folder to search for files uploaded locally using serverfs endpoints (local filesystem). Define this to upload files from where Core Connect is running.

Default: /tmp
TAMR_CONNECT_HADOOP_USER_NAMEUsed for HDFS.

Defines the Hadoop username used when reading from HDFS.
TAMR_CONNECT_HOSTNAMEDefines the hostname for Core Connect.

Default: localhost
TAMR_CONNECT_JDBC_POOL_IMPLDefines the JDBC pool implementation. Possible values are dbcp, Basic, and Hikari.

Default: dbcp
TAMR_CONNECT_KERBEROS_KEYTABUsed for HDFS and Hive.

The file path to a keytab file for the Kerberos principal defined in TAMR_CONNECT_KERBEROS_PRINCIPAL.
TAMR_CONNECT_KERBEROS_PRINCIPALUsed for HDFS and Hive.

The Kerberos principal as whom to authenticate.
TAMR_CONNECT_LOG_LEVELDefines the log level for Core Connect.

See Logging in Single-Node Deployments or Logging in Cloud-Native Deployments for more information on Tamr Core microservice logs.

Default: INFO
TAMR_CONNECT_MAX_CONCURRENT_REQUESTSDefines how many Core Connect jobs can run at the same time.

Running more jobs concurrently takes more memory, see TAMR_CONNECT_MEMORY.

Default: 2
TAMR_CONNECT_MEMORYDefines the amount of memory allocated to the Core Connect driver.

Default: 2G
TAMR_CONNECT_PORTDefines the port where Core Connect will run.

Default: 9050
TAMR_CONNECT_PROFILING_SAMPLE_SIZEDefines the number of records to read to generate profiling results.

Default: 100000
TAMR_CONNECT_TMP_DIRDefines the temporary directory used by connect process.

Default: {{ TAMR_UNIFY_HOME }}/tmp/connect

See the Configuration Variable Reference for more details for each variable.

To set Core Connect configuration variables:

See Configuring Tamr Core for instructions on setting these variables.

You must restart Tamr Core and its dependencies after making a configuration change. See Restarting Tamr Core.