Configuring Core Connect
Authorization and authentication configuration for cloud providers and optional customizations.
The Core Connect service allows you to import and export large data files between Tamr Core and your cloud storage destinations. See Core Connect for the currently supported file types for import and export.
This topic explains authentication configuration for each supported cloud provider and optional configuration variables you can set to customize Core Connect for your deployment.
See Using the Core Connect API in the Tamr Core API Reference for instructions on using Core Connect.
Azure-Specific Requirements
For data import in Azure deployments, the storage account must have hierarchical namespace enabled in the advanced options. Otherwise, data import fails.
For data export, Core Connect can export to any basic storage accounts.
Configuring Authentication and Authorization for Core Connect
When you import data files into Tamr Core from cloud storage, authentication allowing access to that cloud source relies on provider-specific authentication settings. This section explains how to configure the following for Core Connect:
S3 Authentication for Core Connect
Core Connect uses the default credential provider chain for authentication. For more information about the default credential provider chain, see Working with AWS Credentials in the AWS JDK for Java documentation.
While Tamr recommends attaching roles to the EC2 or ECS directly, you can also use environment variables and credentials files, like the one shown in the following example to authenticate with AWS.
Default environment variables and credentials:
[default]
aws_access_key_id=<AWS_ACCESS_KEY_ID>
aws_secret_access_key=<AWS_SECRET_ACCESS_KEY>
These credentials must have all of the permissions required for Core Connect as well as for your Tamr Core deployment, as described in the S3 Authorization for Core Connect section below.
S3 Authorization for Core Connect
The following template for an IAM policy includes the minimum actions (permissions) that are required for Core Connect. Tamr Core will likely require additional permissions; you might have multiple policies attached to EC2 instances and service accounts
Example IAM policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:AbortMultipartUpload",
"s3:ListBucket",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::<BUCKET_NAME>/*",
"arn:aws:s3:::<BUCKET_NAME>"
]
}
]
}
Replace the <BUCKET_NAME>
placeholders with your AWS S3 values.
For information about attaching a policy, see Adding and removing IAM identity permissions in the AWS JDK for Java documentation.
To have data encrypted at rest, use SSE-S3 as described in the AWS JDK for Java documentation.
ADLS2 Authentication for Core Connect
You can authenticate in ADLS2 either with service principals (recommended) or a shared account access key or token.
ADLS2 Authentication with Service Principals
To configure ADLS2 authentication with service principals:
- Ensure that the service principal has roles or ACL to access the specified container. When assigning roles choose Storage Blob Data Contributor.
- Export the environment variables
ADLS2_CLIENT_ID
,ADLS2_CLIENT_SECRET
, andADLS2_TENANT_ID
.
You can use these variable values to authenticate when calling Core Connect API endpoints for ADLS2.
Example:
ADLS2_CLIENT_ID = d08d0a03-40ce-4a7b-9367-3000000002
ADLS2_CLIENT_SECRET = qIKeyKeyKeywzmqepcxlcxtiY=
ADLS2_TENANT_ID = 8d0a03-40ce-4a7b-9367-3000000002
ADLS2 Authentication with Shared Account Access Key or Token
To authenticate with a shared account access key or token:
- Copy your storage account account key as follows:
a. Navigate to your storage account in the Azure portal.
b. Under Settings, select Access keys.
c. Locate and copy the Key value for key1. For more information, see the Azure Blob storage documentation. - Export either:
- The environment variable
ADLS2_ACCOUNT_KEY
to use the shared account key. - The environment variable
ADLS2_SAS_TOKEN
to use the shared account token.
- The environment variable
You can use these variable values when calling Core Connect API endpoints for ADLS2 to authenticate.
Example for shared account access key:
ADLS2_ACCOUNT_KEY = sS23U3jKHMAz4xSzZJbBnv3NPb2ndalgyQA5uVsuV9Lrb6XD82Se6NcoMzoPWLXJ0SvJN4L9hPFKhx==
Example for shared access token:
ADLS2_SAS_TOKEN = ?sv=2020-08-04&ss=bfqt&srt=sco&sp=rwdlacupx&se=2022-11-26T16:13:40Z&st=2021-11-26T08:13:40Z&spr=https&sig=1xGOStsP3%2BNF7L0aRCVuyjUfpGeuPMlmpCXj56Xdw6c
GCS Authentication for Core Connect
Core Connect uses your application default credentials (ADC) for authentication. For information, see the Google Cloud documentation.
GCS Authorization for Core Connect
For Core Connect to access a storage bucket, verify that the following roles are granted:
- roles/storage.legacyObjectReader
- roles/storage.legacyBucketWriter
These roles grant the following permissions, which give Core Connect read/write access to the storage bucket:
- storage.buckets.get
- storage.objects.get
- storage.objects.list
- storage.objects.create
- storage.objects.delete
- storage.multipartUploads.create
- storage.multipartUploads.abort
- storage.multipartUploads.listParts
Using a Jinja Template
You can use Jinja templating in your API requests to specify environment variable names instead of literals. As shown in the following example, after you export an environment variable, you can use the syntax "tamr_json_key": "{{ MY_VAR_NAME }}"
in the POST body of your Connect requests. Tamr recommends the use of environment variables for security, to avoid transmission of password literals.
{
"query":"SELECT * FROM tamr.TAMR_TEST",
"datasetName":"oracle_test",
"queryConfig":{
"jdbcUrl":"jdbc:oracle:thin:@192.168.99.101:1521:orcl",
"dbUsername":"{{ CONNECT_DB_USERNAME }}",
"dbPassword":"{{ CONNECT_DB_PASSWORD }}"
}
}
Configuration Options for Core Connect
You can set the following variables to configure Core Connect.
Configuration Variable | Description |
---|---|
TAMR_CONNECT_ADMIN_PORT | Defines the port where Core Connect Admin will run. Default: 9151. |
TAMR_CONNECT_ADMIN_BIND_PORT | Used only when Core Connect is running in a container. Defines the port where Core Connect Admin will run. Default: 9051 |
TAMR_CONNECT_BIND_PORT | Used only when Core Connect is running in a container. Defines the port where Core Connect will run. Default: 9050 |
TAMR_CONNECT_CONFIG_DIR | Used for HDFS and Hive. Identifies the destination location where Core Connect copies the files defined in TAMR_CONNECT_EXTRA_URIS and TAMR_CONNECT_CONFIG_URIS .Default: '/tmp/config' |
TAMR_CONNECT_CONFIG_URIS | Used for HDFS and Hive. Identifies set of XML files that configure the Java client (that is, what is located in /etc/hive/conf/ together with what is in /etc/hadoop/conf ). It is a semicolon separated list of full file paths. |
TAMR_CONNECT_DEFAULT_CLOUD_PROVIDER | Defines the default cloud service provider for UI. Valid values include "GCS", "S3", or "ADLS2". Default: GCS |
TAMR_CONNECT_ENABLE_HTTPS | Defines whether Core Connect runs on HTTP (false) or HTTPS (true). Default: false |
TAMR_CONNECT_EXTRA_CONFIG | Used for HDFS and Hive. Any extra configuration that is needed (normally empty) to override what is on the filesystem. It is a dictionary of Hadoop setting-value pairs. |
TAMR_CONNECT_EXTRA_URIS | Used for HDFS and Hive. Identifies any extra files that configure the Java client. It is a semicolon separated list of full file paths. |
TAMR_CONNECT_FILE_DROP_PATH | Defines the folder to search for files uploaded locally using serverfs endpoints (local filesystem). Define this to upload files from where Core Connect is running. Default: /tmp |
TAMR_CONNECT_HADOOP_USER_NAME | Used for HDFS. Defines the Hadoop username used when reading from HDFS. |
TAMR_CONNECT_HOSTNAME | Defines the hostname for Core Connect. Default: localhost |
TAMR_CONNECT_JDBC_POOL_IMPL | Defines the JDBC pool implementation. Possible values are dbcp, Basic, and Hikari. Default: dbcp |
TAMR_CONNECT_KERBEROS_KEYTAB | Used for HDFS and Hive. The file path to a keytab file for the Kerberos principal defined in TAMR_CONNECT_KERBEROS_PRINCIPAL . |
TAMR_CONNECT_KERBEROS_PRINCIPAL | Used for HDFS and Hive. The Kerberos principal as whom to authenticate. |
TAMR_CONNECT_LOG_LEVEL | Defines the log level for Core Connect. See Logging in Single-Node Deployments or Logging in Cloud-Native Deployments for more information on Tamr Core microservice logs. Default: INFO |
TAMR_CONNECT_MAX_CONCURRENT_REQUESTS | Defines how many Core Connect jobs can run at the same time. Running more jobs concurrently takes more memory, see TAMR_CONNECT_MEMORY. Default: 2 |
TAMR_CONNECT_MEMORY | Defines the amount of memory allocated to the Core Connect driver. Default: 2G |
TAMR_CONNECT_PORT | Defines the port where Core Connect will run. Default: 9050 |
TAMR_CONNECT_PROFILING_SAMPLE_SIZE | Defines the number of records to read to generate profiling results. Default: 100000 |
TAMR_CONNECT_TMP_DIR | Defines the temporary directory used by connect process. Default: {{ TAMR_UNIFY_HOME }}/tmp/connect |
See the Configuration Variable Reference for more details for each variable.
To set Core Connect configuration variables:
See Configuring Tamr Core for instructions on setting these variables.
You must restart Tamr Core and its dependencies after making a configuration change. See Restarting Tamr Core.
Updated about 2 years ago