Tamr recommends using Core Connect to import and export large data files between Tamr Core and your cloud storage provider.

Before you can use the DMS API, your system administrator must configure the DMS. See Configuring the Data Movement Service.

You can import and export data files between Tamr Core and a cloud storage provider by issuing POST requests to the REST web API for the Data Movement Service (DMS). These POST requests start an asynchronous job to transfer the files, and return a response with a job ID. You issue GET requests to obtain information about the jobs initiated by POST requests, including whether they have completed.

Important: CSV and Parquet files must have a file extension to be imported by DMS.

The /data/transfer endpoint for the DMS is available at http://localhost:9100/api/dms/data/transfer. If Tamr Core is not available, port 9155 is the proxied port.

Note: This version of DMS supports API interaction through command-line utilities, including cURL, only.

DMS POST Requests

The body of a DMS POST request includes:

sourceType and sourceConfig, which identify the current location of the data file or dataset.
sinkType and sinkConfig, which identify the destination for the data file or dataset.

In the example that follows, data files are being imported into Tamr Core. The sinkConfig options include whether the data should be appended to an existing dataset, the name of the primary key column, the name of the dataset, and whether to run a profiling job on the dataset after it is loaded.
A complete reference follows this example POST request to the Data Movement Service’s /data/transfer endpoint:

curl --silent --location --request POST 'http://localhost:9100/api/dms/data/transfer' \
--header 'Content-Type: application/json' \
--user '<tamr_user>':'<tamr_user_password>' \
--data-raw '{
   "sourceType": "S3.SOURCE.FILE",
   "sourceConfig": {
       "bucket": "string",
       "pathPrefix": "string",
       "region": "string",
       "fileType": "PARQUET"
   },
   "sinkType": "TAMR.SINK",
   "sinkConfig": {
       "appendToDataset": boolean,
       "primaryKeys": [
           "string"
       ],
       "threads": integer,
       "username": "string",
       "password": "string",
       "datasetName": "string",
       "profile": boolean
   },
   "inheritSchema": false,
   "renameAttributes": {
       "attr_000_rec2_id": "rec_id",
       "attr_002_state": "state"
    },
}'

The response with the job ID is in the following format:

{"type":"TRANSFER_DATA","id":"66dfd23b-f546-4f18-8af5-eba121bd74a3"}

Request Body Parameters

sourceType

Specifies the source connection type and has a value of TAMR.SOURCE, S3.SOURCE.FILE, ADLS2.SOURCE.FILE, or GCS.SOURCE.FILE.

sourceConfig

The sourceConfig parameter includes different key-value pairs based on the sourceType you supply. See:

sourceConfig: Tamr

To export a dataset from Tamr Core to a cloud storage destination, you set the sourceType value to TAMR.SOURCE and use the following syntax for sourceConfig.

"sourceType": "TAMR.SOURCE",
"sourceConfig": {
  "username": "string",
  "password": "string",
  "hostname": "string",
  "port": integer,
  "datasetName": "string"
}

The following table describes the sourceConfig key-value pairs when the sourceType is TAMR.SOURCE.

Name	Description
`username`	String. Optional. The username of the admin user in Tamr Core.
`password`	String. Optional. The password for the admin user in Tamr Core.
`hostname`	String. Optional. The Tamr Core hostname. For example, 10.1.0.1.
`port`	String. Optional. The port on which Tamr Core is running. For example, 9100.
`datasetName`	String. Required. The name of an existing Tamr Core dataset.

S3: sourceConfig and sinkConfig

To import data files from Amazon Web Services (AWS) S3 storage into Tamr Core, you set sourceType to S3.SOURCE.FILE and sinkType to TAMR.SINK. An example of the sourceConfig syntax for S3 follows.

"sourceType": "S3.SOURCE.FILE",
"sourceConfig": {
    "bucket": "my-s3-bucket-parquet",
    "pathPrefix": "organization/parquet_folder",
    "region": "us-east-1",
    "fileType": "PARQUET"
}

To export a dataset from Tamr Core to Amazon Web Services storage, you set sourceType to TAMR.SOURCE and sinkType to S3.SINK.FILE.

The same key-value pairs apply to both sourceConfig and sinkConfig for AWS S3.

Name	Description
`bucket`	String. Required. The bucket name in AWS S3. For example, my-s3-bucket-parquet.
`pathPrefix`	String. Required. The path prefix folder structure in AWS S3. For example, `organization/parquet_folder`. - `sourceConfig`: on import, the DMS uploads the directory structure and all files in the specified location recursively. All data files must have the same schema. - `sinkConfig`: on export, when `filetype` is `PARQUET` the DMS creates a folder named `/customers.parquet` in the specified folder path and exports the file into that folder. As a result, the complete path for the exported file in this example is `organization/parquet_folder/customers.parquet`.
`region`	String. Required. The AWS region in which the AWS S3 bucket is available. For example, `us-east-1`.
`filetype`	String. Required. The file type: `PARQUET` (case sensitive) for Parquet files or `CSV` for comma-separated values files.

ADLS2: sourceConfig and sinkConfig

To import data files from Azure Data Lake Storage (ADLS) Gen2 storage into Tamr Core, you set sourceType to ADLS2.SOURCE.FILE and sinkType to TAMR.SINK. An example of the sourceConfig syntax for ADLS2 follows.

"sourceType": "ADLS2.SOURCE.FILE",
"sourceConfig": {
  "accountName": "my_adls_storage",
  "accountKey": "key",
  "containerName": "inbound",
  "pathPrefix": "data/customer/customers.parquet",
  "fileType": "PARQUET"
}

To export a dataset from Tamr Core to ADLS, you set sourceType to TAMR.SOURCE and sinkType to ADLS2.SINK.FILE.

The same key-value pairs apply to both sourceConfig and sinkConfig for ADLS2.

You are required to authenticate with ADLS2 if you have not set up the credentials file or if you are using credentials other than the default credentials. You can authenticate in ADLS2 either with an account key or with service principals.

Name	Description
`accountName`	String. Required. The ADLS2 account name. For example, `my_adls_storage`.
`accountKey`	String. Required when authenticating using the account key. The ADLS2 account key. For example, `fTyPR8RnjfYpzM43e6J9iTtJP8pws2AV4gu9jUKVNnYer5hY2j0CEReTt`. You can view and copy your storage account access keys from the Azure portal.
`clientId`	String. Required when authenticating with service principals. The ID of the service principal object or app registered with Active Directory.
`clientSecret`	String. Required when authenticating with service principals. The password for this service principal.
`tenantId`	String. Required when authenticating with service principals. The Azure Active Directory ID.
`containerName`	String. Required. Container for the required files. For example, `inbound`.
`pathPrefix`	String. Required. The path prefix folder structure in ADLS2. For example, `data/customer/customers.parquet` . - `sourceConfig`: on import, the DMS uploads the directory structure and all files in the specified location recursively. All data files must have the same schema. - `sinkConfig`: on export, when `filetype` is `PARQUET` the DMS creates a folder named `/customers.parquet` in the specified folder path and exports the file into that folder. As a result, the complete path for the exported file in this example is `data/customer/customers.parquet/customers.parquet`.
`filetype`	String. Required. The file type: `PARQUET` (case sensitive) for Parquet files or `CSV` for comma-separated values files.

GCS: sourceConfig and sinkConfig

To import data files from Google Cloud Storage (GCS) into Tamr Core, you set sourceType to GCS.SOURCE.FILE and sinkType to TAMR.SINK. An example of the sourceConfig syntax for GCS follows.

"sourceType": "GCS.SOURCE.FILE",
"sourceConfig": {
  "projectId": "tamr-gce-dev",
  "bucket": "tamr-datasets",
  "pathPrefix": "data/customer/parquet",
  "fileType": "PARQUET"
},

To export a dataset from Tamr Core to GCS, you set sourceType to TAMR.SOURCE and sinkType to GCS.SINK.FILE.

The same key-value pairs apply to both sourceConfig and sinkConfig for GCS.

Name	Description
`projectId`	String. Required. For example, `tamr-gce-dev`.
`bucket`	String. Required. The bucket name in GCS. For example, `tamr-datasets`.
`pathPrefix`	String. Required. The path prefix folder structure in GCS. For example, `data/customer/parquet` . - `sourceConfig`: on import, the DMS uploads the directory structure and all files in the specified location recursively. All data files must have the same schema. - `sinkConfig`: on export, when `filetype` is `PARQUET` the DMS creates a folder named `/customers.parquet` in the specified folder path and exports the file into that folder. As a result, the complete path for the exported file in this example is `data/customer/parquet/customers.parquet`.
`filetype`	String. Required. The file type: `PARQUET` (case sensitive) for Parquet files or `CSV` for comma-separated values files.

sinkType

Specifies the destination connection type and has a value of TAMR.SINK, S3.SINK.FILE, GCS.SINK.FILE, or ADLS2.SINK.FILE.

sinkConfig

The sinkConfig parameter includes different key-value pairs based on the sinkType you supply. See:

sinkConfig: Tamr

To import data files into Tamr Core from a cloud storage location, you set the sinkType to TAMR.SINK and use the following syntax for sinkConfig.

"sinkType": "TAMR.SINK",
"sinkConfig": {
   "appendToDataset": true,
   "primaryKeys": [
       "string"
   ],
   "threads": 8,
   "username": "string",
   "password": "string",
   "hostname": "string",
   "port": integer,
   "datasetName": "string",
   "profile": boolean
}

The following table describes the key-value pairs for sinkConfig when sinkType is TAMR.SINK.

Name	Description
`appendToDataset`	Boolean. The default is `false`. Specifies whether to append the data to an existing input dataset in Tamr Core. - If the dataset exists and `appendToDataset` is `false`, Tamr Core truncates (empties) the dataset and creates the new schema for the input dataset from the loaded data. - If the dataset exists and `appendToDataset` is `true`, Tamr Core appends data to it and does not truncate (empty) the dataset of existing data. If a record already exists (based on the primary key), then the existing record is updated. If a record does not already exist (based on the primary key), it is added as a new record. - If a dataset with the supplied `datasetName` does not exist, Tamr Core creates it. See Import Examples.
`primaryKeys`	Array. Optional. The primary key for Tamr Core to use for its input dataset. For example, `pk1`. Currently, only one primary key is supported. If you specify more than one primary key in this array, Tamr Core uses the first item listed as the primary key. If `primaryKeys` is null or undefined, Tamr Core creates a primary key attribute and populates it with a random GUID.
`threads`	Integer. The number of threads to use on import. The maximum and default is `8`. For example, `4`. The ingest job can be performed in parallel and completed faster by specifying a number of threads greater than 1. The maximum is 8, and the number of threads also must be equal to or less than the number of threads assigned to Tamr Core by the `TAMR_API_FACADE_MAX_CONCURRENT_REQUESTS` configuration variable. Depending on the compute resources available on the DMS and Tamr Core instances, increasing the number of threads can cause performance to degrade. Using more threads than the number of CPUs available to the machine can lead to degraded performance.
`username`	String. Optional. The username of the admin user in Tamr Core.
`password`	String. Optional. The password for the admin user in Tamr Core.
`hostname`	String. Optional. The Tamr Core hostname. For example, 10.1.0.1.
`port`	String. Optional. The port on which Tamr Core is running. For example, 9100.
`datasetName`	String. Required. The name of a new or existing Tamr Core source dataset. Tamr Core creates the dataset if it doesn’t exist. If the dataset exists, this is the dataset to which the loaded data will be appended based on the `appendToDataset` value.
`profile`	Boolean. The default is `false`. Whether to start a profiling job for the dataset after it is imported. For more information, see Profiling a Dataset.

renameAttributes

Object. Optional. This parameter indicates attribute (column) names in the source dataset to rename in the export files. The key is the name in the source and the value is the name in the sink attribute. This parameter is useful when a Tamr Core attribute name includes a space, since column names that have spaces can cause errors in Parquet files.

In the following example, “attr_000_rec2_id” is renamed “rec_id” and “ cluster_id“ is renamed “cluster_id“.

"renameAttributes": {
  "attr_000_rec2_id": "rec_id",
  " cluster_id": "cluster_id"
}

inheritSchema

Boolean. Defaults to true.

To make recommendations for schema mapping, Tamr Core requires all attributes to have a data type of either string or string array and the primary key must have a data type of string. When you use DMS to import a file in Parquet format you can use this parameter to override the schema supplied in the file with string or string[]. To override the schema specified in a source file, include inheritSchema: false in the top level configuration.

Note: For complex Parquet files, set the inheritSchema option to false to convert all primitive types to string. See Data Movement Service for more information about complex Parquet files.

"inheritSchema": false

Example DMS POST Requests

The following examples of POST requests to the /data/transfer endpoint of the Data Movement Service illustrate how to fill in values for specific parameters.

Example Request: AWS S3 to Tamr Core

curl --silent --location --request POST 'http://10.1.0.1:9100/api/dms/data/transfer' \
--header 'Content-Type: application/json' \
--user '<admin_user>':'<admin_password>' \
--data-raw '{
   "sourceType": "S3.SOURCE.FILE",
   "sourceConfig": {
       "bucket": "my-s3-bucket-parquet",
       "pathPrefix": "some_folder/nested",
       "region": "us-east-1",
       "fileType": "PARQUET"
   },
   "sinkType": "TAMR.SINK",
   "sinkConfig": {
       "appendToDataset": false,
       "primaryKeys": [
           "pk" 
       ],
       "threads": 4,
       "username": "admin_user",
       "password": "a_password",
       "hostname": "10.1.0.1",
       "port": 9100,
       "datasetName": "YOUR_DATASET",
       "profile": true
   }
}'

Example Request: MS Azure ADLS Gen 2 to Tamr Core

curl --silent --location --request POST 'http://10.1.0.1:9100/api/dms/data/transfer' \
--header 'Content-Type: application/json' \
--user '<admin_user>':'<admin_password>' \
--data-raw '{
   "sourceType": "ADLS2.SOURCE.FILE",
   "sourceConfig": {
       "accountName": "accountName",
       "accountKey": "accountKey",
       "containerName": "inbound",
       "pathPrefix": "customers.parquet",
       "fileType": "PARQUET"
   }, 
   "sinkType": "TAMR.SINK",
   "sinkConfig": {
       "appendToDataset": true,
       "primaryKeys": [
           "tamr_id" 
       ],
       "threads": 4,
       "username": "admin_user",
       "password": "a_password",
       "hostname": "10.1.0.1",
       "port": 9100,
       "datasetName": "customers",
       "profile": false
   }
}'

Example Request: Tamr Core to MS Azure ADLS Gen 2

curl --silent --location --request POST 'http://localhost:9100/api/dms/data/transfer' 
--header 'Content-Type: application/json' \
--user '<admin_user>':'<admin_password>' \
--data-raw '{
     "sourceType": "TAMR.SOURCE",
         "sourceConfig": {
           "username": "admin_user",
           "password": "a_password",
           "hostname": "10.1.0.1",
           "port": 9100,
           "datasetName": "customers"
         },
        "sinkType": "ADLS2.SINK.FILE",
        "sinkConfig": {
            "accountName": "",
            "accountKey": "",
            "containerName": "export",
            "pathPrefix": "customers.parquet",
            "fileType": "PARQUET"
        }
    }'

Import Examples

When you use DMS to import source datasets into Tamr Core over time, the source datasets may undergo schema changes such as more or fewer columns or a changed column order. The value of the appendToDataset key determines the effect of schema changes on the Tamr Core dataset.

When appendToDataset is set to false, the Tamr Core dataset schema and all records are replaced by the source dataset.
When appendToDataset is set to true, the Tamr Core dataset schema and all records are retained. Changed records are updated, and new records are added.
- If the source dataset has fewer columns than the Tamr Core dataset, new records are added to the Tamr Core dataset with null values in the additional columns.
- If the source dataset has more columns than the Tamr Core dataset, new records are added with values for the existing attributes only. Values in the extra columns are not added.
  Tip: In this case, before you import you can use the Tamr Core API to add an attribute to the input dataset for each new column in the source dataset.
- If the columns in the source dataset are in a different order than the attributes in the Tamr Core dataset, new records are added to the Tamr Core dataset with the values in the order defined by the Tamr Core dataset.

DMS GET Requests for Tamr Core Jobs

Submitting a GET request to the /data/transfer endpoint of the Data Movement Service returns the job ID and the job type, where:

The job ID is a unique job identifier, such as bc2ae0e1-ce34-49c0-91e3-5765e06594e4. Use the job ID to check the job’s status and progress.
Note: The job ID is a GUID created by DMS and uses a different format than the numeric job ID created by Tamr Core.
The job type is TRANSFER_DATA.
For successfully completed DMS jobs, the status is completed, instead of succeeded which is reported for other Tamr Core jobs.

Note: DMS jobs are not persisted; upon restart, previous and in progress DMS jobs are no longer listed.

To obtain a listing of all jobs, submit a GET 'http://<ip>:<port>/api/dms/xdata/jobs' request to the jobs endpoint of the Data Movement Service, as follows:

curl --user '<tamr_user>':'<tamr_user_password>' --request GET 'http://localhost:9100/api/dms/xdata/jobs'

To obtain information about a specific job, run the same request and specify the job ID and type in the format GET 'http://<ip>:<port>/xdata/job?id=<value>&type=TRANSFER_DATA', as in this example:

curl --user '<tamr_user>':'<tamr_user_password>' --request GET 'http://localhost:9100/api/dms/xdata/job?id=bc2ae0e1-ce34-49c0-91e3-5765e06594e4&type=TRANSFER_DATA'