Tamr Documentation

Using the DMS API

After DMS is configured, you import and export data files between Tamr and your cloud storage with the Data Movement Service (DMS).

Before you can use the DMS API, your system administrator must configure the DMS. See Configuring the Data Movement Service.

You can import and export data files between Tamr and a cloud storage provider by issuing POST requests to the REST web API for the Data Movement Service (DMS). These POST requests start an asynchronous job to transfer the files, and return a response with a job ID. You issue GET requests to obtain information about the jobs initiated by POST requests, including whether they have completed.

Important: CSV and Parquet files must have a file extension to be imported by DMS.

The /data/transfer endpoint for the DMS is available at http://localhost:9155/data/transfer.

Note: This version of DMS supports API interaction through command-line utilities, including cURL, only.

DMS POST Requests

The body of a DMS POST request includes:

  • sourceType and sourceConfig, which identify the current location of the data file or dataset.
  • sinkType and sinkConfig, which identify the destination for the data file or dataset.

In the example that follows, data files are being imported into Tamr. The sinkConfig options include whether the data should be appended to an existing dataset, the name of the primary key column, the name of the dataset, and whether to ask Tamr to run a profiling job on the dataset once it is loaded.
A complete reference follows this example POST request to the Data Movement Service’s /data/transfer endpoint:

curl --silent --location --request POST 'http://localhost:9155/data/transfer' \
--header 'Content-Type: application/json' \
--user '<tamr_user>':'<tamr_user_password>' \
--data-raw '{
   "sourceType": "S3.SOURCE.FILE",
   "sourceConfig": {
       "bucket": "string",
       "pathPrefix": "string",
       "region": "string",
       "fileType": "PARQUET"
   },
   "sinkType": "TAMR.SINK",
   "sinkConfig": {
       "appendToDataset": boolean,
       "primaryKeys": [
           "string"
       ],
       "threads": integer,
       "username": "string",
       "password": "string",
       "datasetName": "string",
       "profile": boolean
   },
   "inheritSchema": false,
   "renameAttributes": {
       "attr_000_rec2_id": "rec_id",
       "attr_002_state": "state"
    },
}'

The response with the job ID is in the following format:

{"type":"TRANSFER_DATA","id":"66dfd23b-f546-4f18-8af5-eba121bd74a3"}

Request Body Parameters

sourceType

Specifies the source connection type and has a value of TAMR.SOURCE, S3.SOURCE.FILE, ADLS2.SOURCE.FILE, or GCS.SOURCE.FILE.

sourceConfig

The sourceConfig parameter includes different key-value pairs based on the sourceType you supply. See:

sourceConfig: Tamr

To export a dataset from Tamr to a cloud storage destination, you set the sourceType value to TAMR.SOURCE and use the following syntax for sourceConfig.

"sourceType": "TAMR.SOURCE",
"sourceConfig": {
  "username": "string",
  "password": "string",
  "hostname": "string",
  "port": integer,
  "datasetName": "string"
}

The following table describes the sourceConfig key-value pairs when the sourceType is TAMR.SOURCE.

Name

Description

username

String. Optional. The username of the admin user in Tamr.

password

String. Optional. The password for the admin user in Tamr.

hostname

String. Optional. The Tamr hostname. For example, 10.1.0.1.

port

String. Optional. The port on which Tamr is running. For example, 9100.

datasetName

String. Required. The name of an existing Tamr dataset.

S3: sourceConfig and sinkConfig

To import data files from Amazon Web Services (AWS) S3 storage into Tamr, you set sourceType to S3.SOURCE.FILE and sinkType to TAMR.SINK. An example of the sourceConfig syntax for S3 follows.

"sourceType": "S3.SOURCE.FILE",
"sourceConfig": {
    "bucket": "my-s3-bucket-parquet",
    "pathPrefix": "organization/parquet_folder",
    "region": "us-east-1",
    "fileType": "PARQUET"
}

To export a dataset from Tamr to Amazon Web Services storage, you set sourceType to TAMR.SOURCE and sinkType to S3.SINK.FILE.

The same key-value pairs apply to both sourceConfig and sinkConfig for AWS S3.

Name

Description

bucket

String. Required. The bucket name in AWS S3. For example, my-s3-bucket-parquet.

pathPrefix

String. Required. The path prefix folder structure in AWS S3. For example, organization/parquet_folder.

  • sourceConfig: on import, the DMS uploads the directory structure and all files in the specified location recursively. All data files must have the same schema.
  • sinkConfig: on export, when filetype is PARQUET the DMS creates a folder named /customers.parquet in the specified folder path and exports the file into that folder. As a result, the complete path for the exported file in this example is organization/parquet_folder/customers.parquet.

region

String. Required. The AWS region in which the AWS S3 bucket is available. For example, us-east-1.

filetype

String. Required. The file type: PARQUET (case sensitive) for Parquet files or CSV for comma-separated values files.

ADLS2: sourceConfig and sinkConfig

To import data files from Azure Data Lake Storage (ADLS) Gen2 storage into Tamr, you set sourceType to ADLS2.SOURCE.FILE and sinkType to TAMR.SINK. An example of the sourceConfig syntax for ADLS2 follows.

"sourceType": "ADLS2.SOURCE.FILE",
"sourceConfig": {
  "accountName": "my_adls_storage",
  "accountKey": "key",
  "containerName": "inbound",
  "pathPrefix": "data/customer/customers.parquet",
  "fileType": "PARQUET"
}

To export a dataset from Tamr to ADLS, you set sourceType to TAMR.SOURCE and sinkType to ADLS2.SINK.FILE.

The same key-value pairs apply to both sourceConfig and sinkConfig for ADLS2.

You are required to authenticate with ADLS2 if you have not set up the credentials file or if you are using credentials other than the default credentials. You can authenticate in ADLS2 either with an account key or with service principals.

Name

Description

accountName

String. Required. The ADLS2 account name. For example, my_adls_storage.

accountKey

String. Required when authenticating using the account key. The ADLS2 account key. For example, fTyPR8RnjfYpzM43e6J9iTtJP8pws2AV4gu9jUKVNnYer5hY2j0CEReTt.

You can view and copy your storage account access keys from the Azure portal.

clientId

String. Required when authenticating with service principals. The ID of the service principal object or app registered with Active Directory.

clientSecret

String. Required when authenticating with service principals. The password for this service principal.

tenantId

String. Required when authenticating with service principals. The Azure Active Directory ID.

containerName

String. Required. Container for the required files. For example, inbound.

pathPrefix

String. Required. The path prefix folder structure in ADLS2. For example, data/customer/customers.parquet .

  • sourceConfig: on import, the DMS uploads the directory structure and all files in the specified location recursively. All data files must have the same schema.
  • sinkConfig: on export, when filetype is PARQUET the DMS creates a folder named /customers.parquet in the specified folder path and exports the file into that folder. As a result, the complete path for the exported file in this example is data/customer/customers.parquet/customers.parquet.

filetype

String. Required. The file type: PARQUET (case sensitive) for Parquet files or CSV for comma-separated values files.

GCS: sourceConfig and sinkConfig

To import data files from Google Cloud Storage (GCS) into Tamr, you set sourceType to GCS.SOURCE.FILE and sinkType to TAMR.SINK. An example of the sourceConfig syntax for GCS follows.

"sourceType": "GCS.SOURCE.FILE",
"sourceConfig": {
  "projectId": "tamr-gce-dev",
  "bucket": "tamr-datasets",
  "pathPrefix": "data/customer/parquet",
  "fileType": "PARQUET"
},

To export a dataset from Tamr to GCS, you set sourceType to TAMR.SOURCE and sinkType to GCS.SINK.FILE.

The same key-value pairs apply to both sourceConfig and sinkConfig for GCS.

Name

Description

projectId

String. Required. For example, tamr-gce-dev.

bucket

String. Required. The bucket name in GCS. For example, tamr-datasets.

pathPrefix

String. Required. The path prefix folder structure in GCS. For example, data/customer/parquet .

  • sourceConfig: on import, the DMS uploads the directory structure and all files in the specified location recursively. All data files must have the same schema.
  • sinkConfig: on export, when filetype is PARQUET the DMS creates a folder named /customers.parquet in the specified folder path and exports the file into that folder. As a result, the complete path for the exported file in this example is data/customer/parquet/customers.parquet.

filetype

String. Required. The file type: PARQUET (case sensitive) for Parquet files or CSV for comma-separated values files.

sinkType

Specifies the destination connection type and has a value of TAMR.SINK, S3.SINK.FILE, GCS.SINK.FILE, or ADLS2.SINK.FILE.

sinkConfig

The sinkConfig parameter includes different key-value pairs based on the sinkType you supply. See:

sinkConfig: Tamr

To import data files into Tamr from a cloud storage location, you set the sinkType to TAMR.SINK and use the following syntax for sinkConfig.

"sinkType": "TAMR.SINK",
"sinkConfig": {
   "appendToDataset": true,
   "primaryKeys": [
       "string"
   ],
   "threads": 8,
   "username": "string",
   "password": "string",
   "hostname": "string",
   "port": integer,
   "datasetName": "string",
   "profile": boolean
}

The following table describes the key-value pairs for sinkConfig when sinkType is TAMR.SINK.

Name

Description

appendToDataset

Boolean. The default is false. Specifies whether to append the data to an existing input dataset in Tamr.

  • If the dataset exists and appendToDataset is false, Tamr truncates (empties) the dataset and creates the new schema for the input dataset from the loaded data.
  • If the dataset exists and appendToDataset is true, Tamr appends data to it and does not truncate (empty) the dataset of existing data. If a record already exists (based on the primary key), then the existing record is updated. If a record does not already exist (based on the primary key), it is added as a new record.
  • If a dataset with the supplied datasetName does not exist, Tamr creates it.

primaryKeys

Array. Optional. The primary key Tamr should use for its input dataset. For example, pk1. Currently, only one primary key is supported. If you specify more than one primary key in this array, Tamr uses the first item listed as the primary key. If primaryKeys is null or undefined, Tamr creates a primary key attribute and populates it with a random GUID.

threads

Integer. The number of threads to use on import. The default is 8. For example, 16.

The ingest job can be performed in parallel and completed faster by specifying a number of threads greater than 1. The number of threads must be equal to or less than the number of threads assigned to Tamr by the TAMR_API_FACADE_MAX_CONCURRENT_REQUESTS configuration variable. Depending on the compute resources available on the DMS and Tamr instances, increasing the number of threads can cause performance to degrade. Using more threads than the number of CPUs available to the machine can lead to degraded performance.

username

String. Optional. The username of the admin user in Tamr.

password

String. Optional. The password for the admin user in Tamr.

hostname

String. Optional. The Tamr hostname. For example, 10.1.0.1.

port

String. Optional. The port on which Tamr is running. For example, 9100.

datasetName

String. Required. The name of a new or existing Tamr source dataset. Tamr creates the dataset if it doesn’t exist. If the dataset exists, this is the dataset to which the loaded data will be appended based on the appendToDataset value.

profile

Boolean. The default is false. Whether to start a profiling job for the dataset after it is imported. For more information, see Profiling a Dataset.

renameAttributes

Object. Optional. This parameter indicates attribute (column) names in the source dataset to rename in the export files. The key is the name in the source and the value is the name in the sink attribute. This parameter is useful when a Tamr attribute name includes a space, since column names that have spaces can cause errors in Parquet files.

In the following example, “attr_000_rec2_id” is renamed “rec_id” and “ cluster_id“ is renamed “cluster_id“.

"renameAttributes": {
  "attr_000_rec2_id": "rec_id",
  " cluster_id": "cluster_id"
}

inheritSchema

Boolean. Defaults to true.

To make recommendations for schema mapping, Tamr requires all attributes to have a data type of either string or string array and the primary key must have a data type of string. When you use DMS to import a file in Parquet format you can use this parameter to override the schema supplied in the file with string or string[]. To override the schema specified in a source file, include inheritSchema: false in the top level configuration.

Note: For complex Parquet files, set the inheritSchema option to false to convert all primitive types to string. See Data Movement Service for more information about complex Parquet files.

"inheritSchema": false

Example DMS POST Requests

The following examples of POST requests to the /data/transfer endpoint of the Data Movement Service illustrate how to fill in values for specific parameters.

Example Request: AWS S3 to Tamr

curl --silent --location --request POST 'http://10.1.0.1:9155/data/transfer' \
--header 'Content-Type: application/json' \
--user '<admin_user>':'<admin_password>' \
--data-raw '{
   "sourceType": "S3.SOURCE.FILE",
   "sourceConfig": {
       "bucket": "my-s3-bucket-parquet",
       "pathPrefix": "some_folder/nested",
       "region": "us-east-1",
       "fileType": "PARQUET"
   },
   "sinkType": "TAMR.SINK",
   "sinkConfig": {
       "appendToDataset": false,
       "primaryKeys": [
           "pk" 
       ],
       "threads": 16,
       "username": "admin_user",
       "password": "a_password",
       "hostname": "10.1.0.1",
       "port": 9100,
       "datasetName": "YOUR_DATASET",
       "profile": true
   }
}'

Example Request: MS Azure ADLS Gen 2 to Tamr

curl --silent --location --request POST 'http://10.1.0.1:9155/data/transfer' \
--header 'Content-Type: application/json' \
--user '<admin_user>':'<admin_password>' \
--data-raw '{
   "sourceType": "ADLS2.SOURCE.FILE",
   "sourceConfig": {
       "accountName": "accountName",
       "accountKey": "accountKey",
       "containerName": "inbound",
       "pathPrefix": "customers.parquet",
       "fileType": "PARQUET"
   }, 
   "sinkType": "TAMR.SINK",
   "sinkConfig": {
       "appendToDataset": true,
       "primaryKeys": [
           "tamr_id" 
       ],
       "threads": 16,
       "username": "admin_user",
       "password": "a_password",
       "hostname": "10.1.0.1",
       "port": 9100,
       "datasetName": "customers",
       "profile": false
   }
}'

Example Request: Tamr to MS Azure ADLS Gen 2

curl --silent --location --request POST 'http://localhost:9155/data/transfer' 
--header 'Content-Type: application/json' \
--user '<admin_user>':'<admin_password>' \
--data-raw '{
     "sourceType": "TAMR.SOURCE",
         "sourceConfig": {
           "username": "admin_user",
           "password": "a_password",
           "hostname": "10.1.0.1",
           "port": 9100,
           "datasetName": "customers"
         },
        "sinkType": "ADLS2.SINK.FILE",
        "sinkConfig": {
            "accountName": "",
            "accountKey": "",
            "containerName": "export",
            "pathPrefix": "customers.parquet",
            "fileType": "PARQUET"
        }
    }'

DMS GET Requests for Tamr Jobs

Submitting a GET request to the /data/transfer endpoint of the Data Movement Service returns the job ID and the job type, where:

  • The job ID is a unique job identifier, such as bc2ae0e1-ce34-49c0-91e3-5765e06594e4. Use the job ID to check the job’s status and progress.
    Note: The job ID is a GUID created by DMS and uses a different format than the numeric job ID created by Tamr.
  • The job type is TRANSFER_DATA.
  • For successfully completed DMS jobs, the status is completed, instead of succeeded which is reported for other Tamr jobs.

Note: DMS jobs are not persisted; upon restart, previous and in progress DMS jobs are no longer listed.

To obtain a listing of all jobs, submit a GET 'http://<ip>:<port>/xdata/jobs' request to the jobs endpoint of the Data Movement Service, as follows:

curl --silent --location er <admin_'<admin_user>':'<admin_password>' --request --request GET 'http://10.1.0.1:9155/xdata/jobs'

To obtain information about a specific job, run the same request and specify the job ID and type in the format GET 'http://<ip>:<port>/job?id=<value>&type=TRANSFER_DATA', as in this example:

curl --silent --location er <admin_'<admin_user>':'<admin_password>' --request --request GET 'http://10.1.0.1:9155/job?id=bc2ae0e1-ce34-49c0-91e3-5765e06594e4&type=TRANSFER_DATA'

Updated 29 days ago



Using the DMS API


After DMS is configured, you import and export data files between Tamr and your cloud storage with the Data Movement Service (DMS).

Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.