Using the DMS API
After DMS is configured, you import and export data files between Tamr Core and your cloud storage with the Data Movement Service (DMS).
Tamr recommends using Core Connect to import and export large data files between Tamr Core and your cloud storage provider.
Before you can use the DMS API, your system administrator must configure the DMS. See Configuring the Data Movement Service.
You can import and export data files between Tamr Core and a cloud storage provider by issuing POST requests to the REST web API for the Data Movement Service (DMS). These POST requests start an asynchronous job to transfer the files, and return a response with a job ID. You issue GET requests to obtain information about the jobs initiated by POST requests, including whether they have completed.
Important: CSV and Parquet files must have a file extension to be imported by DMS.
The /data/transfer
endpoint for the DMS is available at http://localhost:9100/api/dms/data/transfer
. If Tamr Core is not available, port 9155 is the proxied port.
Note: This version of DMS supports API interaction through command-line utilities, including cURL, only.
DMS POST Requests
The body of a DMS POST request includes:
sourceType
andsourceConfig
, which identify the current location of the data file or dataset.sinkType
andsinkConfig
, which identify the destination for the data file or dataset.
In the example that follows, data files are being imported into Tamr Core. The sinkConfig
options include whether the data should be appended to an existing dataset, the name of the primary key column, the name of the dataset, and whether to run a profiling job on the dataset after it is loaded.
A complete reference follows this example POST
request to the Data Movement Service’s /data/transfer
endpoint:
curl --silent --location --request POST 'http://localhost:9100/api/dms/data/transfer' \
--header 'Content-Type: application/json' \
--user '<tamr_user>':'<tamr_user_password>' \
--data-raw '{
"sourceType": "S3.SOURCE.FILE",
"sourceConfig": {
"bucket": "string",
"pathPrefix": "string",
"region": "string",
"fileType": "PARQUET"
},
"sinkType": "TAMR.SINK",
"sinkConfig": {
"appendToDataset": boolean,
"primaryKeys": [
"string"
],
"threads": integer,
"username": "string",
"password": "string",
"datasetName": "string",
"profile": boolean
},
"inheritSchema": false,
"renameAttributes": {
"attr_000_rec2_id": "rec_id",
"attr_002_state": "state"
},
}'
The response with the job ID is in the following format:
{"type":"TRANSFER_DATA","id":"66dfd23b-f546-4f18-8af5-eba121bd74a3"}
Request Body Parameters
sourceType
Specifies the source connection type and has a value of TAMR.SOURCE, S3.SOURCE.FILE, ADLS2.SOURCE.FILE, or GCS.SOURCE.FILE.
sourceConfig
The sourceConfig parameter includes different key-value pairs based on the sourceType you supply. See:
- sourceConfig: Tamr
- S3: sourceConfig and sinkConfig
- GCS: sourceConfig and sinkConfig
- ADLS2: sourceConfig and sinkConfig
sourceConfig: Tamr
To export a dataset from Tamr Core to a cloud storage destination, you set the sourceType
value to TAMR.SOURCE
and use the following syntax for sourceConfig
.
"sourceType": "TAMR.SOURCE",
"sourceConfig": {
"username": "string",
"password": "string",
"hostname": "string",
"port": integer,
"datasetName": "string"
}
The following table describes the sourceConfig
key-value pairs when the sourceType
is TAMR.SOURCE
.
Name | Description |
---|---|
username | String. Optional. The username of the admin user in Tamr Core. |
password | String. Optional. The password for the admin user in Tamr Core. |
hostname | String. Optional. The Tamr Core hostname. For example, 10.1.0.1. |
port | String. Optional. The port on which Tamr Core is running. For example, 9100. |
datasetName | String. Required. The name of an existing Tamr Core dataset. |
S3: sourceConfig and sinkConfig
To import data files from Amazon Web Services (AWS) S3 storage into Tamr Core, you set sourceType
to S3.SOURCE.FILE
and sinkType
to TAMR.SINK
. An example of the sourceConfig
syntax for S3 follows.
"sourceType": "S3.SOURCE.FILE",
"sourceConfig": {
"bucket": "my-s3-bucket-parquet",
"pathPrefix": "organization/parquet_folder",
"region": "us-east-1",
"fileType": "PARQUET"
}
To export a dataset from Tamr Core to Amazon Web Services storage, you set sourceType
to TAMR.SOURCE
and sinkType
to S3.SINK.FILE
.
The same key-value pairs apply to both sourceConfig
and sinkConfig
for AWS S3.
Name | Description |
---|---|
bucket | String. Required. The bucket name in AWS S3. For example, my-s3-bucket-parquet. |
pathPrefix | String. Required. The path prefix folder structure in AWS S3. For example, organization/parquet_folder .- sourceConfig : on import, the DMS uploads the directory structure and all files in the specified location recursively. All data files must have the same schema.- sinkConfig : on export, when filetype is PARQUET the DMS creates a folder named /customers.parquet in the specified folder path and exports the file into that folder. As a result, the complete path for the exported file in this example is organization/parquet_folder/customers.parquet . |
region | String. Required. The AWS region in which the AWS S3 bucket is available. For example, us-east-1 . |
filetype | String. Required. The file type: PARQUET (case sensitive) for Parquet files or CSV for comma-separated values files. |
ADLS2: sourceConfig and sinkConfig
To import data files from Azure Data Lake Storage (ADLS) Gen2 storage into Tamr Core, you set sourceType
to ADLS2.SOURCE.FILE
and sinkType
to TAMR.SINK
. An example of the sourceConfig
syntax for ADLS2 follows.
"sourceType": "ADLS2.SOURCE.FILE",
"sourceConfig": {
"accountName": "my_adls_storage",
"accountKey": "key",
"containerName": "inbound",
"pathPrefix": "data/customer/customers.parquet",
"fileType": "PARQUET"
}
To export a dataset from Tamr Core to ADLS, you set sourceType
to TAMR.SOURCE
and sinkType
to ADLS2.SINK.FILE
.
The same key-value pairs apply to both sourceConfig
and sinkConfig
for ADLS2.
You are required to authenticate with ADLS2 if you have not set up the credentials file or if you are using credentials other than the default credentials. You can authenticate in ADLS2 either with an account key or with service principals.
Name | Description |
---|---|
accountName | String. Required. The ADLS2 account name. For example, my_adls_storage . |
accountKey | String. Required when authenticating using the account key. The ADLS2 account key. For example, fTyPR8RnjfYpzM43e6J9iTtJP8pws2AV4gu9jUKVNnYer5hY2j0CEReTt .You can view and copy your storage account access keys from the Azure portal. |
clientId | String. Required when authenticating with service principals. The ID of the service principal object or app registered with Active Directory. |
clientSecret | String. Required when authenticating with service principals. The password for this service principal. |
tenantId | String. Required when authenticating with service principals. The Azure Active Directory ID. |
containerName | String. Required. Container for the required files. For example, inbound . |
pathPrefix | String. Required. The path prefix folder structure in ADLS2. For example, data/customer/customers.parquet .- sourceConfig : on import, the DMS uploads the directory structure and all files in the specified location recursively. All data files must have the same schema.- sinkConfig : on export, when filetype is PARQUET the DMS creates a folder named /customers.parquet in the specified folder path and exports the file into that folder. As a result, the complete path for the exported file in this example is data/customer/customers.parquet/customers.parquet . |
filetype | String. Required. The file type: PARQUET (case sensitive) for Parquet files or CSV for comma-separated values files. |
GCS: sourceConfig and sinkConfig
To import data files from Google Cloud Storage (GCS) into Tamr Core, you set sourceType
to GCS.SOURCE.FILE
and sinkType
to TAMR.SINK
. An example of the sourceConfig
syntax for GCS follows.
"sourceType": "GCS.SOURCE.FILE",
"sourceConfig": {
"projectId": "tamr-gce-dev",
"bucket": "tamr-datasets",
"pathPrefix": "data/customer/parquet",
"fileType": "PARQUET"
},
To export a dataset from Tamr Core to GCS, you set sourceType
to TAMR.SOURCE
and sinkType
to GCS.SINK.FILE
.
The same key-value pairs apply to both sourceConfig
and sinkConfig
for GCS.
Name | Description |
---|---|
projectId | String. Required. For example, tamr-gce-dev . |
bucket | String. Required. The bucket name in GCS. For example, tamr-datasets . |
pathPrefix | String. Required. The path prefix folder structure in GCS. For example, data/customer/parquet .- sourceConfig : on import, the DMS uploads the directory structure and all files in the specified location recursively. All data files must have the same schema.- sinkConfig : on export, when filetype is PARQUET the DMS creates a folder named /customers.parquet in the specified folder path and exports the file into that folder. As a result, the complete path for the exported file in this example is data/customer/parquet/customers.parquet . |
filetype | String. Required. The file type: PARQUET (case sensitive) for Parquet files or CSV for comma-separated values files. |
sinkType
Specifies the destination connection type and has a value of TAMR.SINK
, S3.SINK.FILE
, GCS.SINK.FILE
, or ADLS2.SINK.FILE
.
sinkConfig
The sinkConfig
parameter includes different key-value pairs based on the sinkType
you supply. See:
- sinkConfig: Tamr
- S3: sourceConfig and sinkConfig
- GCS: sourceConfig and sinkConfig
- ADLS2: sourceConfig and sinkConfig
sinkConfig: Tamr
To import data files into Tamr Core from a cloud storage location, you set the sinkType
to TAMR.SINK
and use the following syntax for sinkConfig
.
"sinkType": "TAMR.SINK",
"sinkConfig": {
"appendToDataset": true,
"primaryKeys": [
"string"
],
"threads": 8,
"username": "string",
"password": "string",
"hostname": "string",
"port": integer,
"datasetName": "string",
"profile": boolean
}
The following table describes the key-value pairs for sinkConfig
when sinkType
is TAMR.SINK
.
Name | Description |
---|---|
appendToDataset | Boolean. The default is false . Specifies whether to append the data to an existing input dataset in Tamr Core.- If the dataset exists and appendToDataset is false , Tamr Core truncates (empties) the dataset and creates the new schema for the input dataset from the loaded data.- If the dataset exists and appendToDataset is true , Tamr Core appends data to it and does not truncate (empty) the dataset of existing data. If a record already exists (based on the primary key), then the existing record is updated. If a record does not already exist (based on the primary key), it is added as a new record.- If a dataset with the supplied datasetName does not exist, Tamr Core creates it.See Import Examples. |
primaryKeys | Array. Optional. The primary key for Tamr Core to use for its input dataset. For example, pk1 . Currently, only one primary key is supported. If you specify more than one primary key in this array, Tamr Core uses the first item listed as the primary key. If primaryKeys is null or undefined, Tamr Core creates a primary key attribute and populates it with a random GUID. |
threads | Integer. The number of threads to use on import. The maximum and default is 8 . For example, 4 .The ingest job can be performed in parallel and completed faster by specifying a number of threads greater than 1. The maximum is 8, and the number of threads also must be equal to or less than the number of threads assigned to Tamr Core by the TAMR_API_FACADE_MAX_CONCURRENT_REQUESTS configuration variable. Depending on the compute resources available on the DMS and Tamr Core instances, increasing the number of threads can cause performance to degrade. Using more threads than the number of CPUs available to the machine can lead to degraded performance. |
username | String. Optional. The username of the admin user in Tamr Core. |
password | String. Optional. The password for the admin user in Tamr Core. |
hostname | String. Optional. The Tamr Core hostname. For example, 10.1.0.1. |
port | String. Optional. The port on which Tamr Core is running. For example, 9100. |
datasetName | String. Required. The name of a new or existing Tamr Core source dataset. Tamr Core creates the dataset if it doesn’t exist. If the dataset exists, this is the dataset to which the loaded data will be appended based on the appendToDataset value. |
profile | Boolean. The default is false . Whether to start a profiling job for the dataset after it is imported. For more information, see Profiling a Dataset. |
renameAttributes
Object. Optional. This parameter indicates attribute (column) names in the source dataset to rename in the export files. The key is the name in the source and the value is the name in the sink attribute. This parameter is useful when a Tamr Core attribute name includes a space, since column names that have spaces can cause errors in Parquet files.
In the following example, “attr_000_rec2_id” is renamed “rec_id” and “ cluster_id“ is renamed “cluster_id“.
"renameAttributes": {
"attr_000_rec2_id": "rec_id",
" cluster_id": "cluster_id"
}
inheritSchema
Boolean. Defaults to true
.
To make recommendations for schema mapping, Tamr Core requires all attributes to have a data type of either string or string array and the primary key must have a data type of string. When you use DMS to import a file in Parquet format you can use this parameter to override the schema supplied in the file with string or string[]. To override the schema specified in a source file, include inheritSchema: false
in the top level configuration.
Note: For complex Parquet files, set the inheritSchema
option to false
to convert all primitive types to string. See Data Movement Service for more information about complex Parquet files.
"inheritSchema": false
Example DMS POST Requests
The following examples of POST
requests to the /data/transfer
endpoint of the Data Movement Service illustrate how to fill in values for specific parameters.
Example Request: AWS S3 to Tamr Core
curl --silent --location --request POST 'http://10.1.0.1:9100/api/dms/data/transfer' \
--header 'Content-Type: application/json' \
--user '<admin_user>':'<admin_password>' \
--data-raw '{
"sourceType": "S3.SOURCE.FILE",
"sourceConfig": {
"bucket": "my-s3-bucket-parquet",
"pathPrefix": "some_folder/nested",
"region": "us-east-1",
"fileType": "PARQUET"
},
"sinkType": "TAMR.SINK",
"sinkConfig": {
"appendToDataset": false,
"primaryKeys": [
"pk"
],
"threads": 4,
"username": "admin_user",
"password": "a_password",
"hostname": "10.1.0.1",
"port": 9100,
"datasetName": "YOUR_DATASET",
"profile": true
}
}'
Example Request: MS Azure ADLS Gen 2 to Tamr Core
curl --silent --location --request POST 'http://10.1.0.1:9100/api/dms/data/transfer' \
--header 'Content-Type: application/json' \
--user '<admin_user>':'<admin_password>' \
--data-raw '{
"sourceType": "ADLS2.SOURCE.FILE",
"sourceConfig": {
"accountName": "accountName",
"accountKey": "accountKey",
"containerName": "inbound",
"pathPrefix": "customers.parquet",
"fileType": "PARQUET"
},
"sinkType": "TAMR.SINK",
"sinkConfig": {
"appendToDataset": true,
"primaryKeys": [
"tamr_id"
],
"threads": 4,
"username": "admin_user",
"password": "a_password",
"hostname": "10.1.0.1",
"port": 9100,
"datasetName": "customers",
"profile": false
}
}'
Example Request: Tamr Core to MS Azure ADLS Gen 2
curl --silent --location --request POST 'http://localhost:9100/api/dms/data/transfer'
--header 'Content-Type: application/json' \
--user '<admin_user>':'<admin_password>' \
--data-raw '{
"sourceType": "TAMR.SOURCE",
"sourceConfig": {
"username": "admin_user",
"password": "a_password",
"hostname": "10.1.0.1",
"port": 9100,
"datasetName": "customers"
},
"sinkType": "ADLS2.SINK.FILE",
"sinkConfig": {
"accountName": "",
"accountKey": "",
"containerName": "export",
"pathPrefix": "customers.parquet",
"fileType": "PARQUET"
}
}'
Import Examples
When you use DMS to import source datasets into Tamr Core over time, the source datasets may undergo schema changes such as more or fewer columns or a changed column order. The value of the appendToDataset
key determines the effect of schema changes on the Tamr Core dataset.
- When
appendToDataset
is set tofalse
, the Tamr Core dataset schema and all records are replaced by the source dataset. - When
appendToDataset
is set totrue
, the Tamr Core dataset schema and all records are retained. Changed records are updated, and new records are added.- If the source dataset has fewer columns than the Tamr Core dataset, new records are added to the Tamr Core dataset with
null
values in the additional columns. - If the source dataset has more columns than the Tamr Core dataset, new records are added with values for the existing attributes only. Values in the extra columns are not added.
Tip: In this case, before you import you can use the Tamr Core API to add an attribute to the input dataset for each new column in the source dataset. - If the columns in the source dataset are in a different order than the attributes in the Tamr Core dataset, new records are added to the Tamr Core dataset with the values in the order defined by the Tamr Core dataset.
- If the source dataset has fewer columns than the Tamr Core dataset, new records are added to the Tamr Core dataset with
DMS GET Requests for Tamr Core Jobs
Submitting a GET
request to the /data/transfer
endpoint of the Data Movement Service returns the job ID and the job type, where:
- The job ID is a unique job identifier, such as bc2ae0e1-ce34-49c0-91e3-5765e06594e4. Use the job ID to check the job’s status and progress.
Note: The job ID is a GUID created by DMS and uses a different format than the numeric job ID created by Tamr Core. - The job type is
TRANSFER_DATA
. - For successfully completed DMS jobs, the status is
completed
, instead ofsucceeded
which is reported for other Tamr Core jobs.
Note: DMS jobs are not persisted; upon restart, previous and in progress DMS jobs are no longer listed.
To obtain a listing of all jobs, submit a GET 'http://<ip>:<port>/api/dms/xdata/jobs'
request to the jobs endpoint of the Data Movement Service, as follows:
curl --user '<tamr_user>':'<tamr_user_password>' --request GET 'http://localhost:9100/api/dms/xdata/jobs'
To obtain information about a specific job, run the same request and specify the job ID and type in the format GET 'http://<ip>:<port>/xdata/job?id=<value>&type=TRANSFER_DATA'
, as in this example:
curl --user '<tamr_user>':'<tamr_user_password>' --request GET 'http://localhost:9100/api/dms/xdata/job?id=bc2ae0e1-ce34-49c0-91e3-5765e06594e4&type=TRANSFER_DATA'
Updated over 2 years ago