Tamr Documentation

Uploading a Dataset into a Project

Prepare and upload an input dataset into a Tamr project.

Note: Team members with the admin role control access to datasets and projects in Tamr. If you cannot upload a dataset into a project, verify your permission settings with an admin.

In a Tamr schema mapping, mastering, or categorization project, you begin by specifying one or more input datasets. You can choose any previously-uploaded dataset. See Adding a Dataset to a Project.

You can also upload additional datasets into a Tamr project from one or more sources. The options for uploading a dataset are:

  • Select a delimiter-separated values file on your local filesystem.
  • Connect to an external source to upload an Avro, Parquet, or delimiter-separated values file (if configured for your installation).
  • Use the Data Movement Service (DMS) to upload data in CSV or Parquet format from cloud storage (if configured for your installation).

Before you upload a dataset using any of these options, you prepare the dataset.

Preparing a Dataset

Before you upload a dataset into Tamr, verify that the column names and primary key in the dataset meet the following requirements:

  • Column names cannot contain the following special characters: . (period), # (hash), \ (backslash), / (slash).
    Note that as a result, column names cannot contain URLs.
  • Column names must be unique: they cannot match (case-insensitively) the names of any attributes that already exist in the unified dataset of your project. See Tamr-Generated Data Attributes for reserved attribute names.
  • Column names cannot be empty, and cannot contain a leading or trailing whitespace.
  • Primary keys must be single attribute keys. Tamr does not support composite keys.
    Note that the options for uploading datasets offer different procedures for the primary key. When you upload a local file, for example, Tamr can create an attribute for a primary key and populate it with unique identifiers. When you upload from a connected external source, however, you must specify the dataset column that contains the primary key.

Important: To upload a CSV and Parquet file with DMS, all files must have a file extension. For complex Parquet files with lists, maps, and/or structs, see Data Movement Service to learn more about uploading these files.

Troubleshooting Tips

These reminders can help you find and resolve problems in delimiter-separated data files before you upload.

  • All commas (or other delimiter characters) that do not represent a column separator must be properly escaped.
  • If you use the Python to_csv() function, an unnamed index column is added to the CSV file by default. To avoid this problem, set the create new index parameter to false when using to_csv().
  • Columns produced by a JOIN command can include a . (period) character in the column name by default. Be sure to rename such joined columns in the dataset before uploading.

Upload a Local File

Tamr supports upload of local data files with the following characteristics:

  • Primary Key: Optional. Tamr can create an attribute for a primary key and populate it with unique identifiers.
  • Format: Delimiter-separated values file. The default for the delimiter is a comma (that is, a CSV file), and for the quote, escape, and comment characters is ", ", and # respectively. You can specify different character values during the upload process.
  • Encoding: UTF-8 or UTF-8 with BOM.
  • Header: The file must contain a header row.

To upload a local dataset into Tamr:

Before you begin, verify that the dataset is prepared for upload.

  1. Open a schema mapping, mastering, or categorization project and select Datasets and then Edit datasets. A dialog box opens.
  2. Select Upload File and then Choose File.
  3. Use the file finder to choose your data file. Tamr uses the default character settings to show a preview of the file.
  4. To specify a different delimiter, quote, escape, or comment character, choose Show advanced CSV options and select the characters for your file. The preview updates with your choices.
  5. Optionally, provide a Description for this dataset.
  6. Use the Primary Key column dropdown to identify the file's primary key. If the file does not contain a primary key, select No Primary Key and provide a unique identifying name for it. Tamr creates this column in the input dataset and populates it with row numbers automatically.
    For more information about how primary keys are used in Tamr, see Managing Primary Keys (for transformations) and Modify a Dataset's Records (for API updates).
  7. By default, the Profile Dataset checkbox is selected. Tamr automatically starts a profile job for the dataset after you upload it. See Profiling a Dataset.
  8. Select Save Datasets.

Tip: An admin may need to adjust the permissions in a policy to give all team members access to an uploaded dataset. See Managing User Accounts and Access.

Upload from a Connected External Source

Note: A system administrator must configure an external system (such as a Hadoop Distributed File System (HDFS) cluster) as an external storage provider.

For on-premise Tamr deployments, Tamr supports upload of multiple datasets with identical schemas from an external source into a single Tamr input dataset.

Data files must have the following characteristics:

  • Primary Key: Required.
  • Format: Avro format, Parquet format, or in a delimiter-separated values file. The default for the delimiter is a comma (that is, a CSV file), and for the quote and escape characters " and ". You can specify different character values during the upload process.
    Important: All files selected for import must have identical schemas.

To upload an external dataset into Tamr:

Before you begin, verify that the dataset is prepared for upload.

  1. Open a schema mapping, mastering, or categorization project and select Datasets and then Edit datasets. A dialog box opens.
  2. Select Connect to Source. A file finder for the previously-configured external storage cluster appears.
  3. Select one or more of the delimiter-separated values files to upload, or one or more of the .avro files to upload, as a single input dataset.
    Note: You cannot include both .avro and delimiter-separated values files in the same dataset. Instead, use this procedure multiple times to create different input datasets for the different source file types.
  4. For a delimiter-separated values file, specify a different delimiter, quote, or escape character as needed.
  5. Enter a Name and an optional Description.
  6. In the ID Column text box, identify the column that contains the primary key of the input dataset. This field is required. Tamr does not generate primary keys for datasets uploaded from an external source.
  7. Select Add Dataset.

Tip: An admin may need to adjust the permissions in a policy to give all team members access to an uploaded dataset. See Managing User Accounts and Access.

Upload with the DMS

Tamr supports upload of CSV and Parquet files from cloud storage. With the Tamr Data Movement Service (DMS), you can:

  • Select multiple datasets with identical schemas for upload into a single Tamr input dataset.
  • Append an uploaded dataset to an existing dataset in Tamr.

Important: CSV and Parquet files must have a file extension to be uploaded with DMS.

Tamr also offers an API to upload these files. See Using the DMS API.

Tamr supports upload of data files with the following characteristics:

  • Primary Key: Optional. Tamr can create an attribute for a primary key and populate it with unique identifiers.
  • Format: Parquet or comma-separated values (CSV) files.
    Important: All files selected for import must have identical schemas.
  • Encoding: UTF-8 or UTF-8 with BOM.
  • Header: The file must contain a header row.

To upload a dataset into Tamr from cloud storage:

Before you begin, verify that the dataset is prepared for upload.

  1. Open a schema mapping, mastering, or categorization project and select Datasets and then Edit datasets. A dialog box opens.
  2. Select Connect to (cloud storage provider).
  3. Select the type of files to import: CSV or Parquet.
  4. Specify the location of the files.
  • ADLS2: Account Name, Container, Path
  • AWS S3: Region, Bucket, Path
  • GCS: Project, Bucket, Path
    To search for files, you can supply values for the first two fields and then select Apply.
    Tip: To reduce the time a search takes, provide as much of the path as possible.
    If you change the file type or the location, Apply again to refresh the file picker.
  1. To import all files in a folder, select the folder. You can also select a single file or use Ctrl+click to select multiple files.
    Note: All files selected for upload must have identical schemas.
  2. Enter a Name for the Tamr dataset and optionally provide a Description for this dataset.
    To append the file to an existing input dataset in Tamr, provide the name of that input dataset.
  3. Use the Primary key dropdown to specify the column with the file's primary key. If the file does not contain a primary key, select No Primary Key.
    Tamr creates this column and automatically populates it with random GUIDs.
    For more information about how primary keys are used in Tamr, see Managing Primary Keys (for transformations) and Modify a Dataset's Records (for API updates).
  4. Select Show advanced options to verify or change these options:
  • Number of threads defines the number of files to load in parallel.
  • Profile Dataset indicates whether you want Tamr to start the profile job automatically after upload. See Profiling a Dataset.
  • Append Data indicates whether you want to add the uploaded file to a dataset that already exists in Tamr. The data is appended to the dataset identified in the Name field.
  1. Add or save the dataset.

Tip: An admin may need to adjust the permissions in a policy to give all team members access to an uploaded dataset. See Managing User Accounts and Access.

Updated 7 days ago


Uploading a Dataset into a Project


Prepare and upload an input dataset into a Tamr project.

Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.