User GuidesAPI ReferenceRelease Notes
Doc HomeHelp CenterLog In

Uploading a Dataset into a Project

Prepare and upload an input dataset into a project.

Note: Team members with the admin role control access to datasets and projects. If you cannot upload a dataset into a project, verify your permission settings with an admin.

In a schema mapping, mastering, or categorization project, you begin by specifying one or more input datasets. You can choose any previously-uploaded dataset. See Adding a Dataset to a Project.

You can also upload additional datasets into a project from one or more sources. The options for uploading a dataset are:

  • Select a delimiter-separated values file on your local filesystem.
  • Use the Data Movement Service (DMS) to upload data in CSV or Parquet format from cloud storage (if configured for your installation).

Before you upload a dataset using any of these options, you prepare the dataset.

Preparing a Dataset

Before you upload a dataset, verify that the column names and primary key in the dataset meet the following requirements:

  • Column names cannot contain the following special characters: . (period), # (hash), \ (backslash), / (slash).
    Note: As a result, column names cannot contain URLs.
  • Column names cannot contain leading or trailing spaces.
  • Column names cannot be empty.
  • Column names must be unique: they cannot match (case-insensitively) the names of any attributes that already exist in the unified dataset of your project. See system-generated attributes for reserved attribute names.
  • Primary keys must be single attribute keys. Tamr does not support composite keys.
    Note: The options for uploading datasets offer different procedures for the primary key. When you upload a local file, for example, Tamr Core can create an attribute for a primary key and populate it with unique identifiers. When you upload from a connected external source, however, you must specify the dataset column that contains the primary key.

important Important: When you use DMS to upload CSV or Parquet files, all files must have a file extension. For complex Parquet files with lists, maps, or structs, see Data Movement Service to learn more about uploading these files.

Troubleshooting Tips

The following reminders can help you find and resolve problems in delimiter-separated data files before you upload:

  • All commas (or other delimiter characters) that do not represent a column separator must be properly escaped.
  • If you use the Python to_csv() function, an unnamed index column is added to the CSV file by default. To avoid this problem, set the create new index parameter to false when using to_csv().
  • Columns produced by a JOIN command can include a . (period) character in the column name by default. Be sure to rename such joined columns in the dataset before uploading.
  • Note: If you upload a dataset with the incorrect primary key, do not change the primary key column. Instead, in the Dataset Catalog, delete the dataset and add it back with the correct column identified as the primary key.

Upload a Local File

Tamr supports upload of local data files with the following characteristics:

  • Primary Key: (Optional) Tamr Core can create an attribute for a primary key and populate it with unique identifiers.
  • Format: Delimiter-separated values file. The default for the delimiter is a comma (that is, a CSV file), and for the quote, escape, and comment characters is ", ", and # respectively. You can specify different character values during the upload process.
  • Encoding: UTF-8 or UTF-8 with BOM.
  • Header: The file must contain a header row.

To upload a local dataset:

Before you begin, verify that the dataset is prepared for upload.

  1. Open a schema mapping, mastering, or categorization project and select Datasets.
  2. Choose Edit datasets. A dialog box opens.
  3. Select Upload File > Choose File.
  4. Use the file finder to choose your data file. Tamr Core uses the default character settings to show a preview of the file.
  5. To specify a different delimiter, quote, escape, or comment character, choose Show advanced CSV options and select the characters for your file. The preview updates with your choices.
  6. (Optional) provide a Description for this dataset.
  7. Use the Primary Key column dropdown to identify the file's primary key. If the file does not contain a primary key, select No Primary Key and provide a unique identifying name for it. Tamr Core creates this column in the input dataset and populates it with row numbers automatically.
    For more information about how primary keys are used in Tamr Core, see Managing Primary Keys (for transformations) and Modify a dataset's records (for API updates).
  8. Beginning in v2022.005, if you need to profile datasets upon upload, you must do this manually. Due to this change, the record count in the Tamr Core UI may be incorrect. To fix the count, re-profile and refresh. You can do this by updating your scripts to call the profile API after running an ingest job in Core Connect. See v2022.005.0 Upgrade Considerations.
  9. Select Save Datasets.

Tip: An admin may need to adjust the permissions in a policy to give all team members access to an uploaded dataset. See Managing User Accounts and Access.

Upload with the DMS

Tamr supports upload of CSV and Parquet files from cloud storage. With the Data Movement Service (DMS), you can:

  • Select multiple datasets with identical schemas for upload into a single input dataset.
  • Append an uploaded dataset to an existing dataset.

important Important: CSV and Parquet files must have a file extension to be uploaded with DMS.

Tamr Core also offers an API to upload these files. See Using the DMS API.

Tamr supports upload of data files with the following characteristics:

  • Primary Key: (Optional) Tamr Core can create an attribute for a primary key and populate it with unique identifiers.
  • Format: Parquet or comma-separated values (CSV) files.

    important Important: All files selected for import must have identical schemas.

  • Encoding: UTF-8 or UTF-8 with BOM.
  • Header: The file must contain a header row.

To upload a dataset from cloud storage:

Before you begin, verify that the dataset is prepared for upload.

  1. Open a schema mapping, mastering, or categorization project and select Datasets > Edit datasets. A dialog box opens.
  2. Select Connect to (cloud storage provider).
  3. Select the type of files to import: CSV or Parquet.
  4. Specify the location of the files.
  • ADLS2: Account Name, Container, Path
  • AWS S3: Region, Bucket, Path
  • GCS: Project, Bucket, Path
    To search for files, you can supply values for the first two fields and then select Apply.
    Tip: To reduce the time a search takes, provide as much of the path as possible.
    If you change the file type or the location, Apply again to refresh the file picker.
  1. To import all files in a folder, select the folder. You can also select a single file or use Ctrl+click to select multiple files.
    Note: All files selected for upload must have identical schemas.
  2. Enter a Name for the dataset and optionally provide a Description for this dataset.
    To append the file to an existing input dataset, provide the name of that input dataset.
  3. Use the Primary key dropdown to specify the column with the file's primary key. If the file does not contain a primary key, select No Primary Key.
    Tamr Core creates this column and automatically populates it with random GUIDs.
    For more information about how primary keys are used, see Managing Primary Keys (for transformations) and Modify a dataset's records (for API updates).
  4. Select Show advanced options to verify or change these options:
  • Number of threads defines the number of files to load in parallel.
    Note: As of version 2021.021.0, Tamr only supports up to 4 threads.
  • Beginning in v2022.005, if you need to profile datasets upon upload, you must do this manually. Due to this change, the record count in the Tamr Core UI may be incorrect. To fix the count, re-profile and refresh. You can do this by updating your scripts to call the profile API after running an ingest job in Core Connect. See v2022.005.0 Upgrade Considerations.
  • Append Data indicates whether you want to add the uploaded file to a dataset that already exists. The data is appended to the dataset identified in the Name field.
  1. Add or save the dataset.