Uploading a Dataset into a Project
Prepare and upload an input dataset into a project.
Note: If you cannot upload a dataset into a project, verify your permission settings with an admin.
In a schema mapping, mastering, or categorization project, you begin by specifying one or more input datasets. You can choose any previously-uploaded dataset on the Select from Dataset Catalog tab. See Adding a Dataset to a Project.
You can also upload additional datasets into a project from one or more sources. You can:
- Select a delimiter-separated values file on your local filesystem.
- Upload from S3, GCP, or ADLS2 cloud storage (comma- or tab-separated values files or Avro files). Requires Core Connect configuration. See Uploading from Cloud Storage.
- Upload using JDBC connections, including relational datasets, Parquet files, and more. See Uploading with a JDBC Driver.
Tip: When uploading a dataset, an author or admin selects the policies that should include the dataset as a resource. After a curator uploads a dataset, an author or admin might need to adjust policy resources to give all team members access. See Managing User Accounts and Access.
Before you upload a dataset, you prepare the dataset.
Preparing a Dataset
Checking Column Names
Before you upload a dataset, verify that the column names in the dataset meet the following requirements:
- Column names cannot contain the following special characters:
.
(period),#
(hash),\
(backslash),/
(slash). - Column names cannot contain leading or trailing spaces.
- Column names cannot be empty.
- Column names must be unique within a dataset.
Identifying Primary Keys
Before you upload a dataset, verify that the primary key in the dataset meets the following requirements:
- Primary keys must be single attribute keys. Tamr does not support composite keys.
- You can either specify the dataset column that contains the primary key or indicate that the dataset has no primary key. Tamr Core adds a column and supplies row numbers if you choose this option.
For more information about how Tamr Core uses primary keys, see Managing Primary Keys (for transformations), Modify a dataset's records (for API updates), and How to Obtain Primary Keys (for more advanced information on primary keys).
Notes:
- When you use Core Connect to upload Avro files, all files must have a file extension.
- For complex Parquet files, see JDBC Driver for Parquet in the CData documentation to learn more about working with these files.
Troubleshooting Tips
The following reminders can help you find and resolve problems in delimiter-separated data files before you upload:
- All commas (or other delimiter characters) that do not represent a column separator must be properly escaped.
- If you use the Python
to_csv()
function, an unnamed index column is added to the CSV file by default. To avoid this problem, set the create new index parameter tofalse
when usingto_csv()
. - Columns produced by a JOIN command can include a . (period) character in the column name by default. Be sure to rename such joined columns in the dataset before uploading.
- If you upload a dataset with the incorrect primary key, do not change the primary key column. Instead, an admin should delete the dataset in the Dataset Catalog. You can then upload it again with the correct column identified as the primary key.
Uploading a Local File
Tamr supports upload of local data files with the following characteristics:
- Primary Key: Optional. Tamr Core can create an attribute for a primary key and populate it with unique identifiers.
- Format: Delimiter-separated values file. The default for the delimiter is a comma (that is, a CSV file), and for the quote, escape, and comment characters is
"
,"
, and#
respectively. You can specify different character values during the upload process. - Encoding: UTF-8 or UTF-8 with BOM.
- Header: The file must contain a header row.
To upload a local dataset:
Before you begin, verify that the dataset is prepared for upload.
- Open a schema mapping, mastering, or categorization project and select Datasets.
- Choose Edit datasets. A dialog box opens.
- Select Upload File > Choose File.
- Use the file finder to choose your data file. Tamr Core uses the default character settings to show a preview of the file.
- To specify a different delimiter, quote, escape, or comment character, choose Show advanced CSV options and select the characters for your file. The preview updates with your choices.
- (Optional) Provide a Description for this dataset.
- Use the Primary Key column dropdown to identify the file's primary key. If the file does not contain a primary key, select No Primary Key and provide a unique identifying name for it. Tamr Core creates this column in the input dataset and populates it with row numbers automatically.
- By default, the job to profile the dataset starts after you upload the dataset. Clear the Profile Dataset checkbox to profile the dataset at another time.
- (Optional) Select the Truncate Existing Dataset checkbox to truncate the existing dataset, if present, when uploading this file.
- Select Save Datasets.
Tip: After you upload a dataset, an author or admin might need to adjust policies to give team members access. See Updating Dataset Access.
Uploading from Cloud Storage
If Core Connect is configured to access a cloud storage provider, Tamr Core provides an option for uploading files in comma- or tab-separated values (CSV or TSV) or Avro format from a cloud storage provider.
Tip: You can also upload Parquet files from cloud storage via JDBC. See Uploading with a JDBC Driver.
When you use the cloud storage option, you connect to your cloud storage provider and specify the location and file or files to upload. You can select multiple datasets with identical schemas for upload into a single input dataset. All files must have a file extension to be uploaded with this option.
Tamr Core also offers an API to upload these files. See Using the Core Connect API .
Before You Begin:
Verify that the dataset is prepared for upload. Tamr Core supports upload of data files with the following characteristics:
- Primary Key: Optional. Tamr Core can create an attribute for a primary key and populate it with unique identifiers.
- Format: Avro, CSV, or TSV files.
- Encoding: UTF-8 or UTF-8 with BOM.
- Header: The file must contain a header row.
To upload a file from cloud storage:
- Open a schema mapping, mastering, or categorization project and select Datasets > Edit datasets. A dialog box opens.
- Select Connect to (cloud provider). The name of the cloud storage provider configured for use with Tamr Core appears: S3, Google Cloud Storage, or Azure Data Lake Storage Gen2.
- Select the type of files to import: CSV (for delimiter-separated files) or AVRO.
- Specify the location of the files. The values you specify to identify the location differ by provider.
- S3: Region, Bucket, Path
- Google Cloud Storage: Project, Bucket, Path
- Azure Data Lake Storage Gen2: Account Name, Container, Path
- To search for files, supply values for the first two fields and then select Apply. Results appear on the right side of the dialog box.
- To reduce the time your search takes, provide as much of the path as possible.
- If you change the file type or the location, Apply again to refresh the files shown on the right side of the dialog box.
- If you choose the AVRO format, only file names that include "avro" appear.
- Select a file to import. To import all files in a folder, select the folder. All files selected for upload must have identical schemas.
- Enter a Name for the Tamr Core dataset. To append the file to an existing input dataset, provide the name of that input dataset.
- Use the Primary key dropdown to specify the column with the primary key. If the file does not contain a primary key, select No Primary Key. Tamr Core creates a column and populates it with an increasing sequence..
- Select Show advanced options to verify or change the following options:
- Number of threads: The number of parallel threads to run the import job.
- Profile dataset: By default, the job to profile the dataset starts after you upload the dataset. Clear the Profile Dataset checkbox to profile the dataset at another time.
- Append Data: Specifies whether you want to add the uploaded file to a dataset that already exists. Check the checkbox to append the data to the dataset identified in the Name field.
- Select Save Datasets. The import job starts and can be monitored on the Jobs page.
Tip: After you upload a dataset, an author or admin might need to adjust policies to give team members access. See Updating Dataset Access.
Uploading with a JDBC Driver
Tamr Core supports upload of data via JDBC connections. You can upload data from SQL databases (Oracle, Hive, and so on), Parquet files in cloud storage, Salesforce, Google Big Query, and more. For a list of the JDBC drivers that Tamr Core supports, see Core Connect.
You connect with JDBC by providing a JDBC URL and query, and then specify the dataset name.
Note: This procedure is intended for use by those familiar with JDBC, SQL, and making requests with the Core Connect API. See JDBC Keys.
Before You Begin:
Verify that the dataset is prepared for upload. Tamr Core supports upload of data files with the following characteristics:
- Primary Key: Optional. Tamr Core can create an attribute for a primary key and populate it with unique identifiers.
- Format: Specific to the JDBC driver.
For example, you can upload data from Salesforce with the Salesforce driver or a Parquet file from cloud storage with the Parquet driver.
To upload using JDBC:
- Open a schema mapping, mastering, or categorization project and select Datasets > Edit datasets. A dialog box opens.
- Select Connect with JDBC.
- Supply values for the following fields. The corresponding keys for JDBC calls are provided for reference.
- JDBC Url: the jdbcUrl key in the queryConfig object of JDBC requests. Required.
For examples, see Example jdbcUrl Values for Supported Drivers. - JDBC Query: the query key. The SQL query used to retrieve data from a JDBC source. Required.
- Primary Key: the primaryKey key. Optional. If left blank, Tamr Core adds a TAMRSEQ column to the dataset and populates it with row numbers to use as the primary key.
Note: If uploading to a pre-existing table, Tamr Core uses the existing primaryKey. - Dataset Name: the datasetName key. Must be a unique name that does not match an existing dataset in Tamr Core. Required.
- Database Username: the dbUsername key in the the queryConfig object of JDBC requests. Optional.
- Database Password: the dbPassword key in the the queryConfig object of JDBC requests. Optional.
- Fetch Size: Defaults to 10000.
- By default, the job to profile the dataset starts automatically after the import job finishes. Clear the Profile dataset after import checkbox to profile the dataset at another time.
- To test your connection and preview 1,000 records, select Generate Preview.
- Select Add Dataset. The import job starts and can be monitored on the Jobs page.
Tip: After you upload a dataset, an author or admin might need to adjust policies to give team members access. See Updating Dataset Access.
Updated 8 months ago