Exporting a Dataset
Export a dataset from a project to save locally or to an external datastore.
Curators can export input datasets from the Datasets page of a project. You can:
- Download input datasets in CSV format to your local file system. See Exporting a Dataset to a Local File System.
- Export to S3, GCP, or ADLS2 cloud storage in CSV or Avro format. Requires Core Connect configuration. See Exporting to Cloud Storage.
- Use a JDBC driver to export in Parquet format or to other external storage locations such as relational databases. See Exporting with a JDBC Driver.
Additional options for exporting:
- You can also use the Tamr Core API to export datasets. To get started, see this Help Center article or Using the Core Connect API.
- Admins can export any dataset file from the Dataset Catalog, including unified datasets and internal datasets. See Exporting a Dataset from the Dataset Catalog.
Exporting a Dataset to a Local File System
You can download datasets to your local file system in CSV format.
Export File Format for Local Downloads
Files downloaded to local file system have the following characteristics:
- Format: Comma separated values (.csv). The delimiter, quote and escape characters are
,
,"
and"
respectively. - Encoding: UTF-8.
- Header: File contains a header row.
- Multivalues: Multivalues are delimited by the character
|
.
Export a Dataset Locally
To export a dataset locally:
- Open a schema mapping, mastering, or categorization project and select the Datasets page.
- Locate the dataset and choose Export CSV for Download.
- Select Confirm to start a dataset export job. You can monitor its progress on the Jobs page.
- When the dataset export job finishes, choose Export.
- Select Download CSV. The CSV file downloads to your local file system.
Exporting to Cloud Storage
You use Core Connect to export data in CSV, Avro, or Parquet format to cloud storage.
Tip: Users who need access to the data files you export to cloud storage must be given access to those cloud storage locations.
Export File Formats
Files exported to cloud storage destinations have the following characteristics:
- Format: Comma-separated values (CSV) or Avro.
For exports in CSV format, the delimiter, quote and escape characters are,
,"
, and"
respectively. If a column name includes a space, the exported column name includes the space. - Encoding: UTF-8.
- Header: File contains a header row.
- Multivalues: Multivalues are delimited by the character
|
.
Exporting a CSV or Avro File
To export a CSV or Avro file:
- Open a schema mapping, mastering, or categorization project and select the Datasets page.
- Locate the dataset and choose Export > Export to (provider name). The Export dialog opens.
- Select the file type for your export: CSV or Avro. Tamr Core converts its internal representation of the dataset into the specified format.
- Specify a new or existing destination path for the file.
- ADLS2: Account Name, Container, Path.
- AWS S3: Region, Bucket, Path.
- GCS: Project, Bucket, Path.
- To search for destination folders, supply values for the first two fields and then select Apply. Results appear on the right side of the dialog box.
- To reduce the time your search takes, provide as much of the path as possible.
- If you change the file type or the location, Apply again to refresh the folders shown on the right side of the dialog box.
- At the end of the path, specify a name and the extension for the exported file.
- Select Export Dataset to export the dataset file in the specified format.
Exporting with a JDBC Driver
Tamr Core supports export of input datasets to external storage using JDBC. By using JDBC drivers, you can export a SQL database (Oracle, Hive, and so on) as well as Parquet, Salesforce, Google Big Query, and other formats. For a list of the JDBC drivers that Tamr Core supports, see Core Connect.
Note: This procedure is intended for use by those familiar with JDBC, SQL, and making requests with the Core Connect API. See JDBC Keys.
To export with JDBC:
- Open a schema mapping, mastering, or categorization project and select the Datasets page.
- Locate the input dataset and choose Export >Export with JDBC. The Export dialog opens.
- Supply values for the following fields. The corresponding keys for JDBC calls are provided for reference.
- JDBC Url: the jdbcUrl key in the queryConfig object of JDBC requests. Required.
For examples, see Example jdbcUrl Values for Supported Drivers. - Target Table Name: the targetTableName key. Required.
- Database Username: the dbUsername key in the the queryConfig object of JDBC requests. Optional.
- Database Password: the dbPassword key in the the queryConfig object of JDBC requests. Optional.
- Batch Insert Size: the batchInsertSize key. Defaults to 10000.
- Select these options if applicable:
- Create table: the createTable key. If the target table name you specified is new on the target system, check this checkbox. To append exported data to a table that already exists, clear this checkbox.
- Truncate table before load: the truncateBeforeLoad key. If the target table name already exists, check this checkbox to delete all of the rows in the target table before writing exported data into it. Clear this checkbox to append the exported data.
- Select Submit Job. The export job starts and you can monitor its progress on the Jobs page.
Updated about 2 years ago