Create a new dataset and define the schema
Create a dataset, specifying its name, type and (optionally) external id properties.
Optional keys include description
, externalId
, and tags
. If no description is provided, the field is blank. If no externalId
is provided, it is generated at creation time. External IDs must be unique and are case-insensitive. If no tags
are provided, the tags array will remain empty.
For an example, post
body
{
"name": "Dataset created with pubapi",
"keyAttributeNames": ["F1"],
"description": "So much data in here!",
"externalId": "Dataset created with pubapi",
"tags": ["my-project"]
}
will create a new dataset named Dataset created with pubapi
with the primary key column F1
.
Key Attributes
The Key Attribute is the field in your dataset that Tamr will use as a unique identifier for each record. Note that, at this time, compound keys are not supported. This means only one field can be passed in as the
keyAttribute
.
Creating a new dataset automatically creates a string type attribute for the field in keyAttributeNames
.
Loading a file from an external storage provider
To create a file backed by files in a storage provider, include the optional externalDatasetConfig
field in your post
body. This must include the storage provider name and the path to the file or a directory containing multiple files. For example:
"name": "External Dataset",
"description": "my dataset from foo",
"keyAttributeNames": ["id"],
"externalDatasetConfig": {
"storageProviderName": "foo",
"filePath": "/dataset.avro"
}
If the filepath points to a directory, all of the avro
files in that directory will be combined together as the dataset to be added to Tamr.
File Types in Storage Providers
Note that only
avro
files are supported for storage providers, notcsv
.
Exporting a Tamr dataset to a file in an external storage provider
You can link an upstream dataset from Tamr to a downstream file in an external storage provider. To do this, include both the optional externalDatasetConfig
and upstreamDatasetIds
field in your post
body. The filepath
of the externalDatasetConfig
must point to a directory, not a file, and the upstreamDatasetIds
must reference the full id
of a Tamr dataset.
When you materialize the external dataset, the contents of the upstream dataset will be written to one or more avro files, overwriting anything that may have previously existed in that directory.
"name": "External Dataset",
"description": "my dataset from foo",
"keyAttributeNames": ["id"],
"externalDatasetConfig": {
"storageProviderName": "foo",
"filePath": "myDirectory/mySubdirectory/"
},
"upstreamDatasetIds": ["unify://unified-data/v1/datasets/1"]
Response Fields
This endpoint returns a dataset object describing the dataset created, if successful.