Project Movement
The Tamr Core project movement feature can be used to create, update, or back up project artifacts within and across instances.
Moving a Project
You use the Tamr Core project movement API to export projects and then, optionally, import them into existing or distinct new projects. The export process stores a set of project artifacts, representing different types of project data, in a compressed .zip file. The import process uses the .zip file to recreate the state and content of the exported project. The export never contains the source data records associated with your project.
You can use the project movement API to completely clone a project, or you can select only certain artifacts in a project to export or import.
During the export or import operation, Tamr Core goes into a read-only state to prevent any changes to ensure the project and its data cannot be corrupted during the operation. If you attempt any actions while export or import is ongoing, you receive an error. When the operation is complete, the system returns to read-write, and you can then perform actions and run jobs.
Note: Server snapshots are not recommended for migrations or backups. Use the project movement API or do a Tamr Core application backup.
Project Movement Use Cases
Some examples of how you can use the project movement API follow.
- Duplicate a project from one environment onto another. For example, you can export from your development instance and then import to a new project on your test or production instance.
- Replace a project with another project. For example, you can restore a project on your production instance to a known state by exporting from a stable project on your development instance and then importing it into the project on another instance.
- “Fork” a project for testing by exporting and then importing it to a new project on the same instance. You can then make changes to the copy without affecting the original.
- Merge changes made in a forked project back into the original project. For example, having forked a project and completed changes in the copy satisfactorily, you can export the copy and then import it into the original project.
- Update a project on one instance with a specific type of changes made on another instance. For example, after duplicating a mastering project from your development instance to production, additional pair labeling and cluster verification takes place in development. You can export only the artifacts for this added reviewer input and import them into the project on production.
- Duplicate or update a series of “chained” projects, where the output of one project is used as the input to another project. The projects must be migrated in the correct dependency order. Examples of chained projects include:
- A mastering project with published clusters that are input to a golden records project. You must import the mastering project to the target instance before importing the golden records project.
- A mastering project with published clusters that are input to a different mastering project. This may be used to master at another hierarchical level, such as company followed by corporate parent. You must import the first mastering project to the target instance before importing the second mastering project.
Permissions and Roles
- To use the project movement API, you must be assigned the admin user role.
- Files can only be exported to, and imported from, the local file system. As a result, the functional user for the Tamr Core instance must have read/write access to the directories you specify.
Requirements and Best Practices
Project movement between instances that are running different versions of Tamr Core is not supported.
Important: Attempting to use project movement between instances with different versions can appear to succeed in the API responses. However, doing so is likely to result in corrupted data in the project and upgrade failure in the future.
These guidelines describe the most effective uses of the project movement API.
- Before you import a project, back up the instance.
- Periodically verify the size of the compressed .zip files produced by your exports and the available local disk space.
- The software version in use on the source instance is not included in export files. To identify the version of a file, Tamr suggests setting up version-specific directories and using an appropriate naming convention for them.
Note: An export does not include any of the source datasets for the project you are migrating. This includes input datasets on the Schema Mapping page and any additional datasets referenced by transformations, such as a JOIN or LOOKUP. If the source and target instances are different for your export, be sure to verify that all source datasets exist, with the same names and structures, on both instances. The data itself can be different.
Working with Project Artifacts
When you export, you can either include all of a project’s artifacts or specify certain artifacts to exclude.
Tip: Some artifacts are required and cannot be excluded from an export. Reference tables with the supported export options for each artifact follow.
On import, you can include all of the artifacts that are in the export file or specify artifacts to exclude. You can also indicate whether the import should succeed or fail if any of the non-excluded artifacts are missing from the export file.
Tip: Some artifacts can only be excluded when you import into an existing project, but are required when you import into a new project. Reference tables with the supported import options for each artifact follow.
In addition, when you import into an existing project you can specify whether an artifact should be imported destructively or additively.
- Destructively: drops all existing data in the target project and then populates it with the data from the source project.
- Additively: overwrites existing data in the target project that conflict with the data from the source project and retains all non-conflicting existing data.
A complete list of artifacts by project type and their options for export and import follow.
Options for Project Movement Artifacts
The reference tables that follow list project artifacts by Tamr Core project type, indicate the artifacts that are required for new project import, and provide the supported import options for each artifact, and a description.
These tables indicate the import options as:
- E - Exclude from import into any project, new or existing
- EP - Exclude from import into existing projects only
- IA - Include additively
- ID - Include destructively
Schema Mapping Project Artifacts
Artifact Name | Supported/Default* Import Options | Description |
---|---|---|
INPUT_DATASETS | E, ID, IA* | The set of input datsets for the project. |
UNIFIED_ATTRIBUTES | E, ID, IA* | The set of unified attributes for the unified dataset. Also includes the mappings from input to unified attributes. |
INPUT_DATASET_DO_NOT_MAPS | E, ID, IA* | "Do not map" metadata for attributes of input datasets. |
TRANSFORMATIONS | E, ID* | The set of input and unified dataset transformations. |
SMR_MODEL | E, ID* | The schema mapping recommendation model. |
RECORD_COMMENTS | E*, ID, IA | The comments attached to records on the Clusters page (for a mastering project) or the categorizations page (for a categorization project). |
Mastering Project Artifacts
Mastering projects contain all of the schema mapping artifacts plus the following artifacts.
Artifact Name | Supported / Default* Import Options | Description |
---|---|---|
MASTERING_CONFIGURATION (required for new project import) |
EP, ID* | The mastering configuration, including:
|
USER_DEFINED_SIGNALS | E, ID, IA* | Reserved for future use. The user-defined signals for the clustering model. |
MASTERING_FUNCTIONS | E, ID, IA* | Reserved for future use. The mastering functions. |
RECORD_PAIR_COMMENTS | E*, ID, IA | The comments on pairs. |
RECORD_PAIR_VERIFIED_LABELS | E, ID, IA* | The verified labels for pairs, used in training the clustering model. |
RECORD_PAIR_UNVERIFIED_LABELS | E, ID, IA* | The unverified labels for pairs contributed by reviewers. |
RECORD_PAIR_ASSIGNMENTS | E, ID, IA* | The user assignments made for a pair review. |
CLUSTERING_MODEL | E, ID* | The model used to predict matching pairs and clusters of records. |
PUBLISHED_CLUSTERS | E, ID* | The published set of clusters with their persistent identifiers. |
CLUSTER_RECORD_VERIFICATIONS | E, ID, IA* | The verifications of records as members in specific clusters. |
CLUSTER_ASSIGNMENTS | E*, ID, IA | The user assignments made for cluster review. |
Golden Records Project Artifacts
Artifact Name | Supported / Default* Import Options | Description |
---|---|---|
GR_CONFIGURATION (required for new project import) |
EP, ID* | The golden record configuration, including:
|
GR_RULES | E, ID* | The rules used to populate each golden record attribute. |
GR_OVERRIDES | E, ID, IA* | The manual override values for the golden records. |
Categorization Project Artifacts
Categorization projects contain all of the schema mapping artifacts plus the following artifacts.
Artifact Name | Supported / Default* Import Options | Description |
---|---|---|
CATEGORIZATION_CONFIGURATION (required for new project import) |
EP, ID* | The categorization configuration, including the confidence threshold for assigning records to categories beyond the first tier. |
CATEGORIZATION_FUNCTIONS | E, ID, IA* | Reserved for future use. The categorization functions. |
CATEGORIZATION_VERIFIED_LABELS | E, ID, IA* | The verified labels used to train the categorization model. |
CATEGORIZATION_TAXONOMIES | E, ID, IA* | The taxonomy nodes for categorization. |
CATEGORIZATION_MODEL | E, ID* | The model used to predict categorizations for records. |
CATEGORIZATION_FEEDBACK | E, ID, IA* | The unverified labels contributed by reviewers. |
Exporting a Project
During the export operation, Tamr Core goes into a read-only state to prevent any changes to ensure the project and its data cannot be corrupted during export. If you attempt any actions while the export is ongoing, you receive an error. When the export is complete, the system returns to read-write, and you can then perform actions and run jobs.
Note: A skipReadOnlyMode
parameter is available and can be set to true
. Tamr recommends leaving this parameter at its default setting, false
.
Before You Begin
These prerequisite steps help ensure data consistency and prevent failures.
For a mastering project, ensure that the clustering takes all verifications into account prior to export.
- Run Update results only to update clusters and then Review and publish clusters. See Publishing Clusters.
- For a mastering project created prior to v2021.003 that you intend to import into a new project, run Apply feedback and update results and then Publish clusters. This one-time step is necessary after the upgrade to v2021.003 or later. If you want to export and import a project without having to run Apply feedback and update results, contact Tamr Support at [email protected].
To export a project:
- Find the project ID of the project you want to export. See Editing a Project and List all Datasets.
- Run
POST v1/projects/{project}:export
. You supply the project ID, and in the body you specify:- artifactDirectory (required)
- excludeArtifacts (optional). See Options for Project Movement Artifacts. Tamr Core starts an asynchronous job to export the specified project. See Export a project.
The response includes the ID for this project movement operation, which you must supply to get the status of the export operation.
- Run
GET v1/operations/{operationID}
as needed to get the status of the export operation.
Tip: You can find theoperationID
in the response to the POST call. The operation IDs listed byGET /v1/operations
do not include project movement operations.
Tamr Core starts an asynchronous job to export the specified project, and returns a response object that includes an operation ID and the job status. Tamr Core creates a compressed .zip file named export-<project_id>-<timestamp>
in the designated directory with the exported project artifacts.
Tip: The existence of the .zip file does not mean the export is successful. Use GET v1/operations/{operationId}
to verify the status of the export operation.
An example follows.
Example Export
curl -X POST --header 'Accept: application/json' -H "Content-Type: application/json" --header 'Authorization: <credentials>' 'http://host:9100/api/versioned/v1/projects/1:export' -d '
{
"artifactDirectory": "/home/ubuntu/tamr/projectExports",
"excludeArtifacts": [
"UNIFIED_ATTRIBUTES",
"RECORD_PAIR_COMMENTS"
]
}'
// This will return a response Operation object that looks like:
{
"id": "projectExport-my_project-2021-03-03_01-49-59-041",
"type": "projectExport",
"description": "projectExport with artifact: /home/ubuntu/tamr/projectExports/export-1-123123782.zip",
"status": {
"state": "PENDING",
"startTime": "",
"endTime": "",
"message": ""
},
"created": {
"username": "<user>",
"time": "2021-03-03_01-49-59-041",
"version": ""
},
"lastModified": {
"username": "<user>",
"time": "2021-03-03_01-49-59-041",
"version": ""
},
"result": {
"result": {
"typeUrl": "type.googleapis.com/google.protobuf.StringValue",
"value": "Ck4vaG9tZS91YnVudHUvdGFtci9wcm9qZWN0RXhwb3J0cy9Mb3RzJTIwb2YlMjBUcmFuc2Zvcm1hdGlvbnMtMTYxNDczNjE5OTAyNi56aXA="
}
},
"relativeId": "operations/projectExport-my_project-2021-03-03_01-49-59-041"
}
Importing a Project
Tamr recommends that you back up your instance before you import a project. You can either import to a new project or import into an existing project.
Before You Begin:
These prerequisites help ensure data consistency and prevent failures:
- Verify that the input datasets for the source project exist on the target project with identical names.
- If you are importing to a different instance, the target instance must be running the same version (up to minor version) of Tamr Core as the source.
- If you are importing to a different instance and require correct attribution of labels and comments, verify that the same set of Tamr Core user accounts exists on both the source and target instances. Importing labels and comments contributed by anyone who does not have a user account on the target instance results in corruption of these values.
During the import operation, Tamr Core goes into a read-only state to prevent any changes to ensure the project and its data cannot be corrupted during import. If you attempt any actions while the export is ongoing, you receive an error. When the import is complete, the system returns to read-write, and you can then perform actions and run jobs.
To import a project:
- You can import into a new or existing project.
- To import into a new project, run
POST v1/projects:import
. You need to specify a project name and a unified dataset name that are unique on the target instance. - To import into an existing project, you need to find the existing project's project ID and specify it when you run
POST v1/projects/{project}:import
. See Editing a Project and List all Datasets.
Tip: When importing into an existing project, specify exactly which artifacts you want to import and how. Relying on the default settings can have unintended results. See Options for Project Movement Artifacts.
- In the body, specify:
- newProjectName (required, only available for
POST v1/projects:import
) - newUnifiedDatasetName (optional, only available for
POST v1/projects:import
. Supplied as{newProjectName}_unified_dataset
if not defined.) - projectArtifact (required)
Note: In the projectArtifact argument of the import statement, spaces must be replaced by%20
to ensure Tamr Core recognizes the path correctly. This can occur when a project name has spaces in it. - excludeArtifacts (optional)
- failIfNotPresent (optional)
Applies only when importing into an existing project. Indicates the action Tamr Core should take for the artifacts that are included by default (either additively or destructively). When set to false (default), the file is imported regardless of whether all of the artifacts that are included by default are present. When set to true, Tamr Core requires the file to include all of the included by default artifacts to protect the project from being overwritten unexpectedly. See Options for Project Movement Artifacts. - includeDestructiveArtifacts (optional, see the Before You Begin: notes above for an example of when this is required)
- includeAdditiveArtifacts (optional)
Tamr Core starts an asynchronous job to import the specified artifacts from the designatedprojectArtifact
file, and returns a response object that includes the job status and the ID for this project movement operation, which you must supply to get the status of the import operation.
- Run
GET v1/operations/{operationID}
as needed to get the status of the import operation. An example follows.
Tip: You can find theoperationID
in the response to the POST call. The operation IDs listed byGET /v1/operations
do not include project movement operations. - Run each of the jobs in the project.
Note: Tamr does not alert you that the imported project needs to be updated; however, running these jobs is required to complete the import.
Example POST
and response follows.
Example Import into a New Project
curl -X POST --header 'Accept: application/json' -H "Content-Type: application/json" --header 'Authorization: <credentials>' 'http://host:9100/api/versioned/v1/projects:import' -d '
{
"newProjectName": "my_new_project",
"projectArtifact":
"/home/ubuntu/tamr/projectExports/my_project-1603164090688.zip",
"excludeArtifacts": [],
"includeDestructiveArtifacts": [
"CLUSTER_RECORD_VERIFICATIONS
],
"includeAdditiveArtifacts": [
"UNIFIED_ATTRIBUTES"
],
"failIfNotPresent": false
}'
// This will return a response Operation object that looks like:
{
"id": "projectImport-my_new_project-2021-03-03_01-49-59-041",
"type": "projectImport",
"description": "projectImport with artifact: /home/ubuntu/tamr/projectExports/my_project-1614736199026.zip",
"status": {
"state": "PENDING",
"startTime": "",
"endTime": "",
"message": ""
},
"created": {
"username": "<user>",
"time": "2021-03-03_01-49-59-041",
"version": ""
},
"lastModified": {
"username": "<user>",
"time": "2021-03-03_01-49-59-041",
"version": ""
},
"result": {
"result": {
"typeUrl": "type.googleapis.com/google.protobuf.StringValue",
"value": "Ck4vaG9tZS91YnVudHUvdGFtci9wcm9qZWN0RXhwb3J0cy9Mb3RzJTIwb2YlMjBUcmFuc2Zvcm1hdGlvbnMtMTYxNDczNjE5OTAyNi56aXA="
}
},
"relativeId": "operations/projectImport-my_new_project-2021-03-03_01-49-59-041"
}
Example Import into an Existing Project
curl -X POST --header 'Accept: application/json' -H "Content-Type: application/json" \
--header 'Authorization: <credentials>' 'http://host:9100/api/versioned/v1/projects/1:import' -d ' \
{
"projectArtifact": "/home/ubuntu/tamr/projectExports/my_project-1603164090688.zip",
"excludeArtifacts": [],
"includeDestructiveArtifacts": [
"CLUSTER_RECORD_VERIFICATIONS"
],
"includeAdditiveArtifacts": [
"UNIFIED_ATTRIBUTES"
],
"failIfNotPresent": false
}'
Example Get the Export or Import Operation
curl -X GET --header 'Accept: application/json' -H "Content-Type: application/json" \
--header 'Authorization: <credentials>' \
'http://host:9100/api/versioned/v1/operations/import-my_new_project-2020-12-04_17-24-07-017'
Updated 10 months ago