Datasets Generated by Tamr
This page lists datasets that you can locate under Results and Internals on the Dataset Catalog page in Tamr.
Note that the Export Support column refers to the label in the Tamr user interface. You can export all datasets listed on this page through the API.
Topics:
- Datasets in a Mastering Project
- Datasets in a Categorization Project
- Datasets in a Schema Mapping Project
Datasets in a Mastering Project
Dataset Name | Description | Export Support | Delete Support |
---|---|---|---|
_dedup_features | Features for all the rows and values in the source dataset. | No | Yes, can be recreated by generating pairs |
_dedup_idf | The inverse document frequency for all the fields in the source dataset. | Yes | Yes, can be recreated by generating pairs |
_dedup_non_null_count | The non-null count for each feature. | Yes | Yes, can be recreated by generating pairs |
_dedup_dnf_binning | Binning done across all features. | Yes | Yes, can be recreated by generating pairs |
_dedup_clusters_with_data | Unified dataset plus cluster IDs, cluster names, and whether the record is locked. | Yes | Yes, can be recreated by updating results |
_dedup_clusters_with_stats | Record clustering with statistics. | No | Yes, can be recreated by updating results |
_dedup_cluster_stats | Statistics about the clusters. | Yes | |
_dedup_published_clusters | Publish clusters, keyed by unified dataset record ID | Yes | Not recommended. Can be deleted and recreated but stored data will be lost. |
_dedup_published_clusters_stats | Record clustering with statistics, keyed by persistent cluster ID | No | Not recommended. Can be deleted and recreated but stored data will be lost. |
_dedup_published_clusters_with_data | Unified dataset plus cluster IDs, cluster names, and whether the record is locked, keyed by unified dataset record ID | Yes | Not recommended. Can be deleted and recreated but stored data will be lost. |
<unified_dataset_name>_dedup_published_cluster_counts | Record and cluster counts as of the latest cluster publication. | Yes | Yes, can be recreated by publishing clusters. |
_dedup_all_persistent_ids | Contains all persistent cluster IDs ever created for the Tamr deployment, keyed by unified persistent cluster ID | Yes | Not recommended. Can be deleted and recreated but stored data will be lost. |
_dedup_clusters_union | Current clusters and published clusters associated with records. | Yes | |
_dedup_cluster_stats_union | Statistics of current clusters and published clusters. | No | |
_dedup_clusters_with_stats_union | Records in the current and published clusters joined with statistics of the associated current and published clusters. | No | |
_dedup_imported_cluster_members | Imported clusters. | ||
_dedup_high_impact_questions | All record pairs which are marked as high impact questions. | Yes | |
_dedup_dnf_binning | Binning done across all features. | Yes | |
_dedup_grouped_dnf_binning | Binning data grouped by clause and bin IDs. | Yes | |
_dedup_labels | Human-labeled pairs.. | Yes | |
_dedup_feedback | Pairs label feedback. | No | |
_dedup_pair_comments | Pairs with comments. | Yes | |
_dedup_signals | All signals generated while comparing records (i.e. pairs and similarities). | Yes | |
_dedup_human_signals | All signals generated using human labels, comments, and feedback. | Yes | |
_dedup_dnf_signals | All signals generated using the deduplication model. | Yes | |
_dedup_signals_predictions | All signals along with the predictions and confidence scores. | ||
_dedup_model | The deduplication model decision tree. | No | Yes, can be recreated by training on pair labels. |
_important_pairs | Pairs with labels, feedback, or comments. | Yes | |
recordPairLabel | Raw pair labels from persistence. | Yes | Yes, can be recreated by generating pairs |
record_pair_feedback | Raw pair feedback from persistence. | No | Yes, can be recreated by generating pairs. |
record_pair_id_string | Raw record pair ID strings from persistence. | Yes | |
internal_links | Raw links from persistence. | No | |
cluster_member | Raw verified cluster members from persistence. | Yes | Yes, can be recreated by generating pairs. |
cluster_feedback | Raw cluster feedback from persistence. | Yes | Yes, can be recreated by updating model. |
Datasets in a Categorization Project
Dataset Name | Description | Export Support | Delete Support |
---|---|---|---|
_classifications | All the classifications, manual and suggested, for the project. | Yes | Yes, can be recreated by updating categorizations |
_classifications_with_data | All the classifications, manual and suggested, with input record fields. | Yes | Yes, can be recreated by updating categorizations |
_classification_histogram_boundaries | The histogram boundaries for numeric attributes. | Yes | Yes, can be recreated by updating categorizations |
_classification_model | The categorization model dataset. | Yes | Yes, can be recreated by updating categorizations |
_classifications_average_confidences | The average confidences of all records' classifications. | Yes | Yes, can be recreated by updating categorizations |
category | Raw categories from persistance. | No | Yes, can be recreated by updating categorizations |
categorization | Raw categorizations from persistence. |
Datasets in a Schema Mapping Project
Dataset Name | Description | Export Support | Delete Support |
---|---|---|---|
mappingrecommendations_recipe<sm_recommendations_recipe_number> | The schema mapping recommendations dataset. | Yes | No |
mappingrecommendation_model_recipe<sm_recommendations_recipe_number> | The schema mapping recommendation model dataset. | Yes | No |
Updated over 5 years ago