Tamr Documentation

Datasets Generated by Tamr

This page lists datasets that you can locate using the Results and Internals filter on the Dataset Catalog page in Tamr.

Note that Export Support refers to the availability of the Export option in the Tamr user interface. You can export all datasets listed on the Dataset Catalog page through the API.

Topics:

Datasets in a Mastering Project

_dedup_features

Description: Features for all the rows and values in the source dataset.

Export Support: No

Delete Support: Yes, can be recreated by generating pairs.

_dedup_idf

Description: The inverse document frequency for all the fields in the source dataset.

Export Support: Yes

Delete Support: Yes, can be recreated by generating pairs.

_dedup_non_null_count

Description: The non-null count for each feature.

Export Support: Yes

Delete Support: Yes, can be recreated by generating pairs.

_dedup_dnf_binning

Description: Blocking done across all features.

Export Support: Yes

Delete Support: Yes, can be recreated by generating pairs.

_dedup_clusters_with_data

Description: Unified dataset plus cluster IDs, cluster names, and whether the record is locked.

Export Support: Yes

Delete Support: Yes, can be recreated by updating results.

_dedup_clusters_with_stats

Description: Record clustering with statistics.

Export Support: No

Delete Support: Yes, can be recreated by updating results.

_dedup_cluster_stats

Description: Statistics about the clusters.

Export Support: Yes

Delete Support:

_dedup_cluster_average_linkage

Description: The average pairwise match probabilities in clusters.

Export Support: Yes

Delete Support: Yes, can be recreated by running a predict-clustering job.

_dedup_published_clusters

Description: Publish clusters, keyed by unified dataset record ID.

Export Support: Yes

Delete Support: Not recommended. Can be deleted and recreated but stored data will be lost.

_dedup_published_clusters_stats

Description: Record clustering with statistics, keyed by persistent cluster ID.

Export Support: No

Delete Support: Not recommended. Can be deleted and recreated but stored data will be lost.

_dedup_published_clusters_with_data

Description: Unified dataset plus cluster IDs, cluster names, and whether the record is locked, keyed by unified dataset record ID

Export Support: Yes

Delete Support: Not recommended. Can be deleted and recreated but stored data will be lost.

<unified_dataset_name>_dedup_published_cluster_counts

Description: Record and cluster counts as of the latest cluster publication.

Export Support: Yes

Delete Support: Yes, can be recreated by publishing clusters.

_dedup_all_persistent_ids

Description: Contains all persistent cluster IDs ever created for the Tamr deployment, keyed by the unified persistent cluster ID.

Export Support: Yes

Delete Support: Not recommended. Can be deleted and recreated but stored data will be lost.

_dedup_clusters_union

Description: Current clusters and published clusters associated with records.

Export Support: Yes

Delete Support:

_dedup_cluster_stats_union

Description: Statistics of current clusters and published clusters.

Export Support: No

Delete Support:

_dedup_clusters_with_stats_union

Description: Records in the current and published clusters joined with statistics of the associated current and published clusters.

Export Support: No

Delete Support:

_dedup_imported_cluster_members

Description: Imported clusters.

Export Support:

Delete Support:

_dedup_high_impact_questions

Description: All record pairs which are marked as high impact questions.

Export Support: Yes

Delete Support:

_dedup_dnf_binning

Description: Blocking done across all features.

Export Support: Yes

Delete Support:

_dedup_grouped_dnf_binning

Description: Blocking data grouped by clause and bin IDs.

Export Support: Yes

Delete Support:

_dedup_labels

Description: Human-labeled pairs.

Export Support: Yes

Delete Support:

_dedup_feedback

Description: Pairs label feedback.

Export Support: No

Delete Support:

_dedup_pair_comments

Description: Pairs with comments.

Export Support: Yes

Delete Support:

_dedup_signals

Description: All signals generated while comparing records (i.e. pairs and similarities).

Export Support: Yes

Delete Support:

_dedup_human_signals

Description: All signals generated using human labels, comments, and feedback.

Export Support: Yes

Delete Support:

_dedup_dnf_signals

Description: All signals generated using the deduplication model.

Export Support: Yes

Delete Support:

_dedup_signals_predictions

Description: All signals along with the predictions and confidence scores.

Export Support:

Delete Support:

_dedup_model

Description: The deduplication model decision tree.

Export Support: No

Delete Support: Yes, can be recreated by training on pair labels.

_important_pairs

Description: Pairs with labels, feedback, or comments.

Export Support: Yes

Delete Support:

**recordPairLabel

Description: Raw pair labels from persistence.

Export Support: Yes

Delete Support: Yes, can be recreated by generating pairs.

record_pair_feedback

Description: Raw pair feedback from persistence.

Export Support: No

Delete Support: Yes, can be recreated by generating pairs.

record_pair_id_string

Description: Raw record pair ID strings from persistence.

Export Support: Yes

Delete Support:

internal_links

Description: Raw links from persistence.

Export Support: No

Delete Support:

cluster_member

Description: Raw verified cluster members from persistence.

Export Support: Yes

Delete Support: Yes, can be recreated by generating pairs.

cluster_feedback

Description: Raw cluster feedback from persistence.

Export Support: Yes

Delete Support: Yes, can be recreated by updating model.

Datasets in a Categorization Project

_categories

Description: The taxonomy for the project.

Export Support: Yes

Delete Support:

_classifications

Description: All the categorizations, manual and suggested, for the project.

Export Support: Yes

Delete Support: Yes, can be recreated by updating categorizations.

_classifications_with_data

Description: All the categorizations, manual and suggested, with input record fields. The attributes in this dataset include:

  • trainingFunctionCategoryPath (beta) The category predicted by the Tamr model for a record that has a manually-assigned category label.
  • manualClassificationPath The category label manually assigned by an expert. Can be null.
  • finalCategoryPath The categorization accepted for the record. The final categorization combines manual labels, categorization functions of type override, and Tamr model predictions (in that order).

Export Support: Yes

Delete Support: Yes, can be recreated by updating categorizations.

_classification_histogram_boundaries

Description: The histogram boundaries for numeric attributes.

Export Support: Yes

Delete Support: Yes, can be recreated by updating categorizations.

_classification_model

Description: The categorization model dataset.

Export Support: Yes

Delete Support: Yes, can be recreated by updating categorizations.

_classifications_average_confidences

Description: The average confidences of all records' categorizations.

Export Support: Yes

Delete Support: Yes, can be recreated by updating categorizations.

category

Description: Raw categories from HBase.

Export Support: No

Delete Support: Yes, can be recreated by updating categorizations.

categorization

Description: Raw categorizations from HBase.

Export Support:

Delete Support:

Datasets in a Schema Mapping Project

mappingrecommendations_recipe<sm_recommendations_recipe_number>

Description: The schema mapping recommendations dataset.

Export Support: Yes

Delete Support: No

mappingrecommendation_model_recipe<sm_recommendations_recipe_number>

Description: The schema mapping recommendation model dataset.

Export Support: Yes

Delete Support: No

Datasets in a Golden Records Project

_golden_records_overrides

Description: Dataset listing all the manual overrides that have been applied to attributes within a golden records project including user and timestamp.

Export Support: Yes

Delete Support: No

_golden_records_draft

Description: Golden Records dataset prior to publishing golden records.

Export Support: Yes

Delete Support: No

_golden_records_rule_output

Description: Golden Records dataset as a result of applying the Golden Record rules.

Export Support: Yes

Delete Support: No

_golden_records

Description: Golden Records dataset as a result of applying both golden record rules and manual overrides.

Export Support: Yes

Delete Support: No

_golden_records_pinned_cluster_input

Description: Dataset containing Tamr ids on a record level with their associated Golden Records values.

Export Support: Yes

Delete Support: No

_golden_records_cluster_profile

Description: Dataset containing the relationship between input cluster ids and the ids of the records contained in each cluster.

Export Support: Yes

Delete Support: No

Updated 3 months ago


Datasets Generated by Tamr


Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.