The confidence of a cluster is based on the average "distance" between records in the cluster.
You can find this confidence information in several Tamr-generated datasets:
Tamr version prior to 2021.010.0
_dedup_cluster_stats and _dedup_clusters_with_stats. The distinction between these datasets is that the "cluster stats" dataset contains one record per cluster, while the "clusters with stats" dataset contains one record per record from the unified dataset
Tamr version 2021.010.0 and later
_dedup_cluster_stats and _dedup_clusters_average_linkage. (The _dedup_cluster_with_stats dataset was replaced by the _dedup_clusters_average_linkage dataset.)
The information on cluster confidence is contained in the field average linkage. For clusters with only one record (identified from the record count field), there will not be an average linkage field, as it is not necessary to compute confidence in this case.
The _dedup_cluster_stats and the _dedup_clusters_average_linkage don’t contain the same schema; some changes to the code are necessary if you are using the old datasets.
Some of these datasets contain complex information which cannot be represented in a CSV file; they must be exported as JSON through the API. For more information about these datasets and how to export them, see Datasets Generated by Tamr Core in the public documentation. Visit our dedicated section on using Tamr Core APIs for more information on data export through API.
Updated over 1 year ago