How To Extract Subset of Clusters with Specific Average Confidence?

  1. Run the "Review and publish clusters" button if you have not run it already.

  2. Create a new Schema Mapping project and add "<Mastering_project>_unified_dataset_dedup_published_clusters_with_data" as input dataset.

  3. Click "create unified dataset" on the right panel of the schema mapping page. Then, create a new attribute "Confidence" on the right panel of the Schema Mapping page and click "Update Unified Dataset"

  4. Go to "Unified Dataset" page, and add transformations on the "Transformations on Unified Dataset". These 2 transformations need to be added as "script" -

a.

LEFT JOIN <Mastering_project>_unified_dataset_dedup_published_cluster_stats ON get(persistentId,0) == <Mastering_project>_unified_dataset_dedup_published_cluster_stats.persistentId;
SELECT *,
<Mastering_project>_unified_dataset_dedup_published_cluster_stats.averageLinkage as Confidence;

b. To get cluster data depending on the average confidence, run the following transformation:

# Low average confidence
FILTER to_double(Confidence) <= 0.7

# Medium and High average confidence
FILTER to_double(Confidence) > 0.7
  1. After entering each transformation, make sure to select "Preview" to see how your data looks like after applying the transformation. After verifying, click "done" and then "save changes".

  2. Run Update Unified Dataset. You can export the <schema_mapping>_unified_dataset dataset from dataset page (dataset catalog page).