How To Extract Subset of Clusters with Specific Average Confidence?
-
Run the "Review and publish clusters" button if you have not run it already.
-
Create a new Schema Mapping project and add "<Mastering_project>_unified_dataset_dedup_published_clusters_with_data" as input dataset.
-
Click "create unified dataset" on the right panel of the schema mapping page. Then, create a new attribute "Confidence" on the right panel of the Schema Mapping page and click "Update Unified Dataset"
-
Go to "Unified Dataset" page, and add transformations on the "Transformations on Unified Dataset". These 2 transformations need to be added as "script" -
a.
LEFT JOIN <Mastering_project>_unified_dataset_dedup_published_cluster_stats ON get(persistentId,0) == <Mastering_project>_unified_dataset_dedup_published_cluster_stats.persistentId;
SELECT *,
<Mastering_project>_unified_dataset_dedup_published_cluster_stats.averageLinkage as Confidence;
b. To get cluster data depending on the average confidence, run the following transformation:
# Low average confidence
FILTER to_double(Confidence) <= 0.7
# Medium and High average confidence
FILTER to_double(Confidence) > 0.7
-
After entering each transformation, make sure to select "Preview" to see how your data looks like after applying the transformation. After verifying, click "done" and then "save changes".
-
Run Update Unified Dataset. You can export the <schema_mapping>_unified_dataset dataset from dataset page (dataset catalog page).
Updated over 2 years ago