Profiling a Dataset

Profiling creates a sample of the dataset for preview, computes metrics, and saves metadata for use during schema mapping.

Profiling an input dataset does the following:

Creates a sample of the dataset to preview on the Schema Mapping and Unified Dataset pages.
Computes the dataset record count and metrics for each attribute in the dataset including percentage null, percentage distinct, and most frequent values (distinct value histogram).
Produces metadata that the machine learning model can use when learning how to map attributes from input datasets to the unified dataset.

You can profile datasets when you upload or while you work in a project.

To profile an input dataset:

Open a schema mapping, mastering, or categorization project and select Datasets.
Select a dataset by using the checkbox that appears on the left end of the row and then choose Profile.
Select Profile n Dataset to confirm.

To update profiling during schema mapping:

Open the Schema Mapping page of a schema mapping, mastering, or categorization project. The input attributes appear on the left.
Expand the information for an input attribute (on the left) or unified attribute (on the right) by selecting the expand arrow to the left of the attribute name. If profiling is needed, one of the following options appears.

A Profile option appears if the attribute's dataset has never been profiled.

A Profiling out of date option appears if work in the project requires profiling to be refreshed.

Viewing Attribute Metrics

When you expand an attribute's details on the Schema Mapping page, the following information displays:

Record count.
Most frequent values with percentages.
A bar chart representing the percentages of records with different types of data values vs. those that are null, empty, or blank. Move your cursor over the colored sections of the bar to see the calculated percentages.

You can also access metrics on the Unified Dataset page. Move your cursor over the bar chart under an attribute name to see its metrics.

Updated over 3 years ago