Using Metadata in Transformations
To provide additional options for transforming data, you can add metadata to input datasets and attributes.
Curators and admins can add metadata to input datasets and attributes, which can then be used to expand the options available for transforming data.
You add metadata by specifying a property for the input dataset as a whole or for one or more of its input attributes. Each property contains a key:value pair. See Adding a Metadata Property.
Use Cases for Metadata
Example use cases for transformations that use metadata follow.
Filter by Confidentiality Level
Add a privacy
key and a value of P1
, P2
, P3
, or P4
to each attribute to filter records to meet access level requirements.
Filter to Newest Records
Add a last_modified_date
key and corresponding value to the input dataset. Use it and the transaction_date
attribute in your unified dataset to filter out records more than n months or years old relative to the date of the dataset.
Make Location-Specific Changes
In a project that harmonizes datasets from different countries, add a country_of_origin
key and a corresponding value to identify each input dataset.
Group Datasets by Type
In a project with a large number of datasets that fall into a few general types, instead of (or in addition to) using dataset tags to identify a dataset’s type, add a type
key and value to allow for type-specific transformations.
Adding a Metadata Property
Curators and admins can add metadata to an input dataset in a mastering, categorization, or schema mapping project.
To add a metadata property:
- Open a mastering, categorization, or schema mapping project.
- On the project’s Datasets page, select a dataset.
- Choose Open properties. A panel opens on the right side of the page.
- Use the drop-down list to select the object you want to add metadata to: the dataset itself, or one of the attributes in the dataset.
- Enter a name to identify the property key.
- Enter a value for the key.
- Select Add property. The Properties panel populate the first row of a table with the selected object and the specified key and value.
Repeat these steps to add more properties to the input dataset.
Example: Using Metadata in a Transformation
The metadata properties that you add for an input dataset are treated like a separate dataset, named metadata, with columns for attribute
, datasetName
, and key
.
You access these columns using syntax like USE metadata("source_dataset_1.csv", "source_dataset_2.csv")
.
In the example that follows, metadata has been added to identify the source country of the input datasets: one input dataset contains data for Canadian companies, the other for US companies. The transformation validates the formatting of postal code values in the unified dataset.
// Join the metadata to your unified dataset, filtering to the metadata key
// and attribute that you need.
// When using metadata on the dataset as a whole (not attribute specific) the
// attribute field should be null.
LEFT JOIN WITH metadata("healthcare_facilities_canada.csv", "us_companies.csv") AS metadata
ON origin_source_name == metadata.datasetName AND metadata.attribute is null AND metadata.key == 'country_of_origin';
// You can use it in transformations directly
// Setup data types
SELECT *, to_string(postal_code) AS postal_code;
SELECT *,
CASE
// CA post codes should follow the pattern letter,number,letter number,letter,number
WHEN metadata.value == 'CA' THEN matches('\\w\\d\\w \\d\\w\\d', postal_code)
//US post codes should be exactly 5 digits
WHEN metadata.value == 'US' THEN matches('\\d{5}', postal_code)
END AS postal_code_is_valid;
// or you can save the metadata value as a new column
SELECT *, metadata.value AS country;
Updated about 2 years ago