Creating a Unified Schema
Attributes from multiple input datasets are mapped to unified attributes in the target, "unified" dataset.
The unified schema contains the attributes that provide a single, consistent view of an entity, harmonizing data from multiple sources and adding attributes that answer downstream questions. Before you create the unified dataset that will contain the schema, it can be helpful to understand the options for mapping input attributes to unified attributes and the approaches you can take to creating the unified schema.
Working with Unified Attributes
Unified attributes are derived attributes that you populate by mapping one or more attributes from your input datasets into a single attribute in the schema for the unified dataset.
- Each unified attribute corresponds to the header of one column in the unified dataset.
- A schema is a collection of unified attributes.
- You can map more than one attribute from an input dataset to the same unified attribute.
- The mappings can be one-to-one or many-to-one.
- You can choose to ignore attributes in input datasets and not map them to any unified attributes.
- You can add unified attributes and populate them with the results of transformations instead of mapping input attributes to them directly.
Approaches to Creating a Unified Schema
To create a unified schema for the unified dataset, you can:
- Design a set of unified attributes ahead of time and add them to the unified schema manually. You can then map attributes from the input datasets to these unified attributes.
- "Bootstrap" attributes from an input dataset onto the unified schema. This can be especially useful if the input dataset you select contains data that has been standardized.
- "Bootstrap" attributes from an input dataset that consists only of the header row with no data. This can be used to ensure that the unified schema is free of spelling or capitalization errors.
Bootstrapping maps the selected input attributes to the unified attributes by creating a unified attribute with the same name as the input attribute. Subsequent bootstrapping can be used to map input attributes to previously created unified attributes. If multiple input attributes that you select have the same name (for example, from different input datasets), bootstrapping creates a single unified attribute with that name and maps all of those input attributes to it.
Tip: If you intend to include numerous or complex data transformations in your project, consider using a naming convention like "_original" for your set of source unified attributes that you do not want to modify with transformations, and map input attributes to these original source attributes. You can then add other attributes to your unified schema to populate with the results of transformations. This approach can help you maintain a clear data lineage. See the Transformations guide.
Creating a unified schema is often an iterative process, especially as you add new input datasets over time. For example, as you work with your data, you may find that you would like to add more attributes from other input datasets to help describe a particular entity. Tamr Core helps automate most of the schema mapping process.
Updated about 2 years ago