Golden Record Consolidation Rules
Configure rules and compose your own expression rules. Tamr Core uses these rules for the consolidation of golden records.
Rules act on attributes and consist of the following:
- Conditions
- Aggregation functions and expression aggregation functions
The List of Predefined Rules
The following rules are predefined and you don't need to create them with transformations:
- Most common value
- Max
- Min
- Sum
- Mean
- Count
- Distinct
- Count Distinct
- Expression (custom filter)
Conditions
List of Conditions for Golden Records
- Most common value
Note: This condition is useful when you want to choose the most common value for a particular attribute. - Max
- Min
- Is not empty
- Filter
- Dataset
Note: This condition is useful when you want to mark some of your input datasets as preferred.
Condition: dataset
The dataset condition allows you to apply a filter that, for each attribute, selects one or more preferred input datasets from the list of all datasets considered for golden records:
- Use the dataset condition to choose which input datasets you prefer, and exclude other input datasets from consideration. This is useful if, for a particular attribute, you prefer to rely on specific input datasets that you trust for this attribute.
- If you include more than one dataset into the
dataset
condition, each of them is considered. - If the list of input datasets changes since the last publish of golden records, Tamr Core adds the newly added datasets to the dropdown and marks them with the "New" label. For example, the list of input datasets changes if you add new input datasets to an existing mastering project or update golden records through the API.
- You can prioritize data from different input datasets to be considered for golden records. If you have more than one input dataset that you'd like to consider first, you can add them to a group with a priority number.
- Priority groups are numbered, with 1 being the highest priority.
- When Tamr Core creates golden records, it considers the datasets in the highest priority group first. If it does not find records in the input datasets that you have prioritized the highest and that fit your rules, it looks for records in the datasets with the next priority.
- When you prioritize input datasets, the rule you configure, such as "Most common value", applies to them first, if Tamr Core finds records that arrived from those input datasets. For example, this configuration: [["The Best Dataset"], ["Dataset A", "Dataset B"]] has two priority groups where the highest priority group contains only "The Best Dataset" and the lower priority group contains “Dataset A” and “Dataset B.” If any records in the cluster arrive from "The Best Dataset," then the golden record rules use data only from it. If none of the records arrived from "The Best Dataset," the rules consider the next priority group. If any of the records have a source of either "Dataset A" or "Dataset B," then Tamr Core selects those records for the golden records. If not, Tamr Core looks for the next priority group.
- You can omit specifying preferred input datasets. In this case, all records from all input datasets are considered.
- You can also drag and drop datasets between priority groups.
Condition: Is Not Empty
The Is Not Empty condition allows you to filter attributes based on whether the values are empty.
Condition: Max
The Max condition computes the maximum value of an attribute or expression in a group. You can also create a custom Max condition. This condition is similar to the Max
rule and uses the MAX aggregate transformation function.
Condition: Min
The Min condition computes the least value of an attribute or expression in a group. You can also create a custom Min condition. This condition is similar to the Min
rule and uses the MIN aggregate transformation function.
Condition: Most Common Value
The Most Common Value condition is applied by default, upon the initial creation of golden records. It allows you to apply a filter for a golden records' attribute that selects the most common value for it, from the group of attribute values it considers.
For example, if you apply this condition to the attribute city
in the following dataset with the grouping key published_id
, Tamr Core returns the value Boston
. If you add this condition to the attribute state
, and then apply the function most common value
to the attribute city
, Tamr returns the value Cambridge
.
published_id | state | city |
---|---|---|
101 | Massachusetts | Boston |
101 | Massachusetts | Cambridge |
101 | Ohio | Boston |
101 | Massachusetts | Cambridge |
101 | Ohio | Boston |
101 | Ohio | Boston |
101 | Massachusetts | Cambridge |
Condition: Logical Filter
The Logical Filter condition narrows down the values based on the predicate expressions that you use in the filter. For example, you can specify a logical filter as: length ("columnA") > 5
.
Aggregation Functions
You can configure aggregation functions and compose expression aggregation functions for rules used in golden records.
Most Common Value
This function returns the most common value, ignoring null and blank values. If the attribute contains arrays, the rule with this function unpacks the array, and finds the most common value across all values. For example, if the attribute has three arrays (1,2,3), (3,2,2), (1,2,3), the most common value is 2.
Note: Tie breaking selects the minimum value of an ascending sort. For example, if “apple” and “banana” are the most common values in the group, the rule chooses “apple”.
Max
This function returns the maximum value. For a numeric attribute, this is the largest number. For a string, this is the first result when sorting in a descending order. If the attribute contains arrays, the rule with this function unpacks the array, and finds the maximum across all values. For example, if the attribute has three arrays (1,2,4), (42), (40,41), the maximum value is 42. You can also create a custom Max filter rule.
Min
This function returns the minimum value. For a numeric attribute, this is the smallest number. For a string, this is the first result with sorting in the ascending order. If the attribute contains arrays, the rule unpacks the array, and finds the min across all values. For example, if the attribute has three arrays (1,2,4), (42), (40,41), the minimum value is 1. You can also create a custom Min filter rule.
Longest (string)
This function returns the longest value in an attribute, determined by the character length. It ignores nulls and blank strings. The rule with this function works only on string type attributes.
Shortest (string)
This function returns the shortest value in an attribute, determined by the character length. It ignores nulls and blank strings. The rule with this function works only on string type attributes.
Sum
This function adds all of the values in an attribute together. The rule with this function only works on numeric type attributes.
Mean
This function calculates the average of the values in an attribute. The rule with this function works only on numeric type attributes.
Count
This function returns the number of values in the cluster for that attribute. It ignores nulls, blank strings, and empty lists. If the attribute contains arrays, the rule unpacks the array, and counts the number of values. For example, if the attribute has three arrays (1,2,4), (42), (40,41), the count is 6.
Distinct
This function returns an array of all the distinct values in the cluster for that attribute. It ignores nulls, blank strings, and empty lists. If the attribute contains arrays, the rule unpacks the array, and finds the set. For example, if the attribute has three arrays (1,2,4), (3,2,2), (1,2,3), the rule returns (1,2,4,3).
Count Distinct
This function returns the number of unique values in the cluster for that attribute. It ignores nulls, blank strings, and empty lists. If the attribute contains arrays, the rule unpacks the array, and counts the number of distinct values. For example, if the attribute has three arrays (1,2,4), (3,2,2), (1,2,3), the rule returns the number 4.
Expression
A custom expression that you can specify. See the expression aggregation functions in the next section.
Expression Aggregation Functions
Use expression aggregation functions inside a code editor. These functions allow you to compose custom aggregations that utilize Tamr Core transformation functions.
The expression aggregation functions behave similar to how the GROUP BY
statement behaves. See Group By.
For example, to aggregate the most common value from first_name
and concatenate it with the most common value from last_name
, with a space between the two values, compose the following expression:
concat(
mode(...first_name),
' ',
mode(...last_name)
)
Updated about 2 years ago