Golden Record Consolidation Rules
Configure rules and compose your own expression rules. Tamr uses these rules for consolidation of golden records.
Rules consist of aggregation functions and conditions. You can configure aggregation functions and compose expression aggregation functions for rules used in golden records.
Aggregation Functions
Most common value
Returns the most common value, ignoring null and blank values. If the attribute contains arrays, the rule unpacks the array, and finds the most common value across all values. For example, if the attribute has three arrays (1,2,3), (3,2,2), (1,2,3), the most common value will be 2.
Tiebreaking
Tiebreaking selects the minimum value of an ascending sort. For example, if “apple” and “banana” are the most common values in the group, the rule chooses “apple”.
Max
Returns the maximum value. For a numeric attribute, this is the largest number. For a string, this is the first result when sorting descending. If the attribute contains arrays, the rule will unpack the array, and find the max across all values. For example, if the attribute has three arrays (1,2,4), (42), (40,41), the max value will be 42. You can also create a custom Max filter rule.
Min
Returns the minimum value. For a numeric attribute, this is the smallest number. For a string, this is the first result with sorting ascending. If the attribute contains arrays, the rule will unpack the array, and find the min across all values. For example, if the attribute has three arrays (1,2,4), (42), (40,41), the min value will be 1. You can also create a custom Min filter rule.
Longest (string)
Returns the longest value in an attribute, determined by character length. It ignores nulls and blank strings. This rule only works on attributes of string type.
Shortest (string)
Returns the shortest value in an attribute, determined by character length. It ignores nulls and blank strings. This rule only works on attributes of string type.
Sum
Adds all of the values in an attribute together. This rule only works on attributes of numeric type.
Mean
Calculates the average of the values in an attribute. This rule only works on attributes of numeric type.
Count
Returns how many values are in the cluster for that attribute. It ignores nulls, blank strings, and empty lists. If the attribute contains arrays, the rule will unpack the array, and count the number of values. For example, if the attribute has three arrays (1,2,4), (42), (40,41), the count will be 6.
Distinct
Returns an array of all the distinct values in the cluster for that attribute. It ignores nulls, blank strings, and empty lists. If the attribute contains arrays, the rule will unpack the array, and find the set. For example, if the attribute has three arrays (1,2,4), (3,2,2), (1,2,3), the rule will return (1,2,4,3).
Count Distinct
Returns how many unique values are in the cluster for that attribute. It ignores nulls, blank strings, and empty lists. If the attribute contains arrays, the rule will unpack the array, and count the number of distinct values. For example, if the attribute has three arrays (1,2,4), (3,2,2), (1,2,3), the rule will return the number 4.
Expression
A custom expression that you can specify. See Golden Record Consolidation Rules.
Conditions
Condition: dataset
dataset
The dataset
condition allows you to apply a filter that selects from the list one or more preferred input datasets, for each attribute. This is useful if, for example, for a particular attribute, you prefer to rely on specific input datasets that you trust for this attribute. You can use this condition to choose which input datasets you prefer, and exclude other input datasets from consideration.
If you include more than one dataset into your dataset
condition, each of them is considered.
If the list of input datasets changes since the last publishing of golden records, Tamr adds the newly added datasets to the drop-down list and marks them with the "New" label. For example, this can happen if you add new input datasets to an existing mastering project or updating golden records through the API.
You can prioritize data from different input datasets to be considered for golden records. If you have more than one input dataset that you'd like to consider first, you can add them to a group with a priority number. Priority groups are numbered, with 1 being the highest priority. When Tamr creates golden records, it considers the datasets in the highest priority group first. If it does not find records in the input datasets that you have prioritized the highest and that fit your rules, it looks for records in the datasets with the next priority.
When you prioritize input datasets, the rule you have configured, such as "Most common value", applies to them first, if Tamr finds records that arrived from those input datasets.
For example, this configuration: [["The Best Dataset"], ["Dataset A", "Dataset B"]] has two priority groups where the highest priority group contains only "The Best Dataset" and the lower priority group contains “Dataset A” and “Dataset B”. If any records in the cluster are coming from "The Best Dataset", then the golden record rules use data only from it. If none of the records arrived from "The Best Dataset", the rules consider the next priority group. If any of the records have a source of either "Dataset A" or "Dataset B", then Tamr selects those records for the golden records. If not, Tamr looks for the next priority group.
You can omit specifying preferred input datasets. In this case, all records from all input datasets are considered. You can also drag and drop datasets between priority groups.
Condition: Is not empty
.
Is not empty
.Allows you to filter attributes based on whether the values are not empty.
Condition: Max
Max
This condition is similar to the Max
rule and uses the MAX aggregate transformation function. It computes the maximum value of an attribute or expression in a group. You can also create a custom Max condition.
Condition: Min
Min
This condition is similar to the Min
rule and uses the MIN aggregate transformation function. It computes the least value of an attribute or expression in a group. You can also create a custom Min condition.
Condition: Most common value
Most common value
The most common value
condition allows you to apply a filter for an attribute in golden records that selects the most common value for this attribute, from the group of values it considers.
For example, if you apply the condition most common value
to the attribute city
in the following dataset with the grouping key published_id
, Tamr returns the value Boston
. If you add the most common value
condition to the attribute state
, and then apply the function most common value
to the attribute city
, Tamr returns the value Cambridge
.
published_id | state | city |
---|---|---|
101 | Massachusetts | Boston |
101 | Massachusetts | Cambridge |
101 | Ohio | Boston |
101 | Massachusetts | Cambridge |
101 | Ohio | Boston |
101 | Ohio | Boston |
101 | Massachusetts | Cambridge |
Condition: Logical filter
.
Logical filter
.Narrows down the values based on the predicate expressions that you use in the filter. For example, you can specify a logical filter as: length ("columnA") > 5
.
Expression Aggregation Functions
Use expression aggregation functions inside a code editor. These functions allow you to compose custom aggregations that utilize Tamr transformation functions.
The expression aggregation functions behave similar to how the GROUP BY
statement behaves. See Group By.
For example, to aggregate the most common value from first_name
and concatenate it with the most common value from last_name
, with a space between the two values, compose the following expression:
concat(
mode(get(first_name, 0)),
' ',
mode(get(last_name, 0))
)
Updated over 5 years ago