Golden Record Consolidation Rules

Rules consist of aggregation functions and conditions. You can configure aggregation functions and compose expression aggregation functions for rules used in golden records.

Aggregation Functions

Most common value

Returns the most common value, ignoring null and blank values. If the attribute contains arrays, the rule unpacks the array, and finds the most common value across all values. For example, if the attribute has three arrays (1,2,3), (3,2,2), (1,2,3), the most common value will be 2.

📘
Tiebreaking
Tiebreaking selects the minimum value of an ascending sort. For example, if “apple” and “banana” are the most common values in the group, the rule chooses “apple”.

Max

Returns the maximum value. For a numeric attribute, this is the largest number. For a string, this is the first result when sorting descending. If the attribute contains arrays, the rule will unpack the array, and find the max across all values. For example, if the attribute has three arrays (1,2,4), (42), (40,41), the max value will be 42. You can also create a custom Max filter rule.

Min

Returns the minimum value. For a numeric attribute, this is the smallest number. For a string, this is the first result with sorting ascending. If the attribute contains arrays, the rule will unpack the array, and find the min across all values. For example, if the attribute has three arrays (1,2,4), (42), (40,41), the min value will be 1. You can also create a custom Min filter rule.

Longest (string)

Returns the longest value in an attribute, determined by character length. It ignores nulls and blank strings. This rule only works on attributes of string type.

Shortest (string)

Returns the shortest value in an attribute, determined by character length. It ignores nulls and blank strings. This rule only works on attributes of string type.

Sum

Adds all of the values in an attribute together. This rule only works on attributes of numeric type.

Mean

Calculates the average of the values in an attribute. This rule only works on attributes of numeric type.

Count

Returns how many values are in the cluster for that attribute. It ignores nulls, blank strings, and empty lists. If the attribute contains arrays, the rule will unpack the array, and count the number of values. For example, if the attribute has three arrays (1,2,4), (42), (40,41), the count will be 6.

Distinct

Returns an array of all the distinct values in the cluster for that attribute. It ignores nulls, blank strings, and empty lists. If the attribute contains arrays, the rule will unpack the array, and find the set. For example, if the attribute has three arrays (1,2,4), (3,2,2), (1,2,3), the rule will return (1,2,4,3).

Count Distinct

Returns how many unique values are in the cluster for that attribute. It ignores nulls, blank strings, and empty lists. If the attribute contains arrays, the rule will unpack the array, and count the number of distinct values. For example, if the attribute has three arrays (1,2,4), (3,2,2), (1,2,3), the rule will return the number 4.

Expression

A custom expression that you can specify. See Golden Record Consolidation Rules.

Conditions

Condition: `dataset`

The dataset condition allows you to apply a filter that selects from the list one or more preferred input datasets, for each attribute. This is useful if, for example, for a particular attribute, you prefer to rely on specific input datasets that you trust for this attribute. You can use this condition to choose which input datasets you prefer, and exclude other input datasets from consideration.
If you include more than one dataset into your dataset condition, each of them is considered.

If the list of input datasets changes since the last publishing of golden records, Tamr adds the newly added datasets to the drop-down list and marks them with the "New" label. For example, this can happen if you add new input datasets to an existing mastering project or updating golden records through the API.

You can prioritize data from different input datasets to be considered for golden records. If you have more than one input dataset that you'd like to consider first, you can add them to a group with a priority number. Priority groups are numbered, with 1 being the highest priority. When Tamr creates golden records, it considers the datasets in the highest priority group first. If it does not find records in the input datasets that you have prioritized the highest and that fit your rules, it looks for records in the datasets with the next priority.

When you prioritize input datasets, the rule you have configured, such as "Most common value", applies to them first, if Tamr finds records that arrived from those input datasets.
For example, this configuration: [["The Best Dataset"], ["Dataset A", "Dataset B"]] has two priority groups where the highest priority group contains only "The Best Dataset" and the lower priority group contains “Dataset A” and “Dataset B”. If any records in the cluster are coming from "The Best Dataset", then the golden record rules use data only from it. If none of the records arrived from "The Best Dataset", the rules consider the next priority group. If any of the records have a source of either "Dataset A" or "Dataset B", then Tamr selects those records for the golden records. If not, Tamr looks for the next priority group.
You can omit specifying preferred input datasets. In this case, all records from all input datasets are considered. You can also drag and drop datasets between priority groups.

Condition: `Is not empty`.

Allows you to filter attributes based on whether the values are not empty.

Condition: `Max`

This condition is similar to the Max rule and uses the MAX aggregate transformation function. It computes the maximum value of an attribute or expression in a group. You can also create a custom Max condition.

Condition: `Min`

This condition is similar to the Min rule and uses the MIN aggregate transformation function. It computes the least value of an attribute or expression in a group. You can also create a custom Min condition.

Condition: `Most common value`

The most common value condition allows you to apply a filter for an attribute in golden records that selects the most common value for this attribute, from the group of values it considers.

For example, if you apply the condition most common value to the attribute city in the following dataset with the grouping key published_id, Tamr returns the value Boston. If you add the most common value condition to the attribute state, and then apply the function most common value to the attribute city, Tamr returns the value Cambridge.

published_id	state	city
101	Massachusetts	Boston
101	Massachusetts	Cambridge
101	Ohio	Boston
101	Massachusetts	Cambridge
101	Ohio	Boston
101	Ohio	Boston
101	Massachusetts	Cambridge

Condition: `Logical filter`.

Narrows down the values based on the predicate expressions that you use in the filter. For example, you can specify a logical filter as: length ("columnA") > 5.

Expression Aggregation Functions

Use expression aggregation functions inside a code editor. These functions allow you to compose custom aggregations that utilize Tamr transformation functions.

The expression aggregation functions behave similar to how the GROUP BY statement behaves. See Group By.

For example, to aggregate the most common value from first_name and concatenate it with the most common value from last_name, with a space between the two values, compose the following expression:

concat(
  mode(get(first_name, 0)),
  ' ',
  mode(get(last_name, 0))
)

Aggregation Functions

Most common value

📘Tiebreaking

Max

Min

Longest (string)

Shortest (string)

Sum

Mean

Count

Distinct

Count Distinct

Expression

Conditions

Condition: dataset

Condition: Is not empty.

Condition: Max

Condition: Min

Condition: Most common value

Condition: Logical filter.

Expression Aggregation Functions

📘
Tiebreaking

Condition: `dataset`

Condition: `Is not empty`.

Condition: `Max`

Condition: `Min`

Condition: `Most common value`

Condition: `Logical filter`.