Tamr Documentation

Golden Record Consolidation Rules

Configure rules and compose your own expression rules. Tamr uses these rules for consolidation of golden records.

Rules act on attributes and consist of:

  • Conditions
  • Aggregation functions and expression aggregation functions.

The List of Predefined Rules

The following rules are predefined and you don't need to create them with transformations.

  • Most common value
  • Max
  • Min
  • Sum
  • Mean
  • Count
  • Distinct
  • Count Distinct
  • Expression (custom filter)

Conditions

List of Conditions for Golden Records

  • Most common value. This condition is useful when you want to choose the most common value for a particular attribute.
  • Max
  • Min
  • Is not empty
  • Filter
  • Dataset. This condition is useful when you want to mark some of your input datasets as preferred.

Condition: dataset

The dataset condition allows you to apply a filter that, for each attribute, selects one or more preferred input datasets from the list of all datasets considered for golden records.

  • Use the dataset condition to choose which input datasets you prefer, and exclude other input datasets from consideration. This is useful if, for a particular attribute, you prefer to rely on specific input datasets that you trust for this attribute.

  • If you include more than one dataset into the dataset condition, each of them is considered.

  • If the list of input datasets changes since the last publishing of golden records, Tamr adds the newly added datasets to the drop-down list and marks them with the "New" label. For example, the list of input datasets changes if you add new input datasets to an existing mastering project or update golden records through the API.

  • You can prioritize data from different input datasets to be considered for golden records. If you have more than one input dataset that you'd like to consider first, you can add them to a group with a priority number. Priority groups are numbered, with 1 being the highest priority. When Tamr creates golden records, it considers the datasets in the highest priority group first. If it does not find records in the input datasets that you have prioritized the highest and that fit your rules, it looks for records in the datasets with the next priority.

  • When you prioritize input datasets, the rule you have configured, such as "Most common value", applies to them first, if Tamr finds records that arrived from those input datasets. For example, this configuration: [["The Best Dataset"], ["Dataset A", "Dataset B"]] has two priority groups where the highest priority group contains only "The Best Dataset" and the lower priority group contains “Dataset A” and “Dataset B”. If any records in the cluster arrive from "The Best Dataset", then the golden record rules use data only from it. If none of the records arrived from "The Best Dataset", the rules consider the next priority group. If any of the records have a source of either "Dataset A" or "Dataset B", then Tamr selects those records for the golden records. If not, Tamr looks for the next priority group.
    You can omit specifying preferred input datasets. In this case, all records from all input datasets are considered. You can also drag and drop datasets between priority groups.

Condition: Is Not Empty

The Is Not Empty condition allows you to filter attributes based on whether the values are not empty.

Condition: Max

The Max condition computes the maximum value of an attribute or expression in a group. You can also create a custom Max condition. This condition is similar to the Max rule and uses the MAX aggregate transformation function.

Condition: Min

The Min condition computes the least value of an attribute or expression in a group. You can also create a custom Min condition. This condition is similar to the Min rule and uses the MIN aggregate transformation function.

Condition: Most Common Value

The Most Common Value condition is applied by default by Tamr, upon the initial creation of golden records. It allows you to apply a filter for a golden records' attribute that selects the most common value for it, from the group of attribute values it considers.

For example, if you apply this condition to the attribute city in the following dataset with the grouping key published_id, Tamr returns the value Boston. If you add this condition to the attribute state, and then apply the function most common value to the attribute city, Tamr returns the value Cambridge.

published_id

state

city

101

Massachusetts

Boston

101

Massachusetts

Cambridge

101

Ohio

Boston

101

Massachusetts

Cambridge

101

Ohio

Boston

101

Ohio

Boston

101

Massachusetts

Cambridge

Condition: Logical Filter

The Logical Filter condition narrows down the values based on the predicate expressions that you use in the filter. For example, you can specify a logical filter as: length ("columnA") > 5.

Aggregation Functions

You can configure aggregation functions and compose expression aggregation functions for rules used in golden records.

Most Common Value

This function returns the most common value, ignoring null and blank values. If the attribute contains arrays, the rule with this function unpacks the array, and finds the most common value across all values. For example, if the attribute has three arrays (1,2,3), (3,2,2), (1,2,3), the most common value is 2.

Note: Tie breaking selects the minimum value of an ascending sort. For example, if “apple” and “banana” are the most common values in the group, the rule chooses “apple”.

Max

This function returns the maximum value. For a numeric attribute, this is the largest number. For a string, this is the first result when sorting in a descending order. If the attribute contains arrays, the rule with this function unpacks the array, and finds the maximum across all values. For example, if the attribute has three arrays (1,2,4), (42), (40,41), the maximum value is 42. You can also create a custom Max filter rule.

Min

This function returns the minimum value. For a numeric attribute, this is the smallest number. For a string, this is the first result with sorting in the ascending order. If the attribute contains arrays, the rule unpacks the array, and finds the min across all values. For example, if the attribute has three arrays (1,2,4), (42), (40,41), the minimum value is 1. You can also create a custom Min filter rule.

Longest (string)

This function returns the longest value in an attribute, determined by the character length. It ignores nulls and blank strings. The rule with this function works only on string type attributes.

Shortest (string)

This function returns the shortest value in an attribute, determined by the character length. It ignores nulls and blank strings. The rule with this function works only on string type attributes.

Sum

This function adds all of the values in an attribute together. The rule with this function only works on numeric type attributes.

Mean

This function calculates the average of the values in an attribute. The rule with this function works only on numeric type attributes.

Count

This function returns the number of values in the cluster for that attribute. It ignores nulls, blank strings, and empty lists. If the attribute contains arrays, the rule unpacks the array, and counts the number of values. For example, if the attribute has three arrays (1,2,4), (42), (40,41), the count is 6.

Distinct

This function returns an array of all the distinct values in the cluster for that attribute. It ignores nulls, blank strings, and empty lists. If the attribute contains arrays, the rule unpacks the array, and finds the set. For example, if the attribute has three arrays (1,2,4), (3,2,2), (1,2,3), the rule returns (1,2,4,3).

Count Distinct

This function returns the number of unique values in the cluster for that attribute. It ignores nulls, blank strings, and empty lists. If the attribute contains arrays, the rule unpacks the array, and counts the number of distinct values. For example, if the attribute has three arrays (1,2,4), (3,2,2), (1,2,3), the rule returns the number 4.

Expression

A custom expression that you can specify. See the expression aggregation functions in the next section.

Expression Aggregation Functions

Use expression aggregation functions inside a code editor. These functions allow you to compose custom aggregations that utilize Tamr transformation functions.

The expression aggregation functions behave similar to how the GROUP BY statement behaves. See Group By.

For example, to aggregate the most common value from first_name and concatenate it with the most common value from last_name, with a space between the two values, compose the following expression:

concat(
  mode(get(first_name, 0)),
  ' ',
  mode(get(last_name, 0))
)

Updated 3 months ago



Golden Record Consolidation Rules


Configure rules and compose your own expression rules. Tamr uses these rules for consolidation of golden records.

Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.