Grouping Obvious Duplicates
Record grouping is an optional stage in the mastering workflow that compares values for certain key attributes to find any obviously-matching records.
Including the record grouping stage in the project workflow can reduce the number of possibly-matching records that the blocking model needs to consider, which can, in turn, make the remaining stages of the mastering workflow more efficient.
Datasets that are known, or suspected, to contain duplicate records, and that include one or more attributes that can reliably be used to identify unique entities, are suitable for record grouping. Examples of reliable key attributes include:
- General: A government-issued identification number
- Businesses: D-U-N-S number
- People: First name, last name, date of birth, and post code
Tamr Core only groups multiple records together if they have identical values for all of the specified key attributes. For example, using the suggested key attributes for grouping people, records would only be grouped together if the values for first name, last name, date of birth, and post code are all identical. Records that have a unique value for one or more of the key attributes are placed into a single member, or “singleton”, group.
When you view a group that has multiple records, you see the single (identical) value for each of the grouping keys. For the other attributes, you choose an aggregation function, such as top 10 most frequently occurring, to represent the values from the member records.
You can choose whether records that have an empty value for a key attribute are grouped or left ungrouped, as individual records. See Grouping Records with Empty Keys.
Enabling Record Grouping
Typically, you enable record grouping early in the project development phase. Follow this procedure for:
- New projects: that is, you have not yet invested in pair training or cluster verification.
- Projects in the development environment.
To enable record grouping:
- In a mastering project, select the Group Records page.
- At upper right, slide the Enable Record Grouping toggle to the right. The page updates to display a preview with rows for record groups and columns for the attributes in the unified dataset.
By default:
- Records are only grouped within input datasets.
- All attributes included in machine learning are set to be grouping keys.
- Values for non-machine learning attributes are aggregated into arrays of the top 5 most-frequently occurring values.
See the Record Grouping Examples.
Next step:
Choose the attributes that are the reliable grouping keys for your data and select effective aggregation functions to other attributes. See Choosing a Grouping Key or Aggregation Function.
Additional options:
- Select input datasets to exclude from the record grouping process.
- Decide whether records with matching key values should be grouped only when they are from the same input dataset, or if they should be grouped across all included data sources.
- For each grouping key, decide whether records that have an empty value for that key should be grouped.
- Change the maximum size of the groups shown on the Group Records page.
Choosing a Grouping Key or Aggregation Function
The Group Records page provides the following choices for each of the attributes in the unified dataset: Grouping Key or Aggregation Function.
When you enable record grouping, by default:
- All of the attributes that are included in machine learning are set to be grouping keys.
- All of the attributes that are not included in machine learning are assigned a default aggregation function:
- Values for the spend attribute (if configured) are summed.
- Values for all other attributes show the top K most frequently occurring values, where K=5.
You can change these settings for each attribute to get the most accurate results for your data.
Important You must select Save Changes to save the updates that you make. If you navigate away from this page without saving, Tamr Core discards all of your changes.
To update attribute settings:
- In a mastering project that has record grouping enabled, select the Group Records page. This page shows a preview of the current record groups in a table at the bottom of the page.
- Optionally, use the Enable Preview toggle at top right to temporarily hide the preview section of the page. Each time you make an update the preview refreshes, so if you have a number of updates to make, hiding this section can streamline your work.
- Use the table’s horizontal scroll bar to locate the attribute you want to update. Attributes that are included in machine learning appear on the left, and those that are not included are on the right.
Tip: You can select Hide Estimated Metrics to show more rows in this table, and Show Estimated Metrics to see how the changes you make affect record grouping. - To identify an attribute as a key for finding obviously-matching records, select Grouping Key. Tamr Core updates the preview (if currently shown) and the estimated metrics to show the potential results of this change.
To exclude an attribute from being used as a key, select Aggregation Function. Tamr Core supplies the function Top K with a K= value of 10, and updates the preview(if currently shown) and the estimated metrics to show the potential results of this change. - To specify a different aggregation function, use the dropdown list. The options are:
- Top K: You can specify a different value for K. Tamr Core removes empty values before applying this rule.
- Sum: Select for the spend attribute only.
- Collect Set: Tamr Core removes empty values before applying this rule.
- Collect Subset K: You can specify a different value for K. Tamr Core removes empty values before applying this rule.
- Remove: Replaces all values with null. Not available for attributes that are included in machine learning or the spend attribute.
- Custom: By default, Tamr Core supplies first and the attribute name. You can supply a transformation with a different aggregation function.
Tamr Core updates the preview to show the potential results of each change you make.
- To review the records that are included in a group, select a value in the Group Size column. A dialog box opens with the data for the group followed by a table with the data for each of its records.
- If the preview is currently hidden, use Enable Preview to display this section of the page and review the potential results of your changes.
- To save the current settings for grouping keys and aggregation functions, select Save Changes.
- To initiate a job that applies the settings, select Update Record Grouping.
Excluding an Input Dataset
You can choose to exclude certain input datasets from record grouping. For example, you can exclude a dataset that is known to be free of duplicate records.
Tips:
- The setting you choose for Grouping Across Datasets can affect your decision to exclude a dataset.
- If you exclude an input dataset from record grouping, that dataset is automatically excluded from clustering and vice versa. See Defining the Blocking Model.
When you exclude a dataset, Tamr Core leaves its individual records ungrouped. Aggregation functions are not applied and the records do not appear in the group preview at the bottom of the page.
To exclude a dataset from grouping:
- In a mastering project that has record grouping enabled, select the Group Records page.
- At top left, select Choose from the Choose which datasets to exclude from record grouping option. A dialog box opens with a list of the input datasets.
- Select the checkbox next to the dataset you want to exclude.
- Select Save.
- Select Save Changes.
- To initiate a job that applies the change to the unified dataset, select Update Record Grouping.
Grouping Across Datasets
By default, Tamr Core only compares key values to group duplicates within a given dataset. You can choose to group records that have identical key values regardless of their input datasets. See the Record Grouping Examples.
To group records across input datasets:
- In a mastering project that has record grouping enabled, select the Group Records page. This page shows a preview of the current record groups at the bottom of the page.
- In the preview, locate the Source attribute on the left.
- Select the Group across sources checkbox above the Source attribute’s name. Tamr Core changes the source dataset for all grouped records to “GroupedSources” and updates the preview and estimated metrics to show the potential results of this change.
To group records only if they are in the same input dataset, clear this checkbox. - Select Save Changes.
- To initiate a job that applies the change to the unified dataset, select Update Record Grouping.
Grouping Records with Empty Keys
By default, Tamr Core does not put records that have an empty value for any of the grouping keys into groups. As a result, these records do not appear on the Group Records page. However, for each grouping key, you can choose to group records that have an empty value for that attribute. To be grouped together by this option, records must have identical values for all keys, including the empty value for the specified key. This choice can result in more singleton and multi-record groups. See the Record Grouping Examples.
Empty values include:
- Nulls
- Empty strings
- Strings of only spaces or tab characters
- Arrays with zero elements
- Arrays with any number of empty elements
To group records with empty key values:
- In a mastering project that has record grouping enabled, select the Group Records page.
- Use the preview table’s horizontal scroll bar to locate the grouping key.
- To group records that have an empty value for this key, select Group if empty. To leave records that have an empty value for this key ungrouped, clear Group if empty.
- Select Save Changes.
- To initiate a job that applies the change to the unified dataset, select Update Record Grouping.
Changing the Maximum Preview Size
By default, Tamr Core shows groups of up to 1,000 records and does not include larger groups in the preview. You can change the maximum size to meet your needs.
To change the maximum preview size:
- In a mastering project that has record grouping enabled, select the Group Records page.
- At top right, select Settings . A dialog box opens.
- From the dropdown list, select the maximum number of records a group can have to be included in the preview. The options are 100, 1,000, or 10,000.
- Select Apply. Tamr Core updates the preview to show the potential result of the change.
Record Grouping Examples
A series of examples follows to demonstrate the effects of different record grouping options on a small data sample.
Example Input Records
Source |
recordId |
first_name |
last_name |
date_of_birth |
tax_id |
---|---|---|---|---|---|
dataset1 |
a |
Joe |
Smith |
11/01/1911 |
11111 |
dataset1 |
c |
Joe |
Smith |
12/02/1922 |
11111 |
dataset2 |
b |
Joe |
Smith |
11/01/1911 |
null |
dataset2 |
h |
Joe |
Smith |
11/01/1911 |
null |
dataset2 |
d |
Betty |
Smith |
03/03/1933 |
222222 |
dataset1 |
e |
B |
Smith |
null |
222222 |
dataset2 |
f |
Joe |
Brown |
11/01/1911 |
11111 |
Example Record Groups: Default Settings
Grouping keys: first_name AND last_name AND date_of_birth AND tax_id (all machine learning attributes)
Group if empty: Cleared for all
Aggregation Functions: top 10 for recordId
Group across sources: Cleared
0 datasets excluded
Groups:
Group Size |
Source |
recordId |
first_name |
last_name |
date_of_birth |
tax_id |
---|---|---|---|---|---|---|
1 |
dataset1 |
a |
Joe |
Smith |
11/01/1911 |
11111 |
1 |
dataset1 |
c |
Joe |
Smith |
12/02/1922 |
11111 |
1 |
dataset2 |
d |
Betty |
Smith |
03/03/1933 |
222222 |
1 |
dataset2 |
f |
Joe |
Brown |
11/01/1911 |
11111 |
Records left ungrouped:
Source |
recordId |
first_name |
last_name |
date_of_birth |
tax_id |
---|---|---|---|---|---|
dataset2 |
b |
Joe |
Smith |
11/01/1911 |
null |
dataset2 |
h |
Joe |
Smith |
11/01/1911 |
null |
dataset1 |
e |
B |
Smith |
null |
222222 |
Example Record Groups: Fewer Key Attributes
Grouping keys: last_name AND tax_id
Group if empty: Cleared for both
Aggregation Functions: top 10 for all non-keys
Group across sources: Cleared
0 datasets excluded
Groups:
Group Size |
Source |
recordId |
first_name |
last_name |
date_of_birth |
tax_id |
---|---|---|---|---|---|---|
2 |
dataset1 |
[a, c] |
Joe |
Smith |
[11/01/1911, 12/02/1922] |
11111 |
1 |
dataset2 |
d |
Betty |
Smith |
03/03/1933 |
222222 |
1 |
dataset1 |
e |
B |
Smith |
null |
222222 |
1 |
dataset2 |
f |
Joe |
Brown |
11/01/1911 |
11111 |
Records left ungrouped:
Source |
recordId |
first_name |
last_name |
date_of_birth |
tax_id |
---|---|---|---|---|---|
dataset2 |
b |
Joe |
Smith |
11/01/1911 |
null |
dataset2 |
h |
Joe |
Smith |
11/01/1911 |
null |
Example Record Groups: Group Across Sources
Grouping keys: last_name AND tax_id
Group if empty: Cleared for both
Aggregation Functions: top 10 for all non-keys
Group across sources: Selected
0 datasets excluded
Groups:
Group Size |
Source |
recordId |
first_name |
last_name |
date_of_birth |
tax_id |
---|---|---|---|---|---|---|
2 |
GroupedSources |
[a, c] |
Joe |
Smith |
[11/01/1911, 12/02/1922] |
11111 |
2 |
GroupedSources |
[d, e] |
[Betty, B] |
Smith |
03/03/1933 |
222222 |
1 |
GroupedSources |
f |
Joe |
Brown |
11/01/1911 |
11111 |
Records left ungrouped:
Source |
recordId |
first_name |
last_name |
date_of_birth |
tax_id |
---|---|---|---|---|---|
dataset2 |
b |
Joe |
Smith |
11/01/1911 |
null |
dataset2 |
h |
Joe |
Smith |
11/01/1911 |
null |
Example Record Groups: Group If Empty
Grouping keys: last_name AND tax_id
Group if empty: Selected for both
Aggregation Functions: top 10 for all non-keys
Group across sources: Selected
0 datasets excluded
Groups:
Group Size |
Source |
recordId |
first_name |
last_name |
date_of_birth |
tax_id |
---|---|---|---|---|---|---|
2 |
GroupedSources |
[a, c] |
Joe |
Smith |
[11/01/1911, 12/02/1922] |
11111 |
2 |
GroupedSources |
[b, h] |
Joe |
Smith |
11/01/1911 |
null |
2 |
GroupedSources |
[d, e] |
[Betty, B] |
Smith |
03/03/1933 |
222222 |
1 |
GroupedSources |
f |
Joe |
Brown |
11/01/1911 |
11111 |
Example Record Groups: Single Key
Grouping keys: last_name
Group if empty: Selected
Aggregation Functions: top 10 for all non-keys
Group across sources: Selected
0 datasets excluded
Groups:
Group Size |
Source |
recordId |
first_name |
last_name |
date_of_birth |
tax_id |
---|---|---|---|---|---|---|
6 |
GroupedSources |
[a, b, c, d ,e, h] |
[Joe, Betty, B] |
Smith |
[11/01/1911, 12/02/1922, 03/03/1933] |
[11111, 222222] |
1 |
GroupedSources |
f |
Joe |
Brown |
11/01/1911 |
11111 |
Updated almost 2 years ago