User GuidesAPI ReferenceRelease Notes
Doc HomeHelp CenterLog In
User Guides

Grouping Obvious Duplicates

Record grouping is an optional stage in the mastering workflow that compares values for certain key attributes to find any obviously-matching records.

Including the record grouping stage in the project workflow can reduce the number of possibly-matching records that the blocking model needs to consider, which can, in turn, make the remaining stages of the mastering workflow more efficient.

Datasets that are known, or suspected, to contain duplicate records, and that include one or more attributes that can reliably be used to identify unique entities, are suitable for record grouping. Examples of reliable key attributes include:

  • General: A government-issued identification number
  • Businesses: D-U-N-S number
  • People: First name, last name, date of birth, and post code

Tamr Core only groups multiple records together if they have identical values for all of the specified key attributes. For example, using the suggested key attributes for grouping people, records would only be grouped together if the values for first name, last name, date of birth, and post code are all identical. Records that have a unique value for one or more of the key attributes are placed into a single member, or “singleton”, group.

300

Records are organized into groups of one or more members.

When you view a group that has multiple records, you see the single (identical) value for each of the grouping keys. For the other attributes, you choose an aggregation function, such as top 10 most frequently occurring, to represent the values from the member records.

1030

Examples of records grouped by DUNS number with aggregated values for another attribute.

You can choose whether records that have an empty value for a key attribute are grouped or left ungrouped, as individual records. See Grouping Records with Empty Keys.

Enabling Record Grouping

Typically, you enable record grouping early in the project development phase. Follow this procedure for:

  • New projects: that is, you have not yet invested in pair training or cluster verification.
  • Projects in the development environment.

To enable record grouping:

  1. In a mastering project, select the Group Records page.
  2. At upper right, slide the Enable Record Grouping toggle to the right. The page updates to display a preview with rows for record groups and columns for the attributes in the unified dataset.

By default:

  • Records are only grouped within input datasets.
  • All attributes included in machine learning are set to be grouping keys.
  • Values for non-machine learning attributes are aggregated into arrays of the top 5 most-frequently occurring values.

See the Record Grouping Examples.

Next step:

Choose the attributes that are the reliable grouping keys for your data and select effective aggregation functions to other attributes. See Choosing a Grouping Key or Aggregation Function.

Additional options:

Choosing a Grouping Key or Aggregation Function

The Group Records page provides the following choices for each of the attributes in the unified dataset: Grouping Key or Aggregation Function.

When you enable record grouping, by default:

  • All of the attributes that are included in machine learning are set to be grouping keys.
  • All of the attributes that are not included in machine learning are assigned a default aggregation function:
  • Values for the spend attribute (if configured) are summed.
  • Values for all other attributes show the top K most frequently occurring values, where K=5.

You can change these settings for each attribute to get the most accurate results for your data.

Important important You must select Save Changes to save the updates that you make. If you navigate away from this page without saving, Tamr Core discards all of your changes.

To update attribute settings:

  1. In a mastering project that has record grouping enabled, select the Group Records page. This page shows a preview of the current record groups in a table at the bottom of the page.
  2. Optionally, use the Enable Preview toggle at top right to temporarily hide the preview section of the page. Each time you make an update the preview refreshes, so if you have a number of updates to make, hiding this section can streamline your work.
  3. Use the table’s horizontal scroll bar to locate the attribute you want to update. Attributes that are included in machine learning appear on the left, and those that are not included are on the right.
    Tip: You can select Hide Estimated Metrics to show more rows in this table, and Show Estimated Metrics to see how the changes you make affect record grouping.
  4. To identify an attribute as a key for finding obviously-matching records, select Grouping Key. Tamr Core updates the preview (if currently shown) and the estimated metrics to show the potential results of this change.
    To exclude an attribute from being used as a key, select Aggregation Function. Tamr Core supplies the function Top K with a K= value of 10, and updates the preview(if currently shown) and the estimated metrics to show the potential results of this change.
  5. To specify a different aggregation function, use the dropdown list. The options are:
  • Top K: You can specify a different value for K. Tamr Core removes empty values before applying this rule.
  • Sum: Select for the spend attribute only.
  • Collect Set: Tamr Core removes empty values before applying this rule.
  • Collect Subset K: You can specify a different value for K. Tamr Core removes empty values before applying this rule.
  • Remove: Replaces all values with null. Not available for attributes that are included in machine learning or the spend attribute.
  • Custom: By default, Tamr Core supplies first and the attribute name. You can supply a transformation with a different aggregation function.
    Tamr Core updates the preview to show the potential results of each change you make.
  1. To review the records that are included in a group, select a value in the Group Size column. A dialog box opens with the data for the group followed by a table with the data for each of its records.
  2. If the preview is currently hidden, use Enable Preview to display this section of the page and review the potential results of your changes.
  3. To save the current settings for grouping keys and aggregation functions, select Save Changes.
  4. To initiate a job that applies the settings, select Update Record Grouping.

Excluding an Input Dataset

You can choose to exclude certain input datasets from record grouping. For example, you can exclude a dataset that is known to be free of duplicate records.

Tips:

  • The setting you choose for Grouping Across Datasets can affect your decision to exclude a dataset.
  • If you exclude an input dataset from record grouping, that dataset is automatically excluded from clustering and vice versa. See Defining the Blocking Model.

When you exclude a dataset, Tamr Core leaves its individual records ungrouped. Aggregation functions are not applied and the records do not appear in the group preview at the bottom of the page.

To exclude a dataset from grouping:

  1. In a mastering project that has record grouping enabled, select the Group Records page.
  2. At top left, select Choose from the Choose which datasets to exclude from record grouping option. A dialog box opens with a list of the input datasets.
  3. Select the checkbox next to the dataset you want to exclude.
  4. Select Save.
  5. Select Save Changes.
  6. To initiate a job that applies the change to the unified dataset, select Update Record Grouping.

Grouping Across Datasets

By default, Tamr Core only compares key values to group duplicates within a given dataset. You can choose to group records that have identical key values regardless of their input datasets. See the Record Grouping Examples.

To group records across input datasets:

  1. In a mastering project that has record grouping enabled, select the Group Records page. This page shows a preview of the current record groups at the bottom of the page.
  2. In the preview, locate the Source attribute on the left.
  3. Select the Group across sources checkbox above the Source attribute’s name. Tamr Core changes the source dataset for all grouped records to “GroupedSources” and updates the preview and estimated metrics to show the potential results of this change.
    To group records only if they are in the same input dataset, clear this checkbox.
  4. Select Save Changes.
  5. To initiate a job that applies the change to the unified dataset, select Update Record Grouping.

Grouping Records with Empty Keys

By default, Tamr Core does not put records that have an empty value for any of the grouping keys into groups. As a result, these records do not appear on the Group Records page. However, for each grouping key, you can choose to group records that have an empty value for that attribute. To be grouped together by this option, records must have identical values for all keys, including the empty value for the specified key. This choice can result in more singleton and multi-record groups. See the Record Grouping Examples.

Empty values include:

  • Nulls
  • Empty strings
  • Strings of only spaces or tab characters
  • Arrays with zero elements
  • Arrays with any number of empty elements

To group records with empty key values:

  1. In a mastering project that has record grouping enabled, select the Group Records page.
  2. Use the preview table’s horizontal scroll bar to locate the grouping key.
  3. To group records that have an empty value for this key, select Group if empty. To leave records that have an empty value for this key ungrouped, clear Group if empty.
  4. Select Save Changes.
  5. To initiate a job that applies the change to the unified dataset, select Update Record Grouping.

Changing the Maximum Preview Size

By default, Tamr Core shows groups of up to 1,000 records and does not include larger groups in the preview. You can change the maximum size to meet your needs.

To change the maximum preview size:

  1. In a mastering project that has record grouping enabled, select the Group Records page.
  2. At top right, select Settings gear icon. A dialog box opens.
  3. From the dropdown list, select the maximum number of records a group can have to be included in the preview. The options are 100, 1,000, or 10,000.
  4. Select Apply. Tamr Core updates the preview to show the potential result of the change.

Record Grouping Examples

A series of examples follows to demonstrate the effects of different record grouping options on a small data sample.

Example Input Records

Source

recordId

first_name

last_name

date_of_birth

tax_id

dataset1

a

Joe

Smith

11/01/1911

11111

dataset1

c

Joe

Smith

12/02/1922

11111

dataset2

b

Joe

Smith

11/01/1911

null

dataset2

h

Joe

Smith

11/01/1911

null

dataset2

d

Betty

Smith

03/03/1933

222222

dataset1

e

B

Smith

null

222222

dataset2

f

Joe

Brown

11/01/1911

11111

Example Record Groups: Default Settings

Grouping keys: first_name AND last_name AND date_of_birth AND tax_id (all machine learning attributes)

Group if empty: Cleared for all

Aggregation Functions: top 10 for recordId

Group across sources: Cleared

0 datasets excluded

Groups:

Group Size

Source

recordId

first_name

last_name

date_of_birth

tax_id

1

dataset1

a

Joe

Smith

11/01/1911

11111

1

dataset1

c

Joe

Smith

12/02/1922

11111

1

dataset2

d

Betty

Smith

03/03/1933

222222

1

dataset2

f

Joe

Brown

11/01/1911

11111

Records left ungrouped:

Source

recordId

first_name

last_name

date_of_birth

tax_id

dataset2

b

Joe

Smith

11/01/1911

null

dataset2

h

Joe

Smith

11/01/1911

null

dataset1

e

B

Smith

null

222222

Example Record Groups: Fewer Key Attributes

Grouping keys: last_name AND tax_id

Group if empty: Cleared for both

Aggregation Functions: top 10 for all non-keys

Group across sources: Cleared

0 datasets excluded

Groups:

Group Size

Source

recordId

first_name

last_name

date_of_birth

tax_id

2

dataset1

[a, c]

Joe

Smith

[11/01/1911, 12/02/1922]

11111

1

dataset2

d

Betty

Smith

03/03/1933

222222

1

dataset1

e

B

Smith

null

222222

1

dataset2

f

Joe

Brown

11/01/1911

11111

Records left ungrouped:

Source

recordId

first_name

last_name

date_of_birth

tax_id

dataset2

b

Joe

Smith

11/01/1911

null

dataset2

h

Joe

Smith

11/01/1911

null

Example Record Groups: Group Across Sources

Grouping keys: last_name AND tax_id

Group if empty: Cleared for both

Aggregation Functions: top 10 for all non-keys

Group across sources: Selected

0 datasets excluded

Groups:

Group Size

Source

recordId

first_name

last_name

date_of_birth

tax_id

2

GroupedSources

[a, c]

Joe

Smith

[11/01/1911, 12/02/1922]

11111

2

GroupedSources

[d, e]

[Betty, B]

Smith

03/03/1933

222222

1

GroupedSources

f

Joe

Brown

11/01/1911

11111

Records left ungrouped:

Source

recordId

first_name

last_name

date_of_birth

tax_id

dataset2

b

Joe

Smith

11/01/1911

null

dataset2

h

Joe

Smith

11/01/1911

null

Example Record Groups: Group If Empty

Grouping keys: last_name AND tax_id

Group if empty: Selected for both

Aggregation Functions: top 10 for all non-keys

Group across sources: Selected

0 datasets excluded

Groups:

Group Size

Source

recordId

first_name

last_name

date_of_birth

tax_id

2

GroupedSources

[a, c]

Joe

Smith

[11/01/1911, 12/02/1922]

11111

2

GroupedSources

[b, h]

Joe

Smith

11/01/1911

null

2

GroupedSources

[d, e]

[Betty, B]

Smith

03/03/1933

222222

1

GroupedSources

f

Joe

Brown

11/01/1911

11111

Example Record Groups: Single Key

Grouping keys: last_name

Group if empty: Selected

Aggregation Functions: top 10 for all non-keys

Group across sources: Selected

0 datasets excluded

Groups:

Group Size

Source

recordId

first_name

last_name

date_of_birth

tax_id

6

GroupedSources

[a, b, c, d ,e, h]

[Joe, Betty, B]

Smith

[11/01/1911, 12/02/1922, 03/03/1933]

[11111, 222222]

1

GroupedSources

f

Joe

Brown

11/01/1911

11111