User GuidesAPI ReferenceRelease Notes
Doc HomeHelp CenterLog In

Merge

MERGE aggregates a set of rows, and is similar to GROUP BY.

MERGE statements can be simpler to write than GROUP BY statements, since you only have to specify the aggregate keys. However, GROUP BY offers more flexibility in how you can aggregate rows.

  • MERGE assumes that all input columns other than the aggregate keys are either scalar or arrays of strings. You can convert non-string columns with the to_string() function. See Functions.
  • MERGE results in all values being collapsed into an array of strings, removing both duplicate values and null values. If arrays of strings are merged together, they will become a larger array, only containing unique values, sorted in lexicographic order.

Note: Aggregate keys are not affected by MERGE and retain their original type.

MERGE and the System-Generated String Attributes

important Important: Changing a MERGE statement can result in an unintentional change to tamr_id values in the unified dataset. To avoid data loss and maintain the stability of tamr_id in mastering and categorization projects, as a best practice Tamr recommends that you explicitly set the tamr_id to a value you're sure is stable in the last transformation. See Managing Primary Keys.

In addition, when you perform a MERGE it changes the data type of all columns to arrays of strings. This includes the Tamr-generated columns origin_source_name and origin_entity_id. (By default, tamr_id is managed by Tamr.) All three of these columns require a single string value, not an array of strings. Be sure to manually convert origin_source_name and origin_entity_id from arrays of strings to strings. The best practice example that follows shows a recommended approach to managing these columns.

MERGE Example: Best Practice

The following example script demonstrates a recommended approach for addressing the requirements of system-generated attributes.

MERGE BY vendor_name, part_number; 
SELECT *, 
CASE 
    WHEN size(origin_source_name) == 1 
    THEN to_string(origin_source_name) ELSE 'Multiple Sources' 
END AS origin_source_name, 
hash(vendor_name, part_number) AS origin_entity_id;

To maintain information on all datasets that have been used to create a record, the following expression can be used for populating origin_source_name. This full lineage can grow large and complex when many datasets are input to transformations.

str_join('; ', sort_array(origin_source_name)) AS origin_source_name

Input Data for Example MERGE

vendor_namepart_numberpart_descriptionorigin_source_nameorigin_entity_id
['Sinks R Us']['00123']['a sink']'dataSourceA''1'
['Sinks R Us']['00123']['a sink']'dataSourceB''1'
['Sinks R Us']['00123']['a big sink']'dataSourceA''2'
['Sinks R Us', 'Parts Inc.']['00123']['a big sink']'dataSourceB''2'
['Nuts & Bolts LLC']['00123']['2cm bolt']'dataSourceA''3'
['Nuts & Bolts LLC']['999543']['2cm bolt']'dataSourceB''3'
['Nuts & Bolts LLC']['999543']['bolt, 2cm']'dataSourceB''4'

Data Transformed by the Example MERGE

vendor_namepart_numberpart_descriptionorigin_source_nameorigin_entity_id
['Sinks R Us']['00123']['a sink', 'a big sink']'Multiple Sources''5431533154057331592'
['Sinks R Us', 'Parts Inc.']['00123']['a big sink']'dataSourceB''2664298903961142759'
['Nuts & Bolts LLC']['00123']['2cm bolt']'dataSourceA''-644995629713466897'
['Nuts & Bolts LLC']['999543']['2cm bolt', 'bolt, 2cm']'dataSourceB''-246476673599077583'

Template for MERGE Best Practice

To make writing a script with a MERGE more efficient, you can copy and paste this template. Be sure to replace the placeholder names <myUAn> with your own unified attributes.

MERGE BY <myUA1>, <myUA2>, ..., <myUAn>; 
SELECT *, 
CASE 
    WHEN size(origin_source_name) == 1 
    THEN to_string(origin_source_name) ELSE 'Multiple Sources' 
END AS origin_source_name, 
hash(<myUA1>, <myUA2>, ..., <myUAn>) AS origin_entity_id;

Merge Scoping

The transformation scope determines how values are considered for aggregation. If you perform a MERGE on specific input datasets, the aggregation is applied across the rows for each input dataset. For example, if you perform a MERGE and set it to apply to dataset1 and dataset2, Tamr Core aggregates the rows for dataset1 and then separately aggregates the rows for dataset2. In contrast, if you perform a MERGE on the unified dataset, then Tamr Core aggregates all rows in the unified dataset across all input datasets, as one job. This is because transformations applied on the unified dataset occur after all of your input datasets have been unified.

Merging a Subset of Records

Often, you must merge records by a few columns except for when those columns are empty. To specify exceptions to what records are merged, use a case expression before the MERGE BY.

This example creates a column, merge_key, with a unique value for each group of records being merged. The hash() function creates a unique value for every unique pair of values in company_division and company_name. These are the records to merge. For any records that you do not want to merge, you can use the value of tamr_id which is unique for each record. See Managing Primary Keys for more information about tamr_id.

The following script merges records that have the same values for company_division and company_name but does not merge any records that have an empty value for either company_division or company_name.

SELECT *,
CASE WHEN company_division IS NOT EMPTY AND company_name IS NOT EMPTY
   THEN hash(company_division, company_name)
   ELSE tamr_id
END
AS merge_key;

// Merge records with the same values and skip records with empty values.

MERGE BY merge_key;