User GuidesAPI ReferenceRelease Notes
Doc HomeHelp CenterLog In
User Guides

Merge

MERGE aggregates a set of rows, similar to GROUP BY.

MERGE is simpler for you to write, since you only have to specify the aggregate keys. GROUP BY offers more flexibility in how you'd like to aggregate rows.

MERGE assumes that all input columns other than the aggregate keys are either scalar or arrays of strings. You can convert non-string columns with the to_string(). See Tamr-Defined Functions.

In a MERGE, all values will be collapsed into an array of strings, removing both duplicate values and null values. If arrays of strings are merged together, they will become a larger array, only containing unique values, sorted in lexicographic order.

ExampleTemplate
MERGE BY vendor_name, part_number; SELECT *, CASE WHEN size(origin_source_name) == 1 THEN to_string(origin_source_name) ELSE 'Multiple Sources' END AS origin_source_name, concat('vendor_name: ', str_join('|', vendor_name), '; part_number: ', str_join('|', part_number)) AS origin_entity_id;MERGE BY <myUA1>, <myUA2>, ..., <myUA_n>; SELECT *, CASE WHEN size(origin_source_name) == 1 THEN to_string(origin_source_name) ELSE 'Multiple Sources' END AS origin_source_name, concat('<myUA1>: ', str_join('|', <myUA1>), '; <myUA2>: ', str_join('|', <myUA2>), ....., '; <myUA_n>: ', str_join('|', <myUA_n>)) AS origin_entity_id;

❗️

Treat Tamr special columns with care

When you perform a MERGE, it will put all fields into an array of strings. This includes the Tamr-generated columns origin_source_name and origin_entity_id. By default, tamr_id is managed by Tamr. These columns require a single string value, not an array of strings. Manually convert origin_source_name and origin_entity_id from arrays of strings to strings. The previous example shows one possible way to manage these columns.

Note that aggregate keys are not affected by MERGE and retain their original type.

Merge Scoping

The transformation scope determines how values are considered for aggregation. If you perform a MERGE on specific input datasets, the aggregation is applied across the rows for each input dataset. For example, if you perform a MERGE and set it to apply to dataset1 and dataset2, the rows for dataset1 will be aggregated, and the rows for dataset2 will be aggregated separately. If you perform a MERGE on the unified dataset, then all rows in the unified dataset, across all of the input datasets, will be aggregated. This is because transformations applied on the unified dataset occur after all of your input datasets have been unified.

Merging a Subset of Records

Often, you must merge records by a few columns except for when those columns are empty. To specify exceptions to what records are merged, use a case expression before a MERGE BY, as in the following example.

In this example, we create a new column, merge_key, that has a unique value for each group of records we want to merge. The hash() function creates a unique value for every unique pair of values in company_division and company_name. These are records we wish to merge. For the records that we do not want to merge, we can use the value of tamr_id which is unique for each record. See primary key management for more information on tamr_id.

Next, we merge records that have the same values for company_division and company_name but will not merge any records that have an empty value for company_division or company_name.

SELECT *,
CASE WHEN company_division IS NOT EMPTY AND company_name IS NOT EMPTY
   THEN hash(company_division, company_name)
   ELSE tamr_id
END
AS merge_key;

// Merge records with the same values and skip records with empty values.

MERGE BY merge_key;