Merge
MERGE
aggregates a set of rows, and is similar to GROUP BY.
MERGE
statements can be simpler to write than GROUP BY
statements, since you only have to specify the aggregate keys. However, GROUP BY
offers more flexibility in how you can aggregate rows.
MERGE
assumes that all input columns other than the aggregate keys are either scalar or arrays of strings. You can convert non-string columns with theto_string()
function. See Functions.MERGE
results in all values being collapsed into an array of strings, removing both duplicate values and null values. If arrays of strings are merged together, they will become a larger array, only containing unique values, sorted in lexicographic order.
Note: Aggregate keys are not affected by MERGE
and retain their original type.
MERGE and the System-Generated String Attributes
Important: Changing a
MERGE
statement can result in an unintentional change totamr_id
values in the unified dataset. To avoid data loss and maintain the stability oftamr_id
in mastering and categorization projects, as a best practice Tamr recommends that you explicitly set thetamr_id
to a value you're sure is stable in the last transformation. See Managing Primary Keys.
In addition, when you perform a MERGE
it changes the data type of all columns to arrays of strings. This includes the Tamr-generated columns origin_source_name
and origin_entity_id
. (By default, tamr_id
is managed by Tamr.) All three of these columns require a single string value, not an array of strings. Be sure to manually convert origin_source_name
and origin_entity_id
from arrays of strings to strings. The best practice example that follows shows a recommended approach to managing these columns.
MERGE Example: Best Practice
The following example script demonstrates a recommended approach for addressing the requirements of system-generated attributes.
MERGE BY vendor_name, part_number;
SELECT *,
CASE
WHEN size(origin_source_name) == 1
THEN to_string(origin_source_name) ELSE 'Multiple Sources'
END AS origin_source_name,
hash(vendor_name, part_number) AS origin_entity_id;
To maintain information on all datasets that have been used to create a record, the following expression can be used for populating origin_source_name
. This full lineage can grow large and complex when many datasets are input to transformations.
str_join('; ', sort_array(origin_source_name)) AS origin_source_name
Input Data for Example MERGE
vendor_name | part_number | part_description | origin_source_name | origin_entity_id |
---|---|---|---|---|
['Sinks R Us'] | ['00123'] | ['a sink'] | 'dataSourceA' | '1' |
['Sinks R Us'] | ['00123'] | ['a sink'] | 'dataSourceB' | '1' |
['Sinks R Us'] | ['00123'] | ['a big sink'] | 'dataSourceA' | '2' |
['Sinks R Us', 'Parts Inc.'] | ['00123'] | ['a big sink'] | 'dataSourceB' | '2' |
['Nuts & Bolts LLC'] | ['00123'] | ['2cm bolt'] | 'dataSourceA' | '3' |
['Nuts & Bolts LLC'] | ['999543'] | ['2cm bolt'] | 'dataSourceB' | '3' |
['Nuts & Bolts LLC'] | ['999543'] | ['bolt, 2cm'] | 'dataSourceB' | '4' |
Data Transformed by the Example MERGE
vendor_name | part_number | part_description | origin_source_name | origin_entity_id |
---|---|---|---|---|
['Sinks R Us'] | ['00123'] | ['a sink', 'a big sink'] | 'Multiple Sources' | '5431533154057331592' |
['Sinks R Us', 'Parts Inc.'] | ['00123'] | ['a big sink'] | 'dataSourceB' | '2664298903961142759' |
['Nuts & Bolts LLC'] | ['00123'] | ['2cm bolt'] | 'dataSourceA' | '-644995629713466897' |
['Nuts & Bolts LLC'] | ['999543'] | ['2cm bolt', 'bolt, 2cm'] | 'dataSourceB' | '-246476673599077583' |
Template for MERGE Best Practice
To make writing a script with a MERGE
more efficient, you can copy and paste this template. Be sure to replace the placeholder names <myUAn>
with your own unified attributes.
MERGE BY <myUA1>, <myUA2>, ..., <myUAn>;
SELECT *,
CASE
WHEN size(origin_source_name) == 1
THEN to_string(origin_source_name) ELSE 'Multiple Sources'
END AS origin_source_name,
hash(<myUA1>, <myUA2>, ..., <myUAn>) AS origin_entity_id;
Merge Scoping
The transformation scope determines how values are considered for aggregation. If you perform a MERGE
on specific input datasets, the aggregation is applied across the rows for each input dataset. For example, if you perform a MERGE
and set it to apply to dataset1
and dataset2
, Tamr Core aggregates the rows for dataset1
and then separately aggregates the rows for dataset2
. In contrast, if you perform a MERGE
on the unified dataset, then Tamr Core aggregates all rows in the unified dataset across all input datasets, as one job. This is because transformations applied on the unified dataset occur after all of your input datasets have been unified.
Merging a Subset of Records
Often, you must merge records by a few columns except for when those columns are empty. To specify exceptions to what records are merged, use a case
expression before the MERGE BY
.
This example creates a column, merge_key
, with a unique value for each group of records being merged. The hash()
function creates a unique value for every unique pair of values in company_division
and company_name
. These are the records to merge. For any records that you do not want to merge, you can use the value of tamr_id
which is unique for each record. See Managing Primary Keys for more information about tamr_id
.
The following script merges records that have the same values for company_division
and company_name
but does not merge any records that have an empty value for either company_division
or company_name
.
SELECT *,
CASE WHEN company_division IS NOT EMPTY AND company_name IS NOT EMPTY
THEN hash(company_division, company_name)
ELSE tamr_id
END
AS merge_key;
// Merge records with the same values and skip records with empty values.
MERGE BY merge_key;
Updated about 3 years ago