MERGE statements can be simpler to write than
GROUP BY statements, since you only have to specify the aggregate keys. However,
GROUP BY offers more flexibility in how you can aggregate rows.
MERGEassumes that all input columns other than the aggregate keys are either scalar or arrays of strings. You can convert non-string columns with the
to_string()function. See Functions.
MERGEresults in all values being collapsed into an array of strings, removing both duplicate values and null values. If arrays of strings are merged together, they will become a larger array, only containing unique values, sorted in lexicographic order.
Note: Aggregate keys are not affected by
MERGE and retain their original type.
Important: Changing a
MERGEstatement can result in an unintentional change to
tamr_idvalues in the unified dataset. To avoid data loss and maintain the stability of
tamr_idin mastering and categorization projects, as a best practice Tamr recommends that you explicitly set the
tamr_idto a value you're sure is stable in the last transformation. See Managing Primary Keys.
In addition, when you perform a
MERGE it changes the data type of all columns to arrays of strings. This includes the Tamr-generated columns
origin_entity_id. (By default,
tamr_id is managed by Tamr.) All three of these columns require a single string value, not an array of strings. Be sure to manually convert
origin_entity_id from arrays of strings to strings. The best practice example that follows shows a recommended approach to managing these columns.
The following example script demonstrates a recommended approach for addressing the requirements of system-generated attributes.
MERGE BY vendor_name, part_number; SELECT *, CASE WHEN size(origin_source_name) == 1 THEN to_string(origin_source_name) ELSE 'Multiple Sources' END AS origin_source_name, hash(vendor_name, part_number) AS origin_entity_id;
To maintain information on all datasets that have been used to create a record, the following expression can be used for populating
origin_source_name. This full lineage can grow large and complex when many datasets are input to transformations.
str_join('; ', sort_array(origin_source_name)) AS origin_source_name
|['Sinks R Us']||['00123']||['a sink']||'dataSourceA'||'1'|
|['Sinks R Us']||['00123']||['a sink']||'dataSourceB'||'1'|
|['Sinks R Us']||['00123']||['a big sink']||'dataSourceA'||'2'|
|['Sinks R Us', 'Parts Inc.']||['00123']||['a big sink']||'dataSourceB'||'2'|
|['Nuts & Bolts LLC']||['00123']||['2cm bolt']||'dataSourceA'||'3'|
|['Nuts & Bolts LLC']||['999543']||['2cm bolt']||'dataSourceB'||'3'|
|['Nuts & Bolts LLC']||['999543']||['bolt, 2cm']||'dataSourceB'||'4'|
|['Sinks R Us']||['00123']||['a sink', 'a big sink']||'Multiple Sources'||'5431533154057331592'|
|['Sinks R Us', 'Parts Inc.']||['00123']||['a big sink']||'dataSourceB'||'2664298903961142759'|
|['Nuts & Bolts LLC']||['00123']||['2cm bolt']||'dataSourceA'||'-644995629713466897'|
|['Nuts & Bolts LLC']||['999543']||['2cm bolt', 'bolt, 2cm']||'dataSourceB'||'-246476673599077583'|
To make writing a script with a
MERGE more efficient, you can copy and paste this template. Be sure to replace the placeholder names
<myUAn> with your own unified attributes.
MERGE BY <myUA1>, <myUA2>, ..., <myUAn>; SELECT *, CASE WHEN size(origin_source_name) == 1 THEN to_string(origin_source_name) ELSE 'Multiple Sources' END AS origin_source_name, hash(<myUA1>, <myUA2>, ..., <myUAn>) AS origin_entity_id;
The transformation scope determines how values are considered for aggregation. If you perform a
MERGE on specific input datasets, the aggregation is applied across the rows for each input dataset. For example, if you perform a
MERGE and set it to apply to
dataset2, Tamr Core aggregates the rows for
dataset1 and then separately aggregates the rows for
dataset2. In contrast, if you perform a
MERGE on the unified dataset, then Tamr Core aggregates all rows in the unified dataset across all input datasets, as one job. This is because transformations applied on the unified dataset occur after all of your input datasets have been unified.
Often, you must merge records by a few columns except for when those columns are empty. To specify exceptions to what records are merged, use a
case expression before the
This example creates a column,
merge_key, with a unique value for each group of records being merged. The
hash() function creates a unique value for every unique pair of values in
company_name. These are the records to merge. For any records that you do not want to merge, you can use the value of
tamr_id which is unique for each record. See Managing Primary Keys for more information about
The following script merges records that have the same values for
company_name but does not merge any records that have an empty value for either
SELECT *, CASE WHEN company_division IS NOT EMPTY AND company_name IS NOT EMPTY THEN hash(company_division, company_name) ELSE tamr_id END AS merge_key; // Merge records with the same values and skip records with empty values. MERGE BY merge_key;
Updated about 2 years ago