Merge
Overview
MERGE
allows aggregation of a set of rows, similar to GROUP BY. MERGE
is simpler for you to write, since you only have to specify the aggregate keys. GROUP BY
offers more flexibility in how you'd like to aggregate rows.
MERGE
assumes that all input columns other than the aggregate keys are either scalar or arrays of strings. You can convert non-string columns with the to_string()
. See Tamr-Defined Functions.
In a MERGE
, all values will be collapsed into an array of strings, removing both duplicate values and null values. If arrays of strings are merged together, they will become a larger array, only containing unique values, sorted in lexicographic order.
Example | Template |
---|---|
MERGE BY vendor_name, part_number; SELECT *, CASE WHEN size(origin_source_name) == 1 THEN to_string(origin_source_name) ELSE 'Multiple Sources' END AS origin_source_name, concat('vendor_name: ', str_join('|', vendor_name), '; part_number: ', str_join('|', part_number)) AS origin_entity_id; | MERGE BY , , ..., <myUA_n>; SELECT *, CASE WHEN size(origin_source_name) == 1 THEN to_string(origin_source_name) ELSE 'Multiple Sources' END AS origin_source_name, concat(': ', str_join('|', ), '; : ', str_join('|', ), ....., '; <myUA_n>: ', str_join('|', <myUA_n>)) AS origin_entity_id; |
Treat Tamr special columns with care
When you perform a
MERGE
, it will put all fields into an array of strings. This includes the Tamr-generated columnsorigin_source_name
andorigin_entity_id
. By default,tamr_id
is managed by Tamr. These columns require a single string value, not an array of strings. Manually convertorigin_source_name
andorigin_entity_id
from arrays of strings to strings. The previous example shows one possible way to manage these columns.Note that aggregate keys are not affected by
MERGE
and retain their original type.
Merge Scoping
The transformation scope determines how values are considered for aggregation. If you perform a MERGE
on specific input datasets, the aggregation is applied across the rows for each input dataset. For example, if you perform a MERGE
and set it to apply to dataset1 and dataset2, the rows for dataset1 will be aggregated, and the rows for dataset2 will be aggregated separately. If you perform a MERGE
on the unified dataset, then all rows in the unified dataset, across all of the input datasets, will be aggregated. This is because transformations applied on the unified dataset occur after all of your input datasets have been unified.
Merging a Subset of Records
Often, you must merge records by a few columns except for when those columns are empty. To specify exceptions to what records are merged, use a CASE
statement before a MERGE BY
, as in the following example.
In this example, we create a new column, merge_key
, that has a unique value for each group of records we want to merge. The hash()
function creates a unique value for every unique pair of values in company_division
and company_name
. These are records we wish to merge.
The Tamr function monotonically_increasing_id()
creates a unique value for each record (row). These are records we do not wish to merge. See Monotonically_increasing_id().
Next, we merge records that have the same values for company_division
and company_name
but will not merge any records that have an empty value for company_division
or company_name
.
SELECT *,
CASE WHEN company_division IS NOT EMPTY AND company_name IS NOT EMPTY
THEN hash(company_division, company_name)
ELSE monotonically_increasing_id()
END
AS merge_key;
// Merge records with the same values and skip records with empty values.
MERGE BY merge_key;
Updated over 5 years ago