GROUP BY statements in Tamr Core are similar to the
GROUP BY function in SQL. This transformation produces a table with one column per aggregating function or key field.
Note: You can also use
MERGE statements to aggregate rows.
MERGE statements can be syntactically easier to write, although they offer less flexibility than
GROUP BY. See Merge.
Important: Changing a
GROUP BYstatement can result in an unintentional change to
tamr_idvalues in the unified dataset. To avoid data loss and maintain the stability of
tamr_idin mastering and categorization projects, as a best practice Tamr recommends that you explicitly set the
tamr_idto a value you're sure is stable in the last transformation. See Managing Primary Keys. In addition, if you use the aggregating expression
collect_set, Tamr also recommends setting the
origin_source_name. See this example for setting the
This example groups data on individual companies by city, and also finds the number of companies in each city and the total revenue report by all companies in that city.
The "before" data has an empty column for record count (
original_record_count). This is a column that is not present in the source data, but that you want to populate using an aggregating function.
The example that follows combines records that have the same value in the city column to create a summary table. For each column that you want to include in the output dataset, the script must include an aggregating function.
Important: This example code includes two sections. Both sections are required. The second section maintains the
origin_source_namesystem attributes. If you do not include this code block, an error results. You can copy the block below and paste it into your transformation before the 'BY' line.
// count each row in the city
count(1) AS original_record_count,
// sum the revenue of each row in the city
sum(annual_revenue) AS annual_revenue,
// find the most common company_type in the city
mode(... company_type) AS company_type,
// maintain the internal columns origin_entity_id and origin_source_name
// collect the set of all unique values in each of these columns and then join together each value with '&' to create a single long string
str_join('&', collect_set(origin_entity_id)) AS origin_entity_id,
str_join('&', collect_set(origin_source_name)) AS origin_source_name
The result is a table with one row per unique city value in the original data. The
original_record_count column is now populated with the number of rows of original data that were combined to create this new view.
Notice that the
company_name attribute is empty. The script does not include an aggregating function for
company_name in the
GROUP BY statement and this is the result.
Updated over 1 year ago