Group By
GROUP BY
invokes an aggregating function across a set of rows.
GROUP BY
statements in Tamr Core are similar to the GROUP BY
function in SQL. This transformation produces a table with one column per aggregating function or key field.
Note: You can also use MERGE
statements to aggregate rows. MERGE
statements can be syntactically easier to write, although they offer less flexibility than GROUP BY
. See Merge.
Important: Changing a
GROUP BY
statement can result in an unintentional change totamr_id
values in the unified dataset. To avoid data loss and maintain the stability oftamr_id
in mastering and categorization projects, as a best practice Tamr recommends that you explicitly set thetamr_id
to a value you're sure is stable in the last transformation. See Managing Primary Keys. In addition, if you use the aggregating expressioncollect_set
, Tamr also recommends setting theorigin_source_name
. See this example for setting theorigin_source_name
.
Example
This example groups data on individual companies by city, and also finds the number of companies in each city and the total revenue report by all companies in that city.
The "before" data has an empty column for record count (original_record_count
). This is a column that is not present in the source data, but that you want to populate using an aggregating function.
The example that follows combines records that have the same value in the city column to create a summary table. For each column that you want to include in the output dataset, the script must include an aggregating function.
Important: This example code includes two sections. Both sections are required. The second section maintains the
origin_entity_id
andorigin_source_name
system attributes. If you do not include this code block, an error results. You can copy the block below and paste it into your transformation before the 'BY' line.
GROUP
// count each row in the city
count(1) AS original_record_count,
// sum the revenue of each row in the city
sum(annual_revenue) AS annual_revenue,
// find the most common company_type in the city
mode(... company_type) AS company_type,
// maintain the internal columns origin_entity_id and origin_source_name
// collect the set of all unique values in each of these columns and then join together each value with '&' to create a single long string
str_join('&', collect_set(origin_entity_id)) AS origin_entity_id,
str_join('&', collect_set(origin_source_name)) AS origin_source_name
BY city;
The result is a table with one row per unique city value in the original data. The original_record_count
column is now populated with the number of rows of original data that were combined to create this new view.
Notice that the company_name
attribute is empty. The script does not include an aggregating function for company_name
in the GROUP BY
statement and this is the result.
Updated about 2 years ago