GROUP BY statements in Tamr Core are similar to the GROUP BY function in SQL. This transformation produces a table with one column per aggregating function or key field.

Note: You can also use MERGE statements to aggregate rows. MERGE statements can be syntactically easier to write, although they offer less flexibility than GROUP BY. See Merge.

Important: Changing a GROUP BY statement can result in an unintentional change to tamr_id values in the unified dataset. To avoid data loss and maintain the stability of tamr_id in mastering and categorization projects, as a best practice Tamr recommends that you explicitly set the tamr_id to a value you're sure is stable in the last transformation. See Managing Primary Keys. In addition, if you use the aggregating expression collect_set, Tamr also recommends setting the origin_source_name. See this example for setting the origin_source_name.

Example

This example groups data on individual companies by city, and also finds the number of companies in each city and the total revenue report by all companies in that city.

The "before" data has an empty column for record count (original_record_count). This is a column that is not present in the source data, but that you want to populate using an aggregating function.

2432 — The original, pre-transformation dataset with an original_record_count attribute that is 100% null values.

The example that follows combines records that have the same value in the city column to create a summary table. For each column that you want to include in the output dataset, the script must include an aggregating function.

Important: This example code includes two sections. Both sections are required. The second section maintains the origin_entity_id and origin_source_name system attributes. If you do not include this code block, an error results. You can copy the block below and paste it into your transformation before the 'BY' line.

GROUP

// count each row in the city
count(1) AS original_record_count,
// sum the revenue of each row in the city
sum(annual_revenue) AS annual_revenue,
// find the most common company_type in the city
mode(... company_type) AS company_type,

// maintain the internal columns origin_entity_id and origin_source_name
// collect the set of all unique values in each of these columns and then join together each value with '&' to create a single long string
str_join('&', collect_set(origin_entity_id)) AS origin_entity_id,
str_join('&', collect_set(origin_source_name)) AS origin_source_name

BY city;

The result is a table with one row per unique city value in the original data. The original_record_count column is now populated with the number of rows of original data that were combined to create this new view.

2430 — The results of the example `GROUP BY` statement, with the original_record_count showing how many rows were consolidated by city and the annual_revenue giving the sum for those rows.

Notice that the company_name attribute is empty. The script does not include an aggregating function for company_name in the GROUP BY statement and this is the result.