Group By

This function is similar to the GROUP BY function in SQL. This transformation produces a table with one column per aggregating function or key field.

Example

In the following example, we will group data on individual companies by city, finding the number of companies in each city and the total revenue report by all companies in that city.

Our "before" data has an empty column for record count (original_record_count). This is a column that is not present in our source data, but that we want to populate using an aggregation function.

The following transformation example combines records that have the same value in the city column to create a summary table. We must include an aggregation function for every column that we want to be included in the output dataset.

GROUP

// count each row in the city
count(1) AS original_record_count,
// sum the revenue of each row in the city
sum(annual_revenue) AS annual_revenue,
// find the most common company_type in the city
mode(... company_type) AS company_type,

// maintain the internal columns origin_entity_id and origin_source_name
// we collect the set of all unique values in each of these columns and then join together each value with '&' to create a single long string
str_join('&', collect_set(origin_entity_id)) AS origin_entity_id,
str_join('&', collect_set(origin_source_name)) AS origin_source_name

BY city;

The result is a table with one row per unique city value in our original data. The original_record_count column is now populated with the number of rows of original data that were combined to create this new view. Note that the company_name attribute is empty. This occurs because we did not provide an aggregation function for company_name in our GROUP BY transformation.