Aggregate Functions
Aggregate functions perform a calculation on a set of values, and return a single value.
DENSE_RANK
Assigns to each row its position according to the ORDER BY expression. Unlike rank()
, dense_rank()
values increase sequentially, one-by-one, regardless of ties.
SYNTAX
dense_rank()
OUTPUT
int
EXAMPLE
WINDOW DENSE_RANK() AS eventRank BY region ORDER BY date
region | date | event | → | region | event | date | eventRank |
---|---|---|---|---|---|---|---|
'Bay Area' | '2010-10-15' | 'light fog' | 'Bay Area' | 'light fog' | '2010-10-15' | 1 | |
'Bay Area' | '2010-10-15' | null | 'Bay Area' | null | '2010-10-15' | 1 | |
'Bay Area' | '2010-10-16' | 'strong winds' | 'Bay Area' | 'strong winds' | '2010-10-16' | 2 |
LAG
Computes an expression on a row that is behind the current one in the sort order.
SYNTAX
lag(input, offset)
input
- Type: any
- Description: The expression to compute on a previous row
offset
- Type: int
- Description: The number of rows previous to lag. Must be a positive integer constant.
OUTPUT
any
EXAMPLE
WINDOW event, LAG(event, 1) AS previousEvent BY region ORDER BY date
region | date | event | → | region | previousEvent | event | date |
---|---|---|---|---|---|---|---|
'Bay Area' | '2010-10-15' | 'light fog' | 'Bay Area' | null | 'light fog' | '2010-10-15' | |
'Bay Area' | '2010-10-15' | null | 'Bay Area' | 'light fog' | null | '2010-10-15' | |
'Bay Area' | '2010-10-16' | 'strong winds' | 'Bay Area' | null | 'strong winds' | '2010-10-16' |
LEAD
Computes an expression on a row that is ahead of the current one in the sort order.
SYNTAX
lead(input, offset)
input
- Type: any
- Description: The expression to compute on an upcoming row
offset
- Type: int
- Description: The number of rows following to skip. Must be a positive integer constant.
OUTPUT
any
EXAMPLE
WINDOW event, LEAD(event, 1) AS nextEvent BY region ORDER BY date
region | date | event | → | region | nextEvent | event | date |
---|---|---|---|---|---|---|---|
'Bay Area' | '2010-10-15' | 'light fog' | 'Bay Area' | null | 'light fog' | '2010-10-15' | |
'Bay Area' | '2010-10-15' | null | 'Bay Area' | 'strong winds' | null | '2010-10-15' | |
'Bay Area' | '2010-10-16' | 'strong winds' | 'Bay Area' | null | 'strong winds' | '2010-10-16' |
RANK
Assigns to each row its position in the partition according to the ORDER BY expression.
SYNTAX
rank()
OUTPUT
int
EXAMPLE
WINDOW RANK() AS eventRank BY region ORDER BY date
region | date | event | → | region | event | date | eventRank |
---|---|---|---|---|---|---|---|
'Bay Area' | '2010-10-15' | 'light fog' | 'Bay Area' | 'light fog' | '2010-10-15' | 1 | |
'Bay Area' | '2010-10-15' | null | 'Bay Area' | null | '2010-10-15' | 1 | |
'Bay Area' | '2010-10-16' | 'strong winds' | 'Bay Area' | 'strong winds' | '2010-10-16' | 3 |
ROW_NUMBER
Assigns a unique sequential ID number to each row in a partition, starting with 1.
SYNTAX
row_number()
OUTPUT
int
EXAMPLE
WINDOW ROW_NUMBER() AS eventId BY region ORDER BY date
region | date | event | → | region | eventId | event | date |
---|---|---|---|---|---|---|---|
'Bay Area' | '2010-10-15' | 'light fog' | 'Bay Area' | 1 | 'light fog' | '2010-10-15' | |
'Bay Area' | '2010-10-15' | null | 'Bay Area' | 2 | null | '2010-10-15' | |
'Bay Area' | '2010-10-16' | 'strong winds' | 'Bay Area' | 3 | 'strong winds' | '2010-10-16' |
MIN
Computes the least value of an attribute or expression in a group.Tip: To find the function that best meets your needs, review similar function LEAST.
SYNTAX
min(input)
input
- Type: any
- Description: The attributes to compute the least value over
OUTPUT
any
EXAMPLE
GROUP min(age) AS result BY name
name | age |
---|---|
'susan' | 27 |
'susan' | 18 |
'susan' | 35 |
'bob' | 28 |
'bob' | null |
'bob' | 55 |
→
result | name |
---|---|
18 | 'susan' |
28 | 'bob' |
MIN_SIZE
Computes the minimum value of an attribute or expression based on the bit size of the elements contained in a group. For arrays, the bit size of its length is added to the sum of the bit size of each of its elements.
SYNTAX
min_size(input)
input
- Type: any
- Description: The attributes to compute the minimum value over
OUTPUT
any
EXAMPLES
GROUP min_size(name) AS result BY age
age | name |
---|---|
27 | 'susan' |
27 | 'samuel' |
27 | 'sam' |
54 | 'chris' |
54 | 'bob' |
→
result | age |
---|---|
'sam' | 27 |
'bob' | 54 |
GROUP min_size(cities) AS result BY state
state | cities |
---|---|
'NY' | ['New York', 'Albany'] |
'WA' | ['Seattle', null] |
'WA' | ['Spokane', 'Tacoma'] |
'CA' | ['Los Angeles', 'Pleasanton'] |
'CA' | ['San Franscisco', 'Berkeley', 'San Diego'] |
→
result | state |
---|---|
['New York', 'Albany'] | 'NY' |
['Seattle', null] | 'WA' |
['Los Angeles', 'Pleasanton'] | 'CA' |
GROUP min_size(special_case) AS result BY some_value
some_value | special_case |
---|---|
'case-1' | ['', '', '', 'one'] |
'case-1' | ['o', 'n', 'e'] |
'case-2' | ['one', '', '', ''] |
'case-2' | ['o', 'n', 'e'] |
→
result | some_value |
---|---|
['', '', '', 'one'] | 'case-1' |
['o', 'n', 'e'] | 'case-2' |
GROUP min_size(special_case_2) AS result BY some_value
some_value | special_case_2 |
---|---|
'case-1' | ['one', '', null] |
'case-1' | ['o', 'n', 'e'] |
'case-1' | ['o', 'n', 'e', null] |
→
result | some_value |
---|---|
['one', '', null] | 'case-1' |
MAX
Computes the greatest value of an attribute or expression in a group.Tip: To find the function that best meets your needs, review similar function GREATEST.
SYNTAX
max(input)
input
- Type: any
- Description: The attributes to compute the greatest value over
OUTPUT
any
EXAMPLE
GROUP max(age) AS result BY name
name | age |
---|---|
'susan' | 27 |
'susan' | 18 |
'susan' | 35 |
'bob' | 28 |
'bob' | null |
'bob' | 55 |
→
result | name |
---|---|
35 | 'susan' |
55 | 'bob' |
MAX_SIZE
Computes the maximum value of an attribute or expression based on the bit size of the elements contained in a group. For arrays, the bit size of its length is added to the sum of the bit size of each of its elements.
SYNTAX
max_size(input)
input
- Type: any
- Description: The attributes to compute the maximum value over
OUTPUT
any
EXAMPLES
GROUP max_size(name) AS result BY age
age | name |
---|---|
27 | 'susan' |
27 | 'samuel' |
27 | 'sam' |
54 | 'chris' |
54 | 'bob' |
→
result | age |
---|---|
'samuel' | 27 |
'chris' | 54 |
GROUP max_size(cities) AS result BY state
state | cities |
---|---|
'NY' | ['New York', 'Albany'] |
'WA' | ['Seattle', null] |
'WA' | ['Spokane', 'Tacoma'] |
'CA' | ['Los Angeles', 'Pleasanton'] |
'CA' | ['San Franscisco', 'Berkeley', 'San Diego'] |
→
result | state |
---|---|
['New York', 'Albany'] | 'NY' |
['Spokane', 'Tacoma'] | 'WA' |
['San Franscisco', 'Berkeley', 'San Diego'] | 'CA' |
GROUP max_size(special_case) AS result BY some_value
some_value | special_case |
---|---|
'case-1' | ['', '', '', 'one'] |
'case-1' | ['o', 'n', 'e'] |
'case-2' | ['one', '', '', ''] |
'case-2' | ['o', 'n', 'e'] |
→
result | some_value |
---|---|
['o', 'n', 'e'] | 'case-1' |
['one', '', '', ''] | 'case-2' |
GROUP max_size(special_case_2) AS result BY some_value
some_value | special_case_2 |
---|---|
'case-1' | ['one', '', null] |
'case-1' | ['o', 'n', 'e'] |
'case-1' | ['o', 'n', 'e', null] |
→
result | some_value |
---|---|
['o', 'n', 'e', null] | 'case-1' |
COUNT
Counts the number of non-null items in a group.
SYNTAX
count(input)
input
- Type: any
- Description: The attributes whose items to count
OUTPUT
long
EXAMPLE
GROUP count(event) AS num_events BY region
region | date | event |
---|---|---|
'New York' | '2010-10-15' | 'light showers' |
'New York' | '2010-10-16' | 'strong winds' |
'New York' | '2010-10-16' | 'thundershowers' |
'Bay Area' | '2010-10-15' | 'light fog' |
'Bay Area' | '2010-10-15' | null |
'Bay Area' | '2010-10-16' | 'strong winds' |
→
region | num_events |
---|---|
'Bay Area' | 2 |
'New York' | 3 |
COUNT_DISTINCT
Counts the number of distinct non-null items in a group.
SYNTAX
count_distinct(input)
input
- Type: any
- Description: The attributes whose items to count
OUTPUT
long
EXAMPLE
GROUP count_distinct(event) AS num_events BY region
region | date | event |
---|---|---|
'New York' | '2010-10-15' | 'light showers' |
'New York' | '2010-10-15' | 'light showers' |
'New York' | '2010-10-15' | 'light showers' |
'New York' | '2010-10-16' | 'strong winds' |
'New York' | '2010-10-16' | 'thundershowers' |
'Bay Area' | '2010-10-15' | 'light fog' |
'Bay Area' | '2010-10-15' | 'light fog' |
'Bay Area' | '2010-10-15' | null |
'Bay Area' | '2010-10-15' | null |
'Bay Area' | '2010-10-16' | 'strong winds' |
→
region | num_events |
---|---|
'Bay Area' | 2 |
'New York' | 3 |
SUM
Computes the sum of all non-null values of an attribute in a group.Tip: To find the function that best meets your needs, review similar function ARRAY.SUM.
SYNTAX
sum(input)
input
- Type: number
- Description: The attribute whose values to sum
OUTPUT
number
EXAMPLE
GROUP sum(age) AS result BY name
name | age |
---|---|
'susan' | 27 |
'susan' | 18 |
'susan' | 35 |
'bob' | 28 |
'bob' | null |
'bob' | 55 |
→
result | name |
---|---|
80 | 'susan' |
83 | 'bob' |
AVG
Computes the average of all non-null values of an attribute in a group.
SYNTAX
avg(input)
input
- Type: number
- Description: The attributes whose values are averaged
OUTPUT
double
EXAMPLE
GROUP avg(age) AS result BY name
name | age |
---|---|
'susan' | 27 |
'susan' | 18 |
'susan' | 35 |
'bob' | 28 |
'bob' | null |
'bob' | 55 |
'alice' | null |
→
result | name |
---|---|
26.666666666666668 | 'susan' |
41.5 | 'bob' |
null | 'alice' |
MEAN
Computes the mean of all non-null values of an attribute in a group. This is a synonym of the AVG function.
SYNTAX
mean(input)
input
- Type: number
- Description: The attributes over which the mean is calculated
OUTPUT
double
EXAMPLE
GROUP mean(age) AS result BY name
name | age |
---|---|
'susan' | 27 |
'susan' | 18 |
'susan' | 35 |
'bob' | 28 |
'bob' | null |
'bob' | 55 |
→
result | name |
---|---|
26.666666666666668 | 'susan' |
41.5 | 'bob' |
MODE
Computes the mode (most frequent value) of an attribute in a group. Ignores nulls. In the case of a tie, the least value will be returned.Tip: To find the function that best meets your needs, review these similar functions: TOP, ARRAY.MOST_FREQUENT.
SYNTAX
mode(input)
input
- Type: any
- Description: The attributes on which to compute the mode
OUTPUT
any
EXAMPLE
GROUP mode(age) AS result BY name
name | age |
---|---|
'susan' | 27 |
'susan' | 18 |
'susan' | 27 |
'susan' | null |
'susan' | null |
'susan' | null |
'bob' | 28 |
'bob' | 55 |
'bob' | 55 |
'steve' | null |
'stacy' | 38 |
'stacy' | 36 |
'stacy' | 37 |
→
result | name |
---|---|
27 | 'susan' |
55 | 'bob' |
null | 'steve' |
36 | 'stacy' |
FIRST
Returns the first value in a attribute for each partition defined in an aggregation.Tip: To find the function that best meets your needs, review similar function ARRAY.FIRST.
SYNTAX
first(data)
data
- Type: any
- Description: The attributes from which we want to return the first value.
OUTPUT
any
EXAMPLE
GROUP first(A) AS new_A BY B
A | B |
---|---|
['jim', 'pam'] | '1' |
[] | '1' |
['toby'] | '1' |
→
B | new_A |
---|---|
'1' | ['jim', 'pam'] |
LAST
Returns the last value in a attribute for each partition defined in an aggregation.Tip: To find the function that best meets your needs, review similar function ARRAY.LAST.
SYNTAX
last(data)
data
- Type: any
- Description: The attributes from which we want to return the last value.
OUTPUT
any
EXAMPLE
GROUP last(A) AS new_A BY B
A | B |
---|---|
['jim', 'pam'] | '1' |
[] | '1' |
['toby'] | '1' |
→
B | new_A |
---|---|
'1' | ['toby'] |
TOP
Computes K most frequently occurring distinct values per group. TOP breaks ties in value frequency by selecting an arbitrary value to sort first. Choose TOP if some values are expected to consistently occur more frequently than others. If there are many ties in value frequency then small changes in data can yield significant changes in which values TOP collects. Tip: To find the function that best meets your needs, review these similar functions: COLLECT_SUBSET, MODE, ARRAY.MOST_FREQUENT.
SYNTAX
top(K, input)
K
- Type: int
- Description: The number of values to return per group. Must be a positive integer.
input
- Type: any
- Description: The attribute on which to compute TOP K values.
OUTPUT
{'type': 'array', 'elementType': 'any'}
EXAMPLES
GROUP top(2, age) AS result BY name
name | age |
---|---|
'susan' | 27 |
'susan' | 18 |
'susan' | 27 |
'susan' | 18 |
'susan' | 70 |
'susan' | null |
'susan' | null |
'susan' | null |
'bob' | 28 |
'bob' | 28 |
'steve' | null |
→
result | name |
---|---|
[27, 18] | 'susan' |
[28] | 'bob' |
[] | 'steve' |
GROUP top(2, ...age) AS result BY name
name | age |
---|---|
'susan' | [27, 42] |
'susan' | [70] |
'susan' | [27, 70] |
'susan' | [18, 30, 27] |
'susan' | [70] |
'susan' | [] |
'susan' | [null] |
'susan' | null |
'bob' | [28] |
'bob' | [28] |
'steve' | null |
→
result | name |
---|---|
[27, 70] | 'susan' |
[28] | 'bob' |
[] | 'steve' |
HISTOGRAM
Computes the histogram for the n most frequently occurring distinct values per group.
SYNTAX
histogram(n, input)
n
- Type: int
- Description: The number of values to return per group. Must be a positive integer.
input
- Type: any
- Description: The attribute on which to compute the histogram.
OUTPUT
{'type': 'array', 'elementType': {'type': 'record', 'fullySpecified': True, 'fields': [{'name': 'value', 'type': 'any'}, {'name': 'count', 'type': 'long'}]}}
EXAMPLES
GROUP histogram(2, age) AS result BY name
name | age |
---|---|
'susan' | 27 |
'susan' | 18 |
'susan' | 27 |
'susan' | 18 |
'susan' | 18 |
'susan' | null |
'susan' | null |
'susan' | null |
'bob' | 28 |
'bob' | 28 |
'steve' | null |
→
result | name |
---|---|
[{'count': 3, 'value': 18}, {'count': 2, 'value': 27}] | 'susan' |
[{'count': 2, 'value': 28}] | 'bob' |
[] | 'steve' |
GROUP histogram(2, ...age) AS result BY name
name | age |
---|---|
'susan' | [27, 42] |
'susan' | [70] |
'susan' | [27, 70] |
'susan' | [18, 30, 27] |
'susan' | [70] |
'susan' | [] |
'susan' | [null] |
'susan' | null |
'bob' | [28] |
'bob' | [28] |
'steve' | null |
→
result | name |
---|---|
[{'count': 3, 'value': 27}, {'count': 3, 'value': 70}] | 'susan' |
[{'count': 2, 'value': 28}] | 'bob' |
[] | 'steve' |
GROUP histogram(3, ...vehicles) AS result BY name
name | vehicles |
---|---|
'John' | [{'model': 'Corolla', 'year': 2002, 'make': 'Toyota'}, {'model': 'Camry', 'year': null, 'make': 'Toyota'}, {'model': 'Civic', 'year': 1995, 'make': 'Honda'}] |
'John' | [{'model': null, 'year': 1995, 'make': 'Honda'}, {'model': 'Corolla', 'year': 2002, 'make': 'Toyota'}] |
'John' | [{'model': 'Corolla', 'year': 2002, 'make': 'Toyota'}, {'model': 'Camry', 'year': null, 'make': 'Toyota'}] |
'Mary' | [{'model': 'Camry', 'year': null, 'make': 'Toyota'}] |
'Mary' | [] |
→
result | name |
---|---|
[{'count': 3, 'value': {'model': 'Corolla', 'year': 2002, 'make': 'Toyota'}}, {'count': 2, 'value': {'model': 'Camry', 'year': null, 'make': 'Toyota'}}, {'count': 1, 'value': {'model': null, 'year': 1995, 'make': 'Honda'}}] | 'John' |
[{'count': 1, 'value': {'model': 'Camry', 'year': null, 'make': 'Toyota'}}] | 'Mary' |
COLLECT_LIST
Collects all non-null values of an attribute in a group. If the input dataset is ordered, then the resulting list is also ordered. WARNING: Using this function with a large number of records can result in errors. Consider using COLLECT_SET.
SYNTAX
collect_list(input)
input
- Type: any
- Description: The attributes whose values to collect
OUTPUT
{'type': 'array', 'elementType': 'any'}
EXAMPLE
GROUP collect_list(employee) AS employees BY department
department | employee |
---|---|
'accounting' | 'angela' |
'sales' | 'dwight' |
'sales' | 'jim' |
'accounting' | 'kevin' |
'admin' | null |
'accounting' | 'oscar' |
'admin' | 'pam' |
'sales' | 'phyllis' |
'sales' | 'stanley' |
'admin' | 'toby' |
→
department | employees |
---|---|
'accounting' | ['angela', 'kevin', 'oscar'] |
'sales' | ['dwight', 'jim', 'phyllis', 'stanley'] |
'admin' | ['pam', 'toby'] |
COLLECT_SET
Collects all distinct, non-null values of an attribute in a group. The resulting list is not ordered. Tip: To find the function that best meets your needs, review similar functions COLLECT_LIST and ARRAY.DISTINCT.
SYNTAX
collect_set(input)
input
- Type: any
- Description: The attributes whose values to collect
OUTPUT
{'type': 'array', 'elementType': 'any'}
EXAMPLE
GROUP collect_set(employee) AS employees BY department
department | employee |
---|---|
'accounting' | 'angela' |
'sales' | 'dwight' |
'sales' | 'dwight' |
'sales' | 'jim' |
'accounting' | 'kevin' |
'accounting' | 'kevin' |
'accounting' | 'oscar' |
'accounting' | null |
'sales' | 'phyllis' |
'sales' | 'stanley' |
'sales' | 'stanley' |
→
department | employees |
---|---|
'accounting' | ['kevin', 'oscar', 'angela'] |
'sales' | ['jim', 'phyllis', 'stanley', 'dwight'] |
COLLECT_SUBSET
Collects a subset of the distinct, non-null values of an attribute in a group. The resulting list is not ordered. The selected subset is based on the hash value of the input values. As a result, this aggregation function does deterministic tie-breaking, since the same subset is returned for the same set of values. When values occur with roughly equal frequency in data, the results of COLLECT_SUBSET will be more stable than TOP in the face of small data changes.
SYNTAX
collect_subset(K, input)
K
- Type: int
- Description: The number of values to return per group. Must be a positive integer.
input
- Type: any
- Description: The attributes for which to collect a subset of values
OUTPUT
{'type': 'array', 'elementType': 'any'}
EXAMPLES
GROUP collect_subset(2, employee) AS employees BY department
department | employee |
---|---|
'accounting' | 'angela' |
'sales' | 'dwight' |
'sales' | 'dwight' |
'sales' | 'jim' |
'accounting' | 'kevin' |
'accounting' | 'kevin' |
'accounting' | 'oscar' |
'accounting' | null |
'sales' | 'phyllis' |
'sales' | 'stanley' |
'sales' | 'stanley' |
→
department | employees |
---|---|
'accounting' | ['kevin', 'angela'] |
'sales' | ['dwight', 'stanley'] |
GROUP collect_subset(3, ...employee) AS employees BY department
department | employee |
---|---|
'accounting' | ['angela', 'oscar'] |
'sales' | ['dwight'] |
'sales' | ['dwight', null] |
'sales' | ['jim', 'walter'] |
'accounting' | ['kevin', 'john'] |
'accounting' | ['kevin', 'frank'] |
'accounting' | ['oscar'] |
'accounting' | null |
'sales' | ['phyllis', 'mark'] |
'sales' | ['stanley'] |
'sales' | ['stanley', 'adam'] |
→
department | employees |
---|---|
'accounting' | ['angela', 'john', 'frank'] |
'sales' | ['stanley', 'dwight', 'walter'] |
STDDEV_SAMP
Computes the corrected sample standard deviation of all non-null values of an attribute in a group.
SYNTAX
stddev_samp(input)
input
- Type: number
- Description: The attribute whose standard deviation is calculated
OUTPUT
double
EXAMPLE
GROUP stddev(age) AS result BY name
name | age |
---|---|
'susan' | 1.0 |
'susan' | 2.0 |
'susan' | 3.0 |
'bob' | 10.0 |
'bob' | 20.0 |
'bob' | 30.0 |
'alice' | 20.0 |
'john' | 30.0 |
'john' | 30.0 |
'john' | 30.0 |
→
result | name |
---|---|
1.0 | 'susan' |
10.0 | 'bob' |
null | 'alice' |
0.0 | 'john' |
STDDEV_POP
Computes the population standard deviation of all non-null values of an attribute in a group.
SYNTAX
stddev_pop(input)
input
- Type: number
- Description: The attribute whose standard deviation is calculated
OUTPUT
double
EXAMPLE
GROUP stddev_pop(age) AS result BY name
name | age |
---|---|
'susan' | 1.0 |
'susan' | 2.0 |
'susan' | 3.0 |
'bob' | 10.0 |
'bob' | 20.0 |
'bob' | 30.0 |
→
result | name |
---|---|
0.816496580927726 | 'susan' |
8.16496580927726 | 'bob' |
VAR_SAMP
Computes the corrected sample variance of all non-null values of an attribute in a group.
SYNTAX
var_samp(input)
input
- Type: number
- Description: The attribute whose variance is calculated
OUTPUT
double
EXAMPLE
GROUP var_samp(age) AS result BY name
name | age |
---|---|
'susan' | 1.0 |
'susan' | 2.0 |
'susan' | 3.0 |
'bob' | 10.0 |
'bob' | 20.0 |
'bob' | 30.0 |
→
result | name |
---|---|
1.0 | 'susan' |
100.0 | 'bob' |
VAR_POP
Computes the population variance of all non-null values of an attribute in a group.
SYNTAX
var_pop(input)
input
- Type: number
- Description: The attribute whose variance is calculated
OUTPUT
double
EXAMPLE
GROUP var_pop(age) AS result BY name
name | age |
---|---|
'susan' | 1.0 |
'susan' | 2.0 |
'susan' | 3.0 |
'bob' | 10.0 |
'bob' | 20.0 |
'bob' | 30.0 |
→
result | name |
---|---|
0.6666666666666666 | 'susan' |
66.66666666666667 | 'bob' |
Updated 8 months ago