Aggregate Functions
Aggregate functions.
AVG
Computes the average of all non-null values of an attribute in a group.
SYNTAX
avg()
OUTPUT
double
EXAMPLE
GROUP avg(age) AS result BY name
name | age |
---|---|
susan | 27 |
susan | 18 |
susan | 35 |
bob | 28 |
bob | null |
bob | 55 |
→
result | name |
---|---|
26.666666666666668 | susan |
41.5 | bob |
COLLECT_LIST
Collects all non-null values of an attribute in a group. If the input dataset is ordered, then the resulting list is also ordered. WARNING - using this function with a large number of records can result in errors. Please consider using collect_set()
.
SYNTAX
collect_list()
OUTPUT
{'type': 'array', 'elementType': 'any'}
EXAMPLE
GROUP collect_list(employee) AS employees BY department
department | employee |
---|---|
accounting | angela |
sales | dwight |
sales | jim |
accounting | kevin |
admin | null |
accounting | oscar |
admin | pam |
sales | phyllis |
sales | stanley |
admin | toby |
→
department | employees |
---|---|
accounting | ['angela', 'kevin', 'oscar'] |
sales | ['dwight', 'jim', 'phyllis', 'stanley'] |
admin | ['pam', 'toby'] |
COLLECT_SET
Collects all distinct, non-null values of an attribute in a group. The resulting list is not ordered.
SYNTAX
collect_set()
OUTPUT
{'type': 'array', 'elementType': 'any'}
EXAMPLE
GROUP collect_set(employee) AS employees BY department
department | employee |
---|---|
accounting | angela |
sales | dwight |
sales | dwight |
sales | jim |
accounting | kevin |
accounting | kevin |
accounting | oscar |
accounting | null |
sales | phyllis |
sales | stanley |
sales | stanley |
→
department | employees |
---|---|
accounting | ['kevin', 'oscar', 'angela'] |
sales | ['jim', 'phyllis', 'stanley', 'dwight'] |
COUNT
Counts the number of non-null items in a group.
SYNTAX
count()
OUTPUT
long
EXAMPLE
GROUP count(event) AS num_events BY region
region | event | date |
---|---|---|
New York | light showers | 10-15-2010 |
New York | strong winds | 10-16-2010 |
New York | thundershowers | 10-16-2010 |
Bay Area | light fog | 10-15-2010 |
Bay Area | null | 10-15-2010 |
Bay Area | strong winds | 10-16-2010 |
→
region | num_events |
---|---|
Bay Area | 2 |
New York | 3 |
COUNT_DISTINCT
Counts the number of distinct non-null items in a group.
SYNTAX
count_distinct()
OUTPUT
long
EXAMPLE
GROUP count_distinct(event) AS num_events BY region
region | event | date |
---|---|---|
New York | light showers | 10-15-2010 |
New York | light showers | 10-15-2010 |
New York | light showers | 10-15-2010 |
New York | strong winds | 10-16-2010 |
New York | thundershowers | 10-16-2010 |
Bay Area | light fog | 10-15-2010 |
Bay Area | light fog | 10-15-2010 |
Bay Area | null | 10-15-2010 |
Bay Area | null | 10-15-2010 |
Bay Area | strong winds | 10-16-2010 |
→
region | num_events |
---|---|
Bay Area | 2 |
New York | 3 |
DENSE_RANK
Assigns to each row its position according to the ORDER BY expression. Unlike rank()
, dense_rank()
values increase sequentially, one-by-one, regardless of ties.
SYNTAX
dense_rank()
OUTPUT
int
EXAMPLE
WINDOW DENSE_RANK() AS eventRank BY region ORDER BY date
region | event | date | → | region | event | date | eventRank |
---|---|---|---|---|---|---|---|
Bay Area | light fog | 10-15-2010 | Bay Area | light fog | 10-15-2010 | 1 | |
Bay Area | null | 10-15-2010 | Bay Area | null | 10-15-2010 | 1 | |
Bay Area | strong winds | 10-16-2010 | Bay Area | strong winds | 10-16-2010 | 2 |
FIRST
Returns the first value in a attribute for each partition defined in an aggregation.
SYNTAX
first()
OUTPUT
any
EXAMPLE
GROUP first(A) AS new_A BY B
A | B |
---|---|
['jim', 'pam'] | 1 |
[] | 1 |
['toby'] | 1 |
→
B | new_A |
---|---|
1 | ['jim', 'pam'] |
LAG
Computes an expression on a row that is behind the current one in the sort order.
SYNTAX
lag(input, offset)
input
- Type: any
- Description: The expression to compute on a previous row
offset
- Type: int
- Description: The number of rows previous to lag. Must be a positive integer constant.
OUTPUT
any
EXAMPLE
WINDOW event, LAG(event, 1) AS previousEvent BY region ORDER BY date
region | event | date | → | region | previousEvent | event | date |
---|---|---|---|---|---|---|---|
Bay Area | light fog | 10-15-2010 | Bay Area | null | light fog | 10-15-2010 | |
Bay Area | null | 10-15-2010 | Bay Area | light fog | null | 10-15-2010 | |
Bay Area | strong winds | 10-16-2010 | Bay Area | null | strong winds | 10-16-2010 |
LAST
Returns the last value in a attribute for each partition defined in an aggregation.
SYNTAX
last()
OUTPUT
any
EXAMPLE
GROUP last(A) AS new_A BY B
A | B |
---|---|
['jim', 'pam'] | 1 |
[] | 1 |
['toby'] | 1 |
→
B | new_A |
---|---|
1 | ['toby'] |
LEAD
Computes an expression on a row that is ahead of the current one in the sort order.
SYNTAX
lead(input, offset)
input
- Type: any
- Description: The expression to compute on an upcoming row
offset
- Type: int
- Description: The number of rows following to skip. Must be a positive integer constant.
OUTPUT
any
EXAMPLE
WINDOW event, LEAD(event, 1) AS nextEvent BY region ORDER BY date
region | event | date | → | region | nextEvent | event | date |
---|---|---|---|---|---|---|---|
Bay Area | light fog | 10-15-2010 | Bay Area | null | light fog | 10-15-2010 | |
Bay Area | null | 10-15-2010 | Bay Area | strong winds | null | 10-15-2010 | |
Bay Area | strong winds | 10-16-2010 | Bay Area | null | strong winds | 10-16-2010 |
MAX
Computes the greatest value of an attribute or expression in a group.
SYNTAX
max()
OUTPUT
any
EXAMPLE
GROUP max(age) AS result BY name
name | age |
---|---|
susan | 27 |
susan | 18 |
susan | 35 |
bob | 28 |
bob | null |
bob | 55 |
→
result | name |
---|---|
35 | susan |
55 | bob |
MAX_SIZE
Computes the maximum value of an attribute or expression based on the bit size of the elements contained in a group. For arrays, the bit size of its length is added to the sum of the bit size of each of its elements.
SYNTAX
max_size()
OUTPUT
any
EXAMPLES
GROUP max_size(name) AS result BY age
age | name |
---|---|
27 | susan |
27 | samuel |
27 | sam |
54 | chris |
54 | bob |
→
result | age |
---|---|
samuel | 27 |
chris | 54 |
GROUP max_size(cities) AS result BY state
cities | state |
---|---|
['New York', 'Albany'] | NY |
['Seattle', null] | WA |
['Spokane', 'Tacoma'] | WA |
['Los Angeles', 'Pleasanton'] | CA |
['San Franscisco', 'Berkeley', 'San Diego'] | CA |
→
result | state |
---|---|
['New York', 'Albany'] | NY |
['Spokane', 'Tacoma'] | WA |
['San Franscisco', 'Berkeley', 'San Diego'] | CA |
GROUP max_size(special_case) AS result BY some_value
special_case | some_value |
---|---|
['', '', '', 'one'] | case-1 |
['o', 'n', 'e'] | case-1 |
['one', '', '', ''] | case-2 |
['o', 'n', 'e'] | case-2 |
→
result | some_value |
---|---|
['o', 'n', 'e'] | case-1 |
['one', '', '', ''] | case-2 |
GROUP max_size(special_case_2) AS result BY some_value
special_case_2 | some_value |
---|---|
['one', '', null] | case-1 |
['o', 'n', 'e'] | case-1 |
['o', 'n', 'e', null] | case-1 |
→
result | some_value |
---|---|
['o', 'n', 'e', null] | case-1 |
MEAN
Computes the mean of all non-null values of an attribute in a group. This is a synonym of the AVG function.
SYNTAX
mean()
OUTPUT
double
EXAMPLE
GROUP mean(age) AS result BY name
name | age |
---|---|
susan | 27 |
susan | 18 |
susan | 35 |
bob | 28 |
bob | null |
bob | 55 |
→
result | name |
---|---|
26.666666666666668 | susan |
41.5 | bob |
MIN
Computes the least value of an attribute or expression in a group.
SYNTAX
min()
OUTPUT
any
EXAMPLE
GROUP min(age) AS result BY name
name | age |
---|---|
susan | 27 |
susan | 18 |
susan | 35 |
bob | 28 |
bob | null |
bob | 55 |
→
result | name |
---|---|
18 | susan |
28 | bob |
MIN_SIZE
Computes the minimum value of an attribute or expression based on the bit size of the elements contained in a group. For arrays, the bit size of its length is added to the sum of the bit size of each of its elements.
SYNTAX
min_size()
OUTPUT
any
EXAMPLES
GROUP min_size(name) AS result BY age
age | name |
---|---|
27 | susan |
27 | samuel |
27 | sam |
54 | chris |
54 | bob |
→
result | age |
---|---|
sam | 27 |
bob | 54 |
GROUP min_size(cities) AS result BY state
cities | state |
---|---|
['New York', 'Albany'] | NY |
['Seattle', null] | WA |
['Spokane', 'Tacoma'] | WA |
['Los Angeles', 'Pleasanton'] | CA |
['San Franscisco', 'Berkeley', 'San Diego'] | CA |
→
result | state |
---|---|
['New York', 'Albany'] | NY |
['Seattle', null] | WA |
['Los Angeles', 'Pleasanton'] | CA |
GROUP min_size(special_case) AS result BY some_value
special_case | some_value |
---|---|
['', '', '', 'one'] | case-1 |
['o', 'n', 'e'] | case-1 |
['one', '', '', ''] | case-2 |
['o', 'n', 'e'] | case-2 |
→
result | some_value |
---|---|
['', '', '', 'one'] | case-1 |
['o', 'n', 'e'] | case-2 |
GROUP min_size(special_case_2) AS result BY some_value
special_case_2 | some_value |
---|---|
['one', '', null] | case-1 |
['o', 'n', 'e'] | case-1 |
['o', 'n', 'e', null] | case-1 |
→
result | some_value |
---|---|
['one', '', null] | case-1 |
MODE
Computes the mode (most frequent value) of an attribute in a group. Ignores nulls. In the case of a tie, the least value will be returned.
SYNTAX
mode()
OUTPUT
any
EXAMPLE
GROUP mode(age) AS result BY name
name | age |
---|---|
susan | 27 |
susan | 18 |
susan | 27 |
susan | null |
susan | null |
susan | null |
bob | 28 |
bob | 55 |
bob | 55 |
steve | null |
stacy | 38 |
stacy | 36 |
stacy | 37 |
→
result | name |
---|---|
27 | susan |
55 | bob |
null | steve |
36 | stacy |
RANK
Assigns to each row its position in the partition according to the ORDER BY expression.
SYNTAX
rank()
OUTPUT
int
EXAMPLE
WINDOW RANK() AS eventRank BY region ORDER BY date
region | event | date | → | region | event | date | eventRank |
---|---|---|---|---|---|---|---|
Bay Area | light fog | 10-15-2010 | Bay Area | light fog | 10-15-2010 | 1 | |
Bay Area | null | 10-15-2010 | Bay Area | null | 10-15-2010 | 1 | |
Bay Area | strong winds | 10-16-2010 | Bay Area | strong winds | 10-16-2010 | 3 |
ROW_NUMBER
Assigns a unique sequential ID number to each row in a partition, starting with 1.
SYNTAX
row_number()
OUTPUT
int
EXAMPLE
WINDOW ROW_NUMBER() AS eventId BY region ORDER BY date
region | event | date | → | region | eventId | event | date |
---|---|---|---|---|---|---|---|
Bay Area | light fog | 10-15-2010 | Bay Area | 1 | light fog | 10-15-2010 | |
Bay Area | null | 10-15-2010 | Bay Area | 2 | null | 10-15-2010 | |
Bay Area | strong winds | 10-16-2010 | Bay Area | 3 | strong winds | 10-16-2010 |
STDDEV_POP
Computes the population standard deviation of all non-null values of an attribute in a group.
SYNTAX
stddev_pop()
OUTPUT
double
EXAMPLE
GROUP stddev_pop(age) AS result BY name
name | age |
---|---|
susan | 1.0 |
susan | 2.0 |
susan | 3.0 |
bob | 10.0 |
bob | 20.0 |
bob | 30.0 |
→
result | name |
---|---|
0.816496580927726 | susan |
8.16496580927726 | bob |
STDDEV_SAMP
Computes the corrected sample standard deviation of all non-null values of an attribute in a group.
SYNTAX
stddev_samp()
OUTPUT
double
EXAMPLE
GROUP stddev(age) AS result BY name
name | age |
---|---|
susan | 1.0 |
susan | 2.0 |
susan | 3.0 |
bob | 10.0 |
bob | 20.0 |
bob | 30.0 |
→
result | name |
---|---|
1.0 | susan |
10.0 | bob |
SUM
Computes the sum of all non-null values of an attribute in a group.
SYNTAX
sum()
OUTPUT
number
EXAMPLE
GROUP sum(age) AS result BY name
name | age |
---|---|
susan | 27 |
susan | 18 |
susan | 35 |
bob | 28 |
bob | null |
bob | 55 |
→
result | name |
---|---|
80 | susan |
83 | bob |
TOP
Computes n most frequently occurring distinct values per group.
SYNTAX
top(input, n)
input
- Type: any
- Description: The attribute on which to compute top n values.
n
- Type: int
- Description: The number of values to return per group. Must be a positive integer.
OUTPUT
{'type': 'array', 'elementType': 'any'}
EXAMPLE
GROUP top(age, 2) AS result BY name
name | age |
---|---|
susan | 27 |
susan | 18 |
susan | 27 |
susan | 18 |
susan | 70 |
susan | null |
susan | null |
susan | null |
bob | 28 |
bob | 28 |
steve | null |
→
result | name |
---|---|
[27, 18] | susan |
[28] | bob |
[] | steve |
VAR_POP
Computes the population variance of all non-null values of an attribute in a group.
SYNTAX
var_pop()
OUTPUT
double
EXAMPLE
GROUP var_pop(age) AS result BY name
name | age |
---|---|
susan | 1.0 |
susan | 2.0 |
susan | 3.0 |
bob | 10.0 |
bob | 20.0 |
bob | 30.0 |
→
result | name |
---|---|
0.6666666666666666 | susan |
66.66666666666667 | bob |
VAR_SAMP
Computes the corrected sample variance of all non-null values of an attribute in a group.
SYNTAX
var_samp()
OUTPUT
double
EXAMPLE
GROUP var_samp(age) AS result BY name
name | age |
---|---|
susan | 1.0 |
susan | 2.0 |
susan | 3.0 |
bob | 10.0 |
bob | 20.0 |
bob | 30.0 |
→
result | name |
---|---|
1.0 | susan |
100.0 | bob |
Updated over 5 years ago