User GuidesAPI ReferenceRelease Notes

# Aggregate Functions

Aggregate functions perform a calculation on a set of values, and return a single value.

# DENSE_RANK

Assigns to each row its position according to the ORDER BY expression. Unlike `rank()`, `dense_rank()` values increase sequentially, one-by-one, regardless of ties.

SYNTAX

`dense_rank()`

OUTPUT

`int`

EXAMPLE

``````WINDOW DENSE_RANK() AS eventRank BY region ORDER BY date
``````
regiondateeventregioneventdateeventRank
'Bay Area''2010-10-15''light fog''Bay Area''light fog''2010-10-15'1
'Bay Area''2010-10-15'null'Bay Area'null'2010-10-15'1
'Bay Area''2010-10-16''strong winds''Bay Area''strong winds''2010-10-16'2

# LAG

Computes an expression on a row that is behind the current one in the sort order.

SYNTAX

`lag(input, offset)`

• `input`
• Type: any
• Description: The expression to compute on a previous row
• `offset`
• Type: int
• Description: The number of rows previous to lag. Must be a positive integer constant.

OUTPUT

`any`

EXAMPLE

``````WINDOW event, LAG(event, 1) AS previousEvent BY region ORDER BY date
``````
regiondateeventregionpreviousEventeventdate
'Bay Area''2010-10-15''light fog''Bay Area'null'light fog''2010-10-15'
'Bay Area''2010-10-15'null'Bay Area''light fog'null'2010-10-15'
'Bay Area''2010-10-16''strong winds''Bay Area'null'strong winds''2010-10-16'

Computes an expression on a row that is ahead of the current one in the sort order.

SYNTAX

`lead(input, offset)`

• `input`
• Type: any
• Description: The expression to compute on an upcoming row
• `offset`
• Type: int
• Description: The number of rows following to skip. Must be a positive integer constant.

OUTPUT

`any`

EXAMPLE

``````WINDOW event, LEAD(event, 1) AS nextEvent BY region ORDER BY date
``````
regiondateeventregionnextEventeventdate
'Bay Area''2010-10-15''light fog''Bay Area'null'light fog''2010-10-15'
'Bay Area''2010-10-15'null'Bay Area''strong winds'null'2010-10-15'
'Bay Area''2010-10-16''strong winds''Bay Area'null'strong winds''2010-10-16'

# RANK

Assigns to each row its position in the partition according to the ORDER BY expression.

SYNTAX

`rank()`

OUTPUT

`int`

EXAMPLE

``````WINDOW RANK() AS eventRank BY region ORDER BY date
``````
regiondateeventregioneventdateeventRank
'Bay Area''2010-10-15''light fog''Bay Area''light fog''2010-10-15'1
'Bay Area''2010-10-15'null'Bay Area'null'2010-10-15'1
'Bay Area''2010-10-16''strong winds''Bay Area''strong winds''2010-10-16'3

# ROW_NUMBER

Assigns a unique sequential ID number to each row in a partition, starting with 1.

SYNTAX

`row_number()`

OUTPUT

`int`

EXAMPLE

``````WINDOW ROW_NUMBER() AS eventId BY region ORDER BY date
``````
regiondateeventregioneventIdeventdate
'Bay Area''2010-10-15''light fog''Bay Area'1'light fog''2010-10-15'
'Bay Area''2010-10-15'null'Bay Area'2null'2010-10-15'
'Bay Area''2010-10-16''strong winds''Bay Area'3'strong winds''2010-10-16'

# MIN

Computes the least value of an attribute or expression in a group.Tip: To find the function that best meets your needs, review similar function LEAST.

SYNTAX

`min(input)`

• `input`
• Type: any
• Description: The attributes to compute the least value over

OUTPUT

`any`

EXAMPLE

``````GROUP min(age) AS result BY name
``````
nameage
'susan'27
'susan'18
'susan'35
'bob'28
'bob'null
'bob'55

resultname
18'susan'
28'bob'

# MIN_SIZE

Computes the minimum value of an attribute or expression based on the bit size of the elements contained in a group. For arrays, the bit size of its length is added to the sum of the bit size of each of its elements.

SYNTAX

`min_size(input)`

• `input`
• Type: any
• Description: The attributes to compute the minimum value over

OUTPUT

`any`

EXAMPLES

``````GROUP min_size(name) AS result BY age
``````
agename
27'susan'
27'samuel'
27'sam'
54'chris'
54'bob'

resultage
'sam'27
'bob'54
``````GROUP min_size(cities) AS result BY state
``````
statecities
'NY'['New York', 'Albany']
'WA'['Seattle', null]
'WA'['Spokane', 'Tacoma']
'CA'['Los Angeles', 'Pleasanton']
'CA'['San Franscisco', 'Berkeley', 'San Diego']

resultstate
['New York', 'Albany']'NY'
['Seattle', null]'WA'
['Los Angeles', 'Pleasanton']'CA'
``````GROUP min_size(special_case) AS result BY some_value
``````
some_valuespecial_case
'case-1'['', '', '', 'one']
'case-1'['o', 'n', 'e']
'case-2'['one', '', '', '']
'case-2'['o', 'n', 'e']

resultsome_value
['', '', '', 'one']'case-1'
['o', 'n', 'e']'case-2'
``````GROUP min_size(special_case_2) AS result BY some_value
``````
some_valuespecial_case_2
'case-1'['one', '', null]
'case-1'['o', 'n', 'e']
'case-1'['o', 'n', 'e', null]

resultsome_value
['one', '', null]'case-1'

# MAX

Computes the greatest value of an attribute or expression in a group.Tip: To find the function that best meets your needs, review similar function GREATEST.

SYNTAX

`max(input)`

• `input`
• Type: any
• Description: The attributes to compute the greatest value over

OUTPUT

`any`

EXAMPLE

``````GROUP max(age) AS result BY name
``````
nameage
'susan'27
'susan'18
'susan'35
'bob'28
'bob'null
'bob'55

resultname
35'susan'
55'bob'

# MAX_SIZE

Computes the maximum value of an attribute or expression based on the bit size of the elements contained in a group. For arrays, the bit size of its length is added to the sum of the bit size of each of its elements.

SYNTAX

`max_size(input)`

• `input`
• Type: any
• Description: The attributes to compute the maximum value over

OUTPUT

`any`

EXAMPLES

``````GROUP max_size(name) AS result BY age
``````
agename
27'susan'
27'samuel'
27'sam'
54'chris'
54'bob'

resultage
'samuel'27
'chris'54
``````GROUP max_size(cities) AS result BY state
``````
statecities
'NY'['New York', 'Albany']
'WA'['Seattle', null]
'WA'['Spokane', 'Tacoma']
'CA'['Los Angeles', 'Pleasanton']
'CA'['San Franscisco', 'Berkeley', 'San Diego']

resultstate
['New York', 'Albany']'NY'
['Spokane', 'Tacoma']'WA'
['San Franscisco', 'Berkeley', 'San Diego']'CA'
``````GROUP max_size(special_case) AS result BY some_value
``````
some_valuespecial_case
'case-1'['', '', '', 'one']
'case-1'['o', 'n', 'e']
'case-2'['one', '', '', '']
'case-2'['o', 'n', 'e']

resultsome_value
['o', 'n', 'e']'case-1'
['one', '', '', '']'case-2'
``````GROUP max_size(special_case_2) AS result BY some_value
``````
some_valuespecial_case_2
'case-1'['one', '', null]
'case-1'['o', 'n', 'e']
'case-1'['o', 'n', 'e', null]

resultsome_value
['o', 'n', 'e', null]'case-1'

# COUNT

Counts the number of non-null items in a group.

SYNTAX

`count(input)`

• `input`
• Type: any
• Description: The attributes whose items to count

OUTPUT

`long`

EXAMPLE

``````GROUP count(event) AS num_events BY region
``````
regiondateevent
'New York''2010-10-15''light showers'
'New York''2010-10-16''strong winds'
'New York''2010-10-16''thundershowers'
'Bay Area''2010-10-15''light fog'
'Bay Area''2010-10-15'null
'Bay Area''2010-10-16''strong winds'

regionnum_events
'Bay Area'2
'New York'3

# COUNT_DISTINCT

Counts the number of distinct non-null items in a group.

SYNTAX

`count_distinct(input)`

• `input`
• Type: any
• Description: The attributes whose items to count

OUTPUT

`long`

EXAMPLE

``````GROUP count_distinct(event) AS num_events BY region
``````
regiondateevent
'New York''2010-10-15''light showers'
'New York''2010-10-15''light showers'
'New York''2010-10-15''light showers'
'New York''2010-10-16''strong winds'
'New York''2010-10-16''thundershowers'
'Bay Area''2010-10-15''light fog'
'Bay Area''2010-10-15''light fog'
'Bay Area''2010-10-15'null
'Bay Area''2010-10-15'null
'Bay Area''2010-10-16''strong winds'

regionnum_events
'Bay Area'2
'New York'3

# SUM

Computes the sum of all non-null values of an attribute in a group.Tip: To find the function that best meets your needs, review similar function ARRAY.SUM.

SYNTAX

`sum(input)`

• `input`
• Type: number
• Description: The attribute whose values to sum

OUTPUT

`number`

EXAMPLE

``````GROUP sum(age) AS result BY name
``````
nameage
'susan'27
'susan'18
'susan'35
'bob'28
'bob'null
'bob'55

resultname
80'susan'
83'bob'

# AVG

Computes the average of all non-null values of an attribute in a group.

SYNTAX

`avg(input)`

• `input`
• Type: number
• Description: The attributes whose values are averaged

OUTPUT

`double`

EXAMPLE

``````GROUP avg(age) AS result BY name
``````
nameage
'susan'27
'susan'18
'susan'35
'bob'28
'bob'null
'bob'55
'alice'null

resultname
26.666666666666668'susan'
41.5'bob'
null'alice'

# MEAN

Computes the mean of all non-null values of an attribute in a group. This is a synonym of the AVG function.

SYNTAX

`mean(input)`

• `input`
• Type: number
• Description: The attributes over which the mean is calculated

OUTPUT

`double`

EXAMPLE

``````GROUP mean(age) AS result BY name
``````
nameage
'susan'27
'susan'18
'susan'35
'bob'28
'bob'null
'bob'55

resultname
26.666666666666668'susan'
41.5'bob'

# MODE

Computes the mode (most frequent value) of an attribute in a group. Ignores nulls. In the case of a tie, the least value will be returned.Tip: To find the function that best meets your needs, review these similar functions: TOP, ARRAY.MOST_FREQUENT.

SYNTAX

`mode(input)`

• `input`
• Type: any
• Description: The attributes on which to compute the mode

OUTPUT

`any`

EXAMPLE

``````GROUP mode(age) AS result BY name
``````
nameage
'susan'27
'susan'18
'susan'27
'susan'null
'susan'null
'susan'null
'bob'28
'bob'55
'bob'55
'steve'null
'stacy'38
'stacy'36
'stacy'37

resultname
27'susan'
55'bob'
null'steve'
36'stacy'

# FIRST

Returns the first value in a attribute for each partition defined in an aggregation.Tip: To find the function that best meets your needs, review similar function ARRAY.FIRST.

SYNTAX

`first(data)`

• `data`
• Type: any
• Description: The attributes from which we want to return the first value.

OUTPUT

`any`

EXAMPLE

``````GROUP first(A) AS new_A BY B
``````
AB
['jim', 'pam']'1'
[]'1'
['toby']'1'

Bnew_A
'1'['jim', 'pam']

# LAST

Returns the last value in a attribute for each partition defined in an aggregation.Tip: To find the function that best meets your needs, review similar function ARRAY.LAST.

SYNTAX

`last(data)`

• `data`
• Type: any
• Description: The attributes from which we want to return the last value.

OUTPUT

`any`

EXAMPLE

``````GROUP last(A) AS new_A BY B
``````
AB
['jim', 'pam']'1'
[]'1'
['toby']'1'

Bnew_A
'1'['toby']

# TOP

Computes K most frequently occurring distinct values per group. TOP breaks ties in value frequency by selecting an arbitrary value to sort first. Choose TOP if some values are expected to consistently occur more frequently than others. If there are many ties in value frequency then small changes in data can yield significant changes in which values TOP collects. Tip: To find the function that best meets your needs, review these similar functions: COLLECT_SUBSET, MODE, ARRAY.MOST_FREQUENT.

SYNTAX

`top(K, input)`

• `K`
• Type: int
• Description: The number of values to return per group. Must be a positive integer.
• `input`
• Type: any
• Description: The attribute on which to compute TOP K values.

OUTPUT

`{'type': 'array', 'elementType': 'any'}`

EXAMPLES

``````GROUP top(2, age) AS result BY name
``````
nameage
'susan'27
'susan'18
'susan'27
'susan'18
'susan'70
'susan'null
'susan'null
'susan'null
'bob'28
'bob'28
'steve'null

resultname
[27, 18]'susan'
[28]'bob'
[]'steve'
``````GROUP top(2, ...age) AS result BY name
``````
nameage
'susan'[27, 42]
'susan'[70]
'susan'[27, 70]
'susan'[18, 30, 27]
'susan'[70]
'susan'[]
'susan'[null]
'susan'null
'bob'[28]
'bob'[28]
'steve'null

resultname
[27, 70]'susan'
[28]'bob'
[]'steve'

# HISTOGRAM

Computes the histogram for the n most frequently occurring distinct values per group.

SYNTAX

`histogram(n, input)`

• `n`
• Type: int
• Description: The number of values to return per group. Must be a positive integer.
• `input`
• Type: any
• Description: The attribute on which to compute the histogram.

OUTPUT

`{'type': 'array', 'elementType': {'type': 'record', 'fullySpecified': True, 'fields': [{'name': 'value', 'type': 'any'}, {'name': 'count', 'type': 'long'}]}}`

EXAMPLES

``````GROUP histogram(2, age) AS result BY name
``````
nameage
'susan'27
'susan'18
'susan'27
'susan'18
'susan'18
'susan'null
'susan'null
'susan'null
'bob'28
'bob'28
'steve'null

resultname
[{'count': 3, 'value': 18}, {'count': 2, 'value': 27}]'susan'
[{'count': 2, 'value': 28}]'bob'
[]'steve'
``````GROUP histogram(2, ...age) AS result BY name
``````
nameage
'susan'[27, 42]
'susan'[70]
'susan'[27, 70]
'susan'[18, 30, 27]
'susan'[70]
'susan'[]
'susan'[null]
'susan'null
'bob'[28]
'bob'[28]
'steve'null

resultname
[{'count': 3, 'value': 27}, {'count': 3, 'value': 70}]'susan'
[{'count': 2, 'value': 28}]'bob'
[]'steve'
``````GROUP histogram(3, ...vehicles) AS result BY name
``````
namevehicles
'John'[{'model': 'Corolla', 'year': 2002, 'make': 'Toyota'}, {'model': 'Camry', 'year': null, 'make': 'Toyota'}, {'model': 'Civic', 'year': 1995, 'make': 'Honda'}]
'John'[{'model': null, 'year': 1995, 'make': 'Honda'}, {'model': 'Corolla', 'year': 2002, 'make': 'Toyota'}]
'John'[{'model': 'Corolla', 'year': 2002, 'make': 'Toyota'}, {'model': 'Camry', 'year': null, 'make': 'Toyota'}]
'Mary'[{'model': 'Camry', 'year': null, 'make': 'Toyota'}]
'Mary'[]

resultname
[{'count': 3, 'value': {'model': 'Corolla', 'year': 2002, 'make': 'Toyota'}}, {'count': 2, 'value': {'model': 'Camry', 'year': null, 'make': 'Toyota'}}, {'count': 1, 'value': {'model': null, 'year': 1995, 'make': 'Honda'}}]'John'
[{'count': 1, 'value': {'model': 'Camry', 'year': null, 'make': 'Toyota'}}]'Mary'

# COLLECT_LIST

Collects all non-null values of an attribute in a group. If the input dataset is ordered, then the resulting list is also ordered. WARNING: Using this function with a large number of records can result in errors. Consider using COLLECT_SET.

SYNTAX

`collect_list(input)`

• `input`
• Type: any
• Description: The attributes whose values to collect

OUTPUT

`{'type': 'array', 'elementType': 'any'}`

EXAMPLE

``````GROUP collect_list(employee) AS employees BY department
``````
departmentemployee
'accounting''angela'
'sales''dwight'
'sales''jim'
'accounting''kevin'
'accounting''oscar'
'sales''phyllis'
'sales''stanley'

departmentemployees
'accounting'['angela', 'kevin', 'oscar']
'sales'['dwight', 'jim', 'phyllis', 'stanley']

# COLLECT_SET

Collects all distinct, non-null values of an attribute in a group. The resulting list is not ordered. Tip: To find the function that best meets your needs, review similar functions COLLECT_LIST and ARRAY.DISTINCT.

SYNTAX

`collect_set(input)`

• `input`
• Type: any
• Description: The attributes whose values to collect

OUTPUT

`{'type': 'array', 'elementType': 'any'}`

EXAMPLE

``````GROUP collect_set(employee) AS employees BY department
``````
departmentemployee
'accounting''angela'
'sales''dwight'
'sales''dwight'
'sales''jim'
'accounting''kevin'
'accounting''kevin'
'accounting''oscar'
'accounting'null
'sales''phyllis'
'sales''stanley'
'sales''stanley'

departmentemployees
'accounting'['kevin', 'oscar', 'angela']
'sales'['jim', 'phyllis', 'stanley', 'dwight']

# COLLECT_SUBSET

Collects a subset of the distinct, non-null values of an attribute in a group. The resulting list is not ordered. The selected subset is based on the hash value of the input values. As a result, this aggregation function does deterministic tie-breaking, since the same subset is returned for the same set of values. When values occur with roughly equal frequency in data, the results of COLLECT_SUBSET will be more stable than TOP in the face of small data changes.

SYNTAX

`collect_subset(K, input)`

• `K`
• Type: int
• Description: The number of values to return per group. Must be a positive integer.
• `input`
• Type: any
• Description: The attributes for which to collect a subset of values

OUTPUT

`{'type': 'array', 'elementType': 'any'}`

EXAMPLES

``````GROUP collect_subset(2, employee) AS employees BY department
``````
departmentemployee
'accounting''angela'
'sales''dwight'
'sales''dwight'
'sales''jim'
'accounting''kevin'
'accounting''kevin'
'accounting''oscar'
'accounting'null
'sales''phyllis'
'sales''stanley'
'sales''stanley'

departmentemployees
'accounting'['kevin', 'angela']
'sales'['dwight', 'stanley']
``````GROUP collect_subset(3, ...employee) AS employees BY department
``````
departmentemployee
'accounting'['angela', 'oscar']
'sales'['dwight']
'sales'['dwight', null]
'sales'['jim', 'walter']
'accounting'['kevin', 'john']
'accounting'['kevin', 'frank']
'accounting'['oscar']
'accounting'null
'sales'['phyllis', 'mark']
'sales'['stanley']

departmentemployees
'accounting'['angela', 'john', 'frank']
'sales'['stanley', 'dwight', 'walter']

# STDDEV_SAMP

Computes the corrected sample standard deviation of all non-null values of an attribute in a group.

SYNTAX

`stddev_samp(input)`

• `input`
• Type: number
• Description: The attribute whose standard deviation is calculated

OUTPUT

`double`

EXAMPLE

``````GROUP stddev(age) AS result BY name
``````
nameage
'susan'1.0
'susan'2.0
'susan'3.0
'bob'10.0
'bob'20.0
'bob'30.0
'alice'20.0
'john'30.0
'john'30.0
'john'30.0

resultname
1.0'susan'
10.0'bob'
null'alice'
0.0'john'

# STDDEV_POP

Computes the population standard deviation of all non-null values of an attribute in a group.

SYNTAX

`stddev_pop(input)`

• `input`
• Type: number
• Description: The attribute whose standard deviation is calculated

OUTPUT

`double`

EXAMPLE

``````GROUP stddev_pop(age) AS result BY name
``````
nameage
'susan'1.0
'susan'2.0
'susan'3.0
'bob'10.0
'bob'20.0
'bob'30.0

resultname
0.816496580927726'susan'
8.16496580927726'bob'

# VAR_SAMP

Computes the corrected sample variance of all non-null values of an attribute in a group.

SYNTAX

`var_samp(input)`

• `input`
• Type: number
• Description: The attribute whose variance is calculated

OUTPUT

`double`

EXAMPLE

``````GROUP var_samp(age) AS result BY name
``````
nameage
'susan'1.0
'susan'2.0
'susan'3.0
'bob'10.0
'bob'20.0
'bob'30.0

resultname
1.0'susan'
100.0'bob'

# VAR_POP

Computes the population variance of all non-null values of an attribute in a group.

SYNTAX

`var_pop(input)`

• `input`
• Type: number
• Description: The attribute whose variance is calculated

OUTPUT

`double`

EXAMPLE

``````GROUP var_pop(age) AS result BY name
``````
nameage
'susan'1.0
'susan'2.0
'susan'3.0
'bob'10.0
'bob'20.0
'bob'30.0

resultname
0.6666666666666666'susan'
66.66666666666667'bob'