User GuidesAPI ReferenceRelease Notes
Doc HomeHelp CenterLog In

Aggregate Functions

Aggregate functions perform a calculation on a set of values, and return a single value.

DENSE_RANK

Assigns to each row its position according to the ORDER BY expression. Unlike rank(), dense_rank() values increase sequentially, one-by-one, regardless of ties.

SYNTAX

dense_rank()

OUTPUT

int

EXAMPLE

WINDOW DENSE_RANK() AS eventRank BY region ORDER BY date
regiondateeventregioneventdateeventRank
'Bay Area''2010-10-15''light fog''Bay Area''light fog''2010-10-15'1
'Bay Area''2010-10-15'null'Bay Area'null'2010-10-15'1
'Bay Area''2010-10-16''strong winds''Bay Area''strong winds''2010-10-16'2

LAG

Computes an expression on a row that is behind the current one in the sort order.

SYNTAX

lag(input, offset)

  • input
    • Type: any
    • Description: The expression to compute on a previous row
  • offset
    • Type: int
    • Description: The number of rows previous to lag. Must be a positive integer constant.

OUTPUT

any

EXAMPLE

WINDOW event, LAG(event, 1) AS previousEvent BY region ORDER BY date
regiondateeventregionpreviousEventeventdate
'Bay Area''2010-10-15''light fog''Bay Area'null'light fog''2010-10-15'
'Bay Area''2010-10-15'null'Bay Area''light fog'null'2010-10-15'
'Bay Area''2010-10-16''strong winds''Bay Area'null'strong winds''2010-10-16'

LEAD

Computes an expression on a row that is ahead of the current one in the sort order.

SYNTAX

lead(input, offset)

  • input
    • Type: any
    • Description: The expression to compute on an upcoming row
  • offset
    • Type: int
    • Description: The number of rows following to skip. Must be a positive integer constant.

OUTPUT

any

EXAMPLE

WINDOW event, LEAD(event, 1) AS nextEvent BY region ORDER BY date
regiondateeventregionnextEventeventdate
'Bay Area''2010-10-15''light fog''Bay Area'null'light fog''2010-10-15'
'Bay Area''2010-10-15'null'Bay Area''strong winds'null'2010-10-15'
'Bay Area''2010-10-16''strong winds''Bay Area'null'strong winds''2010-10-16'

RANK

Assigns to each row its position in the partition according to the ORDER BY expression.

SYNTAX

rank()

OUTPUT

int

EXAMPLE

WINDOW RANK() AS eventRank BY region ORDER BY date
regiondateeventregioneventdateeventRank
'Bay Area''2010-10-15''light fog''Bay Area''light fog''2010-10-15'1
'Bay Area''2010-10-15'null'Bay Area'null'2010-10-15'1
'Bay Area''2010-10-16''strong winds''Bay Area''strong winds''2010-10-16'3

ROW_NUMBER

Assigns a unique sequential ID number to each row in a partition, starting with 1.

SYNTAX

row_number()

OUTPUT

int

EXAMPLE

WINDOW ROW_NUMBER() AS eventId BY region ORDER BY date
regiondateeventregioneventIdeventdate
'Bay Area''2010-10-15''light fog''Bay Area'1'light fog''2010-10-15'
'Bay Area''2010-10-15'null'Bay Area'2null'2010-10-15'
'Bay Area''2010-10-16''strong winds''Bay Area'3'strong winds''2010-10-16'

LEGACY.TOP

This function computes K most frequently occurring distinct values per group. This function is DEPRECATED because of its inability to support vararg argument. Use the current TOP function instead. Note that the parameter 'K' is the first parameter in the current TOP function.

SYNTAX

legacy.top(input, K)

  • input
    • Type: any
    • Description: The attribute on which to compute TOP K values.
  • K
    • Type: int
    • Description: The number of values to return per group. Must be a positive integer.

OUTPUT

{'type': 'array', 'elementType': 'any'}

EXAMPLE

GROUP legacy.top(age, 2) AS result BY name
nameage
'susan'27
'susan'18
'susan'27
'susan'18
'susan'70
'susan'null
'susan'null
'susan'null
'bob'28
'bob'28
'steve'null

resultname
[27, 18]'susan'
[28]'bob'
[]'steve'

MIN

Computes the least value of an attribute or expression in a group.

SYNTAX

min(input)

  • input
    • Type: any
    • Description: The attributes to compute the least value over

OUTPUT

any

EXAMPLE

GROUP min(age) AS result BY name
nameage
'susan'27
'susan'18
'susan'35
'bob'28
'bob'null
'bob'55

resultname
18'susan'
28'bob'

MIN_SIZE

Computes the minimum value of an attribute or expression based on the bit size of the elements contained in a group. For arrays, the bit size of its length is added to the sum of the bit size of each of its elements.

SYNTAX

min_size(input)

  • input
    • Type: any
    • Description: The attributes to compute the minimum value over

OUTPUT

any

EXAMPLES

GROUP min_size(name) AS result BY age
agename
27'susan'
27'samuel'
27'sam'
54'chris'
54'bob'

resultage
'sam'27
'bob'54
GROUP min_size(cities) AS result BY state
statecities
'NY'['New York', 'Albany']
'WA'['Seattle', null]
'WA'['Spokane', 'Tacoma']
'CA'['Los Angeles', 'Pleasanton']
'CA'['San Franscisco', 'Berkeley', 'San Diego']

resultstate
['New York', 'Albany']'NY'
['Seattle', null]'WA'
['Los Angeles', 'Pleasanton']'CA'
GROUP min_size(special_case) AS result BY some_value
some_valuespecial_case
'case-1'['', '', '', 'one']
'case-1'['o', 'n', 'e']
'case-2'['one', '', '', '']
'case-2'['o', 'n', 'e']

resultsome_value
['', '', '', 'one']'case-1'
['o', 'n', 'e']'case-2'
GROUP min_size(special_case_2) AS result BY some_value
some_valuespecial_case_2
'case-1'['one', '', null]
'case-1'['o', 'n', 'e']
'case-1'['o', 'n', 'e', null]

resultsome_value
['one', '', null]'case-1'

MAX

Computes the greatest value of an attribute or expression in a group.

SYNTAX

max(input)

  • input
    • Type: any
    • Description: The attributes to compute the greatest value over

OUTPUT

any

EXAMPLE

GROUP max(age) AS result BY name
nameage
'susan'27
'susan'18
'susan'35
'bob'28
'bob'null
'bob'55

resultname
35'susan'
55'bob'

MAX_SIZE

Computes the maximum value of an attribute or expression based on the bit size of the elements contained in a group. For arrays, the bit size of its length is added to the sum of the bit size of each of its elements.

SYNTAX

max_size(input)

  • input
    • Type: any
    • Description: The attributes to compute the maximum value over

OUTPUT

any

EXAMPLES

GROUP max_size(name) AS result BY age
agename
27'susan'
27'samuel'
27'sam'
54'chris'
54'bob'

resultage
'samuel'27
'chris'54
GROUP max_size(cities) AS result BY state
statecities
'NY'['New York', 'Albany']
'WA'['Seattle', null]
'WA'['Spokane', 'Tacoma']
'CA'['Los Angeles', 'Pleasanton']
'CA'['San Franscisco', 'Berkeley', 'San Diego']

resultstate
['New York', 'Albany']'NY'
['Spokane', 'Tacoma']'WA'
['San Franscisco', 'Berkeley', 'San Diego']'CA'
GROUP max_size(special_case) AS result BY some_value
some_valuespecial_case
'case-1'['', '', '', 'one']
'case-1'['o', 'n', 'e']
'case-2'['one', '', '', '']
'case-2'['o', 'n', 'e']

resultsome_value
['o', 'n', 'e']'case-1'
['one', '', '', '']'case-2'
GROUP max_size(special_case_2) AS result BY some_value
some_valuespecial_case_2
'case-1'['one', '', null]
'case-1'['o', 'n', 'e']
'case-1'['o', 'n', 'e', null]

resultsome_value
['o', 'n', 'e', null]'case-1'

COUNT

Counts the number of non-null items in a group.

SYNTAX

count(input)

  • input
    • Type: any
    • Description: The attributes whose items to count

OUTPUT

long

EXAMPLE

GROUP count(event) AS num_events BY region
regiondateevent
'New York''2010-10-15''light showers'
'New York''2010-10-16''strong winds'
'New York''2010-10-16''thundershowers'
'Bay Area''2010-10-15''light fog'
'Bay Area''2010-10-15'null
'Bay Area''2010-10-16''strong winds'

regionnum_events
'Bay Area'2
'New York'3

COUNT_DISTINCT

Counts the number of distinct non-null items in a group.

SYNTAX

count_distinct(input)

  • input
    • Type: any
    • Description: The attributes whose items to count

OUTPUT

long

EXAMPLE

GROUP count_distinct(event) AS num_events BY region
regiondateevent
'New York''2010-10-15''light showers'
'New York''2010-10-15''light showers'
'New York''2010-10-15''light showers'
'New York''2010-10-16''strong winds'
'New York''2010-10-16''thundershowers'
'Bay Area''2010-10-15''light fog'
'Bay Area''2010-10-15''light fog'
'Bay Area''2010-10-15'null
'Bay Area''2010-10-15'null
'Bay Area''2010-10-16''strong winds'

regionnum_events
'Bay Area'2
'New York'3

SUM

Computes the sum of all non-null values of an attribute in a group.

SYNTAX

sum(input)

  • input
    • Type: number
    • Description: The attribute whose values to sum

OUTPUT

number

EXAMPLE

GROUP sum(age) AS result BY name
nameage
'susan'27
'susan'18
'susan'35
'bob'28
'bob'null
'bob'55

resultname
80'susan'
83'bob'

AVG

Computes the average of all non-null values of an attribute in a group.

SYNTAX

avg(input)

  • input
    • Type: number
    • Description: The attributes whose values are averaged

OUTPUT

double

EXAMPLE

GROUP avg(age) AS result BY name
nameage
'susan'27
'susan'18
'susan'35
'bob'28
'bob'null
'bob'55
'alice'null

resultname
26.666666666666668'susan'
41.5'bob'
null'alice'

MEAN

Computes the mean of all non-null values of an attribute in a group. This is a synonym of the AVG function.

SYNTAX

mean(input)

  • input
    • Type: number
    • Description: The attributes over which the mean is calculated

OUTPUT

double

EXAMPLE

GROUP mean(age) AS result BY name
nameage
'susan'27
'susan'18
'susan'35
'bob'28
'bob'null
'bob'55

resultname
26.666666666666668'susan'
41.5'bob'

MODE

Computes the mode (most frequent value) of an attribute in a group. Ignores nulls. In the case of a tie, the least value will be returned.

SYNTAX

mode(input)

  • input
    • Type: any
    • Description: The attributes on which to compute the mode

OUTPUT

any

EXAMPLE

GROUP mode(age) AS result BY name
nameage
'susan'27
'susan'18
'susan'27
'susan'null
'susan'null
'susan'null
'bob'28
'bob'55
'bob'55
'steve'null
'stacy'38
'stacy'36
'stacy'37

resultname
27'susan'
55'bob'
null'steve'
36'stacy'

FIRST

Returns the first value in a attribute for each partition defined in an aggregation.

SYNTAX

first(data)

  • data
    • Type: any
    • Description: The attributes from which we want to return the first value.

OUTPUT

any

EXAMPLE

GROUP first(A) AS new_A BY B
AB
['jim', 'pam']'1'
[]'1'
['toby']'1'

Bnew_A
'1'['jim', 'pam']

LAST

Returns the last value in a attribute for each partition defined in an aggregation.

SYNTAX

last(data)

  • data
    • Type: any
    • Description: The attributes from which we want to return the last value.

OUTPUT

any

EXAMPLE

GROUP last(A) AS new_A BY B
AB
['jim', 'pam']'1'
[]'1'
['toby']'1'

Bnew_A
'1'['toby']

TOP

Computes K most frequently occurring distinct values per group. TOP breaks ties in value frequency by selecting an arbitrary value to sort first. Choose TOP if some values are expected to consistently occur more frequently than others. If there are many ties in value frequency then small changes in data can yield significant changes in which values TOP collects. For an alternative aggregation method, see COLLECT_SUBSET.

SYNTAX

top(K, input)

  • K
    • Type: int
    • Description: The number of values to return per group. Must be a positive integer.
  • input
    • Type: any
    • Description: The attribute on which to compute TOP K values.

OUTPUT

{'type': 'array', 'elementType': 'any'}

EXAMPLES

GROUP top(2, age) AS result BY name
nameage
'susan'27
'susan'18
'susan'27
'susan'18
'susan'70
'susan'null
'susan'null
'susan'null
'bob'28
'bob'28
'steve'null

resultname
[27, 18]'susan'
[28]'bob'
[]'steve'
GROUP top(2, ...age) AS result BY name
nameage
'susan'[27, 42]
'susan'[70]
'susan'[27, 70]
'susan'[18, 30, 27]
'susan'[70]
'susan'[]
'susan'[null]
'susan'null
'bob'[28]
'bob'[28]
'steve'null

resultname
[27, 70]'susan'
[28]'bob'
[]'steve'

HISTOGRAM

Computes the histogram for the n most frequently occurring distinct values per group.

SYNTAX

histogram(n, input)

  • n
    • Type: int
    • Description: The number of values to return per group. Must be a positive integer.
  • input
    • Type: any
    • Description: The attribute on which to compute the histogram.

OUTPUT

{'type': 'array', 'elementType': {'type': 'record', 'fullySpecified': True, 'fields': [{'name': 'value', 'type': 'any'}, {'name': 'count', 'type': 'long'}]}}

EXAMPLES

GROUP histogram(2, age) AS result BY name
nameage
'susan'27
'susan'18
'susan'27
'susan'18
'susan'18
'susan'null
'susan'null
'susan'null
'bob'28
'bob'28
'steve'null

resultname
[{'count': 3, 'value': 18}, {'count': 2, 'value': 27}]'susan'
[{'count': 2, 'value': 28}]'bob'
[]'steve'
GROUP histogram(2, ...age) AS result BY name
nameage
'susan'[27, 42]
'susan'[70]
'susan'[27, 70]
'susan'[18, 30, 27]
'susan'[70]
'susan'[]
'susan'[null]
'susan'null
'bob'[28]
'bob'[28]
'steve'null

resultname
[{'count': 3, 'value': 27}, {'count': 3, 'value': 70}]'susan'
[{'count': 2, 'value': 28}]'bob'
[]'steve'
GROUP histogram(3, ...vehicles) AS result BY name
namevehicles
'John'[{'model': 'Corolla', 'year': 2002, 'make': 'Toyota'}, {'model': 'Camry', 'year': null, 'make': 'Toyota'}, {'model': 'Civic', 'year': 1995, 'make': 'Honda'}]
'John'[{'model': null, 'year': 1995, 'make': 'Honda'}, {'model': 'Corolla', 'year': 2002, 'make': 'Toyota'}]
'John'[{'model': 'Corolla', 'year': 2002, 'make': 'Toyota'}, {'model': 'Camry', 'year': null, 'make': 'Toyota'}]
'Mary'[{'model': 'Camry', 'year': null, 'make': 'Toyota'}]
'Mary'[]

resultname
[{'count': 3, 'value': {'model': 'Corolla', 'year': 2002, 'make': 'Toyota'}}, {'count': 2, 'value': {'model': 'Camry', 'year': null, 'make': 'Toyota'}}, {'count': 1, 'value': {'model': null, 'year': 1995, 'make': 'Honda'}}]'John'
[{'count': 1, 'value': {'model': 'Camry', 'year': null, 'make': 'Toyota'}}]'Mary'

COLLECT_LIST

Collects all non-null values of an attribute in a group. If the input dataset is ordered, then the resulting list is also ordered. WARNING: Using this function with a large number of records can result in errors. Consider using COLLECT_SET().See COLLECT_SET.

SYNTAX

collect_list(input)

  • input
    • Type: any
    • Description: The attributes whose values to collect

OUTPUT

{'type': 'array', 'elementType': 'any'}

EXAMPLE

GROUP collect_list(employee) AS employees BY department
departmentemployee
'accounting''angela'
'sales''dwight'
'sales''jim'
'accounting''kevin'
'admin'null
'accounting''oscar'
'admin''pam'
'sales''phyllis'
'sales''stanley'
'admin''toby'

departmentemployees
'accounting'['angela', 'kevin', 'oscar']
'sales'['dwight', 'jim', 'phyllis', 'stanley']
'admin'['pam', 'toby']

COLLECT_SET

Collects all distinct, non-null values of an attribute in a group. The resulting list is not ordered. Similar function: COLLECT_LIST.

SYNTAX

collect_set(input)

  • input
    • Type: any
    • Description: The attributes whose values to collect

OUTPUT

{'type': 'array', 'elementType': 'any'}

EXAMPLE

GROUP collect_set(employee) AS employees BY department
departmentemployee
'accounting''angela'
'sales''dwight'
'sales''dwight'
'sales''jim'
'accounting''kevin'
'accounting''kevin'
'accounting''oscar'
'accounting'null
'sales''phyllis'
'sales''stanley'
'sales''stanley'

departmentemployees
'accounting'['kevin', 'oscar', 'angela']
'sales'['jim', 'phyllis', 'stanley', 'dwight']

COLLECT_SUBSET

Collects a subset of the distinct, non-null values of an attribute in a group. The resulting list is not ordered. The selected subset is based on the hash value of the input values. As a result, this aggregation function does deterministic tie-breaking, since the same subset is returned for the same set of values. When values occur with roughly equal frequency in data, the results of COLLECT_SUBSET will be more stable than TOP in the face of small data changes.

SYNTAX

collect_subset(K, input)

  • K
    • Type: int
    • Description: The number of values to return per group. Must be a positive integer.
  • input
    • Type: any
    • Description: The attributes for which to collect a subset of values

OUTPUT

{'type': 'array', 'elementType': 'any'}

EXAMPLES

GROUP collect_subset(2, employee) AS employees BY department
departmentemployee
'accounting''angela'
'sales''dwight'
'sales''dwight'
'sales''jim'
'accounting''kevin'
'accounting''kevin'
'accounting''oscar'
'accounting'null
'sales''phyllis'
'sales''stanley'
'sales''stanley'

departmentemployees
'accounting'['kevin', 'angela']
'sales'['dwight', 'stanley']
GROUP collect_subset(3, ...employee) AS employees BY department
departmentemployee
'accounting'['angela', 'oscar']
'sales'['dwight']
'sales'['dwight', null]
'sales'['jim', 'walter']
'accounting'['kevin', 'john']
'accounting'['kevin', 'frank']
'accounting'['oscar']
'accounting'null
'sales'['phyllis', 'mark']
'sales'['stanley']
'sales'['stanley', 'adam']

departmentemployees
'accounting'['angela', 'john', 'frank']
'sales'['stanley', 'dwight', 'walter']

STDDEV_SAMP

Computes the corrected sample standard deviation of all non-null values of an attribute in a group.

SYNTAX

stddev_samp(input)

  • input
    • Type: number
    • Description: The attribute whose standard deviation is calculated

OUTPUT

double

EXAMPLE

GROUP stddev(age) AS result BY name
nameage
'susan'1.0
'susan'2.0
'susan'3.0
'bob'10.0
'bob'20.0
'bob'30.0
'alice'20.0
'john'30.0
'john'30.0
'john'30.0

resultname
1.0'susan'
10.0'bob'
null'alice'
0.0'john'

STDDEV_POP

Computes the population standard deviation of all non-null values of an attribute in a group.

SYNTAX

stddev_pop(input)

  • input
    • Type: number
    • Description: The attribute whose standard deviation is calculated

OUTPUT

double

EXAMPLE

GROUP stddev_pop(age) AS result BY name
nameage
'susan'1.0
'susan'2.0
'susan'3.0
'bob'10.0
'bob'20.0
'bob'30.0

resultname
0.816496580927726'susan'
8.16496580927726'bob'

VAR_SAMP

Computes the corrected sample variance of all non-null values of an attribute in a group.

SYNTAX

var_samp(input)

  • input
    • Type: number
    • Description: The attribute whose variance is calculated

OUTPUT

double

EXAMPLE

GROUP var_samp(age) AS result BY name
nameage
'susan'1.0
'susan'2.0
'susan'3.0
'bob'10.0
'bob'20.0
'bob'30.0

resultname
1.0'susan'
100.0'bob'

VAR_POP

Computes the population variance of all non-null values of an attribute in a group.

SYNTAX

var_pop(input)

  • input
    • Type: number
    • Description: The attribute whose variance is calculated

OUTPUT

double

EXAMPLE

GROUP var_pop(age) AS result BY name
nameage
'susan'1.0
'susan'2.0
'susan'3.0
'bob'10.0
'bob'20.0
'bob'30.0

resultname
0.6666666666666666'susan'
66.66666666666667'bob'