User GuidesAPI ReferenceRelease Notes
Doc HomeHelp CenterLog In
User Guides

Aggregate Functions

Aggregate functions.

AVG

Computes the average of all non-null values of an attribute in a group.

SYNTAX

avg()

OUTPUT

double

EXAMPLE

GROUP avg(age) AS result BY name
nameage
susan27
susan18
susan35
bob28
bobnull
bob55

resultname
26.666666666666668susan
41.5bob

COLLECT_LIST

Collects all non-null values of an attribute in a group. If the input dataset is ordered, then the resulting list is also ordered. WARNING - using this function with a large number of records can result in errors. Please consider using collect_set().

SYNTAX

collect_list()

OUTPUT

{'type': 'array', 'elementType': 'any'}

EXAMPLE

GROUP collect_list(employee) AS employees BY department
departmentemployee
accountingangela
salesdwight
salesjim
accountingkevin
adminnull
accountingoscar
adminpam
salesphyllis
salesstanley
admintoby

departmentemployees
accounting['angela', 'kevin', 'oscar']
sales['dwight', 'jim', 'phyllis', 'stanley']
admin['pam', 'toby']

COLLECT_SET

Collects all distinct, non-null values of an attribute in a group. The resulting list is not ordered.

SYNTAX

collect_set()

OUTPUT

{'type': 'array', 'elementType': 'any'}

EXAMPLE

GROUP collect_set(employee) AS employees BY department
departmentemployee
accountingangela
salesdwight
salesdwight
salesjim
accountingkevin
accountingkevin
accountingoscar
accountingnull
salesphyllis
salesstanley
salesstanley

departmentemployees
accounting['kevin', 'oscar', 'angela']
sales['jim', 'phyllis', 'stanley', 'dwight']

COUNT

Counts the number of non-null items in a group.

SYNTAX

count()

OUTPUT

long

EXAMPLE

GROUP count(event) AS num_events BY region
regioneventdate
New Yorklight showers10-15-2010
New Yorkstrong winds10-16-2010
New Yorkthundershowers10-16-2010
Bay Arealight fog10-15-2010
Bay Areanull10-15-2010
Bay Areastrong winds10-16-2010

regionnum_events
Bay Area2
New York3

COUNT_DISTINCT

Counts the number of distinct non-null items in a group.

SYNTAX

count_distinct()

OUTPUT

long

EXAMPLE

GROUP count_distinct(event) AS num_events BY region
regioneventdate
New Yorklight showers10-15-2010
New Yorklight showers10-15-2010
New Yorklight showers10-15-2010
New Yorkstrong winds10-16-2010
New Yorkthundershowers10-16-2010
Bay Arealight fog10-15-2010
Bay Arealight fog10-15-2010
Bay Areanull10-15-2010
Bay Areanull10-15-2010
Bay Areastrong winds10-16-2010

regionnum_events
Bay Area2
New York3

DENSE_RANK

Assigns to each row its position according to the ORDER BY expression. Unlike rank(), dense_rank() values increase sequentially, one-by-one, regardless of ties.

SYNTAX

dense_rank()

OUTPUT

int

EXAMPLE

WINDOW DENSE_RANK() AS eventRank BY region ORDER BY date
regioneventdateregioneventdateeventRank
Bay Arealight fog10-15-2010Bay Arealight fog10-15-20101
Bay Areanull10-15-2010Bay Areanull10-15-20101
Bay Areastrong winds10-16-2010Bay Areastrong winds10-16-20102

FIRST

Returns the first value in a attribute for each partition defined in an aggregation.

SYNTAX

first()

OUTPUT

any

EXAMPLE

GROUP first(A) AS new_A BY B
AB
['jim', 'pam']1
[]1
['toby']1

Bnew_A
1['jim', 'pam']

LAG

Computes an expression on a row that is behind the current one in the sort order.

SYNTAX

lag(input, offset)

  • input
    • Type: any
    • Description: The expression to compute on a previous row
  • offset
    • Type: int
    • Description: The number of rows previous to lag. Must be a positive integer constant.

OUTPUT

any

EXAMPLE

WINDOW event, LAG(event, 1) AS previousEvent BY region ORDER BY date
regioneventdateregionpreviousEventeventdate
Bay Arealight fog10-15-2010Bay Areanulllight fog10-15-2010
Bay Areanull10-15-2010Bay Arealight fognull10-15-2010
Bay Areastrong winds10-16-2010Bay Areanullstrong winds10-16-2010

LAST

Returns the last value in a attribute for each partition defined in an aggregation.

SYNTAX

last()

OUTPUT

any

EXAMPLE

GROUP last(A) AS new_A BY B
AB
['jim', 'pam']1
[]1
['toby']1

Bnew_A
1['toby']

LEAD

Computes an expression on a row that is ahead of the current one in the sort order.

SYNTAX

lead(input, offset)

  • input
    • Type: any
    • Description: The expression to compute on an upcoming row
  • offset
    • Type: int
    • Description: The number of rows following to skip. Must be a positive integer constant.

OUTPUT

any

EXAMPLE

WINDOW event, LEAD(event, 1) AS nextEvent BY region ORDER BY date
regioneventdateregionnextEventeventdate
Bay Arealight fog10-15-2010Bay Areanulllight fog10-15-2010
Bay Areanull10-15-2010Bay Areastrong windsnull10-15-2010
Bay Areastrong winds10-16-2010Bay Areanullstrong winds10-16-2010

MAX

Computes the greatest value of an attribute or expression in a group.

SYNTAX

max()

OUTPUT

any

EXAMPLE

GROUP max(age) AS result BY name
nameage
susan27
susan18
susan35
bob28
bobnull
bob55

resultname
35susan
55bob

MAX_SIZE

Computes the maximum value of an attribute or expression based on the bit size of the elements contained in a group. For arrays, the bit size of its length is added to the sum of the bit size of each of its elements.

SYNTAX

max_size()

OUTPUT

any

EXAMPLES

GROUP max_size(name) AS result BY age
agename
27susan
27samuel
27sam
54chris
54bob

resultage
samuel27
chris54
GROUP max_size(cities) AS result BY state
citiesstate
['New York', 'Albany']NY
['Seattle', null]WA
['Spokane', 'Tacoma']WA
['Los Angeles', 'Pleasanton']CA
['San Franscisco', 'Berkeley', 'San Diego']CA

resultstate
['New York', 'Albany']NY
['Spokane', 'Tacoma']WA
['San Franscisco', 'Berkeley', 'San Diego']CA
GROUP max_size(special_case) AS result BY some_value
special_casesome_value
['', '', '', 'one']case-1
['o', 'n', 'e']case-1
['one', '', '', '']case-2
['o', 'n', 'e']case-2

resultsome_value
['o', 'n', 'e']case-1
['one', '', '', '']case-2
GROUP max_size(special_case_2) AS result BY some_value
special_case_2some_value
['one', '', null]case-1
['o', 'n', 'e']case-1
['o', 'n', 'e', null]case-1

resultsome_value
['o', 'n', 'e', null]case-1

MEAN

Computes the mean of all non-null values of an attribute in a group. This is a synonym of the AVG function.

SYNTAX

mean()

OUTPUT

double

EXAMPLE

GROUP mean(age) AS result BY name
nameage
susan27
susan18
susan35
bob28
bobnull
bob55

resultname
26.666666666666668susan
41.5bob

MIN

Computes the least value of an attribute or expression in a group.

SYNTAX

min()

OUTPUT

any

EXAMPLE

GROUP min(age) AS result BY name
nameage
susan27
susan18
susan35
bob28
bobnull
bob55

resultname
18susan
28bob

MIN_SIZE

Computes the minimum value of an attribute or expression based on the bit size of the elements contained in a group. For arrays, the bit size of its length is added to the sum of the bit size of each of its elements.

SYNTAX

min_size()

OUTPUT

any

EXAMPLES

GROUP min_size(name) AS result BY age
agename
27susan
27samuel
27sam
54chris
54bob

resultage
sam27
bob54
GROUP min_size(cities) AS result BY state
citiesstate
['New York', 'Albany']NY
['Seattle', null]WA
['Spokane', 'Tacoma']WA
['Los Angeles', 'Pleasanton']CA
['San Franscisco', 'Berkeley', 'San Diego']CA

resultstate
['New York', 'Albany']NY
['Seattle', null]WA
['Los Angeles', 'Pleasanton']CA
GROUP min_size(special_case) AS result BY some_value
special_casesome_value
['', '', '', 'one']case-1
['o', 'n', 'e']case-1
['one', '', '', '']case-2
['o', 'n', 'e']case-2

resultsome_value
['', '', '', 'one']case-1
['o', 'n', 'e']case-2
GROUP min_size(special_case_2) AS result BY some_value
special_case_2some_value
['one', '', null]case-1
['o', 'n', 'e']case-1
['o', 'n', 'e', null]case-1

resultsome_value
['one', '', null]case-1

MODE

Computes the mode (most frequent value) of an attribute in a group. Ignores nulls. In the case of a tie, the least value will be returned.

SYNTAX

mode()

OUTPUT

any

EXAMPLE

GROUP mode(age) AS result BY name
nameage
susan27
susan18
susan27
susannull
susannull
susannull
bob28
bob55
bob55
stevenull
stacy38
stacy36
stacy37

resultname
27susan
55bob
nullsteve
36stacy

RANK

Assigns to each row its position in the partition according to the ORDER BY expression.

SYNTAX

rank()

OUTPUT

int

EXAMPLE

WINDOW RANK() AS eventRank BY region ORDER BY date
regioneventdateregioneventdateeventRank
Bay Arealight fog10-15-2010Bay Arealight fog10-15-20101
Bay Areanull10-15-2010Bay Areanull10-15-20101
Bay Areastrong winds10-16-2010Bay Areastrong winds10-16-20103

ROW_NUMBER

Assigns a unique sequential ID number to each row in a partition, starting with 1.

SYNTAX

row_number()

OUTPUT

int

EXAMPLE

WINDOW ROW_NUMBER() AS eventId BY region ORDER BY date
regioneventdateregioneventIdeventdate
Bay Arealight fog10-15-2010Bay Area1light fog10-15-2010
Bay Areanull10-15-2010Bay Area2null10-15-2010
Bay Areastrong winds10-16-2010Bay Area3strong winds10-16-2010

STDDEV_POP

Computes the population standard deviation of all non-null values of an attribute in a group.

SYNTAX

stddev_pop()

OUTPUT

double

EXAMPLE

GROUP stddev_pop(age) AS result BY name
nameage
susan1.0
susan2.0
susan3.0
bob10.0
bob20.0
bob30.0

resultname
0.816496580927726susan
8.16496580927726bob

STDDEV_SAMP

Computes the corrected sample standard deviation of all non-null values of an attribute in a group.

SYNTAX

stddev_samp()

OUTPUT

double

EXAMPLE

GROUP stddev(age) AS result BY name
nameage
susan1.0
susan2.0
susan3.0
bob10.0
bob20.0
bob30.0

resultname
1.0susan
10.0bob

SUM

Computes the sum of all non-null values of an attribute in a group.

SYNTAX

sum()

OUTPUT

number

EXAMPLE

GROUP sum(age) AS result BY name
nameage
susan27
susan18
susan35
bob28
bobnull
bob55

resultname
80susan
83bob

TOP

Computes n most frequently occurring distinct values per group.

SYNTAX

top(input, n)

  • input
    • Type: any
    • Description: The attribute on which to compute top n values.
  • n
    • Type: int
    • Description: The number of values to return per group. Must be a positive integer.

OUTPUT

{'type': 'array', 'elementType': 'any'}

EXAMPLE

GROUP top(age, 2) AS result BY name
nameage
susan27
susan18
susan27
susan18
susan70
susannull
susannull
susannull
bob28
bob28
stevenull

resultname
[27, 18]susan
[28]bob
[]steve

VAR_POP

Computes the population variance of all non-null values of an attribute in a group.

SYNTAX

var_pop()

OUTPUT

double

EXAMPLE

GROUP var_pop(age) AS result BY name
nameage
susan1.0
susan2.0
susan3.0
bob10.0
bob20.0
bob30.0

resultname
0.6666666666666666susan
66.66666666666667bob

VAR_SAMP

Computes the corrected sample variance of all non-null values of an attribute in a group.

SYNTAX

var_samp()

OUTPUT

double

EXAMPLE

GROUP var_samp(age) AS result BY name
nameage
susan1.0
susan2.0
susan3.0
bob10.0
bob20.0
bob30.0

resultname
1.0susan
100.0bob