AVG

Computes the average of all non-null values of an attribute in a group.

SYNTAX

avg()

OUTPUT

double

EXAMPLE

GROUP avg(age) AS result BY name

name	age
susan	27
susan	18
susan	35
bob	28
bob	null
bob	55

→

result	name
26.666666666666668	susan
41.5	bob

COLLECT_LIST

Collects all non-null values of an attribute in a group. If the input dataset is ordered, then the resulting list is also ordered. WARNING - using this function with a large number of records can result in errors. Please consider using collect_set().

SYNTAX

collect_list()

OUTPUT

{'type': 'array', 'elementType': 'any'}

EXAMPLE

GROUP collect_list(employee) AS employees BY department

department	employee
accounting	angela
sales	dwight
sales	jim
accounting	kevin
admin	null
accounting	oscar
admin	pam
sales	phyllis
sales	stanley
admin	toby

→

department	employees
accounting	['angela', 'kevin', 'oscar']
sales	['dwight', 'jim', 'phyllis', 'stanley']
admin	['pam', 'toby']

COLLECT_SET

Collects all distinct, non-null values of an attribute in a group. The resulting list is not ordered.

SYNTAX

collect_set()

OUTPUT

{'type': 'array', 'elementType': 'any'}

EXAMPLE

GROUP collect_set(employee) AS employees BY department

department	employee
accounting	angela
sales	dwight
sales	dwight
sales	jim
accounting	kevin
accounting	kevin
accounting	oscar
accounting	null
sales	phyllis
sales	stanley
sales	stanley

→

department	employees
accounting	['kevin', 'oscar', 'angela']
sales	['jim', 'phyllis', 'stanley', 'dwight']

COUNT

Counts the number of non-null items in a group.

SYNTAX

count()

OUTPUT

long

EXAMPLE

GROUP count(event) AS num_events BY region

region	event	date
New York	light showers	10-15-2010
New York	strong winds	10-16-2010
New York	thundershowers	10-16-2010
Bay Area	light fog	10-15-2010
Bay Area	null	10-15-2010
Bay Area	strong winds	10-16-2010

→

region	num_events
Bay Area	2
New York	3

COUNT_DISTINCT

Counts the number of distinct non-null items in a group.

SYNTAX

count_distinct()

OUTPUT

long

EXAMPLE

GROUP count_distinct(event) AS num_events BY region

region	event	date
New York	light showers	10-15-2010
New York	light showers	10-15-2010
New York	light showers	10-15-2010
New York	strong winds	10-16-2010
New York	thundershowers	10-16-2010
Bay Area	light fog	10-15-2010
Bay Area	light fog	10-15-2010
Bay Area	null	10-15-2010
Bay Area	null	10-15-2010
Bay Area	strong winds	10-16-2010

→

region	num_events
Bay Area	2
New York	3

DENSE_RANK

Assigns to each row its position according to the ORDER BY expression. Unlike rank(), dense_rank() values increase sequentially, one-by-one, regardless of ties.

SYNTAX

dense_rank()

OUTPUT

int

EXAMPLE

WINDOW DENSE_RANK() AS eventRank BY region ORDER BY date

region	event	date	region	event	date	eventRank
Bay Area	light fog	10-15-2010	Bay Area	light fog	10-15-2010	1
Bay Area	null	10-15-2010	Bay Area	null	10-15-2010	1
Bay Area	strong winds	10-16-2010	Bay Area	strong winds	10-16-2010	2

FIRST

Returns the first value in a attribute for each partition defined in an aggregation.

SYNTAX

first()

OUTPUT

any

EXAMPLE

GROUP first(A) AS new_A BY B

A	B
['jim', 'pam']	1
[]	1
['toby']	1

→

B	new_A
1	['jim', 'pam']

LAG

Computes an expression on a row that is behind the current one in the sort order.

SYNTAX

lag(input, offset)

input
- Type: any
- Description: The expression to compute on a previous row
offset
- Type: int
- Description: The number of rows previous to lag. Must be a positive integer constant.

OUTPUT

any

EXAMPLE

WINDOW event, LAG(event, 1) AS previousEvent BY region ORDER BY date

region	event	date	region	previousEvent	event	date
Bay Area	light fog	10-15-2010	Bay Area	null	light fog	10-15-2010
Bay Area	null	10-15-2010	Bay Area	light fog	null	10-15-2010
Bay Area	strong winds	10-16-2010	Bay Area	null	strong winds	10-16-2010

LAST

Returns the last value in a attribute for each partition defined in an aggregation.

SYNTAX

last()

OUTPUT

any

EXAMPLE

GROUP last(A) AS new_A BY B

A	B
['jim', 'pam']	1
[]	1
['toby']	1

→

B	new_A
1	['toby']

LEAD

Computes an expression on a row that is ahead of the current one in the sort order.

SYNTAX

lead(input, offset)

input
- Type: any
- Description: The expression to compute on an upcoming row
offset
- Type: int
- Description: The number of rows following to skip. Must be a positive integer constant.

OUTPUT

any

EXAMPLE

WINDOW event, LEAD(event, 1) AS nextEvent BY region ORDER BY date

region	event	date	region	nextEvent	event	date
Bay Area	light fog	10-15-2010	Bay Area	null	light fog	10-15-2010
Bay Area	null	10-15-2010	Bay Area	strong winds	null	10-15-2010
Bay Area	strong winds	10-16-2010	Bay Area	null	strong winds	10-16-2010

MAX

Computes the greatest value of an attribute or expression in a group.

SYNTAX

max()

OUTPUT

any

EXAMPLE

GROUP max(age) AS result BY name

name	age
susan	27
susan	18
susan	35
bob	28
bob	null
bob	55

→

result	name
35	susan
55	bob

MAX_SIZE

Computes the maximum value of an attribute or expression based on the bit size of the elements contained in a group. For arrays, the bit size of its length is added to the sum of the bit size of each of its elements.

SYNTAX

max_size()

OUTPUT

any

EXAMPLES

GROUP max_size(name) AS result BY age

age	name
27	susan
27	samuel
27	sam
54	chris
54	bob

→

result	age
samuel	27
chris	54

GROUP max_size(cities) AS result BY state

cities	state
['New York', 'Albany']	NY
['Seattle', null]	WA
['Spokane', 'Tacoma']	WA
['Los Angeles', 'Pleasanton']	CA
['San Franscisco', 'Berkeley', 'San Diego']	CA

→

result	state
['New York', 'Albany']	NY
['Spokane', 'Tacoma']	WA
['San Franscisco', 'Berkeley', 'San Diego']	CA

GROUP max_size(special_case) AS result BY some_value

special_case	some_value
['', '', '', 'one']	case-1
['o', 'n', 'e']	case-1
['one', '', '', '']	case-2
['o', 'n', 'e']	case-2

→

result	some_value
['o', 'n', 'e']	case-1
['one', '', '', '']	case-2

GROUP max_size(special_case_2) AS result BY some_value

special_case_2	some_value
['one', '', null]	case-1
['o', 'n', 'e']	case-1
['o', 'n', 'e', null]	case-1

→

result	some_value
['o', 'n', 'e', null]	case-1

MEAN

Computes the mean of all non-null values of an attribute in a group. This is a synonym of the AVG function.

SYNTAX

mean()

OUTPUT

double

EXAMPLE

GROUP mean(age) AS result BY name

name	age
susan	27
susan	18
susan	35
bob	28
bob	null
bob	55

→

result	name
26.666666666666668	susan
41.5	bob

MIN

Computes the least value of an attribute or expression in a group.

SYNTAX

min()

OUTPUT

any

EXAMPLE

GROUP min(age) AS result BY name

name	age
susan	27
susan	18
susan	35
bob	28
bob	null
bob	55

→

result	name
18	susan
28	bob

MIN_SIZE

Computes the minimum value of an attribute or expression based on the bit size of the elements contained in a group. For arrays, the bit size of its length is added to the sum of the bit size of each of its elements.

SYNTAX

min_size()

OUTPUT

any

EXAMPLES

GROUP min_size(name) AS result BY age

age	name
27	susan
27	samuel
27	sam
54	chris
54	bob

→

result	age
sam	27
bob	54

GROUP min_size(cities) AS result BY state

cities	state
['New York', 'Albany']	NY
['Seattle', null]	WA
['Spokane', 'Tacoma']	WA
['Los Angeles', 'Pleasanton']	CA
['San Franscisco', 'Berkeley', 'San Diego']	CA

→

result	state
['New York', 'Albany']	NY
['Seattle', null]	WA
['Los Angeles', 'Pleasanton']	CA

GROUP min_size(special_case) AS result BY some_value

special_case	some_value
['', '', '', 'one']	case-1
['o', 'n', 'e']	case-1
['one', '', '', '']	case-2
['o', 'n', 'e']	case-2

→

result	some_value
['', '', '', 'one']	case-1
['o', 'n', 'e']	case-2

GROUP min_size(special_case_2) AS result BY some_value

special_case_2	some_value
['one', '', null]	case-1
['o', 'n', 'e']	case-1
['o', 'n', 'e', null]	case-1

→

result	some_value
['one', '', null]	case-1

MODE

Computes the mode (most frequent value) of an attribute in a group. Ignores nulls. In the case of a tie, the least value will be returned.

SYNTAX

mode()

OUTPUT

any

EXAMPLE

GROUP mode(age) AS result BY name

name	age
susan	27
susan	18
susan	27
susan	null
susan	null
susan	null
bob	28
bob	55
bob	55
steve	null
stacy	38
stacy	36
stacy	37

→

result	name
27	susan
55	bob
null	steve
36	stacy

RANK

Assigns to each row its position in the partition according to the ORDER BY expression.

SYNTAX

rank()

OUTPUT

int

EXAMPLE

WINDOW RANK() AS eventRank BY region ORDER BY date

region	event	date	region	event	date	eventRank
Bay Area	light fog	10-15-2010	Bay Area	light fog	10-15-2010	1
Bay Area	null	10-15-2010	Bay Area	null	10-15-2010	1
Bay Area	strong winds	10-16-2010	Bay Area	strong winds	10-16-2010	3

ROW_NUMBER

Assigns a unique sequential ID number to each row in a partition, starting with 1.

SYNTAX

row_number()

OUTPUT

int

EXAMPLE

WINDOW ROW_NUMBER() AS eventId BY region ORDER BY date

region	event	date	region	eventId	event	date
Bay Area	light fog	10-15-2010	Bay Area	1	light fog	10-15-2010
Bay Area	null	10-15-2010	Bay Area	2	null	10-15-2010
Bay Area	strong winds	10-16-2010	Bay Area	3	strong winds	10-16-2010

STDDEV_POP

Computes the population standard deviation of all non-null values of an attribute in a group.

SYNTAX

stddev_pop()

OUTPUT

double

EXAMPLE

GROUP stddev_pop(age) AS result BY name

name	age
susan	1.0
susan	2.0
susan	3.0
bob	10.0
bob	20.0
bob	30.0

→

result	name
0.816496580927726	susan
8.16496580927726	bob

STDDEV_SAMP

Computes the corrected sample standard deviation of all non-null values of an attribute in a group.

SYNTAX

stddev_samp()

OUTPUT

double

EXAMPLE

GROUP stddev(age) AS result BY name

name	age
susan	1.0
susan	2.0
susan	3.0
bob	10.0
bob	20.0
bob	30.0

→

result	name
1.0	susan
10.0	bob

SUM

Computes the sum of all non-null values of an attribute in a group.

SYNTAX

sum()

OUTPUT

number

EXAMPLE

GROUP sum(age) AS result BY name

name	age
susan	27
susan	18
susan	35
bob	28
bob	null
bob	55

→

result	name
80	susan
83	bob

TOP

Computes n most frequently occurring distinct values per group.

SYNTAX

top(input, n)

input
- Type: any
- Description: The attribute on which to compute top n values.
n
- Type: int
- Description: The number of values to return per group. Must be a positive integer.

OUTPUT

{'type': 'array', 'elementType': 'any'}

EXAMPLE

GROUP top(age, 2) AS result BY name

name	age
susan	27
susan	18
susan	27
susan	18
susan	70
susan	null
susan	null
susan	null
bob	28
bob	28
steve	null

→

result	name
[27, 18]	susan
[28]	bob
[]	steve

VAR_POP

Computes the population variance of all non-null values of an attribute in a group.

SYNTAX

var_pop()

OUTPUT

double

EXAMPLE

GROUP var_pop(age) AS result BY name

name	age
susan	1.0
susan	2.0
susan	3.0
bob	10.0
bob	20.0
bob	30.0

→

result	name
0.6666666666666666	susan
66.66666666666667	bob

VAR_SAMP

Computes the corrected sample variance of all non-null values of an attribute in a group.

SYNTAX

var_samp()

OUTPUT

double

EXAMPLE

GROUP var_samp(age) AS result BY name

name	age
susan	1.0
susan	2.0
susan	3.0
bob	10.0
bob	20.0
bob	30.0

→

result	name
1.0	susan
100.0	bob