DENSE_RANK

Assigns to each row its position according to the ORDER BY expression. Unlike rank(), dense_rank() values increase sequentially, one-by-one, regardless of ties.

SYNTAX

dense_rank()

OUTPUT

int

EXAMPLE

WINDOW DENSE_RANK() AS eventRank BY region ORDER BY date

region	date	event	region	event	date	eventRank
'Bay Area'	'2010-10-15'	'light fog'	'Bay Area'	'light fog'	'2010-10-15'	1
'Bay Area'	'2010-10-15'	null	'Bay Area'	null	'2010-10-15'	1
'Bay Area'	'2010-10-16'	'strong winds'	'Bay Area'	'strong winds'	'2010-10-16'	2

LAG

Computes an expression on a row that is behind the current one in the sort order.

SYNTAX

lag(input, offset)

input
- Type: any
- Description: The expression to compute on a previous row
offset
- Type: int
- Description: The number of rows previous to lag. Must be a positive integer constant.

OUTPUT

any

EXAMPLE

WINDOW event, LAG(event, 1) AS previousEvent BY region ORDER BY date

region	date	event	region	previousEvent	event	date
'Bay Area'	'2010-10-15'	'light fog'	'Bay Area'	null	'light fog'	'2010-10-15'
'Bay Area'	'2010-10-15'	null	'Bay Area'	'light fog'	null	'2010-10-15'
'Bay Area'	'2010-10-16'	'strong winds'	'Bay Area'	null	'strong winds'	'2010-10-16'

LEAD

Computes an expression on a row that is ahead of the current one in the sort order.

SYNTAX

lead(input, offset)

input
- Type: any
- Description: The expression to compute on an upcoming row
offset
- Type: int
- Description: The number of rows following to skip. Must be a positive integer constant.

OUTPUT

any

EXAMPLE

WINDOW event, LEAD(event, 1) AS nextEvent BY region ORDER BY date

region	date	event	region	nextEvent	event	date
'Bay Area'	'2010-10-15'	'light fog'	'Bay Area'	null	'light fog'	'2010-10-15'
'Bay Area'	'2010-10-15'	null	'Bay Area'	'strong winds'	null	'2010-10-15'
'Bay Area'	'2010-10-16'	'strong winds'	'Bay Area'	null	'strong winds'	'2010-10-16'

RANK

Assigns to each row its position in the partition according to the ORDER BY expression.

SYNTAX

rank()

OUTPUT

int

EXAMPLE

WINDOW RANK() AS eventRank BY region ORDER BY date

region	date	event	region	event	date	eventRank
'Bay Area'	'2010-10-15'	'light fog'	'Bay Area'	'light fog'	'2010-10-15'	1
'Bay Area'	'2010-10-15'	null	'Bay Area'	null	'2010-10-15'	1
'Bay Area'	'2010-10-16'	'strong winds'	'Bay Area'	'strong winds'	'2010-10-16'	3

ROW_NUMBER

Assigns a unique sequential ID number to each row in a partition, starting with 1.

SYNTAX

row_number()

OUTPUT

int

EXAMPLE

WINDOW ROW_NUMBER() AS eventId BY region ORDER BY date

region	date	event	region	eventId	event	date
'Bay Area'	'2010-10-15'	'light fog'	'Bay Area'	1	'light fog'	'2010-10-15'
'Bay Area'	'2010-10-15'	null	'Bay Area'	2	null	'2010-10-15'
'Bay Area'	'2010-10-16'	'strong winds'	'Bay Area'	3	'strong winds'	'2010-10-16'

LEGACY.TOP

This function computes K most frequently occurring distinct values per group. This function is DEPRECATED because of its inability to support vararg argument. Use the current TOP function instead. Note that the parameter 'K' is the first parameter in the current TOP function.

SYNTAX

legacy.top(input, K)

input
- Type: any
- Description: The attribute on which to compute TOP K values.
K
- Type: int
- Description: The number of values to return per group. Must be a positive integer.

OUTPUT

{'type': 'array', 'elementType': 'any'}

EXAMPLE

GROUP legacy.top(age, 2) AS result BY name

name	age
'susan'	27
'susan'	18
'susan'	27
'susan'	18
'susan'	70
'susan'	null
'susan'	null
'susan'	null
'bob'	28
'bob'	28
'steve'	null

→

result	name
[27, 18]	'susan'
[28]	'bob'
[]	'steve'

MIN

Computes the least value of an attribute or expression in a group.

SYNTAX

min(input)

input
- Type: any
- Description: The attributes to compute the least value over

OUTPUT

any

EXAMPLE

GROUP min(age) AS result BY name

name	age
'susan'	27
'susan'	18
'susan'	35
'bob'	28
'bob'	null
'bob'	55

→

result	name
18	'susan'
28	'bob'

MIN_SIZE

Computes the minimum value of an attribute or expression based on the bit size of the elements contained in a group. For arrays, the bit size of its length is added to the sum of the bit size of each of its elements.

SYNTAX

min_size(input)

input
- Type: any
- Description: The attributes to compute the minimum value over

OUTPUT

any

EXAMPLES

GROUP min_size(name) AS result BY age

age	name
27	'susan'
27	'samuel'
27	'sam'
54	'chris'
54	'bob'

→

result	age
'sam'	27
'bob'	54

GROUP min_size(cities) AS result BY state

state	cities
'NY'	['New York', 'Albany']
'WA'	['Seattle', null]
'WA'	['Spokane', 'Tacoma']
'CA'	['Los Angeles', 'Pleasanton']
'CA'	['San Franscisco', 'Berkeley', 'San Diego']

→

result	state
['New York', 'Albany']	'NY'
['Seattle', null]	'WA'
['Los Angeles', 'Pleasanton']	'CA'

GROUP min_size(special_case) AS result BY some_value

some_value	special_case
'case-1'	['', '', '', 'one']
'case-1'	['o', 'n', 'e']
'case-2'	['one', '', '', '']
'case-2'	['o', 'n', 'e']

→

result	some_value
['', '', '', 'one']	'case-1'
['o', 'n', 'e']	'case-2'

GROUP min_size(special_case_2) AS result BY some_value

some_value	special_case_2
'case-1'	['one', '', null]
'case-1'	['o', 'n', 'e']
'case-1'	['o', 'n', 'e', null]

→

result	some_value
['one', '', null]	'case-1'

MAX

Computes the greatest value of an attribute or expression in a group.

SYNTAX

max(input)

input
- Type: any
- Description: The attributes to compute the greatest value over

OUTPUT

any

EXAMPLE

GROUP max(age) AS result BY name

name	age
'susan'	27
'susan'	18
'susan'	35
'bob'	28
'bob'	null
'bob'	55

→

result	name
35	'susan'
55	'bob'

MAX_SIZE

Computes the maximum value of an attribute or expression based on the bit size of the elements contained in a group. For arrays, the bit size of its length is added to the sum of the bit size of each of its elements.

SYNTAX

max_size(input)

input
- Type: any
- Description: The attributes to compute the maximum value over

OUTPUT

any

EXAMPLES

GROUP max_size(name) AS result BY age

age	name
27	'susan'
27	'samuel'
27	'sam'
54	'chris'
54	'bob'

→

result	age
'samuel'	27
'chris'	54

GROUP max_size(cities) AS result BY state

state	cities
'NY'	['New York', 'Albany']
'WA'	['Seattle', null]
'WA'	['Spokane', 'Tacoma']
'CA'	['Los Angeles', 'Pleasanton']
'CA'	['San Franscisco', 'Berkeley', 'San Diego']

→

result	state
['New York', 'Albany']	'NY'
['Spokane', 'Tacoma']	'WA'
['San Franscisco', 'Berkeley', 'San Diego']	'CA'

GROUP max_size(special_case) AS result BY some_value

some_value	special_case
'case-1'	['', '', '', 'one']
'case-1'	['o', 'n', 'e']
'case-2'	['one', '', '', '']
'case-2'	['o', 'n', 'e']

→

result	some_value
['o', 'n', 'e']	'case-1'
['one', '', '', '']	'case-2'

GROUP max_size(special_case_2) AS result BY some_value

some_value	special_case_2
'case-1'	['one', '', null]
'case-1'	['o', 'n', 'e']
'case-1'	['o', 'n', 'e', null]

→

result	some_value
['o', 'n', 'e', null]	'case-1'

COUNT

Counts the number of non-null items in a group.

SYNTAX

count(input)

input
- Type: any
- Description: The attributes whose items to count

OUTPUT

long

EXAMPLE

GROUP count(event) AS num_events BY region

region	date	event
'New York'	'2010-10-15'	'light showers'
'New York'	'2010-10-16'	'strong winds'
'New York'	'2010-10-16'	'thundershowers'
'Bay Area'	'2010-10-15'	'light fog'
'Bay Area'	'2010-10-15'	null
'Bay Area'	'2010-10-16'	'strong winds'

→

region	num_events
'Bay Area'	2
'New York'	3

COUNT_DISTINCT

Counts the number of distinct non-null items in a group.

SYNTAX

count_distinct(input)

input
- Type: any
- Description: The attributes whose items to count

OUTPUT

long

EXAMPLE

GROUP count_distinct(event) AS num_events BY region

region	date	event
'New York'	'2010-10-15'	'light showers'
'New York'	'2010-10-15'	'light showers'
'New York'	'2010-10-15'	'light showers'
'New York'	'2010-10-16'	'strong winds'
'New York'	'2010-10-16'	'thundershowers'
'Bay Area'	'2010-10-15'	'light fog'
'Bay Area'	'2010-10-15'	'light fog'
'Bay Area'	'2010-10-15'	null
'Bay Area'	'2010-10-15'	null
'Bay Area'	'2010-10-16'	'strong winds'

→

region	num_events
'Bay Area'	2
'New York'	3

SUM

Computes the sum of all non-null values of an attribute in a group.

SYNTAX

sum(input)

input
- Type: number
- Description: The attribute whose values to sum

OUTPUT

number

EXAMPLE

GROUP sum(age) AS result BY name

name	age
'susan'	27
'susan'	18
'susan'	35
'bob'	28
'bob'	null
'bob'	55

→

result	name
80	'susan'
83	'bob'

AVG

Computes the average of all non-null values of an attribute in a group.

SYNTAX

avg(input)

input
- Type: number
- Description: The attributes whose values are averaged

OUTPUT

double

EXAMPLE

GROUP avg(age) AS result BY name

name	age
'susan'	27
'susan'	18
'susan'	35
'bob'	28
'bob'	null
'bob'	55
'alice'	null

→

result	name
26.666666666666668	'susan'
41.5	'bob'
null	'alice'

MEAN

Computes the mean of all non-null values of an attribute in a group. This is a synonym of the AVG function.

SYNTAX

mean(input)

input
- Type: number
- Description: The attributes over which the mean is calculated

OUTPUT

double

EXAMPLE

GROUP mean(age) AS result BY name

name	age
'susan'	27
'susan'	18
'susan'	35
'bob'	28
'bob'	null
'bob'	55

→

result	name
26.666666666666668	'susan'
41.5	'bob'

MODE

Computes the mode (most frequent value) of an attribute in a group. Ignores nulls. In the case of a tie, the least value will be returned.

SYNTAX

mode(input)

input
- Type: any
- Description: The attributes on which to compute the mode

OUTPUT

any

EXAMPLE

GROUP mode(age) AS result BY name

name	age
'susan'	27
'susan'	18
'susan'	27
'susan'	null
'susan'	null
'susan'	null
'bob'	28
'bob'	55
'bob'	55
'steve'	null
'stacy'	38
'stacy'	36
'stacy'	37

→

result	name
27	'susan'
55	'bob'
null	'steve'
36	'stacy'

FIRST

Returns the first value in a attribute for each partition defined in an aggregation.

SYNTAX

first(data)

data
- Type: any
- Description: The attributes from which we want to return the first value.

OUTPUT

any

EXAMPLE

GROUP first(A) AS new_A BY B

A	B
['jim', 'pam']	'1'
[]	'1'
['toby']	'1'

→

B	new_A
'1'	['jim', 'pam']

LAST

Returns the last value in a attribute for each partition defined in an aggregation.

SYNTAX

last(data)

data
- Type: any
- Description: The attributes from which we want to return the last value.

OUTPUT

any

EXAMPLE

GROUP last(A) AS new_A BY B

A	B
['jim', 'pam']	'1'
[]	'1'
['toby']	'1'

→

B	new_A
'1'	['toby']

TOP

Computes K most frequently occurring distinct values per group. TOP breaks ties in value frequency by selecting an arbitrary value to sort first. Choose TOP if some values are expected to consistently occur more frequently than others. If there are many ties in value frequency then small changes in data can yield significant changes in which values TOP collects. For an alternative aggregation method, see COLLECT_SUBSET.

SYNTAX

top(K, input)

K
- Type: int
- Description: The number of values to return per group. Must be a positive integer.
input
- Type: any
- Description: The attribute on which to compute TOP K values.

OUTPUT

{'type': 'array', 'elementType': 'any'}

EXAMPLES

GROUP top(2, age) AS result BY name

name	age
'susan'	27
'susan'	18
'susan'	27
'susan'	18
'susan'	70
'susan'	null
'susan'	null
'susan'	null
'bob'	28
'bob'	28
'steve'	null

→

result	name
[27, 18]	'susan'
[28]	'bob'
[]	'steve'

GROUP top(2, ...age) AS result BY name

name	age
'susan'	[27, 42]
'susan'	[70]
'susan'	[27, 70]
'susan'	[18, 30, 27]
'susan'	[70]
'susan'	[]
'susan'	[null]
'susan'	null
'bob'	[28]
'bob'	[28]
'steve'	null

→

result	name
[27, 70]	'susan'
[28]	'bob'
[]	'steve'

HISTOGRAM

Computes the histogram for the n most frequently occurring distinct values per group.

SYNTAX

histogram(n, input)

n
- Type: int
- Description: The number of values to return per group. Must be a positive integer.
input
- Type: any
- Description: The attribute on which to compute the histogram.

OUTPUT

{'type': 'array', 'elementType': {'type': 'record', 'fullySpecified': True, 'fields': [{'name': 'value', 'type': 'any'}, {'name': 'count', 'type': 'long'}]}}

EXAMPLES

GROUP histogram(2, age) AS result BY name

name	age
'susan'	27
'susan'	18
'susan'	27
'susan'	18
'susan'	18
'susan'	null
'susan'	null
'susan'	null
'bob'	28
'bob'	28
'steve'	null

→

result	name
[{'count': 3, 'value': 18}, {'count': 2, 'value': 27}]	'susan'
[{'count': 2, 'value': 28}]	'bob'
[]	'steve'

GROUP histogram(2, ...age) AS result BY name

name	age
'susan'	[27, 42]
'susan'	[70]
'susan'	[27, 70]
'susan'	[18, 30, 27]
'susan'	[70]
'susan'	[]
'susan'	[null]
'susan'	null
'bob'	[28]
'bob'	[28]
'steve'	null

→

result	name
[{'count': 3, 'value': 27}, {'count': 3, 'value': 70}]	'susan'
[{'count': 2, 'value': 28}]	'bob'
[]	'steve'

GROUP histogram(3, ...vehicles) AS result BY name

name	vehicles
'John'	[{'model': 'Corolla', 'year': 2002, 'make': 'Toyota'}, {'model': 'Camry', 'year': null, 'make': 'Toyota'}, {'model': 'Civic', 'year': 1995, 'make': 'Honda'}]
'John'	[{'model': null, 'year': 1995, 'make': 'Honda'}, {'model': 'Corolla', 'year': 2002, 'make': 'Toyota'}]
'John'	[{'model': 'Corolla', 'year': 2002, 'make': 'Toyota'}, {'model': 'Camry', 'year': null, 'make': 'Toyota'}]
'Mary'	[{'model': 'Camry', 'year': null, 'make': 'Toyota'}]
'Mary'	[]

→

result	name
[{'count': 3, 'value': {'model': 'Corolla', 'year': 2002, 'make': 'Toyota'}}, {'count': 2, 'value': {'model': 'Camry', 'year': null, 'make': 'Toyota'}}, {'count': 1, 'value': {'model': null, 'year': 1995, 'make': 'Honda'}}]	'John'
[{'count': 1, 'value': {'model': 'Camry', 'year': null, 'make': 'Toyota'}}]	'Mary'

COLLECT_LIST

Collects all non-null values of an attribute in a group. If the input dataset is ordered, then the resulting list is also ordered. WARNING: Using this function with a large number of records can result in errors. Consider using COLLECT_SET().See COLLECT_SET.

SYNTAX

collect_list(input)

input
- Type: any
- Description: The attributes whose values to collect

OUTPUT

{'type': 'array', 'elementType': 'any'}

EXAMPLE

GROUP collect_list(employee) AS employees BY department

department	employee
'accounting'	'angela'
'sales'	'dwight'
'sales'	'jim'
'accounting'	'kevin'
'admin'	null
'accounting'	'oscar'
'admin'	'pam'
'sales'	'phyllis'
'sales'	'stanley'
'admin'	'toby'

→

department	employees
'accounting'	['angela', 'kevin', 'oscar']
'sales'	['dwight', 'jim', 'phyllis', 'stanley']
'admin'	['pam', 'toby']

COLLECT_SET

Collects all distinct, non-null values of an attribute in a group. The resulting list is not ordered. Similar function: COLLECT_LIST.

SYNTAX

collect_set(input)

input
- Type: any
- Description: The attributes whose values to collect

OUTPUT

{'type': 'array', 'elementType': 'any'}

EXAMPLE

GROUP collect_set(employee) AS employees BY department

department	employee
'accounting'	'angela'
'sales'	'dwight'
'sales'	'dwight'
'sales'	'jim'
'accounting'	'kevin'
'accounting'	'kevin'
'accounting'	'oscar'
'accounting'	null
'sales'	'phyllis'
'sales'	'stanley'
'sales'	'stanley'

→

department	employees
'accounting'	['kevin', 'oscar', 'angela']
'sales'	['jim', 'phyllis', 'stanley', 'dwight']

COLLECT_SUBSET

Collects a subset of the distinct, non-null values of an attribute in a group. The resulting list is not ordered. The selected subset is based on the hash value of the input values. As a result, this aggregation function does deterministic tie-breaking, since the same subset is returned for the same set of values. When values occur with roughly equal frequency in data, the results of COLLECT_SUBSET will be more stable than TOP in the face of small data changes.

SYNTAX

collect_subset(K, input)

K
- Type: int
- Description: The number of values to return per group. Must be a positive integer.
input
- Type: any
- Description: The attributes for which to collect a subset of values

OUTPUT

{'type': 'array', 'elementType': 'any'}

EXAMPLES

GROUP collect_subset(2, employee) AS employees BY department

department	employee
'accounting'	'angela'
'sales'	'dwight'
'sales'	'dwight'
'sales'	'jim'
'accounting'	'kevin'
'accounting'	'kevin'
'accounting'	'oscar'
'accounting'	null
'sales'	'phyllis'
'sales'	'stanley'
'sales'	'stanley'

→

department	employees
'accounting'	['kevin', 'angela']
'sales'	['dwight', 'stanley']

GROUP collect_subset(3, ...employee) AS employees BY department

department	employee
'accounting'	['angela', 'oscar']
'sales'	['dwight']
'sales'	['dwight', null]
'sales'	['jim', 'walter']
'accounting'	['kevin', 'john']
'accounting'	['kevin', 'frank']
'accounting'	['oscar']
'accounting'	null
'sales'	['phyllis', 'mark']
'sales'	['stanley']
'sales'	['stanley', 'adam']

→

department	employees
'accounting'	['angela', 'john', 'frank']
'sales'	['stanley', 'dwight', 'walter']

STDDEV_SAMP

Computes the corrected sample standard deviation of all non-null values of an attribute in a group.

SYNTAX

stddev_samp(input)

input
- Type: number
- Description: The attribute whose standard deviation is calculated

OUTPUT

double

EXAMPLE

GROUP stddev(age) AS result BY name

name	age
'susan'	1.0
'susan'	2.0
'susan'	3.0
'bob'	10.0
'bob'	20.0
'bob'	30.0
'alice'	20.0
'john'	30.0
'john'	30.0
'john'	30.0

→

result	name
1.0	'susan'
10.0	'bob'
null	'alice'
0.0	'john'

STDDEV_POP

Computes the population standard deviation of all non-null values of an attribute in a group.

SYNTAX

stddev_pop(input)

input
- Type: number
- Description: The attribute whose standard deviation is calculated

OUTPUT

double

EXAMPLE

GROUP stddev_pop(age) AS result BY name

name	age
'susan'	1.0
'susan'	2.0
'susan'	3.0
'bob'	10.0
'bob'	20.0
'bob'	30.0

→

result	name
0.816496580927726	'susan'
8.16496580927726	'bob'

VAR_SAMP

Computes the corrected sample variance of all non-null values of an attribute in a group.

SYNTAX

var_samp(input)

input
- Type: number
- Description: The attribute whose variance is calculated

OUTPUT

double

EXAMPLE

GROUP var_samp(age) AS result BY name

name	age
'susan'	1.0
'susan'	2.0
'susan'	3.0
'bob'	10.0
'bob'	20.0
'bob'	30.0

→

result	name
1.0	'susan'
100.0	'bob'

VAR_POP

Computes the population variance of all non-null values of an attribute in a group.

SYNTAX

var_pop(input)

input
- Type: number
- Description: The attribute whose variance is calculated

OUTPUT

double

EXAMPLE

GROUP var_pop(age) AS result BY name

name	age
'susan'	1.0
'susan'	2.0
'susan'	3.0
'bob'	10.0
'bob'	20.0
'bob'	30.0

→

result	name
0.6666666666666666	'susan'
66.66666666666667	'bob'