Window

Overview

Using the WINDOW statement in a transformation allows you to compute values for each row based on values from a group of rows. For example, you could rank purchases by size in each product category.

Original table:

Table ordered by "price_rank":

WINDOW statements offer a place to use window functions and aggregating functions. Each window or aggregating function operates on a neighborhood of related rows, known as a window frame. Common examples of window functions are moving averages, cumulative sums, row numbering and ranking.

Supported Functions

A variety of aggregating functions are supported for use in WINDOW statements. The full list is here.

Reference

A WINDOW statement is defined as follows:

WINDOW select_list BY partition_key [ORDER BY sort_expression [sort_expression …]]

Where select_list is a list of comma separated named expressions that may contain window functions or aggregating functions. A named expression requires you to specify an attribute (for example, rank() as sales_rank). For each row in the table, these expressions are computed over the current window frame.

partition_key is a list of variables that are used to divide the data into window frames, similar to the key in a GROUP BY statement. partition_key cannot be empty.

An ORDER BY clause is optional. An ORDER BY clause creates rolling window frames. The rolling window frame starts at the first row, and adds one row at a time to the computation until all rows have been evaluated. For example, it computes the cumulative sum for the top row, then the top two rows, then the top three rows, and so on, until the frame is complete. If you don't include the ORDER BY clause in the window statement, all rows in the window frame are included at once.

📘
When to use ORDER BY
Most window functions, including rank() and row_number(), require adding an ORDER BY statement. Aggregating functions do not require an ORDER BY, but you may optionally include it.

📘
Export
Exporting the unified dataset after applying a WINDOW statement does not maintain the row order. If you filled in an attribute using rank() or row_number(), then you can sort on these columns after exporting them to revive the row order used there.

Examples

Suppose we have an input dataset of sales transactions data from an ecommerce site. It might look like the following:

rank()

Now suppose that we’d like rank the purchases by size in each product category (total sales per category) so we can find the big purchases. We can use the following transformation:

WINDOW rank() AS price_rank BY product_category ORDER BY price DESC NULLS LAST;

This statement will partition the purchases by category, sort within each category by price, then compute the rank of each purchase. Here’s the result:

📘
How duplicates are handled in rank() vs row_number()
If two purchases have the same price, they will receive the same rank.
If, instead of rank(), we used the row_number() function, then ties between purchases in a product category will be arbitrarily ordered Each time the WINDOW statement is run, the ordering between the tied purchases may change.

sum()

The ranking is useful, but suppose we want a more quantitative measure. Say we’d like to compare purchases based on what fraction of overall daily revenue they represent within their product category. We can use the sum() function to find the total within each category, and use it to compute the percentage each sale represents.

WINDOW Price, 100.0 * Price / sum(Price) AS category_sales_percent BY product_category;

For each row, this will fill the category_sales_percent column with that (row’s price) divided by (the total sales within the row’s category), times (100) to convert to a percent. Here’s the result:

Note that the value in "category_sales_percent" will be the same as if we’d used it in a GROUP BY transformation. Unlike with GROUP BY, however, here the output dataset has the same number of rows as the input, instead of only one row per group.

Now suppose we’re interested in daily trends. What time of day are customers spending the most? Maybe we want to see the distribution of sales by time within each product category.

We can modify the last transformation by adding an ORDER BY clause to look for trends during the day:

WINDOW Price, sum(Price) AS cumulative_sales BY product_category ORDER BY timestamp;

This will populate, for each purchase, the "cumulative_sales column" with the total sales up to the current row’s time of day. In the table, you can see the "cumulative_sales" grow each row:

Notice that with the ORDER BY clause the window frame is rolling. So rather than considering the whole partition, each row’s cumulative_sales sums only the rows preceding the current row’s time of day.

lag and lead

Instead of calculating a cumulative value, you can also specify your window function to look at a row before or after the current row. This is useful when you want to calculate the change between two rows, for example.

You can write in the select_list expression a lag(), which references at a row before, or a lead(), which references a row after. The format is lag(attribute, integer) or lead(attribute, integer). The integer is how many rows before or after the expression will reference. The integer must be positive.

📘
Include an ORDER BY
You must include an ORDER BY if you are using lag() or lead(). It is important that you think about the order you want, since this will change which row is referenced by these functions.

For example, WINDOW Price, lead(Price,1) - Price AS Price_diff, BY product_category ORDER BY timestamp will produce a table with rows ordered by the timestamp for each product_category, and the column Price_diff will have the difference between the two prices: the price of the current row minus the price of the row after. The last row will be null since there is no row after it.

Custom Window Frames

Overview

For each row, the frame of a WINDOW statement defines the set of records that are the inputs to the window expressions for that row. The frame defines a local context for each row - essentially how many rows before and/or after the current row are included in the computation.

To give some examples, for a ranking window expression like row_number(), the frame is the current row and all the rows preceding it- --then row_number() returns the count of rows in the frame. For a 10-day moving average like avg(price), the window frame is the current row plus rows from the previous ten days.

A WINDOW statement can either use the default window frame, or else define a custom window frame explicitly.

Syntax

The syntax for a WINDOW statement with a custom window frame is as follows:

WINDOW select_list BY partition_key [ORDER BY sort_expression [sort_expression …]] frame_clause;

The frame_clause can be one of the following:

ROWS frame_start
RANGE frame_start
ROWS BETWEEN frame_start AND frame_end
RANGE BETWEEN frame_start AND frame_end

where frame_start and frame_end can be:

UNBOUNDED PRECEDING
value PRECEDING
CURRENT ROW
value FOLLOWING
UNBOUNDED FOLLOWING

If frame_end is omitted it defaults to CURRENT ROW. There are a few rules how to use frame_start and frame_end:

frame_start cannot be UNBOUNDED FOLLOWING
frame_end cannot be UNBOUNDED PRECEDING
The frame_end choice cannot appear earlier in the above list than the frame_start choice — for example RANGE BETWEEN CURRENT ROW AND value PRECEDING is not allowed
value must be a non-negative integer.

ROWS vs RANGE

The window frame defines an ordered set of rows before and/or after the current row. UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING denote all rows in the current partition before and after the current row, respectively. The meaning of CURRENT ROW depends on whether ROWS or RANGE is specified. In ROWS mode, CURRENT ROW means the current row, only. In RANGE mode, it means all rows that have the same value in the ORDER BY expression as the CURRENT ROW.

The meaning of value PRECEDING and value FOLLOWING also depends on whether ROWS or RANGE is specified. In ROWS mode, value indicates the number of rows before or after the current row, respectively, to include. In RANGE mode, value indicates the range to include based on values in the ORDER BY expression. For example, for the WINDOW statement:

WINDOW sum(Price) AS new_Price BY product_category ORDER BY timestamp RANGE 1 PRECEDING;

Then new_Price for each row will sum up the current row’s price with that of all rows on the same timestamp or from one timestamp before in that product category.

📘
RANGE frames

Require an ORDER BY clause

ORDER BY clause must have exactly one sort expression of numeric data type
Equal values in the ORDER BY expression are all treated the same. That means in RANGE, CURRENT ROW means the current row plus all rows with an identical value in the ORDER BY expression

📘
Window Functions
Window functions, including offset window functions lag() and lead(), define their own implicit window frames and are therefore incompatible with custom window frames. These functions can only be used with the default window frame.

Default Window Frames

The default window frame depends on the expressions in the expression list, and on whether there’s an ORDER BY clause or not.

If there is no ORDER BY clause, then the default window frame is:

ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

That is, it’s the whole partition. That means that every row in a partition will compute the same values for each window expression. This is equivalent to a GROUP BY statement that does not merge together the rows in each partition.

ORDER BY without Offset Window Functions
When there is an ORDER BY and the window expressions do not include any window functions then the default window frame is:

RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

That is, all the rows in the partition before the current one, plus any rows with the same value as the current row in the sort expression.

ORDER BY with Offset Window Functions
Offset window functions define their own window frames implicitly to match the function semantics. For example, lag(Price, 10) has the implicit window frame:

ROWS BETWEEN 10 PRECEDING AND 10 PRECEDING

That is, the offset functions set the frame to match the row they select exactly. Note that it’s okay to have multiple offset window functions defined in the same expression or in the same WINDOW statement. The backend sorts out how to combine all of their window frames appropriately. (Internally, one frame is applied to each offset window function.)

Window

Overview

Supported Functions

Reference

📘
When to use ORDER BY

📘
Export

Examples

rank()

📘
How duplicates are handled in rank() vs row_number()

sum()

lag and lead

📘
Include an ORDER BY

Custom Window Frames

Overview

Syntax

ROWS vs RANGE

📘
RANGE frames

📘
Window Functions

Default Window Frames

Overview

Supported Functions

Reference

📘When to use ORDER BY

📘Export

Examples

rank()

📘How duplicates are handled in rank() vs row_number()

sum()

lag and lead

📘Include an ORDER BY

Custom Window Frames

Overview

Syntax

ROWS vs RANGE

📘RANGE frames

📘Window Functions

Default Window Frames

📘
When to use ORDER BY

📘
Export

📘
How duplicates are handled in rank() vs row_number()

📘
Include an ORDER BY

📘
RANGE frames

📘
Window Functions