You can use a
WINDOW statement in a transformation to compute values for each row based on values from a group of rows. For example, you could use
WINDOW to rank purchases by size in each product category. (See the first example below.)
WINDOW statements give you a way to include windowing and aggregating functions in a script. Each windowing or aggregating function operates on a neighborhood of related rows, known as a window frame. Common examples of windowing functions are moving averages, cumulative sums, row numbering and ranking. (Examples follow.)
A variety of aggregating functions are supported for use in
WINDOW statements. See Aggregate Functions.
The syntax for a
WINDOW statement is defined as follows:
WINDOW select_list BY partition_key [ORDER BY sort_expression [sort_expression …]]
select_listis a list of comma-separated named expressions that may contain windowing functions or aggregating functions. A named expression requires you to specify an attribute (for example,
rank() AS sales_rank). For each row in the table, these expressions are computed over the current window frame.
partition_keyis a list of variables that are used to divide the data into window frames, similar to the key in a
partition_keycannot be empty.
ORDER BYcan be used to create rolling window frames. The rolling window frame starts at the first row, and adds one row at a time to the computation until all rows have been evaluated. For example, it computes the cumulative sum for the top row, then the top two rows, then the top three rows, and so on, until the frame is complete. If you don't include the
ORDER BYclause in the
WINDOWstatement, all rows in the window frame are included at once.
Tip: Most windowing functions, including
row_number(), require adding an
ORDER BY statement. Aggregating functions do not require an
ORDER BY, but you may optionally include it.
Exporting the unified dataset after applying a
WINDOW statement does not maintain row order. You can supply values for an attribute using the
row_number() functions and then sort on these columns after exporting to reapply a row order.
Suppose you have an input dataset of sales transactions data from an e-commerce site. It might look like the following:
Now suppose that you’d like to rank the purchases by size in each product category (total sales per category) so that you can find the big purchases. You can use the following transformation:
WINDOW rank() AS price_rank BY product_category ORDER BY price DESC NULLS LAST;
This statement will partition the purchases by category, sort within each category by price, then compute the rank of each purchase. Here’s the result:
When you use the
rank() function, if two purchases have the same price they receive the same rank.
If you use the
row_number() function instead of
rank() then any tie values between purchases in a product category are arbitrarily ordered. Each time the
WINDOW statement is run, the ordering between the tied purchases can change.
While ranking can be useful, suppose you want a more quantitative measure. Say you'd like to compare purchases based on what fraction of overall daily revenue they represent within their product category. You can use the
sum() function to find the total within each category, and use it to compute the percentage each sale represents.
WINDOW Price, 100.0 * Price / sum(Price) AS category_sales_percent BY product_category;
For each row, this statement will fill the category_sales_percent column with that (row’s price) divided by (the total sales within the row’s category), times (100) to convert to a percent. Here’s the result:
The value in category_sales_percent is the same as if you'd used it in a
GROUP BY transformation. Unlike a
GROUP BY, however, here the output dataset has the same number of rows as the input, instead of only one row per group.
Now suppose you’re interested in daily trends. What time of day are customers spending the most? Maybe you want to see the distribution of sales by time within each product category.
You can modify the last transformation by adding an
ORDER BY clause to look for trends during the day:
WINDOW Price, sum(Price) AS cumulative_sales BY product_category ORDER BY timestamp;
For each purchase this script populates the cumulative_sales column with the total sales up to the current row’s time of day. In the table, you can see the cumulative_sales grow in each row:
Notice that with the
ORDER BY clause the window frame is rolling. So rather than considering the whole partition, each row’s cumulative_sales sums only the rows preceding the current row’s time of day.
Instead of calculating a cumulative value, you can also specify your
WINDOW statement to look at a row before or after the current row. This is useful when you want to calculate the change between two rows, for example.
You can write a
lag() in the
select_list expression to reference a row before, or a
lead() to reference a row after. The format is
lag(attribute, integer) or
lead(attribute, integer). The integer indicates how many rows before, or after, the expression references. The integer must be positive.
Note: You must include an
ORDER BY statement if you are using
lead(). Be sure to consider the order you want, since this changes which row these functions reference.
WINDOW Price, lead(Price,1) - Price AS Price_diff, BY product_category ORDER BY timestamp produces a table with rows ordered by the timestamp for each product_category, and the column Price_diff contains the difference between the two prices: the price of the current row minus the price of the row after. The Price_diff for the last row is null since there is no row after it.
For each row, the frame of a
WINDOW statement defines the set of records that are the inputs to the window expressions for that row. The frame defines a local context for each row: essentially, how many rows before or after, or before and after, the current row are included in the computation.
To give some examples, for a ranking window expression like
row_number(), the frame is the current row and all the rows preceding it:
row_number() returns the count of rows in the frame. For a 10-day moving average like
avg(price), the window frame is the current row plus rows from the previous ten days.
WINDOW statement can either use the default window frame, or else define a custom window frame explicitly.
The syntax for a
WINDOW statement with a custom window frame is as follows:
WINDOW select_list BY partition_key [ORDER BY sort_expression [sort_expression …]] frame_clause;
The frame_clause can be one of the following:
- ROWS frame_start
- RANGE frame_start
- ROWS BETWEEN frame_start AND frame_end
- RANGE BETWEEN frame_start AND frame_end
where frame_start and frame_end can be:
- UNBOUNDED PRECEDING
- value PRECEDING
- CURRENT ROW
- value FOLLOWING
- UNBOUNDED FOLLOWING
If you omit frame_end it defaults to CURRENT ROW.
There are a few rules for using frame_start and frame_end:
- frame_start cannot be UNBOUNDED FOLLOWING
- frame_end cannot be UNBOUNDED PRECEDING
- The frame_end choice cannot appear earlier in the above list than the frame_start choice. For example, RANGE BETWEEN CURRENT ROW AND value PRECEDING is not allowed
- value must be a non-negative integer
The window frame defines an ordered set of rows before and after, or before and after, the current row. UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING denote all rows in the current partition before and after the current row, respectively. The meaning of CURRENT ROW depends on whether ROWS or RANGE is specified. In ROWS mode, CURRENT ROW means the current row, only. In RANGE mode, it means all rows that have the same value in the ORDER BY expression as the CURRENT ROW.
The meaning of value PRECEDING and value FOLLOWING also depends on whether ROWS or RANGE is specified. In ROWS mode, value indicates the number of rows before or after the current row, respectively, to include. In RANGE mode, value indicates the range to include based on values in the ORDER BY expression. For example, for the
WINDOW sum(Price) AS new_Price BY product_category ORDER BY timestamp RANGE 1 PRECEDING;
Then new_Price for each row sums up the current row’s price with that of all rows on the same timestamp or from one timestamp before in that product category.
RANGE frames require an
ORDER BY clause, and the
ORDER BY clause must have exactly one sort expression of numeric data type. Equal values in the ORDER BY expression are all treated the same. That means in RANGE, CURRENT ROW means the current row plus all rows with an identical value in the ORDER BY expression.
Note: Windowing functions, including the offset window functions
lead(), define their own implicit window frames and are therefore incompatible with custom window frames. These functions can only be used with the default window frame.
The default window frame depends on the expressions in the expression list, and on whether there’s an
ORDER BY clause or not.
If there is no
ORDER BY clause, then the default window frame is:
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
That is, it’s the whole partition. The result is that every row in a partition computes the same value for each window expression. This is equivalent to a
GROUP BY statement that does not merge together the rows in each partition.
When there is an
ORDER BY statement and the window expressions do not include any window functions then the default window frame is:
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
That is, all the rows in the partition before the current one, plus any rows with the same value as the current row in the sort expression.
Offset window functions define their own window frames implicitly to match the function semantics. For example,
lag(Price, 10) has the implicit window frame:
ROWS BETWEEN 10 PRECEDING AND 10 PRECEDING
That is, the offset functions set the frame to match the row they select exactly. You can define multiple offset window functions in the same expression or in the same
WINDOW statement. Tamr Core sorts out how to combine all of the window frames appropriately. (That is, one frame is applied to each offset window function.)
Updated over 1 year ago