Tamr Documentation

map, filter, and reduce

Transformations support the higher order functions map, filter, and reduce. They allow you to apply a Tamr defined function across array fields.

  • map applies the function to each element of an array field.
  • filter keeps elements of array fields that match a specified condition.
  • reduce combines an array into a single value.

Conceptually, these functions are similar to SELECT, FILTER, and GROUP BY statements with a few major differences.
The following examples show how map, filter, and reduce are similar to SELECT, FILTER, and GROUP BY statements:

  • SELECT func(col) maps a function to each cell in a column.
  • FILTER am_i_true(col) keeps only the rows that match the given condition.
  • GROUP agg() BY col combines rows based groups of values in a column.

How Do map, filter, and reduce Differ from SELECT, FILTER, and GROUP BY

The differences are as follows:

  1. map, filter, and reduce can be used as other functions, not as standalone statements. For example:
SELECT *, get(map(to_int, NUM_COL), 1) * 5 AS second_element_mult_by_5
  1. With map, filter, and reduce, you can take a function like to_int that acts on non-array values and apply it to each value of an array type.
    For comparison:
  • If you want to use SELECT, FILTER, or GROUP BY to apply a function to an array column, the function you use needs to be compatible with array inputs. For example, to_int takes a String and returns an Integer.
  • If you want to use SELECT, FILTER, or GROUP BY to apply a function directly to a column with data type String[], it will fail unless you first convert the column to String.

Map

map applies any function to each value in an array field.

map(Function: Any -> Any function, Any[] a) takes two parameters:

  • function, A single-argument function
  • a, an array column or expression.

map returns an array column, where the array value type matches the return type of function.

For example, to remove leading and trailing whitespace from each value in a string array column using the trim function, I could write:

SELECT *, map(trim, A) as A;

🚧

Function type compatibility

Functions have restrictions on what type they can accept as arguments. You may need to convert the type of the array input depending on the function you wish to use.

In the example above, if A had the data type Integer[], instead of String[], an error would occur. This is because trim only accepts String inputs.

Filter

filter keeps elements of array fields that match a specified condition.

The condition, in this case, is a function that returns true or false based on the input value. filter will keep all values for which the function returns true.

filter(Function: Any -> Boolean predicate, Any[] a) takes the following arguments:

  • predicate, a function that takes a single argument and returns a boolean. If predicate returns true, the corresponding value will be kept. Otherwise the value will be removed.
  • a, an array column or expression

filter returns an array of the same type as the input column a.

For example, to remove all null or empty values from a string array column, I could write:

SELECT *, filter(is_not_empty, A) as A;

🚧

Function type compatibility

As with map, the argument data type to the predicate function must match the data type of the values in the input array, a.

Reduce

reduce combines the values of an array column using a function as a reducer.

The reducer function is used to combine each value of the input array with a running result, and is called on input array values from left-to-right.

At the start of the operation, we haven't yet computed a running result, so we combine the first item of the collection with the initial value. Otherwise, we combine each subsequent value with the result of invoking the reducer function on the previous element.

reduce(Function: Any, Any -> Any reducer, Any initialValue, Any[] a)

  • reducer, a function that takes two arguments of the same data type and returns a value of that data type. (e.g., add, or concat).
  • initialValue, the initial value of the reduction
  • a, the array-valued column to reduce over

🚧

Function type compatibility

The data types of both arguments to the reducer function must match the function's return data type, the data type of the initialValue argument and the element data type of the array column!

For example, to sum the values of an integer array column, I could write:

SELECT *, reduce(add, 0, A) as A;

Let's consider a single value of column A - array(1, 2, 3). When the following expression is evaluated, reduce(add, 0, array(1,2,3)), add is applied in sequence to each element of the array and the result of the previous add, or otherwise initialValue.

You can think of this example of reduce acting like the following pseudocode for-loop:

# set the result to initialValue, 0
result = 0
for i in [1, 2, 3]:
    result = add(result, i)
# result now equals 6

Put a bit differently, reduce(add, 0, array(1,2,3)) evaluates to add(3, add(2, add(1, 0))) which evaluates to 6.

Combining Map, Filter, and Reduce

While each of these functions is useful on their own, they can be used together to great effect.

For example, if I want to convert a string array column to integer values, remove any values that failed conversion (i.e. are null), and then sum them, I could write the following:

SELECT *, map(to_int, A) as mapped;
SELECT *, filter(is_not_null, mapped) as filtered;
SELECT *, reduce(sum, 0, filtered) as sum;

Note that this could also be written in a single SELECT:

SELECT *, reduce(sum, 0, filter(is_not_null, map(to_int, A))) as sum;

Applying Map, Filter, and Reduce to Arrays

Lambdas give you the power to apply an arbitrary expression to each value in an array.
A lambda expression is just like an ordinary expression except that it’s allowed to reference one or several “free variables.” The free variables are like other identifiers (e.g., SUBJECT_ID), but they are not references to the underlying data---instead their values are supplied by the context of how the lambda is used. For example, consider the following map expression:

map(lambda x: x + 1, LISTS_OF_NUMBERS)

Here the parameter x is the lambda’s free variable. What map() does is take individual values from LISTS_OF_NUMBERS (an attribute with arrays) and supplies them one-by-one as x values to the lambda. The result is every number in every list in the attribute is incremented by one.

The general syntax is lambda [id_list]: expression,

  • id_list is a comma-separated list of variable names
  • id_list can be empty, in which case the lambda expression always returns the same constant value
  • Each id in the list can be any valid alphanumeric name that isn’t already defined outside of the lambda
  • expression is an arbitrary expression that is allowed to refer to the variables in id_list

Note that there are variations depending on if you are using this in a map(), filter(), or reduce(). These differences are detailed later.

Here are some general examples of valid lamdas:

  • lambda: 42
  • lambda x: x
  • lambda foo: foo + 1
  • lambda x, y: x + y
  • lambda my_num: case when my_num > 5 then ‘big’ else ‘small’ end
  • lambda x, y: x + length(y)

👍

Naming the lambda variable

A good practice to follow is to name lambda parameter names a unique variable name, such as x or y. Do not, for example, name them something that already exists, such as an attribute name.

Each time you write a lambda, you may reuse the same parameter name. The parameter name is relevant only to the lambda it is written with, and will not conflict with other lambda expressions.

Let's learn how to use lambdas in map(), filter(), and reduce().

map()

You can pass a lambda expression to map() to have the expression applied to each of the values in an array.

Examples:

  • map(lambda x: x, my_array_col) Returns the same exact array
  • map(lambda x: to_int(x), my_int_array_col) This is the same as map(to_int, my_int_array_col)
  • map(lambda x: case when is_valid_email(x) then x else ‘INVALID’ end, email_col) Replace anything that’s not a valid email with “INVALID”

filter()

filter() applies a boolean (true or false) function to each item in the array. It keeps the items that result in “true” and drops the values that result in “false.” You can use a lambda in a filter() to apply complex conditions to an attribute with arrays. The lambda should take one parameter.

Examples:

  • `filter(lambda x: x > 0, ints)` Will keep only the positive numbers in the array
    
  • `filter(lambda x: matches(‘[a-zA-Z]+’, x), strings)` Will keep only the non-empty alphabetic values
    
  • `filter(lambda x: substr2(x, 0, 1) == ‘A’, strings)` Will keep the values in an array that start with 'A'
    

reduce()

reduce() aggregates the values in an array and produces a single result. reduce() accepts a lambda of two parameters where the first is the “running total” and the second represents the items in the array. reduce() works by calling the lambda function for each item in the array with the “running total” set each time to the value returned by the last call, or to an initial value for the first call.

Examples

  • `reduce(lambda x, y: x + y, 0, ints)` Sums all values in the array
    
  • `reduce(lambda x, y: x || y, ‘prefix-’, strings)` Concatenate all values, prefixed by ‘prefix-’
    
  • `reduce(lambda x, y: x + length(y), 0, strings)` Sums total length of all the strings in the array
    

📘

Keep track of data types

Note that the initial value (the second argument passed to reduce) and the first lambda parameter (the “running total”) need to have the same data type. In addition, the second lambda parameter needs to have the same type as the elements in the array input.

Updated 8 months ago



map, filter, and reduce


Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.