Map, Filter, and Reduce
You use these higher-order functions to apply a function across an array field.
Transformations support the higher-order functions map
, filter
, and reduce
. They allow you to apply a Tamr Core-defined function across array fields.
map
applies the function to each element of an array field.filter
keeps elements of array fields that match a specified condition.reduce
combines an array into a single value.
Conceptually, these functions are similar to SELECT
, FILTER
, and GROUP BY
statements with a few major differences.
The following examples show how map
, filter
, and reduce
are similar to SELECT
, FILTER
, and GROUP BY
statements:
SELECT func(col)
maps a function to each cell in a column.FILTER am_i_true(col)
keeps only the rows that match the given condition.GROUP agg() BY col
combines rows based groups of values in a column.
How Do map
, filter
, and reduce
Differ from SELECT
, FILTER
, and GROUP BY
map
, filter
, and reduce
Differ from SELECT
, FILTER
, and GROUP BY
The differences are as follows:
map
,filter
, andreduce
can be used as other functions, not as standalone statements. For example:
SELECT *, get(map(to_int, NUM_COL), 1) * 5 AS second_element_mult_by_5
- With
map
,filter
, andreduce
, you can take a function liketo_int
that acts on non-array values and apply it to each value of an array type.
For comparison:
- If you want to use
SELECT
,FILTER
, orGROUP BY
to apply a function to an array column, the function you use needs to be compatible with array inputs. For example,to_int
takes aString
and returns anInteger
. - If you want to use
SELECT
,FILTER
, orGROUP BY
to apply a function directly to a column with data typeString[]
, it will fail unless you first convert the column toString
.
Map
map
applies any function to each value in an array field.
map(Function: Any -> Any function, Any[] a)
takes two parameters:
function
, A single-argument functiona
, an array column or expression.
map
returns an array column, where the array value type matches the return type of function
.
For example, to remove leading and trailing whitespace from each value in a string array column using the trim
function, I could write:
SELECT *, map(trim, A) as A;
Function type compatibility
Functions have restrictions on what type they can accept as arguments. You may need to convert the type of the array input depending on the function you wish to use.
In the example above, if A
had the data type Integer[]
, instead of String[]
, an error would occur. This is because trim
only accepts String
inputs.
Filter
filter
keeps elements of array fields that match a specified condition.
The condition, in this case, is a function that returns true or false based on the input value. filter
will keep all values for which the function returns true.
filter(Function: Any -> Boolean predicate, Any[] a)
takes the following arguments:
predicate
, a function that takes a single argument and returns a boolean. Ifpredicate
returnstrue
, the corresponding value will be kept. Otherwise the value will be removed.a
, an array column or expression
filter
returns an array of the same type as the input column a
.
For example, to remove all null or empty values from a string array column, I could write:
SELECT *, filter(is_not_empty, A) as A;
Function type compatibility
As with map
, the argument data type to the predicate
function must match the data type of the values in the input array, a
.
Reduce
reduce
combines the values of an array column using a function as a reducer.
The reducer function is used to combine each value of the input array with a running result, and is called on input array values from left-to-right.
At the start of the operation, a running result hasn't yet been computed, so you combine the first item of the collection with the initial value. Otherwise, you combine each subsequent value with the result of invoking the reducer function on the previous element.
reduce(Function: Any, Any -> Any reducer, Any initialValue, Any[] a)
reducer
, a function that takes two arguments of the same data type and returns a value of that data type. (e.g.,add
, orconcat
).initialValue
, the initial value of the reductiona
, the array-valued column to reduce over
Function type compatibility
The data types of both arguments to the reducer function must match the function's return data type, the data type of the initialValue
argument and the element data type of the array column!
For example, to sum the values of an integer array column, I could write:
SELECT *, reduce(add, 0, A) as A;
Let's consider a single value of column A
- array(1, 2, 3)
. When the following expression is evaluated, reduce(add, 0, array(1,2,3))
, add
is applied in sequence to each element of the array and the result of the previous add
, or otherwise initialValue
.
You can think of this example of reduce acting like the following pseudocode for-loop:
# set the result to initialValue, 0
result = 0
for i in [1, 2, 3]:
result = add(result, i)
# result now equals 6
Put a bit differently, reduce(add, 0, array(1,2,3))
evaluates to add(3, add(2, add(1, 0)))
which evaluates to 6
.
Combining Map, Filter, and Reduce
While each of these functions is useful on their own, they can be used together to great effect.
For example, if I want to convert a string array column to integer values, remove any values that failed conversion (i.e. are null), and then sum them, I could write the following:
SELECT *, map(to_int, A) as mapped;
SELECT *, filter(is_not_null, mapped) as filtered;
SELECT *, reduce(sum, 0, filtered) as sum;
Note that this could also be written in a single SELECT
:
SELECT *, reduce(sum, 0, filter(is_not_null, map(to_int, A))) as sum;
Applying Map, Filter, and Reduce to Arrays
Lambdas give you the power to apply an arbitrary expression to each value in an array.
A lambda expression is just like an ordinary expression except that it’s allowed to reference one or several “free variables.” The free variables are like other identifiers (e.g., SUBJECT_ID
), but they are not references to the underlying data---instead their values are supplied by the context of how the lambda is used. For example, consider the following map expression:
map(lambda x: x + 1, LISTS_OF_NUMBERS)
Here the parameter x
is the lambda’s free variable. What map()
does is take individual values from LISTS_OF_NUMBERS
(an attribute with arrays) and supplies them one-by-one as x values to the lambda. The result is every number in every list in the attribute is incremented by one.
The general syntax is lambda [id_list]: expression
,
- id_list is a comma-separated list of variable names
- id_list can be empty, in which case the lambda expression always returns the same constant value
- Each id in the list can be any valid alphanumeric name that isn’t already defined outside of the lambda
- expression is an arbitrary expression that is allowed to refer to the variables in id_list
Note that there are variations depending on if you are using this in a map()
, filter()
, or reduce()
. These differences are detailed later.
Here are some general examples of valid lamdas:
lambda: 42
lambda x: x
lambda foo: foo + 1
lambda x, y: x + y
lambda my_num: case when my_num > 5 then ‘big’ else ‘small’ end
lambda x, y: x + length(y)
Naming the lambda variable
A good practice to follow is to name lambda parameter names a unique variable name, such as x
or y
. Do not, for example, name them something that already exists, such as an attribute name.
Each time you write a lambda, you may reuse the same parameter name. The parameter name is relevant only to the lambda it is written with, and will not conflict with other lambda expressions.
Using lambdas in Map, Filter, and Reduce
Let's learn how to use lambdas in map()
, filter()
, and reduce()
.
map()
map()
You can pass a lambda expression to map()
to have the expression applied to each of the values in an array.
Examples:
map(lambda x: x, my_array_col)
Returns the same exact arraymap(lambda x: to_int(x), my_int_array_col)
This is the same asmap(to_int, my_int_array_col)
map(lambda x: case when is_valid_email(x) then x else ‘INVALID’ end, email_col)
Replace anything that’s not a valid email with “INVALID”
filter()
filter()
filter()
applies a boolean (true or false) function to each item in the array. It keeps the items that result in “true” and drops the values that result in “false.” You can use a lambda in a filter()
to apply complex conditions to an attribute with arrays. The lambda should take one parameter.
Examples:
filter(lambda x: x > 0, ints)
Will keep only the positive numbers in the arrayfilter(lambda x: matches(‘[a-zA-Z]+’, x), strings)
Will keep only the non-empty alphabetic valuesfilter(lambda x: substr2(x, 0, 1) == ‘A’, strings)
Will keep the values in an array that start with 'A'
reduce()
reduce()
reduce()
aggregates the values in an array and produces a single result. reduce()
accepts a lambda of two parameters where the first is the “running total” and the second represents the items in the array. reduce()
works by calling the lambda function for each item in the array with the “running total” set each time to the value returned by the last call, or to an initial value for the first call.
Examples:
reduce(lambda x, y: x + y, 0, ints)
Sums all values in the arrayreduce(lambda x, y: x || y, ‘prefix-’, strings)
Concatenate all values, prefixed by ‘prefix-’reduce(lambda x, y: x + length(y), 0, strings)
Sums total length of all the strings in the array
Note: Keep track of data types
The initial value (the second argument passed to reduce) and the first lambda parameter (the “running total”) need to have the same data type. In addition, the second lambda parameter needs to have the same type as the elements in the array input.
Updated about 2 years ago