Map, Filter, and Reduce
You use these higher-order functions to apply a function across an array field.
Transformations support the higher-order functions
reduce. They allow you to apply a Tamr Core-defined function across array fields.
mapapplies the function to each element of an array field.
filterkeeps elements of array fields that match a specified condition.
reducecombines an array into a single value.
Conceptually, these functions are similar to
GROUP BY statements with a few major differences.
The following examples show how
reduce are similar to
GROUP BY statements:
SELECT func(col)maps a function to each cell in a column.
FILTER am_i_true(col)keeps only the rows that match the given condition.
GROUP agg() BY colcombines rows based groups of values in a column.
reduce Differ from
The differences are as follows:
reducecan be used as other functions, not as standalone statements. For example:
SELECT *, get(map(to_int, NUM_COL), 1) * 5 AS second_element_mult_by_5
reduce, you can take a function like
to_intthat acts on non-array values and apply it to each value of an array type.
- If you want to use
GROUP BYto apply a function to an array column, the function you use needs to be compatible with array inputs. For example,
Stringand returns an
- If you want to use
GROUP BYto apply a function directly to a column with data type
String, it will fail unless you first convert the column to
map applies any function to each value in an array field.
map(Function: Any -> Any function, Any a) takes two parameters:
function, A single-argument function
a, an array column or expression.
map returns an array column, where the array value type matches the return type of
For example, to remove leading and trailing whitespace from each value in a string array column using the
trim function, I could write:
SELECT *, map(trim, A) as A;
Function type compatibility
Functions have restrictions on what type they can accept as arguments. You may need to convert the type of the array input depending on the function you wish to use.
In the example above, if
A had the data type
Integer, instead of
String, an error would occur. This is because
trim only accepts
filter keeps elements of array fields that match a specified condition.
The condition, in this case, is a function that returns true or false based on the input value.
filter will keep all values for which the function returns true.
filter(Function: Any -> Boolean predicate, Any a) takes the following arguments:
predicate, a function that takes a single argument and returns a boolean. If
true, the corresponding value will be kept. Otherwise the value will be removed.
a, an array column or expression
filter returns an array of the same type as the input column
For example, to remove all null or empty values from a string array column, I could write:
SELECT *, filter(is_not_empty, A) as A;
Function type compatibility
map, the argument data type to the
predicate function must match the data type of the values in the input array,
reduce combines the values of an array column using a function as a reducer.
The reducer function is used to combine each value of the input array with a running result, and is called on input array values from left-to-right.
At the start of the operation, a running result hasn't yet been computed, so you combine the first item of the collection with the initial value. Otherwise, you combine each subsequent value with the result of invoking the reducer function on the previous element.
reduce(Function: Any, Any -> Any reducer, Any initialValue, Any a)
reducer, a function that takes two arguments of the same data type and returns a value of that data type. (e.g.,
initialValue, the initial value of the reduction
a, the array-valued column to reduce over
Function type compatibility
The data types of both arguments to the reducer function must match the function's return data type, the data type of the
initialValue argument and the element data type of the array column!
For example, to sum the values of an integer array column, I could write:
SELECT *, reduce(add, 0, A) as A;
Let's consider a single value of column
array(1, 2, 3). When the following expression is evaluated,
reduce(add, 0, array(1,2,3)),
add is applied in sequence to each element of the array and the result of the previous
add, or otherwise
You can think of this example of reduce acting like the following pseudocode for-loop:
# set the result to initialValue, 0 result = 0 for i in [1, 2, 3]: result = add(result, i) # result now equals 6
Put a bit differently,
reduce(add, 0, array(1,2,3)) evaluates to
add(3, add(2, add(1, 0))) which evaluates to
Combining Map, Filter, and Reduce
While each of these functions is useful on their own, they can be used together to great effect.
For example, if I want to convert a string array column to integer values, remove any values that failed conversion (i.e. are null), and then sum them, I could write the following:
SELECT *, map(to_int, A) as mapped; SELECT *, filter(is_not_null, mapped) as filtered; SELECT *, reduce(sum, 0, filtered) as sum;
Note that this could also be written in a single
SELECT *, reduce(sum, 0, filter(is_not_null, map(to_int, A))) as sum;
Applying Map, Filter, and Reduce to Arrays
Lambdas give you the power to apply an arbitrary expression to each value in an array.
A lambda expression is just like an ordinary expression except that it’s allowed to reference one or several “free variables.” The free variables are like other identifiers (e.g.,
SUBJECT_ID), but they are not references to the underlying data---instead their values are supplied by the context of how the lambda is used. For example, consider the following map expression:
map(lambda x: x + 1, LISTS_OF_NUMBERS)
Here the parameter
x is the lambda’s free variable. What
map() does is take individual values from
LISTS_OF_NUMBERS (an attribute with arrays) and supplies them one-by-one as x values to the lambda. The result is every number in every list in the attribute is incremented by one.
The general syntax is
lambda [id_list]: expression,
- id_list is a comma-separated list of variable names
- id_list can be empty, in which case the lambda expression always returns the same constant value
- Each id in the list can be any valid alphanumeric name that isn’t already defined outside of the lambda
- expression is an arbitrary expression that is allowed to refer to the variables in id_list
Note that there are variations depending on if you are using this in a
reduce(). These differences are detailed later.
Here are some general examples of valid lamdas:
lambda x: x
lambda foo: foo + 1
lambda x, y: x + y
lambda my_num: case when my_num > 5 then ‘big’ else ‘small’ end
lambda x, y: x + length(y)
Naming the lambda variable
A good practice to follow is to name lambda parameter names a unique variable name, such as
y. Do not, for example, name them something that already exists, such as an attribute name.
Each time you write a lambda, you may reuse the same parameter name. The parameter name is relevant only to the lambda it is written with, and will not conflict with other lambda expressions.
Using lambdas in Map, Filter, and Reduce
Let's learn how to use lambdas in
You can pass a lambda expression to
map() to have the expression applied to each of the values in an array.
map(lambda x: x, my_array_col)Returns the same exact array
map(lambda x: to_int(x), my_int_array_col)This is the same as
map(lambda x: case when is_valid_email(x) then x else ‘INVALID’ end, email_col)Replace anything that’s not a valid email with “INVALID”
filter() applies a boolean (true or false) function to each item in the array. It keeps the items that result in “true” and drops the values that result in “false.” You can use a lambda in a
filter() to apply complex conditions to an attribute with arrays. The lambda should take one parameter.
filter(lambda x: x > 0, ints)Will keep only the positive numbers in the array
filter(lambda x: matches(‘[a-zA-Z]+’, x), strings)Will keep only the non-empty alphabetic values
filter(lambda x: substr2(x, 0, 1) == ‘A’, strings)Will keep the values in an array that start with 'A'
reduce() aggregates the values in an array and produces a single result.
reduce() accepts a lambda of two parameters where the first is the “running total” and the second represents the items in the array.
reduce() works by calling the lambda function for each item in the array with the “running total” set each time to the value returned by the last call, or to an initial value for the first call.
reduce(lambda x, y: x + y, 0, ints)Sums all values in the array
reduce(lambda x, y: x || y, ‘prefix-’, strings)Concatenate all values, prefixed by ‘prefix-’
reduce(lambda x, y: x + length(y), 0, strings)Sums total length of all the strings in the array
Note: Keep track of data types
The initial value (the second argument passed to reduce) and the first lambda parameter (the “running total”) need to have the same data type. In addition, the second lambda parameter needs to have the same type as the elements in the array input.
Updated 7 months ago