Transformations support the following data types.
See also Geospatial Data Types.
The complex data type array[ ] supports each primitive data type, such as array[integer]. It also supports nested arrays, for example: array[array[string]].
Tip: A syntax error appears when the preview service attempts to transform data into an array of arrays. To preview data of this type, after the transformation that results in a nested array add another transformation to convert the data to a simpler type.
To view the data type of a unified attribute in a transformation script, position your cursor over the attribute name in the script. A tip shows the data type.
The default data type of unified attributes is array[string], even if they contain only one value.
- Tamr Core offers a number of array functions to process columns with this type. In addition, functions that can be applied to single-valued attributes can often be applied to an array by using
- Some functions accept values of any data type, denoted as type any.
- Other functions only accept values of certain data types. For example, you can use
upper()only on data of type string. You can convert between data types using data type conversion functions such as
to_string(), which casts values of any type to type string.
- Any data that fails to convert returns a
Tip: You can use the
coalescefunction to supply a default value when nulls are encountered. See coalesce.
As a best practice, Tamr recommends a forward-looking approach that keeps data types consistent by treating all unified attributes as arrays. Using this approach can make your transformations more resilient to software and data changes over time. An example follows. However, transformations for single-valued attributes are often easier to write.
Tip: Multi-value arrays can be particularly useful in mastering projects as they allow the computation of similarity across columns of your source data. For example, you can
GROUP BY a small set of columns that indicate a clear cluster (such as company_division, product_number, and product_description). You can then use the aggregating function
mode() on most other columns to keep the most common value, and the aggregating function
top() to keep multiple common values for very informative columns. See Group By and Functions.
If you choose to write scripts for single-valued data, as a good practice Tamr recommends that you convert all of the unified attributes that are subject to transformation once, at the beginning of your transformations, instead of repeatedly within each transformation. The same strategy applies if an attribute needs to be converted to a different array type. You can make these conversions using a series of formulas, one for each attribute you wish to treat as a different type, or with a single script such as the example that follows.
SELECT *, to_string(person_name) AS person_name, to_string(person_city) AS person_city, to_string(person_state) AS person_state, to_int(person_birth_year) AS person_birth_year, array.of(to_double, person_hourly_wages) AS person_hourly_wages;
Note: Any value that fails data type conversion becomes null. This also applies to arrays of strings with multiple values that are converted using
This section includes examples of best, good, and not recommended practices for data type conversions. Each example shows the effect on the following pre-transformation values in a unified dataset.
Treat everything as an array to allow future multiple mappings:
SELECT *, map(upper, person_name) AS person_name; SELECT *, filter(is_not_empty, person_name) AS person_name;
Treat singly mapped attributes as single values, convert once:
SELECT *, to_string(person_name) AS person_name; SELECT *, upper(person_name) AS person_name; SELECT *, CASE WHEN person_name IS NOT EMPTY THEN person_name END AS person_name;
Treat singly mapped attributes as single values, convert in every statement:
SELECT *, upper(to_string(person_name)) AS person_name; SELECT *, CASE WHEN to_string(person_name) IS NOT EMPTY THEN person_name END AS person_name;
The attributes generated for every unified dataset by Tamr Core are
tamr_id. These attributes have a data type of string by default, and must be present and have the data type string at the end of transformations.
tamr_id must be unique for each row in the unified dataset. When transforming the unified dataset of a categorization or mastering project, additional care must be taken to ensure that the
tamr_id is stable to avoid the loss of downstream artifacts such as expert feedback or golden records.
All data types are reset between source transformation and unified transformation. When you perform a transformation on a source or unified attribute, be aware that all unified dataset attributes are converted back to array, and spend attributes in the unified dataset are converted to double(float).
Updated 4 months ago