Tamr Documentation

Data Types and Transformations

The default data type of unified attributes is array[string]; Tamr transformations support additional data types.

Tamr transformations support the following data types.

Data type

Displayed As

Written As

integer

1

1

long

1L

1L

double

1.0

1.0

string

example

'example'

Boolean

true

true

array[ ]

a, b

array('a', 'b')

See also Geospatial Data Types.

The complex data type array[ ] supports each primitive data type, such as array[integer]. It also supports nesting of data types, for example: array[array[string]].

Tip: To view the data type of a unified attribute within a transformation script in Tamr, position your mouse cursor over the attribute name in the script. A tip shows the data type.

Tamr shows you that the data type of the company_name attribute is array[string]Tamr shows you that the data type of the company_name attribute is array[string]

Tamr shows you that the data type of the company_name attribute is array[string]

Data Types and Tamr Functions

The default data type of unified attributes is array[string], even if they contain only one value.

  • Tamr offers a number of array functions to process columns with this type. In addition, functions that can be applied to single-valued attributes can often be applied to an array by using map.
  • Some functions accept values of any data type, denoted as type any.
  • Other functions only accept values of certain data types. For example, you can use upper() only on data of type string. You can convert between data types using data type conversion functions such as to_string(), which casts values of any type to type string.
  • Any data that fails to convert returns a null.
    Tip: You can use the coalesce function to use a default value when nulls are encountered. See coalesce.

Best Practices for Data Type Conversion

As a best practice, Tamr recommends a forward-looking approach that keeps data types consistent by treating all unified attributes as arrays. Using this approach can make your transformations more resilient to software and data changes over time. An example follows. However, transformations for single-valued attributes are often easier to write.

Tip: Multi-values can be particularly useful in mastering projects as they allow the computation of similarity across columns of your source data. For example, you can GROUP BY a small set of columns that indicate a clear cluster (such as company_division, product_number, and product_description) and then use the aggregating function mode() on most other columns to keep the most common value and the aggregating function top() to keep multiple common values for very informative columns. See Group By and Functions.

If you choose to write scripts for single-valued data, as a good practice Tamr recommends that you convert all of the unified attributes that are subject to transformation once, at the beginning of your transformations, instead of repeatedly within each transformation. The same strategy applies if an attribute needs to be converted to a different array type. You can make these conversions using a series of formulas, one for each attribute you wish to treat as a different type, or with a single script such as the example that follows.

SELECT *,
to_string(person_name) AS person_name,
to_string(person_city) AS person_city,
to_string(person_state) AS person_state,
to_int(person_birth_year) AS person_birth_year,
array.of(to_double, person_hourly_wages) AS person_hourly_wages;

Note: Any value that fails data type conversion becomes null. This also applies to arrays of strings with multiple values that are converted using to_string().

Example Data Type Conversions

This section includes examples of best, good, and not recommended practices for data type conversions. Each example shows the effect on the following pre-transformation values in a unified dataset.

tamr_id

person_name

1

['Becky']

2

['Richard', 'Rich']

3

['Alex']

4

['']

Best Practice Example

Treat everything as an array to allow future multiple mappings:

SELECT *, map(upper, person_name) AS person_name;
SELECT *, filter(is_not_empty, person_name) AS person_name;

Results in:

tamr_id

person_name

1

['BECKY']

2

['RICHARD', 'RICH']

3

['ALEX']

4

[]

Good Practice Example

Treat singly mapped attributes as single values, convert once:

SELECT *, to_string(person_name) AS person_name;
SELECT *, upper(person_name) AS person_name;
SELECT *, CASE WHEN person_name IS NOT EMPTY THEN person_name END AS person_name;

Results in:

tamr_id

person_name

1

['BECKY']

2

null

3

['ALEX']

4

null

Not Recommended Example

Treat singly mapped attributes as single values, convert in every statement:

SELECT *, upper(to_string(person_name)) AS person_name;
SELECT *, CASE WHEN to_string(person_name) IS NOT EMPTY THEN person_name END AS person_name;

Results in:

tamr_id

person_name

1

['BECKY']

2

null

3

['ALEX']

4

null

Data Types and Tamr-generated Attributes

The attributes generated for every unified dataset by Tamr are origin_entity_id, origin_source_name, and tamr_id. These attributes have a data type of string by default, and must be present and have the data type string at the end of transformations.

In addition, tamr_id must be unique for each row in the unified dataset. When transforming the unified dataset of a categorization or mastering project, additional care must be taken to ensure that the tamr_id is stable to avoid the loss of downstream artifacts such as human feedback or golden records.

Updated 4 months ago



Data Types and Transformations


The default data type of unified attributes is array[string]; Tamr transformations support additional data types.

Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.