User GuidesAPI ReferenceRelease Notes
Doc HomeHelp CenterLog In

Data Types and Transformations

The default data type of unified attributes is array[string]. Transformations support additional data types.

Transformations support the following data types.

Data typeDisplayed AsWritten As
integer11
long1L1L
double1.01.0
stringexample'example'
Booleantruetrue
array[ ]a, barray('a', 'b')

See also Geospatial Data Types.

The complex data type array[ ] supports each primitive data type, such as array[integer]. It also supports nested arrays, for example: array[array[string]].

Tip: A syntax error appears when the preview service attempts to transform data into an array of arrays. To preview data of this type, after the transformation that results in a nested array add another transformation to convert the data to a simpler type.

To view the data type of a unified attribute in a transformation script, position your cursor over the attribute name in the script. A tip shows the data type.

916

Showing that the data type of the company_name attribute is array[string].

Data Types and Functions

The default data type of unified attributes is array[string], even if they contain only one value.

  • Tamr Core offers a number of array functions to process columns with this type. In addition, functions that can be applied to single-valued attributes can often be applied to an array by using map.
  • Some functions accept values of any data type, denoted as type any.
  • Other functions only accept values of certain data types. For example, you can use upper() only on data of type string. You can convert between data types using data type conversion functions such as to_string(), which casts values of any type to type string.
  • Any data that fails to convert returns a null.
    Tip: You can use the coalesce function to supply a default value when nulls are encountered. See coalesce.

Best Practices for Data Type Conversion

As a best practice, Tamr recommends a forward-looking approach that keeps data types consistent by treating all unified attributes as arrays. Using this approach can make your transformations more resilient to software and data changes over time. An example follows. However, transformations for single-valued attributes are often easier to write.

Tip: Multi-value arrays can be particularly useful in mastering projects as they allow the computation of similarity across columns of your source data. For example, you can GROUP BY a small set of columns that indicate a clear cluster (such as company_division, product_number, and product_description). You can then use the aggregating function mode() on most other columns to keep the most common value, and the aggregating function top() to keep multiple common values for very informative columns. See Group By and Functions.

If you choose to write scripts for single-valued data, as a good practice Tamr recommends that you convert all of the unified attributes that are subject to transformation once, at the beginning of your transformations, instead of repeatedly within each transformation. The same strategy applies if an attribute needs to be converted to a different array type. You can make these conversions using a series of formulas, one for each attribute you wish to treat as a different type, or with a single script such as the example that follows.

SELECT *,
to_string(person_name) AS person_name,
to_string(person_city) AS person_city,
to_string(person_state) AS person_state,
to_int(person_birth_year) AS person_birth_year,
array.of(to_double, person_hourly_wages) AS person_hourly_wages;

Note: Any value that fails data type conversion becomes null. This also applies to arrays of strings with multiple values that are converted using to_string().

Example Data Type Conversions

This section includes examples of best, good, and not recommended practices for data type conversions. Each example shows the effect on the following pre-transformation values in a unified dataset.

tamr_idperson_name
1['Becky']
2['Richard', 'Rich']
3['Alex']
4['']

Best Practice Example

Treat everything as an array to allow future multiple mappings:

SELECT *, map(upper, person_name) AS person_name;
SELECT *, filter(is_not_empty, person_name) AS person_name;

Results in:

tamr_idperson_name
1['BECKY']
2['RICHARD', 'RICH']
3['ALEX']
4[]

Good Practice Example

Treat singly mapped attributes as single values, convert once:

SELECT *, to_string(person_name) AS person_name;
SELECT *, upper(person_name) AS person_name;
SELECT *, CASE WHEN person_name IS NOT EMPTY THEN person_name END AS person_name;

Results in:

tamr_idperson_name
1['BECKY']
2null
3['ALEX']
4null

Not Recommended Example

Treat singly mapped attributes as single values, convert in every statement:

SELECT *, upper(to_string(person_name)) AS person_name;
SELECT *, CASE WHEN to_string(person_name) IS NOT EMPTY THEN person_name END AS person_name;

Results in:

tamr_idperson_name
1['BECKY']
2null
3['ALEX']
4null

Data Types and System-Generated Attributes

The attributes generated for every unified dataset by Tamr Core are origin_entity_id, origin_source_name, and tamr_id. These attributes have a data type of string by default, and must be present and have the data type string at the end of transformations.

In addition, tamr_id must be unique for each row in the unified dataset. When transforming the unified dataset of a categorization or mastering project, additional care must be taken to ensure that the tamr_id is stable to avoid the loss of downstream artifacts such as expert feedback or golden records.

All data types are reset between source transformation and unified transformation. When you perform a transformation on a source or unified attribute, be aware that all unified dataset attributes are converted back to array, and spend attributes in the unified dataset are converted to double(float).