Data Types and Transformations
The default data type of unified attributes is array[string]. Transformations support additional data types.
Transformations support the following data types.
Data type | Displayed As | Written As |
---|---|---|
integer | 1 | 1 |
long | 1L | 1L |
double | 1.0 | 1.0 |
string | example | 'example' |
Boolean | true | true |
array[ ] | a, b | array('a', 'b') |
See also Geospatial Data Types.
The complex data type array[ ] supports each primitive data type, such as array[integer]. It also supports nested arrays, for example: array[array[string]].
Tip: A syntax error appears when the preview service attempts to transform data into an array of arrays. To preview data of this type, after the transformation that results in a nested array add another transformation to convert the data to a simpler type.
To view the data type of a unified attribute in a transformation script, position your cursor over the attribute name in the script. A tip shows the data type.
Data Types and Functions
The default data type of unified attributes is array[string], even if they contain only one value.
- Tamr Core offers a number of array functions to process columns with this type. In addition, functions that can be applied to single-valued attributes can often be applied to an array by using
map
. - Some functions accept values of any data type, denoted as type any.
- Other functions only accept values of certain data types. For example, you can use
upper()
only on data of type string. You can convert between data types using data type conversion functions such asto_string()
, which casts values of any type to type string. - Any data that fails to convert returns a
null
.
Tip: You can use thecoalesce
function to supply a default value when nulls are encountered. See coalesce.
Best Practices for Data Type Conversion
As a best practice, Tamr recommends a forward-looking approach that keeps data types consistent by treating all unified attributes as arrays. Using this approach can make your transformations more resilient to software and data changes over time. An example follows. However, transformations for single-valued attributes are often easier to write.
Tip: Multi-value arrays can be particularly useful in mastering projects as they allow the computation of similarity across columns of your source data. For example, you can GROUP BY
a small set of columns that indicate a clear cluster (such as company_division, product_number, and product_description). You can then use the aggregating function mode()
on most other columns to keep the most common value, and the aggregating function top()
to keep multiple common values for very informative columns. See Group By and Functions.
If you choose to write scripts for single-valued data, as a good practice Tamr recommends that you convert all of the unified attributes that are subject to transformation once, at the beginning of your transformations, instead of repeatedly within each transformation. The same strategy applies if an attribute needs to be converted to a different array type. You can make these conversions using a series of formulas, one for each attribute you wish to treat as a different type, or with a single script such as the example that follows.
SELECT *,
to_string(person_name) AS person_name,
to_string(person_city) AS person_city,
to_string(person_state) AS person_state,
to_int(person_birth_year) AS person_birth_year,
array.of(to_double, person_hourly_wages) AS person_hourly_wages;
Note: Any value that fails data type conversion becomes null. This also applies to arrays of strings with multiple values that are converted using to_string()
.
Example Data Type Conversions
This section includes examples of best, good, and not recommended practices for data type conversions. Each example shows the effect on the following pre-transformation values in a unified dataset.
tamr_id | person_name |
---|---|
1 | ['Becky'] |
2 | ['Richard', 'Rich'] |
3 | ['Alex'] |
4 | [''] |
Best Practice Example
Treat everything as an array to allow future multiple mappings:
SELECT *, map(upper, person_name) AS person_name;
SELECT *, filter(is_not_empty, person_name) AS person_name;
Results in:
tamr_id | person_name |
---|---|
1 | ['BECKY'] |
2 | ['RICHARD', 'RICH'] |
3 | ['ALEX'] |
4 | [] |
Good Practice Example
Treat singly mapped attributes as single values, convert once:
SELECT *, to_string(person_name) AS person_name;
SELECT *, upper(person_name) AS person_name;
SELECT *, CASE WHEN person_name IS NOT EMPTY THEN person_name END AS person_name;
Results in:
tamr_id | person_name |
---|---|
1 | ['BECKY'] |
2 | null |
3 | ['ALEX'] |
4 | null |
Not Recommended Example
Treat singly mapped attributes as single values, convert in every statement:
SELECT *, upper(to_string(person_name)) AS person_name;
SELECT *, CASE WHEN to_string(person_name) IS NOT EMPTY THEN person_name END AS person_name;
Results in:
tamr_id | person_name |
---|---|
1 | ['BECKY'] |
2 | null |
3 | ['ALEX'] |
4 | null |
Data Types and System-Generated Attributes
The attributes generated for every unified dataset by Tamr Core are origin_entity_id
, origin_source_name
, and tamr_id
. These attributes have a data type of string by default, and must be present and have the data type string at the end of transformations.
In addition, tamr_id
must be unique for each row in the unified dataset. When transforming the unified dataset of a categorization or mastering project, additional care must be taken to ensure that the tamr_id
is stable to avoid the loss of downstream artifacts such as expert feedback or golden records.
All data types are reset between source transformation and unified transformation. When you perform a transformation on a source or unified attribute, be aware that all unified dataset attributes are converted back to array, and spend attributes in the unified dataset are converted to double(float).
Updated over 2 years ago