Lookup

LOOKUP allows you to use values from a lookup dataset, joined by a key, to transform values in an attribute in your dataset. For example, you could use LOOKUP to change all state name values, such as California, into their two-letter abbreviations, such as CA.

Note: To standardize values in an attribute you can also use a case expression. Use case for dataset attributes where you must standardize a few values only and the standardizations are not expected to change over time.

LOOKUP references a set of values from an auxiliary dataset, known as the <lookup_dataset>, which you must upload to Tamr Core. To create a lookup table without uploading it as a separate dataset, you can use ROWS.

Note: LOOKUP is optimized for use with small lookup datasets. You may experience decreased performance if the size of the <lookup_dataset> is similar to your unified dataset. If you use a very small lookup dataset, then running a LOOKUP statement can be faster than running a JOIN statement.

In most cases, LOOKUP does not change the number of rows in your dataset. The referenced <lookup_dataset> is grouped on the join keys before the target value is added. If a key appears multiple times in the <lookup_dataset>, LOOKUP combines the values of repeated keys into an array, and does not produce multiple rows as a JOIN statement would.

For example, if the <lookup_dataset> contained both "USA"="United States" and "USA"="United States of America", the value produced by a LOOKUP for "USA" would be the array ["United States", "United States of America"].

However, if a row in your dataset matches with multiple different values of a join key in the <lookup_dataset>, then LOOKUP increases the number of rows in your dataset.

For example, if the <lookup_dataset> contains both "US"="United States" and "USA"="United States of America", and the LOOKUP condition is CountryCode.contains(<lookup_dataset>.CountryCode), then the record in your dataset with the country code "USA" matches with both records of the <lookup_dataset>. This results in two records in your dataset instead of one. Since "US" and "USA" are different, those records in the <lookup_dataset> are not grouped together by LOOKUP.

The transformation that uses a LOOKUP statement has the following syntax:

LOOKUP (<expr> AS <attribute>)*
FROM <lookup_dataset> AS <alias>
ON <join_condition>;

In this transformation:

The <expr> must reference attributes from the provided <lookup_dataset>.
The <attribute> must reference attributes from your dataset.
The <join_condition> must be an expression that references an attribute in the current dataset and an attribute in the <lookup_dataset>.

Join Conditions in LOOKUP

LOOKUP statements typically use logical comparators to find values that do, and do not, meet certain criteria. Logical comparators include both equality and non-equality expressions.

The LOOKUP statement takes both equality and non-equality expressions:

If the <join_condition> is an equality expression, such as TableA.Genre=TableB.Genre, then a referenced attribute from the current dataset must match a referenced attribute from the <lookup_dataset>.
If the <join_condition> is not an equality expression, such as TableA.Age>=TableB.Age, or TableA.Name.contains(TableB.Name), then an attribute from the current dataset may match more than one attribute from the <lookup_dataset>. This may result in LOOKUP adding rows to your dataset, and as a result duplicating any unique primary keys you may have.

Note: Tamr recommends that you use LOOKUP with equality join conditions. However, non-equality conditions are also supported and have caveats that are described in this section.

If you use a non-equality join condition in a LOOKUP statement, then when LOOKUP performs a join on attributes from two tables, the attributes are not required to have matching attribute values for them to be considered a match. In this case, the attribute values in your table do not directly compare, using equality = , with the corresponding attribute values in the lookup table. For example, the condition uses >=. This could result in a record from your original dataset matching multiple different records from the <lookup_dataset>. Since these records have different attribute values, the GROUP by operation that runs as part of LOOKUP cannot combine them into a single record. This creates records in the output from LOOKUP where the record from your original dataset appears multiple times alongside different records from the <lookup_dataset>. In addition, the primary key of the record from the original dataset is duplicated.

For example, consider TableA in which the Name attribute's values for two records are TIMOTHY for record1 and ALICE for record2. In TableB (which will function as the <lookup_dataset>, the Name attribute's values are TIM for record1 and TIMOTHY for record2. If the JOIN condition is TableA.Name.contains(TableB.Name), then LOOKUP returns a table with two new records, one record that matches record1 from TableA and record1 from TableB, and one record that matches record1 from TableA and record2 from TableB. These two new records will have the same primary key because they take the primary key from the same record from TableA. When Tamr Core writes this table to HBase while running Spark that uses partitions, HBase may write the record that arrives first, and then overwrite it with the values from the second record, because they have the same primary keys.

Primary Key Management for LOOKUP with Non-Equality Join Conditions

Starting with v2020.016 and greater, Tamr Core automatically assigns primary keys to all LOOKUP statements with non-equality join conditions that you add in this version or in subsequent versions. This means that the system changes primary keys (tamr_id) for such LOOKUP statements.

To avoid disruptions to existing transformations with LOOKUP statements written in versions before v2020.016, during the upgrade to this version, the system automatically runs an upgrade script that disables automatic assignment of primary keys for existing LOOKUP statements with non-equality join conditions. The upgrade script prevents breaking any current projects that contain transformations with LOOKUP statements with non-equality join conditions and that depend on primary keys staying the same as in the software version from which you are upgrading.

The upgrade script adds the text hint(pkmanagement.manual) in front of these statements. Once the script completes, it issues a report listing all the projects and their transformations that were changed. It also lists any projects and transformations that could not be updated with the text hint(pkmanagement.manual) due to parsing or linting errors. For more information about using HINT, see Labels, Hints, and Scope.

A potential data loss may occur in some LOOKUP statements with non-equality join conditions. These LOOKUP statements may create records with the same primary keys. When the results are written to HBase in Spark, these records overwrite each other.

In some cases, you may wish to avoid data loss at all costs, and don’t mind if you lose labels because of changed primary keys. In such cases, you can choose to re-enable automatic assignment of primary keys to records resulting from any LOOKUP statements. To do this, remove the text hint(pkmanagement.manual) that was added by the upgrade script to those LOOKUP statements.

Note: If primary keys change, then all downstream labels, such as pair labels, or labels in a categorization project begin referring to the records with the previous set of primary keys. They no longer link to any existing records that have new primary keys. In such cases, you lose all previously-assigned labels. For more information, see Primary Key Management and Upgrading Tamr Core.

For information about other maintenance utilities that are available for unify-admin.sh, see Utilities for Validation and System-Wide Processes.

LOOKUP Example 1

For example, consider a list of people and the two-letter code of the State they live in:

746 — A table of records with values for attributes First Name, Last Name, and State, where values for State are two-letter state codes.

You want to use the LOOKUP statement to replace each two-letter code in State with the full name of the state, so that the final table looks like this:

748 — A list of records with values for attributes First Name, Last Name and State, where values for State are spelled out names of states.

To apply LOOKUP to the State attribute, you use a lookup dataset, state_lookup_table.csv, that contains an association between each state’s two-letter code and its full name:

496 — A table with two attributes, stateCode and stateName.

Using LOOKUP, you can replace the two-letter values in the State attribute in your dataset with the full names of the state from the lookup dataset. To do this, you find where State in your dataset matches stateCode in state_lookup_table.csv. When they match, the stateName value is selected as the new value for State. The LOOKUP statement for this example is as follows:

LOOKUP get(stateName, 0) AS State
FROM state_lookup_table.csv
ON State = state_lookup_table.csv.stateCode;

LOOKUP Example 2: Aliasing

You can assign an alias to the lookup dataset so that you don't have to write the dataset's full name each time you reference its attribute.

Tip: This can be particularly useful for datasets with long names.

In the following example, the second line adds an alias lkp to the table named state_lookup_table.csv. This allows you to write lkp.stateCode, instead of state_lookup_table.csv.stateCode, each time you reference the attribute stateCode from the lookup dataset.

LOOKUP get(stateName, 0) AS State
FROM state_lookup_table.csv AS lkp
ON State = lkp.stateCode;

LOOKUP Example 3

In this example, you would like the resulting table to have a State attribute with a value where the two-letter stateCode value is followed by the full name of the state:

792 — A list of records with values for attributes First Name, Last Name, and State, where values for State are abbreviations and state names.

The script looks like the following:

LOOKUP (get(stateCode, 0) || ': ' || get(stateName, 0)) AS State
FROM state_lookup_table.csv AS lkp
ON State = lkp.stateCode;

Note: The following error occurs because the grouping condition, GROUP BY, cannot find an attribute in the lookup dataset that it can associate with an attribute in the current dataset.

1040 — A "Provide at least one key to GROUP by" error message.

LOOKUP Example 4: Extract keywords from a description

Lookup can be used to extract keywords contained in a description attribute. In this example, the transformation extracts country keywords contained in organization_description_array.

LOOKUP get(country, 0) AS country_extracted
FROM country_lookup_table.csv AS lkp
ON array.contains(organization_description_array, get(lkp.country, 0));