Lookup
LOOKUP
statements standardize values in a column.
LOOKUP
allows you to use values from a lookup dataset, joined by a key, to transform values in an attribute in your dataset. For example, you could use LOOKUP
to change all state name values, such as California, into their two-letter abbreviations, such as CA.
Note: To standardize values in an attribute you can also use a case expression. Use case
for dataset attributes where you must standardize a few values only and the standardizations are not expected to change over time.
LOOKUP
references a set of values from an auxiliary dataset, known as the <lookup_dataset>
, which you must upload to Tamr Core. To create a lookup table without uploading it as a separate dataset, you can use ROWS.
Note: LOOKUP
is optimized for use with small lookup datasets. You may experience decreased performance if the size of the <lookup_dataset>
is similar to your unified dataset. If you use a very small lookup dataset, then running a LOOKUP
statement can be faster than running a JOIN
statement.
In most cases, LOOKUP
does not change the number of rows in your dataset. The referenced <lookup_dataset>
is grouped on the join keys before the target value is added. If a key appears multiple times in the <lookup_dataset>
, LOOKUP
combines the values of repeated keys into an array, and does not produce multiple rows as a JOIN
statement would.
For example, if the <lookup_dataset>
contained both "USA"="United States" and "USA"="United States of America", the value produced by a LOOKUP
for "USA" would be the array ["United States", "United States of America"].
However, if a row in your dataset matches with multiple different values of a join key in the <lookup_dataset>
, then LOOKUP
increases the number of rows in your dataset.
For example, if the <lookup_dataset>
contains both "US"="United States" and "USA"="United States of America", and the LOOKUP
condition is CountryCode.contains(<lookup_dataset>.CountryCode)
, then the record in your dataset with the country code "USA" matches with both records of the <lookup_dataset>
. This results in two records in your dataset instead of one. Since "US" and "USA" are different, those records in the <lookup_dataset>
are not grouped together by LOOKUP
.
The transformation that uses a LOOKUP
statement has the following syntax:
LOOKUP (<expr> AS <attribute>)*
FROM <lookup_dataset> AS <alias>
ON <join_condition>;
In this transformation:
- The
<expr>
must reference attributes from the provided<lookup_dataset>
. - The
<attribute>
must reference attributes from your dataset. - The
<join_condition>
must be an expression that references an attribute in the current dataset and an attribute in the<lookup_dataset>
.
Join Conditions in LOOKUP
LOOKUP
statements typically use logical comparators to find values that do, and do not, meet certain criteria. Logical comparators include both equality and non-equality expressions.
The LOOKUP
statement takes both equality and non-equality expressions:
- If the
<join_condition>
is an equality expression, such asTableA.Genre=TableB.Genre
, then a referenced attribute from the current dataset must match a referenced attribute from the<lookup_dataset>
. - If the
<join_condition>
is not an equality expression, such asTableA.Age>=TableB.Age
, orTableA.Name.contains(TableB.Name)
, then an attribute from the current dataset may match more than one attribute from the<lookup_dataset>
. This may result inLOOKUP
adding rows to your dataset, and as a result duplicating any unique primary keys you may have.
Note: Tamr recommends that you use LOOKUP
with equality join conditions. However, non-equality conditions are also supported and have caveats that are described in this section.
If you use a non-equality join condition in a LOOKUP
statement, then when LOOKUP
performs a join on attributes from two tables, the attributes are not required to have matching attribute values for them to be considered a match. In this case, the attribute values in your table do not directly compare, using equality =
, with the corresponding attribute values in the lookup table. For example, the condition uses >=
. This could result in a record from your original dataset matching multiple different records from the <lookup_dataset>
. Since these records have different attribute values, the GROUP by
operation that runs as part of LOOKUP
cannot combine them into a single record. This creates records in the output from LOOKUP
where the record from your original dataset appears multiple times alongside different records from the <lookup_dataset>
. In addition, the primary key of the record from the original dataset is duplicated.
For example, consider TableA in which the Name
attribute's values for two records are TIMOTHY
for record1 and ALICE
for record2. In TableB (which will function as the <lookup_dataset>
, the Name
attribute's values are TIM
for record1 and TIMOTHY
for record2. If the JOIN
condition is TableA.Name.contains(TableB.Name)
, then LOOKUP
returns a table with two new records, one record that matches record1 from TableA and record1 from TableB, and one record that matches record1 from TableA and record2 from TableB. These two new records will have the same primary key because they take the primary key from the same record from TableA. When Tamr Core writes this table to HBase while running Spark that uses partitions, HBase may write the record that arrives first, and then overwrite it with the values from the second record, because they have the same primary keys.
Primary Key Management for LOOKUP with Non-Equality Join Conditions
Starting with v2020.016 and greater, Tamr Core automatically assigns primary keys to all LOOKUP
statements with non-equality join conditions that you add in this version or in subsequent versions. This means that the system changes primary keys (tamr_id
) for such LOOKUP
statements.
To avoid disruptions to existing transformations with LOOKUP
statements written in versions before v2020.016, during the upgrade to this version, the system automatically runs an upgrade script that disables automatic assignment of primary keys for existing LOOKUP
statements with non-equality join conditions. The upgrade script prevents breaking any current projects that contain transformations with LOOKUP
statements with non-equality join conditions and that depend on primary keys staying the same as in the software version from which you are upgrading.
The upgrade script adds the text hint(pkmanagement.manual)
in front of these statements. Once the script completes, it issues a report listing all the projects and their transformations that were changed. It also lists any projects and transformations that could not be updated with the text hint(pkmanagement.manual)
due to parsing or linting errors. For more information about using HINT
, see Labels, Hints, and Scope.
A potential data loss may occur in some LOOKUP
statements with non-equality join conditions. These LOOKUP
statements may create records with the same primary keys. When the results are written to HBase in Spark, these records overwrite each other.
In some cases, you may wish to avoid data loss at all costs, and don’t mind if you lose labels because of changed primary keys. In such cases, you can choose to re-enable automatic assignment of primary keys to records resulting from any LOOKUP
statements. To do this, remove the text hint(pkmanagement.manual)
that was added by the upgrade script to those LOOKUP
statements.
Note: If primary keys change, then all downstream labels, such as pair labels, or labels in a categorization project begin referring to the records with the previous set of primary keys. They no longer link to any existing records that have new primary keys. In such cases, you lose all previously-assigned labels. For more information, see Primary Key Management and Upgrading Tamr Core.
For information about other maintenance utilities that are available for unify-admin.sh
, see Utilities for Validation and System-Wide Processes.
LOOKUP Example 1
For example, consider a list of people and the two-letter code of the State they live in:
You want to use the LOOKUP
statement to replace each two-letter code in State with the full name of the state, so that the final table looks like this:
To apply LOOKUP
to the State
attribute, you use a lookup dataset, state_lookup_table.csv
, that contains an association between each state’s two-letter code and its full name:
Using LOOKUP
, you can replace the two-letter values in the State
attribute in your dataset with the full names of the state from the lookup dataset. To do this, you find where State
in your dataset matches stateCode
in state_lookup_table.csv
. When they match, the stateName
value is selected as the new value for State
. The LOOKUP
statement for this example is as follows:
LOOKUP get(stateName, 0) AS State
FROM state_lookup_table.csv
ON State = state_lookup_table.csv.stateCode;
LOOKUP Example 2: Aliasing
You can assign an alias to the lookup dataset so that you don't have to write the dataset's full name each time you reference its attribute.
Tip: This can be particularly useful for datasets with long names.
In the following example, the second line adds an alias lkp
to the table named state_lookup_table.csv
. This allows you to write lkp.stateCode
, instead of state_lookup_table.csv.stateCode
, each time you reference the attribute stateCode
from the lookup dataset.
LOOKUP get(stateName, 0) AS State
FROM state_lookup_table.csv AS lkp
ON State = lkp.stateCode;
LOOKUP Example 3
In this example, you would like the resulting table to have a State attribute with a value where the two-letter stateCode
value is followed by the full name of the state:
The script looks like the following:
LOOKUP (get(stateCode, 0) || ': ' || get(stateName, 0)) AS State
FROM state_lookup_table.csv AS lkp
ON State = lkp.stateCode;
Note: The following error occurs because the grouping condition, GROUP BY
, cannot find an attribute in the lookup dataset that it can associate with an attribute in the current dataset.
LOOKUP Example 4: Extract keywords from a description
Lookup can be used to extract keywords contained in a description attribute. In this example, the transformation extracts country
keywords contained in organization_description_array
.
LOOKUP get(country, 0) AS country_extracted
FROM country_lookup_table.csv AS lkp
ON array.contains(organization_description_array, get(lkp.country, 0));
Updated about 3 years ago