LOOKUP allows you to use values from a lookup dataset, joined by a key, to transform values in an attribute in your dataset. For example, you could use
LOOKUP to change all state name values, such as California, into their two-letter abbreviations, such as CA.
Note: To standardize values in an attribute you can also use a case expression. Use
case for dataset attributes where you must standardize a few values only and the standardizations are not expected to change over time.
LOOKUP references a set of values from an auxiliary dataset, known as the
<lookup_dataset>, which you must upload to Tamr Core. To create a lookup table without uploading it as a separate dataset, you can use ROWS.
LOOKUP is optimized for use with small lookup datasets. You may experience decreased performance if the size of the
<lookup_dataset> is similar to your unified dataset. If you use a very small lookup dataset, then running a
LOOKUP statement can be faster than running a
In most cases,
LOOKUP does not change the number of rows in your dataset. The referenced
<lookup_dataset> is grouped on the join keys before the target value is added. If a key appears multiple times in the
LOOKUP combines the values of repeated keys into an array, and does not produce multiple rows as a
JOIN statement would.
For example, if the
<lookup_dataset> contained both "USA"="United States" and "USA"="United States of America", the value produced by a
LOOKUP for "USA" would be the array ["United States", "United States of America"].
However, if a row in your dataset matches with multiple different values of a join key in the
LOOKUP increases the number of rows in your dataset.
For example, if the
<lookup_dataset> contains both "US"="United States" and "USA"="United States of America", and the
LOOKUP condition is
CountryCode.contains(<lookup_dataset>.CountryCode), then the record in your dataset with the country code "USA" matches with both records of the
<lookup_dataset>. This results in two records in your dataset instead of one. Since "US" and "USA" are different, those records in the
<lookup_dataset> are not grouped together by
The transformation that uses a
LOOKUP statement has the following syntax:
LOOKUP (<expr> AS <attribute>)* FROM <lookup_dataset> AS <alias> ON <join_condition>;
In this transformation:
<expr>must reference attributes from the provided
<attribute>must reference attributes from your dataset.
<join_condition>must be an expression that references an attribute in the current dataset and an attribute in the
LOOKUP statements typically use logical comparators to find values that do, and do not, meet certain criteria. Logical comparators include both equality and non-equality expressions.
LOOKUP statement takes both equality and non-equality expressions:
- If the
<join_condition>is an equality expression, such as
TableA.Genre=TableB.Genre, then a referenced attribute from the current dataset must match a referenced attribute from the
- If the
<join_condition>is not an equality expression, such as
TableA.Name.contains(TableB.Name), then an attribute from the current dataset may match more than one attribute from the
<lookup_dataset>. This may result in
LOOKUPadding rows to your dataset, and as a result duplicating any unique primary keys you may have.
Note: Tamr recommends that you use
LOOKUP with equality join conditions. However, non-equality conditions are also supported and have caveats that are described in this section.
If you use a non-equality join condition in a
LOOKUP statement, then when
LOOKUP performs a join on attributes from two tables, the attributes are not required to have matching attribute values for them to be considered a match. In this case, the attribute values in your table do not directly compare, using equality
= , with the corresponding attribute values in the lookup table. For example, the condition uses
>=. This could result in a record from your original dataset matching multiple different records from the
<lookup_dataset>. Since these records have different attribute values, the
GROUP by operation that runs as part of
LOOKUP cannot combine them into a single record. This creates records in the output from
LOOKUP where the record from your original dataset appears multiple times alongside different records from the
<lookup_dataset>. In addition, the primary key of the record from the original dataset is duplicated.
For example, consider TableA in which the
Name attribute's values for two records are
TIMOTHY for record1 and
ALICE for record2. In TableB (which will function as the
Name attribute's values are
TIM for record1 and
TIMOTHY for record2. If the
JOIN condition is
LOOKUP returns a table with two new records, one record that matches record1 from TableA and record1 from TableB, and one record that matches record1 from TableA and record2 from TableB. These two new records will have the same primary key because they take the primary key from the same record from TableA. When Tamr Core writes this table to HBase while running Spark that uses partitions, HBase may write the record that arrives first, and then overwrite it with the values from the second record, because they have the same primary keys.
Starting with v2020.016 and greater, Tamr Core automatically assigns primary keys to all
LOOKUP statements with non-equality join conditions that you add in this version or in subsequent versions. This means that the system changes primary keys (
tamr_id) for such
To avoid disruptions to existing transformations with
LOOKUP statements written in versions before v2020.016, during the upgrade to this version, the system automatically runs an upgrade script that disables automatic assignment of primary keys for existing
LOOKUP statements with non-equality join conditions. The upgrade script prevents breaking any current projects that contain transformations with
LOOKUP statements with non-equality join conditions and that depend on primary keys staying the same as in the software version from which you are upgrading.
The upgrade script adds the text
hint(pkmanagement.manual) in front of these statements. Once the script completes, it issues a report listing all the projects and their transformations that were changed. It also lists any projects and transformations that could not be updated with the text
hint(pkmanagement.manual) due to parsing or linting errors. For more information about using
HINT, see Labels, Hints, and Scope.
A potential data loss may occur in some
LOOKUP statements with non-equality join conditions. These
LOOKUP statements may create records with the same primary keys. When the results are written to HBase in Spark, these records overwrite each other.
In some cases, you may wish to avoid data loss at all costs, and don’t mind if you lose labels because of changed primary keys. In such cases, you can choose to re-enable automatic assignment of primary keys to records resulting from any
LOOKUP statements. To do this, remove the text
hint(pkmanagement.manual) that was added by the upgrade script to those
Note: If primary keys change, then all downstream labels, such as pair labels, or labels in a categorization project begin referring to the records with the previous set of primary keys. They no longer link to any existing records that have new primary keys. In such cases, you lose all previously-assigned labels. For more information, see Primary Key Management and Upgrading Tamr Core.
For information about other maintenance utilities that are available for
unify-admin.sh, see Utilities for Validation and System-Wide Processes.
For example, consider a list of people and the two-letter code of the State they live in:
You want to use the
LOOKUP statement to replace each two-letter code in State with the full name of the state, so that the final table looks like this:
LOOKUP to the
State attribute, you use a lookup dataset,
state_lookup_table.csv, that contains an association between each state’s two-letter code and its full name:
LOOKUP, you can replace the two-letter values in the
State attribute in your dataset with the full names of the state from the lookup dataset. To do this, you find where
State in your dataset matches
state_lookup_table.csv. When they match, the
stateName value is selected as the new value for
LOOKUP statement for this example is as follows:
LOOKUP get(stateName, 0) AS State FROM state_lookup_table.csv ON State = state_lookup_table.csv.stateCode;
You can assign an alias to the lookup dataset so that you don't have to write the dataset's full name each time you reference its attribute.
Tip: This can be particularly useful for datasets with long names.
In the following example, the second line adds an alias
lkp to the table named
state_lookup_table.csv. This allows you to write
lkp.stateCode, instead of
state_lookup_table.csv.stateCode, each time you reference the attribute
stateCode from the lookup dataset.
LOOKUP get(stateName, 0) AS State FROM state_lookup_table.csv AS lkp ON State = lkp.stateCode;
In this example, you would like the resulting table to have a State attribute with a value where the two-letter
stateCode value is followed by the full name of the state:
The script looks like the following:
LOOKUP (get(stateCode, 0) || ': ' || get(stateName, 0)) AS State FROM state_lookup_table.csv AS lkp ON State = lkp.stateCode;
Note: The following error occurs because the grouping condition,
GROUP BY, cannot find an attribute in the lookup dataset that it can associate with an attribute in the current dataset.
Lookup can be used to extract keywords contained in a description attribute. In this example, the transformation extracts
country keywords contained in
LOOKUP get(country, 0) AS country_extracted FROM country_lookup_table.csv AS lkp ON array.contains(organization_description_array, get(lkp.country, 0));
Updated about 2 years ago