User GuidesAPI ReferenceRelease Notes
Doc HomeHelp CenterLog In
User Guides

Overview

The LOOKUP statement references a set of values from an auxiliary dataset, known as the <lookup_dataset> in order to transform values in an attribute in your own dataset.
The LOOKUP statement allows you to use rows from the lookup dataset, joined by a key. Lookups are commonly used to standardize values in a column. For example, you could use LOOKUP to change all states values into their two-letter abbreviation.

Unlike a JOIN, LOOKUP does not change the number of rows in your dataset. The referenced <lookup_dataset> is grouped on the join keys before the target value is added. So if a key appears multiple times in the "lookup_dataset", LOOKUP combines the values of repeated keys into a multi-value, and does not produce multiple rows as a JOIN statement would. For example, if the <lookup_dataset> contained both "USA"="United States" and "USA"="United States of America", the value produced by a LOOKUP for "USA" would be the multi-value ["United States", "United States of America"].

LOOKUP requires that you upload the <lookup_dataset>. To create the lookup table without uploading it as a separate dataset, you can use ROWS.

📘

LOOKUP assumes a small dataset

LOOKUP is optimized for use with small lookup datasets. You may experience decreased performance if the size of the <lookup_dataset> is similar to your unified dataset.

The general form of this transformation is:

LOOKUP (<expr> AS <attribute>)*
FROM <lookup_dataset> AS <alias>
ON <join_condition>;
  • All attributes referenced in the <expr> need to reference attributes from the provided <lookup_dataset>.
  • All attributes referenced in the <attribute> need to reference attributes from the current dataset.
  • The <join_condition> must be an equality expression where one side references an attribute from the current dataset and the other side references an attribute from the <lookup_dataset>.

Example 1

For example, consider a list of people and the two-letter code of the State they live in:

746

We would like to use the LOOKUP statement to replace each two-letter code in State with the full name of the state, so that the final table looks like this:

748

To apply LOOKUP to the State attribute, we can use a lookup dataset called state_lookup_table.csv that contains an association between each state’s two-digit letter and the full name of the state:

496

The two-letter values in the State attribute in our own dataset can be replaced with the full names of the state from the lookup dataset. To do this, we want to find where the State in our dataset matches the stateCode attribute in state_lookup_table.csv. Wherever they match, the stateName will be selected as the new value for State.

The LOOKUP statement for this example looks like:

LOOKUP get(stateName, 0) AS State
FROM state_lookup_table.csv
ON State = state_lookup_table.csv.stateCode;

Aliasing

Additionally, you can assign an alias to the lookup dataset so that its full name does not have to be written every time an attribute is referenced. The second line in the following example adds an alias lookup to the table named state_lookup_table.csv. This allows you to write lookup.stateCode, rather than state_lookup_table.csv.stateCode and is a useful tip for datasets with long names.

LOOKUP get(stateName, 0) AS State
FROM state_lookup_table.csv AS lookup
ON State = lookup.stateCode;

Example 2

In this example, we would like our resulting table to have a State column with a value where the two-letter stateCode is followed by the full name of the state:

792

Our transformation will look like:

LOOKUP (get(stateCode, 0) || ': ' || get(stateName, 0)) AS State
FROM state_lookup_table.csv AS lookup
ON State = lookup.stateCode;

🚧

Note

The following error is caused because there is no attribute present from the lookup dataset in the expression to associate the lookup dataset with our current dataset. This expression is a grouping condition, hence the error mentioning GROUP by.

1040

📘

Using CASE

You can also use CASE statements to standardize values in a column. They are recommended when there are few values to standardize and the standardizations are not expected to change.