Tamr Documentation

Join

JOIN allows you to combine rows from two datasets.

This statement is similar to the JOIN function in SQL. In general, joins have a left dataset and a right dataset. In Tamr transformations, the left dataset is always the result of the previous statement. Therefore, the left dataset is implicit and is not explicitly listed in Tamr JOIN statements, unlike in standard SQL. The right dataset is specified by name. For more information, see SQL Joins.

Important: Changing a JOIN statement can result in an unintentional change to tamr_id values in the unified dataset. To avoid data loss and maintain the stability of tamr_id in mastering and categorization projects, as a best practice Tamr recommends that you explicitly set the tamr_id to a value you're sure is stable in the last transformation. See Managing Primary Keys.

Example

Here's a simple example of a JOIN, for two datasets named parents.csv and children.csv.

In this example, during schema mapping two source attributes from parents.csv were mapped to two unified attributes. Additional, unmapped unified attributes were also created to store the following information about the children: age, name, and parent.

This results in an initial preview with three empty attributes.

You can use JOIN in a script to populate each of these empty columns. Note that the unified dataset is the left dataset, and the right dataset is specified by name (in this example, "children.csv"). It is given the alias AS "Children" in order to make selecting attributes easier. After the JOIN, a SELECT statement displays the columns with names to match those of the unified attributes.

The final result is:

📘

Previewing Joins

Preview executes transformations on small samples of each dataset. When executing a JOIN transformation between two sampled datasets, the probability of joining any pair of records is low. As a result, the preview may contain very few, or even no, joined records.

To see the true join, apply the transformation to the non-sampled datasets by clicking Update Unified Dataset and subsequently view all records.

Additional Information

The supported JOIN types are:

  • INNER
  • FULL OUTER
  • LEFT OUTER
  • RIGHT OUTER
  • CROSS

The syntax for JOIN is to place the type before the word JOIN (for example, FULL OUTER JOIN with Children. If you do not specify a type, the default is INNER JOIN.

Note: CROSS JOIN is not recommended, as it is likely to cause performance problems.

In addition:

  • JOIN prefixes every added column with the name of the dataset it came from. This can be changed by writing JOIN WITH <dataset> AS <prefix> ON .
  • JOIN statements can be chained together.
  • It is usually a good idea to follow up a JOIN with a SELECT that keeps all the relevant columns out of the joins.
  • All JOIN keywords are case insensitive.
  • JOIN statements often use logical comparators to find values that do, and do not, meet certain criteria.

Optimizing JOIN Statements

When working on a project with multiple JOIN statements, optimization speeds up processing and helps you avoid memory issues. This section describes different ways to structure JOIN statements so that they make the most effective use of the underlying technology that Tamr uses to process transformations, Spark.

Prepare the Main Table Once

The following example, which joins multiple tables, redefines the main table several times.

PRE_JOIN1: SELECT *, ...;
LOOKUP_TABLE1: USE ...;
USE PRE_JOIN1; 
LEFT OUTER JOIN WITH LOOKUP_TABLE1 ON ...;
SELECT *, ...;

PRE_JOIN2: SELECT *, ...;
LOOKUP_TABLE2: USE ...;
USE PRE_JOIN2; 
LEFT OUTER JOIN WITH LOOKUP_TABLE2 ON ...;
SELECT *, ...;

PRE_JOIN3: SELECT *, ...;
LOOKUP_TABLE3: USE ...;
USE PRE_JOIN3; 
LEFT OUTER JOIN WITH LOOKUP_TABLE3 ON ...;
SELECT *, ...;

While the outcome of this transformation may be satisfactory, it can be rewritten to use less processing time. Spark is more efficient when processing statements of the same type in sequence. To optimize execution, you can structure the JOIN so that s single SELECT statement is followed by all of the lookups and then all of the joins, as in the following example.

// All your Pre-join work here...
PRE_JOIN: SELECT *, ...;

// All your lookup table processing here....
LOOKUP_TABLE1: USE ...;
LOOKUP_TABLE2: USE ...;
LOOKUP_TABLE3: USE ...;

// All your joins here....
USE PRE_JOIN;
LEFT OUTER JOIN LOOKUP_TABLE1 ON ...;
LEFT OUTER JOIN LOOKUP_TABLE2 ON ...;
LEFT OUTER JOIN LOOKUP_TABLE3 ON ...;

// Post-join select
SELECT *, ...;

Don’t Use SELECT * to Prepare Lookup Tables

When you do a JOIN, all of the columns in your lookup table get added to the output table. This is not obvious in Tamr because the preview feature does not include any columns except those for unified attributes. These columns for the lookup table do exist, and occupy memory, during transformation processing. If your lookup table has a lot of columns this can make for wide, memory-intensive datasets.

So instead of writing a statement like this:

USE external_table.csv;
LOOKUP_TABLE: SELECT *, process(LookupKey) as LookupKey, process(LookupValue) as LookupValue;

A better practice is to write a statement like this (no SELECT *) :

USE external_table.csv;
LOOKUP_TABLE: SELECT process(LookupKey) as LookupKey, process(LookupValue) as LookupValue;

Use GROUP BY or LOOKUP

A typical goal for a JOIN statement is to look up a single value for each record. This generally does not change the number of rows in your dataset, but it can if the lookup table has duplicate join keys. Duplicating rows also makes processing your join less efficient. To avoid this problem, you can either:

  • Use GROUP BY on your join table
  • Use the LOOKUP statement

To ensure that there are no duplicate keys, consider using a GROUP BY instead of (or in addition to) a SELECT when you prepare your lookup tables, as shown in the following example. Using LEFT OUTER JOIN ensures that no rows are deleted from your main dataset.

USE external_table.csv;
LOOKUP_TABLE: {
  SELECT process(LookupKey) as LookupKey, process(LookupValue) as LookupValue;
  GROUP first(LookupValue) as LookupValue BY LookupKey;
};

This structure ensures that your LookupKey column only includes unique values. If you have multiple LookupValues for the same LookupKey, you will need to write some more complicated logic than first(LookupValue) to resolve the correct lookup value. In that case, though, you may actually intend to get back multiple join results per LookupKey and increase the row count in your dataset.

Alternatively, you can use a LOOKUP statement, which implicitly does a MERGE BY-like aggregation on your join dataset. It also uses a LEFT OUTER JOIN so no rows are dropped from your main dataset.

Note: This structure is only recommended for small join datasets.

LOOKUP to_string(LookupValue) AS my_value
FROM external_table.csv
ON my_key = external_table.csv.LookupKey;

Note: In some cases, LOOKUP has created nested types of array<array> when the type in the original lookup table is array, so be sure to convert it to the desired type.

Use a Broadcast Join

In a standard join, Spark uses multiple nodes to process both of the tables that are being joined, shuffling the tables between nodes. This can result in a lot of network traffic. A broadcast join copies, or broadcasts, the contents of one of the tables to every node, with the result that neither of the tables needs to be shuffled among the nodes. This strategy can work well for lookup joins because lookup tables are usually small and can be broadcast easily, while your main table is usually large.

Note: If your lookup table isn't actually small, using a broadcast join can cause you to run out of memory.

After minimizing the number of rows and columns in your lookup table, add a broadcast join hint by prepending the HINT(join.broadcast) statement to your JOIN. For multiple JOINs in a row, you can apply the HINT to the entire scope (using { }) as in this example:

USE PRE_JOIN;
HINT(join.broadcast) {
  LEFT OUTER JOIN WITH LOOKUP_TABLE1 ON ...;
  LEFT OUTER JOIN WITH LOOKUP_TABLE2 ON ...;
};

Updated about a year ago



Join


JOIN allows you to combine rows from two datasets.

Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.