User GuidesAPI ReferenceRelease Notes
Doc HomeHelp CenterLog In

Join

JOIN combines rows from two datasets.

This statement is similar to the JOIN function in SQL. In general, joins have a left dataset and a right dataset. For the Tamr Core statement, the left dataset is always the result of the previous statement. Therefore, the left dataset is implicit, and therefore is not explicitly listed in a JOIN statement, unlike in standard SQL. The right dataset is specified by name. See SQL Joins.

important Important: Changing a JOIN statement can result in an unintentional change to tamr_id values in the unified dataset. To avoid data loss and maintain the stability of tamr_id in mastering and categorization projects, as a best practice Tamr recommends that you explicitly set the tamr_id to a value you're sure is stable in the last transformation. See Managing Primary Keys.

Example

This example combines the rows from two datasets named parents.csv and children.csv.

During schema mapping, two source attributes from parents.csv are mapped to two unified attributes. In addition, unified attributes were created to store the following information about the children: age, name, and parent. These attributes are not mapped.

1408

The Name and Phone Number input attributes mapped to the Name and Phone_Number attributes, respectively.

This schema results in an initial preview with three empty attributes.

1866

The Children_Parent, Children_Name, and Children_Age attributes are 100% empty, while the Dataset, Name, and Phone_Number attributes are 100% populated.

You can use JOIN in a script to populate each of these empty columns. Remember that the unified dataset is the implicit left dataset, and the right dataset is specified by name (in this example, "children.csv"). It is given the alias AS "Children" in order to make selecting attributes easier. After the JOIN, a SELECT statement displays the columns with names to map to the corresponding unified attributes.

784

The complete JOIN statement.

The final result is:

1786

The JOIN statement populates the fields with values from the children.csv file: Charlie and Alice are each parents on one child, while Bob is the parent of two children.

Previewing Joins

Preview executes transformations on small samples of each dataset. When executing a JOIN transformation between two sampled datasets, the probability of joining any pair of records is low. As a result, the preview may contain very few, or even no, joined records.

To see the true effect of a JOIN, you must apply the transformation to the complete dataset by selecting Update Unified Dataset and then view all records.

Additional Information

The supported JOIN types are:

  • INNER
  • FULL OUTER
  • LEFT OUTER
  • RIGHT OUTER
  • CROSS

The syntax for JOIN is to place the type before the word JOIN (for example, FULL OUTER JOIN with Children. If you do not specify a type, the default is INNER JOIN.

Note: Tamr recommends that you avoid using CROSS JOIN as it is likely to cause performance problems.

In addition:

  • JOIN prefixes every added column with the name of the dataset it came from. You can change this by writing JOIN WITH <dataset> AS <prefix> ON .
  • JOIN statements can be chained together.
  • It is usually a good idea to follow up a JOIN with a SELECT that keeps all the relevant columns out of the joins.
  • All JOIN keywords are case insensitive.
  • JOIN statements often use logical comparators to find values that do, and do not, meet certain criteria.

Optimizing JOIN Statements

When working on a project with multiple JOIN statements, optimization speeds up processing and helps you avoid memory issues. This section describes different ways to structure JOIN statements so that they make the most effective use of the underlying technology that Tamr Core uses to process transformations, Spark.

Prepare the Main Table Once

The following example, which joins multiple tables, redefines the main table several times.

PRE_JOIN1: SELECT *, ...;
LOOKUP_TABLE1: USE ...;
USE PRE_JOIN1; 
LEFT OUTER JOIN WITH LOOKUP_TABLE1 ON ...;
SELECT *, ...;

PRE_JOIN2: SELECT *, ...;
LOOKUP_TABLE2: USE ...;
USE PRE_JOIN2; 
LEFT OUTER JOIN WITH LOOKUP_TABLE2 ON ...;
SELECT *, ...;

PRE_JOIN3: SELECT *, ...;
LOOKUP_TABLE3: USE ...;
USE PRE_JOIN3; 
LEFT OUTER JOIN WITH LOOKUP_TABLE3 ON ...;
SELECT *, ...;

While the outcome of this transformation may be satisfactory, it can be rewritten to use less processing time. Spark is more efficient when processing statements of the same type in sequence. To optimize execution, you can structure the JOIN so that s single SELECT statement is followed by all of the lookups and then all of the joins, as in the following example.

// All your Pre-join work here...
PRE_JOIN: SELECT *, ...;

// All your lookup table processing here....
LOOKUP_TABLE1: USE ...;
LOOKUP_TABLE2: USE ...;
LOOKUP_TABLE3: USE ...;

// All your joins here....
USE PRE_JOIN;
LEFT OUTER JOIN LOOKUP_TABLE1 ON ...;
LEFT OUTER JOIN LOOKUP_TABLE2 ON ...;
LEFT OUTER JOIN LOOKUP_TABLE3 ON ...;

// Post-join select
SELECT *, ...;

Don’t Use SELECT * to Prepare Lookup Tables

When you do a JOIN, all of the columns in your lookup table get added to the output table. This is not obvious in Tamr Core because the preview feature does not include any columns except those for unified attributes. These columns for the lookup table do exist, and occupy memory, during transformation processing. If your lookup table has a lot of columns this can make for wide, memory-intensive datasets.

So instead of writing a statement like this:

USE external_table.csv;
LOOKUP_TABLE: SELECT *, process(LookupKey) as LookupKey, process(LookupValue) as LookupValue;

A better practice is to write a statement like this (no SELECT *) :

USE external_table.csv;
LOOKUP_TABLE: SELECT process(LookupKey) as LookupKey, process(LookupValue) as LookupValue;

Use GROUP BY or LOOKUP

A typical goal for a JOIN statement is to look up a single value for each record. This generally does not change the number of rows in your dataset, but it can if the lookup table has duplicate join keys. Duplicating rows also makes processing your join less efficient. To avoid this problem, you can either:

  • Use GROUP BY on your join table
  • Use the LOOKUP statement

To ensure that there are no duplicate keys, consider using a GROUP BY instead of (or in addition to) a SELECT when you prepare your lookup tables, as shown in the following example. Using LEFT OUTER JOIN ensures that no rows are deleted from your main dataset.

USE external_table.csv;
LOOKUP_TABLE: {
  SELECT process(LookupKey) as LookupKey, process(LookupValue) as LookupValue;
  GROUP first(LookupValue) as LookupValue BY LookupKey;
};

This structure ensures that your LookupKey column only includes unique values. If you have multiple LookupValues for the same LookupKey, you will need to write some more complicated logic than first(LookupValue) to resolve the correct lookup value. In that case, though, you may actually intend to get back multiple join results per LookupKey and increase the row count in your dataset.

Alternatively, you can use a LOOKUP statement, which implicitly does a MERGE BY-like aggregation on your join dataset. It also uses a LEFT OUTER JOIN so no rows are dropped from your main dataset.

Note: This structure is only recommended for small join datasets.

LOOKUP to_string(LookupValue) AS my_value
FROM external_table.csv
ON my_key = external_table.csv.LookupKey;

Note: In some cases, LOOKUP has created nested types of array[array[string]] when the type in the original lookup table is array[string], so be sure to convert it to the desired type.

Use a Broadcast Join

In a standard join, Spark uses multiple nodes to process both of the tables that are being joined, shuffling the tables between nodes. This can result in a lot of network traffic. A broadcast join copies, or broadcasts, the contents of one of the tables to every node, with the result that neither of the tables needs to be shuffled among the nodes. This strategy can work well for lookup joins because lookup tables are usually small and can be broadcast easily, while your main table is usually large.

Note: If your lookup table isn't actually small, using a broadcast join can cause you to run out of memory.

After minimizing the number of rows and columns in your lookup table, add a broadcast join hint by prepending the HINT(join.broadcast) statement to your JOIN. For multiple JOINs in a row, you can apply the HINT to the entire scope (using { }) as in this example:

USE PRE_JOIN;
HINT(join.broadcast) {
  LEFT OUTER JOIN WITH LOOKUP_TABLE1 ON ...;
  LEFT OUTER JOIN WITH LOOKUP_TABLE2 ON ...;
};