Join
JOIN
combines rows from two datasets.
This statement is similar to the JOIN
function in SQL. In general, joins have a left dataset and a right dataset. For the Tamr Core statement, the left dataset is always the result of the previous statement. Therefore, the left dataset is implicit, and therefore is not explicitly listed in a JOIN
statement, unlike in standard SQL. The right dataset is specified by name. See SQL Joins.
Important: Changing a
JOIN
statement can result in an unintentional change totamr_id
values in the unified dataset. To avoid data loss and maintain the stability oftamr_id
in mastering and categorization projects, as a best practice Tamr recommends that you explicitly set thetamr_id
to a value you're sure is stable in the last transformation. See Managing Primary Keys.
Example
This example combines the rows from two datasets named parents.csv
and children.csv
.
During schema mapping, two source attributes from parents.csv
are mapped to two unified attributes. In addition, unified attributes were created to store the following information about the children: age, name, and parent. These attributes are not mapped.
This schema results in an initial preview with three empty attributes.
You can use JOIN
in a script to populate each of these empty columns. Remember that the unified dataset is the implicit left dataset, and the right dataset is specified by name (in this example, "children.csv"
). It is given the alias AS "Children"
in order to make selecting attributes easier. After the JOIN
, a SELECT
statement displays the columns with names to map to the corresponding unified attributes.
The final result is:
Previewing Joins
Preview executes transformations on small samples of each dataset. When executing a JOIN
transformation between two sampled datasets, the probability of joining any pair of records is low. As a result, the preview may contain very few, or even no, joined records.
To see the true effect of a JOIN, you must apply the transformation to the complete dataset by selecting Update Unified Dataset and then view all records.
Additional Information
The supported JOIN
types are:
INNER
FULL OUTER
LEFT OUTER
RIGHT OUTER
CROSS
The syntax for JOIN
is to place the type before the word JOIN
(for example, FULL OUTER JOIN with Children
. If you do not specify a type, the default is INNER JOIN
.
Note: Tamr recommends that you avoid using CROSS JOIN
as it is likely to cause performance problems.
In addition:
JOIN
prefixes every added column with the name of the dataset it came from. You can change this by writingJOIN WITH <dataset> AS <prefix> ON
.JOIN
statements can be chained together.- It is usually a good idea to follow up a
JOIN
with aSELECT
that keeps all the relevant columns out of the joins. - All
JOIN
keywords are case insensitive. JOIN
statements often use logical comparators to find values that do, and do not, meet certain criteria.
Optimizing JOIN Statements
When working on a project with multiple JOIN
statements, optimization speeds up processing and helps you avoid memory issues. This section describes different ways to structure JOIN
statements so that they make the most effective use of the underlying technology that Tamr Core uses to process transformations, Spark.
Prepare the Main Table Once
The following example, which joins multiple tables, redefines the main table several times.
PRE_JOIN1: SELECT *, ...;
LOOKUP_TABLE1: USE ...;
USE PRE_JOIN1;
LEFT OUTER JOIN WITH LOOKUP_TABLE1 ON ...;
SELECT *, ...;
PRE_JOIN2: SELECT *, ...;
LOOKUP_TABLE2: USE ...;
USE PRE_JOIN2;
LEFT OUTER JOIN WITH LOOKUP_TABLE2 ON ...;
SELECT *, ...;
PRE_JOIN3: SELECT *, ...;
LOOKUP_TABLE3: USE ...;
USE PRE_JOIN3;
LEFT OUTER JOIN WITH LOOKUP_TABLE3 ON ...;
SELECT *, ...;
While the outcome of this transformation may be satisfactory, it can be rewritten to use less processing time. Spark is more efficient when processing statements of the same type in sequence. To optimize execution, you can structure the JOIN
so that s single SELECT
statement is followed by all of the lookups and then all of the joins, as in the following example.
// All your Pre-join work here...
PRE_JOIN: SELECT *, ...;
// All your lookup table processing here....
LOOKUP_TABLE1: USE ...;
LOOKUP_TABLE2: USE ...;
LOOKUP_TABLE3: USE ...;
// All your joins here....
USE PRE_JOIN;
LEFT OUTER JOIN LOOKUP_TABLE1 ON ...;
LEFT OUTER JOIN LOOKUP_TABLE2 ON ...;
LEFT OUTER JOIN LOOKUP_TABLE3 ON ...;
// Post-join select
SELECT *, ...;
Don’t Use SELECT * to Prepare Lookup Tables
When you do a JOIN
, all of the columns in your lookup table get added to the output table. This is not obvious in Tamr Core because the preview feature does not include any columns except those for unified attributes. These columns for the lookup table do exist, and occupy memory, during transformation processing. If your lookup table has a lot of columns this can make for wide, memory-intensive datasets.
So instead of writing a statement like this:
USE external_table.csv;
LOOKUP_TABLE: SELECT *, process(LookupKey) as LookupKey, process(LookupValue) as LookupValue;
A better practice is to write a statement like this (no SELECT *
) :
USE external_table.csv;
LOOKUP_TABLE: SELECT process(LookupKey) as LookupKey, process(LookupValue) as LookupValue;
Use GROUP BY or LOOKUP
A typical goal for a JOIN
statement is to look up a single value for each record. This generally does not change the number of rows in your dataset, but it can if the lookup table has duplicate join keys. Duplicating rows also makes processing your join less efficient. To avoid this problem, you can either:
- Use
GROUP BY
on your join table - Use the
LOOKUP
statement
To ensure that there are no duplicate keys, consider using a GROUP BY
instead of (or in addition to) a SELECT
when you prepare your lookup tables, as shown in the following example. Using LEFT OUTER JOIN
ensures that no rows are deleted from your main dataset.
USE external_table.csv;
LOOKUP_TABLE: {
SELECT process(LookupKey) as LookupKey, process(LookupValue) as LookupValue;
GROUP first(LookupValue) as LookupValue BY LookupKey;
};
This structure ensures that your LookupKey
column only includes unique values. If you have multiple LookupValue
s for the same LookupKey
, you will need to write some more complicated logic than first(LookupValue)
to resolve the correct lookup value. In that case, though, you may actually intend to get back multiple join results per LookupKey
and increase the row count in your dataset.
Alternatively, you can use a LOOKUP
statement, which implicitly does a MERGE BY
-like aggregation on your join dataset. It also uses a LEFT OUTER JOIN
so no rows are dropped from your main dataset.
Note: This structure is only recommended for small join datasets.
LOOKUP to_string(LookupValue) AS my_value
FROM external_table.csv
ON my_key = external_table.csv.LookupKey;
Note: In some cases, LOOKUP
has created nested types of array[array[string]] when the type in the original lookup table is array[string], so be sure to convert it to the desired type.
Use a Broadcast Join
In a standard join, Spark uses multiple nodes to process both of the tables that are being joined, shuffling the tables between nodes. This can result in a lot of network traffic. A broadcast join copies, or broadcasts, the contents of one of the tables to every node, with the result that neither of the tables needs to be shuffled among the nodes. This strategy can work well for lookup joins because lookup tables are usually small and can be broadcast easily, while your main table is usually large.
Note: If your lookup table isn't actually small, using a broadcast join can cause you to run out of memory.
After minimizing the number of rows and columns in your lookup table, add a broadcast join hint by prepending the HINT(join.broadcast)
statement to your JOIN. For multiple JOINs in a row, you can apply the HINT
to the entire scope (using { }
) as in this example:
USE PRE_JOIN;
HINT(join.broadcast) {
LEFT OUTER JOIN WITH LOOKUP_TABLE1 ON ...;
LEFT OUTER JOIN WITH LOOKUP_TABLE2 ON ...;
};
Updated about 2 years ago