Statement Modifiers
You can apply labels and hints to modify the statements in your transformation scripts. Use scope syntax { }
to apply labels and hints to multiple statements at a time.
Labeled Views
To provide a name and save the state of a dataset output of any transformation statement, you can apply a label. In SQL, this is also called a "view", which is a virtual table based on the result-set of an SQL statement. You can later reference this view using the label anywhere a dataset could be referenced, such as in JOIN
, UNION
, or USE
statements.
Label examples
my_label: SELECT *;
A_different_label: FILTER company_name IS NOT NULL;
Here is how you can reference a labeled dataset:
USE my_label;
See USE for an additional example.
Hint
HINT
is an advanced feature that allows you to modify how Tamr Core runs a transformation. You add the HINT
statement directly before the transformation that you want to modify.
You supply a predefined value for each HINT
statement to modify transformation execution for a specific purpose. A description of each HINT
value follows.
checkpoint.reliable
This value for the HINT
statement defines a reliable store for Spark to complete a CHECKPOINT
.
See Checkpoint.
checkpoint.local
Important: Use this
HINT
value with caution and consult with Tamr Support at [email protected].
This value for the HINT
statement specifies a local (unreliable) store for Spark to complete a CHECKPOINT
. See Checkpoint.
Note: Checkpointing to a local store can result in better performance, depending on the setup of the underlying Spark cluster.
join.broadcast
Important: Use this
HINT
value with caution and consult with Tamr Support at [email protected].
This value for the HINT
statement changes the way Spark completes a join operation. A broadcast join is more efficient than the default join when the joined dataset is much smaller than the unified dataset.
HINT(join.broadcast) JOIN WITH "my_small_data.csv" AS sm on id == sm.id;
pkmanagement.manual
This value for the HINT
statement disables the automatic management (assignment) of the primary key (tamr_id
) for records in unified datasets. If you use this HINT
value, there is no guarantee that tamr_id
is a unique string for every record of the unified dataset. You will need to ensure the uniqueness of your primary key manually. See Primary Key Management.
HINT(pkmanagement.manual) INNER JOIN WITH "example_dataset.csv" AS ex ON my_id == ex.id;
pkmanagement.auto
This value for the HINT
statement enables the automatic management of the primary key of unified datasets (tamr_id
). See Primary Key Management.
HINT(pkmanagement.auto) MERGE BY city;
pkmanagement.default
This value for the HINT
statement sets the automatic management of the primary key to the default value. Since the current default is having primary key management enabled, this is the same as HINT(pkmanagement.auto)
.
Since this is the default behavior for managing primary keys, this HINT
statement can be omitted. Primary key management has the same default behavior.
HINT(pkmanagement.default) MERGE BY city;
Scope
Scoping allows you to apply a HINT
or a label to a group of transformations rather than to a single statement. A scope begins with a {
and ends with a }
.
For example:
// Use scope to label the output of three statements:
my_label: {
SELECT *, str_join(', ', array(address_line1, city, state, zip)) AS full_address;
FILTER full_address IS NOT EMPTY;
MERGE BY full_address;
};
// Use scope to apply a `HINT` to two joins:
HINT(join.broadcast) {
JOIN WITH "state_lkp.csv" AS st ON state == st.name;
JOIN WITH "price_lkp.csv" AS pr ON part_name == pr.product;
};
Updated about 2 years ago