Labeled Views

To provide a name and save the state of a dataset output of any transformation statement, you can apply a label. In SQL, this is also called a "view", which is a virtual table based on the result-set of an SQL statement. You can later reference this view using the label anywhere a dataset could be referenced, such as in JOIN, UNION, or USE statements.

Label examples

my_label: SELECT *;

A_different_label: FILTER company_name IS NOT NULL;

Here is how you can reference a labeled dataset:

USE my_label;

See USE for an additional example.

Hint

HINT is an advanced feature that allows you to modify how Tamr Core runs a transformation. You add the HINT statement directly before the transformation that you want to modify.

You supply a predefined value for each HINT statement to modify transformation execution for a specific purpose. A description of each HINT value follows.

checkpoint.reliable

This value for the HINT statement defines a reliable store for Spark to complete a CHECKPOINT.
See Checkpoint.

checkpoint.local

Important: Use this HINT value with caution and consult with Tamr Support at help@tamr.com.

This value for the HINT statement specifies a local (unreliable) store for Spark to complete a CHECKPOINT. See Checkpoint.

Note: Checkpointing to a local store can result in better performance, depending on the setup of the underlying Spark cluster.

join.broadcast

Important: Use this HINT value with caution and consult with Tamr Support at help@tamr.com.

This value for the HINT statement changes the way Spark completes a join operation. A broadcast join is more efficient than the default join when the joined dataset is much smaller than the unified dataset.

HINT(join.broadcast) JOIN WITH "my_small_data.csv" AS sm on id == sm.id;

pkmanagement.manual

This value for the HINT statement disables the automatic management (assignment) of the primary key (tamr_id) for records in unified datasets. If you use this HINT value, there is no guarantee that tamr_id is a unique string for every record of the unified dataset. You will need to ensure the uniqueness of your primary key manually. See Primary Key Management.

HINT(pkmanagement.manual) INNER JOIN WITH "example_dataset.csv" AS ex ON my_id == ex.id;

pkmanagement.auto

This value for the HINT statement enables the automatic management of the primary key of unified datasets (tamr_id). See Primary Key Management.

HINT(pkmanagement.auto) MERGE BY city;

pkmanagement.default

This value for the HINT statement sets the automatic management of the primary key to the default value. Since the current default is having primary key management enabled, this is the same as HINT(pkmanagement.auto).

Since this is the default behavior for managing primary keys, this HINT statement can be omitted. Primary key management has the same default behavior.

HINT(pkmanagement.default) MERGE BY city;

Scope

Scoping allows you to apply a HINT or a label to a group of transformations rather than to a single statement. A scope begins with a { and ends with a }.

For example:

// Use scope to label the output of three statements:

my_label: {
  SELECT *, str_join(', ', array(address_line1, city, state, zip)) AS full_address;
  FILTER full_address IS NOT EMPTY;
  MERGE BY full_address;
};

// Use scope to apply a `HINT` to two joins:

HINT(join.broadcast) {
  JOIN WITH "state_lkp.csv" AS st ON state == st.name;
  JOIN WITH "price_lkp.csv" AS pr ON part_name == pr.product;
};