Tamr Documentation

v2020.001 Notes

Tamr release notes.

What's New

  • Do more with transformations. The new top() transformation function now supports a variable number of arguments and allows you to include a single array as one of the input arguments. It also allows you to Spread arrays. The previous version of top() is renamed to legacy.top() and is deprecated. For more information, see the Transformations section of this document.
  • Use smaller instances for cloud deployments of Tamr with externally-hosted Spark, HBase, and Elasticsearch. Due to improvements to memory allocation, Tamr cloud deployments can run on smaller instances when you rely on remote (external) services for Tamr dependent components (Spark, HBase, ElasticSearch).
  • Use the documentation for versioned public APIs both inside and outside of the installed Tamr software. The versioned API documentation is now published as part of the other user documentation for Tamr. For more information, see Tamr APIs v.1.0.

Mastering

Improved performance of the Projects page and made it load faster. In addition, fixed an issue where in previous releases, the Projects page would break when there were backwards-incompatible changes after upgrades.

Golden Records

Tamr sorts golden records numerically, if the output type specified for the golden record's rule is of numeric type.

Configuration, Deployment, and Lifecycle Changes

  • You can run Tamr service health and validation checks during system upgrades. These checks run by default before an upgrade and we recommend that you do not disable them from running. For more information, see Validation.
  • You can configure JSON logging from Spark by using the TAMR_LOG_JSON_ENABLED parameter. It controls log output from Spark drivers and executors. JSON-formatted Spark logs include job ID, job name, workspace configuration override name (or "default"), whether in driver or executor, executor ID, and Spark application's name and ID.
  • You can configure the batch size for HBase using the TAMR_HBASE_BATCH_SIZE parameter. The default batch size for HBase is now 1000. The previously hard-coded value of 10,000 was causing timeout issues with Zookeeper resulting in shutting down of the HBase server. The HBase batch size is configurable to make it easier to modify it if needed in production systems with different environments, such as GCP BigTable.
  • Tamr deployments in the cloud allow you to use smaller instances when you rely on remote (external) services for Tamr dependent components (Spark, HBase, ElasticSearch). As part of this improvement, improved memory allocation calculations that Tamr makes when using external dependent components (Spark, HBase, Elasticsearch). If these components are running externally to the Tamr instance, the administrative utility in Tamr does not reserve memory for them. Instead, it distributes the memory to those Tamr services that are running on the Tamr VM. This reduces the minimum memory requirements to run Tamr on a VM. This change applies to Tamr deployments on all cloud environments, regardless of the cloud infrastructure provider.

Transformations

Changed the top() transformation function by adding new features to it, and renamed the previous version of top() to legacy.top(). The following statements describe the changes:

  • The new top() function allows you to specify a variable number of arguments as an input, such as: top(5, A, B, C, D, E). This is indicated by the presence of vararg in the function's input description, where the vararg type of input means that this function can take a varying number of arguments. For comparison, the deprecated legacy.top() allows specifying only two arguments in its input.
  • The new top() function allows you to include the contents of a single array, as in this example: top(5, ...myArray). Because the new top() supports vararg as an input type, it enables you to use a feature of the Tamr transformations language known as "array spreading", or Spread. With array spreading you can use this syntax: top(5, ...MyArray) instead of writing, for example, top(5, col1, col2, col3) to otain the top five values from the first three columns, where these columns have array type values. Array spreading is supported for all Tamr transformation functions that allow vararg in their inputs and is now supported for the new top() function.
  • If you continue to use the previous version of top() in this release, Tamr issues a linting error. To fix it, manually replace instances of top() with legacy.top() in your transformation scripts, or use the function-replacer option of the transform-tools.jar. Run: java -jar ./tamr/libs/transform-tools.jar function-replacer --new=legacy.top --old=top --tamr-url=http://localhost:9100 -p.

  • For reference, the documentation for the new top() and legacy.top() is as follows. You can also view this documentation inside the Transformations module in the Tamr user interface. To view the transformation functions documentation inside the installed Tamr instance, navigate to the Unified Dataset page and locate the Transformations side panel on the right. In the upper-right corner, locate the view function docs link.

    • TOP computes n most frequently occurring distinct values per group. This is an aggregation function that can be used in GROUP BY and WINDOW statements. Its syntax is top(n, input0, input1...), where n is of type integer (positive-only) and is the number of values to return per group. The "second" argument in the input is of type vararg, it can be of any data type and describes the attribute(s) on which to compute the top n values. The output of the top() function is an array.
    • LEGACY.TOP computes n most frequently occurring distinct values per group. This function is DEPRECATED because it does not support the input arguments of type vararg. Use the current top() function instead. Note that the n parameter is the first parameter in the current top function, but it is the last parameter in the legacy.top() function. This is an aggregation function and it can be used in GROUP BY and WINDOW statements. Its syntax is legacy.top(input, n), where input can be of any type and represents the attribute on which to compute top n values, and n must be of type integer (positive-only) and is the number of values to return per group. The output of legacy.top() is an array.

Configuration, Deployment, and Lifecycle Changes

  • Tamr service health and validation checks are now part of the upgrade process and run by default before system upgrades. We recommend that you do not disable them from running when you upgrade Tamr to the next version. For more information, see Validation.
  • Configure JSON logging from Spark by using the TAMR_LOG_JSON_ENABLED parameter. It controls log output from Spark drivers and executors. JSON-formatted Spark logs include job ID, job name, workspace configuration override name (or "default"), whether in driver or executor, executor ID, and Spark application's name and ID.
  • Configure the batch size for HBase by using the TAMR_HBASE_BATCH_SIZE parameter. The default batch size for HBase is now 1000. The previously hard-coded value of 10,000 was causing timeout issues with Zookeeper resulting in shutting down of the HBase server. The HBase batch size is configurable to make it easier to modify it if needed in production systems with different environments, such as GCP BigTable.
  • Use smaller instances for Tamr cloud deployments when you rely on remote (external) services for Tamr dependent components (Spark, HBase, ElasticSearch). As part of this improvement, improved memory allocation calculations that Tamr makes when using external dependent components (Spark, HBase, Elasticsearch). If these components are running externally to the Tamr instance, the administrative utility in Tamr does not reserve memory for them and instead distributes the memory to those Tamr services that are running on the Tamr VM. This reduces the minimum memory requirements to run Tamr on a VM. This change applies to Tamr deployments on all cloud environments, regardless of the cloud infrastructure provider.

Fixed Support Issues

  • Fixed an issue that allows records in the golden records projects to be sorted numerically, based on the output type (numeric) that you specify for the golden record's rule. Previously, the attributes with numeric types, such as "count" or "cluster size", in golden records projects were sorted alphanumerically, instead of numerically. For example, the sorting order would be: "9", "2000", "10", instead of "2000", 10", "9".
  • Fixed an issue where exporting, profiling, or streaming a dataset failed with a deserializaiton error in Avro. The error occurred due to the schema mismatch that resulted from changing the record's schema and then pushing dataset updates that aborted the schema change request.
  • Fixed an issue where IDF for LLM is now recomputed when a new source data is added in the non-incremental processing mode (when TAMR_DEDUP_DISABLE_INCREMENTAL is set to true).
  • Fixed an issue where an HBase node would shut down due to an issue with the HBase batch size. Note: this issue was fixed in custom patches in earlier versions and is now fixed in this release making the fix available to all customers going forward.

Fixed Issues

  • Fixed an issue in clustering with the following two API endpoints: for training the clustering model POST /v1/projects/{project}/recordPairsWithPredictions/model:refresh, and for predictions POST /v1/projects/{project}/recordPairsWithPredictions:refresh where if you used them separately and the mastering (dedup) model has changed (due to incremental dataset updates), then some cluster predictions could be out-of date. This issue could have affected you only if you ran these ML jobs programmatically through the versioned API endpoints. It did not affect running these ML jobs in the user interface for the Mastering project (clustering).
  • Fixed an issue that allowed publishing of public versioned APIs externally, outside of in-product API documentation.

Known Issues

The following known issues exist in this release.

  • Status field (text and icon) on the Jobs page is not centered and the icon is truncated.
  • Mapped/Unmapped attribute filters are not working on any downstream project in a project with chained datasets, after an upgrade to v.2019.023.1 and greater. If you encounter this issue, contact Tamr Support for information about a workaround (running an internal-only API request that calculates attribute mappings in this case).
  • Column resizing on the Users page does not behave as expected.
  • The schema mapping project is not showing out-of-dateness for projects.
  • The Unified Dataset page throws an error in the user interface when you are logged in as a reviewer.
  • Job submission for chained projects may not appear immediately on the Jobs page after choosing Submit. Submit is not disabled in this case. Pre-processing of dataset versions takes place before Tamr submits the jobs to Spark and Tamr is not currently accounting for this time on the Jobs page.
  • The job for Updating results is not showing the project it is associated with on the Jobs page.
  • The upgrade process updates all record pair feedback to use unified record IDs instead of origin record IDs. This process runs automatically when upgrading. However, this process depends on Elasticsearch index being up-to-date for the unified dataset before you start an upgrade process. In cases where the index is not up-to-date at the time of upgrading to version v.2019.024 or greater, the upgrade process will have no effect and the pre-upgrade pair feedback will not be migrated or deleted. As a workaround, before you upgrade, index the unified dataset in Elasticsearch, and after you upgrade, run the following endpoint manually: /api/dedup/pairs/feedback/migrate, and then run the job that updates pairs for your project.

Geospatial support:

  • A pair showing a single point throws a Javascript error when the link is clicked to open the details panel.
  • Cannot update golden records with a geospatial data type present.
  • A warning displays in the browser console for any page or details panel that contains geospatial record types in mastering projects.
  • Error when attempting sorting by a geospatial type attribute.
  • Most frequent values are not showing up for a geospatial type field on a profiled attribute on the Schema Mapping page.
  • The schema mapping suggestion fails if a geospatial record type attribute is present.
  • Export is failing on a dataset with geospatial data.

Upgrade

See upgrading page for instructions.

Updated 2 days ago

v2020.001 Notes


Tamr release notes.

Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.