Tamr Documentation

v2019.023.1 Notes

Tamr release notes.

What's New

  • Preview rules for golden records. To preview rules, select one or more entities with rules, edit rules, and then choose Preview Rule to see how rules for golden records will behave. Once you are satisfied with the rules, update golden records and publish them.
  • Use a DROP statement in transformations to omit unwanted attributes. See Drop.
  • Use MATH.TAN, MATH.ATAN, MATH.ATAN2, MATH.DEGREES, and MATH.RADIUS functions in the library of mathematical functions.
  • Use a new transformation statement, Sample.
  • Use a new directional Hausdorff similarity function for pair matching of geospatial records.
  • Back up and restore a Tamr deployment in which datasets are stored in Google Cloud Platform (GCP) BigTable. You can back up and restore to a local disk, HDFS, and GCS. See Backup Configuration.
  • Retain labels after an upgrade by running a script that disables automatic assignment of primary keys.
  • Start Tamr faster by running start-tamr.sh with improved performance.
  • Upgrade directly from v.0.40.1. to the current version of Tamr (without having to install intervening versions).
  • Be aware of these configuration changes:
    • Increased the services startup timeout to three minutes from one minute in previous releases.
    • Switched to the OpenJDK 8 instead of the Oracle Java SDK in the Tamr installation package.
    • Added Tamr configuration properties to support configuring password-protected transport authentication context in SAML v2. For more information, see SAML Authentication.
    • Renamed the tools.jar to transform-tools.jar. It is packaged under tamr/libs in the Tamr installation and allows you to convert between legacy.hash() and hash().

Golden Records

In the golden records project, you can preview rules. To preview a rule, select the entities one by one or in bulk as bookmarks, edit a rule, and then choose Preview Rule. Once you are satisfied with the rule, select Update and then choose Publish Golden Records.

Transformations

  • Added a new transformation statement, Sample.
  • Added a new DROP statement in transformations. Use it to remove a column from the active dataset. Using DROP is more convenient to use than SELECT. For example, to increase peformance, use DROP to remove unused columns after running JOIN statements. If you drop an attibute in the unified dataset, it is then populated with nulls. See Drop.
  • Added MATH.TAN, MATH.ATAN, MATH.ATAN2, MATH.DEGREES, and MATH.RADIUS functions to the library of mathematical functions.
  • Added the USE HINT statement that applies a hint to the current transformation in the editor and to all of the subsequent transformations in that project. This statement might be useful to you if:
    • You have Tamr projects created before Tamr v.2019.014.1 that you are upgrading to this version (Tamr v.2019.021.0), and
    • You explicitly do not want Tamr to automatically manage your primary keys.

For example, to disable automatic primary key management by Tamr in a particular project, add: USE HINT(pkmanagement.manual); in the first transformation.

Note: The USE HINT statement is only useful if you have created projects before Tamr v.2019.014.1. In that version, automatic management of primary keys was introduced. If you have started using Tamr after v.2019.014.1, Tamr automatically manages primary keys. For information about primary keys, see Primary Key Management.

Mastering

  • Added sorting to record IDs in derived datasets (these datasets in mastering projects show statistics for clusters of records).
  • Added a new directional Hausdorff similarity function. The existing functions, Haudorff distance and absolute Hausdorrf, target matching geo-type objects. The new directional Hausdorff function supports matching contained objects, for example, you can use it to see if a small section of a road is matched against the entire road, or whether a portion of a building matches an entire building. Directional Hausdorff is useful for pair generation and as an additional signal for pair matching. See Working with Geospatial Data.

Configuration Changes

  • Increased the services startup timeout from one to three minutes.
  • Switched to the OpenJDK 8 instead of the Oracle Java SDK in the Tamr installation package. As a result, the default value of TAMR_JAVA_HOME is changed from <tamr-home-directory>/jdk1.8.0_111/ to <tamr-home-directory>/openjdk-8u222/.
  • Changed the lowest version from which you can upgrade to the current version of Tamr from v.0.37.0 to v.0.40.1. Previously, the lowest version from which you could upgrade directly to the current version without installing intervening versions was Tamr v.0.37.0. With this change, the lowest version that allows direct upgrades to the current version of Tamr (v.2019.023.1 and higher) is v.0.40.1. If you are upgrading from version 0.40.0 or earlier, upgrade to each individual version up to v0.40.1 and then upgrade directly to the current version.
  • Enforced using of HBase for dataset storage upon Tamr upgrade to this version. If you have migrated to Tamr v.2019.019.0, then the upgrade script that runs during an upgrade takes care of moving your datasets to HBase-backed storage. For information, see HBASE.
  • Added an ability to back up and restore Tamr from Google Cloud Platform (GCP) BigTable. Previously, you could back up and restore Tamr deployments backed up by HBase. Beginning with this release, if your Tamr datasets use GCP BigTable instead of HBase, you can back up and restore them. The GCP BigTable backup in Tamr creates SequenceFiles. Note that the HBase backup creates HBase snapshot files. You can back up to a local filesystem, HDFS, or GCS. See Backup Configuration.
  • Fixed the administration utility, admin-unify.sh to not require quotes when unsetting a custom parameter to an empty value in tamr_config.yml. Previously, it required specifying quotes for an empty string value, such as TAMR_JOB_SPARK_BIGQUERY_JAR: "". With the fix, to unset a custom parameter's value, you can omit quotes and leave the value empty (quotes are still allowed).
  • Reviewers can no longer cancel jobs. Previously, access controls for jobs allowed reviewers to cancel jobs.
  • Added a list of case-insensitive reserved words to unified attribute names in Schema Mapping. Tamr issues an error and prevents you from using the following reserved words for unified attribute names in Schema Mapping: origin_source_name, tamr_id, origin_entity_id, clusterId, originSourceId, originEntityId, sourceId, entityId, suggestedClusterId, verificationType, and verifiedClusterId.
  • Added a script that can apply the USE HINT (pkmanagement.manual) statements to every project using transformations. Starting with v2019.014.0, Tamr automatically assigns unique primary keys (tamr_id) if you have not assigned a tamr_id manually to your records. If the tamr_ids change, then labels could change in downstream projects. Until this release, you could manually add the statement USE HINT (pkmanagement.manual) to each of your transformation scripts to disable automatic assignment of primary keys. In this release, we added an option in the <unify-zip>/tamr/libs/transform-tools.jar script to automate the process of temporarily disabling primary key assigments after an upgrade. This option adds a HINT to project's transformations. For information about primary keys, see Primary Key Management.
  • Aded a warning strongly discouraging you from starting Tamr as a root user.
  • Improved performance of start-tamr.sh.
  • Fixed the administration utility, admin-unify.sh to not require quotes when unsetting a custom parameter to an empty value in tamr_config.yml. Previously, it required specifying quotes for an empty string value, such as TAMR_JOB_SPARK_BIGQUERY_JAR: "". With the fix, to unset a custom parameter's value, you can omit quotes and leave the value empty (quotes are still allowed).
  • Added Tamr configuration properties that support configuring password-protected transport authentication context in SAML v2. For more information, see SAML Authentication.
  • Changed the defaults for the configuration parameters TAMR_FS_CONFIG_URIS and TAMR_FS_CONFIG_DIR to use the configuration found in TAMR_HADOOP_HOME.
  • These changes have the following implications:
    • Changes to YARN/Spark configuration, such as TAMR_YARN_NODE_MANAGER_PORT no longer require setting TAMR_FS_CONFIG_URIS.
    • If Tamr is deployed in an on-premise Hadoop installation that has some other Hadoop configuration present on the filesystem, such as core-site.xml and/or yarn-site.xml, Tamr's own Spark configuration is not ignored. Tamr now uses the configuration specified in TAMR_HADOOP_HOME by default.

Fixed Support Issues

  • Fixed an issue with ghost golden records. With this fix, if a cluster no longer exists in the input dataset, it does not contribute to a golden record.
  • Fixed an issue where previously, you could start Tamr as a root user. With this fix, Tamr issues a warning discouraging you to start Tamr as a root user.

Other Fixed Issues

  • Fixed a security issue that affected log files produced by restoring from a backup from a previous version.
  • Fixed a performance regression in Spark when running join operations by reducing the number of unnecessary row counts.
  • Fixed an issue where getRecordsByIds failed if a dataset had a top-level struct type field with a field named timestamp.
  • Fixed an issue where Tamr created duplicate columns during bootstrapping in Schema Mapping, for these attributes in the clusters datasets: suggestedClusterId, verificationType, verifiedClusterId. This occurred if you were using the datasets from the clusters as your input datasets to a new mastering project and attempted to bootstap the attributes. The issue occurred because the names collided with existing names. The fix for this issue introduces a list of reserved words in Tamr. You cannot use these reserved words for unified attributes in Schema Mapping. For a list of reserved words for unified attributes, see "Configuration Chages" on this page.

Known Issues

The following known issues exist in this release.

  • Column resizing on the Users page does not behave as expected.
  • Job submission for chained projects may not appear immediately on the Jobs page after choosing Submit. Submit is not disabled in this case. Pre-processing of dataset versions takes place before Tamr submits the jobs to Spark and Tamr is not currently accounting for this time on the Jobs page.
  • The job for Updating results is not showing the project it is associated with on the Jobs page.
  • Geospatial support:
    • Cannot update golden records with a geospatial data type present.
    • A warning displays in the browser console for any page or details panel that contains geospatial record types in mastering projects.
    • Error occurs when attempting sorting by a geospatial type attribute.
    • Most frequent values are not showing up for a geospatial type field on a profiled attribute on the Schema Mapping page.
    • The schema mapping suggestion fails if a geospatial record type attribute is present.
    • Export is failing on a dataset with geospatial data.

Upgrade

See upgrading page for instructions.

v2019.023.1 Notes


Tamr release notes.

Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.