Previous Tamr Documentation

v2019.019 Notes

Tamr release notes.

What's New

  • Added two transformation functions that compute the maximum and minimum value of an attribute or expression based on the bit size of the elements contained in a group. See "Transformations" in this document.
  • ElasticSearch is upgraded to version 6.8.2.
  • HBase is required. Beginning with this release, HBase is required when you deploy Tamr, as we have stopped supporting deployments with datasets backed by a local filesystem. The transition period allowed you to turn off HBase and continue using a local filesystem for your dataset storage with Tamr. Beginning with this release and in the future releases, we strongly recommend that you enable HBase when you upgrade. For upgrading instructions, see the HBase section in this document.
  • SSL support with Postgres. You can connect to an external Postgres instance via SSL.
  • Tamr Python client v.0.9.0 is released.
  • Access control improvements.
  • Improvements in preferred sources in golden records.

Major Improvements

ElasticSearch 6.8.2 Upgrade

This release includes an upgrade to the internally-used Elasticsearch v.6.8.2 and is similar to the v.2019.002 Tamr upgrade when ElasticSearch was upgraded. A summary of the Tamr upgrade procedure is as follows:

  • The upgrade takes longer due to reindexing.
  • If you are upgrading from a much earlier version, we recommend that you create a backup, upgrade first to v0.52.0 and make sure your deployment is working properly, then create another backup and upgrade to v2019.019.0.
  • If you have stale projects, those you no longer need or projects that are in a broken state, we recommended that you bring these projects up-to-date or delete them before running an upgrade to Tamr v2019.019.0. The upgrade scripts may fail when trying to start reindexing jobs for projects in these states.
  • This note applies to you only if you used published clusters before upgrading to version 2019.019. The upgrade of ElasticSearch materializes datasets (this is also known as reindexing). In version 2019.017 the schema for internal derived datasets in published clusters has changed. These internal derived datasets are not automatically indexed as part of the ElasticSearch upgrade. Therefore, If you haven't republished clusters after an upgrade to Tamr version 2019.017, when you upgrade to version 2019.019 this may lead to errors with downstream jobs that rely on the changed attributes in the published cluster datasets. To avoid these errors, rerun the publish clusters API or use Publish after you upgrade.

HBase is Required

Beginning with this release v.2019.019, HBase is required when you deploy Tamr. The following statements describe this requirement:

  • Tamr initially adopted HBase as the default storage mechanism for datasets in version 2018.044 in 2018. Until the current release version 2019.019, you could continue deploying Tamr with datasets stored in their local filesystem, or running Tamr with datasets stored locally in tandem with datasets backed by HBase.
  • To achieve further performance improvements and allow scaling out Tamr deployments, you must migrate datasets to HBase. This is required for Tamr as of version of v2019.011 and beyond in order to avoid known bugs and instance failures. In this release, we repeat this requirement, in preparation for enforcing it during upgrades.
  • The summary is:
    • Enable HBase.
    • After you upgrade, at any point of using Tamr v.2019.019, migrate datasets and prepare for the enforcement of HBase in future releases.

To enable HBase:

  1. Upgrade to Tamr version 2019.019. In some cases, such as when you are upgrading from versions prior to v.2019.08 to versions 2019.08 - 2019.020, the upgrade script may fail with "HBase is not enabled" error. In this case, follow step 2 in this procedure to enable HBase, and rerun the upgrade.
  2. Set the TAMR_HBASE_ENABLED configuration variable in the admin-unify.sh admin tool to true. This also requires restarting Tamr and its dependencies. For information about using unify-admin.sh, see Configuring Tamr.

To migrate existing datasets:

  1. Verify that the TAMR_HBASE_ENABLED configuration variable in the admin-unify.sh admin tool is set to true. See Configuring Tamr.
  2. Convert existing datasets to HBase using the /datasets/moveDatasetsToStorageDriver API. For example, use this URL: http://<your-tamr-host>:9100/api/dataset/datasets/moveDatasetsToStorageDriver?destinationDriver=hbase or this command: curl -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' 'http://<your-tamr-host>:9100/api/dataset/datasets/moveDatasetsToStorageDriver?destinationDriver=hbase'. It starts a job that registers all datasets with HBase. When the job is running, it shows under the Jobs page. There is a known issue where the job's details obscure the X sign for closing the job window. You may need to zoom out of that screen to be able to close it. To see which datasets are stored in HBase, use /datasets/<name>/status.
  3. Set the TAMR_HBASE_AS_PRIMARY_STORAGE configuration variable in the admin-unify.sh admin tool to true. This ensures that going forward, all datasets are stored in HBase by default. See Configuring Tamr.

Golden Records

  • Fixed an issue where in golden records, filters with Min and Max value conditions did not ignore empty values. Note: other rules and filters behave correctly and ignore empty values.
  • Fixed an issue with "ghost" datasets that weren't showing in all filters. "Ghost" input datasets are those that are no longer present in your project but were present previously. The fix ensures that all of the dataset filters in golden records list such ghost datasets. Previously, only some filters behaved correctly. You can also now delete such input datasets from the golden records rules.
  • You can apply the golden record rules for longest and shortest value to attributes with types other than String. Previously, the rules for longest and shortest value accepted only String array input columns.
  • Improved the behavior of publishing golden records. When you publish golden records in this release, you also update the rules at the same time. This is reflected in the user interface: Publish is now renamed to Update and Publish. Previously, Publish would take your saved rules.
  • Fixed an issue in golden records where input datasets that were once added, then removed and added again weren't marked as New.
  • Fixed an issue with rule overrides not behaving correctly. Previously, adding or deleting an override changed the number of overrides for all other attributes to zero.

Access Control

Continued improvements and bug fixes to ensure the proper behavior of policies and access controls for users in Tamr. In particular, made the following changes:

  • You can run operations on user groups, such as deleting them, in the user interface. Previously, you could run some of the operations only in the APIs.
  • Renamed Permissions navigation menu to Policies for consistency.
  • Fixed an issue where the Projects page would not load unless the user had the "read all datasets" permissions.

Transformations

  • Added two new aggregation functions, max_size() and min_size(), that compute the maximum and minimum value of an attribute or expression based on the bit size of the elements contained in a group. The functions take any type of attribute (including arrays) and produce the value that has the maximum or minimum “bitsize value" for all of the input attributes considered. In these functions, the "bitsize value" is a measure that reflects how much information in bits is represented by a data value. For Null values the "bitsize value" is always zero bits; for Boolean type values it is always 1 bit, for Strings, it is the number of characters in a String, for numbers, this value is an Integer with 32 bit precision, for Floating point values it is represented with 32 bit precision, and for Long and Doube type numeric values this value is represented in 64 bit precision. For example, 1.00 has a larger "bitsize value" than 1.0. For attributes of type Array, the shortest and the longest values are calculated as the sum of the element bit sizes and the precise bit size of the array’s length. Note that empty values are filtered out. For examples, see max-size and min_size.

APIs

  • The PUT operation in the /v1/projects/id API allows changing the name of the project. The project name in the versioned API corresponds to the “display name”.
  • The POST and PUT operations in the /v1/datasets/id API allow adding tags at dataset creation time and updating them.

Configuration

  • Added a property TAMR_YARN_NODE_MANAGER_HOST to configure a custom name for the Yarn NodeManager hostname, yarn.nodemanager.hostname. See YARN Cluster Manager Jobs.
  • You can configure a custom YARN nodemanager UI port (instead of the default port) by setting TAMR_YARN_NODE_MANAGER_PORT using the Tarm administrative tool tool, unify-admin.sh.
  • Added a --dry-run parameter for setting values to Tamr. See Admin Tool Command Reference.
  • You can connect to an external Postgres instance via SSL.The database instance URL now allows specifying the following string to the JDBC driver: jdbc:postgresql://<host>:<port>/<database>?ssl=true. See Postgres.

Other Improvements

Made the following improvements:

  • ElasticSearch improvements. Upgraded ElasticSearch from 6.4.2 to 6.8.2 resulting in performance implications during an upgrade to this version. The ElasticSearch upgrade enabled improved performance of mastering projects, and established a more stable ElasticSearch snapshot/restore process with HDFS. The upgraded ElasticSearch version 6.8.2 is also compatible with Google Cloud Platform requirements, and is required for enabling ongoing cloud provider integrations.
  • Performance improvements. Dataset and Dataset Catalog pages load faster.
    Improved performance of Tamr mastering workflows by tuning Spark joins configuration.
  • User interface improvements. Renamed Upload CSV to Upload File to account for the fact that Tamr allows adding datasets other than CSV. Fixed the behavior of the About screen.
  • Fixed an issue with upgrades where reindexing of the cluster locks failed if an empty project existed. This was a known issue in v.2019.016.
  • Fixed an issue where previewing a dataset did not work in the project's Datasets tab, after uploading and adding a dataset to a project. This was a known issue in Tamr version 2019.017.
    • Added logging about database migrations to log files and stdout.
    • Fixed an issue with the Tamr administrative tool that did not allow specifying relative paths when setting configuration from a file, such as: admin-unify.sh config:set --file <path-to-file>.
    • The datasets page now loads faster.
  • Fixed an issue where longitude and latitude were not showing for polygons in the user interface.
  • Fixed an issue where Tamr was previously making calls to async-io.org on startup. In previous releases, the messaging framework that Tamr uses internally made these calls.
  • User interface improvements:
    • The Mastering user interface now indicates when there are no labels. Previously, the confusion matrix was empty in this case.
    • The list of comments automatically scrolls to the top of the list to show the newest comments first when you add them in Categorization, Pairs, and Clusters pages.
    • The user interface reflects that the "Loading datasets" process is in progress when you load datasets on the Dataset Catalog page.
    • The message indicating that the dataset is out-of-date in the navigation bar is now located to the side of menu controls and not in between them.

Support Tickets

Fixed the following support ticket:

  • Fixed an issue where a job for updating the unified dataset was stalled.
  • Fixed an issue where pair assignments disappear and human-assigned labels become "verified by API" after regenerating pairs. The issue was found in version 2019.012 and fixed in version 2019.019.
  • Fixed an issue where authorization did not work for users with mixed-case usernames.
  • Fixed an issue where Tamr would not restart due to an inconsistent YARN jobs state. This issue affects Tamr versions 2019.014 - 2019.016 and is fixed in v.2019.017. Contact Tamr Support to obtain a patch for affected versions, or upgrade to version 2019.017 or greater.
  • Fixed an issue where connecting to an external Postgres instance via a JDBC driver did not allow configuring secure access via SSL. The fix enalbes specifying SSL in the database instance URL for the JDBC driver as follows: jdbc:postgresql://<host>:<port>/<database>?ssl=true.
  • Fixed an issue where jobs could not start, or running jobs caused a "Waiting for resources" error due to issues with port contention for the YARN node manager. The fix enables you to specify your own port numbers in YARN. For more information, see YARN Cluster Manager Jobs.

Known Issues

The following known issues exist in this release.

  • Internet Explorer v.11 support. In Schema Mapping, the drop-down menu for updating an out-of-date project does not display correctly. Other user interface issues are observable in IE v.11.
  • Jobs management. The job for Updating results is not showing the project it is associated with on the Jobs page.
  • Geospatial support.
    • Cannot update golden records with a geospatial data type present.
    • A warning displays in the browser console for any page or details panel that contains the geospatial record types in mastering projects.
    • Error when attempting sorting by a geospatial type attribute.
    • Most frequent values are not showing up for a geospatial type field on a profiled attribute on the Schema Mapping page.
    • The schema mapping suggestion fails if a geospatial record type attribute is present.
    • Export is failing on a dataset with geospatial data.

Upgrade

See upgrading page for instructions.

Updated about a year ago


v2019.019 Notes


Tamr release notes.

Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.