HomeGuidesAPI ReferenceChangelog
HomeGuidesTamr API ReferenceTutorialsEnrichment API ReferenceSupport Help CenterLog In

Tamr Release Notes

📘

Tamr Release Notes are now available from this page, starting with v2019.019.

These release notes describe new features, improvements, and corrected issues in each Tamr release.

Tamr Releases

Important Information for All Releases

Upgrading Tamr to a New Release

Follow the upgrade instructions in the Tamr documentation for the version to which you are upgrading. If the version to which you are upgrading has patch releases, Tamr strongly recommends installing the latest patch to ensure you have the latest fixes and security enhancements.

Depending on the version from which you are upgrading, you may need to install required checkpoint releases as part of your upgrade path.

Required Checkpoint Releases

When you upgrade Tamr, you must upgrade to each of the intervening checkpoint versions before upgrading to a later version.

The following Tamr releases are checkpoint releases:

  • v2020.016.0
  • v2020.003
  • v2021.002.0

The upgrade utility prevents you from upgrading past a checkpoint version. For example, the following upgrade paths are allowed: v2020.017 -> v2021.002, v2021.001 -> v2021.002. These upgrade paths are prevented: v2020.016 -> v2021.003, v2019.019 -> v2021.002.

Installing Patch Releases

For every software release version that has a patch, Tamr strongly recommends that you upgrade to the most recent patch. To install the most recent patch, follow the upgrade instructions and supply the version number of the patch as the version for your upgrade.

If you want to upgrade directly to a patched version, specify the --skipCheckpointReleaseValidation flag when installing Tamr. For information, see Upgrading.

Known Issues

To view current known issues, please consult the Tamr knowledge base.


Tamr 2021 Releases

v2021.019.0 Release Notes

New Features and Improvements

The following new features are included in this release.

  • Browser tab naming for easier browser tab navigation.
  • Add project name as page header in Golden Records > Rules page.
  • Resolve warning messages when running start-dependencies.sh or unify-admin.sh.
  • Security enhancements for user lists.
  • Improved consistency for user API access by role.

Fixed Support Issues

This release corrects the following errors.

  • High-impact pairs don't show up in UI after running train/predict with new feedback. Affects versions: v2021.015.0. Fix versions: v2021.019.0.
  • Add project name as page header in Golden Records > Rules page. Affects versions: . Fix versions: v2021.019.0.
  • Memory allocated to Tamr dependencies and micro-services is greater than available memory in the machine. Affects versions: v2020.016.4. Fix versions: v2021.019.0.
  • Categorization dashboard now available for reviewer role. Affects versions: v2021.006.0. Fix versions: v2021.019.0.
  • Improve base memory and validator calculations. Affects versions: v2021.010.0. Fix versions: v2021.019.0.

Back to top


v2021.018.0 Release Notes

New Features and Improvements

The following new features are included in this release.

  • DMS now makes source data array types for newly created datasets from csv files. Existing datasets created by DMS will continue to have string types and the data will continue to be appended as strings.
  • A new TAMR_JOB_SPARK_NAME_TEMPLATE Zookeeper variable is now appended to job names in Spark, to allow for customization of Spark job names.

Fixed Support Issues

This release corrects the following errors.

  • Fixed versioned GET taxonomy API issues for projects with taxonomies uploaded via UI before v2021.016. Affects versions: v2021.002.2, v2021.006.0. Fix versions: v2021.018.0.
  • Fixed Tamr UI performance issue. Affects versions: v2020.012.0. Fix versions: v2021.018.0, v2021.002.3, v2020.016.5.
  • Fixed Bulk Match API intermittently failing on AWS with "AmazonS3Exception Slow Down" error. Affects versions: v2020.020.1. Fix versions: v2021.018.0, v2020.020.3.
  • Fixed issue with reliable pair estimate at scale on AWS due to "AmazonS3Exception Slow Down" error. Affects versions: v2020.020.2. Fix versions: v2021.018.0, v2020.020.3.

Back to top


v2021.017.0 Release Notes

New Features and Improvements

The following new features are included in this release.

  • Optimize pair generation jobs when using unweighted tokenizations.

Validation improvements:

  • Validate that the vm.max_map_count (which specifies the maximum number of memory map areas) is at least the required value of 262144.
  • Run ulimit validator as part of preupgrade validation.
  • For the ulimit validator, check the open file ulimit against a maximum of 1000000, instead of 66000.
  • Run certain validators, including ulimits every time Tamr is started.

Fixed Support Issues

This release corrects the following errors.

  • Improve behavior of Tamr when dataset migration takes a long time during upgrade. Affects versions: v2020.014.0. Fix versions: v2021.017.0.
  • Ingesting a file via DMS fails when values include unescaped commas. Affects versions: v2021.006.1. Fix versions: v2021.017.0.
  • Parquet files greater than 2GB created by DMS cannot be read. Affects versions: v2021.006.0. Fix versions: v2021.017.0.
  • Input transformation json lacks input dataset information. Affects versions: v2020.024.1. Fix versions: v2021.017.0.
  • Use validation scripts and add them to start-dependencies and start-unify by default, not just upgrade. Affects versions: . Fix versions: v2021.017.0.
  • Project Movement failing to materialize transformations. Affects versions: v2021.010.2. Fix versions: v2021.017.0.
  • UX improvements to reduce confusion when uploading dataset via DMS. Affects versions: v2021.010.2. Fix versions: v2021.017.0.
  • Selecting multiple input dataset in Project transformations does not update backend. Affects versions: . Fix versions: v2021.017.0.

Back to top


v2021.016.0 Release Notes

What's New

Persistent Cluster ID Improvements

Tamr now assigns persistent cluster IDs to records in mastering projects the first time you run the "Apply feedback and update results" or "Update results only" job. Cluster IDs are re-assigned and updated to reflect your feedback each time you run the Review and publish clusters job. Previously, persistent IDs were not assigned to clusters on creation.

New Verifier Guide to Support Verifier Role

A Verifier Guide is available to support users in the Verifier role. Verifiers are subject matter experts who use Tamr to provide input to mastering and categorization workflows and manage review assignments. Verifiers can complete all tasks that Reviewers complete. Additionally, similar to Curators, Verifiers act as coordinators in mastering and categorization projects by assigning tasks and verifying pairs and clusters or categorizations. To support this new role, topics in the Curator Guide that apply to both Verifiers and Curators have been moved to the Verifier Guide.

New Topics Related to Deploying Tamr on AWS

Two new topics are available in the System Administrator Guide related to deploying Tamr on AWS:

Fixed Support Issues

This release corrects the following errors.

  • Ingesting via DMS UI does not expose correct attributes in csv. Affects versions: v2021.010.2. Fix versions: v2021.016.0.
  • Using the DMS UI browser for ADLS crashes DMS. Affects versions: v2021.010.2. Fix versions: v2021.016.0.
  • TAMR_JOB_SPARK_CONFIG_OVERRIDES not getting deserialized, preventing Tamr from starting. Affects versions: v2021.010.0. Fix versions: v2021.016.0.
  • Failed DMS jobs shows state as running indefinitely, preventing further DMS jobs . Affects versions: v2021.006.0. Fix versions: v2021.016.0.
  • Spark overrides should not need to be fully specified. Affects versions: v2021.003.0. Fix versions: v2021.016.0.
  • Versioned get taxonomy endpoint does not work. Affects versions: v2021.002.0, v2021.006.0. Fix versions: v2021.016.0.

Back to top


v2021.015.0 Release Notes

These release notes list what's new in this release, corrected issues, and known issues.

New Features and Improvements

The following new features are included in this release.

  • New API endpoint for LLM to get last update
    (GET /api/v1/projects/{projectName}/lastLlmUupdate).

Fixed Support Issues

This release corrects the following errors.

  • Project Movement failIfNotPresent flag too sensitive to use. Affects versions: v2021.010.2. Fix versions: v2021.015.0.
  • Cannot filter to attributes mapped to unified attributes. Affects versions: v2021.010.0, v2021.011.0. Fix versions: v2021.015.0.

Back to top


v2021.014.0 Release Notes

What's New

Verifier Role

This release includes a new Verifier role. This role is best suited for subject matter experts who will use Tamr only to provide input to your workflow and assign review tasks to other users. Verifiers cannot perform any actions that affect the underlying model.

This role allows subject matter experts to perform the following actions in mastering projects:

  • View unified data
  • Create and manage pair and cluster review assignments
  • Label pairs and clusters
  • Verify pairs and clusters

In categorization projects, Verifiers can assign and verify categories.

If users require permission to configure and run the underlying model, grant them Curator role access instead.

For more information about Verifier role permissions, see the Permissions Matrix by User Role.

New Features and Improvements

  • The system is set to read-only state when using the Project Movement API to import or export projects. This change ensures that the project and its data are protected during the operation. When the operation is complete, the system returns to read-write, and you can then perform actions and run jobs.
  • Additional configuration options are available the Databricks client library features.

Fixed Support Issues

This release corrects the following errors.

  • The backup process did not clear the tmp directory, causing the database connection to fail. Affects versions: v2021.010.0. Fix versions: v2021.014.0.
  • Due to an issue with the cluster metrics dataset, the “Estimate cluster metrics” option was unavailable and predict clusters jobs were cancelled. Affects versions: v2021.010.0. Fix versions: v2021.014.0, v2021.010.2.
  • When running unify-admin.sh validate, erroneous warnings were generated for missing optional files. Affects versions: v2021.010.0. Fix versions: v2021.014.0.
  • Upgrade failed due to inability to connect to dataset service. Affects versions: v2021.010.0. Fix versions: v2021.014.0, v2021.010.2.
  • When using relative Hausdorff distance, you cannot get to the geo shape overlay on the pairs page. Affects versions: v2019.017.0, v2020.024.1. Fix versions: v2021.014.0.
  • Datasets appear to have a smaller number of records than expected in deployments that use HBase. Affects versions: v2021.010.0. Fix versions: v2021.014.0, v2021.010.2.
  • Using geometry fields with PreGroupBy caused the generate pairs job to fail. Affects versions: v2021.009.0. Fix versions: v2021.014.0.
  • Project Import fails due to changes to Unified Attributes causing invalid transformations. Affects versions: v2021.010.1. Fix versions: v2021.014.0.
  • Match service should throw an error if the project does not exist. Affects versions: v2021.006.0. Fix versions: v2021.014.0.

Back to top


v2021.012.1 Patch Release Notes

The patch version addresses the following issues:

  • For single-node deployments, the unify-data directory was not included in the backup which could potentially cause dataset exports to fail.
  • Restored mastering workflows failed due to the Tamr restore process failing to pull in a needed piece of information from the backup file.
  • Restore failed if the Low-Latency Match (LLM) service automatically polled for updates during restore. The LLM service no longer automatically polls when the system in read-only mode.

v2021.012.0 Release notes

New Features and Improvements

The following new features are included in this release.

  • Enable change order of columns for clusters in GR project.

Fixed Support Issues

This release corrects the following errors.

  • Cannot change order of columns for clusters in GR project. Affects versions: v2020.017.0. Fix versions: v2021.012.0.
  • Investigation to why dozens of hbase_configNNNN... folders in /tmp folder. Affects versions: v2021.002.0. Fix versions: v2021.012.0.
  • Adding a return character at the end of the license key doesn't break Installation but breaks Upgrade. Affects versions: . Fix versions: v2021.012.0.
  • Enable change order of columns for clusters in GR project. Affects versions: . Fix versions: v2021.012.0.

Back to top


v2021.011.0 Release Notes

New Features and Improvements

The following new feature is included in this release.

In Tamr mastering projects, the pairs page now automatically shows pairs with Tamr suggestions that have a medium (M) confidence level, as well as suggestions with high (H) and low (L) confidence levels. This change is the result of a new, lower default value for the TAMR_PAIR_CONFIDENCE_THRESHOLD_MEDIUM configuration variable. For more information, see Configuring Tamr.

Fixed Support Issues

This release corrects the following errors.

  • Add ability to group on nulls/empties for specific aggregation fields in pregroupby. Affects versions: v2020.020.0. Fix versions: v2021.011.0.
  • Medium-confidence pairs should be possible. Affects versions: . Fix versions: v2021.011.0.

Back to top


v2021.010.2 Patch Release Notes

The patch version addresses the following issues:

  • For single-node deployments, the unify-data directory was not included in the backup which could potentially cause dataset exports to fail.
  • Restored mastering workflows failed due to the Tamr restore process failing to pull in a needed piece of information from the backup file.
  • Restore failed if the Low-Latency Match (LLM) service automatically polled for updates during restore. The LLM service no longer automatically polls when the system in read-only mode.
  • Upgrade failed due to inability to connect to the dataset service.
  • Ingestion of CSV files from ADLS Gen2 failed when using the Data Movement Service (DMS).
  • Unable to select a primary key when adding a dataset from Azure cloud storage via DMS in the Tamr UI, because the Primary Key dropdown menu was not populated.
  • Datasets appear to have a smaller number of records than expected in deployments that use HBase.
  • An issue with the cluster metrics dataset which caused the “Estimate cluster metrics” option to be disabled and predict clusters jobs to be cancelled.

v2021.010.1 Patch Release Notes

This patch corrects a timeout error that occurred when deleting datasets with interdependencies on derived datasets, including published clusters. Due to the inability to delete datasets, users were not able to delete related Mastering projects after publishing clusters.

Additionally, this patch corrects a related issue in which the DELETE API returned the following timeout error: "com.tamr.platform.tasq.TaskException: Timed out waiting for tasks to finish".

v2021.010.0 Release Notes

New Features and Improvements

The following new features and improvements are included in this release.

  • For Clusters, when a metrics estimation job completes, the "Estimate" option automatically updates to the "View cluster metrics" option.
  • For Data Movement Service, improved append to dataset functionality when uploading a dataset.
  • For Data Movement Service, support for uploading Parquet files with complex schema.

Fixed Support Issues

This release corrects the following errors.

  • Expose highImpactThreshold as a configurable variable in CategorizationInfo recipe . Affects versions: v2021.002.1, v2021.002.2. Fix versions: v2021.010.0.
  • Tag filter not working in adding datasets to project page. Affects versions: v2021.006.0. Fix versions: v2021.010.0.
  • UI-uploaded filenames are not encoded as UTF-8 Strings. Affects versions: v2021.005.0. Fix versions: v2021.010.0.
  • Unable to see 'View cluster metrics' button after upgrading to v2020.20.0. Affects versions: v2020.020.0. Fix versions: v2021.010.0.
  • From support: Expose highImpactThreshold as a configurable variable in CategorizationInfo recipe . Affects versions: . Fix versions: v2021.010.0, v2021.002.2.
  • Classic-22 View Metrics appears without user clicking to another page. Affects versions: Fix versions: v2021.010.0.
  • SUP-5175 Tag filter not working in adding datasets to project page. Affects versions: v2021.006.0. Fix versions: v2021.010.0.
  • SUP-5181 (Update copy) API docs - LLM. Affects versions: v2021.006.0. Fix versions: v2021.010.0.
  • SUP-5129 UI-uploaded filenames are not encoded as UTF-8 Strings. Affects versions: v2021.005.0. Fix versions: v2021.010.0.

Back to top


v2021.009.1 Patch Release Notes

The patch version addresses the following issues:

  • For single-node deployments, the unify-data directory was not included in the backup which could potentially cause dataset exports to fail.
  • Restored mastering workflows failed due to the Tamr restore process failing to pull in a needed piece of information from the backup file.
  • Restore failed if the Low-Latency Match (LLM) service automatically polled for updates during restore. The LLM service no longer automatically polls when the system in read-only mode.

v2021.009.0 Release Notes

New Features and Improvements

The following new features are included in this release.

  • Make HBASE Peak/OffPeak windows configurable.
  • SUP-4847 Implement conditionality for 'View cluster metrics'.

Fixed Support Issues

This release corrects the following errors.

  • AWS EMR Ephemeral Spark cluster instance groups not being named correctly. Affects versions: v2021.008.0. Fix versions: v2021.009.0.
  • Make HBASE Peak/OffPeak windows configurable. Affects versions: . Fix versions: v2021.009.0.
  • SUP-4879 LLM not working during backup. Affects versions: . Fix versions: v2021.009.0.
  • SUP-4847 Implement conditionality for 'View cluster metrics'. Affects versions: . Fix versions: v2021.009.0.

Back to top


v2021.008.0 Release Notes

New Features and Improvements

The following new features are included in this release.

  • The dropdown list for attributes on the blocking model page of a mastering project now provides a tooltip on mouseover with the full attribute name. Previously, the list was too narrow to show long attribute names.
  • New Tamr configuration variable for setting AMI in RunJobFlowRequest.

Fixed Support Issues

This release corrects the following errors.

  • Default value for TAMR_BIGQUERY_ENABLED gives errors in dataset.log when not using bigquery. Affects versions: v2021.006.0.
  • Show full attribute name on blocking model page. Affects versions: v2019.023.1.
  • Unified Attribute side of schema mapping does not show correct number of source attributes. Affects versions: v0.39.0, v2021.001.0.

Back to top


v2021.007.0 Release Notes

What's New

This release includes:

  • We now support overriding the following Databricks-specific parameters using TAMR_JOB_SPARK_CONFIG_OVERRIDES:
    • minWorkers - Maps to TAMR_JOB_DATABRICKS_MIN_WORKERS
    • maxWorkers - Maps to TAMR_JOB_DATABRICKS_MAX_WORKERS
    • databricksNodeType - Maps to TAMR_JOB_DATABRICKS_NODE_TYPE

These are members of the sparkDeploymentConfig map.

An example of overriding only these values can be found below (with required property name included):

TAMR_JOB_SPARK_CONFIG_OVERRIDES: "[{
name: databricksOverrides,
sparkDeploymentConfig: {
minWorkers: 5,
maxWorkers: 6,
databricksNodeType: Standard_DS4_v2;
}
}]

New Features and Improvements

The following new features are included in this release.

  • Support spark overrides for Databricks cluster specifications.

Fixed Support Issues

This release corrects the following errors.

  • Unable to apply feedback and updates classification results, receiving error java.lang.OutOfMemoryError: Java heap space . Affects versions: v2021.004.0. Fix versions: v2021.007.0.

Back to top


v2021.006.3 Patch Release Notes

This patch corrects an issue for single-node deployments in which the unify-data directory was not included in the backup, which could potentially cause dataset exports to fail.

v2021.006.2 Patch Release Notes

This patch addresses two issues:

  • The restore service was failing to pull in a needed piece of information from the backup file, causing restored mastering workflows to fail.
  • Automatic polling by the Low-Latency Match (LLM) service during restore caused restore to fail. The LLM service no longer automatically polls when the system is in read-only mode.

v2021.006.1 Patch Release Notes

This patch corrects an issue that affected AWS backup and restore.

v2021.006.0 Release Notes

What's New

This release includes:

  • As long as TAMR_HBASE_REMOTE_DOWNLOAD_ENABLED is set to false, the filesystem connection info will not be in the job spec.
  • Security improvements

Fixed Support Issues

This release corrects the following errors.

  • Get all datasets API failed when searching for a deleted dataset. Affects versions: v2020.024.1. Fix versions: v2021.006.0.
  • Tamr job status never updates for Terminated Databricks cluster on Azure. Affects versions: v2021.002.1. Fix versions: v2021.006.0.
  • Pages in schema mapping load slowly. Affects versions: v2020.012.0, v2020.016.3. Fix versions: v2021.006.0.
  • Job status doesn't update from Azure Databricks, possibly due to ADLS Gen 1. Affects versions: v2020.026.0. Fix versions: v2021.006.0.
  • Project steps dialogue in UI does not reflect the updates that have been done via API. Affects versions: v2020.015.0. Fix versions: v2021.006.0.

Back to top


v2021.005.0 Release Notes

What's New

This release includes the following new features.

Project Movement
The new Tamr project movement feature can be used to create, update, or back up project artifacts within and across instances. Use the Tamr project movement API to export projects and then, optionally, import them into existing or distinct new projects.

To learn more about project movement, see:

Data Movement Service (DMS)
You can now import and export data files between Tamr and your cloud storage with the new Data Movement Service (DMS).

To learn more about DMS, see following:

Important Notes for DMS

  • The current version of DMS supports API interaction through command-line utilities, including cURL, only.
  • DMS does not support Parquet files that include arrays with nulls.
  • Appending uploaded data to an existing dataset:
    • When appending uploaded data with multiple threads (and multiple files) to an existing dataset, the original data is overwritten by the uploaded data and no longer appears in the dataset. This issue is fixed in release v2021.010.0.
    • When appending an uploaded dataset to an existing dataset, if the new dataset does not include all of the columns in the original dataset, the schema of the existing dataset is changed to have only the columns included in the new dataset. This issue is fixed in release v2021.010.0; the schema no longer changes and omitted columns have null values in their respective cells.
  • DMS jobs:
    • DMS jobs are not persisted; upon restart, previous and in progress DMS jobs are no longer listed.
    • For DMS jobs, the job ID is a GUID created by DMS and uses a different format than the numeric job
      IDs created by Tamr.
    • For failed DMS jobs, tmp/.tmp files created during the upload process are not deleted as expected, and can consume a large amount of disk space. For successful DMS jobs, the tmp/.tmp files are deleted.
    • DMS job status does not appear immediately and the progress bar is not in sync with the job status.
    • For successfully completed DMS jobs, the status is completed, instead of succeeded which is reported for other Tamr jobs.

New Features and Improvements

The following new features and improvements are included in this release.

  • Create versioned APIs for project movement.
  • Show the number of blocks per record when estimating the stats of the blocking model.
  • Need configuration parameter for Elasticsearch to avoid "too_long_frame_exception" with the reason "An HTTP line is larger than 4096 bytes".

Fixed Support Issues

This release corrects the following errors.

  • ADLS Gen 1 credentials exposed in multiple logs for HBaseSiteConnectionHandler. Affects versions: v2020.026.0. Fix versions: v2021.005.0, v2021.002.1.
  • Give more information to user about pair estimate complexity. Affects versions: v2020.020.2. Fix versions: v2021.005.0.
  • Error loading similar entities when clicking on clusters. Affects versions: v2020.023.0. Fix versions: v2021.005.0.
  • Need configuration parameter for Elasticsearch to avoid "too_long_frame_exception" with the reason "An HTTP line is larger than 4096 bytes". Affects versions: v2020.012.0, v2021.001.0. Fix versions: v2021.005.0.
  • Show the number of blocks per record when estimating the stats of the blocking model. Affects versions: . Fix versions: v2021.005.0.
  • Error loading similar entities when clicking on clusters. Affects versions: . Fix versions: v2021.005.0.

Back to top


v2021.004.0 Release Notes

Fixed Support Issues

This release corrects the following errors.

  • Tamr often fails to provide error messages for job failures on AWS scale out. Affects versions: v2020.020.2. Fix versions: v2021.004.0.
  • Publish clusters job initially fails without an error message, and succeeds after resubmission. Affects versions: v2020.020.1. Fix versions: v2021.004.0.
  • Clusters records job initially fails with '"TreeNodeException", and succeeds after resubmission. Affects versions: v2020.020.2. Fix versions: v2021.004.0.
  • Publish clusters job initially fails with "NullPointerException", and succeeds after resubmission. Affects versions: v2020.020.2. Fix versions: v2021.004.0.
  • "Generate SM suggestions" button not clickable after model import. Affects versions: v2020.012.0. Fix versions: v2021.004.0.

Back to top


v2021.003.0 Release Notes

What's New

This release includes:

Spark config overrides are changing for ephemeral EMR spark.

There are now only two fields that are supported to override within the sparkDeploymentConfig map, clusterNamePrefix and runJobFlowRequest. The values are representative of what you would set for the following Tamr configurations:

  • clusterNamePrefix > TAMR_DATASET_EMR_CLUSTER_NAME_PREFIX
  • runJobFlowRequest > TAMR_DATASET_EMR_RUN_JOB_FLOW_REQUEST

TAMR_JOB_SPARK_CONFIG_OVERRIDES: '[{"name": "adjustedInstanceCount", "sparkDeploymentConfig": {"clusterNamePrefix":"", "runJobFlowRequest": "..."} }]'

Fixed Support Issues

This release corrects the following errors.

  • CSV export download does not work on AWS scale-out. Affects versions: v2020.013.0. Fix versions: v2021.003.0.

Back to top


v2021.002.3 Patch Release Notes

This patch release improves Tamr UI performance.

v2021.002.2 Patch Release Notes

This patch release provides a fix for large scale categorization projects.

v2021.002.1 Patch Release Notes

This patch release provides a security improvement.

v2021.002.0 Release Notes

Tamr v2021.002.0 is a checkpoint release. For information about how checkpoint releases affect Tamr upgrades, see Upgrading Tamr.

What's New

This release includes:

  • For categorization projects, you can now upload and re-use a taxonomy file in multiple projects without requiring a unique name in each project. Tamr now generates a new dataset for the taxonomy in each project, (unified_dataset_name)_categories, which you can view and export from the Dataset Catalog page.
  • The workflow for categorization projects has also changed. Now, you must create the unified dataset for the project before you upload the taxonomy file.

Fixed Support Issues

This release corrects the following errors.

  • Enrichment returns empty results. Affects versions: v2021.001.0. Fix versions: v2021.002.0.
  • WriteLockException on _unified_dataset_dedup_suggested_clusters_log table after concurrent record verification actions. Affects versions: v2020.020.2, v2020.024.1. Fix versions: v2021.002.0.
  • case statement not defaulting to null when ELSE is not specified. Affects versions: v2020.020.0. Fix versions: v2021.002.0.

Back to top


v2021.001.0 Release Notes

This release contains minor updates that improve the experience of using Tamr.

Back to top


Tamr 2020 Releases

v2020.026.0 Release Notes

What's New

For cloud-native deployments of Tamr on Azure, this release adds support for ADLS Gen2. It also supports using service principals instead of storage account keys with ADLS Gen2. The following Tamr configuration variables are now available, with the link provided for the Microsoft Azure documentation that describes how to obtain the value for each one.

See the Configuration Variable Reference.

This release removes the TAMR_ADLS_GEN2_KEY configuration variable.

This release also adds:

Back to top


v2020.025.0 Release Notes

This release contains minor updates that improve the experience of using Tamr.
#v2020.024.2 Patch Release notes

This patch release corrects an issue with restore remaining in “running” state longer than expected after upgrade.

Back to top


v2020.024.1 Patch Release Notes

This patch release corrects the following issues.

  • Failure to create snapshot diff component should not cause the entire planning to fail
  • When using Configure table on the Clusters page in a mastering project, changes to the visibility and positioning of the Cluster, Dataset, origin_entity_id, and tamr_id columns are now saved and applied as expected.

For cloud-native deployments of Tamr on Azure, this patch release also allows for the use of service principals instead of storage account keys with ADLS Gen2. The following Tamr configuration variables are now available, with links to the Microsoft Azure documentation that describes how to get the value for each one.

See the Configuration Variable Reference.

This patch also removes the TAMR_ADLS_GEN2_KEY configuration variable.

v2020.024.0 Release Notes

Fixed Support Issues

This release corrects the following errors.

  • Tamr instance unusably slow to load pages. Affects versions: v2020.016.2. Fix versions: v2020.024.0, v2020.020.2.
  • Bulk match API returns no records, but matches are found and written to file system (AWS). Affects versions: v2020.013.0, v2020.020.0. Fix versions: v2020.024.0, v2020.020.1.
  • get.projects() broken on v2020.021.0 with enrichment projects. Affects versions: v2020.021.0. Fix versions: v2020.024.0.
  • Generating pairs processing time from 5 min to 5 hrs. Affects versions: v2020.021.0. Fix versions: v2020.023.0, v2020.024.0, v2020.020.1.

Back to top


v2020.023.1 Patch Release Notes

This patch corrects the issue:

  • UI Issue: Adding new blocking model clauses with no tokenizer option is throwing a token weighting error. The API is unaffected.

v2020.023.0 Release Notes

New Features and Improvements

The following new feature is included in this release.

  • Add support for localCheckpoint. When you use a CHECKPOINT statement, you now have the option to include a HINT modifier to specify checkpoint.reliable or checkpoint.local as the Spark store behavior. See Checkpoint and Statement Modifiers.
  • Option to not use IDF weighing when computing similarity scores. See Tokenizers and Similarity Functions.

Fixed Support Issues

This release corrects the following error.

  • Spark instance type override is not picked up and used when submitting jobs. Affects versions: v2020.018.0. Fix versions: v2020.023.0.

Back to top


v2020.022.0 Release Notes

New Features and Improvements

The following new features are included in this release.

  • Min distance between shapes as fully supported geospatial metric.
  • Relative area overlap similarity function and binning.
  • Show User Defined Signal output as columns in dedup record pairs table. See user-defined signals.

Fixed Support Issues

This release corrects the following errors.

  • gis.centroid transformation function should not double-count start/end point of a polygon boundary in calculation. Affects versions: v2020.004.1. Fix versions: v2020.022.0.
  • Convex overlap for polygons as a non-DNF comparator function. Affects versions: v2020.015.0. Fix versions: v2020.022.0.
  • Min distance between shapes as fully supported geospatial metric. Affects versions: v2020.015.0. Fix versions: v2020.022.0.

Back to top


v2020.021.0 Release Notes

New Features and Improvements

The following new features are included in this release.

Fixed Support Issues

This release corrects the following errors.

  • Update the UI to accommodate the ability to infer pairs from cluster feedback. Affects versions: v2020.005.0. Fix versions: v2020.021.0.

Back to top


v2020.020.4 Patch Release Notes

This patch corrects the following issues.

  • This patch corrects an issue in which datasets appear to have a smaller number of records than expected in deployments that use HBase.

v2020.020.3 Patch Release Notes

This patch corrects the following issues.

  • Bulk match API intermittently failing on AWS with "AmazonS3Exception Slow Down"
  • Cannot get reliable pair estimate at scale on AWS due to SlowDown Exception

v2020.020.2 Patch Release Notes

This patch corrects the following issues.

  • TAMR_JOB_SPARK_CONFIG_OVERRIDES not getting picked up correctly
  • Tamr instance unusably slow to load pages
  • RecordMatchService using paths instead of full URIs
  • WINDOW transformation GC explodes at scale

v2020.020.1 Patch Release Notes

This patch corrects the following issues.

  • Corrects the issue: Upgrade from 2020.004.1 to 2020.016.3 Succeeded but Materialize Unified Dataset Jobs fail.
  • Corrects the issue: Spark task times highly skewed when reading HBase on EMR

v2020.020.0 Release Notes

What's New

Tamr now supports upload of files in the Parquet file format from an external HDFS cluster, S3, or GCS. For more information, see Uploading a Dataset into Tamr.

New Features and Improvements

The following new features are included in this release.

  • Create a separate endpoint to get records for dedup service.
  • match.log is too chatty.
  • BigQuery: get datasets and tables to be sorted.
  • Allow pulling hbase configuration files from the Tamr data dir.
  • UI improvements when connecting external sources.

Fixed Support Issues

This release corrects the following errors.

  • Configure Table button missing on Unified Dataset Preview . Affects versions: v2020.018.0. Fix versions: v2020.020.0.
  • Add dataset UI does not scroll down when using advanced CSV options. Affects versions: v2020.018.0. Fix versions: v2020.020.0.
  • LLM no longer works on GCP Native with BigTable. Affects versions: v2020.016.3. Fix versions: v2020.020.0.
  • Transformation does not show up after deleting a unified attribute. Affects versions: v2020.008.0, v2020.012.0. Fix versions: v2020.020.0.
  • Upload New Dataset Modal Broken. Affects versions: v2020.017.0, v2020.016.1. Fix versions: v2020.020.0.
  • Support for parquet external files in UI. Affects versions: v2020.017.0. Fix versions: v2020.020.0.

Back to top


v2020.019.0 Release Notes

These release notes list what's new in this release, corrected issues, and known issues.

What's New

In this release, you can:

  • Review precision and recall metrics for clusters on the Clusters page of your mastering projects. In addition to the computed percentages, a trend graph with confidence intervals shows changes in the accuracy of your clusters over time. Unlike the in-sample metrics computed for record pairs, cluster metrics are computed using a test set of records. See Precision and Recall Metrics for Clusters.
  • Open cluster metrics (after they have been computed for your project) by clicking View Cluster Accuracy in the dialog box for the confusion matrix and in-sample metrics on the Pairs page. See Viewing In-Sample Pair Metrics.
  • Review documentation for the transformation-tools.jar system administration utility.

New Features and Improvements

The following new features are included in this release.

  • Optimize incremental clustering.
  • Optimize incremental pair generation.
  • Support AES256 server-side encryption for s3 external storage providers.
  • Support AES256 Server Side Encryption for S3.
  • Improve connection checking in tasq.
  • Maintain lineage information for pair labels and allow filtering by lineage type.
  • Pairwise Accuracy UI Changes.
  • Cluster Accuracy Modal.
  • CloudSQL Backup.
  • Improvements to function docs example table formatting.

Fixed Support Issues

This release corrects the following errors.

  • External storage provider linked dataset not getting ingested. Fix versions: v2020.019.0.
  • Upload 400 error says to check logs, but error isn't in the logs. Fix versions: v2020.019.0.
  • Documentation now available for transformation-tools.jar. Fix versions: v2020.019.0.
  • Transform docs with long examples hard to interpret. Fix versions: v2020.019.0.

Back to top


v2020.18.0 Release Notes

These release notes list what's new in this release, corrected issues, and known issues.

What's New

In this release these changes were made:

  • Deployment on Microsoft Azure. See Deploying Tamr on Azure.
  • Test clusters produce cluster precision and recall. See Filtering Clusters.
  • A predefined list of datetime formats is now supplied for the datetime_to_iso and date_and_time_to_iso functions. These formats are applied after any formats specified in your transformations. See Working with Dates.

Fixed Support Issues

The following support issues were fixed in this release.

  • Golden Records: Error using top: The 1st argument to function top must be a constant literal expression. Affects versions: v2020.011.0. Fix versions: v2020.018.0.
  • Bulk unmap button in Schema Mapping does not work. Affects versions: v2020.012.0, v2020.014.0, v2020.016.0. Fix versions: v2020.017.0, v2020.018.0, v2020.016.1.
  • Golden Record Page keeps freezing. Affects versions: v2020.017.0. Fix versions: v2020.018.0.
  • UI doesn't show average confidence . Affects versions: v2019.023.1, v2020.016.1, v2020.016.2. Fix versions: v2020.018.0, v2020.016.3.
  • "Open details" button is not clickable on taxonomy page. Affects versions: v2020.015.0. Fix versions: v2020.018.0.
  • Train predict failure. Affects versions: v2020.016.0. Fix versions: v2020.018.0, v2020.016.2.
  • Jobs failing possibly due to incorrect path setting with spark 2.4. Affects versions: v2020.016.0. Fix versions: v2020.018.0, v2020.016.2.
  • Some links that open the Tamr UI in a new tab cause the original tab to become unresponsive on closing the new tab.
  • Mapped/unmapped filters on unified attributes is broken.
  • Average confidences for categorizations not showing for systems upgraded from v37.1.

Back to top


v2020.17.0 Release Notes

What's New

In this release these changes were made:

  • You can view the cluster from the golden records page.
  • The process of restoring from backup is improved and retains the PostgreSQL configuration from the pre-upgrade release.

Fixed Support Issues

The following support issues were fixed in this release.

  • Fixed an issue where unmapping an attribute in the schema mapping project does not work. Affects versions: v2020.012.0, v2020.014.0, v2020.016.0. Fix versions: v2020.017.0, v2020.018.0, v2020.016.1.
  • Provided ability to view the clusters from the golden records page. Affects versions: v2019.021.0. Fix versions: v2020.017.0.
  • Fixed an issue where Generate pairs and Open exclusions were not displayed correctly in the user interface. Affects versions: v2020.015.0. Fix versions: v2020.017.0.
  • Fixed an issue where Accept cluster suggestion resulted in the "Cannot read property 'name' of undefined" error.
  • Fixed an issue wher the Elasticsearch configuration parameter TAMR_ES_MAX_RESULT_WINDOW was not applied to new projects and required manual workarounds. Affects versions: v2019.009.0. Fix versions: v2020.017.0.
  • Fixed an HTTP 500 error issued by Elasticsearch if no clusters were generated. Affects versions: v2020.016.0. Fix versions: v2020.017.0, v2020.016.1.
  • Fixed an issue where an upgrade to Tamr v2020.16 resulted in errors looking for non-existent clusters. Affects versions: v2020.016.0. Fix versions: v2020.017.0, v2020.016.1.
  • Fixed an issue where the bulk match service didn't work on an AWS scaled out deployment. Affects versions: v2020.007.1. Fix versions: v2020.017.0.

Back to top


v2020.016.6 Patch Release Notes

This patch introduces a new configuration variable, TAMR_MAX_EDGES_PER_PARTITION, which allows you to tune clustering performance.

v2020.016.5 Patch Release Notes

This patch improves Tamr UI performance.

v2020.016.4 Patch Release Notes

This patch corrects an issue in which upgrade from 2020.004.1 to 2020.016.3 succeeded, but Materialize Unified Dataset jobs fail.

v2020.16.3 Patch Release Notes

This patch contains the following fixed bugs and support issues.

  • Cannot run jobs - GCP Native - NoSuchMethodError.
  • Profiling jobs stuck in 'waiting for results' despite no other jobs running or failed.
  • UI doesn't show average confidence.

v2020.16.2 Patch Release Notes

This patch contains the following fixed bugs and support issues.

  • Train predict failure.
  • Jobs failing possibly due to incorrect path setting with spark 2.4.

v2020.16.1 Patch Release Notes

This patch contains the following fixed bugs and support issues.

  • Fixed an issue where, if no clusters had been generated yet for a mastering project, the project issued an error in the user interface.
  • Fixed an issue where the unmapped/mapped filters for unified attributes were not working.
  • Fixed an issue where bulk un-mapping of attributes in Schema Mapping was not working.

v2020.016.0 Release Notes

What's New

In this release, the following notable changes took place:

  • Tamr v2020.16 is a checkpoint release. If you are upgrading from an earlier version, you must first upgrade to Tamr v.2020.16 before upgrading to this version or a greater version. The upgrade utility prevents you from upgrading past Tamr v2020.16 without first upgrading directly to Tamr v2020.16. For example, these upgrade paths are prevented: v2020.03 -> v2020.07, v2020.15 -> v2020.17. The following upgrade paths are allowed: v2020.03 -> v2020.15, v2020.03 -> v2020.16, v2020.15 -> v2020.16. 
  • After you upgrade to Tamr v2020.016, if you write new LOOKUP statements with non-equality join conditions, or remove the hint which the upgrade process added to existing LOOKUP statements of this type, Tamr assigns primary keys to resulting records automatically. For LOOKUP statements with non-equality join conditions that existed before you upgrade, the upgrade process disables automatic assignment of primary keys. This change during the upgrade process (disabling automatic assignment of primary keys) does not affect LOOKUP statements with equality joins. Additionally, you can choose to disable automatic assignment of primary keys altogether. For more information, see Lookup, Primary Key Management, and Upgrading Tamr.
  • Passwords for Tamr non-administrative users must contain a minimum of 8 and a maximum of 64 characters. Passwords for newly created admin users also have this requirement. When you create such users or make changes to existing users in Tamr, these password requirements are enforced and Tamr issues an error if they are not met. For more information, see Creating a User.
  • The user interface for editing rules for golden records now allows you to save rules that are invalid. This is also known as a "forced save". This helps you save rules midway, while you continue working on them. It also helps in cases where you might need to force a deletion of one rule that refers to a non-existing attribute in an upstream dataset, so that you can then delete another rule that refers to another non-existing attribute. Previously, you could only fix such issues with internal APIs for golden records. 
  • Made changes to the user interface of the Clusters page. The records summary header now shows up directly above the table that lists records. Additionally, you can use a new Accept suggestion option directly from the record details side panel to add a record to a specific cluster that Tamr suggests. Tamr offers the ID of this new suggested cluster. Previously, if Tamr suggested to move a record to a new cluster, using Move to new did not allow moving it in one step. See Curating Clusters.
  • Added documentation for Tokenizers and Similarity Functions. The documentation now uses the industry term Blocking Model instead of the Binning Model, which was the previously used term.

Fixed Issues

  • Added a script that disables primary key management for existing LOOKUP statements with non-equality joint conditions. The script runs automatically during upgrades and issues a report of the changes it made.
  • Enabled automatic primary key management for new LOOKUP statements with non-equality join conditions. For more information, see LOOKUP.
  • Fixed an issue where the unify-admin.sh utility allowed input of settings with incorrect syntax and this broke Zookeeper. Affects versions: v2020.012.0. Fix versions: v2020.016.0.
  • Fixed an issue where you could not delete a rule in golden records on an attribute that no longer existed if there were other attributes that no longer existed. Affects versions: v2019.023.2. Fix versions: v2020.016.0.

Back to top


v2020.15.0 Release Notes

What's New

In this release, the following notable changes were made:

  • Upgraded the version of Spark used by Tamr to Spark 2.4.5. Starting with this release, Tamr uses Spark 2.4.5 instead of Spark 2.2.0 that it used in previous releases. The upgrade to Spark 2.4.5 takes place automatically as you upgrade to this release. For more information, see Upgrading Tamr.
  • Added a new aggregation function, histogram, to the list of Tamr transformation functions. It computes the histogram for the top-n most frequent values per group, sorting values in the descending frequency order. Use the histogram function together with WINDOW and GROUP statements. It supports vararg (array flattening) and complex types.
  • Warn users when project datasets are about to be removed in the dataset catalog.

Fixed Issues

  • Fixed an issue where de-selecting a dataset from the dataset catalog did not alert the users that the dataset was about to be removed. Affects versions: v2019.023.1. Fix versions: v2020.015.0.
  • Fixed an issue where page information on the Datasets window was not extending properly. Affects versions: v2020.009.0. Fix versions: v2020.015.0.
  • Fixed an issue with rules in golden records where deleting a newly created, unsaved golden records rule deleted a random other rule.
  • Fixed an issue where HBase and ZooKeeper communication failed, putting the HBase server on the "failed servers list", rendering the Tamr instance broken.
  • Fixed an issue where changing an input dataset schema broke previewing of golden records.
  • Fixed an issue where deleting an attribute in an upstream dataset broke the job for updating golden records.

Back to top


v2020.14.0 Release Notes

What's New

In this release, the following notable changes were made:

  • Released version 0.12.0 of the Pyton tamr-client. For more information, see tamr-client 0.12.0 and Tamr Client documentation.
  • API changes. Added a new parameter, expectedVersion to the POST /datasets/{name}/update endpoint in the dataset service, to allow consistent dataset updates.
  • Usability and design improvements.
  • Observe in the tooltip that profiling value counts are estimates, when examining results of profiling a dataset.
  • Use the Rules tab, when working with the golden records project as a curator.
  • Performance and configuration improvements. Take advantage of improved HBase performance when processing Tamr jobs.

Fixed Issues

  • Fixed an issue where the "read only" permissions on /tmp prevented Tamr dependencies from starting. Affects versions: v2020.008.1. Fix versions: v2020.014.0.
  • Fixed an issue where Tamr instance was broken after deleting an input dataset. Affects versions: v2020.008.1. Fix versions: v2020.014.0.
  • Fixed an issue where deleting the unified dataset caused Tamr to throw a nullpointer exception. Affects versions: v2020.008.0. Fix versions: v2020.014.0.
  • Fixed an issue with profile value counts to indicate that they are estimates. Affects versions: v2020.009.0. Fix versions: v2020.014.0.
  • Fixed an issue in the golden records project where it did not show the Rules tab to curators. Affects versions: v2020.006.0. Fix versions: v2020.014.0.
  • Fixed an issue where Postgres Prometheus configuration used HOST_IP instead of TAMR_POSTGRES_HOSTNAME.
  • Added TAMR_PERSISTENCE_EXPORTER_USER and TAMR_PERSISTENCE_EXPORT_PASS to the configuration definitions.

Back to top


v2020.13.0 Release Notes

What's New

In this release, you can:

  • Rely on faster running Spark jobs due to HBase configuration improvements.
  • Avoid dataset errors when updating or publishing golden records due to improved dataset validation checks.
  • Collect logs for a specified time period using a new flag on the collect-logs.sh script.

Improvements and Changes

  • Upgraded versions of Grafana to 6.3.4 and Kibana to 5.6.16 . Affects versions: v2020.004.0. Fix versions: v2020.013.0, v2020.004.2.
  • HBase. Stopped blocking new jobs while HBase rollback is in progress.
  • HBase. Adjusted the buffer to store enough records for sorting streaming updates to HBase.
  • Allowed LLM and Bulk Matching on projects with Mastering functions and user-defined signals. Affects versions: v2020.002.0. Fix versions: v2020.013.0.

Fixed Issues

The following issues were fixed in this release.

  • Updated collect-logs.sh to accept an age field to enable collecting logs for only a certain number of days. Affects versions: All. Fix versions: v2020.013.0.
  • Fixed the log pruning scripts to set dependencies correctly.
  • Fixed an issue where the user policy management dialog deselected datasets as you paginate.
  • Fixed an issue where you could not edit project and dataset user policies without deselecting other datasets in the policy. Affects versions: All. Fix versions: v2020.013.0.
  • Fixed an issue in working with geospatial data, where displaying multi-point data in a Leaflet map caused the user interface to blank out. Affects versions: v2020.004.1. Fix versions: v2020.013.0, v2020.004.2.
  • Fixed an issue where you could not import pair labels when pre-grouping feature was enabled. Affects versions: v2020.009.0. Fix versions: v2020.013.0.
  • Fixed an issue where removing a source dataset did not remove it from pair exclusions in the internal configuration. Affects versions: v2020.004.1. Fix versions: v2020.013.0.
  • Made estimate pairs sampling configurable in internal interfaces. Affects versions: All. Fix versions: v2020.013.0.
  • Reduced indexing unneeded internal datasets when Elasticsearch is disabled. Affects versions: v2020.004.1. Fix versions: v2020.013.0.

Documentation Changes

Beginning with Tamr v2020.013.0, documentation versions available at docs.tamr.com are listed as ranges of versions.

  • Documentation version ranges map to the development releases contained within the range.
  • For example, Tamr documentation version 2020.13.0-2020.16.0 (this current range) maps to four consequtive development releases.
  • At the time of this writing, only the first of these development releases is available, Tamr v2020.013.00. The other releases in this range will become available in the future.
  • The documentation for versions in the range is updated in place and republished.
  • The release notes for each development release continue to be published.
  • For information about deltas between individual development releases, see the release notes for each development release. Also see the Changelog for a running list of release notes.
  • The documentation version scheme differs slightly from the development version scheme in that it does not use leading zeros in its numbers. For example, Tamr development version 2020.013.0 is represented as the documentation version 2020.13.0 (there is a missing zero in front of 13). This is by design and the two version notations map to each other.

Back to top


v2020.012.0 Release Notes

What's New

In this release, you can:

  • On the Details section of the Pairs and Clusters pages, expand the details to see long string values for record attributes. Choose Show more or Show less to see the attribute details. You can also see the number of members in an array, for attributes of type array.
  • In the Golden Records project, review reported errors and fix them before proceeding with the project. This is useful if you load golden records datasets programmatically using the APIs. In this case, Tamr validates your golden records dataset against the input records and clusters.
  • Use Postgres v12. This version of Postgres is required beginning with Tamr v2020.012.0 (this version). You can upgrade to Postgres v12 even before you upgrade to Tamr v2020.012.0. To upgrade Postgres, stop Tamr, upgrade Postgres using tools specific to your operating system, run the upgrade to Tamr v2020.012.0, and restart Tamr. If you cannot use Postgres v12 for any reason stemming from your environment, contact Tamr Support to obtain advice on the best course of action. For more information, see Installing Postgres and Upgrading Postgres.
  • Run pair generation jobs faster due to fine-tuned Spark memory allocation. Note: Beginning with this release, Grafana and Kibana monitoring services are disabled by default. You can enable them explicitly, if needed.

Improvements and Changes

The following improvements and changes were made in this release.

Upgrades and Configuration

  • Fixed Spark Executor Calculations.
  • Reduced OS headroom default.
  • When starting Tamr, no longer issue an error if no auxiliary services have been deployed.
  • Disabled Kibana by default (TAMR_ELK_ENABLED=false) and avoid re-computing Spark memory allocations that are already accounted for.
  • Disabled Grafana by default. Fixed an issue where turning off Grafana with TAMR_GRAFANA_ENABLED=false turned off the Graphite exporter but didn't turn off the Spark exports.

User Interface and Usability Improvements

  • The user interface reports errors if saved golden record rules are invalid.
  • You can use Show more and Show less in the Details sidebar on the Pairs and Clusters pages to examine long string attribute values.
  • Attribute values of type array are formatted to show the number of values in a value of type array. Clicking the number shows the details of the array.
  • The Compare Details dialog (side-by-side view) on the Pairs page allows toggling between displaying the long string value of a record's field or a truncated value with ellipsis.

Mastering

  • You can use pair-wise classification user-defined signals in LLM.

Fixed Issues

The following issues were fixed in this release.

  • Fixed an issue where, on refresh, the Shema Mapping page shows The unified dataset [****] has been deleted.. Affects versions: v2020.011.0. Fix versions: v2020.012.0.
  • Fixed an issue where pair generation failed with the error cannot resolve 'testRecord' given input columns: [verificationType, recordId, username, timestamp, verifiedClusterId]. Affects versions: v2020.011.0. Fix versions: v2020.012.0.
  • Fixed an issue where the comment persisted in the entry box in pair labeling even after the comment was submitted. Affects versions: v2020.008.0. Fix versions: v2020.012.0.
  • Fixed an issue where changing the name of the Spend field from project settings did not change this name in the filter dropdown menu. Affects versions: v0.51.0, v2020.009.0. Fix versions: v2020.012.0.
  • Fixed an issue where long text fields for records were displayed in a tooltip, and not in a dialog (preferred). Affects versions: v2019.023.1. Fix versions: v2020.012.0.
  • Fixed an issue where the dataset's preview could not be updated for a unified dataset that contained transformations. Affects versions: v2020.009.0. Fix versions: v2020.012.0.
  • Fixed an issue where the transformations preview failed with a null pointer exception. Affects versions: v2020.004.0, v2020.005.0, v2020.008.0. Fix versions: v2020.012.0.
  • Fixed an issue where an HTTP 500 error was issued on transformations preview, and previewSpark.log mentioned expired credentials. Affects versions: v2020.005.0. Fix versions:
  • Upgraded Postgres to a higher version (v12.3) since Postgres v9.4 is End of Life. Affects versions: v2019.023.1. Fix versions: v2020.012.0.
  • Fixed an issue where the POST /clusters/{dataset}/import internal endpoint did not delete the delta pipeline of the dataset. Affects versions: v2019.003.0. Fix versions: v2020.012.0.
  • Fixed an issue where responding with y to an upgrade prompt resulted in a canceled upgrade.
  • Fixed an issue where the clusters page broke if the publish date was outside of the publish time range.
  • Fixed an issue where the record pairs filter did not not show as active when attribute similarity filters were active.
  • Fixed an issue where the Clear search and record filters link on the cluster records table did not clear search.
  • Fixed an issue where the Pairs page broke with Cannot read property 'get' of undefined if input data for a record in a pair was missing.

Back to top


v2020.011.0 Release Notes

What's New

In this release, you can:

  • Use versioned APIs to rename categories.
  • Add and remove datasets in projects, and change tags, as curators. Previously, only admins could run these actions. In this release, curators can also run them.
  • Take advantage of the following performance improvements. Dataset profiling jobs run faster, and single-node Tamr deployments avoid out of memory issues due to fine-tuned Spark and YARN configurations.
  • Use the Low Latency Match (LLM) service with mastering projects that rely on user-defined signals in mastering. Previously, this aspect was not supported for LLM.
  • Specify a unified attribute's type as geospatial in the user interface.

Improvements

The following new features and improvements were completed in this release.

  • Mastering. Curators can add and remove datasets in projects and change tags. The LLM service supports projects with user-defined signals.
  • Upgrades and Configuration. Reduced TAMR_MAX_ROWS_PER_PARTITION from 500,000 to 100000 in Spark configuration to address out of memory issues.
  • Performance. Optimized profiling jobs to run in one pass over the data.
  • Upgrades. Notify users what to expect on success and failure of upgrade maintenance scripts.
  • Versioned APIs. Use versioned APIs to rename categories in the categorization projects.

Fixed Issues

The following issues were fixed in this release.

  • The collect-logs.sh script is referencing the wrong location for Zookeeper log. Affects versions: v2020.008.0. Fix versions: v2020.011.0, v2020.008.1.
  • Fixed the tooltip in Confusion Matrix to say: "How often you agree with Tamr". Affects versions: v2020.008.0. Fix versions: v2020.011.0.
  • Enabled Pairs with similarity score by default when adding an attribute similarity to filter pairs in the user interface. Affects versions: v2020.008.0. Fix versions: v2020.011.0.
  • Fixed an issue where the upgrade process wrote the wrong version to upgrade status tracking.

Back to top


v2020.010.1 Patch Release Notes

This patch is applicable only if you have upgraded to v2020.010.0 from previous versions. The patch fixes an issue where, when upgrading from v2020.010.0 to later versions the upgrade status version was set incorrectly for v2020.010.0 and a validation check failed on future upgrades. If you upgraded to v2020.010, and you are planning on upgrading to a newer version, run the upgrade to the patch v2020.010.1 with the --skipUpgradeStatusValidation flag. The patch does not affect new installations of Tamr v2020.010.0 because they do not have the upgrade status version set for them.

v2020.010.0 Release Notes

What's New

In this release, you can use an improved upgrade process that catches additional upgrade issues and creates an upgrade report. 

Improvements

The following new features and improvements were completed in this release.

Configuration Changes

Added a configuration variable TAMR_LD_PRELOAD to allow appending additional libraries to LD_PRELOAD.

Upgrade Improvements

The upgrade validation process detects:

  • Database tables that have issues which will cause upgrades to fail.
  • Dependent datasets that have missing upstream datasets.

After the upgrade validation process completes, it creates a report file that allows you to follow up on identified issues that might affect the upgrade.

Fixed Issues

The following issues were fixed in this release.

  • Fixed an invalid path error getting HBase configuration in Spark. 
  • Fixed a configuration issue with Zookeeper that affected scaled out deployments. 
  • Fixed an issue where the user interface for golden records was inaccessible after deleting a column in an upstream mastering project. 

Back to top


v2020.009.0 Release Notes

What's New

In this release, you can:

  • Before upgrading to a new Tamr version, use a validation check for the required minimum space limit (ulimit -n).
  • Use versioned APIs to import and export a categorization model, and to add and remove transformations.

Changes and Improvements

The following changes and improvements took place in this release.

Versioned API Work

  • Added support for importing and exporting the categorization model in a ZIP file.
  • Added support for adding/removing transformations to the versioned API.

Upgrade Improvements

  • Added a validation script for checking open file ulimit (ulimit -n) against required minimum of 66000.
  • Added a configuration variable for configuring an HTTP request idle timeout for all services.

General UX/Visual Design

Made improvements for showing similar clusters.

Fixed Issues

The following support issues and bugs were fixed in this release.

  • Fixed an issue where the job to train a categorisation project had a lower-case ‘m’ in materialize, whereas everything else was capitalised. Affects versions: v2019.023.2. Fix versions: v2020.009.0.
  • Fixed an issue where, when using geospatial features, users were unable to zoom into "Details" window closer than 30m/100ft. Fix versions: v2020.009.0.
  • Fixed an issue where users could not update the project definition for golden records projects via versioned API. Affects versions: v2020.002.0. Fix versions: v2020.009.0.
  • Fixed an issue where PostgreSQL contained multiple, conflicting labels on pairs. Affects versions: v2019.006.1. Fix versions: v2020.009.0.
  • Changed default elastic batch size in ManualClusteringService.reindexClusterMembers from 10,000 to 1,000 to match other defaults.
  • Fixed an issue where unpinning and pinning source datasets caused errors in related jobs.
  • Removed false error messages for user-deleted export files and allowed export cleanup to run smoothly.
  • Fixed an issue in the upgrade script that failed for a mastering project if no mappings existed but there was a unified dataset.

Back to top


v2020.008.1 Patch Release Notes

The patch fixes an issue with the collect-logs.sh script that is not copying the Zookeeper log in its new location.

v2020.008.0 Release Notes

What's New

In this release, you can:

  • Use versioned API endpoints for importing, exporting, and removing categorizations. This improvement is a step towards functional completeness of versioned APIs for Tamr.
  • Reliably stop Tamr dependent components, such as HBase, YARN, and Zookeper by using an improved stop-dependencies.sh script.
  • Rely on an updated set of upgrade validation checks in the administration utility. For example, the administration utility now tracks upgrades and checks configured directories.
  • Avoid having to log in again into Tamr after restarting it. Beginning with this release, your session credentials persist after restarting Tamr.
  • Use a new transformation function, array.non_empties(), to remove empty values from an array. See array.non-empties.

Usability and Design Improvements

In this release, you can:

  • Confrm your actions by observing the new tooltip, Copied!, which appears after you click the icon to copy the Cluster ID on the Clusters page.
  • Use improved attribute similarity filters on the Pairs page. You can select both similarity range and null similarity filter, read tooltips, and use check boxes. Note: In this version, the filters for similarity are turned off by default and you need to explicitly turn them on. There is an open issue for fixing this in the future releases.

Upgrade Improvements

In this release, the upgrades validation process in the Tamr administration utility was improved. It:

  • Tracks upgrades and notifies you if you try to run an upgrade while another upgrade is partially completed.
  • Walks down the path from each Tamr-configured subdirectory, checks for permissions, the presence of symbolic links, and available storage space. It reports if it cannot access a directory, and warns you if the storage space is not sufficient in any of the directories (typically, the space should not be less than 1GB).
  • Reliably stop processes for all Tamr-dependent components using the stop-dependencies.sh script. The script attempts to gracefully shut down Yarn, HBase, and Zookeeper. If that does not succeed, it waits for the amount of time in TAMR_HARDKILL_TIMEOUT_SECONDS, and terminates these processes in the correct order. The new parameter, TAMR_HARDKILL_TIMEOUT_SECONDS was added to the administration utility, it is set to 10 seconds by default.

Fixed Issues

The following support issues and bugs were fixed in this release.

Fixed Issues in Upgrades

  • Fixed an issue where an upgrade from v2019.011.0 to v2020.004.0 failed. Affects versions: v2020.004.0. Fix versions: v2020.008.0.
  • Fixed an issue where an upgrade from v2019.023.1 to v2020.004.0 failed. Affects versions: v2019.023.1, v2020.004.0. Fix versions: v2020.008.0, v2020.004.1.
  • Fixed an issue where an upgrade to v2020.002.0 failed with failure to stop Spark. Affects versions: v2019.009.1. Fix versions: v2020.008.0.
  • Fixed an issue where an upgrade failed because stop-dependencies.sh did not reliably stop all dependencies as required. Affects versions: v2020.004.0. Fix versions: v2020.008.0.
  • Fixed an issue where an upgrade to v.2020.004.1 failed if two categorization projects have been deleted.
  • Fixed an issue where updating schema mappings failed if there was a project with a unified dataset but no mappings present.

Other Fixed Issues

  • Fixed an issue where the cluster stats showed an incorrect "last published" date.
  • Fixed an issue where unpinning and pinning source datasets caused errors in related jobs.
  • Fixed an issue where reindexing of transaction comments failed when there were comments associated with deleted projects in the Elasticsearch index.
  • Fixed an issue where exporting categorization labels from versioned APIs didn't take into account new configuration.

Back to top


v2020.007.1 Patch Release Notes

This patch allows adding tags to Amazon EMR Ephemeral Spark clusters and DynamoDB tables on AWS. This issue affected Tamr scaled out deployments.

v2020.007.0 Release Notes

What's New

In this release, you can:

  • Take advantage of usability improvements when working with golden records.
  • View full record values in the Record Details sidebar. Previously, long record values were cut off.
  • Check the submission of Spark jobs using the validation framework in the administration utility.
  • Run the script for disabling primary keys directly from the administration utility.

Golden Records Improvements

When working with golden records, you can:

  • Use the display name of a mastering project when selecting this project for creating a new Golden Records project.
  • Use the ENTER key to save rule overrides in golden records.

Usability and Design Improvements

In this release, you can:

  • Use the in-product transformations documentation to learn how str_join handles null values.
  • Use a toggle to show full record values in the Record Details sidebar.

Upgrade Improvements

In this release, you can:

  • Run the script for disabling primary keys from the Tamr administrative utility.
  • Check the Spark jobs submission as part of validation checks in the Tamr administrative utility.

Fixed Issues

The following support issues were fixed in this release.

  • Fixed an HBase issue where the job was displayed as running when no jobs were running. Affects versions: v2019.016.0. Fixed in versions: v2020.007.0.
  • Fixed an issue where pair generation on the Pairs page broke if Jaccard tokenizer threshold was set to less than 0.4. The fix checks the value you enter for the Jaccard tokenizer and ensures that Tamr issues an informative error message in this case. Affects versions: v2019.023.2. Fixed in versions: v2020.007.0.
  • Fixed an issue that prevented a customer from using the ENTER key to save rule overrides in golden records. Affects versions: v2019.023.1. Fixed in versions: v2020.007.0.
  • Fixed an issue where the Pairs page was not loading with "Error parsing metadata at key 'DEDUP'". Affects versions: v2019.023.1. Fixed in versions: v2020.007.0.
  • Fixed an issue that prevented a customer from using the display name of the mastering project when selecting this project for creating a new Golden Records project. The fix allows you to use the mastering project's display name. Affects versions: v2020.001.0. Fixed in versions: v2020.007.0.
  • Fixed an issue where, previously, a customer could not delete the taxonomy file name when uploading a taxonomy to a categorization project, because empty values were not allowed in the user interface. The fix allows you to delete the name of the file and replace it with a new name. Affects versions: v2020.002.0. Fixed in versions: v2020.007.0.
  • Fixed an issue where the Golden Records project could not be created if attribute names collided with existing attribute names. Affects versions: v2019.020.0. Fixed in versions: v2020.007.0.
  • Allowed creating golden records attributes and rules when creating a new Golden Records project. Affects versions: v2019.011.0. Fixed in versions: v2020.007.0.
  • Fixed formatting errors in versioned API names and added missing error code descriptions.
  • Fixed an issue where the Dataset Catalog page broke if you clicked a row and then pressed the space key.

Back to top


v2020.006.0 Release Notes

New Features and Improvements

The following new features and improvements were completed in this release.

  • Stopped retrying the request if the table limit exceeds 1,000 tables in GCP BigTable.
  • Reduced unnecessary authentication service dependencies.
  • Made the Grafana path configurable.
  • Improved accessibility of user interfaces for cluster verification.

Fixed Issues

The following support issues were fixed in this release.

  • Fixed an issue where the matching service on port 9170 was not healthy and LLM could not be used to query projects. Affects versions: v2020.002.0. Fix versions: v2020.006.0.
  • Fixed an issue in the user interface where a user could not delete many datasets, including the unified dataset, of a dedup project that has been published. Affects versions: v2019.023.1. Fix versions: v2020.006.0.
  • Fixed an issue where the Delete Dataset dialogue cuts off critical information and requires you to scroll. Previously, in the Delete Dataset dialog, long dataset names were cut off. This made it impossible to determine which derived dataset was preventing a user from deleting a particular dataset. Affects versions: v2019.026.0. Fix versions: v2020.006.0.
  • Fixed an issue where reindexing feedback documents failed when there were feedback documents associated with existing unified datasets from deleted projects in persistence.
  • Fixed an issue where the Open second cluster browser button was not disabled when the map was open.
  • Fixed an issue where Preview returned a 500 error for an empty metadata reference.
  • Fixed an issue where an upgrade failed with a failure during backup, indicating that /home/ubuntu/tmp did not exist.

Back to top


v2020.005.0 Release Notes

What's New

In this release, you can:

  • Publish the last updated version of golden records.
  • Rely on additional post-upgrade validation checks that display in a format that is easier to read.
  • Use IAM roles for Amazon S3 backup configuration.
  • Configure Tamr timeout during restarts. The default timeout is 5 minutes instead of 3 minutes in the previous releases.
  • Azure (Beta). Run Tamr with ADLSv1 as the primary filesystem w/Databricks Spark.

Fixed Issues

The following issues were fixed in this release.

  • Fixed an issue to prevent the user from entering reserved words for attribute names and avoid vague error report in Categorization. Affects versions: v2019.023.0. Fix versions: v2020.005.0.
  • Fixed an issue where Tamr reserved some column names that users could still use, which caused job failures. Affects versions: v2019.009.1. Fix versions: v2020.005.0.
  • Fixed an issue where Categorization failed due to using a reserved attribute name "finalClassificationPath".
  • Fixed an issue where restoring from a backup failed and did not restart Tamr. Affects versions: v2019.024.0, v2019.025.0. Fix versions: v2020.005.0.
  • Added configurable parameters for the number of retries for startup (50) and interval between retries (6 sec) to achieve the default timeout interval of 5 minutes upon restart.
  • Increased default Zookeeper timeouts to 120,000 ms.
  • Enabled support for IAM roles for Amazon S3 backup configuration. Affects versions: v0.47.0. Fix versions: v2020.005.0.
  • Fixed an issue where toggling Excluded bucket selection in Golden Records Preferred Sources editing dialog did nothing.
  • Fixed an issue where the Unified Dataset page threw an authorization error for users logged in as reviewers.
  • Cleaned up the service restart after a system state change.
  • Removed __MACOSX directory from Tamr zip.

Fixed Issues in Transformations

  • Fixed incorrect color highlighting of "metadata" datasets in the Transformation editor.
  • Fixed an issue where custom sample datasets in Transformations bypassed certain Project permissions.
  • Fixed type-checking for Struct and Array types which were inaccurate in some cases.
  • Fixed an issue where datasets could not be referenced with USE within a labeled scope.

Back to top


v2020.004.2 Patch Release Notes

This patch contains the following fixed bugs and support issues.

  • Fixed an issue with geospatial rendering in the user interface where displaying multi-point data in the Leaflet-based map caused the user interface to blank out.
  • Upgraded Grafana and Kibana to mitigate CVE cybersecurity vulnerabilities. For a list of versions to which Grafana and Kibana were upgraded, see Supported Monitoring Tools.

v2020.004.1 Patch Release Notes

The patch contains the following fixed bugs and support issues.

Issue with Grafana

  • Fixed an issue where, in Grafana, the output of template rendering is no longer logged, as this may contain sensitive information.

Issue with Low Latency Match (LLM) Service

  • Fixed an issue where port 9170 for the Low Latency Match (LLM) service was not healthy and the LLM service could not be used to query projects.

Issues in Elasticsearch Indexing

  • Fixed an issue where reindexing transaction comments failed when there were comments associated with deleted projects in persistence.
  • Fixed an issue where reindexing feedback failed when there were feedbacks associated with existing unified datasets from deleted projects in persistence.

Upgrade Issues

  • Fixed a support issue where an upgrade failed from v2019.023.1 to v2020.004.0. Affects version 2019.023.1, fixed in 2020.004.1.
  • Fixed an issue where an upgrade to version 2020.004.1 failed if two categorization projects had been deleted. Affects version 2020.004 when upgrading to the 2020.004.1 patch. Fixed in 2020.004.1.
  • Fixed an issue where an upgrade script failed for a mastering project if no mappings existed but there was a unified dataset. Affects version 2020.001 when upgrading to the 2020.004.1 patch. Fixed in 2020.004.1.

v2020.004.0 Release Notes

What's New

This release allows you to use Tamr for data mastering on geospatial records. You can:

  • Load geospatial records into Tamr in the GeoJSON format. In the beta release, to load records, contact Tamr Support.
  • Work with pairs, find matches and duplicates, and run transformations on geospatial records. You can then put records in clusters based on information extracted from geospatial data, and create a categorization project to align records with an existing taxonomy that you might have in place.
  • Configure Tamr to use Openstreet Map and ThunderForest server tiles. You can then view geospatial record pairs, clusters, and shapes, such as polygons, on the Leaflet-based map.
  • Use a Leaflet-based map on Pairs and Clusters pages in Tamr. If you have configured Tamr to use multiple tile servers, you can switch between them and use different maps for pair matching and clustering.
  • Zoom and pan on the map to refetch geospatial data as the map adjusts interactively.
  • On the Schema Mapping page, configure pair similarity metrics, such as Hausdorff, Relative Hausdorff, and Directional Hausdorff Distances. View a pair of two records on the map, along with their similarity metrics and location.
  • On the Clusters page, view a cluster of records on the map at the same time and configure Tamr to display adjacent records.
  • Run geospatial-boundary searches on clusters of geospatial records.
    -Run transformations on records with geospatial data, such as calculating the area or the perimeter or converting record types to supported geospatial types. For information, see GIS Functions.
  • Use geospatial records in a unified dataset.

In addition, in this release, you can:

  • View dataset metadata in the user interface and run transformations on metadata in Tamr.
  • Use Tamr versioned APIs for these actions on golden records: run profiling on golden records, publish golden records, and update golden records.
  • Have greater control over permissions to datasets, projects, and transformations.

Fixed Issues

The following issues were fixed in this release.

  • Fixed an issue where schema mappings disappeared. Affects versions: v2019.019.0, v2019.023.1. Fix versions: v2020.004.0.
  • Allow using dataset tags or metadata information in transformations. Fix versions: v2020.004.0.
  • Fixed an issue where transformations linting no longer breaks on USE "datasetName" statements if the dataset is missing or unauthorized, or datasetName is an empty string.

Back to top


v2020.003.0 Release Notes

What's New

This release allows you to:

  • View dataset metadata in the user interface.
  • View geospatial data in golden records.
  • Run profiling on golden records using the versioned Tamr API.
  • Have greater control over permissions to datasets, projects, and transformations.

New Features and Improvements

The following new features and improvements were completed in this release.

Role-Based Access Controls Expansion

  • Added authentication for events in the user interface.
  • Added access controls to Taxonomy and Classification projects.

Dataset Metadata

You can now modify dataset metadata in the Tamr user interface.

You can set metadata values for an input dataset or any of its attributes in a mastering, categorization, or schema mapping project. When you select a dataset on the project’s Datasets page, the Open properties option is now available. After you select the object you want to add metadata to, you specify an identifying key for the property and a value for the key.

For example, you can add a key of “Privacy” with a value of 1, 2, or 3, to several attributes in a dataset where 1=Public Information, 2=Private Information, and 3=Top Secret Information.

For more information, see Using Metadata in Transformations.

Geospatial Support

  • Added new transformation functions to convert data into geospatial types point and polygon.
  • Added ability to refresh a map.
  • Relative Hausdorff Similarity uses true shape diameter instead of bounding-box approximation.
  • Made warning messaging on geospatial map consistent.
  • Fixed the count of geospatially-mapped records upon zooming in.
  • Fixed an issue where Tamr did not fetch records for adjacent clusters when you have not toggled on visibility for adjacent clusters.

Golden Records

  • Added versioned API endpoints for golden records. You can use POST /v1/projects/{project}/goldenRecordsProfile:refresh to run profiling on golden records using the versioned Tamr API.
  • Golden records preview returns an array for a Struct type.
  • Golden records can output and display geospatial data.
  • Golden records create an intermediate "Rule Output" internal derived dataset caching the output of running golden record rules.

Configuration, Deployment, and Lifecycle Changes

  • HBase improvements. Enabled configuring HBASE_ZK_SESSION_TIMEOUT for ZooKeeper from HBase.
  • Configured Elasticsearch socket timeout is now respected by Spark components.

Fixed Support Issues

The following support issues were fixed in this release.

  • Fix mktemp: too few X's in template tempZipDir' warning. Affects versions: v2019.026.0, Fix versions: v2020.003.0
  • Set java.io.tmpdir to the value of TAMR_TMP_DIR. Affects versions: v2020.002.0, Fix versions: v2020.003.0.
  • Classification Dashboard incorrectly displayed Verified, agrees with Tamr. Affects versions: v2019.023.1, Fix versions: v2020.003.0.
  • Curator dashboard always says Tamr agrees with verified records 100 percent of the time. Affects versions: v0.35.0, v0.36.0, v0.37.0, Fix versions: v2020.003.0.
  • Disk space during a backup. Affects versions: v2019.020.0, Fix versions: v2020.003.0.
  • Disk space gets locked up after repeated cycles of running Tamr and Tamr backup. Affects versions: v2019.019.0, Fix versions: v2020.003.0.
  • Frequent errors "Your connection to the application has been unexpectedly terminated". Affects versions: v2019.023.1, Fix versions: v2020.003.0.
  • The Update Pairs job yields a Request to Elasticsearch Scroll API failed error. Affects versions: v2019.023.1, Fix versions: v2020.003.0, v2020.002.0.
  • Pairs cannot render, forcing page reload. Affects versions: v2019.011.0, v2019.015.0, v2019.023.0, v2019.025.0, v2019.023.1, v2019.026.0, v2020.001.0, Fix versions: v2020.003.0.

Fixed Issues

Fixed the following issues.

  • DNF estimation and DNF learner assumed that originEntityId is unique.
  • A job was being canceled with a cannot compute delta error on "cluster similarities" dataset.
  • Administration utility upgrade command: backup failure occurred due to missing class HBaseExportSnapshotWriter.

Back to top


v2020.002.0 Release Notes

What's New

  • Added new transformation functions. For more information, see Transformations in this document.
  • Released version 0.10 of the Tamr Python client. This version supports Python 3.8, standard Python logging, and uses markdown for the Tamr Python Client user documentation.
  • Added error messages and stack traces from API errors to the errors that you see in the Tamr user interface.
  • Further improved the validation process during upgrades by adding a healthcheck script for Elasticsearch. For more information, see Configuration, Deployment, and Lifecycle Changes in this document.

Record Clustering

  • The cluster card shows Accept suggestion button in the side bar that shows suggested clusters. 
  • Beginning with this release, LLM returns the top-k similar clusters instead of only the top-1 that it returned in the previous releases. You can configure the k parameter through TAMR_LLM_TOPK
    All aspects of the mastering forkflow now support pregrouping. Pregrouping integrates the GROUP BY transformation into the internal dedup service and allows Tamr to group almost exact matches before generating candidate pairs. You enable pregrouping by changing the dedup recipe metadata. When pregrouping is enabled in Tamr, the resulting record clusters are based on the original record IDs, and this makes it easier for Tamr curators to verify, lock, and review the resulting clusters. When grouping fields change, Tamr auto-maps the verified pair labels to the new group IDs. To enable pregrouping, contact your Tamr support representative.

Transformations

  • Added the new array.most_frequent function to the list of Tamr transformation functions. It returns the N most frequently-occurring values in an array, skipping null values. This function may be useful in categorization projects. For example, it allows you to create a single meaningful text description from a list of string array descriptions arriving from multiple records. This is useful when you need to unify many columns containing a short text description of the product into a single column with the most representative text description.
  • Added collect_set and  collect_subset aggregation functions.

Configuration, Deployment, and Lifecycle Changes

  • Released version 0.10 of the Tamr Python Client. For details, see the Python Client release notes and changelog.
  • Removed TAMR_ES_CLUSTER "elasticsearch_procurify". This parameter was not used in Tamr. Upon upgrading to this version of Tamr, the upgrade logs indicate the following: INFO: Property definition [TAMR_ES_CLUSTER] is not declared in new definitions. It will be removed from the configuration store. This removal does not affect Tamr operations in any way in this release or in any other releases.
  • Added TAMR_ES_MAX_GEOSPATIAL_FEATURES_DEFAULT: "1000". This is the limit for the number of records to fetch when rendering them on a geospatial map. This value must not exceed the value of TAMR_ES_MAX_RESULT_WINDOW.
  • Changed TAMR_ES_SOCKET_TIMEOUT: "300000" to TAMR_ES_SOCKET_TIMEOUT: "900000" as part of performance-related bug fixes.
  • Added TAMR_JOB_SPARK_LOG4J_PROPS: "/home/ubuntu/tamr/conf/log4j.properties.j2"  as part of logging improvements.
  • Added a validation script for Elasticsearch to the list of Tamr Validation checks. The script checks for: 
  • The connectivity to the Elasticsearch cluster used to power the user interface in Tamr, and ensures that this cluster is running.
  • The existance of the Elasticsearch data directory, /home/<user-name>/src/javasrc/procurify/deployment/build/deployment/elasticsearch-with-plugins-6.8.2/data, and that this directory has a sufficient amount of free space,  and is readable and writeable.
  • The version of Elasticsearch that is used is compatible with the supported version used in Tamr. 
  • Added error messages and stack traces from API errors to the errors that you see in the Tamr user interface. 

Fixed Support Issues

  • Fixed an issue where an Elasticsearch "clear" operation was receiving a socket timeout or generating an intermediate response that was too long. This was caused by index deletions on projects with very large numbers of pairs. The fix changes the default for the Elasticsearch search timeout:  DEFAULT_SOCKET_TIMEOUT_MS is set to 900000 milliseconds (15 minutes) by default. The fix also reduces the default batch size on Elasticsearch "clear" operations from 10,000 to 1,000 to match the batch size used by other operations. The fix was included in the patch release v2019.023.2 and fixed in v2020.002.0.
  • Fixed an issue where previewing bookmarked golden records failed on datasets from chained projects.
  • Added the array.most_frequent function to transformations. It returns the N most frequently-occurring values in an array, skipping null values.
  • Fixed an issue where, during backups, multiple numbered files were created in
    /opt/tamr/tmp/backup/ but not deleted, which took up considerable amounts of disk space. The issue was due to the HBase export snapshot process that kept temporary files around until Tamr was restarted. The fix was to make the export snapshot run as a separate process.

Fixed Issues

  • Fixed an issue where the number of generated pairs was incorrect for geospatial records that use Directional Hausdorff for pair matching.
  • Fixed an issue in golden records where you could not preview bookmarked geospatial-type records.
  • Fixed an issue in the Tamr user interface where the Status buttons on the Jobs page were truncated and off-center. 
  • Fixed an issue where upgrade validation failed when Tamr was using Google Cloud SQL Postgres instances.

Back to top


Tamr 2019 Releases

v2019.26.0 Release Notes

What's New

  • Added a validation framework to Tamr upgrades. You can run healthcheck validation at any time and it also runs during version upgrades. The validation healthcheck scripts check for Tamr license, memory usage, operating system, and HBase configuration, and publish detailed information if they find issues. See Validation.
  • Added more granular and expressive controls for record locking and verification in clustering compared with previous releases. Records in the cluster verification table reflect more expressive cluster verification states compared with previous releases where you could only Lock and Unlock records. See Curating Clusters.

Clustering

Added user interface options for cluster verification actions in the Clusters page that were previously available only in the API. In particular, these changes were made:

  • Use four subtypes of Verify actions. These actions replace Lock and Unlock actions available in the previous releases. The four types of verification actions are: Verify and Enable Suggestions, Verify and disable Suggestions, Verify and auto-accept suggestions, and Remove verification. The Lock action in the previous releases is equivalent to Verify and disable suggestions.
  • Records in the table reflect verification states and allow you to take action.
  • Use more expressive verification filters to filter clusters and view verification states in the record sidebar.
  • Cluster table shows verification aggregations, such as the number of records in the cluster that may require to be moved to another cluster, or other actions stemming from the new verification options. For more information about cluster verification options, see Curating Clusters.

Configuration, Deployment, and Lifecycle Changes

Added validation healthcheck scripts in the Tamr administrative utility. To use validation scripts, run the administrative utility with the new validate option, such as: tamr/utils/unify-admin.sh validate. For more information, see Validation.

Fixed Support Issues

Fixed an issue where the Edit rules link for golden records was broken in Internet Explorer.

Back to top


v2019.025.0 Release Notes

What's New

Improved upgrades, including their resilience, logging and diagnostics.

Mastering

Improved user experience for loading data from external data sources, such as HDFS. This feature was introduced in the previous release.

Access Control Improvements

Continued improvements for rules-based access controls to schema mapping and to remainining API actions in the mastering workflow.

Configuration, Deployment and Lifecycle Changes

  • Added better logging for upgrades. Create more informative logs to inform you about errors during upgrades. Changed logging in the Tamr administrative utility. Console output is more readable and contains only the highlights but omits the details. At the same time, the log file is now more detailed.
  • Made upgrades more resilient so that they don't stop in the middle leaving Tamr in an inconsistent state. Improved diagnostics to help you find out which parts of Tamr aren't working. Added built-in diagnostics in the administrative utility, to allow you to identify and check for problems. The project upgrades don't hold the upgrade process and report the errors in the user interface inside the project for you to fix after an upgrade.
  • Made the following changes to the configuration properties in Tamr: added a new property, TAMR_UNIFY_BACKUP_THREADS, to configure the backup threads (replaces TAMR_UNIFY_BACKUP_NUM_FILE_COPY_THREADS), added the new default value to TAMR_HBASE_CONFIG_URIS to allow Tamr to pick up configuration from hbase-site.xml.
    TAMR_HBASE_DATA_DIR only needs to be set if you fit into these cases: TAMR_CONNECTION_INFO_TYPE=hbase and TAMR_REMOTE_HBASE_ENABLED=true, or in script installation if you prefer to place the HBASE data directory in another location.
  • Made robustness changes to backups. The main backup/restore process monitors backups for individual operations, checks for final status, reacts to user cancellation and exits.

Fixed Support Issues

  • Fixed an issue where a mastering project was out of date.
  • Fixed an issue where Yarn won't start after an upgrade from version 2019.022 to v.2019.023. The Proxy changes prevented the administrative utility from checking that Yarn is up.

Back to top


v2019.023.2 Patch Release Notes

This patch fixes an issue with Elasticsearch. Changed the default batch size for Elasticsearch components from 10K to 1K, to work around a ContentTooLong exception received with some types of data.

v2019.023.1 Patch Release Notes

This patch fixes an issue that prevented you from logging into Tamr using Internet Explorer.

v2019.023.0 Release Notes

What's New

  • Preview rules for golden records. To preview rules, select one or more entities with rules, edit rules, and then choose Preview Rule to see how rules for golden records will behave. Once you are satisfied with the rules, update golden records and publish them.
  • Use a DROP statement in transformations to omit unwanted attributes. See Drop.
  • Use MATH.TAN, MATH.ATAN, MATH.ATAN2, MATH.DEGREES, and MATH.RADIUS functions in the library of mathematical functions.
  • Use a new transformation statement, Sample.
  • Use a new directional Hausdorff similarity function for pair matching of geospatial records.
  • Back up and restore a Tamr deployment in which datasets are stored in Google Cloud Platform (GCP) BigTable. You can back up and restore to a local disk, HDFS, and GCS. See Backup Configuration.
  • Retain labels after an upgrade by running a script that disables automatic assignment of primary keys.
  • Start Tamr faster by running start-tamr.sh with improved performance.
  • Upgrade directly from v.0.40.1. to the current version of Tamr (without having to install intervening versions).
  • Be aware of these configuration changes:
  • Increased the services startup timeout to three minutes from one minute in previous releases.
  • Switched to the OpenJDK 8 instead of the Oracle Java SDK in the Tamr installation package.
  • Added Tamr configuration properties to support configuring password-protected transport authentication context in SAML v2. For more information, see SAML Authentication.
  • Renamed the tools.jar to transform-tools.jar. It is packaged under tamr/libs in the Tamr installation and allows you to convert between legacy.hash() and hash().

Golden Records

In the golden records project, you can preview rules. To preview a rule, select the entities one by one or in bulk as bookmarks, edit a rule, and then choose Preview Rule. Once you are satisfied with the rule, select Update and then choose Publish Golden Records.

Transformations

  • Added a new transformation statement, Sample.
  • Added a new DROP statement in transformations. Use it to remove a column from the active dataset. Using DROP is more convenient to use than SELECT. For example, to increase peformance, use DROP to remove unused columns after running JOIN statements. If you drop an attibute in the unified dataset, it is then populated with nulls. See Drop.
  • Added MATH.TAN, MATH.ATAN, MATH.ATAN2, MATH.DEGREES, and MATH.RADIUS functions to the library of mathematical functions.
  • Added the USE HINT statement that applies a hint to the current transformation in the editor and to all of the subsequent transformations in that project. This statement might be useful to you if:
  • You have Tamr projects created before Tamr v.2019.014.1 that you are upgrading to this version (Tamr v.2019.021.0), and
  • You explicitly do not want Tamr to automatically manage your primary keys.

For example, to disable automatic primary key management by Tamr in a particular project, add: USE HINT(pkmanagement.manual); in the first transformation.

Note: The USE HINT statement is only useful if you have created projects before Tamr v.2019.014.1. In that version, automatic management of primary keys was introduced. If you have started using Tamr after v.2019.014.1, Tamr automatically manages primary keys. For information about primary keys, see Primary Key Management.

Mastering

  • Added sorting to record IDs in derived datasets (these datasets in mastering projects show statistics for clusters of records).
  • Added a new directional Hausdorff similarity function. The existing functions, Haudorff distance and absolute Hausdorrf, target matching geo-type objects. The new directional Hausdorff function supports matching contained objects, for example, you can use it to see if a small section of a road is matched against the entire road, or whether a portion of a building matches an entire building. Directional Hausdorff is useful for pair generation and as an additional signal for pair matching. See Working with Geospatial Data.

Configuration Changes

  • Increased the services startup timeout from one to three minutes.
  • Switched to the OpenJDK 8 instead of the Oracle Java SDK in the Tamr installation package. As a result, the default value of TAMR_JAVA_HOME is changed from <tamr-home-directory>/jdk1.8.0_111/ to <tamr-home-directory>/openjdk-8u222/.
  • Changed the lowest version from which you can upgrade to the current version of Tamr from v.0.37.0 to v.0.40.1. Previously, the lowest version from which you could upgrade directly to the current version without installing intervening versions was Tamr v.0.37.0. With this change, the lowest version that allows direct upgrades to the current version of Tamr (v.2019.023.1 and higher) is v.0.40.1. If you are upgrading from version 0.40.0 or earlier, upgrade to each individual version up to v0.40.1 and then upgrade directly to the current version.
  • Enforced using of HBase for dataset storage upon Tamr upgrade to this version. If you have migrated to Tamr v.2019.019.0, then the upgrade script that runs during an upgrade takes care of moving your datasets to HBase-backed storage. For information, see HBASE.
  • Added an ability to back up and restore Tamr from Google Cloud Platform (GCP) BigTable. Previously, you could back up and restore Tamr deployments backed up by HBase. Beginning with this release, if your Tamr datasets use GCP BigTable instead of HBase, you can back up and restore them. The GCP BigTable backup in Tamr creates SequenceFiles. Note that the HBase backup creates HBase snapshot files. You can back up to a local filesystem, HDFS, or GCS. See Backup Configuration.
  • Fixed the administration utility, admin-unify.sh to not require quotes when unsetting a custom parameter to an empty value in tamr_config.yml. Previously, it required specifying quotes for an empty string value, such as TAMR_JOB_SPARK_BIGQUERY_JAR: "". With the fix, to unset a custom parameter's value, you can omit quotes and leave the value empty (quotes are still allowed).
  • Reviewers can no longer cancel jobs. Previously, access controls for jobs allowed reviewers to cancel jobs.
  • Added a list of case-insensitive reserved words to unified attribute names in Schema Mapping. Tamr issues an error and prevents you from using the following reserved words for unified attribute names in Schema Mapping: origin_source_name, tamr_id, origin_entity_id, clusterId, originSourceId, originEntityId, sourceId, entityId, suggestedClusterId, verificationType, and verifiedClusterId.
  • Added a script that can apply the USE HINT (pkmanagement.manual) statements to every project using transformations. Starting with v2019.014.0, Tamr automatically assigns unique primary keys (tamr_id) if you have not assigned a tamr_id manually to your records. If the tamr_ids change, then labels could change in downstream projects. Until this release, you could manually add the statement USE HINT (pkmanagement.manual) to each of your transformation scripts to disable automatic assignment of primary keys. In this release, we added an option in the <unify-zip>/tamr/libs/transform-tools.jar script to automate the process of temporarily disabling primary key assigments after an upgrade. This option adds a HINT to project's transformations. For information about primary keys, see Primary Key Management.
  • Added a warning strongly discouraging you from starting Tamr as a root user.
  • Improved performance of start-tamr.sh.
  • Fixed the administration utility, admin-unify.sh to not require quotes when unsetting a custom parameter to an empty value in tamr_config.yml. Previously, it required specifying quotes for an empty string value, such as TAMR_JOB_SPARK_BIGQUERY_JAR: "". With the fix, to unset a custom parameter's value, you can omit quotes and leave the value empty (quotes are still allowed).
  • Added Tamr configuration properties that support configuring password-protected transport authentication context in SAML v2. For more information, see SAML Authentication.
  • Changed the defaults for the configuration parameters TAMR_FS_CONFIG_URIS and TAMR_FS_CONFIG_DIR to use the configuration found in TAMR_HADOOP_HOME.
  • These changes have the following implications:
    • Changes to YARN/Spark configuration, such as TAMR_YARN_NODE_MANAGER_PORT no longer require setting TAMR_FS_CONFIG_URIS.
    • If Tamr is deployed in an on-premise Hadoop installation that has some other Hadoop configuration present on the filesystem, such as core-site.xml and/or yarn-site.xml, Tamr's own Spark configuration is not ignored. Tamr now uses the configuration specified in TAMR_HADOOP_HOME by default.

Fixed Support Issues

  • Fixed an issue with ghost golden records. With this fix, if a cluster no longer exists in the input dataset, it does not contribute to a golden record.
  • Fixed an issue where previously, you could start Tamr as a root user. With this fix, Tamr issues a warning discouraging you to start Tamr as a root user.

Other Fixed Issues

  • Fixed a security issue that affected log files produced by restoring from a backup from a previous version.
  • Fixed a performance regression in Spark when running join operations by reducing the number of unnecessary row counts.
  • Fixed an issue where getRecordsByIds failed if a dataset had a top-level struct type field with a field named timestamp.
  • Fixed an issue where Tamr created duplicate columns during bootstrapping in Schema Mapping, for these attributes in the clusters datasets: suggestedClusterId, verificationType, verifiedClusterId. This occurred if you were using the datasets from the clusters as your input datasets to a new mastering project and attempted to bootstap the attributes. The issue occurred because the names collided with existing names. The fix for this issue introduces a list of reserved words in Tamr. You cannot use these reserved words for unified attributes in Schema Mapping. For a list of reserved words for unified attributes, see "Configuration Chages" on this page.

Back to top


v2019.022.0 Release Notes

What's New

  • Changed the lowest version from which you can upgrade directly to the current version of Tamr (without having to install intervening versions) from v.0.37.0 to v.0.40.1.
  • Increased the services startup timeout to three minutes from one minute in previous releases.
  • Switched to the OpenJDK 8 instead of the Oracle Java SDK in the Tamr installation package.
  • Added a new transformation statement, Sample.
  • Added Tamr configuration properties to support configuring password-protected transport authentication context in SAML v2. For more information, see SAML Authentication.
  • Renamed the tools.jar to transform-tools.jar. It is packaged under tamr/libs in the Tamr installation and lets you convert between legacy.hash() and hash().

Transformations

  • Added a new Sample statement to generate a sample of records with a uniform probability distribution.
  • The transform-tools.jar script is available under tamr/libs. Note that in the previous release, this script was named tools.jar and was not packaged with the installation. Use this script to convert instances of legacy.hash() in your projects to the hash() function. You can also use this script to temporarity replace all hash() functions in your transformation scripts with legacy.hash() until you are ready to start using hash(). For example, run: java -jar ./tamr/libs/transform-tools.jar function-replacer -u=admin --old=hash --new=legacy.hash --tamr_url="http://<IP HERE>:9100". For more information, see hash() and legacy.hash().

Configuration Changes

  • Increased the services startup timeout from one to three minutes.
  • Switched to the OpenJDK 8 instead of the Oracle Java SDK in the Tamr installation package. As a result, the default value of TAMR_JAVA_HOME is changed from <tamr-home-directory>/jdk1.8.0_111/ to <tamr-home-directory>/openjdk-8u222/.
  • Changed the lowest version from which you can upgrade to the current version of Tamr from v.0.37.0 to v.0.40.1. Previously, the lowest version from which you could upgrade directly to the current version without installing intervening versions was Tamr v.0.37.0. With this change, the lowest version that allows direct upgrades to the current version of Tamr (v.2019.022.0 and higher) is v.0.40.1. If you are upgrading from version 0.40.0 or earlier, upgrade to each individual version up to v0.40.1 and then upgrade directly to the current version.
  • Added Tamr configuration properties that support configuring password-protected transport authentication context in SAML v2. For more information, see SAML Authentication.
  • Changed the defaults for the configuration parameters TAMR_FS_CONFIG_URIS and TAMR_FS_CONFIG_DIR to use the configuration found in TAMR_HADOOP_HOME.
  • These changes have the following implications:
    • Changes to YARN/Spark configuration, such as TAMR_YARN_NODE_MANAGER_PORT no longer require setting TAMR_FS_CONFIG_URIS.
    • If Tamr is deployed in an on-premise Hadoop installation that has some other Hadoop configuration present on the filesystem, such as core-site.xml and/or yarn-site.xml, Tamr's own Spark configuration is not ignored. Tamr now uses the configuration specified in TAMR_HADOOP_HOME by default.

Fixed Issues

  • Upgrades.
  • Fixed an issue where upgrade scripts timed out and prevented Tamr services from starting up after an upgrade. This fix removes support for upgrading directly to v2019.022.0 or greater from v040.0 or earlier and increases the services startup timeout to three minutes.
  • Fixed an issue where users were not showing on the Users page after an upgrade (affected Tamr v.2019.021.0 and Tamr v.2019.022.0).
  • Fixed an issue where high-impact pairs disappeared after an upgrade.
  • Fixed an issue in golden records tables where the columns for entity rule, sources, and cluster Id weren't rendering properly in the user interface.
  • Fixed an issue with expression rules and custom conditions in golden records where filters might not been applied properly. This issue affected expression rules and custom conditions that contained expressions that did not return null when evaluated on null values. This caused the custom conditions and rules to pick the wrong values.
  • Fixed an issue where job submission was timing timing out and it was taking Tamr too long to calculate versions before submitting a job to Spark.
  • Fixed an issue where cluster actions failed when there were too many cluster members in the IN clause due to the database query limit. 
  • Fixed a regression issue introduced in v.2019.021.0 where a job for generating pairs in a mastering project failed if you set a similarity function in the unified dataset to Hausdorff Distance or other similarity signal except Cosine or Jaccard.
  • Fixed a regression issue where filters, drop-down menus, and checkboxes did not work on the Pairs page on IE 11.
  • Fixed an issue where a job for updating golden records could not be started because the hosting Hadoop server used the YARN/Spark configuration present on the server and specified outside of Tamr. It ignored Spark settings that Tamr requires. With the fix, Tamr uses the Hadoop configuration specified in TAMR_HADOOP_HOME by default. Previously, to ensure that Hadoop configuration from Tamr is used, you had to specify this explicitly in TAMR_FS_CONFIG_URIS.

Back to top


v2019.021.0 Release Notes

What's New

  • Transformations. Added MATH.TAN, MATH.ATAN, MATH.ATAN2, MATH.DEGREES, and MATH.RADIUS functions to the library of mathematical functions.
  • Added an ability to back up and restore a Tamr deployment in which datasets are stored in Google Cloud Platform (GCP) BigTable. You can back up and restore to a local disk, HDFS, and GCS. See Backup Configuration.

Mastering

  • Added sorting to record IDs in derived datasets (these datasets in mastering projects show statistics for clusters of records).

Transformations

  • Added MATH.TAN, MATH.ATAN, MATH.ATAN2, MATH.DEGREES, and MATH.RADIUS functions to the library of mathematical functions.
  • Added the USE HINT statement that applies a hint to the current transformation in the editor and to all of the subsequent transformations in that project. This statement might be useful to you if:
  • You have Tamr projects created before Tamr v.2019.014.1 that you are upgrading to this version (Tamr v.2019.021.0), and
  • You explicitly do not want Tamr to automatically manage your primary keys.

For example, to disable automatic primary key management by Tamr in a particular project, add: USE HINT(pkmanagement.manual); in the first transformation.

Note: The USE HINT statement is only useful if you have created projects before Tamr v.2019.014.1. In that version, automatic management of primary keys was introduced. If you have started using Tamr after v.2019.014.1, Tamr automatically manages primary keys and you don't ever need to turn this feature off, for any projects. For information about primary keys, see Primary Key Management.

Configuration

  • Enforced using of HBase for dataset storage upon Tamr upgrade to this version. If you have migrated to Tamr v.2019.019.0, then the upgrade script that runs during an upgrade takes care of moving your datasets to HBase-backed storage. For information, see HBASE.
  • Added an ability to back up and restore Tamr from Google Cloud Platform (GCP) BigTable. Previously, you could back up and restore Tamr deployments backed up by HBase. Beginning with this release, if your Tamr datasets use GCP BigTable instead of HBase, you can back up and restore them. The GCP BigTable backup in Tamr creates SequenceFiles. Note that the HBase backup creates HBase snapshot files. You can back up to a local filesystem, HDFS, or GCS. See Backup Configuration.
  • Fixed the administration utility, admin-unify.sh to not require quotes when unsetting a custom parameter to an empty value in tamr_config.yml. Previously, it required specifying quotes for an empty string value, such as TAMR_JOB_SPARK_BIGQUERY_JAR: "". With the fix, to unset a custom parameter's value, you can omit quotes and leave the value empty (quotes are still allowed).

Back to top


v2019.020.0 Release Notes

What's New

General

  • Implement fixed ratio Spark cores and memory allocation

Total memory available for Spark

Driver memory

Executor JVM instances

Total memory available for executors

Executor cores

19g or less

3g, 1 core

1

Total available

Max available but no more than 1 core per 2g executor memory

35g or less

3g, 1 core

2

Total available/2

As above

51g or less

3g, 1 core

3

Total available/3

Less than 67g

3g, 1 core

4

Total available/4

As above

67g and greater

3g, 1 core

4

16g

As above

Unknown (remote Spark cluster)

3g, 1 core

4

16g

8

Less than 9g

1g, 1 core

1

Total available

Max available but no more than 1 core per 1g executor memory

  • Do not run local instance of Spark if remote Spark is used
    -- There is a new boolean configuration property called TAMR_REMOTE_SPARK_ENABLED. This property is false by default, and if set to true, start-dependencies.sh will not start a local Spark process.
    -- Important: This variable is not automatically set on upgrade, so if you are using a remote Spark cluster (Yarn, Dataproc, etc.), manually set this property to true.
  • Ability to cancel running Spark jobs
    -- You can now cancel submitted and running Spark jobs in addition to pending jobs.
    -- Cancellation is a best-effort and asynchronous action. The job might have succeeded or failed before it gets to the cancellation, which cause the cancellation to fail.

General Improvements and Major Bug Fixes

General

  • When opening the “Add new CSV” dialog in the “Add a new dataset” dialog on the “Datasets” page in Transformations, the file picker opens automatically.
  • Set the initial number of HBase regions to 1
  • Clarify description of unified dataset attributes on Schema Mapping page
  • Jobs running progress bar resized
  • Updated Unify log collection script

Mastering

  • No longer cause write lock exception on mastering dataset running update results job if another mastering job was currently running.
  • Publish clusters job now lists project it was run on in project column on jobs page.

Transformations

  • Transformations can be previewed on custom dataset samples.
    -- To configure this please follow the steps in this page: Setting a custom preview sample.
    -- Note: configuring this functionality is API-only (though the effects can be seen in the UI)

Back to top


v2019.019.0 Release Notes

What's New

  • Added two transformation functions that compute the maximum and minimum value of an attribute or expression based on the bit size of the elements contained in a group. See "Transformations" in this document. 
  • ElasticSearch is upgraded to version 6.8.2.
  • HBase is required. Beginning with this release, HBase is required when you deploy Tamr, as we have stopped supporting deployments with datasets backed by a local filesystem. The transition period allowed you to turn off HBase and continue using a local filesystem for your dataset storage with Tamr. Beginning with this release and in the future releases, we strongly recommend that you enable HBase when you upgrade. For upgrading instructions, see the HBase section in this document.
  • SSL support with Postgres. You can connect to an external Postgres instance via SSL.
  • Tamr Python client v.0.9.0 is released.
  • Access control improvements.
  • Improvements in preferred sources in golden records.

Major Improvements

ElasticSearch 6.8.2 Upgrade

This release includes an upgrade to the internally-used Elasticsearch v.6.8.2 and is similar to the v.2019.002 Tamr upgrade when ElasticSearch was upgraded. A summary of the Tamr upgrade procedure is as follows:

  • The upgrade takes longer due to reindexing.
  • If you are upgrading from a much earlier version, we recommend that you create a backup, upgrade first to v0.52.0 and make sure your deployment is working properly, then create another backup and upgrade to v2019.019.0.
  • If you have stale projects, those you no longer need or projects that are in a broken state, we recommended that you bring these projects up-to-date or delete them before running an upgrade to Tamr v2019.019.0. The upgrade scripts may fail when trying to start reindexing jobs for projects in these states.
  • This note applies to you only if you used published clusters before upgrading to version 2019.019. The upgrade of ElasticSearch materializes datasets (this is also known as reindexing). In version 2019.017 the schema for internal derived datasets in published clusters has changed. These internal derived datasets are not automatically indexed as part of the ElasticSearch upgrade. Therefore, If you haven't republished clusters after an upgrade to Tamr version 2019.017, when you upgrade to version 2019.019 this may lead to errors with downstream jobs that rely on the changed attributes in the published cluster datasets. To avoid these errors, rerun the publish clusters API or use Publish after you upgrade.

HBase is Required

Beginning with this release v.2019.019, HBase is required when you deploy Tamr. The following statements describe this requirement:

  • Tamr initially adopted HBase as the default storage mechanism for datasets in version 2018.044 in 2018. Until the current release version 2019.019, you could continue deploying Tamr with datasets stored in their local filesystem, or running Tamr with datasets stored locally in tandem with datasets backed by HBase.
  • To achieve further performance improvements and allow scaling out Tamr deployments, you must migrate datasets to HBase. This is required for Tamr as of version of v2019.011 and beyond in order to avoid known bugs and instance failures. In this release, we repeat this requirement, in preparation for enforcing it during upgrades.
  • The summary is:
  • Enable HBase.
  • After you upgrade, at any point of using Tamr v.2019.019, migrate datasets and prepare for the enforcement of HBase in future releases.

To enable HBase:

  1. Upgrade to Tamr version 2019.019. In some cases, such as when you are upgrading from versions prior to v.2019.08 to versions 2019.08 - 2019.020, the upgrade script may fail with "HBase is not enabled" error. In this case, follow step 2 in this procedure to enable HBase, and rerun the upgrade.
  2. Set the TAMR_HBASE_ENABLED configuration variable in the admin-unify.sh admin tool to true. This also requires restarting Tamr and its dependencies. For information about using unify-admin.sh, see Configuring Tamr.

To migrate existing datasets:

  1. Verify that the TAMR_HBASE_ENABLED configuration variable in the admin-unify.sh admin tool is set to true. See Configuring Tamr.
  2. Convert existing datasets to HBase using the /datasets/moveDatasetsToStorageDriver API. For example, use this URL: http://<your-tamr-host>:9100/api/dataset/datasets/moveDatasetsToStorageDriver?destinationDriver=hbase or this command: curl -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' 'http://<your-tamr-host>:9100/api/dataset/datasets/moveDatasetsToStorageDriver?destinationDriver=hbase'. It starts a job that registers all datasets with HBase. When the job is running, it shows under the Jobs page. There is a known issue where the job's details obscure the X sign for closing the job window. You may need to zoom out of that screen to be able to close it. To see which datasets are stored in HBase, use /datasets/<name>/status.
  3. Set the TAMR_HBASE_AS_PRIMARY_STORAGE configuration variable in the admin-unify.sh admin tool to true. This ensures that going forward, all datasets are stored in HBase by default. See Configuring Tamr.

Golden Records

  • Fixed an issue where in golden records, filters with Min and Max value conditions did not ignore empty values. Note: other rules and filters behave correctly and ignore empty values.
  • Fixed an issue with "ghost" datasets that weren't showing in all filters. "Ghost" input datasets are those that are no longer present in your project but were present previously. The fix ensures that all of the dataset filters in golden records list such ghost datasets. Previously, only some filters behaved correctly. You can also now delete such input datasets from the golden records rules.
  • You can apply the golden record rules for longest and shortest value to attributes with types other than String. Previously, the rules for longest and shortest value accepted only String array input columns. 
  • Improved the behavior of publishing golden records. When you publish golden records in this release, you also update the rules at the same time. This is reflected in the user interface: Publish is now renamed to Update and Publish. Previously, Publish would take your saved rules.
  • Fixed an issue in golden records where input datasets that were once added, then removed and added again weren't marked as New.
  • Fixed an issue with rule overrides not behaving correctly. Previously, adding or deleting an override changed the number of overrides for all other attributes to zero.

Access Control

Continued improvements and bug fixes to ensure the proper behavior of policies and access controls for users in Tamr. In particular, made the following changes:

  • You can run operations on user groups, such as deleting them, in the user interface. Previously, you could run some of the operations only in the APIs.
  • Renamed Permissions navigation menu to Policies for consistency.
  • Fixed an issue where the Projects page would not load unless the user had the "read all datasets" permissions.

Transformations

  • Added two new aggregation functions, max_size() and min_size(), that compute the maximum and minimum value of an attribute or expression based on the bit size of the elements contained in a group. The functions take any type of attribute (including arrays) and produce the value that has the maximum or minimum “bitsize value" for all of the input attributes considered. In these functions, the "bitsize value" is a measure that reflects how much information in bits is represented by a data value. For Null values the "bitsize value" is always zero bits; for Boolean type values it is always 1 bit, for Strings, it is the number of characters in a String, for numbers, this value is an Integer with 32 bit precision, for Floating point values it is represented with 32 bit precision, and for Long and Doube type numeric values this value is represented in 64 bit precision. For example, 1.00 has a larger "bitsize value" than 1.0. For attributes of type Array, the shortest and the longest values are calculated as the sum of the element bit sizes and the precise bit size of the array’s length. Note that empty values are filtered out. For examples, see max-size and min_size.

APIs

  • The PUT operation in the /v1/projects/id API allows changing the name of the project. The project name in the versioned API corresponds to the “display name”.
  • The POST and PUT operations in the /v1/datasets/id API allow adding tags at dataset creation time and updating them.

Configuration

  • Added a property TAMR_YARN_NODE_MANAGER_HOST to configure a custom name for the Yarn NodeManager hostname, yarn.nodemanager.hostname. See YARN Cluster Manager Jobs.
  • You can configure a custom YARN nodemanager UI port (instead of the default port) by setting TAMR_YARN_NODE_MANAGER_PORT using the Tarm administrative tool tool, unify-admin.sh.
  • Added a --dry-run parameter for setting values to Tamr. See Admin Tool Command Reference.
  • You can connect to an external Postgres instance via SSL.The database instance URL now allows specifying the following string to the JDBC driver: jdbc:postgresql://<host>:<port>/<database>?ssl=true. See Postgres.

Other Improvements

Made the following improvements:

  • ElasticSearch improvements. Upgraded ElasticSearch from 6.4.2 to 6.8.2 resulting in performance implications during an upgrade to this version. The ElasticSearch upgrade enabled improved performance of mastering projects, and established a more stable ElasticSearch snapshot/restore process with HDFS. The upgraded ElasticSearch version 6.8.2 is also compatible with Google Cloud Platform requirements, and is required for enabling ongoing cloud provider integrations.
  • Performance improvements. Dataset and Dataset Catalog pages load faster.
    Improved performance of Tamr mastering workflows by tuning Spark joins configuration.
  • User interface improvements. Renamed Upload CSV to Upload File to account for the fact that Tamr allows adding datasets other than CSV. Fixed the behavior of the About screen.
  • Fixed an issue with upgrades where reindexing of the cluster locks failed if an empty project existed. This was a known issue in v.2019.016.
  • Fixed an issue where previewing a dataset did not work in the project's Datasets tab, after uploading and adding a dataset to a project. This was a known issue in Tamr version 2019.017.
  • Added logging about database migrations to log files and stdout.
  • Fixed an issue with the Tamr administrative tool that did not allow specifying relative paths when setting configuration from a file, such as: admin-unify.sh config:set --file <path-to-file>.
  • The datasets page now loads faster.
  • Fixed an issue where longitude and latitude were not showing for polygons in the user interface. 
  • Fixed an issue where Tamr was previously making calls to async-io.org on startup. In previous releases, the messaging framework that Tamr uses internally made these calls.
  • User interface improvements:
  • The Mastering user interface now indicates when there are no labels. Previously, the confusion matrix was empty in this case.
  • The list of comments automatically scrolls to the top of the list to show the newest comments first when you add them in Categorization, Pairs, and Clusters pages. 
  • The user interface reflects that the "Loading datasets" process is in progress when you load datasets on the Dataset Catalog page. 
  • The message indicating that the dataset is out-of-date in the navigation bar is now located to the side of menu controls and not in between them.

Support Tickets

Fixed the following support ticket:

  • Fixed an issue where a job for updating the unified dataset was stalled. 
  • Fixed an issue where pair assignments disappear and human-assigned labels become "verified by API" after regenerating pairs. The issue was found in version 2019.012 and fixed in version 2019.019.
  • Fixed an issue where authorization did not work for users with mixed-case usernames.
  • Fixed an issue where Tamr would not restart due to an inconsistent YARN jobs state. This issue affects Tamr versions 2019.014 - 2019.016 and is fixed in v.2019.017. Contact Tamr Support to obtain a patch for affected versions, or upgrade to version 2019.017 or greater.
  • Fixed an issue where connecting to an external Postgres instance via a JDBC driver did not allow configuring secure access via SSL. The fix enalbes specifying SSL in the database instance URL for the JDBC driver as follows: jdbc:postgresql://<host>:<port>/<database>?ssl=true.
  • Fixed an issue where jobs could not start, or running jobs caused a "Waiting for resources" error due to issues with port contention for the YARN node manager. The fix enables you to specify your own port numbers in YARN. For more information, see YARN Cluster Manager Jobs.

Back to top