2020 Tamr Core Release Notes
These release notes describe new features, improvements, and corrected issues in each Tamr Core 2020 release.
See Tamr Core Release Notes for important information for all releases, including upgrade instructions and checkpoint releases.
Other Tamr Core releases:
Tamr Core 2020 Releases
- v2020.026.0
- v2020.025.0
- v2020.024.3 patch
- v2020.024.2 patch
- v2020.024.1 patch
- v2020.024.0
- v2020.023.1 patch
- v2020.023.0
- v2020.022.0
- v2020.021.0
- 2020.020.4 patch
- v2020.020.3 patch
- v2020.020.2 patch
- v2020.020.1 patch
- v2020.020.0
- v2020.019.0
- v2020.018.0
- v2020.017.0
- v2020.016.7 patch
- v2020.016.6 patch
- v2020.016.5 patch
- v2020.016.4 patch
- v2020.016.3 patch
- v2020.016.2 patch
- v2020.016.1 patch
- v2020.016.0
- v2020.015.0
- v2020.014.0
- Earlier releases in 2020
v2020.026.0 Release Notes
What's New
For cloud-native deployments of Tamr Core on Azure, this release adds support for ADLS Gen2. It also supports using service principals instead of storage account keys with ADLS Gen2. The following Tamr Core configuration variables are now available, with the link provided for the Microsoft Azure documentation that describes how to obtain the value for each one.
TAMR_ADLS_GEN2_CLIENT_ID
: see Get tenant and app ID values for signing in.TAMR_ADLS_GEN2_CLIENT_SECRET
: see Option 2: Create a new application secret.TAMR_ADLS_GEN2_TENANT_ID
: see Get tenant and app ID values for signing in.
See the Configuration Variable Reference.
This release removes the TAMR_ADLS_GEN2_KEY
configuration variable.
This release also adds:
- Functions for Log Transformations. See Mathematical Functions.
- A security improvement.
v2020.025.0 Release Notes
This release contains minor updates that improve the experience of using Tamr Core.
v2020.024.3 Patch Release Notes
This patch addresses the following Apache Log4j vulnerabilities by updating Tamr Core to use Apache Log4j version 2.17.0:
- Apache Log4j CVE-2021-45105
- Apache Log4j CVE-2021-45046
- Apache Log4j CVE-2021-44228
For full details regarding these vulnerabilities and Tamr Core, refer to Tamr's Updates on Apache Log4j Vulnerabilities article.
This patch fully remediates these three vulnerabilities in Tamr Core and Elasticsearch. Install this patch regardless of whether you have taken any of the remediation steps in the article referenced above.
v2020.024.2 Patch Release Notes
This patch release corrects an issue with restore remaining in “running” state longer than expected after upgrade.
v2020.024.1 Patch Release Notes
This patch release corrects the following issues.
- Failure to create snapshot diff component should not cause the entire planning to fail
- When using Configure table on the Clusters page in a mastering project, changes to the visibility and positioning of the Cluster, Dataset, origin_entity_id, and tamr_id columns are now saved and applied as expected.
For cloud-native deployments of Tamr Core on Azure, this patch release also allows for the use of service principals instead of storage account keys with ADLS Gen2. The following configuration variables are now available, with links to the Microsoft Azure documentation that describes how to get the value for each one.
TAMR_ADLS_GEN2_CLIENT_ID
: see Get tenant and app ID values for signing in.TAMR_ADLS_GEN2_CLIENT_SECRET
: see Option 2: Create a new application secret.TAMR_ADLS_GEN2_TENANT_ID
: see Get tenant and app ID values for signing in.
See the Configuration Variable Reference.
This patch also removes the TAMR_ADLS_GEN2_KEY
configuration variable.
v2020.024.0 Release Notes
Fixed Support Issues
This release corrects the following errors.
- Tamr instance unusably slow to load pages. Affects versions: v2020.016.2. Fix versions: v2020.024.0, v2020.020.2.
- Bulk match API returns no records, but matches are found and written to file system (AWS). Affects versions: v2020.013.0, v2020.020.0. Fix versions: v2020.024.0, v2020.020.1.
get.projects()
broken on v2020.021.0 with enrichment projects. Affects versions: v2020.021.0. Fix versions: v2020.024.0.- Generating pairs processing time from 5 min to 5 hrs. Affects versions: v2020.021.0. Fix versions: v2020.023.0, v2020.024.0, v2020.020.1.
v2020.023.1 Patch Release Notes
This patch corrects the issue:
- UI Issue: Adding new blocking model clauses with no tokenizer option is throwing a token weighting error. The API is unaffected.
v2020.023.0 Release Notes
New Features and Improvements
The following new feature is included in this release.
- Add support for localCheckpoint. When you use a
CHECKPOINT
statement, you now have the option to include aHINT
modifier to specifycheckpoint.reliable
orcheckpoint.local
as the Spark store behavior. See Checkpoint and Statement Modifiers. - Option to not use IDF weighing when computing similarity scores. See Tokenizers and Similarity Functions.
Fixed Support Issues
This release corrects the following error.
- Spark instance type override is not picked up and used when submitting jobs. Affects versions: v2020.018.0. Fix versions: v2020.023.0.
v2020.022.0 Release Notes
New Features and Improvements
The following new features are included in this release.
- Min distance between shapes as fully supported geospatial metric.
- Relative area overlap similarity function and blocking.
- Show User Defined Signal output as columns in dedup record pairs table. See user-defined signals.
Fixed Support Issues
This release corrects the following errors.
gis.centroid
transformation function should not double-count start/end point of a polygon boundary in calculation. Affects versions: v2020.004.1. Fix versions: v2020.022.0.- Convex overlap for polygons as a non-DNF comparator function. Affects versions: v2020.015.0. Fix versions: v2020.022.0.
- Min distance between shapes as fully supported geospatial metric. Affects versions: v2020.015.0. Fix versions: v2020.022.0.
v2020.021.0 Release Notes
New Features and Improvements
The following new features are included in this release.
- Add a script to turn derived datasets associated with a deleted project into source datasets. For more information, see Utilities for Validation and System-Wide Processes.
Fixed Support Issues
This release corrects the following errors.
- Update the UI to accommodate the ability to infer pairs from cluster feedback. Affects versions: v2020.005.0. Fix versions: v2020.021.0.
v2020.020.4 Patch Release Notes
This patch corrects the following issues.
- This patch corrects an issue in which datasets appear to have a smaller number of records than expected in deployments that use HBase.
v2020.020.3 Patch Release Notes
This patch corrects the following issues.
- Bulk match API intermittently failing on AWS with "AmazonS3Exception Slow Down"
- Cannot get reliable pair estimate at scale on AWS due to SlowDown Exception
v2020.020.2 Patch Release Notes
This patch corrects the following issues.
TAMR_JOB_SPARK_CONFIG_OVERRIDES
not getting picked up correctly- Tamr instance unusably slow to load pages
- RecordMatchService using paths instead of full URIs
- WINDOW transformation GC explodes at scale
v2020.020.1 Patch Release Notes
This patch corrects the following issues.
- Corrects the issue: Upgrade from 2020.004.1 to 2020.016.3 Succeeded but Materialize Unified Dataset Jobs fail.
- Corrects the issue: Spark task times highly skewed when reading HBase on EMR
v2020.020.0 Release Notes
What's New
Tamr Core now supports upload of files in the Parquet file format from an external HDFS cluster, S3, or GCS. For more information, see Uploading a Dataset into Tamr.
New Features and Improvements
The following new features are included in this release.
- Create a separate endpoint to get records for dedup service.
- match.log is too chatty.
- BigQuery: get datasets and tables to be sorted.
- Allow pulling hbase configuration files from the Tamr data dir.
- UI improvements when connecting external sources.
Fixed Support Issues
This release corrects the following errors.
- Configure Table button missing on Unified Dataset Preview . Affects versions: v2020.018.0. Fix versions: v2020.020.0.
- Add dataset UI does not scroll down when using advanced CSV options. Affects versions: v2020.018.0. Fix versions: v2020.020.0.
- LLM no longer works on GCP Native with BigTable. Affects versions: v2020.016.3. Fix versions: v2020.020.0.
- Transformation does not show up after deleting a unified attribute. Affects versions: v2020.008.0, v2020.012.0. Fix versions: v2020.020.0.
- Upload New Dataset Modal Broken. Affects versions: v2020.017.0, v2020.016.1. Fix versions: v2020.020.0.
- Support for parquet external files in UI. Affects versions: v2020.017.0. Fix versions: v2020.020.0.
v2020.019.0 Release Notes
These release notes list what's new in this release, corrected issues, and known issues.
What's New
In this release, you can:
- Review precision and recall metrics for clusters on the Clusters page of your mastering projects. In addition to the computed percentages, a trend graph with confidence intervals shows changes in the accuracy of your clusters over time. Unlike the in-sample metrics computed for record pairs, cluster metrics are computed using a test set of records. See Precision and Recall Metrics for Clusters.
- Open cluster metrics (after they have been computed for your project) by clicking View Cluster Accuracy in the dialog box for the confusion matrix and in-sample metrics on the Pairs page. See Viewing In-Sample Pair Metrics.
- Review documentation for the transformation-tools.jar system administration utility.
New Features and Improvements
The following new features are included in this release.
- Optimize incremental clustering.
- Optimize incremental pair generation.
- Support AES256 server-side encryption for s3 external storage providers.
- Support AES256 Server Side Encryption for S3.
- Improve connection checking in tasq.
- Maintain lineage information for pair labels and allow filtering by lineage type.
- Pairwise Accuracy UI Changes.
- Cluster Accuracy Modal.
- CloudSQL Backup.
- Improvements to function docs example table formatting.
Fixed Support Issues
This release corrects the following errors.
- External storage provider linked dataset not getting ingested. Fix versions: v2020.019.0.
- Upload 400 error says to check logs, but error isn't in the logs. Fix versions: v2020.019.0.
- Documentation now available for transformation-tools.jar. Fix versions: v2020.019.0.
- Transform docs with long examples hard to interpret. Fix versions: v2020.019.0.
v2020.018.0 Release Notes
These release notes list what's new in this release, corrected issues, and known issues.
What's New
In this release these changes were made:
- Deployment on Microsoft Azure. See Deploying Tamr on Azure.
- Test clusters produce cluster precision and recall. See Filtering Clusters.
- A predefined list of datetime formats is now supplied for the
datetime_to_iso
anddate_and_time_to_iso
functions. These formats are applied after any formats specified in your transformations. See Working with Dates.
Fixed Support Issues
The following support issues were fixed in this release.
- Golden Records: Error using
top
: The 1st argument to functiontop
must be a constant literal expression. Affects versions: v2020.011.0. Fix versions: v2020.018.0. - Bulk unmap button in Schema Mapping does not work. Affects versions: v2020.012.0, v2020.014.0, v2020.016.0. Fix versions: v2020.017.0, v2020.018.0, v2020.016.1.
- Golden Record Page keeps freezing. Affects versions: v2020.017.0. Fix versions: v2020.018.0.
- UI doesn't show average confidence . Affects versions: v2019.023.1, v2020.016.1, v2020.016.2. Fix versions: v2020.018.0, v2020.016.3.
- "Open details" button is not clickable on taxonomy page. Affects versions: v2020.015.0. Fix versions: v2020.018.0.
- Train predict failure. Affects versions: v2020.016.0. Fix versions: v2020.018.0, v2020.016.2.
- Jobs failing possibly due to incorrect path setting with spark 2.4. Affects versions: v2020.016.0. Fix versions: v2020.018.0, v2020.016.2.
- Some links that open the Tamr UI in a new tab cause the original tab to become unresponsive on closing the new tab.
- Mapped/unmapped filters on unified attributes is broken.
- Average confidences for categorizations not showing for systems upgraded from v37.1.
v2020.017.0 Release Notes
What's New
In this release these changes were made:
- You can view the cluster from the golden records page.
- The process of restoring from backup is improved and retains the PostgreSQL configuration from the pre-upgrade release.
Fixed Support Issues
The following support issues were fixed in this release.
- Fixed an issue where unmapping an attribute in the schema mapping project does not work. Affects versions: v2020.012.0, v2020.014.0, v2020.016.0. Fix versions: v2020.017.0, v2020.018.0, v2020.016.1.
- Provided ability to view the clusters from the golden records page. Affects versions: v2019.021.0. Fix versions: v2020.017.0.
- Fixed an issue where Generate pairs and Open exclusions were not displayed correctly in the user interface. Affects versions: v2020.015.0. Fix versions: v2020.017.0.
- Fixed an issue where Accept cluster suggestion resulted in the "Cannot read property 'name' of undefined" error.
- Fixed an issue where the Elasticsearch configuration parameter
TAMR_ES_MAX_RESULT_WINDOW
was not applied to new projects and required manual workarounds. Affects versions: v2019.009.0. Fix versions: v2020.017.0. - Fixed an HTTP 500 error issued by Elasticsearch if no clusters were generated. Affects versions: v2020.016.0. Fix versions: v2020.017.0, v2020.016.1.
- Fixed an issue where an upgrade to Tamr v2020.16 resulted in errors looking for non-existent clusters. Affects versions: v2020.016.0. Fix versions: v2020.017.0, v2020.016.1.
- Fixed an issue where the bulk match service didn't work on an AWS scaled out deployment. Affects versions: v2020.007.1. Fix versions: v2020.017.0.
v2020.016.7 Patch Release Notes
This patch addresses the following Apache Log4j vulnerabilities by updating Tamr Core to use Apache Log4j version 2.17.0:
- Apache Log4j CVE-2021-45105
- Apache Log4j CVE-2021-45046
- Apache Log4j CVE-2021-44228
For full details regarding these vulnerabilities and Tamr Core, refer to Tamr's Updates on Apache Log4j Vulnerabilities article.
This patch fully remediates these three vulnerabilities in Tamr Core and Elasticsearch. Install this patch regardless of whether you have taken any of the remediation steps in the article referenced above.
v2020.016.6 Patch Release Notes
This patch introduces a new configuration variable, TAMR_MAX_EDGES_PER_PARTITION
, which allows you to tune clustering performance.
v2020.016.5 Patch Release Notes
This patch improves Tamr Core UI performance.
v2020.016.4 Patch Release Notes
This patch corrects an issue in which upgrade from 2020.004.1 to 2020.016.3 succeeded, but Materialize Unified Dataset jobs fail.
v2020.016.3 Patch Release Notes
This patch contains the following fixed bugs and support issues.
- Cannot run jobs - GCP Native - NoSuchMethodError.
- Profiling jobs stuck in 'waiting for results' despite no other jobs running or failed.
- UI doesn't show average confidence.
v2020.016.2 Patch Release Notes
This patch contains the following fixed bugs and support issues.
- Train predict failure.
- Jobs failing possibly due to incorrect path setting with spark 2.4.
v2020.016.1 Patch Release Notes
This patch contains the following fixed bugs and support issues.
- Fixed an issue where, if no clusters had been generated yet for a mastering project, the project issued an error in the user interface.
- Fixed an issue where the unmapped/mapped filters for unified attributes were not working.
- Fixed an issue where bulk un-mapping of attributes in Schema Mapping was not working.
v2020.016.0 Release Notes
What's New
In this release, the following notable changes took place:
- Tamr Core v2020.16 is a checkpoint release. If you are upgrading from an earlier version, you must first upgrade to v.2020.16 before upgrading to this version or a greater version. The upgrade utility prevents you from upgrading past v2020.16 without first upgrading directly to v2020.16. For example, these upgrade paths are prevented: v2020.03 -> v2020.07, v2020.15 -> v2020.17. The following upgrade paths are allowed: v2020.03 -> v2020.15, v2020.03 -> v2020.16, v2020.15 -> v2020.16.
- After you upgrade to v2020.016, if you write new
LOOKUP
statements with non-equality join conditions, or remove thehint
which the upgrade process added to existingLOOKUP
statements of this type, Tamr Core assigns primary keys to resulting records automatically. ForLOOKUP
statements with non-equality join conditions that existed before you upgrade, the upgrade process disables automatic assignment of primary keys. This change during the upgrade process (disabling automatic assignment of primary keys) does not affectLOOKUP
statements with equality joins. Additionally, you can choose to disable automatic assignment of primary keys altogether. For more information, see Lookup, Primary Key Management, and Upgrading Tamr. - Passwords for non-administrative users must contain a minimum of 8 and a maximum of 64 characters. Passwords for newly created admin users also have this requirement. When you create such users or make changes to existing users in Tamr Core, these password requirements are enforced and Tamr Core issues an error if they are not met. For more information, see Creating a User.
- The user interface for editing rules for golden records now allows you to save rules that are invalid. This is also known as a "forced save". This helps you save rules midway, while you continue working on them. It also helps in cases where you might need to force a deletion of one rule that refers to a non-existing attribute in an upstream dataset, so that you can then delete another rule that refers to another non-existing attribute. Previously, you could only fix such issues with internal APIs for golden records.
- Made changes to the user interface of the Clusters page. The records summary header now shows up directly above the table that lists records. Additionally, you can use a new Accept suggestion option directly from the record details side panel to add a record to a specific cluster that Tamr Core suggests. Tamr Core offers the ID of this new suggested cluster. Previously, if Tamr Core suggested to move a record to a new cluster, using Move to new did not allow moving it in one step. See Curating Clusters.
- Added documentation for Tokenizers and Similarity Functions. The documentation now uses the industry term Blocking Model instead of the Blocking Model, which was the previously used term.
Fixed Issues
- Added a script that disables primary key management for existing
LOOKUP
statements with non-equality joint conditions. The script runs automatically during upgrades and issues a report of the changes it made. - Enabled automatic primary key management for new
LOOKUP
statements with non-equality join conditions. For more information, seeLOOKUP
. - Fixed an issue where the
unify-admin.sh
utility allowed input of settings with incorrect syntax and this broke Zookeeper. Affects versions: v2020.012.0. Fix versions: v2020.016.0. - Fixed an issue where you could not delete a rule in golden records on an attribute that no longer existed if there were other attributes that no longer existed. Affects versions: v2019.023.2. Fix versions: v2020.016.0.
v2020.015.0 Release Notes
What's New
In this release, the following notable changes were made:
- Upgraded the version of Spark used by Tamr Core to Spark 2.4.5. Starting with this release, Tamr Core uses Spark 2.4.5 instead of Spark 2.2.0 that it used in previous releases. The upgrade to Spark 2.4.5 takes place automatically as you upgrade to this release. For more information, see Upgrading Tamr.
- Added a new aggregation function, histogram, to the list of Tamr Core transformation functions. It computes the histogram for the top-n most frequent values per group, sorting values in the descending frequency order. Use the
histogram
function together withWINDOW
andGROUP
statements. It supportsvararg
(array
flattening) and complex types. - Warn users when project datasets are about to be removed in the dataset catalog.
Fixed Issues
- Fixed an issue where de-selecting a dataset from the dataset catalog did not alert the users that the dataset was about to be removed. Affects versions: v2019.023.1. Fix versions: v2020.015.0.
- Fixed an issue where page information on the Datasets window was not extending properly. Affects versions: v2020.009.0. Fix versions: v2020.015.0.
- Fixed an issue with rules in golden records where deleting a newly created, unsaved golden records rule deleted a random other rule.
- Fixed an issue where HBase and ZooKeeper communication failed, putting the HBase server on the "failed servers list", rendering the Tamr instance broken.
- Fixed an issue where changing an input dataset schema broke previewing of golden records.
- Fixed an issue where deleting an attribute in an upstream dataset broke the job for updating golden records.
v2020.014.0 Release Notes
What's New
In this release, the following notable changes were made:
- Released version 0.12.0 of the Python
tamr-client
. For more information, see tamr-client 0.12.0 and Tamr Client documentation. - API changes. Added a new parameter,
expectedVersion
to thePOST /datasets/{name}/update
endpoint in thedataset
service, to allow consistent dataset updates. - Usability and design improvements.
- Observe in the tooltip that profiling value counts are estimates, when examining results of profiling a dataset.
- Use the Rules tab, when working with the golden records project as a curator.
- Performance and configuration improvements. Take advantage of improved HBase performance when processing Tamr jobs.
Fixed Issues
- Fixed an issue where the "read only" permissions on
/tmp
prevented Tamr dependencies from starting. Affects versions: v2020.008.1. Fix versions: v2020.014.0. - Fixed an issue where Tamr instance was broken after deleting an input dataset. Affects versions: v2020.008.1. Fix versions: v2020.014.0.
- Fixed an issue where deleting the unified dataset caused Tamr to throw a nullpointer exception. Affects versions: v2020.008.0. Fix versions: v2020.014.0.
- Fixed an issue with profile value counts to indicate that they are estimates. Affects versions: v2020.009.0. Fix versions: v2020.014.0.
- Fixed an issue in the golden records project where it did not show the Rules page to curators. Affects versions: v2020.006.0. Fix versions: v2020.014.0.
- Fixed an issue where Postgres Prometheus configuration used
HOST_IP
instead ofTAMR_POSTGRES_HOSTNAME
. - Added
TAMR_PERSISTENCE_EXPORTER_USER
andTAMR_PERSISTENCE_EXPORT_PASS
to the configuration definitions.
Earlier Releases in 2020
- v2020.013.0
- v2020.012.1 patch
- v2020.012.0
- v2020.011.0
- v2020.010.1 patch
- v2020.010.0
- v2020.009.0
- v2020.008.1 patch
- v2020.008.0
- v2020.007.1 patch
- v2020.007.0
- v2020.006.0
- v2020.005.0
- v2020.004.4 patch
- v2020.004.3 patch
- v2020.004.2 patch
- v2020.004.1 patch
- v2020.004.0
- v2020.003.0
- v2020.002.0
v2020.013.0 Release Notes
What's New
In this release, you can:
- Rely on faster running Spark jobs due to HBase configuration improvements.
- Avoid dataset errors when updating or publishing golden records due to improved dataset validation checks.
- Collect logs for a specified time period using a new flag on the
collect-logs.sh
script.
Improvements and Changes
- Upgraded versions of Grafana to 6.3.4 and Kibana to 5.6.16 . Affects versions: v2020.004.0. Fix versions: v2020.013.0, v2020.004.2.
- HBase. Stopped blocking new jobs while HBase rollback is in progress.
- HBase. Adjusted the buffer to store enough records for sorting streaming updates to HBase.
- Allowed LLM and Bulk Matching on projects with Mastering functions and user-defined signals. Affects versions: v2020.002.0. Fix versions: v2020.013.0.
Fixed Issues
The following issues were fixed in this release.
- Updated
collect-logs.sh
to accept anage
field to enable collecting logs for only a certain number of days. Affects versions: All. Fix versions: v2020.013.0. - Fixed the log pruning scripts to set dependencies correctly.
- Fixed an issue where the user policy management dialog deselected datasets as you paginate.
- Fixed an issue where you could not edit project and dataset user policies without deselecting other datasets in the policy. Affects versions: All. Fix versions: v2020.013.0.
- Fixed an issue in working with geospatial data, where displaying multi-point data in a Leaflet map caused the user interface to blank out. Affects versions: v2020.004.1. Fix versions: v2020.013.0, v2020.004.2.
- Fixed an issue where you could not import pair labels when pre-grouping feature was enabled. Affects versions: v2020.009.0. Fix versions: v2020.013.0.
- Fixed an issue where removing a source dataset did not remove it from pair exclusions in the internal configuration. Affects versions: v2020.004.1. Fix versions: v2020.013.0.
- Made estimate pairs sampling configurable in internal interfaces. Affects versions: All. Fix versions: v2020.013.0.
- Reduced indexing unneeded internal datasets when Elasticsearch is disabled. Affects versions: v2020.004.1. Fix versions: v2020.013.0.
Documentation Changes
Beginning with Tamr Core v2020.013.0, documentation versions available at docs.tamr.com are listed as ranges of versions.
- Documentation version ranges map to the development releases contained within the range.
- For example, Tamr Core documentation version 2020.13.0-2020.16.0 maps to four consecutive development releases.
- The documentation for versions in the range is updated in place and republished.
- The release notes for each development release continue to be published.
- For information about deltas between individual development releases, see the release notes for each development release.
- The documentation version scheme differs slightly from the development version scheme in that it does not use leading zeros in its numbers. For example, Tamr development version 2020.013.0 is represented as the documentation version 2020.13.0 (there is a missing zero in front of 13). This is by design and the two version notations map to each other.
v2020.012.1 Patch Release Notes
This patch addresses the following Apache Log4j vulnerabilities by updating Tamr Core to use Apache Log4j version 2.17.0:
- Apache Log4j CVE-2021-45105
- Apache Log4j CVE-2021-45046
- Apache Log4j CVE-2021-44228
For full details regarding these vulnerabilities and Tamr Core, refer to Tamr's Updates on Apache Log4j Vulnerabilities article.
This patch fully remediates these three vulnerabilities in Tamr Core and Elasticsearch. Install this patch regardless of whether you have taken any of the remediation steps in the article referenced above.
v2020.012.0 Release Notes
What's New
In this release, you can:
- On the Details section of the Pairs and Clusters pages, expand the details to see long string values for record attributes. Choose Show more or Show less to see the attribute details. You can also see the number of members in an array, for attributes of type
array
. - In the Golden Records project, review reported errors and fix them before proceeding with the project. This is useful if you load golden records datasets programmatically using the APIs. In this case, Tamr Core validates your golden records dataset against the input records and clusters.
- Use Postgres v12. This version of Postgres is required beginning with Tamr Core v2020.012.0 (this version). You can upgrade to Postgres v12 even before you upgrade to Tamr v2020.012.0. To upgrade Postgres, stop Tamr Core, upgrade Postgres using tools specific to your operating system, run the upgrade to Tamr Core v2020.012.0, and restart Tamr Core. If you cannot use Postgres v12 for any reason stemming from your environment, contact Support to obtain advice on the best course of action. For more information, see Installing Postgres and Upgrading Postgres.
- Run pair generation jobs faster due to fine-tuned Spark memory allocation. Note: Beginning with this release, Grafana and Kibana monitoring services are disabled by default. You can enable them explicitly, if needed.
Improvements and Changes
The following improvements and changes were made in this release.
Upgrades and Configuration
- Fixed Spark Executor Calculations.
- Reduced OS headroom default.
- When starting Tamr, no longer issue an error if no auxiliary services have been deployed.
- Disabled Kibana by default (
TAMR_ELK_ENABLED=false
) and avoid re-computing Spark memory allocations that are already accounted for. - Disabled Grafana by default. Fixed an issue where turning off Grafana with
TAMR_GRAFANA_ENABLED=false
turned off the Graphite exporter but didn't turn off the Spark exports.
User Interface and Usability Improvements
- The user interface reports errors if saved golden record rules are invalid.
- You can use Show more and Show less in the Details sidebar on the Pairs and Clusters pages to examine long string attribute values.
- Attribute values of type
array
are formatted to show the number of values in a value of typearray
. Clicking the number shows the details of the array. - The Compare Details dialog (side-by-side view) on the Pairs page allows toggling between displaying the long string value of a record's field or a truncated value with ellipsis.
Mastering
- You can use pair-wise classification user-defined signals in LLM.
Fixed Issues
The following issues were fixed in this release.
- Fixed an issue where, on refresh, the Schema Mapping page shows
The unified dataset [****] has been deleted.
. Affects versions: v2020.011.0. Fix versions: v2020.012.0. - Fixed an issue where pair generation failed with the error
cannot resolve '
testRecord' given input columns: [verificationType, recordId, username, timestamp, verifiedClusterId]
. Affects versions: v2020.011.0. Fix versions: v2020.012.0. - Fixed an issue where the comment persisted in the entry box in pair labeling even after the comment was submitted. Affects versions: v2020.008.0. Fix versions: v2020.012.0.
- Fixed an issue where changing the name of the Spend field from project settings did not change this name in the filter dropdown menu. Affects versions: v0.51.0, v2020.009.0. Fix versions: v2020.012.0.
- Fixed an issue where long text fields for records were displayed in a tooltip, and not in a dialog (preferred). Affects versions: v2019.023.1. Fix versions: v2020.012.0.
- Fixed an issue where the dataset's preview could not be updated for a unified dataset that contained transformations. Affects versions: v2020.009.0. Fix versions: v2020.012.0.
- Fixed an issue where the transformations preview failed with a null pointer exception. Affects versions: v2020.004.0, v2020.005.0, v2020.008.0. Fix versions: v2020.012.0.
- Fixed an issue where an HTTP 500 error was issued on transformations preview, and
previewSpark.log
mentioned expired credentials. Affects versions: v2020.005.0. Fix versions: - Upgraded Postgres to a higher version (v12.3) since Postgres v9.4 is End of Life. Affects versions: v2019.023.1. Fix versions: v2020.012.0.
- Fixed an issue where the
POST /clusters/{dataset}/import
internal endpoint did not delete the delta pipeline of the dataset. Affects versions: v2019.003.0. Fix versions: v2020.012.0. - Fixed an issue where responding with
y
to an upgrade prompt resulted in a canceled upgrade. - Fixed an issue where the clusters page broke if the publish date was outside of the publish time range.
- Fixed an issue where the record pairs filter did not not show as active when attribute similarity filters were active.
- Fixed an issue where the Clear search and record filters link on the cluster records table did not clear search.
- Fixed an issue where the Pairs page broke with
Cannot read property 'get' of undefined
if input data for a record in a pair was missing.
v2020.011.0 Release Notes
What's New
In this release, you can:
- Use versioned APIs to rename categories.
- Add and remove datasets in projects, and change tags, as curators. Previously, only admins could run these actions. In this release, curators can also run them.
- Take advantage of the following performance improvements. Dataset profiling jobs run faster, and single-node Tamr deployments avoid out of memory issues due to fine-tuned Spark and YARN configurations.
- Use the Low Latency Match (LLM) service with mastering projects that rely on user-defined signals in mastering. Previously, this aspect was not supported for LLM.
- Specify a unified attribute's type as geospatial in the user interface.
Improvements
The following new features and improvements were completed in this release.
- Mastering. Curators can add and remove datasets in projects and change tags. The LLM service supports projects with user-defined signals.
- Upgrades and Configuration. Reduced
TAMR_MAX_ROWS_PER_PARTITION
from 500,000 to 100000 in Spark configuration to address out of memory issues. - Performance. Optimized profiling jobs to run in one pass over the data.
- Upgrades. Notify users what to expect on success and failure of upgrade maintenance scripts.
- Versioned APIs. Use versioned APIs to rename categories in the categorization projects.
Fixed Issues
The following issues were fixed in this release.
- The
collect-logs.sh
script is referencing the wrong location for Zookeeper log. Affects versions: v2020.008.0. Fix versions: v2020.011.0, v2020.008.1. - Fixed the tooltip in Confusion Matrix to say: "How often you agree with Tamr". Affects versions: v2020.008.0. Fix versions: v2020.011.0.
- Enabled
Pairs with similarity score
by default when adding an attribute similarity to filter pairs in the user interface. Affects versions: v2020.008.0. Fix versions: v2020.011.0. - Fixed an issue where the upgrade process wrote the wrong version to upgrade status tracking.
v2020.010.1 Patch Release Notes
This patch is applicable only if you have upgraded to v2020.010.0 from previous versions. The patch fixes an issue where, when upgrading from v2020.010.0 to later versions the upgrade status version was set incorrectly for v2020.010.0 and a validation check failed on future upgrades. If you upgraded to v2020.010, and you are planning on upgrading to a newer version, run the upgrade to the patch v2020.010.1 with the --skipUpgradeStatusValidation
flag. The patch does not affect new installations of Tamr v2020.010.0 because they do not have the upgrade status version set for them.
v2020.010.0 Release Notes
What's New
In this release, you can use an improved upgrade process that catches additional upgrade issues and creates an upgrade report.
Improvements
The following new features and improvements were completed in this release.
Configuration Changes
Added a configuration variable TAMR_LD_PRELOAD
to allow appending additional libraries to LD_PRELOAD
.
Upgrade Improvements
The upgrade validation process detects:
- Database tables that have issues which will cause upgrades to fail.
- Dependent datasets that have missing upstream datasets.
After the upgrade validation process completes, it creates a report file that allows you to follow up on identified issues that might affect the upgrade.
Fixed Issues
The following issues were fixed in this release.
- Fixed an invalid path error getting HBase configuration in Spark.
- Fixed a configuration issue with Zookeeper that affected scaled out deployments.
- Fixed an issue where the user interface for golden records was inaccessible after deleting a column in an upstream mastering project.
v2020.009.0 Release Notes
What's New
In this release, you can:
- Before upgrading to a new Tamr Core version, use a validation check for the required minimum space limit (
ulimit -n
). - Use versioned APIs to import and export a categorization model, and to add and remove transformations.
Changes and Improvements
The following changes and improvements took place in this release.
Versioned API Work
- Added support for importing and exporting the categorization model in a ZIP file.
- Added support for adding/removing transformations to the versioned API.
Upgrade Improvements
- Added a validation script for checking open file ulimit (ulimit -n) against required minimum of 66000.
- Added a configuration variable for configuring an HTTP request idle timeout for all services.
General UX/Visual Design
Made improvements for showing similar clusters.
Fixed Issues
The following support issues and bugs were fixed in this release.
- Fixed an issue where the job to train a categorization project had a lower-case ‘m’ in materialize, whereas everything else was capitalized. Affects versions: v2019.023.2. Fix versions: v2020.009.0.
- Fixed an issue where, when using geospatial features, users were unable to zoom into "Details" window closer than 30m/100ft. Fix versions: v2020.009.0.
- Fixed an issue where users could not update the project definition for golden records projects via versioned API. Affects versions: v2020.002.0. Fix versions: v2020.009.0.
- Fixed an issue where PostgreSQL contained multiple, conflicting labels on pairs. Affects versions: v2019.006.1. Fix versions: v2020.009.0.
- Changed default elastic batch size in ManualClusteringService.reindexClusterMembers from 10,000 to 1,000 to match other defaults.
- Fixed an issue where unpinning and pinning source datasets caused errors in related jobs.
- Removed false error messages for user-deleted export files and allowed export cleanup to run smoothly.
- Fixed an issue in the upgrade script that failed for a mastering project if no mappings existed but there was a unified dataset.
v2020.008.1 Patch Release Notes
The patch fixes an issue with the collect-logs.sh
script that is not copying the Zookeeper log in its new location.
v2020.008.0 Release Notes
What's New
In this release, you can:
- Use versioned API endpoints for importing, exporting, and removing categorizations. This improvement is a step towards functional completeness of versioned APIs for Tamr.
- Reliably stop Tamr dependent components, such as HBase, YARN, and Zookeeper by using an improved
stop-dependencies.sh
script. - Rely on an updated set of upgrade validation checks in the administration utility. For example, the administration utility now tracks upgrades and checks configured directories.
- Avoid having to log in again into Tamr after restarting it. Beginning with this release, your session credentials persist after restarting Tamr.
- Use a new transformation function,
array.non_empties()
, to remove empty values from an array. See array.non-empties.
Usability and Design Improvements
In this release, you can:
- Confirm your actions by observing the new tooltip,
Copied!
, which appears after you click the icon to copy the Cluster ID on the Clusters page. - Use improved attribute similarity filters on the Pairs page. You can select both similarity range and null similarity filter, read tooltips, and use check boxes. Note: In this version, the filters for similarity are turned off by default and you need to explicitly turn them on. There is an open issue for fixing this in the future releases.
Upgrade Improvements
In this release, the upgrades validation process in the Tamr Core administration utility was improved. It:
- Tracks upgrades and notifies you if you try to run an upgrade while another upgrade is partially completed.
- Walks down the path from each Tamr-configured subdirectory, checks for permissions, the presence of symbolic links, and available storage space. It reports if it cannot access a directory, and warns you if the storage space is not sufficient in any of the directories (typically, the space should not be less than 1GB).
- Reliably stop processes for all Tamr-dependent components using the
stop-dependencies.sh
script. The script attempts to gracefully shut down Yarn, HBase, and Zookeeper. If that does not succeed, it waits for the amount of time inTAMR_HARDKILL_TIMEOUT_SECONDS
, and terminates these processes in the correct order. The new parameter,TAMR_HARDKILL_TIMEOUT_SECONDS
was added to the administration utility, it is set to 10 seconds by default.
Fixed Issues
The following support issues and bugs were fixed in this release.
Fixed Issues in Upgrades
- Fixed an issue where an upgrade from v2019.011.0 to v2020.004.0 failed. Affects versions: v2020.004.0. Fix versions: v2020.008.0.
- Fixed an issue where an upgrade from v2019.023.1 to v2020.004.0 failed. Affects versions: v2019.023.1, v2020.004.0. Fix versions: v2020.008.0, v2020.004.1.
- Fixed an issue where an upgrade to v2020.002.0 failed with
failure to stop Spark
. Affects versions: v2019.009.1. Fix versions: v2020.008.0. - Fixed an issue where an upgrade failed because
stop-dependencies.sh
did not reliably stop all dependencies as required. Affects versions: v2020.004.0. Fix versions: v2020.008.0. - Fixed an issue where an upgrade to v.2020.004.1 failed if two categorization projects have been deleted.
- Fixed an issue where updating schema mappings failed if there was a project with a unified dataset but no mappings present.
Other Fixed Issues
- Fixed an issue where the cluster stats showed an incorrect "last published" date.
- Fixed an issue where unpinning and pinning source datasets caused errors in related jobs.
- Fixed an issue where reindexing of transaction comments failed when there were comments associated with deleted projects in the Elasticsearch index.
- Fixed an issue where exporting categorization labels from versioned APIs didn't take into account new configuration.
v2020.007.1 Patch Release Notes
This patch allows adding tags to Amazon EMR Ephemeral Spark clusters and DynamoDB tables on AWS. This issue affected Tamr Core cloud-native deployments.
v2020.007.0 Release Notes
What's New
In this release, you can:
- Take advantage of usability improvements when working with golden records.
- View full record values in the Record Details sidebar. Previously, long record values were cut off.
- Check the submission of Spark jobs using the validation framework in the administration utility.
- Run the script for disabling primary keys directly from the administration utility.
Golden Records Improvements
When working with golden records, you can:
- Use the display name of a mastering project when selecting this project for creating a new Golden Records project.
- Use the
ENTER
key to save rule overrides in golden records.
Usability and Design Improvements
In this release, you can:
- Use the in-product transformations documentation to learn how
str_join
handles null values. - Use a toggle to show full record values in the Record Details sidebar.
Upgrade Improvements
In this release, you can:
- Run the script for disabling primary keys from the Tamr administrative utility.
- Check the Spark jobs submission as part of validation checks in the Tamr administrative utility.
Fixed Issues
The following support issues were fixed in this release.
- Fixed an HBase issue where the job was displayed as running when no jobs were running. Affects versions: v2019.016.0. Fixed in versions: v2020.007.0.
- Fixed an issue where pair generation on the Pairs page broke if Jaccard tokenizer threshold was set to less than 0.4. The fix checks the value you enter for the Jaccard tokenizer and ensures that Tamr issues an informative error message in this case. Affects versions: v2019.023.2. Fixed in versions: v2020.007.0.
- Fixed an issue that prevented a customer from using the
ENTER
key to save rule overrides in golden records. Affects versions: v2019.023.1. Fixed in versions: v2020.007.0. - Fixed an issue where the Pairs page was not loading with
"Error parsing metadata at key 'DEDUP'"
. Affects versions: v2019.023.1. Fixed in versions: v2020.007.0. - Fixed an issue that prevented a customer from using the display name of the mastering project when selecting this project for creating a new Golden Records project. The fix allows you to use the mastering project's display name. Affects versions: v2020.001.0. Fixed in versions: v2020.007.0.
- Fixed an issue where, previously, a customer could not delete the taxonomy file name when uploading a taxonomy to a categorization project, because empty values were not allowed in the user interface. The fix allows you to delete the name of the file and replace it with a new name. Affects versions: v2020.002.0. Fixed in versions: v2020.007.0.
- Fixed an issue where the Golden Records project could not be created if attribute names collided with existing attribute names. Affects versions: v2019.020.0. Fixed in versions: v2020.007.0.
- Allowed creating golden records attributes and rules when creating a new Golden Records project. Affects versions: v2019.011.0. Fixed in versions: v2020.007.0.
- Fixed formatting errors in versioned API names and added missing error code descriptions.
- Fixed an issue where the Dataset Catalog page broke if you clicked a row and then pressed the space key.
v2020.006.0 Release Notes
New Features and Improvements
The following new features and improvements were completed in this release.
- Stopped retrying the request if the table limit exceeds 1,000 tables in GCP BigTable.
- Reduced unnecessary authentication service dependencies.
- Made the Grafana path configurable.
- Improved accessibility of user interfaces for cluster verification.
Fixed Issues
The following support issues were fixed in this release.
- Fixed an issue where the matching service on port 9170 was not healthy and LLM could not be used to query projects. Affects versions: v2020.002.0. Fix versions: v2020.006.0.
- Fixed an issue in the user interface where a user could not delete many datasets, including the unified dataset, of a dedup project that has been published. Affects versions: v2019.023.1. Fix versions: v2020.006.0.
- Fixed an issue where the Delete Dataset dialogue cuts off critical information and requires you to scroll. Previously, in the Delete Dataset dialog, long dataset names were cut off. This made it impossible to determine which derived dataset was preventing a user from deleting a particular dataset. Affects versions: v2019.026.0. Fix versions: v2020.006.0.
- Fixed an issue where reindexing feedback documents failed when there were feedback documents associated with existing unified datasets from deleted projects in persistence.
- Fixed an issue where the Open second cluster browser button was not disabled when the map was open.
- Fixed an issue where Preview returned a 500 error for an empty metadata reference.
- Fixed an issue where an upgrade failed with a failure during backup, indicating that
/home/ubuntu/tmp
did not exist.
v2020.005.0 Release Notes
What's New
In this release, you can:
- Publish the last updated version of golden records.
- Rely on additional post-upgrade validation checks that display in a format that is easier to read.
- Use IAM roles for Amazon S3 backup configuration.
- Configure Tamr timeout during restarts. The default timeout is 5 minutes instead of 3 minutes in the previous releases.
- Azure (Beta). Run Tamr with ADLSv1 as the primary filesystem w/Databricks Spark.
Fixed Issues
The following issues were fixed in this release.
- Fixed an issue to prevent the user from entering reserved words for attribute names and avoid vague error report in Categorization. Affects versions: v2019.023.0. Fix versions: v2020.005.0.
- Fixed an issue where Tamr reserved some column names that users could still use, which caused job failures. Affects versions: v2019.009.1. Fix versions: v2020.005.0.
- Fixed an issue where Categorization failed due to using a reserved attribute name "finalClassificationPath".
- Fixed an issue where restoring from a backup failed and did not restart Tamr. Affects versions: v2019.024.0, v2019.025.0. Fix versions: v2020.005.0.
- Added configurable parameters for the number of retries for startup (50) and interval between retries (6 sec) to achieve the default timeout interval of 5 minutes upon restart.
- Increased default Zookeeper timeouts to 120,000 ms.
- Enabled support for IAM roles for Amazon S3 backup configuration. Affects versions: v0.47.0. Fix versions: v2020.005.0.
- Fixed an issue where toggling Excluded bucket selection in Golden Records Preferred Sources editing dialog did nothing.
- Fixed an issue where the Unified Dataset page threw an authorization error for users logged in as reviewers.
- Cleaned up the service restart after a system state change.
- Removed __MACOSX directory from Tamr zip.
Fixed Issues in Transformations
- Fixed incorrect color highlighting of "metadata" datasets in the Transformation editor.
- Fixed an issue where custom sample datasets in Transformations bypassed certain Project permissions.
- Fixed type-checking for
Struct
andArray
types which were inaccurate in some cases. - Fixed an issue where datasets could not be referenced with
USE
within a labeled scope.
v2020.004.4 Patch Release Notes
This patch updates Tamr Core to use Apache Log4j version 2.17.1.
For full details regarding Log4j vulnerabilities and Tamr Core, refer to Tamr's Updates on Apache Log4j Vulnerabilities article.
v2020.004.3 Patch Release Notes
This patch addresses the following Apache Log4j vulnerabilities by updating Tamr Core to use Apache Log4j version 2.17.0:
- Apache Log4j CVE-2021-45105
- Apache Log4j CVE-2021-45046
- Apache Log4j CVE-2021-44228
For full details regarding these vulnerabilities and Tamr Core, refer to Tamr's Updates on Apache Log4j Vulnerabilities article.
This patch fully remediates these three vulnerabilities in Tamr Core and Elasticsearch. Install this patch regardless of whether you have taken any of the remediation steps in the article referenced above.
v2020.004.2 Patch Release Notes
This patch contains the following fixed bugs and support issues.
- Fixed an issue with geospatial rendering in the user interface where displaying multi-point data in the Leaflet-based map caused the user interface to blank out.
- Upgraded Grafana and Kibana to mitigate CVE cybersecurity vulnerabilities. For a list of versions to which Grafana and Kibana were upgraded, see Supported Monitoring Tools.
v2020.004.1 Patch Release Notes
The patch contains the following fixed bugs and support issues.
Issue with Grafana
- Fixed an issue where, in Grafana, the output of template rendering is no longer logged, as this may contain sensitive information.
Issue with Low Latency Match (LLM) Service
- Fixed an issue where port 9170 for the Low Latency Match (LLM) service was not healthy and the LLM service could not be used to query projects.
Issues in Elasticsearch Indexing
- Fixed an issue where reindexing transaction comments failed when there were comments associated with deleted projects in persistence.
- Fixed an issue where reindexing feedback failed when there were feedbacks associated with existing unified datasets from deleted projects in persistence.
Upgrade Issues
- Fixed a support issue where an upgrade failed from v2019.023.1 to v2020.004.0. Affects version 2019.023.1, fixed in 2020.004.1.
- Fixed an issue where an upgrade to version 2020.004.1 failed if two categorization projects had been deleted. Affects version 2020.004 when upgrading to the 2020.004.1 patch. Fixed in 2020.004.1.
- Fixed an issue where an upgrade script failed for a mastering project if no mappings existed but there was a unified dataset. Affects version 2020.001 when upgrading to the 2020.004.1 patch. Fixed in 2020.004.1.
v2020.004.0 Release Notes
What's New
This release allows you to use Tamr Core for data mastering on geospatial records. You can:
- Load geospatial records in the
GeoJSON
format. In the beta release, to load records, contact Support. - Work with pairs, find matches and duplicates, and run transformations on geospatial records. You can then put records in clusters based on information extracted from geospatial data, and create a categorization project to align records with an existing taxonomy that you might have in place.
- Configure Tamr Core to use Openstreet Map and ThunderForest server tiles. You can then view geospatial record pairs, clusters, and shapes, such as polygons, on the Leaflet-based map.
- Use a Leaflet-based map on Pairs and Clusters pages. If you have configured Tamr Core to use multiple tile servers, you can switch between them and use different maps for pair matching and clustering.
- Zoom and pan on the map to refetch geospatial data as the map adjusts interactively.
- On the Schema Mapping page, configure pair similarity metrics, such as Hausdorff, Relative Hausdorff, and Directional Hausdorff Distances. View a pair of two records on the map, along with their similarity metrics and location.
- On the Clusters page, view a cluster of records on the map at the same time and configure Tamr Core to display adjacent records.
- Run geospatial-boundary searches on clusters of geospatial records.
-Run transformations on records with geospatial data, such as calculating the area or the perimeter or converting record types to supported geospatial types. For information, see GIS Functions. - Use geospatial records in a unified dataset.
In addition, in this release, you can:
- View dataset metadata in the user interface and run transformations on metadata.
- Use versioned APIs for these actions on golden records: run profiling on golden records, publish golden records, and update golden records.
- Have greater control over permissions to datasets, projects, and transformations.
Fixed Issues
The following issues were fixed in this release.
- Fixed an issue where schema mappings disappeared. Affects versions: v2019.019.0, v2019.023.1. Fix versions: v2020.004.0.
- Allow using dataset tags or metadata information in transformations. Fix versions: v2020.004.0.
- Fixed an issue where transformations linting no longer breaks on
USE "datasetName"
statements if the dataset is missing or unauthorized, ordatasetName
is an empty string.
v2020.003.0 Release Notes
What's New
This release allows you to:
- View dataset metadata in the user interface.
- View geospatial data in golden records.
- Run profiling on golden records using the versioned Tamr Core API.
- Have greater control over permissions to datasets, projects, and transformations.
New Features and Improvements
The following new features and improvements were completed in this release.
Role-Based Access Controls Expansion
- Added authentication for events in the user interface.
- Added access controls to Taxonomy and Classification projects.
Dataset Metadata
You can now modify dataset metadata in the Tamr Core user interface.
You can set metadata values for an input dataset or any of its attributes in a mastering, categorization, or schema mapping project. When you select a dataset on the project’s Datasets page, the Open properties option is now available. After you select the object you want to add metadata to, you specify an identifying key for the property and a value for the key.
For example, you can add a key of “Privacy” with a value of 1, 2, or 3, to several attributes in a dataset where 1=Public Information, 2=Private Information, and 3=Top Secret Information.
For more information, see Using Metadata in Transformations.
Geospatial Support
- Added new transformation functions to convert data into geospatial types
point
andpolygon
. - Added ability to refresh a map.
- Relative Hausdorff Similarity uses true shape diameter instead of bounding-box approximation.
- Made warning messaging on geospatial map consistent.
- Fixed the count of geospatially-mapped records upon zooming in.
- Fixed an issue where Tamr did not fetch records for adjacent clusters when you have not toggled on visibility for adjacent clusters.
Golden Records
- Added versioned API endpoints for golden records. You can use
POST /v1/projects/{project}/goldenRecordsProfile:refresh
to run profiling on golden records using the versioned Tamr Core API. - Golden records preview returns an array for a
Struct
type. - Golden records can output and display geospatial data.
- Golden records create an intermediate "Rule Output" internal derived dataset caching the output of running golden record rules.
Configuration, Deployment, and Lifecycle Changes
- HBase improvements. Enabled configuring
HBASE_ZK_SESSION_TIMEOUT
for ZooKeeper from HBase. - Configured Elasticsearch socket timeout is now respected by Spark components.
Fixed Support Issues
The following support issues were fixed in this release.
- Fix
mktemp
: too few X's in templatetempZipDir'
warning. Affects versions: v2019.026.0, Fix versions: v2020.003.0 - Set
java.io.tmpdir
to the value ofTAMR_TMP_DIR
. Affects versions: v2020.002.0, Fix versions: v2020.003.0. - Classification Dashboard incorrectly displayed Verified, agrees with Tamr. Affects versions: v2019.023.1, Fix versions: v2020.003.0.
- Curator dashboard always says Tamr agrees with verified records 100 percent of the time. Affects versions: v0.35.0, v0.36.0, v0.37.0, Fix versions: v2020.003.0.
- Disk space during a backup. Affects versions: v2019.020.0, Fix versions: v2020.003.0.
- Disk space gets locked up after repeated cycles of running Tamr and Tamr backup. Affects versions: v2019.019.0, Fix versions: v2020.003.0.
- Frequent errors
"Your connection to the application has been unexpectedly terminated"
. Affects versions: v2019.023.1, Fix versions: v2020.003.0. - The Update Pairs job yields a
Request to Elasticsearch Scroll API failed
error. Affects versions: v2019.023.1, Fix versions: v2020.003.0, v2020.002.0. - Pairs cannot render, forcing page reload. Affects versions: v2019.011.0, v2019.015.0, v2019.023.0, v2019.025.0, v2019.023.1, v2019.026.0, v2020.001.0, Fix versions: v2020.003.0.
Fixed Issues
Fixed the following issues.
- DNF estimation and DNF learner assumed that
originEntityId
is unique. - A job was being canceled with a
cannot compute delta
error on "cluster similarities" dataset. - Administration utility upgrade command: backup failure occurred due to missing class
HBaseExportSnapshotWriter
.
v2020.002.0 Release Notes
What's New
- Added new transformation functions. For more information, see Transformations below.
- Released version 0.10 of the Tamr Python client. This version supports Python 3.8, standard Python logging, and uses markdown for the Tamr Python Client user documentation.
- Added error messages and stack traces from API errors to the errors that you see in the Tamr Core user interface.
- Further improved the validation process during upgrades by adding a healthcheck script for Elasticsearch. For more information, see Configuration, Deployment, and Lifecycle Changes below.
Record Clustering
- The cluster card shows Accept suggestion button in the side bar that shows suggested clusters.
- Beginning with this release, LLM returns the top-k similar clusters instead of only the top-1 that it returned in the previous releases. You can configure the k parameter through
TAMR_LLM_TOPK
.
All aspects of the mastering workflow now support pregrouping. Pregrouping integrates theGROUP BY
transformation into the internal dedup service and allows Tamr Core to group almost exact matches before generating candidate pairs. You enable pregrouping by changing the dedup recipe metadata. When pregrouping is enabled in Tamr Core, the resulting record clusters are based on the original record IDs, and this makes it easier for curators to verify, lock, and review the resulting clusters. When grouping fields change, Tamr Core auto-maps the verified pair labels to the new group IDs. To enable pregrouping, contact your support representative.
Note: The pregrouping feature is in limited release. For more information about using this feature, contact Tamr Support ([[email protected]](mailto:[email protected])).
Transformations
- Added the new
array.most_frequent
function to the list of Tamr Core transformation functions. It returns the N most frequently-occurring values in an array, skipping null values. This function may be useful in categorization projects. For example, it allows you to create a single meaningful text description from a list of string array descriptions arriving from multiple records. This is useful when you need to unify many columns containing a short text description of the product into a single column with the most representative text description. - Added collect_set and collect_subset aggregation functions.
Configuration, Deployment, and Lifecycle Changes
- Released version 0.10 of the Tamr Python Client. For details, see the Python Client release notes and changelog.
- Removed
TAMR_ES_CLUSTER "elasticsearch_procurify"
. This parameter was not used. Upon upgrading to this version of Tamr Core, the upgrade logs indicate the following:INFO: Property definition [TAMR_ES_CLUSTER] is not declared in new definitions. It will be removed from the configuration store
. This removal does not affect Tamr operations in any way in this release or in any other releases. - Added
TAMR_ES_MAX_GEOSPATIAL_FEATURES_DEFAULT: "1000".
This is the limit for the number of records to fetch when rendering them on a geospatial map. This value must not exceed the value ofTAMR_ES_MAX_RESULT_WINDOW
. - Changed
TAMR_ES_SOCKET_TIMEOUT: "300000"
toTAMR_ES_SOCKET_TIMEOUT: "900000"
as part of performance-related bug fixes. - Added
TAMR_JOB_SPARK_LOG4J_PROPS: "/home/ubuntu/tamr/conf/log4j.properties.j2"
as part of logging improvements. - Added a validation script for Elasticsearch to the list of Tamr Core Validation checks. The script checks for:
- The connectivity to the Elasticsearch cluster used to power the user interface, and ensures that this cluster is running.
- The existance of the Elasticsearch data directory,
/home/<user-name>/src/javasrc/procurify/deployment/build/deployment/elasticsearch-with-plugins-6.8.2/data
, and that this directory has a sufficient amount of free space, and is readable and writeable. - The version of Elasticsearch that is used is compatible with the supported version used in Tamr Core.
- Added error messages and stack traces from API errors to the errors that you see in the Tamr Core user interface.
Fixed Support Issues
- Fixed an issue where an Elasticsearch "clear" operation was receiving a socket timeout or generating an intermediate response that was too long. This was caused by index deletions on projects with very large numbers of pairs. The fix changes the default for the Elasticsearch search timeout:
DEFAULT_SOCKET_TIMEOUT_MS
is set to 900000 milliseconds (15 minutes) by default. The fix also reduces the default batch size on Elasticsearch "clear" operations from 10,000 to 1,000 to match the batch size used by other operations. The fix was included in the patch release v2019.023.2 and fixed in v2020.002.0. - Fixed an issue where previewing bookmarked golden records failed on datasets from chained projects.
- Added the
array.most_frequent
function to transformations. It returns the N most frequently-occurring values in an array, skipping null values. - Fixed an issue where, during backups, multiple numbered files were created in
/opt/tamr/tmp/backup/
but not deleted, which took up considerable amounts of disk space. The issue was due to the HBase export snapshot process that kept temporary files around until Tamr Core was restarted. The fix was to make the export snapshot run as a separate process.
Fixed Issues
- Fixed an issue where the number of generated pairs was incorrect for geospatial records that use Directional Hausdorff for pair matching.
- Fixed an issue in golden records where you could not preview bookmarked geospatial-type records.
- Fixed an issue in the Tamr user interface where the Status buttons on the Jobs page were truncated and off-center.
- Fixed an issue where upgrade validation failed when Tamr was using Google Cloud SQL Postgres instances.