2019 Tamr Core Release Notes
These release notes describe new features, improvements, and corrected issues in each Tamr Core 2019 release.
See Tamr Core Release Notes for important information for all releases, including upgrade instructions and checkpoint releases.
Other Tamr Core releases:
Tamr Core 2019 Releases
- v2019.026.0
- v2019.025.0
- v2019.023.2 patch
- v2019.023.1 patch
- v2019.023.0
- v2019.022.0
- v2019.021.0
- v2019.020.0
- v2019.019.2 patch
- v2019.019.0
v2019.026.0 Release Notes
What's New
- Added a validation framework to Tamr Core upgrades. You can run healthcheck validation at any time and it also runs during version upgrades. The validation healthcheck scripts check for Tamr Core license, memory usage, operating system, and HBase configuration, and publish detailed information if they find issues. See Validation.
- Added more granular and expressive controls for record locking and verification in clustering compared with previous releases. Records in the cluster verification table reflect more expressive cluster verification states compared with previous releases where you could only Lock and Unlock records. See Curating Clusters.
Clustering
Added user interface options for cluster verification actions in the Clusters page that were previously available only in the API. In particular, these changes were made:
- Use four subtypes of Verify actions. These actions replace Lock and Unlock actions available in the previous releases. The four types of verification actions are: Verify and Enable Suggestions, Verify and disable Suggestions, Verify and auto-accept suggestions, and Remove verification. The Lock action in the previous releases is equivalent to Verify and disable suggestions.
- Records in the table reflect verification states and allow you to take action.
- Use more expressive verification filters to filter clusters and view verification states in the record sidebar.
- Cluster table shows verification aggregations, such as the number of records in the cluster that may require to be moved to another cluster, or other actions stemming from the new verification options. For more information about cluster verification options, see Curating Clusters.
Configuration, Deployment, and Lifecycle Changes
Added validation healthcheck scripts in the Tamr Core administrative utility. To use validation scripts, run the administrative utility with the new validate
option, such as: tamr/utils/unify-admin.sh validate
. For more information, see Validation.
Fixed Support Issues
Fixed an issue where the Edit rules link for golden records was broken in Internet Explorer.
v2019.025.0 Release Notes
What's New
Improved upgrades, including their resilience, logging and diagnostics.
Mastering
Improved user experience for loading data from external data sources, such as HDFS. This feature was introduced in the previous release.
Access Control Improvements
Continued improvements for rules-based access controls to schema mapping and to remainining API actions in the mastering workflow.
Configuration, Deployment and Lifecycle Changes
- Added better logging for upgrades. Create more informative logs to inform you about errors during upgrades. Changed logging in the Tamr Core administrative utility. Console output is more readable and contains only the highlights but omits the details. At the same time, the log file is now more detailed.
- Made upgrades more resilient so that they don't stop in the middle leaving Tamr Core in an inconsistent state. Improved diagnostics to help you find out which parts of Tamr Core aren't working. Added built-in diagnostics in the administrative utility, to allow you to identify and check for problems. The project upgrades don't hold the upgrade process and report the errors in the user interface inside the project for you to fix after an upgrade.
- Made the following changes to the configuration properties in Tamr Core: added a new property,
TAMR_UNIFY_BACKUP_THREADS
, to configure the backup threads (replaces TAMR_UNIFY_BACKUP_NUM_FILE_COPY_THREADS), added the new default value toTAMR_HBASE_CONFIG_URIS
to allow Tamr to pick up configuration fromhbase-site.xml
.
TAMR_HBASE_DATA_DIR
only needs to be set if you fit into these cases:TAMR_CONNECTION_INFO_TYPE=hbase
andTAMR_REMOTE_HBASE_ENABLED=true
, or in script installation if you prefer to place the HBASE data directory in another location. - Made robustness changes to backups. The main backup/restore process monitors backups for individual operations, checks for final status, reacts to user cancellation and exits.
Fixed Support Issues
- Fixed an issue where a mastering project was out of date.
- Fixed an issue where Yarn won't start after an upgrade from version 2019.022 to v.2019.023. The Proxy changes prevented the administrative utility from checking that Yarn is up.
v2019.023.2 Patch Release Notes
This patch fixes an issue with Elasticsearch. Changed the default batch size for Elasticsearch components from 10K to 1K, to work around a ContentTooLong exception received with some types of data.
v2019.023.1 Patch Release Notes
This patch fixes an issue that prevented you from logging into Tamr Core using Internet Explorer.
v2019.023.0 Release Notes
What's New
- Preview rules for golden records. To preview rules, select one or more entities with rules, edit rules, and then choose
Preview Rule
to see how rules for golden records will behave. Once you are satisfied with the rules, update golden records and publish them. - Use a
DROP
statement in transformations to omit unwanted attributes. See Drop. - Use
MATH.TAN
,MATH.ATAN
,MATH.ATAN2
,MATH.DEGREES
, andMATH.RADIUS
functions in the library of mathematical functions. - Use a new transformation statement, Sample.
- Use a new
directional Hausdorff
similarity function for pair matching of geospatial records. - Back up and restore a Tamr Core deployment in which datasets are stored in Google Cloud Platform (GCP) BigTable. You can back up and restore to a local disk, HDFS, and GCS. See Backup Configuration.
- Retain labels after an upgrade by running a script that disables automatic assignment of primary keys.
- Start Tamr Core faster by running
start-tamr.sh
with improved performance. - Upgrade directly from v.0.40.1. to the current version of Tamr Core (without having to install intervening versions).
- Be aware of these configuration changes:
- Increased the services startup timeout to three minutes from one minute in previous releases.
- Switched to the OpenJDK 8 instead of the Oracle Java SDK in the Tamr Core installation package.
- Added Tamr Core configuration properties to support configuring password-protected transport authentication context in SAML v2. For more information, see SAML Authentication.
- Renamed the
tools.jar
totransform-tools.jar
. It is packaged undertamr/libs
in the Tamr Core installation and allows you to convert betweenlegacy.hash()
andhash()
.
Golden Records
In the golden records project, you can preview rules. To preview a rule, select the entities one by one or in bulk as bookmarks, edit a rule, and then choose Preview Rule. Once you are satisfied with the rule, select Update and then choose Publish Golden Records.
Transformations
- Added a new transformation statement, Sample.
- Added a new
DROP
statement in transformations. Use it to remove a column from the active dataset. UsingDROP
is more convenient to use thanSELECT
. For example, to increase performance, useDROP
to remove unused columns after runningJOIN
statements. If you drop an attribute in the unified dataset, it is then populated with nulls. See Drop. - Added
MATH.TAN
,MATH.ATAN
,MATH.ATAN2
,MATH.DEGREES
, andMATH.RADIUS
functions to the library of mathematical functions. - Added the
USE HINT
statement that applies a hint to the current transformation in the editor and to all of the subsequent transformations in that project. This statement might be useful to you if: - You have Tamr Core projects created before v.2019.014.1 that you are upgrading to this version (Tamr v.2019.021.0), and
- You explicitly do not want Tamr Core to automatically manage your primary keys.
For example, to disable automatic primary key management by Tamr Core in a particular project, add: USE HINT(pkmanagement.manual);
in the first transformation.
Note: The USE HINT
statement is only useful if you have created projects before v.2019.014.1. In that version, automatic management of primary keys was introduced. If you have started using Tamr Core after v.2019.014.1, Tamr Core automatically manages primary keys. For information about primary keys, see Primary Key Management.
Mastering
- Added sorting to record IDs in derived datasets (these datasets in mastering projects show statistics for clusters of records).
- Added a new directional Hausdorff similarity function. The existing functions, Haudorff distance and absolute Hausdorrf, target matching geo-type objects. The new directional Hausdorff function supports matching contained objects, for example, you can use it to see if a small section of a road is matched against the entire road, or whether a portion of a building matches an entire building. Directional Hausdorff is useful for pair generation and as an additional signal for pair matching. See Working with Geospatial Data.
Configuration Changes
- Increased the services startup timeout from one to three minutes.
- Switched to the OpenJDK 8 instead of the Oracle Java SDK in the Tamr Core installation package. As a result, the default value of
TAMR_JAVA_HOME
is changed from<tamr-home-directory>/jdk1.8.0_111/
to<tamr-home-directory>/openjdk-8u222/
. - Changed the lowest version from which you can upgrade to the current version of Tamr Core from v.0.37.0 to v.0.40.1. Previously, the lowest version from which you could upgrade directly to the current version without installing intervening versions was v.0.37.0. With this change, the lowest version that allows direct upgrades to the current version (v.2019.023.1 and higher) is v.0.40.1. If you are upgrading from version 0.40.0 or earlier, upgrade to each individual version up to v0.40.1 and then upgrade directly to the current version.
- Enforced using of HBase for dataset storage upon upgrade to this version. If you have migrated to v.2019.019.0, then the upgrade script that runs during an upgrade takes care of moving your datasets to HBase-backed storage. For information, see HBASE.
- Added an ability to back up and restore Tamr Core from Google Cloud Platform (GCP) BigTable. Previously, you could back up and restore Tamr deployments backed up by HBase. Beginning with this release, if your Tamr Core datasets use GCP BigTable instead of HBase, you can back up and restore them. The GCP BigTable backup in Tamr Core creates SequenceFiles. Note that the HBase backup creates HBase snapshot files. You can back up to a local filesystem, HDFS, or GCS. See Backup Configuration.
- Fixed the administration utility,
admin-unify.sh
to not require quotes when unsetting a custom parameter to an empty value intamr_config.yml
. Previously, it required specifying quotes for an empty string value, such asTAMR_JOB_SPARK_BIGQUERY_JAR: ""
. With the fix, to unset a custom parameter's value, you can omit quotes and leave the value empty (quotes are still allowed). - Reviewers can no longer cancel jobs. Previously, access controls for jobs allowed reviewers to cancel jobs.
- Added a list of case-insensitive reserved words to unified attribute names in Schema Mapping. Tamr Core issues an error and prevents you from using the following reserved words for unified attribute names in Schema Mapping:
origin_source_name
,tamr_id
,origin_entity_id
,clusterId
,originSourceId
,originEntityId
,sourceId
,entityId
,suggestedClusterId
,verificationType
, andverifiedClusterId
. - Added a script that can apply the
USE HINT (pkmanagement.manual)
statements to every project using transformations. Starting with v2019.014.0, Tamr automatically assigns unique primary keys (tamr_id
) if you have not assigned atamr_id
manually to your records. If thetamr_ids
change, then labels could change in downstream projects. Until this release, you could manually add the statementUSE HINT (pkmanagement.manual)
to each of your transformation scripts to disable automatic assignment of primary keys. In this release, we added an option in the<unify-zip>/tamr/libs/transform-tools.jar
script to automate the process of temporarily disabling primary key assignments after an upgrade. This option adds aHINT
to project's transformations. For information about primary keys, see Primary Key Management. - Added a warning strongly discouraging you from starting Tamr Core as a root user.
- Improved performance of
start-tamr.sh
. - Fixed the administration utility,
admin-unify.sh
to not require quotes when unsetting a custom parameter to an empty value intamr_config.yml
. Previously, it required specifying quotes for an empty string value, such asTAMR_JOB_SPARK_BIGQUERY_JAR: ""
. With the fix, to unset a custom parameter's value, you can omit quotes and leave the value empty (quotes are still allowed). - Added configuration properties that support configuring password-protected transport authentication context in SAML v2. For more information, see SAML Authentication.
- Changed the defaults for the configuration parameters
TAMR_FS_CONFIG_URIS
andTAMR_FS_CONFIG_DIR
to use the configuration found inTAMR_HADOOP_HOME
. - These changes have the following implications:
- Changes to YARN/Spark configuration, such as
TAMR_YARN_NODE_MANAGER_PORT
no longer require settingTAMR_FS_CONFIG_URIS
. - If Tamr Core is deployed in an on-premise Hadoop installation that has some other Hadoop configuration present on the filesystem, such as
core-site.xml
and/oryarn-site.xml
, the Tamr Core own Spark configuration is not ignored. Tamr Core now uses the configuration specified inTAMR_HADOOP_HOME
by default.
- Changes to YARN/Spark configuration, such as
Fixed Support Issues
- Fixed an issue with ghost golden records. With this fix, if a cluster no longer exists in the input dataset, it does not contribute to a golden record.
- Fixed an issue where previously, you could start Tamr Core as a root user. With this fix, Tamr Core issues a warning discouraging you to start as a root user.
Other Fixed Issues
- Fixed a security issue that affected log files produced by restoring from a backup from a previous version.
- Fixed a performance regression in Spark when running join operations by reducing the number of unnecessary row counts.
- Fixed an issue where
getRecordsByIds
failed if a dataset had a top-level struct type field with a field namedtimestamp
. - Fixed an issue where Tamr Core created duplicate columns during bootstrapping in Schema Mapping, for these attributes in the clusters datasets:
suggestedClusterId
,verificationType
,verifiedClusterId
. This occurred if you were using the datasets from the clusters as your input datasets to a new mastering project and attempted to bootstrap the attributes. The issue occurred because the names collided with existing names. The fix for this issue introduces a list of reserved words in Tamr Core. You cannot use these reserved words for unified attributes in Schema Mapping. For a list of reserved words for unified attributes, see "Configuration Changes" on this page.
v2019.022.0 Release Notes
What's New
- Changed the lowest version from which you can upgrade directly to the current version (without having to install intervening versions) from v.0.37.0 to v.0.40.1.
- Increased the services startup timeout to three minutes from one minute in previous releases.
- Switched to the OpenJDK 8 instead of the Oracle Java SDK in the installation package.
- Added a new transformation statement, Sample.
- Added configuration properties to support configuring password-protected transport authentication context in SAML v2. For more information, see SAML Authentication.
- Renamed the
tools.jar
totransform-tools.jar
. It is packaged undertamr/libs
in the installation and lets you convert betweenlegacy.hash()
andhash()
.
Transformations
- Added a new Sample statement to generate a sample of records with a uniform probability distribution.
- The
transform-tools.jar
script is available undertamr/libs
. Note that in the previous release, this script was namedtools.jar
and was not packaged with the installation. Use this script to convert instances oflegacy.hash()
in your projects to thehash()
function. You can also use this script to temporarity replace allhash()
functions in your transformation scripts withlegacy.hash()
until you are ready to start usinghash(
). For example, run:java -jar ./tamr/libs/transform-tools.jar function-replacer -u=admin --old=hash --new=legacy.hash --tamr_url="http://<IP HERE>:9100"
. For more information, see hash() and legacy.hash().
Configuration Changes
- Increased the services startup timeout from one to three minutes.
- Switched to the OpenJDK 8 instead of the Oracle Java SDK in the installation package. As a result, the default value of
TAMR_JAVA_HOME
is changed from<tamr-home-directory>/jdk1.8.0_111/
to<tamr-home-directory>/openjdk-8u222/
. - Changed the lowest version from which you can upgrade to the current version from v.0.37.0 to v.0.40.1. Previously, the lowest version from which you could upgrade directly to the current version without installing intervening versions was v.0.37.0. With this change, the lowest version that allows direct upgrades to the current version (v.2019.022.0 and higher) is v.0.40.1. If you are upgrading from version 0.40.0 or earlier, upgrade to each individual version up to v0.40.1 and then upgrade directly to the current version.
- Added configuration properties that support configuring password-protected transport authentication context in SAML v2. For more information, see SAML Authentication.
- Changed the defaults for the configuration parameters
TAMR_FS_CONFIG_URIS
andTAMR_FS_CONFIG_DIR
to use the configuration found inTAMR_HADOOP_HOME
. - These changes have the following implications:
- Changes to YARN/Spark configuration, such as
TAMR_YARN_NODE_MANAGER_PORT
no longer require settingTAMR_FS_CONFIG_URIS
. - If Tamr Core is deployed in an on-premise Hadoop installation that has some other Hadoop configuration present on the filesystem, such as
core-site.xml
and/oryarn-site.xml
, the Tamr Core Spark configuration is not ignored. Tamr Core now uses the configuration specified inTAMR_HADOOP_HOME
by default.
- Changes to YARN/Spark configuration, such as
Fixed Issues
- Upgrades.
- Fixed an issue where upgrade scripts timed out and prevented Tamr Core services from starting up after an upgrade. This fix removes support for upgrading directly to v2019.022.0 or greater from v040.0 or earlier and increases the services startup timeout to three minutes.
- Fixed an issue where users were not showing on the Users page after an upgrade (affected v.2019.021.0 and v.2019.022.0).
- Fixed an issue where high-impact pairs disappeared after an upgrade.
- Fixed an issue in golden records tables where the columns for entity rule, sources, and cluster Id weren't rendering properly in the user interface.
- Fixed an issue with expression rules and custom conditions in golden records where filters might not been applied properly. This issue affected expression rules and custom conditions that contained expressions that did not return null when evaluated on null values. This caused the custom conditions and rules to pick the wrong values.
- Fixed an issue where job submission was timing timing out and it was taking too long to calculate versions before submitting a job to Spark.
- Fixed an issue where cluster actions failed when there were too many cluster members in the IN clause due to the database query limit.
- Fixed a regression issue introduced in v.2019.021.0 where a job for generating pairs in a mastering project failed if you set a similarity function in the unified dataset to Hausdorff Distance or other similarity signal except Cosine or Jaccard.
- Fixed a regression issue where filters, drop-down menus, and checkboxes did not work on the Pairs page on IE 11.
- Fixed an issue where a job for updating golden records could not be started because the hosting Hadoop server used the YARN/Spark configuration present on the server and specified outside of Tamr Core. It ignored Spark settings that Tamr Core requires. With the fix, Tamr Core uses the Hadoop configuration specified in
TAMR_HADOOP_HOME
by default. Previously, to ensure that Hadoop configuration from Tamr Core is used, you had to specify this explicitly inTAMR_FS_CONFIG_URIS
.
v2019.021.0 Release Notes
What's New
- Transformations. Added
MATH.TAN
,MATH.ATAN
,MATH.ATAN2
,MATH.DEGREES
, andMATH.RADIUS
functions to the library of mathematical functions. - Added an ability to back up and restore a Tamr Core deployment in which datasets are stored in Google Cloud Platform (GCP) BigTable. You can back up and restore to a local disk, HDFS, and GCS. See Backup Configuration.
Mastering
- Added sorting to record IDs in derived datasets (these datasets in mastering projects show statistics for clusters of records).
Transformations
- Added
MATH.TAN
,MATH.ATAN
,MATH.ATAN2
,MATH.DEGREES
, andMATH.RADIUS
functions to the library of mathematical functions. - Added the
USE HINT
statement that applies a hint to the current transformation in the editor and to all of the subsequent transformations in that project. This statement might be useful to you if: - You have Tamr Core projects created before v.2019.014.1 that you are upgrading to this version (v.2019.021.0), and
- You explicitly do not want Tamr Core to automatically manage your primary keys.
For example, to disable automatic primary key management by Tamr Core in a particular project, add: USE HINT(pkmanagement.manual);
in the first transformation.
Note: The USE HINT
statement is only useful if you have created projects before v.2019.014.1. In that version, automatic management of primary keys was introduced. If you have started using Tamr Core after v.2019.014.1, Tamr Core automatically manages primary keys and you don't ever need to turn this feature off, for any projects. For information about primary keys, see Primary Key Management.
Configuration
- Enforced using of HBase for dataset storage upon upgrade to this version. If you have migrated to v.2019.019.0, then the upgrade script that runs during an upgrade takes care of moving your datasets to HBase-backed storage. For information, see HBASE.
- Added an ability to back up and restore Tamr Core from Google Cloud Platform (GCP) BigTable. Previously, you could back up and restore Tamr deployments backed up by HBase. Beginning with this release, if your Tamr Core datasets use GCP BigTable instead of HBase, you can back up and restore them. The GCP BigTable backup in Tamr Core creates SequenceFiles. Note that the HBase backup creates HBase snapshot files. You can back up to a local filesystem, HDFS, or GCS. See Backup Configuration.
- Fixed the administration utility,
admin-unify.sh
to not require quotes when unsetting a custom parameter to an empty value intamr_config.yml
. Previously, it required specifying quotes for an empty string value, such asTAMR_JOB_SPARK_BIGQUERY_JAR: ""
. With the fix, to unset a custom parameter's value, you can omit quotes and leave the value empty (quotes are still allowed).
v2019.020.0 Release Notes
What's New
General
- Implement fixed ratio Spark cores and memory allocation
Total memory available for Spark | Driver memory | Executor JVM instances | Total memory available for executors | Executor cores |
---|---|---|---|---|
19g or less | 3g, 1 core | 1 | Total available | Max available but no more than 1 core per 2g executor memory |
35g or less | 3g, 1 core | 2 | Total available/2 | As above |
51g or less | 3g, 1 core | 3 | Total available/3 | |
Less than 67g | 3g, 1 core | 4 | Total available/4 | As above |
67g and greater | 3g, 1 core | 4 | 16g | As above |
Unknown (remote Spark cluster) | 3g, 1 core | 4 | 16g | 8 |
Less than 9g | 1g, 1 core | 1 | Total available | Max available but no more than 1 core per 1g executor memory |
- Do not run local instance of Spark if remote Spark is used
-- There is a new boolean configuration property called TAMR_REMOTE_SPARK_ENABLED. This property is false by default, and if set to true, start-dependencies.sh will not start a local Spark process.
-- Important: This variable is not automatically set on upgrade, so if you are using a remote Spark cluster (Yarn, Dataproc, etc.), manually set this property to true. - Ability to cancel running Spark jobs
-- You can now cancel submitted and running Spark jobs in addition to pending jobs.
-- Cancellation is a best-effort and asynchronous action. The job might have succeeded or failed before it gets to the cancellation, which cause the cancellation to fail.
General Improvements and Major Bug Fixes
General
- When opening the “Add new CSV” dialog in the “Add a new dataset” dialog on the “Datasets” page in Transformations, the file picker opens automatically.
- Set the initial number of HBase regions to 1
- Clarify description of unified dataset attributes on Schema Mapping page
- Jobs running progress bar resized
- Updated Unify log collection script
Mastering
- No longer cause write lock exception on mastering dataset running update results job if another mastering job was currently running.
- Publish clusters job now lists project it was run on in project column on jobs page.
Transformations
- Transformations can be previewed on custom dataset samples.
-- To configure this please follow the steps in this page: Setting a custom preview sample.
-- Note: configuring this functionality is API-only (though the effects can be seen in the UI)
v2019.019.2 Patch Release Notes
This patch addresses the following Apache Log4j vulnerabilities by updating Tamr Core to use Apache Log4j version 2.17.0:
- Apache Log4j CVE-2021-45105
- Apache Log4j CVE-2021-45046
- Apache Log4j CVE-2021-44228
For full details regarding these vulnerabilities and Tamr Core, refer to Tamr's Updates on Apache Log4j Vulnerabilities article.
This patch fully remediates these three vulnerabilities in Tamr Core and Elasticsearch. Install this patch regardless of whether you have taken any of the remediation steps in the article referenced above.
Note: Patch v2019.019.1 was prepared but never released. v2019.019.2 is the latest version.
v2019.019.0 Release Notes
What's New
- Added two transformation functions that compute the maximum and minimum value of an attribute or expression based on the bit size of the elements contained in a group. See "Transformations" in this document.
- ElasticSearch is upgraded to version 6.8.2.
- HBase is required. Beginning with this release, HBase is required when you deploy Tamr Core, as we have stopped supporting deployments with datasets backed by a local filesystem. The transition period allowed you to turn off HBase and continue using a local filesystem for your dataset storage with Tamr Core. Beginning with this release and in the future releases, we strongly recommend that you enable HBase when you upgrade. For upgrading instructions, see the HBase section in this document.
- SSL support with Postgres. You can connect to an external Postgres instance via SSL.
- Tamr Python client v.0.9.0 is released.
- Access control improvements.
- Improvements in preferred sources in golden records.
Major Improvements
ElasticSearch 6.8.2 Upgrade
This release includes an upgrade to the internally-used Elasticsearch v.6.8.2 and is similar to the v.2019.002 Tamr upgrade when ElasticSearch was upgraded. A summary of the upgrade procedure is as follows:
- The upgrade takes longer due to reindexing.
- If you are upgrading from a much earlier version, we recommend that you create a backup, upgrade first to v0.52.0 and make sure your deployment is working properly, then create another backup and upgrade to v2019.019.0.
- If you have stale projects, those you no longer need or projects that are in a broken state, we recommended that you bring these projects up-to-date or delete them before running an upgrade to v2019.019.0. The upgrade scripts may fail when trying to start reindexing jobs for projects in these states.
- This note applies to you only if you used published clusters before upgrading to version 2019.019. The upgrade of ElasticSearch materializes datasets (this is also known as reindexing). In version 2019.017 the schema for internal derived datasets in published clusters has changed. These internal derived datasets are not automatically indexed as part of the ElasticSearch upgrade. Therefore, If you haven't republished clusters after an upgrade to version 2019.017, when you upgrade to version 2019.019 this may lead to errors with downstream jobs that rely on the changed attributes in the published cluster datasets. To avoid these errors, rerun the publish clusters API or use Publish after you upgrade.
HBase is Required
Beginning with this release v.2019.019, HBase is required when you deploy Tamr Core. The following statements describe this requirement:
- Tamr Core initially adopted HBase as the default storage mechanism for datasets in version 2018.044 in 2018. Until the current release version 2019.019, you could continue deploying Tamr Corewith datasets stored in their local filesystem, or running Tamr Core with datasets stored locally in tandem with datasets backed by HBase.
- To achieve further performance improvements and allow scaling out Tamr Core deployments, you must migrate datasets to HBase. This is required as of version of v2019.011 and beyond in order to avoid known bugs and instance failures. In this release, we repeat this requirement, in preparation for enforcing it during upgrades.
- The summary is:
- Enable HBase.
- After you upgrade, at any point of using v.2019.019, migrate datasets and prepare for the enforcement of HBase in future releases.
To enable HBase:
- Upgrade to version 2019.019. In some cases, such as when you are upgrading from versions prior to v.2019.08 to versions 2019.08 - 2019.020, the upgrade script may fail with
"HBase is not enabled"
error. In this case, follow step 2 in this procedure to enable HBase, and rerun the upgrade. - Set the
TAMR_HBASE_ENABLED
configuration variable in theadmin-unify.sh
admin tool totrue
. This also requires restarting Tamr and its dependencies. For information about usingunify-admin.sh
, see Configuring Tamr.
To migrate existing datasets:
- Verify that the
TAMR_HBASE_ENABLED
configuration variable in theadmin-unify.sh
admin tool is set totrue
. See Configuring Tamr. - Convert existing datasets to HBase using the
/datasets/moveDatasetsToStorageDriver
API. For example, use this URL:http://<your-tamr-host>:9100/api/dataset/datasets/moveDatasetsToStorageDriver?destinationDriver=hbase
or this command:curl -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' 'http://<your-tamr-host>:9100/api/dataset/datasets/moveDatasetsToStorageDriver?destinationDriver=hbase'
. It starts a job that registers all datasets with HBase. When the job is running, it shows under the Jobs page. There is a known issue where the job's details obscure the X sign for closing the job window. You may need to zoom out of that screen to be able to close it. To see which datasets are stored in HBase, use/datasets/<name>/status
. - Set the
TAMR_HBASE_AS_PRIMARY_STORAGE
configuration variable in theadmin-unify.sh
admin tool totrue
. This ensures that going forward, all datasets are stored in HBase by default. See Configuring Tamr.
Golden Records
- Fixed an issue where in golden records, filters with Min and Max value conditions did not ignore empty values. Note: other rules and filters behave correctly and ignore empty values.
- Fixed an issue with "ghost" datasets that weren't showing in all filters. "Ghost" input datasets are those that are no longer present in your project but were present previously. The fix ensures that all of the dataset filters in golden records list such ghost datasets. Previously, only some filters behaved correctly. You can also now delete such input datasets from the golden records rules.
- You can apply the golden record rules for longest and shortest value to attributes with types other than String. Previously, the rules for longest and shortest value accepted only String array input columns.
- Improved the behavior of publishing golden records. When you publish golden records in this release, you also update the rules at the same time. This is reflected in the user interface: Publish is now renamed to Update and Publish. Previously, Publish would take your saved rules.
- Fixed an issue in golden records where input datasets that were once added, then removed and added again weren't marked as New.
- Fixed an issue with rule overrides not behaving correctly. Previously, adding or deleting an override changed the number of overrides for all other attributes to zero.
Access Control
Continued improvements and bug fixes to ensure the proper behavior of policies and access controls for users in Tamr Core. In particular, made the following changes:
- You can run operations on user groups, such as deleting them, in the user interface. Previously, you could run some of the operations only in the APIs.
- Renamed Permissions navigation menu to Policies for consistency.
- Fixed an issue where the Projects page would not load unless the user had the
"read all datasets"
permissions.
Transformations
Added two new aggregation functions, max_size()
and min_size()
, that compute the maximum and minimum value of an attribute or expression based on the bit size of the elements contained in a group. The functions take any type of attribute (including arrays) and produce the value that has the maximum or minimum “bitsize value" for all of the input attributes considered. In these functions, the "bitsize value" is a measure that reflects how much information in bits is represented by a data value. For Null values the "bitsize value" is always zero bits; for Boolean type values it is always 1 bit, for Strings, it is the number of characters in a String, for numbers, this value is an Integer with 32 bit precision, for Floating point values it is represented with 32 bit precision, and for Long and Double type numeric values this value is represented in 64 bit precision. For example, 1.00 has a larger "bitsize value" than 1.0. For attributes of type Array, the shortest and the longest values are calculated as the sum of the element bit sizes and the precise bit size of the array’s length. Note that empty values are filtered out. For examples, see max-size and min_size.
APIs
- The
PUT
operation in the/v1/projects/id
API allows changing the name of the project. The project name in the versioned API corresponds to the “display name”. - The
POST
andPUT
operations in the/v1/datasets/id
API allow adding tags at dataset creation time and updating them.
Configuration
- Added a property
TAMR_YARN_NODE_MANAGER_HOST
to configure a custom name for the Yarn NodeManager hostname,yarn.nodemanager.hostname
. See YARN Cluster Manager Jobs. - You can configure a custom YARN node manager UI port (instead of the default port) by setting
TAMR_YARN_NODE_MANAGER_PORT
using the Tamr Core administrative tool,unify-admin.sh
. - Added a
--dry-run
parameter for setting values to Tamr. See Admin Tool Command Reference. - You can connect to an external Postgres instance via SSL.The database instance URL now allows specifying the following string to the JDBC driver:
jdbc:postgresql://<host>:<port>/<database>?ssl=true
. See Postgres.
Other Improvements
Made the following improvements:
- ElasticSearch improvements. Upgraded ElasticSearch from 6.4.2 to 6.8.2 resulting in performance implications during an upgrade to this version. The ElasticSearch upgrade enabled improved performance of mastering projects, and established a more stable ElasticSearch snapshot/restore process with HDFS. The upgraded ElasticSearch version 6.8.2 is also compatible with Google Cloud Platform requirements, and is required for enabling ongoing cloud provider integrations.
- Performance improvements. Dataset and Dataset Catalog pages load faster.
Improved performance of Tamr Core mastering workflows by tuning Spark joins configuration. - User interface improvements. Renamed Upload CSV to Upload File to account for the fact that Tamr Core allows adding datasets other than CSV. Fixed the behavior of the About screen.
- Fixed an issue with upgrades where reindexing of the cluster locks failed if an empty project existed. This was a known issue in v.2019.016.
- Fixed an issue where previewing a dataset did not work in the project's Datasets tab, after uploading and adding a dataset to a project. This was a known issue in version 2019.017.
- Added logging about database migrations to log files and
stdout
. - Fixed an issue with the Tamr Core administrative tool that did not allow specifying relative paths when setting configuration from a file, such as:
admin-unify.sh config:set --file <path-to-file>
. - The datasets page now loads faster.
- Fixed an issue where longitude and latitude were not showing for polygons in the user interface.
- Fixed an issue where Tamr Core was previously making calls to
async-io.org
on startup. In previous releases, the messaging framework that Tamr Core uses internally made these calls. - User interface improvements:
- The Mastering user interface now indicates when there are no labels. Previously, the confusion matrix was empty in this case.
- The list of comments automatically scrolls to the top of the list to show the newest comments first when you add them in Categorization, Pairs, and Clusters pages.
- The user interface reflects that the "Loading datasets" process is in progress when you load datasets on the Dataset Catalog page.
- The message indicating that the dataset is out-of-date in the navigation bar is now located to the side of menu controls and not in between them.
Support Tickets
Fixed the following support ticket:
- Fixed an issue where a job for updating the unified dataset was stalled.
- Fixed an issue where pair assignments disappear and expert-assigned labels become "verified by API" after regenerating pairs. The issue was found in version 2019.012 and fixed in version 2019.019.
- Fixed an issue where authorization did not work for users with mixed-case usernames.
- Fixed an issue where Tamr Core would not restart due to an inconsistent YARN jobs state. This issue affects versions 2019.014 - 2019.016 and is fixed in v.2019.017. Contact Support to obtain a patch for affected versions, or upgrade to version 2019.017 or greater.
- Fixed an issue where connecting to an external Postgres instance via a JDBC driver did not allow configuring secure access via SSL. The fix enables specifying SSL in the database instance URL for the JDBC driver as follows:
jdbc:postgresql://<host>:<port>/<database>?ssl=true
. - Fixed an issue where jobs could not start, or running jobs caused a
"Waiting for resources"
error due to issues with port contention for the YARN node manager. The fix enables you to specify your own port numbers in YARN. For more information, see YARN Cluster Manager Jobs.