Production Deployment Cycle for Mastering at Scale

Overall guidelines

This article outlines Tamr’s Production Deployment cycle recommendations for deployment in a mature state of production, where downstream consumers depend on the output of Tamr to be present and stable.

The overall guidelines amount to:

  • Treat the Production (Prod) environment with the utmost care
  • Have separate Development (Dev) and Test environments
  • Fully vet changes in lower environments (Dev and Test) before promoting to Prod
  • Changes with the potential to break Prod (or degrade data quality) should have a clear promotion plan, a sufficient time window allocated, and a clear rollback plan
    • Always take a backup of Prod prior to changes
    • Have a plan to restore from backup if needed

These guidelines are appropriate for a Mastering application with these considerations:

  • A scale-out cloud-based Tamr deployment is used
  • The full data scale in Prod is around 100M records or more
  • Dev has fewer computing resources than Prod and access to only a subset of the data
  • Test has equal computing resources to Prod and access to the full production data. If Test has fewer resources than Prod and a subset of data, this increases the risk of changes failing in Prod
  • The Tamr UI and its backing Elasticsearch are disabled in Prod and Test, to reduce resource requirements and reduce the chance of inadvertent changes

Environments

Tamr recommends having separate Development (Dev), Test, and Production (Prod) environments as part of each deployment. This is the minimum recommendation. Some deployments may benefit from having additional stages to further isolate types of changes and issues. The cost of maintaining multiple environments should be factored in when scoping any Tamr project.

Environment summary

There are three key areas in which each of these environments typically differ:

  • Source data
  • Resources
  • User access

This table shows Tamr’s recommendations for each environment.

Environment Data Resources User Access
Production (Prod) Full production data Complete CPU, RAM, disk and network bandwidth available. Server: Functional ids only.

UI:

User management: LDAP/SAML

Admin: 1 or Few

Curator, Reviewer: Few, API only

Test / QA / Pre-Prod Full production data Same architecture, CPU, RAM, disk and network bandwidth as Prod. Server: Functional ids only.

UI:

User management: LDAP/SAML

Admin: 1 or Few

Curator, Reviewer: Few, API only

Development (Dev) Subset of production data Same architecture as Prod but reduced CPU, RAM, disk and network bandwidth. Server: User and functional ids.

UI:

User management: LDAP/SAML or Local

Admin: Many

Curator: Many

Reviewer: Many

Development environment

The Dev environment forms the beginning of the promotion cycle and is used to first create and iterate on a change to bring it to its final form. The final form can be the result of many distinct changes executed in a set order, or a single, simple configuration change.

The Dev environment has the same cloud-based architecture as Prod but reduced computing resources. It has access to a limited subset of the production data. The complexity of the promotion cycle may be exacerbated due to these differences between Dev and Prod. It is especially important that the Test environment mirrors Prod as closely as possible, in order to ensure valid testing of changes (promoted from Dev) before promoting to Prod.

The number of users allowed access to the Dev environment should still be limited, but the level of access given to these users should be the broadest of all the environments. User accounts can be managed externally with LDAP or SAML for consistency with Test and Prod, or local account management can be enabled and available to all developer users. A full range of actions may be permitted in Dev, including creating new projects, schema mapping changes, adding or editing transformations, and retraining machine learning models. The Tamr UI and its backing Elasticsearch are enabled in Dev to support development work.

Test environment

The Test environment forms part of the promotion cycle and is used to validate changes before these changes are promoted to the Prod environment. Test must have the same architecture as Prod and ideally has identical resources and data access as the Prod environment. If Test does not have access to the full production data, this increases the risk that changes will pass Test but fail in Prod.

Critically, the Test environment does not send results downstream to the ultimate consumer, for example to business stakeholders. The ultimate consumer only consumes data from Prod. The Test environment may send data to downstream test consumers to verify export connectivity and data quality, such as test versions of data lakes or dashboards.

The only actions performed in the Test environment are for validation of changes before promotion to Prod. If the changes do not validate successfully, fix them in Dev and promote the fixed changes back up to Test. If the changes validate successfully, promote them to the Prod environment.

The Tamr UI and its backing Elasticsearch are disabled in Test to reduce resource requirements. User access is restricted to a small set of people, with one or few end users having the Admin role to run validation and apply critical fixes. A small number of Curator or Reviewer users may be granted access to view data, but interaction would be only via the Tamr APIs. User access should be managed externally with LDAP or SAML, especially if dealing with sensitive data.

Production environment

The Prod environment is the only environment that produces results sent downstream to the ultimate consumer, and it must be insulated from breaking changes. Only nominal, business-as-usual (BAU) user actions are performed directly in Production. These user actions form part of a user workflow supported by Tamr’s user roles and permission models and allowing only _non-breaking _changes. These supported workflows and actions are typically presented in a standard operating procedure (SOP) document. All other actions or changes should go through a promotion cycle.

Prod processes the complete production dataset, with the maximum server resources available. The only changes that should occur directly in Prod are regular data updates. The Tamr UI and its backing Elasticsearch are disabled in Prod to reduce resource requirements and reduce the chance of inadvertent changes.

User access is as restrictive as possible, with one or few end users having the Admin role to apply critical fixes. A small number of Curator or Reviewer users may be granted access to view data, but interaction would be only via the Tamr APIs. User access should be managed externally with LDAP or SAML, especially if dealing with sensitive data.

Using subsets of data in Dev and Test

One particular challenge when Mastering at scale is that Dev (and potentially Test) often won't have access to the full production data. Changes that succeed in Dev can fail in Prod due to data differences. Careful consideration of what data to use in the Dev (and Test) environment can help mitigate these issues.

Here are some guidelines on the data to include in Dev:

  • The Dev environment should contain as large an amount of data as possible given its computational resources. Concretely, the Dev environment should have at least 100,000 records. Having more data is preferred if possible.
  • Every data source in a Prod environment should have a representative subset in a Dev environment. For example:
    • If a data source is known to have differences based on geographic region, such as “state”, then some data from every state should be present in Dev, instead of choosing just one state.
    • Duplicate records should exist in the subset to provide training examples for Tamr. Purely random subsets may have too few duplicates for training.
    • If some fields that are important for the Machine Learning (ML) model are known to be empty in a subset of records, the data in Dev should include fields with both filled and empty values.
  • New data sources should always be introduced first to Dev before promotion to Prod.
  • When a large fraction of an existing dataset is being updated or added (10% or more), that change should be first performed in Dev, or at least in Test.
  • Data sources that have been problematic in the past should have a larger subset (or the full dataset) in Dev.

Changes within Tamr

Projects within Tamr define how data is mapped, transformed, and mastered. User actions inside Tamr can therefore change or break Tamr’s results.

The only changes that should occur directly in Prod are regular data updates. These fall into one of two categories. One category is small updates to existing datasets (modifying or appending up to 5-10% of the records in a dataset) on some cadence. In the other category, sometimes called batch processing, source datasets are truncated and a completely distinct set of records is uploaded on some cadence. For example, this could mean running a monthly pipeline only on the data generated in the preceding month.

Data quality issues with new records, like a large fraction of null values in important fields, can degrade Tamr’s performance on those records.

Any other changes should be first performed in Dev and promoted to Test and Prod. This includes, but is not limited to:

  • large dataset updates (more than 10% of an existing dataset).
    • If this can’t be done in Dev, at least update this dataset in Test before Prod.
  • adding a new source dataset
  • changing unified dataset schema or mappings
  • changing transformations
  • changing the binning model
  • retraining machine learning models
  • changing golden record rules
  • modifying orchestration code/pipeline
  • upgrading the Tamr software version

Training models in Tamr

Model training in Tamr depends on the user feedback provided by Subject Matter Experts (SMEs). This must be done in the Dev environment and is recommended to be done in the Tamr UI.

The most important factor in training models is having a representative subset of the data in Dev. See the above section on data subsetting in Dev. Tamr cannot be trained on masked, redacted, or anonymized data in Dev and then applied to unaltered data in Prod.

Model training actions include:

  • Reviewing record pairs: End-users with the Reviewer role can label record pairs as MATCH or NON-MATCH.
  • Verifying/curating record pairs: Record pair labels must be carefully verified by an Admin or _Curator _role before they will be used for model training.
  • Training the model: After pair labels have been verified, users with an Admin or Curator role can run a job within Tamr to train the model

The trained model is then promoted from Dev to the Test and Prod environments. A simple way to do this is via model export and import APIs.

Tamr’s promotion cycle

Detailed promotion cycle and testing procedures should be authored for each Tamr deployment based on the customer’s requirements for changes to production systems, and the expected set of Tamr artifacts that will be promoted. These procedures should be adhered to for even seemingly small changes. Any changes made to Prod outside of the promotion cycle carry a risk of causing the pipeline to fail.

Tamr recommends the following procedure:

  1. Collect the list of changed artifacts
  2. Create a Product Deployment Release Plan. A template is given in the appendix.
    1. Have a rollback plan if changes are unsuccessful. Tamr recommends taking a Backup of Prod as a rollback option.
  3. Apply the set of changes in Test, following the Plan.
    2. If Test has the same data and resources as Prod, the backup of Prod can be restored first to Test to start from an identical state
  4. Validate the changes according to a Product Deployment Release Plan.
    3. Restore Test to its pre-failure state (such as restoring from the Prod backup)
    4. Fix the issues in Dev
    5. Retry the changes in Test to make sure instructions are complete before touching Prod
    6. If changes fail in Test and require debugging to make them work:
  5. Apply the set of changes in Prod, following the Plan.
  6. Run Prod following the changes.
  7. If the changes are unsuccessful, execute the Rollback Plan. A template is given in the appendix.
    7. If a backup was taken, this means restoring from the backup.

Changes between environments (steps 3 and 5) can either be promoted manually or programmatically using the Tamr APIs. Tamr professional services can provide solutions for programmatic promotion.

Backup/restore functionality

Always take a backup of Prod prior to making or promoting changes. This provides a rollback option if changes are unsuccessful in Prod.

Tamr support can provide guidance on the backup/restore procedure for a scale-out Tamr deployment.

Appendix

Template: Product Deployment Release Plan

Step Action Expected Result Execution Date Actual Result
1 Backup PROD Tamr
1.1 Stop PROD Tamr Tamr successfully stopped
1.2 Execute Backup steps Backup successful
1.3 Restart PROD Tamr Tamr successfully started
2 Restore PROD backup to TEST
2.1 Detailed steps as needed
3 Execute changes in TEST
3.1 Modify attribute mappings Mappings are modified
3.2 Add new transformation New transformation is saved
3.3 Additional steps as needed
4 Run TEST with changes
4.1 Detailed steps as needed
5 Execute changes in PROD
5.1 Detailed steps as needed
6 Run PROD with changes
6.1 Detailed steps as needed

Template: Product Deployment Rollback Plan

Step Action Expected Result Execution Date Actual Result
1 Restore backup to PROD
1.1 Stop PROD Tamr Tamr successfully stopped
1.2 Execute restore steps Restore successful
1.3 Restart PROD Tamr Tamr successfully started