User GuidesAPI ReferenceRelease NotesEnrichment APIs
Doc HomeSupportLog In

Deploying Tamr Core on Azure

Install the Tamr Core software package on Microsoft's cloud platform.

This topic provides an overview of Tamr Core's cloud-native offering on Azure, basic network requirements and security features, and deployment prerequisites and steps.

Azure Overview

Tamr leverages Azure’s pre-built technologies to provide cloud-native horizontal scalability to the Tamr Core platform. These technologies include:

  • Azure HDInsight cluster : Optimized for HBase and used for the data-scale storage.
  • Azure Databricks: Used for distributed processing with Spark.
  • ADLS: ADLS is an implementation of HDFS in Azure and is used for the remote filesystem.

Sizing Guidelines

For single-node deployment sizing guidelines, see Azure Sizing and Limits.

The following table provides cloud-native environment size configurations for Azure. Size refers to the number of records:

  • Small: 10 million - 100 million records
  • Medium: 100 million - 500 million records
  • Large: 500 million - 1 billion records
Small Medium Large
Tamr Core 1 x Standard_D8s_v3, 300GB 1 x Standard_D8s_v3, 300GB 1 x Standard_D8s_v3, 300GB
HDInsights Master 2 x Standard_D12_v2 2 x Standard_D12_v2 2 x Standard_D12_v2
HDInsights Worker 10 x Standard_D12_v2 20 x Standard_D12_v2 40 x Standard_D12_v2
Databricks Master 1 x Standard_DS12_v2 1 x Standard_DS12_v2 1 x Standard_DS12_v2
Databricks Worker 23 x Standard_DS12_v2 47 x Standard_DS12_v2 94 x Standard_DS12_v2
Elasticsearch 3 x Standard_E4s_v3, 500GB N/A N/A

Deployment Structure

Resource Groups

Resource groups allow you to organize resources - such as virtual machines or virtual networks - into groups. Every resource must be created within a resource group. However, resources in different groups can still interact with each other.

Tamr suggests you create resource groups to help you organize the resources used by your deployment.

Active Directory

Tamr Core integrates with LDAP. See LDAP Authentication and Authorization to configure access to Tamr Core with Active Directory.


The deployment of Tamr Core on Azure uses virtual networks, subnets, and network security groups.

Virtual Networks

Virtual networks define ranges of network addresses that resources use to communicate.


A subnet is a block of the virtual network’s IP space.

To integrate the Tamr Core deployment with your existing networking infrastructure, you must provide subnets with the correct service delegations and endpoints for the services running on that subnet. Each component of the Tamr Core deployment documents its subnet requirements with the Tamr-provided Terraform module.

  • Subnet delegations allow a specific service to create and modify basic network configuration rules for a subnet to help that service communicate properly over the subnet.
  • Service endpoints allow you to route traffic directly from your virtual network to the specific Microsoft Azure services you have specified, such as storage accounts and databases.

Network Security Groups

Network security groups control the ingress and egress traffic to resources on the network.

Tamr Core uses network security groups to allow Tamr Core services to communicate only over the ports that are explicitly allowed by the services that require access.

Overview of the Tamr Terraform Modules

Tamr maintains a set of Terraform modules that create the compute, storage, and networking resources in Azure for the components of a Tamr deployment on Azure.

The Tamr Terraform modules consist of a root module and nested submodules. As a user of the module, you can either use the entirety of the module at its root or the nested modules directly if you need more flexibility.

The root of the modules do not create virtual networks or subnets, even though some components require these. Each module that requires a subnet includes a nested module that defines the service endpoints and specifies subnet delegations required by that module.

Why Azure Databricks for Tamr Core?

Tamr Core uses Azure Databricks for its ephemeral Spark clusters. Azure Databricks:

  • Provides APIs that allow the Tamr Core deployment to request clusters on-demand for each Spark job that Tamr Core needs to run.
  • Allows you to observe ongoing Spark jobs in the cluster.
  • Provides a public domain URL accessible with an authentication token.

How is the Deployment Orchestrated?

In a cloud-native Tamr Core deployment on Azure, many services must working together. Tamr orchestrates cloud-native deployment on Azure using both Terraform and manual installation.

Deploy Hardware with Terraform

Terraform by Hashicorp is a cloud provider-agnostic tool for infrastructure deployments. Terraform uses Providers for each of the cloud provider deployments. To deploy cloud-native VMs in Azure for Tamr Core, Tamr uses the Azure Provider and the preconfigured Terraform modules. The modules allow you to:

  • Preview the expected deployment.
  • Deploy the Azure cloud-native hardware required to run the dependent services and components.
  • Change the deployment’s configuration. For example, you can scale it up, or down.
  • Run the modules in a sequenced, orchestrated way.
  • Create your own versions of the versioned modules, store them in Git, and maintain their version history.
  • Configure networking resources, such as virtual networks, IP address ranges, subnets, and Network Security Groups (NSGs).

Install Software with Ansible

Tamr uses Anisible to install Elasticsearch on the deployed Azure VMs. Ansible is an open source tool for provisioning and maintaining the configuration, and deploying software packages. For more information, see the Ansible documentation.


Tamr Core cloud-native deployment on Azure provide the following security features:

  • Azure services that encrypt data at rest.
  • A list of allowed IP addresses and network security group (NSG) to allow or deny network traffic to your virtual machine instances.
  • Integration with LDAP and SAML for user access management.
  • Encrypted, secure configuration for ZooKeeper.
  • External access to the Tamr Core deployment secured through Azure Firewall configuration and HTTPS, provided by the NGINX reverse proxy server.

For non-production environments configuring a firewall, NGINX, and HTTPS are strongly recommended but not required.

importantimportant Important: If you do not configure a firewall, NGINX, and HTTPS in a non-production deployment, all users on the network will have access to the data. Use a unique password for this deployment.

Note: None of the Tamr Core deployed resources are required to be configured for public access for normal operation. Tamr recommends that these resources should not be made available for public access.

Azure Deployment Procedures

Prerequisites for Deploying Azure

  • Tamr Core Software Package and License. Contact Tamr Support to obtain the ZIP file for the Tamr Core software package and your Tamr license key.
  • The Azure Command-Line Interface. If you do not already have the Azure CLI installed, follow the Microsoft Azure documentation to install it.
  • An Azure Subscription. When you sign in to the Azure CLI for the first time, configure your account and provide your subscription as follows:
    az account set --subscription <my_subscription>
  • Contributor Role. The Terraform modules require the deployment user or service principal to have a Contributor role.
  • The Terraform Software Package. Install Terraform v0.12.25 or greater on the machine on which you intend to run Terraform templates for the Tamr Core deployment. Tamr Core uses the Azure Provider plugin package v2.11 or greater. The Azure Provider plugin package is frequently upgraded. For information on the Azure plugin releases, see Releases for terraform-provider-azurerm. To ensure that Terraform uses the correct version of the provider, the provider version is included in the configuration in the file. When you run terraform init, Terraform downloads the specified version of that provider’s plugin.
  • Tamr Terraform Modules. Access the Tamr Terraform modules from the Github repositories. For more information, see the Tamr Terraform Modules Reference.
  • Ansible Software Package. Install Ansible version 2.5.1 or greater on the machine that has SSH access to the virtual machines on which you intend to install Elasticsearch. This could be the same VM on which you install Tamr Core, a laptop, or another VM.
  • NGINX reverse proxy server. Configure secure external access to Tamr Core via HTTPS via a reverse proxy from the NGINX application server. For more information, see Requirements (for NGINX version support), Installing NGINX, and Configuring HTTPS.
  • Azure Firewall. Deploy and configure the Azure Firewall using the Azure portal. See Deploy and Configure the Azure Firewall Using the Azure Portal in the Azure documentation for instructions. Firewall configuration requirements:
    • Allow only internal access to Tamr Core default port 9100 (via TCP).
    • Open port 443 for HTTPS, with a restrictive IP range that you specify using IPv4 addresses in CIDR notation, such as
      Note: If you plan to forward HTTP traffic to HTTPS, also open port 80.

Terraform Prerequisites

  • Tenant ID and Subscription ID. These values can be found with the Azure command line:
    az account list
  • Resource Group for Tamr Core and its Required VMs, Networks, and Subnets. For organizational purposes, Tamr recommends creating a dedicated resource group for Tamr Core. This resource group can include all of the related resources for the deployment that you want to manage as a group. You can create this resource group with Terraform. See azurerm_resource_group in the Terraform documentation. You can also create a resource group in the Azure Portal or with the Azure CLI. See Create Resource Groups in the Microsoft Azure documentation.
  • Virtual Networks and Subnets. Create a dedicated virtual network with at least three subnets where you intend to deploy Tamr Core:
    • Create two subnets, public and private, for establishing a connection to Azure Databricks.
    • Create an additional subnet for establishing connections to other components in the deployment.
      Note: These are minimum requirements, as the virtual network setup and size depend on the size of the deployment address space. For example, Azure Databricks requires a CIDR block for the virtual network with the prefixes between /16 to /24, and for the subnets with the prefix up to /26. For the detailed list of virtual network and subnet requirements, see Deploy Azure Databricks in your Azure virtual network (VNet injection) in the Microsoft Azure Databricks documentation.
    • Add Microsoft.AzureActiveDirectory and Microsoft.Storage to the list of the allowed service endpoints on this subnet. This allows the ADLS service endpoint to securely connect to a Tamr Core VM on the subnet. This is required if you are loading source data from ADLS.

Installation Process

Step 1: Configure the Terraform Modules

importantimportant Important: Before you begin this task, verify that you have configured the required virtual networks and subnets as specified above. Tamr also recommends creating a resource group in which to provision the Tamr Core resources instead of using an existing resource group.

Each the source files for each module in the repository include:

  • The file that describes how to use the module.
  • Example uses of the modules in the /examples folder.

For each module, fill in the required parameters and any other optional parameters you may require for your deployment.

For more detail, see the example module uses in each module’s /examples folder.

Note: Tamr's Terraform modules follow their own release cycles. To use a newer version, update the version in the ref query parameter in the source variable as shown below:

module "example_source_reference" {
  source = "git::<name of repo>?ref=0.1.0"

Step 2: Apply the Terraform Modules

Tip: There are many possible workflows for applying Terraform modules. If you are proficient in Terraform, feel free to adapt this workflow to suit your situation.

To apply the Terraform modules:

  1. On the machine from which you intend to control the deployments, sign in to Azure:
    az login
  2. Initialize the modules in the directory with your Terraform code:
    terraform init
  3. Review the resources that are being provisioned, changed, or deleted:
    terraform plan
  4. Apply the plan:
    terraform apply

Note: Some deprecation warnings may appear while the Terraform commands run. You can ignore these messages.

Step 3: Create Active Directory Service Account & Obtain ADLS Access Information

Create an Azure AD application and service principal that can access resources as described in the Microsoft documentation.

Add the following permissions to the application:

  • Azure Storage user_impersonation
  • Azure Data Lake user_impersonation

Assign the following roles:

  • Storage Account Contributor
  • Storage Blob Data Contributor

Once created, retrieve the following information:

  1. Record the Application (client) ID and tenant ID from the portal as described in the Microsoft documentation.
  2. On the left-hand side, select Certificates & secrets.
  3. Add a new client secret key by selecting New client secret as described in the Microsoft documentation.
  4. Record the value of your new client secret.

Step 4: Create a Databricks Access Token and mount ADLS to Databricks Filesystem

To access Databricks REST APIs, use the Azure console to create a personal Databricks access token. See Authentication using Azure Databricks personal access tokens.

While in the Databricks portal, create a new python notebook, then follow the instructions to mount Azure Data Lake Storage Gen2 filesystem
to the Databricks filesystem.

Note: The mount-name must match the ADLS storage container name.

For more detailed information about mounting, see the dbutils API definitions.

Step 5: Configure HBase on the HDInsight Cluster

To configure HBase on the HDInsight cluster:

  1. Sign in to the Ambari UI at https://<hdinsight_cluster_name>
  • The username is the value of the gateway_username variable, supplied to Tamr's HDInsight module to create the HDInsight cluster.
  • The password is the value of the gateway_password variable, supplied to Tamr's HDInsight module to create the HDInsight cluster.
  1. Navigate to the HBase section on the left-hand side of the Ambari UI, select the Configs tab at the top of the page, then set the following properties in the Settings tab further down on the page:
  • In the Server section, set Memstore Flush Size to 268435456 bytes (256 MB).
  • In the Server section, set HBase Region Block Multiplier to 8.
  • In the Client section, set Maximum Client Retries to 3.
  • In the Disk section, set Maximum Region File Size to 1073741824 bytes (1GB).
  • In the Timeouts section, set Zookeeper Session Timeout to 60000 milliseconds (1 minute 00 seconds).
  • In the Timeouts section, set HBase RPC Timeout to 600000 milliseconds (10 minutes 00 seconds).
    The Ambari UI may warn that the settings for Maximum Client Retries and HBase RPC Timeout are not recommended; this is expected and can be ignored.
  1. Set the following properties in the Advanced tab:
  • In the Custom hbase-site section, select Add Property and paste the following lines as custom properties:
  • In the Advanced hbase-site section, change hstore blocking storefiles to 200
  1. Select Save.
  2. Restart All Affected HBase resources by selecting Restart at the top of the page after you save your changes.

importantimportant Important: These values are subject to change and you may need to work with your Tamr representative to tune them further. In addition, see Optimize Apache HBase with Apache Ambari in Azure HDInsight in the Microsoft Azure documentation.

Step 6: Install, Deploy, and Configure Elasticsearch

Tamr Core uses Ansible to automate the installation and configuration of Elasticsearch on each node in your cluster, which requires the Elasticsearch role. You can download it with Ansible’s “package manager” called Ansible Galaxy. For information about Ansible visit the Ansible documentation. For information about Ansible Galaxy, see the Ansible Galaxy documentation.

To install, deploy, and configure Elasticsearch:

  1. Install Ansible Galaxy Elasticsearch:
ansible-galaxy install elastic.elasticsearch,7.1.1
  1. Create an Ansible playbook file named es_playbook.yml and paste these example contents into it:
- name: tamr_es
  hosts: es-nodes
   - role: elastic.elasticsearch
    es_api_host: "{{ ansible_default_ipv4.address }}"
    es_major_version: "6.x"
    es_version: "6.8.2"
    es_heap_size: <number>g
    es_config: "<any_es_node_ip>:9300"
     http.port: 9200
      transport.tcp.port: 9300 "{{ ansible_default_ipv4.address }}" true
      node.master: true
      script.allowed_types: inline
      indices.fielddata.cache.size: 20%
      indices.query.bool.max_clause_count: 4096
      indices.memory.index_buffer_size: "25%"
  1. Set es_heap_size: <number>g, where <number> is equal to half of the RAM size on the VM where you are deploying Elasticsearch. For example, if your VM size is set to Standard_D8_v3, you have 32GB of RAM, so set es_heap_size: 16g.
  2. Update the value for to replace <any_es_node_ip> with the address of any of the Elasticsearch VMs you intend to use.

    importantimportant Important: Do not change es_version: "6.8.2" or script.allowed_types: inline or else Tamr Core will not work.

  3. Create an esnodes.txt inventory file with a list of newline-separated IP addresses for all of the VMs in your Elasticsearch cluster. For example:
  1. Run the Ansible playbook from a machine that has SSH access to the Elasticsearch VMs:
ansible-playbook -u <es-admin-tamr> -i esnodes.txt --key-file /path/to/id_rsa es_playbook.yml


  • <es-admin-tamr> is the placeholder that you must replace with the all lowercase, no underscores, name of the Linux user administrator for Elasticsearch on each of the target VMs. This is the same username that you supplied to the Terraform module when creating the VMs.
  • /path/to/id_rsa is the path to the SSH private key that allows access.

Step 7: Install the Tamr Core Software Package

To install Tamr Core, SSH into the Tamr Core VM and follow the procedure for installing PostgreSQL and unzipping the Tamr Core Software Package on the VM. You install PostgreSQL and Tamr Core on the same VM.

Do not start Tamr Core or its dependencies at this point; you must complete all of the steps for Azure deployment to configure PostgreSQL and Tamr Core to work with the scale-out deployment.

Step 8: Configure Postgres

To configure Postgres:

  1. Verify that you have set the following configuration variables in /etc/postgresql/12/main/postgresql.conf:
listen_addresses = '*'  # instead of something like 'localhost'
port = 5432
  1. Add entries in the /etc/postgresql/12/main/pg_hba.conf for the Tamr address and Databricks subnet ranges to restrict client access:
# Tamr VM (replace in the form 'a.b.c.d/32')
host  all  all  <REPLACE_ME>  md5

# Databricks Private Subnet (replace in the form 'w.x.y.z/a')
host  all  all  <REPLACE_ME>  md5
  1. If PostgreSQL is currently running and any settings have been changed, restart PostgreSQL. See the procedure for installing PostgreSQL for OS-specific instructions.

Step 9: Start ZooKeeper

Run <tamr-home-dir>/tamr/ to start ZooKeeper.

Step 10: Share the HBase Configuration from Azure HDInsight with ZooKeeper

Hadoop clients are configured via a set of xml files, xls, bash, and other tools. In common deployments of Hadoop clusters, storage (HBase) and compute (Spark) services are deployed in the same cluster. In this scenario, the HBase client configuration is already present on the cluster nodes running Spark.

Because the HBase in HDInsight is hosted on separate VMs from the VMs that host Databricks Spark, you must make the HBase client configuration files available to the Spark workers before you can start the Spark cluster. You can do this by hosting the configuration files in the Tamr instance of ZooKeeper. ZooKeeper provides Spark workers the means to download the files that configure the HBase client.

To share HBase configuration with ZooKeeper:

  1. Sign in to the Ambari UI.
  2. Download HBASE_client-configs.tar.gz:
  • Navigate to the HBase section on the left-hand side of the Ambari UI.
  • At top right, select Service Actions and then choose Download Client Configs.
  1. On the Tamr VM, create a <tamr-home-dir>/custom-conf/ directory and copy the HBASE_client-configs.tar.gz:
    scp HBASE_client-configs.tar.gz <[email protected]>:<tamr-home-dir>/custom-conf/
  2. Unzip the HBASE_client-configs.tar.gz file on the Tamr VM using tar:
    tar -xvzf HBASE_client-configs.tar.gz
  3. Store the following unzipped configuration files,, hbase-policy.xml, and hbase-site.xml in ZooKeeper:
<tamr-home-dir>/tamr/utils/ zk:put --file-path <tamr-home-dir>/custom-conf/hbase-site.xml --zk-path zk://localhost:21281/tamr/unify001/hbase-conf/
<tamr-home-dir>/tamr/utils/ zk:put --file-path <tamr-home-dir>/custom-conf/hbase-policy.xml --zk-path zk://localhost:21281/tamr/unify001/hbase-conf/
<tamr-home-dir>/tamr/utils/ zk:put --file-path <tamr-home-dir>/custom-conf/ --zk-path zk://localhost:21281/tamr/unify001/hbase-conf/

Step 11: Configure Tamr

To create and upload a YAML configuration file for Tamr Core:

  1. Create a YAML file at <tamr-home-dir>/custom-conf/config.yaml based on the example file included below. Replace instances of <REPLACE_ME> with the appropriate values for your deployment.
  2. Upload the resulting configuration file config.yaml to Tamr Core:
    <tamr-home-dir>/tamr/utils/ config:set --file <tamr-home-dir>/custom-conf/config.yaml

Step 12: Start Tamr Core and Its Dependencies

For detailed instructions, follow the procedure for installing Tamr Core.

In general, run these commands:



Verifying the Deployment

To verify that HBase and Spark are functioning properly:

  1. SSH to the VM on which Tamr Core is installed.
  2. Navigate to http://<tamr-vm-ip>:9100 and sign in.
  3. Upload a small CSV dataset and profile it.

To verify that Elasticsearch is functioning properly:

  1. Create a schema mapping project.
  2. Add the dataset you just uploaded to the project.
  3. Bootstrap several attributes.
  4. Run Update Unified Dataset.
  5. Verify that you now see records on the Unified Dataset page.

For more detailed instructions, see Tamr Core Installation Verification Steps.

Sample Tamr Core config.yaml File

Tip: Be sure to replace instances of <REPLACE_ME> with the appropriate values for your deployment.
Tip: To avoid naming conflicts, provide a unique value for values marked <UNIQUE_VALUE> for each instance of Tamr you deploy in your Azure environment.

# -- Tamr --

# replace with a valid license key







# -- Postgres --

# replace with IP of the VM that Postgres is running on
TAMR_PERSISTENCE_DB_URL: "jdbc:postgresql://<REPLACE_ME>:5432/doit?sslmode=require"

# replace with the username used to initialize the Postgres database

# replace with the password used to initialize the Postgres database

# -- ElasticSearch --

# replace with the IP of any of the Elasticsearch nodes

# -- Spark --








TAMR_JOB_SPARK_PROPS: "{'spark.dynamicAllocation.enabled':'false',

# -- Databricks Spark --

# replace with the access token obtained in Step 4





# replace with the name assigned to Databricks

# replace with the appropriate Spark version
TAMR_JOB_DATABRICKS_SPARK_VERSION: "<REPLACE_ME>" # (e.g. 6.4.x-esr-scala2.11)

# -- HDInsight HBase --

# replace with the IP of the Tamr VM
TAMR_HBASE_CONFIG_URIS: "zk://<REPLACE_ME>:21281/tamr/unify001/hbase-conf/hbase-policy.xml;zk://<REPLACE_ME>:21281/tamr/unify001/hbase-conf/hbase-site.xml" 

# replace with the IP of the Tamr VM
TAMR_HBASE_EXTRA_URIS: "zk://<REPLACE_ME>:21281/tamr/unify001/hbase-conf/"

TAMR_HBASE_EXTRA_CONFIG: "{'hbase.client.pause':5000,'hbase.client.retries.number':350}"









# -- ADLS Filesystem --

# replace with the storage container name

# replace with the storage container name

# ADLS identifying information

# ID of the Azure Active Directory of the ADLS from step 3

# Service Account
# replace with the application ID of the service account from step 3

Tamr Terraform Modules Reference

Tamr offers Terraform modules to automate the deployment of scale-out environments in Azure.

In each module, the template contains a README file that describes:

  • The purpose of the module.
  • The list of input values that you can specify to the module, when you run it, for each of the components that the module will be deploying.
  • The list of results, or outputs, that the module produces, once you use it to deploy.

Tamr uses the following Terraform modules:



Tamr VM

Deploys virtual machines for use by Tamr Core, PostgreSQL, and/or Elasticsearch

ADLS Gen 2

Deploys the ADLS filesystem for use by Tamr and Azure Databricks

HDInsight HBase

Deploys an HDInsight Cluster

Databricks Spark

Deploys an Azure Databricks workspace

importantimportant Important: The modules are subject to change. As a result, this table may not always be up to date. For accurate information on the recent versions of the Tamr Terraform modules, visit the README file included in each of the modules.

Configure the DMS

To move data files from cloud storage into Tamr Core, and exported datasets from Tamr Core to cloud storage, you use the Data Movement Service. See Configuring the Data Movement Service.


You can use Azure Monitor as your cloud monitoring tool, allowing you to monitor and analyze applications. The tool offers observability into infrastructure and network performance in real time.


You can use Tamr Core logs, DMS logs, and Azure's logging services for Spark and Hbase. See Logging in Cloud Platform Deployments.

Did this page help you?