HomeGuidesAPI ReferenceChangelog
HomeGuidesTamr API ReferenceTutorialsEnrichment API ReferenceSupport Help CenterLog In

Deploying Tamr on AWS

Install the Tamr software package on AWS.

AWS Overview

In a cloud-native deployment of Tamr, some or all major dependencies are externalized. Tamr's deployment on AWS makes use of billable services from the AWS stack. A complete Tamr deployment requires the following mandatory services to be configured:

  • Amazon EMR (Elastic Map-Reduce) - Hosts an HBase cluster for internal data storage and Spark clusters for on-demand computation.
  • Amazon S3 (Simple Storage Service) - Stores persistent data from EMR.
  • Amazon RDS (Relational Database Service) - Provisions a PostgreSQL database instance for application metadata storage.
  • Amazon Elasticsearch Service - Launches Elasticsearch cluster to power the Tamr UI.
  • Amazon EC2 - Provisions a VM to host the Tamr application.
  • Amazon CloudWatch - Collects metrics from launched AWS resources.

AWS cloud resources typically take about 30 minutes to provision. Additionally, you should plan to allocate one day for the configuration and execution of the steps described in this topic.

This topic contains an overview of Tamr’s cloud-native offering on AWS, basic network requirements and security features as well as deployment prerequisites and deployment steps for launching Tamr.

Requirements for Users Deploying Tamr on AWS

Users deploying Tamr on AWS must be skilled in the following technologies: Linux and Terraform. These users must also be familiar with the following AWS services: EC2, EMR, RDS, ElasticSearch Service, S3, and Networking (VPC, subnets, security groups).

Costs

Tamr runtime costs equal the cost of the deployed EC2 instance, plus EBS cost. Optionally, backups may be stored in S3, which incurs an additional cost on a per GB basis.

Additional optional costs may be incurred from the usage of RDS per DB Instance-hour consumed, and from CloudTrail per data event.

Tamr costs are per license, with additional cost for optional services and support. For more details on costs please see the Tamr AWS Marketplace product listing

Sizing Guidelines

For single-node deployment sizing guidelines, see AWS Sizing and Limits.

The following table provides cloud-native (multi-node) environment size configurations for AWS. Size refers to the number of records:

  • Small: 10 million - 100 million records
  • Medium: 100 million - 500 million records
  • Large: 500 million - 1 billion records
Small Medium Large
Tamr Core 1 x r5d.2xlarge, 300GB 1 x r5d.2xlarge, 300GB 1 x r5d.2xlarge, 300GB
EMR HBase Master 1 x r5.xlarge 1 x r5.xlarge 3 x r5.xlarge
EMR HBase Worker 20 x r5.xlarge 40 x r5.xlarge 80 x r5.xlarge
EMR Spark Master 1 x r5.xlarge 1 x r5.xlarge 1 x r5.xlarge
EMR Spark Worker 4 x r5.8xlarge 8 x r5.8xlarge 16 x r5.8xlarge
Elasticsearch 3 x r5d.xlarge, 500GB N/A N/A

Limits on AWS Services

Limits on services are dictated by AWS rather than Tamr. See the AWS documentation for limits for EC2, EBS, and S3.

Security Features

AWS services expose controls for enforcing secure practices. Tamr leverages the following practices:

  • Enforcement of encryption at-rest on EMR clusters with server-side S3 encryption (SSE-S3) on the underlying S3 buckets used for storing data.
  • Enforcement of encryption at-rest and in-transit on Elasticsearch cluster nodes.
  • Encryption of Tamr VM’s root volume device.
  • Controlled in-bound network traffic to Tamr infrastructure which you can modify to your organization’s needs by configuring security group rules to allow specific VPN or organizational IP addresses.
  • Network firewall to control access (internal access via a VPC network or a secure public access over HTTPS) and specify ports for each type of connection that must be kept open. You use these ports to access the Tamr user interface and run commands to check the health of the Tamr instance. Tamr uses the NGINX reverse proxy server to allow clients to access Tamr securely over HTTPS.
    For non-production environments configuring a firewall, NGINX, and HTTPS are strongly recommended but not required.
    Important: If you do not configure a firewall, NGINX, and HTTPS in a non-production deployment, all users on the network will have access to the data. Use a unique password for this deployment.

See Tamr AWS Network Reference Architecture for more information.

Tamr recommends following the principles of least privilege when deploying on AWS. See AWS documentation for further information.

Tamr encrypts the credentials for RDS in transit and at rest. Alternatively, you can choose to use the AWS Secrets Manager to maintain the RDS credentials in a central, secured location, and propagate the credential information to Tamr. See the AWS Secrets Manager documentation for instructions on creating and managing secrets.

Data Flow in Tamr Cloud-Native Deployment on AWS

The following diagram shows data flow through the Tamr cloud-native deployment on AWS.

Tamr deployed on AWS with component resources described belowTamr deployed on AWS with component resources described below

Tamr deployed on AWS with component resources described below

This diagram shows the following components.

Users

As shown in the top left, users log into Tamr using LDAP or SAML authorization.

Source Data

The input dataset(s) can be uploaded from the local filesystem or from a connected external source. See Uploading a Dataset for more information about uploading source data to Tamr.

AWS Cloud Resources for Tamr

  1. Tamr VM - The Tamr application is deployed on a single EC2 instance. A number of internal microservices and external service dependencies, such as ZooKeeper, PostgreSQL client, Grafana, Kibana, and Prometheus, run on the instance as well.
  2. Firewall - AWS Network Firewall is used to control access (internal access via a VPC network or a secure public access over HTTPS) and specify ports for each type of connection that must be kept open. See Network Firewall Prerequisites.
    Note: None of Tamr's deployed resources are required to be configured for public access for normal operation. It is recommended that these resources should not be made available for public access.
  3. Data Processing
  • Spark on YARN - To run Spark jobs, Tamr uses Amazon EMR which is a managed cluster platform that simplifies running big data frameworks.
  • Internal Database HBase - Tamr launches an HBase cluster using Amazon EMR. HBase works by sharing its filesystem and serving as a direct input and output to the EMR framework and execution engine. Customer data may be stored here, including sensitive data.
  1. Storage
    Amazon S3 - Amazon S3 is used as a shared filesystem between Tamr and Spark to store data like HBase StoreFiles (HFiles) and table metadata, .jar files, Spark logs, etc.
  2. Data Services
  • Metadata - Tamr uses Amazon RDS to host a PostgreSQL database instance that stores Tamr application metadata.
  • Search - Tamr uses Amazon Elasticsearch to host a cluster that is used as a search engine and powers the Tamr UI. Customer data may be stored here, including sensitive data.
  1. Monitoring
    Metrics - Amazon CloudWatch can be enabled on launched AWS resources to monitor both real-time and historical metrics and gain perspective on how services in the Tamr stack are performing. See also Logging in Cloud Platform Deployments.

How is the Deployment Orchestrated?

To orchestrate Tamr’s cloud-native deployment on AWS, Tamr uses the following approaches.

Terraform to deploy hardware

Terraform by Hashicorp is an infrastructure as code tool to manage infrastructure. Hashicorp maintains an AWS provider that allows management of AWS resources using Terraform.

Tamr maintains a set of Terraform modules for deploying the AWS infrastructure needed by Tamr, including a module that creates a configuration file for the Tamr software.

Manual installation of Tamr software

To start the Tamr application, copy the Tamr software and generated Tamr configuration file to the Tamr EC2 instance and run startup scripts bundled in the Tamr software package. The directions for installing Tamr are detailed in Installation Process.

Tamr Terraform Modules Reference

Tamr maintains Terraform modules to provision and manage the infrastructure for a cloud-native environment in AWS.

Each module provides suggested patterns of use in the /examples directory in addition to a minimal in-line example specified in each module’s README file. Additionally, the README file also provides an overview of what the module will provision as well as the module’s input and output values.

The following Terraform modules in GitHub are used to deploy an AWS cloud-native Tamr environment.

Module

Description

AWS Networking

Deploys network reference architecture, following security best practices.

Tamr VM

Provisions Amazon EC2 instance for Tamr VM

EMR (HBase, Spark)

Deploys either (1) a static HBase and/or Spark cluster on Amazon EMR or (2) supporting infrastructure for an ephemeral Spark cluster

Elasticsearch

Deploys an Elasticsearch cluster on Amazon Elasticsearch Service

RDS Postgres Database

Deploys a DB instance running PostgreSQL on Amazon RDS

S3

Deploys S3 buckets and bucket access policies

Tamr Configuration

Populates Tamr configuration variables with values needed to set up Tamr software on an EC2 instance

Deployment Prerequisites

Terraform Prerequisites

Tamr maintains a set of Terraform modules that declare the AWS resources needed for a Tamr deployment. To apply Tamr’s AWS modules, verify that the following prerequisites are met.

Install the Terraform binary

Tamr’s AWS modules currently support Terraform v0.13 and the AWS Provider plugin package v3.36.0 or greater. The AWS Provider is frequently upgraded. For information on new releases, see the Releases page for terraform-provider-aws.

Configure Terraform’s AWS provider with proper credentials

Before using the Tamr Terraform modules, you must ensure that the IAM user or role with which Terraform commands are executed has the appropriate permissions. You also must configure the AWS provider with the necessary credentials. See Terraform IAM Principal Permissions for AWS for detailed policies that provide with necessary permissions and credentials.

Note: Learn more about AWS provider permissions in the Terraform documentation.

Networking Prerequisites

Tamr’s AWS deployment modules make the assumption that resources will be deployed into an existing VPC setup. See Tamr AWS Network Reference Architecture for network architecture requirements and details.

Network Firewall and HTTPS Prerequisites

  • Secure external access to Tamr via HTTPS by installing NGINX and configuring a reverse proxy from the NGINX application server. See Installing NGINX and Configuring HTTPS.
  • Configure the AWS Network Firewall. See Getting Started with AWS Network Firewall in the AWS documentation for instructions. Firewall configuration requirements:
    • Allow only internal access to Tamr default port 9100 (via TCP).
    • Open port 443 for HTTPS, with a restrictive IP range that you specify using IPv4 addresses in CIDR notation, such as 1.2.3.4/32.
      Note: If you plan to forward HTTP traffic to HTTPS, also open port 80.

Other Prerequisites

  • Obtain a license key and Tamr software package by contacting Tamr Support at [email protected]. You will need to provide the license key when accessing the Tamr instance via a browser.
  • (Optional) Prepare a small CSV dataset to profile. You can use this dataset to run a small profiling job after Tamr is installed to check the health and readiness of your deployment’s EMR clusters.

Installation Process

Step 1: Configure the Terraform modules

Each module’s source files in the repository include:

  • A README file that describes how to use the module.
    Note: There is a README file for each nested submodule as well.
  • An example with a suggested pattern of use is provided in each module repository’s /examples folder.

Invoke each module, filling in the required parameters and any other optional parameters you require for your deployment. In some situations, it may be appropriate to invoke the nested submodules if the root module is too prescriptive for your desired deployment setup.

Tamr recommends that you add values for the Tamr configuration Terraform module last, as many input variables depend on values that are output from the other Tamr AWS Terraform modules.

For reference, the Tamr configuration Terraform module has an example of a full AWS cloud-native deployment with example invocations of all Tamr AWS Terraform modules. Contact your Tamr account representative for access to the Tamr configuration Tamr module repository.

Note: Tamr’s Terraform modules follow their own release cycles. To use a newer version, update the version in the ref query parameter of the source variable in the module invocation:

module "example_source_reference" {
  source = "[email protected]:Datatamer/<repository name>.git?ref=0.1.0"
}

Step 2: Apply the Terraform modules

Note: Please check the Terraform Prerequisites before running the following steps.

  1. Initialize provider and other dependencies.
    terraform init
  2. Check what changes (creations, updates, deletions) will be introduced.
    terraform plan
  3. Apply the changes from the plan. This step can take 20-45 minutes to execute.
    terraform apply

Step 3: Install the Tamr Software Package

Step 3.0 Prepare EC2 Instance for Tamr Software Installation

To ensure that Tamr is installed and runs properly:

  1. Set ulimit resource limits. For more detailed directions for setting these values, see Setting ulimit Limits.
  2. Install the appropriate PostgreSQL 12 clients (pgdump and pgrestore) for your EC2 instance’s operating system.

Step 3.1 Unzip the Software package

SSH into the Tamr EC2 instance and unzip the Tamr Software Package (unify.zip) into some home directory designated for your Tamr application.

Step 3.2: Start ZooKeeper

To start ZooKeeper, run:

<home-directory>/tamr/start-zk.sh

Step 3.3: Mount the Tamr configuration file

Take the output of the terraform-aws-tamr-config module (or manually populate a Tamr configuration file) and create a YAML configuration file on the Tamr EC2 instance.

To set the configuration, run:

 `<home-directory>/tamr/utils/unify-admin.sh config:set --file /path/to/<Tamr-configuration-filename>.yml`

Step 3.4: Start Tamr and its dependencies

To start Tamr’s dependencies and the Tamr application, run:

<home-directory>/tamr/start_dependencies.sh
<home-directory>/tamr/start-unify.sh

Verify Deployment

To verify that your deployment’s EMR clusters are functioning properly:

  1. Navigate to http://<tamr-ec2-private-ip>:9100 in a browser and log in using your credentials obtained from Tamr Support.
  2. Upload a small CSV dataset, and profile it.
    Tip: If the Tamr UI is not accessible at the mentioned address, check the tamr/logs/ directory on the EC2 instance.

To verify that your deployment’s Elasticsearch cluster is functioning properly:

  1. Create a schema mapping project and add a small CSV dataset to the project.
  2. Bootstrap some attributes, and then run Update Unified Dataset.
    Verify that you now see records on the Unified Dataset page.

For more detailed instructions, see Tamr Installation Verification Steps.

Configure the DMS

To move data files from cloud storage into Tamr, and exported datasets from Tamr to cloud storage, you use the Data Movement Service. See Configuring the Data Movement Service.

Additional Resources

See the Support Help Center Knowledge Base for information on maintaining availability in AWS cloud native deployments.


Did this page help you?