About This Documentation

The documentation in this section describes the steps you need to follow to install Tamr software package on a single provisioned Amazon EC2 instance. The size of your Tamr deployment on a single node depends on the size of the instance and other parameters. For more information, see Single-Node Deployments. For information on deploying Tamr in a multi-node AWS environment, contact your Tamr account representative.

Note: This documentation provides the minimum required instructions for deploying Tamr on a single node in AWS. In addition to this documentation, we strongly recommend that you follow the AWS security recommendations in the Shared Responsibility Model in the AWS documentation. Also see the AWS Security blog post How to get specific security information about AWS services.

What is Tamr?

Tamr software uses a patented machine-learning based approach to deliver up-to-date data while keeping you in the loop. Rather than using deterministic rules programmed by a developer to combine a handful of data sources for consumption, Tamr relies on probabilistic models created through machine learning and enriched by human expertise to quickly unify data at previously unprecedented scale. The results are breakthrough insights that lead to new growth opportunities, cost savings, and operational improvements -- delivered in a fraction of the time and cost of traditional approaches. Tamr offers specific solutions across use cases and industries, including agile data mastering, agile customer mastering, procurement analytics, GDPR compliance, and M&A Integration.

Prerequisites

Any deployment of Tamr, whether in the cloud or on-premises, must meet these Requirements. You must also acquire a Tamr license. In addition:

To install Tamr in your AWS cloud, you must be familiar with Amazon EC2. To launch Tamr, you will need EC2 create permissions on an existing AWS account. To install Tamr, see Installation.
To manage Tamr, you must have Postgres experience, basic Linux experience, and permissions to execute bash scripts and commands on your Amazon EC2 instance.
It is useful to have create permissions for Amazon RDS for your storage layer, EBS for persistent volumes, and Amazon S3 for backups of your Tamr instance. For information about security in AWS, see Shared Responsibility Model in the AWS documentation.

Creating an IAM role

This section is adapted from AWS documentation.

Create a user
Create a group
Add user to group
Create a policy
Attach policy to a group

Step 1: Create a user

You can use the AWS Management Console to create IAM users.
To create one or more IAM users (console):

Sign in to the AWS Management Console and open the IAM console at https://console.aws.amazon.com/iam/.
In the navigation pane, choose Users and then choose Add user.
Type the user name for the new user. This is the sign-in name for AWS. If you want to add more than one user at the same time, choose Add another user for each additional user and type their user names. You can add up to 10 users at one time.
Note
User names can be a combination of up to 64 letters, digits, and these characters: plus (+), equal (=), comma (,), period (.), at sign (@), and hyphen (-). Names must be unique within an account. They are not distinguished by case. For example, you cannot create two users named TESTUSER and testuser. For more information about limitations on IAM entities, see Limitations on IAM Entities and Objects.
Select the type of access this set of users will have. You can select programmatic access, access to the AWS Management Console, or both.

Select Programmatic access if the users require access to the API, AWS CLI, or Tools for Windows PowerShell. This creates an access key for each new user. You can view or download the access keys when you get to the Final page.
Select AWS Management Console access if the users require access to the AWS Management Console. This creates a password for each new user.
For Console password, choose one of the following:
- Autogenerated password. Each user gets a randomly generated password that meets the account password policy in effect (if any). You can view or download the passwords when you get to the Final page.
- Custom password. Each user is assigned the password that you type in the box.

Choose Next: Review to see all of the choices you made up to this point. When you are ready to proceed, choose Create user.
To view the users' access keys (access key IDs and secret access keys), choose Show next to each password and access key that you want to see. To save the access keys, choose Download .csv and then save the file to a safe location.

Important: This is your only opportunity to view or download the secret access keys, and you must provide this information to your users before they can use the AWS API. Save the user's new access key ID and secret access key in a safe and secure place. You will not have access to the secret keys again after this step.

Provide each user with their credentials. On the final page you can choose Send email next to each user. Your local mail client opens with a draft that you can customize and send. The email template includes the following details to each user:

User name
URL to the account sign-in page. Use the following example, substituting the correct account ID number or account alias: https://AWS-account-ID or alias.signin.aws.amazon.com/console

For more information, see How IAM Users Sign In to AWS.

Important
The user's password is not included in the generated email. You must provide them to the customer in a way that complies with your organization's security guidelines.

Step 2: Create a group
To set up a group, you need to create the group. Then give the group permissions based on the type of work that you expect the users in the group to do. Finally, add users to the group.
For information about the permissions that you need in order to create a group, see Permissions Required to Access IAM Resources.
To create an IAM group and attach policies (console):

Sign in to the AWS Management Console and open the IAM console at https://console.aws.amazon.com/iam/.
In the navigation pane, click Groups and then click Create New Group.
In the Group Name box, type the name of the group and then click Next Step.
Important
Group names must be unique within an account. They are not distinguished by case, for example, you cannot create groups named both ADMINS and admins.
Click Create Group.

Step 3: Add user to group
You can use the AWS Management Console to add a user to a group.

Sign in to the AWS Management Console and open the IAM console at https://console.aws.amazon.com/iam/.
In the navigation pane, choose Groups and then choose the name of the group.
Choose the Users tab and then choose Add Users to Group. Select the check box next to the users you want to add.
Choose Add Users.

Step 4: Create a policy
Create a customer managed policy that allows a user to sign in to the AWS Management Console with read-write access to Amazon S3. This will allow an administrator to fully manage S3 resources associated with the specific bucket that Tamr will store data in. This bucket should be specified in place of <your-bucket-name-here> in order to give users the most restrictive permissions possible. These permissions will allow for the creation and deployment of Tamr on EC2, but are more restrictive than the administrative permissions described in the Security section. Those roles are more permissive in order to allow the administrator to access the deployment for modification after it has been created.

To create the policy for a user:

Sign in to the IAM console at https://console.aws.amazon.com/iam/ with your user that has administrator permissions.
In the navigation pane, choose Policies.
In the content pane, choose Create policy.
Choose the JSON tab and copy the text from the following JSON policy document:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:Get*",
                "s3:List*"
            ],
            "Resource": "*"
        },
        {
            "Sid": "Stmt1324645872606",
            "Action": [
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:AbortMultipartUpload",
                "s3:RestoreObject",
                "s3:Get*",
                "s3:*"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::<your-bucket-name-here>/*"
            ],
            "Condition": {
                "BoolIfExists": {
                    "aws:MultiFactorAuthPresent": true
                }
            }
        }
    ]
}

Step 5 - Attach a Policy to a Group
You can use the AWS Management Console to add permissions to an identity (user, group, or role). To do this, attach managed policies that control permissions, or specify a policy that serves as a permissions boundary.

To use a managed policy as a permissions policy for an identity

Sign in to the AWS Management Console and open the IAM console at https://console.aws.amazon.com/iam/.
In the navigation pane, choose Policies.
In the list of policies, select the check box next to the name of the policy to attach. You can use the Filter menu and the search box to filter the list of policies.
Choose Policy actions, and then choose Attach.
Select one or more identities to attach the policy to. You can use the Filter menu and the search box to filter the list of principal entities. After selecting the identities, choose Attach policy.

Creating a Security Group

This section is adapted from AWS documentation.
We recommend the security group is configured to accept Inbound traffic only originating from your IP address or internal network hosts and permit Outbound traffic to all IP address (0.0.0.0/0).

To create a new security group using the console

Open the Amazon EC2 console at https://console.aws.amazon.com/ec2/.
In the navigation pane, choose Security Groups.
Choose Create Security Group.
Specify a name and description for the security group.
For VPC, choose the ID of your VPC.
Add inbound rules. On the Inbound tab, choose Add Rule.

For Type, select All traffic.
Leave the default for Protocol and Port Range.
For Source, choose the following:
- Custom: in the provided field, you must specify your or your company's IP address in CIDR notation, a CIDR block, or another security group. Typically, permissible source traffic would be restricted to originating from internal network hosts only.
For Description, you can optionally specify a description for the rule.

Click on the Outbound tab, choose Add Rule, and do the following:

For Type, select All traffic.
For Destination, choose the following:
- Anywhere: automatically adds the 0.0.0.0/0 IPv4 CIDR block. This option enables outbound traffic to all IP addresses.
For Description, you can optionally specify a description for the rule.

Choose Create.

Creating an EC2 Instance

This section is adapted from AWS documentation.

Before you can launch and connect to an Amazon EC2 instance, you need to create a key pair, unless you already have one. You can create a key pair using the Amazon EC2 console and then you can launch your EC2 instance. To create a key pair, follow the steps in Setting Up with Amazon EC2 in the Amazon EC2 User Guide for Linux Instances to create a key pair. If you already have a key pair, you do not need to create a new one and you can use your existing key pair for this purpose.

To launch the EC2 instance

Open the Amazon EC2 console at https://console.aws.amazon.com/ec2/.
Choose Launch Instance.
In Step 1: Choose an Amazon Machine Image (AMI), find an Amazon Linux AMI at the top of the list and choose Select.
In Step 2: Choose an Instance Type, such as r5.4xlarge , choose Next: Configure Instance Details. For more information on choosing a size, see Sizing.
In Step 3: Configure Instance Details, choose Network, and then choose the entry for your default VPC. It should look something like vpc-xxxxxxx (172.31.0.0/16) (default).

a. Choose Subnet, and then choose a subnet in any Availability Zone.

b. Choose Next: Add Storage. Choose Add New Volume, select EBS from the dropdown menu, enter 2000 or 3000 into the Size field, depending on whether you have chosen a large or extra-large deployment. See Sizing for more details.

Choose Next: Tag Instance. Add optional tags.
Name your instance and choose Next: Configure Security Group.
In Step 6: Configure Security Group, review the contents of this page, ensure that Assign a security group is set to Select an existing security group, and choose the group you created using the instructions above.
Choose Review and Launch.
Choose Launch.
Select the check box for the key pair that you created, and then choose Launch Instances.
Choose View Instances and verify that your instance has been created.

Architecture Diagrams

You can launch Tamr on a single EC2 instance. You may optionally wish to use an external persistent disk for data storage, as well as use a hosted PostgreSQL on RDS.

You can create Tamr backups using the backup API. We recommend storing your backups in a different Availability Zone than your instance, in case of Availability Zone failures.

844 — A deployment of Tamr that utiliizes AWS RDS for PostgreSQL, and stores backups in a different Availability Zone but same Region as the instance.

649 — The most basic Tamr deployment uses a single instance, with Postgres installed directly in a shared subnet.

Security

To install Tamr in AWS, you must have EC2 create permissions, and permission to ssh to the EC2 instance once it has been created. If you do not have EC2 permissions, the following JSON defines the IAM role with minimum permissions needed to create an EC2 instance using the AWS CLI, Console, or API, as well as start, and stop that instance. Additionally, it will allow the user to describe all EC2 instances, and to pass IAM roles on to created instances. Note that the last permission is limited to S3 access, so as not to allow the newly created EC2 instance to have unlimited access to all AWS resources.

IAM roles are defined by AWS and allow you to control who is authenticated (signed in) and authorized (has permissions) to use resources.

The following IAM role is the minimum needed to deploy Tamr on AWS. It allows the user to create, start, and stop instances in EC2. Additionally, it will allow users to associate and edit IAM permissions with the newly created instance. This user will also have the ability to access S3 resources for things like backup.

{
    "Version": "2012-10-17",
    "Statement": [
 				{
            "Effect": "Allow",
            "Action": [
                "ec2:RunInstances",
                "ec2:StartInstances",
                "ec2:StopInstances",
                "ec2:AssociateIamInstanceProfile",
                "ec2:ReplaceIamInstanceProfileAssociation"
            ],
            "Resource": "arn:aws:ec2:*:*:instance/*",
        },
        {
            "Effect": "Allow",
            "Action": "ec2:DescribeInstances",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::123456789012:role/S3Access"
    		}
    ]
}

Egress and Port Whitelisting

After creation of your EC2 instance, you will need to:

Open port 9100 for inbound web access from internal network hosts.
Allow egress traffic from 0.0.0.0/localhost. Services must be able to communicate with each other via HTTP. This is typically arranged by having them use the loopback network interface with no proxy.

Tamr recommends that you use root EBS volume encryption to protect your data. See Step 5 of Creating an EC2 Instance for more information.

Users may deploy tags as desired to track spend or for other purposes.

Access Key Rotation

Tamr recommends that you rotate IAM access keys. The following steps are taken from AWS Documentation.

In order to rotate access keys, you must have permissions from the following IAM policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ManageAccessKeysForUser",
            "Effect": "Allow",
            "Action": [
                "iam:DeleteAccessKey",
                "iam:GetAccessKeyLastUsed",
                "iam:UpdateAccessKey",
                "iam:GetUser",
                "iam:CreateAccessKey",
                "iam:ListAccessKeys"
            ],
            "Resource": "arn:aws:iam::*:user/${aws:username}"
        },
        {
            "Sid": "ListUsersInConsole",
            "Effect": "Allow",
            "Action": "iam:ListUsers",
            "Resource": "*"
        }
    ]
}

To rotate access keys without interrupting your applications using the console:

While the first access key is still active, create a second access key.

a. Sign in to the AWS Management Console and open the IAM console at https://console.aws.amazon.com/iam/.

b. In the navigation pane, choose Users.

c. Choose the name of the intended user, and then choose the Security credentials tab.

d. Choose Create access key and then choose Download .csv file to save the access key ID and secret access key to a .csv file on your computer. Store the file in a secure location. You will not have access to the secret access key again after this closes. After you have downloaded the .csv file, choose Close. The new access key is active by default. At this point, the user has two active access keys.

Update all applications and tools to use the new access key.
Determine whether the first access key is still in use by reviewing the Last used column for the oldest access key. One approach is to wait several days and then check the old access key for any use before proceeding.
Even if the Last used column value indicates that the old key has never been used, we recommend that you do not immediately delete the first access key. Instead, choose Make inactive to deactivate the first access key.
Use only the new access key to confirm that your applications are working. Any applications and tools that still use the original access key will stop working at this point because they no longer have access to AWS resources. If you find such an application or tool, you can choose Make active to reenable the first access key. Then return to Step 3 and update this application to use the new key.
After you wait some period of time to ensure that all applications and tools have been updated, you can delete the first access key:

a. Sign in to the AWS Management Console and open the IAM console at https://console.aws.amazon.com/iam/.

b. In the navigation pane, choose Users.

c. Choose the name of the intended user, and then choose the Security credentials tab.

d. Locate the access key to delete and choose its X button at the far right of the row. Then choose Delete to confirm.

Audit Mechanisms

Tamr recommends using CloudTrail or another audit mechanism to keep track of access logs. The following instructions are taken from AWS Documentation.

See CloudTrail pricing for more information on costs associated with Data Events.

As best practice, we recommend storing all CloudTrail logs in an S3 bucket owned by a different account created solely for audit and monitoring.

To set up CloudTrail using the console:
When you create a trail, you enable ongoing delivery of events as log files to an Amazon S3 bucket that you specify. Creating a trail has many benefits, including:

A record of events that extends past 90 days.
The option to automatically monitor and alarm on specified events by sending log events to Amazon CloudWatch Logs.
The option to query logs and analyze AWS service activity with Amazon Athena.

If you use AWS Organizations, you can create a trail that will log events for all AWS accounts in the organization. A trail with the same name will be created in each member account, and events from each trail will be delivered to the Amazon S3 bucket that you specify.

📘
Note
Only the master account for an organization can create a trail for the organization. Creating a trail for an organization automatically enables integration between CloudTrail and Organizations. For more information, see Creating a Trail for an Organization.

You can configure the following settings when you create or update a trail with the CloudTrail console:

You can configure your trail for the following:

Specify if you want the trail to apply to all Regions or a single Region.
Specify an Amazon S3 bucket to receive log files.
For management and data events, specify if you want to log read-only, write-only, or all events.

Sign in to the AWS Management Console and open the CloudTrail console at https://console.aws.amazon.com/cloudtrail/.
Choose the Region where you want the trail to be created.
Choose Get Started Now.

📘
Tip
If you do not see Get Started Now, choose Trails, and then choose Create trail.

On the Create Trail page, for Trail name, type a name for your trail.
For Apply trail to all regions, choose Yes to receive log files from all Regions. This is the default and recommended setting. If you choose No, the trail logs files only from the Region in which you create the trail.
For Management events, for Read/Write events, choose if you want your trail to log All, Read-only, Write-only, or None, and then choose Save. By default, trails log all management events.
For Data events, you can specify logging data events for Amazon S3 buckets, for AWS Lambda functions, or both. By default, trails don't log data events. Additional charges apply for logging data events.

You can select the option to log all S3 buckets and Lambda functions, or you can specify individual buckets or functions.

For Amazon S3 buckets:

Choose the S3 tab.
To specify a bucket, choose Add S3 bucket. Type the S3 bucket name and prefix (optional) for which you want to log data events. For each bucket, specify whether you want to log Read events, such as GetObject, Write events, such as PutObject, or both.
To log data events for all S3 buckets in your AWS account, select Select all S3 buckets in your account. Then choose whether you want to log Read events, such as GetObject, Write events, such as PutObject, or both. This setting takes precedence over individual settings you configure for individual buckets. For example, if you specify logging Read events for all S3 buckets, and then choose to add a specific bucket for data event logging, Read is already selected for the bucket you added. You cannot clear the selection. You can only configure the option for Write.

📘
Note
Selecting the Select all S3 buckets in your account option enables data event logging for all buckets currently in your AWS account and any buckets you create after you finish creating the trail. It also enables logging of data event activity performed by any user or role in your AWS account, even if that activity is performed on a bucket that belongs to another AWS account.
If the trail applies only to one Region, selecting the Select all S3 buckets in your account option enables data event logging for all buckets in the same Region as your trail and any buckets you create later in that Region. It will not log data events for Amazon S3 buckets in other Regions in your AWS account.

For Storage location, for Create a new S3 bucket, choose Yes to create a bucket. When you create a bucket, CloudTrail creates and applies the required bucket policies.

📘
Note
If you chose No, choose an existing S3 bucket. The bucket policy must grant CloudTrail permission to write to it. For information about manually editing the bucket policy, see Amazon S3 Bucket Policy for CloudTrail.

For S3 bucket, type a name for the bucket you want to designate for log file storage. The name must be globally unique.
Choose Create.
The new trail appears on the Trails page. The Trails page shows the trails in your account from all Regions. In about 15 minutes, CloudTrail publishes log files that show the AWS API calls made in your account. You can see the log files in the S3 bucket that you specified.

Securing Tamr with SSL

Nginx provides an easy solution for setting up https proxy with the minimum change in the usual deployment procedure.

Install Nginx.

On Ubuntu:

sudo apt-get update
sudo apt-get install nginx

On Red Hat:

sudo yum install epel-release
sudo yum update
sudo yum install nginx

If it fails and complains about ports, make sure the port 80 is open on the server. If there is apache2 service running already, stop it using

sudo service apache2 stop

Go to /etc/nginx/. Clean up the sites-enabled folder and add the following block to a file called unify.conf:

server {

# SSL configuration
    #
    listen 9102 ssl default_server;
    listen [::]:9102 ssl default_server;

    root /var/www/html;

    # Add index.php to the list if you are using PHP
    index index.html index.htm index.nginx-debian.html;

    server_name _;

    ssl_certificate /absolute/path/to/cert.crt;
    ssl_certificate_key /absolute/path/to/cert.key;

    ssl on;
    ssl_session_cache  builtin:1000  shared:SSL:10m;
    ssl_protocols  TLSv1 TLSv1.1 TLSv1.2;
    ssl_ciphers HIGH:!aNULL:!eNULL:!EXPORT:!CAMELLIA:!DES:!MD5:!PSK:!RC4;
    ssl_prefer_server_ciphers on;

    location / {

      proxy_set_header        Host $host;
      proxy_set_header        X-Real-IP $remote_addr;
      proxy_set_header        X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header        X-Forwarded-Proto $scheme;

      # Fix the “It appears that your reverse proxy set up is broken" error.
      proxy_pass          http://localhost:9100;
      proxy_read_timeout  90;
      proxy_redirect      http://localhost:9100 https://localhost:9102;
    }
}

The above block will redirect https://localhost:9102 to the default http://localhost:9100. If you want to use the 9100 port for the https connection, simply swap 9100 with 9102 in the above file. Additionally, set the TAMR_UNIFY_BIND_PORT and TAMR_UNIFY_PORT environment variables to 9102.

Important: The default file size for uploading files through NGINX is 1MB. To remove this limit, set client_max_body_size to zero.

Costs

Tamr runtime costs equal the cost of the deployed EC2 instance, plus EBS cost. Optionally, backups may be stored in S3, which incurs an additional cost on a per GB basis.

Additional optional costs may be incurred from the usage of RDS per DB Instance-hour consumed, and from CloudTrail per data event.

Tamr costs are per license, with additional cost for optional services and support.

Sizing

Limits on services are dictated by AWS rather than Tamr. Limits for EC2, EBS, and S3 are described by AWS here.

For up to 20 million records, Tamr recommends a large (c5.9xlarge EC2 instance) deployment with an attached 2TB SSD EBS volume.

For larger numbers of records, Tamr recommends an extra-large (c5.18xlarge EC2 instance) deployment with an attached 3 TB SSD EBS volume.

If optionally using PostgreSQL on RDS, Tamr recommends a db.r5.2xlarge instance for up to 20 million records, or db.r5.4xlarge for larger numbers of records. For more information on hosted relational databases, see here.

Deployment Assets

Tamr is deployed on a single EC2 instance, with a suggested encrypted EBS volume for secure data storage. Scaling is addressed with individual EC2 sizing increases. Disaster recovery can be initiated by taking regular backups of Tamr and keeping the backup in S3 in a different Availability Zone than your EC2 instance. General deployment guidelines can be found here.

Deploying Tamr

Once the machine and attached storage have been provisioned, Tamr can be deployed easily. The following steps are taken from the Installation guide.

Tamr requires PostgreSQL to be installed on the EC2 instance for your deployment, or to be hosted using Amazon RDS. Choose and deploy one before installing Tamr.

Install Postgres and Create the Database

If you choose to deploy PostgreSQL on the EC2 instance where Tamr is running, follow these steps:

Install and start Postgres.

For RHEL, please refer to Installing Postgres 9.4 on RHEL 7.
For Ubuntu Server, please refer to Installing Postgres 9.4 on Ubuntu 14.

Download the below script setup-tamr-database.sql.

drive.google.com

setup-tamr-database.sql

Create the database by executing the provided script setup-tamr-database.sql.

sudo su - postgres
psql -f setup-tamr-database.sql
exit

Create a PostgreSQL DB Instance with Multi-Availability Zone Configuration

If you choose to deploy PostgreSQL using RDS, follow these steps. If choosing the hosted option, Tamr recommends multi-Availability Zone configured deployments for RDS. If you do not choose multi-Availability Zone configuration, AWS will still automatically backup your DB instance. Instructions adapted from AWS documentation.

Important: You must complete the tasks in the Setting Up for Amazon RDS section before you can create or connect to a DB instance.

Sign in to the AWS Management Console and open the Amazon RDS console at https://console.aws.amazon.com/rds/.
In the top right corner of the AWS Management Console, choose the AWS Region in which you want to create the DB instance.
In the navigation pane, choose Databases. If the navigation pane is closed, choose the menu icon at the top left to open it.
Choose Create database to open the Select engine page.
On the Select engine page, choose the PostgreSQL icon, and then choose Next.
Next, the Use case page asks if you are planning to use the DB instance you are creating for production. If you are, choose Production. If you choose this option, the failover option Multi-AZ and the Provisioned IOPS storage options are preselected in the following step. Choose Next when you are finished.
On the Specify DB Details page, specify your DB instance information. Choose Next when you are finished.

For This Parameter	Do This
License Model	PostgreSQL has only one license model. Choose postgresql-license to use the general license agreement for PostgreSQL.
DB Engine Version	Choose version `9.4.18`.
DB Instance Class	Choose db.r5.2xlarge or db.r5.4xlarge depending on your number of records. See Sizing for more information.
Multi-AZ Deployment	Choose Yes to have a standby replica of your DB instance created in another Availability Zone for failover support.
Storage Type	Choose the storage type General Purpose (SSD).
Allocated Storage	Enter 64 or 128 to allocate that many GiB of storage for your database. In some cases, allocating a higher amount of storage for your DB instance than the size of your database can improve I/O performance.
DB Instance Identifier	Enter a name for the DB instance that is unique for your account in the AWS Region you chose.
Master Username	Enter a name using alphanumeric characters to use as the master user name to log on to your DB instance. The recommended name is tamr.
Master Password and Confirm Password	Enter a password that contains from 8 to 128 printable ASCII characters (excluding /,", and @) for your master password, then type the password again in the Confirm Password box.

On the Configure Advanced Settings page, provide additional information that RDS needs to launch the PostgreSQL DB instance. The table shows settings for an example DB instance. Specify your DB instance information, then choose Create database.

For This Parameter	Do This
VPC	Choose the VPC group used by the EC2 instance where Tamr is deployed.
Subnet Group	Choose the subnet group for the VPC used by the EC2 instance where Tamr is deployed.
Publicly Accessible	Choose No, so the DB instance is only accessible from inside the VPC.
Availability Zone	Use the default value of No Preference unless you want to specify an Availability Zone.
VPC Security Group	Choose the security group for the VPC used by the EC2 instance where Tamr is deployed.
Database Name	Enter the name `doit`.
Database Port	Use the default port 5432.
DB Parameter Group	Use the default value.
Option Group	Use the default value.
Copy Tags To Snapshots	Choose this option to have any DB instance tags copied to a DB snapshot when you create a snapshot.
Enable Encryption	Choose Yes to enable encryption at rest for this DB instance.
Backup Retention Period	Set the number of days you want automatic backups of your database to be retained. A minimum recommended value is 7.
Backup Window	Unless you have a specific time that you want to have your database backup, use the default of No Preference.
Enable Enhanced Monitoring	Choose Yes to enable real-time OS monitoring. Amazon RDS provides metrics in real time for the operating system (OS) that your DB instance runs on. You are only charged for Enhanced Monitoring that exceeds the free tier provided by Amazon CloudWatch Logs.
Monitoring Role	Choose Default to use the default IAM role.
Granularity	Choose 60 to monitor the instance every minute.
Auto Minor Version Upgrade	Do not enable auto upgrading.
Maintenance Window	Choose the 30-minute window in which pending modifications to your DB instance are applied. If the time period doesn't matter, choose No Preference.

On the final page, choose Create database.
On the Amazon RDS console, the new DB instance appears in the list of DB instances. The DB instance has a status of creating until the DB instance is created and ready for use. When the state changes to available, you can connect to the DB instance. Depending on the DB instance class and store allocated, it could take several minutes for the new instance to be available.
Once you have installed Tamr, you must update some environment variables in order to connect your deployment to PostgreSQL properly. This step will be listed under Install Tamr.

Install Tamr

Checklist before proceeding

Current user is the Tamr functional user, .e.g. tamr.
Tamr software bundle unify.zip.
Postgres is listening.
Identified an installation directory (not on the root mount).
All specified requirements are met.

Unpack the Tamr software bundle unify.zip in the installation directory.

unzip unify.zip

If you are using RDS for PostgreSQL Skip this step if you installed Postgres directly onto your EC2 instance. However, if you are using RDS, you must update your configuration values for Postgres . Before continuing, it is recommended to encrypt your DB password using AES128 so that it is not stored in plaintext. Your DB_URL should look comparable to <name>.<id>.<region>.rds.amazonaws.com. Do the following to set your values:

./tamr/utils/unify-admin.sh config:set TAMR_PERSISTENCE_DB_PASS=<your encrypted password>
./tamr/utils/unify-admin.sh config:set TAMR_PERSISTENCE_DB_URL=jdbc:postgresql://<your DB instance endpoint>:<TAMR_PERSISTENCE_DB_PORT>/<TAMR_PERSISTENCE_DB_NAME>

Initialize and start all Tamr application dependencies

./tamr/start-dependencies.sh

Start Tamr

./tamr/start-unify.sh

Health Check

Service Health can be checked programmatically using Tamr APIs. This endpoint returns health checks for the service as a whole as well as Zookeeper, which Tamr uses for configuration management. In addition, this section will outline how to complete health and capacity checks for AWS resources.

Check Storage Capacity of an EC2 instance

You can view descriptive information about your EBS volumes. For example, you can view information about all volumes in a specific Region or view detailed information about a single volume, including its size, volume type, whether the volume is encrypted, which master key was used to encrypt the volume, and the specific instance to which the volume is attached.
You can get additional information about your EBS volumes, such as how much disk space is available, from the operating system on the instance.
Viewing Descriptive information

To view information about an EBS volume using the console

Open the Amazon EC2 console at https://console.aws.amazon.com/ec2/.
In the navigation pane, choose Volumes.
To view more information about a volume, select it. In the details pane, you can inspect the information provided about the volume.

To view what EBS (or other) volumes are attached to an Amazon EC2 instance

Open the Amazon EC2 console at https://console.aws.amazon.com/ec2/.
In the navigation pane, choose Instances.
To view more information about an instance, select it.
In the details pane, you can inspect the information provided about root and block devices.

To view information about an EBS volume using the command line
You can use one of the following commands to view volume attributes. For more information, see Accessing Amazon EC2.

describe-volumes (AWS CLI)
Get-EC2Volume (AWS Tools for Windows PowerShell)

Viewing Free Disk Space
You can get additional information about your EBS volumes, such as how much disk space is available, from the Linux operating system on the instance. For example, use the following command:

[ec2-user ~]$ df -hT /dev/xvda1

This returns:

Filesystem     Type      Size  Used Avail Use% Mounted on
/dev/xvda1     xfs       8.0G  1.2G  6.9G  15% /

Check EC2 Instance Health

With instance status monitoring, you can quickly determine whether Amazon EC2 has detected any problems that might prevent your instances from running applications. Amazon EC2 performs automated checks on every running EC2 instance to identify hardware and software issues. You can view the results of these status checks to identify specific and detectable problems.

You can view status checks using the AWS Management Console.
To view status checks (console)

Open the Amazon EC2 console at https://console.aws.amazon.com/ec2/.
In the navigation pane, choose Instances.
On the Instances page, the Status Checks column lists the operational status of each instance.
To view the status of a specific instance, select the instance, and then choose the Status Checks tab.
If you have an instance with a failed status check and the instance has been unreachable for over 20 minutes, choose AWS Support to submit a request for assistance. To troubleshoot system or instance status check failures yourself, see Troubleshooting Instances with Failed Status Checks.

Monitor Resources in Multiple Regions

Because your EC2 instance and S3 backups should be in multiple Availability Zones, you can monitor AWS resources in multiple Regions using a single CloudWatch dashboard. For example, you can create a dashboard that shows the status of your Availability Zones and Regions.

To monitor resources in multiple Regions in one dashboard

Open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/.
In the navigation pane, choose Metrics.
In the navigation bar, select a Region.
Select the metrics to add to your dashboard.
For Actions, choose Add to dashboard.
For Add to, type a name for the new dashboard and choose Add to dashboard.
Alternatively, to add to an existing dashboard, choose Existing dashboard, select a dashboard, and then choose Add to dashboard.
To add metrics from another Region, select the next Region and repeat these steps.
Choose Save dashboard.

Troubleshoot Instance Recovery Failures

If you have set up automatic recovery of your instance, but experience failures in this system, the following issues can cause automatic recovery of your instance to fail:

Temporary, insufficient capacity of replacement hardware.
The instance has an attached instance store storage, which is an unsupported configuration for automatic instance recovery.
There is an ongoing Service Health Dashboard event that prevented the recovery process from successfully executing. Refer to http://status.aws.amazon.com/ for the latest service availability information.
The instance has reached the maximum daily allowance of three recovery attempts.

The automatic recovery process attempts to recover your instance for up to three separate failures per day. If the instance system status check failure persists, we recommend that you manually start and stop the instance.

Check Availability Zone Health

For the latest service availability information, go to http://status.aws.amazon.com/.

An Availability Zone fault can cause a variety of issues, and may be expressed in a few common ways:

If you cannot create an EC2 instance or other resource in a specific Availability Zone. If this is the case, select another AZ and try again.
If you cannot connect to your instance, and you have followed the steps outlined in Troubleshooting Connecting to Your Instance. If this is the case, proceed to Backup and Recovery in Case of AZ Failure.
If you receive persistent API errors from your instances. If this is the case, proceed to Backup and Recovery in Case of AZ Failure.

Check Tamr Health

Check Tamr health by:

navigating to http://<hostname>:9100/docs#!/service/getHealth and clicking Try it out or
using the curl command

curl -X GET --header 'Accept: application/json' 'http://<hostname>:9100/api/service/health'

Additionally, our knowledge base is fully populated with troubleshooting articles and is accessible to our customers.

Health Check Failure

If a health check failure occurs, it is necessary to restart Tamr to recover from the failure.

Check Tamr License

Check Tamr license status by:

navigating to http://<hostname>:9020/docs#!/api/service/health and clicking Try it out or
using the curl command

curl -X GET --header 'Accept: application/json' 'http://<hostname>:9020/api/service/health'

and checking that the response body for license and healthy is true.

If it is not, contact [email protected] to request a new license.

Backup and Recovery

Tamr can be backed up easily. It is recommended that backups are stored in a different Availability Zone than your EC2 instance, or more than one zone, in order to protect against Availability Zone failure. The following steps are taken from the Backup and Restore guides.

If using the optional AWS RDS service for PostgreSQL without Multi-Availability Zone support, see AWS documentation for specifically restoring this part the stack. Tamr, however, recommends that you implement multi-Availability Zone configuration, so this should not be an issue for your instance.

Configuring an S3 Backup Location

To configure an S3 backup location:

For each of the below configuration variables, set the configuration variable in using the admin tool, see Configuring Tamr Using the Admin Tool.

Configuration Variable	Example Value
TAMR_UNIFY_BACKUP_URI	`s3a://<bucket-name>/<path-to-backup>`
TAMR_UNIFY_BACKUP_AWS_ACCESS_KEY_ID	`<aws-access-key-id>`
TAMR_UNIFY_BACKUP_AWS_SECRET_ACCESS_KEY	`<aws-secret-access-key>`

Backup

Checklist before proceeding

Backup is configured (Configuring Backup).
Backup location has sufficient free space.

Generating a Backup

1812 — Generate a backup of Tamr and wait for its completion by polling for backup status.

Generate Backup: POST /v1/backups
Generate a backup of Tamr and capture the relativeId of the backup from the response.
Wait For Backup: GET /v1/backups/{backupId}
Using the captured relativeId from Step 1, poll the status of the backup until status.state=SUCCEEDED received.

Restore

Checklist before proceeding

Existing Tamr installation of identical major and minor version (Installation). The patch version does not need to match exactly.
Tamr functional user has read and write permission on the backup URI.

Restoring a Backup

📘
Data Deletion
Restoring a backup deletes all data in the instance of Tamr being restored.

📘
Automatic Restart
Restoring a backup automatically restarts Tamr.

1780 — Restore a backup of Tamr and wait for its completion by polling for restore status.

Restore Backup: POST /v1/instance/restore
Restore a backup of Tamr by reading a backup specified in the POST body. Tamr is unavailable during restore.
Wait For Restore: GET /v1/instance/restore
Poll the status of the currently running or completed restore until status.state=SUCCEEDED received.

Backup and Recovery in Case of Availability Zone Failure

In the case of Availability Zone failure, you should restore Tamr from a backup stored in S3 in a different Availability Zone. Follow these steps to restore Tamr:

Spin up a new EC2 instance to deploy Tamr in another Availability Zone. Follow the instructions outline in the Creating an EC2 instance section.
Deploy Tamr. Make sure the version you deploy matches the version that you have previously backed up. If you were using local PostgreSQL, your data will be stored in the backup in S3. Otherwise, restore your RDS instance using the backup that AWS automatically takes.
Set your local access credentials TAMR_UNIFY_BACKUP_AWS_ACCESS_KEY_ID and TAMR_UNIFY_BACKUP_AWS_SECRET_ACCESS_KEY before initiating the restore.

./tamr/utils/unify-admin.sh config:set TAMR_UNIFY_BACKUP_AWS_ACCESS_KEY_ID=<your backup access key id>
./tamr/utils/unify-admin.sh config:set TAMR_UNIFY_BACKUP_AWS_ACCESS_KEY=<your backup access key>

Restore Backup: POST /v1/instance/restore
Restore a backup of Tamr by reading a backup specified in the POST body. Tamr is unavailable during restore. The POST body will be the path to your S3 backup.
Wait For Restore: GET /v1/instance/restore
Poll the status of the currently running or completed restore until status.state=SUCCEEDED received.

Backup and Recovery in Case of Region Failure

In the case of Region failure, if you have stored a backup on a local filesystem or in another Region, you will be able to follow the steps in Backup and Recovery in Case of Region Failure.

In the case of Region failure, you should restore Tamr from a backup stored in S3 in a different Region. Follow these steps to restore Tamr:

Spin up a new EC2 instance to deploy Tamr in another Region. Follow the instructions outline in the Creating an EC2 instance section.
Deploy Tamr. Make sure the version you deploy matches the version that you have previously backed up. If you were using local PostgreSQL, your data will be stored in the backup in S3. Otherwise, restore your RDS instance using the backup that AWS automatically takes.
Set your local access credentials TAMR_UNIFY_BACKUP_AWS_ACCESS_KEY_ID and TAMR_UNIFY_BACKUP_AWS_SECRET_ACCESS_KEY before initiating the restore.

./tamr/utils/unify-admin.sh config:set TAMR_UNIFY_BACKUP_AWS_ACCESS_KEY_ID=<your backup access key id>
./tamr/utils/unify-admin.sh config:set TAMR_UNIFY_BACKUP_AWS_ACCESS_KEY=<your backup access key>

Restore Backup: POST /v1/instance/restore
Restore a backup of Tamr by reading a backup specified in the POST body. Tamr is unavailable during restore. The POST body will be the path to your S3 backup.
Wait For Restore: GET /v1/instance/restore
Poll the status of the currently running or completed restore until status.state=SUCCEEDED received.

Common Issues

After Installing Tamr

Dataset Service Fails To Start (Script Install)
When running the start script start.sh you may find that dataset service won't start with the following error:

  Starting dataset on port 9150 with pid ####
  Service still not responding after ### tries, giving up

This is usually caused by Spark or Elasticsearch starting incorrectly or not being started at all.

To resolve this issue, first check to see if Elasticsearch or Spark are running:

ps aux | grep elastic

ps aux | grep spark

If they are, kill the pids found in the above commands using the kill command. Then, restart Tamr and its dependencies.

Users have trouble logging in after installing Tamr
If you are having trouble logging in, try restarting Tamr.

You may need to kill Spark and Elastic manually by finding the pids using ps -ef | grep <elastic|spark>.

After Upgrading Tamr

Users are experiencing UI problems after upgrade
If users are experiencing issues loading Tamr UI or logging in after upgrade, ensure that they have cleared their web browser cache before logging in.

Routine Maintenance

We recommend that Tamr is upgraded regularly to ensure security and software features are up to date. Upgrade instructions and release notes are documented publicly in conjunction with each release.

In addition, SSL certificates can be maintained by the customer, allowing Tamr Web Application to serve data over HTTPS. Deploying SSL certificates is a recommended practice. See Securing Tamr With SSL for instructions.

Upgrading Tamr

Current Tamr version is at least 0.37.0.
Current user is the Tamr functional user tamr.
The Tamr software bundle unify.zip of the target version.

Upgrade options

--backup [single-node] [optional] Set the system to backup before upgrading.
--healthcheckTimeout <healthcheckTimeout> [optional] Set how long to wait for the healthchecks to time out.
--help [optional] Print out the help message
--installDir <installDir> [single-node] The current installation on disk.
--nobackup [single-node] [optional] Set the system not to backup before upgrading.
--rerun [single-node] [optional] Re-run the upgrade against the current version of the product. Useful for when an error occurs during upgrade and the user wants to re-attempt the upgrade.
--upgradeDir <upgradeDir> [single-node] [optional] The directory where the upgrade version of Tamr exists, if the upgrade zip file has been extracted.
--zipFile <zipFile> [single-node] [optional] The path to the target upgrade zip file.
--zookeeper <full-zk-conf-node-url> [single-node] The ZooKeeper URL of the Tamr configuration node, e.g. zk://localhost:21281/tamr/unify001/conf

Upgrade Procedure

Back up Tamr by following the backup procedure of the source version, e.g. if upgrading from Tamr version 0.41.0, see Version 0.41.0 Backup.
If you are using any Auxillary Services, disable them before proceeding with the Tamr upgrade. (Disabling an Auxiliary Service)
If upgrading from version 0.37.0 or later, run the admin utility unify-admin.sh with the arguments --upgrade, --zipFile, --installDir and --zookeeper as follows

cd /tamr/utils
./unify-admin.sh --upgrade --zipFile <full-path-to-target-version-unify-zip> --installDir <full-path-to-tamr-unify-home>  --zookeeper <full-zk-conf-node-url>

else run the admin utility unify-admin.sh with the command upgrade and the argument of the full path to the Tamr software bundle unify.zip of the target version.

cd /tamr/utils
./unify-admin.sh upgrade <full-path-to-target-version-unify-zip>

Emergency Maintenance

Tamr can be easily restored from a backup stored in S3 or elsewhere in the event of an emergency or other failures.

Support

For technical support, contact [email protected] or your account representative.

Support Costs, Tiers, and Service Level Agreements

Tamr Support is included with the Tamr annual license fees or as part of annual Maintenance and Support fees.

There is one support tier. This includes:

2 hour response time for severity 1 (outage) issues
4 hour response time for severity 2 (degraded response times) issues
12 hour response for severity 3 issues

About This Documentation

What is Tamr?

Prerequisites

Creating an IAM role

Creating a Security Group

Creating an EC2 Instance

Architecture Diagrams

Security

Egress and Port Whitelisting

Access Key Rotation

Audit Mechanisms

📘Note

📘Tip

📘Note

📘Note

Securing Tamr with SSL

Costs

Sizing

Deployment Assets

Deploying Tamr

Install Postgres and Create the Database

Create a PostgreSQL DB Instance with Multi-Availability Zone Configuration

Install Tamr

Health Check

Check Storage Capacity of an EC2 instance

Check EC2 Instance Health

Monitor Resources in Multiple Regions

Troubleshoot Instance Recovery Failures

Check Availability Zone Health

Check Tamr Health

Health Check Failure

Check Tamr License

Backup and Recovery

Configuring an S3 Backup Location

Backup

Generating a Backup

Restore

Restoring a Backup

📘Data Deletion

📘Automatic Restart

Backup and Recovery in Case of Availability Zone Failure

Backup and Recovery in Case of Region Failure

Common Issues

After Installing Tamr

After Upgrading Tamr

Routine Maintenance

Upgrading Tamr

Upgrade options

Upgrade Procedure

Emergency Maintenance

Support

Support Costs, Tiers, and Service Level Agreements

📘
Note

📘
Tip

📘
Note

📘
Note

📘
Data Deletion

📘
Automatic Restart