User GuidesAPI ReferenceRelease Notes
Doc HomeHelp CenterLog In
User Guides

Deploying Tamr on Google Cloud Platform

Deploy a production-grade Tamr solution to a single instance in Google Cloud Platform (GCP) in a few clicks.

About This Guide

This document describes deployment steps for launching and accessing Tamr on GCP Marketplace, as well as basic network and security requirements, status checks, costs, backups, and support.

Who Should Use This Guide?

This guide is intended for administrators with PostgreSQL experience and basic Linux experience. You must have permissions to run bash scripts and commands on the GCP Tamr instance.

About Tamr

Tamr uses a patented machine-learning based approach to deliver up-to-date data. The software relies on probabilistic models created through machine learning and enriched by human expertise. This lets you unify data quickly and at previously unprecedented scale. You no longer need to rely on deterministic rules programmed by a developer that combine a handful of data sources for consumption.

The results are breakthrough insights that lead to new growth opportunities, cost savings, and operational improvements -- delivered in a fraction of the time and cost of traditional approaches. Tamr offers specific solutions across use cases and industries, including agile data mastering, agile customer mastering, procurement analytics, GDPR compliance, and M&A Integration.

RESTful APIs make it easy to tie Tamr into an existing data infrastructure. For information, see the API reference.

What Is Included in the Tamr Marketplace Image on GCP?

The Tamr software offered on Google Cloud Platform (GCP) Marketplace is a pre-configured and fully integrated package. It allows you to run human-assisted machine learning jobs on vast amounts of data.
The Tamr GCP Marketplace Image has the following characteristics:

  • It is a single virtual machine (VM) image for the Linux Platform (Ubuntu 18.04).
  • The image includes:
  • A version of the Tamr software, optimized for use with Google Cloud Platform Compute Engine.
  • Software dependencies that Tamr requires, including HBase, Spark on Yarn, Elasticsearch, a Postgres instance, and Zookeeper. For a diagram that shows these components, see Deployments in the Tamr System Administrator Guide. The list of licenses for Open Source software used in the Tamr package is included in the home/ubuntu/tamr/licenses/unify-licenses/licenses directory. After you deploy the instance, you can access this directory by connecting to the instance via SSH.
  • The image uses the BYOL (Bring Your Own License) model in Google Marketplace, which requires you to have a valid license to use it. You are responsible for purchasing and managing your own license from Tamr. For information, see Accessing the Tamr Instance in this guide.

Before You Begin

Before you deploy Tamr, use this checklist to ensure that:

  • You have a Google Cloud account. To create an account, see Get Started with Google Cloud Platform.
  • You have sufficient Google Compute Engine (GCE) resource quota limits and your quota limits allow you to create instances with characteristics that Tamr requires. For more information, see Sizing and Limits in this gude.
  • You have installed the Google Cloud SDK. You can use it to manage your instance after it is deployed. See Google Cloud SDK Quickstarts in the GCP documentation.
  • You have generated SSH keys. You will need the keys to connect to your instance with SSH. See Managing SSH Keys in the GCP documentation.
  • You are familiar with GCP access management and shared projects. Shared projects allow multiple users in your team to access any virtual machine instance created within the project. Project users can then establish an SSH connection to the GCP instance. To keep your VM instance and SSH key private, create a GCP project in a VPC and then create and launch your VM instances in this project. See Granting Access to Projects in the GCP documentation.
  • You have set firewall rules for your instance. Firewall rules allow you to specify the type of access - internal access via a VPC network, or a secure public access over HTTPS, and to specify ports for each type of connection that must be kept open. You will use these ports to access the Tamr user interface and run commands to check the health of the Tamr instance. For information, see the following section.
  • For more information, use the links to the GCP documentation about security included in the Security section in this guide.

Setting Up Firewall Rules

It is useful to configure a firewall rule for your project in GCP before launching an instance to avoid adding it later.

This task describes configuring a firewall rule that allows internal access by your users to Tamr default port 9100 (via TCP). This is the only firewall requirement with Tamr. Internal access means that users in your team are connected to your own VPC and that the project you create in GCP for the Tamr instance is configured to use the designated network from your VPC.

📘

About Accessing Tamr via HTTPS

If you are not using a VPC and require configuring a secure external access to Tamr via HTTPS, use an SSL certificate served via a reverse proxy from the NGINX application server. For more information, see Configuring HTTPS using the Nginx Proxy in the Tamr System Administrator Guide. The default file size for uploading files through NGINX is 1MB. To remove this limit, set client_max_body_size to zero.

Additionally, you must create a separate firewall rule to open port 443 for HTTPS with a restrictive IP range that you specify using IPv4 addresses in CIDR notation, such as 1.2.3.4/32. For information about firewall rules, see Firewall Rules Overview in the Google Virtual Private Cloud documentation.

To configure a firewall rule for the Tamr instance:

  1. Log in to https://console.cloud.google.com, and select the correct Project from the drop-down list. This should be the project that will contain the Tamr instance. This project’s VPC should allow access to those users in your team who will have access to Tamr.
  2. Select Products and Services, then scroll down to the Networking section, and choose VPC Network > Firewall Rules.
  3. Select Create Firewall Rule and enter the following information:
  • Name: Tamr recommends the following naming format: “default-allow-9100”
  • Direction of traffic: Ingress
  • Action on match: Allow
  • Targets: All instances in the network
  • Source filter: IP ranges. Tamr recommends only allowing ingress traffic from a private VPC. If you wish to allow ingress traffic over the public Internet, specify a restrictive CIDR range only for port 443 and configure Tamr with an SSL certificate via an NGINX server reverse proxy. Tamr does not recommend allowing access via the public Internet on ports 80 or 9100. For information, see Configuring HTTPS using the Nginx Proxy in the Tamr System Administrator Guide.
  • Source IP ranges: Specify the ranges for your VPC.
  • Protocols and ports: Select Specified protocols and ports, then enter: 9100: “tcp:9100” .
  1. Choose Create. Your new firewall rule should appear on the Firewall Rules page. This rule will affect all instances in the project.

Deploying a Tamr Instance from the GCP Console

To deploy a Tamr instance:

  1. Log in to GCP at https://console.cloud.google.com/ using your existing Google account and select the correct project.
  2. Choose Products and Services and select Marketplace, or open the console at https://cloud.google.com/marketplace/, and choose Explore Marketplace.
  3. Search for the Tamr cloud image and select it.
    On the Tamr solution page, choose Launch on Compute Engine.
473
  1. Configure the Tamr GCP deployment:
  • Select a zone.
  • Select a machine type. Optionally change the number of cores and amount of memory. See Sizing and Limits.
  • Specify the boot disk type and size.
  • Optionally change the network name and subnetwork names. Be sure that whichever network you specify has port 9100 (TCP) exposed via a firewall rule. See Setting Up Firewall Rules.
  1. Read and accept the GCP Marketplace Terms of Service.
  2. Choose Deploy when you are done. Tamr will begin deploying. Note that this can take several minutes. A summary page displays when Tamr is successfully deployed. This page includes the instance ID.
  3. Select the Instance link to retrieve the external IP address for accessing Tamr. This is the host address of your Tamr instance. You will later use it in http://<hostname>:9100.
  4. Obtain a license key, username, and password by contacting Tamr Support at [email protected]. You will require the license key and these credentials when accessing the Tamr instance via a browser.

Accessing the Tamr Instance

Note: The following procedure assumes that users in your team are already connected via your own VPC network.

  • To access the Tamr instance for the first time after it has been deployed, you need to have a license key and an initial set of credentials (username and password).
  • To access the Tamr instance on a regular basis (after you have provided the license key) inside a VPC, use: http://<hostname>:9100.

To access the Tamr instance for the first time and provide the license key:

  1. In your browser, log in to the Google Compute Engine Console at https://console.cloud.google.com/ and select the correct project. Locate your Tamr instance.
  2. On the VM Instances page, SSH to the new VM Instance. From the SSH drop-down menu, select Open in Browser Window.
  3. Using the command line, provide the license key to Tamr: ${TAMR_UNIFY_HOME}/tamr/utils/unify-admin.sh config:set TAMR_LICENSE_KEY="<license-key-value>". See Setting the License Key.
    Note: To connect via SSH, you can also use the gcloud compute ssh, or your own terminal with SSH.
  4. Restart Tamr. cd ${TAMR_UNIFY_HOME}/tamr ./stop-unify.sh ./stop-dependencies.sh ./start-dependencies.sh ./start-unify.sh See Restarting Tamr.
  5. Go to the URL similar to the following to access the Tamr instance: http://<hostname>:9100 and enter the set of credentials you received from Tamr. Change the password immediately.
    Now you are able to use Tamr. For information, see the Tamr User Documentation.

Checking Tamr Health Status

Use the Tamr health check API to check Tamr health status. The health API endpoint returns health checks for the service and for Zookeeper, which Tamr uses for configuration management.

To check Tamr health status:

  1. Open the health API endpoint at: http://<hostname>:9100/docs#!/service/getHealth and choose Try it out, or use the curl command:
    curl -X GET --header 'Accept: application/json' 'http://<hostname>:9100/api/service/health'
    If a health check failure occurs, restart the Tamr instance to recover from the failure. To help troubleshoot the instance, access the Tamr knowledge base which is available for all customers. For information, contact Tamr Support at [email protected].

Checking the Status of the Tamr License

To check the status of your Tamr license:

  1. Open the health API endpoint at http://<hostname>:9020/docs#!/api/service/health and choose Try it out, or use the curl command curl -X GET --header 'Accept: application/json' 'http://<hostname>:9020/api/service/health'
  2. Check that the response body for license and health return true.
    If true is not returned, contact Tamr Support at [email protected] to request a new license.

Starting and Stopping Tamr

Start and stop Tamr using the scripts: ./stop-unify.sh, ./stop-dependencies.sh, ./start-dependencies.sh, and ./start-unify.sh located in the ${TAMR_UNIFY_HOME}/tamr directory. For information, see Restarting Tamr.

Security

To ensure secure access, users in your team must have access to your team’s VPC that is used for the GCP project containing your instance.
Tamr recommends that you use GCP storage volume encryption to protect your data. See Data Encryption Options in the GCP documentation.

Also see Securely connecting to VM instances and the Google Cloud Security documentation.

Costs

The cost of running the Tamr instance is a combination of:

  • Tamr cost. Tamr cost is per license, with additional cost for optional services and support. To obtain a license key, contact Tamr Support at [email protected].
  • Google infrastructure costs for the virtual machine on which you are running Tamr. See Google VM Instance Pricing.
  • GCP storage costs. Optionally, you can choose to store Tamr backups in Google storage. See Disks and images pricing.

Sizing and Limits

Check your Google Compute Engine (GCE) resource quota limits. For more information, see https://cloud.google.com/compute/quotas.
Tamr has the following minimum requirements for a single-node deployment:

  • 3 CPU cores and 64GB RAM.
  • For up to 20 million records, Tamr recommends an n1-highmem-8 instance deployment.
  • For larger numbers of records, Tamr recommends n1-highmem-16 or n1-highmem-32 instance deployments.

For more information, see GCP Machine Types in the GCP documentation. For information about suggested sizings in Tamr, see Single-Node Deployments in the Tamr System Administrator Guide.

Scaling Up

To scale your Tamr deployment on GCP, use individual sizing increases for your Google compute instance. If you need additional storage, attach an external storage drive in GCP. For scaling out your deployment, contact your Tamr account representative.

Backups and Disaster Recovery

Take regular backups of Tamr and keep the backups in Google storage in a different Availability Zone than your GCP instance. To create backups, use the Tamr backup API. For information, see Backup.

Upgrades

Tamr releases new software versions frequently. While Tamr strives to maintain the most recent version available on Google Cloud Platform Marketplace, your instance version may not be the latest and is not automatically upgraded. To upgrade to the most recent version, or to create a custom deployment in Google Cloud Platform, contact Tamr Support at [email protected].

Support

For technical support, contact Tamr Support at [email protected] or contact your Tamr account representative.

Support Costs, Tiers, and Service Level Agreements

Tamr Support is included with the Tamr annual license fees or as part of annual Maintenance and Support fees.
There is one support tier. This includes:

  • 2 hour response time for severity 1 (outage) issues.
  • 4 hour response time for severity 2 (degraded response times) issues.
  • 12 hour response for severity 3 issues.