HomeTamr Core GuidesTamr Core API Reference
Tamr Core GuidesTamr Core API ReferenceTamr Core TutorialsEnrichment API ReferenceSupport Help CenterLog In

HDFS

Configure Tamr to use HDFS as its filesystem.

Single-node Tamr deployments use the local filesystem by default. You use the following procedure to configure a single-node Tamr deployment to use an external Hadoop Distributed File System (HDFS) cluster in place of writing to the local filesystem.

Note: When you add a data file that is stored in an HDFS cluster to Tamr with the Connect to Source option, there must already be a primary key column in the file. This is different from adding a file with the Upload File option, which allows you to select the column with the primary key or to specify No Primary Key, which tells Tamr to create one on ingest. See Uploading a Dataset Into Tamr.

Checklist before you proceed:

  • HDFS configuration files.
    • Optional. The core Hadoop configuration files, such as core-site.xml, hdfs-site.xml.
    • Optional. Any additional files referenced by the core Hadoop files, such as .xsl, .sh, etc.
  • A readable/writable space in HDFS.
  • The Kerberos keytab file and principal.
    • Required only if HDFS uses Kerberos for authentication.
    • Principal user must have read/write access.

Configuring Tamr to use HDFS as its filesystem

To configure Tamr to use HDFS as its filesystem:

  1. Set each of the configuration variables listed below using the administrative utility. See Creating or Updating a Configuration Variable.
  2. Restart Tamr and its dependencies. See Restarting.

TAMR_UNIFY_DATA_DIR

A readable/writable path in HDFS where Tamr will read/write data.

Configuration Variable

Example Value

TAMR_UNIFY_DATA_DIR

hdfs://nameservice/tamr/unify-data

TAMR_FS_URI

Optional.

Primary filesystem URI. Set to the root of the filesystem. Some examples: file:///", "gs://tamr-bucket/, s3://tamr-bucket/, hdfs://tamr-nameservice/.

You can set this variable to the value of fs.defaultFS from the configuration files. If fs.defaultFS is not defined, pick an appropriate nameservice from the configuration files and set fs.defaultFS with TAMR_FS_EXTRA_CONFIG.

Configuration Variable

Example Value

TAMR_FS_URI

hdfs://nameservice

TAMR_FS_CONFIG_URIS

Optional.

You can create a semicolon-separated list of the URIs of the core Hadoop configuration files, such as core-site.xml, hdfs-site.xml.

Supported URI schemes are file, http, and zk (ZooKeeper).

Configuration Variable

Example Value

TAMR_FS_CONFIG_URIS

file:///path/to/core-site.xml;file:///path/to/hdfs-site.xml

TAMR_FS_EXTRA_URIS

A semicolon-separated list of the URIs for the the non-xml configuration files.

Supported URI schemes are file, http, and zk (ZooKeeper).

Configuration Variable

Example Value

TAMR_FS_EXTRA_URIS

zk://localhost:21281/hdfs/config/hadoop-env.sh;zk://localhost:21281/hdfs/config/configuration.xsl

TAMR_FS_CONFIG_DIR

A directory to store the configuration files specified by TAMR_FS_CONFIG_URIS and TAMR_FS_EXTRA_URIS. If the configuration files already exist on the filesystem, you can set this to the path that already contains the files to avoid caching them elsewhere.

This is typically not required to be set because by default Tamr uses Hadoop Home Directory for its configuration.

Configuration Variable

Example Value

TAMR_FS_CONFIG_DIR

/etc/hadoop/conf/

TAMR_FS_EXTRA_CONFIG

Dictionary of key:value pairs. If fs.defaultFS is not defined in the configuration files, you can set a nameservice here. This is typically not required to be set.

Configuration Variable

Example Value

TAMR_FS_EXTRA_CONFIG

{‘fs.defaultFS’: hdfs://nameservice}

TAMR_FS_KERBEROS_ENABLED

Configuration Variable

Example Value

TAMR_FS_KERBEROS_ENABLED

true or false

TAMR_KERBEROS_KEYTAB

Path to a Kerberos keytab file. Required when the HDFS configuration uses Kerberos for authentication and TAMR_FS_KERBEROS_ENABLED is set to true.

Configuration Variable

Example Value

TAMR_KERBEROS_KEYTAB

/path/to/user.keytab

TAMR_KERBEROS_KRB5

Path to a Kerberos krb5.conf file. Required when the HDFS configuration uses Kerberos for authentication and TAMR_FS_KERBEROS_ENABLED is set to true.

Configuration Variable

Example Value

TAMR_KERBEROS_KRB5

/path/to/krb5.conf

TAMR_KERBEROS_PRINCIPAL

The principal to use in the keytab file. Use the klist command to inspect the keytab file to confirm the principal.

klist -k <path-to-keytab>

You must set this variable if HDFS is authenticated with Kerberos and TAMR_FS_KERBEROS_ENABLED is set to true.

Configuration Variable

Example Value

TAMR_KERBEROS_PRINCIPAL

primary/[email protected]


Did this page help you?