User GuidesAPI ReferenceRelease Notes
Doc HomeHelp CenterLog In

Configuring HDFS

Configure Tamr Core to use HDFS as its filesystem.

Single-node Tamr deployments use the local filesystem by default. You can configure Tamr Core to use an external Hadoop Distributed File System (HDFS) cluster instead of writing to the local filesystem.

important Important: When you use the Connect to Source option to add a data file stored in an HDFS cluster to Tamr Core, the file must already include a primary key column. This is different from adding a file with the Upload File option. The Upload File option allows you to select the column with the primary key or to specify No Primary Key, which tells Tamr Core to create a primary key on import. See Uploading a Dataset into a Project.

Configuring Tamr Core to Use HDFS as Its Filesystem

Before you begin:

  • Obtain HDFS configuration files.
    • (Optional) Obtain the core Hadoop configuration files, such as core-site.xml, hdfs-site.xml.
    • (Optional) Obtain any additional files referenced by the core Hadoop files, such as .xsl, .sh, and so on.
  • Verify that you have a readable/writable space in HDFS.
  • If HDFS uses Kerberos for authentication, obtain the Kerberos keytab file and principal.
    • The principal user must have read/write access.

To configure Tamr Core to use HDFS as its filesystem:

  1. Set each of the configuration variables listed below using the administrative utility. See Creating or Updating a Configuration Variable.
  2. Restart Tamr Core and its dependencies. See Restarting Tamr Core.
Configuration VariableExample and Description
TAMR_UNIFY_DATA_DIRhdfs://nameservice/tamr/unify-data

A readable/writable path in HDFS where Tamr Core will read/write data.
TAMR_FS_URIhdfs://nameservice

Primary filesystem URI. Set to the root of the filesystem. Examples: file:///", "gs://tamr-bucket/, s3://tamr-bucket/, hdfs://tamr-nameservice/.

You can set this variable to the value of fs.defaultFS from the configuration files.

If fs.defaultFS is not defined, pick an appropriate nameservice from the configuration files and set fs.defaultFS with TAMR_FS_EXTRA_CONFIG.
TAMR_FS_CONFIG_URISfile:///path/to/core-site.xml;file:///path/to/hdfs-site.xml

You can create a semicolon-separated list of the URIs of the core Hadoop configuration files, such as core-site.xml, hdfs-site.xml.

Supported URI schemes are file, http, and zk (ZooKeeper).
TAMR_FS_EXTRA_URISzk://localhost:21281/hdfs/config/hadoop-env.sh;zk://localhost:21281/hdfs/config/configuration.xsl

A semicolon-separated list of the URIs for the the non-xml configuration files.

Supported URI schemes are file, http, and zk (ZooKeeper).
TAMR_FS_CONFIG_DIR/etc/hadoop/conf/

A directory to store the configuration files specified by TAMR_FS_CONFIG_URIS and TAMR_FS_EXTRA_URIS. If the configuration files already exist on the filesystem, you can set this to the path that already contains the files to avoid caching them elsewhere.

This is typically not required to be set because by default Tamr Core uses Hadoop Home Directory for its configuration.
TAMR_FS_EXTRA_CONFIG{‘fs.defaultFS’: hdfs://nameservice}

Dictionary of key:value pairs. If fs.defaultFS is not defined in the configuration files, you can set a nameservice here. This is typically not required to be set.
TAMR_FS_KERBEROS_ENABLEDtrue or false

Enables Kerberos for authentication.
TAMR_KERBEROS_KEYTAB/path/to/user.keytab

Required when the HDFS configuration uses Kerberos for authentication and TAMR_FS_KERBEROS_ENABLED is set to true.

Path to a Kerberos keytab file.
TAMR_KERBEROS_KRB5/path/to/krb5.conf

Required when the HDFS configuration uses Kerberos for authentication and TAMR_FS_KERBEROS_ENABLED is set to true.

Path to a Kerberos krb5.conf file.
TAMR_KERBEROS_PRINCIPALprimary/instance@REALM


Required when the HDFS configuration uses Kerberos for authentication and TAMR_FS_KERBEROS_ENABLED is set to true.

The principal to use in the keytab file. Use the klist command to inspect the keytab file to confirm the principal.

shell klist -k <path-to-keytab>