Configuring HDFS
Configure Tamr Core to use HDFS as its filesystem.
Single-node Tamr deployments use the local filesystem by default. You can configure Tamr Core to use an external Hadoop Distributed File System (HDFS) cluster instead of writing to the local filesystem.
Important: When you use the Connect to Source option to add a data file stored in an HDFS cluster to Tamr Core, the file must already include a primary key column. This is different from adding a file with the Upload File option. The Upload File option allows you to select the column with the primary key or to specify No Primary Key, which tells Tamr Core to create a primary key on import. See Uploading a Dataset into a Project.
Configuring Tamr Core to Use HDFS as Its Filesystem
Before you begin:
- Obtain HDFS configuration files.
- (Optional) Obtain the core Hadoop configuration files, such as
core-site.xml
,hdfs-site.xml
. - (Optional) Obtain any additional files referenced by the core Hadoop files, such as
.xsl
,.sh
, and so on.
- (Optional) Obtain the core Hadoop configuration files, such as
- Verify that you have a readable/writable space in HDFS.
- If HDFS uses Kerberos for authentication, obtain the Kerberos keytab file and principal.
- The principal user must have read/write access.
To configure Tamr Core to use HDFS as its filesystem:
- Set each of the configuration variables listed below using the administrative utility. See Creating or Updating a Configuration Variable.
- Restart Tamr Core and its dependencies. See Restarting Tamr Core.
Configuration Variable | Example and Description |
---|---|
TAMR_UNIFY_DATA_DIR | hdfs://nameservice/tamr/unify-data A readable/writable path in HDFS where Tamr Core will read/write data. |
TAMR_FS_URI | hdfs://nameservice Primary filesystem URI. Set to the root of the filesystem. Examples: file:///", "gs://tamr-bucket/ , s3://tamr-bucket/ , hdfs://tamr-nameservice/ .You can set this variable to the value of fs.defaultFS from the configuration files.If fs.defaultFS is not defined, pick an appropriate nameservice from the configuration files and set fs.defaultFS with TAMR_FS_EXTRA_CONFIG . |
TAMR_FS_CONFIG_URIS | file:///path/to/core-site.xml;file:///path/to/hdfs-site.xml You can create a semicolon-separated list of the URIs of the core Hadoop configuration files, such as core-site.xml , hdfs-site.xml .Supported URI schemes are file , http , and zk (ZooKeeper). |
TAMR_FS_EXTRA_URIS | zk://localhost:21281/hdfs/config/hadoop-env.sh;zk://localhost:21281/hdfs/config/configuration.xsl A semicolon-separated list of the URIs for the the non-xml configuration files. Supported URI schemes are file , http , and zk (ZooKeeper). |
TAMR_FS_CONFIG_DIR | /etc/hadoop/conf/ A directory to store the configuration files specified by TAMR_FS_CONFIG_URIS and TAMR_FS_EXTRA_URIS . If the configuration files already exist on the filesystem, you can set this to the path that already contains the files to avoid caching them elsewhere.This is typically not required to be set because by default Tamr Core uses Hadoop Home Directory for its configuration. |
TAMR_FS_EXTRA_CONFIG | {‘fs.defaultFS’: hdfs://nameservice} Dictionary of key:value pairs. If fs.defaultFS is not defined in the configuration files, you can set a nameservice here. This is typically not required to be set. |
TAMR_FS_KERBEROS_ENABLED | true or false Enables Kerberos for authentication. |
TAMR_KERBEROS_KEYTAB | /path/to/user.keytab Required when the HDFS configuration uses Kerberos for authentication and TAMR_FS_KERBEROS_ENABLED is set to true.Path to a Kerberos keytab file. |
TAMR_KERBEROS_KRB5 | /path/to/krb5.conf Required when the HDFS configuration uses Kerberos for authentication and TAMR_FS_KERBEROS_ENABLED is set to true.Path to a Kerberos krb5.conf file. |
TAMR_KERBEROS_PRINCIPAL | primary/instance@REALM Required when the HDFS configuration uses Kerberos for authentication and TAMR_FS_KERBEROS_ENABLED is set to true.The principal to use in the keytab file. Use the klist command to inspect the keytab file to confirm the principal.shell klist -k <path-to-keytab> |
Updated over 2 years ago