HomeTamr Core GuidesTamr Core API Reference
Tamr Core GuidesTamr Core API ReferenceTamr Core TutorialsEnrichment API ReferenceSupport Help CenterLog In

YARN Cluster Manager Jobs

Optionally, you can configure YARN Cluster Manager to run jobs in parallel and configure the system with resources for running multiple jobs concurrently.

YARN Configuration

Tamr uses the cluster manager from YARN for running Spark jobs, instead of the standalone cluster manager from Spark. The YARN cluster manager starts up a ResourceManager and NodeManager servers.

To see the list of all Spark jobs that have been submitted to the cluster manager, access the YARN Resource Manager at its Web UI port.

YARN-related logs

Server logs are stored in the TAMR_LOG_DIR, which defaults to tamr/logs.

The logs for Spark jobs are stored in the TAMR_LOG_DIR/userlogs directory.

YARN ports

By default, Tamr uses the following ports for YARN:

TAMR_YARN_RESOURCE_MANAGER_RESOURCE_TRACKER_PORT: "8031"
TAMR_YARN_RESOURCE_MANAGER_WEBUI_PORT: "8088"
TAMR_YARN_RESOURCE_MANAGER_PORT: "8032"
TAMR_YARN_RESOURCE_MANAGER_WEBUI_HTTPS_PORT: "8090"
TAMR_YARN_RESOURCE_MANAGER_SCHEDULER_PORT: "8030"
TAMR_YARN_RESOURCE_MANAGER_ADMIN_PORT: "8033"
TAMR_YARN_NODE_MANAGER_PORT: "8042"

Note: Additional ports are listed in the next section.

You can specify different YARN ports. For information, see Configuring Tamr.

YARN Configuration Properties

The YARN cluster manager uses these configuration properties in Tamr Core. You can optionally specify your own values for these properties on a cloud-native deployment. For single-node deployments, Tamr recommends that you use the defaults listed below.

This list also includes some of the YARN ports. The defaults for these ports are listed in the previous section.

  • TAMR_YARN_RESOURCE_MANAGER_HOST The hostname of the Spark YARN ResourceManager. This variable configures the hostname of the ResourceManager. The default is the same hostname as the HOST_IP that you can determine by running hostname -I.
  • TAMR_YARN_NODE_MANAGER_HOST The hostname of the Spark YARN NodeManager. The default is the same hostname as the HOST_IP that you can determine by running hostname -I.
  • TAMR_YARN_NODE_MANAGER_PORT The port of the Spark YARN NodeManager. The default port is 8042.
  • TAMR_YARN_TEMP_DIR The directory for storing temporary files produced by YARN.
    Specify this location if you need to control access to it. The default value is TAMR_HADOOP_HOME/temp.
  • TAMR_JOB_SPARK_YARN_QUEUE The name of the Yarn queue for submitting Spark jobs. The default is not specified. By default Spark jobs are submitted to an empty queue.
  • TAMR_YARN_SCHEDULER_CAPACITY_MAXIMUM_AM_RESOURCE_PERCENT The maximum percentage of resources which can be used to run application masters (AM) in the YARN cluster. It allows you to control the number of applications running concurrently. The default is 1.0. When set to 1.0, this means that all AMs can take as much as possible total memory (100%). Use the default for single-node Tamr deployments running on the YARN cluster. Possible values are between 0.0 and 1.0, inclusive.
  • TAMR_JOB_SPARK_LOCAL_YARN_JARS A list of paths to JARs the YARN cluster manager uses with a local filesystem. List multiple paths with a semicolon separator, and use glob patterns for the paths. The default is TAMR_SPARK_HOME/jars/*. NOTE: Do not change the TAMR_JOB_SPARK_LOCAL_YARN_JARS property on a standard single-node deployment.

Configuring Tamr to Use Concurrent Jobs

If you are sharing the same instance with others in your group, your administrator can configure Tamr to enable running jobs in parallel. To do this, the administrator can edit the Spark configuration name specified in the TAMR_JOB_SPARK_CONFIG_OVERRIDES configuration property.

To configure Tamr to use concurrent jobs with YARN

  1. Create a custom YAML file containing the configuration values you wish to set, and apply this file using the admin tool. For information, see Creating or Updating a Configuration Variable.
  2. Restart Tamr and its dependencies. See Restarting.

Adjusting the Spark Memory

You can adjust the Spark memory resources in TAMR_SPARK_MEMORY based on the following formula. By default, this property accounts for the necessary overhead for running Spark jobs in the YARN cluster manager.

TAMR_SPARK_MEMORY >= 1.1 * x * TAMR_JOB_SPARK_EXECUTOR_INSTANCES + 1.1 * y

Where:

  • x represents TAMR_JOB_SPARK_EXECUTOR_MEM per instance, in GB.
  • y represents TAMR_JOB_SPARK_DRIVER_MEM per instance, in GB.

Every computation is rounded up to a whole number of GBs. This formula also applies to the Spark resource properties that you can specify in the TAMR_JOB_SPARK_CONFIG_OVERRIDES configuration parameter for Tamr.


Did this page help you?