YARN Cluster Manager Jobs
Optionally, you can configure YARN Cluster Manager to run jobs in parallel and configure the system with resources for running multiple jobs concurrently.
Important: Tamr Core uses the cluster manager from YARN for running Spark jobs, instead of the standalone cluster manager from Spark. The YARN cluster manager starts up a ResourceManager and NodeManager servers.
YARN Configuration
Tamr uses the cluster manager from YARN for running Spark jobs, instead of the standalone cluster manager from Spark. The YARN cluster manager starts up a ResourceManager and NodeManager servers.
To see the list of all Spark jobs that have been submitted to the cluster manager, access the YARN Resource Manager at its Web UI port.
YARN-related logs
The server logs are stored in the TAMR_LOG_DIR
.
The logs for Spark jobs are stored in TAMR_JOB_SPARK_EVENT_LOGS_DIR
.
YARN ports
By default, Tamr uses the following ports for YARN:
TAMR_YARN_RESOURCE_MANAGER_RESOURCE_TRACKER_PORT
: "8031"
TAMR_YARN_RESOURCE_MANAGER_WEBUI_PORT
: "8088"
TAMR_YARN_RESOURCE_MANAGER_PORT
: "8032"
TAMR_YARN_RESOURCE_MANAGER_WEBUI_HTTPS_PORT
: "8090"
TAMR_YARN_RESOURCE_MANAGER_SCHEDULER_PORT
: "8030"
TAMR_YARN_RESOURCE_MANAGER_ADMIN_PORT
: "8033"
TAMR_YARN_NODE_MANAGER_PORT
: "8042"
You can specify different YARN ports. For information, see Configuring Tamr.
YARN Configuration Properties
The YARN cluster manager uses these configuration properties in Tamr. You can optionally specify your own values for these properties on a multi-node deployment. For single-node deployments, we recommend to use the defaults listed in this table.
Configuration Variable | Description |
---|---|
TAMR_YARN_RESOURCE_MANAGER_HOST | The hostname of the Spark YARN ResourceManager. This variable configures the hostname of the ResourceManager. The default is the same hostname as the HOST_IP that you can determine by running hostname -I .Note: The ports and their defaults are listed in the previous section on this page. |
TAMR_YARN_NODE_MANAGER_HOST | The hostname of the Spark YARN NodeManager. The default is the same hostname as the HOST_IP that you can determine by running hostname -I . |
TAMR_YARN_NODE_MANAGER_PORT | The port of the Spark YARN NodeManager. The default port is 8042. |
TAMR_YARN_TEMP_DIR | The directory for storing temporary files produced by YARN. Specify this location if you need to control access to it. The default value is TAMR_HADOOP_HOME/temp . |
TAMR_JOB_SPARK_YARN_QUEUE | The name of the Yarn queue for submitting Spark jobs. The default is not specified. By default Spark jobs are submitted to an empty queue. |
TAMR_JOB_SPARK_LOCAL_YARN_JARS | A list of paths to JARs that the YARN cluster manager will use with a local filesystem. List multiple paths with a semicolon separator, and use glob patterns for the paths. The default is TAMR_SPARK_HOME/jars/* .NOTE: Do not change the TAMR_JOB_SPARK_LOCAL_YARN_JARS property on a standard single-node deployment. |
Adjusting the Spark Memory
You can adjust the Spark memory resources in TAMR_SPARK_MEMORY
based on the following formula. By default, this property accounts for the necessary overhead for running Spark jobs in the YARN cluster manager.
1.1 * x * TAMR_JOB_SPARK_EXECUTOR_INSTANCES + 1.1 * y <= TAMR_SPARK_MEMORY
Where:
- x represents
TAMR_JOB_SPARK_EXECUTOR_MEM
per instance, in GB. - y represents
TAMR_JOB_SPARK_DRIVER_MEM
per instance, in GB.
Every computation is rounded up to a whole number of GBs. This formula also applies to the Spark resources properties that you could specify in the TAMR_JOB_SPARK_CONFIG_OVERRIDES
configuration parameter for Tamr.
Updated over 1 year ago