How to Set TAMR_JOB_SPARK_CONFIG_OVERRIDES

Content:

  1. Overview of this Tamr configuration variable
  2. Complete list of settable properties
  3. Selected examples

Overview of TAMR_JOB_SPARK_CONFIG_OVERRIDES

The TAMR_JOB_SPARK_CONFIG_OVERRIDES configuration variable allows users to configure Spark jobs to run with different Spark cluster configurations. For example, one might imagine that a full pipeline run would use the maximum resources available on a Spark cluster while throwing the same resources at a small test job would be overkill and needlessly expensive. Any properties NOT set in the configuration override map will use the default value Tamr is configured to use. This means the user only needs to set properties they care about and do not have to re-define every single property in the list below.

The Dataset tab of the API gives the GET /jobs/sparkConfigNames endpoint, which lists the names of all defined sets of overrides

Spark configuration can be set per project in the Project configuration dialogue in the UI or several endpoints related to starting Spark jobs like POST /jobs/{id}.

List of Configuration override keys

The following is an exhaustive list of all settable properties and the Tamr corresponding configuration property that they override:

  • name (must be unique for each defined configuration map)
  • sparkHome - TAMR_SPARK_HOME
  • hadoopHome - TAMR_HADOOP_HOME
  • cluster - TAMR_JOB_SPARK_CLUSTER
  • driverMemory - TAMR_JOB_SPARK_DRIVER_MEM
  • executorMemory - TAMR_JOB_SPARK_EXECUTOR_MEM
  • executorCores - TAMR_JOB_SPARK_EXECUTOR_CORES
  • executorInstances - TAMR_JOB_SPARK_EXECUTOR_INSTANCES
  • eventLogsDir - TAMR_JOB_SPARK_EVENT_LOGS_DIR
  • sparkSubmitTimeoutSeconds - TAMR_JOB_SPARK_SUBMIT_TIMEOUT_SECONDS
  • applicationJar - TAMR_JOB_SPARK_JAR
  • auxiliaryJars - TAMR_JOB_SPARK_AUX_JAR
  • sparkProps - TAMR_JOB_SPARK_PROPS
  • sparkEnv - TAMR_JOB_SPARK_ENV
  • fsConfig - corresponds to map of the following Filesystem configuration variables:
  • log4jProps - TAMR_JOB_SPARK_LOG4J_PROPS
  • logJson - TAMR_LOG_JSON_ENABLED
  • dataprocClusterConfig -TAMR_JOB_DATAPROC_CLUSTER_CONFIG
  • runJobFlowRequest - TAMR_DATASET_EMR_RUN_JOB_FLOW_REQUEST (only with ephemeral EMR)
  • micrometer - TAMR_MICROMETER_CONFIG
  • useCopyEventLogger - TAMR_DATASET_USE_COPY_EVENT_LOGGER
  • sparkConfigOverrides - corresponds to a map of configuration specific to ephemeral EMR
  • sparkDeploymentConfig - corresponds to a map of deployment-specific configuration (for ephemeral EMR, use sparkConfigOverrides)

SparkDeploymentConfig

This property encompasses deployment-specific configuration. Each deployment and its unique configuration is listed below, along with the Tamr configuration each key corresponds to

Dataproc:

Override KeyEquivalent Tamr Property
dataprocProjectIdTAMR_JOB_DATAPROC_PROJECT_ID
dataprocRegionTAMR_JOB_DATAPROC_REGION
dataprocClusterNameTAMR_JOB_DATAPROC_CLUSTER_NAME

Spalk:

Override KeyEquivalent Tamr Property
spalkClusterNameTAMR_JOB_SPALK_CLUSTER_NAME
spalkEnableTlsTAMR_JOB_SPALK_TLS_ENABLE

Yarn:

Override Key Equivalent Tamr Property
localYarnJarsTAMR_JOB_SPARK_LOCAL_YARN_JARS
yarnQueueTAMR_JOB_SPARK_YARN_QUEUE

EMR (static):

Override KeyEquivalent Tamr Property
clusterIdTAMR_JOB_EMR_CLUSTER_ID

Databricks:

Override KeyEquivalent Tamr Property
databricksHostTAMR_JOB_DATABRICKS_HOST
databricksTokenTAMR_JOB_DATABRICKS_TOKEN
databricksWorkingspaceTAMR_JOB_DATABRICKS_WORKINGSPACE
minWorkersTAMR_JOB_DATABRICKS_MIN_WORKERS
maxWorkersTAMR_JOB_DATABRICKS_MAX_WORKERS
databricksSparkVersionTAMR_JOB_DATABRICKS_SPARK_VERSION
databricksNodeTypeTAMR_JOB_DATABRICKS_NODE_TYPE
enableDbfsFilesizeCheckTAMR_JOB_DATABRICKS_ENABLE_DBFS_FILESIZE_CHECK

SparkConfigOverrides

This property encompasses configuration specific to ephemeral EMR

EMR (ephemeral):

Override Key Equivalent Tamr Property
clusterNamePrefixTAMR_DATASET_EMR_CLUSTER_NAME_PREFIX
runJobFlowRequestTAMR_DATASET_EMR_RUN_JOB_FLOW_REQUEST

A serialized RunJobFlowRequest object can also be added as part of the sparkDeploymentConfig with any of the below properties. Keys that do not have matching Tamr will config override the default serialized RunJobFlowRequest properties.

Override Key

Equivalent Tamr Property or AWS Documentation

clusterNamePrefix

TAMR_DATASET_EMR_CLUSTER_NAME_PREFIX

instances

AWS Documentation

instanceConfig

AWS Documentation

ec2KeyName

TAMR_DATASET_EMR_KEY_NAME

ec2SubnetId

TAMR_DATASET_EMR_SUBNET_ID

ec2SubnetIds

TAMR_DATASET_EMR_SUBNET_ID (multiple)

emrManagedMasterSecurityGroup

TAMR_DATASET_EMR_MASTER_SECURITY_GROUP

emrManagedSlaveSecurityGroup

TAMR_DATASET_EMR_WORKER_SECURITY_GROUP

serviceAccessSecurityGroup

TAMR_DATASET_EMR_ACCESS_SECURITY_GROUP

additionalMasterSecurityGroups

TAMR_DATASET_EMR_MASTER_SECURITY_GROUP_ADDITIONAL

additionalSlaveSecurityGroups

TAMR_DATASET_EMR_WORKER_SECURITY_GROUP_ADDITIONAL

instanceGroups

AWS Documentation

instanceType

TAMR_DATASET_EMR_INSTANCE_TYPE
TAMR_DATASET_EMR_MASTER_INSTANCE_TYPE

instanceRole

AWS Documentation

instanceCount

TAMR_DATASET_EMR_INSTANCE_COUNT

visibleToAllUsers

AWS Documentation

jobFlowRole

TAMR_DATASET_EMR_INSTANCE_PROFILE

serviceRole

TAMR_DATASET_EMR_SERVICE_ROLE

ebsRootVolumeSize

TAMR_DATASET_EMR_ROOT_VOLUME_SIZE

clusterTags

TAMR_DATASET_EMR_CLUSTER_TAGS

configurations

AWS Documentation

classification

AWS Documentation

properties

TAMR_EMR_PROPERTIES

bootstrapActions

AWS Documentation

ebsConfiguration

AWS Documentation

applications

AWS Documentation

customAmiId

TAMR_DATASET_EMR_CUSTOM_AMI_ID

The following Tamr config does not have corresponding override keys

  • TAMR_DATASET_EMR_LOG_URI
  • TAMR_DATASET_EMR_RELEASE

The most up-to-date list of settable properties for both sparkDeploymentConfig and sparkConfigOverrides can be found in <https://github.com/Datatamer/javasrc/blob/develop/common/dropwizard/src/main/java/com/tamr/dw/config/Spark.java>. Everything marked with @JsonProperty is a settable configuration property.

Examples

Databricks - Override Databricks-specific properties to set the variable cluster size

This example uses a set of overrides to define a second cluster size. Tamr is configured to use 80 very large Spark workers by default. We add a set of overrides to make a very small cluster of only 4 much smaller worker nodes available for test runs.

Default Tamr configuration set with admin utility

The first block of configuration defines the Databricks workspace Tamr will use. These properties will not be overwritten by TAMR_JOB_SPARK_CONFIG_OVERRIDES

  • TAMR_REMOTE_SPARK_ENABLED: "true"
  • TAMR_JOB_SPARK_CLUSTER: "databricks"
  • TAMR_JOB_DATABRICKS_HOST: "eastus2.azuredatabricks.net"
  • TAMR_JOB_DATABRICKS_TOKEN: "TOKEN_HERE"
  • TAMR_JOB_DATABRICKS_SPARK_VERSION: "6.4.x-scala2.11"
  • TAMR_JOB_DATABRICKS_WORKINGSPACE: "/FileStore/jars"
  • TAMR_JOB_SPARK_EVENT_LOGS_DIR: "/dbfs/FileStore/jars"

This block of configuration defines the resource scope for the default Spark jobs. Some of these properties will be overwritten

  • TAMR_JOB_DATABRICKS_NODE_TYPE: "Standard_DS8_v2"
  • TAMR_JOB_DATABRICKS_MIN_WORKERS: "80"
  • TAMR_JOB_DATABRICKS_MAX_WORKERS: "81"
  • TAMR_JOB_SPARK_EXECUTOR_INSTANCES: "80"
  • TAMR_SPARK_MEMORY: "8G"
  • TAMR_JOB_SPARK_DRIVER_MEM: "2G"
  • TAMR_JOB_SPARK_EXECUTOR_CORES: "8"
  • TAMR_JOB_SPARK_EXECUTOR_MEM: "6G"
  • TAMR_JOB_SPARK_PROPS: "{'spark.dynamicAllocation.enabled':'false','spark.yarn.driver.memoryOverhead':'512'}"

We can now override the default configuration by defining a value for TAMR_JOB_SPARK_CONFIG_OVERRIDES:

TAMR_JOB_SPARK_CONFIG_OVERRIDES: "[

    {

        'name':'lessWorkers',

        'executorInstances':'4',

        'executorCores':'2',

        'sparkDeploymentConfig':{

            'minWorkers':'4',

            'maxWorkers':'5',

            'databricksNodeType':'Standard_DS3_v2'

            }

    }]"

EMR - Use runJobflowRequest to define new EMR clusters

Below is an example of TAMR_JOB_SPARK_CONFIG_OVERRIDES being used to override EMR runJobFlowRequest values. This example defines a new 2-worker core instance group.

TAMR_JOB_SPARK_CONFIG_OVERRIDES: "[{

'name': 'two-workers',

'executorInstances': '15',

'executorCores': '4',

'executorMemory': '20G',

'driverCores': '4',

'driverMemory': '20G',

'sparkConfigOverrides': {

'instances': {

'instanceGroups': [{

'name': 'core-instance-group',

'instanceRole': 'CORE',

'instanceType': 'm4.xlarge',

'instanceCount': '2'

}]

}

}
}]"

Did this page help you?