Overview of TAMR_JOB_SPARK_CONFIG_OVERRIDES

The TAMR_JOB_SPARK_CONFIG_OVERRIDES configuration variable allows users to configure Spark jobs to run with different Spark cluster configurations. For example, one might imagine that a full pipeline run would use the maximum resources available on a Spark cluster while throwing the same resources at a small test job would be overkill and needlessly expensive. Any properties NOT set in the configuration override map will use the default value Tamr is configured to use. This means the user only needs to set properties they care about and do not have to re-define every single property in the list below.

The Dataset tab of the API gives the GET /jobs/sparkConfigNames endpoint, which lists the names of all defined sets of overrides

Spark configuration can be set per project in the Project configuration dialogue in the UI or several endpoints related to starting Spark jobs like POST /jobs/{id}.

List of Configuration override keys

The following is an exhaustive list of all settable properties and the Tamr corresponding configuration property that they override:

name (must be unique for each defined configuration map)
sparkHome - TAMR_SPARK_HOME
hadoopHome - TAMR_HADOOP_HOME
cluster - TAMR_JOB_SPARK_CLUSTER
driverMemory - TAMR_JOB_SPARK_DRIVER_MEM
executorMemory - TAMR_JOB_SPARK_EXECUTOR_MEM
executorCores - TAMR_JOB_SPARK_EXECUTOR_CORES
executorInstances - TAMR_JOB_SPARK_EXECUTOR_INSTANCES
eventLogsDir - TAMR_JOB_SPARK_EVENT_LOGS_DIR
sparkSubmitTimeoutSeconds - TAMR_JOB_SPARK_SUBMIT_TIMEOUT_SECONDS
applicationJar - TAMR_JOB_SPARK_JAR
auxiliaryJars - TAMR_JOB_SPARK_AUX_JAR
sparkProps - TAMR_JOB_SPARK_PROPS
sparkEnv - TAMR_JOB_SPARK_ENV
fsConfig - corresponds to map of the following Filesystem configuration variables:
log4jProps - TAMR_JOB_SPARK_LOG4J_PROPS
logJson - TAMR_LOG_JSON_ENABLED
dataprocClusterConfig -TAMR_JOB_DATAPROC_CLUSTER_CONFIG
runJobFlowRequest - TAMR_DATASET_EMR_RUN_JOB_FLOW_REQUEST (only with ephemeral EMR)
micrometer - TAMR_MICROMETER_CONFIG
useCopyEventLogger - TAMR_DATASET_USE_COPY_EVENT_LOGGER
sparkConfigOverrides - corresponds to a map of configuration specific to ephemeral EMR
sparkDeploymentConfig - corresponds to a map of deployment-specific configuration (for ephemeral EMR, use sparkConfigOverrides)

SparkDeploymentConfig

This property encompasses deployment-specific configuration. Each deployment and its unique configuration is listed below, along with the Tamr configuration each key corresponds to

Dataproc:

Override Key	Equivalent Tamr Property
dataprocProjectId	TAMR_JOB_DATAPROC_PROJECT_ID
dataprocRegion	TAMR_JOB_DATAPROC_REGION
dataprocClusterName	TAMR_JOB_DATAPROC_CLUSTER_NAME

Spalk:

Override Key	Equivalent Tamr Property
spalkClusterName	TAMR_JOB_SPALK_CLUSTER_NAME
spalkEnableTls	TAMR_JOB_SPALK_TLS_ENABLE

Yarn:

Override Key	`Equivalent Tamr Property`
localYarnJars	TAMR_JOB_SPARK_LOCAL_YARN_JARS
yarnQueue	TAMR_JOB_SPARK_YARN_QUEUE

EMR (static):

Override Key	Equivalent Tamr Property
clusterId	TAMR_JOB_EMR_CLUSTER_ID

Databricks:

Override Key	Equivalent Tamr Property
databricksHost	TAMR_JOB_DATABRICKS_HOST
databricksToken	TAMR_JOB_DATABRICKS_TOKEN
databricksWorkingspace	TAMR_JOB_DATABRICKS_WORKINGSPACE
minWorkers	TAMR_JOB_DATABRICKS_MIN_WORKERS
maxWorkers	TAMR_JOB_DATABRICKS_MAX_WORKERS
databricksSparkVersion	TAMR_JOB_DATABRICKS_SPARK_VERSION
databricksNodeType	TAMR_JOB_DATABRICKS_NODE_TYPE
enableDbfsFilesizeCheck	TAMR_JOB_DATABRICKS_ENABLE_DBFS_FILESIZE_CHECK

SparkConfigOverrides

This property encompasses configuration specific to ephemeral EMR

EMR (ephemeral):

Override Key	`Equivalent Tamr Property`
clusterNamePrefix	TAMR_DATASET_EMR_CLUSTER_NAME_PREFIX
runJobFlowRequest	TAMR_DATASET_EMR_RUN_JOB_FLOW_REQUEST

A serialized RunJobFlowRequest object can also be added as part of the sparkDeploymentConfig with any of the below properties. Keys that do not have matching Tamr will config override the default serialized RunJobFlowRequest properties.

Override Key	`Equivalent Tamr Property or AWS Documentation`
clusterNamePrefix	TAMR_DATASET_EMR_CLUSTER_NAME_PREFIX
instances	AWS Documentation
instanceConfig	AWS Documentation
ec2KeyName	TAMR_DATASET_EMR_KEY_NAME
ec2SubnetId	TAMR_DATASET_EMR_SUBNET_ID
ec2SubnetIds	TAMR_DATASET_EMR_SUBNET_ID (multiple)
emrManagedMasterSecurityGroup	TAMR_DATASET_EMR_MASTER_SECURITY_GROUP
emrManagedSlaveSecurityGroup	TAMR_DATASET_EMR_WORKER_SECURITY_GROUP
serviceAccessSecurityGroup	TAMR_DATASET_EMR_ACCESS_SECURITY_GROUP
additionalMasterSecurityGroups	TAMR_DATASET_EMR_MASTER_SECURITY_GROUP_ADDITIONAL
additionalSlaveSecurityGroups	TAMR_DATASET_EMR_WORKER_SECURITY_GROUP_ADDITIONAL
instanceGroups	AWS Documentation
instanceType	TAMR_DATASET_EMR_INSTANCE_TYPE TAMR_DATASET_EMR_MASTER_INSTANCE_TYPE
instanceRole	AWS Documentation
instanceCount	TAMR_DATASET_EMR_INSTANCE_COUNT
visibleToAllUsers	AWS Documentation
jobFlowRole	TAMR_DATASET_EMR_INSTANCE_PROFILE
serviceRole	TAMR_DATASET_EMR_SERVICE_ROLE
ebsRootVolumeSize	TAMR_DATASET_EMR_ROOT_VOLUME_SIZE
clusterTags	TAMR_DATASET_EMR_CLUSTER_TAGS
configurations	AWS Documentation
classification	AWS Documentation
properties	TAMR_EMR_PROPERTIES
bootstrapActions	AWS Documentation
ebsConfiguration	AWS Documentation
applications	AWS Documentation
customAmiId	TAMR_DATASET_EMR_CUSTOM_AMI_ID

The following Tamr config does not have corresponding override keys

TAMR_DATASET_EMR_LOG_URI
TAMR_DATASET_EMR_RELEASE

Examples

Databricks - Override Databricks-specific properties to set the variable cluster size

This example uses a set of overrides to define a second cluster size. Tamr is configured to use 80 very large Spark workers by default. We add a set of overrides to make a very small cluster of only 4 much smaller worker nodes available for test runs.

Default Tamr configuration set with admin utility

The first block of configuration defines the Databricks workspace Tamr will use. These properties will not be overwritten by TAMR_JOB_SPARK_CONFIG_OVERRIDES

TAMR_REMOTE_SPARK_ENABLED: "true"
TAMR_JOB_SPARK_CLUSTER: "databricks"
TAMR_JOB_DATABRICKS_HOST: "eastus2.azuredatabricks.net"
TAMR_JOB_DATABRICKS_TOKEN: "TOKEN_HERE"
TAMR_JOB_DATABRICKS_SPARK_VERSION: "6.4.x-scala2.11"
TAMR_JOB_DATABRICKS_WORKINGSPACE: "/FileStore/jars"
TAMR_JOB_SPARK_EVENT_LOGS_DIR: "/dbfs/FileStore/jars"

This block of configuration defines the resource scope for the default Spark jobs. Some of these properties will be overwritten

TAMR_JOB_DATABRICKS_NODE_TYPE: "Standard_DS8_v2"
TAMR_JOB_DATABRICKS_MIN_WORKERS: "80"
TAMR_JOB_DATABRICKS_MAX_WORKERS: "81"
TAMR_JOB_SPARK_EXECUTOR_INSTANCES: "80"
TAMR_SPARK_MEMORY: "8G"
TAMR_JOB_SPARK_DRIVER_MEM: "2G"
TAMR_JOB_SPARK_EXECUTOR_CORES: "8"
TAMR_JOB_SPARK_EXECUTOR_MEM: "6G"
TAMR_JOB_SPARK_PROPS: "{'spark.dynamicAllocation.enabled':'false','spark.yarn.driver.memoryOverhead':'512'}"

We can now override the default configuration by defining a value for TAMR_JOB_SPARK_CONFIG_OVERRIDES:

TAMR_JOB_SPARK_CONFIG_OVERRIDES: "[

    {

        'name':'lessWorkers',

        'executorInstances':'4',

        'executorCores':'2',

        'sparkDeploymentConfig':{

            'minWorkers':'4',

            'maxWorkers':'5',

            'databricksNodeType':'Standard_DS3_v2'

            }

    }]"

EMR - Use runJobflowRequest to define new EMR clusters

Below is an example of TAMR_JOB_SPARK_CONFIG_OVERRIDES being used to override EMR runJobFlowRequest values. This example defines a new 2-worker core instance group.

TAMR_JOB_SPARK_CONFIG_OVERRIDES: "[{

'name': 'two-workers',

'executorInstances': '15',

'executorCores': '4',

'executorMemory': '20G',

'driverCores': '4',

'driverMemory': '20G',

'sparkConfigOverrides': {

'instances': {

'instanceGroups': [{

'name': 'core-instance-group',

'instanceRole': 'CORE',

'instanceType': 'm4.xlarge',

'instanceCount': '2'

}]

}

}
}]"