How to Set TAMR_JOB_SPARK_CONFIG_OVERRIDES

Content:

  1. Overview of this Tamr configuration variable
  2. Complete list of settable properties
  3. Selected examples

Overview of TAMR_JOB_SPARK_CONFIG_OVERRIDES

The TAMR_JOB_SPARK_CONFIG_OVERRIDES configuration variable allows users to configure Spark jobs to run with different Spark cluster configurations. For example, one might imagine that a full pipeline run would use the maximum resources available on a Spark cluster while throwing the same resources at a small test job would be overkill and needlessly expensive. Any properties NOT set in the configuration override map will use the default value Tamr is configured to use. This means the user only needs to set properties they care about and do not have to re-define every single property in the list below.

The Dataset tab of the API gives the GET /jobs/sparkConfigNames endpoint, which lists the names of all defined sets of overrides

Spark configuration can be set per project in the Project configuration dialogue in the UI or several endpoints related to starting Spark jobs like POST /jobs/{id}.

List of Configuration override keys

The following is an exhaustive list of all settable properties and the Tamr corresponding configuration property that they override:

  • name (must be unique for each defined configuration map)
  • sparkHome - TAMR_SPARK_HOME
  • hadoopHome - TAMR_HADOOP_HOME
  • cluster - TAMR_JOB_SPARK_CLUSTER
  • driverMemory - TAMR_JOB_SPARK_DRIVER_MEM
  • executorMemory - TAMR_JOB_SPARK_EXECUTOR_MEM
  • executorCores - TAMR_JOB_SPARK_EXECUTOR_CORES
  • executorInstances - TAMR_JOB_SPARK_EXECUTOR_INSTANCES
  • eventLogsDir - TAMR_JOB_SPARK_EVENT_LOGS_DIR
  • sparkSubmitTimeoutSeconds - TAMR_JOB_SPARK_SUBMIT_TIMEOUT_SECONDS
  • applicationJar - TAMR_JOB_SPARK_JAR
  • auxiliaryJars - TAMR_JOB_SPARK_AUX_JAR
  • sparkProps - TAMR_JOB_SPARK_PROPS
  • sparkEnv - TAMR_JOB_SPARK_ENV
  • fsConfig - corresponds to map of the following Filesystem configuration variables:
  • log4jProps - TAMR_JOB_SPARK_LOG4J_PROPS
  • logJson - TAMR_LOG_JSON_ENABLED
  • dataprocClusterConfig -TAMR_JOB_DATAPROC_CLUSTER_CONFIG
  • runJobFlowRequest - TAMR_DATASET_EMR_RUN_JOB_FLOW_REQUEST (only with ephemeral EMR)
  • micrometer - TAMR_MICROMETER_CONFIG
  • useCopyEventLogger - TAMR_DATASET_USE_COPY_EVENT_LOGGER
  • sparkConfigOverrides - corresponds to a map of configuration specific to ephemeral EMR
  • sparkDeploymentConfig - corresponds to a map of deployment-specific configuration (for ephemeral EMR, use sparkConfigOverrides)

SparkDeploymentConfig

This property encompasses deployment-specific configuration. Each deployment and its unique configuration is listed below, along with the Tamr configuration each key corresponds to

Dataproc:

Override KeyEquivalent Tamr Property
dataprocProjectIdTAMR_JOB_DATAPROC_PROJECT_ID
dataprocRegionTAMR_JOB_DATAPROC_REGION
dataprocClusterNameTAMR_JOB_DATAPROC_CLUSTER_NAME

Spalk:

Override KeyEquivalent Tamr Property
spalkClusterNameTAMR_JOB_SPALK_CLUSTER_NAME
spalkEnableTlsTAMR_JOB_SPALK_TLS_ENABLE

Yarn:

Override Key Equivalent Tamr Property
localYarnJarsTAMR_JOB_SPARK_LOCAL_YARN_JARS
yarnQueueTAMR_JOB_SPARK_YARN_QUEUE

EMR (static):

Override KeyEquivalent Tamr Property
clusterIdTAMR_JOB_EMR_CLUSTER_ID

Databricks:

Override KeyEquivalent Tamr Property
databricksHostTAMR_JOB_DATABRICKS_HOST
databricksTokenTAMR_JOB_DATABRICKS_TOKEN
databricksWorkingspaceTAMR_JOB_DATABRICKS_WORKINGSPACE
minWorkersTAMR_JOB_DATABRICKS_MIN_WORKERS
maxWorkersTAMR_JOB_DATABRICKS_MAX_WORKERS
databricksSparkVersionTAMR_JOB_DATABRICKS_SPARK_VERSION
databricksNodeTypeTAMR_JOB_DATABRICKS_NODE_TYPE
enableDbfsFilesizeCheckTAMR_JOB_DATABRICKS_ENABLE_DBFS_FILESIZE_CHECK

SparkConfigOverrides

This property encompasses configuration specific to ephemeral EMR

EMR (ephemeral):

Override Key Equivalent Tamr Property
clusterNamePrefixTAMR_DATASET_EMR_CLUSTER_NAME_PREFIX
runJobFlowRequestTAMR_DATASET_EMR_RUN_JOB_FLOW_REQUEST

A serialized RunJobFlowRequest object can also be added as part of the sparkDeploymentConfig with any of the below properties. Keys that do not have matching Tamr will config override the default serialized RunJobFlowRequest properties.

Override Key Equivalent Tamr Property or AWS Documentation
clusterNamePrefixTAMR_DATASET_EMR_CLUSTER_NAME_PREFIX
instancesAWS Documentation
instanceConfigAWS Documentation
ec2KeyNameTAMR_DATASET_EMR_KEY_NAME
ec2SubnetIdTAMR_DATASET_EMR_SUBNET_ID
ec2SubnetIdsTAMR_DATASET_EMR_SUBNET_ID (multiple)
emrManagedMasterSecurityGroupTAMR_DATASET_EMR_MASTER_SECURITY_GROUP
emrManagedSlaveSecurityGroupTAMR_DATASET_EMR_WORKER_SECURITY_GROUP
serviceAccessSecurityGroupTAMR_DATASET_EMR_ACCESS_SECURITY_GROUP
additionalMasterSecurityGroupsTAMR_DATASET_EMR_MASTER_SECURITY_GROUP_ADDITIONAL
additionalSlaveSecurityGroupsTAMR_DATASET_EMR_WORKER_SECURITY_GROUP_ADDITIONAL
instanceGroupsAWS Documentation
instanceTypeTAMR_DATASET_EMR_INSTANCE_TYPE
TAMR_DATASET_EMR_MASTER_INSTANCE_TYPE
instanceRoleAWS Documentation
instanceCountTAMR_DATASET_EMR_INSTANCE_COUNT
visibleToAllUsersAWS Documentation
jobFlowRoleTAMR_DATASET_EMR_INSTANCE_PROFILE
serviceRoleTAMR_DATASET_EMR_SERVICE_ROLE
ebsRootVolumeSizeTAMR_DATASET_EMR_ROOT_VOLUME_SIZE
clusterTagsTAMR_DATASET_EMR_CLUSTER_TAGS
configurationsAWS Documentation
classificationAWS Documentation
propertiesTAMR_EMR_PROPERTIES
bootstrapActionsAWS Documentation
ebsConfigurationAWS Documentation
applicationsAWS Documentation
customAmiIdTAMR_DATASET_EMR_CUSTOM_AMI_ID

The following Tamr config does not have corresponding override keys

  • TAMR_DATASET_EMR_LOG_URI
  • TAMR_DATASET_EMR_RELEASE

Examples

Databricks - Override Databricks-specific properties to set the variable cluster size

This example uses a set of overrides to define a second cluster size. Tamr is configured to use 80 very large Spark workers by default. We add a set of overrides to make a very small cluster of only 4 much smaller worker nodes available for test runs.

Default Tamr configuration set with admin utility

The first block of configuration defines the Databricks workspace Tamr will use. These properties will not be overwritten by TAMR_JOB_SPARK_CONFIG_OVERRIDES

  • TAMR_REMOTE_SPARK_ENABLED: "true"
  • TAMR_JOB_SPARK_CLUSTER: "databricks"
  • TAMR_JOB_DATABRICKS_HOST: "eastus2.azuredatabricks.net"
  • TAMR_JOB_DATABRICKS_TOKEN: "TOKEN_HERE"
  • TAMR_JOB_DATABRICKS_SPARK_VERSION: "6.4.x-scala2.11"
  • TAMR_JOB_DATABRICKS_WORKINGSPACE: "/FileStore/jars"
  • TAMR_JOB_SPARK_EVENT_LOGS_DIR: "/dbfs/FileStore/jars"

This block of configuration defines the resource scope for the default Spark jobs. Some of these properties will be overwritten

  • TAMR_JOB_DATABRICKS_NODE_TYPE: "Standard_DS8_v2"
  • TAMR_JOB_DATABRICKS_MIN_WORKERS: "80"
  • TAMR_JOB_DATABRICKS_MAX_WORKERS: "81"
  • TAMR_JOB_SPARK_EXECUTOR_INSTANCES: "80"
  • TAMR_SPARK_MEMORY: "8G"
  • TAMR_JOB_SPARK_DRIVER_MEM: "2G"
  • TAMR_JOB_SPARK_EXECUTOR_CORES: "8"
  • TAMR_JOB_SPARK_EXECUTOR_MEM: "6G"
  • TAMR_JOB_SPARK_PROPS: "{'spark.dynamicAllocation.enabled':'false','spark.yarn.driver.memoryOverhead':'512'}"

We can now override the default configuration by defining a value for TAMR_JOB_SPARK_CONFIG_OVERRIDES:

TAMR_JOB_SPARK_CONFIG_OVERRIDES: "[

    {

        'name':'lessWorkers',

        'executorInstances':'4',

        'executorCores':'2',

        'sparkDeploymentConfig':{

            'minWorkers':'4',

            'maxWorkers':'5',

            'databricksNodeType':'Standard_DS3_v2'

            }

    }]"

EMR - Use runJobflowRequest to define new EMR clusters

Below is an example of TAMR_JOB_SPARK_CONFIG_OVERRIDES being used to override EMR runJobFlowRequest values. This example defines a new 2-worker core instance group.

TAMR_JOB_SPARK_CONFIG_OVERRIDES: "[{

'name': 'two-workers',

'executorInstances': '15',

'executorCores': '4',

'executorMemory': '20G',

'driverCores': '4',

'driverMemory': '20G',

'sparkConfigOverrides': {

'instances': {

'instanceGroups': [{

'name': 'core-instance-group',

'instanceRole': 'CORE',

'instanceType': 'm4.xlarge',

'instanceCount': '2'

}]

}

}
}]"