How to Set TAMR_JOB_SPARK_CONFIG_OVERRIDES
Content:
- Overview of this Tamr configuration variable
- Complete list of settable properties
- Selected examples
Overview of TAMR_JOB_SPARK_CONFIG_OVERRIDES
The TAMR_JOB_SPARK_CONFIG_OVERRIDES configuration variable allows users to configure Spark jobs to run with different Spark cluster configurations. For example, one might imagine that a full pipeline run would use the maximum resources available on a Spark cluster while throwing the same resources at a small test job would be overkill and needlessly expensive. Any properties NOT set in the configuration override map will use the default value Tamr is configured to use. This means the user only needs to set properties they care about and do not have to re-define every single property in the list below.
The Dataset tab of the API gives the GET /jobs/sparkConfigNames
endpoint, which lists the names of all defined sets of overrides
Spark configuration can be set per project in the Project configuration dialogue in the UI or several endpoints related to starting Spark jobs like POST /jobs/{id}
.
List of Configuration override keys
The following is an exhaustive list of all settable properties and the Tamr corresponding configuration property that they override:
- name (must be unique for each defined configuration map)
- sparkHome - TAMR_SPARK_HOME
- hadoopHome - TAMR_HADOOP_HOME
- cluster - TAMR_JOB_SPARK_CLUSTER
- driverMemory - TAMR_JOB_SPARK_DRIVER_MEM
- executorMemory - TAMR_JOB_SPARK_EXECUTOR_MEM
- executorCores - TAMR_JOB_SPARK_EXECUTOR_CORES
- executorInstances - TAMR_JOB_SPARK_EXECUTOR_INSTANCES
- eventLogsDir - TAMR_JOB_SPARK_EVENT_LOGS_DIR
- sparkSubmitTimeoutSeconds - TAMR_JOB_SPARK_SUBMIT_TIMEOUT_SECONDS
- applicationJar - TAMR_JOB_SPARK_JAR
- auxiliaryJars - TAMR_JOB_SPARK_AUX_JAR
- sparkProps - TAMR_JOB_SPARK_PROPS
- sparkEnv - TAMR_JOB_SPARK_ENV
- fsConfig - corresponds to map of the following Filesystem configuration variables:
- log4jProps - TAMR_JOB_SPARK_LOG4J_PROPS
- logJson - TAMR_LOG_JSON_ENABLED
- dataprocClusterConfig -TAMR_JOB_DATAPROC_CLUSTER_CONFIG
- runJobFlowRequest - TAMR_DATASET_EMR_RUN_JOB_FLOW_REQUEST (only with ephemeral EMR)
- micrometer - TAMR_MICROMETER_CONFIG
- useCopyEventLogger - TAMR_DATASET_USE_COPY_EVENT_LOGGER
- sparkConfigOverrides - corresponds to a map of configuration specific to ephemeral EMR
- sparkDeploymentConfig - corresponds to a map of deployment-specific configuration (for ephemeral EMR, use sparkConfigOverrides)
SparkDeploymentConfig
This property encompasses deployment-specific configuration. Each deployment and its unique configuration is listed below, along with the Tamr configuration each key corresponds to
Dataproc:
Override Key | Equivalent Tamr Property |
---|---|
dataprocProjectId | TAMR_JOB_DATAPROC_PROJECT_ID |
dataprocRegion | TAMR_JOB_DATAPROC_REGION |
dataprocClusterName | TAMR_JOB_DATAPROC_CLUSTER_NAME |
Spalk:
Override Key | Equivalent Tamr Property |
---|---|
spalkClusterName | TAMR_JOB_SPALK_CLUSTER_NAME |
spalkEnableTls | TAMR_JOB_SPALK_TLS_ENABLE |
Yarn:
Override Key | Equivalent Tamr Property |
---|---|
localYarnJars | TAMR_JOB_SPARK_LOCAL_YARN_JARS |
yarnQueue | TAMR_JOB_SPARK_YARN_QUEUE |
EMR (static):
Override Key | Equivalent Tamr Property |
---|---|
clusterId | TAMR_JOB_EMR_CLUSTER_ID |
Databricks:
Override Key | Equivalent Tamr Property |
---|---|
databricksHost | TAMR_JOB_DATABRICKS_HOST |
databricksToken | TAMR_JOB_DATABRICKS_TOKEN |
databricksWorkingspace | TAMR_JOB_DATABRICKS_WORKINGSPACE |
minWorkers | TAMR_JOB_DATABRICKS_MIN_WORKERS |
maxWorkers | TAMR_JOB_DATABRICKS_MAX_WORKERS |
databricksSparkVersion | TAMR_JOB_DATABRICKS_SPARK_VERSION |
databricksNodeType | TAMR_JOB_DATABRICKS_NODE_TYPE |
enableDbfsFilesizeCheck | TAMR_JOB_DATABRICKS_ENABLE_DBFS_FILESIZE_CHECK |
SparkConfigOverrides
This property encompasses configuration specific to ephemeral EMR
EMR (ephemeral):
Override Key | Equivalent Tamr Property |
---|---|
clusterNamePrefix | TAMR_DATASET_EMR_CLUSTER_NAME_PREFIX |
runJobFlowRequest | TAMR_DATASET_EMR_RUN_JOB_FLOW_REQUEST |
A serialized RunJobFlowRequest object can also be added as part of the sparkDeploymentConfig with any of the below properties. Keys that do not have matching Tamr will config override the default serialized RunJobFlowRequest properties.
Override Key | Equivalent Tamr Property or AWS Documentation |
---|---|
clusterNamePrefix | TAMR_DATASET_EMR_CLUSTER_NAME_PREFIX |
instances | AWS Documentation |
instanceConfig | AWS Documentation |
ec2KeyName | TAMR_DATASET_EMR_KEY_NAME |
ec2SubnetId | TAMR_DATASET_EMR_SUBNET_ID |
ec2SubnetIds | TAMR_DATASET_EMR_SUBNET_ID (multiple) |
emrManagedMasterSecurityGroup | TAMR_DATASET_EMR_MASTER_SECURITY_GROUP |
emrManagedSlaveSecurityGroup | TAMR_DATASET_EMR_WORKER_SECURITY_GROUP |
serviceAccessSecurityGroup | TAMR_DATASET_EMR_ACCESS_SECURITY_GROUP |
additionalMasterSecurityGroups | TAMR_DATASET_EMR_MASTER_SECURITY_GROUP_ADDITIONAL |
additionalSlaveSecurityGroups | TAMR_DATASET_EMR_WORKER_SECURITY_GROUP_ADDITIONAL |
instanceGroups | AWS Documentation |
instanceType | TAMR_DATASET_EMR_INSTANCE_TYPE TAMR_DATASET_EMR_MASTER_INSTANCE_TYPE |
instanceRole | AWS Documentation |
instanceCount | TAMR_DATASET_EMR_INSTANCE_COUNT |
visibleToAllUsers | AWS Documentation |
jobFlowRole | TAMR_DATASET_EMR_INSTANCE_PROFILE |
serviceRole | TAMR_DATASET_EMR_SERVICE_ROLE |
ebsRootVolumeSize | TAMR_DATASET_EMR_ROOT_VOLUME_SIZE |
clusterTags | TAMR_DATASET_EMR_CLUSTER_TAGS |
configurations | AWS Documentation |
classification | AWS Documentation |
properties | TAMR_EMR_PROPERTIES |
bootstrapActions | AWS Documentation |
ebsConfiguration | AWS Documentation |
applications | AWS Documentation |
customAmiId | TAMR_DATASET_EMR_CUSTOM_AMI_ID |
The following Tamr config does not have corresponding override keys
- TAMR_DATASET_EMR_LOG_URI
- TAMR_DATASET_EMR_RELEASE
Examples
Databricks - Override Databricks-specific properties to set the variable cluster size
This example uses a set of overrides to define a second cluster size. Tamr is configured to use 80 very large Spark workers by default. We add a set of overrides to make a very small cluster of only 4 much smaller worker nodes available for test runs.
Default Tamr configuration set with admin utility
The first block of configuration defines the Databricks workspace Tamr will use. These properties will not be overwritten by TAMR_JOB_SPARK_CONFIG_OVERRIDES
- TAMR_REMOTE_SPARK_ENABLED: "true"
- TAMR_JOB_SPARK_CLUSTER: "databricks"
- TAMR_JOB_DATABRICKS_HOST: "eastus2.azuredatabricks.net"
- TAMR_JOB_DATABRICKS_TOKEN: "TOKEN_HERE"
- TAMR_JOB_DATABRICKS_SPARK_VERSION: "6.4.x-scala2.11"
- TAMR_JOB_DATABRICKS_WORKINGSPACE: "/FileStore/jars"
- TAMR_JOB_SPARK_EVENT_LOGS_DIR: "/dbfs/FileStore/jars"
This block of configuration defines the resource scope for the default Spark jobs. Some of these properties will be overwritten
- TAMR_JOB_DATABRICKS_NODE_TYPE: "Standard_DS8_v2"
- TAMR_JOB_DATABRICKS_MIN_WORKERS: "80"
- TAMR_JOB_DATABRICKS_MAX_WORKERS: "81"
- TAMR_JOB_SPARK_EXECUTOR_INSTANCES: "80"
- TAMR_SPARK_MEMORY: "8G"
- TAMR_JOB_SPARK_DRIVER_MEM: "2G"
- TAMR_JOB_SPARK_EXECUTOR_CORES: "8"
- TAMR_JOB_SPARK_EXECUTOR_MEM: "6G"
- TAMR_JOB_SPARK_PROPS: "{'spark.dynamicAllocation.enabled':'false','spark.yarn.driver.memoryOverhead':'512'}"
We can now override the default configuration by defining a value for TAMR_JOB_SPARK_CONFIG_OVERRIDES:
TAMR_JOB_SPARK_CONFIG_OVERRIDES: "[
{
'name':'lessWorkers',
'executorInstances':'4',
'executorCores':'2',
'sparkDeploymentConfig':{
'minWorkers':'4',
'maxWorkers':'5',
'databricksNodeType':'Standard_DS3_v2'
}
}]"
EMR - Use runJobflowRequest to define new EMR clusters
Below is an example of TAMR_JOB_SPARK_CONFIG_OVERRIDES being used to override EMR runJobFlowRequest values. This example defines a new 2-worker core instance group.
TAMR_JOB_SPARK_CONFIG_OVERRIDES: "[{
'name': 'two-workers',
'executorInstances': '15',
'executorCores': '4',
'executorMemory': '20G',
'driverCores': '4',
'driverMemory': '20G',
'sparkConfigOverrides': {
'instances': {
'instanceGroups': [{
'name': 'core-instance-group',
'instanceRole': 'CORE',
'instanceType': 'm4.xlarge',
'instanceCount': '2'
}]
}
}
}]"
Updated over 1 year ago