Gathering Debugging Information for Databricks Spark in Azure Cloud-Native Tamr Deployments

Tamr uses Databricks to run Spark jobs. When jobs in Tamr fail or have other issues, some of the debugging information is available in the Tamr UI jobs page, and some in Tamr’s dataset.log. Oftentimes, direct access to Databricks is useful for gathering additional information.

Databricks CLI

The Databricks CLI is a python-based tool that allows you to interact with Databricks and its file system. You should configure it using the same hostname and token that the Tamr application is using.

Databricks FileSystem (DBFS)

Tamr uses DBFS to send jar files and job specs to Databricks, and reads back job status information. Databricks also writes logs to DBFS.

You can use the Databricks CLI to interact with DBFS:

databricks fs ls

databricks fs ls dbfs:/FileStore/

Spark History Server / Spark Event Logs

The Spark UI for current and past jobs (Spark History Server) is available through the Databricks UI. Log in to the UI, access the Cluster that is running or did run the Tamr job of interest, and select “Spark UI”.

The sparkEventLogs are written to DBFS in the directory from the Tamr configuration variable TAMR_JOB_SPARK_EVENT_LOGS_DIR. They have names like:app-20210121141448-0000. Tamr support may ask for these logs to be able to reconstruct a Spark History Server for a job that requires debugging.

Example of copying the logs locally:

databricks fs cp


Cluster Event Log

The cluster event log displays important cluster lifecycle events that are triggered manually by user actions or automatically by Azure Databricks. This can be useful to check if a cluster is taking an abnormally long time to start up, or if a job was terminated unexpectedly. If failures are given with failure reasons, these can be used for debugging or reported to Microsoft support.

The event log can be accessed through the Databricks UI by navigating to the cluster of interest and selecting “Event Log”.

You can also use the Databricks CLI to get cluster information, including the event log.

databricks clusters list

databricks clusters events --cluster-id <my_cluster>

Typical events include:

  • CREATING: Indicates that the cluster is being created.
  • RUNNING: Indicates the cluster has finished being created. Includes the number of nodes in the cluster and a failure reason if some nodes could not be acquired.
  • RESIZING: Indicates a change in the target size of the cluster (upsize or downsize). This can happen if only a subset of the requested nodes are successfully provisioned the first time.
  • TERMINATING: Indicates that the cluster is being terminated.

A full list of event types can be found here.