User Guide - Databricks on AWS

Below is a guide to help you through receiving a set of predictions from the Sync Autotuner for Databricks on AWS. The below assumes you have registered as a user, and been granted free trial access.

Collect Spark Event Logs

Obtaining the Spark History Server event logs requires enabling the cluster log delivery. In other words, if it is not turned on, you would have to update the settings on your jobs/clusters to enable it.

There are two ways to do so as hinted in the Databricks documentation. See Databricks guidance for AWS, Azure and GCP.

  1. Through the console by setting the Cluster Log Path, with either a location in dbfs or the platform's cloud storage (AWS s3, Google Cloud storage, or Azure Blob Storage destination) under Advanced Options.

  2. Or with an insertion to the new cluster entry of the cluster config json that is used to create clusters. It would be akin to inserting something like the below for dbfs.

"cluster_log_conf": {
        "dbfs": {
            "destination": "dbfs:/cluster-logs"
        }
    }

For s3, it may look like:

"cluster_log_conf": {
        "s3": {
            "destination": "s3://cluster-logs"
        }
    }

For Azure Blob Storage, it may look like:

"cluster_log_conf": {
        "wasb": {
            "destination": "wasbs://cluster-logs"
        }
    }

For Google Cloud Storage, it may look like:

"cluster_log_conf": {
        "gs": {
            "destination": "gs://cluster-logs"
        }
    }

For example, the relevant event log file(s) in the path below. The parts in {} may vary from job to job. There may be more than one event log file associated with one job run.

dbfs:/cluster-logs/{cluster_job_identifier}/eventlog/{another_cluster_job_identifier}/{numeric_identifier}/

Keep in mind that when storing to cloud storage, the destination needs to be in the same region as the cluster and the appropriate permissions of the destination need to be enabled for the Databricks IAM role.

Receiving a set of predictions

  1. Select the Databricks on AWS tile from the "Start" tab. This will bring you to the screen where you can provide input data to the Autotuner.
22262226
  1. Select your AWS Compute Type. If you aren't sure, you can review the Databricks pricing page here.

  2. Upload your event log file. Event logs < 2GB are supported (.zip, .gz, .json, .log, .tar.gz, or no extension) and uncompressed (no extension, or .log) logs for Databricks. If you have multiple rollover log files, please compress them into one .zip file for upload (you may .zip up multiple .gz files).

23522352
  1. Once the compute type is added, and the log file is uploaded, the Autotuner will begin the process of creating cost and runtime predictions across various Spark configurations and AWS infrastructure. If you want, you can close your browser at this point and you will receive an email when the predictions are complete. If you receive an error at this stage, please email [email protected] for assistance.
22302230
  1. When processing is complete, you will be forwarded to your Autotuner prediction results. To quickly select a new configuration for your job, you can pick either the Performance, Balanced, or Economy options. If you want to pick a custom configuration, each dot on the graph is a possible configuration with an associated cost and runtime. You can compare that to your current configuration, which is represented by the back dot on the graph.
19301930
  1. Once you have selected your preferred configuration, scroll down to see the configuration details, which you can copy to your clipboard, and import into Terraform, or your tool of choice to update your Spark and Cluster configuration, and re-run your job.
24622462
  1. You can always access your previous predictions on the history page.
24642464
  1. We would appreciate any and all feedback, in particular new log files after you re-ran your job with one of our predictions. Please feel free to email us at [email protected].