← Back to all products

Spark Optimization Playbook

$69

Hands-on guide and toolkit for optimizing PySpark workloads on Databricks. Diagnostic scripts, 25 optimization patterns, cluster sizing calculator, and benchmarking framework.

📁 13 files🏷 v1.0.0
PythonMarkdownJSONAWSAzureGCPDatabricksPySparkSparkRedis

📁 File Structure 13 files

spark-optimization-playbook/ ├── README.md ├── guides/ │ ├── aqe_tuning_guide.md │ ├── cluster_sizing_guide.md │ ├── memory_management.md │ ├── optimization_patterns.md │ ├── photon_optimization.md │ └── spark_ui_guide.md ├── notebooks/ │ ├── benchmarking_framework.py │ ├── cost_per_query_estimator.py │ └── spark_diagnostic.py └── tools/ ├── cluster_calculator.py └── partition_decision_tree.py

📖 Documentation Preview README excerpt

Spark Optimization Playbook

By [Datanest Digital](https://datanest.dev) | Version 1.0.0 | $69

A comprehensive, battle-tested collection of Spark performance optimization patterns, diagnostic notebooks, and sizing tools for Databricks engineers. Stop guessing why your jobs are slow — diagnose, measure, and fix with systematic approaches.

---

What's Included

Diagnostic & Benchmarking Notebooks

| File | Description |

|------|-------------|

| notebooks/spark_diagnostic.py | Auto-detects common Spark performance issues: data skew, disk spill, small files, excessive shuffle, suboptimal joins, and more. Run it against any job to get an instant health report. |

| notebooks/benchmarking_framework.py | Measure optimization impact with structured before/after comparisons. Captures runtime, shuffle bytes, spill metrics, and stage-level breakdowns. |

| notebooks/cost_per_query_estimator.py | Estimates DBU cost per query based on your cluster configuration, runtime, and Databricks pricing tier. |

Optimization Guides

| File | Description |

|------|-------------|

| guides/optimization_patterns.md | 25 optimization patterns with before/after PySpark code, explanations, and expected impact. |

| guides/cluster_sizing_guide.md | Maps data volumes to cluster configurations with a decision tree for picking instance types, worker counts, and autoscaling ranges. |

| guides/aqe_tuning_guide.md | Deep dive into Adaptive Query Execution tuning — coalescing, skew handling, join strategy switching. |

| guides/photon_optimization.md | Photon runtime optimization guide — what benefits from Photon, what doesn't, and how to structure code to maximize the C++ engine. |

| guides/memory_management.md | Broadcast joins, cache strategies, spill prevention, and memory fraction tuning. |

| guides/spark_ui_guide.md | How to read the Spark UI — stages, tasks, shuffle, GC, skew indicators, and what to look for in the SQL tab. |

CLI Tools

| File | Description |

|------|-------------|

| tools/partition_decision_tree.py | Recommends partitioning strategy (column selection, partition count, file size targets) based on data characteristics you provide. |

| tools/cluster_calculator.py | Calculates optimal cluster configuration given your data volume, job type, concurrency needs, and budget constraints. |

---

Getting Started

Prerequisites

  • Databricks workspace (any cloud: AWS, Azure, GCP)
  • Databricks Runtime 12.2 LTS or later recommended
  • Python 3.9+

Using the Notebooks

1. Import the notebooks/ folder into your Databricks workspace.

2. Start with spark_diagnostic.py — attach it to a cluster running your workload and execute all cells.

3. Review the diagnostic output, then consult the matching guide in guides/ for remediation steps.

4. Use benchmarking_framework.py to measure before/after impact of any change.

5. Use cost_per_query_estimator.py to quantify the dollar impact of optimizations.

Using the CLI Tools


# Partition strategy recommendation
python tools/partition_decision_tree.py \
  --total-size-gb 500 \

*... continues with setup instructions, usage examples, and more.*

📄 Code Sample .py preview

notebooks/benchmarking_framework.py # Databricks notebook source # MAGIC %md # MAGIC # Benchmarking Framework # MAGIC **Datanest Digital — Spark Optimization Playbook** # MAGIC # MAGIC Measure the impact of Spark optimizations with structured before/after comparisons. # MAGIC Captures runtime, shuffle bytes, spill, stage count, and task-level metrics. # MAGIC # MAGIC **Workflow:** # MAGIC 1. Register a "before" run using `benchmark_query()` # MAGIC 2. Apply your optimization # MAGIC 3. Register an "after" run # MAGIC 4. Call `compare_runs()` to see the improvement # COMMAND ---------- import json import time from dataclasses import dataclass, field, asdict from datetime import datetime from typing import Callable, Optional from pyspark.sql import SparkSession, DataFrame # COMMAND ---------- spark: SparkSession = SparkSession.builder.getOrCreate() # COMMAND ---------- # MAGIC %md # MAGIC ## Metrics Collection # COMMAND ---------- @dataclass class StageMetrics: """Metrics captured for a single Spark stage.""" stage_id: int num_tasks: int executor_run_time_ms: int shuffle_read_bytes: int shuffle_write_bytes: int disk_bytes_spilled: int memory_bytes_spilled: int jvm_gc_time_ms: int input_bytes: int # ... 303 more lines ...