Spark Optimization Playbook
Hands-on guide and toolkit for optimizing PySpark workloads on Databricks. Diagnostic scripts, 25 optimization patterns, cluster sizing calculator, and benchmarking framework.
📁 File Structure 13 files
📖 Documentation Preview README excerpt
Spark Optimization Playbook
By [Datanest Digital](https://datanest.dev) | Version 1.0.0 | $69
A comprehensive, battle-tested collection of Spark performance optimization patterns, diagnostic notebooks, and sizing tools for Databricks engineers. Stop guessing why your jobs are slow — diagnose, measure, and fix with systematic approaches.
---
What's Included
Diagnostic & Benchmarking Notebooks
| File | Description |
|------|-------------|
| notebooks/spark_diagnostic.py | Auto-detects common Spark performance issues: data skew, disk spill, small files, excessive shuffle, suboptimal joins, and more. Run it against any job to get an instant health report. |
| notebooks/benchmarking_framework.py | Measure optimization impact with structured before/after comparisons. Captures runtime, shuffle bytes, spill metrics, and stage-level breakdowns. |
| notebooks/cost_per_query_estimator.py | Estimates DBU cost per query based on your cluster configuration, runtime, and Databricks pricing tier. |
Optimization Guides
| File | Description |
|------|-------------|
| guides/optimization_patterns.md | 25 optimization patterns with before/after PySpark code, explanations, and expected impact. |
| guides/cluster_sizing_guide.md | Maps data volumes to cluster configurations with a decision tree for picking instance types, worker counts, and autoscaling ranges. |
| guides/aqe_tuning_guide.md | Deep dive into Adaptive Query Execution tuning — coalescing, skew handling, join strategy switching. |
| guides/photon_optimization.md | Photon runtime optimization guide — what benefits from Photon, what doesn't, and how to structure code to maximize the C++ engine. |
| guides/memory_management.md | Broadcast joins, cache strategies, spill prevention, and memory fraction tuning. |
| guides/spark_ui_guide.md | How to read the Spark UI — stages, tasks, shuffle, GC, skew indicators, and what to look for in the SQL tab. |
CLI Tools
| File | Description |
|------|-------------|
| tools/partition_decision_tree.py | Recommends partitioning strategy (column selection, partition count, file size targets) based on data characteristics you provide. |
| tools/cluster_calculator.py | Calculates optimal cluster configuration given your data volume, job type, concurrency needs, and budget constraints. |
---
Getting Started
Prerequisites
- Databricks workspace (any cloud: AWS, Azure, GCP)
- Databricks Runtime 12.2 LTS or later recommended
- Python 3.9+
Using the Notebooks
1. Import the notebooks/ folder into your Databricks workspace.
2. Start with spark_diagnostic.py — attach it to a cluster running your workload and execute all cells.
3. Review the diagnostic output, then consult the matching guide in guides/ for remediation steps.
4. Use benchmarking_framework.py to measure before/after impact of any change.
5. Use cost_per_query_estimator.py to quantify the dollar impact of optimizations.
Using the CLI Tools
# Partition strategy recommendation
python tools/partition_decision_tree.py \
--total-size-gb 500 \
*... continues with setup instructions, usage examples, and more.*