# PySpark Benchmarks

This directory contains microbenchmarks for PySpark using [ASV (Airspeed Velocity)](https://asv.readthedocs.io/).

## Prerequisites

Install ASV:

```bash
pip install asv
```

For running benchmarks with isolated environments (without `--python=same`), you need an environment manager.
The default configuration uses `virtualenv`, but ASV also supports `conda`, `mamba`, `uv`, and some others. See the official docs for details.

## Running Benchmarks

All commands below can be run from the Spark root directory using `./python/asv`,
which is a wrapper that forwards arguments to `asv` in the benchmarks directory.

### Quick run (current environment)

Run benchmarks using your current Python environment (fastest for development):

```bash
./python/asv run --python=same --quick
```

You can also specify the test class to run:

```bash
./python/asv run --python=same --quick -b 'bench_arrow.LongArrowToPandasBenchmark'
```

### Full run against a commit

Run benchmarks in an isolated virtualenv (builds pyspark from source):

```bash
./python/asv run master^!          # Run on latest master commit
./python/asv run v3.5.0^!          # Run on a specific tag
./python/asv run abc123^!          # Run on a specific commit
```

### Compare two commits

Compare current branch against upstream/master with 10% threshold:

```bash
./python/asv continuous -f 1.1 upstream/master HEAD
```

### Other useful commands

```bash
./python/asv check          # Validate benchmark syntax
```

## Writing Benchmarks

Benchmarks are Python classes with methods prefixed by:
- `time_*` - Measure execution time
- `peakmem_*` - Measure peak memory usage
- `mem_*` - Measure memory usage of returned object

Example:

```python
class MyBenchmark:
    params = [[1000, 10000], ["option1", "option2"]]
    param_names = ["n_rows", "option"]

    def setup(self, n_rows, option):
        # Called before each benchmark method
        self.data = create_test_data(n_rows, option)

    def time_my_operation(self, n_rows, option):
        # Benchmark timing
        process(self.data)

    def peakmem_my_operation(self, n_rows, option):
        # Benchmark peak memory
        process(self.data)
```

See [ASV documentation](https://asv.readthedocs.io/en/stable/writing_benchmarks.html) for more details.