Files
Takuya Ueshin 2887e0bc20 [SPARK-51212][PYTHON] Add a separated PySpark package for Spark Connect by default
### What changes were proposed in this pull request?

Adds a separated PySpark package for Spark Connect by default.

- Rename `pyspark-connect` to `pyspark-client`
- Add a new `pyspark-connect` that depends on `pyspark` but use Spark Connect by default

The new `pyspark-connect` will install a marker package `pyspark_connect`.
If `pyspark` finds the `pyspark_connect` package, the default API mode will be "connect"; otherwise "classic".

```py
def spark_connect_mode() -> str:
    """
    Return the env var SPARK_CONNECT_MODE; otherwise "1" if `pyspark_connect` is available.
    """
    connect_by_default = os.environ.get("SPARK_CONNECT_MODE")
    if connect_by_default is not None:
        return connect_by_default
    try:
        import pyspark_connect

        return "1"
    except ImportError:
        return "0"
```

Note that only `pyspark` can make Spark Connect by default by setting the env var `SPARK_CONNECT_MODE` to `"1"`, which is useful for testing.

### Why are the changes needed?

As discussed in the dev list, we will have an additional package for Spark Connect by default.

### Does this PR introduce _any_ user-facing change?

Yes, there will be three packages for PySpark.

- `pyspark`: contains the whole package to run PySpark, Spark Classic by default
- `pyspark-connect`: depends on `pyspark`, Spark Connect by default
- `pyspark-client`: only contains Python files to work as a Spark Connect client

### How was this patch tested?

Manually.

```
./dev/make-distribution.sh --pip
```

In a clean Python env:

```
$ pip install /path/to/pyspark-4.1.0.dev0.tar.gz
$ pyspark
...
>>> spark
<pyspark.sql.session.SparkSession object at 0x10810c2b0>
>>> quit()

$ pyspark --remote local
...
>>> spark
<pyspark.sql.connect.session.SparkSession object at 0x11033afb0>
>>> quit()

$ pip install /path/to/pyspark-connect-4.1.0.dev0.tar.gz
$ pyspark
...
>>> spark
<pyspark.sql.connect.session.SparkSession object at 0x101a1eb90>
>>> quit()

$ pyspark --master local
...
>>> spark
<pyspark.sql.connect.session.SparkSession object at 0x1022b2b90>
>>> quit()
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #49946 from ueshin/issues/SPARK-51212/packaging.

Authored-by: Takuya Ueshin <ueshin@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit f798e7ab01)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2025-02-17 10:45:58 +08:00

10 lines
121 B
Plaintext

*.pyc
docs/_build/
pyspark.egg-info
pyspark_client.egg-info
pyspark_connect.egg-info
build/
dist/
./setup.py
./setup.cfg