mirror of
https://github.com/apache/spark.git
synced 2026-03-30 01:07:27 +00:00
### What changes were proposed in this pull request?
Adds a separated PySpark package for Spark Connect by default.
- Rename `pyspark-connect` to `pyspark-client`
- Add a new `pyspark-connect` that depends on `pyspark` but use Spark Connect by default
The new `pyspark-connect` will install a marker package `pyspark_connect`.
If `pyspark` finds the `pyspark_connect` package, the default API mode will be "connect"; otherwise "classic".
```py
def spark_connect_mode() -> str:
"""
Return the env var SPARK_CONNECT_MODE; otherwise "1" if `pyspark_connect` is available.
"""
connect_by_default = os.environ.get("SPARK_CONNECT_MODE")
if connect_by_default is not None:
return connect_by_default
try:
import pyspark_connect
return "1"
except ImportError:
return "0"
```
Note that only `pyspark` can make Spark Connect by default by setting the env var `SPARK_CONNECT_MODE` to `"1"`, which is useful for testing.
### Why are the changes needed?
As discussed in the dev list, we will have an additional package for Spark Connect by default.
### Does this PR introduce _any_ user-facing change?
Yes, there will be three packages for PySpark.
- `pyspark`: contains the whole package to run PySpark, Spark Classic by default
- `pyspark-connect`: depends on `pyspark`, Spark Connect by default
- `pyspark-client`: only contains Python files to work as a Spark Connect client
### How was this patch tested?
Manually.
```
./dev/make-distribution.sh --pip
```
In a clean Python env:
```
$ pip install /path/to/pyspark-4.1.0.dev0.tar.gz
$ pyspark
...
>>> spark
<pyspark.sql.session.SparkSession object at 0x10810c2b0>
>>> quit()
$ pyspark --remote local
...
>>> spark
<pyspark.sql.connect.session.SparkSession object at 0x11033afb0>
>>> quit()
$ pip install /path/to/pyspark-connect-4.1.0.dev0.tar.gz
$ pyspark
...
>>> spark
<pyspark.sql.connect.session.SparkSession object at 0x101a1eb90>
>>> quit()
$ pyspark --master local
...
>>> spark
<pyspark.sql.connect.session.SparkSession object at 0x1022b2b90>
>>> quit()
```
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes #49946 from ueshin/issues/SPARK-51212/packaging.
Authored-by: Takuya Ueshin <ueshin@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit f798e7ab01)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
10 lines
121 B
Plaintext
10 lines
121 B
Plaintext
*.pyc
|
|
docs/_build/
|
|
pyspark.egg-info
|
|
pyspark_client.egg-info
|
|
pyspark_connect.egg-info
|
|
build/
|
|
dist/
|
|
./setup.py
|
|
./setup.cfg
|