[SPARK-51575][PYTHON] Combine Python Data Source pushdown & plan read workers
Follow up of https://github.com/apache/spark/pull/49961 ### What changes were proposed in this pull request? As pointed out by https://github.com/apache/spark/pull/49961#issuecomment-2705841733, at the time of filter pushdown we already have enough information to also plan read partitions. So this PR changes the filter pushdown worker to also get partitions, reducing the number of exchanges between Python and Scala. Changes: - Extract part of `plan_data_source_read.py` that is responsible for sending the partitions and the read function to JVM. - Use the extracted logic to also send the partitions and read function when doing filter pushdown in `data_source_pushdown_filters.py`. - Update the Scala code accordingly. ### Why are the changes needed? To improve Python Data Source performance when filter pushdown configuration is enabled. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests in `test_python_datasource.py` ### Was this patch authored or co-authored using generative AI tooling? No Closes #50340 from wengh/pyds-combine-pushdown-plan. Authored-by: Haoyu Weng <wenghy02@gmail.com> Signed-off-by: Allison Wang <allison.wang@databricks.com>
H
Haoyu Weng committed
46bd9ccecefd9cc9156623f4c08eb2ebe919e318
Parent: b829aea
Committed by Allison Wang <allison.wang@databricks.com>
on 3/27/2025, 12:38:24 AM