Blame: pyproject.toml - apache/spark

apache / spark UNCLAIMED

Apache Spark - A unified analytics engine for large-scale data processing

43049 0 15 Scala

[SPARK-36989][TESTS][PYTHON] Add type hints data tests ### What changes were proposed in this pull request? This PR: - Adds basic data test runner to `dev/lint-python`, using [`typeddjango/pytest-mypy-plugins`](https://github.com/typeddjango/pytest-mypy-plugins) - Migrates data test cases from `pyspark-stubs` In case of failure, a message similar to the following one ``` starting mypy annotations test... annotations passed mypy checks. starting mypy data test... annotations failed data checks: ============================= test session starts ============================== platform linux -- Python 3.9.7, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 rootdir: /path/to/spark/python, configfile: pyproject.toml plugins: mypy-plugins-1.9.2 collected 37 items python/pyspark/ml/tests/typing/test_classification.yml .. [ 5%] python/pyspark/ml/tests/typing/test_evaluation.yml . [ 8%] python/pyspark/ml/tests/typing/test_feature.yml . [ 10%] python/pyspark/ml/tests/typing/test_param.yml . [ 13%] python/pyspark/ml/tests/typing/test_readable.yml . [ 16%] python/pyspark/ml/tests/typing/test_regression.yml .. [ 21%] python/pyspark/sql/tests/typing/test_column.yml F [ 24%] python/pyspark/sql/tests/typing/test_dataframe.yml ....... [ 43%] python/pyspark/sql/tests/typing/test_functions.yml . [ 45%] python/pyspark/sql/tests/typing/test_pandas_compatibility.yml .. [ 51%] python/pyspark/sql/tests/typing/test_readwriter.yml .. [ 56%] python/pyspark/sql/tests/typing/test_session.yml ..... [ 70%] python/pyspark/sql/tests/typing/test_udf.yml ....... [ 89%] python/pyspark/tests/typing/test_context.yml . [ 91%] python/pyspark/tests/typing/test_core.yml . [ 94%] python/pyspark/tests/typing/test_rdd.yml . [ 97%] python/pyspark/tests/typing/test_resultiterable.yml . [100%] =================================== FAILURES =================================== ______________________________ colDateTimeCompare ______________________________ /path/to/spark/python/pyspark/sql/tests/typing/test_column.yml:39: E pytest_mypy_plugins.utils.TypecheckAssertionError: Invalid output: E Actual: E main:20: note: Revealed type is "pyspark.sql.column.Column" (diff) E Expected: E main:20: note: Revealed type is "datetime.date" (diff) E Alignment of first line difference: E E: ...ote: Revealed type is "datetime.date" E A: ...ote: Revealed type is "pyspark.sql.column.Column" E ^ =========================== short test summary info ============================ FAILED python/pyspark/sql/tests/typing/test_column.yml::colDateTimeCompare - ======================== 1 failed, 36 passed in 56.13s ========================= ``` will be displayed. ### Why are the changes needed? Currently, type annotations are tested primarily for integrity and, to lesser extent, against actual API. Testing against examples is work in progress (SPARK-36997). Data tests allow us to improve coverage and test negative cases (code, that should fail type checker validation). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Running linter tests with additions proposed in this PR Closes #34296 from zero323/SPARK-36989. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: zero323 <mszymkiewicz@gmail.com> 2021-10-26 10:32:34 +02:00			`#`
			`# Licensed to the Apache Software Foundation (ASF) under one or more`
			`# contributor license agreements. See the NOTICE file distributed with`
			`# this work for additional information regarding copyright ownership.`
			`# The ASF licenses this file to You under the Apache License, Version 2.0`
			`# (the "License"); you may not use this file except in compliance with`
			`# the License. You may obtain a copy of the License at`
			`#`
			`# http://www.apache.org/licenses/LICENSE-2.0`
			`#`
			`# Unless required by applicable law or agreed to in writing, software`
			`# distributed under the License is distributed on an "AS IS" BASIS,`
			`# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.`
			`# See the License for the specific language governing permissions and`
			`# limitations under the License.`
			`#`

[SPARK-54632][PYTHON] Add the option to use ruff for lint ### What changes were proposed in this pull request? Add `ruff` as an option to lint our code. ### Why are the changes needed? Our pinned `flake8` version is just too old - it can't even run on 3.12+. We can upgrade flake8 version but I think gradually switch to `ruff` is a better options. The main reason is `ruff` is much much faster than `flake8`. `ruff` returns the result almost immediately (ms-level) on whole spark repo - which means we can even hook it in the pre-commit in the future. It is surprisingly compatible with flake8 - there's almost no code change needed (with two extra ignored lint types which we can fix in the future). Everything it finds is a real issue instead of a different taste. `ruff` can also serve as a black-compatible formatter which means we can probably ditch both `flake8` and `black` in the future. For now we only enable this option - it's not hooked into any CI or full `./dev/lint-python`. However, I think we should do that soon. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Local lint test passed. ### Was this patch authored or co-authored using generative AI tooling? No Closes #53378 from gaogaotiantian/add-ruff. Authored-by: Tian Gao <gaogaotiantian@hotmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 2025-12-10 08:53:27 +09:00			`[tool.ruff]`
			`exclude = [`
			`"/target/",`
			`"*/.ipynb",`
			`"docs/.local_ruby_bundle/",`
			`"python/pyspark/cloudpickle/.py",`
			`"python/pyspark/ml/deepspeed/tests/.py",`
			`"python/docs/build/",`
			`"*python/docs/source/conf.py",`
			`"python/.eggs/",`
			`"dist/*",`
			`".git/*",`
			`"*python/pyspark/sql/pandas/functions.pyi",`
			`"*python/pyspark/sql/column.pyi",`
			`"*python/pyspark/worker.pyi",`
			`"*python/pyspark/java_gateway.pyi",`
			`"python/pyspark/sql/connect/proto/",`
			`"python/pyspark/sql/streaming/proto/",`
			`"venv/*",`
			`]`
[SPARK-56018][PYTHON] Use ruff as formatter ### What changes were proposed in this pull request? Replace `black` with `ruff format`. ### Why are the changes needed? There are few reasons we should use `ruff` 1. We already use `ruff` for linter, using it for `format` will reduce a dependency, which makes upgrade easier 2. `ruff` is significantly faster than `black` which is helpful for our pre-commit hooks 3. `ruff` is more customizable if we need 4. Personally I think the taste of `ruff` is slightly better than `black`. For example: * `ruff` enforces blank spaces for `import`, `class` and `function` better * `ruff` will put the code back in a single line if it fits * `ruff` always uses double quote when it can There are some other details that you'll realize if you take a look at the diff. I think overall `ruff` generates slightly better code than `black` (and `ruff` is probably a bit more strict than `black`). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI needs to pass because we removed the black dependency. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #54840 from gaogaotiantian/use-ruff-as-formatter. Authored-by: Tian Gao <gaogaotiantian@hotmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 2026-03-20 07:20:37 +09:00			`line-length = 100`
[SPARK-54632][PYTHON] Add the option to use ruff for lint ### What changes were proposed in this pull request? Add `ruff` as an option to lint our code. ### Why are the changes needed? Our pinned `flake8` version is just too old - it can't even run on 3.12+. We can upgrade flake8 version but I think gradually switch to `ruff` is a better options. The main reason is `ruff` is much much faster than `flake8`. `ruff` returns the result almost immediately (ms-level) on whole spark repo - which means we can even hook it in the pre-commit in the future. It is surprisingly compatible with flake8 - there's almost no code change needed (with two extra ignored lint types which we can fix in the future). Everything it finds is a real issue instead of a different taste. `ruff` can also serve as a black-compatible formatter which means we can probably ditch both `flake8` and `black` in the future. For now we only enable this option - it's not hooked into any CI or full `./dev/lint-python`. However, I think we should do that soon. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Local lint test passed. ### Was this patch authored or co-authored using generative AI tooling? No Closes #53378 from gaogaotiantian/add-ruff. Authored-by: Tian Gao <gaogaotiantian@hotmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 2025-12-10 08:53:27 +09:00
			`[tool.ruff.lint]`
[SPARK-55457][INFRA] Add `logging.warning` check in ruff ### What changes were proposed in this pull request? Add a config in ruff to check usage of obsolete `logging.warn`. ### Why are the changes needed? [SPARK-55407] replaced all `logging.warn` usages with the recommended `logging.warning`. We add this config to prevent future usage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Confirmed it will detect `logging.warn` usage locally. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #54235 from gaogaotiantian/add-check-for-logging. Authored-by: Tian Gao <gaogaotiantian@hotmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 2026-02-10 11:51:10 +08:00			`extend-select = [`
[SPARK-55480][PYTHON] Remove all unused noqa for ruff ### What changes were proposed in this pull request? Removed all unused `# noqa` comments for linter ### Why are the changes needed? We accumulated years of `# noqa` comments for linter. A lot of them are unused (because linter gets smarter). We removed all the unused ones and make `ruff` check for unused noqa in the future - just like `mypy` did. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI, but this should be a comment only change. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #54264 from gaogaotiantian/remove-unused-noqa. Authored-by: Tian Gao <gaogaotiantian@hotmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 2026-02-11 20:25:13 +08:00			`"G010", # logging-warn`
[SPARK-55621][PYTHON] Fix ambiguous and unnecessary unicode usage ### What changes were proposed in this pull request? Fixed all the unnecessary and ambiguous unicode character usage. A set of `ruff` rules are also added to prevent future regressions. ### Why are the changes needed? We should avoid using non-ascii unicode character usage as much as possible. There are few rationales behind it * Sometimes it's just wrong. e.g. `‘index’` vs `'index'` * Some editor (VSCode) will highlight it as a warning and some editor/terminal might not display it well * It's difficult to keep consistency because people don't know how to type that * For docstrings, it could actually be displayed somewhere while users are using it and unicode could cause problems ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `ruff check` passed. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #54410 from gaogaotiantian/fix-ascii. Authored-by: Tian Gao <gaogaotiantian@hotmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 2026-03-03 09:28:25 +09:00			`# ambiguous unicode character`
			`"RUF001", # string`
			`"RUF002", # docstring`
			`"RUF003", # comment`
			`# ambiguous unicode character end`
[SPARK-55480][PYTHON] Remove all unused noqa for ruff ### What changes were proposed in this pull request? Removed all unused `# noqa` comments for linter ### Why are the changes needed? We accumulated years of `# noqa` comments for linter. A lot of them are unused (because linter gets smarter). We removed all the unused ones and make `ruff` check for unused noqa in the future - just like `mypy` did. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI, but this should be a comment only change. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #54264 from gaogaotiantian/remove-unused-noqa. Authored-by: Tian Gao <gaogaotiantian@hotmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 2026-02-11 20:25:13 +08:00			`"RUF100", # unused-noqa`
[SPARK-55457][INFRA] Add `logging.warning` check in ruff ### What changes were proposed in this pull request? Add a config in ruff to check usage of obsolete `logging.warn`. ### Why are the changes needed? [SPARK-55407] replaced all `logging.warn` usages with the recommended `logging.warning`. We add this config to prevent future usage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Confirmed it will detect `logging.warn` usage locally. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #54235 from gaogaotiantian/add-check-for-logging. Authored-by: Tian Gao <gaogaotiantian@hotmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 2026-02-10 11:51:10 +08:00			`]`
[SPARK-54632][PYTHON] Add the option to use ruff for lint ### What changes were proposed in this pull request? Add `ruff` as an option to lint our code. ### Why are the changes needed? Our pinned `flake8` version is just too old - it can't even run on 3.12+. We can upgrade flake8 version but I think gradually switch to `ruff` is a better options. The main reason is `ruff` is much much faster than `flake8`. `ruff` returns the result almost immediately (ms-level) on whole spark repo - which means we can even hook it in the pre-commit in the future. It is surprisingly compatible with flake8 - there's almost no code change needed (with two extra ignored lint types which we can fix in the future). Everything it finds is a real issue instead of a different taste. `ruff` can also serve as a black-compatible formatter which means we can probably ditch both `flake8` and `black` in the future. For now we only enable this option - it's not hooked into any CI or full `./dev/lint-python`. However, I think we should do that soon. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Local lint test passed. ### Was this patch authored or co-authored using generative AI tooling? No Closes #53378 from gaogaotiantian/add-ruff. Authored-by: Tian Gao <gaogaotiantian@hotmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 2025-12-10 08:53:27 +09:00			`ignore = [`
			`"E402", # Module top level import is disabled for optional import check, etc.`
			`# TODO`
			`"E721", # Use isinstance for type comparison, too many for now.`
			`"E741", # Ambiguous variables like l, I or O.`
			`]`

			`[tool.ruff.lint.per-file-ignores]`
			`# E501 is ignored as shared.py is auto-generated.`
			`"python/pyspark/ml/param/shared.py" = ["E501"]`
			`# E501 is ignored as we should keep the json string format in error_classes.py.`
			`"python/pyspark/errors/error_classes.py" = ["E501"]`
			`# Examples contain some unused variables.`
			`"examples/src/main/python/sql/datasource.py" = ["F841"]`

[SPARK-37022][PYTHON] Use black as a formatter for PySpark ### What changes were proposed in this pull request? This PR applies `black` (21.5b2) formatting to the whole `python/pyspark` source tree. Additionally, the following changes were made: - Disabled E501 (line too long) in pycodestyle config ‒ black allows line to exceed `line-length` in cases of inline comments. There are 15 cases like this, all listed below ``` pycodestyle checks failed: ./python/pyspark/sql/catalog.py:349:101: E501 line too long (103 > 100 characters) ./python/pyspark/sql/session.py:652:101: E501 line too long (106 > 100 characters) ./python/pyspark/sql/utils.py:50:101: E501 line too long (108 > 100 characters) ./python/pyspark/sql/streaming.py:1063:101: E501 line too long (128 > 100 characters) ./python/pyspark/sql/streaming.py:1071:101: E501 line too long (112 > 100 characters) ./python/pyspark/sql/streaming.py:1080:101: E501 line too long (124 > 100 characters) ./python/pyspark/sql/streaming.py:1259:101: E501 line too long (134 > 100 characters) ./python/pyspark/sql/pandas/conversion.py:136:101: E501 line too long (106 > 100 characters) ./python/pyspark/ml/param/_shared_params_code_gen.py:111:101: E501 line too long (103 > 100 characters) ./python/pyspark/ml/param/_shared_params_code_gen.py:136:101: E501 line too long (105 > 100 characters) ./python/pyspark/ml/param/_shared_params_code_gen.py:163:101: E501 line too long (101 > 100 characters) ./python/pyspark/ml/param/_shared_params_code_gen.py:233:101: E501 line too long (101 > 100 characters) ./python/pyspark/ml/param/_shared_params_code_gen.py:265:101: E501 line too long (101 > 100 characters) ./python/pyspark/tests/test_readwrite.py:235:101: E501 line too long (114 > 100 characters) ./python/pyspark/tests/test_readwrite.py:336:101: E501 line too long (114 > 100 characters) ``` - After reformatting, minor typing changes were made: - Realign certain `type: ignore` comments with ignored code. - Apply explicit `casts` to ` unittest.skipIf` messages. The following ```python unittest.skipIf( not have_pandas or not have_pyarrow, pandas_requirement_message or pyarrow_requirement_message ) # type: ignore[arg-type] ``` replaced with ```python unittest.skipIf( not have_pandas or not have_pyarrow, cast(str, pandas_requirement_message or pyarrow_requirement_message), ) ``` ### Why are the changes needed? Consistency and reduced maintenance overhead. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Existing liners and tests. Closes #34297 from zero323/SPARK-37022. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 2021-11-10 18:39:06 +09:00			`[tool.black]`
[SPARK-37754][PYTHON] Fix black version in dev/reformat-python ### What changes were proposed in this pull request? Make version of `black` in `reformat-python` be consistent to `build_and_test.yml` related to #35021 ### Why are the changes needed? If we install the old version of black recommended by `reformat-python`, it produces style error happening on `python/pyspark/shuffle.py` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This patch only changes the error message Closes #35033 from chia7712/SPARK-37754. Authored-by: Chia-Ping Tsai <chia7712@gmail.com> Signed-off-by: zero323 <mszymkiewicz@gmail.com> 2021-12-28 10:55:27 +01:00			`# When changing the version, we have to update`
			`# GitHub workflow version and dev/reformat-python`
[SPARK-55986][PYTHON] Upgrade black to 26.3.1 ### What changes were proposed in this pull request? This pr aims to upgrade black from 23.12.1 to 26.3.1 ### Why are the changes needed? To fix https://github.com/apache/spark/security/dependabot/172 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass Github Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #54782 from LuciferYang/black-26.3.1. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 2026-03-16 09:20:05 -07:00			`required-version = "26.3.1"`
[SPARK-37022][PYTHON] Use black as a formatter for PySpark ### What changes were proposed in this pull request? This PR applies `black` (21.5b2) formatting to the whole `python/pyspark` source tree. Additionally, the following changes were made: - Disabled E501 (line too long) in pycodestyle config ‒ black allows line to exceed `line-length` in cases of inline comments. There are 15 cases like this, all listed below ``` pycodestyle checks failed: ./python/pyspark/sql/catalog.py:349:101: E501 line too long (103 > 100 characters) ./python/pyspark/sql/session.py:652:101: E501 line too long (106 > 100 characters) ./python/pyspark/sql/utils.py:50:101: E501 line too long (108 > 100 characters) ./python/pyspark/sql/streaming.py:1063:101: E501 line too long (128 > 100 characters) ./python/pyspark/sql/streaming.py:1071:101: E501 line too long (112 > 100 characters) ./python/pyspark/sql/streaming.py:1080:101: E501 line too long (124 > 100 characters) ./python/pyspark/sql/streaming.py:1259:101: E501 line too long (134 > 100 characters) ./python/pyspark/sql/pandas/conversion.py:136:101: E501 line too long (106 > 100 characters) ./python/pyspark/ml/param/_shared_params_code_gen.py:111:101: E501 line too long (103 > 100 characters) ./python/pyspark/ml/param/_shared_params_code_gen.py:136:101: E501 line too long (105 > 100 characters) ./python/pyspark/ml/param/_shared_params_code_gen.py:163:101: E501 line too long (101 > 100 characters) ./python/pyspark/ml/param/_shared_params_code_gen.py:233:101: E501 line too long (101 > 100 characters) ./python/pyspark/ml/param/_shared_params_code_gen.py:265:101: E501 line too long (101 > 100 characters) ./python/pyspark/tests/test_readwrite.py:235:101: E501 line too long (114 > 100 characters) ./python/pyspark/tests/test_readwrite.py:336:101: E501 line too long (114 > 100 characters) ``` - After reformatting, minor typing changes were made: - Realign certain `type: ignore` comments with ignored code. - Apply explicit `casts` to ` unittest.skipIf` messages. The following ```python unittest.skipIf( not have_pandas or not have_pyarrow, pandas_requirement_message or pyarrow_requirement_message ) # type: ignore[arg-type] ``` replaced with ```python unittest.skipIf( not have_pandas or not have_pyarrow, cast(str, pandas_requirement_message or pyarrow_requirement_message), ) ``` ### Why are the changes needed? Consistency and reduced maintenance overhead. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Existing liners and tests. Closes #34297 from zero323/SPARK-37022. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 2021-11-10 18:39:06 +09:00			`line-length = 100`
[SPARK-48531][INFRA] Fix `Black` target version to Python 3.9 ### What changes were proposed in this pull request? This PR aims to fix `Black` target version to `Python 3.9`. ### Why are the changes needed? Since SPARK-47993 dropped Python 3.8 support officially at Apache Spark 4.0.0, we had better update target version to `Python 3.9`. - #46228 `py39` is the version for `Python 3.9`. ``` $ black --help \| grep target -t, --target-version [py33\|py34\|py35\|py36\|py37\|py38\|py39\|py310\|py311\|py312] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with Python linter. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46867 from dongjoon-hyun/SPARK-48531. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 2024-06-04 10:28:50 -07:00			`target-version = ['py39']`
[SPARK-37022][PYTHON] Use black as a formatter for PySpark ### What changes were proposed in this pull request? This PR applies `black` (21.5b2) formatting to the whole `python/pyspark` source tree. Additionally, the following changes were made: - Disabled E501 (line too long) in pycodestyle config ‒ black allows line to exceed `line-length` in cases of inline comments. There are 15 cases like this, all listed below ``` pycodestyle checks failed: ./python/pyspark/sql/catalog.py:349:101: E501 line too long (103 > 100 characters) ./python/pyspark/sql/session.py:652:101: E501 line too long (106 > 100 characters) ./python/pyspark/sql/utils.py:50:101: E501 line too long (108 > 100 characters) ./python/pyspark/sql/streaming.py:1063:101: E501 line too long (128 > 100 characters) ./python/pyspark/sql/streaming.py:1071:101: E501 line too long (112 > 100 characters) ./python/pyspark/sql/streaming.py:1080:101: E501 line too long (124 > 100 characters) ./python/pyspark/sql/streaming.py:1259:101: E501 line too long (134 > 100 characters) ./python/pyspark/sql/pandas/conversion.py:136:101: E501 line too long (106 > 100 characters) ./python/pyspark/ml/param/_shared_params_code_gen.py:111:101: E501 line too long (103 > 100 characters) ./python/pyspark/ml/param/_shared_params_code_gen.py:136:101: E501 line too long (105 > 100 characters) ./python/pyspark/ml/param/_shared_params_code_gen.py:163:101: E501 line too long (101 > 100 characters) ./python/pyspark/ml/param/_shared_params_code_gen.py:233:101: E501 line too long (101 > 100 characters) ./python/pyspark/ml/param/_shared_params_code_gen.py:265:101: E501 line too long (101 > 100 characters) ./python/pyspark/tests/test_readwrite.py:235:101: E501 line too long (114 > 100 characters) ./python/pyspark/tests/test_readwrite.py:336:101: E501 line too long (114 > 100 characters) ``` - After reformatting, minor typing changes were made: - Realign certain `type: ignore` comments with ignored code. - Apply explicit `casts` to ` unittest.skipIf` messages. The following ```python unittest.skipIf( not have_pandas or not have_pyarrow, pandas_requirement_message or pyarrow_requirement_message ) # type: ignore[arg-type] ``` replaced with ```python unittest.skipIf( not have_pandas or not have_pyarrow, cast(str, pandas_requirement_message or pyarrow_requirement_message), ) ``` ### Why are the changes needed? Consistency and reduced maintenance overhead. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Existing liners and tests. Closes #34297 from zero323/SPARK-37022. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 2021-11-10 18:39:06 +09:00			`include = '\.pyi?$'`
[SPARK-41586][PYTHON] Introduce `pyspark.errors` and error classes for PySpark ### What changes were proposed in this pull request? This PR proposes to introduce `pyspark.errors` and error classes to unifying & improving errors generated by PySpark under a single path. To summarize, this PR includes the changes below: - Add `python/pyspark/errors/error_classes.py` to support error class for PySpark. - Add `ErrorClassesReader` to manage the `error_classes.py`. - Add `PySparkException` to handle the errors generated by PySpark. - Add `check_error` for error class testing. This is an initial PR for introducing error framework for PySpark to facilitate the error management and provide better/consistent error messages to users. While such an active work is being done on the [SQL side to improve error messages](https://issues.apache.org/jira/browse/SPARK-37935), so far there is no work to improve error messages in PySpark. So, I'd expect to also initiate the effort on error message improvement for PySpark side from this PR. Eventually, the errors massage will be shown as below, for example: - PySpark, `PySparkException` (thrown by Python driver): ```python >>> from pyspark.sql.functions import lit >>> lit([df.id, df.id]) Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/sql/utils.py", line 334, in wrapped return f(args, kwargs) File ".../spark/python/pyspark/sql/functions.py", line 176, in lit raise PySparkException( pyspark.errors.exceptions.PySparkException: [COLUMN_IN_LIST] lit does not allow a column in a list. ``` - PySpark, `AnalysisException` (thrown by JVM side, and capture in PySpark side): ``` >>> df.unpivot("id", [], "var", "val").collect() Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/sql/dataframe.py", line 3296, in unpivot jdf = self._jdf.unpivotWithSeq(jids, jvals, variableColumnName, valueColumnName) File ".../spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__ File ".../spark/python/pyspark/sql/utils.py", line 209, in deco raise converted from None pyspark.sql.utils.AnalysisException: [UNPIVOT_REQUIRES_VALUE_COLUMNS] At least one value column needs to be specified for UNPIVOT, all columns specified as ids; 'Unpivot ArraySeq(id#2L), ArraySeq(), var, [val] +- LogicalRDD [id#2L, int#3L, double#4, str#5], false ``` - Spark, `AnalysisException`: ```scala scala> df.select($"id").unpivot(Array($"id"), Array.empty,variableColumnName = "var", valueColumnName = "val") org.apache.spark.sql.AnalysisException: [UNPIVOT_REQUIRES_VALUE_COLUMNS] At least one value column needs to be specified for UNPIVOT, all columns specified as ids; 'Unpivot ArraySeq(id#0L), ArraySeq(), var, [val] +- Project [id#0L] +- Range (0, 10, step=1, splits=Some(16)) ``` Next up* for this PR include: - Migrate more errors into `PySparkException` across all modules (e.g, Spark Connect, pandas API on Spark...). - Migrate more error tests into error class tests by using `check_error`. - Define more error classes onto `error_classes.py`. - Add documentation. ### Why are the changes needed? Centralizing error messages & introducing identified error class provides the following benefits: - Errors are searchable via the unique class names and properly classified. - Reduce the cost of future maintenance for PySpark errors. - Provide consistent & actionable error messages to users. - Facilitates translating error messages into different languages. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Adding UTs & running the existing static analysis tools (`dev/lint-python`) Closes #39387 from itholic/SPARK-41586. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 2023-01-16 10:22:43 +09:00			`extend-exclude = 'cloudpickle\|error_classes.py'`