SIGN IN SIGN UP
apache / spark UNCLAIMED

Apache Spark - A unified analytics engine for large-scale data processing

43049 0 15 Scala
[SPARK-36989][TESTS][PYTHON] Add type hints data tests ### What changes were proposed in this pull request? This PR: - Adds basic data test runner to `dev/lint-python`, using [`typeddjango/pytest-mypy-plugins`](https://github.com/typeddjango/pytest-mypy-plugins) - Migrates data test cases from `pyspark-stubs` In case of failure, a message similar to the following one ``` starting mypy annotations test... annotations passed mypy checks. starting mypy data test... annotations failed data checks: ============================= test session starts ============================== platform linux -- Python 3.9.7, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 rootdir: /path/to/spark/python, configfile: pyproject.toml plugins: mypy-plugins-1.9.2 collected 37 items python/pyspark/ml/tests/typing/test_classification.yml .. [ 5%] python/pyspark/ml/tests/typing/test_evaluation.yml . [ 8%] python/pyspark/ml/tests/typing/test_feature.yml . [ 10%] python/pyspark/ml/tests/typing/test_param.yml . [ 13%] python/pyspark/ml/tests/typing/test_readable.yml . [ 16%] python/pyspark/ml/tests/typing/test_regression.yml .. [ 21%] python/pyspark/sql/tests/typing/test_column.yml F [ 24%] python/pyspark/sql/tests/typing/test_dataframe.yml ....... [ 43%] python/pyspark/sql/tests/typing/test_functions.yml . [ 45%] python/pyspark/sql/tests/typing/test_pandas_compatibility.yml .. [ 51%] python/pyspark/sql/tests/typing/test_readwriter.yml .. [ 56%] python/pyspark/sql/tests/typing/test_session.yml ..... [ 70%] python/pyspark/sql/tests/typing/test_udf.yml ....... [ 89%] python/pyspark/tests/typing/test_context.yml . [ 91%] python/pyspark/tests/typing/test_core.yml . [ 94%] python/pyspark/tests/typing/test_rdd.yml . [ 97%] python/pyspark/tests/typing/test_resultiterable.yml . [100%] =================================== FAILURES =================================== ______________________________ colDateTimeCompare ______________________________ /path/to/spark/python/pyspark/sql/tests/typing/test_column.yml:39: E pytest_mypy_plugins.utils.TypecheckAssertionError: Invalid output: E Actual: E main:20: note: Revealed type is "pyspark.sql.column.Column" (diff) E Expected: E main:20: note: Revealed type is "datetime.date*" (diff) E Alignment of first line difference: E E: ...ote: Revealed type is "datetime.date*" E A: ...ote: Revealed type is "pyspark.sql.column.Column" E ^ =========================== short test summary info ============================ FAILED python/pyspark/sql/tests/typing/test_column.yml::colDateTimeCompare - ======================== 1 failed, 36 passed in 56.13s ========================= ``` will be displayed. ### Why are the changes needed? Currently, type annotations are tested primarily for integrity and, to lesser extent, against actual API. Testing against examples is work in progress (SPARK-36997). Data tests allow us to improve coverage and test negative cases (code, that should fail type checker validation). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Running linter tests with additions proposed in this PR Closes #34296 from zero323/SPARK-36989. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: zero323 <mszymkiewicz@gmail.com>
2021-10-26 10:32:34 +02:00
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
[tool.ruff]
exclude = [
"*/target/*",
"**/*.ipynb",
"docs/.local_ruby_bundle/",
"*python/pyspark/cloudpickle/*.py",
"*python/pyspark/ml/deepspeed/tests/*.py",
"*python/docs/build/*",
"*python/docs/source/conf.py",
"*python/.eggs/*",
"dist/*",
".git/*",
"*python/pyspark/sql/pandas/functions.pyi",
"*python/pyspark/sql/column.pyi",
"*python/pyspark/worker.pyi",
"*python/pyspark/java_gateway.pyi",
"*python/pyspark/sql/connect/proto/*",
"*python/pyspark/sql/streaming/proto/*",
"*venv*/*",
]
line-length = 100
[tool.ruff.lint]
extend-select = [
"G010", # logging-warn
# ambiguous unicode character
"RUF001", # string
"RUF002", # docstring
"RUF003", # comment
# ambiguous unicode character end
"RUF100", # unused-noqa
]
ignore = [
"E402", # Module top level import is disabled for optional import check, etc.
# TODO
"E721", # Use isinstance for type comparison, too many for now.
"E741", # Ambiguous variables like l, I or O.
]
[tool.ruff.lint.per-file-ignores]
# E501 is ignored as shared.py is auto-generated.
"python/pyspark/ml/param/shared.py" = ["E501"]
# E501 is ignored as we should keep the json string format in error_classes.py.
"python/pyspark/errors/error_classes.py" = ["E501"]
# Examples contain some unused variables.
"examples/src/main/python/sql/datasource.py" = ["F841"]
[SPARK-37022][PYTHON] Use black as a formatter for PySpark ### What changes were proposed in this pull request? This PR applies `black` (21.5b2) formatting to the whole `python/pyspark` source tree. Additionally, the following changes were made: - Disabled E501 (line too long) in pycodestyle config ‒ black allows line to exceed `line-length` in cases of inline comments. There are 15 cases like this, all listed below ``` pycodestyle checks failed: ./python/pyspark/sql/catalog.py:349:101: E501 line too long (103 > 100 characters) ./python/pyspark/sql/session.py:652:101: E501 line too long (106 > 100 characters) ./python/pyspark/sql/utils.py:50:101: E501 line too long (108 > 100 characters) ./python/pyspark/sql/streaming.py:1063:101: E501 line too long (128 > 100 characters) ./python/pyspark/sql/streaming.py:1071:101: E501 line too long (112 > 100 characters) ./python/pyspark/sql/streaming.py:1080:101: E501 line too long (124 > 100 characters) ./python/pyspark/sql/streaming.py:1259:101: E501 line too long (134 > 100 characters) ./python/pyspark/sql/pandas/conversion.py:136:101: E501 line too long (106 > 100 characters) ./python/pyspark/ml/param/_shared_params_code_gen.py:111:101: E501 line too long (103 > 100 characters) ./python/pyspark/ml/param/_shared_params_code_gen.py:136:101: E501 line too long (105 > 100 characters) ./python/pyspark/ml/param/_shared_params_code_gen.py:163:101: E501 line too long (101 > 100 characters) ./python/pyspark/ml/param/_shared_params_code_gen.py:233:101: E501 line too long (101 > 100 characters) ./python/pyspark/ml/param/_shared_params_code_gen.py:265:101: E501 line too long (101 > 100 characters) ./python/pyspark/tests/test_readwrite.py:235:101: E501 line too long (114 > 100 characters) ./python/pyspark/tests/test_readwrite.py:336:101: E501 line too long (114 > 100 characters) ``` - After reformatting, minor typing changes were made: - Realign certain `type: ignore` comments with ignored code. - Apply explicit `casts` to ` unittest.skipIf` messages. The following ```python unittest.skipIf( not have_pandas or not have_pyarrow, pandas_requirement_message or pyarrow_requirement_message ) # type: ignore[arg-type] ``` replaced with ```python unittest.skipIf( not have_pandas or not have_pyarrow, cast(str, pandas_requirement_message or pyarrow_requirement_message), ) ``` ### Why are the changes needed? Consistency and reduced maintenance overhead. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Existing liners and tests. Closes #34297 from zero323/SPARK-37022. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-11-10 18:39:06 +09:00
[tool.black]
# When changing the version, we have to update
# GitHub workflow version and dev/reformat-python
required-version = "26.3.1"
[SPARK-37022][PYTHON] Use black as a formatter for PySpark ### What changes were proposed in this pull request? This PR applies `black` (21.5b2) formatting to the whole `python/pyspark` source tree. Additionally, the following changes were made: - Disabled E501 (line too long) in pycodestyle config ‒ black allows line to exceed `line-length` in cases of inline comments. There are 15 cases like this, all listed below ``` pycodestyle checks failed: ./python/pyspark/sql/catalog.py:349:101: E501 line too long (103 > 100 characters) ./python/pyspark/sql/session.py:652:101: E501 line too long (106 > 100 characters) ./python/pyspark/sql/utils.py:50:101: E501 line too long (108 > 100 characters) ./python/pyspark/sql/streaming.py:1063:101: E501 line too long (128 > 100 characters) ./python/pyspark/sql/streaming.py:1071:101: E501 line too long (112 > 100 characters) ./python/pyspark/sql/streaming.py:1080:101: E501 line too long (124 > 100 characters) ./python/pyspark/sql/streaming.py:1259:101: E501 line too long (134 > 100 characters) ./python/pyspark/sql/pandas/conversion.py:136:101: E501 line too long (106 > 100 characters) ./python/pyspark/ml/param/_shared_params_code_gen.py:111:101: E501 line too long (103 > 100 characters) ./python/pyspark/ml/param/_shared_params_code_gen.py:136:101: E501 line too long (105 > 100 characters) ./python/pyspark/ml/param/_shared_params_code_gen.py:163:101: E501 line too long (101 > 100 characters) ./python/pyspark/ml/param/_shared_params_code_gen.py:233:101: E501 line too long (101 > 100 characters) ./python/pyspark/ml/param/_shared_params_code_gen.py:265:101: E501 line too long (101 > 100 characters) ./python/pyspark/tests/test_readwrite.py:235:101: E501 line too long (114 > 100 characters) ./python/pyspark/tests/test_readwrite.py:336:101: E501 line too long (114 > 100 characters) ``` - After reformatting, minor typing changes were made: - Realign certain `type: ignore` comments with ignored code. - Apply explicit `casts` to ` unittest.skipIf` messages. The following ```python unittest.skipIf( not have_pandas or not have_pyarrow, pandas_requirement_message or pyarrow_requirement_message ) # type: ignore[arg-type] ``` replaced with ```python unittest.skipIf( not have_pandas or not have_pyarrow, cast(str, pandas_requirement_message or pyarrow_requirement_message), ) ``` ### Why are the changes needed? Consistency and reduced maintenance overhead. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Existing liners and tests. Closes #34297 from zero323/SPARK-37022. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-11-10 18:39:06 +09:00
line-length = 100
target-version = ['py39']
[SPARK-37022][PYTHON] Use black as a formatter for PySpark ### What changes were proposed in this pull request? This PR applies `black` (21.5b2) formatting to the whole `python/pyspark` source tree. Additionally, the following changes were made: - Disabled E501 (line too long) in pycodestyle config ‒ black allows line to exceed `line-length` in cases of inline comments. There are 15 cases like this, all listed below ``` pycodestyle checks failed: ./python/pyspark/sql/catalog.py:349:101: E501 line too long (103 > 100 characters) ./python/pyspark/sql/session.py:652:101: E501 line too long (106 > 100 characters) ./python/pyspark/sql/utils.py:50:101: E501 line too long (108 > 100 characters) ./python/pyspark/sql/streaming.py:1063:101: E501 line too long (128 > 100 characters) ./python/pyspark/sql/streaming.py:1071:101: E501 line too long (112 > 100 characters) ./python/pyspark/sql/streaming.py:1080:101: E501 line too long (124 > 100 characters) ./python/pyspark/sql/streaming.py:1259:101: E501 line too long (134 > 100 characters) ./python/pyspark/sql/pandas/conversion.py:136:101: E501 line too long (106 > 100 characters) ./python/pyspark/ml/param/_shared_params_code_gen.py:111:101: E501 line too long (103 > 100 characters) ./python/pyspark/ml/param/_shared_params_code_gen.py:136:101: E501 line too long (105 > 100 characters) ./python/pyspark/ml/param/_shared_params_code_gen.py:163:101: E501 line too long (101 > 100 characters) ./python/pyspark/ml/param/_shared_params_code_gen.py:233:101: E501 line too long (101 > 100 characters) ./python/pyspark/ml/param/_shared_params_code_gen.py:265:101: E501 line too long (101 > 100 characters) ./python/pyspark/tests/test_readwrite.py:235:101: E501 line too long (114 > 100 characters) ./python/pyspark/tests/test_readwrite.py:336:101: E501 line too long (114 > 100 characters) ``` - After reformatting, minor typing changes were made: - Realign certain `type: ignore` comments with ignored code. - Apply explicit `casts` to ` unittest.skipIf` messages. The following ```python unittest.skipIf( not have_pandas or not have_pyarrow, pandas_requirement_message or pyarrow_requirement_message ) # type: ignore[arg-type] ``` replaced with ```python unittest.skipIf( not have_pandas or not have_pyarrow, cast(str, pandas_requirement_message or pyarrow_requirement_message), ) ``` ### Why are the changes needed? Consistency and reduced maintenance overhead. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Existing liners and tests. Closes #34297 from zero323/SPARK-37022. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-11-10 18:39:06 +09:00
include = '\.pyi?$'
[SPARK-41586][PYTHON] Introduce `pyspark.errors` and error classes for PySpark ### What changes were proposed in this pull request? This PR proposes to introduce `pyspark.errors` and error classes to unifying & improving errors generated by PySpark under a single path. To summarize, this PR includes the changes below: - Add `python/pyspark/errors/error_classes.py` to support error class for PySpark. - Add `ErrorClassesReader` to manage the `error_classes.py`. - Add `PySparkException` to handle the errors generated by PySpark. - Add `check_error` for error class testing. This is an initial PR for introducing error framework for PySpark to facilitate the error management and provide better/consistent error messages to users. While such an active work is being done on the [SQL side to improve error messages](https://issues.apache.org/jira/browse/SPARK-37935), so far there is no work to improve error messages in PySpark. So, I'd expect to also initiate the effort on error message improvement for PySpark side from this PR. Eventually, the errors massage will be shown as below, for example: - PySpark, `PySparkException` (thrown by Python driver): ```python >>> from pyspark.sql.functions import lit >>> lit([df.id, df.id]) Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/sql/utils.py", line 334, in wrapped return f(*args, **kwargs) File ".../spark/python/pyspark/sql/functions.py", line 176, in lit raise PySparkException( pyspark.errors.exceptions.PySparkException: [COLUMN_IN_LIST] lit does not allow a column in a list. ``` - PySpark, `AnalysisException` (thrown by JVM side, and capture in PySpark side): ``` >>> df.unpivot("id", [], "var", "val").collect() Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/sql/dataframe.py", line 3296, in unpivot jdf = self._jdf.unpivotWithSeq(jids, jvals, variableColumnName, valueColumnName) File ".../spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__ File ".../spark/python/pyspark/sql/utils.py", line 209, in deco raise converted from None pyspark.sql.utils.AnalysisException: [UNPIVOT_REQUIRES_VALUE_COLUMNS] At least one value column needs to be specified for UNPIVOT, all columns specified as ids; 'Unpivot ArraySeq(id#2L), ArraySeq(), var, [val] +- LogicalRDD [id#2L, int#3L, double#4, str#5], false ``` - Spark, `AnalysisException`: ```scala scala> df.select($"id").unpivot(Array($"id"), Array.empty,variableColumnName = "var", valueColumnName = "val") org.apache.spark.sql.AnalysisException: [UNPIVOT_REQUIRES_VALUE_COLUMNS] At least one value column needs to be specified for UNPIVOT, all columns specified as ids; 'Unpivot ArraySeq(id#0L), ArraySeq(), var, [val] +- Project [id#0L] +- Range (0, 10, step=1, splits=Some(16)) ``` **Next up** for this PR include: - Migrate more errors into `PySparkException` across all modules (e.g, Spark Connect, pandas API on Spark...). - Migrate more error tests into error class tests by using `check_error`. - Define more error classes onto `error_classes.py`. - Add documentation. ### Why are the changes needed? Centralizing error messages & introducing identified error class provides the following benefits: - Errors are searchable via the unique class names and properly classified. - Reduce the cost of future maintenance for PySpark errors. - Provide consistent & actionable error messages to users. - Facilitates translating error messages into different languages. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Adding UTs & running the existing static analysis tools (`dev/lint-python`) Closes #39387 from itholic/SPARK-41586. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2023-01-16 10:22:43 +09:00
extend-exclude = 'cloudpickle|error_classes.py'