[SPARK-36989][TESTS][PYTHON] Add type hints data tests
### What changes were proposed in this pull request?
This PR:
- Adds basic data test runner to `dev/lint-python`, using [`typeddjango/pytest-mypy-plugins`](https://github.com/typeddjango/pytest-mypy-plugins)
- Migrates data test cases from `pyspark-stubs`
In case of failure, a message similar to the following one
```
starting mypy annotations test...
annotations passed mypy checks.
starting mypy data test...
annotations failed data checks:
============================= test session starts ==============================
platform linux -- Python 3.9.7, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /path/to/spark/python, configfile: pyproject.toml
plugins: mypy-plugins-1.9.2
collected 37 items
python/pyspark/ml/tests/typing/test_classification.yml .. [ 5%]
python/pyspark/ml/tests/typing/test_evaluation.yml . [ 8%]
python/pyspark/ml/tests/typing/test_feature.yml . [ 10%]
python/pyspark/ml/tests/typing/test_param.yml . [ 13%]
python/pyspark/ml/tests/typing/test_readable.yml . [ 16%]
python/pyspark/ml/tests/typing/test_regression.yml .. [ 21%]
python/pyspark/sql/tests/typing/test_column.yml F [ 24%]
python/pyspark/sql/tests/typing/test_dataframe.yml ....... [ 43%]
python/pyspark/sql/tests/typing/test_functions.yml . [ 45%]
python/pyspark/sql/tests/typing/test_pandas_compatibility.yml .. [ 51%]
python/pyspark/sql/tests/typing/test_readwriter.yml .. [ 56%]
python/pyspark/sql/tests/typing/test_session.yml ..... [ 70%]
python/pyspark/sql/tests/typing/test_udf.yml ....... [ 89%]
python/pyspark/tests/typing/test_context.yml . [ 91%]
python/pyspark/tests/typing/test_core.yml . [ 94%]
python/pyspark/tests/typing/test_rdd.yml . [ 97%]
python/pyspark/tests/typing/test_resultiterable.yml . [100%]
=================================== FAILURES ===================================
______________________________ colDateTimeCompare ______________________________
/path/to/spark/python/pyspark/sql/tests/typing/test_column.yml:39:
E pytest_mypy_plugins.utils.TypecheckAssertionError: Invalid output:
E Actual:
E main:20: note: Revealed type is "pyspark.sql.column.Column" (diff)
E Expected:
E main:20: note: Revealed type is "datetime.date*" (diff)
E Alignment of first line difference:
E E: ...ote: Revealed type is "datetime.date*"
E A: ...ote: Revealed type is "pyspark.sql.column.Column"
E ^
=========================== short test summary info ============================
FAILED python/pyspark/sql/tests/typing/test_column.yml::colDateTimeCompare -
======================== 1 failed, 36 passed in 56.13s =========================
```
will be displayed.
### Why are the changes needed?
Currently, type annotations are tested primarily for integrity and, to lesser extent, against actual API. Testing against examples is work in progress (SPARK-36997). Data tests allow us to improve coverage and test negative cases (code, that should fail type checker validation).
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Running linter tests with additions proposed in this PR
Closes #34296 from zero323/SPARK-36989.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: zero323 <mszymkiewicz@gmail.com>
2021-10-26 10:32:34 +02:00
|
|
|
#
|
|
|
|
|
# Licensed to the Apache Software Foundation (ASF) under one or more
|
|
|
|
|
# contributor license agreements. See the NOTICE file distributed with
|
|
|
|
|
# this work for additional information regarding copyright ownership.
|
|
|
|
|
# The ASF licenses this file to You under the Apache License, Version 2.0
|
|
|
|
|
# (the "License"); you may not use this file except in compliance with
|
|
|
|
|
# the License. You may obtain a copy of the License at
|
|
|
|
|
#
|
|
|
|
|
# http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
|
#
|
|
|
|
|
# Unless required by applicable law or agreed to in writing, software
|
|
|
|
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
|
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
|
|
# See the License for the specific language governing permissions and
|
|
|
|
|
# limitations under the License.
|
|
|
|
|
#
|
|
|
|
|
|
2025-12-10 08:53:27 +09:00
|
|
|
[tool.ruff]
|
|
|
|
|
exclude = [
|
|
|
|
|
"*/target/*",
|
|
|
|
|
"**/*.ipynb",
|
|
|
|
|
"docs/.local_ruby_bundle/",
|
|
|
|
|
"*python/pyspark/cloudpickle/*.py",
|
|
|
|
|
"*python/pyspark/ml/deepspeed/tests/*.py",
|
|
|
|
|
"*python/docs/build/*",
|
|
|
|
|
"*python/docs/source/conf.py",
|
|
|
|
|
"*python/.eggs/*",
|
|
|
|
|
"dist/*",
|
|
|
|
|
".git/*",
|
|
|
|
|
"*python/pyspark/sql/pandas/functions.pyi",
|
|
|
|
|
"*python/pyspark/sql/column.pyi",
|
|
|
|
|
"*python/pyspark/worker.pyi",
|
|
|
|
|
"*python/pyspark/java_gateway.pyi",
|
|
|
|
|
"*python/pyspark/sql/connect/proto/*",
|
|
|
|
|
"*python/pyspark/sql/streaming/proto/*",
|
|
|
|
|
"*venv*/*",
|
|
|
|
|
]
|
2026-03-20 07:20:37 +09:00
|
|
|
line-length = 100
|
2025-12-10 08:53:27 +09:00
|
|
|
|
|
|
|
|
[tool.ruff.lint]
|
2026-02-10 11:51:10 +08:00
|
|
|
extend-select = [
|
2026-02-11 20:25:13 +08:00
|
|
|
"G010", # logging-warn
|
2026-03-03 09:28:25 +09:00
|
|
|
# ambiguous unicode character
|
|
|
|
|
"RUF001", # string
|
|
|
|
|
"RUF002", # docstring
|
|
|
|
|
"RUF003", # comment
|
|
|
|
|
# ambiguous unicode character end
|
2026-02-11 20:25:13 +08:00
|
|
|
"RUF100", # unused-noqa
|
2026-02-10 11:51:10 +08:00
|
|
|
]
|
2025-12-10 08:53:27 +09:00
|
|
|
ignore = [
|
|
|
|
|
"E402", # Module top level import is disabled for optional import check, etc.
|
|
|
|
|
# TODO
|
|
|
|
|
"E721", # Use isinstance for type comparison, too many for now.
|
|
|
|
|
"E741", # Ambiguous variables like l, I or O.
|
|
|
|
|
]
|
|
|
|
|
|
|
|
|
|
[tool.ruff.lint.per-file-ignores]
|
|
|
|
|
# E501 is ignored as shared.py is auto-generated.
|
|
|
|
|
"python/pyspark/ml/param/shared.py" = ["E501"]
|
|
|
|
|
# E501 is ignored as we should keep the json string format in error_classes.py.
|
|
|
|
|
"python/pyspark/errors/error_classes.py" = ["E501"]
|
|
|
|
|
# Examples contain some unused variables.
|
|
|
|
|
"examples/src/main/python/sql/datasource.py" = ["F841"]
|
|
|
|
|
|
2021-11-10 18:39:06 +09:00
|
|
|
[tool.black]
|
2021-12-28 10:55:27 +01:00
|
|
|
# When changing the version, we have to update
|
|
|
|
|
# GitHub workflow version and dev/reformat-python
|
2026-03-16 09:20:05 -07:00
|
|
|
required-version = "26.3.1"
|
2021-11-10 18:39:06 +09:00
|
|
|
line-length = 100
|
2024-06-04 10:28:50 -07:00
|
|
|
target-version = ['py39']
|
2021-11-10 18:39:06 +09:00
|
|
|
include = '\.pyi?$'
|
[SPARK-41586][PYTHON] Introduce `pyspark.errors` and error classes for PySpark
### What changes were proposed in this pull request?
This PR proposes to introduce `pyspark.errors` and error classes to unifying & improving errors generated by PySpark under a single path.
To summarize, this PR includes the changes below:
- Add `python/pyspark/errors/error_classes.py` to support error class for PySpark.
- Add `ErrorClassesReader` to manage the `error_classes.py`.
- Add `PySparkException` to handle the errors generated by PySpark.
- Add `check_error` for error class testing.
This is an initial PR for introducing error framework for PySpark to facilitate the error management and provide better/consistent error messages to users.
While such an active work is being done on the [SQL side to improve error messages](https://issues.apache.org/jira/browse/SPARK-37935), so far there is no work to improve error messages in PySpark.
So, I'd expect to also initiate the effort on error message improvement for PySpark side from this PR.
Eventually, the errors massage will be shown as below, for example:
- PySpark, `PySparkException` (thrown by Python driver):
```python
>>> from pyspark.sql.functions import lit
>>> lit([df.id, df.id])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../spark/python/pyspark/sql/utils.py", line 334, in wrapped
return f(*args, **kwargs)
File ".../spark/python/pyspark/sql/functions.py", line 176, in lit
raise PySparkException(
pyspark.errors.exceptions.PySparkException: [COLUMN_IN_LIST] lit does not allow a column in a list.
```
- PySpark, `AnalysisException` (thrown by JVM side, and capture in PySpark side):
```
>>> df.unpivot("id", [], "var", "val").collect()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../spark/python/pyspark/sql/dataframe.py", line 3296, in unpivot
jdf = self._jdf.unpivotWithSeq(jids, jvals, variableColumnName, valueColumnName)
File ".../spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
File ".../spark/python/pyspark/sql/utils.py", line 209, in deco
raise converted from None
pyspark.sql.utils.AnalysisException: [UNPIVOT_REQUIRES_VALUE_COLUMNS] At least one value column needs to be specified for UNPIVOT, all columns specified as ids;
'Unpivot ArraySeq(id#2L), ArraySeq(), var, [val]
+- LogicalRDD [id#2L, int#3L, double#4, str#5], false
```
- Spark, `AnalysisException`:
```scala
scala> df.select($"id").unpivot(Array($"id"), Array.empty,variableColumnName = "var", valueColumnName = "val")
org.apache.spark.sql.AnalysisException: [UNPIVOT_REQUIRES_VALUE_COLUMNS] At least one value column needs to be specified for UNPIVOT, all columns specified as ids;
'Unpivot ArraySeq(id#0L), ArraySeq(), var, [val]
+- Project [id#0L]
+- Range (0, 10, step=1, splits=Some(16))
```
**Next up** for this PR include:
- Migrate more errors into `PySparkException` across all modules (e.g, Spark Connect, pandas API on Spark...).
- Migrate more error tests into error class tests by using `check_error`.
- Define more error classes onto `error_classes.py`.
- Add documentation.
### Why are the changes needed?
Centralizing error messages & introducing identified error class provides the following benefits:
- Errors are searchable via the unique class names and properly classified.
- Reduce the cost of future maintenance for PySpark errors.
- Provide consistent & actionable error messages to users.
- Facilitates translating error messages into different languages.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Adding UTs & running the existing static analysis tools (`dev/lint-python`)
Closes #39387 from itholic/SPARK-41586.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2023-01-16 10:22:43 +09:00
|
|
|
extend-exclude = 'cloudpickle|error_classes.py'
|