### Rationale for this change
Move our PyArrow build backend from setuptools and a custom setup.py to scikit-build-core which is just build backend for CMake related projects.
### What changes are included in this PR?
Move from setuptools to scikit-build-core and remove PyArrow setup.py. Update some of the build requirements and minor fixes.
A custom build backend has been also been created in order to wrap scikit-build-core in order to fix problems on License files for monorepos.
pyproject.toml metadata validation expects license files to exist before exercising the build backend that's why we create symlinks. Our thin build backend will just make those symlinks hard-links in order for license and notice files to contain the contents and be added as part of the sdist.
Remove flags that are not used anymore (were only part of setup.py) and documented and validated how the same flags have to be used now.
### Are these changes tested?
Yes all Python CI tests, wheels and sdist are successful.
### Are there any user-facing changes?
Yes, users building PyArrow will now require the new build dependencies to exercise the build and depending on the flags used they might require to use the new documented way of using those flags.
* GitHub Issue: #36411
Lead-authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
### Rationale for this change
As discussed on the issue we don't seem to have run asv benchmarks on Python for the last years. It is probably broken.
### What changes are included in this PR?
Remove asv benchmarking related files and docs.
### Are these changes tested?
No, Validate CI and run preview-docs to validate docs.
### Are there any user-facing changes?
No
* GitHub Issue: #46008
Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
### Rationale for this change
When building locally, I get many errors along the lines of
```
Please ensure the files specified are contained by the root
of the Python package (normally marked by `pyproject.toml`).
By 2026-Mar-20, you need to update your project and remove deprecated calls
or your builds will no longer be supported.
See https://packaging.python.org/en/latest/specifications/glob-patterns/ for details.
```
<img width="958" height="755" alt="terminal demo" src="https://github.com/user-attachments/assets/67f0e261-c4d2-403c-b004-688dfaaccda6" />
### What changes are included in this PR?
- Make the licence [SPDX compliant](https://spdx.org/licenses)
- Remove the licence classifier
- Move the licence files from `setup.cfg` to `pyproject.toml`
- Fix the [disallowed glob patterns](https://packaging.python.org/en/latest/specifications/glob-patterns) via symlinks
- Bumped the minimum version of `setuptools` due to macOS CI failures (don't know why this happened, caching maybe?)
I appreciate the symlink change might prove controversial. See discussions in https://github.com/apache/arrow/issues/45867, fixes https://github.com/apache/arrow/issues/45867.
### Are these changes tested?
When I rebuild locally, I get no errors any more.
### Are there any user-facing changes?
Yes. The minimum required version of `setuptools` is now `77`. However, this is available on `>=3.9` so won't affect anyone really.
* GitHub Issue: #45867
Authored-by: Patrick J. Roddy <patrickjamesroddy@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
### Rationale for this change
Throw a python exception if a RecordBatch API isn't able to be used when the memory is backed by non-cpu devices.
### What changes are included in this PR?
* Assert the device is CPU for APIs that only support CPU
### Are these changes tested?
Pytests
### Are there any user-facing changes?
The user experiences Python exceptions instead of segfaults for unsupported APIs.
* GitHub Issue: #43727
Authored-by: Dane Pitkin <dpitkin@apache.org>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
### Rationale for this change
We can't maintain Plasma.
### What changes are included in this PR?
These changes remove all Plasma related codes.
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* Closes: #33243
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
To follow ARROW-4648 https://github.com/apache/arrow/pull/5069 :
> hyphens in executable file names
But this causes backward incompatibility.
Closes#5131 from kou/cpp-plasma-store-server and squashes the following commits:
2a9786ccc <Sutou Kouhei> Use hyphen for plasma-store-server executable
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Author: Antoine Pitrou <antoine@python.org>
Author: Krisztián Szűcs <szucs.krisztian@gmail.com>
Closes#4675 from pitrou/ARROW-2461-manylinux2010 and squashes the following commits:
2f218d50d <Krisztián Szűcs> remove multibuild's clean_code function call
1186b52cb <Krisztián Szűcs> use FETCH_HEAD in the CI templates
17a4c7fc7 <Antoine Pitrou> Nits
cb5f28530 <Antoine Pitrou> Attempt to factor out package scripts
bcd0761f4 <Antoine Pitrou> ARROW-2461: Build manylinux2010 wheels
This is a partial C++ implementation of the Flight RPC system initially proposed by Jacques in ARROW-249. As in Java, I had to dig into gRPC and Protocol Buffers internals to ensure that
* On write, memory is only copied once into the outgoing gRPC buffer
* On read, no memory is copied
The way that I tricked gRPC into circumventing the built-in protobuf serde paths might look a bit hacky, but after digging around in the library a bunch I've convinced myself that it's the best and perhaps only way to accomplish this. Luckily, the message that's being serialized/deserialized is pretty opaque to the rest of the gRPC system, and it's controlled by the `SerializationTraits<T>` class. So you can take a gRPC stream reader and make it create any kind of type you want, even if the input data is a protocol buffer.
Some things that won't be addressed in this patch, as scope is too large:
* gRPC build toolchain issues (this is rather complex, I will create follow-up issues)
* Security / encryption, and authentication issues. I have only implemented an insecure server
* Integration with Travis CI
* Python bindings
API is preliminary and I expect to be the subject of iteration to make general and fast over the next several months.
Author: Wes McKinney <wesm+git@apache.org>
Closes#2547 from wesm/flight-cpp-prototype and squashes the following commits:
64bcdea43 <Wes McKinney> Initial Arrow Flight C++ implementation
Author: Krisztián Szűcs <szucs.krisztian@gmail.com>
Closes#2535 from kszucs/ARROW-2646 and squashes the following commits:
1f36aa6d9 <Krisztián Szűcs> add plasma_store_server to gitignore
2f6a31061 <Krisztián Szűcs> flake8
070e97520 <Krisztián Szűcs> test case for ChunkedArray and Column
8b44f7a02 <Krisztián Szűcs> support date_as_object in ArrowDeserializer
6ee8d2d85 <Krisztián Szűcs> ConvertDates if date_as_object PandasOption is set
Remove the reliance on pkg_resources to find out the version of an install PyArrow.
Here, this makes `import pyarrow` around 200 ms. faster.
Author: Antoine Pitrou <antoine@python.org>
Closes#2415 from pitrou/ARROW-3029-use-generated-version-file and squashes the following commits:
0504c7d7 <Antoine Pitrou> ARROW-3029: Generate version file when building
After spending a non-trivial time wrestling with Cython and our build system, we're now able to generate and upload Python and Cython coverage results as part of a Travis-CI run (in addition to C++ coverage).
Author: Antoine Pitrou <antoine@python.org>
Closes#2050 from pitrou/ARROW-2574-cython-coverage and squashes the following commits:
4553185 <Antoine Pitrou> Remove leftover
b1212a4 <Antoine Pitrou> Silence "unknown warning option" error on clang
e1a5b4a <Antoine Pitrou> Disable ORC when building benchmarks
06b0665 <Antoine Pitrou> Try to fix Sphinx doc building
9b41d24 <Antoine Pitrou> Add nogil tracing
4014951 <Antoine Pitrou> ARROW-2574: Add Cython and Python code coverage
Author: Wes McKinney <wes.mckinney@twosigma.com>
Closes#1219 from wesm/ARROW-1087 and squashes the following commits:
5444b82e [Wes McKinney] Use more stable include
6cc855de [Wes McKinney] Add stub rst file about C extensions with pyarrow
cc @yackoa
Author: Wes McKinney <wes.mckinney@twosigma.com>
Closes#916 from wesm/ARROW-1213 and squashes the following commits:
f8a0aff1 [Wes McKinney] Add HDFS section to API docs
c54302df [Wes McKinney] Add deprecation warning for HdfsClient
4d3e7222 [Wes McKinney] Auto-wrap s3fs filesystem when using ParquetDataset
0be33bb8 [Wes McKinney] Implement os.walk emulation layer for s3fs
719f806d [Wes McKinney] Progress toward supporting s3fs in Parquet reader
bbd664ed [Wes McKinney] Refactor HdfsClient into pyarrow/hdfs.py. Add connect factory method. Rename to HadoopFilesystem. Add walk implementation for HDFS, base Parquet directory walker on that
4984a9d4 [Wes McKinney] Refactoring slightly
4c0bcf4a [Wes McKinney] Start on Dask filesystem wrapper, S3-Parquet dataset test case
Author: Phillip Cloud <cpcloud@gmail.com>
Closes#588 from cpcloud/ARROW-483 and squashes the following commits:
f671ba4 [Phillip Cloud] ARROW-483: [C++/Python] Provide access to "custom_metadata" Field attribute in IPC setting
I haven't tried this out on MSVC yet.
Also includes .gitignore fix for ARROW-821
Author: Wes McKinney <wes.mckinney@twosigma.com>
Closes#549 from wesm/ARROW-832 and squashes the following commits:
2f246a0 [Wes McKinney] Remove unused CMake variable
7a62cf4 [Wes McKinney] Small fix when ARROW_BUILD_BENCHMARKS=off
8eaa318 [Wes McKinney] Add dependency on gtest for benchmarks
5f692db [Wes McKinney] Update to gtest 1.8.0, remove now unneeded test_main.cc
Closes#108
Some users will not have a full Java Hadoop distribution and may wish to use the libhdfs3 package from Pivotal (https://github.com/Pivotal-Data-Attic/pivotalrd-libhdfs3), part of Apache HAWQ (incubating).
In C++, you can switch by setting:
```c++
HdfsConnectionConfig conf;
conf.driver = HdfsDriver::LIBHDFS3;
```
In Python, you can run:
```python
con = arrow.io.HdfsClient.connect(..., driver='libhdfs3')
```
Author: Wes McKinney <wes.mckinney@twosigma.com>
Closes#244 from wesm/ARROW-243 and squashes the following commits:
7ae197a [Wes McKinney] Refactor HdfsClient code to support both libhdfs and libhdfs3 at runtime. Add driver option to Python interface
This also includes format reconciliation as discussed in ARROW-384.
Author: Wes McKinney <wes.mckinney@twosigma.com>
Closes#211 from wesm/ARROW-363 and squashes the following commits:
6982c3c [Wes McKinney] Permit end of buffer IPC reads if length is 0
4d46c8b [Wes McKinney] Fix logical error with offsets array in JsonFileWriter. Add broken string test case to simple.json
36ab5d6 [Wes McKinney] Increment MetadataVersion in flatbuffer
844257e [Wes McKinney] cpplint
a2711f2 [Wes McKinney] Address other format incompatibilities, write vectorLayout to Arrow metadata
13608ef [Wes McKinney] Relax 64 byte padding. Do not write RecordBatch embedded in Message for now
6a66fc8 [Wes McKinney] Write record batch size prefix in Java
72ea42c [Wes McKinney] Note that padding is 64-bytes at start of file (for now)
c2ffde4 [Wes McKinney] More notes about the file format
aef4382 [Wes McKinney] cpplint
85128f7 [Wes McKinney] Refactor IPC/File record batch read/write structure to reflect discussion in ARROW-384
dbd6ed6 [Wes McKinney] Do not embed metadata length in WriteDataHeader
c529d63 [Wes McKinney] Fix JSON integration test example to make it further
d806aa6 [Wes McKinney] Exclude JSON files from Apache RAT checks
a7e2d4b [Wes McKinney] Draft testing harness
Author: Uwe L. Korn <uwelk@xhochy.com>
Closes#179 from xhochy/ARROW-342 and squashes the following commits:
15d0ce3 [Uwe L. Korn] ARROW-342: Set Python version on release
Patches [ARROW-103](https://issues.apache.org/jira/browse/ARROW-103), though perhaps it would make sense to leave that issue open to cover any future .gitignore-related pull requests.
Author: Dan Robinson <danrobinson010@gmail.com>
Closes#62 from danrobinson/ARROW-103 and squashes the following commits:
7c1c7d8 [Dan Robinson] ARROW-103: Added '*-build' to cpp/.gitignore
633bacf [Dan Robinson] ARROW-103: Added '.cache' to python/.gitignore
59f58ba [Dan Robinson] ARROW-103: Add '*.dylib to python/.gitignore'
52572ab [Dan Robinson] ARROW-103: Add 'dev-build/' to cpp/.gitignore
Run them using `asv run --python=same` or `asv dev`.
Author: Uwe L. Korn <uwelk@xhochy.com>
Closes#44 from xhochy/arrow-79 and squashes the following commits:
d3c6401 [Uwe L. Korn] Move benchmarks to toplevel folder
2737f18 [Uwe L. Korn] ARROW-79: [Python] Add benchmarks
This patch provides no actual functionality; it only builds an empty Cython extension that links to libarrow.so. I will hook this into Travis CI at some later time.
I have adapted a limited amount of BSD (2- or 3-clause) or Apache 2.0 3rd-party code (particularly the cmake/Cython integration) to bootstrap this Python package / build setup in accordance with http://www.apache.org/legal/resolved.html. I have noted the relevant copyright holders and licenses in `python/LICENSE.txt`. In particular, I expect to continue to refactor and reuse occasional utility code from pandas (https://github.com/pydata/pandas) as practical.
Since a significant amount of "glue code" will need to be written to marshal between Arrow data and pure Python / NumPy / pandas objects, to get started I've adopted the approach used by libdynd/dynd-python -- a C++ "glue library" that is then called from Cython to provide a Python user interface. This will allow us to build shims as necessary to abstract away complications that leak through (for example: enabling C++ code with no knowledge of Python to invoke Python functions). Let's see how this goes: there are other options, like Boost::Python, but Cython + shim code is a more lightweight and flexible solution for the moment.
Author: Wes McKinney <wesm@apache.org>
Closes#17 from wesm/ARROW-7 and squashes the following commits:
be059a2 [Wes McKinney] Nest arrow::py namespace
3ad3143 [Wes McKinney] Add preliminary Python development toolchain