25 Commits

Author SHA1 Message Date
Raúl Cumplido
f1b21f1945 GH-36411: [Python] Use scikit-build-core as build backend for PyArrow and get rid of setup.py (#49259)
### Rationale for this change

Move our PyArrow build backend from setuptools and a custom setup.py to scikit-build-core which is just build backend for CMake related projects.

### What changes are included in this PR?

Move from setuptools to scikit-build-core and remove PyArrow setup.py. Update some of the build requirements and minor fixes.
A custom build backend has been also been created in order to wrap scikit-build-core in order to fix problems on License files for monorepos.
pyproject.toml metadata validation expects license files to exist before exercising the build backend that's why we create symlinks. Our thin build backend will just make those symlinks hard-links in order for license and notice files to contain the contents and be added as part of the sdist.

Remove flags that are not used anymore (were only part of setup.py) and documented and validated how the same flags have to be used now.

### Are these changes tested?

Yes all Python CI tests, wheels and sdist are successful.

### Are there any user-facing changes?

Yes, users building PyArrow will now require the new build dependencies to exercise the build and depending on the flags used they might require to use the new documented way of using those flags.

* GitHub Issue: #36411

Lead-authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
2026-03-09 09:47:21 +01:00
Raúl Cumplido
644ec570f5 GH-46008: [Python][Benchmarking] Remove unused asv benchmarking files (#49047)
### Rationale for this change

As discussed on the issue we don't seem to have run asv benchmarks on Python for the last years. It is probably broken.

### What changes are included in this PR?

Remove asv benchmarking related files and docs.

### Are these changes tested?

No, Validate CI and run preview-docs to validate docs.

### Are there any user-facing changes?

No
* GitHub Issue: #46008

Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
2026-02-02 12:54:43 +01:00
Patrick J. Roddy
866502e6d6 GH-45867: [Python] Fix SetuptoolsDeprecationWarning (#47141)
### Rationale for this change
When building locally, I get many errors along the lines of

```
Please ensure the files specified are contained by the root
of the Python package (normally marked by `pyproject.toml`).

By 2026-Mar-20, you need to update your project and remove deprecated calls
or your builds will no longer be supported.

See https://packaging.python.org/en/latest/specifications/glob-patterns/ for details.
```

<img width="958" height="755" alt="terminal demo" src="https://github.com/user-attachments/assets/67f0e261-c4d2-403c-b004-688dfaaccda6" />

### What changes are included in this PR?
- Make the licence [SPDX compliant](https://spdx.org/licenses)
- Remove the licence classifier
- Move the licence files from `setup.cfg` to `pyproject.toml`
- Fix the [disallowed glob patterns](https://packaging.python.org/en/latest/specifications/glob-patterns) via symlinks
- Bumped the minimum version of `setuptools` due to macOS CI failures (don't know why this happened, caching maybe?)

I appreciate the symlink change might prove controversial. See discussions in https://github.com/apache/arrow/issues/45867, fixes https://github.com/apache/arrow/issues/45867.

### Are these changes tested?
When I rebuild locally, I get no errors any more.

### Are there any user-facing changes?
Yes. The minimum required version of `setuptools` is now `77`. However, this is available on `>=3.9` so won't affect anyone really.

* GitHub Issue: #45867

Authored-by: Patrick J. Roddy <patrickjamesroddy@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
2025-12-18 13:11:45 +01:00
Dane Pitkin
2fac18587e GH-43727: [Python] RecordBatch fails gracefully on non-cpu devices (#43729)
### Rationale for this change

Throw a python exception if a RecordBatch API isn't able to be used when the memory is backed by non-cpu devices.

### What changes are included in this PR?

* Assert the device is CPU for APIs that only support CPU

### Are these changes tested?

Pytests

### Are there any user-facing changes?

The user experiences Python exceptions instead of segfaults for unsupported APIs.
* GitHub Issue: #43727

Authored-by: Dane Pitkin <dpitkin@apache.org>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
2024-09-04 09:46:09 +02:00
Sutou Kouhei
2c9340dd17 GH-33243: [Plasma] Remove (#34718)
### Rationale for this change

We can't maintain Plasma.

### What changes are included in this PR?

These changes remove all Plasma related codes.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* Closes: #33243

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
2023-03-29 06:34:23 +09:00
Weston Pace
83511b1c27 MINOR: [Python] add generated lib.h file to pyarrow gitignore (#15121)
Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
2022-12-30 07:33:42 +09:00
Weston Pace
f01770dee9 MINOR: [Python] Adding .hypothesis directory to gitignore (#13664)
Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
2022-07-21 13:34:37 -10:00
Antoine Pitrou
e60e6b00ee ARROW-15165: [Python] Expose function to resolve S3 bucket region
Closes #11998 from pitrou/ARROW-15165-py-s3-resolve-region

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
2021-12-23 15:36:46 +01:00
Sutou Kouhei
94ba10bbf3 ARROW-6294: [C++] Use hyphen for plasma-store-server executable
To follow ARROW-4648 https://github.com/apache/arrow/pull/5069 :

> hyphens in executable file names

But this causes backward incompatibility.

Closes #5131 from kou/cpp-plasma-store-server and squashes the following commits:

2a9786ccc <Sutou Kouhei>  Use hyphen for plasma-store-server executable

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
2019-08-21 11:49:45 +02:00
Antoine Pitrou
0568544531 ARROW-2461: [Python] Build manylinux2010 wheels
Author: Antoine Pitrou <antoine@python.org>
Author: Krisztián Szűcs <szucs.krisztian@gmail.com>

Closes #4675 from pitrou/ARROW-2461-manylinux2010 and squashes the following commits:

2f218d50d <Krisztián Szűcs> remove multibuild's clean_code function call
1186b52cb <Krisztián Szűcs> use FETCH_HEAD in the CI templates
17a4c7fc7 <Antoine Pitrou> Nits
cb5f28530 <Antoine Pitrou> Attempt to factor out package scripts
bcd0761f4 <Antoine Pitrou> ARROW-2461:  Build manylinux2010 wheels
2019-06-25 16:45:03 +02:00
François Saint-Jacques
c3511db97e ARROW-4827: [C++] Implement benchmark comparison
This script/library allows comparing revisions/builds.

Author: François Saint-Jacques <fsaintjacques@gmail.com>

Closes #4141 from fsaintjacques/ARROW-4827-benchmark-comparison and squashes the following commits:

a047ae4ed <François Saint-Jacques> Satisfy flake8
e95baf317 <François Saint-Jacques> Add comments and move stuff
ee39a1feb <François Saint-Jacques> Move cpp_runner_from_rev_or_path in CppRunner
2a953f180 <François Saint-Jacques> Missing files
d8e3c1c85 <François Saint-Jacques> Review
514e8e428 <François Saint-Jacques> Introduce RegressionSetArgs
280c93be4 <François Saint-Jacques> Update gitignore
dc031bde7 <François Saint-Jacques> Support conda toolchain
28254676c <François Saint-Jacques> Add --cmake-extras to benchmark-diff command
e6762899c <François Saint-Jacques> Typo
048ba0ede <François Saint-Jacques> Add verbose_third_party
71b10e98a <François Saint-Jacques> Disable python in benchmarks
c3719214c <François Saint-Jacques> Fix flake8 warnings
8845e3e78 <François Saint-Jacques> Remove empty __init__.py
1949f749c <François Saint-Jacques> Supports HEAD revisions
96f999748 <François Saint-Jacques> Add gitignore entry
d9692bc8f <François Saint-Jacques> Fix splitlines
90578af61 <François Saint-Jacques> Add --cmake-extras to build command
7696202ba <François Saint-Jacques> Add doc for bin attribute.
a281ae8e6 <François Saint-Jacques> Various language fixes
1b028390c <François Saint-Jacques> Rename --cxx_flags to --cxx-flags
bc111b2d3 <François Saint-Jacques> Removes copied stuff
d6733b6f4 <François Saint-Jacques> Formatting
21b2e14fc <François Saint-Jacques> Add doc and fix bugs
2a81744cf <François Saint-Jacques> Ooops.
c85661cf3 <François Saint-Jacques> Add documentation
703cf987a <François Saint-Jacques> commit
2c0d512f8 <François Saint-Jacques> Checkpoint
a38f49cd9 <François Saint-Jacques> checkpoint
a5ad76d11 <François Saint-Jacques> Fix syntax
712d2ed3c <François Saint-Jacques> initial commit
2019-04-25 17:54:09 +02:00
Wes McKinney
db0ef22dd6 ARROW-3146: [C++] Prototype Flight RPC client and server implementations
This is a partial C++ implementation of the Flight RPC system initially proposed by Jacques in ARROW-249. As in Java, I had to dig into gRPC and Protocol Buffers internals to ensure that

* On write, memory is only copied once into the outgoing gRPC buffer
* On read, no memory is copied

The way that I tricked gRPC into circumventing the built-in protobuf serde paths might look a bit hacky, but after digging around in the library a bunch I've convinced myself that it's the best and perhaps only way to accomplish this. Luckily, the message that's being serialized/deserialized is pretty opaque to the rest of the gRPC system, and it's controlled by the `SerializationTraits<T>` class. So you can take a gRPC stream reader and make it create any kind of type you want, even if the input data is a protocol buffer.

Some things that won't be addressed in this patch, as scope is too large:

* gRPC build toolchain issues (this is rather complex, I will create follow-up issues)
* Security / encryption, and authentication issues. I have only implemented an insecure server
* Integration with Travis CI
* Python bindings

API is preliminary and I expect to be the subject of iteration to make general and fast over the next several months.

Author: Wes McKinney <wesm+git@apache.org>

Closes #2547 from wesm/flight-cpp-prototype and squashes the following commits:

64bcdea43 <Wes McKinney> Initial Arrow Flight C++ implementation
2018-09-20 16:56:50 -04:00
Krisztián Szűcs
67c05c203d ARROW-2646: [C++/Python] Pandas roundtrip for date objects
Author: Krisztián Szűcs <szucs.krisztian@gmail.com>

Closes #2535 from kszucs/ARROW-2646 and squashes the following commits:

1f36aa6d9 <Krisztián Szűcs> add plasma_store_server to gitignore
2f6a31061 <Krisztián Szűcs> flake8
070e97520 <Krisztián Szűcs> test case for ChunkedArray and Column
8b44f7a02 <Krisztián Szűcs> support date_as_object in ArrowDeserializer
6ee8d2d85 <Krisztián Szűcs> ConvertDates if date_as_object PandasOption is set
2018-09-13 17:15:20 -04:00
Antoine Pitrou
0c38a21be6 ARROW-3029: [Python] Generate version file when building
Remove the reliance on pkg_resources to find out the version of an install PyArrow.
Here, this makes `import pyarrow` around 200 ms. faster.

Author: Antoine Pitrou <antoine@python.org>

Closes #2415 from pitrou/ARROW-3029-use-generated-version-file and squashes the following commits:

0504c7d7 <Antoine Pitrou> ARROW-3029:  Generate version file when building
2018-08-09 13:49:28 -04:00
Antoine Pitrou
7764bc8890 ARROW-2574: [Python] Add Cython and Python code coverage
After spending a non-trivial time wrestling with Cython and our build system, we're now able to generate and upload Python and Cython coverage results as part of a Travis-CI run (in addition to C++ coverage).

Author: Antoine Pitrou <antoine@python.org>

Closes #2050 from pitrou/ARROW-2574-cython-coverage and squashes the following commits:

4553185 <Antoine Pitrou> Remove leftover
b1212a4 <Antoine Pitrou> Silence "unknown warning option" error on clang
e1a5b4a <Antoine Pitrou> Disable ORC when building benchmarks
06b0665 <Antoine Pitrou> Try to fix Sphinx doc building
9b41d24 <Antoine Pitrou> Add nogil tracing
4014951 <Antoine Pitrou> ARROW-2574:  Add Cython and Python code coverage
2018-05-17 16:21:47 +02:00
Wes McKinney
53dd0c8571 ARROW-1087: [Python] Add pyarrow.get_include function. Bundle includes in all builds
Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes #1219 from wesm/ARROW-1087 and squashes the following commits:

5444b82e [Wes McKinney] Use more stable include
6cc855de [Wes McKinney] Add stub rst file about C extensions with pyarrow
2017-10-22 21:37:49 -04:00
Wes McKinney
af2aeafca6 ARROW-1213: [Python] Support s3fs filesystem for Amazon S3 in ParquetDataset
cc @yackoa

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes #916 from wesm/ARROW-1213 and squashes the following commits:

f8a0aff1 [Wes McKinney] Add HDFS section to API docs
c54302df [Wes McKinney] Add deprecation warning for HdfsClient
4d3e7222 [Wes McKinney] Auto-wrap s3fs filesystem when using ParquetDataset
0be33bb8 [Wes McKinney] Implement os.walk emulation layer for s3fs
719f806d [Wes McKinney] Progress toward supporting s3fs in Parquet reader
bbd664ed [Wes McKinney] Refactor HdfsClient into pyarrow/hdfs.py. Add connect factory method. Rename to HadoopFilesystem. Add walk implementation for HDFS, base Parquet directory walker on that
4984a9d4 [Wes McKinney] Refactoring slightly
4c0bcf4a [Wes McKinney] Start on Dask filesystem wrapper, S3-Parquet dataset test case
2017-07-31 18:46:54 -04:00
Phillip Cloud
7d433dc27b ARROW-483: [C++/Python] Provide access to "custom_metadata" Field attribute in IPC setting
Author: Phillip Cloud <cpcloud@gmail.com>

Closes #588 from cpcloud/ARROW-483 and squashes the following commits:

f671ba4 [Phillip Cloud] ARROW-483: [C++/Python] Provide access to "custom_metadata" Field attribute in IPC setting
2017-04-25 17:36:31 -04:00
Wes McKinney
dad1a8ee38 ARROW-832: [C++] Update to gtest 1.8.0, remove now unneeded test_main.cc
I haven't tried this out on MSVC yet.

Also includes .gitignore fix for ARROW-821

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes #549 from wesm/ARROW-832 and squashes the following commits:

2f246a0 [Wes McKinney] Remove unused CMake variable
7a62cf4 [Wes McKinney] Small fix when ARROW_BUILD_BENCHMARKS=off
8eaa318 [Wes McKinney] Add dependency on gtest for benchmarks
5f692db [Wes McKinney] Update to gtest 1.8.0, remove now unneeded test_main.cc
2017-04-16 09:29:15 -04:00
Wes McKinney
cfde4607df ARROW-243: [C++] Add option to switch between libhdfs and libhdfs3 when creating HdfsClient
Closes #108

Some users will not have a full Java Hadoop distribution and may wish to use the libhdfs3 package from Pivotal (https://github.com/Pivotal-Data-Attic/pivotalrd-libhdfs3), part of Apache HAWQ (incubating).

In C++, you can switch by setting:

```c++
HdfsConnectionConfig conf;
conf.driver = HdfsDriver::LIBHDFS3;
```

In Python, you can run:

```python
con = arrow.io.HdfsClient.connect(..., driver='libhdfs3')
```

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes #244 from wesm/ARROW-243 and squashes the following commits:

7ae197a [Wes McKinney] Refactor HdfsClient code to support both libhdfs and libhdfs3 at runtime. Add driver option to Python interface
2016-12-19 18:26:17 -05:00
Wes McKinney
e3c167bd10 ARROW-363: [Java/C++] integration testing harness, initial integration tests
This also includes format reconciliation as discussed in ARROW-384.

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes #211 from wesm/ARROW-363 and squashes the following commits:

6982c3c [Wes McKinney] Permit end of buffer IPC reads if length is 0
4d46c8b [Wes McKinney] Fix logical error with offsets array in JsonFileWriter. Add broken string test case to simple.json
36ab5d6 [Wes McKinney] Increment MetadataVersion in flatbuffer
844257e [Wes McKinney] cpplint
a2711f2 [Wes McKinney] Address other format incompatibilities, write vectorLayout to Arrow metadata
13608ef [Wes McKinney] Relax 64 byte padding. Do not write RecordBatch embedded in Message for now
6a66fc8 [Wes McKinney] Write record batch size prefix in Java
72ea42c [Wes McKinney] Note that padding is 64-bytes at start of file (for now)
c2ffde4 [Wes McKinney] More notes about the file format
aef4382 [Wes McKinney] cpplint
85128f7 [Wes McKinney] Refactor IPC/File record batch read/write structure to reflect discussion in ARROW-384
dbd6ed6 [Wes McKinney] Do not embed metadata length in WriteDataHeader
c529d63 [Wes McKinney] Fix JSON integration test example to make it further
d806aa6 [Wes McKinney] Exclude JSON files from Apache RAT checks
a7e2d4b [Wes McKinney] Draft testing harness
2016-11-28 21:29:19 -05:00
Uwe L. Korn
2f84493371 ARROW-342: Set Python version on release
Author: Uwe L. Korn <uwelk@xhochy.com>

Closes #179 from xhochy/ARROW-342 and squashes the following commits:

15d0ce3 [Uwe L. Korn] ARROW-342: Set Python version on release
2016-10-21 16:27:00 -04:00
Dan Robinson
5843e6872f ARROW-103: Add files to gitignore
Patches [ARROW-103](https://issues.apache.org/jira/browse/ARROW-103), though perhaps it would make sense to leave that issue open to cover any future .gitignore-related pull requests.

Author: Dan Robinson <danrobinson010@gmail.com>

Closes #62 from danrobinson/ARROW-103 and squashes the following commits:

7c1c7d8 [Dan Robinson] ARROW-103: Added '*-build' to cpp/.gitignore
633bacf [Dan Robinson] ARROW-103: Added '.cache' to python/.gitignore
59f58ba [Dan Robinson] ARROW-103: Add '*.dylib to python/.gitignore'
52572ab [Dan Robinson] ARROW-103: Add 'dev-build/' to cpp/.gitignore
2016-04-17 15:25:39 +02:00
Uwe L. Korn
80ec2c17fc ARROW-79: [Python] Add benchmarks
Run them using `asv run --python=same` or `asv dev`.

Author: Uwe L. Korn <uwelk@xhochy.com>

Closes #44 from xhochy/arrow-79 and squashes the following commits:

d3c6401 [Uwe L. Korn] Move benchmarks to toplevel folder
2737f18 [Uwe L. Korn] ARROW-79: [Python] Add benchmarks
2016-03-28 09:39:55 -07:00
Wes McKinney
572cdf22e3 ARROW-7: Add barebones Python library build toolchain
This patch provides no actual functionality; it only builds an empty Cython extension that links to libarrow.so. I will hook this into Travis CI at some later time.

I have adapted a limited amount of BSD (2- or 3-clause) or Apache 2.0 3rd-party code (particularly the cmake/Cython integration) to bootstrap this Python package / build setup in accordance with http://www.apache.org/legal/resolved.html. I have noted the relevant copyright holders and licenses in `python/LICENSE.txt`. In particular, I expect to continue to refactor and reuse occasional utility code from pandas (https://github.com/pydata/pandas) as practical.

Since a significant amount of "glue code" will need to be written to marshal between Arrow data and pure Python / NumPy / pandas objects, to get started I've adopted the approach used by libdynd/dynd-python -- a C++ "glue library" that is then called from Cython to provide a Python user interface. This will allow us to build shims as necessary to abstract away complications that leak through (for example: enabling C++ code with no knowledge of Python to invoke Python functions). Let's see how this goes: there are other options, like Boost::Python, but Cython + shim code is a more lightweight and flexible solution for the moment.

Author: Wes McKinney <wesm@apache.org>

Closes #17 from wesm/ARROW-7 and squashes the following commits:

be059a2 [Wes McKinney] Nest arrow::py namespace
3ad3143 [Wes McKinney] Add preliminary Python development toolchain
2016-03-07 14:42:32 -08:00