Blame: python/pyarrow/ipc.py - apache/arrow

ARROW-503: [Python] Implement Python interface to streaming file format See the new `StreamWriter` and `StreamReader` classes. This patch is stacked on top of the patch for ARROW-475. Will rebase when that is merged. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #299 from wesm/ARROW-503 and squashes the following commits: e9d918e [Wes McKinney] Close BufferOutputStream after completing file or stream writes 31e519f [Wes McKinney] Add function alias to preserve backwards compatibility faac28c [Wes McKinney] Fix small bug in BinaryArray::Equals, add rudimentary StreamReader/Writer interface and tests d9fb3dc [Wes McKinney] Refactoring, consolidate IPC code into io.pyx

2017-01-23 09:10:18 -05:00

# Licensed to the Apache Software Foundation (ASF) under one

ARROW-9395: [Python] allow configuring MetadataVersion Adds the environment variable `ARROW_PRE_1_0_METADATA_VERSION`. Can rename it to anything else. I opted to remove `use_legacy_format` in underscore APIs, but kept it in public APIs. Also makes the necessary changes to Flight. Closes #7702 from lidavidm/arrow-9395 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org>

2020-07-11 18:12:35 -05:00

import os

ARROW-5027: [Python] Python bindings for JSON reader This PR implements Python bindings for the JSON reader. Author: Philipp Moritz <pcmoritz@gmail.com> Closes #4044 from pcmoritz/cython-read-json and squashes the following commits: 1148f43ed <Philipp Moritz> update 465e9d416 <Philipp Moritz> fixes and docstring d75b02d18 <Philipp Moritz> add tests b630845a7 <Philipp Moritz> temp commit aa0aa3f85 <Philipp Moritz> update b1742b00e <Philipp Moritz> update 9dfc978cf <Philipp Moritz> add absolute imports c776e0f5e <Philipp Moritz> linting bb614282b <Philipp Moritz> comment in again 46a8561fe <Philipp Moritz> update 619064571 <Philipp Moritz> update 364971c86 <Philipp Moritz> update 90a3510a8 <Philipp Moritz> update 4edb201f6 <Philipp Moritz> initial work on JSON reader python wrapper

2019-05-27 18:01:27 +02:00

ARROW-881: [Python] Reconstruct Pandas DataFrame indexes using metadata cc @mrocklin Author: Phillip Cloud <cpcloud@gmail.com> Closes #612 from cpcloud/ARROW-881 and squashes the following commits: 4fa679d [Phillip Cloud] Add metadata test 60f71aa [Phillip Cloud] More doc de616e8 [Phillip Cloud] Add doc a42a084 [Phillip Cloud] Decode metadata to utf8 because JSON 2198dc5 [Phillip Cloud] Call column_name_idx on index_columns 32c5e64 [Phillip Cloud] Add test for read_pandas subset 2fa1f16 [Phillip Cloud] Do not write index_column metadata if not requested 21a8829 [Phillip Cloud] Add docs to pq.read_pandas c35970c [Phillip Cloud] Add test for no index written and pq.read_pandas 59477b5 [Phillip Cloud] ARROW-881: [Python] Reconstruct Pandas DataFrame indexes using custom_metadata

2017-05-16 13:14:18 -04:00

								import pyarrow as pa

							

ARROW-1214: [Python/C++] Add C++ functionality to more easily handle encapsulated IPC messages, Python bindings This patch does a bunch of things: * Decouples the RecordBatchStreamReader from the actual message iteration (which is handled by a new `arrow::ipc::MessageReader` interface * Enables `arrow::ipc::Message` to hold all of the memory for a complete unit of data: metadata plus body * Renames some IPC methods for better consistency (GetNextRecordBatch -> ReadNextRecordBatch) * Adds function to serialize a complete encapsulated message to an `arrow::io::OutputStream* * Add Python bindings for all of the above, introduce `pyarrow.Message`, `pyarrow.MessageReader`. Add `read_message` and `Message.serialize` functions for efficient memory round trips * Add `pyarrow.read_record_batch` for reading a single record batch given a message and a known schema Later we will want to add `pyarrow.read_schema`, but it seemed like a bit of work to make it work for dictionaries. This implements the C++ analogue to ARROW-1047, which was for Java. Not sure why I didn't create a JIRA about this. cc @icexelloss Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #839 from wesm/ARROW-1214 and squashes the following commits: 07f1820a [Wes McKinney] Refactor to introduce MessageReader abstract type, use unique_ptr for messages instead of shared_ptr. First cut at Message, MessageReader Python API. Add read_message, C++/Python machinery for message roundtrips to Buffer, comparison. Add function to read RecordBatch from encapsulated message given schema.

2017-07-15 16:51:51 -04:00

ARROW-15776: [Python] Expose IpcReadOptions This PR intends to expose IpcReadOptions to pyarrow. Closes #12800 from raulcd/ARROW-15776 Lead-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>

2022-04-19 15:31:41 +02:00

								from pyarrow.lib import (IpcReadOptions, IpcWriteOptions, ReadStats, WriteStats,  # noqa

							

ARROW-6883: [C++][Python] Allow writing dictionary deltas * Add an ipc::IpcWriteOptions member to govern emission of dictionary deltas. If the option is enabled, deltas are detected by checking whether the new dictionary starts with the last emitted one for the same field. However, for nested dictionaries, deltas are not emitted for the outer dictionary, as the read path doesn't support it. * Add a stats() method to ipc::StreamDecoder * Expose the IPC statistics in Python, and add tests Closes #8811 from pitrou/ARROW-6883-ipc-write-deltas Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm@apache.org>

2020-12-14 18:12:34 -06:00

								                         Message, MessageReader,

							

ARROW-9761: [C/C++] Add experimental C stream inferface The goal is to have a standardized ABI to communicate streams of homogeneous arrays or record batches (for example for database result sets). The trickiest part is error reporting. This proposal tries to strike a compromise between simplicity (an integer error code mapping to errno values) and expressivity (an optional description string for application-specific and context-specific details). Closes #8052 from pitrou/ARROW-9761-c-array-stream Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

2020-10-01 12:00:12 +02:00

								                         RecordBatchReader, _ReadPandasMixin,

							

GH-32276: [C++][FlightRPC] Add option to align RecordBatch buffers given to IPC reader (#44279) ### Rationale for this change Data retrieved via IPC is expected to provide memory-aligned arrays, but data retrieved via C++ Flight client is mis-aligned. Datafusion (Rust), which requires data type-specific alignment, cannot handle such data: #43552. https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding ### What changes are included in this PR? This adds option `arrow::ipc::IpcReadOptions.ensure_alignment` of type `arrow::ipc::Alignment` to configure how RecordBatch array buffers decoded by IPC are realigned. It supports no realignment (default), data type-specific alignment and 64-byte alignment. Implementation mirrors that of [`align_buffers` in arrow-rs](https://github.com/apache/arrow-rs/blob/3293a8c2f9062fca93bee2210d540a1d25155bf5/arrow-data/src/data.rs#L698-L711) (https://github.com/apache/arrow-rs/pull/4681). ### Are these changes tested? Configuration flag tested in unit test. Integration test with Flight server. Manually end-to-end tested that memory alignment fixes issue with reproduction code provided in #43552. ### Are there any user-facing changes? Adds option `IpcReadOptions.ensure_alignment` and enum type `Alignment`. * GitHub Issue: #32276 Lead-authored-by: Enrico Minack <github@enrico.minack.dev> Co-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: David Li <li.davidm96@gmail.com>

2025-05-20 01:32:04 +02:00

								                         MetadataVersion, Alignment,

							

ARROW-1408: [C++] IPC public API cleanup, refactoring. Add SerializeSchema, ReadSchema public APIs This is mostly moving code around. In reviewing I recommend focusing on the public headers. There were a number of places where it is more consistent to use naked pointers versus shared_ptr. Also some constructors were returning shared_ptr to subclass, where it would be simpler for clients to return a pointer to base. This includes ARROW-1376 and ARROW-1406 Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #988 from wesm/ARROW-1408 and squashes the following commits: b156767c [Wes McKinney] Fix up glib bindings, undeprecate some APIs 4bdebfab [Wes McKinney] Add serialize methods to RecordBatch, Schema. Test round trip ef12e0f2 [Wes McKinney] Fix a valgrind warning 73d30c98 [Wes McKinney] Better comments 8597b96f [Wes McKinney] Remove API that was never intended to be public, unlikely to be used anywhere 122a7591 [Wes McKinney] Refactoring sweep and cleanup of public IPC API. Move non-public APIs from metadata.h to metadata-internal.h and create message.h, dictionary.h b646f965 [Wes McKinney] Set device in more places

2017-08-24 23:04:58 -04:00

								                         read_message, read_record_batch, read_schema,

							

ARROW-1214: [Python/C++] Add C++ functionality to more easily handle encapsulated IPC messages, Python bindings This patch does a bunch of things: * Decouples the RecordBatchStreamReader from the actual message iteration (which is handled by a new `arrow::ipc::MessageReader` interface * Enables `arrow::ipc::Message` to hold all of the memory for a complete unit of data: metadata plus body * Renames some IPC methods for better consistency (GetNextRecordBatch -> ReadNextRecordBatch) * Adds function to serialize a complete encapsulated message to an `arrow::io::OutputStream* * Add Python bindings for all of the above, introduce `pyarrow.Message`, `pyarrow.MessageReader`. Add `read_message` and `Message.serialize` functions for efficient memory round trips * Add `pyarrow.read_record_batch` for reading a single record batch given a message and a known schema Later we will want to add `pyarrow.read_schema`, but it seemed like a bit of work to make it work for dictionaries. This implements the C++ analogue to ARROW-1047, which was for Java. Not sure why I didn't create a JIRA about this. cc @icexelloss Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #839 from wesm/ARROW-1214 and squashes the following commits: 07f1820a [Wes McKinney] Refactor to introduce MessageReader abstract type, use unique_ptr for messages instead of shared_ptr. First cut at Message, MessageReader Python API. Add read_message, C++/Python machinery for message roundtrips to Buffer, comparison. Add function to read RecordBatch from encapsulated message given schema.

2017-07-15 16:51:51 -04:00

								                         read_tensor, write_tensor,

							

ARROW-819: Public Cython and C++ API in the style of lxml, arrow::py::import_pyarrow method I have been looking at LXML's approach to creating both a public Cython API and C++ API https://github.com/lxml/lxml While this may seem like a somewhat radical reorganization of the code, putting all of the main symbols in a single Cython extension makes generating a C++ API for them significantly simpler. By using `.pxi` files we can break the codebase into as small pieces as we like (as long as there are no circular dependencies). As a convenient side effect, the build times are shorter. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #680 from wesm/ARROW-819 and squashes the following commits: 9e6ee246 [Wes McKinney] Fix up optional extensions cff757de [Wes McKinney] Expose pyarrow C API in arrow/python/pyarrow.h b39d19cd [Wes McKinney] Fix test suite. Move _config into lib ff1b5e51 [Wes McKinney] Rename things a bit d4a83912 [Wes McKinney] Reorganize Cython code in the style of lxml so make declaring a public C API easier

2017-05-13 15:44:43 -04:00

								import pyarrow.lib as lib

							

ARROW-503: [Python] Implement Python interface to streaming file format See the new `StreamWriter` and `StreamReader` classes. This patch is stacked on top of the patch for ARROW-475. Will rebase when that is merged. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #299 from wesm/ARROW-503 and squashes the following commits: e9d918e [Wes McKinney] Close BufferOutputStream after completing file or stream writes 31e519f [Wes McKinney] Add function alias to preserve backwards compatibility faac28c [Wes McKinney] Fix small bug in BinaryArray::Equals, add rudimentary StreamReader/Writer interface and tests d9fb3dc [Wes McKinney] Refactoring, consolidate IPC code into io.pyx

2017-01-23 09:10:18 -05:00

ARROW-9761: [C/C++] Add experimental C stream inferface The goal is to have a standardized ABI to communicate streams of homogeneous arrays or record batches (for example for database result sets). The trickiest part is error reporting. This proposal tries to strike a compromise between simplicity (an integer error code mapping to errno values) and expressivity (an optional description string for application-specific and context-specific details). Closes #8052 from pitrou/ARROW-9761-c-array-stream Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

2020-10-01 12:00:12 +02:00

								class RecordBatchStreamReader(lib._RecordBatchStreamReader):

							

ARROW-503: [Python] Implement Python interface to streaming file format See the new `StreamWriter` and `StreamReader` classes. This patch is stacked on top of the patch for ARROW-475. Will rebase when that is merged. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #299 from wesm/ARROW-503 and squashes the following commits: e9d918e [Wes McKinney] Close BufferOutputStream after completing file or stream writes 31e519f [Wes McKinney] Add function alias to preserve backwards compatibility faac28c [Wes McKinney] Fix small bug in BinaryArray::Equals, add rudimentary StreamReader/Writer interface and tests d9fb3dc [Wes McKinney] Refactoring, consolidate IPC code into io.pyx

2017-01-23 09:10:18 -05:00

"""

ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

2020-03-25 14:44:38 +01:00

    Reader for the Arrow streaming binary format.

ARROW-503: [Python] Implement Python interface to streaming file format See the new `StreamWriter` and `StreamReader` classes. This patch is stacked on top of the patch for ARROW-475. Will rebase when that is merged. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #299 from wesm/ARROW-503 and squashes the following commits: e9d918e [Wes McKinney] Close BufferOutputStream after completing file or stream writes 31e519f [Wes McKinney] Add function alias to preserve backwards compatibility faac28c [Wes McKinney] Fix small bug in BinaryArray::Equals, add rudimentary StreamReader/Writer interface and tests d9fb3dc [Wes McKinney] Refactoring, consolidate IPC code into io.pyx

2017-01-23 09:10:18 -05:00

ARROW-2859: [Python] Accept buffer-like objects as sources in open_file, open_stream APIs The behavior had been to treat a string-like object like a file name; we didn't have any APIs that made use of this fact, and I think that being able to read a stream from an object importing the buffer protocol is much more convenient and natural as `pa.open_stream(buf)` than `pa.open_stream(pa.BufferReader(buf))`. I may look at quickly adding support for pathlib.Path objects here. I also added the precursor for addressing ARROW-2807 Author: Wes McKinney <wesm+git@apache.org> Closes #2314 from wesm/ARROW-2859 and squashes the following commits: b64a828c <Wes McKinney> Fix docstrings 5cc363f8 <Wes McKinney> Amend usages of get_result, add FutureWarning b11e5328 <Wes McKinney> Add pathlib test. Refactor to use pytest 53f32e84 <Wes McKinney> Add test for stream from buffer protocol a6fc8f1c <Wes McKinney> Do not try to open file from buffer input, add use_memory_map flag

2018-07-24 15:23:06 -04:00

    source : bytes/buffer-like, pyarrow.NativeFile, or file-like Python object

ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

2020-03-25 14:44:38 +01:00

        Either an in-memory buffer, or a readable file object.

ARROW-16382: [Python] Disable memory mapping by default in pyarrow (#13342) [Python] Disable memory mapping by default in pyarrow Authored-by: Alvin Chunga <alvinchma@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org>

2022-06-20 10:39:42 -05:00

        If you want to use memory map use MemoryMappedFile as source.

ARROW-15776: [Python] Expose IpcReadOptions This PR intends to expose IpcReadOptions to pyarrow. Closes #12800 from raulcd/ARROW-15776 Lead-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>

2022-04-19 15:31:41 +02:00

    options : pyarrow.ipc.IpcReadOptions

ARROW-503: [Python] Implement Python interface to streaming file format See the new `StreamWriter` and `StreamReader` classes. This patch is stacked on top of the patch for ARROW-475. Will rebase when that is merged. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #299 from wesm/ARROW-503 and squashes the following commits: e9d918e [Wes McKinney] Close BufferOutputStream after completing file or stream writes 31e519f [Wes McKinney] Add function alias to preserve backwards compatibility faac28c [Wes McKinney] Fix small bug in BinaryArray::Equals, add rudimentary StreamReader/Writer interface and tests d9fb3dc [Wes McKinney] Refactoring, consolidate IPC code into io.pyx

2017-01-23 09:10:18 -05:00

"""

ARROW-2863: [Python] Add context manager APIs to RecordBatch*Writer/Reader classes Closes #5563 from kszucs/ARROW-2863 and squashes the following commits: 648a81228 <Krisztián Szűcs> move context mgrs to the cython module 3b25fa4d1 <Krisztián Szűcs> remove context manager from the cdef class a9c80638b <Krisztián Szűcs> context manager Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

2019-10-18 13:58:26 -04:00

ARROW-15776: [Python] Expose IpcReadOptions This PR intends to expose IpcReadOptions to pyarrow. Closes #12800 from raulcd/ARROW-15776 Lead-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>

2022-04-19 15:31:41 +02:00

								    def __init__(self, source, *, options=None, memory_pool=None):

							

ARROW-503: [Python] Implement Python interface to streaming file format See the new `StreamWriter` and `StreamReader` classes. This patch is stacked on top of the patch for ARROW-475. Will rebase when that is merged. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #299 from wesm/ARROW-503 and squashes the following commits: e9d918e [Wes McKinney] Close BufferOutputStream after completing file or stream writes 31e519f [Wes McKinney] Add function alias to preserve backwards compatibility faac28c [Wes McKinney] Fix small bug in BinaryArray::Equals, add rudimentary StreamReader/Writer interface and tests d9fb3dc [Wes McKinney] Refactoring, consolidate IPC code into io.pyx

2017-01-23 09:10:18 -05:00

ARROW-6474: [Python] Add option to use legacy / pre-0.15 IPC message format and to set the default using PYARROW_LEGACY_IPC_FORMAT environment variable It feels gross to alter behavior with environment variables but this is probably the least invasive way to enable Apache Spark users to upgrade to pyarrow 0.15.0 or beyond if they are frozen on an older Spark release Closes #5396 from wesm/ARROW-6474 and squashes the following commits: 52a966d86 <Wes McKinney> Fix sphinx warning a1768bd82 <Wes McKinney> Synchronize docsstrings 0bb54d4ce <Wes McKinney> Rename environment variable per comments e8cde5842 <Wes McKinney> Add option to use legacy / pre-0.15 IPC message format and to set the default value by environment variable Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org>

2019-09-18 13:47:38 -05:00

								_ipc_writer_class_doc = """\

							

ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

2020-03-25 14:44:38 +01:00

    Either a file path, or a writable file object.

ARROW-6474: [Python] Add option to use legacy / pre-0.15 IPC message format and to set the default using PYARROW_LEGACY_IPC_FORMAT environment variable It feels gross to alter behavior with environment variables but this is probably the least invasive way to enable Apache Spark users to upgrade to pyarrow 0.15.0 or beyond if they are frozen on an older Spark release Closes #5396 from wesm/ARROW-6474 and squashes the following commits: 52a966d86 <Wes McKinney> Fix sphinx warning a1768bd82 <Wes McKinney> Synchronize docsstrings 0bb54d4ce <Wes McKinney> Rename environment variable per comments e8cde5842 <Wes McKinney> Add option to use legacy / pre-0.15 IPC message format and to set the default value by environment variable Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org>

2019-09-18 13:47:38 -05:00

schema : pyarrow.Schema

ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

2020-03-25 14:44:38 +01:00

    The Arrow schema for data to be written to the file.

ARROW-9395: [Python] allow configuring MetadataVersion Adds the environment variable `ARROW_PRE_1_0_METADATA_VERSION`. Can rename it to anything else. I opted to remove `use_legacy_format` in underscore APIs, but kept it in public APIs. Also makes the necessary changes to Flight. Closes #7702 from lidavidm/arrow-9395 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org>

2020-07-11 18:12:35 -05:00

options : pyarrow.ipc.IpcWriteOptions

ARROW-15006: [Python][CI][Doc] Enable numpydoc check PR03 (#13983) Adds an additional numypdoc check to CI (PR03) and fixes all corresponding violations. Note this does not fully resolve [ARROW-15006](https://issues.apache.org/jira/browse/ARROW-15006). Authored-by: Bryce Mecum <petridish@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

2022-10-19 23:41:24 -08:00

    ARROW_PRE_1_0_METADATA_VERSION=1."""

ARROW-6474: [Python] Add option to use legacy / pre-0.15 IPC message format and to set the default using PYARROW_LEGACY_IPC_FORMAT environment variable It feels gross to alter behavior with environment variables but this is probably the least invasive way to enable Apache Spark users to upgrade to pyarrow 0.15.0 or beyond if they are frozen on an older Spark release Closes #5396 from wesm/ARROW-6474 and squashes the following commits: 52a966d86 <Wes McKinney> Fix sphinx warning a1768bd82 <Wes McKinney> Synchronize docsstrings 0bb54d4ce <Wes McKinney> Rename environment variable per comments e8cde5842 <Wes McKinney> Add option to use legacy / pre-0.15 IPC message format and to set the default value by environment variable Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org>

2019-09-18 13:47:38 -05:00

GH-46222: [Python] Allow to specify footer metadata when opening IPC file for writing (#46354) ### Rationale for this change The Arrow APIs for other languages offer to set the footer metadata of an IPC file, but `pyarrow` was lacking this. I opened #46222 because we need this feature at KNIME, and took a shot at suggesting an implementation. Happy to get your feedback on the change! ### What changes are included in this PR? A new keyword argument `metadata` in `RecordBatchFileWriter.__init__` as well as in `ipc.new_file`. The value is `None` by default to not break backwards compatibility. Also added a `metadata` property to the `RecordBatchFileReader` to be able to extract the metadata easily. ### Are these changes tested? Yes, by a unit test. ### Are there any user-facing changes? Yes, see above :) * GitHub Issue: #46222 Lead-authored-by: Carsten Haubold <carsten.haubold@knime.com> Co-authored-by: Carsten Haubold <CarstenHaubold@googlemail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Rok Mihevc <rok@mihevc.org>

2025-05-12 11:16:31 +02:00

								_ipc_file_writer_class_doc = (

							

ARROW-3162: Flight Python bindings - [ ] Add docs - [ ] Format code - [ ] Include Python in integration tests (requires binding the JSON reader/writer from C++) - [ ] Validate performance? - [ ] Complete server bindings if approach makes sense Author: David Li <David.M.Li@twosigma.com> Author: Wes McKinney <wesm+git@apache.org> Closes #3566 from lihalite/flight-python and squashes the following commits: ac29ab88 <David Li> Clean up to-be-implemented parts of Flight Python bindings 9d5442a0 <David Li> Clarify various RecordBatchStream{Reader,Writer} wrappers e1c298ad <David Li> Lint CMake files 77644447 <Wes McKinney> Reformat cmake c6b02aa9 <David Li> Add basic Python bindings for Flight

2019-02-14 18:16:56 +01:00

								class RecordBatchStreamWriter(lib._RecordBatchStreamWriter):

							

GH-45619: [Python] Use f-string instead of string.format (#45629) ### Rationale for this change See https://github.com/apache/arrow/issues/45619. ### What changes are included in this PR? Refactor using f-string instead of `string.format`. But do not use f-string for following case, `string.format` allows passing parameters, making the code more reusable. https://github.com/apache/arrow/blob/0fbf9823542233c5f32c26534c34cc97ce3f0be2/python/pyarrow/parquet/core.py#L1624-L1695 ### Are these changes tested? Via CI. ### Are there any user-facing changes? No. * GitHub Issue: #45619 Lead-authored-by: Chilin Chiou <chilin.chiou@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

2025-05-12 20:34:49 +08:00

								    __doc__ = f"""Writer for the Arrow streaming binary format

							

ARROW-503: [Python] Implement Python interface to streaming file format See the new `StreamWriter` and `StreamReader` classes. This patch is stacked on top of the patch for ARROW-475. Will rebase when that is merged. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #299 from wesm/ARROW-503 and squashes the following commits: e9d918e [Wes McKinney] Close BufferOutputStream after completing file or stream writes 31e519f [Wes McKinney] Add function alias to preserve backwards compatibility faac28c [Wes McKinney] Fix small bug in BinaryArray::Equals, add rudimentary StreamReader/Writer interface and tests d9fb3dc [Wes McKinney] Refactoring, consolidate IPC code into io.pyx

2017-01-23 09:10:18 -05:00

GH-45619: [Python] Use f-string instead of string.format (#45629) ### Rationale for this change See https://github.com/apache/arrow/issues/45619. ### What changes are included in this PR? Refactor using f-string instead of `string.format`. But do not use f-string for following case, `string.format` allows passing parameters, making the code more reusable. https://github.com/apache/arrow/blob/0fbf9823542233c5f32c26534c34cc97ce3f0be2/python/pyarrow/parquet/core.py#L1624-L1695 ### Are these changes tested? Via CI. ### Are there any user-facing changes? No. * GitHub Issue: #45619 Lead-authored-by: Chilin Chiou <chilin.chiou@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

2025-05-12 20:34:49 +08:00

								{_ipc_writer_class_doc}"""

							

ARROW-6474: [Python] Add option to use legacy / pre-0.15 IPC message format and to set the default using PYARROW_LEGACY_IPC_FORMAT environment variable It feels gross to alter behavior with environment variables but this is probably the least invasive way to enable Apache Spark users to upgrade to pyarrow 0.15.0 or beyond if they are frozen on an older Spark release Closes #5396 from wesm/ARROW-6474 and squashes the following commits: 52a966d86 <Wes McKinney> Fix sphinx warning a1768bd82 <Wes McKinney> Synchronize docsstrings 0bb54d4ce <Wes McKinney> Rename environment variable per comments e8cde5842 <Wes McKinney> Add option to use legacy / pre-0.15 IPC message format and to set the default value by environment variable Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org>

2019-09-18 13:47:38 -05:00

GH-46130: [Python] Remove `use_legacy_format` in favour of setting `IpcWriteOptions` (#46131) ### Rationale for this change `use_legacy_format` in the ipc writer has been deprecated and can be removed. ### What changes are included in this PR? `use_legacy_format` keyword is removed in favour of setting `IpcWriteOptions`. ### Are these changes tested? Existing tests should pass. ### Are there any user-facing changes? Deprecated functionality is removed. * GitHub Issue: #46130 Authored-by: AlenkaF <frim.alenka@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

2025-04-16 10:49:50 +02:00

								    def __init__(self, sink, schema, *, options=None):

							

ARROW-9395: [Python] allow configuring MetadataVersion Adds the environment variable `ARROW_PRE_1_0_METADATA_VERSION`. Can rename it to anything else. I opted to remove `use_legacy_format` in underscore APIs, but kept it in public APIs. Also makes the necessary changes to Flight. Closes #7702 from lidavidm/arrow-9395 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org>

2020-07-11 18:12:35 -05:00

								        self._open(sink, schema, options=options)

							

ARROW-503: [Python] Implement Python interface to streaming file format See the new `StreamWriter` and `StreamReader` classes. This patch is stacked on top of the patch for ARROW-475. Will rebase when that is merged. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #299 from wesm/ARROW-503 and squashes the following commits: e9d918e [Wes McKinney] Close BufferOutputStream after completing file or stream writes 31e519f [Wes McKinney] Add function alias to preserve backwards compatibility faac28c [Wes McKinney] Fix small bug in BinaryArray::Equals, add rudimentary StreamReader/Writer interface and tests d9fb3dc [Wes McKinney] Refactoring, consolidate IPC code into io.pyx

2017-01-23 09:10:18 -05:00

ARROW-9761: [C/C++] Add experimental C stream inferface The goal is to have a standardized ABI to communicate streams of homogeneous arrays or record batches (for example for database result sets). The trickiest part is error reporting. This proposal tries to strike a compromise between simplicity (an integer error code mapping to errno values) and expressivity (an optional description string for application-specific and context-specific details). Closes #8052 from pitrou/ARROW-9761-c-array-stream Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

2020-10-01 12:00:12 +02:00

								class RecordBatchFileReader(lib._RecordBatchFileReader):

							

ARROW-503: [Python] Implement Python interface to streaming file format See the new `StreamWriter` and `StreamReader` classes. This patch is stacked on top of the patch for ARROW-475. Will rebase when that is merged. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #299 from wesm/ARROW-503 and squashes the following commits: e9d918e [Wes McKinney] Close BufferOutputStream after completing file or stream writes 31e519f [Wes McKinney] Add function alias to preserve backwards compatibility faac28c [Wes McKinney] Fix small bug in BinaryArray::Equals, add rudimentary StreamReader/Writer interface and tests d9fb3dc [Wes McKinney] Refactoring, consolidate IPC code into io.pyx

2017-01-23 09:10:18 -05:00

"""

ARROW-2859: [Python] Accept buffer-like objects as sources in open_file, open_stream APIs The behavior had been to treat a string-like object like a file name; we didn't have any APIs that made use of this fact, and I think that being able to read a stream from an object importing the buffer protocol is much more convenient and natural as `pa.open_stream(buf)` than `pa.open_stream(pa.BufferReader(buf))`. I may look at quickly adding support for pathlib.Path objects here. I also added the precursor for addressing ARROW-2807 Author: Wes McKinney <wesm+git@apache.org> Closes #2314 from wesm/ARROW-2859 and squashes the following commits: b64a828c <Wes McKinney> Fix docstrings 5cc363f8 <Wes McKinney> Amend usages of get_result, add FutureWarning b11e5328 <Wes McKinney> Add pathlib test. Refactor to use pytest 53f32e84 <Wes McKinney> Add test for stream from buffer protocol a6fc8f1c <Wes McKinney> Do not try to open file from buffer input, add use_memory_map flag

2018-07-24 15:23:06 -04:00

    source : bytes/buffer-like, pyarrow.NativeFile, or file-like Python object

ARROW-16382: [Python] Disable memory mapping by default in pyarrow (#13342) [Python] Disable memory mapping by default in pyarrow Authored-by: Alvin Chunga <alvinchma@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org>

2022-06-20 10:39:42 -05:00

        Either an in-memory buffer, or a readable file object.

ARROW-503: [Python] Implement Python interface to streaming file format See the new `StreamWriter` and `StreamReader` classes. This patch is stacked on top of the patch for ARROW-475. Will rebase when that is merged. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #299 from wesm/ARROW-503 and squashes the following commits: e9d918e [Wes McKinney] Close BufferOutputStream after completing file or stream writes 31e519f [Wes McKinney] Add function alias to preserve backwards compatibility faac28c [Wes McKinney] Fix small bug in BinaryArray::Equals, add rudimentary StreamReader/Writer interface and tests d9fb3dc [Wes McKinney] Refactoring, consolidate IPC code into io.pyx

2017-01-23 09:10:18 -05:00

    footer_offset : int, default None

ARROW-15776: [Python] Expose IpcReadOptions This PR intends to expose IpcReadOptions to pyarrow. Closes #12800 from raulcd/ARROW-15776 Lead-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>

2022-04-19 15:31:41 +02:00

    options : pyarrow.ipc.IpcReadOptions

ARROW-503: [Python] Implement Python interface to streaming file format See the new `StreamWriter` and `StreamReader` classes. This patch is stacked on top of the patch for ARROW-475. Will rebase when that is merged. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #299 from wesm/ARROW-503 and squashes the following commits: e9d918e [Wes McKinney] Close BufferOutputStream after completing file or stream writes 31e519f [Wes McKinney] Add function alias to preserve backwards compatibility faac28c [Wes McKinney] Fix small bug in BinaryArray::Equals, add rudimentary StreamReader/Writer interface and tests d9fb3dc [Wes McKinney] Refactoring, consolidate IPC code into io.pyx

2017-01-23 09:10:18 -05:00

"""

ARROW-8846: [Dev][Python] Autoformat Python files with archery `archery lint --flake8` becomes `archery lint --python` and now recognizes the `--fix` option. Reformatting involves running `autopep8`. Closes #7215 from pitrou/ARROW-8846-archery-autopep8 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

2020-05-19 09:33:39 +02:00

ARROW-15776: [Python] Expose IpcReadOptions This PR intends to expose IpcReadOptions to pyarrow. Closes #12800 from raulcd/ARROW-15776 Lead-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>

2022-04-19 15:31:41 +02:00

								    def __init__(self, source, footer_offset=None, *, options=None,

							

ARROW-503: [Python] Implement Python interface to streaming file format See the new `StreamWriter` and `StreamReader` classes. This patch is stacked on top of the patch for ARROW-475. Will rebase when that is merged. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #299 from wesm/ARROW-503 and squashes the following commits: e9d918e [Wes McKinney] Close BufferOutputStream after completing file or stream writes 31e519f [Wes McKinney] Add function alias to preserve backwards compatibility faac28c [Wes McKinney] Fix small bug in BinaryArray::Equals, add rudimentary StreamReader/Writer interface and tests d9fb3dc [Wes McKinney] Refactoring, consolidate IPC code into io.pyx

2017-01-23 09:10:18 -05:00

ARROW-1008: [C++] Add abstract stream writer and reader C++ APIs. Give clearer names to IPC reader/writer classes The main motivation for this patch was to make `StreamReader` and `StreamWriter` abstract, so that other implementations can be created. I would also like to add the option for asynchronous reading and writing. I also added a CMake option `ARROW_NO_DEPRECATED_API` for more graceful name deprecations. @kou do you think these names for the IPC classes are more clear? Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #679 from wesm/ARROW-1008 and squashes the following commits: d7b7c9ce [Wes McKinney] Add missing dtors for pimpl pattern a797ee3e [Wes McKinney] Fix glib 04fa2854 [Wes McKinney] Feedback on ipc reader/writer names. Add open_stream/open_file Python APIs 22346d47 [Wes McKinney] Fix unit tests 10837a65 [Wes McKinney] Add abstract stream writer and reader C++ APIs. Rename record batch stream reader and writer classes for better clarity

2017-05-14 08:55:26 -04:00

								class RecordBatchFileWriter(lib._RecordBatchFileWriter):

							

ARROW-503: [Python] Implement Python interface to streaming file format See the new `StreamWriter` and `StreamReader` classes. This patch is stacked on top of the patch for ARROW-475. Will rebase when that is merged. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #299 from wesm/ARROW-503 and squashes the following commits: e9d918e [Wes McKinney] Close BufferOutputStream after completing file or stream writes 31e519f [Wes McKinney] Add function alias to preserve backwards compatibility faac28c [Wes McKinney] Fix small bug in BinaryArray::Equals, add rudimentary StreamReader/Writer interface and tests d9fb3dc [Wes McKinney] Refactoring, consolidate IPC code into io.pyx

2017-01-23 09:10:18 -05:00

GH-45619: [Python] Use f-string instead of string.format (#45629) ### Rationale for this change See https://github.com/apache/arrow/issues/45619. ### What changes are included in this PR? Refactor using f-string instead of `string.format`. But do not use f-string for following case, `string.format` allows passing parameters, making the code more reusable. https://github.com/apache/arrow/blob/0fbf9823542233c5f32c26534c34cc97ce3f0be2/python/pyarrow/parquet/core.py#L1624-L1695 ### Are these changes tested? Via CI. ### Are there any user-facing changes? No. * GitHub Issue: #45619 Lead-authored-by: Chilin Chiou <chilin.chiou@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

2025-05-12 20:34:49 +08:00

								    __doc__ = f"""Writer to create the Arrow binary file format

							

ARROW-6474: [Python] Add option to use legacy / pre-0.15 IPC message format and to set the default using PYARROW_LEGACY_IPC_FORMAT environment variable It feels gross to alter behavior with environment variables but this is probably the least invasive way to enable Apache Spark users to upgrade to pyarrow 0.15.0 or beyond if they are frozen on an older Spark release Closes #5396 from wesm/ARROW-6474 and squashes the following commits: 52a966d86 <Wes McKinney> Fix sphinx warning a1768bd82 <Wes McKinney> Synchronize docsstrings 0bb54d4ce <Wes McKinney> Rename environment variable per comments e8cde5842 <Wes McKinney> Add option to use legacy / pre-0.15 IPC message format and to set the default value by environment variable Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org>

2019-09-18 13:47:38 -05:00

GH-45619: [Python] Use f-string instead of string.format (#45629) ### Rationale for this change See https://github.com/apache/arrow/issues/45619. ### What changes are included in this PR? Refactor using f-string instead of `string.format`. But do not use f-string for following case, `string.format` allows passing parameters, making the code more reusable. https://github.com/apache/arrow/blob/0fbf9823542233c5f32c26534c34cc97ce3f0be2/python/pyarrow/parquet/core.py#L1624-L1695 ### Are these changes tested? Via CI. ### Are there any user-facing changes? No. * GitHub Issue: #45619 Lead-authored-by: Chilin Chiou <chilin.chiou@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

2025-05-12 20:34:49 +08:00

								{_ipc_file_writer_class_doc}"""

							

ARROW-6474: [Python] Add option to use legacy / pre-0.15 IPC message format and to set the default using PYARROW_LEGACY_IPC_FORMAT environment variable It feels gross to alter behavior with environment variables but this is probably the least invasive way to enable Apache Spark users to upgrade to pyarrow 0.15.0 or beyond if they are frozen on an older Spark release Closes #5396 from wesm/ARROW-6474 and squashes the following commits: 52a966d86 <Wes McKinney> Fix sphinx warning a1768bd82 <Wes McKinney> Synchronize docsstrings 0bb54d4ce <Wes McKinney> Rename environment variable per comments e8cde5842 <Wes McKinney> Add option to use legacy / pre-0.15 IPC message format and to set the default value by environment variable Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org>

2019-09-18 13:47:38 -05:00

GH-46222: [Python] Allow to specify footer metadata when opening IPC file for writing (#46354) ### Rationale for this change The Arrow APIs for other languages offer to set the footer metadata of an IPC file, but `pyarrow` was lacking this. I opened #46222 because we need this feature at KNIME, and took a shot at suggesting an implementation. Happy to get your feedback on the change! ### What changes are included in this PR? A new keyword argument `metadata` in `RecordBatchFileWriter.__init__` as well as in `ipc.new_file`. The value is `None` by default to not break backwards compatibility. Also added a `metadata` property to the `RecordBatchFileReader` to be able to extract the metadata easily. ### Are these changes tested? Yes, by a unit test. ### Are there any user-facing changes? Yes, see above :) * GitHub Issue: #46222 Lead-authored-by: Carsten Haubold <carsten.haubold@knime.com> Co-authored-by: Carsten Haubold <CarstenHaubold@googlemail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Rok Mihevc <rok@mihevc.org>

2025-05-12 11:16:31 +02:00

								    def __init__(self, sink, schema, *, options=None, metadata=None):

							

GH-46130: [Python] Remove `use_legacy_format` in favour of setting `IpcWriteOptions` (#46131) ### Rationale for this change `use_legacy_format` in the ipc writer has been deprecated and can be removed. ### What changes are included in this PR? `use_legacy_format` keyword is removed in favour of setting `IpcWriteOptions`. ### Are these changes tested? Existing tests should pass. ### Are there any user-facing changes? Deprecated functionality is removed. * GitHub Issue: #46130 Authored-by: AlenkaF <frim.alenka@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

2025-04-16 10:49:50 +02:00

								        options = _get_legacy_format_default(options)

							

GH-46222: [Python] Allow to specify footer metadata when opening IPC file for writing (#46354) ### Rationale for this change The Arrow APIs for other languages offer to set the footer metadata of an IPC file, but `pyarrow` was lacking this. I opened #46222 because we need this feature at KNIME, and took a shot at suggesting an implementation. Happy to get your feedback on the change! ### What changes are included in this PR? A new keyword argument `metadata` in `RecordBatchFileWriter.__init__` as well as in `ipc.new_file`. The value is `None` by default to not break backwards compatibility. Also added a `metadata` property to the `RecordBatchFileReader` to be able to extract the metadata easily. ### Are these changes tested? Yes, by a unit test. ### Are there any user-facing changes? Yes, see above :) * GitHub Issue: #46222 Lead-authored-by: Carsten Haubold <carsten.haubold@knime.com> Co-authored-by: Carsten Haubold <CarstenHaubold@googlemail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Rok Mihevc <rok@mihevc.org>

2025-05-12 11:16:31 +02:00

								        self._open(sink, schema, options=options, metadata=metadata)

							

ARROW-9395: [Python] allow configuring MetadataVersion Adds the environment variable `ARROW_PRE_1_0_METADATA_VERSION`. Can rename it to anything else. I opted to remove `use_legacy_format` in underscore APIs, but kept it in public APIs. Also makes the necessary changes to Flight. Closes #7702 from lidavidm/arrow-9395 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org>

2020-07-11 18:12:35 -05:00

ARROW-6474: [Python] Add option to use legacy / pre-0.15 IPC message format and to set the default using PYARROW_LEGACY_IPC_FORMAT environment variable It feels gross to alter behavior with environment variables but this is probably the least invasive way to enable Apache Spark users to upgrade to pyarrow 0.15.0 or beyond if they are frozen on an older Spark release Closes #5396 from wesm/ARROW-6474 and squashes the following commits: 52a966d86 <Wes McKinney> Fix sphinx warning a1768bd82 <Wes McKinney> Synchronize docsstrings 0bb54d4ce <Wes McKinney> Rename environment variable per comments e8cde5842 <Wes McKinney> Add option to use legacy / pre-0.15 IPC message format and to set the default value by environment variable Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org>

2019-09-18 13:47:38 -05:00

GH-46130: [Python] Remove `use_legacy_format` in favour of setting `IpcWriteOptions` (#46131) ### Rationale for this change `use_legacy_format` in the ipc writer has been deprecated and can be removed. ### What changes are included in this PR? `use_legacy_format` keyword is removed in favour of setting `IpcWriteOptions`. ### Are these changes tested? Existing tests should pass. ### Are there any user-facing changes? Deprecated functionality is removed. * GitHub Issue: #46130 Authored-by: AlenkaF <frim.alenka@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

2025-04-16 10:49:50 +02:00

								def _get_legacy_format_default(options):

							

ARROW-9333: [Python] Expose more IPC options Also make some optional arguments keyword-only. Closes #7730 from pitrou/ARROW-9333-py-ipc-options Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm@apache.org>

2020-07-13 12:49:07 -05:00

								        if not isinstance(options, IpcWriteOptions):

							

GH-45619: [Python] Use f-string instead of string.format (#45629) ### Rationale for this change See https://github.com/apache/arrow/issues/45619. ### What changes are included in this PR? Refactor using f-string instead of `string.format`. But do not use f-string for following case, `string.format` allows passing parameters, making the code more reusable. https://github.com/apache/arrow/blob/0fbf9823542233c5f32c26534c34cc97ce3f0be2/python/pyarrow/parquet/core.py#L1624-L1695 ### Are these changes tested? Via CI. ### Are there any user-facing changes? No. * GitHub Issue: #45619 Lead-authored-by: Chilin Chiou <chilin.chiou@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

2025-05-12 20:34:49 +08:00

								            raise TypeError(f"expected IpcWriteOptions, got {type(options)}")

							

ARROW-9395: [Python] allow configuring MetadataVersion Adds the environment variable `ARROW_PRE_1_0_METADATA_VERSION`. Can rename it to anything else. I opted to remove `use_legacy_format` in underscore APIs, but kept it in public APIs. Also makes the necessary changes to Flight. Closes #7702 from lidavidm/arrow-9395 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org>

2020-07-11 18:12:35 -05:00

        return options

ARROW-6474: [Python] Add option to use legacy / pre-0.15 IPC message format and to set the default using PYARROW_LEGACY_IPC_FORMAT environment variable It feels gross to alter behavior with environment variables but this is probably the least invasive way to enable Apache Spark users to upgrade to pyarrow 0.15.0 or beyond if they are frozen on an older Spark release Closes #5396 from wesm/ARROW-6474 and squashes the following commits: 52a966d86 <Wes McKinney> Fix sphinx warning a1768bd82 <Wes McKinney> Synchronize docsstrings 0bb54d4ce <Wes McKinney> Rename environment variable per comments e8cde5842 <Wes McKinney> Add option to use legacy / pre-0.15 IPC message format and to set the default value by environment variable Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org>

2019-09-18 13:47:38 -05:00

ARROW-9395: [Python] allow configuring MetadataVersion Adds the environment variable `ARROW_PRE_1_0_METADATA_VERSION`. Can rename it to anything else. I opted to remove `use_legacy_format` in underscore APIs, but kept it in public APIs. Also makes the necessary changes to Flight. Closes #7702 from lidavidm/arrow-9395 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org>

2020-07-11 18:12:35 -05:00

								    metadata_version = MetadataVersion.V5

							

GH-46130: [Python] Remove `use_legacy_format` in favour of setting `IpcWriteOptions` (#46131) ### Rationale for this change `use_legacy_format` in the ipc writer has been deprecated and can be removed. ### What changes are included in this PR? `use_legacy_format` keyword is removed in favour of setting `IpcWriteOptions`. ### Are these changes tested? Existing tests should pass. ### Are there any user-facing changes? Deprecated functionality is removed. * GitHub Issue: #46130 Authored-by: AlenkaF <frim.alenka@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

2025-04-16 10:49:50 +02:00

    use_legacy_format = \

ARROW-9395: [Python] allow configuring MetadataVersion Adds the environment variable `ARROW_PRE_1_0_METADATA_VERSION`. Can rename it to anything else. I opted to remove `use_legacy_format` in underscore APIs, but kept it in public APIs. Also makes the necessary changes to Flight. Closes #7702 from lidavidm/arrow-9395 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org>

2020-07-11 18:12:35 -05:00

								    if bool(int(os.environ.get('ARROW_PRE_1_0_METADATA_VERSION', '0'))):

							

ARROW-1008: [C++] Add abstract stream writer and reader C++ APIs. Give clearer names to IPC reader/writer classes The main motivation for this patch was to make `StreamReader` and `StreamWriter` abstract, so that other implementations can be created. I would also like to add the option for asynchronous reading and writing. I also added a CMake option `ARROW_NO_DEPRECATED_API` for more graceful name deprecations. @kou do you think these names for the IPC classes are more clear? Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #679 from wesm/ARROW-1008 and squashes the following commits: d7b7c9ce [Wes McKinney] Add missing dtors for pimpl pattern a797ee3e [Wes McKinney] Fix glib 04fa2854 [Wes McKinney] Feedback on ipc reader/writer names. Add open_stream/open_file Python APIs 22346d47 [Wes McKinney] Fix unit tests 10837a65 [Wes McKinney] Add abstract stream writer and reader C++ APIs. Rename record batch stream reader and writer classes for better clarity

2017-05-14 08:55:26 -04:00

ARROW-15776: [Python] Expose IpcReadOptions This PR intends to expose IpcReadOptions to pyarrow. Closes #12800 from raulcd/ARROW-15776 Lead-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>

2022-04-19 15:31:41 +02:00

								def _ensure_default_ipc_read_options(options):

							

GH-45619: [Python] Use f-string instead of string.format (#45629) ### Rationale for this change See https://github.com/apache/arrow/issues/45619. ### What changes are included in this PR? Refactor using f-string instead of `string.format`. But do not use f-string for following case, `string.format` allows passing parameters, making the code more reusable. https://github.com/apache/arrow/blob/0fbf9823542233c5f32c26534c34cc97ce3f0be2/python/pyarrow/parquet/core.py#L1624-L1695 ### Are these changes tested? Via CI. ### Are there any user-facing changes? No. * GitHub Issue: #45619 Lead-authored-by: Chilin Chiou <chilin.chiou@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

2025-05-12 20:34:49 +08:00

								        raise TypeError(f"expected IpcReadOptions, got {type(options)}")

							

ARROW-15776: [Python] Expose IpcReadOptions This PR intends to expose IpcReadOptions to pyarrow. Closes #12800 from raulcd/ARROW-15776 Lead-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>

2022-04-19 15:31:41 +02:00

								    return options or IpcReadOptions()

							

GH-46130: [Python] Remove `use_legacy_format` in favour of setting `IpcWriteOptions` (#46131) ### Rationale for this change `use_legacy_format` in the ipc writer has been deprecated and can be removed. ### What changes are included in this PR? `use_legacy_format` keyword is removed in favour of setting `IpcWriteOptions`. ### Are these changes tested? Existing tests should pass. ### Are there any user-facing changes? Deprecated functionality is removed. * GitHub Issue: #46130 Authored-by: AlenkaF <frim.alenka@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

2025-04-16 10:49:50 +02:00

								def new_stream(sink, schema, *, options=None):

							

ARROW-8275: [Python] Update Feather documentation for V2, Python IPC API cleanups / deprecations This splits out the Feather documentation into its own section and explains the V2 changes (support for all Arrow types and compression). This adds a FutureWarning to most of the `pyarrow.ipc` functions that are in the `pyarrow.*` namespace. Since these functions may cause confusion amongst non-advanced users, it's most clear what they are when accessed via the `pyarrow.ipc` namespace, for example `pa.ipc.read_schema`. This is consistent with the prior deprecation of `pa.open_stream` and `pa.open_file`. Also disables failure-on-warning when using 'make html' to build Sphinx docs. Fix various Sphinx warnings. I had to pin Sphinx 2.4.4 on account of the newly released Sphinx 3.0.0 not being compatible with our Sphinx project, see ARROW-8340. Closes #6843 from wesm/ARROW-8275 Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org>

2020-04-06 20:58:35 -05:00

								    return RecordBatchStreamWriter(sink, schema,

							

ARROW-9395: [Python] allow configuring MetadataVersion Adds the environment variable `ARROW_PRE_1_0_METADATA_VERSION`. Can rename it to anything else. I opted to remove `use_legacy_format` in underscore APIs, but kept it in public APIs. Also makes the necessary changes to Flight. Closes #7702 from lidavidm/arrow-9395 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org>

2020-07-11 18:12:35 -05:00

								                                   options=options)

							

ARROW-8275: [Python] Update Feather documentation for V2, Python IPC API cleanups / deprecations This splits out the Feather documentation into its own section and explains the V2 changes (support for all Arrow types and compression). This adds a FutureWarning to most of the `pyarrow.ipc` functions that are in the `pyarrow.*` namespace. Since these functions may cause confusion amongst non-advanced users, it's most clear what they are when accessed via the `pyarrow.ipc` namespace, for example `pa.ipc.read_schema`. This is consistent with the prior deprecation of `pa.open_stream` and `pa.open_file`. Also disables failure-on-warning when using 'make html' to build Sphinx docs. Fix various Sphinx warnings. I had to pin Sphinx 2.4.4 on account of the newly released Sphinx 3.0.0 not being compatible with our Sphinx project, see ARROW-8340. Closes #6843 from wesm/ARROW-8275 Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org>

2020-04-06 20:58:35 -05:00

GH-45619: [Python] Use f-string instead of string.format (#45629) ### Rationale for this change See https://github.com/apache/arrow/issues/45619. ### What changes are included in this PR? Refactor using f-string instead of `string.format`. But do not use f-string for following case, `string.format` allows passing parameters, making the code more reusable. https://github.com/apache/arrow/blob/0fbf9823542233c5f32c26534c34cc97ce3f0be2/python/pyarrow/parquet/core.py#L1624-L1695 ### Are these changes tested? Via CI. ### Are there any user-facing changes? No. * GitHub Issue: #45619 Lead-authored-by: Chilin Chiou <chilin.chiou@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

2025-05-12 20:34:49 +08:00

								new_stream.__doc__ = f"""\

							

ARROW-8275: [Python] Update Feather documentation for V2, Python IPC API cleanups / deprecations This splits out the Feather documentation into its own section and explains the V2 changes (support for all Arrow types and compression). This adds a FutureWarning to most of the `pyarrow.ipc` functions that are in the `pyarrow.*` namespace. Since these functions may cause confusion amongst non-advanced users, it's most clear what they are when accessed via the `pyarrow.ipc` namespace, for example `pa.ipc.read_schema`. This is consistent with the prior deprecation of `pa.open_stream` and `pa.open_file`. Also disables failure-on-warning when using 'make html' to build Sphinx docs. Fix various Sphinx warnings. I had to pin Sphinx 2.4.4 on account of the newly released Sphinx 3.0.0 not being compatible with our Sphinx project, see ARROW-8340. Closes #6843 from wesm/ARROW-8275 Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org>

2020-04-06 20:58:35 -05:00

Create an Arrow columnar IPC stream writer instance

GH-45619: [Python] Use f-string instead of string.format (#45629) ### Rationale for this change See https://github.com/apache/arrow/issues/45619. ### What changes are included in this PR? Refactor using f-string instead of `string.format`. But do not use f-string for following case, `string.format` allows passing parameters, making the code more reusable. https://github.com/apache/arrow/blob/0fbf9823542233c5f32c26534c34cc97ce3f0be2/python/pyarrow/parquet/core.py#L1624-L1695 ### Are these changes tested? Via CI. ### Are there any user-facing changes? No. * GitHub Issue: #45619 Lead-authored-by: Chilin Chiou <chilin.chiou@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

2025-05-12 20:34:49 +08:00

								{_ipc_writer_class_doc}

							

MINOR: [Python][Docs] Fix typo and add Returns for new_file/new_stream (#13369) Lead-authored-by: Saul Pwanson <saul@voltrondata.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

2022-06-12 07:11:27 -07:00

ARROW-15006: [Python][Doc] Add five more numpydoc checks to CI (#15214) This adds the numpydoc checks GL10, PR04, PR05, RT03, and YD01 to CI and fixes the associated issues in docstrings. These checks are: - GL10: reST directives {directives} must be followed by two colons - PR04: Parameter "{param_name}" has no type - PR05: Parameter "{param_name}" type should not finish with "." - RT03: Return value has no description - YD01: No Yields section found https://numpydoc.readthedocs.io/en/latest/validation.html Lead-authored-by: Bryce Mecum <petridish@gmail.com> Co-authored-by: Will Jones <willjones127@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>

2023-01-06 14:21:27 -09:00

    A writer for the given sink

GH-45619: [Python] Use f-string instead of string.format (#45629) ### Rationale for this change See https://github.com/apache/arrow/issues/45619. ### What changes are included in this PR? Refactor using f-string instead of `string.format`. But do not use f-string for following case, `string.format` allows passing parameters, making the code more reusable. https://github.com/apache/arrow/blob/0fbf9823542233c5f32c26534c34cc97ce3f0be2/python/pyarrow/parquet/core.py#L1624-L1695 ### Are these changes tested? Via CI. ### Are there any user-facing changes? No. * GitHub Issue: #45619 Lead-authored-by: Chilin Chiou <chilin.chiou@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

2025-05-12 20:34:49 +08:00

"""

ARROW-8275: [Python] Update Feather documentation for V2, Python IPC API cleanups / deprecations This splits out the Feather documentation into its own section and explains the V2 changes (support for all Arrow types and compression). This adds a FutureWarning to most of the `pyarrow.ipc` functions that are in the `pyarrow.*` namespace. Since these functions may cause confusion amongst non-advanced users, it's most clear what they are when accessed via the `pyarrow.ipc` namespace, for example `pa.ipc.read_schema`. This is consistent with the prior deprecation of `pa.open_stream` and `pa.open_file`. Also disables failure-on-warning when using 'make html' to build Sphinx docs. Fix various Sphinx warnings. I had to pin Sphinx 2.4.4 on account of the newly released Sphinx 3.0.0 not being compatible with our Sphinx project, see ARROW-8340. Closes #6843 from wesm/ARROW-8275 Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org>

2020-04-06 20:58:35 -05:00

ARROW-15776: [Python] Expose IpcReadOptions This PR intends to expose IpcReadOptions to pyarrow. Closes #12800 from raulcd/ARROW-15776 Lead-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>

2022-04-19 15:31:41 +02:00

								def open_stream(source, *, options=None, memory_pool=None):

							

ARROW-1008: [C++] Add abstract stream writer and reader C++ APIs. Give clearer names to IPC reader/writer classes The main motivation for this patch was to make `StreamReader` and `StreamWriter` abstract, so that other implementations can be created. I would also like to add the option for asynchronous reading and writing. I also added a CMake option `ARROW_NO_DEPRECATED_API` for more graceful name deprecations. @kou do you think these names for the IPC classes are more clear? Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #679 from wesm/ARROW-1008 and squashes the following commits: d7b7c9ce [Wes McKinney] Add missing dtors for pimpl pattern a797ee3e [Wes McKinney] Fix glib 04fa2854 [Wes McKinney] Feedback on ipc reader/writer names. Add open_stream/open_file Python APIs 22346d47 [Wes McKinney] Fix unit tests 10837a65 [Wes McKinney] Add abstract stream writer and reader C++ APIs. Rename record batch stream reader and writer classes for better clarity

2017-05-14 08:55:26 -04:00

"""

ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

2020-03-25 14:44:38 +01:00

    Create reader for Arrow streaming format.

ARROW-1008: [C++] Add abstract stream writer and reader C++ APIs. Give clearer names to IPC reader/writer classes The main motivation for this patch was to make `StreamReader` and `StreamWriter` abstract, so that other implementations can be created. I would also like to add the option for asynchronous reading and writing. I also added a CMake option `ARROW_NO_DEPRECATED_API` for more graceful name deprecations. @kou do you think these names for the IPC classes are more clear? Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #679 from wesm/ARROW-1008 and squashes the following commits: d7b7c9ce [Wes McKinney] Add missing dtors for pimpl pattern a797ee3e [Wes McKinney] Fix glib 04fa2854 [Wes McKinney] Feedback on ipc reader/writer names. Add open_stream/open_file Python APIs 22346d47 [Wes McKinney] Fix unit tests 10837a65 [Wes McKinney] Add abstract stream writer and reader C++ APIs. Rename record batch stream reader and writer classes for better clarity

2017-05-14 08:55:26 -04:00

ARROW-2859: [Python] Accept buffer-like objects as sources in open_file, open_stream APIs The behavior had been to treat a string-like object like a file name; we didn't have any APIs that made use of this fact, and I think that being able to read a stream from an object importing the buffer protocol is much more convenient and natural as `pa.open_stream(buf)` than `pa.open_stream(pa.BufferReader(buf))`. I may look at quickly adding support for pathlib.Path objects here. I also added the precursor for addressing ARROW-2807 Author: Wes McKinney <wesm+git@apache.org> Closes #2314 from wesm/ARROW-2859 and squashes the following commits: b64a828c <Wes McKinney> Fix docstrings 5cc363f8 <Wes McKinney> Amend usages of get_result, add FutureWarning b11e5328 <Wes McKinney> Add pathlib test. Refactor to use pytest 53f32e84 <Wes McKinney> Add test for stream from buffer protocol a6fc8f1c <Wes McKinney> Do not try to open file from buffer input, add use_memory_map flag

2018-07-24 15:23:06 -04:00

    source : bytes/buffer-like, pyarrow.NativeFile, or file-like Python object

ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

2020-03-25 14:44:38 +01:00

        Either an in-memory buffer, or a readable file object.

ARROW-15776: [Python] Expose IpcReadOptions This PR intends to expose IpcReadOptions to pyarrow. Closes #12800 from raulcd/ARROW-15776 Lead-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>

2022-04-19 15:31:41 +02:00

    options : pyarrow.ipc.IpcReadOptions

ARROW-15006: [Python][Doc] Add five more numpydoc checks to CI (#15214) This adds the numpydoc checks GL10, PR04, PR05, RT03, and YD01 to CI and fixes the associated issues in docstrings. These checks are: - GL10: reST directives {directives} must be followed by two colons - PR04: Parameter "{param_name}" has no type - PR05: Parameter "{param_name}" type should not finish with "." - RT03: Return value has no description - YD01: No Yields section found https://numpydoc.readthedocs.io/en/latest/validation.html Lead-authored-by: Bryce Mecum <petridish@gmail.com> Co-authored-by: Will Jones <willjones127@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>

2023-01-06 14:21:27 -09:00

ARROW-1008: [C++] Add abstract stream writer and reader C++ APIs. Give clearer names to IPC reader/writer classes The main motivation for this patch was to make `StreamReader` and `StreamWriter` abstract, so that other implementations can be created. I would also like to add the option for asynchronous reading and writing. I also added a CMake option `ARROW_NO_DEPRECATED_API` for more graceful name deprecations. @kou do you think these names for the IPC classes are more clear? Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #679 from wesm/ARROW-1008 and squashes the following commits: d7b7c9ce [Wes McKinney] Add missing dtors for pimpl pattern a797ee3e [Wes McKinney] Fix glib 04fa2854 [Wes McKinney] Feedback on ipc reader/writer names. Add open_stream/open_file Python APIs 22346d47 [Wes McKinney] Fix unit tests 10837a65 [Wes McKinney] Add abstract stream writer and reader C++ APIs. Rename record batch stream reader and writer classes for better clarity

2017-05-14 08:55:26 -04:00

    Returns

ARROW-15006: [Python][Doc] Add five more numpydoc checks to CI (#15214) This adds the numpydoc checks GL10, PR04, PR05, RT03, and YD01 to CI and fixes the associated issues in docstrings. These checks are: - GL10: reST directives {directives} must be followed by two colons - PR04: Parameter "{param_name}" has no type - PR05: Parameter "{param_name}" type should not finish with "." - RT03: Return value has no description - YD01: No Yields section found https://numpydoc.readthedocs.io/en/latest/validation.html Lead-authored-by: Bryce Mecum <petridish@gmail.com> Co-authored-by: Will Jones <willjones127@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>

2023-01-06 14:21:27 -09:00

        A reader for the given source

ARROW-1008: [C++] Add abstract stream writer and reader C++ APIs. Give clearer names to IPC reader/writer classes The main motivation for this patch was to make `StreamReader` and `StreamWriter` abstract, so that other implementations can be created. I would also like to add the option for asynchronous reading and writing. I also added a CMake option `ARROW_NO_DEPRECATED_API` for more graceful name deprecations. @kou do you think these names for the IPC classes are more clear? Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #679 from wesm/ARROW-1008 and squashes the following commits: d7b7c9ce [Wes McKinney] Add missing dtors for pimpl pattern a797ee3e [Wes McKinney] Fix glib 04fa2854 [Wes McKinney] Feedback on ipc reader/writer names. Add open_stream/open_file Python APIs 22346d47 [Wes McKinney] Fix unit tests 10837a65 [Wes McKinney] Add abstract stream writer and reader C++ APIs. Rename record batch stream reader and writer classes for better clarity

2017-05-14 08:55:26 -04:00

"""

ARROW-15776: [Python] Expose IpcReadOptions This PR intends to expose IpcReadOptions to pyarrow. Closes #12800 from raulcd/ARROW-15776 Lead-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>

2022-04-19 15:31:41 +02:00

								    return RecordBatchStreamReader(source, options=options,

							

ARROW-1008: [C++] Add abstract stream writer and reader C++ APIs. Give clearer names to IPC reader/writer classes The main motivation for this patch was to make `StreamReader` and `StreamWriter` abstract, so that other implementations can be created. I would also like to add the option for asynchronous reading and writing. I also added a CMake option `ARROW_NO_DEPRECATED_API` for more graceful name deprecations. @kou do you think these names for the IPC classes are more clear? Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #679 from wesm/ARROW-1008 and squashes the following commits: d7b7c9ce [Wes McKinney] Add missing dtors for pimpl pattern a797ee3e [Wes McKinney] Fix glib 04fa2854 [Wes McKinney] Feedback on ipc reader/writer names. Add open_stream/open_file Python APIs 22346d47 [Wes McKinney] Fix unit tests 10837a65 [Wes McKinney] Add abstract stream writer and reader C++ APIs. Rename record batch stream reader and writer classes for better clarity

2017-05-14 08:55:26 -04:00

GH-46222: [Python] Allow to specify footer metadata when opening IPC file for writing (#46354) ### Rationale for this change The Arrow APIs for other languages offer to set the footer metadata of an IPC file, but `pyarrow` was lacking this. I opened #46222 because we need this feature at KNIME, and took a shot at suggesting an implementation. Happy to get your feedback on the change! ### What changes are included in this PR? A new keyword argument `metadata` in `RecordBatchFileWriter.__init__` as well as in `ipc.new_file`. The value is `None` by default to not break backwards compatibility. Also added a `metadata` property to the `RecordBatchFileReader` to be able to extract the metadata easily. ### Are these changes tested? Yes, by a unit test. ### Are there any user-facing changes? Yes, see above :) * GitHub Issue: #46222 Lead-authored-by: Carsten Haubold <carsten.haubold@knime.com> Co-authored-by: Carsten Haubold <CarstenHaubold@googlemail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Rok Mihevc <rok@mihevc.org>

2025-05-12 11:16:31 +02:00

								def new_file(sink, schema, *, options=None, metadata=None):

							

ARROW-8275: [Python] Update Feather documentation for V2, Python IPC API cleanups / deprecations This splits out the Feather documentation into its own section and explains the V2 changes (support for all Arrow types and compression). This adds a FutureWarning to most of the `pyarrow.ipc` functions that are in the `pyarrow.*` namespace. Since these functions may cause confusion amongst non-advanced users, it's most clear what they are when accessed via the `pyarrow.ipc` namespace, for example `pa.ipc.read_schema`. This is consistent with the prior deprecation of `pa.open_stream` and `pa.open_file`. Also disables failure-on-warning when using 'make html' to build Sphinx docs. Fix various Sphinx warnings. I had to pin Sphinx 2.4.4 on account of the newly released Sphinx 3.0.0 not being compatible with our Sphinx project, see ARROW-8340. Closes #6843 from wesm/ARROW-8275 Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org>

2020-04-06 20:58:35 -05:00

GH-45619: [Python] Use f-string instead of string.format (#45629) ### Rationale for this change See https://github.com/apache/arrow/issues/45619. ### What changes are included in this PR? Refactor using f-string instead of `string.format`. But do not use f-string for following case, `string.format` allows passing parameters, making the code more reusable. https://github.com/apache/arrow/blob/0fbf9823542233c5f32c26534c34cc97ce3f0be2/python/pyarrow/parquet/core.py#L1624-L1695 ### Are these changes tested? Via CI. ### Are there any user-facing changes? No. * GitHub Issue: #45619 Lead-authored-by: Chilin Chiou <chilin.chiou@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

2025-05-12 20:34:49 +08:00

								new_file.__doc__ = f"""\

							

ARROW-8275: [Python] Update Feather documentation for V2, Python IPC API cleanups / deprecations This splits out the Feather documentation into its own section and explains the V2 changes (support for all Arrow types and compression). This adds a FutureWarning to most of the `pyarrow.ipc` functions that are in the `pyarrow.*` namespace. Since these functions may cause confusion amongst non-advanced users, it's most clear what they are when accessed via the `pyarrow.ipc` namespace, for example `pa.ipc.read_schema`. This is consistent with the prior deprecation of `pa.open_stream` and `pa.open_file`. Also disables failure-on-warning when using 'make html' to build Sphinx docs. Fix various Sphinx warnings. I had to pin Sphinx 2.4.4 on account of the newly released Sphinx 3.0.0 not being compatible with our Sphinx project, see ARROW-8340. Closes #6843 from wesm/ARROW-8275 Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org>

2020-04-06 20:58:35 -05:00

Create an Arrow columnar IPC file writer instance

GH-45619: [Python] Use f-string instead of string.format (#45629) ### Rationale for this change See https://github.com/apache/arrow/issues/45619. ### What changes are included in this PR? Refactor using f-string instead of `string.format`. But do not use f-string for following case, `string.format` allows passing parameters, making the code more reusable. https://github.com/apache/arrow/blob/0fbf9823542233c5f32c26534c34cc97ce3f0be2/python/pyarrow/parquet/core.py#L1624-L1695 ### Are these changes tested? Via CI. ### Are there any user-facing changes? No. * GitHub Issue: #45619 Lead-authored-by: Chilin Chiou <chilin.chiou@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

2025-05-12 20:34:49 +08:00

								{_ipc_file_writer_class_doc}

							

MINOR: [Python][Docs] Fix typo and add Returns for new_file/new_stream (#13369) Lead-authored-by: Saul Pwanson <saul@voltrondata.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

2022-06-12 07:11:27 -07:00

ARROW-15006: [Python][Doc] Add five more numpydoc checks to CI (#15214) This adds the numpydoc checks GL10, PR04, PR05, RT03, and YD01 to CI and fixes the associated issues in docstrings. These checks are: - GL10: reST directives {directives} must be followed by two colons - PR04: Parameter "{param_name}" has no type - PR05: Parameter "{param_name}" type should not finish with "." - RT03: Return value has no description - YD01: No Yields section found https://numpydoc.readthedocs.io/en/latest/validation.html Lead-authored-by: Bryce Mecum <petridish@gmail.com> Co-authored-by: Will Jones <willjones127@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>

2023-01-06 14:21:27 -09:00

    A writer for the given sink

GH-45619: [Python] Use f-string instead of string.format (#45629) ### Rationale for this change See https://github.com/apache/arrow/issues/45619. ### What changes are included in this PR? Refactor using f-string instead of `string.format`. But do not use f-string for following case, `string.format` allows passing parameters, making the code more reusable. https://github.com/apache/arrow/blob/0fbf9823542233c5f32c26534c34cc97ce3f0be2/python/pyarrow/parquet/core.py#L1624-L1695 ### Are these changes tested? Via CI. ### Are there any user-facing changes? No. * GitHub Issue: #45619 Lead-authored-by: Chilin Chiou <chilin.chiou@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

2025-05-12 20:34:49 +08:00

"""

ARROW-8275: [Python] Update Feather documentation for V2, Python IPC API cleanups / deprecations This splits out the Feather documentation into its own section and explains the V2 changes (support for all Arrow types and compression). This adds a FutureWarning to most of the `pyarrow.ipc` functions that are in the `pyarrow.*` namespace. Since these functions may cause confusion amongst non-advanced users, it's most clear what they are when accessed via the `pyarrow.ipc` namespace, for example `pa.ipc.read_schema`. This is consistent with the prior deprecation of `pa.open_stream` and `pa.open_file`. Also disables failure-on-warning when using 'make html' to build Sphinx docs. Fix various Sphinx warnings. I had to pin Sphinx 2.4.4 on account of the newly released Sphinx 3.0.0 not being compatible with our Sphinx project, see ARROW-8340. Closes #6843 from wesm/ARROW-8275 Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org>

2020-04-06 20:58:35 -05:00

ARROW-15776: [Python] Expose IpcReadOptions This PR intends to expose IpcReadOptions to pyarrow. Closes #12800 from raulcd/ARROW-15776 Lead-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>

2022-04-19 15:31:41 +02:00

								def open_file(source, footer_offset=None, *, options=None, memory_pool=None):

							

ARROW-1008: [C++] Add abstract stream writer and reader C++ APIs. Give clearer names to IPC reader/writer classes The main motivation for this patch was to make `StreamReader` and `StreamWriter` abstract, so that other implementations can be created. I would also like to add the option for asynchronous reading and writing. I also added a CMake option `ARROW_NO_DEPRECATED_API` for more graceful name deprecations. @kou do you think these names for the IPC classes are more clear? Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #679 from wesm/ARROW-1008 and squashes the following commits: d7b7c9ce [Wes McKinney] Add missing dtors for pimpl pattern a797ee3e [Wes McKinney] Fix glib 04fa2854 [Wes McKinney] Feedback on ipc reader/writer names. Add open_stream/open_file Python APIs 22346d47 [Wes McKinney] Fix unit tests 10837a65 [Wes McKinney] Add abstract stream writer and reader C++ APIs. Rename record batch stream reader and writer classes for better clarity

2017-05-14 08:55:26 -04:00

"""

ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

2020-03-25 14:44:38 +01:00

    Create reader for Arrow file format.

ARROW-1008: [C++] Add abstract stream writer and reader C++ APIs. Give clearer names to IPC reader/writer classes The main motivation for this patch was to make `StreamReader` and `StreamWriter` abstract, so that other implementations can be created. I would also like to add the option for asynchronous reading and writing. I also added a CMake option `ARROW_NO_DEPRECATED_API` for more graceful name deprecations. @kou do you think these names for the IPC classes are more clear? Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #679 from wesm/ARROW-1008 and squashes the following commits: d7b7c9ce [Wes McKinney] Add missing dtors for pimpl pattern a797ee3e [Wes McKinney] Fix glib 04fa2854 [Wes McKinney] Feedback on ipc reader/writer names. Add open_stream/open_file Python APIs 22346d47 [Wes McKinney] Fix unit tests 10837a65 [Wes McKinney] Add abstract stream writer and reader C++ APIs. Rename record batch stream reader and writer classes for better clarity

2017-05-14 08:55:26 -04:00

ARROW-2859: [Python] Accept buffer-like objects as sources in open_file, open_stream APIs The behavior had been to treat a string-like object like a file name; we didn't have any APIs that made use of this fact, and I think that being able to read a stream from an object importing the buffer protocol is much more convenient and natural as `pa.open_stream(buf)` than `pa.open_stream(pa.BufferReader(buf))`. I may look at quickly adding support for pathlib.Path objects here. I also added the precursor for addressing ARROW-2807 Author: Wes McKinney <wesm+git@apache.org> Closes #2314 from wesm/ARROW-2859 and squashes the following commits: b64a828c <Wes McKinney> Fix docstrings 5cc363f8 <Wes McKinney> Amend usages of get_result, add FutureWarning b11e5328 <Wes McKinney> Add pathlib test. Refactor to use pytest 53f32e84 <Wes McKinney> Add test for stream from buffer protocol a6fc8f1c <Wes McKinney> Do not try to open file from buffer input, add use_memory_map flag

2018-07-24 15:23:06 -04:00

    source : bytes/buffer-like, pyarrow.NativeFile, or file-like Python object

ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

2020-03-25 14:44:38 +01:00

        Either an in-memory buffer, or a readable file object.

ARROW-1008: [C++] Add abstract stream writer and reader C++ APIs. Give clearer names to IPC reader/writer classes The main motivation for this patch was to make `StreamReader` and `StreamWriter` abstract, so that other implementations can be created. I would also like to add the option for asynchronous reading and writing. I also added a CMake option `ARROW_NO_DEPRECATED_API` for more graceful name deprecations. @kou do you think these names for the IPC classes are more clear? Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #679 from wesm/ARROW-1008 and squashes the following commits: d7b7c9ce [Wes McKinney] Add missing dtors for pimpl pattern a797ee3e [Wes McKinney] Fix glib 04fa2854 [Wes McKinney] Feedback on ipc reader/writer names. Add open_stream/open_file Python APIs 22346d47 [Wes McKinney] Fix unit tests 10837a65 [Wes McKinney] Add abstract stream writer and reader C++ APIs. Rename record batch stream reader and writer classes for better clarity

2017-05-14 08:55:26 -04:00

    footer_offset : int, default None

ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

2020-03-25 14:44:38 +01:00

        the very end of the file data.

ARROW-15776: [Python] Expose IpcReadOptions This PR intends to expose IpcReadOptions to pyarrow. Closes #12800 from raulcd/ARROW-15776 Lead-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>

2022-04-19 15:31:41 +02:00

    options : pyarrow.ipc.IpcReadOptions

ARROW-15006: [Python][Doc] Add five more numpydoc checks to CI (#15214) This adds the numpydoc checks GL10, PR04, PR05, RT03, and YD01 to CI and fixes the associated issues in docstrings. These checks are: - GL10: reST directives {directives} must be followed by two colons - PR04: Parameter "{param_name}" has no type - PR05: Parameter "{param_name}" type should not finish with "." - RT03: Return value has no description - YD01: No Yields section found https://numpydoc.readthedocs.io/en/latest/validation.html Lead-authored-by: Bryce Mecum <petridish@gmail.com> Co-authored-by: Will Jones <willjones127@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>

2023-01-06 14:21:27 -09:00

ARROW-1008: [C++] Add abstract stream writer and reader C++ APIs. Give clearer names to IPC reader/writer classes The main motivation for this patch was to make `StreamReader` and `StreamWriter` abstract, so that other implementations can be created. I would also like to add the option for asynchronous reading and writing. I also added a CMake option `ARROW_NO_DEPRECATED_API` for more graceful name deprecations. @kou do you think these names for the IPC classes are more clear? Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #679 from wesm/ARROW-1008 and squashes the following commits: d7b7c9ce [Wes McKinney] Add missing dtors for pimpl pattern a797ee3e [Wes McKinney] Fix glib 04fa2854 [Wes McKinney] Feedback on ipc reader/writer names. Add open_stream/open_file Python APIs 22346d47 [Wes McKinney] Fix unit tests 10837a65 [Wes McKinney] Add abstract stream writer and reader C++ APIs. Rename record batch stream reader and writer classes for better clarity

2017-05-14 08:55:26 -04:00

    Returns

ARROW-15006: [Python][Doc] Add five more numpydoc checks to CI (#15214) This adds the numpydoc checks GL10, PR04, PR05, RT03, and YD01 to CI and fixes the associated issues in docstrings. These checks are: - GL10: reST directives {directives} must be followed by two colons - PR04: Parameter "{param_name}" has no type - PR05: Parameter "{param_name}" type should not finish with "." - RT03: Return value has no description - YD01: No Yields section found https://numpydoc.readthedocs.io/en/latest/validation.html Lead-authored-by: Bryce Mecum <petridish@gmail.com> Co-authored-by: Will Jones <willjones127@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>

2023-01-06 14:21:27 -09:00

        A reader for the given source

ARROW-1008: [C++] Add abstract stream writer and reader C++ APIs. Give clearer names to IPC reader/writer classes The main motivation for this patch was to make `StreamReader` and `StreamWriter` abstract, so that other implementations can be created. I would also like to add the option for asynchronous reading and writing. I also added a CMake option `ARROW_NO_DEPRECATED_API` for more graceful name deprecations. @kou do you think these names for the IPC classes are more clear? Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #679 from wesm/ARROW-1008 and squashes the following commits: d7b7c9ce [Wes McKinney] Add missing dtors for pimpl pattern a797ee3e [Wes McKinney] Fix glib 04fa2854 [Wes McKinney] Feedback on ipc reader/writer names. Add open_stream/open_file Python APIs 22346d47 [Wes McKinney] Fix unit tests 10837a65 [Wes McKinney] Add abstract stream writer and reader C++ APIs. Rename record batch stream reader and writer classes for better clarity

2017-05-14 08:55:26 -04:00

"""

ARROW-15776: [Python] Expose IpcReadOptions This PR intends to expose IpcReadOptions to pyarrow. Closes #12800 from raulcd/ARROW-15776 Lead-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>

2022-04-19 15:31:41 +02:00

								    return RecordBatchFileReader(

							

ARROW-881: [Python] Reconstruct Pandas DataFrame indexes using metadata cc @mrocklin Author: Phillip Cloud <cpcloud@gmail.com> Closes #612 from cpcloud/ARROW-881 and squashes the following commits: 4fa679d [Phillip Cloud] Add metadata test 60f71aa [Phillip Cloud] More doc de616e8 [Phillip Cloud] Add doc a42a084 [Phillip Cloud] Decode metadata to utf8 because JSON 2198dc5 [Phillip Cloud] Call column_name_idx on index_columns 32c5e64 [Phillip Cloud] Add test for read_pandas subset 2fa1f16 [Phillip Cloud] Do not write index_column metadata if not requested 21a8829 [Phillip Cloud] Add docs to pq.read_pandas c35970c [Phillip Cloud] Add test for no index written and pq.read_pandas 59477b5 [Phillip Cloud] ARROW-881: [Python] Reconstruct Pandas DataFrame indexes using custom_metadata

2017-05-16 13:14:18 -04:00

ARROW-9333: [Python] Expose more IPC options Also make some optional arguments keyword-only. Closes #7730 from pitrou/ARROW-9333-py-ipc-options Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm@apache.org>

2020-07-13 12:49:07 -05:00

								def serialize_pandas(df, *, nthreads=None, preserve_index=None):

							

ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

2020-03-25 14:44:38 +01:00

"""

ARROW-881: [Python] Reconstruct Pandas DataFrame indexes using metadata cc @mrocklin Author: Phillip Cloud <cpcloud@gmail.com> Closes #612 from cpcloud/ARROW-881 and squashes the following commits: 4fa679d [Phillip Cloud] Add metadata test 60f71aa [Phillip Cloud] More doc de616e8 [Phillip Cloud] Add doc a42a084 [Phillip Cloud] Decode metadata to utf8 because JSON 2198dc5 [Phillip Cloud] Call column_name_idx on index_columns 32c5e64 [Phillip Cloud] Add test for read_pandas subset 2fa1f16 [Phillip Cloud] Do not write index_column metadata if not requested 21a8829 [Phillip Cloud] Add docs to pq.read_pandas c35970c [Phillip Cloud] Add test for no index written and pq.read_pandas 59477b5 [Phillip Cloud] ARROW-881: [Python] Reconstruct Pandas DataFrame indexes using custom_metadata

2017-05-16 13:14:18 -04:00

ARROW-1594: [Python] Multithreaded conversions to Arrow in from_pandas This results in nice speedups when column conversions do not require GIL to be held: ```python In [5]: import numpy as np In [6]: import pandas as pd In [7]: import pyarrow as pa In [8]: NROWS = 1000000 In [9]: NCOLS = 50 In [10]: arr = np.random.randn(NCOLS, NROWS).T In [11]: arr[::5] = np.nan In [12]: df = pd.DataFrame(arr) In [13]: %timeit rb = pa.RecordBatch.from_pandas(df, nthreads=1) 10 loops, best of 3: 179 ms per loop In [14]: %timeit rb = pa.RecordBatch.from_pandas(df, nthreads=4) 10 loops, best of 3: 59.7 ms per loop ``` This introduces a dependency on the `futures` Python 2.7 backport of concurrent.futures (PSF license) Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1186 from wesm/multithreaded-from-pandas and squashes the following commits: a3072f0e [Wes McKinney] Only install futures on py2 c30e4735 [Wes McKinney] Add heuristic to use threadpool conversion only if nrows > ncols * 100 5a692085 [Wes McKinney] Only install concurrent.futures backport on py2, test serialize_pandas with nthreads 0afab342 [Wes McKinney] Add nthreads argument to serialize_pandas, make default for serialize/deserialize consistent 15841d13 [Wes McKinney] Default to cpu_count() for nthreads in from_pandas to conform with to_pandas default 6a58c038 [Wes McKinney] Add nthreads argument to RecordBatch/Table.from_pandas. Use concurrent.futures for parallel processing

2017-10-08 11:45:06 -04:00

    nthreads : int, default None

ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

2020-03-25 14:44:38 +01:00

        Number of threads to use for conversion to Arrow, default all CPUs.

ARROW-5427: [Python] pandas conversion preserve_index=True to force RangeIndex serialization https://issues.apache.org/jira/browse/ARROW-5427 This proposes to let `preserve_index=True` force the index to be a column, also RangeIndexes. The default is unchanged (only now codified as None) to have store a RangeIndex as metadata only. But this gives the possibility to force serialization if consistent results are required (or need to match a specified schema, cfr https://issues.apache.org/jira/browse/ARROW-5220) Author: Joris Van den Bossche <jorisvandenbossche@gmail.com> Closes #4651 from jorisvandenbossche/ARROW-5427-preserve-index and squashes the following commits: 56efb38ba <Joris Van den Bossche> ARROW-5427: pandas conversion preserve_index=True to force RangeIndex serialization

2019-06-25 20:57:02 -07:00

        The default of None will store the index as a column, except for

ARROW-881: [Python] Reconstruct Pandas DataFrame indexes using metadata cc @mrocklin Author: Phillip Cloud <cpcloud@gmail.com> Closes #612 from cpcloud/ARROW-881 and squashes the following commits: 4fa679d [Phillip Cloud] Add metadata test 60f71aa [Phillip Cloud] More doc de616e8 [Phillip Cloud] Add doc a42a084 [Phillip Cloud] Decode metadata to utf8 because JSON 2198dc5 [Phillip Cloud] Call column_name_idx on index_columns 32c5e64 [Phillip Cloud] Add test for read_pandas subset 2fa1f16 [Phillip Cloud] Do not write index_column metadata if not requested 21a8829 [Phillip Cloud] Add docs to pq.read_pandas c35970c [Phillip Cloud] Add test for no index written and pq.read_pandas 59477b5 [Phillip Cloud] ARROW-881: [Python] Reconstruct Pandas DataFrame indexes using custom_metadata

2017-05-16 13:14:18 -04:00

ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

2020-03-25 14:44:38 +01:00

        An object compatible with the buffer protocol.

ARROW-881: [Python] Reconstruct Pandas DataFrame indexes using metadata cc @mrocklin Author: Phillip Cloud <cpcloud@gmail.com> Closes #612 from cpcloud/ARROW-881 and squashes the following commits: 4fa679d [Phillip Cloud] Add metadata test 60f71aa [Phillip Cloud] More doc de616e8 [Phillip Cloud] Add doc a42a084 [Phillip Cloud] Decode metadata to utf8 because JSON 2198dc5 [Phillip Cloud] Call column_name_idx on index_columns 32c5e64 [Phillip Cloud] Add test for read_pandas subset 2fa1f16 [Phillip Cloud] Do not write index_column metadata if not requested 21a8829 [Phillip Cloud] Add docs to pq.read_pandas c35970c [Phillip Cloud] Add test for no index written and pq.read_pandas 59477b5 [Phillip Cloud] ARROW-881: [Python] Reconstruct Pandas DataFrame indexes using custom_metadata

2017-05-16 13:14:18 -04:00

"""

ARROW-1593: [Python] Pass through preserve_index to RecordBatch.from_pandas in serialize_pandas Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1190 from wesm/ARROW-1593 and squashes the following commits: 4cfde4b4 [Wes McKinney] Also test passing preserve_index=True 04dc0171 [Wes McKinney] Pass through preserve_index to RecordBatch.from_pandas in serialize_pandas

2017-10-09 21:02:48 -04:00

								    batch = pa.RecordBatch.from_pandas(df, nthreads=nthreads,

							

ARROW-1395: [C++/Python] Remove APIs deprecated from 0.5.0 onward Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #983 from wesm/ARROW-1395 and squashes the following commits: c105a215 [Wes McKinney] Remove deprecated APIs from <= 0.4.0

2017-08-21 22:30:20 -04:00

								    sink = pa.BufferOutputStream()

							

ARROW-2863: [Python] Add context manager APIs to RecordBatch*Writer/Reader classes Closes #5563 from kszucs/ARROW-2863 and squashes the following commits: 648a81228 <Krisztián Szűcs> move context mgrs to the cython module 3b25fa4d1 <Krisztián Szűcs> remove context manager from the cdef class a9c80638b <Krisztián Szűcs> context manager Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

2019-10-18 13:58:26 -04:00

								    with pa.RecordBatchStreamWriter(sink, batch.schema) as writer:

							

ARROW-2989: [C++/Python] Remove API deprecations in 0.10 I didn't remove the `nthreads` deprecations in Python yet. @pitrou do you support removing them in 0.11? Author: Wes McKinney <wesm+git@apache.org> Closes #2478 from wesm/ARROW-2989 and squashes the following commits: 8de36e6a <Wes McKinney> Test non-threaded conversion bff138a0 <Wes McKinney> Make use_threads=True the default in pandas conversion tests 5ae97a96 <Wes McKinney> Remove nthreads argument where thread pool is being used. Set default for use_threads to True 279453c7 <Wes McKinney> Remove C++/Python API deprecations in 0.10, except for nthreads in Python

2018-09-04 17:24:23 -04:00

								    return sink.getvalue()

							

ARROW-881: [Python] Reconstruct Pandas DataFrame indexes using metadata cc @mrocklin Author: Phillip Cloud <cpcloud@gmail.com> Closes #612 from cpcloud/ARROW-881 and squashes the following commits: 4fa679d [Phillip Cloud] Add metadata test 60f71aa [Phillip Cloud] More doc de616e8 [Phillip Cloud] Add doc a42a084 [Phillip Cloud] Decode metadata to utf8 because JSON 2198dc5 [Phillip Cloud] Call column_name_idx on index_columns 32c5e64 [Phillip Cloud] Add test for read_pandas subset 2fa1f16 [Phillip Cloud] Do not write index_column metadata if not requested 21a8829 [Phillip Cloud] Add docs to pq.read_pandas c35970c [Phillip Cloud] Add test for no index written and pq.read_pandas 59477b5 [Phillip Cloud] ARROW-881: [Python] Reconstruct Pandas DataFrame indexes using custom_metadata

2017-05-16 13:14:18 -04:00

ARROW-9333: [Python] Expose more IPC options Also make some optional arguments keyword-only. Closes #7730 from pitrou/ARROW-9333-py-ipc-options Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm@apache.org>

2020-07-13 12:49:07 -05:00

								def deserialize_pandas(buf, *, use_threads=True):

							

ARROW-881: [Python] Reconstruct Pandas DataFrame indexes using metadata cc @mrocklin Author: Phillip Cloud <cpcloud@gmail.com> Closes #612 from cpcloud/ARROW-881 and squashes the following commits: 4fa679d [Phillip Cloud] Add metadata test 60f71aa [Phillip Cloud] More doc de616e8 [Phillip Cloud] Add doc a42a084 [Phillip Cloud] Decode metadata to utf8 because JSON 2198dc5 [Phillip Cloud] Call column_name_idx on index_columns 32c5e64 [Phillip Cloud] Add test for read_pandas subset 2fa1f16 [Phillip Cloud] Do not write index_column metadata if not requested 21a8829 [Phillip Cloud] Add docs to pq.read_pandas c35970c [Phillip Cloud] Add test for no index written and pq.read_pandas 59477b5 [Phillip Cloud] ARROW-881: [Python] Reconstruct Pandas DataFrame indexes using custom_metadata

2017-05-16 13:14:18 -04:00

								    """Deserialize a buffer protocol compatible object into a pandas DataFrame.

							

ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

2020-03-25 14:44:38 +01:00

        An object compatible with the buffer protocol.

ARROW-13637: [Python] Fix docstrings Address all docstrings to make sure they pass `archery numpydoc --allow-rule PR01` Closes #11245 from amol-/ARROW-13637 Authored-by: Alessandro Molina <amol@turbogears.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

2021-10-04 11:44:40 +02:00

    use_threads : bool, default True

ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

2020-03-25 14:44:38 +01:00

        Whether to parallelize the conversion using multiple threads.

ARROW-881: [Python] Reconstruct Pandas DataFrame indexes using metadata cc @mrocklin Author: Phillip Cloud <cpcloud@gmail.com> Closes #612 from cpcloud/ARROW-881 and squashes the following commits: 4fa679d [Phillip Cloud] Add metadata test 60f71aa [Phillip Cloud] More doc de616e8 [Phillip Cloud] Add doc a42a084 [Phillip Cloud] Decode metadata to utf8 because JSON 2198dc5 [Phillip Cloud] Call column_name_idx on index_columns 32c5e64 [Phillip Cloud] Add test for read_pandas subset 2fa1f16 [Phillip Cloud] Do not write index_column metadata if not requested 21a8829 [Phillip Cloud] Add docs to pq.read_pandas c35970c [Phillip Cloud] Add test for no index written and pq.read_pandas 59477b5 [Phillip Cloud] ARROW-881: [Python] Reconstruct Pandas DataFrame indexes using custom_metadata

2017-05-16 13:14:18 -04:00

ARROW-15006: [Python][Doc] Add five more numpydoc checks to CI (#15214) This adds the numpydoc checks GL10, PR04, PR05, RT03, and YD01 to CI and fixes the associated issues in docstrings. These checks are: - GL10: reST directives {directives} must be followed by two colons - PR04: Parameter "{param_name}" has no type - PR05: Parameter "{param_name}" type should not finish with "." - RT03: Return value has no description - YD01: No Yields section found https://numpydoc.readthedocs.io/en/latest/validation.html Lead-authored-by: Bryce Mecum <petridish@gmail.com> Co-authored-by: Will Jones <willjones127@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>

2023-01-06 14:21:27 -09:00

        The buffer deserialized as pandas DataFrame

ARROW-881: [Python] Reconstruct Pandas DataFrame indexes using metadata cc @mrocklin Author: Phillip Cloud <cpcloud@gmail.com> Closes #612 from cpcloud/ARROW-881 and squashes the following commits: 4fa679d [Phillip Cloud] Add metadata test 60f71aa [Phillip Cloud] More doc de616e8 [Phillip Cloud] Add doc a42a084 [Phillip Cloud] Decode metadata to utf8 because JSON 2198dc5 [Phillip Cloud] Call column_name_idx on index_columns 32c5e64 [Phillip Cloud] Add test for read_pandas subset 2fa1f16 [Phillip Cloud] Do not write index_column metadata if not requested 21a8829 [Phillip Cloud] Add docs to pq.read_pandas c35970c [Phillip Cloud] Add test for no index written and pq.read_pandas 59477b5 [Phillip Cloud] ARROW-881: [Python] Reconstruct Pandas DataFrame indexes using custom_metadata

2017-05-16 13:14:18 -04:00

"""

ARROW-2863: [Python] Add context manager APIs to RecordBatch*Writer/Reader classes Closes #5563 from kszucs/ARROW-2863 and squashes the following commits: 648a81228 <Krisztián Szűcs> move context mgrs to the cython module 3b25fa4d1 <Krisztián Szűcs> remove context manager from the cdef class a9c80638b <Krisztián Szűcs> context manager Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

2019-10-18 13:58:26 -04:00

								    with pa.RecordBatchStreamReader(buffer_reader) as reader:

							

ARROW-2568: [Python] Expose thread pool size setting to Python, and deprecate "nthreads" where possible There are two areas where `nthreads` cannot be replaced immediately by the global thread pool: 1. when converting Pandas data to Arrow table or record batch, since it uses a Python `ThreadPoolExecutor` from pure Python code (see `dataframe_to_arrays` in `pandas_compat.py`) 2. when reading or writing Parquet data, since `parquet-cpp` relies on parallelization facilities in the stable version of Arrow (see https://github.com/apache/parquet-cpp/pull/467) Elsewhere, we add a `use_threads` boolean argument and deprecate `nthreads`. Author: Antoine Pitrou <antoine@python.org> Closes #2078 from pitrou/ARROW-2568 and squashes the following commits: 91187bf6 <Antoine Pitrou> Move use_threads flag into PandasOptions a7aeed0e <Antoine Pitrou> Factor out secession predicate f601d4e9 <Antoine Pitrou> ThreadPool::State pointer is const 4567a2c3 <Antoine Pitrou> Add a two-argument variant of ParallelFor() that uses the global CPU thread pool d0a527ab <Antoine Pitrou> Restore single-thread path in WriteTableToBlocks() 2171e6e2 <Antoine Pitrou> On Windows, avoid shutting down the global thread pool at process exit 934e5a11 <Antoine Pitrou> Add & operator between Statuses 96397076 <Antoine Pitrou> Factor out deprecation logic 6b685406 <Antoine Pitrou> Fix MSVC warning 6b6f64a1 <Antoine Pitrou> Make ThreadPool capacity an int, not a size_t 61669755 <Antoine Pitrou> Rename CPUThreadPool() to GetCpuThreadPool() d4eb8d47 <Antoine Pitrou> Export ThreadPool and CPUThreadPool() 6985afe6 <Antoine Pitrou> Lint fcca62f5 <Antoine Pitrou> Emit FutureWarning (which is visible by default) rather than DeprecationWarning 172fba37 <Antoine Pitrou> Use C++ API, rather than multiprocessing, in pyarrow.{cpu_count,set_cpu_count} 50310b88 <Antoine Pitrou> Add API function to get desired ThreadPool capacity 3417325c <Antoine Pitrou> ARROW-2568: WIP

2018-05-24 21:28:15 -04:00

								    return table.to_pandas(use_threads=use_threads)