SIGN IN SIGN UP
apache / arrow UNCLAIMED

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics

0 0 19 C++
ARROW-1214: [Python/C++] Add C++ functionality to more easily handle encapsulated IPC messages, Python bindings This patch does a bunch of things: * Decouples the RecordBatchStreamReader from the actual message iteration (which is handled by a new `arrow::ipc::MessageReader` interface * Enables `arrow::ipc::Message` to hold all of the memory for a complete unit of data: metadata plus body * Renames some IPC methods for better consistency (GetNextRecordBatch -> ReadNextRecordBatch) * Adds function to serialize a complete encapsulated message to an `arrow::io::OutputStream* * Add Python bindings for all of the above, introduce `pyarrow.Message`, `pyarrow.MessageReader`. Add `read_message` and `Message.serialize` functions for efficient memory round trips * Add `pyarrow.read_record_batch` for reading a single record batch given a message and a known schema Later we will want to add `pyarrow.read_schema`, but it seemed like a bit of work to make it work for dictionaries. This implements the C++ analogue to ARROW-1047, which was for Java. Not sure why I didn't create a JIRA about this. cc @icexelloss Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #839 from wesm/ARROW-1214 and squashes the following commits: 07f1820a [Wes McKinney] Refactor to introduce MessageReader abstract type, use unique_ptr for messages instead of shared_ptr. First cut at Message, MessageReader Python API. Add read_message, C++/Python machinery for message roundtrips to Buffer, comparison. Add function to read RecordBatch from encapsulated message given schema.
2017-07-15 16:51:51 -04:00
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
# ---------------------------------------------------------------------
ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow IPC file format This is based on top of ARROW-7979, so I will need to rebase once that is merged. Excluding the changes from ARROW-7979, this patch is a substantial code reduction in Feather-related code. I removed a lot of cruft from the V1 implementation and made things a lot simpler without altering the user-facing functionality. To summarize: * V2 is exactly the Arrow IPC file format, with the option for the experimental "trivial" body buffer compression implemented in ARROW-7979. `read_feather` functions distinguish the files based on the magic bytes at the beginning of the file ("FEA1" versus "ARROW1") * A `ipc::feather::WriteProperties` struct has been introduced to allow setting the file version, as well as chunksize (since large tables are broken up into smaller chunks when writing), compression type, and compression level (compressor-specific) * LZ4 and ZSTD are the only codecs intended to be supported (also in line with mailing list discussion about IPC compression). The default is LZ4 unless -DARROW_WITH_LZ4=OFF in which case it's uncompressed * Unit tests in Python now test both versions * R tests are only running the V2 version. I'll need some help adding options to set the version as well as the compression type and compression level Since 0.17.0 is likely to be released without formalizing IPC compression, I will plan to support an "ARROW:experimental_compression" metadata member in 0.17.0 Feather files. Other notes: * Column decompression is currently serial. I'll work on making this parallel ASAP as it will impact benchmarks significantly. * Compression (both chunk-level and column-level) is serial. Write performance would be much improved, especially at higher compression levels, by compressing in parallel at least at the column level * Write performance could be improved by compressing chunks and writing them to disk concurrently. It's done serially at the moment, so will open a follow up JIRA about this Closes #6694 from wesm/feather-v2 Lead-authored-by: Wes McKinney <wesm+git@apache.org> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org>
2020-03-29 19:05:36 -05:00
# Implement Feather file format
ARROW-1214: [Python/C++] Add C++ functionality to more easily handle encapsulated IPC messages, Python bindings This patch does a bunch of things: * Decouples the RecordBatchStreamReader from the actual message iteration (which is handled by a new `arrow::ipc::MessageReader` interface * Enables `arrow::ipc::Message` to hold all of the memory for a complete unit of data: metadata plus body * Renames some IPC methods for better consistency (GetNextRecordBatch -> ReadNextRecordBatch) * Adds function to serialize a complete encapsulated message to an `arrow::io::OutputStream* * Add Python bindings for all of the above, introduce `pyarrow.Message`, `pyarrow.MessageReader`. Add `read_message` and `Message.serialize` functions for efficient memory round trips * Add `pyarrow.read_record_batch` for reading a single record batch given a message and a known schema Later we will want to add `pyarrow.read_schema`, but it seemed like a bit of work to make it work for dictionaries. This implements the C++ analogue to ARROW-1047, which was for Java. Not sure why I didn't create a JIRA about this. cc @icexelloss Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #839 from wesm/ARROW-1214 and squashes the following commits: 07f1820a [Wes McKinney] Refactor to introduce MessageReader abstract type, use unique_ptr for messages instead of shared_ptr. First cut at Message, MessageReader Python API. Add read_message, C++/Python machinery for message roundtrips to Buffer, comparison. Add function to read RecordBatch from encapsulated message given schema.
2017-07-15 16:51:51 -04:00
# cython: profile=False
# distutils: language = c++
# cython: language_level=3
from cython.operator cimport dereference as deref
from pyarrow.includes.common cimport *
from pyarrow.includes.libarrow cimport *
from pyarrow.includes.libarrow_feather cimport *
from pyarrow.lib cimport (check_status, Table, _Weakrefable,
get_writer, get_reader, pyarrow_wrap_table)
from pyarrow.lib import tobytes
ARROW-1214: [Python/C++] Add C++ functionality to more easily handle encapsulated IPC messages, Python bindings This patch does a bunch of things: * Decouples the RecordBatchStreamReader from the actual message iteration (which is handled by a new `arrow::ipc::MessageReader` interface * Enables `arrow::ipc::Message` to hold all of the memory for a complete unit of data: metadata plus body * Renames some IPC methods for better consistency (GetNextRecordBatch -> ReadNextRecordBatch) * Adds function to serialize a complete encapsulated message to an `arrow::io::OutputStream* * Add Python bindings for all of the above, introduce `pyarrow.Message`, `pyarrow.MessageReader`. Add `read_message` and `Message.serialize` functions for efficient memory round trips * Add `pyarrow.read_record_batch` for reading a single record batch given a message and a known schema Later we will want to add `pyarrow.read_schema`, but it seemed like a bit of work to make it work for dictionaries. This implements the C++ analogue to ARROW-1047, which was for Java. Not sure why I didn't create a JIRA about this. cc @icexelloss Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #839 from wesm/ARROW-1214 and squashes the following commits: 07f1820a [Wes McKinney] Refactor to introduce MessageReader abstract type, use unique_ptr for messages instead of shared_ptr. First cut at Message, MessageReader Python API. Add read_message, C++/Python machinery for message roundtrips to Buffer, comparison. Add function to read RecordBatch from encapsulated message given schema.
2017-07-15 16:51:51 -04:00
class FeatherError(Exception):
pass
ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow IPC file format This is based on top of ARROW-7979, so I will need to rebase once that is merged. Excluding the changes from ARROW-7979, this patch is a substantial code reduction in Feather-related code. I removed a lot of cruft from the V1 implementation and made things a lot simpler without altering the user-facing functionality. To summarize: * V2 is exactly the Arrow IPC file format, with the option for the experimental "trivial" body buffer compression implemented in ARROW-7979. `read_feather` functions distinguish the files based on the magic bytes at the beginning of the file ("FEA1" versus "ARROW1") * A `ipc::feather::WriteProperties` struct has been introduced to allow setting the file version, as well as chunksize (since large tables are broken up into smaller chunks when writing), compression type, and compression level (compressor-specific) * LZ4 and ZSTD are the only codecs intended to be supported (also in line with mailing list discussion about IPC compression). The default is LZ4 unless -DARROW_WITH_LZ4=OFF in which case it's uncompressed * Unit tests in Python now test both versions * R tests are only running the V2 version. I'll need some help adding options to set the version as well as the compression type and compression level Since 0.17.0 is likely to be released without formalizing IPC compression, I will plan to support an "ARROW:experimental_compression" metadata member in 0.17.0 Feather files. Other notes: * Column decompression is currently serial. I'll work on making this parallel ASAP as it will impact benchmarks significantly. * Compression (both chunk-level and column-level) is serial. Write performance would be much improved, especially at higher compression levels, by compressing in parallel at least at the column level * Write performance could be improved by compressing chunks and writing them to disk concurrently. It's done serially at the moment, so will open a follow up JIRA about this Closes #6694 from wesm/feather-v2 Lead-authored-by: Wes McKinney <wesm+git@apache.org> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org>
2020-03-29 19:05:36 -05:00
def write_feather(Table table, object dest, compression=None,
compression_level=None, chunksize=None, version=2):
cdef shared_ptr[COutputStream] sink
get_writer(dest, &sink)
ARROW-1214: [Python/C++] Add C++ functionality to more easily handle encapsulated IPC messages, Python bindings This patch does a bunch of things: * Decouples the RecordBatchStreamReader from the actual message iteration (which is handled by a new `arrow::ipc::MessageReader` interface * Enables `arrow::ipc::Message` to hold all of the memory for a complete unit of data: metadata plus body * Renames some IPC methods for better consistency (GetNextRecordBatch -> ReadNextRecordBatch) * Adds function to serialize a complete encapsulated message to an `arrow::io::OutputStream* * Add Python bindings for all of the above, introduce `pyarrow.Message`, `pyarrow.MessageReader`. Add `read_message` and `Message.serialize` functions for efficient memory round trips * Add `pyarrow.read_record_batch` for reading a single record batch given a message and a known schema Later we will want to add `pyarrow.read_schema`, but it seemed like a bit of work to make it work for dictionaries. This implements the C++ analogue to ARROW-1047, which was for Java. Not sure why I didn't create a JIRA about this. cc @icexelloss Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #839 from wesm/ARROW-1214 and squashes the following commits: 07f1820a [Wes McKinney] Refactor to introduce MessageReader abstract type, use unique_ptr for messages instead of shared_ptr. First cut at Message, MessageReader Python API. Add read_message, C++/Python machinery for message roundtrips to Buffer, comparison. Add function to read RecordBatch from encapsulated message given schema.
2017-07-15 16:51:51 -04:00
ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow IPC file format This is based on top of ARROW-7979, so I will need to rebase once that is merged. Excluding the changes from ARROW-7979, this patch is a substantial code reduction in Feather-related code. I removed a lot of cruft from the V1 implementation and made things a lot simpler without altering the user-facing functionality. To summarize: * V2 is exactly the Arrow IPC file format, with the option for the experimental "trivial" body buffer compression implemented in ARROW-7979. `read_feather` functions distinguish the files based on the magic bytes at the beginning of the file ("FEA1" versus "ARROW1") * A `ipc::feather::WriteProperties` struct has been introduced to allow setting the file version, as well as chunksize (since large tables are broken up into smaller chunks when writing), compression type, and compression level (compressor-specific) * LZ4 and ZSTD are the only codecs intended to be supported (also in line with mailing list discussion about IPC compression). The default is LZ4 unless -DARROW_WITH_LZ4=OFF in which case it's uncompressed * Unit tests in Python now test both versions * R tests are only running the V2 version. I'll need some help adding options to set the version as well as the compression type and compression level Since 0.17.0 is likely to be released without formalizing IPC compression, I will plan to support an "ARROW:experimental_compression" metadata member in 0.17.0 Feather files. Other notes: * Column decompression is currently serial. I'll work on making this parallel ASAP as it will impact benchmarks significantly. * Compression (both chunk-level and column-level) is serial. Write performance would be much improved, especially at higher compression levels, by compressing in parallel at least at the column level * Write performance could be improved by compressing chunks and writing them to disk concurrently. It's done serially at the moment, so will open a follow up JIRA about this Closes #6694 from wesm/feather-v2 Lead-authored-by: Wes McKinney <wesm+git@apache.org> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org>
2020-03-29 19:05:36 -05:00
cdef CFeatherProperties properties
if version == 2:
properties.version = kFeatherV2Version
else:
properties.version = kFeatherV1Version
ARROW-1214: [Python/C++] Add C++ functionality to more easily handle encapsulated IPC messages, Python bindings This patch does a bunch of things: * Decouples the RecordBatchStreamReader from the actual message iteration (which is handled by a new `arrow::ipc::MessageReader` interface * Enables `arrow::ipc::Message` to hold all of the memory for a complete unit of data: metadata plus body * Renames some IPC methods for better consistency (GetNextRecordBatch -> ReadNextRecordBatch) * Adds function to serialize a complete encapsulated message to an `arrow::io::OutputStream* * Add Python bindings for all of the above, introduce `pyarrow.Message`, `pyarrow.MessageReader`. Add `read_message` and `Message.serialize` functions for efficient memory round trips * Add `pyarrow.read_record_batch` for reading a single record batch given a message and a known schema Later we will want to add `pyarrow.read_schema`, but it seemed like a bit of work to make it work for dictionaries. This implements the C++ analogue to ARROW-1047, which was for Java. Not sure why I didn't create a JIRA about this. cc @icexelloss Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #839 from wesm/ARROW-1214 and squashes the following commits: 07f1820a [Wes McKinney] Refactor to introduce MessageReader abstract type, use unique_ptr for messages instead of shared_ptr. First cut at Message, MessageReader Python API. Add read_message, C++/Python machinery for message roundtrips to Buffer, comparison. Add function to read RecordBatch from encapsulated message given schema.
2017-07-15 16:51:51 -04:00
ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow IPC file format This is based on top of ARROW-7979, so I will need to rebase once that is merged. Excluding the changes from ARROW-7979, this patch is a substantial code reduction in Feather-related code. I removed a lot of cruft from the V1 implementation and made things a lot simpler without altering the user-facing functionality. To summarize: * V2 is exactly the Arrow IPC file format, with the option for the experimental "trivial" body buffer compression implemented in ARROW-7979. `read_feather` functions distinguish the files based on the magic bytes at the beginning of the file ("FEA1" versus "ARROW1") * A `ipc::feather::WriteProperties` struct has been introduced to allow setting the file version, as well as chunksize (since large tables are broken up into smaller chunks when writing), compression type, and compression level (compressor-specific) * LZ4 and ZSTD are the only codecs intended to be supported (also in line with mailing list discussion about IPC compression). The default is LZ4 unless -DARROW_WITH_LZ4=OFF in which case it's uncompressed * Unit tests in Python now test both versions * R tests are only running the V2 version. I'll need some help adding options to set the version as well as the compression type and compression level Since 0.17.0 is likely to be released without formalizing IPC compression, I will plan to support an "ARROW:experimental_compression" metadata member in 0.17.0 Feather files. Other notes: * Column decompression is currently serial. I'll work on making this parallel ASAP as it will impact benchmarks significantly. * Compression (both chunk-level and column-level) is serial. Write performance would be much improved, especially at higher compression levels, by compressing in parallel at least at the column level * Write performance could be improved by compressing chunks and writing them to disk concurrently. It's done serially at the moment, so will open a follow up JIRA about this Closes #6694 from wesm/feather-v2 Lead-authored-by: Wes McKinney <wesm+git@apache.org> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org>
2020-03-29 19:05:36 -05:00
if compression == 'zstd':
properties.compression = CCompressionType_ZSTD
elif compression == 'lz4':
properties.compression = CCompressionType_LZ4_FRAME
ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow IPC file format This is based on top of ARROW-7979, so I will need to rebase once that is merged. Excluding the changes from ARROW-7979, this patch is a substantial code reduction in Feather-related code. I removed a lot of cruft from the V1 implementation and made things a lot simpler without altering the user-facing functionality. To summarize: * V2 is exactly the Arrow IPC file format, with the option for the experimental "trivial" body buffer compression implemented in ARROW-7979. `read_feather` functions distinguish the files based on the magic bytes at the beginning of the file ("FEA1" versus "ARROW1") * A `ipc::feather::WriteProperties` struct has been introduced to allow setting the file version, as well as chunksize (since large tables are broken up into smaller chunks when writing), compression type, and compression level (compressor-specific) * LZ4 and ZSTD are the only codecs intended to be supported (also in line with mailing list discussion about IPC compression). The default is LZ4 unless -DARROW_WITH_LZ4=OFF in which case it's uncompressed * Unit tests in Python now test both versions * R tests are only running the V2 version. I'll need some help adding options to set the version as well as the compression type and compression level Since 0.17.0 is likely to be released without formalizing IPC compression, I will plan to support an "ARROW:experimental_compression" metadata member in 0.17.0 Feather files. Other notes: * Column decompression is currently serial. I'll work on making this parallel ASAP as it will impact benchmarks significantly. * Compression (both chunk-level and column-level) is serial. Write performance would be much improved, especially at higher compression levels, by compressing in parallel at least at the column level * Write performance could be improved by compressing chunks and writing them to disk concurrently. It's done serially at the moment, so will open a follow up JIRA about this Closes #6694 from wesm/feather-v2 Lead-authored-by: Wes McKinney <wesm+git@apache.org> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org>
2020-03-29 19:05:36 -05:00
else:
properties.compression = CCompressionType_UNCOMPRESSED
ARROW-1214: [Python/C++] Add C++ functionality to more easily handle encapsulated IPC messages, Python bindings This patch does a bunch of things: * Decouples the RecordBatchStreamReader from the actual message iteration (which is handled by a new `arrow::ipc::MessageReader` interface * Enables `arrow::ipc::Message` to hold all of the memory for a complete unit of data: metadata plus body * Renames some IPC methods for better consistency (GetNextRecordBatch -> ReadNextRecordBatch) * Adds function to serialize a complete encapsulated message to an `arrow::io::OutputStream* * Add Python bindings for all of the above, introduce `pyarrow.Message`, `pyarrow.MessageReader`. Add `read_message` and `Message.serialize` functions for efficient memory round trips * Add `pyarrow.read_record_batch` for reading a single record batch given a message and a known schema Later we will want to add `pyarrow.read_schema`, but it seemed like a bit of work to make it work for dictionaries. This implements the C++ analogue to ARROW-1047, which was for Java. Not sure why I didn't create a JIRA about this. cc @icexelloss Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #839 from wesm/ARROW-1214 and squashes the following commits: 07f1820a [Wes McKinney] Refactor to introduce MessageReader abstract type, use unique_ptr for messages instead of shared_ptr. First cut at Message, MessageReader Python API. Add read_message, C++/Python machinery for message roundtrips to Buffer, comparison. Add function to read RecordBatch from encapsulated message given schema.
2017-07-15 16:51:51 -04:00
ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow IPC file format This is based on top of ARROW-7979, so I will need to rebase once that is merged. Excluding the changes from ARROW-7979, this patch is a substantial code reduction in Feather-related code. I removed a lot of cruft from the V1 implementation and made things a lot simpler without altering the user-facing functionality. To summarize: * V2 is exactly the Arrow IPC file format, with the option for the experimental "trivial" body buffer compression implemented in ARROW-7979. `read_feather` functions distinguish the files based on the magic bytes at the beginning of the file ("FEA1" versus "ARROW1") * A `ipc::feather::WriteProperties` struct has been introduced to allow setting the file version, as well as chunksize (since large tables are broken up into smaller chunks when writing), compression type, and compression level (compressor-specific) * LZ4 and ZSTD are the only codecs intended to be supported (also in line with mailing list discussion about IPC compression). The default is LZ4 unless -DARROW_WITH_LZ4=OFF in which case it's uncompressed * Unit tests in Python now test both versions * R tests are only running the V2 version. I'll need some help adding options to set the version as well as the compression type and compression level Since 0.17.0 is likely to be released without formalizing IPC compression, I will plan to support an "ARROW:experimental_compression" metadata member in 0.17.0 Feather files. Other notes: * Column decompression is currently serial. I'll work on making this parallel ASAP as it will impact benchmarks significantly. * Compression (both chunk-level and column-level) is serial. Write performance would be much improved, especially at higher compression levels, by compressing in parallel at least at the column level * Write performance could be improved by compressing chunks and writing them to disk concurrently. It's done serially at the moment, so will open a follow up JIRA about this Closes #6694 from wesm/feather-v2 Lead-authored-by: Wes McKinney <wesm+git@apache.org> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org>
2020-03-29 19:05:36 -05:00
if chunksize is not None:
properties.chunksize = chunksize
ARROW-1214: [Python/C++] Add C++ functionality to more easily handle encapsulated IPC messages, Python bindings This patch does a bunch of things: * Decouples the RecordBatchStreamReader from the actual message iteration (which is handled by a new `arrow::ipc::MessageReader` interface * Enables `arrow::ipc::Message` to hold all of the memory for a complete unit of data: metadata plus body * Renames some IPC methods for better consistency (GetNextRecordBatch -> ReadNextRecordBatch) * Adds function to serialize a complete encapsulated message to an `arrow::io::OutputStream* * Add Python bindings for all of the above, introduce `pyarrow.Message`, `pyarrow.MessageReader`. Add `read_message` and `Message.serialize` functions for efficient memory round trips * Add `pyarrow.read_record_batch` for reading a single record batch given a message and a known schema Later we will want to add `pyarrow.read_schema`, but it seemed like a bit of work to make it work for dictionaries. This implements the C++ analogue to ARROW-1047, which was for Java. Not sure why I didn't create a JIRA about this. cc @icexelloss Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #839 from wesm/ARROW-1214 and squashes the following commits: 07f1820a [Wes McKinney] Refactor to introduce MessageReader abstract type, use unique_ptr for messages instead of shared_ptr. First cut at Message, MessageReader Python API. Add read_message, C++/Python machinery for message roundtrips to Buffer, comparison. Add function to read RecordBatch from encapsulated message given schema.
2017-07-15 16:51:51 -04:00
ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow IPC file format This is based on top of ARROW-7979, so I will need to rebase once that is merged. Excluding the changes from ARROW-7979, this patch is a substantial code reduction in Feather-related code. I removed a lot of cruft from the V1 implementation and made things a lot simpler without altering the user-facing functionality. To summarize: * V2 is exactly the Arrow IPC file format, with the option for the experimental "trivial" body buffer compression implemented in ARROW-7979. `read_feather` functions distinguish the files based on the magic bytes at the beginning of the file ("FEA1" versus "ARROW1") * A `ipc::feather::WriteProperties` struct has been introduced to allow setting the file version, as well as chunksize (since large tables are broken up into smaller chunks when writing), compression type, and compression level (compressor-specific) * LZ4 and ZSTD are the only codecs intended to be supported (also in line with mailing list discussion about IPC compression). The default is LZ4 unless -DARROW_WITH_LZ4=OFF in which case it's uncompressed * Unit tests in Python now test both versions * R tests are only running the V2 version. I'll need some help adding options to set the version as well as the compression type and compression level Since 0.17.0 is likely to be released without formalizing IPC compression, I will plan to support an "ARROW:experimental_compression" metadata member in 0.17.0 Feather files. Other notes: * Column decompression is currently serial. I'll work on making this parallel ASAP as it will impact benchmarks significantly. * Compression (both chunk-level and column-level) is serial. Write performance would be much improved, especially at higher compression levels, by compressing in parallel at least at the column level * Write performance could be improved by compressing chunks and writing them to disk concurrently. It's done serially at the moment, so will open a follow up JIRA about this Closes #6694 from wesm/feather-v2 Lead-authored-by: Wes McKinney <wesm+git@apache.org> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org>
2020-03-29 19:05:36 -05:00
if compression_level is not None:
properties.compression_level = compression_level
ARROW-1214: [Python/C++] Add C++ functionality to more easily handle encapsulated IPC messages, Python bindings This patch does a bunch of things: * Decouples the RecordBatchStreamReader from the actual message iteration (which is handled by a new `arrow::ipc::MessageReader` interface * Enables `arrow::ipc::Message` to hold all of the memory for a complete unit of data: metadata plus body * Renames some IPC methods for better consistency (GetNextRecordBatch -> ReadNextRecordBatch) * Adds function to serialize a complete encapsulated message to an `arrow::io::OutputStream* * Add Python bindings for all of the above, introduce `pyarrow.Message`, `pyarrow.MessageReader`. Add `read_message` and `Message.serialize` functions for efficient memory round trips * Add `pyarrow.read_record_batch` for reading a single record batch given a message and a known schema Later we will want to add `pyarrow.read_schema`, but it seemed like a bit of work to make it work for dictionaries. This implements the C++ analogue to ARROW-1047, which was for Java. Not sure why I didn't create a JIRA about this. cc @icexelloss Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #839 from wesm/ARROW-1214 and squashes the following commits: 07f1820a [Wes McKinney] Refactor to introduce MessageReader abstract type, use unique_ptr for messages instead of shared_ptr. First cut at Message, MessageReader Python API. Add read_message, C++/Python machinery for message roundtrips to Buffer, comparison. Add function to read RecordBatch from encapsulated message given schema.
2017-07-15 16:51:51 -04:00
ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow IPC file format This is based on top of ARROW-7979, so I will need to rebase once that is merged. Excluding the changes from ARROW-7979, this patch is a substantial code reduction in Feather-related code. I removed a lot of cruft from the V1 implementation and made things a lot simpler without altering the user-facing functionality. To summarize: * V2 is exactly the Arrow IPC file format, with the option for the experimental "trivial" body buffer compression implemented in ARROW-7979. `read_feather` functions distinguish the files based on the magic bytes at the beginning of the file ("FEA1" versus "ARROW1") * A `ipc::feather::WriteProperties` struct has been introduced to allow setting the file version, as well as chunksize (since large tables are broken up into smaller chunks when writing), compression type, and compression level (compressor-specific) * LZ4 and ZSTD are the only codecs intended to be supported (also in line with mailing list discussion about IPC compression). The default is LZ4 unless -DARROW_WITH_LZ4=OFF in which case it's uncompressed * Unit tests in Python now test both versions * R tests are only running the V2 version. I'll need some help adding options to set the version as well as the compression type and compression level Since 0.17.0 is likely to be released without formalizing IPC compression, I will plan to support an "ARROW:experimental_compression" metadata member in 0.17.0 Feather files. Other notes: * Column decompression is currently serial. I'll work on making this parallel ASAP as it will impact benchmarks significantly. * Compression (both chunk-level and column-level) is serial. Write performance would be much improved, especially at higher compression levels, by compressing in parallel at least at the column level * Write performance could be improved by compressing chunks and writing them to disk concurrently. It's done serially at the moment, so will open a follow up JIRA about this Closes #6694 from wesm/feather-v2 Lead-authored-by: Wes McKinney <wesm+git@apache.org> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org>
2020-03-29 19:05:36 -05:00
with nogil:
check_status(WriteFeather(deref(table.table), sink.get(),
properties))
ARROW-1214: [Python/C++] Add C++ functionality to more easily handle encapsulated IPC messages, Python bindings This patch does a bunch of things: * Decouples the RecordBatchStreamReader from the actual message iteration (which is handled by a new `arrow::ipc::MessageReader` interface * Enables `arrow::ipc::Message` to hold all of the memory for a complete unit of data: metadata plus body * Renames some IPC methods for better consistency (GetNextRecordBatch -> ReadNextRecordBatch) * Adds function to serialize a complete encapsulated message to an `arrow::io::OutputStream* * Add Python bindings for all of the above, introduce `pyarrow.Message`, `pyarrow.MessageReader`. Add `read_message` and `Message.serialize` functions for efficient memory round trips * Add `pyarrow.read_record_batch` for reading a single record batch given a message and a known schema Later we will want to add `pyarrow.read_schema`, but it seemed like a bit of work to make it work for dictionaries. This implements the C++ analogue to ARROW-1047, which was for Java. Not sure why I didn't create a JIRA about this. cc @icexelloss Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #839 from wesm/ARROW-1214 and squashes the following commits: 07f1820a [Wes McKinney] Refactor to introduce MessageReader abstract type, use unique_ptr for messages instead of shared_ptr. First cut at Message, MessageReader Python API. Add read_message, C++/Python machinery for message roundtrips to Buffer, comparison. Add function to read RecordBatch from encapsulated message given schema.
2017-07-15 16:51:51 -04:00
cdef class FeatherReader(_Weakrefable):
ARROW-1214: [Python/C++] Add C++ functionality to more easily handle encapsulated IPC messages, Python bindings This patch does a bunch of things: * Decouples the RecordBatchStreamReader from the actual message iteration (which is handled by a new `arrow::ipc::MessageReader` interface * Enables `arrow::ipc::Message` to hold all of the memory for a complete unit of data: metadata plus body * Renames some IPC methods for better consistency (GetNextRecordBatch -> ReadNextRecordBatch) * Adds function to serialize a complete encapsulated message to an `arrow::io::OutputStream* * Add Python bindings for all of the above, introduce `pyarrow.Message`, `pyarrow.MessageReader`. Add `read_message` and `Message.serialize` functions for efficient memory round trips * Add `pyarrow.read_record_batch` for reading a single record batch given a message and a known schema Later we will want to add `pyarrow.read_schema`, but it seemed like a bit of work to make it work for dictionaries. This implements the C++ analogue to ARROW-1047, which was for Java. Not sure why I didn't create a JIRA about this. cc @icexelloss Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #839 from wesm/ARROW-1214 and squashes the following commits: 07f1820a [Wes McKinney] Refactor to introduce MessageReader abstract type, use unique_ptr for messages instead of shared_ptr. First cut at Message, MessageReader Python API. Add read_message, C++/Python machinery for message roundtrips to Buffer, comparison. Add function to read RecordBatch from encapsulated message given schema.
2017-07-15 16:51:51 -04:00
cdef:
ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow IPC file format This is based on top of ARROW-7979, so I will need to rebase once that is merged. Excluding the changes from ARROW-7979, this patch is a substantial code reduction in Feather-related code. I removed a lot of cruft from the V1 implementation and made things a lot simpler without altering the user-facing functionality. To summarize: * V2 is exactly the Arrow IPC file format, with the option for the experimental "trivial" body buffer compression implemented in ARROW-7979. `read_feather` functions distinguish the files based on the magic bytes at the beginning of the file ("FEA1" versus "ARROW1") * A `ipc::feather::WriteProperties` struct has been introduced to allow setting the file version, as well as chunksize (since large tables are broken up into smaller chunks when writing), compression type, and compression level (compressor-specific) * LZ4 and ZSTD are the only codecs intended to be supported (also in line with mailing list discussion about IPC compression). The default is LZ4 unless -DARROW_WITH_LZ4=OFF in which case it's uncompressed * Unit tests in Python now test both versions * R tests are only running the V2 version. I'll need some help adding options to set the version as well as the compression type and compression level Since 0.17.0 is likely to be released without formalizing IPC compression, I will plan to support an "ARROW:experimental_compression" metadata member in 0.17.0 Feather files. Other notes: * Column decompression is currently serial. I'll work on making this parallel ASAP as it will impact benchmarks significantly. * Compression (both chunk-level and column-level) is serial. Write performance would be much improved, especially at higher compression levels, by compressing in parallel at least at the column level * Write performance could be improved by compressing chunks and writing them to disk concurrently. It's done serially at the moment, so will open a follow up JIRA about this Closes #6694 from wesm/feather-v2 Lead-authored-by: Wes McKinney <wesm+git@apache.org> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org>
2020-03-29 19:05:36 -05:00
shared_ptr[CFeatherReader] reader
ARROW-1214: [Python/C++] Add C++ functionality to more easily handle encapsulated IPC messages, Python bindings This patch does a bunch of things: * Decouples the RecordBatchStreamReader from the actual message iteration (which is handled by a new `arrow::ipc::MessageReader` interface * Enables `arrow::ipc::Message` to hold all of the memory for a complete unit of data: metadata plus body * Renames some IPC methods for better consistency (GetNextRecordBatch -> ReadNextRecordBatch) * Adds function to serialize a complete encapsulated message to an `arrow::io::OutputStream* * Add Python bindings for all of the above, introduce `pyarrow.Message`, `pyarrow.MessageReader`. Add `read_message` and `Message.serialize` functions for efficient memory round trips * Add `pyarrow.read_record_batch` for reading a single record batch given a message and a known schema Later we will want to add `pyarrow.read_schema`, but it seemed like a bit of work to make it work for dictionaries. This implements the C++ analogue to ARROW-1047, which was for Java. Not sure why I didn't create a JIRA about this. cc @icexelloss Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #839 from wesm/ARROW-1214 and squashes the following commits: 07f1820a [Wes McKinney] Refactor to introduce MessageReader abstract type, use unique_ptr for messages instead of shared_ptr. First cut at Message, MessageReader Python API. Add read_message, C++/Python machinery for message roundtrips to Buffer, comparison. Add function to read RecordBatch from encapsulated message given schema.
2017-07-15 16:51:51 -04:00
def __cinit__(self, source, c_bool use_memory_map, c_bool use_threads):
cdef:
shared_ptr[CRandomAccessFile] reader
CIpcReadOptions options = CIpcReadOptions.Defaults()
options.use_threads = use_threads
get_reader(source, use_memory_map, &reader)
ARROW-1214: [Python/C++] Add C++ functionality to more easily handle encapsulated IPC messages, Python bindings This patch does a bunch of things: * Decouples the RecordBatchStreamReader from the actual message iteration (which is handled by a new `arrow::ipc::MessageReader` interface * Enables `arrow::ipc::Message` to hold all of the memory for a complete unit of data: metadata plus body * Renames some IPC methods for better consistency (GetNextRecordBatch -> ReadNextRecordBatch) * Adds function to serialize a complete encapsulated message to an `arrow::io::OutputStream* * Add Python bindings for all of the above, introduce `pyarrow.Message`, `pyarrow.MessageReader`. Add `read_message` and `Message.serialize` functions for efficient memory round trips * Add `pyarrow.read_record_batch` for reading a single record batch given a message and a known schema Later we will want to add `pyarrow.read_schema`, but it seemed like a bit of work to make it work for dictionaries. This implements the C++ analogue to ARROW-1047, which was for Java. Not sure why I didn't create a JIRA about this. cc @icexelloss Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #839 from wesm/ARROW-1214 and squashes the following commits: 07f1820a [Wes McKinney] Refactor to introduce MessageReader abstract type, use unique_ptr for messages instead of shared_ptr. First cut at Message, MessageReader Python API. Add read_message, C++/Python machinery for message roundtrips to Buffer, comparison. Add function to read RecordBatch from encapsulated message given schema.
2017-07-15 16:51:51 -04:00
with nogil:
self.reader = GetResultValue(CFeatherReader.Open(reader, options))
@property
def version(self):
return self.reader.get().version()
ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow IPC file format This is based on top of ARROW-7979, so I will need to rebase once that is merged. Excluding the changes from ARROW-7979, this patch is a substantial code reduction in Feather-related code. I removed a lot of cruft from the V1 implementation and made things a lot simpler without altering the user-facing functionality. To summarize: * V2 is exactly the Arrow IPC file format, with the option for the experimental "trivial" body buffer compression implemented in ARROW-7979. `read_feather` functions distinguish the files based on the magic bytes at the beginning of the file ("FEA1" versus "ARROW1") * A `ipc::feather::WriteProperties` struct has been introduced to allow setting the file version, as well as chunksize (since large tables are broken up into smaller chunks when writing), compression type, and compression level (compressor-specific) * LZ4 and ZSTD are the only codecs intended to be supported (also in line with mailing list discussion about IPC compression). The default is LZ4 unless -DARROW_WITH_LZ4=OFF in which case it's uncompressed * Unit tests in Python now test both versions * R tests are only running the V2 version. I'll need some help adding options to set the version as well as the compression type and compression level Since 0.17.0 is likely to be released without formalizing IPC compression, I will plan to support an "ARROW:experimental_compression" metadata member in 0.17.0 Feather files. Other notes: * Column decompression is currently serial. I'll work on making this parallel ASAP as it will impact benchmarks significantly. * Compression (both chunk-level and column-level) is serial. Write performance would be much improved, especially at higher compression levels, by compressing in parallel at least at the column level * Write performance could be improved by compressing chunks and writing them to disk concurrently. It's done serially at the moment, so will open a follow up JIRA about this Closes #6694 from wesm/feather-v2 Lead-authored-by: Wes McKinney <wesm+git@apache.org> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org>
2020-03-29 19:05:36 -05:00
def read(self):
cdef shared_ptr[CTable] sp_table
with nogil:
check_status(self.reader.get()
.Read(&sp_table))
return pyarrow_wrap_table(sp_table)
ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow IPC file format This is based on top of ARROW-7979, so I will need to rebase once that is merged. Excluding the changes from ARROW-7979, this patch is a substantial code reduction in Feather-related code. I removed a lot of cruft from the V1 implementation and made things a lot simpler without altering the user-facing functionality. To summarize: * V2 is exactly the Arrow IPC file format, with the option for the experimental "trivial" body buffer compression implemented in ARROW-7979. `read_feather` functions distinguish the files based on the magic bytes at the beginning of the file ("FEA1" versus "ARROW1") * A `ipc::feather::WriteProperties` struct has been introduced to allow setting the file version, as well as chunksize (since large tables are broken up into smaller chunks when writing), compression type, and compression level (compressor-specific) * LZ4 and ZSTD are the only codecs intended to be supported (also in line with mailing list discussion about IPC compression). The default is LZ4 unless -DARROW_WITH_LZ4=OFF in which case it's uncompressed * Unit tests in Python now test both versions * R tests are only running the V2 version. I'll need some help adding options to set the version as well as the compression type and compression level Since 0.17.0 is likely to be released without formalizing IPC compression, I will plan to support an "ARROW:experimental_compression" metadata member in 0.17.0 Feather files. Other notes: * Column decompression is currently serial. I'll work on making this parallel ASAP as it will impact benchmarks significantly. * Compression (both chunk-level and column-level) is serial. Write performance would be much improved, especially at higher compression levels, by compressing in parallel at least at the column level * Write performance could be improved by compressing chunks and writing them to disk concurrently. It's done serially at the moment, so will open a follow up JIRA about this Closes #6694 from wesm/feather-v2 Lead-authored-by: Wes McKinney <wesm+git@apache.org> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org>
2020-03-29 19:05:36 -05:00
def read_indices(self, indices):
cdef:
shared_ptr[CTable] sp_table
vector[int] c_indices
for index in indices:
c_indices.push_back(index)
with nogil:
check_status(self.reader.get()
.Read(c_indices, &sp_table))
return pyarrow_wrap_table(sp_table)
ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow IPC file format This is based on top of ARROW-7979, so I will need to rebase once that is merged. Excluding the changes from ARROW-7979, this patch is a substantial code reduction in Feather-related code. I removed a lot of cruft from the V1 implementation and made things a lot simpler without altering the user-facing functionality. To summarize: * V2 is exactly the Arrow IPC file format, with the option for the experimental "trivial" body buffer compression implemented in ARROW-7979. `read_feather` functions distinguish the files based on the magic bytes at the beginning of the file ("FEA1" versus "ARROW1") * A `ipc::feather::WriteProperties` struct has been introduced to allow setting the file version, as well as chunksize (since large tables are broken up into smaller chunks when writing), compression type, and compression level (compressor-specific) * LZ4 and ZSTD are the only codecs intended to be supported (also in line with mailing list discussion about IPC compression). The default is LZ4 unless -DARROW_WITH_LZ4=OFF in which case it's uncompressed * Unit tests in Python now test both versions * R tests are only running the V2 version. I'll need some help adding options to set the version as well as the compression type and compression level Since 0.17.0 is likely to be released without formalizing IPC compression, I will plan to support an "ARROW:experimental_compression" metadata member in 0.17.0 Feather files. Other notes: * Column decompression is currently serial. I'll work on making this parallel ASAP as it will impact benchmarks significantly. * Compression (both chunk-level and column-level) is serial. Write performance would be much improved, especially at higher compression levels, by compressing in parallel at least at the column level * Write performance could be improved by compressing chunks and writing them to disk concurrently. It's done serially at the moment, so will open a follow up JIRA about this Closes #6694 from wesm/feather-v2 Lead-authored-by: Wes McKinney <wesm+git@apache.org> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org>
2020-03-29 19:05:36 -05:00
def read_names(self, names):
cdef:
shared_ptr[CTable] sp_table
vector[c_string] c_names
for name in names:
c_names.push_back(tobytes(name))
with nogil:
check_status(self.reader.get()
.Read(c_names, &sp_table))
return pyarrow_wrap_table(sp_table)