2017-07-15 16:51:51 -04:00
|
|
|
# Licensed to the Apache Software Foundation (ASF) under one
|
|
|
|
|
# or more contributor license agreements. See the NOTICE file
|
|
|
|
|
# distributed with this work for additional information
|
|
|
|
|
# regarding copyright ownership. The ASF licenses this file
|
|
|
|
|
# to you under the Apache License, Version 2.0 (the
|
|
|
|
|
# "License"); you may not use this file except in compliance
|
|
|
|
|
# with the License. You may obtain a copy of the License at
|
|
|
|
|
#
|
|
|
|
|
# http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
|
#
|
|
|
|
|
# Unless required by applicable law or agreed to in writing,
|
|
|
|
|
# software distributed under the License is distributed on an
|
|
|
|
|
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
|
|
|
# KIND, either express or implied. See the License for the
|
|
|
|
|
# specific language governing permissions and limitations
|
|
|
|
|
# under the License.
|
|
|
|
|
|
2017-08-01 22:50:21 -04:00
|
|
|
# ---------------------------------------------------------------------
|
ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow IPC file format
This is based on top of ARROW-7979, so I will need to rebase once that is merged.
Excluding the changes from ARROW-7979, this patch is a substantial code reduction in Feather-related code. I removed a lot of cruft from the V1 implementation and made things a lot simpler without altering the user-facing functionality.
To summarize:
* V2 is exactly the Arrow IPC file format, with the option for the experimental "trivial" body buffer compression implemented in ARROW-7979. `read_feather` functions distinguish the files based on the magic bytes at the beginning of the file ("FEA1" versus "ARROW1")
* A `ipc::feather::WriteProperties` struct has been introduced to allow setting the file version, as well as chunksize (since large tables are broken up into smaller chunks when writing), compression type, and compression level (compressor-specific)
* LZ4 and ZSTD are the only codecs intended to be supported (also in line with mailing list discussion about IPC compression). The default is LZ4 unless -DARROW_WITH_LZ4=OFF in which case it's uncompressed
* Unit tests in Python now test both versions
* R tests are only running the V2 version. I'll need some help adding options to set the version as well as the compression type and compression level
Since 0.17.0 is likely to be released without formalizing IPC compression, I will plan to support an "ARROW:experimental_compression" metadata member in 0.17.0 Feather files.
Other notes:
* Column decompression is currently serial. I'll work on making this parallel ASAP as it will impact benchmarks significantly.
* Compression (both chunk-level and column-level) is serial. Write performance would be much improved, especially at higher compression levels, by compressing in parallel at least at the column level
* Write performance could be improved by compressing chunks and writing them to disk concurrently. It's done serially at the moment, so will open a follow up JIRA about this
Closes #6694 from wesm/feather-v2
Lead-authored-by: Wes McKinney <wesm+git@apache.org>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Wes McKinney <wesm+git@apache.org>
2020-03-29 19:05:36 -05:00
|
|
|
# Implement Feather file format
|
2017-07-15 16:51:51 -04:00
|
|
|
|
2021-04-27 18:21:31 +02:00
|
|
|
# cython: profile=False
|
|
|
|
|
# distutils: language = c++
|
|
|
|
|
# cython: language_level=3
|
|
|
|
|
|
|
|
|
|
from cython.operator cimport dereference as deref
|
|
|
|
|
from pyarrow.includes.common cimport *
|
|
|
|
|
from pyarrow.includes.libarrow cimport *
|
|
|
|
|
from pyarrow.includes.libarrow_feather cimport *
|
|
|
|
|
from pyarrow.lib cimport (check_status, Table, _Weakrefable,
|
|
|
|
|
get_writer, get_reader, pyarrow_wrap_table)
|
|
|
|
|
from pyarrow.lib import tobytes
|
|
|
|
|
|
2017-07-15 16:51:51 -04:00
|
|
|
|
|
|
|
|
class FeatherError(Exception):
|
|
|
|
|
pass
|
|
|
|
|
|
|
|
|
|
|
ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow IPC file format
This is based on top of ARROW-7979, so I will need to rebase once that is merged.
Excluding the changes from ARROW-7979, this patch is a substantial code reduction in Feather-related code. I removed a lot of cruft from the V1 implementation and made things a lot simpler without altering the user-facing functionality.
To summarize:
* V2 is exactly the Arrow IPC file format, with the option for the experimental "trivial" body buffer compression implemented in ARROW-7979. `read_feather` functions distinguish the files based on the magic bytes at the beginning of the file ("FEA1" versus "ARROW1")
* A `ipc::feather::WriteProperties` struct has been introduced to allow setting the file version, as well as chunksize (since large tables are broken up into smaller chunks when writing), compression type, and compression level (compressor-specific)
* LZ4 and ZSTD are the only codecs intended to be supported (also in line with mailing list discussion about IPC compression). The default is LZ4 unless -DARROW_WITH_LZ4=OFF in which case it's uncompressed
* Unit tests in Python now test both versions
* R tests are only running the V2 version. I'll need some help adding options to set the version as well as the compression type and compression level
Since 0.17.0 is likely to be released without formalizing IPC compression, I will plan to support an "ARROW:experimental_compression" metadata member in 0.17.0 Feather files.
Other notes:
* Column decompression is currently serial. I'll work on making this parallel ASAP as it will impact benchmarks significantly.
* Compression (both chunk-level and column-level) is serial. Write performance would be much improved, especially at higher compression levels, by compressing in parallel at least at the column level
* Write performance could be improved by compressing chunks and writing them to disk concurrently. It's done serially at the moment, so will open a follow up JIRA about this
Closes #6694 from wesm/feather-v2
Lead-authored-by: Wes McKinney <wesm+git@apache.org>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Wes McKinney <wesm+git@apache.org>
2020-03-29 19:05:36 -05:00
|
|
|
def write_feather(Table table, object dest, compression=None,
|
|
|
|
|
compression_level=None, chunksize=None, version=2):
|
|
|
|
|
cdef shared_ptr[COutputStream] sink
|
|
|
|
|
get_writer(dest, &sink)
|
2017-07-15 16:51:51 -04:00
|
|
|
|
ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow IPC file format
This is based on top of ARROW-7979, so I will need to rebase once that is merged.
Excluding the changes from ARROW-7979, this patch is a substantial code reduction in Feather-related code. I removed a lot of cruft from the V1 implementation and made things a lot simpler without altering the user-facing functionality.
To summarize:
* V2 is exactly the Arrow IPC file format, with the option for the experimental "trivial" body buffer compression implemented in ARROW-7979. `read_feather` functions distinguish the files based on the magic bytes at the beginning of the file ("FEA1" versus "ARROW1")
* A `ipc::feather::WriteProperties` struct has been introduced to allow setting the file version, as well as chunksize (since large tables are broken up into smaller chunks when writing), compression type, and compression level (compressor-specific)
* LZ4 and ZSTD are the only codecs intended to be supported (also in line with mailing list discussion about IPC compression). The default is LZ4 unless -DARROW_WITH_LZ4=OFF in which case it's uncompressed
* Unit tests in Python now test both versions
* R tests are only running the V2 version. I'll need some help adding options to set the version as well as the compression type and compression level
Since 0.17.0 is likely to be released without formalizing IPC compression, I will plan to support an "ARROW:experimental_compression" metadata member in 0.17.0 Feather files.
Other notes:
* Column decompression is currently serial. I'll work on making this parallel ASAP as it will impact benchmarks significantly.
* Compression (both chunk-level and column-level) is serial. Write performance would be much improved, especially at higher compression levels, by compressing in parallel at least at the column level
* Write performance could be improved by compressing chunks and writing them to disk concurrently. It's done serially at the moment, so will open a follow up JIRA about this
Closes #6694 from wesm/feather-v2
Lead-authored-by: Wes McKinney <wesm+git@apache.org>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Wes McKinney <wesm+git@apache.org>
2020-03-29 19:05:36 -05:00
|
|
|
cdef CFeatherProperties properties
|
|
|
|
|
if version == 2:
|
|
|
|
|
properties.version = kFeatherV2Version
|
|
|
|
|
else:
|
|
|
|
|
properties.version = kFeatherV1Version
|
2017-07-15 16:51:51 -04:00
|
|
|
|
ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow IPC file format
This is based on top of ARROW-7979, so I will need to rebase once that is merged.
Excluding the changes from ARROW-7979, this patch is a substantial code reduction in Feather-related code. I removed a lot of cruft from the V1 implementation and made things a lot simpler without altering the user-facing functionality.
To summarize:
* V2 is exactly the Arrow IPC file format, with the option for the experimental "trivial" body buffer compression implemented in ARROW-7979. `read_feather` functions distinguish the files based on the magic bytes at the beginning of the file ("FEA1" versus "ARROW1")
* A `ipc::feather::WriteProperties` struct has been introduced to allow setting the file version, as well as chunksize (since large tables are broken up into smaller chunks when writing), compression type, and compression level (compressor-specific)
* LZ4 and ZSTD are the only codecs intended to be supported (also in line with mailing list discussion about IPC compression). The default is LZ4 unless -DARROW_WITH_LZ4=OFF in which case it's uncompressed
* Unit tests in Python now test both versions
* R tests are only running the V2 version. I'll need some help adding options to set the version as well as the compression type and compression level
Since 0.17.0 is likely to be released without formalizing IPC compression, I will plan to support an "ARROW:experimental_compression" metadata member in 0.17.0 Feather files.
Other notes:
* Column decompression is currently serial. I'll work on making this parallel ASAP as it will impact benchmarks significantly.
* Compression (both chunk-level and column-level) is serial. Write performance would be much improved, especially at higher compression levels, by compressing in parallel at least at the column level
* Write performance could be improved by compressing chunks and writing them to disk concurrently. It's done serially at the moment, so will open a follow up JIRA about this
Closes #6694 from wesm/feather-v2
Lead-authored-by: Wes McKinney <wesm+git@apache.org>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Wes McKinney <wesm+git@apache.org>
2020-03-29 19:05:36 -05:00
|
|
|
if compression == 'zstd':
|
|
|
|
|
properties.compression = CCompressionType_ZSTD
|
|
|
|
|
elif compression == 'lz4':
|
2020-03-30 20:21:31 -05:00
|
|
|
properties.compression = CCompressionType_LZ4_FRAME
|
ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow IPC file format
This is based on top of ARROW-7979, so I will need to rebase once that is merged.
Excluding the changes from ARROW-7979, this patch is a substantial code reduction in Feather-related code. I removed a lot of cruft from the V1 implementation and made things a lot simpler without altering the user-facing functionality.
To summarize:
* V2 is exactly the Arrow IPC file format, with the option for the experimental "trivial" body buffer compression implemented in ARROW-7979. `read_feather` functions distinguish the files based on the magic bytes at the beginning of the file ("FEA1" versus "ARROW1")
* A `ipc::feather::WriteProperties` struct has been introduced to allow setting the file version, as well as chunksize (since large tables are broken up into smaller chunks when writing), compression type, and compression level (compressor-specific)
* LZ4 and ZSTD are the only codecs intended to be supported (also in line with mailing list discussion about IPC compression). The default is LZ4 unless -DARROW_WITH_LZ4=OFF in which case it's uncompressed
* Unit tests in Python now test both versions
* R tests are only running the V2 version. I'll need some help adding options to set the version as well as the compression type and compression level
Since 0.17.0 is likely to be released without formalizing IPC compression, I will plan to support an "ARROW:experimental_compression" metadata member in 0.17.0 Feather files.
Other notes:
* Column decompression is currently serial. I'll work on making this parallel ASAP as it will impact benchmarks significantly.
* Compression (both chunk-level and column-level) is serial. Write performance would be much improved, especially at higher compression levels, by compressing in parallel at least at the column level
* Write performance could be improved by compressing chunks and writing them to disk concurrently. It's done serially at the moment, so will open a follow up JIRA about this
Closes #6694 from wesm/feather-v2
Lead-authored-by: Wes McKinney <wesm+git@apache.org>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Wes McKinney <wesm+git@apache.org>
2020-03-29 19:05:36 -05:00
|
|
|
else:
|
|
|
|
|
properties.compression = CCompressionType_UNCOMPRESSED
|
2017-07-15 16:51:51 -04:00
|
|
|
|
ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow IPC file format
This is based on top of ARROW-7979, so I will need to rebase once that is merged.
Excluding the changes from ARROW-7979, this patch is a substantial code reduction in Feather-related code. I removed a lot of cruft from the V1 implementation and made things a lot simpler without altering the user-facing functionality.
To summarize:
* V2 is exactly the Arrow IPC file format, with the option for the experimental "trivial" body buffer compression implemented in ARROW-7979. `read_feather` functions distinguish the files based on the magic bytes at the beginning of the file ("FEA1" versus "ARROW1")
* A `ipc::feather::WriteProperties` struct has been introduced to allow setting the file version, as well as chunksize (since large tables are broken up into smaller chunks when writing), compression type, and compression level (compressor-specific)
* LZ4 and ZSTD are the only codecs intended to be supported (also in line with mailing list discussion about IPC compression). The default is LZ4 unless -DARROW_WITH_LZ4=OFF in which case it's uncompressed
* Unit tests in Python now test both versions
* R tests are only running the V2 version. I'll need some help adding options to set the version as well as the compression type and compression level
Since 0.17.0 is likely to be released without formalizing IPC compression, I will plan to support an "ARROW:experimental_compression" metadata member in 0.17.0 Feather files.
Other notes:
* Column decompression is currently serial. I'll work on making this parallel ASAP as it will impact benchmarks significantly.
* Compression (both chunk-level and column-level) is serial. Write performance would be much improved, especially at higher compression levels, by compressing in parallel at least at the column level
* Write performance could be improved by compressing chunks and writing them to disk concurrently. It's done serially at the moment, so will open a follow up JIRA about this
Closes #6694 from wesm/feather-v2
Lead-authored-by: Wes McKinney <wesm+git@apache.org>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Wes McKinney <wesm+git@apache.org>
2020-03-29 19:05:36 -05:00
|
|
|
if chunksize is not None:
|
|
|
|
|
properties.chunksize = chunksize
|
2017-07-15 16:51:51 -04:00
|
|
|
|
ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow IPC file format
This is based on top of ARROW-7979, so I will need to rebase once that is merged.
Excluding the changes from ARROW-7979, this patch is a substantial code reduction in Feather-related code. I removed a lot of cruft from the V1 implementation and made things a lot simpler without altering the user-facing functionality.
To summarize:
* V2 is exactly the Arrow IPC file format, with the option for the experimental "trivial" body buffer compression implemented in ARROW-7979. `read_feather` functions distinguish the files based on the magic bytes at the beginning of the file ("FEA1" versus "ARROW1")
* A `ipc::feather::WriteProperties` struct has been introduced to allow setting the file version, as well as chunksize (since large tables are broken up into smaller chunks when writing), compression type, and compression level (compressor-specific)
* LZ4 and ZSTD are the only codecs intended to be supported (also in line with mailing list discussion about IPC compression). The default is LZ4 unless -DARROW_WITH_LZ4=OFF in which case it's uncompressed
* Unit tests in Python now test both versions
* R tests are only running the V2 version. I'll need some help adding options to set the version as well as the compression type and compression level
Since 0.17.0 is likely to be released without formalizing IPC compression, I will plan to support an "ARROW:experimental_compression" metadata member in 0.17.0 Feather files.
Other notes:
* Column decompression is currently serial. I'll work on making this parallel ASAP as it will impact benchmarks significantly.
* Compression (both chunk-level and column-level) is serial. Write performance would be much improved, especially at higher compression levels, by compressing in parallel at least at the column level
* Write performance could be improved by compressing chunks and writing them to disk concurrently. It's done serially at the moment, so will open a follow up JIRA about this
Closes #6694 from wesm/feather-v2
Lead-authored-by: Wes McKinney <wesm+git@apache.org>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Wes McKinney <wesm+git@apache.org>
2020-03-29 19:05:36 -05:00
|
|
|
if compression_level is not None:
|
|
|
|
|
properties.compression_level = compression_level
|
2017-07-15 16:51:51 -04:00
|
|
|
|
ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow IPC file format
This is based on top of ARROW-7979, so I will need to rebase once that is merged.
Excluding the changes from ARROW-7979, this patch is a substantial code reduction in Feather-related code. I removed a lot of cruft from the V1 implementation and made things a lot simpler without altering the user-facing functionality.
To summarize:
* V2 is exactly the Arrow IPC file format, with the option for the experimental "trivial" body buffer compression implemented in ARROW-7979. `read_feather` functions distinguish the files based on the magic bytes at the beginning of the file ("FEA1" versus "ARROW1")
* A `ipc::feather::WriteProperties` struct has been introduced to allow setting the file version, as well as chunksize (since large tables are broken up into smaller chunks when writing), compression type, and compression level (compressor-specific)
* LZ4 and ZSTD are the only codecs intended to be supported (also in line with mailing list discussion about IPC compression). The default is LZ4 unless -DARROW_WITH_LZ4=OFF in which case it's uncompressed
* Unit tests in Python now test both versions
* R tests are only running the V2 version. I'll need some help adding options to set the version as well as the compression type and compression level
Since 0.17.0 is likely to be released without formalizing IPC compression, I will plan to support an "ARROW:experimental_compression" metadata member in 0.17.0 Feather files.
Other notes:
* Column decompression is currently serial. I'll work on making this parallel ASAP as it will impact benchmarks significantly.
* Compression (both chunk-level and column-level) is serial. Write performance would be much improved, especially at higher compression levels, by compressing in parallel at least at the column level
* Write performance could be improved by compressing chunks and writing them to disk concurrently. It's done serially at the moment, so will open a follow up JIRA about this
Closes #6694 from wesm/feather-v2
Lead-authored-by: Wes McKinney <wesm+git@apache.org>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Wes McKinney <wesm+git@apache.org>
2020-03-29 19:05:36 -05:00
|
|
|
with nogil:
|
|
|
|
|
check_status(WriteFeather(deref(table.table), sink.get(),
|
|
|
|
|
properties))
|
2017-07-15 16:51:51 -04:00
|
|
|
|
|
|
|
|
|
2020-07-29 12:24:36 +02:00
|
|
|
cdef class FeatherReader(_Weakrefable):
|
2017-07-15 16:51:51 -04:00
|
|
|
cdef:
|
ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow IPC file format
This is based on top of ARROW-7979, so I will need to rebase once that is merged.
Excluding the changes from ARROW-7979, this patch is a substantial code reduction in Feather-related code. I removed a lot of cruft from the V1 implementation and made things a lot simpler without altering the user-facing functionality.
To summarize:
* V2 is exactly the Arrow IPC file format, with the option for the experimental "trivial" body buffer compression implemented in ARROW-7979. `read_feather` functions distinguish the files based on the magic bytes at the beginning of the file ("FEA1" versus "ARROW1")
* A `ipc::feather::WriteProperties` struct has been introduced to allow setting the file version, as well as chunksize (since large tables are broken up into smaller chunks when writing), compression type, and compression level (compressor-specific)
* LZ4 and ZSTD are the only codecs intended to be supported (also in line with mailing list discussion about IPC compression). The default is LZ4 unless -DARROW_WITH_LZ4=OFF in which case it's uncompressed
* Unit tests in Python now test both versions
* R tests are only running the V2 version. I'll need some help adding options to set the version as well as the compression type and compression level
Since 0.17.0 is likely to be released without formalizing IPC compression, I will plan to support an "ARROW:experimental_compression" metadata member in 0.17.0 Feather files.
Other notes:
* Column decompression is currently serial. I'll work on making this parallel ASAP as it will impact benchmarks significantly.
* Compression (both chunk-level and column-level) is serial. Write performance would be much improved, especially at higher compression levels, by compressing in parallel at least at the column level
* Write performance could be improved by compressing chunks and writing them to disk concurrently. It's done serially at the moment, so will open a follow up JIRA about this
Closes #6694 from wesm/feather-v2
Lead-authored-by: Wes McKinney <wesm+git@apache.org>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Wes McKinney <wesm+git@apache.org>
2020-03-29 19:05:36 -05:00
|
|
|
shared_ptr[CFeatherReader] reader
|
2017-07-15 16:51:51 -04:00
|
|
|
|
2021-10-27 18:12:28 +02:00
|
|
|
def __cinit__(self, source, c_bool use_memory_map, c_bool use_threads):
|
|
|
|
|
cdef:
|
|
|
|
|
shared_ptr[CRandomAccessFile] reader
|
|
|
|
|
CIpcReadOptions options = CIpcReadOptions.Defaults()
|
|
|
|
|
options.use_threads = use_threads
|
|
|
|
|
|
2018-07-24 15:23:06 -04:00
|
|
|
get_reader(source, use_memory_map, &reader)
|
2017-07-15 16:51:51 -04:00
|
|
|
with nogil:
|
2021-10-27 18:12:28 +02:00
|
|
|
self.reader = GetResultValue(CFeatherReader.Open(reader, options))
|
2018-10-28 10:21:48 +01:00
|
|
|
|
2020-05-07 11:23:25 -05:00
|
|
|
@property
|
|
|
|
|
def version(self):
|
|
|
|
|
return self.reader.get().version()
|
|
|
|
|
|
ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow IPC file format
This is based on top of ARROW-7979, so I will need to rebase once that is merged.
Excluding the changes from ARROW-7979, this patch is a substantial code reduction in Feather-related code. I removed a lot of cruft from the V1 implementation and made things a lot simpler without altering the user-facing functionality.
To summarize:
* V2 is exactly the Arrow IPC file format, with the option for the experimental "trivial" body buffer compression implemented in ARROW-7979. `read_feather` functions distinguish the files based on the magic bytes at the beginning of the file ("FEA1" versus "ARROW1")
* A `ipc::feather::WriteProperties` struct has been introduced to allow setting the file version, as well as chunksize (since large tables are broken up into smaller chunks when writing), compression type, and compression level (compressor-specific)
* LZ4 and ZSTD are the only codecs intended to be supported (also in line with mailing list discussion about IPC compression). The default is LZ4 unless -DARROW_WITH_LZ4=OFF in which case it's uncompressed
* Unit tests in Python now test both versions
* R tests are only running the V2 version. I'll need some help adding options to set the version as well as the compression type and compression level
Since 0.17.0 is likely to be released without formalizing IPC compression, I will plan to support an "ARROW:experimental_compression" metadata member in 0.17.0 Feather files.
Other notes:
* Column decompression is currently serial. I'll work on making this parallel ASAP as it will impact benchmarks significantly.
* Compression (both chunk-level and column-level) is serial. Write performance would be much improved, especially at higher compression levels, by compressing in parallel at least at the column level
* Write performance could be improved by compressing chunks and writing them to disk concurrently. It's done serially at the moment, so will open a follow up JIRA about this
Closes #6694 from wesm/feather-v2
Lead-authored-by: Wes McKinney <wesm+git@apache.org>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Wes McKinney <wesm+git@apache.org>
2020-03-29 19:05:36 -05:00
|
|
|
def read(self):
|
2018-10-28 10:21:48 +01:00
|
|
|
cdef shared_ptr[CTable] sp_table
|
|
|
|
|
with nogil:
|
|
|
|
|
check_status(self.reader.get()
|
|
|
|
|
.Read(&sp_table))
|
|
|
|
|
|
|
|
|
|
return pyarrow_wrap_table(sp_table)
|
|
|
|
|
|
ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow IPC file format
This is based on top of ARROW-7979, so I will need to rebase once that is merged.
Excluding the changes from ARROW-7979, this patch is a substantial code reduction in Feather-related code. I removed a lot of cruft from the V1 implementation and made things a lot simpler without altering the user-facing functionality.
To summarize:
* V2 is exactly the Arrow IPC file format, with the option for the experimental "trivial" body buffer compression implemented in ARROW-7979. `read_feather` functions distinguish the files based on the magic bytes at the beginning of the file ("FEA1" versus "ARROW1")
* A `ipc::feather::WriteProperties` struct has been introduced to allow setting the file version, as well as chunksize (since large tables are broken up into smaller chunks when writing), compression type, and compression level (compressor-specific)
* LZ4 and ZSTD are the only codecs intended to be supported (also in line with mailing list discussion about IPC compression). The default is LZ4 unless -DARROW_WITH_LZ4=OFF in which case it's uncompressed
* Unit tests in Python now test both versions
* R tests are only running the V2 version. I'll need some help adding options to set the version as well as the compression type and compression level
Since 0.17.0 is likely to be released without formalizing IPC compression, I will plan to support an "ARROW:experimental_compression" metadata member in 0.17.0 Feather files.
Other notes:
* Column decompression is currently serial. I'll work on making this parallel ASAP as it will impact benchmarks significantly.
* Compression (both chunk-level and column-level) is serial. Write performance would be much improved, especially at higher compression levels, by compressing in parallel at least at the column level
* Write performance could be improved by compressing chunks and writing them to disk concurrently. It's done serially at the moment, so will open a follow up JIRA about this
Closes #6694 from wesm/feather-v2
Lead-authored-by: Wes McKinney <wesm+git@apache.org>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Wes McKinney <wesm+git@apache.org>
2020-03-29 19:05:36 -05:00
|
|
|
def read_indices(self, indices):
|
2018-10-28 10:21:48 +01:00
|
|
|
cdef:
|
|
|
|
|
shared_ptr[CTable] sp_table
|
|
|
|
|
vector[int] c_indices
|
|
|
|
|
|
|
|
|
|
for index in indices:
|
|
|
|
|
c_indices.push_back(index)
|
|
|
|
|
with nogil:
|
|
|
|
|
check_status(self.reader.get()
|
|
|
|
|
.Read(c_indices, &sp_table))
|
|
|
|
|
|
|
|
|
|
return pyarrow_wrap_table(sp_table)
|
|
|
|
|
|
ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow IPC file format
This is based on top of ARROW-7979, so I will need to rebase once that is merged.
Excluding the changes from ARROW-7979, this patch is a substantial code reduction in Feather-related code. I removed a lot of cruft from the V1 implementation and made things a lot simpler without altering the user-facing functionality.
To summarize:
* V2 is exactly the Arrow IPC file format, with the option for the experimental "trivial" body buffer compression implemented in ARROW-7979. `read_feather` functions distinguish the files based on the magic bytes at the beginning of the file ("FEA1" versus "ARROW1")
* A `ipc::feather::WriteProperties` struct has been introduced to allow setting the file version, as well as chunksize (since large tables are broken up into smaller chunks when writing), compression type, and compression level (compressor-specific)
* LZ4 and ZSTD are the only codecs intended to be supported (also in line with mailing list discussion about IPC compression). The default is LZ4 unless -DARROW_WITH_LZ4=OFF in which case it's uncompressed
* Unit tests in Python now test both versions
* R tests are only running the V2 version. I'll need some help adding options to set the version as well as the compression type and compression level
Since 0.17.0 is likely to be released without formalizing IPC compression, I will plan to support an "ARROW:experimental_compression" metadata member in 0.17.0 Feather files.
Other notes:
* Column decompression is currently serial. I'll work on making this parallel ASAP as it will impact benchmarks significantly.
* Compression (both chunk-level and column-level) is serial. Write performance would be much improved, especially at higher compression levels, by compressing in parallel at least at the column level
* Write performance could be improved by compressing chunks and writing them to disk concurrently. It's done serially at the moment, so will open a follow up JIRA about this
Closes #6694 from wesm/feather-v2
Lead-authored-by: Wes McKinney <wesm+git@apache.org>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Wes McKinney <wesm+git@apache.org>
2020-03-29 19:05:36 -05:00
|
|
|
def read_names(self, names):
|
2018-10-28 10:21:48 +01:00
|
|
|
cdef:
|
|
|
|
|
shared_ptr[CTable] sp_table
|
|
|
|
|
vector[c_string] c_names
|
|
|
|
|
|
|
|
|
|
for name in names:
|
|
|
|
|
c_names.push_back(tobytes(name))
|
|
|
|
|
with nogil:
|
|
|
|
|
check_status(self.reader.get()
|
|
|
|
|
.Read(c_names, &sp_table))
|
|
|
|
|
|
|
|
|
|
return pyarrow_wrap_table(sp_table)
|