2018-01-04 10:11:13 -05:00
|
|
|
# Licensed to the Apache Software Foundation (ASF) under one
|
|
|
|
|
# or more contributor license agreements. See the NOTICE file
|
|
|
|
|
# distributed with this work for additional information
|
|
|
|
|
# regarding copyright ownership. The ASF licenses this file
|
|
|
|
|
# to you under the Apache License, Version 2.0 (the
|
|
|
|
|
# "License"); you may not use this file except in compliance
|
|
|
|
|
# with the License. You may obtain a copy of the License at
|
|
|
|
|
#
|
|
|
|
|
# http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
|
#
|
|
|
|
|
# Unless required by applicable law or agreed to in writing,
|
|
|
|
|
# software distributed under the License is distributed on an
|
|
|
|
|
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
|
|
|
# KIND, either express or implied. See the License for the
|
|
|
|
|
# specific language governing permissions and limitations
|
|
|
|
|
# under the License.
|
|
|
|
|
|
|
|
|
|
# distutils: language = c++
|
2018-10-23 06:27:38 -04:00
|
|
|
# cython: language_level = 3
|
2018-01-04 10:11:13 -05:00
|
|
|
|
2022-01-20 15:41:39 +01:00
|
|
|
from libcpp cimport bool as c_bool
|
2018-01-04 10:11:13 -05:00
|
|
|
from libc.string cimport const_char
|
|
|
|
|
from libcpp.vector cimport vector as std_vector
|
|
|
|
|
from pyarrow.includes.common cimport *
|
|
|
|
|
from pyarrow.includes.libarrow cimport (CArray, CSchema, CStatus,
|
ARROW-7906: [C++] [Python] Add ORC write support
This pull request tracks the progress on adding ORC write support. The functionality is not complete yet. However for most types the process of populating a ColumnVectorBatch in ORC using data from Arrow Array.
Arrow data types (arrow::Type::type) I do support:
Boolean: BOOL
Numerical: INT8, INT16, INT32, INT64, FLOAT, DOUBLE
Time-related: DATE32
Binary: BINARY, STRING, LARGE_BINARY, LARGE_STRING, FIXED_SIZE_BINARY
Nested: LIST, LARGE_LIST, FIXED_SIZE_LIST, STRUCT, MAP, DENSE_UNION, SPARSE_UNION
Arrow data types I plan to support:
Numerical: DECIMAL128
Time-related: DATE64, TIMESTAMP
Dictionary: DICTIONARY
Arrow data types I currently do NOT plan to support:
Numerical: UINT8, UINT16, UINT32, UINT64, HALF_FLOAT, DECIMAL256 (There are no corresponding types in ORC. Of course except for in the case of DECIMAL256 we can always cast them into larger types. However I think maybe users need to explicitly do that.)
Time-related: TIME32, TIME64, INTERVAL_MONTHS, INTERVAL_DAY_TIME, DURATION (There are no corresponding types in ORC and it is impossible to cast them into ORC types without losing time-related information)
Extension: EXTENSION
Closes #8648 from mathyingzhou/ARROW-7906_pyarrow_write_orc
Lead-authored-by: Ying Zhou <yingzhou474@gmail.com>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Heres, Daniel <danielheres@gmail.com>
Co-authored-by: Dmitry Patsura <zaets28rus@gmail.com>
Co-authored-by: Neville Dipale <nevilledips@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Yibo Cai <yibo.cai@arm.com>
Co-authored-by: Yordan Pavlov <yordan.pavlov@outlook.com>
Co-authored-by: mqy <meng.qingyou@gmail.com>
Co-authored-by: Kenta Murata <mrkn@mrkn.jp>
Co-authored-by: Johannes Müller <JohannesMueller@fico.com>
Co-authored-by: Mahmut Bulut <vertexclique@gmail.com>
Co-authored-by: Ryan Jennings <ryan@ryanj.net>
Co-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Jörn Horstmann <joern.horstmann@signavio.com>
Co-authored-by: Daniël Heres <danielheres@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Co-authored-by: Matt Brubeck <mbrubeck@limpet.net>
Co-authored-by: Max Burke <max@urbanlogiq.com>
Co-authored-by: Maarten A. Breddels <maartenbreddels@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
2021-04-19 13:50:16 +02:00
|
|
|
CResult, CTable, CMemoryPool,
|
2018-01-04 10:11:13 -05:00
|
|
|
CKeyValueMetadata,
|
|
|
|
|
CRecordBatch,
|
2022-01-20 15:41:39 +01:00
|
|
|
CTable, CCompressionType,
|
2019-10-01 15:22:19 +02:00
|
|
|
CRandomAccessFile, COutputStream,
|
2018-01-04 10:11:13 -05:00
|
|
|
TimeUnit)
|
|
|
|
|
|
2022-01-20 15:41:39 +01:00
|
|
|
cdef extern from "arrow/adapters/orc/options.h" \
|
|
|
|
|
namespace "arrow::adapters::orc" nogil:
|
|
|
|
|
cdef enum CompressionStrategy \
|
|
|
|
|
" arrow::adapters::orc::CompressionStrategy":
|
|
|
|
|
_CompressionStrategy_SPEED \
|
|
|
|
|
" arrow::adapters::orc::CompressionStrategy::kSpeed"
|
|
|
|
|
_CompressionStrategy_COMPRESSION \
|
|
|
|
|
" arrow::adapters::orc::CompressionStrategy::kCompression"
|
|
|
|
|
|
|
|
|
|
cdef enum WriterId" arrow::adapters::orc::WriterId":
|
|
|
|
|
_WriterId_ORC_JAVA_WRITER" arrow::adapters::orc::WriterId::kOrcJava"
|
|
|
|
|
_WriterId_ORC_CPP_WRITER" arrow::adapters::orc::WriterId::kOrcCpp"
|
|
|
|
|
_WriterId_PRESTO_WRITER" arrow::adapters::orc::WriterId::kPresto"
|
|
|
|
|
_WriterId_SCRITCHLEY_GO \
|
|
|
|
|
" arrow::adapters::orc::WriterId::kScritchleyGo"
|
|
|
|
|
_WriterId_TRINO_WRITER" arrow::adapters::orc::WriterId::kTrino"
|
|
|
|
|
_WriterId_UNKNOWN_WRITER" arrow::adapters::orc::WriterId::kUnknown"
|
|
|
|
|
|
|
|
|
|
cdef enum WriterVersion" arrow::adapters::orc::WriterVersion":
|
|
|
|
|
_WriterVersion_ORIGINAL \
|
|
|
|
|
" arrow::adapters::orc::WriterVersion::kOriginal"
|
|
|
|
|
_WriterVersion_HIVE_8732 \
|
|
|
|
|
" arrow::adapters::orc::WriterVersion::kHive8732"
|
|
|
|
|
_WriterVersion_HIVE_4243 \
|
|
|
|
|
" arrow::adapters::orc::WriterVersion::kHive4243"
|
|
|
|
|
_WriterVersion_HIVE_12055 \
|
|
|
|
|
" arrow::adapters::orc::WriterVersion::kHive12055"
|
|
|
|
|
_WriterVersion_HIVE_13083 \
|
|
|
|
|
" arrow::adapters::orc::WriterVersion::kHive13083"
|
|
|
|
|
_WriterVersion_ORC_101" arrow::adapters::orc::WriterVersion::kOrc101"
|
|
|
|
|
_WriterVersion_ORC_135" arrow::adapters::orc::WriterVersion::kOrc135"
|
|
|
|
|
_WriterVersion_ORC_517" arrow::adapters::orc::WriterVersion::kOrc517"
|
|
|
|
|
_WriterVersion_ORC_203" arrow::adapters::orc::WriterVersion::kOrc203"
|
|
|
|
|
_WriterVersion_ORC_14" arrow::adapters::orc::WriterVersion::kOrc14"
|
|
|
|
|
_WriterVersion_MAX" arrow::adapters::orc::WriterVersion::kMax"
|
|
|
|
|
|
|
|
|
|
cdef cppclass FileVersion" arrow::adapters::orc::FileVersion":
|
2022-01-25 17:21:45 +01:00
|
|
|
FileVersion(uint32_t major_version, uint32_t minor_version)
|
|
|
|
|
uint32_t major_version()
|
|
|
|
|
uint32_t minor_version()
|
2022-01-20 15:41:39 +01:00
|
|
|
c_string ToString()
|
|
|
|
|
|
|
|
|
|
cdef struct WriteOptions" arrow::adapters::orc::WriteOptions":
|
|
|
|
|
int64_t batch_size
|
|
|
|
|
FileVersion file_version
|
|
|
|
|
int64_t stripe_size
|
|
|
|
|
CCompressionType compression
|
|
|
|
|
int64_t compression_block_size
|
|
|
|
|
CompressionStrategy compression_strategy
|
|
|
|
|
int64_t row_index_stride
|
|
|
|
|
double padding_tolerance
|
|
|
|
|
double dictionary_key_size_threshold
|
|
|
|
|
std_vector[int64_t] bloom_filter_columns
|
|
|
|
|
double bloom_filter_fpp
|
|
|
|
|
|
2018-01-04 10:11:13 -05:00
|
|
|
|
2018-01-19 12:46:40 -05:00
|
|
|
cdef extern from "arrow/adapters/orc/adapter.h" \
|
|
|
|
|
namespace "arrow::adapters::orc" nogil:
|
2018-01-04 10:11:13 -05:00
|
|
|
|
2018-01-19 12:46:40 -05:00
|
|
|
cdef cppclass ORCFileReader:
|
2018-01-04 10:11:13 -05:00
|
|
|
@staticmethod
|
2021-09-13 18:53:23 +02:00
|
|
|
CResult[unique_ptr[ORCFileReader]] Open(
|
|
|
|
|
const shared_ptr[CRandomAccessFile]& file,
|
|
|
|
|
CMemoryPool* pool)
|
2018-01-04 10:11:13 -05:00
|
|
|
|
2021-05-27 17:52:58 +02:00
|
|
|
CResult[shared_ptr[const CKeyValueMetadata]] ReadMetadata()
|
|
|
|
|
|
2021-09-13 18:53:23 +02:00
|
|
|
CResult[shared_ptr[CSchema]] ReadSchema()
|
2018-01-04 10:11:13 -05:00
|
|
|
|
2021-09-13 18:53:23 +02:00
|
|
|
CResult[shared_ptr[CRecordBatch]] ReadStripe(int64_t stripe)
|
|
|
|
|
CResult[shared_ptr[CRecordBatch]] ReadStripe(
|
2021-10-11 17:48:33 +02:00
|
|
|
int64_t stripe, std_vector[c_string])
|
2018-01-04 10:11:13 -05:00
|
|
|
|
2021-09-13 18:53:23 +02:00
|
|
|
CResult[shared_ptr[CTable]] Read()
|
2021-10-11 17:48:33 +02:00
|
|
|
CResult[shared_ptr[CTable]] Read(std_vector[c_string])
|
2018-01-04 10:11:13 -05:00
|
|
|
|
|
|
|
|
int64_t NumberOfStripes()
|
|
|
|
|
int64_t NumberOfRows()
|
2022-01-20 15:41:39 +01:00
|
|
|
FileVersion GetFileVersion()
|
|
|
|
|
c_string GetSoftwareVersion()
|
|
|
|
|
CResult[CCompressionType] GetCompression()
|
|
|
|
|
int64_t GetCompressionSize()
|
|
|
|
|
int64_t GetRowIndexStride()
|
|
|
|
|
WriterId GetWriterId()
|
|
|
|
|
int32_t GetWriterIdValue()
|
|
|
|
|
WriterVersion GetWriterVersion()
|
|
|
|
|
int64_t GetNumberOfStripeStatistics()
|
|
|
|
|
int64_t GetContentLength()
|
|
|
|
|
int64_t GetStripeStatisticsLength()
|
|
|
|
|
int64_t GetFileFooterLength()
|
|
|
|
|
int64_t GetFilePostscriptLength()
|
|
|
|
|
int64_t GetFileLength()
|
|
|
|
|
c_string GetSerializedFileTail()
|
ARROW-7906: [C++] [Python] Add ORC write support
This pull request tracks the progress on adding ORC write support. The functionality is not complete yet. However for most types the process of populating a ColumnVectorBatch in ORC using data from Arrow Array.
Arrow data types (arrow::Type::type) I do support:
Boolean: BOOL
Numerical: INT8, INT16, INT32, INT64, FLOAT, DOUBLE
Time-related: DATE32
Binary: BINARY, STRING, LARGE_BINARY, LARGE_STRING, FIXED_SIZE_BINARY
Nested: LIST, LARGE_LIST, FIXED_SIZE_LIST, STRUCT, MAP, DENSE_UNION, SPARSE_UNION
Arrow data types I plan to support:
Numerical: DECIMAL128
Time-related: DATE64, TIMESTAMP
Dictionary: DICTIONARY
Arrow data types I currently do NOT plan to support:
Numerical: UINT8, UINT16, UINT32, UINT64, HALF_FLOAT, DECIMAL256 (There are no corresponding types in ORC. Of course except for in the case of DECIMAL256 we can always cast them into larger types. However I think maybe users need to explicitly do that.)
Time-related: TIME32, TIME64, INTERVAL_MONTHS, INTERVAL_DAY_TIME, DURATION (There are no corresponding types in ORC and it is impossible to cast them into ORC types without losing time-related information)
Extension: EXTENSION
Closes #8648 from mathyingzhou/ARROW-7906_pyarrow_write_orc
Lead-authored-by: Ying Zhou <yingzhou474@gmail.com>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Heres, Daniel <danielheres@gmail.com>
Co-authored-by: Dmitry Patsura <zaets28rus@gmail.com>
Co-authored-by: Neville Dipale <nevilledips@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Yibo Cai <yibo.cai@arm.com>
Co-authored-by: Yordan Pavlov <yordan.pavlov@outlook.com>
Co-authored-by: mqy <meng.qingyou@gmail.com>
Co-authored-by: Kenta Murata <mrkn@mrkn.jp>
Co-authored-by: Johannes Müller <JohannesMueller@fico.com>
Co-authored-by: Mahmut Bulut <vertexclique@gmail.com>
Co-authored-by: Ryan Jennings <ryan@ryanj.net>
Co-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Jörn Horstmann <joern.horstmann@signavio.com>
Co-authored-by: Daniël Heres <danielheres@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Co-authored-by: Matt Brubeck <mbrubeck@limpet.net>
Co-authored-by: Max Burke <max@urbanlogiq.com>
Co-authored-by: Maarten A. Breddels <maartenbreddels@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
2021-04-19 13:50:16 +02:00
|
|
|
|
|
|
|
|
cdef cppclass ORCFileWriter:
|
|
|
|
|
@staticmethod
|
2022-01-20 15:41:39 +01:00
|
|
|
CResult[unique_ptr[ORCFileWriter]] Open(
|
|
|
|
|
COutputStream* output_stream, const WriteOptions& writer_options)
|
ARROW-7906: [C++] [Python] Add ORC write support
This pull request tracks the progress on adding ORC write support. The functionality is not complete yet. However for most types the process of populating a ColumnVectorBatch in ORC using data from Arrow Array.
Arrow data types (arrow::Type::type) I do support:
Boolean: BOOL
Numerical: INT8, INT16, INT32, INT64, FLOAT, DOUBLE
Time-related: DATE32
Binary: BINARY, STRING, LARGE_BINARY, LARGE_STRING, FIXED_SIZE_BINARY
Nested: LIST, LARGE_LIST, FIXED_SIZE_LIST, STRUCT, MAP, DENSE_UNION, SPARSE_UNION
Arrow data types I plan to support:
Numerical: DECIMAL128
Time-related: DATE64, TIMESTAMP
Dictionary: DICTIONARY
Arrow data types I currently do NOT plan to support:
Numerical: UINT8, UINT16, UINT32, UINT64, HALF_FLOAT, DECIMAL256 (There are no corresponding types in ORC. Of course except for in the case of DECIMAL256 we can always cast them into larger types. However I think maybe users need to explicitly do that.)
Time-related: TIME32, TIME64, INTERVAL_MONTHS, INTERVAL_DAY_TIME, DURATION (There are no corresponding types in ORC and it is impossible to cast them into ORC types without losing time-related information)
Extension: EXTENSION
Closes #8648 from mathyingzhou/ARROW-7906_pyarrow_write_orc
Lead-authored-by: Ying Zhou <yingzhou474@gmail.com>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Heres, Daniel <danielheres@gmail.com>
Co-authored-by: Dmitry Patsura <zaets28rus@gmail.com>
Co-authored-by: Neville Dipale <nevilledips@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Yibo Cai <yibo.cai@arm.com>
Co-authored-by: Yordan Pavlov <yordan.pavlov@outlook.com>
Co-authored-by: mqy <meng.qingyou@gmail.com>
Co-authored-by: Kenta Murata <mrkn@mrkn.jp>
Co-authored-by: Johannes Müller <JohannesMueller@fico.com>
Co-authored-by: Mahmut Bulut <vertexclique@gmail.com>
Co-authored-by: Ryan Jennings <ryan@ryanj.net>
Co-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Jörn Horstmann <joern.horstmann@signavio.com>
Co-authored-by: Daniël Heres <danielheres@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Co-authored-by: Matt Brubeck <mbrubeck@limpet.net>
Co-authored-by: Max Burke <max@urbanlogiq.com>
Co-authored-by: Maarten A. Breddels <maartenbreddels@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
2021-04-19 13:50:16 +02:00
|
|
|
|
|
|
|
|
CStatus Write(const CTable& table)
|
|
|
|
|
|
|
|
|
|
CStatus Close()
|