2022-04-04 10:54:28 +02:00
|
|
|
.. Licensed to the Apache Software Foundation (ASF) under one
|
|
|
|
|
.. or more contributor license agreements. See the NOTICE file
|
|
|
|
|
.. distributed with this work for additional information
|
|
|
|
|
.. regarding copyright ownership. The ASF licenses this file
|
|
|
|
|
.. to you under the Apache License, Version 2.0 (the
|
|
|
|
|
.. "License"); you may not use this file except in compliance
|
|
|
|
|
.. with the License. You may obtain a copy of the License at
|
|
|
|
|
|
|
|
|
|
.. http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
|
|
|
|
|
|
.. Unless required by applicable law or agreed to in writing,
|
|
|
|
|
.. software distributed under the License is distributed on an
|
|
|
|
|
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
|
|
|
.. KIND, either express or implied. See the License for the
|
|
|
|
|
.. specific language governing permissions and limitations
|
|
|
|
|
.. under the License.
|
|
|
|
|
|
|
|
|
|
.. currentmodule:: pyarrow
|
|
|
|
|
.. _orc:
|
|
|
|
|
|
|
|
|
|
Reading and Writing the Apache ORC Format
|
|
|
|
|
=============================================
|
|
|
|
|
|
|
|
|
|
The `Apache ORC <http://orc.apache.org/>`_ project provides a
|
|
|
|
|
standardized open-source columnar storage format for use in data analysis
|
|
|
|
|
systems. It was created originally for use in `Apache Hadoop
|
|
|
|
|
<http://hadoop.apache.org/>`_ with systems like `Apache Drill
|
|
|
|
|
<http://drill.apache.org>`_, `Apache Hive <http://hive.apache.org>`_, `Apache
|
2022-04-20 05:59:00 +09:00
|
|
|
Impala <http://impala.apache.org>`_, and `Apache Spark
|
2022-04-04 10:54:28 +02:00
|
|
|
<http://spark.apache.org>`_ adopting it as a shared standard for high
|
|
|
|
|
performance data IO.
|
|
|
|
|
|
|
|
|
|
Apache Arrow is an ideal in-memory representation layer for data that is being read
|
|
|
|
|
or written with ORC files.
|
|
|
|
|
|
|
|
|
|
Obtaining pyarrow with ORC Support
|
|
|
|
|
--------------------------------------
|
|
|
|
|
|
|
|
|
|
If you installed ``pyarrow`` with pip or conda, it should be built with ORC
|
2026-01-30 09:09:33 +01:00
|
|
|
support bundled:
|
|
|
|
|
|
|
|
|
|
.. code-block:: python
|
2022-04-04 10:54:28 +02:00
|
|
|
|
|
|
|
|
>>> from pyarrow import orc
|
|
|
|
|
|
|
|
|
|
If you are building ``pyarrow`` from source, you must use
|
|
|
|
|
``-DARROW_ORC=ON`` when compiling the C++ libraries and enable the ORC
|
|
|
|
|
extensions when building ``pyarrow``. See the :ref:`Python Development
|
|
|
|
|
<python-development>` page for more details.
|
|
|
|
|
|
|
|
|
|
Reading and Writing Single Files
|
|
|
|
|
--------------------------------
|
|
|
|
|
|
|
|
|
|
The functions :func:`~.orc.read_table` and :func:`~.orc.write_table`
|
|
|
|
|
read and write the :ref:`pyarrow.Table <data.table>` object, respectively.
|
|
|
|
|
|
2026-01-30 09:09:33 +01:00
|
|
|
Let's look at a simple table:
|
|
|
|
|
|
|
|
|
|
.. code-block:: python
|
2022-04-04 10:54:28 +02:00
|
|
|
|
|
|
|
|
>>> import numpy as np
|
|
|
|
|
>>> import pyarrow as pa
|
|
|
|
|
|
|
|
|
|
>>> table = pa.table(
|
|
|
|
|
... {
|
|
|
|
|
... 'one': [-1, np.nan, 2.5],
|
|
|
|
|
... 'two': ['foo', 'bar', 'baz'],
|
|
|
|
|
... 'three': [True, False, True]
|
|
|
|
|
... }
|
|
|
|
|
... )
|
|
|
|
|
|
2026-01-30 09:09:33 +01:00
|
|
|
We write this to ORC format with ``write_table``:
|
|
|
|
|
|
|
|
|
|
.. code-block:: python
|
2022-04-04 10:54:28 +02:00
|
|
|
|
|
|
|
|
>>> from pyarrow import orc
|
|
|
|
|
>>> orc.write_table(table, 'example.orc')
|
|
|
|
|
|
|
|
|
|
This creates a single ORC file. In practice, an ORC dataset may consist
|
|
|
|
|
of many files in many directories. We can read a single file back with
|
2026-01-30 09:09:33 +01:00
|
|
|
``read_table``:
|
|
|
|
|
|
|
|
|
|
.. code-block:: python
|
2022-04-04 10:54:28 +02:00
|
|
|
|
|
|
|
|
>>> table2 = orc.read_table('example.orc')
|
|
|
|
|
|
|
|
|
|
You can pass a subset of columns to read, which can be much faster than reading
|
2026-01-30 09:09:33 +01:00
|
|
|
the whole file (due to the columnar layout):
|
|
|
|
|
|
|
|
|
|
.. code-block:: python
|
2022-04-04 10:54:28 +02:00
|
|
|
|
|
|
|
|
>>> orc.read_table('example.orc', columns=['one', 'three'])
|
|
|
|
|
pyarrow.Table
|
|
|
|
|
one: double
|
|
|
|
|
three: bool
|
|
|
|
|
----
|
|
|
|
|
one: [[-1,nan,2.5]]
|
|
|
|
|
three: [[true,false,true]]
|
|
|
|
|
|
|
|
|
|
We need not use a string to specify the origin of the file. It can be any of:
|
|
|
|
|
|
|
|
|
|
* A file path as a string
|
|
|
|
|
* A Python file object
|
|
|
|
|
* A pathlib.Path object
|
|
|
|
|
* A :ref:`NativeFile <io.native_file>` from PyArrow
|
|
|
|
|
|
|
|
|
|
In general, a Python file object will have the worst read performance, while a
|
|
|
|
|
string file path or an instance of :class:`~.NativeFile` (especially memory
|
|
|
|
|
maps) will perform the best.
|
|
|
|
|
|
|
|
|
|
We can also read partitioned datasets with multiple ORC files through the
|
|
|
|
|
:mod:`pyarrow.dataset <dataset>` interface.
|
|
|
|
|
|
|
|
|
|
.. seealso::
|
|
|
|
|
:ref:`Documentation for datasets <dataset>`.
|
|
|
|
|
|
|
|
|
|
ORC file writing options
|
|
|
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
|
|
|
|
:func:`~pyarrow.orc.write_table()` has a number of options to
|
|
|
|
|
control various settings when writing an ORC file.
|
|
|
|
|
|
|
|
|
|
* ``file_version``, the ORC format version to use. ``'0.11'`` ensures
|
|
|
|
|
compatibility with older readers, while ``'0.12'`` is the newer one.
|
2024-04-30 17:27:26 -08:00
|
|
|
* ``stripe_size``, to control the approximate size of data within a column
|
2022-04-04 10:54:28 +02:00
|
|
|
stripe. This currently defaults to 64MB.
|
|
|
|
|
|
|
|
|
|
See the :func:`~pyarrow.orc.write_table()` docstring for more details.
|
|
|
|
|
|
|
|
|
|
Finer-grained Reading and Writing
|
|
|
|
|
---------------------------------
|
|
|
|
|
|
2026-01-30 09:09:33 +01:00
|
|
|
``read_table`` uses the :class:`~.ORCFile` class, which has other features:
|
|
|
|
|
|
|
|
|
|
.. code-block:: python
|
2022-04-04 10:54:28 +02:00
|
|
|
|
|
|
|
|
>>> orc_file = orc.ORCFile('example.orc')
|
|
|
|
|
>>> orc_file.metadata
|
2026-01-30 09:09:33 +01:00
|
|
|
<BLANKLINE>
|
2022-04-04 10:54:28 +02:00
|
|
|
-- metadata --
|
|
|
|
|
>>> orc_file.schema
|
|
|
|
|
one: double
|
|
|
|
|
two: string
|
|
|
|
|
three: bool
|
|
|
|
|
>>> orc_file.nrows
|
|
|
|
|
3
|
|
|
|
|
|
|
|
|
|
See the :class:`~pyarrow.orc.ORCFile` docstring for more details.
|
|
|
|
|
|
|
|
|
|
As you can learn more in the `Apache ORC format
|
|
|
|
|
<https://orc.apache.org/specification/>`_, an ORC file consists of
|
|
|
|
|
multiple stripes. ``read_table`` will read all of the stripes and
|
|
|
|
|
concatenate them into a single table. You can read individual stripes with
|
2026-01-30 09:09:33 +01:00
|
|
|
``read_stripe``:
|
|
|
|
|
|
|
|
|
|
.. code-block:: python
|
2022-04-04 10:54:28 +02:00
|
|
|
|
|
|
|
|
>>> orc_file.nstripes
|
|
|
|
|
1
|
|
|
|
|
>>> orc_file.read_stripe(0)
|
|
|
|
|
pyarrow.RecordBatch
|
|
|
|
|
one: double
|
|
|
|
|
two: string
|
|
|
|
|
three: bool
|
2026-01-30 09:09:33 +01:00
|
|
|
----
|
|
|
|
|
one: [-1,nan,2.5]
|
|
|
|
|
two: ["foo","bar","baz"]
|
|
|
|
|
three: [true,false,true]
|
2022-04-04 10:54:28 +02:00
|
|
|
|
2026-01-30 09:09:33 +01:00
|
|
|
We can write an ORC file using ``ORCWriter``:
|
|
|
|
|
|
|
|
|
|
.. code-block:: python
|
2022-04-04 10:54:28 +02:00
|
|
|
|
|
|
|
|
>>> with orc.ORCWriter('example2.orc') as writer:
|
|
|
|
|
... writer.write(table)
|
|
|
|
|
|
|
|
|
|
Compression
|
|
|
|
|
---------------------------------------------
|
|
|
|
|
|
|
|
|
|
The data pages within a column in a row group can be compressed after the
|
|
|
|
|
encoding passes (dictionary, RLE encoding). In PyArrow we don't use compression
|
2026-01-30 09:09:33 +01:00
|
|
|
by default, but Snappy, ZSTD, Zlib, and LZ4 are also supported:
|
|
|
|
|
|
|
|
|
|
.. code-block:: python
|
2022-04-04 10:54:28 +02:00
|
|
|
|
2026-01-30 09:09:33 +01:00
|
|
|
>>> orc.write_table(table, 'example.orc', compression='uncompressed')
|
|
|
|
|
>>> orc.write_table(table, 'example.orc', compression='zlib')
|
|
|
|
|
>>> orc.write_table(table, 'example.orc', compression='zstd')
|
|
|
|
|
>>> orc.write_table(table, 'example.orc', compression='snappy')
|
2022-04-04 10:54:28 +02:00
|
|
|
|
2025-03-04 06:23:08 -08:00
|
|
|
Snappy generally results in better performance, while Zlib may yield smaller
|
2022-04-04 10:54:28 +02:00
|
|
|
files.
|
|
|
|
|
|
|
|
|
|
Reading from cloud storage
|
|
|
|
|
--------------------------
|
|
|
|
|
|
|
|
|
|
In addition to local files, pyarrow supports other filesystems, such as cloud
|
2026-01-30 09:09:33 +01:00
|
|
|
filesystems, through the ``filesystem`` keyword:
|
|
|
|
|
|
|
|
|
|
.. code-block:: python
|
2022-04-04 10:54:28 +02:00
|
|
|
|
|
|
|
|
>>> from pyarrow import fs
|
|
|
|
|
|
2026-01-30 09:09:33 +01:00
|
|
|
>>> s3 = fs.S3FileSystem(region="us-east-2") # doctest: +SKIP
|
|
|
|
|
>>> table = orc.read_table("bucket/object/key/prefix", filesystem=s3) # doctest: +SKIP
|
2022-04-04 10:54:28 +02:00
|
|
|
|
|
|
|
|
.. seealso::
|
|
|
|
|
:ref:`Documentation for filesystems <filesystem>`.
|