2021-09-01 12:20:23 +02:00
|
|
|
.. Licensed to the Apache Software Foundation (ASF) under one
|
|
|
|
|
.. or more contributor license agreements. See the NOTICE file
|
|
|
|
|
.. distributed with this work for additional information
|
|
|
|
|
.. regarding copyright ownership. The ASF licenses this file
|
|
|
|
|
.. to you under the Apache License, Version 2.0 (the
|
|
|
|
|
.. "License"); you may not use this file except in compliance
|
|
|
|
|
.. with the License. You may obtain a copy of the License at
|
|
|
|
|
|
|
|
|
|
.. http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
|
|
|
|
|
|
.. Unless required by applicable law or agreed to in writing,
|
|
|
|
|
.. software distributed under the License is distributed on an
|
|
|
|
|
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
|
|
|
.. KIND, either express or implied. See the License for the
|
|
|
|
|
.. specific language governing permissions and limitations
|
|
|
|
|
.. under the License.
|
|
|
|
|
|
|
|
|
|
.. _getstarted:
|
|
|
|
|
|
|
|
|
|
Getting Started
|
|
|
|
|
===============
|
|
|
|
|
|
|
|
|
|
Arrow manages data in arrays (:class:`pyarrow.Array`), which can be
|
|
|
|
|
grouped in tables (:class:`pyarrow.Table`) to represent columns of data
|
|
|
|
|
in tabular data.
|
|
|
|
|
|
|
|
|
|
Arrow also provides support for various formats to get those tabular
|
|
|
|
|
data in and out of disk and networks. Most commonly used formats are
|
2024-04-30 17:27:26 -08:00
|
|
|
Parquet (:ref:`parquet`) and the IPC format (:ref:`ipc`).
|
2021-09-01 12:20:23 +02:00
|
|
|
|
|
|
|
|
Creating Arrays and Tables
|
|
|
|
|
--------------------------
|
|
|
|
|
|
|
|
|
|
Arrays in Arrow are collections of data of uniform type. That allows
|
|
|
|
|
Arrow to use the best performing implementation to store the data and
|
|
|
|
|
perform computations on it. So each array is meant to have data and
|
|
|
|
|
a type
|
|
|
|
|
|
2026-01-30 09:09:33 +01:00
|
|
|
.. code-block:: python
|
2021-09-01 12:20:23 +02:00
|
|
|
|
2026-01-30 09:09:33 +01:00
|
|
|
>>> import pyarrow as pa
|
|
|
|
|
>>> days = pa.array([1, 12, 17, 23, 28], type=pa.int8())
|
2021-09-01 12:20:23 +02:00
|
|
|
|
|
|
|
|
Multiple arrays can be combined in tables to form the columns
|
|
|
|
|
in tabular data when attached to a column name
|
|
|
|
|
|
2026-01-30 09:09:33 +01:00
|
|
|
.. code-block:: python
|
|
|
|
|
|
|
|
|
|
>>> months = pa.array([1, 3, 5, 7, 1], type=pa.int8())
|
|
|
|
|
>>> years = pa.array([1990, 2000, 1995, 2000, 1995], type=pa.int16())
|
|
|
|
|
>>> birthdays_table = pa.table([days, months, years],
|
|
|
|
|
... names=["days", "months", "years"])
|
|
|
|
|
>>> birthdays_table
|
|
|
|
|
pyarrow.Table
|
|
|
|
|
days: int8
|
|
|
|
|
months: int8
|
|
|
|
|
years: int16
|
|
|
|
|
----
|
|
|
|
|
days: [[1,12,17,23,28]]
|
|
|
|
|
months: [[1,3,5,7,1]]
|
|
|
|
|
years: [[1990,2000,1995,2000,1995]]
|
2021-09-01 12:20:23 +02:00
|
|
|
|
|
|
|
|
See :ref:`data` for more details.
|
|
|
|
|
|
|
|
|
|
Saving and Loading Tables
|
|
|
|
|
-------------------------
|
|
|
|
|
|
|
|
|
|
Once you have tabular data, Arrow provides out of the box
|
|
|
|
|
the features to save and restore that data for common formats
|
|
|
|
|
like Parquet:
|
|
|
|
|
|
2026-01-30 09:09:33 +01:00
|
|
|
.. code-block:: python
|
2021-09-01 12:20:23 +02:00
|
|
|
|
2026-01-30 09:09:33 +01:00
|
|
|
>>> import pyarrow.parquet as pq
|
|
|
|
|
>>> pq.write_table(birthdays_table, 'birthdays.parquet')
|
2021-09-01 12:20:23 +02:00
|
|
|
|
|
|
|
|
Once you have your data on disk, loading it back is a single function call,
|
|
|
|
|
and Arrow is heavily optimized for memory and speed so loading
|
|
|
|
|
data will be as quick as possible
|
|
|
|
|
|
2026-01-30 09:09:33 +01:00
|
|
|
.. code-block:: python
|
2021-09-01 12:20:23 +02:00
|
|
|
|
2026-01-30 09:09:33 +01:00
|
|
|
>>> reloaded_birthdays = pq.read_table('birthdays.parquet')
|
|
|
|
|
>>> reloaded_birthdays
|
|
|
|
|
pyarrow.Table
|
|
|
|
|
days: int8
|
|
|
|
|
months: int8
|
|
|
|
|
years: int16
|
|
|
|
|
----
|
|
|
|
|
days: [[1,12,17,23,28]]
|
|
|
|
|
months: [[1,3,5,7,1]]
|
|
|
|
|
years: [[1990,2000,1995,2000,1995]]
|
2021-09-01 12:20:23 +02:00
|
|
|
|
|
|
|
|
Saving and loading back data in arrow is usually done through
|
2024-04-30 17:27:26 -08:00
|
|
|
:ref:`Parquet <parquet>`, :ref:`IPC format <ipc>` (:ref:`feather`),
|
2021-12-02 18:27:48 +01:00
|
|
|
:ref:`CSV <py-csv>` or :ref:`Line-Delimited JSON <json>` formats.
|
2021-09-01 12:20:23 +02:00
|
|
|
|
|
|
|
|
Performing Computations
|
|
|
|
|
-----------------------
|
|
|
|
|
|
|
|
|
|
Arrow ships with a bunch of compute functions that can be applied
|
2024-04-30 17:27:26 -08:00
|
|
|
to its arrays and tables, so through the compute functions
|
2021-09-01 12:20:23 +02:00
|
|
|
it's possible to apply transformations to the data
|
|
|
|
|
|
2026-01-30 09:09:33 +01:00
|
|
|
.. code-block:: python
|
|
|
|
|
|
|
|
|
|
>>> import pyarrow.compute as pc
|
|
|
|
|
>>> pc.value_counts(birthdays_table["years"])
|
|
|
|
|
<pyarrow.lib.StructArray object at ...>
|
|
|
|
|
-- is_valid: all not null
|
|
|
|
|
-- child 0 type: int16
|
|
|
|
|
[
|
|
|
|
|
1990,
|
|
|
|
|
2000,
|
|
|
|
|
1995
|
|
|
|
|
]
|
|
|
|
|
-- child 1 type: int64
|
|
|
|
|
[
|
|
|
|
|
1,
|
|
|
|
|
2,
|
|
|
|
|
2
|
|
|
|
|
]
|
2021-09-01 12:20:23 +02:00
|
|
|
|
|
|
|
|
See :ref:`compute` for a list of available compute functions and
|
|
|
|
|
how to use them.
|
|
|
|
|
|
|
|
|
|
Working with large data
|
|
|
|
|
-----------------------
|
|
|
|
|
|
|
|
|
|
Arrow also provides the :class:`pyarrow.dataset` API to work with
|
|
|
|
|
large data, which will handle for you partitioning of your data in
|
|
|
|
|
smaller chunks
|
|
|
|
|
|
2026-01-30 09:09:33 +01:00
|
|
|
.. code-block:: python
|
2021-09-01 12:20:23 +02:00
|
|
|
|
2026-01-30 09:09:33 +01:00
|
|
|
>>> import pyarrow.dataset as ds
|
|
|
|
|
>>> ds.write_dataset(birthdays_table, "savedir", format="parquet",
|
|
|
|
|
... partitioning=ds.partitioning(
|
|
|
|
|
... pa.schema([birthdays_table.schema.field("years")])
|
|
|
|
|
... ))
|
2021-09-01 12:20:23 +02:00
|
|
|
|
|
|
|
|
Loading back the partitioned dataset will detect the chunks
|
|
|
|
|
|
2026-01-30 09:09:33 +01:00
|
|
|
.. code-block:: python
|
2021-09-01 12:20:23 +02:00
|
|
|
|
2026-01-30 09:09:33 +01:00
|
|
|
>>> birthdays_dataset = ds.dataset("savedir", format="parquet", partitioning=["years"])
|
|
|
|
|
>>> birthdays_dataset.files
|
|
|
|
|
['savedir/1990/part-0.parquet', 'savedir/1995/part-0.parquet', 'savedir/2000/part-0.parquet']
|
2021-09-01 12:20:23 +02:00
|
|
|
|
|
|
|
|
and will lazily load chunks of data only when iterating over them
|
|
|
|
|
|
2026-01-30 09:09:33 +01:00
|
|
|
.. code-block:: python
|
|
|
|
|
|
|
|
|
|
>>> current_year = 2025
|
|
|
|
|
>>> for table_chunk in birthdays_dataset.to_batches():
|
|
|
|
|
... print("AGES", pc.subtract(current_year, table_chunk["years"]))
|
|
|
|
|
AGES [
|
|
|
|
|
35
|
|
|
|
|
]
|
|
|
|
|
AGES [
|
|
|
|
|
30,
|
|
|
|
|
30
|
|
|
|
|
]
|
|
|
|
|
AGES [
|
|
|
|
|
25,
|
|
|
|
|
25
|
|
|
|
|
]
|
2021-09-01 12:20:23 +02:00
|
|
|
|
|
|
|
|
For further details on how to work with big datasets, how to filter them,
|
|
|
|
|
how to project them, etc., refer to :ref:`dataset` documentation.
|
|
|
|
|
|
2022-06-12 07:11:27 -07:00
|
|
|
Continuing from here
|
|
|
|
|
--------------------
|
2021-09-01 12:20:23 +02:00
|
|
|
|
2024-04-30 17:27:26 -08:00
|
|
|
For digging further into Arrow, you might want to read the
|
|
|
|
|
:doc:`PyArrow Documentation <./index>` itself or the
|
2021-09-01 12:20:23 +02:00
|
|
|
`Arrow Python Cookbook <https://arrow.apache.org/cookbook/py/>`_
|