2018-10-29 17:40:52 -04:00
|
|
|
.. Licensed to the Apache Software Foundation (ASF) under one
|
|
|
|
|
.. or more contributor license agreements. See the NOTICE file
|
|
|
|
|
.. distributed with this work for additional information
|
|
|
|
|
.. regarding copyright ownership. The ASF licenses this file
|
|
|
|
|
.. to you under the Apache License, Version 2.0 (the
|
|
|
|
|
.. "License"); you may not use this file except in compliance
|
|
|
|
|
.. with the License. You may obtain a copy of the License at
|
|
|
|
|
|
|
|
|
|
.. http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
|
|
|
|
|
|
.. Unless required by applicable law or agreed to in writing,
|
|
|
|
|
.. software distributed under the License is distributed on an
|
|
|
|
|
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
|
|
|
.. KIND, either express or implied. See the License for the
|
|
|
|
|
.. specific language governing permissions and limitations
|
|
|
|
|
.. under the License.
|
|
|
|
|
|
|
|
|
|
.. currentmodule:: pyarrow.csv
|
2021-12-02 18:27:48 +01:00
|
|
|
.. _py-csv:
|
2018-10-29 17:40:52 -04:00
|
|
|
|
2021-07-07 12:20:11 -04:00
|
|
|
Reading and Writing CSV files
|
|
|
|
|
=============================
|
2018-10-29 17:40:52 -04:00
|
|
|
|
2021-07-07 12:20:11 -04:00
|
|
|
Arrow supports reading and writing columnar data from/to CSV files.
|
2018-10-29 17:40:52 -04:00
|
|
|
The features currently offered are the following:
|
|
|
|
|
|
|
|
|
|
* multi-threaded or single-threaded reading
|
|
|
|
|
* automatic decompression of input files (based on the filename extension,
|
|
|
|
|
such as ``my_data.csv.gz``)
|
|
|
|
|
* fetching column names from the first row in the CSV file
|
|
|
|
|
* column-wise type inference and conversion to one of ``null``, ``int64``,
|
2021-07-27 11:55:07 +02:00
|
|
|
``float64``, ``date32``, ``time32[s]``, ``timestamp[s]``, ``timestamp[ns]``,
|
GH-40278: [C++] Support casting string to duration in CSV converter (#46035)
### Rationale for this change
Currently, the Arrow C++ CSV converter does not support parsing strings into `duration` types. This limits CSV ingestion capabilities when handling datasets with time-based intervals represented as numeric strings (e.g., `1000`, `2000000`). This PR adds support for parsing such strings into Arrow's `DurationType`.
Note: Human-readable duration formats such as `1s`, `2m`, or `3h` are not supported in this PR. Support for those formats may be considered in a future enhancement.
### What changes are included in this PR?
- Added `DurationValueDecoder` using `StringConverter<DurationType>`
- Registered support in both standard and dictionary converters
- Added unit tests covering:
- Basic parsing across all time units (s, ms, µs, ns)
- Null and custom null values
- Whitespace handling and error cases
### Are these changes tested?
Yes, conversion logic is fully covered by new tests in `converter_test.cc`.
### Are there any user-facing changes?
Yes, users can now convert duration strings in CSV files to Arrow `duration` arrays by specifying the appropriate schema type.
* GitHub Issue: #40278
Lead-authored-by: Zihan Qi <zihan.qi@tum.de>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
2025-05-07 17:56:48 +02:00
|
|
|
``duration`` (from numeric strings), ``string`` or ``binary`` data
|
2020-06-25 09:24:29 -05:00
|
|
|
* opportunistic dictionary encoding of ``string`` and ``binary`` columns
|
|
|
|
|
(disabled by default)
|
2018-10-29 17:40:52 -04:00
|
|
|
* detecting various spellings of null values such as ``NaN`` or ``#N/A``
|
2021-07-07 12:20:11 -04:00
|
|
|
* writing CSV files with options to configure the exact output format
|
2018-10-29 17:40:52 -04:00
|
|
|
|
|
|
|
|
Usage
|
|
|
|
|
-----
|
|
|
|
|
|
2021-07-07 12:20:11 -04:00
|
|
|
CSV reading and writing functionality is available through the
|
|
|
|
|
:mod:`pyarrow.csv` module. In many cases, you will simply call the
|
2026-01-30 09:09:33 +01:00
|
|
|
:func:`read_csv` function with the file path you want to read from:
|
|
|
|
|
|
|
|
|
|
.. code-block:: python
|
2018-10-29 17:40:52 -04:00
|
|
|
|
|
|
|
|
>>> from pyarrow import csv
|
2026-01-30 09:09:33 +01:00
|
|
|
>>> import pyarrow as pa
|
|
|
|
|
>>> import pandas as pd
|
|
|
|
|
>>> fn = 'tips.csv.gz' # doctest: +SKIP
|
|
|
|
|
>>> table = csv.read_csv(fn) # doctest: +SKIP
|
|
|
|
|
>>> table # doctest: +SKIP
|
2018-10-29 17:40:52 -04:00
|
|
|
pyarrow.Table
|
|
|
|
|
total_bill: double
|
|
|
|
|
tip: double
|
2018-11-09 17:43:28 -05:00
|
|
|
sex: string
|
|
|
|
|
smoker: string
|
|
|
|
|
day: string
|
|
|
|
|
time: string
|
2018-10-29 17:40:52 -04:00
|
|
|
size: int64
|
2026-01-30 09:09:33 +01:00
|
|
|
>>> len(table) # doctest: +SKIP
|
2018-10-29 17:40:52 -04:00
|
|
|
244
|
2026-01-30 09:09:33 +01:00
|
|
|
>>> df = table.to_pandas() # doctest: +SKIP
|
|
|
|
|
>>> df.head() # doctest: +SKIP
|
2018-11-09 17:43:28 -05:00
|
|
|
total_bill tip sex smoker day time size
|
|
|
|
|
0 16.99 1.01 Female No Sun Dinner 2
|
|
|
|
|
1 10.34 1.66 Male No Sun Dinner 3
|
|
|
|
|
2 21.01 3.50 Male No Sun Dinner 3
|
|
|
|
|
3 23.68 3.31 Male No Sun Dinner 2
|
|
|
|
|
4 24.59 3.61 Female No Sun Dinner 4
|
2018-10-29 17:40:52 -04:00
|
|
|
|
2021-07-07 12:20:11 -04:00
|
|
|
To write CSV files, just call :func:`write_csv` with a
|
|
|
|
|
:class:`pyarrow.RecordBatch` or :class:`pyarrow.Table` and a path or
|
2026-01-30 09:09:33 +01:00
|
|
|
file-like object:
|
|
|
|
|
|
|
|
|
|
.. code-block:: python
|
2021-07-07 12:20:11 -04:00
|
|
|
|
2026-01-30 09:09:33 +01:00
|
|
|
>>> table = pa.table({'col1': [1, 2, 3], 'col2': ['a', 'b', 'c']})
|
2021-07-07 12:20:11 -04:00
|
|
|
>>> csv.write_csv(table, "tips.csv")
|
|
|
|
|
>>> with pa.CompressedOutputStream("tips.csv.gz", "gzip") as out:
|
|
|
|
|
... csv.write_csv(table, out)
|
|
|
|
|
|
|
|
|
|
.. note:: The writer does not yet support all Arrow types.
|
|
|
|
|
|
2018-10-29 17:40:52 -04:00
|
|
|
Customized parsing
|
|
|
|
|
------------------
|
|
|
|
|
|
|
|
|
|
To alter the default parsing settings in case of reading CSV files with an
|
|
|
|
|
unusual structure, you should create a :class:`ParseOptions` instance
|
2026-01-30 09:09:33 +01:00
|
|
|
and pass it to :func:`read_csv`:
|
|
|
|
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
|
|
|
|
|
|
>>> def skip_handler(row):
|
|
|
|
|
... pass
|
|
|
|
|
>>> table = csv.read_csv('tips.csv.gz', parse_options=csv.ParseOptions(
|
|
|
|
|
... delimiter=";",
|
|
|
|
|
... invalid_row_handler=skip_handler
|
|
|
|
|
... ))
|
|
|
|
|
>>> table
|
|
|
|
|
pyarrow.Table
|
|
|
|
|
col1,"col2": string
|
|
|
|
|
----
|
|
|
|
|
col1,"col2": [["1,"a"","2,"b"","3,"c""]]
|
2022-03-29 11:24:27 +02:00
|
|
|
|
|
|
|
|
Available parsing options are:
|
|
|
|
|
|
|
|
|
|
.. autosummary::
|
|
|
|
|
|
|
|
|
|
~ParseOptions.delimiter
|
|
|
|
|
~ParseOptions.quote_char
|
|
|
|
|
~ParseOptions.double_quote
|
|
|
|
|
~ParseOptions.escape_char
|
|
|
|
|
~ParseOptions.newlines_in_values
|
|
|
|
|
~ParseOptions.ignore_empty_lines
|
|
|
|
|
~ParseOptions.invalid_row_handler
|
|
|
|
|
|
|
|
|
|
.. seealso::
|
|
|
|
|
|
|
|
|
|
For more examples see :class:`ParseOptions`.
|
2018-10-29 17:40:52 -04:00
|
|
|
|
2018-11-09 17:43:28 -05:00
|
|
|
Customized conversion
|
|
|
|
|
---------------------
|
|
|
|
|
|
|
|
|
|
To alter how CSV data is converted to Arrow types and data, you should create
|
2026-01-30 09:09:33 +01:00
|
|
|
a :class:`ConvertOptions` instance and pass it to :func:`read_csv`:
|
2021-03-05 21:00:19 -05:00
|
|
|
|
2026-01-30 09:09:33 +01:00
|
|
|
.. code-block:: python
|
2021-03-05 21:00:19 -05:00
|
|
|
|
2026-01-30 09:09:33 +01:00
|
|
|
>>> table = csv.read_csv('tips.csv.gz', convert_options=csv.ConvertOptions(
|
|
|
|
|
... column_types={
|
|
|
|
|
... 'total_bill': pa.decimal128(precision=10, scale=2),
|
|
|
|
|
... 'tip': pa.decimal128(precision=10, scale=2),
|
|
|
|
|
... }
|
|
|
|
|
... ))
|
|
|
|
|
>>> table
|
|
|
|
|
pyarrow.Table
|
|
|
|
|
col1: int64
|
|
|
|
|
col2: string
|
|
|
|
|
----
|
|
|
|
|
col1: [[1,2,3]]
|
|
|
|
|
col2: [["a","b","c"]]
|
2021-03-05 21:00:19 -05:00
|
|
|
|
GH-40278: [C++] Support casting string to duration in CSV converter (#46035)
### Rationale for this change
Currently, the Arrow C++ CSV converter does not support parsing strings into `duration` types. This limits CSV ingestion capabilities when handling datasets with time-based intervals represented as numeric strings (e.g., `1000`, `2000000`). This PR adds support for parsing such strings into Arrow's `DurationType`.
Note: Human-readable duration formats such as `1s`, `2m`, or `3h` are not supported in this PR. Support for those formats may be considered in a future enhancement.
### What changes are included in this PR?
- Added `DurationValueDecoder` using `StringConverter<DurationType>`
- Registered support in both standard and dictionary converters
- Added unit tests covering:
- Basic parsing across all time units (s, ms, µs, ns)
- Null and custom null values
- Whitespace handling and error cases
### Are these changes tested?
Yes, conversion logic is fully covered by new tests in `converter_test.cc`.
### Are there any user-facing changes?
Yes, users can now convert duration strings in CSV files to Arrow `duration` arrays by specifying the appropriate schema type.
* GitHub Issue: #40278
Lead-authored-by: Zihan Qi <zihan.qi@tum.de>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
2025-05-07 17:56:48 +02:00
|
|
|
.. note::
|
|
|
|
|
To assign a column as ``duration``, the CSV values must be numeric strings
|
|
|
|
|
that match the expected unit (e.g. ``60000`` for 60 seconds when
|
|
|
|
|
using ``duration[ms]``).
|
|
|
|
|
|
2022-03-29 11:24:27 +02:00
|
|
|
Available convert options are:
|
|
|
|
|
|
|
|
|
|
.. autosummary::
|
|
|
|
|
|
|
|
|
|
~ConvertOptions.check_utf8
|
|
|
|
|
~ConvertOptions.column_types
|
|
|
|
|
~ConvertOptions.null_values
|
|
|
|
|
~ConvertOptions.true_values
|
|
|
|
|
~ConvertOptions.false_values
|
|
|
|
|
~ConvertOptions.decimal_point
|
|
|
|
|
~ConvertOptions.timestamp_parsers
|
|
|
|
|
~ConvertOptions.strings_can_be_null
|
|
|
|
|
~ConvertOptions.quoted_strings_can_be_null
|
|
|
|
|
~ConvertOptions.auto_dict_encode
|
|
|
|
|
~ConvertOptions.auto_dict_max_cardinality
|
|
|
|
|
~ConvertOptions.include_columns
|
|
|
|
|
~ConvertOptions.include_missing_columns
|
|
|
|
|
|
|
|
|
|
.. seealso::
|
|
|
|
|
|
|
|
|
|
For more examples see :class:`ConvertOptions`.
|
2018-11-09 17:43:28 -05:00
|
|
|
|
2020-04-14 11:44:20 -04:00
|
|
|
Incremental reading
|
|
|
|
|
-------------------
|
|
|
|
|
|
|
|
|
|
For memory-constrained environments, it is also possible to read a CSV file
|
2021-04-27 15:24:52 +02:00
|
|
|
one batch at a time, using :func:`open_csv`.
|
|
|
|
|
|
|
|
|
|
There are a few caveats:
|
|
|
|
|
|
|
|
|
|
1. For now, the incremental reader is always single-threaded (regardless of
|
|
|
|
|
:attr:`ReadOptions.use_threads`)
|
|
|
|
|
|
|
|
|
|
2. Type inference is done on the first block and types are frozen afterwards;
|
|
|
|
|
to make sure the right data types are inferred, either set
|
|
|
|
|
:attr:`ReadOptions.block_size` to a large enough value, or use
|
|
|
|
|
:attr:`ConvertOptions.column_types` to set the desired data types explicitly.
|
2020-04-14 11:44:20 -04:00
|
|
|
|
2020-06-12 11:38:31 -05:00
|
|
|
Character encoding
|
|
|
|
|
------------------
|
|
|
|
|
|
2020-06-25 09:24:29 -05:00
|
|
|
By default, CSV files are expected to be encoded in UTF8. Non-UTF8 data
|
|
|
|
|
is accepted for ``binary`` columns. The encoding can be changed using
|
2026-01-30 09:09:33 +01:00
|
|
|
the :class:`ReadOptions` class:
|
2022-03-29 11:24:27 +02:00
|
|
|
|
2026-01-30 09:09:33 +01:00
|
|
|
.. code-block:: python
|
2022-03-29 11:24:27 +02:00
|
|
|
|
2026-01-30 09:09:33 +01:00
|
|
|
>>> table = csv.read_csv('tips.csv.gz', read_options=csv.ReadOptions(
|
|
|
|
|
... column_names=["n_legs", "entry"],
|
|
|
|
|
... skip_rows=1
|
|
|
|
|
... ))
|
|
|
|
|
>>> table
|
|
|
|
|
pyarrow.Table
|
|
|
|
|
n_legs: int64
|
|
|
|
|
entry: string
|
|
|
|
|
----
|
|
|
|
|
n_legs: [[1,2,3]]
|
|
|
|
|
entry: [["a","b","c"]]
|
2022-03-29 11:24:27 +02:00
|
|
|
|
|
|
|
|
Available read options are:
|
|
|
|
|
|
|
|
|
|
.. autosummary::
|
|
|
|
|
|
|
|
|
|
~ReadOptions.use_threads
|
|
|
|
|
~ReadOptions.block_size
|
|
|
|
|
~ReadOptions.skip_rows
|
|
|
|
|
~ReadOptions.skip_rows_after_names
|
|
|
|
|
~ReadOptions.column_names
|
|
|
|
|
~ReadOptions.autogenerate_column_names
|
|
|
|
|
~ReadOptions.encoding
|
|
|
|
|
|
|
|
|
|
.. seealso::
|
|
|
|
|
|
|
|
|
|
For more examples see :class:`ReadOptions`.
|
2020-06-12 11:38:31 -05:00
|
|
|
|
2021-07-07 12:20:11 -04:00
|
|
|
Customized writing
|
|
|
|
|
------------------
|
|
|
|
|
|
|
|
|
|
To alter the default write settings in case of writing CSV files with
|
|
|
|
|
different conventions, you can create a :class:`WriteOptions` instance and
|
2026-01-30 09:09:33 +01:00
|
|
|
pass it to :func:`write_csv`:
|
|
|
|
|
|
|
|
|
|
.. code-block:: python
|
2021-07-07 12:20:11 -04:00
|
|
|
|
|
|
|
|
>>> # Omit the header row (include_header=True is the default)
|
|
|
|
|
>>> options = csv.WriteOptions(include_header=False)
|
|
|
|
|
>>> csv.write_csv(table, "data.csv", options)
|
|
|
|
|
|
|
|
|
|
Incremental writing
|
|
|
|
|
-------------------
|
|
|
|
|
|
|
|
|
|
To write CSV files one batch at a time, create a :class:`CSVWriter`. This
|
|
|
|
|
requires the output (a path or file-like object), the schema of the data to
|
2026-01-30 09:09:33 +01:00
|
|
|
be written, and optionally write options as described above:
|
|
|
|
|
|
|
|
|
|
.. code-block:: python
|
2021-07-07 12:20:11 -04:00
|
|
|
|
|
|
|
|
>>> with csv.CSVWriter("data.csv", table.schema) as writer:
|
2026-01-30 09:09:33 +01:00
|
|
|
... writer.write_table(table)
|
2021-07-07 12:20:11 -04:00
|
|
|
|
2018-10-29 17:40:52 -04:00
|
|
|
Performance
|
|
|
|
|
-----------
|
|
|
|
|
|
|
|
|
|
Due to the structure of CSV files, one cannot expect the same levels of
|
|
|
|
|
performance as when reading dedicated binary formats like
|
|
|
|
|
:ref:`Parquet <Parquet>`. Nevertheless, Arrow strives to reduce the
|
2020-06-12 11:38:31 -05:00
|
|
|
overhead of reading CSV files. A reasonable expectation is at least
|
2021-01-14 16:05:58 +01:00
|
|
|
100 MB/s per core on a performant desktop or laptop computer (measured
|
|
|
|
|
in source CSV bytes, not target Arrow data bytes).
|
2018-10-29 17:40:52 -04:00
|
|
|
|
|
|
|
|
Performance options can be controlled through the :class:`ReadOptions` class.
|
|
|
|
|
Multi-threaded reading is the default for highest performance, distributing
|
|
|
|
|
the workload efficiently over all available cores.
|
2018-12-19 23:15:00 +01:00
|
|
|
|
|
|
|
|
.. note::
|
2020-06-12 11:38:31 -05:00
|
|
|
The number of concurrent threads is automatically inferred by Arrow.
|
|
|
|
|
You can inspect and change it using the :func:`~pyarrow.cpu_count()`
|
|
|
|
|
and :func:`~pyarrow.set_cpu_count()` functions, respectively.
|