Blame: docs/source/python/csv.rst - apache/arrow

apache / arrow UNCLAIMED

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics

0 0 12 C++

Normal View History Raw

ARROW-3405: [Python] Document CSV reader Author: Antoine Pitrou <antoine@python.org> Closes #2863 from pitrou/ARROW-3405-python-csv-doc and squashes the following commits: 4ee59cadd <Antoine Pitrou> ARROW-3405: Document CSV reader 2018-10-29 17:40:52 -04:00			`.. Licensed to the Apache Software Foundation (ASF) under one`
			`.. or more contributor license agreements. See the NOTICE file`
			`.. distributed with this work for additional information`
			`.. regarding copyright ownership. The ASF licenses this file`
			`.. to you under the Apache License, Version 2.0 (the`
			`.. "License"); you may not use this file except in compliance`
			`.. with the License. You may obtain a copy of the License at`

			`.. http://www.apache.org/licenses/LICENSE-2.0`

			`.. Unless required by applicable law or agreed to in writing,`
			`.. software distributed under the License is distributed on an`
			`.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY`
			`.. KIND, either express or implied. See the License for the`
			`.. specific language governing permissions and limitations`
			`.. under the License.`

			`.. currentmodule:: pyarrow.csv`
ARROW-13553: [Doc] Add guidelines for code reviews Closes #11757 from pitrou/ARROW-13553-doc-reviewing Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-12-02 18:27:48 +01:00			`.. _py-csv:`
ARROW-3405: [Python] Document CSV reader Author: Antoine Pitrou <antoine@python.org> Closes #2863 from pitrou/ARROW-3405-python-csv-doc and squashes the following commits: 4ee59cadd <Antoine Pitrou> ARROW-3405: Document CSV reader 2018-10-29 17:40:52 -04:00
ARROW-13230: [Docs][Python] Add CSV writer docs Closes #10658 from lidavidm/arrow-13230 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2021-07-07 12:20:11 -04:00			`Reading and Writing CSV files`
			`=============================`
ARROW-3405: [Python] Document CSV reader Author: Antoine Pitrou <antoine@python.org> Closes #2863 from pitrou/ARROW-3405-python-csv-doc and squashes the following commits: 4ee59cadd <Antoine Pitrou> ARROW-3405: Document CSV reader 2018-10-29 17:40:52 -04:00
ARROW-13230: [Docs][Python] Add CSV writer docs Closes #10658 from lidavidm/arrow-13230 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2021-07-07 12:20:11 -04:00			`Arrow supports reading and writing columnar data from/to CSV files.`
ARROW-3405: [Python] Document CSV reader Author: Antoine Pitrou <antoine@python.org> Closes #2863 from pitrou/ARROW-3405-python-csv-doc and squashes the following commits: 4ee59cadd <Antoine Pitrou> ARROW-3405: Document CSV reader 2018-10-29 17:40:52 -04:00			`The features currently offered are the following:`

			`* multi-threaded or single-threaded reading`
			`* automatic decompression of input files (based on the filename extension,`
			such as ``my_data.csv.gz``)
			`* fetching column names from the first row in the CSV file`
			* column-wise type inference and conversion to one of ``null``, ``int64``,
ARROW-11243: [C++] Recognize time types in CSV files * Allow reading CSV columns as time32 and time64 * Automatically infer "hh:mm" and "hh:mm:ss" as time32[s] Closes #10782 from pitrou/ARROW-11243-csv-times Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-07-27 11:55:07 +02:00			``float64``, ``date32``, ``time32[s]``, ``timestamp[s]``, ``timestamp[ns]``,
GH-40278: [C++] Support casting string to duration in CSV converter (#46035) ### Rationale for this change Currently, the Arrow C++ CSV converter does not support parsing strings into `duration` types. This limits CSV ingestion capabilities when handling datasets with time-based intervals represented as numeric strings (e.g., `1000`, `2000000`). This PR adds support for parsing such strings into Arrow's `DurationType`. Note: Human-readable duration formats such as `1s`, `2m`, or `3h` are not supported in this PR. Support for those formats may be considered in a future enhancement. ### What changes are included in this PR? - Added `DurationValueDecoder` using `StringConverter<DurationType>` - Registered support in both standard and dictionary converters - Added unit tests covering: - Basic parsing across all time units (s, ms, µs, ns) - Null and custom null values - Whitespace handling and error cases ### Are these changes tested? Yes, conversion logic is fully covered by new tests in `converter_test.cc`. ### Are there any user-facing changes? Yes, users can now convert duration strings in CSV files to Arrow `duration` arrays by specifying the appropriate schema type. * GitHub Issue: #40278 Lead-authored-by: Zihan Qi <zihan.qi@tum.de> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org> 2025-05-07 17:56:48 +02:00			``duration`` (from numeric strings), ``string`` or ``binary`` data
ARROW-9106: [Python] Allow specifying CSV file encoding Also add a C++ InputStream wrapper that transforms data from the stream according to an arbitrary callable. Closes #7456 from pitrou/ARROW-9106-csv-transcoding Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-25 09:24:29 -05:00			* opportunistic dictionary encoding of ``string`` and ``binary`` columns
			`(disabled by default)`
ARROW-3405: [Python] Document CSV reader Author: Antoine Pitrou <antoine@python.org> Closes #2863 from pitrou/ARROW-3405-python-csv-doc and squashes the following commits: 4ee59cadd <Antoine Pitrou> ARROW-3405: Document CSV reader 2018-10-29 17:40:52 -04:00			* detecting various spellings of null values such as ``NaN`` or ``#N/A``
ARROW-13230: [Docs][Python] Add CSV writer docs Closes #10658 from lidavidm/arrow-13230 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2021-07-07 12:20:11 -04:00			`* writing CSV files with options to configure the exact output format`
ARROW-3405: [Python] Document CSV reader Author: Antoine Pitrou <antoine@python.org> Closes #2863 from pitrou/ARROW-3405-python-csv-doc and squashes the following commits: 4ee59cadd <Antoine Pitrou> ARROW-3405: Document CSV reader 2018-10-29 17:40:52 -04:00
			`Usage`
			`-----`

ARROW-13230: [Docs][Python] Add CSV writer docs Closes #10658 from lidavidm/arrow-13230 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2021-07-07 12:20:11 -04:00			`CSV reading and writing functionality is available through the`
			:mod:`pyarrow.csv` module. In many cases, you will simply call the
GH-28859: [Doc][Python] Use only code-block directive and set up doctest for the python user guide (#48619) ### Rationale for this change In many places in the Python User Guide the code exampels are written with IPython directive (elsewhere code-block is used). IPython directives are converted to IPython format (`In` and `Out` during the doc build). This can lead to slower builds. ### What changes are included in this PR? IPython directives are converted to runnable code-block (with `>>>` and `...`) and pytest doctest support for `.rst` files is added to the `conda-python-docs` CI job. This means the code in the Python User Guide is tested separately to the building of the documentation. ### Are these changes tested? Yes, with the CI. ### Are there any user-facing changes? Changes to the Python User Guide examples will have to be tested with `pytest --doctest-glob='.rst' docs/source/python/file.rst` GitHub Issue: #28859 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: tadeja <tadeja@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2026-01-30 09:09:33 +01:00			:func:`read_csv` function with the file path you want to read from:

			`.. code-block:: python`
ARROW-3405: [Python] Document CSV reader Author: Antoine Pitrou <antoine@python.org> Closes #2863 from pitrou/ARROW-3405-python-csv-doc and squashes the following commits: 4ee59cadd <Antoine Pitrou> ARROW-3405: Document CSV reader 2018-10-29 17:40:52 -04:00
			`>>> from pyarrow import csv`
GH-28859: [Doc][Python] Use only code-block directive and set up doctest for the python user guide (#48619) ### Rationale for this change In many places in the Python User Guide the code exampels are written with IPython directive (elsewhere code-block is used). IPython directives are converted to IPython format (`In` and `Out` during the doc build). This can lead to slower builds. ### What changes are included in this PR? IPython directives are converted to runnable code-block (with `>>>` and `...`) and pytest doctest support for `.rst` files is added to the `conda-python-docs` CI job. This means the code in the Python User Guide is tested separately to the building of the documentation. ### Are these changes tested? Yes, with the CI. ### Are there any user-facing changes? Changes to the Python User Guide examples will have to be tested with `pytest --doctest-glob='.rst' docs/source/python/file.rst` GitHub Issue: #28859 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: tadeja <tadeja@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2026-01-30 09:09:33 +01:00			`>>> import pyarrow as pa`
			`>>> import pandas as pd`
			`>>> fn = 'tips.csv.gz' # doctest: +SKIP`
			`>>> table = csv.read_csv(fn) # doctest: +SKIP`
			`>>> table # doctest: +SKIP`
ARROW-3405: [Python] Document CSV reader Author: Antoine Pitrou <antoine@python.org> Closes #2863 from pitrou/ARROW-3405-python-csv-doc and squashes the following commits: 4ee59cadd <Antoine Pitrou> ARROW-3405: Document CSV reader 2018-10-29 17:40:52 -04:00			`pyarrow.Table`
			`total_bill: double`
			`tip: double`
ARROW-3407: [C++] Add UTF8 handling to CSV conversion CSV conversion now has distinct paths for string and binary columns. String columns are UTF8-validated by default, but it can be disabled by setting the `check_utf8` option in `ConvertOptions`. CSV type inference now first attempts string conversion and falls back on binary if UTF8 validation fails (if it's not disabled). As for performance, on pure ASCII columns single-threaded reading slows down by ~10% (which can be avoided by setting `check_utf8` to false). Multi-threaded reading does not seem affected here. Based on PR #2916. Author: Antoine Pitrou <antoine@python.org> Closes #2924 from pitrou/ARROW-3407-csv-utf8-conversion and squashes the following commits: 26a812c5c <Antoine Pitrou> ARROW-3407: Add UTF8 handling to CSV conversion 2018-11-09 17:43:28 -05:00			`sex: string`
			`smoker: string`
			`day: string`
			`time: string`
ARROW-3405: [Python] Document CSV reader Author: Antoine Pitrou <antoine@python.org> Closes #2863 from pitrou/ARROW-3405-python-csv-doc and squashes the following commits: 4ee59cadd <Antoine Pitrou> ARROW-3405: Document CSV reader 2018-10-29 17:40:52 -04:00			`size: int64`
GH-28859: [Doc][Python] Use only code-block directive and set up doctest for the python user guide (#48619) ### Rationale for this change In many places in the Python User Guide the code exampels are written with IPython directive (elsewhere code-block is used). IPython directives are converted to IPython format (`In` and `Out` during the doc build). This can lead to slower builds. ### What changes are included in this PR? IPython directives are converted to runnable code-block (with `>>>` and `...`) and pytest doctest support for `.rst` files is added to the `conda-python-docs` CI job. This means the code in the Python User Guide is tested separately to the building of the documentation. ### Are these changes tested? Yes, with the CI. ### Are there any user-facing changes? Changes to the Python User Guide examples will have to be tested with `pytest --doctest-glob='.rst' docs/source/python/file.rst` GitHub Issue: #28859 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: tadeja <tadeja@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2026-01-30 09:09:33 +01:00			`>>> len(table) # doctest: +SKIP`
ARROW-3405: [Python] Document CSV reader Author: Antoine Pitrou <antoine@python.org> Closes #2863 from pitrou/ARROW-3405-python-csv-doc and squashes the following commits: 4ee59cadd <Antoine Pitrou> ARROW-3405: Document CSV reader 2018-10-29 17:40:52 -04:00			`244`
GH-28859: [Doc][Python] Use only code-block directive and set up doctest for the python user guide (#48619) ### Rationale for this change In many places in the Python User Guide the code exampels are written with IPython directive (elsewhere code-block is used). IPython directives are converted to IPython format (`In` and `Out` during the doc build). This can lead to slower builds. ### What changes are included in this PR? IPython directives are converted to runnable code-block (with `>>>` and `...`) and pytest doctest support for `.rst` files is added to the `conda-python-docs` CI job. This means the code in the Python User Guide is tested separately to the building of the documentation. ### Are these changes tested? Yes, with the CI. ### Are there any user-facing changes? Changes to the Python User Guide examples will have to be tested with `pytest --doctest-glob='.rst' docs/source/python/file.rst` GitHub Issue: #28859 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: tadeja <tadeja@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2026-01-30 09:09:33 +01:00			`>>> df = table.to_pandas() # doctest: +SKIP`
			`>>> df.head() # doctest: +SKIP`
ARROW-3407: [C++] Add UTF8 handling to CSV conversion CSV conversion now has distinct paths for string and binary columns. String columns are UTF8-validated by default, but it can be disabled by setting the `check_utf8` option in `ConvertOptions`. CSV type inference now first attempts string conversion and falls back on binary if UTF8 validation fails (if it's not disabled). As for performance, on pure ASCII columns single-threaded reading slows down by ~10% (which can be avoided by setting `check_utf8` to false). Multi-threaded reading does not seem affected here. Based on PR #2916. Author: Antoine Pitrou <antoine@python.org> Closes #2924 from pitrou/ARROW-3407-csv-utf8-conversion and squashes the following commits: 26a812c5c <Antoine Pitrou> ARROW-3407: Add UTF8 handling to CSV conversion 2018-11-09 17:43:28 -05:00			`total_bill tip sex smoker day time size`
			`0 16.99 1.01 Female No Sun Dinner 2`
			`1 10.34 1.66 Male No Sun Dinner 3`
			`2 21.01 3.50 Male No Sun Dinner 3`
			`3 23.68 3.31 Male No Sun Dinner 2`
			`4 24.59 3.61 Female No Sun Dinner 4`
ARROW-3405: [Python] Document CSV reader Author: Antoine Pitrou <antoine@python.org> Closes #2863 from pitrou/ARROW-3405-python-csv-doc and squashes the following commits: 4ee59cadd <Antoine Pitrou> ARROW-3405: Document CSV reader 2018-10-29 17:40:52 -04:00
ARROW-13230: [Docs][Python] Add CSV writer docs Closes #10658 from lidavidm/arrow-13230 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2021-07-07 12:20:11 -04:00			To write CSV files, just call :func:`write_csv` with a
			:class:`pyarrow.RecordBatch` or :class:`pyarrow.Table` and a path or
GH-28859: [Doc][Python] Use only code-block directive and set up doctest for the python user guide (#48619) ### Rationale for this change In many places in the Python User Guide the code exampels are written with IPython directive (elsewhere code-block is used). IPython directives are converted to IPython format (`In` and `Out` during the doc build). This can lead to slower builds. ### What changes are included in this PR? IPython directives are converted to runnable code-block (with `>>>` and `...`) and pytest doctest support for `.rst` files is added to the `conda-python-docs` CI job. This means the code in the Python User Guide is tested separately to the building of the documentation. ### Are these changes tested? Yes, with the CI. ### Are there any user-facing changes? Changes to the Python User Guide examples will have to be tested with `pytest --doctest-glob='.rst' docs/source/python/file.rst` GitHub Issue: #28859 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: tadeja <tadeja@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2026-01-30 09:09:33 +01:00			`file-like object:`

			`.. code-block:: python`
ARROW-13230: [Docs][Python] Add CSV writer docs Closes #10658 from lidavidm/arrow-13230 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2021-07-07 12:20:11 -04:00
GH-28859: [Doc][Python] Use only code-block directive and set up doctest for the python user guide (#48619) ### Rationale for this change In many places in the Python User Guide the code exampels are written with IPython directive (elsewhere code-block is used). IPython directives are converted to IPython format (`In` and `Out` during the doc build). This can lead to slower builds. ### What changes are included in this PR? IPython directives are converted to runnable code-block (with `>>>` and `...`) and pytest doctest support for `.rst` files is added to the `conda-python-docs` CI job. This means the code in the Python User Guide is tested separately to the building of the documentation. ### Are these changes tested? Yes, with the CI. ### Are there any user-facing changes? Changes to the Python User Guide examples will have to be tested with `pytest --doctest-glob='.rst' docs/source/python/file.rst` GitHub Issue: #28859 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: tadeja <tadeja@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2026-01-30 09:09:33 +01:00			`>>> table = pa.table({'col1': [1, 2, 3], 'col2': ['a', 'b', 'c']})`
ARROW-13230: [Docs][Python] Add CSV writer docs Closes #10658 from lidavidm/arrow-13230 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2021-07-07 12:20:11 -04:00			`>>> csv.write_csv(table, "tips.csv")`
			`>>> with pa.CompressedOutputStream("tips.csv.gz", "gzip") as out:`
			`... csv.write_csv(table, out)`

			`.. note:: The writer does not yet support all Arrow types.`

ARROW-3405: [Python] Document CSV reader Author: Antoine Pitrou <antoine@python.org> Closes #2863 from pitrou/ARROW-3405-python-csv-doc and squashes the following commits: 4ee59cadd <Antoine Pitrou> ARROW-3405: Document CSV reader 2018-10-29 17:40:52 -04:00			`Customized parsing`
			`------------------`

			`To alter the default parsing settings in case of reading CSV files with an`
			unusual structure, you should create a :class:`ParseOptions` instance
GH-28859: [Doc][Python] Use only code-block directive and set up doctest for the python user guide (#48619) ### Rationale for this change In many places in the Python User Guide the code exampels are written with IPython directive (elsewhere code-block is used). IPython directives are converted to IPython format (`In` and `Out` during the doc build). This can lead to slower builds. ### What changes are included in this PR? IPython directives are converted to runnable code-block (with `>>>` and `...`) and pytest doctest support for `.rst` files is added to the `conda-python-docs` CI job. This means the code in the Python User Guide is tested separately to the building of the documentation. ### Are these changes tested? Yes, with the CI. ### Are there any user-facing changes? Changes to the Python User Guide examples will have to be tested with `pytest --doctest-glob='.rst' docs/source/python/file.rst` GitHub Issue: #28859 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: tadeja <tadeja@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2026-01-30 09:09:33 +01:00			and pass it to :func:`read_csv`:

			`.. code-block:: python`

			`>>> def skip_handler(row):`
			`... pass`
			`>>> table = csv.read_csv('tips.csv.gz', parse_options=csv.ParseOptions(`
			`... delimiter=";",`
			`... invalid_row_handler=skip_handler`
			`... ))`
			`>>> table`
			`pyarrow.Table`
			`col1,"col2": string`
			`----`
			`col1,"col2": [["1,"a"","2,"b"","3,"c""]]`
ARROW-15432: [Python] Address CSV docstrings This PR adds docstring examples to: - `pyarrow.csv.write_csv` - `pyarrow.csv.read_csv` - `pyarrow.csv.ReadOptions` except `block_size` - `pyarrow.csv.ParseOptions` except `invalid_row_handler` - `pyarrow.csv.ConvertOptions` Closes #12543 from AlenkaF/ARROW-15432 Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2022-03-29 11:24:27 +02:00
			`Available parsing options are:`

			`.. autosummary::`

			`~ParseOptions.delimiter`
			`~ParseOptions.quote_char`
			`~ParseOptions.double_quote`
			`~ParseOptions.escape_char`
			`~ParseOptions.newlines_in_values`
			`~ParseOptions.ignore_empty_lines`
			`~ParseOptions.invalid_row_handler`

			`.. seealso::`

			For more examples see :class:`ParseOptions`.
ARROW-3405: [Python] Document CSV reader Author: Antoine Pitrou <antoine@python.org> Closes #2863 from pitrou/ARROW-3405-python-csv-doc and squashes the following commits: 4ee59cadd <Antoine Pitrou> ARROW-3405: Document CSV reader 2018-10-29 17:40:52 -04:00
ARROW-3407: [C++] Add UTF8 handling to CSV conversion CSV conversion now has distinct paths for string and binary columns. String columns are UTF8-validated by default, but it can be disabled by setting the `check_utf8` option in `ConvertOptions`. CSV type inference now first attempts string conversion and falls back on binary if UTF8 validation fails (if it's not disabled). As for performance, on pure ASCII columns single-threaded reading slows down by ~10% (which can be avoided by setting `check_utf8` to false). Multi-threaded reading does not seem affected here. Based on PR #2916. Author: Antoine Pitrou <antoine@python.org> Closes #2924 from pitrou/ARROW-3407-csv-utf8-conversion and squashes the following commits: 26a812c5c <Antoine Pitrou> ARROW-3407: Add UTF8 handling to CSV conversion 2018-11-09 17:43:28 -05:00			`Customized conversion`
			`---------------------`

			`To alter how CSV data is converted to Arrow types and data, you should create`
GH-28859: [Doc][Python] Use only code-block directive and set up doctest for the python user guide (#48619) ### Rationale for this change In many places in the Python User Guide the code exampels are written with IPython directive (elsewhere code-block is used). IPython directives are converted to IPython format (`In` and `Out` during the doc build). This can lead to slower builds. ### What changes are included in this PR? IPython directives are converted to runnable code-block (with `>>>` and `...`) and pytest doctest support for `.rst` files is added to the `conda-python-docs` CI job. This means the code in the Python User Guide is tested separately to the building of the documentation. ### Are these changes tested? Yes, with the CI. ### Are there any user-facing changes? Changes to the Python User Guide examples will have to be tested with `pytest --doctest-glob='.rst' docs/source/python/file.rst` GitHub Issue: #28859 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: tadeja <tadeja@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2026-01-30 09:09:33 +01:00			a :class:`ConvertOptions` instance and pass it to :func:`read_csv`:
ARROW-11373: [Python][Docs] Add example of specifying type for a column when reading csv file Closes #9386 from jaroszan/doc_improvement_ARROW-11373 Lead-authored-by: Szangin <jaroslaw.szangin@teliacompany.com> Co-authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2021-03-05 21:00:19 -05:00
GH-28859: [Doc][Python] Use only code-block directive and set up doctest for the python user guide (#48619) ### Rationale for this change In many places in the Python User Guide the code exampels are written with IPython directive (elsewhere code-block is used). IPython directives are converted to IPython format (`In` and `Out` during the doc build). This can lead to slower builds. ### What changes are included in this PR? IPython directives are converted to runnable code-block (with `>>>` and `...`) and pytest doctest support for `.rst` files is added to the `conda-python-docs` CI job. This means the code in the Python User Guide is tested separately to the building of the documentation. ### Are these changes tested? Yes, with the CI. ### Are there any user-facing changes? Changes to the Python User Guide examples will have to be tested with `pytest --doctest-glob='.rst' docs/source/python/file.rst` GitHub Issue: #28859 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: tadeja <tadeja@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2026-01-30 09:09:33 +01:00			`.. code-block:: python`
ARROW-11373: [Python][Docs] Add example of specifying type for a column when reading csv file Closes #9386 from jaroszan/doc_improvement_ARROW-11373 Lead-authored-by: Szangin <jaroslaw.szangin@teliacompany.com> Co-authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2021-03-05 21:00:19 -05:00
GH-28859: [Doc][Python] Use only code-block directive and set up doctest for the python user guide (#48619) ### Rationale for this change In many places in the Python User Guide the code exampels are written with IPython directive (elsewhere code-block is used). IPython directives are converted to IPython format (`In` and `Out` during the doc build). This can lead to slower builds. ### What changes are included in this PR? IPython directives are converted to runnable code-block (with `>>>` and `...`) and pytest doctest support for `.rst` files is added to the `conda-python-docs` CI job. This means the code in the Python User Guide is tested separately to the building of the documentation. ### Are these changes tested? Yes, with the CI. ### Are there any user-facing changes? Changes to the Python User Guide examples will have to be tested with `pytest --doctest-glob='.rst' docs/source/python/file.rst` GitHub Issue: #28859 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: tadeja <tadeja@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2026-01-30 09:09:33 +01:00			`>>> table = csv.read_csv('tips.csv.gz', convert_options=csv.ConvertOptions(`
			`... column_types={`
			`... 'total_bill': pa.decimal128(precision=10, scale=2),`
			`... 'tip': pa.decimal128(precision=10, scale=2),`
			`... }`
			`... ))`
			`>>> table`
			`pyarrow.Table`
			`col1: int64`
			`col2: string`
			`----`
			`col1: [[1,2,3]]`
			`col2: [["a","b","c"]]`
ARROW-11373: [Python][Docs] Add example of specifying type for a column when reading csv file Closes #9386 from jaroszan/doc_improvement_ARROW-11373 Lead-authored-by: Szangin <jaroslaw.szangin@teliacompany.com> Co-authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2021-03-05 21:00:19 -05:00
GH-40278: [C++] Support casting string to duration in CSV converter (#46035) ### Rationale for this change Currently, the Arrow C++ CSV converter does not support parsing strings into `duration` types. This limits CSV ingestion capabilities when handling datasets with time-based intervals represented as numeric strings (e.g., `1000`, `2000000`). This PR adds support for parsing such strings into Arrow's `DurationType`. Note: Human-readable duration formats such as `1s`, `2m`, or `3h` are not supported in this PR. Support for those formats may be considered in a future enhancement. ### What changes are included in this PR? - Added `DurationValueDecoder` using `StringConverter<DurationType>` - Registered support in both standard and dictionary converters - Added unit tests covering: - Basic parsing across all time units (s, ms, µs, ns) - Null and custom null values - Whitespace handling and error cases ### Are these changes tested? Yes, conversion logic is fully covered by new tests in `converter_test.cc`. ### Are there any user-facing changes? Yes, users can now convert duration strings in CSV files to Arrow `duration` arrays by specifying the appropriate schema type. * GitHub Issue: #40278 Lead-authored-by: Zihan Qi <zihan.qi@tum.de> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org> 2025-05-07 17:56:48 +02:00			`.. note::`
			To assign a column as ``duration``, the CSV values must be numeric strings
			that match the expected unit (e.g. ``60000`` for 60 seconds when
			using ``duration[ms]``).

ARROW-15432: [Python] Address CSV docstrings This PR adds docstring examples to: - `pyarrow.csv.write_csv` - `pyarrow.csv.read_csv` - `pyarrow.csv.ReadOptions` except `block_size` - `pyarrow.csv.ParseOptions` except `invalid_row_handler` - `pyarrow.csv.ConvertOptions` Closes #12543 from AlenkaF/ARROW-15432 Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2022-03-29 11:24:27 +02:00			`Available convert options are:`

			`.. autosummary::`

			`~ConvertOptions.check_utf8`
			`~ConvertOptions.column_types`
			`~ConvertOptions.null_values`
			`~ConvertOptions.true_values`
			`~ConvertOptions.false_values`
			`~ConvertOptions.decimal_point`
			`~ConvertOptions.timestamp_parsers`
			`~ConvertOptions.strings_can_be_null`
			`~ConvertOptions.quoted_strings_can_be_null`
			`~ConvertOptions.auto_dict_encode`
			`~ConvertOptions.auto_dict_max_cardinality`
			`~ConvertOptions.include_columns`
			`~ConvertOptions.include_missing_columns`

			`.. seealso::`

			For more examples see :class:`ConvertOptions`.
ARROW-3407: [C++] Add UTF8 handling to CSV conversion CSV conversion now has distinct paths for string and binary columns. String columns are UTF8-validated by default, but it can be disabled by setting the `check_utf8` option in `ConvertOptions`. CSV type inference now first attempts string conversion and falls back on binary if UTF8 validation fails (if it's not disabled). As for performance, on pure ASCII columns single-threaded reading slows down by ~10% (which can be avoided by setting `check_utf8` to false). Multi-threaded reading does not seem affected here. Based on PR #2916. Author: Antoine Pitrou <antoine@python.org> Closes #2924 from pitrou/ARROW-3407-csv-utf8-conversion and squashes the following commits: 26a812c5c <Antoine Pitrou> ARROW-3407: Add UTF8 handling to CSV conversion 2018-11-09 17:43:28 -05:00
ARROW-3410: [C++][Python] Add streaming CSV reader. Only the serial version is implemented for now. Closes #6764 from pitrou/ARROW-3410-streaming-csv Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: François Saint-Jacques <fsaintjacques@gmail.com> 2020-04-14 11:44:20 -04:00			`Incremental reading`
			`-------------------`

			`For memory-constrained environments, it is also possible to read a CSV file`
ARROW-12482: [Doc][C++][Python] Mention CSVStreamingReader pitfalls with type inference Users may be surprised that type inference is done on the first block, and values can fail converting afterwards. Closes #10132 from pitrou/ARROW-12482-csv-streaming-doc Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-04-27 15:24:52 +02:00			one batch at a time, using :func:`open_csv`.

			`There are a few caveats:`

			`1. For now, the incremental reader is always single-threaded (regardless of`
			:attr:`ReadOptions.use_threads`)

			`2. Type inference is done on the first block and types are frozen afterwards;`
			`to make sure the right data types are inferred, either set`
			:attr:`ReadOptions.block_size` to a large enough value, or use
			:attr:`ConvertOptions.column_types` to set the desired data types explicitly.
ARROW-3410: [C++][Python] Add streaming CSV reader. Only the serial version is implemented for now. Closes #6764 from pitrou/ARROW-3410-streaming-csv Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: François Saint-Jacques <fsaintjacques@gmail.com> 2020-04-14 11:44:20 -04:00
ARROW-9101: [Doc][C++] Document encoding expected for CSV data Closes #7407 from pitrou/ARROW-9101-document-csv-encoding Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-12 11:38:31 -05:00			`Character encoding`
			`------------------`

ARROW-9106: [Python] Allow specifying CSV file encoding Also add a C++ InputStream wrapper that transforms data from the stream according to an arbitrary callable. Closes #7456 from pitrou/ARROW-9106-csv-transcoding Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-25 09:24:29 -05:00			`By default, CSV files are expected to be encoded in UTF8. Non-UTF8 data`
			is accepted for ``binary`` columns. The encoding can be changed using
GH-28859: [Doc][Python] Use only code-block directive and set up doctest for the python user guide (#48619) ### Rationale for this change In many places in the Python User Guide the code exampels are written with IPython directive (elsewhere code-block is used). IPython directives are converted to IPython format (`In` and `Out` during the doc build). This can lead to slower builds. ### What changes are included in this PR? IPython directives are converted to runnable code-block (with `>>>` and `...`) and pytest doctest support for `.rst` files is added to the `conda-python-docs` CI job. This means the code in the Python User Guide is tested separately to the building of the documentation. ### Are these changes tested? Yes, with the CI. ### Are there any user-facing changes? Changes to the Python User Guide examples will have to be tested with `pytest --doctest-glob='.rst' docs/source/python/file.rst` GitHub Issue: #28859 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: tadeja <tadeja@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2026-01-30 09:09:33 +01:00			the :class:`ReadOptions` class:
ARROW-15432: [Python] Address CSV docstrings This PR adds docstring examples to: - `pyarrow.csv.write_csv` - `pyarrow.csv.read_csv` - `pyarrow.csv.ReadOptions` except `block_size` - `pyarrow.csv.ParseOptions` except `invalid_row_handler` - `pyarrow.csv.ConvertOptions` Closes #12543 from AlenkaF/ARROW-15432 Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2022-03-29 11:24:27 +02:00
GH-28859: [Doc][Python] Use only code-block directive and set up doctest for the python user guide (#48619) ### Rationale for this change In many places in the Python User Guide the code exampels are written with IPython directive (elsewhere code-block is used). IPython directives are converted to IPython format (`In` and `Out` during the doc build). This can lead to slower builds. ### What changes are included in this PR? IPython directives are converted to runnable code-block (with `>>>` and `...`) and pytest doctest support for `.rst` files is added to the `conda-python-docs` CI job. This means the code in the Python User Guide is tested separately to the building of the documentation. ### Are these changes tested? Yes, with the CI. ### Are there any user-facing changes? Changes to the Python User Guide examples will have to be tested with `pytest --doctest-glob='.rst' docs/source/python/file.rst` GitHub Issue: #28859 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: tadeja <tadeja@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2026-01-30 09:09:33 +01:00			`.. code-block:: python`
ARROW-15432: [Python] Address CSV docstrings This PR adds docstring examples to: - `pyarrow.csv.write_csv` - `pyarrow.csv.read_csv` - `pyarrow.csv.ReadOptions` except `block_size` - `pyarrow.csv.ParseOptions` except `invalid_row_handler` - `pyarrow.csv.ConvertOptions` Closes #12543 from AlenkaF/ARROW-15432 Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2022-03-29 11:24:27 +02:00
GH-28859: [Doc][Python] Use only code-block directive and set up doctest for the python user guide (#48619) ### Rationale for this change In many places in the Python User Guide the code exampels are written with IPython directive (elsewhere code-block is used). IPython directives are converted to IPython format (`In` and `Out` during the doc build). This can lead to slower builds. ### What changes are included in this PR? IPython directives are converted to runnable code-block (with `>>>` and `...`) and pytest doctest support for `.rst` files is added to the `conda-python-docs` CI job. This means the code in the Python User Guide is tested separately to the building of the documentation. ### Are these changes tested? Yes, with the CI. ### Are there any user-facing changes? Changes to the Python User Guide examples will have to be tested with `pytest --doctest-glob='.rst' docs/source/python/file.rst` GitHub Issue: #28859 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: tadeja <tadeja@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2026-01-30 09:09:33 +01:00			`>>> table = csv.read_csv('tips.csv.gz', read_options=csv.ReadOptions(`
			`... column_names=["n_legs", "entry"],`
			`... skip_rows=1`
			`... ))`
			`>>> table`
			`pyarrow.Table`
			`n_legs: int64`
			`entry: string`
			`----`
			`n_legs: [[1,2,3]]`
			`entry: [["a","b","c"]]`
ARROW-15432: [Python] Address CSV docstrings This PR adds docstring examples to: - `pyarrow.csv.write_csv` - `pyarrow.csv.read_csv` - `pyarrow.csv.ReadOptions` except `block_size` - `pyarrow.csv.ParseOptions` except `invalid_row_handler` - `pyarrow.csv.ConvertOptions` Closes #12543 from AlenkaF/ARROW-15432 Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2022-03-29 11:24:27 +02:00
			`Available read options are:`

			`.. autosummary::`

			`~ReadOptions.use_threads`
			`~ReadOptions.block_size`
			`~ReadOptions.skip_rows`
			`~ReadOptions.skip_rows_after_names`
			`~ReadOptions.column_names`
			`~ReadOptions.autogenerate_column_names`
			`~ReadOptions.encoding`

			`.. seealso::`

			For more examples see :class:`ReadOptions`.
ARROW-9101: [Doc][C++] Document encoding expected for CSV data Closes #7407 from pitrou/ARROW-9101-document-csv-encoding Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-12 11:38:31 -05:00
ARROW-13230: [Docs][Python] Add CSV writer docs Closes #10658 from lidavidm/arrow-13230 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2021-07-07 12:20:11 -04:00			`Customized writing`
			`------------------`

			`To alter the default write settings in case of writing CSV files with`
			different conventions, you can create a :class:`WriteOptions` instance and
GH-28859: [Doc][Python] Use only code-block directive and set up doctest for the python user guide (#48619) ### Rationale for this change In many places in the Python User Guide the code exampels are written with IPython directive (elsewhere code-block is used). IPython directives are converted to IPython format (`In` and `Out` during the doc build). This can lead to slower builds. ### What changes are included in this PR? IPython directives are converted to runnable code-block (with `>>>` and `...`) and pytest doctest support for `.rst` files is added to the `conda-python-docs` CI job. This means the code in the Python User Guide is tested separately to the building of the documentation. ### Are these changes tested? Yes, with the CI. ### Are there any user-facing changes? Changes to the Python User Guide examples will have to be tested with `pytest --doctest-glob='.rst' docs/source/python/file.rst` GitHub Issue: #28859 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: tadeja <tadeja@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2026-01-30 09:09:33 +01:00			pass it to :func:`write_csv`:

			`.. code-block:: python`
ARROW-13230: [Docs][Python] Add CSV writer docs Closes #10658 from lidavidm/arrow-13230 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2021-07-07 12:20:11 -04:00
			`>>> # Omit the header row (include_header=True is the default)`
			`>>> options = csv.WriteOptions(include_header=False)`
			`>>> csv.write_csv(table, "data.csv", options)`

			`Incremental writing`
			`-------------------`

			To write CSV files one batch at a time, create a :class:`CSVWriter`. This
			`requires the output (a path or file-like object), the schema of the data to`
GH-28859: [Doc][Python] Use only code-block directive and set up doctest for the python user guide (#48619) ### Rationale for this change In many places in the Python User Guide the code exampels are written with IPython directive (elsewhere code-block is used). IPython directives are converted to IPython format (`In` and `Out` during the doc build). This can lead to slower builds. ### What changes are included in this PR? IPython directives are converted to runnable code-block (with `>>>` and `...`) and pytest doctest support for `.rst` files is added to the `conda-python-docs` CI job. This means the code in the Python User Guide is tested separately to the building of the documentation. ### Are these changes tested? Yes, with the CI. ### Are there any user-facing changes? Changes to the Python User Guide examples will have to be tested with `pytest --doctest-glob='.rst' docs/source/python/file.rst` GitHub Issue: #28859 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: tadeja <tadeja@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2026-01-30 09:09:33 +01:00			`be written, and optionally write options as described above:`

			`.. code-block:: python`
ARROW-13230: [Docs][Python] Add CSV writer docs Closes #10658 from lidavidm/arrow-13230 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2021-07-07 12:20:11 -04:00
			`>>> with csv.CSVWriter("data.csv", table.schema) as writer:`
GH-28859: [Doc][Python] Use only code-block directive and set up doctest for the python user guide (#48619) ### Rationale for this change In many places in the Python User Guide the code exampels are written with IPython directive (elsewhere code-block is used). IPython directives are converted to IPython format (`In` and `Out` during the doc build). This can lead to slower builds. ### What changes are included in this PR? IPython directives are converted to runnable code-block (with `>>>` and `...`) and pytest doctest support for `.rst` files is added to the `conda-python-docs` CI job. This means the code in the Python User Guide is tested separately to the building of the documentation. ### Are these changes tested? Yes, with the CI. ### Are there any user-facing changes? Changes to the Python User Guide examples will have to be tested with `pytest --doctest-glob='.rst' docs/source/python/file.rst` GitHub Issue: #28859 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: tadeja <tadeja@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2026-01-30 09:09:33 +01:00			`... writer.write_table(table)`
ARROW-13230: [Docs][Python] Add CSV writer docs Closes #10658 from lidavidm/arrow-13230 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2021-07-07 12:20:11 -04:00
ARROW-3405: [Python] Document CSV reader Author: Antoine Pitrou <antoine@python.org> Closes #2863 from pitrou/ARROW-3405-python-csv-doc and squashes the following commits: 4ee59cadd <Antoine Pitrou> ARROW-3405: Document CSV reader 2018-10-29 17:40:52 -04:00			`Performance`
			`-----------`

			`Due to the structure of CSV files, one cannot expect the same levels of`
			`performance as when reading dedicated binary formats like`
			:ref:`Parquet <Parquet>`. Nevertheless, Arrow strives to reduce the
ARROW-9101: [Doc][C++] Document encoding expected for CSV data Closes #7407 from pitrou/ARROW-9101-document-csv-encoding Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-12 11:38:31 -05:00			`overhead of reading CSV files. A reasonable expectation is at least`
ARROW-11247: [C++] Infer date32 columns in CSV Closes #9203 from nealrichardson/csv-infer-date Lead-authored-by: Neal Richardson <neal.p.richardson@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-01-14 16:05:58 +01:00			`100 MB/s per core on a performant desktop or laptop computer (measured`
			`in source CSV bytes, not target Arrow data bytes).`
ARROW-3405: [Python] Document CSV reader Author: Antoine Pitrou <antoine@python.org> Closes #2863 from pitrou/ARROW-3405-python-csv-doc and squashes the following commits: 4ee59cadd <Antoine Pitrou> ARROW-3405: Document CSV reader 2018-10-29 17:40:52 -04:00
			Performance options can be controlled through the :class:`ReadOptions` class.
			`Multi-threaded reading is the default for highest performance, distributing`
			`the workload efficiently over all available cores.`
ARROW-3620: [Python] Document pa.cpu_count() in Sphinx API docs Author: Antoine Pitrou <antoine@python.org> Closes #3224 from pitrou/ARROW-3620-document-cpu-count and squashes the following commits: 15fda9ba <Antoine Pitrou> ARROW-3620: Document pa.cpu_count() in Sphinx API docs 2018-12-19 23:15:00 +01:00
			`.. note::`
ARROW-9101: [Doc][C++] Document encoding expected for CSV data Closes #7407 from pitrou/ARROW-9101-document-csv-encoding Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-12 11:38:31 -05:00			`The number of concurrent threads is automatically inferred by Arrow.`
			You can inspect and change it using the :func:`~pyarrow.cpu_count()`
			and :func:`~pyarrow.set_cpu_count()` functions, respectively.