2019-06-12 10:57:27 -05:00
|
|
|
.. Licensed to the Apache Software Foundation (ASF) under one
|
|
|
|
|
.. or more contributor license agreements. See the NOTICE file
|
|
|
|
|
.. distributed with this work for additional information
|
|
|
|
|
.. regarding copyright ownership. The ASF licenses this file
|
|
|
|
|
.. to you under the Apache License, Version 2.0 (the
|
|
|
|
|
.. "License"); you may not use this file except in compliance
|
|
|
|
|
.. with the License. You may obtain a copy of the License at
|
|
|
|
|
|
|
|
|
|
.. http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
|
|
|
|
|
|
.. Unless required by applicable law or agreed to in writing,
|
|
|
|
|
.. software distributed under the License is distributed on an
|
|
|
|
|
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
|
|
|
.. KIND, either express or implied. See the License for the
|
|
|
|
|
.. specific language governing permissions and limitations
|
|
|
|
|
.. under the License.
|
|
|
|
|
|
|
|
|
|
.. currentmodule:: pyarrow.json
|
|
|
|
|
.. _json:
|
|
|
|
|
|
|
|
|
|
Reading JSON files
|
|
|
|
|
==================
|
|
|
|
|
|
2024-04-30 17:27:26 -08:00
|
|
|
Arrow supports reading columnar data from line-delimited JSON files.
|
2020-09-02 14:35:08 +02:00
|
|
|
In this context, a JSON file consists of multiple JSON objects, one per line,
|
|
|
|
|
representing individual data rows. For example, this file represents
|
|
|
|
|
two rows of data with four columns "a", "b", "c", "d":
|
2019-06-12 10:57:27 -05:00
|
|
|
|
|
|
|
|
.. code-block:: json
|
|
|
|
|
|
|
|
|
|
{"a": 1, "b": 2.0, "c": "foo", "d": false}
|
|
|
|
|
{"a": 4, "b": -5.5, "c": null, "d": true}
|
|
|
|
|
|
|
|
|
|
The features currently offered are the following:
|
|
|
|
|
|
|
|
|
|
* multi-threaded or single-threaded reading
|
|
|
|
|
* automatic decompression of input files (based on the filename extension,
|
|
|
|
|
such as ``my_data.json.gz``)
|
|
|
|
|
* sophisticated type inference (see below)
|
|
|
|
|
|
2020-09-02 14:35:08 +02:00
|
|
|
.. note::
|
|
|
|
|
Currently only the line-delimited JSON format is supported.
|
|
|
|
|
|
2019-06-12 10:57:27 -05:00
|
|
|
|
|
|
|
|
Usage
|
|
|
|
|
-----
|
|
|
|
|
|
|
|
|
|
JSON reading functionality is available through the :mod:`pyarrow.json` module.
|
|
|
|
|
In many cases, you will simply call the :func:`read_json` function
|
2026-01-30 09:09:33 +01:00
|
|
|
with the file path you want to read from:
|
|
|
|
|
|
|
|
|
|
.. code-block:: python
|
2019-06-12 10:57:27 -05:00
|
|
|
|
|
|
|
|
>>> from pyarrow import json
|
2026-01-30 09:09:33 +01:00
|
|
|
>>> fn = 'my_data.json' # doctest: +SKIP
|
|
|
|
|
>>> table = json.read_json(fn) # doctest: +SKIP
|
|
|
|
|
>>> table # doctest: +SKIP
|
2019-06-12 10:57:27 -05:00
|
|
|
pyarrow.Table
|
|
|
|
|
a: int64
|
|
|
|
|
b: double
|
|
|
|
|
c: string
|
|
|
|
|
d: bool
|
2026-01-30 09:09:33 +01:00
|
|
|
>>> table.to_pandas() # doctest: +SKIP
|
2019-06-12 10:57:27 -05:00
|
|
|
a b c d
|
|
|
|
|
0 1 2.0 foo False
|
|
|
|
|
1 4 -5.5 None True
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Automatic Type Inference
|
|
|
|
|
------------------------
|
|
|
|
|
|
|
|
|
|
Arrow :ref:`data types <data.types>` are inferred from the JSON types and
|
|
|
|
|
values of each column:
|
|
|
|
|
|
|
|
|
|
* JSON null values convert to the ``null`` type, but can fall back to any
|
|
|
|
|
other type.
|
|
|
|
|
* JSON booleans convert to ``bool_``.
|
|
|
|
|
* JSON numbers convert to ``int64``, falling back to ``float64`` if a
|
|
|
|
|
non-integer is encountered.
|
|
|
|
|
* JSON strings of the kind "YYYY-MM-DD" and "YYYY-MM-DD hh:mm:ss" convert
|
|
|
|
|
to ``timestamp[s]``, falling back to ``utf8`` if a conversion error occurs.
|
|
|
|
|
* JSON arrays convert to a ``list`` type, and inference proceeds recursively
|
|
|
|
|
on the JSON arrays' values.
|
|
|
|
|
* Nested JSON objects convert to a ``struct`` type, and inference proceeds
|
|
|
|
|
recursively on the JSON objects' values.
|
|
|
|
|
|
|
|
|
|
Thus, reading this JSON file:
|
|
|
|
|
|
|
|
|
|
.. code-block:: json
|
|
|
|
|
|
|
|
|
|
{"a": [1, 2], "b": {"c": true, "d": "1991-02-03"}}
|
|
|
|
|
{"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01"}}
|
|
|
|
|
|
2026-01-30 09:09:33 +01:00
|
|
|
returns the following data:
|
|
|
|
|
|
|
|
|
|
.. code-block:: python
|
2019-06-12 10:57:27 -05:00
|
|
|
|
2026-01-30 09:09:33 +01:00
|
|
|
>>> table = json.read_json("my_data.json") # doctest: +SKIP
|
|
|
|
|
>>> table # doctest: +SKIP
|
2019-06-12 10:57:27 -05:00
|
|
|
pyarrow.Table
|
|
|
|
|
a: list<item: int64>
|
|
|
|
|
child 0, item: int64
|
|
|
|
|
b: struct<c: bool, d: timestamp[s]>
|
|
|
|
|
child 0, c: bool
|
|
|
|
|
child 1, d: timestamp[s]
|
2026-01-30 09:09:33 +01:00
|
|
|
>>> table.to_pandas() # doctest: +SKIP
|
2019-06-12 10:57:27 -05:00
|
|
|
a b
|
|
|
|
|
0 [1, 2] {'c': True, 'd': 1991-02-03 00:00:00}
|
|
|
|
|
1 [3, 4, 5] {'c': False, 'd': 2019-04-01 00:00:00}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Customized parsing
|
|
|
|
|
------------------
|
|
|
|
|
|
|
|
|
|
To alter the default parsing settings in case of reading JSON files with an
|
|
|
|
|
unusual structure, you should create a :class:`ParseOptions` instance
|
|
|
|
|
and pass it to :func:`read_json`. For example, you can pass an explicit
|
|
|
|
|
:ref:`schema <data.schema>` in order to bypass automatic type inference.
|
|
|
|
|
|
|
|
|
|
Similarly, you can choose performance settings by passing a
|
|
|
|
|
:class:`ReadOptions` instance to :func:`read_json`.
|
2025-02-06 21:47:52 +08:00
|
|
|
|
|
|
|
|
|
|
|
|
|
Incremental reading
|
|
|
|
|
-------------------
|
|
|
|
|
|
|
|
|
|
For memory-constrained environments, it is also possible to read a JSON file
|
|
|
|
|
one batch at a time, using :func:`open_json`.
|
|
|
|
|
|
|
|
|
|
In this case, type inference is done on the first block and types are frozen afterwards.
|
|
|
|
|
To make sure the right data types are inferred, either set
|
|
|
|
|
:attr:`ReadOptions.block_size` to a large enough value, or use
|
|
|
|
|
:attr:`ParseOptions.explicit_schema` to set the desired data types explicitly.
|