Blame: python/pyarrow/array.pxi - apache/arrow

apache / arrow UNCLAIMED

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics

0 0 16 C++

Normal View History Raw

ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00			`# Licensed to the Apache Software Foundation (ASF) under one`
			`# or more contributor license agreements. See the NOTICE file`
			`# distributed with this work for additional information`
			`# regarding copyright ownership. The ASF licenses this file`
			`# to you under the Apache License, Version 2.0 (the`
			`# "License"); you may not use this file except in compliance`
			`# with the License. You may obtain a copy of the License at`
			`#`
			`# http://www.apache.org/licenses/LICENSE-2.0`
			`#`
			`# Unless required by applicable law or agreed to in writing,`
			`# software distributed under the License is distributed on an`
			`# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY`
			`# KIND, either express or implied. See the License for the`
			`# specific language governing permissions and limitations`
			`# under the License.`

GH-35531: [Python] C Data Interface PyCapsule Protocol (#37797) ### Rationale for this change ### What changes are included in this PR? * A new specification for Arrow PyCapsules and related dunder methods * Implementing the dunder methods for `DataType`, `Field`, `Schema`, `Array`, `RecordBatch`, `Table`, and `RecordBatchReader`. ### Are these changes tested? Yes, I've added various roundtrip tests for each of the types. ### Are there any user-facing changes? This introduces some new APIs and documents them. * Closes: #34031 * Closes: #35531 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2023-10-18 04:44:50 -07:00			`from cpython.pycapsule cimport PyCapsule_CheckExact, PyCapsule_GetPointer, PyCapsule_New`

GH-45531: [Python] Add the `dim_names` argument to `from_numpy_ndarray` (#46170) ### Rationale for this change The `FixedShapeTensorArray.from_numpy_ndarray` method did not pass `dim_names` to the `fixed_shape_tensor` constructor, which resulted in dimension names being lost when converting from a NumPy array. This change ensures that dimension names are properly preserved when constructing a tensor array from a NumPy ndarray. ### What changes are included in this PR? - Added an optional `dim_names` parameter to `FixedShapeTensorArray.from_numpy_ndarray`. - If provided, the `dim_names` are now passed to the `fixed_shape_tensor` constructor. ### Are these changes tested? - Existing tests pass, confirming no regressions to current functionality. - Additional unit tests have been added to verify that `dim_names` are correctly handled when specified. ### Are there any user-facing changes? - The method `FixedShapeTensorArray.from_numpy_ndarray` now accepts an optional `dim_names` argument. - This argument is optional, and the behavior remains unchanged when it is not provided. * GitHub Issue: #45531 Authored-by: yyossy5 <hm.hr.yossy@gmail.com> Signed-off-by: Rok Mihevc <rok@mihevc.org> 2025-04-18 03:13:51 +09:00			`from collections.abc import Sequence`
ARROW-9528: [Python] Honor tzinfo when converting from datetime Follow up of: - ARROW-9223: [Python] Propagate timezone information in pandas conversion - ARROW-9528: [Python] Honor tzinfo when converting from datetime (https://github.com/apache/arrow/pull/7805) TODOs: - [x] Store all Timestamp values normalized to UTC - [x] Infer timezone from the array values if no explicit type was given - [x] Testing (especially pandas object roundtrip) - [x] Testing of timezone-naive roundtrips - [x] Testing mixed pandas and datetime objects Closes #7816 from kszucs/tz Lead-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Co-authored-by: Micah Kornfield <emkornfield@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-08-16 15:12:28 -05:00			`import os`
ARROW-7168: [Python] Respect the specified dictionary type for pd.Categorical conversion https://issues.apache.org/jira/browse/ARROW-7168 This change ensures that if you specify a `type` in `pa.array`, we ensure the output actually has this type when converting to dictionary array (as we also do for other types). The PR now implements this change, but we might want to do this with a deprecation first, as this can break people's code. Closes #5866 from jorisvandenbossche/ARROW-7168-categorical-specified-type and squashes the following commits: 39ff8e82c <Joris Van den Bossche> more python 2 e4dbb2c4f <Joris Van den Bossche> try fix python 2 003e6532b <Joris Van den Bossche> for now use deprecation warnings instead of error bfb82372b <Joris Van den Bossche> additional tests 3535a5699 <Joris Van den Bossche> ARROW-7168: Respect the specified dictionary type when converting pd.Categorical Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-11-21 11:17:56 +01:00			`import warnings`
GH-35531: [Python] C Data Interface PyCapsule Protocol (#37797) ### Rationale for this change ### What changes are included in this PR? * A new specification for Arrow PyCapsules and related dunder methods * Implementing the dunder methods for `DataType`, `Field`, `Schema`, `Array`, `RecordBatch`, `Table`, and `RecordBatchReader`. ### Are these changes tested? Yes, I've added various roundtrip tests for each of the types. ### Are there any user-facing changes? This introduces some new APIs and documents them. * Closes: #34031 * Closes: #35531 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2023-10-18 04:44:50 -07:00			`from cython import sizeof`
ARROW-7168: [Python] Respect the specified dictionary type for pd.Categorical conversion https://issues.apache.org/jira/browse/ARROW-7168 This change ensures that if you specify a `type` in `pa.array`, we ensure the output actually has this type when converting to dictionary array (as we also do for other types). The PR now implements this change, but we might want to do this with a deprecation first, as this can break people's code. Closes #5866 from jorisvandenbossche/ARROW-7168-categorical-specified-type and squashes the following commits: 39ff8e82c <Joris Van den Bossche> more python 2 e4dbb2c4f <Joris Van den Bossche> try fix python 2 003e6532b <Joris Van den Bossche> for now use deprecation warnings instead of error bfb82372b <Joris Van den Bossche> additional tests 3535a5699 <Joris Van den Bossche> ARROW-7168: Respect the specified dictionary type when converting pd.Categorical Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-11-21 11:17:56 +01:00
GH-45457: [Python] Add `pyarrow.ArrayStatistics` (#45550) ### Rationale for this change Apache Arrow C++ can attach statistics read from Apache Parquet data to `arrow::Array`. If we have the bindings of the feature in Python, Python users can also use attached statistics. ### What changes are included in this PR? * Add `pyarrow.ArrayStatistics` * Add `pyarrow.Array.statistics()`. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #45457 Lead-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Sutou Kouhei <kou@clear-code.com> 2025-02-25 22:25:52 +09:00			`cdef extern from "<variant>" namespace "std":`
			`c_bool holds_alternative[T](...)`
			`T get[T](...)`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
ARROW-2814: [Python] Unify conversion paths for sequences of Python objects Key points * All object sequences, including NumPy arrays of objects are being converted in builtin_convert.cc * pyarrow.array can now yield chunked output from normal Python input. Before, we could overflow a BinaryBuilder with no recourse * Eliminated virtual calls from the inner hot path * Eliminated some code duplication in builtin_convert.cc * Special-cased mask handling, so masks (`mask=...` in `pyarrow.array`) also work with plain Python sequence now instead of only NumPy arrays * Centralized null checking to a single code path, with a compile-time switch between pandas-style and non-pandas null-checking Some issues I ran into: * We have tests that make the somewhat heavy-handed promotion of small NumPy scalars to int64 or uint64. I have added more rigid "type unification" for dtypes, so that now a sequence of int8 scalars will yield int8 result * We were implicitly casting integers to double without checking whether the integers are representable as doubles. I think implicit casting is OK (e.g. `pa.array([1.5, 1, None])`) but we should validate that we can't discarding information There are some other problems that need fixing still / inconsistencies from the two code paths or follow-up issues. I have created a number of follow up JIRAs and added a number of new unit tests Author: Wes McKinney <wesm+git@apache.org> Closes #2366 from wesm/ARROW-2814 and squashes the following commits: 9d15551c <Wes McKinney> Address further code review comments a7a8c3ce <Wes McKinney> Check in new source files d7760cef <Wes McKinney> Address @pitrou code review comments 3f56c300 <Wes McKinney> Exclude python/iterators.h from C++/CLI lint checks d1687720 <Wes McKinney> Fix some more things df136064 <Wes McKinney> Miscellaneous micro-optimizations 07ff8094 <Wes McKinney> Bump versions in asv.conf.json 9efb097e <Wes McKinney> Add more unit tests, sand rough edges. Add boundschecking for integer coercion with float32 e0c9b9ce <Wes McKinney> Delete casting cruft a13bcaf1 <Wes McKinney> Fix rest of unit tests 2b3815f3 <Wes McKinney> Loose and string utf8 type conversions a04bcdc2 <Wes McKinney> Fix more unit tests, disallow non-boolean mask 688b8298 <Wes McKinney> Implement NumPy dtype unifier helper class. Some more cleanup d9d0822e <Wes McKinney> Add NumPy concrete type checking logic d3d97eaf <Wes McKinney> Fix NumPy float scalar casting issue f3b3e2f9 <Wes McKinney> Code fully compiles again e8e5964c <Wes McKinney> First pass cleaning up ListConverter 4424c62c <Wes McKinney> Remove comments c5ca7a42 <Wes McKinney> More refactoring, cleaning up old code. Add lambda version of VIsitTypeInline 1c714d35 <Wes McKinney> Delete some ConvertLists code b4fdea0c <Wes McKinney> Refactoring, add VisitSequenceMasked d75adaf2 <Wes McKinney> More refactoring 72de8a3d <Wes McKinney> Templatize more, less code duplication 72e6574e <Wes McKinney> Do not make virtual AppendSingle/AppendMultiple calls for non-nested SeqConverter 1c338204 <Wes McKinney> Move over NumPyConverter code, small refactorings. Now very broken 58db0964 <Wes McKinney> Fix buglets and mixing dicts/scalars raises TypeError for now c5428d5b <Wes McKinney> Consolidate to a single ConvertPySequence entry point 79cd77e9 <Wes McKinney> Add short circuit option, some small refactoring 2018-08-09 13:31:10 -04:00			`cdef _sequence_to_array(object sequence, object mask, object size,`
ARROW-1949: [Python/C++] Add option to Array.from_pandas and pyarrow.array to perform unsafe casts Author: Krisztián Szűcs <szucs.krisztian@gmail.com> Closes #2497 from kszucs/ARROW-1949 and squashes the following commits: f352c477 <Krisztián Szűcs> remove safe flag from _sequence_to_array 70d6cae2 <Krisztián Szűcs> annotate boolean arguments as bint e838a14d <Krisztián Szűcs> check-format fff89aaa <Krisztián Szűcs> lint 92ac3a92 <Krisztián Szűcs> tests for timestamp casts dd8871e8 <Krisztián Szűcs> wire CastOptions through the API 2018-09-04 08:36:29 +02:00			`DataType type, CMemoryPool* pool, c_bool from_pandas):`
ARROW-9992: [C++][Python] Refactor python to arrow conversions based on a reusable conversion API ### Targets of the refactoring: - PythonToArrow converters based on a common API - PyBytesView to use `Result` return values and contain `is_utf8` flag - PyConversionOptions is now available from all converters so we can honor its flags ### Fixes - ARROW-9993 [Python] Tzinfo - string roundtrip fails on pytz.StaticTzInfo objects - ARROW-9994 [C++][Python] Auto chunking nested array containing binary-like fields result malformed output - ARROW-9996 [C++] Dictionary is unset when calling DictionaryArray.GetScalar for null values - ~ARROW-9997 [Python] StructScalar.as_py() fails if the type has duplicate field names~ - ARROW-9999 [Python] Support constructing dictionary array directly through pa.array() - ARROW-10000 [C++][Python] Support constructing StructArray from list of key-value pairs - ARROW-9593 [Python] Add custom pickle reducers for DictionaryScalar - ARROW-6281 [Python] Produce chunked arrays for nested types in pyarrow.array - ARROW-2367 [Python] ListArray has trouble with sizes greater than kMaximumCapacity - ARROW-9976: [Python] ArrowCapacityError when doing Table.from_pandas with large dataframe ### Backward incompatibility ~~Since a struct type can contain duplicated field names we cannot return a struct scalar as a mapping, so I had to change the `.as_py()` representation to return with a list of key-value pairs.~~ ### TODOs: - [x] ensure that the large memory tests are passing - [x] benchmark and check binary size again ### Library size Before: ``` 12M Sep 25 15:05 libarrow.200.0.0.dylib 2.7M Sep 25 15:07 libarrow_python.200.0.0.dylib ``` After: ``` 12M Sep 25 15:46 libarrow.200.0.0.dylib 2.1M Sep 25 15:50 libarrow_python.200.0.0.dylib ``` ### Benchmarks Executed the following ASV benchmark: ```bash asv continuous --bench convert_builtins master py2ar --no-only-changed --split ``` After some optimization: ``` Benchmarks that have improved: before after ratio [f358a29b] [18d1c052] <master> <py2ar> - 2.78±0.03ms 2.45±0.03ms 0.88 convert_builtins.ConvertPyListToArray.time_convert('bool') - 3.59±0.01ms 3.12±0.02ms 0.87 convert_builtins.ConvertPyListToArray.time_convert('int32') - 3.37±0.01ms 2.73±0.01ms 0.81 convert_builtins.ConvertPyListToArray.time_convert('uint32') - 3.74±0.02ms 3.03±0.01ms 0.81 convert_builtins.ConvertPyListToArray.time_convert('int64') - 3.38±0.01ms 2.69±0.01ms 0.80 convert_builtins.ConvertPyListToArray.time_convert('uint64') - 2.83±0.01ms 2.24±0.01ms 0.79 convert_builtins.ConvertPyListToArray.time_convert('float32') - 3.92±0.02ms 2.99±0.02ms 0.76 convert_builtins.ConvertPyListToArray.time_convert('binary10') - 14.1±0.04ms 8.89±0.05ms 0.63 convert_builtins.ConvertPyListToArray.time_convert('unicode') - 5.60±0.01ms 3.24±0.03ms 0.58 convert_builtins.ConvertPyListToArray.time_convert('ascii') - 5.37±0.02ms 2.91±0.04ms 0.54 convert_builtins.ConvertPyListToArray.time_convert('binary') Benchmarks that have stayed the same: before after ratio [f358a29b] [18d1c052] <master> <py2ar> 14.8±0.02ms 15.5±0.1ms 1.05 convert_builtins.ConvertPyListToArray.time_convert('decimal') 16.4±0.7ms 15.1±0.6ms 0.92 convert_builtins.ConvertPyListToArray.time_convert('struct from tuples') 34.4±0.3ms 31.5±0.4ms 0.92 convert_builtins.ConvertPyListToArray.time_convert('int64 list') 16.7±0.7ms 15.1±0.6ms ~0.91 convert_builtins.ConvertPyListToArray.time_convert('struct') 2.42±0.02ms 2.05±0.03ms ~0.85 convert_builtins.ConvertPyListToArray.time_convert('float64') ``` Closes #8088 from kszucs/py2ar Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com> 2020-09-25 20:49:16 -04:00			`cdef:`
			`int64_t c_size`
			`PyConversionOptions options`
			`shared_ptr[CChunkedArray] chunked`
ARROW-838: [Python] Expand pyarrow.array to handle NumPy arrays not originating in pandas This unifies the ingest path for 1D data into `pyarrow.array`. I added the argument `from_pandas` to turn null sentinel checking on or off: ``` In [8]: arr = np.random.randn(10000000) In [9]: arr[::3] = np.nan In [10]: arr2 = pa.array(arr) In [11]: arr2.null_count Out[11]: 0 In [12]: %timeit arr2 = pa.array(arr) The slowest run took 5.43 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 68.4 µs per loop In [13]: arr2 = pa.array(arr, from_pandas=True) In [14]: arr2.null_count Out[14]: 3333334 In [15]: %timeit arr2 = pa.array(arr, from_pandas=True) 1 loop, best of 3: 228 ms per loop ``` When the data is contiguous, it is always zero-copy, but then `from_pandas=True` and no null mask is passed, then a null bitmap is constructed and populated. This also permits sequence reads into integers smaller than int64: ``` In [17]: pa.array([1, 2, 3, 4], type='i1') Out[17]: <pyarrow.lib.Int8Array object at 0x7ffa1c1c65e8> [ 1, 2, 3, 4 ] ``` Oh, I also added NumPy-like string type aliases: ``` In [18]: pa.int32() == 'i4' Out[18]: True ``` Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1146 from wesm/expand-py-array-method and squashes the following commits: 1570e525 [Wes McKinney] Code review comments d3bbb3c3 [Wes McKinney] Handle type aliases in cast, too 797f0151 [Wes McKinney] Allow null checking to be skipped with from_pandas=False in pyarrow.array f2802fc7 [Wes McKinney] Cleaner codepath for numpy->arrow conversions 587c575a [Wes McKinney] Add direct types sequence converters for more data types cf40b767 [Wes McKinney] Add type aliases, some unit tests 7b530e4b [Wes McKinney] Consolidate both sequence and ndarray/Series/Index conversion in pyarrow.Array 2017-09-29 23:02:58 -05:00
ARROW-2814: [Python] Unify conversion paths for sequences of Python objects Key points * All object sequences, including NumPy arrays of objects are being converted in builtin_convert.cc * pyarrow.array can now yield chunked output from normal Python input. Before, we could overflow a BinaryBuilder with no recourse * Eliminated virtual calls from the inner hot path * Eliminated some code duplication in builtin_convert.cc * Special-cased mask handling, so masks (`mask=...` in `pyarrow.array`) also work with plain Python sequence now instead of only NumPy arrays * Centralized null checking to a single code path, with a compile-time switch between pandas-style and non-pandas null-checking Some issues I ran into: * We have tests that make the somewhat heavy-handed promotion of small NumPy scalars to int64 or uint64. I have added more rigid "type unification" for dtypes, so that now a sequence of int8 scalars will yield int8 result * We were implicitly casting integers to double without checking whether the integers are representable as doubles. I think implicit casting is OK (e.g. `pa.array([1.5, 1, None])`) but we should validate that we can't discarding information There are some other problems that need fixing still / inconsistencies from the two code paths or follow-up issues. I have created a number of follow up JIRAs and added a number of new unit tests Author: Wes McKinney <wesm+git@apache.org> Closes #2366 from wesm/ARROW-2814 and squashes the following commits: 9d15551c <Wes McKinney> Address further code review comments a7a8c3ce <Wes McKinney> Check in new source files d7760cef <Wes McKinney> Address @pitrou code review comments 3f56c300 <Wes McKinney> Exclude python/iterators.h from C++/CLI lint checks d1687720 <Wes McKinney> Fix some more things df136064 <Wes McKinney> Miscellaneous micro-optimizations 07ff8094 <Wes McKinney> Bump versions in asv.conf.json 9efb097e <Wes McKinney> Add more unit tests, sand rough edges. Add boundschecking for integer coercion with float32 e0c9b9ce <Wes McKinney> Delete casting cruft a13bcaf1 <Wes McKinney> Fix rest of unit tests 2b3815f3 <Wes McKinney> Loose and string utf8 type conversions a04bcdc2 <Wes McKinney> Fix more unit tests, disallow non-boolean mask 688b8298 <Wes McKinney> Implement NumPy dtype unifier helper class. Some more cleanup d9d0822e <Wes McKinney> Add NumPy concrete type checking logic d3d97eaf <Wes McKinney> Fix NumPy float scalar casting issue f3b3e2f9 <Wes McKinney> Code fully compiles again e8e5964c <Wes McKinney> First pass cleaning up ListConverter 4424c62c <Wes McKinney> Remove comments c5ca7a42 <Wes McKinney> More refactoring, cleaning up old code. Add lambda version of VIsitTypeInline 1c714d35 <Wes McKinney> Delete some ConvertLists code b4fdea0c <Wes McKinney> Refactoring, add VisitSequenceMasked d75adaf2 <Wes McKinney> More refactoring 72de8a3d <Wes McKinney> Templatize more, less code duplication 72e6574e <Wes McKinney> Do not make virtual AppendSingle/AppendMultiple calls for non-nested SeqConverter 1c338204 <Wes McKinney> Move over NumPyConverter code, small refactorings. Now very broken 58db0964 <Wes McKinney> Fix buglets and mixing dicts/scalars raises TypeError for now c5428d5b <Wes McKinney> Consolidate to a single ConvertPySequence entry point 79cd77e9 <Wes McKinney> Add short circuit option, some small refactoring 2018-08-09 13:31:10 -04:00			`if type is not None:`
			`options.type = type.sp_type`

			`if size is not None:`
			`options.size = size`

			`options.from_pandas = from_pandas`
ARROW-9528: [Python] Honor tzinfo when converting from datetime Follow up of: - ARROW-9223: [Python] Propagate timezone information in pandas conversion - ARROW-9528: [Python] Honor tzinfo when converting from datetime (https://github.com/apache/arrow/pull/7805) TODOs: - [x] Store all Timestamp values normalized to UTC - [x] Infer timezone from the array values if no explicit type was given - [x] Testing (especially pandas object roundtrip) - [x] Testing of timezone-naive roundtrips - [x] Testing mixed pandas and datetime objects Closes #7816 from kszucs/tz Lead-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Co-authored-by: Micah Kornfield <emkornfield@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-08-16 15:12:28 -05:00			`options.ignore_timezone = os.environ.get('PYARROW_IGNORE_TIMEZONE', False)`
ARROW-2814: [Python] Unify conversion paths for sequences of Python objects Key points * All object sequences, including NumPy arrays of objects are being converted in builtin_convert.cc * pyarrow.array can now yield chunked output from normal Python input. Before, we could overflow a BinaryBuilder with no recourse * Eliminated virtual calls from the inner hot path * Eliminated some code duplication in builtin_convert.cc * Special-cased mask handling, so masks (`mask=...` in `pyarrow.array`) also work with plain Python sequence now instead of only NumPy arrays * Centralized null checking to a single code path, with a compile-time switch between pandas-style and non-pandas null-checking Some issues I ran into: * We have tests that make the somewhat heavy-handed promotion of small NumPy scalars to int64 or uint64. I have added more rigid "type unification" for dtypes, so that now a sequence of int8 scalars will yield int8 result * We were implicitly casting integers to double without checking whether the integers are representable as doubles. I think implicit casting is OK (e.g. `pa.array([1.5, 1, None])`) but we should validate that we can't discarding information There are some other problems that need fixing still / inconsistencies from the two code paths or follow-up issues. I have created a number of follow up JIRAs and added a number of new unit tests Author: Wes McKinney <wesm+git@apache.org> Closes #2366 from wesm/ARROW-2814 and squashes the following commits: 9d15551c <Wes McKinney> Address further code review comments a7a8c3ce <Wes McKinney> Check in new source files d7760cef <Wes McKinney> Address @pitrou code review comments 3f56c300 <Wes McKinney> Exclude python/iterators.h from C++/CLI lint checks d1687720 <Wes McKinney> Fix some more things df136064 <Wes McKinney> Miscellaneous micro-optimizations 07ff8094 <Wes McKinney> Bump versions in asv.conf.json 9efb097e <Wes McKinney> Add more unit tests, sand rough edges. Add boundschecking for integer coercion with float32 e0c9b9ce <Wes McKinney> Delete casting cruft a13bcaf1 <Wes McKinney> Fix rest of unit tests 2b3815f3 <Wes McKinney> Loose and string utf8 type conversions a04bcdc2 <Wes McKinney> Fix more unit tests, disallow non-boolean mask 688b8298 <Wes McKinney> Implement NumPy dtype unifier helper class. Some more cleanup d9d0822e <Wes McKinney> Add NumPy concrete type checking logic d3d97eaf <Wes McKinney> Fix NumPy float scalar casting issue f3b3e2f9 <Wes McKinney> Code fully compiles again e8e5964c <Wes McKinney> First pass cleaning up ListConverter 4424c62c <Wes McKinney> Remove comments c5ca7a42 <Wes McKinney> More refactoring, cleaning up old code. Add lambda version of VIsitTypeInline 1c714d35 <Wes McKinney> Delete some ConvertLists code b4fdea0c <Wes McKinney> Refactoring, add VisitSequenceMasked d75adaf2 <Wes McKinney> More refactoring 72de8a3d <Wes McKinney> Templatize more, less code duplication 72e6574e <Wes McKinney> Do not make virtual AppendSingle/AppendMultiple calls for non-nested SeqConverter 1c338204 <Wes McKinney> Move over NumPyConverter code, small refactorings. Now very broken 58db0964 <Wes McKinney> Fix buglets and mixing dicts/scalars raises TypeError for now c5428d5b <Wes McKinney> Consolidate to a single ConvertPySequence entry point 79cd77e9 <Wes McKinney> Add short circuit option, some small refactoring 2018-08-09 13:31:10 -04:00
			`with nogil:`
ARROW-9992: [C++][Python] Refactor python to arrow conversions based on a reusable conversion API ### Targets of the refactoring: - PythonToArrow converters based on a common API - PyBytesView to use `Result` return values and contain `is_utf8` flag - PyConversionOptions is now available from all converters so we can honor its flags ### Fixes - ARROW-9993 [Python] Tzinfo - string roundtrip fails on pytz.StaticTzInfo objects - ARROW-9994 [C++][Python] Auto chunking nested array containing binary-like fields result malformed output - ARROW-9996 [C++] Dictionary is unset when calling DictionaryArray.GetScalar for null values - ~ARROW-9997 [Python] StructScalar.as_py() fails if the type has duplicate field names~ - ARROW-9999 [Python] Support constructing dictionary array directly through pa.array() - ARROW-10000 [C++][Python] Support constructing StructArray from list of key-value pairs - ARROW-9593 [Python] Add custom pickle reducers for DictionaryScalar - ARROW-6281 [Python] Produce chunked arrays for nested types in pyarrow.array - ARROW-2367 [Python] ListArray has trouble with sizes greater than kMaximumCapacity - ARROW-9976: [Python] ArrowCapacityError when doing Table.from_pandas with large dataframe ### Backward incompatibility ~~Since a struct type can contain duplicated field names we cannot return a struct scalar as a mapping, so I had to change the `.as_py()` representation to return with a list of key-value pairs.~~ ### TODOs: - [x] ensure that the large memory tests are passing - [x] benchmark and check binary size again ### Library size Before: ``` 12M Sep 25 15:05 libarrow.200.0.0.dylib 2.7M Sep 25 15:07 libarrow_python.200.0.0.dylib ``` After: ``` 12M Sep 25 15:46 libarrow.200.0.0.dylib 2.1M Sep 25 15:50 libarrow_python.200.0.0.dylib ``` ### Benchmarks Executed the following ASV benchmark: ```bash asv continuous --bench convert_builtins master py2ar --no-only-changed --split ``` After some optimization: ``` Benchmarks that have improved: before after ratio [f358a29b] [18d1c052] <master> <py2ar> - 2.78±0.03ms 2.45±0.03ms 0.88 convert_builtins.ConvertPyListToArray.time_convert('bool') - 3.59±0.01ms 3.12±0.02ms 0.87 convert_builtins.ConvertPyListToArray.time_convert('int32') - 3.37±0.01ms 2.73±0.01ms 0.81 convert_builtins.ConvertPyListToArray.time_convert('uint32') - 3.74±0.02ms 3.03±0.01ms 0.81 convert_builtins.ConvertPyListToArray.time_convert('int64') - 3.38±0.01ms 2.69±0.01ms 0.80 convert_builtins.ConvertPyListToArray.time_convert('uint64') - 2.83±0.01ms 2.24±0.01ms 0.79 convert_builtins.ConvertPyListToArray.time_convert('float32') - 3.92±0.02ms 2.99±0.02ms 0.76 convert_builtins.ConvertPyListToArray.time_convert('binary10') - 14.1±0.04ms 8.89±0.05ms 0.63 convert_builtins.ConvertPyListToArray.time_convert('unicode') - 5.60±0.01ms 3.24±0.03ms 0.58 convert_builtins.ConvertPyListToArray.time_convert('ascii') - 5.37±0.02ms 2.91±0.04ms 0.54 convert_builtins.ConvertPyListToArray.time_convert('binary') Benchmarks that have stayed the same: before after ratio [f358a29b] [18d1c052] <master> <py2ar> 14.8±0.02ms 15.5±0.1ms 1.05 convert_builtins.ConvertPyListToArray.time_convert('decimal') 16.4±0.7ms 15.1±0.6ms 0.92 convert_builtins.ConvertPyListToArray.time_convert('struct from tuples') 34.4±0.3ms 31.5±0.4ms 0.92 convert_builtins.ConvertPyListToArray.time_convert('int64 list') 16.7±0.7ms 15.1±0.6ms ~0.91 convert_builtins.ConvertPyListToArray.time_convert('struct') 2.42±0.02ms 2.05±0.03ms ~0.85 convert_builtins.ConvertPyListToArray.time_convert('float64') ``` Closes #8088 from kszucs/py2ar Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com> 2020-09-25 20:49:16 -04:00			`chunked = GetResultValue(`
			`ConvertPySequence(sequence, mask, options, pool)`
			`)`
ARROW-2814: [Python] Unify conversion paths for sequences of Python objects Key points * All object sequences, including NumPy arrays of objects are being converted in builtin_convert.cc * pyarrow.array can now yield chunked output from normal Python input. Before, we could overflow a BinaryBuilder with no recourse * Eliminated virtual calls from the inner hot path * Eliminated some code duplication in builtin_convert.cc * Special-cased mask handling, so masks (`mask=...` in `pyarrow.array`) also work with plain Python sequence now instead of only NumPy arrays * Centralized null checking to a single code path, with a compile-time switch between pandas-style and non-pandas null-checking Some issues I ran into: * We have tests that make the somewhat heavy-handed promotion of small NumPy scalars to int64 or uint64. I have added more rigid "type unification" for dtypes, so that now a sequence of int8 scalars will yield int8 result * We were implicitly casting integers to double without checking whether the integers are representable as doubles. I think implicit casting is OK (e.g. `pa.array([1.5, 1, None])`) but we should validate that we can't discarding information There are some other problems that need fixing still / inconsistencies from the two code paths or follow-up issues. I have created a number of follow up JIRAs and added a number of new unit tests Author: Wes McKinney <wesm+git@apache.org> Closes #2366 from wesm/ARROW-2814 and squashes the following commits: 9d15551c <Wes McKinney> Address further code review comments a7a8c3ce <Wes McKinney> Check in new source files d7760cef <Wes McKinney> Address @pitrou code review comments 3f56c300 <Wes McKinney> Exclude python/iterators.h from C++/CLI lint checks d1687720 <Wes McKinney> Fix some more things df136064 <Wes McKinney> Miscellaneous micro-optimizations 07ff8094 <Wes McKinney> Bump versions in asv.conf.json 9efb097e <Wes McKinney> Add more unit tests, sand rough edges. Add boundschecking for integer coercion with float32 e0c9b9ce <Wes McKinney> Delete casting cruft a13bcaf1 <Wes McKinney> Fix rest of unit tests 2b3815f3 <Wes McKinney> Loose and string utf8 type conversions a04bcdc2 <Wes McKinney> Fix more unit tests, disallow non-boolean mask 688b8298 <Wes McKinney> Implement NumPy dtype unifier helper class. Some more cleanup d9d0822e <Wes McKinney> Add NumPy concrete type checking logic d3d97eaf <Wes McKinney> Fix NumPy float scalar casting issue f3b3e2f9 <Wes McKinney> Code fully compiles again e8e5964c <Wes McKinney> First pass cleaning up ListConverter 4424c62c <Wes McKinney> Remove comments c5ca7a42 <Wes McKinney> More refactoring, cleaning up old code. Add lambda version of VIsitTypeInline 1c714d35 <Wes McKinney> Delete some ConvertLists code b4fdea0c <Wes McKinney> Refactoring, add VisitSequenceMasked d75adaf2 <Wes McKinney> More refactoring 72de8a3d <Wes McKinney> Templatize more, less code duplication 72e6574e <Wes McKinney> Do not make virtual AppendSingle/AppendMultiple calls for non-nested SeqConverter 1c338204 <Wes McKinney> Move over NumPyConverter code, small refactorings. Now very broken 58db0964 <Wes McKinney> Fix buglets and mixing dicts/scalars raises TypeError for now c5428d5b <Wes McKinney> Consolidate to a single ConvertPySequence entry point 79cd77e9 <Wes McKinney> Add short circuit option, some small refactoring 2018-08-09 13:31:10 -04:00
ARROW-9992: [C++][Python] Refactor python to arrow conversions based on a reusable conversion API ### Targets of the refactoring: - PythonToArrow converters based on a common API - PyBytesView to use `Result` return values and contain `is_utf8` flag - PyConversionOptions is now available from all converters so we can honor its flags ### Fixes - ARROW-9993 [Python] Tzinfo - string roundtrip fails on pytz.StaticTzInfo objects - ARROW-9994 [C++][Python] Auto chunking nested array containing binary-like fields result malformed output - ARROW-9996 [C++] Dictionary is unset when calling DictionaryArray.GetScalar for null values - ~ARROW-9997 [Python] StructScalar.as_py() fails if the type has duplicate field names~ - ARROW-9999 [Python] Support constructing dictionary array directly through pa.array() - ARROW-10000 [C++][Python] Support constructing StructArray from list of key-value pairs - ARROW-9593 [Python] Add custom pickle reducers for DictionaryScalar - ARROW-6281 [Python] Produce chunked arrays for nested types in pyarrow.array - ARROW-2367 [Python] ListArray has trouble with sizes greater than kMaximumCapacity - ARROW-9976: [Python] ArrowCapacityError when doing Table.from_pandas with large dataframe ### Backward incompatibility ~~Since a struct type can contain duplicated field names we cannot return a struct scalar as a mapping, so I had to change the `.as_py()` representation to return with a list of key-value pairs.~~ ### TODOs: - [x] ensure that the large memory tests are passing - [x] benchmark and check binary size again ### Library size Before: ``` 12M Sep 25 15:05 libarrow.200.0.0.dylib 2.7M Sep 25 15:07 libarrow_python.200.0.0.dylib ``` After: ``` 12M Sep 25 15:46 libarrow.200.0.0.dylib 2.1M Sep 25 15:50 libarrow_python.200.0.0.dylib ``` ### Benchmarks Executed the following ASV benchmark: ```bash asv continuous --bench convert_builtins master py2ar --no-only-changed --split ``` After some optimization: ``` Benchmarks that have improved: before after ratio [f358a29b] [18d1c052] <master> <py2ar> - 2.78±0.03ms 2.45±0.03ms 0.88 convert_builtins.ConvertPyListToArray.time_convert('bool') - 3.59±0.01ms 3.12±0.02ms 0.87 convert_builtins.ConvertPyListToArray.time_convert('int32') - 3.37±0.01ms 2.73±0.01ms 0.81 convert_builtins.ConvertPyListToArray.time_convert('uint32') - 3.74±0.02ms 3.03±0.01ms 0.81 convert_builtins.ConvertPyListToArray.time_convert('int64') - 3.38±0.01ms 2.69±0.01ms 0.80 convert_builtins.ConvertPyListToArray.time_convert('uint64') - 2.83±0.01ms 2.24±0.01ms 0.79 convert_builtins.ConvertPyListToArray.time_convert('float32') - 3.92±0.02ms 2.99±0.02ms 0.76 convert_builtins.ConvertPyListToArray.time_convert('binary10') - 14.1±0.04ms 8.89±0.05ms 0.63 convert_builtins.ConvertPyListToArray.time_convert('unicode') - 5.60±0.01ms 3.24±0.03ms 0.58 convert_builtins.ConvertPyListToArray.time_convert('ascii') - 5.37±0.02ms 2.91±0.04ms 0.54 convert_builtins.ConvertPyListToArray.time_convert('binary') Benchmarks that have stayed the same: before after ratio [f358a29b] [18d1c052] <master> <py2ar> 14.8±0.02ms 15.5±0.1ms 1.05 convert_builtins.ConvertPyListToArray.time_convert('decimal') 16.4±0.7ms 15.1±0.6ms 0.92 convert_builtins.ConvertPyListToArray.time_convert('struct from tuples') 34.4±0.3ms 31.5±0.4ms 0.92 convert_builtins.ConvertPyListToArray.time_convert('int64 list') 16.7±0.7ms 15.1±0.6ms ~0.91 convert_builtins.ConvertPyListToArray.time_convert('struct') 2.42±0.02ms 2.05±0.03ms ~0.85 convert_builtins.ConvertPyListToArray.time_convert('float64') ``` Closes #8088 from kszucs/py2ar Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com> 2020-09-25 20:49:16 -04:00			`if chunked.get().num_chunks() == 1:`
			`return pyarrow_wrap_array(chunked.get().chunk(0))`
ARROW-2814: [Python] Unify conversion paths for sequences of Python objects Key points * All object sequences, including NumPy arrays of objects are being converted in builtin_convert.cc * pyarrow.array can now yield chunked output from normal Python input. Before, we could overflow a BinaryBuilder with no recourse * Eliminated virtual calls from the inner hot path * Eliminated some code duplication in builtin_convert.cc * Special-cased mask handling, so masks (`mask=...` in `pyarrow.array`) also work with plain Python sequence now instead of only NumPy arrays * Centralized null checking to a single code path, with a compile-time switch between pandas-style and non-pandas null-checking Some issues I ran into: * We have tests that make the somewhat heavy-handed promotion of small NumPy scalars to int64 or uint64. I have added more rigid "type unification" for dtypes, so that now a sequence of int8 scalars will yield int8 result * We were implicitly casting integers to double without checking whether the integers are representable as doubles. I think implicit casting is OK (e.g. `pa.array([1.5, 1, None])`) but we should validate that we can't discarding information There are some other problems that need fixing still / inconsistencies from the two code paths or follow-up issues. I have created a number of follow up JIRAs and added a number of new unit tests Author: Wes McKinney <wesm+git@apache.org> Closes #2366 from wesm/ARROW-2814 and squashes the following commits: 9d15551c <Wes McKinney> Address further code review comments a7a8c3ce <Wes McKinney> Check in new source files d7760cef <Wes McKinney> Address @pitrou code review comments 3f56c300 <Wes McKinney> Exclude python/iterators.h from C++/CLI lint checks d1687720 <Wes McKinney> Fix some more things df136064 <Wes McKinney> Miscellaneous micro-optimizations 07ff8094 <Wes McKinney> Bump versions in asv.conf.json 9efb097e <Wes McKinney> Add more unit tests, sand rough edges. Add boundschecking for integer coercion with float32 e0c9b9ce <Wes McKinney> Delete casting cruft a13bcaf1 <Wes McKinney> Fix rest of unit tests 2b3815f3 <Wes McKinney> Loose and string utf8 type conversions a04bcdc2 <Wes McKinney> Fix more unit tests, disallow non-boolean mask 688b8298 <Wes McKinney> Implement NumPy dtype unifier helper class. Some more cleanup d9d0822e <Wes McKinney> Add NumPy concrete type checking logic d3d97eaf <Wes McKinney> Fix NumPy float scalar casting issue f3b3e2f9 <Wes McKinney> Code fully compiles again e8e5964c <Wes McKinney> First pass cleaning up ListConverter 4424c62c <Wes McKinney> Remove comments c5ca7a42 <Wes McKinney> More refactoring, cleaning up old code. Add lambda version of VIsitTypeInline 1c714d35 <Wes McKinney> Delete some ConvertLists code b4fdea0c <Wes McKinney> Refactoring, add VisitSequenceMasked d75adaf2 <Wes McKinney> More refactoring 72de8a3d <Wes McKinney> Templatize more, less code duplication 72e6574e <Wes McKinney> Do not make virtual AppendSingle/AppendMultiple calls for non-nested SeqConverter 1c338204 <Wes McKinney> Move over NumPyConverter code, small refactorings. Now very broken 58db0964 <Wes McKinney> Fix buglets and mixing dicts/scalars raises TypeError for now c5428d5b <Wes McKinney> Consolidate to a single ConvertPySequence entry point 79cd77e9 <Wes McKinney> Add short circuit option, some small refactoring 2018-08-09 13:31:10 -04:00			`else:`
ARROW-9992: [C++][Python] Refactor python to arrow conversions based on a reusable conversion API ### Targets of the refactoring: - PythonToArrow converters based on a common API - PyBytesView to use `Result` return values and contain `is_utf8` flag - PyConversionOptions is now available from all converters so we can honor its flags ### Fixes - ARROW-9993 [Python] Tzinfo - string roundtrip fails on pytz.StaticTzInfo objects - ARROW-9994 [C++][Python] Auto chunking nested array containing binary-like fields result malformed output - ARROW-9996 [C++] Dictionary is unset when calling DictionaryArray.GetScalar for null values - ~ARROW-9997 [Python] StructScalar.as_py() fails if the type has duplicate field names~ - ARROW-9999 [Python] Support constructing dictionary array directly through pa.array() - ARROW-10000 [C++][Python] Support constructing StructArray from list of key-value pairs - ARROW-9593 [Python] Add custom pickle reducers for DictionaryScalar - ARROW-6281 [Python] Produce chunked arrays for nested types in pyarrow.array - ARROW-2367 [Python] ListArray has trouble with sizes greater than kMaximumCapacity - ARROW-9976: [Python] ArrowCapacityError when doing Table.from_pandas with large dataframe ### Backward incompatibility ~~Since a struct type can contain duplicated field names we cannot return a struct scalar as a mapping, so I had to change the `.as_py()` representation to return with a list of key-value pairs.~~ ### TODOs: - [x] ensure that the large memory tests are passing - [x] benchmark and check binary size again ### Library size Before: ``` 12M Sep 25 15:05 libarrow.200.0.0.dylib 2.7M Sep 25 15:07 libarrow_python.200.0.0.dylib ``` After: ``` 12M Sep 25 15:46 libarrow.200.0.0.dylib 2.1M Sep 25 15:50 libarrow_python.200.0.0.dylib ``` ### Benchmarks Executed the following ASV benchmark: ```bash asv continuous --bench convert_builtins master py2ar --no-only-changed --split ``` After some optimization: ``` Benchmarks that have improved: before after ratio [f358a29b] [18d1c052] <master> <py2ar> - 2.78±0.03ms 2.45±0.03ms 0.88 convert_builtins.ConvertPyListToArray.time_convert('bool') - 3.59±0.01ms 3.12±0.02ms 0.87 convert_builtins.ConvertPyListToArray.time_convert('int32') - 3.37±0.01ms 2.73±0.01ms 0.81 convert_builtins.ConvertPyListToArray.time_convert('uint32') - 3.74±0.02ms 3.03±0.01ms 0.81 convert_builtins.ConvertPyListToArray.time_convert('int64') - 3.38±0.01ms 2.69±0.01ms 0.80 convert_builtins.ConvertPyListToArray.time_convert('uint64') - 2.83±0.01ms 2.24±0.01ms 0.79 convert_builtins.ConvertPyListToArray.time_convert('float32') - 3.92±0.02ms 2.99±0.02ms 0.76 convert_builtins.ConvertPyListToArray.time_convert('binary10') - 14.1±0.04ms 8.89±0.05ms 0.63 convert_builtins.ConvertPyListToArray.time_convert('unicode') - 5.60±0.01ms 3.24±0.03ms 0.58 convert_builtins.ConvertPyListToArray.time_convert('ascii') - 5.37±0.02ms 2.91±0.04ms 0.54 convert_builtins.ConvertPyListToArray.time_convert('binary') Benchmarks that have stayed the same: before after ratio [f358a29b] [18d1c052] <master> <py2ar> 14.8±0.02ms 15.5±0.1ms 1.05 convert_builtins.ConvertPyListToArray.time_convert('decimal') 16.4±0.7ms 15.1±0.6ms 0.92 convert_builtins.ConvertPyListToArray.time_convert('struct from tuples') 34.4±0.3ms 31.5±0.4ms 0.92 convert_builtins.ConvertPyListToArray.time_convert('int64 list') 16.7±0.7ms 15.1±0.6ms ~0.91 convert_builtins.ConvertPyListToArray.time_convert('struct') 2.42±0.02ms 2.05±0.03ms ~0.85 convert_builtins.ConvertPyListToArray.time_convert('float64') ``` Closes #8088 from kszucs/py2ar Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com> 2020-09-25 20:49:16 -04:00			`return pyarrow_wrap_chunked_array(chunked)`
ARROW-838: [Python] Expand pyarrow.array to handle NumPy arrays not originating in pandas This unifies the ingest path for 1D data into `pyarrow.array`. I added the argument `from_pandas` to turn null sentinel checking on or off: ``` In [8]: arr = np.random.randn(10000000) In [9]: arr[::3] = np.nan In [10]: arr2 = pa.array(arr) In [11]: arr2.null_count Out[11]: 0 In [12]: %timeit arr2 = pa.array(arr) The slowest run took 5.43 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 68.4 µs per loop In [13]: arr2 = pa.array(arr, from_pandas=True) In [14]: arr2.null_count Out[14]: 3333334 In [15]: %timeit arr2 = pa.array(arr, from_pandas=True) 1 loop, best of 3: 228 ms per loop ``` When the data is contiguous, it is always zero-copy, but then `from_pandas=True` and no null mask is passed, then a null bitmap is constructed and populated. This also permits sequence reads into integers smaller than int64: ``` In [17]: pa.array([1, 2, 3, 4], type='i1') Out[17]: <pyarrow.lib.Int8Array object at 0x7ffa1c1c65e8> [ 1, 2, 3, 4 ] ``` Oh, I also added NumPy-like string type aliases: ``` In [18]: pa.int32() == 'i4' Out[18]: True ``` Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1146 from wesm/expand-py-array-method and squashes the following commits: 1570e525 [Wes McKinney] Code review comments d3bbb3c3 [Wes McKinney] Handle type aliases in cast, too 797f0151 [Wes McKinney] Allow null checking to be skipped with from_pandas=False in pyarrow.array f2802fc7 [Wes McKinney] Cleaner codepath for numpy->arrow conversions 587c575a [Wes McKinney] Add direct types sequence converters for more data types cf40b767 [Wes McKinney] Add type aliases, some unit tests 7b530e4b [Wes McKinney] Consolidate both sequence and ndarray/Series/Index conversion in pyarrow.Array 2017-09-29 23:02:58 -05:00

ARROW-4637: [Python] Conditionally import pandas symbols if they are used. Do not require pandas as a test dependency Warning: hold your nose for this one =) I think this can be made cleaner, but I just wanted to triage all the problems and make sure we don't introduce a hard dependency on pandas again Also resolves ARROW-4794: make pandas an optional dependency Author: Wes McKinney <wesm+git@apache.org> Closes #3893 from wesm/ARROW-4637 and squashes the following commits: 3c353b6b <Wes McKinney> do not override orc global mark in test_orc.py 59fdf8ab <Wes McKinney> Tweak pyarrow._orc import to see if it fixes MSVC 747c8937 <Wes McKinney> Address Python 2.7 unicode interaction issue with Cython 4d94b2e1 <Wes McKinney> Add benchmark from ARROW-4629 f7bf7741 <Wes McKinney> Cythonize pandas API shim for better performance 4a2549aa <Wes McKinney> Remove TF testing from travis_script_manylinux1.sh 5061c6e1 <Wes McKinney> Do not require pandas to run unit tests 385cfe55 <Wes McKinney> Finish pandas API shim; do not eagerly import pandas, add to CI 804587a0 <Wes McKinney> add import test script bb0240e2 <Wes McKinney> Begin to refactor to make references to pandas more centralized and lazy 2019-03-19 11:59:30 -05:00			`cdef inline _is_array_like(obj):`
GH-25118: [Python] Make NumPy an optional runtime dependency (#41904) ### Rationale for this change Being able to run pyarrow without requiring numpy. ### What changes are included in this PR? If numpy is not present we are able to import pyarrow and run functionality. A new CI job has been created to run some basic tests without numpy. ### Are these changes tested? Yes via CI. ### Are there any user-facing changes? Yes, NumPy can be removed from the user installation and pyarrow functionality still works * GitHub Issue: #25118 Lead-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org> 2024-09-02 16:35:26 +02:00			`if np is None:`
			`return False`
ARROW-4637: [Python] Conditionally import pandas symbols if they are used. Do not require pandas as a test dependency Warning: hold your nose for this one =) I think this can be made cleaner, but I just wanted to triage all the problems and make sure we don't introduce a hard dependency on pandas again Also resolves ARROW-4794: make pandas an optional dependency Author: Wes McKinney <wesm+git@apache.org> Closes #3893 from wesm/ARROW-4637 and squashes the following commits: 3c353b6b <Wes McKinney> do not override orc global mark in test_orc.py 59fdf8ab <Wes McKinney> Tweak pyarrow._orc import to see if it fixes MSVC 747c8937 <Wes McKinney> Address Python 2.7 unicode interaction issue with Cython 4d94b2e1 <Wes McKinney> Add benchmark from ARROW-4629 f7bf7741 <Wes McKinney> Cythonize pandas API shim for better performance 4a2549aa <Wes McKinney> Remove TF testing from travis_script_manylinux1.sh 5061c6e1 <Wes McKinney> Do not require pandas to run unit tests 385cfe55 <Wes McKinney> Finish pandas API shim; do not eagerly import pandas, add to CI 804587a0 <Wes McKinney> add import test script bb0240e2 <Wes McKinney> Begin to refactor to make references to pandas more centralized and lazy 2019-03-19 11:59:30 -05:00			`if isinstance(obj, np.ndarray):`
			`return True`
			`return pandas_api._have_pandas_internal() and pandas_api.is_array_like(obj)`
ARROW-838: [Python] Expand pyarrow.array to handle NumPy arrays not originating in pandas This unifies the ingest path for 1D data into `pyarrow.array`. I added the argument `from_pandas` to turn null sentinel checking on or off: ``` In [8]: arr = np.random.randn(10000000) In [9]: arr[::3] = np.nan In [10]: arr2 = pa.array(arr) In [11]: arr2.null_count Out[11]: 0 In [12]: %timeit arr2 = pa.array(arr) The slowest run took 5.43 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 68.4 µs per loop In [13]: arr2 = pa.array(arr, from_pandas=True) In [14]: arr2.null_count Out[14]: 3333334 In [15]: %timeit arr2 = pa.array(arr, from_pandas=True) 1 loop, best of 3: 228 ms per loop ``` When the data is contiguous, it is always zero-copy, but then `from_pandas=True` and no null mask is passed, then a null bitmap is constructed and populated. This also permits sequence reads into integers smaller than int64: ``` In [17]: pa.array([1, 2, 3, 4], type='i1') Out[17]: <pyarrow.lib.Int8Array object at 0x7ffa1c1c65e8> [ 1, 2, 3, 4 ] ``` Oh, I also added NumPy-like string type aliases: ``` In [18]: pa.int32() == 'i4' Out[18]: True ``` Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1146 from wesm/expand-py-array-method and squashes the following commits: 1570e525 [Wes McKinney] Code review comments d3bbb3c3 [Wes McKinney] Handle type aliases in cast, too 797f0151 [Wes McKinney] Allow null checking to be skipped with from_pandas=False in pyarrow.array f2802fc7 [Wes McKinney] Cleaner codepath for numpy->arrow conversions 587c575a [Wes McKinney] Add direct types sequence converters for more data types cf40b767 [Wes McKinney] Add type aliases, some unit tests 7b530e4b [Wes McKinney] Consolidate both sequence and ndarray/Series/Index conversion in pyarrow.Array 2017-09-29 23:02:58 -05:00

ARROW-1993: [Python] Add function for determining implied Arrow schema from pandas.DataFrame Author: Krisztián Szűcs <szucs.krisztian@gmail.com> Author: Kee Chong Tan <keechong.tan@> Closes #1929 from keechongtan/ARROW-1993 and squashes the following commits: 22c357e09 <Krisztián Szűcs> use tmp variable a9d6a5c7d <Krisztián Szűcs> documentation fixes 3544e42a1 <Krisztián Szűcs> except d81983475 <Krisztián Szűcs> fix segfault on py2 0cf42b57d <Krisztián Szűcs> fix exception handling 00e86f64c <Krisztián Szűcs> slightly rename functions 7113b6d79 <Krisztián Szűcs> rebase c7409c6df <Kee Chong Tan> Fix incorrect variable used d631fb308 <Kee Chong Tan> Add function for determining implied Arrow schema from pandas.DataFrame b04a09b3d <Kee Chong Tan> Fix incorrect variable used a5c8b9d0c <Kee Chong Tan> Add function for determining implied Arrow schema from pandas.DataFrame 2018-11-26 09:29:18 -05:00			`def _ndarray_to_arrow_type(object values, DataType type):`
			`return pyarrow_wrap_data_type(_ndarray_to_type(values, type))`


			`cdef shared_ptr[CDataType] _ndarray_to_type(object values,`
			`DataType type) except *:`
			`cdef shared_ptr[CDataType] c_type`
ARROW-838: [Python] Expand pyarrow.array to handle NumPy arrays not originating in pandas This unifies the ingest path for 1D data into `pyarrow.array`. I added the argument `from_pandas` to turn null sentinel checking on or off: ``` In [8]: arr = np.random.randn(10000000) In [9]: arr[::3] = np.nan In [10]: arr2 = pa.array(arr) In [11]: arr2.null_count Out[11]: 0 In [12]: %timeit arr2 = pa.array(arr) The slowest run took 5.43 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 68.4 µs per loop In [13]: arr2 = pa.array(arr, from_pandas=True) In [14]: arr2.null_count Out[14]: 3333334 In [15]: %timeit arr2 = pa.array(arr, from_pandas=True) 1 loop, best of 3: 228 ms per loop ``` When the data is contiguous, it is always zero-copy, but then `from_pandas=True` and no null mask is passed, then a null bitmap is constructed and populated. This also permits sequence reads into integers smaller than int64: ``` In [17]: pa.array([1, 2, 3, 4], type='i1') Out[17]: <pyarrow.lib.Int8Array object at 0x7ffa1c1c65e8> [ 1, 2, 3, 4 ] ``` Oh, I also added NumPy-like string type aliases: ``` In [18]: pa.int32() == 'i4' Out[18]: True ``` Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1146 from wesm/expand-py-array-method and squashes the following commits: 1570e525 [Wes McKinney] Code review comments d3bbb3c3 [Wes McKinney] Handle type aliases in cast, too 797f0151 [Wes McKinney] Allow null checking to be skipped with from_pandas=False in pyarrow.array f2802fc7 [Wes McKinney] Cleaner codepath for numpy->arrow conversions 587c575a [Wes McKinney] Add direct types sequence converters for more data types cf40b767 [Wes McKinney] Add type aliases, some unit tests 7b530e4b [Wes McKinney] Consolidate both sequence and ndarray/Series/Index conversion in pyarrow.Array 2017-09-29 23:02:58 -05:00
			`dtype = values.dtype`

			`if type is None and dtype != object:`
GH-39599: [Python] Avoid leaking references to Numpy dtypes (#39636) ### Rationale for this change `PyArray_DescrFromScalar` returns a new reference, so we should be careful to decref it when we don't use it anymore. ### Are these changes tested? No. ### Are there any user-facing changes? No. * Closes: #39599 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-01-17 11:26:37 +01:00			`c_type = GetResultValue(NumPyDtypeToArrow(dtype))`
ARROW-838: [Python] Expand pyarrow.array to handle NumPy arrays not originating in pandas This unifies the ingest path for 1D data into `pyarrow.array`. I added the argument `from_pandas` to turn null sentinel checking on or off: ``` In [8]: arr = np.random.randn(10000000) In [9]: arr[::3] = np.nan In [10]: arr2 = pa.array(arr) In [11]: arr2.null_count Out[11]: 0 In [12]: %timeit arr2 = pa.array(arr) The slowest run took 5.43 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 68.4 µs per loop In [13]: arr2 = pa.array(arr, from_pandas=True) In [14]: arr2.null_count Out[14]: 3333334 In [15]: %timeit arr2 = pa.array(arr, from_pandas=True) 1 loop, best of 3: 228 ms per loop ``` When the data is contiguous, it is always zero-copy, but then `from_pandas=True` and no null mask is passed, then a null bitmap is constructed and populated. This also permits sequence reads into integers smaller than int64: ``` In [17]: pa.array([1, 2, 3, 4], type='i1') Out[17]: <pyarrow.lib.Int8Array object at 0x7ffa1c1c65e8> [ 1, 2, 3, 4 ] ``` Oh, I also added NumPy-like string type aliases: ``` In [18]: pa.int32() == 'i4' Out[18]: True ``` Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1146 from wesm/expand-py-array-method and squashes the following commits: 1570e525 [Wes McKinney] Code review comments d3bbb3c3 [Wes McKinney] Handle type aliases in cast, too 797f0151 [Wes McKinney] Allow null checking to be skipped with from_pandas=False in pyarrow.array f2802fc7 [Wes McKinney] Cleaner codepath for numpy->arrow conversions 587c575a [Wes McKinney] Add direct types sequence converters for more data types cf40b767 [Wes McKinney] Add type aliases, some unit tests 7b530e4b [Wes McKinney] Consolidate both sequence and ndarray/Series/Index conversion in pyarrow.Array 2017-09-29 23:02:58 -05:00
			`if type is not None:`
			`c_type = type.sp_type`

ARROW-1993: [Python] Add function for determining implied Arrow schema from pandas.DataFrame Author: Krisztián Szűcs <szucs.krisztian@gmail.com> Author: Kee Chong Tan <keechong.tan@> Closes #1929 from keechongtan/ARROW-1993 and squashes the following commits: 22c357e09 <Krisztián Szűcs> use tmp variable a9d6a5c7d <Krisztián Szűcs> documentation fixes 3544e42a1 <Krisztián Szűcs> except d81983475 <Krisztián Szűcs> fix segfault on py2 0cf42b57d <Krisztián Szűcs> fix exception handling 00e86f64c <Krisztián Szűcs> slightly rename functions 7113b6d79 <Krisztián Szűcs> rebase c7409c6df <Kee Chong Tan> Fix incorrect variable used d631fb308 <Kee Chong Tan> Add function for determining implied Arrow schema from pandas.DataFrame b04a09b3d <Kee Chong Tan> Fix incorrect variable used a5c8b9d0c <Kee Chong Tan> Add function for determining implied Arrow schema from pandas.DataFrame 2018-11-26 09:29:18 -05:00			`return c_type`


			`cdef _ndarray_to_array(object values, object mask, DataType type,`
			`c_bool from_pandas, c_bool safe, CMemoryPool* pool):`
			`cdef:`
			`shared_ptr[CChunkedArray] chunked_out`
			`shared_ptr[CDataType] c_type = _ndarray_to_type(values, type)`
			`CCastOptions cast_options = CCastOptions(safe)`

ARROW-838: [Python] Expand pyarrow.array to handle NumPy arrays not originating in pandas This unifies the ingest path for 1D data into `pyarrow.array`. I added the argument `from_pandas` to turn null sentinel checking on or off: ``` In [8]: arr = np.random.randn(10000000) In [9]: arr[::3] = np.nan In [10]: arr2 = pa.array(arr) In [11]: arr2.null_count Out[11]: 0 In [12]: %timeit arr2 = pa.array(arr) The slowest run took 5.43 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 68.4 µs per loop In [13]: arr2 = pa.array(arr, from_pandas=True) In [14]: arr2.null_count Out[14]: 3333334 In [15]: %timeit arr2 = pa.array(arr, from_pandas=True) 1 loop, best of 3: 228 ms per loop ``` When the data is contiguous, it is always zero-copy, but then `from_pandas=True` and no null mask is passed, then a null bitmap is constructed and populated. This also permits sequence reads into integers smaller than int64: ``` In [17]: pa.array([1, 2, 3, 4], type='i1') Out[17]: <pyarrow.lib.Int8Array object at 0x7ffa1c1c65e8> [ 1, 2, 3, 4 ] ``` Oh, I also added NumPy-like string type aliases: ``` In [18]: pa.int32() == 'i4' Out[18]: True ``` Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1146 from wesm/expand-py-array-method and squashes the following commits: 1570e525 [Wes McKinney] Code review comments d3bbb3c3 [Wes McKinney] Handle type aliases in cast, too 797f0151 [Wes McKinney] Allow null checking to be skipped with from_pandas=False in pyarrow.array f2802fc7 [Wes McKinney] Cleaner codepath for numpy->arrow conversions 587c575a [Wes McKinney] Add direct types sequence converters for more data types cf40b767 [Wes McKinney] Add type aliases, some unit tests 7b530e4b [Wes McKinney] Consolidate both sequence and ndarray/Series/Index conversion in pyarrow.Array 2017-09-29 23:02:58 -05:00			`with nogil:`
ARROW-2814: [Python] Unify conversion paths for sequences of Python objects Key points * All object sequences, including NumPy arrays of objects are being converted in builtin_convert.cc * pyarrow.array can now yield chunked output from normal Python input. Before, we could overflow a BinaryBuilder with no recourse * Eliminated virtual calls from the inner hot path * Eliminated some code duplication in builtin_convert.cc * Special-cased mask handling, so masks (`mask=...` in `pyarrow.array`) also work with plain Python sequence now instead of only NumPy arrays * Centralized null checking to a single code path, with a compile-time switch between pandas-style and non-pandas null-checking Some issues I ran into: * We have tests that make the somewhat heavy-handed promotion of small NumPy scalars to int64 or uint64. I have added more rigid "type unification" for dtypes, so that now a sequence of int8 scalars will yield int8 result * We were implicitly casting integers to double without checking whether the integers are representable as doubles. I think implicit casting is OK (e.g. `pa.array([1.5, 1, None])`) but we should validate that we can't discarding information There are some other problems that need fixing still / inconsistencies from the two code paths or follow-up issues. I have created a number of follow up JIRAs and added a number of new unit tests Author: Wes McKinney <wesm+git@apache.org> Closes #2366 from wesm/ARROW-2814 and squashes the following commits: 9d15551c <Wes McKinney> Address further code review comments a7a8c3ce <Wes McKinney> Check in new source files d7760cef <Wes McKinney> Address @pitrou code review comments 3f56c300 <Wes McKinney> Exclude python/iterators.h from C++/CLI lint checks d1687720 <Wes McKinney> Fix some more things df136064 <Wes McKinney> Miscellaneous micro-optimizations 07ff8094 <Wes McKinney> Bump versions in asv.conf.json 9efb097e <Wes McKinney> Add more unit tests, sand rough edges. Add boundschecking for integer coercion with float32 e0c9b9ce <Wes McKinney> Delete casting cruft a13bcaf1 <Wes McKinney> Fix rest of unit tests 2b3815f3 <Wes McKinney> Loose and string utf8 type conversions a04bcdc2 <Wes McKinney> Fix more unit tests, disallow non-boolean mask 688b8298 <Wes McKinney> Implement NumPy dtype unifier helper class. Some more cleanup d9d0822e <Wes McKinney> Add NumPy concrete type checking logic d3d97eaf <Wes McKinney> Fix NumPy float scalar casting issue f3b3e2f9 <Wes McKinney> Code fully compiles again e8e5964c <Wes McKinney> First pass cleaning up ListConverter 4424c62c <Wes McKinney> Remove comments c5ca7a42 <Wes McKinney> More refactoring, cleaning up old code. Add lambda version of VIsitTypeInline 1c714d35 <Wes McKinney> Delete some ConvertLists code b4fdea0c <Wes McKinney> Refactoring, add VisitSequenceMasked d75adaf2 <Wes McKinney> More refactoring 72de8a3d <Wes McKinney> Templatize more, less code duplication 72e6574e <Wes McKinney> Do not make virtual AppendSingle/AppendMultiple calls for non-nested SeqConverter 1c338204 <Wes McKinney> Move over NumPyConverter code, small refactorings. Now very broken 58db0964 <Wes McKinney> Fix buglets and mixing dicts/scalars raises TypeError for now c5428d5b <Wes McKinney> Consolidate to a single ConvertPySequence entry point 79cd77e9 <Wes McKinney> Add short circuit option, some small refactoring 2018-08-09 13:31:10 -04:00			`check_status(NdarrayToArrow(pool, values, mask, from_pandas,`
ARROW-1949: [Python/C++] Add option to Array.from_pandas and pyarrow.array to perform unsafe casts Author: Krisztián Szűcs <szucs.krisztian@gmail.com> Closes #2497 from kszucs/ARROW-1949 and squashes the following commits: f352c477 <Krisztián Szűcs> remove safe flag from _sequence_to_array 70d6cae2 <Krisztián Szűcs> annotate boolean arguments as bint e838a14d <Krisztián Szűcs> check-format fff89aaa <Krisztián Szűcs> lint 92ac3a92 <Krisztián Szűcs> tests for timestamp casts dd8871e8 <Krisztián Szűcs> wire CastOptions through the API 2018-09-04 08:36:29 +02:00			`c_type, cast_options, &chunked_out))`
ARROW-838: [Python] Expand pyarrow.array to handle NumPy arrays not originating in pandas This unifies the ingest path for 1D data into `pyarrow.array`. I added the argument `from_pandas` to turn null sentinel checking on or off: ``` In [8]: arr = np.random.randn(10000000) In [9]: arr[::3] = np.nan In [10]: arr2 = pa.array(arr) In [11]: arr2.null_count Out[11]: 0 In [12]: %timeit arr2 = pa.array(arr) The slowest run took 5.43 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 68.4 µs per loop In [13]: arr2 = pa.array(arr, from_pandas=True) In [14]: arr2.null_count Out[14]: 3333334 In [15]: %timeit arr2 = pa.array(arr, from_pandas=True) 1 loop, best of 3: 228 ms per loop ``` When the data is contiguous, it is always zero-copy, but then `from_pandas=True` and no null mask is passed, then a null bitmap is constructed and populated. This also permits sequence reads into integers smaller than int64: ``` In [17]: pa.array([1, 2, 3, 4], type='i1') Out[17]: <pyarrow.lib.Int8Array object at 0x7ffa1c1c65e8> [ 1, 2, 3, 4 ] ``` Oh, I also added NumPy-like string type aliases: ``` In [18]: pa.int32() == 'i4' Out[18]: True ``` Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1146 from wesm/expand-py-array-method and squashes the following commits: 1570e525 [Wes McKinney] Code review comments d3bbb3c3 [Wes McKinney] Handle type aliases in cast, too 797f0151 [Wes McKinney] Allow null checking to be skipped with from_pandas=False in pyarrow.array f2802fc7 [Wes McKinney] Cleaner codepath for numpy->arrow conversions 587c575a [Wes McKinney] Add direct types sequence converters for more data types cf40b767 [Wes McKinney] Add type aliases, some unit tests 7b530e4b [Wes McKinney] Consolidate both sequence and ndarray/Series/Index conversion in pyarrow.Array 2017-09-29 23:02:58 -05:00
			`if chunked_out.get().num_chunks() > 1:`
			`return pyarrow_wrap_chunked_array(chunked_out)`
			`else:`
			`return pyarrow_wrap_array(chunked_out.get().chunk(0))`


ARROW-7168: [Python] Respect the specified dictionary type for pd.Categorical conversion https://issues.apache.org/jira/browse/ARROW-7168 This change ensures that if you specify a `type` in `pa.array`, we ensure the output actually has this type when converting to dictionary array (as we also do for other types). The PR now implements this change, but we might want to do this with a deprecation first, as this can break people's code. Closes #5866 from jorisvandenbossche/ARROW-7168-categorical-specified-type and squashes the following commits: 39ff8e82c <Joris Van den Bossche> more python 2 e4dbb2c4f <Joris Van den Bossche> try fix python 2 003e6532b <Joris Van den Bossche> for now use deprecation warnings instead of error bfb82372b <Joris Van den Bossche> additional tests 3535a5699 <Joris Van den Bossche> ARROW-7168: Respect the specified dictionary type when converting pd.Categorical Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-11-21 11:17:56 +01:00			`cdef _codes_to_indices(object codes, object mask, DataType type,`
			`MemoryPool memory_pool):`
			`"""`
			`Convert the codes of a pandas Categorical to indices for a pyarrow`
			`DictionaryArray, taking into account missing values + mask`
			`"""`
			`if mask is None:`
			`mask = codes == -1`
			`else:`
			`mask = mask \| (codes == -1)`
			`return array(codes, mask=mask, type=type, memory_pool=memory_pool)`


ARROW-3829: [Python] add __arrow_array__ protocol to support third-party array classes in conversion to Arrow https://issues.apache.org/jira/browse/ARROW-3829 & https://issues.apache.org/jira/browse/ARROW-5271. And as illustration for the mailing list discussion (will post there in a bit). Closes #5106 from jorisvandenbossche/ARROW-3829-array-protocol and squashes the following commits: bab01f1cd <Joris Van den Bossche> ValueError -> TypeError 8e7099569 <Joris Van den Bossche> use try ... finally 8ac304a13 <Joris Van den Bossche> rename to extending_types.rst 486115423 <Joris Van den Bossche> add docs e2b10c45d <Joris Van den Bossche> add validation of additional keywords and return value 198d69982 <Joris Van den Bossche> compat for older pandas versions c82eb8848 <Joris Van den Bossche> ARROW-3829: add __arrow_array__ protocol to support third-party array classes in conversion to Arrow Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-08-27 18:00:14 -05:00			`def _handle_arrow_array_protocol(obj, type, mask, size):`
			`if mask is not None or size is not None:`
			`raise ValueError(`
			`"Cannot specify a mask or a size when passing an object that is "`
			`"converted with the __arrow_array__ protocol.")`
			`res = obj.__arrow_array__(type=type)`
ARROW-7066: [Python] Allow returning ChunkedArray in __arrow_array__ Closes #5794 from jorisvandenbossche/ARROW-7066-array-protocol-chunked and squashes the following commits: b4085c6f1 <Joris Van den Bossche> update docs / docstring d9a9804bc <Joris Van den Bossche> Update python/pyarrow/tests/test_array.py 7c5a1ef5c <Joris Van den Bossche> ARROW-7066: Allow returning ChunkedArray in __arrow_array__ Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com> 2019-11-12 14:29:52 -08:00			`if not isinstance(res, (Array, ChunkedArray)):`
ARROW-3829: [Python] add __arrow_array__ protocol to support third-party array classes in conversion to Arrow https://issues.apache.org/jira/browse/ARROW-3829 & https://issues.apache.org/jira/browse/ARROW-5271. And as illustration for the mailing list discussion (will post there in a bit). Closes #5106 from jorisvandenbossche/ARROW-3829-array-protocol and squashes the following commits: bab01f1cd <Joris Van den Bossche> ValueError -> TypeError 8e7099569 <Joris Van den Bossche> use try ... finally 8ac304a13 <Joris Van den Bossche> rename to extending_types.rst 486115423 <Joris Van den Bossche> add docs e2b10c45d <Joris Van den Bossche> add validation of additional keywords and return value 198d69982 <Joris Van den Bossche> compat for older pandas versions c82eb8848 <Joris Van den Bossche> ARROW-3829: add __arrow_array__ protocol to support third-party array classes in conversion to Arrow Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-08-27 18:00:14 -05:00			`raise TypeError("The object's __arrow_array__ method does not "`
ARROW-7066: [Python] Allow returning ChunkedArray in __arrow_array__ Closes #5794 from jorisvandenbossche/ARROW-7066-array-protocol-chunked and squashes the following commits: b4085c6f1 <Joris Van den Bossche> update docs / docstring d9a9804bc <Joris Van den Bossche> Update python/pyarrow/tests/test_array.py 7c5a1ef5c <Joris Van den Bossche> ARROW-7066: Allow returning ChunkedArray in __arrow_array__ Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com> 2019-11-12 14:29:52 -08:00			`"return a pyarrow Array or ChunkedArray.")`
GH-33727: [Python] array() errors if pandas categorical column has dictionary as string not object (#34289) ### Rationale for this change Currently writing a pandas dataframe with categorical column of dtype `string[pyarrow]` fails. The reason for this is that when category with `string[pyarrow]` dtype is converted to an array in pyarrow it results in a `ChunkedArray,` not `Array`, and then `DictionaryArray.from_arrays` fails. ### What changes are included in this PR? `_handle_arrow_array_protocol` method in _array.pxi_ is updated so that in case of a `ChunkedArray` with one chunk, the result is a `pyarrow.Array` and not `pa.ChunkedArray.` ### Are these changes tested? Yes. Tests are added to: - python/pyarrow/tests/parquet/test_pandas.py - python/pyarrow/tests/test_pandas.py - python/pyarrow/tests/test_array.py ### Are there any user-facing changes? No. * Closes: #33727 Lead-authored-by: Alenka Frim <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-03-28 10:26:41 +02:00			`if isinstance(res, ChunkedArray) and res.num_chunks==1:`
			`res = res.chunk(0)`
GH-43683: [Python] Use pandas StringDtype when enabled (pandas 3+) (#44195) ### Rationale for this change With pandas' [PDEP-14](https://pandas.pydata.org/pdeps/0014-string-dtype.html) proposal, pandas is planning to introduce a default string dtype in pandas 3.0 (instead of the current object dtype). This will become the default in pandas 3.0, and can be enabled with an option in the upcoming pandas 2.3 (`pd.options.future.infer_string = True`). To prepare for that, we should start using that string dtype in `to_pandas()` conversions when that option is enabled. ### What changes are included in this PR? - If pandas >= 3.0 is used or the pandas option is enabled, ensure that `to_pandas()` calls use the default string dtype of pandas for string-like columns (string, large_string, string_view) ### Are these changes tested? It is tested in the pandas-nightly crossbow build. There is still one failure that is because of a bug on the pandas side (https://github.com/pandas-dev/pandas/issues/59879) ### Are there any user-facing changes? This PR includes breaking changes to public APIs. Depending on the version of pandas, `to_pandas()` will change to use pandas' string dtype instead of object dtype. This is a breaking user-facing change, but essentially just following the equivalent change in default dtype on the pandas side. * GitHub Issue: #43683 Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2025-01-09 20:22:01 +01:00			`if type is not None and res.type != type:`
			`res = res.cast(type)`
ARROW-3829: [Python] add __arrow_array__ protocol to support third-party array classes in conversion to Arrow https://issues.apache.org/jira/browse/ARROW-3829 & https://issues.apache.org/jira/browse/ARROW-5271. And as illustration for the mailing list discussion (will post there in a bit). Closes #5106 from jorisvandenbossche/ARROW-3829-array-protocol and squashes the following commits: bab01f1cd <Joris Van den Bossche> ValueError -> TypeError 8e7099569 <Joris Van den Bossche> use try ... finally 8ac304a13 <Joris Van den Bossche> rename to extending_types.rst 486115423 <Joris Van den Bossche> add docs e2b10c45d <Joris Van den Bossche> add validation of additional keywords and return value 198d69982 <Joris Van den Bossche> compat for older pandas versions c82eb8848 <Joris Van den Bossche> ARROW-3829: add __arrow_array__ protocol to support third-party array classes in conversion to Arrow Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-08-27 18:00:14 -05:00			`return res`


ARROW-4324: [Python] Triage broken type inference logic in presence of a mix of NumPy dtype-having objects and other scalar values In investigating the innocuous bug report from ARROW-4324 I stumbled on a pile of hacks and flawed design around type inference ``` test_list = [np.dtype('int32').type(10), np.dtype('float32').type(0.5)] test_array = pa.array(test_list) # Expected # test_array # <pyarrow.lib.DoubleArray object at 0x7f009963bf48> # [ # 10, # 0.5 # ] # Got # test_array # <pyarrow.lib.Int32Array object at 0x7f009963bf48> # [ # 10, # 0 # ] ``` It turns out there are several issues: * There was a kludge around handling the `numpy.nan` value which is a PyFloat, not a NumPy float64 scalar * Type inference assumed "NaN is null", which should not be hard coded, so I added a flag to switch between pandas semantics and non-pandas * Mixing NumPy scalar values and non-NumPy scalars (like our evil friend numpy.nan) caused the output type to be simply incorrect. For example `[np.float16(1.5), 2.5]` would yield `pa.float16()` output type. Yuck In inserted some hacks to force what I believe to be the correct behavior and fixed a couple unit tests that actually exhibited buggy behavior before (see within). I don't have time to do the "right thing" right now which is to more or less rewrite the hot path of `arrow/python/inference.cc`, so at least this gets the unit tests asserting what is correct so that refactoring will be more productive later. Author: Wes McKinney <wesm+git@apache.org> Closes #4527 from wesm/ARROW-4324 and squashes the following commits: e396958b0 <Wes McKinney> Add unit test for passing pandas Series with from_pandas=False 754468a5d <Wes McKinney> Set from_pandas to None by default in pyarrow.array so that user wishes can be respected e1b839339 <Wes McKinney> Remove outdated unit test, add Python unit test that shows behavior from ARROW-2240 that's been changed 4bc8c8193 <Wes McKinney> Triage type inference logic in presence of a mix of NumPy dtype-having objects and other typed values, pending more serious refactor in ARROW-5564 2019-06-12 17:14:40 -05:00			`def array(object obj, type=None, mask=None, size=None, from_pandas=None,`
ARROW-1949: [Python/C++] Add option to Array.from_pandas and pyarrow.array to perform unsafe casts Author: Krisztián Szűcs <szucs.krisztian@gmail.com> Closes #2497 from kszucs/ARROW-1949 and squashes the following commits: f352c477 <Krisztián Szűcs> remove safe flag from _sequence_to_array 70d6cae2 <Krisztián Szűcs> annotate boolean arguments as bint e838a14d <Krisztián Szűcs> check-format fff89aaa <Krisztián Szűcs> lint 92ac3a92 <Krisztián Szűcs> tests for timestamp casts dd8871e8 <Krisztián Szűcs> wire CastOptions through the API 2018-09-04 08:36:29 +02:00			`bint safe=True, MemoryPool memory_pool=None):`
ARROW-825: [Python] Rename pyarrow.from_pylist to pyarrow.array, test on tuples The idea is to make this function more semantically analogous to `numpy.array` -- convert to native data structure with optional explicit type. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #552 from wesm/ARROW-825 and squashes the following commits: 5d69c70 [Wes McKinney] Update test_jemalloc after ARROW-830 c25fdee [Wes McKinney] Update docstring 3a284b7 [Wes McKinney] Rename pyarrow.from_pylist to pyarrow.array, test on tuples 2017-04-17 17:47:51 -04:00			`"""`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`Create pyarrow.Array instance from a Python object.`
ARROW-825: [Python] Rename pyarrow.from_pylist to pyarrow.array, test on tuples The idea is to make this function more semantically analogous to `numpy.array` -- convert to native data structure with optional explicit type. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #552 from wesm/ARROW-825 and squashes the following commits: 5d69c70 [Wes McKinney] Update test_jemalloc after ARROW-830 c25fdee [Wes McKinney] Update docstring 3a284b7 [Wes McKinney] Rename pyarrow.from_pylist to pyarrow.array, test on tuples 2017-04-17 17:47:51 -04:00
			`Parameters`
			`----------`
GH-35531: [Python] C Data Interface PyCapsule Protocol (#37797) ### Rationale for this change ### What changes are included in this PR? * A new specification for Arrow PyCapsules and related dunder methods * Implementing the dunder methods for `DataType`, `Field`, `Schema`, `Array`, `RecordBatch`, `Table`, and `RecordBatchReader`. ### Are these changes tested? Yes, I've added various roundtrip tests for each of the types. ### Are there any user-facing changes? This introduces some new APIs and documents them. * Closes: #34031 * Closes: #35531 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2023-10-18 04:44:50 -07:00			`obj : sequence, iterable, ndarray, pandas.Series, Arrow-compatible array`
ARROW-838: [Python] Expand pyarrow.array to handle NumPy arrays not originating in pandas This unifies the ingest path for 1D data into `pyarrow.array`. I added the argument `from_pandas` to turn null sentinel checking on or off: ``` In [8]: arr = np.random.randn(10000000) In [9]: arr[::3] = np.nan In [10]: arr2 = pa.array(arr) In [11]: arr2.null_count Out[11]: 0 In [12]: %timeit arr2 = pa.array(arr) The slowest run took 5.43 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 68.4 µs per loop In [13]: arr2 = pa.array(arr, from_pandas=True) In [14]: arr2.null_count Out[14]: 3333334 In [15]: %timeit arr2 = pa.array(arr, from_pandas=True) 1 loop, best of 3: 228 ms per loop ``` When the data is contiguous, it is always zero-copy, but then `from_pandas=True` and no null mask is passed, then a null bitmap is constructed and populated. This also permits sequence reads into integers smaller than int64: ``` In [17]: pa.array([1, 2, 3, 4], type='i1') Out[17]: <pyarrow.lib.Int8Array object at 0x7ffa1c1c65e8> [ 1, 2, 3, 4 ] ``` Oh, I also added NumPy-like string type aliases: ``` In [18]: pa.int32() == 'i4' Out[18]: True ``` Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1146 from wesm/expand-py-array-method and squashes the following commits: 1570e525 [Wes McKinney] Code review comments d3bbb3c3 [Wes McKinney] Handle type aliases in cast, too 797f0151 [Wes McKinney] Allow null checking to be skipped with from_pandas=False in pyarrow.array f2802fc7 [Wes McKinney] Cleaner codepath for numpy->arrow conversions 587c575a [Wes McKinney] Add direct types sequence converters for more data types cf40b767 [Wes McKinney] Add type aliases, some unit tests 7b530e4b [Wes McKinney] Consolidate both sequence and ndarray/Series/Index conversion in pyarrow.Array 2017-09-29 23:02:58 -05:00			`If both type and size are specified may be a single use iterable. If`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`not strongly-typed, Arrow type will be inferred for resulting array.`
GH-35531: [Python] C Data Interface PyCapsule Protocol (#37797) ### Rationale for this change ### What changes are included in this PR? * A new specification for Arrow PyCapsules and related dunder methods * Implementing the dunder methods for `DataType`, `Field`, `Schema`, `Array`, `RecordBatch`, `Table`, and `RecordBatchReader`. ### Are these changes tested? Yes, I've added various roundtrip tests for each of the types. ### Are there any user-facing changes? This introduces some new APIs and documents them. * Closes: #34031 * Closes: #35531 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2023-10-18 04:44:50 -07:00			`Any Arrow-compatible array that implements the Arrow PyCapsule Protocol`
GH-38325: [Python] Implement PyCapsule interface for Device data in PyArrow (#40717) ### Rationale for this change PyArrow implementation for the specification additions being proposed in https://github.com/apache/arrow/pull/40708 ### What changes are included in this PR? New `__arrow_c_device_array__` method to `pyarrow.Array` and `pyarrow.RecordBatch`, and support in the `pyarrow.array(..)`, `pyarrow.record_batch(..)` and `pyarrow.table(..)` functions to consume objects that have those methods. ### Are these changes tested? Yes (for CPU only for now, https://github.com/apache/arrow/pull/40385 is a prerequisite to test this for CUDA) * GitHub Issue: #38325 2024-06-26 17:41:17 +02:00			(has an ``__arrow_c_array__`` or ``__arrow_c_device_array__`` method)
			`can be passed as well.`
ARROW-838: [Python] Expand pyarrow.array to handle NumPy arrays not originating in pandas This unifies the ingest path for 1D data into `pyarrow.array`. I added the argument `from_pandas` to turn null sentinel checking on or off: ``` In [8]: arr = np.random.randn(10000000) In [9]: arr[::3] = np.nan In [10]: arr2 = pa.array(arr) In [11]: arr2.null_count Out[11]: 0 In [12]: %timeit arr2 = pa.array(arr) The slowest run took 5.43 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 68.4 µs per loop In [13]: arr2 = pa.array(arr, from_pandas=True) In [14]: arr2.null_count Out[14]: 3333334 In [15]: %timeit arr2 = pa.array(arr, from_pandas=True) 1 loop, best of 3: 228 ms per loop ``` When the data is contiguous, it is always zero-copy, but then `from_pandas=True` and no null mask is passed, then a null bitmap is constructed and populated. This also permits sequence reads into integers smaller than int64: ``` In [17]: pa.array([1, 2, 3, 4], type='i1') Out[17]: <pyarrow.lib.Int8Array object at 0x7ffa1c1c65e8> [ 1, 2, 3, 4 ] ``` Oh, I also added NumPy-like string type aliases: ``` In [18]: pa.int32() == 'i4' Out[18]: True ``` Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1146 from wesm/expand-py-array-method and squashes the following commits: 1570e525 [Wes McKinney] Code review comments d3bbb3c3 [Wes McKinney] Handle type aliases in cast, too 797f0151 [Wes McKinney] Allow null checking to be skipped with from_pandas=False in pyarrow.array f2802fc7 [Wes McKinney] Cleaner codepath for numpy->arrow conversions 587c575a [Wes McKinney] Add direct types sequence converters for more data types cf40b767 [Wes McKinney] Add type aliases, some unit tests 7b530e4b [Wes McKinney] Consolidate both sequence and ndarray/Series/Index conversion in pyarrow.Array 2017-09-29 23:02:58 -05:00			`type : pyarrow.DataType`
			`Explicit type to attempt to coerce to, otherwise will be inferred from`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`the data.`
			`mask : array[bool], optional`
ARROW-1949: [Python/C++] Add option to Array.from_pandas and pyarrow.array to perform unsafe casts Author: Krisztián Szűcs <szucs.krisztian@gmail.com> Closes #2497 from kszucs/ARROW-1949 and squashes the following commits: f352c477 <Krisztián Szűcs> remove safe flag from _sequence_to_array 70d6cae2 <Krisztián Szűcs> annotate boolean arguments as bint e838a14d <Krisztián Szűcs> check-format fff89aaa <Krisztián Szűcs> lint 92ac3a92 <Krisztián Szűcs> tests for timestamp casts dd8871e8 <Krisztián Szűcs> wire CastOptions through the API 2018-09-04 08:36:29 +02:00			`Indicate which values are null (True) or not null (False).`
ARROW-834: Python Support creating from iterables Support creating arrow arrays from iterables. Possible follow up TODO (or possibly belongs in this issue); throw a clear exception when passed an iterator rather than an iterable. Author: Holden Karau <holden@us.ibm.com> Closes #602 from holdenk/ARROW-834-csupport-creating-from-iterables and squashes the following commits: 750e7f4c [Holden Karau] Switch AppendItem to pure virtual for TypedConverterVisitor 0b72e956 [Holden Karau] Remove unecessary file after merge 2ed00d91 [Holden Karau] Fix long line ee2afaa4 [Holden Karau] Comment the built in converter type inferance code a bit. dddf57db [Holden Karau] Make a note about the resize/realloc in underflow with size 1fd9588a [Holden Karau] Do dynamic resize on the array buffer if size ended up being larger (e.g. support underflow from iterator constructors). ad935e9d [Holden Karau] Have size override the size of the iterator if the iterator is larger. 42f06996 [Holden Karau] Style fix fa0abcc2 [Holden Karau] Add ConvertPySequence to other side 01e462c2 [Holden Karau] Naive merge, lets see if it works 9eb3f106 [Holden Karau] Return the append inside of the decimal convert case/switch business a571ad4b [Holden Karau] Merge in changes to timestamp/datetime builtin converter 8c42fdc2 [Holden Karau] Feedback from wes (fix some previously unchecked appends, fix long line ) 389976cb [Holden Karau] Use CRTP in the iterator 52b03e3e [Holden Karau] Use a const ownedref 1d970bdb [Holden Karau] Switch the SeqVisitor to use OwnedRef c429f9a5 [Holden Karau] Style fixes d392daa8 [Holden Karau] Add limmited pure iterator support and a note be58bc0f [Holden Karau] Restore ArrowBlock (unreleated change) 3a55e824 [Holden Karau] Update array function description 80cc971e [Holden Karau] Cleanup debugging 63c0b7fa [Holden Karau] Tests pass (TODO cleanup debugging) 82ec3c3d [Holden Karau] revert changes to _array.pyx ca0d5303 [Holden Karau] In theory this works ok now for iterables as well b6c72f5c [Holden Karau] Make TypedConverterVisitor work on PySequence or Python Iterators 48b08aa5 [Holden Karau] Switch remaining converters a1bf4bd1 [Holden Karau] Move over timestamp and byte converters 15cdfe34 [Holden Karau] Move more of the convertors to the visitor version 76e08ca5 [Holden Karau] Part of the way along adding iterable support 77c935b9 [Holden Karau] Revert accidently java change 5c0fa0b5 [Holden Karau] Start adding iterable support 2017-06-28 11:35:45 -04:00			`size : int64, optional`
ARROW-8444: [Documentation] Fix spelling errors across the codebase Quickly run `codespell` which found a couple of misspellings. Closes #6931 from kszucs/spelling Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com> 2020-04-14 15:13:44 -04:00			`Size of the elements. If the input is larger than size bail at this`
ARROW-834: Python Support creating from iterables Support creating arrow arrays from iterables. Possible follow up TODO (or possibly belongs in this issue); throw a clear exception when passed an iterator rather than an iterable. Author: Holden Karau <holden@us.ibm.com> Closes #602 from holdenk/ARROW-834-csupport-creating-from-iterables and squashes the following commits: 750e7f4c [Holden Karau] Switch AppendItem to pure virtual for TypedConverterVisitor 0b72e956 [Holden Karau] Remove unecessary file after merge 2ed00d91 [Holden Karau] Fix long line ee2afaa4 [Holden Karau] Comment the built in converter type inferance code a bit. dddf57db [Holden Karau] Make a note about the resize/realloc in underflow with size 1fd9588a [Holden Karau] Do dynamic resize on the array buffer if size ended up being larger (e.g. support underflow from iterator constructors). ad935e9d [Holden Karau] Have size override the size of the iterator if the iterator is larger. 42f06996 [Holden Karau] Style fix fa0abcc2 [Holden Karau] Add ConvertPySequence to other side 01e462c2 [Holden Karau] Naive merge, lets see if it works 9eb3f106 [Holden Karau] Return the append inside of the decimal convert case/switch business a571ad4b [Holden Karau] Merge in changes to timestamp/datetime builtin converter 8c42fdc2 [Holden Karau] Feedback from wes (fix some previously unchecked appends, fix long line ) 389976cb [Holden Karau] Use CRTP in the iterator 52b03e3e [Holden Karau] Use a const ownedref 1d970bdb [Holden Karau] Switch the SeqVisitor to use OwnedRef c429f9a5 [Holden Karau] Style fixes d392daa8 [Holden Karau] Add limmited pure iterator support and a note be58bc0f [Holden Karau] Restore ArrowBlock (unreleated change) 3a55e824 [Holden Karau] Update array function description 80cc971e [Holden Karau] Cleanup debugging 63c0b7fa [Holden Karau] Tests pass (TODO cleanup debugging) 82ec3c3d [Holden Karau] revert changes to _array.pyx ca0d5303 [Holden Karau] In theory this works ok now for iterables as well b6c72f5c [Holden Karau] Make TypedConverterVisitor work on PySequence or Python Iterators 48b08aa5 [Holden Karau] Switch remaining converters a1bf4bd1 [Holden Karau] Move over timestamp and byte converters 15cdfe34 [Holden Karau] Move more of the convertors to the visitor version 76e08ca5 [Holden Karau] Part of the way along adding iterable support 77c935b9 [Holden Karau] Revert accidently java change 5c0fa0b5 [Holden Karau] Start adding iterable support 2017-06-28 11:35:45 -04:00			`length. For iterators, if size is larger than the input iterator this`
			`will be treated as a "max size", but will involve an initial allocation`
			`of size followed by a resize to the actual size (so if you know the`
			`exact size specifying it correctly will give you better performance).`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`from_pandas : bool, default None`
ARROW-4324: [Python] Triage broken type inference logic in presence of a mix of NumPy dtype-having objects and other scalar values In investigating the innocuous bug report from ARROW-4324 I stumbled on a pile of hacks and flawed design around type inference ``` test_list = [np.dtype('int32').type(10), np.dtype('float32').type(0.5)] test_array = pa.array(test_list) # Expected # test_array # <pyarrow.lib.DoubleArray object at 0x7f009963bf48> # [ # 10, # 0.5 # ] # Got # test_array # <pyarrow.lib.Int32Array object at 0x7f009963bf48> # [ # 10, # 0 # ] ``` It turns out there are several issues: * There was a kludge around handling the `numpy.nan` value which is a PyFloat, not a NumPy float64 scalar * Type inference assumed "NaN is null", which should not be hard coded, so I added a flag to switch between pandas semantics and non-pandas * Mixing NumPy scalar values and non-NumPy scalars (like our evil friend numpy.nan) caused the output type to be simply incorrect. For example `[np.float16(1.5), 2.5]` would yield `pa.float16()` output type. Yuck In inserted some hacks to force what I believe to be the correct behavior and fixed a couple unit tests that actually exhibited buggy behavior before (see within). I don't have time to do the "right thing" right now which is to more or less rewrite the hot path of `arrow/python/inference.cc`, so at least this gets the unit tests asserting what is correct so that refactoring will be more productive later. Author: Wes McKinney <wesm+git@apache.org> Closes #4527 from wesm/ARROW-4324 and squashes the following commits: e396958b0 <Wes McKinney> Add unit test for passing pandas Series with from_pandas=False 754468a5d <Wes McKinney> Set from_pandas to None by default in pyarrow.array so that user wishes can be respected e1b839339 <Wes McKinney> Remove outdated unit test, add Python unit test that shows behavior from ARROW-2240 that's been changed 4bc8c8193 <Wes McKinney> Triage type inference logic in presence of a mix of NumPy dtype-having objects and other typed values, pending more serious refactor in ARROW-5564 2019-06-12 17:14:40 -05:00			`Use pandas's semantics for inferring nulls from values in`
ARROW-8444: [Documentation] Fix spelling errors across the codebase Quickly run `codespell` which found a couple of misspellings. Closes #6931 from kszucs/spelling Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com> 2020-04-14 15:13:44 -04:00			`ndarray-like data. If passed, the mask tasks precedence, but`
ARROW-4324: [Python] Triage broken type inference logic in presence of a mix of NumPy dtype-having objects and other scalar values In investigating the innocuous bug report from ARROW-4324 I stumbled on a pile of hacks and flawed design around type inference ``` test_list = [np.dtype('int32').type(10), np.dtype('float32').type(0.5)] test_array = pa.array(test_list) # Expected # test_array # <pyarrow.lib.DoubleArray object at 0x7f009963bf48> # [ # 10, # 0.5 # ] # Got # test_array # <pyarrow.lib.Int32Array object at 0x7f009963bf48> # [ # 10, # 0 # ] ``` It turns out there are several issues: * There was a kludge around handling the `numpy.nan` value which is a PyFloat, not a NumPy float64 scalar * Type inference assumed "NaN is null", which should not be hard coded, so I added a flag to switch between pandas semantics and non-pandas * Mixing NumPy scalar values and non-NumPy scalars (like our evil friend numpy.nan) caused the output type to be simply incorrect. For example `[np.float16(1.5), 2.5]` would yield `pa.float16()` output type. Yuck In inserted some hacks to force what I believe to be the correct behavior and fixed a couple unit tests that actually exhibited buggy behavior before (see within). I don't have time to do the "right thing" right now which is to more or less rewrite the hot path of `arrow/python/inference.cc`, so at least this gets the unit tests asserting what is correct so that refactoring will be more productive later. Author: Wes McKinney <wesm+git@apache.org> Closes #4527 from wesm/ARROW-4324 and squashes the following commits: e396958b0 <Wes McKinney> Add unit test for passing pandas Series with from_pandas=False 754468a5d <Wes McKinney> Set from_pandas to None by default in pyarrow.array so that user wishes can be respected e1b839339 <Wes McKinney> Remove outdated unit test, add Python unit test that shows behavior from ARROW-2240 that's been changed 4bc8c8193 <Wes McKinney> Triage type inference logic in presence of a mix of NumPy dtype-having objects and other typed values, pending more serious refactor in ARROW-5564 2019-06-12 17:14:40 -05:00			`if a value is unmasked (not-null), but still null according to`
			`pandas semantics, then it is null. Defaults to False if not`
			`passed explicitly by user, or True if a pandas object is`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`passed in.`
			`safe : bool, default True`
			`Check for overflows or other unsafe conversions.`
ARROW-1949: [Python/C++] Add option to Array.from_pandas and pyarrow.array to perform unsafe casts Author: Krisztián Szűcs <szucs.krisztian@gmail.com> Closes #2497 from kszucs/ARROW-1949 and squashes the following commits: f352c477 <Krisztián Szűcs> remove safe flag from _sequence_to_array 70d6cae2 <Krisztián Szűcs> annotate boolean arguments as bint e838a14d <Krisztián Szűcs> check-format fff89aaa <Krisztián Szűcs> lint 92ac3a92 <Krisztián Szűcs> tests for timestamp casts dd8871e8 <Krisztián Szűcs> wire CastOptions through the API 2018-09-04 08:36:29 +02:00			`memory_pool : pyarrow.MemoryPool, optional`
			`If not passed, will allocate memory from the currently-set default`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`memory pool.`

			`Returns`
			`-------`
			`array : pyarrow.Array or pyarrow.ChunkedArray`
			`A ChunkedArray instead of an Array is returned if:`
ARROW-12089: [Doc] Fix Sphinx warnings This doesn't fix all warnings (some are due to limitations of Numpydoc, autodoc and Sphinx). Closes #9848 from pitrou/ARROW-12089-sphinx-warnings Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-03-31 13:51:03 +02:00
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`- the object data overflowed binary storage.`
			- the object's ``__arrow_array__`` protocol method returned a chunked
			`array.`
ARROW-838: [Python] Expand pyarrow.array to handle NumPy arrays not originating in pandas This unifies the ingest path for 1D data into `pyarrow.array`. I added the argument `from_pandas` to turn null sentinel checking on or off: ``` In [8]: arr = np.random.randn(10000000) In [9]: arr[::3] = np.nan In [10]: arr2 = pa.array(arr) In [11]: arr2.null_count Out[11]: 0 In [12]: %timeit arr2 = pa.array(arr) The slowest run took 5.43 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 68.4 µs per loop In [13]: arr2 = pa.array(arr, from_pandas=True) In [14]: arr2.null_count Out[14]: 3333334 In [15]: %timeit arr2 = pa.array(arr, from_pandas=True) 1 loop, best of 3: 228 ms per loop ``` When the data is contiguous, it is always zero-copy, but then `from_pandas=True` and no null mask is passed, then a null bitmap is constructed and populated. This also permits sequence reads into integers smaller than int64: ``` In [17]: pa.array([1, 2, 3, 4], type='i1') Out[17]: <pyarrow.lib.Int8Array object at 0x7ffa1c1c65e8> [ 1, 2, 3, 4 ] ``` Oh, I also added NumPy-like string type aliases: ``` In [18]: pa.int32() == 'i4' Out[18]: True ``` Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1146 from wesm/expand-py-array-method and squashes the following commits: 1570e525 [Wes McKinney] Code review comments d3bbb3c3 [Wes McKinney] Handle type aliases in cast, too 797f0151 [Wes McKinney] Allow null checking to be skipped with from_pandas=False in pyarrow.array f2802fc7 [Wes McKinney] Cleaner codepath for numpy->arrow conversions 587c575a [Wes McKinney] Add direct types sequence converters for more data types cf40b767 [Wes McKinney] Add type aliases, some unit tests 7b530e4b [Wes McKinney] Consolidate both sequence and ndarray/Series/Index conversion in pyarrow.Array 2017-09-29 23:02:58 -05:00
			`Notes`
			`-----`
ARROW-14448: [Python] Update pyarrow.array() docstring note on timestamp (timezone) conversion This PR updates the docstring for pyarrow.arrow() for clarification on the return value of timestamp with/without timezone data. Closes #12078 from sanjibansg/docs_python_update_timezone Authored-by: Sanjiban Sengupta <sanjiban.sg@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2022-01-11 10:52:10 +01:00			`Timezone will be preserved in the returned array for timezone-aware data,`
			`else no timezone will be returned for naive timestamps.`
			`Internally, UTC values are stored for timezone-aware data with the`
			`timezone set in the data type.`
ARROW-838: [Python] Expand pyarrow.array to handle NumPy arrays not originating in pandas This unifies the ingest path for 1D data into `pyarrow.array`. I added the argument `from_pandas` to turn null sentinel checking on or off: ``` In [8]: arr = np.random.randn(10000000) In [9]: arr[::3] = np.nan In [10]: arr2 = pa.array(arr) In [11]: arr2.null_count Out[11]: 0 In [12]: %timeit arr2 = pa.array(arr) The slowest run took 5.43 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 68.4 µs per loop In [13]: arr2 = pa.array(arr, from_pandas=True) In [14]: arr2.null_count Out[14]: 3333334 In [15]: %timeit arr2 = pa.array(arr, from_pandas=True) 1 loop, best of 3: 228 ms per loop ``` When the data is contiguous, it is always zero-copy, but then `from_pandas=True` and no null mask is passed, then a null bitmap is constructed and populated. This also permits sequence reads into integers smaller than int64: ``` In [17]: pa.array([1, 2, 3, 4], type='i1') Out[17]: <pyarrow.lib.Int8Array object at 0x7ffa1c1c65e8> [ 1, 2, 3, 4 ] ``` Oh, I also added NumPy-like string type aliases: ``` In [18]: pa.int32() == 'i4' Out[18]: True ``` Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1146 from wesm/expand-py-array-method and squashes the following commits: 1570e525 [Wes McKinney] Code review comments d3bbb3c3 [Wes McKinney] Handle type aliases in cast, too 797f0151 [Wes McKinney] Allow null checking to be skipped with from_pandas=False in pyarrow.array f2802fc7 [Wes McKinney] Cleaner codepath for numpy->arrow conversions 587c575a [Wes McKinney] Add direct types sequence converters for more data types cf40b767 [Wes McKinney] Add type aliases, some unit tests 7b530e4b [Wes McKinney] Consolidate both sequence and ndarray/Series/Index conversion in pyarrow.Array 2017-09-29 23:02:58 -05:00
ARROW-13806: [C++][Python] Add support for new MonthDayNano Interval Type - Refactored ObjectWriter helpers from arrow_to_pandas, so they can be used for plain python types as well (generalized the lowest level so it can work on both PyObject** and an adapter for PyList. - Add DateOffset to static pandas imports - Tried to start laying out code in a way to use C++ for Array.to_pylist (feel free to comment). Support importing from timeinterval, relativedelta and DateOffset types (this is actually mostly duck types, the one complication is that relativedelta has a property weeks that is automatically calculated, so some type checking is necessary). Open questions: - Should we be more strict on duck typing imports? I chose generalism over performance here (rechecking non-present attributes, etc)? - Is the new arrow_to_python.h desirable (I think this can be easily extended for other types)? - My python is rusty and Python C-API even more so, please don't assume I know exactly what I'm doing :) Closes #11302 from emkornfield/interval_python Lead-authored-by: Micah Kornfield <micahk@google.com> Co-authored-by: emkornfield <emkornfield@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-10-07 12:36:46 +02:00			`Pandas's DateOffsets and dateutil.relativedelta.relativedelta are by`
			`default converted as MonthDayNanoIntervalArray. relativedelta leapdays`
			`are ignored as are all absolute fields on both objects. datetime.timedelta`
			`can also be converted to MonthDayNanoIntervalArray but this requires`
			`passing MonthDayNanoIntervalType explicitly.`

ARROW-9992: [C++][Python] Refactor python to arrow conversions based on a reusable conversion API ### Targets of the refactoring: - PythonToArrow converters based on a common API - PyBytesView to use `Result` return values and contain `is_utf8` flag - PyConversionOptions is now available from all converters so we can honor its flags ### Fixes - ARROW-9993 [Python] Tzinfo - string roundtrip fails on pytz.StaticTzInfo objects - ARROW-9994 [C++][Python] Auto chunking nested array containing binary-like fields result malformed output - ARROW-9996 [C++] Dictionary is unset when calling DictionaryArray.GetScalar for null values - ~ARROW-9997 [Python] StructScalar.as_py() fails if the type has duplicate field names~ - ARROW-9999 [Python] Support constructing dictionary array directly through pa.array() - ARROW-10000 [C++][Python] Support constructing StructArray from list of key-value pairs - ARROW-9593 [Python] Add custom pickle reducers for DictionaryScalar - ARROW-6281 [Python] Produce chunked arrays for nested types in pyarrow.array - ARROW-2367 [Python] ListArray has trouble with sizes greater than kMaximumCapacity - ARROW-9976: [Python] ArrowCapacityError when doing Table.from_pandas with large dataframe ### Backward incompatibility ~~Since a struct type can contain duplicated field names we cannot return a struct scalar as a mapping, so I had to change the `.as_py()` representation to return with a list of key-value pairs.~~ ### TODOs: - [x] ensure that the large memory tests are passing - [x] benchmark and check binary size again ### Library size Before: ``` 12M Sep 25 15:05 libarrow.200.0.0.dylib 2.7M Sep 25 15:07 libarrow_python.200.0.0.dylib ``` After: ``` 12M Sep 25 15:46 libarrow.200.0.0.dylib 2.1M Sep 25 15:50 libarrow_python.200.0.0.dylib ``` ### Benchmarks Executed the following ASV benchmark: ```bash asv continuous --bench convert_builtins master py2ar --no-only-changed --split ``` After some optimization: ``` Benchmarks that have improved: before after ratio [f358a29b] [18d1c052] <master> <py2ar> - 2.78±0.03ms 2.45±0.03ms 0.88 convert_builtins.ConvertPyListToArray.time_convert('bool') - 3.59±0.01ms 3.12±0.02ms 0.87 convert_builtins.ConvertPyListToArray.time_convert('int32') - 3.37±0.01ms 2.73±0.01ms 0.81 convert_builtins.ConvertPyListToArray.time_convert('uint32') - 3.74±0.02ms 3.03±0.01ms 0.81 convert_builtins.ConvertPyListToArray.time_convert('int64') - 3.38±0.01ms 2.69±0.01ms 0.80 convert_builtins.ConvertPyListToArray.time_convert('uint64') - 2.83±0.01ms 2.24±0.01ms 0.79 convert_builtins.ConvertPyListToArray.time_convert('float32') - 3.92±0.02ms 2.99±0.02ms 0.76 convert_builtins.ConvertPyListToArray.time_convert('binary10') - 14.1±0.04ms 8.89±0.05ms 0.63 convert_builtins.ConvertPyListToArray.time_convert('unicode') - 5.60±0.01ms 3.24±0.03ms 0.58 convert_builtins.ConvertPyListToArray.time_convert('ascii') - 5.37±0.02ms 2.91±0.04ms 0.54 convert_builtins.ConvertPyListToArray.time_convert('binary') Benchmarks that have stayed the same: before after ratio [f358a29b] [18d1c052] <master> <py2ar> 14.8±0.02ms 15.5±0.1ms 1.05 convert_builtins.ConvertPyListToArray.time_convert('decimal') 16.4±0.7ms 15.1±0.6ms 0.92 convert_builtins.ConvertPyListToArray.time_convert('struct from tuples') 34.4±0.3ms 31.5±0.4ms 0.92 convert_builtins.ConvertPyListToArray.time_convert('int64 list') 16.7±0.7ms 15.1±0.6ms ~0.91 convert_builtins.ConvertPyListToArray.time_convert('struct') 2.42±0.02ms 2.05±0.03ms ~0.85 convert_builtins.ConvertPyListToArray.time_convert('float64') ``` Closes #8088 from kszucs/py2ar Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com> 2020-09-25 20:49:16 -04:00			`Converting to dictionary array will promote to a wider integer type for`
			`indices if the number of distinct values cannot be represented, even if`
			`the index type was explicitly set. This means that if there are more than`
			`127 values the returned dictionary array's index type will be at least`
			`pa.int16() even if pa.int8() was passed to the function. Note that an`
			`explicit index type will not be demoted even if it is wider than required.`

GH-32007: [Python] Support arithmetic on arrays and scalars (#48085) ### Rationale for this change Please see #32007, currently, neither arrays nor scalars support Python-native arithmetic operations, such as `array + array`, it has to be done via `pyarrow.compute` API. This PR strives to fix this with custom dunder methods. ### What changes are included in this PR? Implemented dunder methods ### Are these changes tested? Yes ### Are there any user-facing changes? Possibility to use Python operators directly instead of calling the `pyarrow.compute` API. * GitHub Issue: #32007 Authored-by: Bogdan Romenskii <rmnsk@seznam.cz> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2026-03-27 09:07:21 +01:00			`This class supports Python's standard operators`
			for element-wise operations, i.e. arithmetic (`+`, `-`, `/`, `%`, `**`),
			bitwise (`&`, `\|`, `^`, `>>`, `<<`) and others.
			`They can be used directly instead of calling underlying`
			`pyarrow.compute` functions explicitly.

ARROW-838: [Python] Expand pyarrow.array to handle NumPy arrays not originating in pandas This unifies the ingest path for 1D data into `pyarrow.array`. I added the argument `from_pandas` to turn null sentinel checking on or off: ``` In [8]: arr = np.random.randn(10000000) In [9]: arr[::3] = np.nan In [10]: arr2 = pa.array(arr) In [11]: arr2.null_count Out[11]: 0 In [12]: %timeit arr2 = pa.array(arr) The slowest run took 5.43 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 68.4 µs per loop In [13]: arr2 = pa.array(arr, from_pandas=True) In [14]: arr2.null_count Out[14]: 3333334 In [15]: %timeit arr2 = pa.array(arr, from_pandas=True) 1 loop, best of 3: 228 ms per loop ``` When the data is contiguous, it is always zero-copy, but then `from_pandas=True` and no null mask is passed, then a null bitmap is constructed and populated. This also permits sequence reads into integers smaller than int64: ``` In [17]: pa.array([1, 2, 3, 4], type='i1') Out[17]: <pyarrow.lib.Int8Array object at 0x7ffa1c1c65e8> [ 1, 2, 3, 4 ] ``` Oh, I also added NumPy-like string type aliases: ``` In [18]: pa.int32() == 'i4' Out[18]: True ``` Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1146 from wesm/expand-py-array-method and squashes the following commits: 1570e525 [Wes McKinney] Code review comments d3bbb3c3 [Wes McKinney] Handle type aliases in cast, too 797f0151 [Wes McKinney] Allow null checking to be skipped with from_pandas=False in pyarrow.array f2802fc7 [Wes McKinney] Cleaner codepath for numpy->arrow conversions 587c575a [Wes McKinney] Add direct types sequence converters for more data types cf40b767 [Wes McKinney] Add type aliases, some unit tests 7b530e4b [Wes McKinney] Consolidate both sequence and ndarray/Series/Index conversion in pyarrow.Array 2017-09-29 23:02:58 -05:00			`Examples`
			`--------`
			`>>> import pandas as pd`
			`>>> import pyarrow as pa`
			`>>> pa.array(pd.Series([1, 2]))`
ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-cython) Adding `--doctest-cython` functionality which will be run on the CI with a follow-up PR. This PR can be tested with `pytest --doctest-cython python/pyarrow`. Closes #13204 from AlenkaF/ARROW-16018-doctest-cython Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-05-26 15:30:02 +02:00			`<pyarrow.lib.Int64Array object at ...>`
ARROW-838: [Python] Expand pyarrow.array to handle NumPy arrays not originating in pandas This unifies the ingest path for 1D data into `pyarrow.array`. I added the argument `from_pandas` to turn null sentinel checking on or off: ``` In [8]: arr = np.random.randn(10000000) In [9]: arr[::3] = np.nan In [10]: arr2 = pa.array(arr) In [11]: arr2.null_count Out[11]: 0 In [12]: %timeit arr2 = pa.array(arr) The slowest run took 5.43 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 68.4 µs per loop In [13]: arr2 = pa.array(arr, from_pandas=True) In [14]: arr2.null_count Out[14]: 3333334 In [15]: %timeit arr2 = pa.array(arr, from_pandas=True) 1 loop, best of 3: 228 ms per loop ``` When the data is contiguous, it is always zero-copy, but then `from_pandas=True` and no null mask is passed, then a null bitmap is constructed and populated. This also permits sequence reads into integers smaller than int64: ``` In [17]: pa.array([1, 2, 3, 4], type='i1') Out[17]: <pyarrow.lib.Int8Array object at 0x7ffa1c1c65e8> [ 1, 2, 3, 4 ] ``` Oh, I also added NumPy-like string type aliases: ``` In [18]: pa.int32() == 'i4' Out[18]: True ``` Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1146 from wesm/expand-py-array-method and squashes the following commits: 1570e525 [Wes McKinney] Code review comments d3bbb3c3 [Wes McKinney] Handle type aliases in cast, too 797f0151 [Wes McKinney] Allow null checking to be skipped with from_pandas=False in pyarrow.array f2802fc7 [Wes McKinney] Cleaner codepath for numpy->arrow conversions 587c575a [Wes McKinney] Add direct types sequence converters for more data types cf40b767 [Wes McKinney] Add type aliases, some unit tests 7b530e4b [Wes McKinney] Consolidate both sequence and ndarray/Series/Index conversion in pyarrow.Array 2017-09-29 23:02:58 -05:00			`[`
			`1,`
			`2`
			`]`

ARROW-9992: [C++][Python] Refactor python to arrow conversions based on a reusable conversion API ### Targets of the refactoring: - PythonToArrow converters based on a common API - PyBytesView to use `Result` return values and contain `is_utf8` flag - PyConversionOptions is now available from all converters so we can honor its flags ### Fixes - ARROW-9993 [Python] Tzinfo - string roundtrip fails on pytz.StaticTzInfo objects - ARROW-9994 [C++][Python] Auto chunking nested array containing binary-like fields result malformed output - ARROW-9996 [C++] Dictionary is unset when calling DictionaryArray.GetScalar for null values - ~ARROW-9997 [Python] StructScalar.as_py() fails if the type has duplicate field names~ - ARROW-9999 [Python] Support constructing dictionary array directly through pa.array() - ARROW-10000 [C++][Python] Support constructing StructArray from list of key-value pairs - ARROW-9593 [Python] Add custom pickle reducers for DictionaryScalar - ARROW-6281 [Python] Produce chunked arrays for nested types in pyarrow.array - ARROW-2367 [Python] ListArray has trouble with sizes greater than kMaximumCapacity - ARROW-9976: [Python] ArrowCapacityError when doing Table.from_pandas with large dataframe ### Backward incompatibility ~~Since a struct type can contain duplicated field names we cannot return a struct scalar as a mapping, so I had to change the `.as_py()` representation to return with a list of key-value pairs.~~ ### TODOs: - [x] ensure that the large memory tests are passing - [x] benchmark and check binary size again ### Library size Before: ``` 12M Sep 25 15:05 libarrow.200.0.0.dylib 2.7M Sep 25 15:07 libarrow_python.200.0.0.dylib ``` After: ``` 12M Sep 25 15:46 libarrow.200.0.0.dylib 2.1M Sep 25 15:50 libarrow_python.200.0.0.dylib ``` ### Benchmarks Executed the following ASV benchmark: ```bash asv continuous --bench convert_builtins master py2ar --no-only-changed --split ``` After some optimization: ``` Benchmarks that have improved: before after ratio [f358a29b] [18d1c052] <master> <py2ar> - 2.78±0.03ms 2.45±0.03ms 0.88 convert_builtins.ConvertPyListToArray.time_convert('bool') - 3.59±0.01ms 3.12±0.02ms 0.87 convert_builtins.ConvertPyListToArray.time_convert('int32') - 3.37±0.01ms 2.73±0.01ms 0.81 convert_builtins.ConvertPyListToArray.time_convert('uint32') - 3.74±0.02ms 3.03±0.01ms 0.81 convert_builtins.ConvertPyListToArray.time_convert('int64') - 3.38±0.01ms 2.69±0.01ms 0.80 convert_builtins.ConvertPyListToArray.time_convert('uint64') - 2.83±0.01ms 2.24±0.01ms 0.79 convert_builtins.ConvertPyListToArray.time_convert('float32') - 3.92±0.02ms 2.99±0.02ms 0.76 convert_builtins.ConvertPyListToArray.time_convert('binary10') - 14.1±0.04ms 8.89±0.05ms 0.63 convert_builtins.ConvertPyListToArray.time_convert('unicode') - 5.60±0.01ms 3.24±0.03ms 0.58 convert_builtins.ConvertPyListToArray.time_convert('ascii') - 5.37±0.02ms 2.91±0.04ms 0.54 convert_builtins.ConvertPyListToArray.time_convert('binary') Benchmarks that have stayed the same: before after ratio [f358a29b] [18d1c052] <master> <py2ar> 14.8±0.02ms 15.5±0.1ms 1.05 convert_builtins.ConvertPyListToArray.time_convert('decimal') 16.4±0.7ms 15.1±0.6ms 0.92 convert_builtins.ConvertPyListToArray.time_convert('struct from tuples') 34.4±0.3ms 31.5±0.4ms 0.92 convert_builtins.ConvertPyListToArray.time_convert('int64 list') 16.7±0.7ms 15.1±0.6ms ~0.91 convert_builtins.ConvertPyListToArray.time_convert('struct') 2.42±0.02ms 2.05±0.03ms ~0.85 convert_builtins.ConvertPyListToArray.time_convert('float64') ``` Closes #8088 from kszucs/py2ar Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com> 2020-09-25 20:49:16 -04:00			`>>> pa.array(["a", "b", "a"], type=pa.dictionary(pa.int8(), pa.string()))`
ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-cython) Adding `--doctest-cython` functionality which will be run on the CI with a follow-up PR. This PR can be tested with `pytest --doctest-cython python/pyarrow`. Closes #13204 from AlenkaF/ARROW-16018-doctest-cython Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-05-26 15:30:02 +02:00			`<pyarrow.lib.DictionaryArray object at ...>`
			`...`
ARROW-9992: [C++][Python] Refactor python to arrow conversions based on a reusable conversion API ### Targets of the refactoring: - PythonToArrow converters based on a common API - PyBytesView to use `Result` return values and contain `is_utf8` flag - PyConversionOptions is now available from all converters so we can honor its flags ### Fixes - ARROW-9993 [Python] Tzinfo - string roundtrip fails on pytz.StaticTzInfo objects - ARROW-9994 [C++][Python] Auto chunking nested array containing binary-like fields result malformed output - ARROW-9996 [C++] Dictionary is unset when calling DictionaryArray.GetScalar for null values - ~ARROW-9997 [Python] StructScalar.as_py() fails if the type has duplicate field names~ - ARROW-9999 [Python] Support constructing dictionary array directly through pa.array() - ARROW-10000 [C++][Python] Support constructing StructArray from list of key-value pairs - ARROW-9593 [Python] Add custom pickle reducers for DictionaryScalar - ARROW-6281 [Python] Produce chunked arrays for nested types in pyarrow.array - ARROW-2367 [Python] ListArray has trouble with sizes greater than kMaximumCapacity - ARROW-9976: [Python] ArrowCapacityError when doing Table.from_pandas with large dataframe ### Backward incompatibility ~~Since a struct type can contain duplicated field names we cannot return a struct scalar as a mapping, so I had to change the `.as_py()` representation to return with a list of key-value pairs.~~ ### TODOs: - [x] ensure that the large memory tests are passing - [x] benchmark and check binary size again ### Library size Before: ``` 12M Sep 25 15:05 libarrow.200.0.0.dylib 2.7M Sep 25 15:07 libarrow_python.200.0.0.dylib ``` After: ``` 12M Sep 25 15:46 libarrow.200.0.0.dylib 2.1M Sep 25 15:50 libarrow_python.200.0.0.dylib ``` ### Benchmarks Executed the following ASV benchmark: ```bash asv continuous --bench convert_builtins master py2ar --no-only-changed --split ``` After some optimization: ``` Benchmarks that have improved: before after ratio [f358a29b] [18d1c052] <master> <py2ar> - 2.78±0.03ms 2.45±0.03ms 0.88 convert_builtins.ConvertPyListToArray.time_convert('bool') - 3.59±0.01ms 3.12±0.02ms 0.87 convert_builtins.ConvertPyListToArray.time_convert('int32') - 3.37±0.01ms 2.73±0.01ms 0.81 convert_builtins.ConvertPyListToArray.time_convert('uint32') - 3.74±0.02ms 3.03±0.01ms 0.81 convert_builtins.ConvertPyListToArray.time_convert('int64') - 3.38±0.01ms 2.69±0.01ms 0.80 convert_builtins.ConvertPyListToArray.time_convert('uint64') - 2.83±0.01ms 2.24±0.01ms 0.79 convert_builtins.ConvertPyListToArray.time_convert('float32') - 3.92±0.02ms 2.99±0.02ms 0.76 convert_builtins.ConvertPyListToArray.time_convert('binary10') - 14.1±0.04ms 8.89±0.05ms 0.63 convert_builtins.ConvertPyListToArray.time_convert('unicode') - 5.60±0.01ms 3.24±0.03ms 0.58 convert_builtins.ConvertPyListToArray.time_convert('ascii') - 5.37±0.02ms 2.91±0.04ms 0.54 convert_builtins.ConvertPyListToArray.time_convert('binary') Benchmarks that have stayed the same: before after ratio [f358a29b] [18d1c052] <master> <py2ar> 14.8±0.02ms 15.5±0.1ms 1.05 convert_builtins.ConvertPyListToArray.time_convert('decimal') 16.4±0.7ms 15.1±0.6ms 0.92 convert_builtins.ConvertPyListToArray.time_convert('struct from tuples') 34.4±0.3ms 31.5±0.4ms 0.92 convert_builtins.ConvertPyListToArray.time_convert('int64 list') 16.7±0.7ms 15.1±0.6ms ~0.91 convert_builtins.ConvertPyListToArray.time_convert('struct') 2.42±0.02ms 2.05±0.03ms ~0.85 convert_builtins.ConvertPyListToArray.time_convert('float64') ``` Closes #8088 from kszucs/py2ar Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com> 2020-09-25 20:49:16 -04:00			`-- dictionary:`
ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-cython) Adding `--doctest-cython` functionality which will be run on the CI with a follow-up PR. This PR can be tested with `pytest --doctest-cython python/pyarrow`. Closes #13204 from AlenkaF/ARROW-16018-doctest-cython Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-05-26 15:30:02 +02:00			`[`
			`"a",`
			`"b"`
			`]`
ARROW-9992: [C++][Python] Refactor python to arrow conversions based on a reusable conversion API ### Targets of the refactoring: - PythonToArrow converters based on a common API - PyBytesView to use `Result` return values and contain `is_utf8` flag - PyConversionOptions is now available from all converters so we can honor its flags ### Fixes - ARROW-9993 [Python] Tzinfo - string roundtrip fails on pytz.StaticTzInfo objects - ARROW-9994 [C++][Python] Auto chunking nested array containing binary-like fields result malformed output - ARROW-9996 [C++] Dictionary is unset when calling DictionaryArray.GetScalar for null values - ~ARROW-9997 [Python] StructScalar.as_py() fails if the type has duplicate field names~ - ARROW-9999 [Python] Support constructing dictionary array directly through pa.array() - ARROW-10000 [C++][Python] Support constructing StructArray from list of key-value pairs - ARROW-9593 [Python] Add custom pickle reducers for DictionaryScalar - ARROW-6281 [Python] Produce chunked arrays for nested types in pyarrow.array - ARROW-2367 [Python] ListArray has trouble with sizes greater than kMaximumCapacity - ARROW-9976: [Python] ArrowCapacityError when doing Table.from_pandas with large dataframe ### Backward incompatibility ~~Since a struct type can contain duplicated field names we cannot return a struct scalar as a mapping, so I had to change the `.as_py()` representation to return with a list of key-value pairs.~~ ### TODOs: - [x] ensure that the large memory tests are passing - [x] benchmark and check binary size again ### Library size Before: ``` 12M Sep 25 15:05 libarrow.200.0.0.dylib 2.7M Sep 25 15:07 libarrow_python.200.0.0.dylib ``` After: ``` 12M Sep 25 15:46 libarrow.200.0.0.dylib 2.1M Sep 25 15:50 libarrow_python.200.0.0.dylib ``` ### Benchmarks Executed the following ASV benchmark: ```bash asv continuous --bench convert_builtins master py2ar --no-only-changed --split ``` After some optimization: ``` Benchmarks that have improved: before after ratio [f358a29b] [18d1c052] <master> <py2ar> - 2.78±0.03ms 2.45±0.03ms 0.88 convert_builtins.ConvertPyListToArray.time_convert('bool') - 3.59±0.01ms 3.12±0.02ms 0.87 convert_builtins.ConvertPyListToArray.time_convert('int32') - 3.37±0.01ms 2.73±0.01ms 0.81 convert_builtins.ConvertPyListToArray.time_convert('uint32') - 3.74±0.02ms 3.03±0.01ms 0.81 convert_builtins.ConvertPyListToArray.time_convert('int64') - 3.38±0.01ms 2.69±0.01ms 0.80 convert_builtins.ConvertPyListToArray.time_convert('uint64') - 2.83±0.01ms 2.24±0.01ms 0.79 convert_builtins.ConvertPyListToArray.time_convert('float32') - 3.92±0.02ms 2.99±0.02ms 0.76 convert_builtins.ConvertPyListToArray.time_convert('binary10') - 14.1±0.04ms 8.89±0.05ms 0.63 convert_builtins.ConvertPyListToArray.time_convert('unicode') - 5.60±0.01ms 3.24±0.03ms 0.58 convert_builtins.ConvertPyListToArray.time_convert('ascii') - 5.37±0.02ms 2.91±0.04ms 0.54 convert_builtins.ConvertPyListToArray.time_convert('binary') Benchmarks that have stayed the same: before after ratio [f358a29b] [18d1c052] <master> <py2ar> 14.8±0.02ms 15.5±0.1ms 1.05 convert_builtins.ConvertPyListToArray.time_convert('decimal') 16.4±0.7ms 15.1±0.6ms 0.92 convert_builtins.ConvertPyListToArray.time_convert('struct from tuples') 34.4±0.3ms 31.5±0.4ms 0.92 convert_builtins.ConvertPyListToArray.time_convert('int64 list') 16.7±0.7ms 15.1±0.6ms ~0.91 convert_builtins.ConvertPyListToArray.time_convert('struct') 2.42±0.02ms 2.05±0.03ms ~0.85 convert_builtins.ConvertPyListToArray.time_convert('float64') ``` Closes #8088 from kszucs/py2ar Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com> 2020-09-25 20:49:16 -04:00			`-- indices:`
ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-cython) Adding `--doctest-cython` functionality which will be run on the CI with a follow-up PR. This PR can be tested with `pytest --doctest-cython python/pyarrow`. Closes #13204 from AlenkaF/ARROW-16018-doctest-cython Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-05-26 15:30:02 +02:00			`[`
			`0,`
			`1,`
			`0`
			`]`
ARROW-9992: [C++][Python] Refactor python to arrow conversions based on a reusable conversion API ### Targets of the refactoring: - PythonToArrow converters based on a common API - PyBytesView to use `Result` return values and contain `is_utf8` flag - PyConversionOptions is now available from all converters so we can honor its flags ### Fixes - ARROW-9993 [Python] Tzinfo - string roundtrip fails on pytz.StaticTzInfo objects - ARROW-9994 [C++][Python] Auto chunking nested array containing binary-like fields result malformed output - ARROW-9996 [C++] Dictionary is unset when calling DictionaryArray.GetScalar for null values - ~ARROW-9997 [Python] StructScalar.as_py() fails if the type has duplicate field names~ - ARROW-9999 [Python] Support constructing dictionary array directly through pa.array() - ARROW-10000 [C++][Python] Support constructing StructArray from list of key-value pairs - ARROW-9593 [Python] Add custom pickle reducers for DictionaryScalar - ARROW-6281 [Python] Produce chunked arrays for nested types in pyarrow.array - ARROW-2367 [Python] ListArray has trouble with sizes greater than kMaximumCapacity - ARROW-9976: [Python] ArrowCapacityError when doing Table.from_pandas with large dataframe ### Backward incompatibility ~~Since a struct type can contain duplicated field names we cannot return a struct scalar as a mapping, so I had to change the `.as_py()` representation to return with a list of key-value pairs.~~ ### TODOs: - [x] ensure that the large memory tests are passing - [x] benchmark and check binary size again ### Library size Before: ``` 12M Sep 25 15:05 libarrow.200.0.0.dylib 2.7M Sep 25 15:07 libarrow_python.200.0.0.dylib ``` After: ``` 12M Sep 25 15:46 libarrow.200.0.0.dylib 2.1M Sep 25 15:50 libarrow_python.200.0.0.dylib ``` ### Benchmarks Executed the following ASV benchmark: ```bash asv continuous --bench convert_builtins master py2ar --no-only-changed --split ``` After some optimization: ``` Benchmarks that have improved: before after ratio [f358a29b] [18d1c052] <master> <py2ar> - 2.78±0.03ms 2.45±0.03ms 0.88 convert_builtins.ConvertPyListToArray.time_convert('bool') - 3.59±0.01ms 3.12±0.02ms 0.87 convert_builtins.ConvertPyListToArray.time_convert('int32') - 3.37±0.01ms 2.73±0.01ms 0.81 convert_builtins.ConvertPyListToArray.time_convert('uint32') - 3.74±0.02ms 3.03±0.01ms 0.81 convert_builtins.ConvertPyListToArray.time_convert('int64') - 3.38±0.01ms 2.69±0.01ms 0.80 convert_builtins.ConvertPyListToArray.time_convert('uint64') - 2.83±0.01ms 2.24±0.01ms 0.79 convert_builtins.ConvertPyListToArray.time_convert('float32') - 3.92±0.02ms 2.99±0.02ms 0.76 convert_builtins.ConvertPyListToArray.time_convert('binary10') - 14.1±0.04ms 8.89±0.05ms 0.63 convert_builtins.ConvertPyListToArray.time_convert('unicode') - 5.60±0.01ms 3.24±0.03ms 0.58 convert_builtins.ConvertPyListToArray.time_convert('ascii') - 5.37±0.02ms 2.91±0.04ms 0.54 convert_builtins.ConvertPyListToArray.time_convert('binary') Benchmarks that have stayed the same: before after ratio [f358a29b] [18d1c052] <master> <py2ar> 14.8±0.02ms 15.5±0.1ms 1.05 convert_builtins.ConvertPyListToArray.time_convert('decimal') 16.4±0.7ms 15.1±0.6ms 0.92 convert_builtins.ConvertPyListToArray.time_convert('struct from tuples') 34.4±0.3ms 31.5±0.4ms 0.92 convert_builtins.ConvertPyListToArray.time_convert('int64 list') 16.7±0.7ms 15.1±0.6ms ~0.91 convert_builtins.ConvertPyListToArray.time_convert('struct') 2.42±0.02ms 2.05±0.03ms ~0.85 convert_builtins.ConvertPyListToArray.time_convert('float64') ``` Closes #8088 from kszucs/py2ar Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com> 2020-09-25 20:49:16 -04:00
ARROW-838: [Python] Expand pyarrow.array to handle NumPy arrays not originating in pandas This unifies the ingest path for 1D data into `pyarrow.array`. I added the argument `from_pandas` to turn null sentinel checking on or off: ``` In [8]: arr = np.random.randn(10000000) In [9]: arr[::3] = np.nan In [10]: arr2 = pa.array(arr) In [11]: arr2.null_count Out[11]: 0 In [12]: %timeit arr2 = pa.array(arr) The slowest run took 5.43 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 68.4 µs per loop In [13]: arr2 = pa.array(arr, from_pandas=True) In [14]: arr2.null_count Out[14]: 3333334 In [15]: %timeit arr2 = pa.array(arr, from_pandas=True) 1 loop, best of 3: 228 ms per loop ``` When the data is contiguous, it is always zero-copy, but then `from_pandas=True` and no null mask is passed, then a null bitmap is constructed and populated. This also permits sequence reads into integers smaller than int64: ``` In [17]: pa.array([1, 2, 3, 4], type='i1') Out[17]: <pyarrow.lib.Int8Array object at 0x7ffa1c1c65e8> [ 1, 2, 3, 4 ] ``` Oh, I also added NumPy-like string type aliases: ``` In [18]: pa.int32() == 'i4' Out[18]: True ``` Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1146 from wesm/expand-py-array-method and squashes the following commits: 1570e525 [Wes McKinney] Code review comments d3bbb3c3 [Wes McKinney] Handle type aliases in cast, too 797f0151 [Wes McKinney] Allow null checking to be skipped with from_pandas=False in pyarrow.array f2802fc7 [Wes McKinney] Cleaner codepath for numpy->arrow conversions 587c575a [Wes McKinney] Add direct types sequence converters for more data types cf40b767 [Wes McKinney] Add type aliases, some unit tests 7b530e4b [Wes McKinney] Consolidate both sequence and ndarray/Series/Index conversion in pyarrow.Array 2017-09-29 23:02:58 -05:00			`>>> import numpy as np`
ARROW-9992: [C++][Python] Refactor python to arrow conversions based on a reusable conversion API ### Targets of the refactoring: - PythonToArrow converters based on a common API - PyBytesView to use `Result` return values and contain `is_utf8` flag - PyConversionOptions is now available from all converters so we can honor its flags ### Fixes - ARROW-9993 [Python] Tzinfo - string roundtrip fails on pytz.StaticTzInfo objects - ARROW-9994 [C++][Python] Auto chunking nested array containing binary-like fields result malformed output - ARROW-9996 [C++] Dictionary is unset when calling DictionaryArray.GetScalar for null values - ~ARROW-9997 [Python] StructScalar.as_py() fails if the type has duplicate field names~ - ARROW-9999 [Python] Support constructing dictionary array directly through pa.array() - ARROW-10000 [C++][Python] Support constructing StructArray from list of key-value pairs - ARROW-9593 [Python] Add custom pickle reducers for DictionaryScalar - ARROW-6281 [Python] Produce chunked arrays for nested types in pyarrow.array - ARROW-2367 [Python] ListArray has trouble with sizes greater than kMaximumCapacity - ARROW-9976: [Python] ArrowCapacityError when doing Table.from_pandas with large dataframe ### Backward incompatibility ~~Since a struct type can contain duplicated field names we cannot return a struct scalar as a mapping, so I had to change the `.as_py()` representation to return with a list of key-value pairs.~~ ### TODOs: - [x] ensure that the large memory tests are passing - [x] benchmark and check binary size again ### Library size Before: ``` 12M Sep 25 15:05 libarrow.200.0.0.dylib 2.7M Sep 25 15:07 libarrow_python.200.0.0.dylib ``` After: ``` 12M Sep 25 15:46 libarrow.200.0.0.dylib 2.1M Sep 25 15:50 libarrow_python.200.0.0.dylib ``` ### Benchmarks Executed the following ASV benchmark: ```bash asv continuous --bench convert_builtins master py2ar --no-only-changed --split ``` After some optimization: ``` Benchmarks that have improved: before after ratio [f358a29b] [18d1c052] <master> <py2ar> - 2.78±0.03ms 2.45±0.03ms 0.88 convert_builtins.ConvertPyListToArray.time_convert('bool') - 3.59±0.01ms 3.12±0.02ms 0.87 convert_builtins.ConvertPyListToArray.time_convert('int32') - 3.37±0.01ms 2.73±0.01ms 0.81 convert_builtins.ConvertPyListToArray.time_convert('uint32') - 3.74±0.02ms 3.03±0.01ms 0.81 convert_builtins.ConvertPyListToArray.time_convert('int64') - 3.38±0.01ms 2.69±0.01ms 0.80 convert_builtins.ConvertPyListToArray.time_convert('uint64') - 2.83±0.01ms 2.24±0.01ms 0.79 convert_builtins.ConvertPyListToArray.time_convert('float32') - 3.92±0.02ms 2.99±0.02ms 0.76 convert_builtins.ConvertPyListToArray.time_convert('binary10') - 14.1±0.04ms 8.89±0.05ms 0.63 convert_builtins.ConvertPyListToArray.time_convert('unicode') - 5.60±0.01ms 3.24±0.03ms 0.58 convert_builtins.ConvertPyListToArray.time_convert('ascii') - 5.37±0.02ms 2.91±0.04ms 0.54 convert_builtins.ConvertPyListToArray.time_convert('binary') Benchmarks that have stayed the same: before after ratio [f358a29b] [18d1c052] <master> <py2ar> 14.8±0.02ms 15.5±0.1ms 1.05 convert_builtins.ConvertPyListToArray.time_convert('decimal') 16.4±0.7ms 15.1±0.6ms 0.92 convert_builtins.ConvertPyListToArray.time_convert('struct from tuples') 34.4±0.3ms 31.5±0.4ms 0.92 convert_builtins.ConvertPyListToArray.time_convert('int64 list') 16.7±0.7ms 15.1±0.6ms ~0.91 convert_builtins.ConvertPyListToArray.time_convert('struct') 2.42±0.02ms 2.05±0.03ms ~0.85 convert_builtins.ConvertPyListToArray.time_convert('float64') ``` Closes #8088 from kszucs/py2ar Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com> 2020-09-25 20:49:16 -04:00			`>>> pa.array(pd.Series([1, 2]), mask=np.array([0, 1], dtype=bool))`
ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-cython) Adding `--doctest-cython` functionality which will be run on the CI with a follow-up PR. This PR can be tested with `pytest --doctest-cython python/pyarrow`. Closes #13204 from AlenkaF/ARROW-16018-doctest-cython Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-05-26 15:30:02 +02:00			`<pyarrow.lib.Int64Array object at ...>`
ARROW-838: [Python] Expand pyarrow.array to handle NumPy arrays not originating in pandas This unifies the ingest path for 1D data into `pyarrow.array`. I added the argument `from_pandas` to turn null sentinel checking on or off: ``` In [8]: arr = np.random.randn(10000000) In [9]: arr[::3] = np.nan In [10]: arr2 = pa.array(arr) In [11]: arr2.null_count Out[11]: 0 In [12]: %timeit arr2 = pa.array(arr) The slowest run took 5.43 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 68.4 µs per loop In [13]: arr2 = pa.array(arr, from_pandas=True) In [14]: arr2.null_count Out[14]: 3333334 In [15]: %timeit arr2 = pa.array(arr, from_pandas=True) 1 loop, best of 3: 228 ms per loop ``` When the data is contiguous, it is always zero-copy, but then `from_pandas=True` and no null mask is passed, then a null bitmap is constructed and populated. This also permits sequence reads into integers smaller than int64: ``` In [17]: pa.array([1, 2, 3, 4], type='i1') Out[17]: <pyarrow.lib.Int8Array object at 0x7ffa1c1c65e8> [ 1, 2, 3, 4 ] ``` Oh, I also added NumPy-like string type aliases: ``` In [18]: pa.int32() == 'i4' Out[18]: True ``` Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1146 from wesm/expand-py-array-method and squashes the following commits: 1570e525 [Wes McKinney] Code review comments d3bbb3c3 [Wes McKinney] Handle type aliases in cast, too 797f0151 [Wes McKinney] Allow null checking to be skipped with from_pandas=False in pyarrow.array f2802fc7 [Wes McKinney] Cleaner codepath for numpy->arrow conversions 587c575a [Wes McKinney] Add direct types sequence converters for more data types cf40b767 [Wes McKinney] Add type aliases, some unit tests 7b530e4b [Wes McKinney] Consolidate both sequence and ndarray/Series/Index conversion in pyarrow.Array 2017-09-29 23:02:58 -05:00			`[`
			`1,`
ARROW-889: [Python/C++] Unify PrettyPrints between Python and C++ My main intention was only to have `PrettyPrint` for `ChunkedArray` instance but as the Python and C++ output was quite diverging, I have attempted to unify both. During that I came across many untested and unimplemented edge cases that I also fixed. Thus this turned out slightly larger than expected. Author: Korn, Uwe <Uwe.Korn@blue-yonder.com> Closes #2278 from xhochy/ARROW-889 and squashes the following commits: 002d7b04 <Korn, Uwe> Make test_string_format Python 2 compatible f9ce993c <Korn, Uwe> ARROW-889: Unify PrettyPrints between Python and C++ 2018-07-21 19:24:59 +02:00			`null`
ARROW-838: [Python] Expand pyarrow.array to handle NumPy arrays not originating in pandas This unifies the ingest path for 1D data into `pyarrow.array`. I added the argument `from_pandas` to turn null sentinel checking on or off: ``` In [8]: arr = np.random.randn(10000000) In [9]: arr[::3] = np.nan In [10]: arr2 = pa.array(arr) In [11]: arr2.null_count Out[11]: 0 In [12]: %timeit arr2 = pa.array(arr) The slowest run took 5.43 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 68.4 µs per loop In [13]: arr2 = pa.array(arr, from_pandas=True) In [14]: arr2.null_count Out[14]: 3333334 In [15]: %timeit arr2 = pa.array(arr, from_pandas=True) 1 loop, best of 3: 228 ms per loop ``` When the data is contiguous, it is always zero-copy, but then `from_pandas=True` and no null mask is passed, then a null bitmap is constructed and populated. This also permits sequence reads into integers smaller than int64: ``` In [17]: pa.array([1, 2, 3, 4], type='i1') Out[17]: <pyarrow.lib.Int8Array object at 0x7ffa1c1c65e8> [ 1, 2, 3, 4 ] ``` Oh, I also added NumPy-like string type aliases: ``` In [18]: pa.int32() == 'i4' Out[18]: True ``` Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1146 from wesm/expand-py-array-method and squashes the following commits: 1570e525 [Wes McKinney] Code review comments d3bbb3c3 [Wes McKinney] Handle type aliases in cast, too 797f0151 [Wes McKinney] Allow null checking to be skipped with from_pandas=False in pyarrow.array f2802fc7 [Wes McKinney] Cleaner codepath for numpy->arrow conversions 587c575a [Wes McKinney] Add direct types sequence converters for more data types cf40b767 [Wes McKinney] Add type aliases, some unit tests 7b530e4b [Wes McKinney] Consolidate both sequence and ndarray/Series/Index conversion in pyarrow.Array 2017-09-29 23:02:58 -05:00			`]`
ARROW-9992: [C++][Python] Refactor python to arrow conversions based on a reusable conversion API ### Targets of the refactoring: - PythonToArrow converters based on a common API - PyBytesView to use `Result` return values and contain `is_utf8` flag - PyConversionOptions is now available from all converters so we can honor its flags ### Fixes - ARROW-9993 [Python] Tzinfo - string roundtrip fails on pytz.StaticTzInfo objects - ARROW-9994 [C++][Python] Auto chunking nested array containing binary-like fields result malformed output - ARROW-9996 [C++] Dictionary is unset when calling DictionaryArray.GetScalar for null values - ~ARROW-9997 [Python] StructScalar.as_py() fails if the type has duplicate field names~ - ARROW-9999 [Python] Support constructing dictionary array directly through pa.array() - ARROW-10000 [C++][Python] Support constructing StructArray from list of key-value pairs - ARROW-9593 [Python] Add custom pickle reducers for DictionaryScalar - ARROW-6281 [Python] Produce chunked arrays for nested types in pyarrow.array - ARROW-2367 [Python] ListArray has trouble with sizes greater than kMaximumCapacity - ARROW-9976: [Python] ArrowCapacityError when doing Table.from_pandas with large dataframe ### Backward incompatibility ~~Since a struct type can contain duplicated field names we cannot return a struct scalar as a mapping, so I had to change the `.as_py()` representation to return with a list of key-value pairs.~~ ### TODOs: - [x] ensure that the large memory tests are passing - [x] benchmark and check binary size again ### Library size Before: ``` 12M Sep 25 15:05 libarrow.200.0.0.dylib 2.7M Sep 25 15:07 libarrow_python.200.0.0.dylib ``` After: ``` 12M Sep 25 15:46 libarrow.200.0.0.dylib 2.1M Sep 25 15:50 libarrow_python.200.0.0.dylib ``` ### Benchmarks Executed the following ASV benchmark: ```bash asv continuous --bench convert_builtins master py2ar --no-only-changed --split ``` After some optimization: ``` Benchmarks that have improved: before after ratio [f358a29b] [18d1c052] <master> <py2ar> - 2.78±0.03ms 2.45±0.03ms 0.88 convert_builtins.ConvertPyListToArray.time_convert('bool') - 3.59±0.01ms 3.12±0.02ms 0.87 convert_builtins.ConvertPyListToArray.time_convert('int32') - 3.37±0.01ms 2.73±0.01ms 0.81 convert_builtins.ConvertPyListToArray.time_convert('uint32') - 3.74±0.02ms 3.03±0.01ms 0.81 convert_builtins.ConvertPyListToArray.time_convert('int64') - 3.38±0.01ms 2.69±0.01ms 0.80 convert_builtins.ConvertPyListToArray.time_convert('uint64') - 2.83±0.01ms 2.24±0.01ms 0.79 convert_builtins.ConvertPyListToArray.time_convert('float32') - 3.92±0.02ms 2.99±0.02ms 0.76 convert_builtins.ConvertPyListToArray.time_convert('binary10') - 14.1±0.04ms 8.89±0.05ms 0.63 convert_builtins.ConvertPyListToArray.time_convert('unicode') - 5.60±0.01ms 3.24±0.03ms 0.58 convert_builtins.ConvertPyListToArray.time_convert('ascii') - 5.37±0.02ms 2.91±0.04ms 0.54 convert_builtins.ConvertPyListToArray.time_convert('binary') Benchmarks that have stayed the same: before after ratio [f358a29b] [18d1c052] <master> <py2ar> 14.8±0.02ms 15.5±0.1ms 1.05 convert_builtins.ConvertPyListToArray.time_convert('decimal') 16.4±0.7ms 15.1±0.6ms 0.92 convert_builtins.ConvertPyListToArray.time_convert('struct from tuples') 34.4±0.3ms 31.5±0.4ms 0.92 convert_builtins.ConvertPyListToArray.time_convert('int64 list') 16.7±0.7ms 15.1±0.6ms ~0.91 convert_builtins.ConvertPyListToArray.time_convert('struct') 2.42±0.02ms 2.05±0.03ms ~0.85 convert_builtins.ConvertPyListToArray.time_convert('float64') ``` Closes #8088 from kszucs/py2ar Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com> 2020-09-25 20:49:16 -04:00
			`>>> arr = pa.array(range(1024), type=pa.dictionary(pa.int8(), pa.int64()))`
			`>>> arr.type.index_type`
			`DataType(int16)`
GH-32007: [Python] Support arithmetic on arrays and scalars (#48085) ### Rationale for this change Please see #32007, currently, neither arrays nor scalars support Python-native arithmetic operations, such as `array + array`, it has to be done via `pyarrow.compute` API. This PR strives to fix this with custom dunder methods. ### What changes are included in this PR? Implemented dunder methods ### Are these changes tested? Yes ### Are there any user-facing changes? Possibility to use Python operators directly instead of calling the `pyarrow.compute` API. * GitHub Issue: #32007 Authored-by: Bogdan Romenskii <rmnsk@seznam.cz> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2026-03-27 09:07:21 +01:00
			`>>> arr1 = pa.array([1, 2, 3], type=pa.int8())`
			`>>> arr2 = pa.array([4, 5, 6], type=pa.int8())`
			`>>> arr1 + arr2`
			`<pyarrow.lib.Int8Array object at ...>`
			`[`
			`5,`
			`7,`
			`9`
			`]`

			`>>> val = pa.scalar(42)`
			`>>> val - arr1`
			`<pyarrow.lib.Int64Array object at ...>`
			`[`
			`41,`
			`40,`
			`39`
			`]`
ARROW-825: [Python] Rename pyarrow.from_pylist to pyarrow.array, test on tuples The idea is to make this function more semantically analogous to `numpy.array` -- convert to native data structure with optional explicit type. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #552 from wesm/ARROW-825 and squashes the following commits: 5d69c70 [Wes McKinney] Update test_jemalloc after ARROW-830 c25fdee [Wes McKinney] Update docstring 3a284b7 [Wes McKinney] Rename pyarrow.from_pylist to pyarrow.array, test on tuples 2017-04-17 17:47:51 -04:00			`"""`
ARROW-4324: [Python] Triage broken type inference logic in presence of a mix of NumPy dtype-having objects and other scalar values In investigating the innocuous bug report from ARROW-4324 I stumbled on a pile of hacks and flawed design around type inference ``` test_list = [np.dtype('int32').type(10), np.dtype('float32').type(0.5)] test_array = pa.array(test_list) # Expected # test_array # <pyarrow.lib.DoubleArray object at 0x7f009963bf48> # [ # 10, # 0.5 # ] # Got # test_array # <pyarrow.lib.Int32Array object at 0x7f009963bf48> # [ # 10, # 0 # ] ``` It turns out there are several issues: * There was a kludge around handling the `numpy.nan` value which is a PyFloat, not a NumPy float64 scalar * Type inference assumed "NaN is null", which should not be hard coded, so I added a flag to switch between pandas semantics and non-pandas * Mixing NumPy scalar values and non-NumPy scalars (like our evil friend numpy.nan) caused the output type to be simply incorrect. For example `[np.float16(1.5), 2.5]` would yield `pa.float16()` output type. Yuck In inserted some hacks to force what I believe to be the correct behavior and fixed a couple unit tests that actually exhibited buggy behavior before (see within). I don't have time to do the "right thing" right now which is to more or less rewrite the hot path of `arrow/python/inference.cc`, so at least this gets the unit tests asserting what is correct so that refactoring will be more productive later. Author: Wes McKinney <wesm+git@apache.org> Closes #4527 from wesm/ARROW-4324 and squashes the following commits: e396958b0 <Wes McKinney> Add unit test for passing pandas Series with from_pandas=False 754468a5d <Wes McKinney> Set from_pandas to None by default in pyarrow.array so that user wishes can be respected e1b839339 <Wes McKinney> Remove outdated unit test, add Python unit test that shows behavior from ARROW-2240 that's been changed 4bc8c8193 <Wes McKinney> Triage type inference logic in presence of a mix of NumPy dtype-having objects and other typed values, pending more serious refactor in ARROW-5564 2019-06-12 17:14:40 -05:00			`cdef:`
			`CMemoryPool* pool = maybe_unbox_memory_pool(memory_pool)`
			`bint is_pandas_object = False`
			`bint c_from_pandas`

ARROW-3722: [C++] Allow specifying types of CSV columns Author: Antoine Pitrou <antoine@python.org> Closes #2950 from pitrou/ARROW-3722-csv-column-types and squashes the following commits: ab9c30f8 <Antoine Pitrou> Allow passing string aliases to column_types bddf28f6 <Antoine Pitrou> Allow passing column_types as schema or sequence 20085039 <Antoine Pitrou> ARROW-3722: Allow specifying types of CSV columns 2018-11-14 20:50:07 +01:00			`type = ensure_type(type, allow_none=True)`
ARROW-4324: [Python] Triage broken type inference logic in presence of a mix of NumPy dtype-having objects and other scalar values In investigating the innocuous bug report from ARROW-4324 I stumbled on a pile of hacks and flawed design around type inference ``` test_list = [np.dtype('int32').type(10), np.dtype('float32').type(0.5)] test_array = pa.array(test_list) # Expected # test_array # <pyarrow.lib.DoubleArray object at 0x7f009963bf48> # [ # 10, # 0.5 # ] # Got # test_array # <pyarrow.lib.Int32Array object at 0x7f009963bf48> # [ # 10, # 0 # ] ``` It turns out there are several issues: * There was a kludge around handling the `numpy.nan` value which is a PyFloat, not a NumPy float64 scalar * Type inference assumed "NaN is null", which should not be hard coded, so I added a flag to switch between pandas semantics and non-pandas * Mixing NumPy scalar values and non-NumPy scalars (like our evil friend numpy.nan) caused the output type to be simply incorrect. For example `[np.float16(1.5), 2.5]` would yield `pa.float16()` output type. Yuck In inserted some hacks to force what I believe to be the correct behavior and fixed a couple unit tests that actually exhibited buggy behavior before (see within). I don't have time to do the "right thing" right now which is to more or less rewrite the hot path of `arrow/python/inference.cc`, so at least this gets the unit tests asserting what is correct so that refactoring will be more productive later. Author: Wes McKinney <wesm+git@apache.org> Closes #4527 from wesm/ARROW-4324 and squashes the following commits: e396958b0 <Wes McKinney> Add unit test for passing pandas Series with from_pandas=False 754468a5d <Wes McKinney> Set from_pandas to None by default in pyarrow.array so that user wishes can be respected e1b839339 <Wes McKinney> Remove outdated unit test, add Python unit test that shows behavior from ARROW-2240 that's been changed 4bc8c8193 <Wes McKinney> Triage type inference logic in presence of a mix of NumPy dtype-having objects and other typed values, pending more serious refactor in ARROW-5564 2019-06-12 17:14:40 -05:00
ARROW-17834: [Python] Allow creating ExtensionArray through pa.array(..) constructor (#14253) Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2022-10-05 10:00:17 +02:00			`extension_type = None`
			`if type is not None and type.id == _Type_EXTENSION:`
			`extension_type = type`
			`type = type.storage_type`

ARROW-4324: [Python] Triage broken type inference logic in presence of a mix of NumPy dtype-having objects and other scalar values In investigating the innocuous bug report from ARROW-4324 I stumbled on a pile of hacks and flawed design around type inference ``` test_list = [np.dtype('int32').type(10), np.dtype('float32').type(0.5)] test_array = pa.array(test_list) # Expected # test_array # <pyarrow.lib.DoubleArray object at 0x7f009963bf48> # [ # 10, # 0.5 # ] # Got # test_array # <pyarrow.lib.Int32Array object at 0x7f009963bf48> # [ # 10, # 0 # ] ``` It turns out there are several issues: * There was a kludge around handling the `numpy.nan` value which is a PyFloat, not a NumPy float64 scalar * Type inference assumed "NaN is null", which should not be hard coded, so I added a flag to switch between pandas semantics and non-pandas * Mixing NumPy scalar values and non-NumPy scalars (like our evil friend numpy.nan) caused the output type to be simply incorrect. For example `[np.float16(1.5), 2.5]` would yield `pa.float16()` output type. Yuck In inserted some hacks to force what I believe to be the correct behavior and fixed a couple unit tests that actually exhibited buggy behavior before (see within). I don't have time to do the "right thing" right now which is to more or less rewrite the hot path of `arrow/python/inference.cc`, so at least this gets the unit tests asserting what is correct so that refactoring will be more productive later. Author: Wes McKinney <wesm+git@apache.org> Closes #4527 from wesm/ARROW-4324 and squashes the following commits: e396958b0 <Wes McKinney> Add unit test for passing pandas Series with from_pandas=False 754468a5d <Wes McKinney> Set from_pandas to None by default in pyarrow.array so that user wishes can be respected e1b839339 <Wes McKinney> Remove outdated unit test, add Python unit test that shows behavior from ARROW-2240 that's been changed 4bc8c8193 <Wes McKinney> Triage type inference logic in presence of a mix of NumPy dtype-having objects and other typed values, pending more serious refactor in ARROW-5564 2019-06-12 17:14:40 -05:00			`if from_pandas is None:`
			`c_from_pandas = False`
			`else:`
			`c_from_pandas = from_pandas`
ARROW-825: [Python] Rename pyarrow.from_pylist to pyarrow.array, test on tuples The idea is to make this function more semantically analogous to `numpy.array` -- convert to native data structure with optional explicit type. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #552 from wesm/ARROW-825 and squashes the following commits: 5d69c70 [Wes McKinney] Update test_jemalloc after ARROW-830 c25fdee [Wes McKinney] Update docstring 3a284b7 [Wes McKinney] Rename pyarrow.from_pylist to pyarrow.array, test on tuples 2017-04-17 17:47:51 -04:00
GH-34411: [Python] Change array constructor to accept pyarrow array (#34275) ### Rationale for this change Currently, `pyarrow.array` doesn't accept pyarrow Arrays and this PR adds a check to allow that. * Closes: #34411 Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-03-02 14:29:31 +01:00			`if isinstance(obj, Array):`
			`if type is not None and not obj.type.equals(type):`
			`obj = obj.cast(type, safe=safe, memory_pool=memory_pool)`
			`return obj`

ARROW-3829: [Python] add __arrow_array__ protocol to support third-party array classes in conversion to Arrow https://issues.apache.org/jira/browse/ARROW-3829 & https://issues.apache.org/jira/browse/ARROW-5271. And as illustration for the mailing list discussion (will post there in a bit). Closes #5106 from jorisvandenbossche/ARROW-3829-array-protocol and squashes the following commits: bab01f1cd <Joris Van den Bossche> ValueError -> TypeError 8e7099569 <Joris Van den Bossche> use try ... finally 8ac304a13 <Joris Van den Bossche> rename to extending_types.rst 486115423 <Joris Van den Bossche> add docs e2b10c45d <Joris Van den Bossche> add validation of additional keywords and return value 198d69982 <Joris Van den Bossche> compat for older pandas versions c82eb8848 <Joris Van den Bossche> ARROW-3829: add __arrow_array__ protocol to support third-party array classes in conversion to Arrow Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-08-27 18:00:14 -05:00			`if hasattr(obj, '__arrow_array__'):`
			`return _handle_arrow_array_protocol(obj, type, mask, size)`
GH-38325: [Python] Implement PyCapsule interface for Device data in PyArrow (#40717) ### Rationale for this change PyArrow implementation for the specification additions being proposed in https://github.com/apache/arrow/pull/40708 ### What changes are included in this PR? New `__arrow_c_device_array__` method to `pyarrow.Array` and `pyarrow.RecordBatch`, and support in the `pyarrow.array(..)`, `pyarrow.record_batch(..)` and `pyarrow.table(..)` functions to consume objects that have those methods. ### Are these changes tested? Yes (for CPU only for now, https://github.com/apache/arrow/pull/40385 is a prerequisite to test this for CUDA) * GitHub Issue: #38325 2024-06-26 17:41:17 +02:00			`elif hasattr(obj, '__arrow_c_device_array__'):`
			`if type is not None:`
			`requested_type = type.__arrow_c_schema__()`
			`else:`
			`requested_type = None`
			`schema_capsule, array_capsule = obj.__arrow_c_device_array__(requested_type)`
			`out_array = Array._import_from_c_device_capsule(schema_capsule, array_capsule)`
			`if type is not None and out_array.type != type:`
			`# PyCapsule interface type coercion is best effort, so we need to`
			`# check the type of the returned array and cast if necessary`
			`out_array = array.cast(type, safe=safe, memory_pool=memory_pool)`
			`return out_array`
GH-35531: [Python] C Data Interface PyCapsule Protocol (#37797) ### Rationale for this change ### What changes are included in this PR? * A new specification for Arrow PyCapsules and related dunder methods * Implementing the dunder methods for `DataType`, `Field`, `Schema`, `Array`, `RecordBatch`, `Table`, and `RecordBatchReader`. ### Are these changes tested? Yes, I've added various roundtrip tests for each of the types. ### Are there any user-facing changes? This introduces some new APIs and documents them. * Closes: #34031 * Closes: #35531 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2023-10-18 04:44:50 -07:00			`elif hasattr(obj, '__arrow_c_array__'):`
			`if type is not None:`
			`requested_type = type.__arrow_c_schema__()`
			`else:`
			`requested_type = None`
			`schema_capsule, array_capsule = obj.__arrow_c_array__(requested_type)`
			`out_array = Array._import_from_c_capsule(schema_capsule, array_capsule)`
			`if type is not None and out_array.type != type:`
GH-38944: [Python] Fix spelling (#38945) ### Rationale for this change ### What changes are included in this PR? Spelling fixes to python/ ### Are these changes tested? ### Are there any user-facing changes? * Closes: #38944 Authored-by: Josh Soref <2119212+jsoref@users.noreply.github.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-12-01 03:30:13 -05:00			`# PyCapsule interface type coercion is best effort, so we need to`
GH-35531: [Python] C Data Interface PyCapsule Protocol (#37797) ### Rationale for this change ### What changes are included in this PR? * A new specification for Arrow PyCapsules and related dunder methods * Implementing the dunder methods for `DataType`, `Field`, `Schema`, `Array`, `RecordBatch`, `Table`, and `RecordBatchReader`. ### Are these changes tested? Yes, I've added various roundtrip tests for each of the types. ### Are there any user-facing changes? This introduces some new APIs and documents them. * Closes: #34031 * Closes: #35531 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2023-10-18 04:44:50 -07:00			`# check the type of the returned array and cast if necessary`
			`out_array = array.cast(type, safe=safe, memory_pool=memory_pool)`
			`return out_array`
ARROW-3829: [Python] add __arrow_array__ protocol to support third-party array classes in conversion to Arrow https://issues.apache.org/jira/browse/ARROW-3829 & https://issues.apache.org/jira/browse/ARROW-5271. And as illustration for the mailing list discussion (will post there in a bit). Closes #5106 from jorisvandenbossche/ARROW-3829-array-protocol and squashes the following commits: bab01f1cd <Joris Van den Bossche> ValueError -> TypeError 8e7099569 <Joris Van den Bossche> use try ... finally 8ac304a13 <Joris Van den Bossche> rename to extending_types.rst 486115423 <Joris Van den Bossche> add docs e2b10c45d <Joris Van den Bossche> add validation of additional keywords and return value 198d69982 <Joris Van den Bossche> compat for older pandas versions c82eb8848 <Joris Van den Bossche> ARROW-3829: add __arrow_array__ protocol to support third-party array classes in conversion to Arrow Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-08-27 18:00:14 -05:00			`elif _is_array_like(obj):`
ARROW-14381: [CI][Python] Fix Spark integration failures I don't have a small reproducer, but either a pandas series or a dataframe gets passed as mask to `pa.array()` Closes #11465 from kszucs/ARROW-14381 Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2021-10-20 01:12:21 +02:00			`if mask is not None:`
			`if _is_array_like(mask):`
			`mask = get_values(mask, &is_pandas_object)`
			`else:`
			`raise TypeError("Mask must be a numpy array "`
			`"when converting numpy arrays")`
ARROW-825: [Python] Rename pyarrow.from_pylist to pyarrow.array, test on tuples The idea is to make this function more semantically analogous to `numpy.array` -- convert to native data structure with optional explicit type. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #552 from wesm/ARROW-825 and squashes the following commits: 5d69c70 [Wes McKinney] Update test_jemalloc after ARROW-830 c25fdee [Wes McKinney] Update docstring 3a284b7 [Wes McKinney] Rename pyarrow.from_pylist to pyarrow.array, test on tuples 2017-04-17 17:47:51 -04:00
ARROW-7022, ARROW-7023: [Python] fix handling of pandas Index and Period/Interval extension arrays in pa.array Fixes https://issues.apache.org/jira/browse/ARROW-7022, and while doing this noticed another bug this is fixing (for which I opened https://issues.apache.org/jira/browse/ARROW-7023) Closes #5753 from jorisvandenbossche/ARROW-7022-arrow-array-extension and squashes the following commits: b2f0eb5c2 <Joris Van den Bossche> do not fallback to ndarray for pandas ExtensionArray 53ee1085f <Joris Van den Bossche> fix error message 57b7a506b <Joris Van den Bossche> ARROW-7022, ARROW-7023: fix handling of pandas Index and Period/Interval extension arrays in pa.array Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-11-05 14:21:38 +01:00			`values = get_values(obj, &is_pandas_object)`
ARROW-4324: [Python] Triage broken type inference logic in presence of a mix of NumPy dtype-having objects and other scalar values In investigating the innocuous bug report from ARROW-4324 I stumbled on a pile of hacks and flawed design around type inference ``` test_list = [np.dtype('int32').type(10), np.dtype('float32').type(0.5)] test_array = pa.array(test_list) # Expected # test_array # <pyarrow.lib.DoubleArray object at 0x7f009963bf48> # [ # 10, # 0.5 # ] # Got # test_array # <pyarrow.lib.Int32Array object at 0x7f009963bf48> # [ # 10, # 0 # ] ``` It turns out there are several issues: * There was a kludge around handling the `numpy.nan` value which is a PyFloat, not a NumPy float64 scalar * Type inference assumed "NaN is null", which should not be hard coded, so I added a flag to switch between pandas semantics and non-pandas * Mixing NumPy scalar values and non-NumPy scalars (like our evil friend numpy.nan) caused the output type to be simply incorrect. For example `[np.float16(1.5), 2.5]` would yield `pa.float16()` output type. Yuck In inserted some hacks to force what I believe to be the correct behavior and fixed a couple unit tests that actually exhibited buggy behavior before (see within). I don't have time to do the "right thing" right now which is to more or less rewrite the hot path of `arrow/python/inference.cc`, so at least this gets the unit tests asserting what is correct so that refactoring will be more productive later. Author: Wes McKinney <wesm+git@apache.org> Closes #4527 from wesm/ARROW-4324 and squashes the following commits: e396958b0 <Wes McKinney> Add unit test for passing pandas Series with from_pandas=False 754468a5d <Wes McKinney> Set from_pandas to None by default in pyarrow.array so that user wishes can be respected e1b839339 <Wes McKinney> Remove outdated unit test, add Python unit test that shows behavior from ARROW-2240 that's been changed 4bc8c8193 <Wes McKinney> Triage type inference logic in presence of a mix of NumPy dtype-having objects and other typed values, pending more serious refactor in ARROW-5564 2019-06-12 17:14:40 -05:00			`if is_pandas_object and from_pandas is None:`
			`c_from_pandas = True`
ARROW-838: [Python] Expand pyarrow.array to handle NumPy arrays not originating in pandas This unifies the ingest path for 1D data into `pyarrow.array`. I added the argument `from_pandas` to turn null sentinel checking on or off: ``` In [8]: arr = np.random.randn(10000000) In [9]: arr[::3] = np.nan In [10]: arr2 = pa.array(arr) In [11]: arr2.null_count Out[11]: 0 In [12]: %timeit arr2 = pa.array(arr) The slowest run took 5.43 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 68.4 µs per loop In [13]: arr2 = pa.array(arr, from_pandas=True) In [14]: arr2.null_count Out[14]: 3333334 In [15]: %timeit arr2 = pa.array(arr, from_pandas=True) 1 loop, best of 3: 228 ms per loop ``` When the data is contiguous, it is always zero-copy, but then `from_pandas=True` and no null mask is passed, then a null bitmap is constructed and populated. This also permits sequence reads into integers smaller than int64: ``` In [17]: pa.array([1, 2, 3, 4], type='i1') Out[17]: <pyarrow.lib.Int8Array object at 0x7ffa1c1c65e8> [ 1, 2, 3, 4 ] ``` Oh, I also added NumPy-like string type aliases: ``` In [18]: pa.int32() == 'i4' Out[18]: True ``` Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1146 from wesm/expand-py-array-method and squashes the following commits: 1570e525 [Wes McKinney] Code review comments d3bbb3c3 [Wes McKinney] Handle type aliases in cast, too 797f0151 [Wes McKinney] Allow null checking to be skipped with from_pandas=False in pyarrow.array f2802fc7 [Wes McKinney] Cleaner codepath for numpy->arrow conversions 587c575a [Wes McKinney] Add direct types sequence converters for more data types cf40b767 [Wes McKinney] Add type aliases, some unit tests 7b530e4b [Wes McKinney] Consolidate both sequence and ndarray/Series/Index conversion in pyarrow.Array 2017-09-29 23:02:58 -05:00
ARROW-3686: [Python] support masked arrays in pa.array https://issues.apache.org/jira/browse/ARROW-3686 Author: Joris Van den Bossche <jorisvandenbossche@gmail.com> Closes #4534 from jorisvandenbossche/ARROW-3686-masked-array and squashes the following commits: 424885f29 <Joris Van den Bossche> pin type + use isinstance f431c5e87 <Joris Van den Bossche> Merge remote-tracking branch 'upstream/master' into ARROW-3686-masked-array e3e22b536 <Joris Van den Bossche> support masked arrays in pa.array 2019-06-14 13:57:17 -04:00			`if isinstance(values, np.ma.MaskedArray):`
			`if mask is not None:`
			`raise ValueError("Cannot pass a numpy masked array and "`
			`"specify a mask at the same time")`
			`else:`
ARROW-8105: [Python] Fix segfault when shrunken masked array is passed to pyarrow.array Needed to validate that the mask of the masked array wasn't equal to the nomask constant which indicates that the masked array was shrunk Closes #6612 from nugend/master Lead-authored-by: Dan Nugent <nugend@gmail.com> Co-authored-by: Daniel Nugent <nugend@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2020-03-17 11:55:10 +01:00			`# don't use shrunken masks`
			`mask = None if values.mask is np.ma.nomask else values.mask`
ARROW-3686: [Python] support masked arrays in pa.array https://issues.apache.org/jira/browse/ARROW-3686 Author: Joris Van den Bossche <jorisvandenbossche@gmail.com> Closes #4534 from jorisvandenbossche/ARROW-3686-masked-array and squashes the following commits: 424885f29 <Joris Van den Bossche> pin type + use isinstance f431c5e87 <Joris Van den Bossche> Merge remote-tracking branch 'upstream/master' into ARROW-3686-masked-array e3e22b536 <Joris Van den Bossche> support masked arrays in pa.array 2019-06-14 13:57:17 -04:00			`values = values.data`

ARROW-10742: [Python] Check mask when creating array from numpy This change adds checks so that the same exceptions are raised when creating an array with a mask using a python sequence or a numpy array. Closes #8775 from chrisavl/10742-array-from-numpy-check-mask Authored-by: Christian Lundgren <chrisavl@users.noreply.github.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2020-12-08 09:31:38 +01:00			`if mask is not None:`
			`if mask.dtype != np.bool_:`
			`raise TypeError("Mask must be boolean dtype")`
			`if mask.ndim != 1:`
			`raise ValueError("Mask must be 1D array")`
			`if len(values) != len(mask):`
			`raise ValueError(`
			`"Mask is a different length from sequence being converted")`

ARROW-3829: [Python] add __arrow_array__ protocol to support third-party array classes in conversion to Arrow https://issues.apache.org/jira/browse/ARROW-3829 & https://issues.apache.org/jira/browse/ARROW-5271. And as illustration for the mailing list discussion (will post there in a bit). Closes #5106 from jorisvandenbossche/ARROW-3829-array-protocol and squashes the following commits: bab01f1cd <Joris Van den Bossche> ValueError -> TypeError 8e7099569 <Joris Van den Bossche> use try ... finally 8ac304a13 <Joris Van den Bossche> rename to extending_types.rst 486115423 <Joris Van den Bossche> add docs e2b10c45d <Joris Van den Bossche> add validation of additional keywords and return value 198d69982 <Joris Van den Bossche> compat for older pandas versions c82eb8848 <Joris Van den Bossche> ARROW-3829: add __arrow_array__ protocol to support third-party array classes in conversion to Arrow Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-08-27 18:00:14 -05:00			`if hasattr(values, '__arrow_array__'):`
			`return _handle_arrow_array_protocol(values, type, mask, size)`
ARROW-17834: [Python] Allow creating ExtensionArray through pa.array(..) constructor (#14253) Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2022-10-05 10:00:17 +02:00			`elif (pandas_api.is_categorical(values) and`
			`type is not None and type.id != Type_DICTIONARY):`
			`result = _ndarray_to_array(`
			`np.asarray(values), mask, type, c_from_pandas, safe, pool`
			`)`
ARROW-3829: [Python] add __arrow_array__ protocol to support third-party array classes in conversion to Arrow https://issues.apache.org/jira/browse/ARROW-3829 & https://issues.apache.org/jira/browse/ARROW-5271. And as illustration for the mailing list discussion (will post there in a bit). Closes #5106 from jorisvandenbossche/ARROW-3829-array-protocol and squashes the following commits: bab01f1cd <Joris Van den Bossche> ValueError -> TypeError 8e7099569 <Joris Van den Bossche> use try ... finally 8ac304a13 <Joris Van den Bossche> rename to extending_types.rst 486115423 <Joris Van den Bossche> add docs e2b10c45d <Joris Van den Bossche> add validation of additional keywords and return value 198d69982 <Joris Van den Bossche> compat for older pandas versions c82eb8848 <Joris Van den Bossche> ARROW-3829: add __arrow_array__ protocol to support third-party array classes in conversion to Arrow Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-08-27 18:00:14 -05:00			`elif pandas_api.is_categorical(values):`
ARROW-7168: [Python] Respect the specified dictionary type for pd.Categorical conversion https://issues.apache.org/jira/browse/ARROW-7168 This change ensures that if you specify a `type` in `pa.array`, we ensure the output actually has this type when converting to dictionary array (as we also do for other types). The PR now implements this change, but we might want to do this with a deprecation first, as this can break people's code. Closes #5866 from jorisvandenbossche/ARROW-7168-categorical-specified-type and squashes the following commits: 39ff8e82c <Joris Van den Bossche> more python 2 e4dbb2c4f <Joris Van den Bossche> try fix python 2 003e6532b <Joris Van den Bossche> for now use deprecation warnings instead of error bfb82372b <Joris Van den Bossche> additional tests 3535a5699 <Joris Van den Bossche> ARROW-7168: Respect the specified dictionary type when converting pd.Categorical Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-11-21 11:17:56 +01:00			`if type is not None:`
			`index_type = type.index_type`
			`value_type = type.value_type`
			`if values.ordered != type.ordered:`
ARROW-17649: [Python] Remove remaining deprecated APIs from <= 1.0.0 (#14401) Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2022-10-14 11:54:04 +02:00			`raise ValueError(`
ARROW-7168: [Python] Respect the specified dictionary type for pd.Categorical conversion https://issues.apache.org/jira/browse/ARROW-7168 This change ensures that if you specify a `type` in `pa.array`, we ensure the output actually has this type when converting to dictionary array (as we also do for other types). The PR now implements this change, but we might want to do this with a deprecation first, as this can break people's code. Closes #5866 from jorisvandenbossche/ARROW-7168-categorical-specified-type and squashes the following commits: 39ff8e82c <Joris Van den Bossche> more python 2 e4dbb2c4f <Joris Van den Bossche> try fix python 2 003e6532b <Joris Van den Bossche> for now use deprecation warnings instead of error bfb82372b <Joris Van den Bossche> additional tests 3535a5699 <Joris Van den Bossche> ARROW-7168: Respect the specified dictionary type when converting pd.Categorical Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-11-21 11:17:56 +01:00			`"The 'ordered' flag of the passed categorical values "`
ARROW-17649: [Python] Remove remaining deprecated APIs from <= 1.0.0 (#14401) Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2022-10-14 11:54:04 +02:00			`"does not match the 'ordered' of the specified type. ")`
ARROW-7168: [Python] Respect the specified dictionary type for pd.Categorical conversion https://issues.apache.org/jira/browse/ARROW-7168 This change ensures that if you specify a `type` in `pa.array`, we ensure the output actually has this type when converting to dictionary array (as we also do for other types). The PR now implements this change, but we might want to do this with a deprecation first, as this can break people's code. Closes #5866 from jorisvandenbossche/ARROW-7168-categorical-specified-type and squashes the following commits: 39ff8e82c <Joris Van den Bossche> more python 2 e4dbb2c4f <Joris Van den Bossche> try fix python 2 003e6532b <Joris Van den Bossche> for now use deprecation warnings instead of error bfb82372b <Joris Van den Bossche> additional tests 3535a5699 <Joris Van den Bossche> ARROW-7168: Respect the specified dictionary type when converting pd.Categorical Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-11-21 11:17:56 +01:00			`else:`
			`index_type = None`
			`value_type = None`

			`indices = _codes_to_indices(`
			`values.codes, mask, index_type, memory_pool)`
			`try:`
			`dictionary = array(`
			`values.categories.values, type=value_type,`
			`memory_pool=memory_pool)`
			`except TypeError:`
			`# TODO when removing the deprecation warning, this whole`
			`# try/except can be removed (to bubble the TypeError of`
			`# the first array(..) call)`
			`if value_type is not None:`
			`warnings.warn(`
			`"The dtype of the 'categories' of the passed "`
GH-45619: [Python] Use f-string instead of string.format (#45629) ### Rationale for this change See https://github.com/apache/arrow/issues/45619. ### What changes are included in this PR? Refactor using f-string instead of `string.format`. But do not use f-string for following case, `string.format` allows passing parameters, making the code more reusable. https://github.com/apache/arrow/blob/0fbf9823542233c5f32c26534c34cc97ce3f0be2/python/pyarrow/parquet/core.py#L1624-L1695 ### Are these changes tested? Via CI. ### Are there any user-facing changes? No. * GitHub Issue: #45619 Lead-authored-by: Chilin Chiou <chilin.chiou@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> 2025-05-12 20:34:49 +08:00			`f"categorical values ({values.categories.dtype}) does not match the "`
			`f"specified type ({value_type}). For now ignoring the specified "`
ARROW-7168: [Python] Respect the specified dictionary type for pd.Categorical conversion https://issues.apache.org/jira/browse/ARROW-7168 This change ensures that if you specify a `type` in `pa.array`, we ensure the output actually has this type when converting to dictionary array (as we also do for other types). The PR now implements this change, but we might want to do this with a deprecation first, as this can break people's code. Closes #5866 from jorisvandenbossche/ARROW-7168-categorical-specified-type and squashes the following commits: 39ff8e82c <Joris Van den Bossche> more python 2 e4dbb2c4f <Joris Van den Bossche> try fix python 2 003e6532b <Joris Van den Bossche> for now use deprecation warnings instead of error bfb82372b <Joris Van den Bossche> additional tests 3535a5699 <Joris Van den Bossche> ARROW-7168: Respect the specified dictionary type when converting pd.Categorical Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-11-21 11:17:56 +01:00			`"type, but in the future this mismatch will raise a "`
GH-45619: [Python] Use f-string instead of string.format (#45629) ### Rationale for this change See https://github.com/apache/arrow/issues/45619. ### What changes are included in this PR? Refactor using f-string instead of `string.format`. But do not use f-string for following case, `string.format` allows passing parameters, making the code more reusable. https://github.com/apache/arrow/blob/0fbf9823542233c5f32c26534c34cc97ce3f0be2/python/pyarrow/parquet/core.py#L1624-L1695 ### Are these changes tested? Via CI. ### Are there any user-facing changes? No. * GitHub Issue: #45619 Lead-authored-by: Chilin Chiou <chilin.chiou@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> 2025-05-12 20:34:49 +08:00			`"TypeError",`
ARROW-7168: [Python] Respect the specified dictionary type for pd.Categorical conversion https://issues.apache.org/jira/browse/ARROW-7168 This change ensures that if you specify a `type` in `pa.array`, we ensure the output actually has this type when converting to dictionary array (as we also do for other types). The PR now implements this change, but we might want to do this with a deprecation first, as this can break people's code. Closes #5866 from jorisvandenbossche/ARROW-7168-categorical-specified-type and squashes the following commits: 39ff8e82c <Joris Van den Bossche> more python 2 e4dbb2c4f <Joris Van den Bossche> try fix python 2 003e6532b <Joris Van den Bossche> for now use deprecation warnings instead of error bfb82372b <Joris Van den Bossche> additional tests 3535a5699 <Joris Van den Bossche> ARROW-7168: Respect the specified dictionary type when converting pd.Categorical Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-11-21 11:17:56 +01:00			`FutureWarning, stacklevel=2)`
			`dictionary = array(`
			`values.categories.values, memory_pool=memory_pool)`
			`else:`
			`raise`

ARROW-838: [Python] Expand pyarrow.array to handle NumPy arrays not originating in pandas This unifies the ingest path for 1D data into `pyarrow.array`. I added the argument `from_pandas` to turn null sentinel checking on or off: ``` In [8]: arr = np.random.randn(10000000) In [9]: arr[::3] = np.nan In [10]: arr2 = pa.array(arr) In [11]: arr2.null_count Out[11]: 0 In [12]: %timeit arr2 = pa.array(arr) The slowest run took 5.43 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 68.4 µs per loop In [13]: arr2 = pa.array(arr, from_pandas=True) In [14]: arr2.null_count Out[14]: 3333334 In [15]: %timeit arr2 = pa.array(arr, from_pandas=True) 1 loop, best of 3: 228 ms per loop ``` When the data is contiguous, it is always zero-copy, but then `from_pandas=True` and no null mask is passed, then a null bitmap is constructed and populated. This also permits sequence reads into integers smaller than int64: ``` In [17]: pa.array([1, 2, 3, 4], type='i1') Out[17]: <pyarrow.lib.Int8Array object at 0x7ffa1c1c65e8> [ 1, 2, 3, 4 ] ``` Oh, I also added NumPy-like string type aliases: ``` In [18]: pa.int32() == 'i4' Out[18]: True ``` Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1146 from wesm/expand-py-array-method and squashes the following commits: 1570e525 [Wes McKinney] Code review comments d3bbb3c3 [Wes McKinney] Handle type aliases in cast, too 797f0151 [Wes McKinney] Allow null checking to be skipped with from_pandas=False in pyarrow.array f2802fc7 [Wes McKinney] Cleaner codepath for numpy->arrow conversions 587c575a [Wes McKinney] Add direct types sequence converters for more data types cf40b767 [Wes McKinney] Add type aliases, some unit tests 7b530e4b [Wes McKinney] Consolidate both sequence and ndarray/Series/Index conversion in pyarrow.Array 2017-09-29 23:02:58 -05:00			`return DictionaryArray.from_arrays(`
ARROW-7168: [Python] Respect the specified dictionary type for pd.Categorical conversion https://issues.apache.org/jira/browse/ARROW-7168 This change ensures that if you specify a `type` in `pa.array`, we ensure the output actually has this type when converting to dictionary array (as we also do for other types). The PR now implements this change, but we might want to do this with a deprecation first, as this can break people's code. Closes #5866 from jorisvandenbossche/ARROW-7168-categorical-specified-type and squashes the following commits: 39ff8e82c <Joris Van den Bossche> more python 2 e4dbb2c4f <Joris Van den Bossche> try fix python 2 003e6532b <Joris Van den Bossche> for now use deprecation warnings instead of error bfb82372b <Joris Van den Bossche> additional tests 3535a5699 <Joris Van den Bossche> ARROW-7168: Respect the specified dictionary type when converting pd.Categorical Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-11-21 11:17:56 +01:00			`indices, dictionary, ordered=values.ordered, safe=safe)`
ARROW-838: [Python] Expand pyarrow.array to handle NumPy arrays not originating in pandas This unifies the ingest path for 1D data into `pyarrow.array`. I added the argument `from_pandas` to turn null sentinel checking on or off: ``` In [8]: arr = np.random.randn(10000000) In [9]: arr[::3] = np.nan In [10]: arr2 = pa.array(arr) In [11]: arr2.null_count Out[11]: 0 In [12]: %timeit arr2 = pa.array(arr) The slowest run took 5.43 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 68.4 µs per loop In [13]: arr2 = pa.array(arr, from_pandas=True) In [14]: arr2.null_count Out[14]: 3333334 In [15]: %timeit arr2 = pa.array(arr, from_pandas=True) 1 loop, best of 3: 228 ms per loop ``` When the data is contiguous, it is always zero-copy, but then `from_pandas=True` and no null mask is passed, then a null bitmap is constructed and populated. This also permits sequence reads into integers smaller than int64: ``` In [17]: pa.array([1, 2, 3, 4], type='i1') Out[17]: <pyarrow.lib.Int8Array object at 0x7ffa1c1c65e8> [ 1, 2, 3, 4 ] ``` Oh, I also added NumPy-like string type aliases: ``` In [18]: pa.int32() == 'i4' Out[18]: True ``` Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1146 from wesm/expand-py-array-method and squashes the following commits: 1570e525 [Wes McKinney] Code review comments d3bbb3c3 [Wes McKinney] Handle type aliases in cast, too 797f0151 [Wes McKinney] Allow null checking to be skipped with from_pandas=False in pyarrow.array f2802fc7 [Wes McKinney] Cleaner codepath for numpy->arrow conversions 587c575a [Wes McKinney] Add direct types sequence converters for more data types cf40b767 [Wes McKinney] Add type aliases, some unit tests 7b530e4b [Wes McKinney] Consolidate both sequence and ndarray/Series/Index conversion in pyarrow.Array 2017-09-29 23:02:58 -05:00			`else:`
ARROW-4637: [Python] Conditionally import pandas symbols if they are used. Do not require pandas as a test dependency Warning: hold your nose for this one =) I think this can be made cleaner, but I just wanted to triage all the problems and make sure we don't introduce a hard dependency on pandas again Also resolves ARROW-4794: make pandas an optional dependency Author: Wes McKinney <wesm+git@apache.org> Closes #3893 from wesm/ARROW-4637 and squashes the following commits: 3c353b6b <Wes McKinney> do not override orc global mark in test_orc.py 59fdf8ab <Wes McKinney> Tweak pyarrow._orc import to see if it fixes MSVC 747c8937 <Wes McKinney> Address Python 2.7 unicode interaction issue with Cython 4d94b2e1 <Wes McKinney> Add benchmark from ARROW-4629 f7bf7741 <Wes McKinney> Cythonize pandas API shim for better performance 4a2549aa <Wes McKinney> Remove TF testing from travis_script_manylinux1.sh 5061c6e1 <Wes McKinney> Do not require pandas to run unit tests 385cfe55 <Wes McKinney> Finish pandas API shim; do not eagerly import pandas, add to CI 804587a0 <Wes McKinney> add import test script bb0240e2 <Wes McKinney> Begin to refactor to make references to pandas more centralized and lazy 2019-03-19 11:59:30 -05:00			`if pandas_api.have_pandas:`
			`values, type = pandas_api.compat.get_datetimetz_type(`
ARROW-4629: [Python] Pandas arrow conversion slowed down by imports The local imports slow down the conversion from pandas to arrow significantly (see [here](https://issues.apache.org/jira/browse/ARROW-4629)) Author: fjetter <fjetter@users.noreply.github.com> Author: Uwe L. Korn <xhochy@users.noreply.github.com> Closes #3706 from fjetter/local_imports and squashes the following commits: eb5c8bad <Uwe L. Korn> Apply suggestions from code review b4604bec <fjetter> Only import pandas_compat if pandas is available f1c8b401 <fjetter> Don't use local imports 2019-02-20 15:20:35 +01:00			`values, obj.dtype, type)`
GH-40273: [Python] Support construction of Run-End Encoded arrays in pa.array(..) (#40341) ### Rationale for this change We want to enable the construction of a Run-End Encoded arrays with `pyarrow.array `constructor ### What changes are included in this PR? Added a check for Run-End Encoded Type in the `pyarrow.array` constructor code. ### Are these changes tested? Yes, added test_run_end_encoded_from_array_with_type. ### Are there any user-facing changes? No. * GitHub Issue: #40273 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-03-20 08:44:08 +01:00			`if type and type.id == _Type_RUN_END_ENCODED:`
			`arr = _ndarray_to_array(`
			`values, mask, type.value_type, c_from_pandas, safe, pool)`
			`result = _pc().run_end_encode(arr, run_end_type=type.run_end_type,`
			`memory_pool=memory_pool)`
			`else:`
			`result = _ndarray_to_array(values, mask, type, c_from_pandas, safe,`
			`pool)`
ARROW-838: [Python] Expand pyarrow.array to handle NumPy arrays not originating in pandas This unifies the ingest path for 1D data into `pyarrow.array`. I added the argument `from_pandas` to turn null sentinel checking on or off: ``` In [8]: arr = np.random.randn(10000000) In [9]: arr[::3] = np.nan In [10]: arr2 = pa.array(arr) In [11]: arr2.null_count Out[11]: 0 In [12]: %timeit arr2 = pa.array(arr) The slowest run took 5.43 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 68.4 µs per loop In [13]: arr2 = pa.array(arr, from_pandas=True) In [14]: arr2.null_count Out[14]: 3333334 In [15]: %timeit arr2 = pa.array(arr, from_pandas=True) 1 loop, best of 3: 228 ms per loop ``` When the data is contiguous, it is always zero-copy, but then `from_pandas=True` and no null mask is passed, then a null bitmap is constructed and populated. This also permits sequence reads into integers smaller than int64: ``` In [17]: pa.array([1, 2, 3, 4], type='i1') Out[17]: <pyarrow.lib.Int8Array object at 0x7ffa1c1c65e8> [ 1, 2, 3, 4 ] ``` Oh, I also added NumPy-like string type aliases: ``` In [18]: pa.int32() == 'i4' Out[18]: True ``` Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1146 from wesm/expand-py-array-method and squashes the following commits: 1570e525 [Wes McKinney] Code review comments d3bbb3c3 [Wes McKinney] Handle type aliases in cast, too 797f0151 [Wes McKinney] Allow null checking to be skipped with from_pandas=False in pyarrow.array f2802fc7 [Wes McKinney] Cleaner codepath for numpy->arrow conversions 587c575a [Wes McKinney] Add direct types sequence converters for more data types cf40b767 [Wes McKinney] Add type aliases, some unit tests 7b530e4b [Wes McKinney] Consolidate both sequence and ndarray/Series/Index conversion in pyarrow.Array 2017-09-29 23:02:58 -05:00			`else:`
GH-40273: [Python] Support construction of Run-End Encoded arrays in pa.array(..) (#40341) ### Rationale for this change We want to enable the construction of a Run-End Encoded arrays with `pyarrow.array `constructor ### What changes are included in this PR? Added a check for Run-End Encoded Type in the `pyarrow.array` constructor code. ### Are these changes tested? Yes, added test_run_end_encoded_from_array_with_type. ### Are there any user-facing changes? No. * GitHub Issue: #40273 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-03-20 08:44:08 +01:00			`if type and type.id == _Type_RUN_END_ENCODED:`
			`arr = _sequence_to_array(`
			`obj, mask, size, type.value_type, pool, from_pandas)`
			`result = _pc().run_end_encode(arr, run_end_type=type.run_end_type,`
			`memory_pool=memory_pool)`
ARROW-1949: [Python/C++] Add option to Array.from_pandas and pyarrow.array to perform unsafe casts Author: Krisztián Szűcs <szucs.krisztian@gmail.com> Closes #2497 from kszucs/ARROW-1949 and squashes the following commits: f352c477 <Krisztián Szűcs> remove safe flag from _sequence_to_array 70d6cae2 <Krisztián Szűcs> annotate boolean arguments as bint e838a14d <Krisztián Szűcs> check-format fff89aaa <Krisztián Szűcs> lint 92ac3a92 <Krisztián Szűcs> tests for timestamp casts dd8871e8 <Krisztián Szűcs> wire CastOptions through the API 2018-09-04 08:36:29 +02:00			`# ConvertPySequence does strict conversion if type is explicitly passed`
GH-40273: [Python] Support construction of Run-End Encoded arrays in pa.array(..) (#40341) ### Rationale for this change We want to enable the construction of a Run-End Encoded arrays with `pyarrow.array `constructor ### What changes are included in this PR? Added a check for Run-End Encoded Type in the `pyarrow.array` constructor code. ### Are these changes tested? Yes, added test_run_end_encoded_from_array_with_type. ### Are there any user-facing changes? No. * GitHub Issue: #40273 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-03-20 08:44:08 +01:00			`else:`
			`result = _sequence_to_array(obj, mask, size, type, pool, c_from_pandas)`
ARROW-17834: [Python] Allow creating ExtensionArray through pa.array(..) constructor (#14253) Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2022-10-05 10:00:17 +02:00
			`if extension_type is not None:`
			`result = ExtensionArray.from_storage(extension_type, result)`
			`return result`
ARROW-825: [Python] Rename pyarrow.from_pylist to pyarrow.array, test on tuples The idea is to make this function more semantically analogous to `numpy.array` -- convert to native data structure with optional explicit type. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #552 from wesm/ARROW-825 and squashes the following commits: 5d69c70 [Wes McKinney] Update test_jemalloc after ARROW-830 c25fdee [Wes McKinney] Update docstring 3a284b7 [Wes McKinney] Rename pyarrow.from_pylist to pyarrow.array, test on tuples 2017-04-17 17:47:51 -04:00

ARROW-507: [C++] Complete ListArray::FromArrays implementation, add unit tests In the event that the offsets array has nulls, this will backward-fill the offsets to compute the correct value sizes. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1224 from wesm/ARROW-507 and squashes the following commits: 9027c140 [Wes McKinney] Clean valid bits to remove trailing set bit 8d2cb512 [Wes McKinney] Implement / add tests for ListArray.from_arrays in Python 1c6a8702 [Wes McKinney] Complete C++ implementation, unit test for ListArray::FromArrays, handling of offsets with nulls 2017-10-23 17:57:13 -04:00			`def asarray(values, type=None):`
			`"""`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`Convert to pyarrow.Array, inferring type if not provided.`
ARROW-507: [C++] Complete ListArray::FromArrays implementation, add unit tests In the event that the offsets array has nulls, this will backward-fill the offsets to compute the correct value sizes. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1224 from wesm/ARROW-507 and squashes the following commits: 9027c140 [Wes McKinney] Clean valid bits to remove trailing set bit 8d2cb512 [Wes McKinney] Implement / add tests for ListArray.from_arrays in Python 1c6a8702 [Wes McKinney] Complete C++ implementation, unit test for ListArray::FromArrays, handling of offsets with nulls 2017-10-23 17:57:13 -04:00
			`Parameters`
			`----------`
ARROW-6520: [Python] More consistent handling of specified schema when creating Table https://issues.apache.org/jira/browse/ARROW-6520 Mainly added extra tests trying to cover the different cases. Closes #5366 from jorisvandenbossche/ARROW-6520-table-arrays-schema and squashes the following commits: cb0e904af <Joris Van den Bossche> update docstring for asarray 675b19381 <Joris Van den Bossche> don't construct arrow objects in pytest parametrize 67c47e424 <Joris Van den Bossche> accept schema as first positional + add more tests e4be7496c <Joris Van den Bossche> allow names as first positional argument as well 6df5edf8e <Joris Van den Bossche> check arguments passed to table factory function 3cfd5674a <Joris Van den Bossche> more extensive tests for passing schema to from_arrays/from_pydict + small fixes Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-09-18 11:47:35 -05:00			`values : array-like`
			`This can be a sequence, numpy.ndarray, pyarrow.Array or`
			`pyarrow.ChunkedArray. If a ChunkedArray is passed, the output will be`
			`a ChunkedArray, otherwise the output will be a Array.`
ARROW-507: [C++] Complete ListArray::FromArrays implementation, add unit tests In the event that the offsets array has nulls, this will backward-fill the offsets to compute the correct value sizes. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1224 from wesm/ARROW-507 and squashes the following commits: 9027c140 [Wes McKinney] Clean valid bits to remove trailing set bit 8d2cb512 [Wes McKinney] Implement / add tests for ListArray.from_arrays in Python 1c6a8702 [Wes McKinney] Complete C++ implementation, unit test for ListArray::FromArrays, handling of offsets with nulls 2017-10-23 17:57:13 -04:00			`type : string or DataType`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`Explicitly construct the array with this type. Attempt to cast if`
			`indicated type is different.`
ARROW-507: [C++] Complete ListArray::FromArrays implementation, add unit tests In the event that the offsets array has nulls, this will backward-fill the offsets to compute the correct value sizes. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1224 from wesm/ARROW-507 and squashes the following commits: 9027c140 [Wes McKinney] Clean valid bits to remove trailing set bit 8d2cb512 [Wes McKinney] Implement / add tests for ListArray.from_arrays in Python 1c6a8702 [Wes McKinney] Complete C++ implementation, unit test for ListArray::FromArrays, handling of offsets with nulls 2017-10-23 17:57:13 -04:00
			`Returns`
			`-------`
ARROW-6520: [Python] More consistent handling of specified schema when creating Table https://issues.apache.org/jira/browse/ARROW-6520 Mainly added extra tests trying to cover the different cases. Closes #5366 from jorisvandenbossche/ARROW-6520-table-arrays-schema and squashes the following commits: cb0e904af <Joris Van den Bossche> update docstring for asarray 675b19381 <Joris Van den Bossche> don't construct arrow objects in pytest parametrize 67c47e424 <Joris Van den Bossche> accept schema as first positional + add more tests e4be7496c <Joris Van den Bossche> allow names as first positional argument as well 6df5edf8e <Joris Van den Bossche> check arguments passed to table factory function 3cfd5674a <Joris Van den Bossche> more extensive tests for passing schema to from_arrays/from_pydict + small fixes Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-09-18 11:47:35 -05:00			`arr : Array or ChunkedArray`
ARROW-507: [C++] Complete ListArray::FromArrays implementation, add unit tests In the event that the offsets array has nulls, this will backward-fill the offsets to compute the correct value sizes. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1224 from wesm/ARROW-507 and squashes the following commits: 9027c140 [Wes McKinney] Clean valid bits to remove trailing set bit 8d2cb512 [Wes McKinney] Implement / add tests for ListArray.from_arrays in Python 1c6a8702 [Wes McKinney] Complete C++ implementation, unit test for ListArray::FromArrays, handling of offsets with nulls 2017-10-23 17:57:13 -04:00			`"""`
ARROW-6520: [Python] More consistent handling of specified schema when creating Table https://issues.apache.org/jira/browse/ARROW-6520 Mainly added extra tests trying to cover the different cases. Closes #5366 from jorisvandenbossche/ARROW-6520-table-arrays-schema and squashes the following commits: cb0e904af <Joris Van den Bossche> update docstring for asarray 675b19381 <Joris Van den Bossche> don't construct arrow objects in pytest parametrize 67c47e424 <Joris Van den Bossche> accept schema as first positional + add more tests e4be7496c <Joris Van den Bossche> allow names as first positional argument as well 6df5edf8e <Joris Van den Bossche> check arguments passed to table factory function 3cfd5674a <Joris Van den Bossche> more extensive tests for passing schema to from_arrays/from_pydict + small fixes Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-09-18 11:47:35 -05:00			`if isinstance(values, (Array, ChunkedArray)):`
ARROW-507: [C++] Complete ListArray::FromArrays implementation, add unit tests In the event that the offsets array has nulls, this will backward-fill the offsets to compute the correct value sizes. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1224 from wesm/ARROW-507 and squashes the following commits: 9027c140 [Wes McKinney] Clean valid bits to remove trailing set bit 8d2cb512 [Wes McKinney] Implement / add tests for ListArray.from_arrays in Python 1c6a8702 [Wes McKinney] Complete C++ implementation, unit test for ListArray::FromArrays, handling of offsets with nulls 2017-10-23 17:57:13 -04:00			`if type is not None and not values.type.equals(type):`
			`values = values.cast(type)`
			`return values`
			`else:`
			`return array(values, type=type)`


ARROW-7375: [Python] Expose C++ MakeArrayOfNull Closes #7533 from kszucs/ARROW-7375 Lead-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Co-authored-by: Wes McKinney <wesm@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-24 16:11:05 -05:00			`def nulls(size, type=None, MemoryPool memory_pool=None):`
ARROW-9374: [C++][Python] Expose MakeArrayFromScalar Since we have a complete scalar implementation on the python side, we can implement `pa.repeat(value, size=n)` Closes #7684 from kszucs/repeat Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-07-13 12:25:33 -05:00			`"""`
			`Create a strongly-typed Array instance with all elements null.`

			`Parameters`
			`----------`
			`size : int`
			`Array length.`
			`type : pyarrow.DataType, default None`
			`Explicit type for the array. By default use NullType.`
			`memory_pool : MemoryPool, default None`
			`Arrow MemoryPool to use for allocations. Uses the default memory`
MINOR: [Python][Docs] Fixing typos in python/pyarrow/array.pxi (#34286) ### Rationale for this change "is not passed" isn't grammatically correct here. ### Are these changes tested? n/a - minor doc change ### Are there any user-facing changes? No Authored-by: Leo Shklovskii <leo@thermopylae.net> Signed-off-by: Sutou Kouhei <kou@clear-code.com> 2023-02-22 01:36:52 -05:00			`pool if not passed.`
ARROW-9374: [C++][Python] Expose MakeArrayFromScalar Since we have a complete scalar implementation on the python side, we can implement `pa.repeat(value, size=n)` Closes #7684 from kszucs/repeat Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-07-13 12:25:33 -05:00
			`Returns`
			`-------`
			`arr : Array`

			`Examples`
			`--------`
			`>>> import pyarrow as pa`
			`>>> pa.nulls(10)`
ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-cython) Adding `--doctest-cython` functionality which will be run on the CI with a follow-up PR. This PR can be tested with `pytest --doctest-cython python/pyarrow`. Closes #13204 from AlenkaF/ARROW-16018-doctest-cython Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-05-26 15:30:02 +02:00			`<pyarrow.lib.NullArray object at ...>`
ARROW-9374: [C++][Python] Expose MakeArrayFromScalar Since we have a complete scalar implementation on the python side, we can implement `pa.repeat(value, size=n)` Closes #7684 from kszucs/repeat Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-07-13 12:25:33 -05:00			`10 nulls`

			`>>> pa.nulls(3, pa.uint32())`
ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-cython) Adding `--doctest-cython` functionality which will be run on the CI with a follow-up PR. This PR can be tested with `pytest --doctest-cython python/pyarrow`. Closes #13204 from AlenkaF/ARROW-16018-doctest-cython Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-05-26 15:30:02 +02:00			`<pyarrow.lib.UInt32Array object at ...>`
ARROW-9374: [C++][Python] Expose MakeArrayFromScalar Since we have a complete scalar implementation on the python side, we can implement `pa.repeat(value, size=n)` Closes #7684 from kszucs/repeat Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-07-13 12:25:33 -05:00			`[`
			`null,`
			`null,`
			`null`
			`]`
			`"""`
ARROW-7375: [Python] Expose C++ MakeArrayOfNull Closes #7533 from kszucs/ARROW-7375 Lead-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Co-authored-by: Wes McKinney <wesm@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-24 16:11:05 -05:00			`cdef:`
			`CMemoryPool* pool = maybe_unbox_memory_pool(memory_pool)`
			`int64_t length = size`
			`shared_ptr[CDataType] ty`
			`shared_ptr[CArray] arr`

			`type = ensure_type(type, allow_none=True)`
			`if type is None:`
			`type = null()`

			`ty = pyarrow_unwrap_data_type(type)`
			`with nogil:`
			`arr = GetResultValue(MakeArrayOfNull(ty, length, pool))`

			`return pyarrow_wrap_array(arr)`


ARROW-9374: [C++][Python] Expose MakeArrayFromScalar Since we have a complete scalar implementation on the python side, we can implement `pa.repeat(value, size=n)` Closes #7684 from kszucs/repeat Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-07-13 12:25:33 -05:00			`def repeat(value, size, MemoryPool memory_pool=None):`
			`"""`
			`Create an Array instance whose slots are the given scalar.`

			`Parameters`
			`----------`
ARROW-13637: [Python] Fix docstrings Address all docstrings to make sure they pass `archery numpydoc --allow-rule PR01` Closes #11245 from amol-/ARROW-13637 Authored-by: Alessandro Molina <amol@turbogears.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-10-04 11:44:40 +02:00			`value : Scalar-like object`
ARROW-9374: [C++][Python] Expose MakeArrayFromScalar Since we have a complete scalar implementation on the python side, we can implement `pa.repeat(value, size=n)` Closes #7684 from kszucs/repeat Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-07-13 12:25:33 -05:00			`Either a pyarrow.Scalar or any python object coercible to a Scalar.`
			`size : int`
			`Number of times to repeat the scalar in the output Array.`
			`memory_pool : MemoryPool, default None`
			`Arrow MemoryPool to use for allocations. Uses the default memory`
MINOR: [Python][Docs] Fixing typos in python/pyarrow/array.pxi (#34286) ### Rationale for this change "is not passed" isn't grammatically correct here. ### Are these changes tested? n/a - minor doc change ### Are there any user-facing changes? No Authored-by: Leo Shklovskii <leo@thermopylae.net> Signed-off-by: Sutou Kouhei <kou@clear-code.com> 2023-02-22 01:36:52 -05:00			`pool if not passed.`
ARROW-9374: [C++][Python] Expose MakeArrayFromScalar Since we have a complete scalar implementation on the python side, we can implement `pa.repeat(value, size=n)` Closes #7684 from kszucs/repeat Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-07-13 12:25:33 -05:00
			`Returns`
			`-------`
			`arr : Array`

			`Examples`
			`--------`
			`>>> import pyarrow as pa`
			`>>> pa.repeat(10, 3)`
ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-cython) Adding `--doctest-cython` functionality which will be run on the CI with a follow-up PR. This PR can be tested with `pytest --doctest-cython python/pyarrow`. Closes #13204 from AlenkaF/ARROW-16018-doctest-cython Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-05-26 15:30:02 +02:00			`<pyarrow.lib.Int64Array object at ...>`
ARROW-9374: [C++][Python] Expose MakeArrayFromScalar Since we have a complete scalar implementation on the python side, we can implement `pa.repeat(value, size=n)` Closes #7684 from kszucs/repeat Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-07-13 12:25:33 -05:00			`[`
			`10,`
			`10,`
			`10`
			`]`

			`>>> pa.repeat([1, 2], 2)`
ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-cython) Adding `--doctest-cython` functionality which will be run on the CI with a follow-up PR. This PR can be tested with `pytest --doctest-cython python/pyarrow`. Closes #13204 from AlenkaF/ARROW-16018-doctest-cython Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-05-26 15:30:02 +02:00			`<pyarrow.lib.ListArray object at ...>`
ARROW-9374: [C++][Python] Expose MakeArrayFromScalar Since we have a complete scalar implementation on the python side, we can implement `pa.repeat(value, size=n)` Closes #7684 from kszucs/repeat Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-07-13 12:25:33 -05:00			`[`
			`[`
			`1,`
			`2`
			`],`
			`[`
			`1,`
			`2`
			`]`
			`]`

			`>>> pa.repeat("string", 3)`
ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-cython) Adding `--doctest-cython` functionality which will be run on the CI with a follow-up PR. This PR can be tested with `pytest --doctest-cython python/pyarrow`. Closes #13204 from AlenkaF/ARROW-16018-doctest-cython Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-05-26 15:30:02 +02:00			`<pyarrow.lib.StringArray object at ...>`
ARROW-9374: [C++][Python] Expose MakeArrayFromScalar Since we have a complete scalar implementation on the python side, we can implement `pa.repeat(value, size=n)` Closes #7684 from kszucs/repeat Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-07-13 12:25:33 -05:00			`[`
			`"string",`
			`"string",`
			`"string"`
			`]`

			`>>> pa.repeat(pa.scalar({'a': 1, 'b': [1, 2]}), 2)`
ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-cython) Adding `--doctest-cython` functionality which will be run on the CI with a follow-up PR. This PR can be tested with `pytest --doctest-cython python/pyarrow`. Closes #13204 from AlenkaF/ARROW-16018-doctest-cython Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-05-26 15:30:02 +02:00			`<pyarrow.lib.StructArray object at ...>`
ARROW-9374: [C++][Python] Expose MakeArrayFromScalar Since we have a complete scalar implementation on the python side, we can implement `pa.repeat(value, size=n)` Closes #7684 from kszucs/repeat Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-07-13 12:25:33 -05:00			`-- is_valid: all not null`
			`-- child 0 type: int64`
			`[`
			`1,`
			`1`
			`]`
			`-- child 1 type: list<item: int64>`
			`[`
			`[`
			`1,`
			`2`
			`],`
			`[`
			`1,`
			`2`
			`]`
			`]`
			`"""`
			`cdef:`
			`CMemoryPool* pool = maybe_unbox_memory_pool(memory_pool)`
			`int64_t length = size`
			`shared_ptr[CArray] c_array`
			`shared_ptr[CScalar] c_scalar`

			`if not isinstance(value, Scalar):`
			`value = scalar(value, memory_pool=memory_pool)`

			`c_scalar = (<Scalar> value).unwrap()`
			`with nogil:`
			`c_array = GetResultValue(`
			`MakeArrayFromScalar(deref(c_scalar), length, pool)`
			`)`

			`return pyarrow_wrap_array(c_array)`


ARROW-5208: [Python] Add mask argument to pyarrow.infer_type, do not look at masked values when inferring output type in pyarrow.array Author: Wes McKinney <wesm+git@apache.org> Closes #4677 from wesm/ARROW-5208 and squashes the following commits: 833f075d9 <Wes McKinney> add another test case with ndarray dtype=object argument d7860a206 <Wes McKinney> Add mask arguments to infer_type, respect mask in ConvertPySequence when inferring type 2019-06-24 19:18:28 -05:00			`def infer_type(values, mask=None, from_pandas=False):`
ARROW-4350: [Python] Fix conversion from Python to Arrow with nested lists and NumPy dtype=object items NumPy object array values weren't being iterated over in the case where the value type is `list<T>` instead of some primitive type like `int64`. This code path appears to have never been properly tested, so when a user hit it, it didn't work. Author: Wes McKinney <wesm+git@apache.org> Closes #4609 from wesm/ARROW-4350 and squashes the following commits: 6d7883e41 <Wes McKinney> Code review feedback d2f383171 <Wes McKinney> Add xfailing unit test for ARROW-5645 b2b3c50fe <Wes McKinney> Remove unneeded namespace a74e5cc0b <Wes McKinney> Actually unbox NPY_OBJECT arrays in ListConverter 279f68155 <Wes McKinney> Expose infer_type function 2019-06-19 12:29:27 -05:00			`"""`
			`Attempt to infer Arrow data type that can hold the passed Python`
			`sequence type in an Array object`

			`Parameters`
			`----------`
			`values : array-like`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`Sequence to infer type from.`
ARROW-5208: [Python] Add mask argument to pyarrow.infer_type, do not look at masked values when inferring output type in pyarrow.array Author: Wes McKinney <wesm+git@apache.org> Closes #4677 from wesm/ARROW-5208 and squashes the following commits: 833f075d9 <Wes McKinney> add another test case with ndarray dtype=object argument d7860a206 <Wes McKinney> Add mask arguments to infer_type, respect mask in ConvertPySequence when inferring type 2019-06-24 19:18:28 -05:00			`mask : ndarray (bool type), optional`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`Optional exclusion mask where True marks null, False non-null.`
			`from_pandas : bool, default False`
			`Use pandas's NA/null sentinel values for type inference.`
ARROW-4350: [Python] Fix conversion from Python to Arrow with nested lists and NumPy dtype=object items NumPy object array values weren't being iterated over in the case where the value type is `list<T>` instead of some primitive type like `int64`. This code path appears to have never been properly tested, so when a user hit it, it didn't work. Author: Wes McKinney <wesm+git@apache.org> Closes #4609 from wesm/ARROW-4350 and squashes the following commits: 6d7883e41 <Wes McKinney> Code review feedback d2f383171 <Wes McKinney> Add xfailing unit test for ARROW-5645 b2b3c50fe <Wes McKinney> Remove unneeded namespace a74e5cc0b <Wes McKinney> Actually unbox NPY_OBJECT arrays in ListConverter 279f68155 <Wes McKinney> Expose infer_type function 2019-06-19 12:29:27 -05:00
			`Returns`
			`-------`
			`type : DataType`
			`"""`
			`cdef:`
			`shared_ptr[CDataType] out`
			`c_bool use_pandas_sentinels = from_pandas`

ARROW-5208: [Python] Add mask argument to pyarrow.infer_type, do not look at masked values when inferring output type in pyarrow.array Author: Wes McKinney <wesm+git@apache.org> Closes #4677 from wesm/ARROW-5208 and squashes the following commits: 833f075d9 <Wes McKinney> add another test case with ndarray dtype=object argument d7860a206 <Wes McKinney> Add mask arguments to infer_type, respect mask in ConvertPySequence when inferring type 2019-06-24 19:18:28 -05:00			`if mask is not None and not isinstance(mask, np.ndarray):`
			`mask = np.array(mask, dtype=bool)`

ARROW-9992: [C++][Python] Refactor python to arrow conversions based on a reusable conversion API ### Targets of the refactoring: - PythonToArrow converters based on a common API - PyBytesView to use `Result` return values and contain `is_utf8` flag - PyConversionOptions is now available from all converters so we can honor its flags ### Fixes - ARROW-9993 [Python] Tzinfo - string roundtrip fails on pytz.StaticTzInfo objects - ARROW-9994 [C++][Python] Auto chunking nested array containing binary-like fields result malformed output - ARROW-9996 [C++] Dictionary is unset when calling DictionaryArray.GetScalar for null values - ~ARROW-9997 [Python] StructScalar.as_py() fails if the type has duplicate field names~ - ARROW-9999 [Python] Support constructing dictionary array directly through pa.array() - ARROW-10000 [C++][Python] Support constructing StructArray from list of key-value pairs - ARROW-9593 [Python] Add custom pickle reducers for DictionaryScalar - ARROW-6281 [Python] Produce chunked arrays for nested types in pyarrow.array - ARROW-2367 [Python] ListArray has trouble with sizes greater than kMaximumCapacity - ARROW-9976: [Python] ArrowCapacityError when doing Table.from_pandas with large dataframe ### Backward incompatibility ~~Since a struct type can contain duplicated field names we cannot return a struct scalar as a mapping, so I had to change the `.as_py()` representation to return with a list of key-value pairs.~~ ### TODOs: - [x] ensure that the large memory tests are passing - [x] benchmark and check binary size again ### Library size Before: ``` 12M Sep 25 15:05 libarrow.200.0.0.dylib 2.7M Sep 25 15:07 libarrow_python.200.0.0.dylib ``` After: ``` 12M Sep 25 15:46 libarrow.200.0.0.dylib 2.1M Sep 25 15:50 libarrow_python.200.0.0.dylib ``` ### Benchmarks Executed the following ASV benchmark: ```bash asv continuous --bench convert_builtins master py2ar --no-only-changed --split ``` After some optimization: ``` Benchmarks that have improved: before after ratio [f358a29b] [18d1c052] <master> <py2ar> - 2.78±0.03ms 2.45±0.03ms 0.88 convert_builtins.ConvertPyListToArray.time_convert('bool') - 3.59±0.01ms 3.12±0.02ms 0.87 convert_builtins.ConvertPyListToArray.time_convert('int32') - 3.37±0.01ms 2.73±0.01ms 0.81 convert_builtins.ConvertPyListToArray.time_convert('uint32') - 3.74±0.02ms 3.03±0.01ms 0.81 convert_builtins.ConvertPyListToArray.time_convert('int64') - 3.38±0.01ms 2.69±0.01ms 0.80 convert_builtins.ConvertPyListToArray.time_convert('uint64') - 2.83±0.01ms 2.24±0.01ms 0.79 convert_builtins.ConvertPyListToArray.time_convert('float32') - 3.92±0.02ms 2.99±0.02ms 0.76 convert_builtins.ConvertPyListToArray.time_convert('binary10') - 14.1±0.04ms 8.89±0.05ms 0.63 convert_builtins.ConvertPyListToArray.time_convert('unicode') - 5.60±0.01ms 3.24±0.03ms 0.58 convert_builtins.ConvertPyListToArray.time_convert('ascii') - 5.37±0.02ms 2.91±0.04ms 0.54 convert_builtins.ConvertPyListToArray.time_convert('binary') Benchmarks that have stayed the same: before after ratio [f358a29b] [18d1c052] <master> <py2ar> 14.8±0.02ms 15.5±0.1ms 1.05 convert_builtins.ConvertPyListToArray.time_convert('decimal') 16.4±0.7ms 15.1±0.6ms 0.92 convert_builtins.ConvertPyListToArray.time_convert('struct from tuples') 34.4±0.3ms 31.5±0.4ms 0.92 convert_builtins.ConvertPyListToArray.time_convert('int64 list') 16.7±0.7ms 15.1±0.6ms ~0.91 convert_builtins.ConvertPyListToArray.time_convert('struct') 2.42±0.02ms 2.05±0.03ms ~0.85 convert_builtins.ConvertPyListToArray.time_convert('float64') ``` Closes #8088 from kszucs/py2ar Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com> 2020-09-25 20:49:16 -04:00			`out = GetResultValue(InferArrowType(values, mask, use_pandas_sentinels))`
ARROW-4350: [Python] Fix conversion from Python to Arrow with nested lists and NumPy dtype=object items NumPy object array values weren't being iterated over in the case where the value type is `list<T>` instead of some primitive type like `int64`. This code path appears to have never been properly tested, so when a user hit it, it didn't work. Author: Wes McKinney <wesm+git@apache.org> Closes #4609 from wesm/ARROW-4350 and squashes the following commits: 6d7883e41 <Wes McKinney> Code review feedback d2f383171 <Wes McKinney> Add xfailing unit test for ARROW-5645 b2b3c50fe <Wes McKinney> Remove unneeded namespace a74e5cc0b <Wes McKinney> Actually unbox NPY_OBJECT arrays in ListConverter 279f68155 <Wes McKinney> Expose infer_type function 2019-06-19 12:29:27 -05:00			`return pyarrow_wrap_data_type(out)`


GH-46771: [Python][C++] Implement pa.arange function to generate array sequences (#46778) ### Rationale for this change When slicing arrays with non-trivial steps we were using `numpy.arange` to generate the indices for take. As numpy is an optional dependency, implementing it via Python caused a performance penalty. Creating a pyarrow function to build our own ranges that mimics Python range or numpy arange is useful for that uses case and might also be useful for other use cases. Currently we only generate `Array[Int64]` we could potentially generate more types. ### What changes are included in this PR? provide a `pa.arange` function that allows us to generate indices when slicing arrays. ### Are these changes tested? Yes new tests added. ### Are there any user-facing changes? No but a new pyarrow.arange function has been added. * GitHub Issue: #46771 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> 2025-06-20 12:33:27 +02:00			`def arange(int64_t start, int64_t stop, int64_t step=1, *, memory_pool=None):`
			`"""`
			`Create an array of evenly spaced values within a given interval.`

			This function is similar to Python's `range` function.
			The resulting array will contain values starting from `start` up to but not
			including `stop`, with a step size of `step`.

			`Parameters`
			`----------`
			`start : int`
			`The starting value for the sequence. The returned array will include this value.`
			`stop : int`
			`The stopping value for the sequence. The returned array will not include this value.`
			`step : int, default 1`
			`The spacing between values.`
			`memory_pool : MemoryPool, optional`
			`A memory pool to use for memory allocations.`

			`Raises`
			`------`
			`ArrowInvalid`
			If `step` is zero.

			`Returns`
			`-------`
			`arange : Array`
			`"""`
			`cdef CMemoryPool* pool = maybe_unbox_memory_pool(memory_pool)`
			`with nogil:`
			`c_array = GetResultValue(Arange(start, stop, step, pool))`
			`return pyarrow_wrap_array(c_array)`


ARROW-968: [Python] Support slices in RecordBatch.__getitem__ Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #908 from wesm/ARROW-968 and squashes the following commits: 47b71a5d [Wes McKinney] Support slices in RecordBatch.__getitem__ 2017-07-29 11:00:58 -04:00			`def _normalize_slice(object arrow_obj, slice key):`
ARROW-2714: [Python] Implement variable step slicing with Take Closes #6970 from wesm/ARROW-2714 Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-04-21 10:26:39 -05:00			`"""`
			`Slices with step not equal to 1 (or None) will produce a copy`
			`rather than a zero-copy view`
			`"""`
ARROW-2288: [Python] Fix slicing logic Author: Antoine Pitrou <antoine@python.org> Closes #1723 from pitrou/ARROW-2288-py-slicing-logic and squashes the following commits: 0c5461f1 <Antoine Pitrou> ARROW-2288: Fix slicing logic 2018-03-09 15:04:36 -05:00			`cdef:`
GH-46771: [Python][C++] Implement pa.arange function to generate array sequences (#46778) ### Rationale for this change When slicing arrays with non-trivial steps we were using `numpy.arange` to generate the indices for take. As numpy is an optional dependency, implementing it via Python caused a performance penalty. Creating a pyarrow function to build our own ranges that mimics Python range or numpy arange is useful for that uses case and might also be useful for other use cases. Currently we only generate `Array[Int64]` we could potentially generate more types. ### What changes are included in this PR? provide a `pa.arange` function that allows us to generate indices when slicing arrays. ### Are these changes tested? Yes new tests added. ### Are there any user-facing changes? No but a new pyarrow.arange function has been added. * GitHub Issue: #46771 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> 2025-06-20 12:33:27 +02:00			`int64_t start, stop, step`
ARROW-2288: [Python] Fix slicing logic Author: Antoine Pitrou <antoine@python.org> Closes #1723 from pitrou/ARROW-2288-py-slicing-logic and squashes the following commits: 0c5461f1 <Antoine Pitrou> ARROW-2288: Fix slicing logic 2018-03-09 15:04:36 -05:00			`Py_ssize_t n = len(arrow_obj)`
ARROW-968: [Python] Support slices in RecordBatch.__getitem__ Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #908 from wesm/ARROW-968 and squashes the following commits: 47b71a5d [Wes McKinney] Support slices in RecordBatch.__getitem__ 2017-07-29 11:00:58 -04:00
GH-38768: [Python] Empty slicing an array backwards beyond the start is now empty (#40682) ### What changes are included in this PR? `_normalize_slice` now relies on `slice.indices` (https://docs.python.org/3/reference/datamodel.html#slice.indices). ### Are these changes tested? Yes. ### Are there any user-facing changes? Fixing wrong data returned in an edge case. * GitHub Issue: #40642 * GitHub Issue: #38768 Lead-authored-by: LucasG0 <guillermou.lucas@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-04-09 17:05:03 +02:00			`start, stop, step = key.indices(n)`
ARROW-968: [Python] Support slices in RecordBatch.__getitem__ Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #908 from wesm/ARROW-968 and squashes the following commits: 47b71a5d [Wes McKinney] Support slices in RecordBatch.__getitem__ 2017-07-29 11:00:58 -04:00
			`if step != 1:`
GH-46771: [Python][C++] Implement pa.arange function to generate array sequences (#46778) ### Rationale for this change When slicing arrays with non-trivial steps we were using `numpy.arange` to generate the indices for take. As numpy is an optional dependency, implementing it via Python caused a performance penalty. Creating a pyarrow function to build our own ranges that mimics Python range or numpy arange is useful for that uses case and might also be useful for other use cases. Currently we only generate `Array[Int64]` we could potentially generate more types. ### What changes are included in this PR? provide a `pa.arange` function that allows us to generate indices when slicing arrays. ### Are these changes tested? Yes new tests added. ### Are there any user-facing changes? No but a new pyarrow.arange function has been added. * GitHub Issue: #46771 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> 2025-06-20 12:33:27 +02:00			`return arrow_obj.take(arange(start, stop, step))`
ARROW-968: [Python] Support slices in RecordBatch.__getitem__ Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #908 from wesm/ARROW-968 and squashes the following commits: 47b71a5d [Wes McKinney] Support slices in RecordBatch.__getitem__ 2017-07-29 11:00:58 -04:00			`else:`
ARROW-12769: [Python] Fix slicing array with "negative" length (start > stop) When the normalized slice has a start > stop, we were creating invalid arrays with a negative length (which then errors on subsequent operations) Closes #10341 from jorisvandenbossche/ARROW-12769 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2021-05-17 10:09:46 -04:00			`length = max(stop - start, 0)`
			`return arrow_obj.slice(start, length)`
ARROW-968: [Python] Support slices in RecordBatch.__getitem__ Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #908 from wesm/ARROW-968 and squashes the following commits: 47b71a5d [Wes McKinney] Support slices in RecordBatch.__getitem__ 2017-07-29 11:00:58 -04:00
ARROW-825: [Python] Rename pyarrow.from_pylist to pyarrow.array, test on tuples The idea is to make this function more semantically analogous to `numpy.array` -- convert to native data structure with optional explicit type. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #552 from wesm/ARROW-825 and squashes the following commits: 5d69c70 [Wes McKinney] Update test_jemalloc after ARROW-830 c25fdee [Wes McKinney] Update docstring 3a284b7 [Wes McKinney] Rename pyarrow.from_pylist to pyarrow.array, test on tuples 2017-04-17 17:47:51 -04:00
ARROW-2331: [Python] Fix indexing for negative or out-of-bounds indices Author: Antoine Pitrou <antoine@python.org> Closes #1770 from pitrou/ARROW-2331-python-indexing and squashes the following commits: aec1ef0 <Antoine Pitrou> Try to fix downcast errors 1a38451 <Antoine Pitrou> ARROW-2331: Fix indexing for negative or out-of-bounds indices 2018-03-23 17:03:29 +01:00			`cdef Py_ssize_t _normalize_index(Py_ssize_t index,`
			`Py_ssize_t length) except -1:`
			`if index < 0:`
			`index += length`
			`if index < 0:`
			`raise IndexError("index out of bounds")`
			`elif index >= length:`
			`raise IndexError("index out of bounds")`
			`return index`


ARROW-1559: [C++] Add Unique kernel and refactor DictionaryBuilder to be a stateful kernel Only intended to implement selective categorical conversion in `to_pandas()` but it seems that there is a lot missing to do this in a clean fashion. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1266 from xhochy/ARROW-1559 and squashes the following commits: 50249652 [Wes McKinney] Fix MSVC linker issue b6cb1ece [Wes McKinney] Export CastOptions 4ea3ce61 [Wes McKinney] Return NONE Datum in else branch of functions 4f969c6b [Wes McKinney] Move deprecation suppression after flag munging 7f557cc0 [Wes McKinney] Code review comments, disable C4996 warning (equivalent to -Wno-deprecated) in MSVC builds 84717461 [Wes McKinney] Do not compute hash table threshold on each iteration ae8f2339 [Wes McKinney] Fix double to int64_t conversion warning c1444a26 [Wes McKinney] Fix doxygen warnings 2de85961 [Wes McKinney] Add test cases for unique, dictionary_encode 383b46fd [Wes McKinney] Add Array methods for Unique, DictionaryEncode 0962f06b [Wes McKinney] Add cast method for Column, chunked_array and column factory functions 62c3cefd [Wes McKinney] Datum stubs 27151c47 [Wes McKinney] Implement Cast for chunked arrays, fix kernel implementation. Change kernel API to write to a single Datum 1bf2e2f4 [Wes McKinney] Fix bug with column using wrong type eaadc3e5 [Wes McKinney] Use macros to reduce code duplication in DoubleTableSize 6b4f8f3c [Wes McKinney] Fix datetime64->date32 casting error raised by refactor 2c77a19e [Wes McKinney] Some Decimal->Decimal128 renaming. Add DecimalType base class c07f91b3 [Wes McKinney] ARROW-1559: Add unique kernel 2017-11-17 18:29:49 -05:00			`cdef wrap_datum(const CDatum& datum):`
			`if datum.kind() == DatumType_ARRAY:`
			`return pyarrow_wrap_array(MakeArray(datum.array()))`
			`elif datum.kind() == DatumType_CHUNKED_ARRAY:`
			`return pyarrow_wrap_chunked_array(datum.chunked_array())`
ARROW-6996: [Python] Expose boolean filter kernel on ChunkedArray/RecordBatch/Table Closes #6021 from jorisvandenbossche/ARROW-6996 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-04-04 18:07:07 -05:00			`elif datum.kind() == DatumType_RECORD_BATCH:`
			`return pyarrow_wrap_batch(datum.record_batch())`
			`elif datum.kind() == DatumType_TABLE:`
			`return pyarrow_wrap_table(datum.table())`
ARROW-4939: [Python] Add wrapper for "sum" kernel Author: Philipp Moritz <pcmoritz@gmail.com> Closes #3954 from pcmoritz/python-sum and squashes the following commits: c177c9e47 <Philipp Moritz> update f967acad6 <Philipp Moritz> use pytest.mark.parametrize 6c5d2df44 <Philipp Moritz> cleanup b3658bbaf <Philipp Moritz> update dedb324b6 <Philipp Moritz> update 21049a8ce <Philipp Moritz> update 0e868c42a <Philipp Moritz> update ef6decd23 <Philipp Moritz> fix b414dc45f <Philipp Moritz> add doc and test b77589d04 <Philipp Moritz> add more scalar types 0d4a9ede5 <Philipp Moritz> update c2a468d3f <Philipp Moritz> update 30df4e955 <Philipp Moritz> update 9b8f4ae29 <Philipp Moritz> add debugging 429923e7f <Philipp Moritz> add scalar wrappers 94a17cc3a <Philipp Moritz> fix 293e6c1e1 <Philipp Moritz> add python wrapper for sum kernel 2019-03-26 23:13:22 -07:00			`elif datum.kind() == DatumType_SCALAR:`
			`return pyarrow_wrap_scalar(datum.scalar())`
ARROW-1559: [C++] Add Unique kernel and refactor DictionaryBuilder to be a stateful kernel Only intended to implement selective categorical conversion in `to_pandas()` but it seems that there is a lot missing to do this in a clean fashion. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1266 from xhochy/ARROW-1559 and squashes the following commits: 50249652 [Wes McKinney] Fix MSVC linker issue b6cb1ece [Wes McKinney] Export CastOptions 4ea3ce61 [Wes McKinney] Return NONE Datum in else branch of functions 4f969c6b [Wes McKinney] Move deprecation suppression after flag munging 7f557cc0 [Wes McKinney] Code review comments, disable C4996 warning (equivalent to -Wno-deprecated) in MSVC builds 84717461 [Wes McKinney] Do not compute hash table threshold on each iteration ae8f2339 [Wes McKinney] Fix double to int64_t conversion warning c1444a26 [Wes McKinney] Fix doxygen warnings 2de85961 [Wes McKinney] Add test cases for unique, dictionary_encode 383b46fd [Wes McKinney] Add Array methods for Unique, DictionaryEncode 0962f06b [Wes McKinney] Add cast method for Column, chunked_array and column factory functions 62c3cefd [Wes McKinney] Datum stubs 27151c47 [Wes McKinney] Implement Cast for chunked arrays, fix kernel implementation. Change kernel API to write to a single Datum 1bf2e2f4 [Wes McKinney] Fix bug with column using wrong type eaadc3e5 [Wes McKinney] Use macros to reduce code duplication in DoubleTableSize 6b4f8f3c [Wes McKinney] Fix datetime64->date32 casting error raised by refactor 2c77a19e [Wes McKinney] Some Decimal->Decimal128 renaming. Add DecimalType base class c07f91b3 [Wes McKinney] ARROW-1559: Add unique kernel 2017-11-17 18:29:49 -05:00			`else:`
			`raise ValueError("Unable to wrap Datum in a Python object")`


ARROW-2068: [Python] Expose array's buffers This recurses into child data if present (for nested types). Author: Antoine Pitrou <antoine@python.org> Closes #1613 from pitrou/ARROW-2068-expose-array-buffers and squashes the following commits: 0634aaf [Antoine Pitrou] ARROW-2068: [Python] Expose array's buffers 2018-02-15 18:58:34 +01:00			`cdef _append_array_buffers(const CArrayData* ad, list res):`
			`"""`
			`Recursively append Buffer wrappers from ad and its children.`
			`"""`
			`cdef size_t i, n`
			`assert ad != NULL`
			`n = ad.buffers.size()`
			`for i in range(n):`
			`buf = ad.buffers[i]`
			`res.append(pyarrow_wrap_buffer(buf)`
			`if buf.get() != NULL else None)`
			`n = ad.child_data.size()`
			`for i in range(n):`
			`_append_array_buffers(ad.child_data[i].get(), res)`


ARROW-2493: [Python] Add support for pickling to buffers and arrays Author: Korn, Uwe <Uwe.Korn@blue-yonder.com> Closes #1928 from xhochy/ARROW-2493 and squashes the following commits: e3600f99 <Korn, Uwe> Add pickling support for Arrays 17ec8055 <Korn, Uwe> ARROW-2493: Add support for pickling to buffers 2018-05-02 11:27:53 +02:00			`cdef _reduce_array_data(const CArrayData* ad):`
			`"""`
			`Recursively dissect ArrayData to (pickable) tuples.`
			`"""`
			`cdef size_t i, n`
			`assert ad != NULL`

			`n = ad.buffers.size()`
			`buffers = []`
			`for i in range(n):`
			`buf = ad.buffers[i]`
			`buffers.append(pyarrow_wrap_buffer(buf)`
			`if buf.get() != NULL else None)`

			`children = []`
			`n = ad.child_data.size()`
			`for i in range(n):`
			`children.append(_reduce_array_data(ad.child_data[i].get()))`

ARROW-7214: [Python] Fix pickling of DictionaryArray https://issues.apache.org/jira/browse/ARROW-7214 Closes #5871 from jorisvandenbossche/ARROW-7214-pickle-dictionary and squashes the following commits: e09cb0ee5 <Joris Van den Bossche> ARROW-7214: Fix pickling of DictionaryArray Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-11-20 14:17:03 +01:00			`if ad.dictionary.get() != NULL:`
ARROW-6052: [C++] Split up arrow/array.h/cc into multiple files under arrow/array/, move ArrayData to separate header, make ArrayData::dictionary ArrayData One meaningful change in this patch was to change `ArrayData::dictionary` to be `shared_ptr<ArrayData>`, which was not difficult. Otherwise, this allows some files that only need ArrayData or just Array to not include a bunch of unnecessary header code. The increase in number of lines of code is pretty much all due to license headers. I did a modest amount of IWYU cleaning, there are plenty of still usages of "arrow/array.h" that might be better replaced by more specific headers. As part of this patch I disabled `build/include_what_you_use` in cpplint per ARROW-8994. If we're going to spend time doing IWYU cleaning (which IMHO is overall worthwhile) we should rely on its output rather than having IWYU checks by two different tools. Closes #7310 from wesm/cpp-split-array-h Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-06-02 09:54:10 -05:00			`dictionary = _reduce_array_data(ad.dictionary.get())`
ARROW-7214: [Python] Fix pickling of DictionaryArray https://issues.apache.org/jira/browse/ARROW-7214 Closes #5871 from jorisvandenbossche/ARROW-7214-pickle-dictionary and squashes the following commits: e09cb0ee5 <Joris Van den Bossche> ARROW-7214: Fix pickling of DictionaryArray Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-11-20 14:17:03 +01:00			`else:`
			`dictionary = None`

ARROW-2493: [Python] Add support for pickling to buffers and arrays Author: Korn, Uwe <Uwe.Korn@blue-yonder.com> Closes #1928 from xhochy/ARROW-2493 and squashes the following commits: e3600f99 <Korn, Uwe> Add pickling support for Arrays 17ec8055 <Korn, Uwe> ARROW-2493: Add support for pickling to buffers 2018-05-02 11:27:53 +02:00			`return pyarrow_wrap_data_type(ad.type), ad.length, ad.null_count, \`
ARROW-7214: [Python] Fix pickling of DictionaryArray https://issues.apache.org/jira/browse/ARROW-7214 Closes #5871 from jorisvandenbossche/ARROW-7214-pickle-dictionary and squashes the following commits: e09cb0ee5 <Joris Van den Bossche> ARROW-7214: Fix pickling of DictionaryArray Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-11-20 14:17:03 +01:00			`ad.offset, buffers, children, dictionary`
ARROW-2493: [Python] Add support for pickling to buffers and arrays Author: Korn, Uwe <Uwe.Korn@blue-yonder.com> Closes #1928 from xhochy/ARROW-2493 and squashes the following commits: e3600f99 <Korn, Uwe> Add pickling support for Arrays 17ec8055 <Korn, Uwe> ARROW-2493: Add support for pickling to buffers 2018-05-02 11:27:53 +02:00

			`cdef shared_ptr[CArrayData] _reconstruct_array_data(data):`
			`"""`
			`Reconstruct CArrayData objects from the tuple structure generated`
			`by _reduce_array_data.`
			`"""`
			`cdef:`
			`int64_t length, null_count, offset, i`
			`DataType dtype`
			`Buffer buf`
			`vector[shared_ptr[CBuffer]] c_buffers`
			`vector[shared_ptr[CArrayData]] c_children`
ARROW-6052: [C++] Split up arrow/array.h/cc into multiple files under arrow/array/, move ArrayData to separate header, make ArrayData::dictionary ArrayData One meaningful change in this patch was to change `ArrayData::dictionary` to be `shared_ptr<ArrayData>`, which was not difficult. Otherwise, this allows some files that only need ArrayData or just Array to not include a bunch of unnecessary header code. The increase in number of lines of code is pretty much all due to license headers. I did a modest amount of IWYU cleaning, there are plenty of still usages of "arrow/array.h" that might be better replaced by more specific headers. As part of this patch I disabled `build/include_what_you_use` in cpplint per ARROW-8994. If we're going to spend time doing IWYU cleaning (which IMHO is overall worthwhile) we should rely on its output rather than having IWYU checks by two different tools. Closes #7310 from wesm/cpp-split-array-h Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-06-02 09:54:10 -05:00			`shared_ptr[CArrayData] c_dictionary`
ARROW-2493: [Python] Add support for pickling to buffers and arrays Author: Korn, Uwe <Uwe.Korn@blue-yonder.com> Closes #1928 from xhochy/ARROW-2493 and squashes the following commits: e3600f99 <Korn, Uwe> Add pickling support for Arrays 17ec8055 <Korn, Uwe> ARROW-2493: Add support for pickling to buffers 2018-05-02 11:27:53 +02:00
ARROW-7214: [Python] Fix pickling of DictionaryArray https://issues.apache.org/jira/browse/ARROW-7214 Closes #5871 from jorisvandenbossche/ARROW-7214-pickle-dictionary and squashes the following commits: e09cb0ee5 <Joris Van den Bossche> ARROW-7214: Fix pickling of DictionaryArray Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-11-20 14:17:03 +01:00			`dtype, length, null_count, offset, buffers, children, dictionary = data`
ARROW-2493: [Python] Add support for pickling to buffers and arrays Author: Korn, Uwe <Uwe.Korn@blue-yonder.com> Closes #1928 from xhochy/ARROW-2493 and squashes the following commits: e3600f99 <Korn, Uwe> Add pickling support for Arrays 17ec8055 <Korn, Uwe> ARROW-2493: Add support for pickling to buffers 2018-05-02 11:27:53 +02:00
			`for i in range(len(buffers)):`
			`buf = buffers[i]`
			`if buf is None:`
			`c_buffers.push_back(shared_ptr[CBuffer]())`
			`else:`
			`c_buffers.push_back(buf.buffer)`

			`for i in range(len(children)):`
			`c_children.push_back(_reconstruct_array_data(children[i]))`

ARROW-7214: [Python] Fix pickling of DictionaryArray https://issues.apache.org/jira/browse/ARROW-7214 Closes #5871 from jorisvandenbossche/ARROW-7214-pickle-dictionary and squashes the following commits: e09cb0ee5 <Joris Van den Bossche> ARROW-7214: Fix pickling of DictionaryArray Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-11-20 14:17:03 +01:00			`if dictionary is not None:`
ARROW-6052: [C++] Split up arrow/array.h/cc into multiple files under arrow/array/, move ArrayData to separate header, make ArrayData::dictionary ArrayData One meaningful change in this patch was to change `ArrayData::dictionary` to be `shared_ptr<ArrayData>`, which was not difficult. Otherwise, this allows some files that only need ArrayData or just Array to not include a bunch of unnecessary header code. The increase in number of lines of code is pretty much all due to license headers. I did a modest amount of IWYU cleaning, there are plenty of still usages of "arrow/array.h" that might be better replaced by more specific headers. As part of this patch I disabled `build/include_what_you_use` in cpplint per ARROW-8994. If we're going to spend time doing IWYU cleaning (which IMHO is overall worthwhile) we should rely on its output rather than having IWYU checks by two different tools. Closes #7310 from wesm/cpp-split-array-h Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-06-02 09:54:10 -05:00			`c_dictionary = _reconstruct_array_data(dictionary)`
ARROW-7214: [Python] Fix pickling of DictionaryArray https://issues.apache.org/jira/browse/ARROW-7214 Closes #5871 from jorisvandenbossche/ARROW-7214-pickle-dictionary and squashes the following commits: e09cb0ee5 <Joris Van den Bossche> ARROW-7214: Fix pickling of DictionaryArray Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-11-20 14:17:03 +01:00
			`return CArrayData.MakeWithChildrenAndDictionary(`
ARROW-2493: [Python] Add support for pickling to buffers and arrays Author: Korn, Uwe <Uwe.Korn@blue-yonder.com> Closes #1928 from xhochy/ARROW-2493 and squashes the following commits: e3600f99 <Korn, Uwe> Add pickling support for Arrays 17ec8055 <Korn, Uwe> ARROW-2493: Add support for pickling to buffers 2018-05-02 11:27:53 +02:00			`dtype.sp_type,`
			`length,`
			`c_buffers,`
			`c_children,`
ARROW-7214: [Python] Fix pickling of DictionaryArray https://issues.apache.org/jira/browse/ARROW-7214 Closes #5871 from jorisvandenbossche/ARROW-7214-pickle-dictionary and squashes the following commits: e09cb0ee5 <Joris Van den Bossche> ARROW-7214: Fix pickling of DictionaryArray Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-11-20 14:17:03 +01:00			`c_dictionary,`
ARROW-2493: [Python] Add support for pickling to buffers and arrays Author: Korn, Uwe <Uwe.Korn@blue-yonder.com> Closes #1928 from xhochy/ARROW-2493 and squashes the following commits: e3600f99 <Korn, Uwe> Add pickling support for Arrays 17ec8055 <Korn, Uwe> ARROW-2493: Add support for pickling to buffers 2018-05-02 11:27:53 +02:00			`null_count,`
			`offset)`


			`def _restore_array(data):`
			`"""`
			`Reconstruct an Array from pickled ArrayData.`
			`"""`
			`cdef shared_ptr[CArrayData] ad = _reconstruct_array_data(data)`
			`return pyarrow_wrap_array(MakeArray(ad))`


GH-45457: [Python] Add `pyarrow.ArrayStatistics` (#45550) ### Rationale for this change Apache Arrow C++ can attach statistics read from Apache Parquet data to `arrow::Array`. If we have the bindings of the feature in Python, Python users can also use attached statistics. ### What changes are included in this PR? * Add `pyarrow.ArrayStatistics` * Add `pyarrow.Array.statistics()`. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #45457 Lead-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Sutou Kouhei <kou@clear-code.com> 2025-02-25 22:25:52 +09:00			`cdef class ArrayStatistics(_Weakrefable):`
			`"""`
			`The class for statistics of an array.`
			`"""`

			`def __init__(self):`
			`raise TypeError(f"Do not call {self.__class__.__name__}'s constructor "`
			`"directly")`

			`cdef void init(self, const shared_ptr[CArrayStatistics]& sp_statistics):`
			`self.sp_statistics = sp_statistics`

			`def __repr__(self):`
			`return (f"arrow.ArrayStatistics<null_count={self.null_count}, "`
			`f"distinct_count={self.distinct_count}, min={self.min}, "`
			`f"is_min_exact={self.is_min_exact}, max={self.max}, "`
			`f"is_max_exact={self.is_max_exact}>")`

			`@property`
			`def null_count(self):`
			`"""`
			`The number of nulls.`
			`"""`
			`null_count = self.sp_statistics.get().null_count`
			`# We'll be able to simplify this after`
			`# https://github.com/cython/cython/issues/6692 is solved.`
GH-47103: [Statistics][C++] Implement Statistics specification attribute ARROW:null_count:approximate (#47969) ### Rationale for this change Enable `ARROW:null_count:approximate` support for `arrow::ArrayStatistics`, along with the corresponding GLib, Ruby and Python bindings. ### What changes are included in this PR? Enable `ARROW:null_count:approximate` in C++ and bind it to `ArrayStatistics` in GLib, Ruby and Python. ### Are these changes tested? Yes, I ran the relevant unit tests. ### Are there any user-facing changes? Yes. * The type of `arrow::ArrayStatistics::null_count` has been changed from `std::optional<int64_t>` to `std::optional<CountType>` * New `garrow_array_statistics_is_null_count_exact()`/`garrow_array_statistics_get_null_count_{exact,approximate}()` functions in GLib. * Add support for approximate value in `Arrow::ArrayStatistics#null_count` in Ruby. * A new field `is_null_count_exact` has been added to `ArrayStatistics` in Python. * GitHub Issue: #47103 Lead-authored-by: arash andishgar <arashandishgar1@gmail.com> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> 2025-11-10 11:48:30 +03:30			`if not null_count.has_value():`
GH-45457: [Python] Add `pyarrow.ArrayStatistics` (#45550) ### Rationale for this change Apache Arrow C++ can attach statistics read from Apache Parquet data to `arrow::Array`. If we have the bindings of the feature in Python, Python users can also use attached statistics. ### What changes are included in this PR? * Add `pyarrow.ArrayStatistics` * Add `pyarrow.Array.statistics()`. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #45457 Lead-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Sutou Kouhei <kou@clear-code.com> 2025-02-25 22:25:52 +09:00			`return None`
GH-47103: [Statistics][C++] Implement Statistics specification attribute ARROW:null_count:approximate (#47969) ### Rationale for this change Enable `ARROW:null_count:approximate` support for `arrow::ArrayStatistics`, along with the corresponding GLib, Ruby and Python bindings. ### What changes are included in this PR? Enable `ARROW:null_count:approximate` in C++ and bind it to `ArrayStatistics` in GLib, Ruby and Python. ### Are these changes tested? Yes, I ran the relevant unit tests. ### Are there any user-facing changes? Yes. * The type of `arrow::ArrayStatistics::null_count` has been changed from `std::optional<int64_t>` to `std::optional<CountType>` * New `garrow_array_statistics_is_null_count_exact()`/`garrow_array_statistics_get_null_count_{exact,approximate}()` functions in GLib. * Add support for approximate value in `Arrow::ArrayStatistics#null_count` in Ruby. * A new field `is_null_count_exact` has been added to `ArrayStatistics` in Python. * GitHub Issue: #47103 Lead-authored-by: arash andishgar <arashandishgar1@gmail.com> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> 2025-11-10 11:48:30 +03:30			`value = null_count.value()`
			`if holds_alternative[int64_t](value):`
			`return get[int64_t](value)`
			`else:`
			`return get[double](value)`

			`@property`
			`def is_null_count_exact(self):`
			`"""`
			`Whether the number of null values is a valid exact value or not.`
			`"""`
			`null_count = self.sp_statistics.get().null_count`
			`if not null_count.has_value():`
			`return False`
			`value = null_count.value()`
			`return holds_alternative[int64_t](value)`
GH-45457: [Python] Add `pyarrow.ArrayStatistics` (#45550) ### Rationale for this change Apache Arrow C++ can attach statistics read from Apache Parquet data to `arrow::Array`. If we have the bindings of the feature in Python, Python users can also use attached statistics. ### What changes are included in this PR? * Add `pyarrow.ArrayStatistics` * Add `pyarrow.Array.statistics()`. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #45457 Lead-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Sutou Kouhei <kou@clear-code.com> 2025-02-25 22:25:52 +09:00
			`@property`
			`def distinct_count(self):`
			`"""`
			`The number of distinct values.`
			`"""`
			`distinct_count = self.sp_statistics.get().distinct_count`
GH-47101: [Statistics][C++] Implement Statistics specification attribute ARROW:distinct_count:approximate (#47183) ### Rationale for this change Implement support for the `ARROW:distinct_count:approximate` statistics attribute. ### What changes are included in this PR? Changed the type of `arrow::ArrayStatistics::distinct_count` to support both `double` (for approximate values) and `int64_t` (for exact values). ### Are these changes tested? Yes, I ran the corresponding unit test. ### Are there any user-facing changes? Yes. * C++: The type of `arrow::ArrayStatistics::distinct_count` was changed from `std::optional<int64_t>` to `std::optional<std::variant<int64_t, double>>`. * GLib: `garrow_array_statistics_get_distinct_count()` is deprecated. * GitHub Issue: #47101 Lead-authored-by: Arash Andishgar <arashandishgar1@gmail.com> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Co-authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com> 2025-07-30 05:26:47 +03:30			`if not distinct_count.has_value():`
GH-45457: [Python] Add `pyarrow.ArrayStatistics` (#45550) ### Rationale for this change Apache Arrow C++ can attach statistics read from Apache Parquet data to `arrow::Array`. If we have the bindings of the feature in Python, Python users can also use attached statistics. ### What changes are included in this PR? * Add `pyarrow.ArrayStatistics` * Add `pyarrow.Array.statistics()`. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #45457 Lead-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Sutou Kouhei <kou@clear-code.com> 2025-02-25 22:25:52 +09:00			`return None`
GH-47101: [Statistics][C++] Implement Statistics specification attribute ARROW:distinct_count:approximate (#47183) ### Rationale for this change Implement support for the `ARROW:distinct_count:approximate` statistics attribute. ### What changes are included in this PR? Changed the type of `arrow::ArrayStatistics::distinct_count` to support both `double` (for approximate values) and `int64_t` (for exact values). ### Are these changes tested? Yes, I ran the corresponding unit test. ### Are there any user-facing changes? Yes. * C++: The type of `arrow::ArrayStatistics::distinct_count` was changed from `std::optional<int64_t>` to `std::optional<std::variant<int64_t, double>>`. * GLib: `garrow_array_statistics_get_distinct_count()` is deprecated. * GitHub Issue: #47101 Lead-authored-by: Arash Andishgar <arashandishgar1@gmail.com> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Co-authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com> 2025-07-30 05:26:47 +03:30			`value = distinct_count.value()`
			`if holds_alternative[int64_t](value):`
			`return get[int64_t](value)`
			`else:`
			`return get[double](value)`

			`@property`
			`def is_distinct_count_exact(self):`
			`"""`
			`Whether the number of distinct values is a valid exact value or not.`
			`"""`
			`distinct_count = self.sp_statistics.get().distinct_count`
			`if not distinct_count.has_value():`
			`return False`
			`value = distinct_count.value()`
			`return holds_alternative[int64_t](value)`
GH-45457: [Python] Add `pyarrow.ArrayStatistics` (#45550) ### Rationale for this change Apache Arrow C++ can attach statistics read from Apache Parquet data to `arrow::Array`. If we have the bindings of the feature in Python, Python users can also use attached statistics. ### What changes are included in this PR? * Add `pyarrow.ArrayStatistics` * Add `pyarrow.Array.statistics()`. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #45457 Lead-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Sutou Kouhei <kou@clear-code.com> 2025-02-25 22:25:52 +09:00
			`@property`
			`def min(self):`
			`"""`
			`The minimum value.`
			`"""`
			`return self._get_value(self.sp_statistics.get().min)`

			`@property`
			`def is_min_exact(self):`
			`"""`
			`Whether the minimum value is an exact value or not.`
			`"""`
			`return self.sp_statistics.get().is_min_exact`

			`@property`
			`def max(self):`
			`"""`
			`The maximum value.`
			`"""`
			`return self._get_value(self.sp_statistics.get().max)`

			`@property`
			`def is_max_exact(self):`
			`"""`
			`Whether the maximum value is an exact value or not.`
			`"""`
			`return self.sp_statistics.get().is_max_exact`

			`cdef _get_value(self, const optional[CArrayStatisticsValueType]& optional_value):`
			`"""`
			`Get a raw value from`
			`std::optional<arrow::ArrayStatistics::ValueType>> data.`

			`arrow::ArrayStatistics::ValueType is`
			`std::variant<bool, int64_t, uint64_t, double, std::string>.`
			`"""`
			`if not optional_value.has_value():`
			`return None`
			`value = optional_value.value()`
			`if holds_alternative[c_bool](value):`
			`return get[c_bool](value)`
			`elif holds_alternative[int64_t](value):`
			`return get[int64_t](value)`
			`elif holds_alternative[uint64_t](value):`
			`return get[uint64_t](value)`
			`elif holds_alternative[double](value):`
			`return get[double](value)`
			`else:`
			`return get[c_string](value)`


ARROW-9469: [Python] Make more objects weakrefable By default, Cython extension classes (defined with "cdef class") don't have a weakref slot, so add one to all of them. This adds just one memory word to each object, which IMHO is acceptable. Closes #7758 from pitrou/ARROW-9469-py-weakrefable-objects Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-07-29 12:24:36 +02:00			`cdef class _PandasConvertible(_Weakrefable):`
ARROW-3928: [Python] Deduplicate Python objects when converting binary, string, date, time types to object arrays This adds a `deduplicate_objects` option to all of the `to_pandas` methods. It works with string types, date types (when `date_as_object=True`), and time types. I also made it so that `ScalarMemoTable` can be used with `string_view`, for more efficient memoization in this case. I made the default for `deduplicate_objects` is True. When the ratio of unique strings to the length of the array is low, not only does this use drastically less memory, it is also faster. I will write some benchmarks to show where the "crossover point" is when the overhead of hashing makes things slower. Let's consider a simple case where we have 10,000,000 strings of length 10, but only 1000 unique values: ``` In [50]: import pandas.util.testing as tm In [51]: unique_values = [tm.rands(10) for i in range(1000)] In [52]: values = unique_values * 10000 In [53]: arr = pa.array(values) In [54]: timeit arr.to_pandas() 236 ms ± 1.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [55]: timeit arr.to_pandas(deduplicate_objects=False) 730 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` Almost 3 times faster in this case. The different in memory use is even more drastic ``` In [44]: unique_values = [tm.rands(10) for i in range(1000)] In [45]: values = unique_values * 10000 In [46]: arr = pa.array(values) In [49]: %memit result11 = arr.to_pandas() peak memory: 1505.89 MiB, increment: 76.27 MiB In [50]: %memit result12 = arr.to_pandas(deduplicate_objects=False) peak memory: 2202.29 MiB, increment: 696.11 MiB ``` As you can see, this is a huge problem. If our bug reports about Parquet memory use problems are any indication, users have been suffering from this issue for a long time. When the strings are mostly unique, then things are slower as expected, the peak memory use is higher because of the hash table ``` In [17]: unique_values = [tm.rands(10) for i in range(500000)] In [18]: values = unique_values * 2 In [19]: arr = pa.array(values) In [20]: timeit result = arr.to_pandas() 177 ms ± 574 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [21]: timeit result = arr.to_pandas(deduplicate_objects=False) 70.1 ms ± 783 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [42]: %memit result8 = arr.to_pandas() peak memory: 644.39 MiB, increment: 92.23 MiB In [43]: %memit result9 = arr.to_pandas(deduplicate_objects=False) peak memory: 610.85 MiB, increment: 58.41 MiB ``` In real world work, many duplicated strings is the most common use case. Given the massive memory use and moderate performance improvements, it makes sense to have this enabled by default. Author: Wes McKinney <wesm+git@apache.org> Closes #3257 from wesm/ARROW-3928 and squashes the following commits: d9a88700 <Wes McKinney> Prettier output a00b51c7 <Wes McKinney> Add benchmarks for object deduplication ca88b963 <Wes McKinney> Add Python unit tests, deduplicate for date and time types also when converting to Python objects 7a7873b8 <Wes McKinney> First working iteration of string deduplication when calling to_pandas 2018-12-27 12:17:50 -06:00
ARROW-6570: [Python] Use Arrow's allocators for creating NumPy array instead of leaving it to NumPy This has some benefits: * Move pandas-related memory allocations to the same default allocator as the rest of the Arrow platform, rather than mixing jemalloc and the system allocator as things currently are * NumPy/pandas-related memory allocations are now accounted for in `pyarrow.total_allocated_bytes()` * Better performance (10+% faster, from quick benchmarks) when using libraries with ARROW_JEMALLOC=ON There are a couple other usages of the system allocator in arrow_to_pandas.cc but they are for smaller internal bits ("placement" arrays) of data relating to pandas. These can be fixed later if they are deemed bothersome Closes #5398 from wesm/ARROW-6570 and squashes the following commits: 3fe0bc37c <Wes McKinney> Fix Python 2.7 failures 7788bc859 <Wes McKinney> Fix arrow-python-test bd5040f32 <Wes McKinney> Add unit tests to check memory pool behavior with to_pandas 927f10737 <Wes McKinney> Array needs WRITEABLE flag set explicitly 88c2c8f13 <Wes McKinney> Refactor to use arrow::MemoryPool for large NumPy allocations Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-09-18 11:26:09 -05:00			`def to_pandas(`
			`self,`
			`memory_pool=None,`
			`categories=None,`
			`bint strings_to_categorical=False,`
			`bint zero_copy_only=False,`
			`bint integer_object_nulls=False,`
			`bint date_as_object=True,`
ARROW-5359: [Python] Support non-nanosecond out-of-range timestamps in conversion to pandas This fixes https://issues.apache.org/jira/browse/ARROW-5359 by adding a new flag, `timestamp_as_object`. In this PR the default is to be False. Plausibly it should default to True, much like `date_as_object` is True by default, but that would be backwards incompatible. There are definitely a number of tests that fail when the default is changed, but they might be overly brittle tests. Closes #7169 from itamarst/ARROW-5359 Lead-authored-by: Itamar Turner-Trauring <itamar@itamarst.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-18 22:29:07 -05:00			`bint timestamp_as_object=False,`
ARROW-6570: [Python] Use Arrow's allocators for creating NumPy array instead of leaving it to NumPy This has some benefits: * Move pandas-related memory allocations to the same default allocator as the rest of the Arrow platform, rather than mixing jemalloc and the system allocator as things currently are * NumPy/pandas-related memory allocations are now accounted for in `pyarrow.total_allocated_bytes()` * Better performance (10+% faster, from quick benchmarks) when using libraries with ARROW_JEMALLOC=ON There are a couple other usages of the system allocator in arrow_to_pandas.cc but they are for smaller internal bits ("placement" arrays) of data relating to pandas. These can be fixed later if they are deemed bothersome Closes #5398 from wesm/ARROW-6570 and squashes the following commits: 3fe0bc37c <Wes McKinney> Fix Python 2.7 failures 7788bc859 <Wes McKinney> Fix arrow-python-test bd5040f32 <Wes McKinney> Add unit tests to check memory pool behavior with to_pandas 927f10737 <Wes McKinney> Array needs WRITEABLE flag set explicitly 88c2c8f13 <Wes McKinney> Refactor to use arrow::MemoryPool for large NumPy allocations Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-09-18 11:26:09 -05:00			`bint use_threads=True,`
			`bint deduplicate_objects=True,`
ARROW-3789: [Python] Use common conversion path for Arrow to pandas.Series/DataFrame. Zero copy optimizations for DataFrame, add split_blocks and self_destruct options The primary goal of this patch is to provide a way for some users to avoid memory doubling with converting from Arrow to pandas. This took me entirely too much time to get right, but partly I was attempting to disentangle some of the technical debt and overdue refactoring in arrow_to_pandas.cc. Summary of what's here: - Refactor ChunkedArray->Series and Table->DataFrame conversion paths to use the exact same code rather than two implementations of the same thing with slightly different behavior. The `ArrowDeserializer` helper class is now gone - Do zero-copy construction of internal DataFrame blocks for the case of a contiguous non-nullable array and a block with only 1 column represented - Add `split_blocks` option to `to_pandas` which constructs one block per DataFrame column, resulting in more zero-copy opportunities. Note that pandas's internal "consolidation" can still cause memory doubling (see discussion about this in https://github.com/pandas-dev/pandas/issues/10556) - Add `self_destruct` option to `to_pandas` which releases the Table's internal buffers as soon as they are converted to the required pandas structure. This allows memory to be reclaimed by the OS as conversion is taking place rather than having a forced memory-doubling and then post-facto reclamation (which has been causing OOM for some users) The most conservative invocation of `to_pandas` now would be `table.to_pandas(use_threads=False, split_blocks=True, self_destruct=True)` Note that the self-destruct option makes the `Table` object unsafe for further use. This is a bit dissatisfying but I wasn't sure how else to provide this capability. Closes #6067 from wesm/ARROW-3789 and squashes the following commits: 3b4260283 <Wes McKinney> Code review comments 8f39cce05 <Wes McKinney> Add some documentation. Try fixing MSVC warnings c22d280dc <Wes McKinney> Fix one MSVC cast warning 43068032c <Wes McKinney> Add "split blocks" and "self destruct" options to Table.to_pandas, with zero-copy operations for improved memory use when converting from Arrow to pandas Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-01-14 18:25:01 -06:00			`bint ignore_metadata=False,`
ARROW-7758: [Python] Safe cast to nanosecond timestamps in to_pandas conversion Closes #6358 from jorisvandenbossche/ARROW-7758 and squashes the following commits: 81270d612 <Joris Van den Bossche> safe -> safe_cast on C++ side b4229e573 <Joris Van den Bossche> parametrize test 8f23f121b <Joris Van den Bossche> ARROW-7758: Safe cast to nanosecond timestamps in to_pandas conversion Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2020-02-17 16:21:33 +01:00			`bint safe=True,`
ARROW-3789: [Python] Use common conversion path for Arrow to pandas.Series/DataFrame. Zero copy optimizations for DataFrame, add split_blocks and self_destruct options The primary goal of this patch is to provide a way for some users to avoid memory doubling with converting from Arrow to pandas. This took me entirely too much time to get right, but partly I was attempting to disentangle some of the technical debt and overdue refactoring in arrow_to_pandas.cc. Summary of what's here: - Refactor ChunkedArray->Series and Table->DataFrame conversion paths to use the exact same code rather than two implementations of the same thing with slightly different behavior. The `ArrowDeserializer` helper class is now gone - Do zero-copy construction of internal DataFrame blocks for the case of a contiguous non-nullable array and a block with only 1 column represented - Add `split_blocks` option to `to_pandas` which constructs one block per DataFrame column, resulting in more zero-copy opportunities. Note that pandas's internal "consolidation" can still cause memory doubling (see discussion about this in https://github.com/pandas-dev/pandas/issues/10556) - Add `self_destruct` option to `to_pandas` which releases the Table's internal buffers as soon as they are converted to the required pandas structure. This allows memory to be reclaimed by the OS as conversion is taking place rather than having a forced memory-doubling and then post-facto reclamation (which has been causing OOM for some users) The most conservative invocation of `to_pandas` now would be `table.to_pandas(use_threads=False, split_blocks=True, self_destruct=True)` Note that the self-destruct option makes the `Table` object unsafe for further use. This is a bit dissatisfying but I wasn't sure how else to provide this capability. Closes #6067 from wesm/ARROW-3789 and squashes the following commits: 3b4260283 <Wes McKinney> Code review comments 8f39cce05 <Wes McKinney> Add some documentation. Try fixing MSVC warnings c22d280dc <Wes McKinney> Fix one MSVC cast warning 43068032c <Wes McKinney> Add "split blocks" and "self destruct" options to Table.to_pandas, with zero-copy operations for improved memory use when converting from Arrow to pandas Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-01-14 18:25:01 -06:00			`bint split_blocks=False,`
ARROW-7569: [Python] Add API to map Arrow types to pandas ExtensionDtypes in to_pandas conversions See https://issues.apache.org/jira/browse/ARROW-7569 and https://issues.apache.org/jira/browse/ARROW-2428 for context. https://github.com/apache/arrow/pull/5512 only covered the first 2 cases described in ARROW-2428, this also tries to cover the third case. This PR adds a `types_mapping` to `Table.to_pandas` to specify pandas ExtensionDtypes for built-in arrow types to use in the conversion. One specific example use case for this ability is to convert arrow integer types to pandas' nullable integer dtype instead of to numpy integer dtype (or for one of the other custom nullable dtypes in pandas). For example: ``` table.to_pandas(types_mapping={pa.int64(): pd.Int64Dtype()}) ``` will avoid to convert the int columns first to numpy dtype (possibly float) by directly constructing the pandas nullable dtype. Need to add more tests, and one important concern is that using a pyarrow type instance as the dict key might not easily work for parametrized types (eg timestamp with resolution / timezone). Closes #6189 from jorisvandenbossche/ARROW-7569-to-pandas-types-mapping and squashes the following commits: cb82f5c21 <Joris Van den Bossche> expand tests 1d9c37ca1 <Joris Van den Bossche> simplify (remove unused extension_columns arg) b61b1f5ac <Joris Van den Bossche> dict -> function f3464b15a <Joris Van den Bossche> ARROW-7569: Add API to map Arrow types to pandas ExtensionDtypes for to_pandas conversions Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com> 2020-01-23 09:42:42 -08:00			`bint self_destruct=False,`
GH-34729: [C++][Python] Enhanced Arrow<->Pandas map/pydict support (#34730) ### Rationale for this change Explained in issue #34729 ### What changes are included in this PR? - Add support for list of maps when converting Arrow to Pandas. There doesn't seem to be a strong reason to omit this. Previously it was a hard error as unsupported, due to a bool check. - Refactor Arrow Map -> Pandas to support two paths: (1) list of tuples, or (2) pydicts - Add another option in PandasOptions to enable (2), above - Bugfix in nested pydicts -> Arrow maps. - Unit tests ### Are these changes tested? Unit tests are added in `test_pandas.py` ### Are there any user-facing changes? - An additional option flag in PandasOptions - Enable list of maps to Pandas, which was previously disabled * Closes: #34729 Authored-by: Mike Lui <mikelui@meta.com> Signed-off-by: Will Jones <willjones127@gmail.com> 2023-04-21 11:18:32 -04:00			`str maps_as_pydicts=None,`
GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0 (#35656) Do not coerce temporal types to nanosecond when pandas >= 2.0 is imported, since pandas now supports s/ms/us time units. This PR adds support for the following Arrow -> Pandas conversions, which previously all defaulted to `datetime64[ns]` or `datetime64[ns, <TZ>]`: ``` date32 -> datetime64[ms] date64 -> datetime64[ms] datetime64[s] -> datetime64[s] datetime64[ms] -> datetime64[ms] datetime64[us] -> datetime64[us] datetime64[s, <TZ>] -> datetime64[s, <TZ>] datetime64[ms, <TZ>] -> datetime64[ms, <TZ>] datetime64[us, <TZ>] -> datetime64[us, <TZ>] ``` ### Rationale for this change Pandas 2.0 introduces proper support for temporal types. ### Are these changes tested? Yes. Pytests added and updated. ### Are there any user-facing changes? Yes, arrow-to-pandas default conversion behavior will change when users have pandas >= 2.0, but a legacy option is exposed to provide backwards compatibility. * Closes: #33321 Lead-authored-by: Dane Pitkin <dane@voltrondata.com> Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-07-07 10:45:58 -04:00			`types_mapper=None,`
			`bint coerce_temporal_nanoseconds=False`
ARROW-6570: [Python] Use Arrow's allocators for creating NumPy array instead of leaving it to NumPy This has some benefits: * Move pandas-related memory allocations to the same default allocator as the rest of the Arrow platform, rather than mixing jemalloc and the system allocator as things currently are * NumPy/pandas-related memory allocations are now accounted for in `pyarrow.total_allocated_bytes()` * Better performance (10+% faster, from quick benchmarks) when using libraries with ARROW_JEMALLOC=ON There are a couple other usages of the system allocator in arrow_to_pandas.cc but they are for smaller internal bits ("placement" arrays) of data relating to pandas. These can be fixed later if they are deemed bothersome Closes #5398 from wesm/ARROW-6570 and squashes the following commits: 3fe0bc37c <Wes McKinney> Fix Python 2.7 failures 7788bc859 <Wes McKinney> Fix arrow-python-test bd5040f32 <Wes McKinney> Add unit tests to check memory pool behavior with to_pandas 927f10737 <Wes McKinney> Array needs WRITEABLE flag set explicitly 88c2c8f13 <Wes McKinney> Refactor to use arrow::MemoryPool for large NumPy allocations Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-09-18 11:26:09 -05:00			`):`
ARROW-3928: [Python] Deduplicate Python objects when converting binary, string, date, time types to object arrays This adds a `deduplicate_objects` option to all of the `to_pandas` methods. It works with string types, date types (when `date_as_object=True`), and time types. I also made it so that `ScalarMemoTable` can be used with `string_view`, for more efficient memoization in this case. I made the default for `deduplicate_objects` is True. When the ratio of unique strings to the length of the array is low, not only does this use drastically less memory, it is also faster. I will write some benchmarks to show where the "crossover point" is when the overhead of hashing makes things slower. Let's consider a simple case where we have 10,000,000 strings of length 10, but only 1000 unique values: ``` In [50]: import pandas.util.testing as tm In [51]: unique_values = [tm.rands(10) for i in range(1000)] In [52]: values = unique_values * 10000 In [53]: arr = pa.array(values) In [54]: timeit arr.to_pandas() 236 ms ± 1.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [55]: timeit arr.to_pandas(deduplicate_objects=False) 730 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` Almost 3 times faster in this case. The different in memory use is even more drastic ``` In [44]: unique_values = [tm.rands(10) for i in range(1000)] In [45]: values = unique_values * 10000 In [46]: arr = pa.array(values) In [49]: %memit result11 = arr.to_pandas() peak memory: 1505.89 MiB, increment: 76.27 MiB In [50]: %memit result12 = arr.to_pandas(deduplicate_objects=False) peak memory: 2202.29 MiB, increment: 696.11 MiB ``` As you can see, this is a huge problem. If our bug reports about Parquet memory use problems are any indication, users have been suffering from this issue for a long time. When the strings are mostly unique, then things are slower as expected, the peak memory use is higher because of the hash table ``` In [17]: unique_values = [tm.rands(10) for i in range(500000)] In [18]: values = unique_values * 2 In [19]: arr = pa.array(values) In [20]: timeit result = arr.to_pandas() 177 ms ± 574 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [21]: timeit result = arr.to_pandas(deduplicate_objects=False) 70.1 ms ± 783 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [42]: %memit result8 = arr.to_pandas() peak memory: 644.39 MiB, increment: 92.23 MiB In [43]: %memit result9 = arr.to_pandas(deduplicate_objects=False) peak memory: 610.85 MiB, increment: 58.41 MiB ``` In real world work, many duplicated strings is the most common use case. Given the massive memory use and moderate performance improvements, it makes sense to have this enabled by default. Author: Wes McKinney <wesm+git@apache.org> Closes #3257 from wesm/ARROW-3928 and squashes the following commits: d9a88700 <Wes McKinney> Prettier output a00b51c7 <Wes McKinney> Add benchmarks for object deduplication ca88b963 <Wes McKinney> Add Python unit tests, deduplicate for date and time types also when converting to Python objects 7a7873b8 <Wes McKinney> First working iteration of string deduplication when calling to_pandas 2018-12-27 12:17:50 -06:00			`"""`
			`Convert to a pandas-compatible NumPy array or DataFrame, as appropriate`

			`Parameters`
			`----------`
ARROW-6570: [Python] Use Arrow's allocators for creating NumPy array instead of leaving it to NumPy This has some benefits: * Move pandas-related memory allocations to the same default allocator as the rest of the Arrow platform, rather than mixing jemalloc and the system allocator as things currently are * NumPy/pandas-related memory allocations are now accounted for in `pyarrow.total_allocated_bytes()` * Better performance (10+% faster, from quick benchmarks) when using libraries with ARROW_JEMALLOC=ON There are a couple other usages of the system allocator in arrow_to_pandas.cc but they are for smaller internal bits ("placement" arrays) of data relating to pandas. These can be fixed later if they are deemed bothersome Closes #5398 from wesm/ARROW-6570 and squashes the following commits: 3fe0bc37c <Wes McKinney> Fix Python 2.7 failures 7788bc859 <Wes McKinney> Fix arrow-python-test bd5040f32 <Wes McKinney> Add unit tests to check memory pool behavior with to_pandas 927f10737 <Wes McKinney> Array needs WRITEABLE flag set explicitly 88c2c8f13 <Wes McKinney> Refactor to use arrow::MemoryPool for large NumPy allocations Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-09-18 11:26:09 -05:00			`memory_pool : MemoryPool, default None`
			`Arrow MemoryPool to use for allocations. Uses the default memory`
MINOR: [Python][Docs] Fixing typos in python/pyarrow/array.pxi (#34286) ### Rationale for this change "is not passed" isn't grammatically correct here. ### Are these changes tested? n/a - minor doc change ### Are there any user-facing changes? No Authored-by: Leo Shklovskii <leo@thermopylae.net> Signed-off-by: Sutou Kouhei <kou@clear-code.com> 2023-02-22 01:36:52 -05:00			`pool if not passed.`
ARROW-15321: [Dev][Python] Also numpydoc-validate Cython-generated methods The numpydoc-validation routine in Archery would skip over many Cython-generated methods and properties. This PR also fixes and enhances the docstrings that would newly raise validation errors. Closes #12698 from pitrou/ARROW-15321-numpydoc-cython Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-03-28 18:55:45 +02:00			`categories : list, default empty`
ARROW-3928: [Python] Deduplicate Python objects when converting binary, string, date, time types to object arrays This adds a `deduplicate_objects` option to all of the `to_pandas` methods. It works with string types, date types (when `date_as_object=True`), and time types. I also made it so that `ScalarMemoTable` can be used with `string_view`, for more efficient memoization in this case. I made the default for `deduplicate_objects` is True. When the ratio of unique strings to the length of the array is low, not only does this use drastically less memory, it is also faster. I will write some benchmarks to show where the "crossover point" is when the overhead of hashing makes things slower. Let's consider a simple case where we have 10,000,000 strings of length 10, but only 1000 unique values: ``` In [50]: import pandas.util.testing as tm In [51]: unique_values = [tm.rands(10) for i in range(1000)] In [52]: values = unique_values * 10000 In [53]: arr = pa.array(values) In [54]: timeit arr.to_pandas() 236 ms ± 1.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [55]: timeit arr.to_pandas(deduplicate_objects=False) 730 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` Almost 3 times faster in this case. The different in memory use is even more drastic ``` In [44]: unique_values = [tm.rands(10) for i in range(1000)] In [45]: values = unique_values * 10000 In [46]: arr = pa.array(values) In [49]: %memit result11 = arr.to_pandas() peak memory: 1505.89 MiB, increment: 76.27 MiB In [50]: %memit result12 = arr.to_pandas(deduplicate_objects=False) peak memory: 2202.29 MiB, increment: 696.11 MiB ``` As you can see, this is a huge problem. If our bug reports about Parquet memory use problems are any indication, users have been suffering from this issue for a long time. When the strings are mostly unique, then things are slower as expected, the peak memory use is higher because of the hash table ``` In [17]: unique_values = [tm.rands(10) for i in range(500000)] In [18]: values = unique_values * 2 In [19]: arr = pa.array(values) In [20]: timeit result = arr.to_pandas() 177 ms ± 574 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [21]: timeit result = arr.to_pandas(deduplicate_objects=False) 70.1 ms ± 783 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [42]: %memit result8 = arr.to_pandas() peak memory: 644.39 MiB, increment: 92.23 MiB In [43]: %memit result9 = arr.to_pandas(deduplicate_objects=False) peak memory: 610.85 MiB, increment: 58.41 MiB ``` In real world work, many duplicated strings is the most common use case. Given the massive memory use and moderate performance improvements, it makes sense to have this enabled by default. Author: Wes McKinney <wesm+git@apache.org> Closes #3257 from wesm/ARROW-3928 and squashes the following commits: d9a88700 <Wes McKinney> Prettier output a00b51c7 <Wes McKinney> Add benchmarks for object deduplication ca88b963 <Wes McKinney> Add Python unit tests, deduplicate for date and time types also when converting to Python objects 7a7873b8 <Wes McKinney> First working iteration of string deduplication when calling to_pandas 2018-12-27 12:17:50 -06:00			`List of fields that should be returned as pandas.Categorical. Only`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`applies to table-like data structures.`
ARROW-15006: [Python][CI][Doc] Enable numpydoc check PR03 (#13983) Adds an additional numypdoc check to CI (PR03) and fixes all corresponding violations. Note this does not fully resolve [ARROW-15006](https://issues.apache.org/jira/browse/ARROW-15006). Authored-by: Bryce Mecum <petridish@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2022-10-19 23:41:24 -08:00			`strings_to_categorical : bool, default False`
			`Encode string (UTF8) and binary types to pandas.Categorical.`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`zero_copy_only : bool, default False`
ARROW-3928: [Python] Deduplicate Python objects when converting binary, string, date, time types to object arrays This adds a `deduplicate_objects` option to all of the `to_pandas` methods. It works with string types, date types (when `date_as_object=True`), and time types. I also made it so that `ScalarMemoTable` can be used with `string_view`, for more efficient memoization in this case. I made the default for `deduplicate_objects` is True. When the ratio of unique strings to the length of the array is low, not only does this use drastically less memory, it is also faster. I will write some benchmarks to show where the "crossover point" is when the overhead of hashing makes things slower. Let's consider a simple case where we have 10,000,000 strings of length 10, but only 1000 unique values: ``` In [50]: import pandas.util.testing as tm In [51]: unique_values = [tm.rands(10) for i in range(1000)] In [52]: values = unique_values * 10000 In [53]: arr = pa.array(values) In [54]: timeit arr.to_pandas() 236 ms ± 1.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [55]: timeit arr.to_pandas(deduplicate_objects=False) 730 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` Almost 3 times faster in this case. The different in memory use is even more drastic ``` In [44]: unique_values = [tm.rands(10) for i in range(1000)] In [45]: values = unique_values * 10000 In [46]: arr = pa.array(values) In [49]: %memit result11 = arr.to_pandas() peak memory: 1505.89 MiB, increment: 76.27 MiB In [50]: %memit result12 = arr.to_pandas(deduplicate_objects=False) peak memory: 2202.29 MiB, increment: 696.11 MiB ``` As you can see, this is a huge problem. If our bug reports about Parquet memory use problems are any indication, users have been suffering from this issue for a long time. When the strings are mostly unique, then things are slower as expected, the peak memory use is higher because of the hash table ``` In [17]: unique_values = [tm.rands(10) for i in range(500000)] In [18]: values = unique_values * 2 In [19]: arr = pa.array(values) In [20]: timeit result = arr.to_pandas() 177 ms ± 574 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [21]: timeit result = arr.to_pandas(deduplicate_objects=False) 70.1 ms ± 783 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [42]: %memit result8 = arr.to_pandas() peak memory: 644.39 MiB, increment: 92.23 MiB In [43]: %memit result9 = arr.to_pandas(deduplicate_objects=False) peak memory: 610.85 MiB, increment: 58.41 MiB ``` In real world work, many duplicated strings is the most common use case. Given the massive memory use and moderate performance improvements, it makes sense to have this enabled by default. Author: Wes McKinney <wesm+git@apache.org> Closes #3257 from wesm/ARROW-3928 and squashes the following commits: d9a88700 <Wes McKinney> Prettier output a00b51c7 <Wes McKinney> Add benchmarks for object deduplication ca88b963 <Wes McKinney> Add Python unit tests, deduplicate for date and time types also when converting to Python objects 7a7873b8 <Wes McKinney> First working iteration of string deduplication when calling to_pandas 2018-12-27 12:17:50 -06:00			`Raise an ArrowException if this function call would require copying`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`the underlying data.`
			`integer_object_nulls : bool, default False`
ARROW-3928: [Python] Deduplicate Python objects when converting binary, string, date, time types to object arrays This adds a `deduplicate_objects` option to all of the `to_pandas` methods. It works with string types, date types (when `date_as_object=True`), and time types. I also made it so that `ScalarMemoTable` can be used with `string_view`, for more efficient memoization in this case. I made the default for `deduplicate_objects` is True. When the ratio of unique strings to the length of the array is low, not only does this use drastically less memory, it is also faster. I will write some benchmarks to show where the "crossover point" is when the overhead of hashing makes things slower. Let's consider a simple case where we have 10,000,000 strings of length 10, but only 1000 unique values: ``` In [50]: import pandas.util.testing as tm In [51]: unique_values = [tm.rands(10) for i in range(1000)] In [52]: values = unique_values * 10000 In [53]: arr = pa.array(values) In [54]: timeit arr.to_pandas() 236 ms ± 1.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [55]: timeit arr.to_pandas(deduplicate_objects=False) 730 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` Almost 3 times faster in this case. The different in memory use is even more drastic ``` In [44]: unique_values = [tm.rands(10) for i in range(1000)] In [45]: values = unique_values * 10000 In [46]: arr = pa.array(values) In [49]: %memit result11 = arr.to_pandas() peak memory: 1505.89 MiB, increment: 76.27 MiB In [50]: %memit result12 = arr.to_pandas(deduplicate_objects=False) peak memory: 2202.29 MiB, increment: 696.11 MiB ``` As you can see, this is a huge problem. If our bug reports about Parquet memory use problems are any indication, users have been suffering from this issue for a long time. When the strings are mostly unique, then things are slower as expected, the peak memory use is higher because of the hash table ``` In [17]: unique_values = [tm.rands(10) for i in range(500000)] In [18]: values = unique_values * 2 In [19]: arr = pa.array(values) In [20]: timeit result = arr.to_pandas() 177 ms ± 574 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [21]: timeit result = arr.to_pandas(deduplicate_objects=False) 70.1 ms ± 783 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [42]: %memit result8 = arr.to_pandas() peak memory: 644.39 MiB, increment: 92.23 MiB In [43]: %memit result9 = arr.to_pandas(deduplicate_objects=False) peak memory: 610.85 MiB, increment: 58.41 MiB ``` In real world work, many duplicated strings is the most common use case. Given the massive memory use and moderate performance improvements, it makes sense to have this enabled by default. Author: Wes McKinney <wesm+git@apache.org> Closes #3257 from wesm/ARROW-3928 and squashes the following commits: d9a88700 <Wes McKinney> Prettier output a00b51c7 <Wes McKinney> Add benchmarks for object deduplication ca88b963 <Wes McKinney> Add Python unit tests, deduplicate for date and time types also when converting to Python objects 7a7873b8 <Wes McKinney> First working iteration of string deduplication when calling to_pandas 2018-12-27 12:17:50 -06:00			`Cast integers with nulls to objects`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`date_as_object : bool, default True`
GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0 (#35656) Do not coerce temporal types to nanosecond when pandas >= 2.0 is imported, since pandas now supports s/ms/us time units. This PR adds support for the following Arrow -> Pandas conversions, which previously all defaulted to `datetime64[ns]` or `datetime64[ns, <TZ>]`: ``` date32 -> datetime64[ms] date64 -> datetime64[ms] datetime64[s] -> datetime64[s] datetime64[ms] -> datetime64[ms] datetime64[us] -> datetime64[us] datetime64[s, <TZ>] -> datetime64[s, <TZ>] datetime64[ms, <TZ>] -> datetime64[ms, <TZ>] datetime64[us, <TZ>] -> datetime64[us, <TZ>] ``` ### Rationale for this change Pandas 2.0 introduces proper support for temporal types. ### Are these changes tested? Yes. Pytests added and updated. ### Are there any user-facing changes? Yes, arrow-to-pandas default conversion behavior will change when users have pandas >= 2.0, but a legacy option is exposed to provide backwards compatibility. * Closes: #33321 Lead-authored-by: Dane Pitkin <dane@voltrondata.com> Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-07-07 10:45:58 -04:00			`Cast dates to objects. If False, convert to datetime64 dtype with`
			`the equivalent time unit (if supported). Note: in pandas version`
			`< 2.0, only datetime64[ns] conversion is supported.`
ARROW-5359: [Python] Support non-nanosecond out-of-range timestamps in conversion to pandas This fixes https://issues.apache.org/jira/browse/ARROW-5359 by adding a new flag, `timestamp_as_object`. In this PR the default is to be False. Plausibly it should default to True, much like `date_as_object` is True by default, but that would be backwards incompatible. There are definitely a number of tests that fail when the default is changed, but they might be overly brittle tests. Closes #7169 from itamarst/ARROW-5359 Lead-authored-by: Itamar Turner-Trauring <itamar@itamarst.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-18 22:29:07 -05:00			`timestamp_as_object : bool, default False`
			`Cast non-nanosecond timestamps (np.datetime64) to objects. This is`
GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0 (#35656) Do not coerce temporal types to nanosecond when pandas >= 2.0 is imported, since pandas now supports s/ms/us time units. This PR adds support for the following Arrow -> Pandas conversions, which previously all defaulted to `datetime64[ns]` or `datetime64[ns, <TZ>]`: ``` date32 -> datetime64[ms] date64 -> datetime64[ms] datetime64[s] -> datetime64[s] datetime64[ms] -> datetime64[ms] datetime64[us] -> datetime64[us] datetime64[s, <TZ>] -> datetime64[s, <TZ>] datetime64[ms, <TZ>] -> datetime64[ms, <TZ>] datetime64[us, <TZ>] -> datetime64[us, <TZ>] ``` ### Rationale for this change Pandas 2.0 introduces proper support for temporal types. ### Are these changes tested? Yes. Pytests added and updated. ### Are there any user-facing changes? Yes, arrow-to-pandas default conversion behavior will change when users have pandas >= 2.0, but a legacy option is exposed to provide backwards compatibility. * Closes: #33321 Lead-authored-by: Dane Pitkin <dane@voltrondata.com> Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-07-07 10:45:58 -04:00			`useful in pandas version 1.x if you have timestamps that don't fit`
			`in the normal date range of nanosecond timestamps (1678 CE-2262 CE).`
			`Non-nanosecond timestamps are supported in pandas version 2.0.`
			`If False, all timestamps are converted to datetime64 dtype.`
ARROW-14738: [Python][Doc] Make return types clickable Closes #11726 from amol-/ARROW-14738 Lead-authored-by: Alessandro Molina <amol@turbogears.org> Co-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2022-01-19 12:51:43 +01:00			`use_threads : bool, default True`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`Whether to parallelize the conversion using multiple threads.`
GH-34104: [Python] update deduplicate_objects default in docs to match implementation (#34128) * Closes: #34104 Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-02-14 05:47:46 -08:00			`deduplicate_objects : bool, default True`
ARROW-3928: [Python] Deduplicate Python objects when converting binary, string, date, time types to object arrays This adds a `deduplicate_objects` option to all of the `to_pandas` methods. It works with string types, date types (when `date_as_object=True`), and time types. I also made it so that `ScalarMemoTable` can be used with `string_view`, for more efficient memoization in this case. I made the default for `deduplicate_objects` is True. When the ratio of unique strings to the length of the array is low, not only does this use drastically less memory, it is also faster. I will write some benchmarks to show where the "crossover point" is when the overhead of hashing makes things slower. Let's consider a simple case where we have 10,000,000 strings of length 10, but only 1000 unique values: ``` In [50]: import pandas.util.testing as tm In [51]: unique_values = [tm.rands(10) for i in range(1000)] In [52]: values = unique_values * 10000 In [53]: arr = pa.array(values) In [54]: timeit arr.to_pandas() 236 ms ± 1.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [55]: timeit arr.to_pandas(deduplicate_objects=False) 730 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` Almost 3 times faster in this case. The different in memory use is even more drastic ``` In [44]: unique_values = [tm.rands(10) for i in range(1000)] In [45]: values = unique_values * 10000 In [46]: arr = pa.array(values) In [49]: %memit result11 = arr.to_pandas() peak memory: 1505.89 MiB, increment: 76.27 MiB In [50]: %memit result12 = arr.to_pandas(deduplicate_objects=False) peak memory: 2202.29 MiB, increment: 696.11 MiB ``` As you can see, this is a huge problem. If our bug reports about Parquet memory use problems are any indication, users have been suffering from this issue for a long time. When the strings are mostly unique, then things are slower as expected, the peak memory use is higher because of the hash table ``` In [17]: unique_values = [tm.rands(10) for i in range(500000)] In [18]: values = unique_values * 2 In [19]: arr = pa.array(values) In [20]: timeit result = arr.to_pandas() 177 ms ± 574 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [21]: timeit result = arr.to_pandas(deduplicate_objects=False) 70.1 ms ± 783 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [42]: %memit result8 = arr.to_pandas() peak memory: 644.39 MiB, increment: 92.23 MiB In [43]: %memit result9 = arr.to_pandas(deduplicate_objects=False) peak memory: 610.85 MiB, increment: 58.41 MiB ``` In real world work, many duplicated strings is the most common use case. Given the massive memory use and moderate performance improvements, it makes sense to have this enabled by default. Author: Wes McKinney <wesm+git@apache.org> Closes #3257 from wesm/ARROW-3928 and squashes the following commits: d9a88700 <Wes McKinney> Prettier output a00b51c7 <Wes McKinney> Add benchmarks for object deduplication ca88b963 <Wes McKinney> Add Python unit tests, deduplicate for date and time types also when converting to Python objects 7a7873b8 <Wes McKinney> First working iteration of string deduplication when calling to_pandas 2018-12-27 12:17:50 -06:00			`Do not create multiple copies Python objects when created, to save`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`on memory use. Conversion will be slower.`
			`ignore_metadata : bool, default False`
ARROW-3928: [Python] Deduplicate Python objects when converting binary, string, date, time types to object arrays This adds a `deduplicate_objects` option to all of the `to_pandas` methods. It works with string types, date types (when `date_as_object=True`), and time types. I also made it so that `ScalarMemoTable` can be used with `string_view`, for more efficient memoization in this case. I made the default for `deduplicate_objects` is True. When the ratio of unique strings to the length of the array is low, not only does this use drastically less memory, it is also faster. I will write some benchmarks to show where the "crossover point" is when the overhead of hashing makes things slower. Let's consider a simple case where we have 10,000,000 strings of length 10, but only 1000 unique values: ``` In [50]: import pandas.util.testing as tm In [51]: unique_values = [tm.rands(10) for i in range(1000)] In [52]: values = unique_values * 10000 In [53]: arr = pa.array(values) In [54]: timeit arr.to_pandas() 236 ms ± 1.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [55]: timeit arr.to_pandas(deduplicate_objects=False) 730 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` Almost 3 times faster in this case. The different in memory use is even more drastic ``` In [44]: unique_values = [tm.rands(10) for i in range(1000)] In [45]: values = unique_values * 10000 In [46]: arr = pa.array(values) In [49]: %memit result11 = arr.to_pandas() peak memory: 1505.89 MiB, increment: 76.27 MiB In [50]: %memit result12 = arr.to_pandas(deduplicate_objects=False) peak memory: 2202.29 MiB, increment: 696.11 MiB ``` As you can see, this is a huge problem. If our bug reports about Parquet memory use problems are any indication, users have been suffering from this issue for a long time. When the strings are mostly unique, then things are slower as expected, the peak memory use is higher because of the hash table ``` In [17]: unique_values = [tm.rands(10) for i in range(500000)] In [18]: values = unique_values * 2 In [19]: arr = pa.array(values) In [20]: timeit result = arr.to_pandas() 177 ms ± 574 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [21]: timeit result = arr.to_pandas(deduplicate_objects=False) 70.1 ms ± 783 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [42]: %memit result8 = arr.to_pandas() peak memory: 644.39 MiB, increment: 92.23 MiB In [43]: %memit result9 = arr.to_pandas(deduplicate_objects=False) peak memory: 610.85 MiB, increment: 58.41 MiB ``` In real world work, many duplicated strings is the most common use case. Given the massive memory use and moderate performance improvements, it makes sense to have this enabled by default. Author: Wes McKinney <wesm+git@apache.org> Closes #3257 from wesm/ARROW-3928 and squashes the following commits: d9a88700 <Wes McKinney> Prettier output a00b51c7 <Wes McKinney> Add benchmarks for object deduplication ca88b963 <Wes McKinney> Add Python unit tests, deduplicate for date and time types also when converting to Python objects 7a7873b8 <Wes McKinney> First working iteration of string deduplication when calling to_pandas 2018-12-27 12:17:50 -06:00			`If True, do not use the 'pandas' metadata to reconstruct the`
			`DataFrame index, if present`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`safe : bool, default True`
ARROW-7758: [Python] Safe cast to nanosecond timestamps in to_pandas conversion Closes #6358 from jorisvandenbossche/ARROW-7758 and squashes the following commits: 81270d612 <Joris Van den Bossche> safe -> safe_cast on C++ side b4229e573 <Joris Van den Bossche> parametrize test 8f23f121b <Joris Van den Bossche> ARROW-7758: Safe cast to nanosecond timestamps in to_pandas conversion Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2020-02-17 16:21:33 +01:00			`For certain data types, a cast is needed in order to store the`
			`data in a pandas DataFrame or Series (e.g. timestamps are always`
			`stored as nanoseconds in pandas). This option controls whether it`
			`is a safe cast or not.`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`split_blocks : bool, default False`
ARROW-3789: [Python] Use common conversion path for Arrow to pandas.Series/DataFrame. Zero copy optimizations for DataFrame, add split_blocks and self_destruct options The primary goal of this patch is to provide a way for some users to avoid memory doubling with converting from Arrow to pandas. This took me entirely too much time to get right, but partly I was attempting to disentangle some of the technical debt and overdue refactoring in arrow_to_pandas.cc. Summary of what's here: - Refactor ChunkedArray->Series and Table->DataFrame conversion paths to use the exact same code rather than two implementations of the same thing with slightly different behavior. The `ArrowDeserializer` helper class is now gone - Do zero-copy construction of internal DataFrame blocks for the case of a contiguous non-nullable array and a block with only 1 column represented - Add `split_blocks` option to `to_pandas` which constructs one block per DataFrame column, resulting in more zero-copy opportunities. Note that pandas's internal "consolidation" can still cause memory doubling (see discussion about this in https://github.com/pandas-dev/pandas/issues/10556) - Add `self_destruct` option to `to_pandas` which releases the Table's internal buffers as soon as they are converted to the required pandas structure. This allows memory to be reclaimed by the OS as conversion is taking place rather than having a forced memory-doubling and then post-facto reclamation (which has been causing OOM for some users) The most conservative invocation of `to_pandas` now would be `table.to_pandas(use_threads=False, split_blocks=True, self_destruct=True)` Note that the self-destruct option makes the `Table` object unsafe for further use. This is a bit dissatisfying but I wasn't sure how else to provide this capability. Closes #6067 from wesm/ARROW-3789 and squashes the following commits: 3b4260283 <Wes McKinney> Code review comments 8f39cce05 <Wes McKinney> Add some documentation. Try fixing MSVC warnings c22d280dc <Wes McKinney> Fix one MSVC cast warning 43068032c <Wes McKinney> Add "split blocks" and "self destruct" options to Table.to_pandas, with zero-copy operations for improved memory use when converting from Arrow to pandas Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-01-14 18:25:01 -06:00			`If True, generate one internal "block" for each column when`
			`creating a pandas.DataFrame from a RecordBatch or Table. While this`
			`can temporarily reduce memory note that various pandas operations`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`can trigger "consolidation" which may balloon memory use.`
			`self_destruct : bool, default False`
ARROW-3789: [Python] Use common conversion path for Arrow to pandas.Series/DataFrame. Zero copy optimizations for DataFrame, add split_blocks and self_destruct options The primary goal of this patch is to provide a way for some users to avoid memory doubling with converting from Arrow to pandas. This took me entirely too much time to get right, but partly I was attempting to disentangle some of the technical debt and overdue refactoring in arrow_to_pandas.cc. Summary of what's here: - Refactor ChunkedArray->Series and Table->DataFrame conversion paths to use the exact same code rather than two implementations of the same thing with slightly different behavior. The `ArrowDeserializer` helper class is now gone - Do zero-copy construction of internal DataFrame blocks for the case of a contiguous non-nullable array and a block with only 1 column represented - Add `split_blocks` option to `to_pandas` which constructs one block per DataFrame column, resulting in more zero-copy opportunities. Note that pandas's internal "consolidation" can still cause memory doubling (see discussion about this in https://github.com/pandas-dev/pandas/issues/10556) - Add `self_destruct` option to `to_pandas` which releases the Table's internal buffers as soon as they are converted to the required pandas structure. This allows memory to be reclaimed by the OS as conversion is taking place rather than having a forced memory-doubling and then post-facto reclamation (which has been causing OOM for some users) The most conservative invocation of `to_pandas` now would be `table.to_pandas(use_threads=False, split_blocks=True, self_destruct=True)` Note that the self-destruct option makes the `Table` object unsafe for further use. This is a bit dissatisfying but I wasn't sure how else to provide this capability. Closes #6067 from wesm/ARROW-3789 and squashes the following commits: 3b4260283 <Wes McKinney> Code review comments 8f39cce05 <Wes McKinney> Add some documentation. Try fixing MSVC warnings c22d280dc <Wes McKinney> Fix one MSVC cast warning 43068032c <Wes McKinney> Add "split blocks" and "self destruct" options to Table.to_pandas, with zero-copy operations for improved memory use when converting from Arrow to pandas Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-01-14 18:25:01 -06:00			`EXPERIMENTAL: If True, attempt to deallocate the originating Arrow`
			`memory while converting the Arrow object to pandas. If you use the`
			`object after calling to_pandas with this option it will crash your`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`program.`
ARROW-9878: [Python] Document caveats of to_pandas(self_destruct=True) Documents some common reasons why you might not see a memory usage improvement even with this option. Closes #9730 from lidavidm/arrow-9878 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2021-03-30 13:51:05 -04:00
			`Note that you may not see always memory usage improvements. For`
			`example, if multiple columns share an underlying allocation,`
			`memory can't be freed until all columns are converted.`
GH-34729: [C++][Python] Enhanced Arrow<->Pandas map/pydict support (#34730) ### Rationale for this change Explained in issue #34729 ### What changes are included in this PR? - Add support for list of maps when converting Arrow to Pandas. There doesn't seem to be a strong reason to omit this. Previously it was a hard error as unsupported, due to a bool check. - Refactor Arrow Map -> Pandas to support two paths: (1) list of tuples, or (2) pydicts - Add another option in PandasOptions to enable (2), above - Bugfix in nested pydicts -> Arrow maps. - Unit tests ### Are these changes tested? Unit tests are added in `test_pandas.py` ### Are there any user-facing changes? - An additional option flag in PandasOptions - Enable list of maps to Pandas, which was previously disabled * Closes: #34729 Authored-by: Mike Lui <mikelui@meta.com> Signed-off-by: Will Jones <willjones127@gmail.com> 2023-04-21 11:18:32 -04:00			maps_as_pydicts : str, optional, default `None`
			Valid values are `None`, 'lossy', or 'strict'.
			The default behavior (`None`), is to convert Arrow Map arrays to
			`Python association lists (list-of-tuples) in the same order as the`
			`Arrow Map, as in [(key1, value1), (key2, value2), ...].`

			`If 'lossy' or 'strict', convert Arrow Map arrays to native Python dicts.`
			`This can change the ordering of (key, value) pairs, and will`
			`deduplicate multiple keys, resulting in a possible loss of data.`

			`If 'lossy', this key deduplication results in a warning printed`
			`when detected. If 'strict', this instead results in an exception`
			`being raised when detected.`
ARROW-7569: [Python] Add API to map Arrow types to pandas ExtensionDtypes in to_pandas conversions See https://issues.apache.org/jira/browse/ARROW-7569 and https://issues.apache.org/jira/browse/ARROW-2428 for context. https://github.com/apache/arrow/pull/5512 only covered the first 2 cases described in ARROW-2428, this also tries to cover the third case. This PR adds a `types_mapping` to `Table.to_pandas` to specify pandas ExtensionDtypes for built-in arrow types to use in the conversion. One specific example use case for this ability is to convert arrow integer types to pandas' nullable integer dtype instead of to numpy integer dtype (or for one of the other custom nullable dtypes in pandas). For example: ``` table.to_pandas(types_mapping={pa.int64(): pd.Int64Dtype()}) ``` will avoid to convert the int columns first to numpy dtype (possibly float) by directly constructing the pandas nullable dtype. Need to add more tests, and one important concern is that using a pyarrow type instance as the dict key might not easily work for parametrized types (eg timestamp with resolution / timezone). Closes #6189 from jorisvandenbossche/ARROW-7569-to-pandas-types-mapping and squashes the following commits: cb82f5c21 <Joris Van den Bossche> expand tests 1d9c37ca1 <Joris Van den Bossche> simplify (remove unused extension_columns arg) b61b1f5ac <Joris Van den Bossche> dict -> function f3464b15a <Joris Van den Bossche> ARROW-7569: Add API to map Arrow types to pandas ExtensionDtypes for to_pandas conversions Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com> 2020-01-23 09:42:42 -08:00			`types_mapper : function, default None`
			`A function mapping a pyarrow DataType to a pandas ExtensionDtype.`
			`This can be used to override the default pandas type for conversion`
			`of built-in pyarrow types or in absence of pandas_metadata in the`
			`Table schema. The function receives a pyarrow DataType and is`
			expected to return a pandas ExtensionDtype or ``None`` if the
			`default conversion should be used for that type. If you have`
			a dictionary mapping, you can pass ``dict.get`` as function.
GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0 (#35656) Do not coerce temporal types to nanosecond when pandas >= 2.0 is imported, since pandas now supports s/ms/us time units. This PR adds support for the following Arrow -> Pandas conversions, which previously all defaulted to `datetime64[ns]` or `datetime64[ns, <TZ>]`: ``` date32 -> datetime64[ms] date64 -> datetime64[ms] datetime64[s] -> datetime64[s] datetime64[ms] -> datetime64[ms] datetime64[us] -> datetime64[us] datetime64[s, <TZ>] -> datetime64[s, <TZ>] datetime64[ms, <TZ>] -> datetime64[ms, <TZ>] datetime64[us, <TZ>] -> datetime64[us, <TZ>] ``` ### Rationale for this change Pandas 2.0 introduces proper support for temporal types. ### Are these changes tested? Yes. Pytests added and updated. ### Are there any user-facing changes? Yes, arrow-to-pandas default conversion behavior will change when users have pandas >= 2.0, but a legacy option is exposed to provide backwards compatibility. * Closes: #33321 Lead-authored-by: Dane Pitkin <dane@voltrondata.com> Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-07-07 10:45:58 -04:00			`coerce_temporal_nanoseconds : bool, default False`
			`Only applicable to pandas version >= 2.0.`
			`A legacy option to coerce date32, date64, duration, and timestamp`
			`time units to nanoseconds when converting to pandas. This is the`
			`default behavior in pandas version 1.x. Set this option to True if`
			`you'd like to use this coercion when using pandas version >= 2.0`
			`for backwards compatibility (not recommended otherwise).`
ARROW-3928: [Python] Deduplicate Python objects when converting binary, string, date, time types to object arrays This adds a `deduplicate_objects` option to all of the `to_pandas` methods. It works with string types, date types (when `date_as_object=True`), and time types. I also made it so that `ScalarMemoTable` can be used with `string_view`, for more efficient memoization in this case. I made the default for `deduplicate_objects` is True. When the ratio of unique strings to the length of the array is low, not only does this use drastically less memory, it is also faster. I will write some benchmarks to show where the "crossover point" is when the overhead of hashing makes things slower. Let's consider a simple case where we have 10,000,000 strings of length 10, but only 1000 unique values: ``` In [50]: import pandas.util.testing as tm In [51]: unique_values = [tm.rands(10) for i in range(1000)] In [52]: values = unique_values * 10000 In [53]: arr = pa.array(values) In [54]: timeit arr.to_pandas() 236 ms ± 1.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [55]: timeit arr.to_pandas(deduplicate_objects=False) 730 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` Almost 3 times faster in this case. The different in memory use is even more drastic ``` In [44]: unique_values = [tm.rands(10) for i in range(1000)] In [45]: values = unique_values * 10000 In [46]: arr = pa.array(values) In [49]: %memit result11 = arr.to_pandas() peak memory: 1505.89 MiB, increment: 76.27 MiB In [50]: %memit result12 = arr.to_pandas(deduplicate_objects=False) peak memory: 2202.29 MiB, increment: 696.11 MiB ``` As you can see, this is a huge problem. If our bug reports about Parquet memory use problems are any indication, users have been suffering from this issue for a long time. When the strings are mostly unique, then things are slower as expected, the peak memory use is higher because of the hash table ``` In [17]: unique_values = [tm.rands(10) for i in range(500000)] In [18]: values = unique_values * 2 In [19]: arr = pa.array(values) In [20]: timeit result = arr.to_pandas() 177 ms ± 574 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [21]: timeit result = arr.to_pandas(deduplicate_objects=False) 70.1 ms ± 783 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [42]: %memit result8 = arr.to_pandas() peak memory: 644.39 MiB, increment: 92.23 MiB In [43]: %memit result9 = arr.to_pandas(deduplicate_objects=False) peak memory: 610.85 MiB, increment: 58.41 MiB ``` In real world work, many duplicated strings is the most common use case. Given the massive memory use and moderate performance improvements, it makes sense to have this enabled by default. Author: Wes McKinney <wesm+git@apache.org> Closes #3257 from wesm/ARROW-3928 and squashes the following commits: d9a88700 <Wes McKinney> Prettier output a00b51c7 <Wes McKinney> Add benchmarks for object deduplication ca88b963 <Wes McKinney> Add Python unit tests, deduplicate for date and time types also when converting to Python objects 7a7873b8 <Wes McKinney> First working iteration of string deduplication when calling to_pandas 2018-12-27 12:17:50 -06:00
			`Returns`
			`-------`
ARROW-6557: [Python] Always return pandas.Series from Array/ChunkedArray.to_pandas. Add mechanism to preserve "column names" from RecordBatch, Table as Series.name This does not fully fix the docker-spark-integration build but fixes one class of problems. Closes #5373 from wesm/ARROW-6429 and squashes the following commits: 3115a3958 <Wes McKinney> Attach record batch and table column names to Array/ChunkedArray so they propagate to pandas.Series in to_pandas 0cc9f42f0 <Wes McKinney> Always return pandas.Series from Array/ChunkedArray.to_pandas Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-09-13 10:40:13 -05:00			`pandas.Series or pandas.DataFrame depending on type of object`
ARROW-16058: [Python] Address docstrings for Table class, methods, attributes and constructor This PR adds docstring examples to: - `pyarrow.Table` class methods and attributes - `pyarrow.table` - `to_pandas` for `_PandasConvertible` - `pyarrow.TableGroupBy` class - `pyarrow.concat_tables` - `pyarrow.concat_arrays` Closes #12772 from AlenkaF/ARROW-16058 Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-04-07 12:09:13 +02:00
			`Examples`
			`--------`
ARROW-16057: [Python] Address docstrings for RecordBatch class, methods, attributes and constructor This PR adds docstring examples to: - `pyarrow.RecordBatch` class methods and attributes - `pyarrow.record_batch` - `to_pandas` for `_PandasConvertible` Closes #12762 from AlenkaF/ARROW-16057 Lead-authored-by: Alenka Frim <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-04-13 12:22:31 +02:00			`>>> import pyarrow as pa`
			`>>> import pandas as pd`

ARROW-16058: [Python] Address docstrings for Table class, methods, attributes and constructor This PR adds docstring examples to: - `pyarrow.Table` class methods and attributes - `pyarrow.table` - `to_pandas` for `_PandasConvertible` - `pyarrow.TableGroupBy` class - `pyarrow.concat_tables` - `pyarrow.concat_arrays` Closes #12772 from AlenkaF/ARROW-16058 Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-04-07 12:09:13 +02:00			`Convert a Table to pandas DataFrame:`

			`>>> table = pa.table([`
			`... pa.array([2, 4, 5, 100]),`
			`... pa.array(["Flamingo", "Horse", "Brittle stars", "Centipede"])`
			`... ], names=['n_legs', 'animals'])`
			`>>> table.to_pandas()`
			`n_legs animals`
			`0 2 Flamingo`
			`1 4 Horse`
			`2 5 Brittle stars`
			`3 100 Centipede`
			`>>> isinstance(table.to_pandas(), pd.DataFrame)`
			`True`
ARROW-15429: [Python] Address docstrings for ChunkedArray class, methods, attributes and constructor This PR adds docstring examples to: - `pyarrow.ChunkedArray` class methods and attributes - `pyarrow.chunked_array` - `to_pandas` for `_PandasConvertible` Closes #12754 from AlenkaF/ARROW-15429 Lead-authored-by: Alenka Frim <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-04-07 14:03:15 +02:00
ARROW-16057: [Python] Address docstrings for RecordBatch class, methods, attributes and constructor This PR adds docstring examples to: - `pyarrow.RecordBatch` class methods and attributes - `pyarrow.record_batch` - `to_pandas` for `_PandasConvertible` Closes #12762 from AlenkaF/ARROW-16057 Lead-authored-by: Alenka Frim <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-04-13 12:22:31 +02:00			`Convert a RecordBatch to pandas DataFrame:`

			`>>> import pyarrow as pa`
			`>>> n_legs = pa.array([2, 4, 5, 100])`
			`>>> animals = pa.array(["Flamingo", "Horse", "Brittle stars", "Centipede"])`
			`>>> batch = pa.record_batch([n_legs, animals],`
			`... names=["n_legs", "animals"])`
			`>>> batch`
			`pyarrow.RecordBatch`
			`n_legs: int64`
			`animals: string`
GH-35415: [Python] RecordBatch string reprsentation includes column preview (#35416) ### Rationale for this change Table and RecordBatch now share a common parent class and common APIs should behave the same. ### What changes are included in this PR? Remove override of RecordBatch string representation. ### Are these changes tested? Pytests and doctests updated. ### Are there any user-facing changes? Yes, the string representation of `RecordBatch` includes additional info. * Closes: #35415 Authored-by: Dane Pitkin <dane@voltrondata.com> Signed-off-by: Alenka Frim <frim.alenka@gmail.com> 2023-05-08 14:41:58 -04:00			`----`
			`n_legs: [2,4,5,100]`
			`animals: ["Flamingo","Horse","Brittle stars","Centipede"]`
ARROW-16057: [Python] Address docstrings for RecordBatch class, methods, attributes and constructor This PR adds docstring examples to: - `pyarrow.RecordBatch` class methods and attributes - `pyarrow.record_batch` - `to_pandas` for `_PandasConvertible` Closes #12762 from AlenkaF/ARROW-16057 Lead-authored-by: Alenka Frim <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-04-13 12:22:31 +02:00			`>>> batch.to_pandas()`
			`n_legs animals`
			`0 2 Flamingo`
			`1 4 Horse`
			`2 5 Brittle stars`
			`3 100 Centipede`
			`>>> isinstance(batch.to_pandas(), pd.DataFrame)`
			`True`

ARROW-15429: [Python] Address docstrings for ChunkedArray class, methods, attributes and constructor This PR adds docstring examples to: - `pyarrow.ChunkedArray` class methods and attributes - `pyarrow.chunked_array` - `to_pandas` for `_PandasConvertible` Closes #12754 from AlenkaF/ARROW-15429 Lead-authored-by: Alenka Frim <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-04-07 14:03:15 +02:00			`Convert a Chunked Array to pandas Series:`

			`>>> import pyarrow as pa`
			`>>> n_legs = pa.chunked_array([[2, 2, 4], [4, 5, 100]])`
			`>>> n_legs.to_pandas()`
			`0 2`
			`1 2`
			`2 4`
			`3 4`
			`4 5`
			`5 100`
			`dtype: int64`
			`>>> isinstance(n_legs.to_pandas(), pd.Series)`
			`True`
ARROW-3928: [Python] Deduplicate Python objects when converting binary, string, date, time types to object arrays This adds a `deduplicate_objects` option to all of the `to_pandas` methods. It works with string types, date types (when `date_as_object=True`), and time types. I also made it so that `ScalarMemoTable` can be used with `string_view`, for more efficient memoization in this case. I made the default for `deduplicate_objects` is True. When the ratio of unique strings to the length of the array is low, not only does this use drastically less memory, it is also faster. I will write some benchmarks to show where the "crossover point" is when the overhead of hashing makes things slower. Let's consider a simple case where we have 10,000,000 strings of length 10, but only 1000 unique values: ``` In [50]: import pandas.util.testing as tm In [51]: unique_values = [tm.rands(10) for i in range(1000)] In [52]: values = unique_values * 10000 In [53]: arr = pa.array(values) In [54]: timeit arr.to_pandas() 236 ms ± 1.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [55]: timeit arr.to_pandas(deduplicate_objects=False) 730 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` Almost 3 times faster in this case. The different in memory use is even more drastic ``` In [44]: unique_values = [tm.rands(10) for i in range(1000)] In [45]: values = unique_values * 10000 In [46]: arr = pa.array(values) In [49]: %memit result11 = arr.to_pandas() peak memory: 1505.89 MiB, increment: 76.27 MiB In [50]: %memit result12 = arr.to_pandas(deduplicate_objects=False) peak memory: 2202.29 MiB, increment: 696.11 MiB ``` As you can see, this is a huge problem. If our bug reports about Parquet memory use problems are any indication, users have been suffering from this issue for a long time. When the strings are mostly unique, then things are slower as expected, the peak memory use is higher because of the hash table ``` In [17]: unique_values = [tm.rands(10) for i in range(500000)] In [18]: values = unique_values * 2 In [19]: arr = pa.array(values) In [20]: timeit result = arr.to_pandas() 177 ms ± 574 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [21]: timeit result = arr.to_pandas(deduplicate_objects=False) 70.1 ms ± 783 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [42]: %memit result8 = arr.to_pandas() peak memory: 644.39 MiB, increment: 92.23 MiB In [43]: %memit result9 = arr.to_pandas(deduplicate_objects=False) peak memory: 610.85 MiB, increment: 58.41 MiB ``` In real world work, many duplicated strings is the most common use case. Given the massive memory use and moderate performance improvements, it makes sense to have this enabled by default. Author: Wes McKinney <wesm+git@apache.org> Closes #3257 from wesm/ARROW-3928 and squashes the following commits: d9a88700 <Wes McKinney> Prettier output a00b51c7 <Wes McKinney> Add benchmarks for object deduplication ca88b963 <Wes McKinney> Add Python unit tests, deduplicate for date and time types also when converting to Python objects 7a7873b8 <Wes McKinney> First working iteration of string deduplication when calling to_pandas 2018-12-27 12:17:50 -06:00			`"""`
ARROW-6570: [Python] Use Arrow's allocators for creating NumPy array instead of leaving it to NumPy This has some benefits: * Move pandas-related memory allocations to the same default allocator as the rest of the Arrow platform, rather than mixing jemalloc and the system allocator as things currently are * NumPy/pandas-related memory allocations are now accounted for in `pyarrow.total_allocated_bytes()` * Better performance (10+% faster, from quick benchmarks) when using libraries with ARROW_JEMALLOC=ON There are a couple other usages of the system allocator in arrow_to_pandas.cc but they are for smaller internal bits ("placement" arrays) of data relating to pandas. These can be fixed later if they are deemed bothersome Closes #5398 from wesm/ARROW-6570 and squashes the following commits: 3fe0bc37c <Wes McKinney> Fix Python 2.7 failures 7788bc859 <Wes McKinney> Fix arrow-python-test bd5040f32 <Wes McKinney> Add unit tests to check memory pool behavior with to_pandas 927f10737 <Wes McKinney> Array needs WRITEABLE flag set explicitly 88c2c8f13 <Wes McKinney> Refactor to use arrow::MemoryPool for large NumPy allocations Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-09-18 11:26:09 -05:00			`options = dict(`
			`pool=memory_pool,`
ARROW-3928: [Python] Deduplicate Python objects when converting binary, string, date, time types to object arrays This adds a `deduplicate_objects` option to all of the `to_pandas` methods. It works with string types, date types (when `date_as_object=True`), and time types. I also made it so that `ScalarMemoTable` can be used with `string_view`, for more efficient memoization in this case. I made the default for `deduplicate_objects` is True. When the ratio of unique strings to the length of the array is low, not only does this use drastically less memory, it is also faster. I will write some benchmarks to show where the "crossover point" is when the overhead of hashing makes things slower. Let's consider a simple case where we have 10,000,000 strings of length 10, but only 1000 unique values: ``` In [50]: import pandas.util.testing as tm In [51]: unique_values = [tm.rands(10) for i in range(1000)] In [52]: values = unique_values * 10000 In [53]: arr = pa.array(values) In [54]: timeit arr.to_pandas() 236 ms ± 1.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [55]: timeit arr.to_pandas(deduplicate_objects=False) 730 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` Almost 3 times faster in this case. The different in memory use is even more drastic ``` In [44]: unique_values = [tm.rands(10) for i in range(1000)] In [45]: values = unique_values * 10000 In [46]: arr = pa.array(values) In [49]: %memit result11 = arr.to_pandas() peak memory: 1505.89 MiB, increment: 76.27 MiB In [50]: %memit result12 = arr.to_pandas(deduplicate_objects=False) peak memory: 2202.29 MiB, increment: 696.11 MiB ``` As you can see, this is a huge problem. If our bug reports about Parquet memory use problems are any indication, users have been suffering from this issue for a long time. When the strings are mostly unique, then things are slower as expected, the peak memory use is higher because of the hash table ``` In [17]: unique_values = [tm.rands(10) for i in range(500000)] In [18]: values = unique_values * 2 In [19]: arr = pa.array(values) In [20]: timeit result = arr.to_pandas() 177 ms ± 574 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [21]: timeit result = arr.to_pandas(deduplicate_objects=False) 70.1 ms ± 783 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [42]: %memit result8 = arr.to_pandas() peak memory: 644.39 MiB, increment: 92.23 MiB In [43]: %memit result9 = arr.to_pandas(deduplicate_objects=False) peak memory: 610.85 MiB, increment: 58.41 MiB ``` In real world work, many duplicated strings is the most common use case. Given the massive memory use and moderate performance improvements, it makes sense to have this enabled by default. Author: Wes McKinney <wesm+git@apache.org> Closes #3257 from wesm/ARROW-3928 and squashes the following commits: d9a88700 <Wes McKinney> Prettier output a00b51c7 <Wes McKinney> Add benchmarks for object deduplication ca88b963 <Wes McKinney> Add Python unit tests, deduplicate for date and time types also when converting to Python objects 7a7873b8 <Wes McKinney> First working iteration of string deduplication when calling to_pandas 2018-12-27 12:17:50 -06:00			`strings_to_categorical=strings_to_categorical,`
			`zero_copy_only=zero_copy_only,`
			`integer_object_nulls=integer_object_nulls,`
			`date_as_object=date_as_object,`
ARROW-5359: [Python] Support non-nanosecond out-of-range timestamps in conversion to pandas This fixes https://issues.apache.org/jira/browse/ARROW-5359 by adding a new flag, `timestamp_as_object`. In this PR the default is to be False. Plausibly it should default to True, much like `date_as_object` is True by default, but that would be backwards incompatible. There are definitely a number of tests that fail when the default is changed, but they might be overly brittle tests. Closes #7169 from itamarst/ARROW-5359 Lead-authored-by: Itamar Turner-Trauring <itamar@itamarst.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-18 22:29:07 -05:00			`timestamp_as_object=timestamp_as_object,`
ARROW-3928: [Python] Deduplicate Python objects when converting binary, string, date, time types to object arrays This adds a `deduplicate_objects` option to all of the `to_pandas` methods. It works with string types, date types (when `date_as_object=True`), and time types. I also made it so that `ScalarMemoTable` can be used with `string_view`, for more efficient memoization in this case. I made the default for `deduplicate_objects` is True. When the ratio of unique strings to the length of the array is low, not only does this use drastically less memory, it is also faster. I will write some benchmarks to show where the "crossover point" is when the overhead of hashing makes things slower. Let's consider a simple case where we have 10,000,000 strings of length 10, but only 1000 unique values: ``` In [50]: import pandas.util.testing as tm In [51]: unique_values = [tm.rands(10) for i in range(1000)] In [52]: values = unique_values * 10000 In [53]: arr = pa.array(values) In [54]: timeit arr.to_pandas() 236 ms ± 1.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [55]: timeit arr.to_pandas(deduplicate_objects=False) 730 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` Almost 3 times faster in this case. The different in memory use is even more drastic ``` In [44]: unique_values = [tm.rands(10) for i in range(1000)] In [45]: values = unique_values * 10000 In [46]: arr = pa.array(values) In [49]: %memit result11 = arr.to_pandas() peak memory: 1505.89 MiB, increment: 76.27 MiB In [50]: %memit result12 = arr.to_pandas(deduplicate_objects=False) peak memory: 2202.29 MiB, increment: 696.11 MiB ``` As you can see, this is a huge problem. If our bug reports about Parquet memory use problems are any indication, users have been suffering from this issue for a long time. When the strings are mostly unique, then things are slower as expected, the peak memory use is higher because of the hash table ``` In [17]: unique_values = [tm.rands(10) for i in range(500000)] In [18]: values = unique_values * 2 In [19]: arr = pa.array(values) In [20]: timeit result = arr.to_pandas() 177 ms ± 574 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [21]: timeit result = arr.to_pandas(deduplicate_objects=False) 70.1 ms ± 783 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [42]: %memit result8 = arr.to_pandas() peak memory: 644.39 MiB, increment: 92.23 MiB In [43]: %memit result9 = arr.to_pandas(deduplicate_objects=False) peak memory: 610.85 MiB, increment: 58.41 MiB ``` In real world work, many duplicated strings is the most common use case. Given the massive memory use and moderate performance improvements, it makes sense to have this enabled by default. Author: Wes McKinney <wesm+git@apache.org> Closes #3257 from wesm/ARROW-3928 and squashes the following commits: d9a88700 <Wes McKinney> Prettier output a00b51c7 <Wes McKinney> Add benchmarks for object deduplication ca88b963 <Wes McKinney> Add Python unit tests, deduplicate for date and time types also when converting to Python objects 7a7873b8 <Wes McKinney> First working iteration of string deduplication when calling to_pandas 2018-12-27 12:17:50 -06:00			`use_threads=use_threads,`
ARROW-3789: [Python] Use common conversion path for Arrow to pandas.Series/DataFrame. Zero copy optimizations for DataFrame, add split_blocks and self_destruct options The primary goal of this patch is to provide a way for some users to avoid memory doubling with converting from Arrow to pandas. This took me entirely too much time to get right, but partly I was attempting to disentangle some of the technical debt and overdue refactoring in arrow_to_pandas.cc. Summary of what's here: - Refactor ChunkedArray->Series and Table->DataFrame conversion paths to use the exact same code rather than two implementations of the same thing with slightly different behavior. The `ArrowDeserializer` helper class is now gone - Do zero-copy construction of internal DataFrame blocks for the case of a contiguous non-nullable array and a block with only 1 column represented - Add `split_blocks` option to `to_pandas` which constructs one block per DataFrame column, resulting in more zero-copy opportunities. Note that pandas's internal "consolidation" can still cause memory doubling (see discussion about this in https://github.com/pandas-dev/pandas/issues/10556) - Add `self_destruct` option to `to_pandas` which releases the Table's internal buffers as soon as they are converted to the required pandas structure. This allows memory to be reclaimed by the OS as conversion is taking place rather than having a forced memory-doubling and then post-facto reclamation (which has been causing OOM for some users) The most conservative invocation of `to_pandas` now would be `table.to_pandas(use_threads=False, split_blocks=True, self_destruct=True)` Note that the self-destruct option makes the `Table` object unsafe for further use. This is a bit dissatisfying but I wasn't sure how else to provide this capability. Closes #6067 from wesm/ARROW-3789 and squashes the following commits: 3b4260283 <Wes McKinney> Code review comments 8f39cce05 <Wes McKinney> Add some documentation. Try fixing MSVC warnings c22d280dc <Wes McKinney> Fix one MSVC cast warning 43068032c <Wes McKinney> Add "split blocks" and "self destruct" options to Table.to_pandas, with zero-copy operations for improved memory use when converting from Arrow to pandas Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-01-14 18:25:01 -06:00			`deduplicate_objects=deduplicate_objects,`
ARROW-7758: [Python] Safe cast to nanosecond timestamps in to_pandas conversion Closes #6358 from jorisvandenbossche/ARROW-7758 and squashes the following commits: 81270d612 <Joris Van den Bossche> safe -> safe_cast on C++ side b4229e573 <Joris Van den Bossche> parametrize test 8f23f121b <Joris Van den Bossche> ARROW-7758: Safe cast to nanosecond timestamps in to_pandas conversion Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2020-02-17 16:21:33 +01:00			`safe=safe,`
ARROW-3789: [Python] Use common conversion path for Arrow to pandas.Series/DataFrame. Zero copy optimizations for DataFrame, add split_blocks and self_destruct options The primary goal of this patch is to provide a way for some users to avoid memory doubling with converting from Arrow to pandas. This took me entirely too much time to get right, but partly I was attempting to disentangle some of the technical debt and overdue refactoring in arrow_to_pandas.cc. Summary of what's here: - Refactor ChunkedArray->Series and Table->DataFrame conversion paths to use the exact same code rather than two implementations of the same thing with slightly different behavior. The `ArrowDeserializer` helper class is now gone - Do zero-copy construction of internal DataFrame blocks for the case of a contiguous non-nullable array and a block with only 1 column represented - Add `split_blocks` option to `to_pandas` which constructs one block per DataFrame column, resulting in more zero-copy opportunities. Note that pandas's internal "consolidation" can still cause memory doubling (see discussion about this in https://github.com/pandas-dev/pandas/issues/10556) - Add `self_destruct` option to `to_pandas` which releases the Table's internal buffers as soon as they are converted to the required pandas structure. This allows memory to be reclaimed by the OS as conversion is taking place rather than having a forced memory-doubling and then post-facto reclamation (which has been causing OOM for some users) The most conservative invocation of `to_pandas` now would be `table.to_pandas(use_threads=False, split_blocks=True, self_destruct=True)` Note that the self-destruct option makes the `Table` object unsafe for further use. This is a bit dissatisfying but I wasn't sure how else to provide this capability. Closes #6067 from wesm/ARROW-3789 and squashes the following commits: 3b4260283 <Wes McKinney> Code review comments 8f39cce05 <Wes McKinney> Add some documentation. Try fixing MSVC warnings c22d280dc <Wes McKinney> Fix one MSVC cast warning 43068032c <Wes McKinney> Add "split blocks" and "self destruct" options to Table.to_pandas, with zero-copy operations for improved memory use when converting from Arrow to pandas Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-01-14 18:25:01 -06:00			`split_blocks=split_blocks,`
GH-34729: [C++][Python] Enhanced Arrow<->Pandas map/pydict support (#34730) ### Rationale for this change Explained in issue #34729 ### What changes are included in this PR? - Add support for list of maps when converting Arrow to Pandas. There doesn't seem to be a strong reason to omit this. Previously it was a hard error as unsupported, due to a bool check. - Refactor Arrow Map -> Pandas to support two paths: (1) list of tuples, or (2) pydicts - Add another option in PandasOptions to enable (2), above - Bugfix in nested pydicts -> Arrow maps. - Unit tests ### Are these changes tested? Unit tests are added in `test_pandas.py` ### Are there any user-facing changes? - An additional option flag in PandasOptions - Enable list of maps to Pandas, which was previously disabled * Closes: #34729 Authored-by: Mike Lui <mikelui@meta.com> Signed-off-by: Will Jones <willjones127@gmail.com> 2023-04-21 11:18:32 -04:00			`self_destruct=self_destruct,`
GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0 (#35656) Do not coerce temporal types to nanosecond when pandas >= 2.0 is imported, since pandas now supports s/ms/us time units. This PR adds support for the following Arrow -> Pandas conversions, which previously all defaulted to `datetime64[ns]` or `datetime64[ns, <TZ>]`: ``` date32 -> datetime64[ms] date64 -> datetime64[ms] datetime64[s] -> datetime64[s] datetime64[ms] -> datetime64[ms] datetime64[us] -> datetime64[us] datetime64[s, <TZ>] -> datetime64[s, <TZ>] datetime64[ms, <TZ>] -> datetime64[ms, <TZ>] datetime64[us, <TZ>] -> datetime64[us, <TZ>] ``` ### Rationale for this change Pandas 2.0 introduces proper support for temporal types. ### Are these changes tested? Yes. Pytests added and updated. ### Are there any user-facing changes? Yes, arrow-to-pandas default conversion behavior will change when users have pandas >= 2.0, but a legacy option is exposed to provide backwards compatibility. * Closes: #33321 Lead-authored-by: Dane Pitkin <dane@voltrondata.com> Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-07-07 10:45:58 -04:00			`maps_as_pydicts=maps_as_pydicts,`
			`coerce_temporal_nanoseconds=coerce_temporal_nanoseconds`
ARROW-6570: [Python] Use Arrow's allocators for creating NumPy array instead of leaving it to NumPy This has some benefits: * Move pandas-related memory allocations to the same default allocator as the rest of the Arrow platform, rather than mixing jemalloc and the system allocator as things currently are * NumPy/pandas-related memory allocations are now accounted for in `pyarrow.total_allocated_bytes()` * Better performance (10+% faster, from quick benchmarks) when using libraries with ARROW_JEMALLOC=ON There are a couple other usages of the system allocator in arrow_to_pandas.cc but they are for smaller internal bits ("placement" arrays) of data relating to pandas. These can be fixed later if they are deemed bothersome Closes #5398 from wesm/ARROW-6570 and squashes the following commits: 3fe0bc37c <Wes McKinney> Fix Python 2.7 failures 7788bc859 <Wes McKinney> Fix arrow-python-test bd5040f32 <Wes McKinney> Add unit tests to check memory pool behavior with to_pandas 927f10737 <Wes McKinney> Array needs WRITEABLE flag set explicitly 88c2c8f13 <Wes McKinney> Refactor to use arrow::MemoryPool for large NumPy allocations Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-09-18 11:26:09 -05:00			`)`
ARROW-3928: [Python] Deduplicate Python objects when converting binary, string, date, time types to object arrays This adds a `deduplicate_objects` option to all of the `to_pandas` methods. It works with string types, date types (when `date_as_object=True`), and time types. I also made it so that `ScalarMemoTable` can be used with `string_view`, for more efficient memoization in this case. I made the default for `deduplicate_objects` is True. When the ratio of unique strings to the length of the array is low, not only does this use drastically less memory, it is also faster. I will write some benchmarks to show where the "crossover point" is when the overhead of hashing makes things slower. Let's consider a simple case where we have 10,000,000 strings of length 10, but only 1000 unique values: ``` In [50]: import pandas.util.testing as tm In [51]: unique_values = [tm.rands(10) for i in range(1000)] In [52]: values = unique_values * 10000 In [53]: arr = pa.array(values) In [54]: timeit arr.to_pandas() 236 ms ± 1.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [55]: timeit arr.to_pandas(deduplicate_objects=False) 730 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` Almost 3 times faster in this case. The different in memory use is even more drastic ``` In [44]: unique_values = [tm.rands(10) for i in range(1000)] In [45]: values = unique_values * 10000 In [46]: arr = pa.array(values) In [49]: %memit result11 = arr.to_pandas() peak memory: 1505.89 MiB, increment: 76.27 MiB In [50]: %memit result12 = arr.to_pandas(deduplicate_objects=False) peak memory: 2202.29 MiB, increment: 696.11 MiB ``` As you can see, this is a huge problem. If our bug reports about Parquet memory use problems are any indication, users have been suffering from this issue for a long time. When the strings are mostly unique, then things are slower as expected, the peak memory use is higher because of the hash table ``` In [17]: unique_values = [tm.rands(10) for i in range(500000)] In [18]: values = unique_values * 2 In [19]: arr = pa.array(values) In [20]: timeit result = arr.to_pandas() 177 ms ± 574 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [21]: timeit result = arr.to_pandas(deduplicate_objects=False) 70.1 ms ± 783 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [42]: %memit result8 = arr.to_pandas() peak memory: 644.39 MiB, increment: 92.23 MiB In [43]: %memit result9 = arr.to_pandas(deduplicate_objects=False) peak memory: 610.85 MiB, increment: 58.41 MiB ``` In real world work, many duplicated strings is the most common use case. Given the massive memory use and moderate performance improvements, it makes sense to have this enabled by default. Author: Wes McKinney <wesm+git@apache.org> Closes #3257 from wesm/ARROW-3928 and squashes the following commits: d9a88700 <Wes McKinney> Prettier output a00b51c7 <Wes McKinney> Add benchmarks for object deduplication ca88b963 <Wes McKinney> Add Python unit tests, deduplicate for date and time types also when converting to Python objects 7a7873b8 <Wes McKinney> First working iteration of string deduplication when calling to_pandas 2018-12-27 12:17:50 -06:00			`return self._to_pandas(options, categories=categories,`
ARROW-7569: [Python] Add API to map Arrow types to pandas ExtensionDtypes in to_pandas conversions See https://issues.apache.org/jira/browse/ARROW-7569 and https://issues.apache.org/jira/browse/ARROW-2428 for context. https://github.com/apache/arrow/pull/5512 only covered the first 2 cases described in ARROW-2428, this also tries to cover the third case. This PR adds a `types_mapping` to `Table.to_pandas` to specify pandas ExtensionDtypes for built-in arrow types to use in the conversion. One specific example use case for this ability is to convert arrow integer types to pandas' nullable integer dtype instead of to numpy integer dtype (or for one of the other custom nullable dtypes in pandas). For example: ``` table.to_pandas(types_mapping={pa.int64(): pd.Int64Dtype()}) ``` will avoid to convert the int columns first to numpy dtype (possibly float) by directly constructing the pandas nullable dtype. Need to add more tests, and one important concern is that using a pyarrow type instance as the dict key might not easily work for parametrized types (eg timestamp with resolution / timezone). Closes #6189 from jorisvandenbossche/ARROW-7569-to-pandas-types-mapping and squashes the following commits: cb82f5c21 <Joris Van den Bossche> expand tests 1d9c37ca1 <Joris Van den Bossche> simplify (remove unused extension_columns arg) b61b1f5ac <Joris Van den Bossche> dict -> function f3464b15a <Joris Van den Bossche> ARROW-7569: Add API to map Arrow types to pandas ExtensionDtypes for to_pandas conversions Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com> 2020-01-23 09:42:42 -08:00			`ignore_metadata=ignore_metadata,`
			`types_mapper=types_mapper)`
ARROW-3928: [Python] Deduplicate Python objects when converting binary, string, date, time types to object arrays This adds a `deduplicate_objects` option to all of the `to_pandas` methods. It works with string types, date types (when `date_as_object=True`), and time types. I also made it so that `ScalarMemoTable` can be used with `string_view`, for more efficient memoization in this case. I made the default for `deduplicate_objects` is True. When the ratio of unique strings to the length of the array is low, not only does this use drastically less memory, it is also faster. I will write some benchmarks to show where the "crossover point" is when the overhead of hashing makes things slower. Let's consider a simple case where we have 10,000,000 strings of length 10, but only 1000 unique values: ``` In [50]: import pandas.util.testing as tm In [51]: unique_values = [tm.rands(10) for i in range(1000)] In [52]: values = unique_values * 10000 In [53]: arr = pa.array(values) In [54]: timeit arr.to_pandas() 236 ms ± 1.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [55]: timeit arr.to_pandas(deduplicate_objects=False) 730 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` Almost 3 times faster in this case. The different in memory use is even more drastic ``` In [44]: unique_values = [tm.rands(10) for i in range(1000)] In [45]: values = unique_values * 10000 In [46]: arr = pa.array(values) In [49]: %memit result11 = arr.to_pandas() peak memory: 1505.89 MiB, increment: 76.27 MiB In [50]: %memit result12 = arr.to_pandas(deduplicate_objects=False) peak memory: 2202.29 MiB, increment: 696.11 MiB ``` As you can see, this is a huge problem. If our bug reports about Parquet memory use problems are any indication, users have been suffering from this issue for a long time. When the strings are mostly unique, then things are slower as expected, the peak memory use is higher because of the hash table ``` In [17]: unique_values = [tm.rands(10) for i in range(500000)] In [18]: values = unique_values * 2 In [19]: arr = pa.array(values) In [20]: timeit result = arr.to_pandas() 177 ms ± 574 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [21]: timeit result = arr.to_pandas(deduplicate_objects=False) 70.1 ms ± 783 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [42]: %memit result8 = arr.to_pandas() peak memory: 644.39 MiB, increment: 92.23 MiB In [43]: %memit result9 = arr.to_pandas(deduplicate_objects=False) peak memory: 610.85 MiB, increment: 58.41 MiB ``` In real world work, many duplicated strings is the most common use case. Given the massive memory use and moderate performance improvements, it makes sense to have this enabled by default. Author: Wes McKinney <wesm+git@apache.org> Closes #3257 from wesm/ARROW-3928 and squashes the following commits: d9a88700 <Wes McKinney> Prettier output a00b51c7 <Wes McKinney> Add benchmarks for object deduplication ca88b963 <Wes McKinney> Add Python unit tests, deduplicate for date and time types also when converting to Python objects 7a7873b8 <Wes McKinney> First working iteration of string deduplication when calling to_pandas 2018-12-27 12:17:50 -06:00

ARROW-6570: [Python] Use Arrow's allocators for creating NumPy array instead of leaving it to NumPy This has some benefits: * Move pandas-related memory allocations to the same default allocator as the rest of the Arrow platform, rather than mixing jemalloc and the system allocator as things currently are * NumPy/pandas-related memory allocations are now accounted for in `pyarrow.total_allocated_bytes()` * Better performance (10+% faster, from quick benchmarks) when using libraries with ARROW_JEMALLOC=ON There are a couple other usages of the system allocator in arrow_to_pandas.cc but they are for smaller internal bits ("placement" arrays) of data relating to pandas. These can be fixed later if they are deemed bothersome Closes #5398 from wesm/ARROW-6570 and squashes the following commits: 3fe0bc37c <Wes McKinney> Fix Python 2.7 failures 7788bc859 <Wes McKinney> Fix arrow-python-test bd5040f32 <Wes McKinney> Add unit tests to check memory pool behavior with to_pandas 927f10737 <Wes McKinney> Array needs WRITEABLE flag set explicitly 88c2c8f13 <Wes McKinney> Refactor to use arrow::MemoryPool for large NumPy allocations Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-09-18 11:26:09 -05:00			`cdef PandasOptions _convert_pandas_options(dict options):`
			`cdef PandasOptions result`
			`result.pool = maybe_unbox_memory_pool(options['pool'])`
			`result.strings_to_categorical = options['strings_to_categorical']`
			`result.zero_copy_only = options['zero_copy_only']`
			`result.integer_object_nulls = options['integer_object_nulls']`
			`result.date_as_object = options['date_as_object']`
ARROW-5359: [Python] Support non-nanosecond out-of-range timestamps in conversion to pandas This fixes https://issues.apache.org/jira/browse/ARROW-5359 by adding a new flag, `timestamp_as_object`. In this PR the default is to be False. Plausibly it should default to True, much like `date_as_object` is True by default, but that would be backwards incompatible. There are definitely a number of tests that fail when the default is changed, but they might be overly brittle tests. Closes #7169 from itamarst/ARROW-5359 Lead-authored-by: Itamar Turner-Trauring <itamar@itamarst.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-18 22:29:07 -05:00			`result.timestamp_as_object = options['timestamp_as_object']`
ARROW-6570: [Python] Use Arrow's allocators for creating NumPy array instead of leaving it to NumPy This has some benefits: * Move pandas-related memory allocations to the same default allocator as the rest of the Arrow platform, rather than mixing jemalloc and the system allocator as things currently are * NumPy/pandas-related memory allocations are now accounted for in `pyarrow.total_allocated_bytes()` * Better performance (10+% faster, from quick benchmarks) when using libraries with ARROW_JEMALLOC=ON There are a couple other usages of the system allocator in arrow_to_pandas.cc but they are for smaller internal bits ("placement" arrays) of data relating to pandas. These can be fixed later if they are deemed bothersome Closes #5398 from wesm/ARROW-6570 and squashes the following commits: 3fe0bc37c <Wes McKinney> Fix Python 2.7 failures 7788bc859 <Wes McKinney> Fix arrow-python-test bd5040f32 <Wes McKinney> Add unit tests to check memory pool behavior with to_pandas 927f10737 <Wes McKinney> Array needs WRITEABLE flag set explicitly 88c2c8f13 <Wes McKinney> Refactor to use arrow::MemoryPool for large NumPy allocations Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-09-18 11:26:09 -05:00			`result.use_threads = options['use_threads']`
			`result.deduplicate_objects = options['deduplicate_objects']`
ARROW-7758: [Python] Safe cast to nanosecond timestamps in to_pandas conversion Closes #6358 from jorisvandenbossche/ARROW-7758 and squashes the following commits: 81270d612 <Joris Van den Bossche> safe -> safe_cast on C++ side b4229e573 <Joris Van den Bossche> parametrize test 8f23f121b <Joris Van den Bossche> ARROW-7758: Safe cast to nanosecond timestamps in to_pandas conversion Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2020-02-17 16:21:33 +01:00			`result.safe_cast = options['safe']`
ARROW-3789: [Python] Use common conversion path for Arrow to pandas.Series/DataFrame. Zero copy optimizations for DataFrame, add split_blocks and self_destruct options The primary goal of this patch is to provide a way for some users to avoid memory doubling with converting from Arrow to pandas. This took me entirely too much time to get right, but partly I was attempting to disentangle some of the technical debt and overdue refactoring in arrow_to_pandas.cc. Summary of what's here: - Refactor ChunkedArray->Series and Table->DataFrame conversion paths to use the exact same code rather than two implementations of the same thing with slightly different behavior. The `ArrowDeserializer` helper class is now gone - Do zero-copy construction of internal DataFrame blocks for the case of a contiguous non-nullable array and a block with only 1 column represented - Add `split_blocks` option to `to_pandas` which constructs one block per DataFrame column, resulting in more zero-copy opportunities. Note that pandas's internal "consolidation" can still cause memory doubling (see discussion about this in https://github.com/pandas-dev/pandas/issues/10556) - Add `self_destruct` option to `to_pandas` which releases the Table's internal buffers as soon as they are converted to the required pandas structure. This allows memory to be reclaimed by the OS as conversion is taking place rather than having a forced memory-doubling and then post-facto reclamation (which has been causing OOM for some users) The most conservative invocation of `to_pandas` now would be `table.to_pandas(use_threads=False, split_blocks=True, self_destruct=True)` Note that the self-destruct option makes the `Table` object unsafe for further use. This is a bit dissatisfying but I wasn't sure how else to provide this capability. Closes #6067 from wesm/ARROW-3789 and squashes the following commits: 3b4260283 <Wes McKinney> Code review comments 8f39cce05 <Wes McKinney> Add some documentation. Try fixing MSVC warnings c22d280dc <Wes McKinney> Fix one MSVC cast warning 43068032c <Wes McKinney> Add "split blocks" and "self destruct" options to Table.to_pandas, with zero-copy operations for improved memory use when converting from Arrow to pandas Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-01-14 18:25:01 -06:00			`result.split_blocks = options['split_blocks']`
			`result.self_destruct = options['self_destruct']`
GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0 (#35656) Do not coerce temporal types to nanosecond when pandas >= 2.0 is imported, since pandas now supports s/ms/us time units. This PR adds support for the following Arrow -> Pandas conversions, which previously all defaulted to `datetime64[ns]` or `datetime64[ns, <TZ>]`: ``` date32 -> datetime64[ms] date64 -> datetime64[ms] datetime64[s] -> datetime64[s] datetime64[ms] -> datetime64[ms] datetime64[us] -> datetime64[us] datetime64[s, <TZ>] -> datetime64[s, <TZ>] datetime64[ms, <TZ>] -> datetime64[ms, <TZ>] datetime64[us, <TZ>] -> datetime64[us, <TZ>] ``` ### Rationale for this change Pandas 2.0 introduces proper support for temporal types. ### Are these changes tested? Yes. Pytests added and updated. ### Are there any user-facing changes? Yes, arrow-to-pandas default conversion behavior will change when users have pandas >= 2.0, but a legacy option is exposed to provide backwards compatibility. * Closes: #33321 Lead-authored-by: Dane Pitkin <dane@voltrondata.com> Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-07-07 10:45:58 -04:00			`result.coerce_temporal_nanoseconds = options['coerce_temporal_nanoseconds']`
ARROW-9528: [Python] Honor tzinfo when converting from datetime Follow up of: - ARROW-9223: [Python] Propagate timezone information in pandas conversion - ARROW-9528: [Python] Honor tzinfo when converting from datetime (https://github.com/apache/arrow/pull/7805) TODOs: - [x] Store all Timestamp values normalized to UTC - [x] Infer timezone from the array values if no explicit type was given - [x] Testing (especially pandas object roundtrip) - [x] Testing of timezone-naive roundtrips - [x] Testing mixed pandas and datetime objects Closes #7816 from kszucs/tz Lead-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Co-authored-by: Micah Kornfield <emkornfield@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-08-16 15:12:28 -05:00			`result.ignore_timezone = os.environ.get('PYARROW_IGNORE_TIMEZONE', False)`
GH-34729: [C++][Python] Enhanced Arrow<->Pandas map/pydict support (#34730) ### Rationale for this change Explained in issue #34729 ### What changes are included in this PR? - Add support for list of maps when converting Arrow to Pandas. There doesn't seem to be a strong reason to omit this. Previously it was a hard error as unsupported, due to a bool check. - Refactor Arrow Map -> Pandas to support two paths: (1) list of tuples, or (2) pydicts - Add another option in PandasOptions to enable (2), above - Bugfix in nested pydicts -> Arrow maps. - Unit tests ### Are these changes tested? Unit tests are added in `test_pandas.py` ### Are there any user-facing changes? - An additional option flag in PandasOptions - Enable list of maps to Pandas, which was previously disabled * Closes: #34729 Authored-by: Mike Lui <mikelui@meta.com> Signed-off-by: Will Jones <willjones127@gmail.com> 2023-04-21 11:18:32 -04:00
			`maps_as_pydicts = options['maps_as_pydicts']`
			`if maps_as_pydicts is None:`
			`result.maps_as_pydicts = MapConversionType.DEFAULT`
			`elif maps_as_pydicts == "lossy":`
			`result.maps_as_pydicts = MapConversionType.LOSSY`
			`elif maps_as_pydicts == "strict":`
			`result.maps_as_pydicts = MapConversionType.STRICT_`
			`else:`
			`raise ValueError(`
			`"Invalid value for 'maps_as_pydicts': "`
			+ "valid values are 'lossy', 'strict' or `None` (default). "
			`+ f"Received '{maps_as_pydicts}'."`
			`)`
ARROW-6570: [Python] Use Arrow's allocators for creating NumPy array instead of leaving it to NumPy This has some benefits: * Move pandas-related memory allocations to the same default allocator as the rest of the Arrow platform, rather than mixing jemalloc and the system allocator as things currently are * NumPy/pandas-related memory allocations are now accounted for in `pyarrow.total_allocated_bytes()` * Better performance (10+% faster, from quick benchmarks) when using libraries with ARROW_JEMALLOC=ON There are a couple other usages of the system allocator in arrow_to_pandas.cc but they are for smaller internal bits ("placement" arrays) of data relating to pandas. These can be fixed later if they are deemed bothersome Closes #5398 from wesm/ARROW-6570 and squashes the following commits: 3fe0bc37c <Wes McKinney> Fix Python 2.7 failures 7788bc859 <Wes McKinney> Fix arrow-python-test bd5040f32 <Wes McKinney> Add unit tests to check memory pool behavior with to_pandas 927f10737 <Wes McKinney> Array needs WRITEABLE flag set explicitly 88c2c8f13 <Wes McKinney> Refactor to use arrow::MemoryPool for large NumPy allocations Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-09-18 11:26:09 -05:00			`return result`


ARROW-3928: [Python] Deduplicate Python objects when converting binary, string, date, time types to object arrays This adds a `deduplicate_objects` option to all of the `to_pandas` methods. It works with string types, date types (when `date_as_object=True`), and time types. I also made it so that `ScalarMemoTable` can be used with `string_view`, for more efficient memoization in this case. I made the default for `deduplicate_objects` is True. When the ratio of unique strings to the length of the array is low, not only does this use drastically less memory, it is also faster. I will write some benchmarks to show where the "crossover point" is when the overhead of hashing makes things slower. Let's consider a simple case where we have 10,000,000 strings of length 10, but only 1000 unique values: ``` In [50]: import pandas.util.testing as tm In [51]: unique_values = [tm.rands(10) for i in range(1000)] In [52]: values = unique_values * 10000 In [53]: arr = pa.array(values) In [54]: timeit arr.to_pandas() 236 ms ± 1.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [55]: timeit arr.to_pandas(deduplicate_objects=False) 730 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` Almost 3 times faster in this case. The different in memory use is even more drastic ``` In [44]: unique_values = [tm.rands(10) for i in range(1000)] In [45]: values = unique_values * 10000 In [46]: arr = pa.array(values) In [49]: %memit result11 = arr.to_pandas() peak memory: 1505.89 MiB, increment: 76.27 MiB In [50]: %memit result12 = arr.to_pandas(deduplicate_objects=False) peak memory: 2202.29 MiB, increment: 696.11 MiB ``` As you can see, this is a huge problem. If our bug reports about Parquet memory use problems are any indication, users have been suffering from this issue for a long time. When the strings are mostly unique, then things are slower as expected, the peak memory use is higher because of the hash table ``` In [17]: unique_values = [tm.rands(10) for i in range(500000)] In [18]: values = unique_values * 2 In [19]: arr = pa.array(values) In [20]: timeit result = arr.to_pandas() 177 ms ± 574 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [21]: timeit result = arr.to_pandas(deduplicate_objects=False) 70.1 ms ± 783 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [42]: %memit result8 = arr.to_pandas() peak memory: 644.39 MiB, increment: 92.23 MiB In [43]: %memit result9 = arr.to_pandas(deduplicate_objects=False) peak memory: 610.85 MiB, increment: 58.41 MiB ``` In real world work, many duplicated strings is the most common use case. Given the massive memory use and moderate performance improvements, it makes sense to have this enabled by default. Author: Wes McKinney <wesm+git@apache.org> Closes #3257 from wesm/ARROW-3928 and squashes the following commits: d9a88700 <Wes McKinney> Prettier output a00b51c7 <Wes McKinney> Add benchmarks for object deduplication ca88b963 <Wes McKinney> Add Python unit tests, deduplicate for date and time types also when converting to Python objects 7a7873b8 <Wes McKinney> First working iteration of string deduplication when calling to_pandas 2018-12-27 12:17:50 -06:00			`cdef class Array(_PandasConvertible):`
ARROW-4216: [Python] Add CUDA API docs Also reorganize the API docs into several documents, and add/improve docstrings. To allow building the docs without CUDA enabled, I added some conditional inclusion logic. When CUDA isn't enabled, the API docs are still generated but the docstrings are empty. This seems to be the only sane setting that doesn't produce Sphinx errors, one way or the other. Author: Antoine Pitrou <antoine@python.org> Closes #3372 from pitrou/ARROW-4216-cuda-py-docs and squashes the following commits: 80600da5 <Antoine Pitrou> ARROW-4216: Add CUDA API docs 2019-01-10 21:05:31 +01:00			`"""`
			`The base class for all Arrow arrays.`
			`"""`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
ARROW-2638: [Python] Prevent calling extension class constructors directly Using http://docs.cython.org/en/latest/src/userguide/extension_types.html#fast-instantiation What do You think @pitrou ? Should I implement for the rest of the classes? (Field, Schema etc.) Author: Krisztián Szűcs <szucs.krisztian@gmail.com> Author: Wes McKinney <wesm+git@apache.org> Closes #2085 from kszucs/ARROW-2638 and squashes the following commits: 6c3d9377 <Wes McKinney> Restore DataType to public API ae022d6c <Krisztián Szűcs> private constructor for PlasmaBuffer ee401c61 <Krisztián Szűcs> correct Tensor's error msg 597da834 <Krisztián Szűcs> remove Tensor._validate ef3b9a92 <Krisztián Szűcs> prevent constructing Array and Tensor 71800514 <Krisztián Szűcs> prevent directly constructing Buffer and remove _check_nullptr methods 08b899f1 <Krisztián Szűcs> remove _check_nullptr from ChunkedArray 336d3665 <Krisztián Szűcs> prevent directly constructing ChunkedArray 00a2869f <Krisztián Szűcs> remove _check_nullptr from RecordBatch, Table, Column 6c1edbe3 <Krisztián Szűcs> prevent directly constructing Column 209b8961 <Krisztián Szűcs> remove _check_null methods from types.pxi a96da893 <Krisztián Szűcs> remove DataType, NAType, TimestampType from public API 0231fab4 <Krisztián Szűcs> test struct __len__ and __iter__ fa3e4b4a <Krisztián Szűcs> test more types e1fa710d <Krisztián Szűcs> construct schema via pyarrow_wrap_schema a42c3090 <Krisztián Szűcs> refactor datatype, field and schema pickling to use __reduce__ instead of __getstate__ and __setstate__ methods c13247cd <Krisztián Szűcs> use TypeError instead of RuntimeError bdddf44e <Krisztián Szűcs> datatype, field and schema 102b9b09 <Krisztián Szűcs> recordbatch and table 2018-06-03 23:59:18 -04:00			`def __init__(self):`
GH-45619: [Python] Use f-string instead of string.format (#45629) ### Rationale for this change See https://github.com/apache/arrow/issues/45619. ### What changes are included in this PR? Refactor using f-string instead of `string.format`. But do not use f-string for following case, `string.format` allows passing parameters, making the code more reusable. https://github.com/apache/arrow/blob/0fbf9823542233c5f32c26534c34cc97ce3f0be2/python/pyarrow/parquet/core.py#L1624-L1695 ### Are these changes tested? Via CI. ### Are there any user-facing changes? No. * GitHub Issue: #45619 Lead-authored-by: Chilin Chiou <chilin.chiou@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> 2025-05-12 20:34:49 +08:00			`raise TypeError(f"Do not call {self.__class__.__name__}'s constructor "`
			"directly, use one of the `pyarrow.Array.from_*` "
			`"functions instead.")`
ARROW-2492: [Python] Prevent segfault on accidental call of pyarrow.Array Author: Korn, Uwe <Uwe.Korn@blue-yonder.com> Closes #1926 from xhochy/ARROW-2492 and squashes the following commits: 95d59ad1 <Korn, Uwe> ARROW-2492: Prevent segfault on accidental call of pyarrow.Array 2018-04-23 22:53:23 +02:00
ARROW-840: [Python] Expose extension types Add infrastructure to consume C++ extension types and extension arrays from Python. Also allow creating Python-specific extension types by subclassing `ExtensionType`, and creating extension arrays by passing the type and storage array to `ExtensionArray.from_storage`. Author: Antoine Pitrou <antoine@python.org> Closes #4532 from pitrou/ARROW-840-py-ext-types and squashes the following commits: 95ca6148e <Antoine Pitrou> Add IPC tests 44ac0a156 <Antoine Pitrou> ARROW-840: Expose extension types 2019-06-14 07:53:40 -05:00			`cdef void init(self, const shared_ptr[CArray]& sp_array) except *:`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00			`self.sp_array = sp_array`
			`self.ap = sp_array.get()`
ARROW-819: Public Cython and C++ API in the style of lxml, arrow::py::import_pyarrow method I have been looking at LXML's approach to creating both a public Cython API and C++ API https://github.com/lxml/lxml While this may seem like a somewhat radical reorganization of the code, putting all of the main symbols in a single Cython extension makes generating a C++ API for them significantly simpler. By using `.pxi` files we can break the codebase into as small pieces as we like (as long as there are no circular dependencies). As a convenient side effect, the build times are shorter. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #680 from wesm/ARROW-819 and squashes the following commits: 9e6ee246 [Wes McKinney] Fix up optional extensions cff757de [Wes McKinney] Expose pyarrow C API in arrow/python/pyarrow.h b39d19cd [Wes McKinney] Fix test suite. Move _config into lib ff1b5e51 [Wes McKinney] Rename things a bit d4a83912 [Wes McKinney] Reorganize Cython code in the style of lxml so make declaring a public C API easier 2017-05-13 15:44:43 -04:00			`self.type = pyarrow_wrap_data_type(self.sp_array.get().type())`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
ARROW-1199: [C++] Implement mutable POD struct for Array data This data structure provides a new internal data structure that is a self-contained representation of the memory and metadata inside an Arrow array data structure. This class is designed for easy internal data manipulation, analytical data processing, and data transport to and from IPC messages. For example, we could cast from int64 to float64 like so: ```c++ Int64Array arr = GetMyData(); std::shared_ptr<internal::ArrayData> new_data = arr->data()->ShallowCopy(); new_data->type = arrow::float64(); Float64Array double_arr(new_data); ``` This object is also useful in an analytics setting where memory may be reused. For example, if we had a group of operations all returning doubles, say: ``` Log(Sqrt(Expr(arr)) ``` Then the low-level implementations of each of these functions could have the signatures void Log(const ArrayData& values, ArrayData* out); As another example a function may consume one or more memory buffers in an input array and replace them with newly-allocated data, changing the output data type as well. I did quite a bit of refactoring and code simplification that was enabled by this patch. I note that performance in IPC loading of very wide record batches is about 15% slower, but in smaller record batches it is about the same in microbenchmarks. This code path could possibly be made faster with some performance analysis work. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #824 from wesm/array-data-internals and squashes the following commits: f1acbae1 [Wes McKinney] MSVC fixes dcdf2b29 [Wes McKinney] Fix glib per C++ API changes d0a8ee2b [Wes McKinney] Fix logic error in UnsafeSetNotNull d17f886c [Wes McKinney] Construct dictionary indices in ctor bba42530 [Wes McKinney] Set correct type when creating BinaryArray ba3b2992 [Wes McKinney] Various fixes, Python fixes, add Array operator<< to std::ostream for debugging 0b8af24a [Wes McKinney] Write field metadata directly into output object 05058638 [Wes McKinney] Fix up cmake 75bc6b4f [Wes McKinney] Delete cruft from array/loader.h and consolidate in arrow/ipc 24df1b97 [Wes McKinney] Review comments, add some doxygen comments 6e2e5720 [Wes McKinney] Preallocate vector of shared_ptr 05b806b2 [Wes McKinney] Tests passing again 5bdd6a99 [Wes McKinney] bug fixes 7894496e [Wes McKinney] Some fixes bf91a75a [Wes McKinney] Refactor to use shared_ptr, not yet working 130f0c1a [Wes McKinney] Use std::move instead of std::forward a9b4031b [Wes McKinney] Add move constructors to reduce unnecessary copying 475a3db6 [Wes McKinney] Bug fixes, test suite passing again 16918279 [Wes McKinney] Array internals refactoring to use POD struct for all buffers, auxiliary metadata 2017-07-11 01:39:20 -04:00			`def _debug_print(self):`
ARROW-1327: [Python] Always release GIL before calling check_status in Cython This should prevent deadlock in some multithreaded or subinterpreter contexts. We can be more mindful of this in the future Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #945 from wesm/ARROW-1327 and squashes the following commits: d690c5b3 [Wes McKinney] Fix some GIL acquisitions 870e5222 [Wes McKinney] Always release GIL before calling check_status in Cython 2017-08-07 10:44:50 -04:00			`with nogil:`
			`check_status(DebugPrint(deref(self.ap), 0))`
ARROW-1199: [C++] Implement mutable POD struct for Array data This data structure provides a new internal data structure that is a self-contained representation of the memory and metadata inside an Arrow array data structure. This class is designed for easy internal data manipulation, analytical data processing, and data transport to and from IPC messages. For example, we could cast from int64 to float64 like so: ```c++ Int64Array arr = GetMyData(); std::shared_ptr<internal::ArrayData> new_data = arr->data()->ShallowCopy(); new_data->type = arrow::float64(); Float64Array double_arr(new_data); ``` This object is also useful in an analytics setting where memory may be reused. For example, if we had a group of operations all returning doubles, say: ``` Log(Sqrt(Expr(arr)) ``` Then the low-level implementations of each of these functions could have the signatures void Log(const ArrayData& values, ArrayData* out); As another example a function may consume one or more memory buffers in an input array and replace them with newly-allocated data, changing the output data type as well. I did quite a bit of refactoring and code simplification that was enabled by this patch. I note that performance in IPC loading of very wide record batches is about 15% slower, but in smaller record batches it is about the same in microbenchmarks. This code path could possibly be made faster with some performance analysis work. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #824 from wesm/array-data-internals and squashes the following commits: f1acbae1 [Wes McKinney] MSVC fixes dcdf2b29 [Wes McKinney] Fix glib per C++ API changes d0a8ee2b [Wes McKinney] Fix logic error in UnsafeSetNotNull d17f886c [Wes McKinney] Construct dictionary indices in ctor bba42530 [Wes McKinney] Set correct type when creating BinaryArray ba3b2992 [Wes McKinney] Various fixes, Python fixes, add Array operator<< to std::ostream for debugging 0b8af24a [Wes McKinney] Write field metadata directly into output object 05058638 [Wes McKinney] Fix up cmake 75bc6b4f [Wes McKinney] Delete cruft from array/loader.h and consolidate in arrow/ipc 24df1b97 [Wes McKinney] Review comments, add some doxygen comments 6e2e5720 [Wes McKinney] Preallocate vector of shared_ptr 05b806b2 [Wes McKinney] Tests passing again 5bdd6a99 [Wes McKinney] bug fixes 7894496e [Wes McKinney] Some fixes bf91a75a [Wes McKinney] Refactor to use shared_ptr, not yet working 130f0c1a [Wes McKinney] Use std::move instead of std::forward a9b4031b [Wes McKinney] Add move constructors to reduce unnecessary copying 475a3db6 [Wes McKinney] Bug fixes, test suite passing again 16918279 [Wes McKinney] Array internals refactoring to use POD struct for all buffers, auxiliary metadata 2017-07-11 01:39:20 -04:00
ARROW-6252: [C++][Python] Add Array::Diff in C++ and Array.diff in Python to return diff as string Closes #5355 from wesm/ARROW-6252 and squashes the following commits: e88e07698 <Wes McKinney> py2 fix 5d1f9ee90 <Wes McKinney> Add Array::Diff in C++ and Array.diff in Python Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-09-12 18:04:57 -05:00			`def diff(self, Array other):`
			`"""`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`Compare contents of this array against another one.`

ARROW-15321: [Dev][Python] Also numpydoc-validate Cython-generated methods The numpydoc-validation routine in Archery would skip over many Cython-generated methods and properties. This PR also fixes and enhances the docstrings that would newly raise validation errors. Closes #12698 from pitrou/ARROW-15321-numpydoc-cython Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-03-28 18:55:45 +02:00			`Return a string containing the result of diffing this array`
			`(on the left side) against the other array (on the right side).`

			`Parameters`
			`----------`
			`other : Array`
			`The other array to compare this array with.`

			`Returns`
			`-------`
			`diff : str`
			`A human-readable printout of the differences.`

			`Examples`
			`--------`
ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-cython) Adding `--doctest-cython` functionality which will be run on the CI with a follow-up PR. This PR can be tested with `pytest --doctest-cython python/pyarrow`. Closes #13204 from AlenkaF/ARROW-16018-doctest-cython Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-05-26 15:30:02 +02:00			`>>> import pyarrow as pa`
ARROW-15321: [Dev][Python] Also numpydoc-validate Cython-generated methods The numpydoc-validation routine in Archery would skip over many Cython-generated methods and properties. This PR also fixes and enhances the docstrings that would newly raise validation errors. Closes #12698 from pitrou/ARROW-15321-numpydoc-cython Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-03-28 18:55:45 +02:00			`>>> left = pa.array(["one", "two", "three"])`
			`>>> right = pa.array(["two", None, "two-and-a-half", "three"])`
ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-cython) Adding `--doctest-cython` functionality which will be run on the CI with a follow-up PR. This PR can be tested with `pytest --doctest-cython python/pyarrow`. Closes #13204 from AlenkaF/ARROW-16018-doctest-cython Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-05-26 15:30:02 +02:00			`>>> print(left.diff(right)) # doctest: +SKIP`
ARROW-15321: [Dev][Python] Also numpydoc-validate Cython-generated methods The numpydoc-validation routine in Archery would skip over many Cython-generated methods and properties. This PR also fixes and enhances the docstrings that would newly raise validation errors. Closes #12698 from pitrou/ARROW-15321-numpydoc-cython Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-03-28 18:55:45 +02:00
			`@@ -0, +0 @@`
			`-"one"`
			`@@ -2, +1 @@`
			`+null`
			`+"two-and-a-half"`

ARROW-6252: [C++][Python] Add Array::Diff in C++ and Array.diff in Python to return diff as string Closes #5355 from wesm/ARROW-6252 and squashes the following commits: e88e07698 <Wes McKinney> py2 fix 5d1f9ee90 <Wes McKinney> Add Array::Diff in C++ and Array.diff in Python Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-09-12 18:04:57 -05:00			`"""`
GH-42112: [Python] Array gracefully fails on non-cpu device (#42113) ### Rationale for this change Common `Array` APIs should not segfault or abort on non-cpu devices. ### What changes are included in this PR? * `device_type` and `is_cpu` methods added to the `Array` class * Any function that segfaults, aborts, or gives incorrect results on non-cpu devices now raises an exception ### Are these changes tested? * Unit tests added ### Are there any user-facing changes? * `device_type` and `is_cpu` methods added to the `Array` class * GitHub Issue: #42112 Lead-authored-by: Dane Pitkin <dpitkin@apache.org> Co-authored-by: Dane Pitkin <dpitkin.oss@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-26 05:03:02 -04:00			`self._assert_cpu()`
ARROW-6252: [C++][Python] Add Array::Diff in C++ and Array.diff in Python to return diff as string Closes #5355 from wesm/ARROW-6252 and squashes the following commits: e88e07698 <Wes McKinney> py2 fix 5d1f9ee90 <Wes McKinney> Add Array::Diff in C++ and Array.diff in Python Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-09-12 18:04:57 -05:00			`cdef c_string result`
			`with nogil:`
			`result = self.ap.Diff(deref(other.ap))`
ARROW-10214: [Python] Allow printing undecodable schema metadata Closes #8379 from pitrou/ARROW-10214-repr-undecodable-metadata Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2020-10-07 17:57:47 +02:00			`return frombytes(result, safe=True)`
ARROW-6252: [C++][Python] Add Array::Diff in C++ and Array.diff in Python to return diff as string Closes #5355 from wesm/ARROW-6252 and squashes the following commits: e88e07698 <Wes McKinney> py2 fix 5d1f9ee90 <Wes McKinney> Add Array::Diff in C++ and Array.diff in Python Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-09-12 18:04:57 -05:00
GH-34411: [Python] Change array constructor to accept pyarrow array (#34275) ### Rationale for this change Currently, `pyarrow.array` doesn't accept pyarrow Arrays and this PR adds a check to allow that. * Closes: #34411 Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-03-02 14:29:31 +01:00			`def cast(self, object target_type=None, safe=None, options=None, memory_pool=None):`
ARROW-1156: [C++/Python] Expand casting API, add UnaryKernel callable. Use Cast in appropriate places when converting from pandas cc @cloud-fan With this patch we now try to cast to indicated type on ingest of objects from pandas: ``` In [3]: arr = np.array([None] * 5) In [4]: pa.Array.from_pandas(arr) Out[4]: <pyarrow.lib.NullArray object at 0x7f6cf1485d18> [ NA, NA, NA, NA, NA ] In [5]: pa.Array.from_pandas(arr, type=pa.int32()) Out[5]: <pyarrow.lib.Int32Array object at 0x7f6cf1485d68> [ NA, NA, NA, NA, NA ] ``` I also added zero-copy casts from integers of the right size to each of the date and time types. Includes refactoring for ARROW-1481. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1063 from wesm/ARROW-1156 and squashes the following commits: 166d1a50 [Wes McKinney] iwyu 34f5c9d1 [Wes McKinney] Harden default cast options, fix unsafe Python case 1d07b756 [Wes McKinney] Add some basic casting unit tests in Python c1b45709 [Wes McKinney] Expose arrow::compute::Cast in Python as Array.cast. Still need to write tests a9a04c9c [Wes McKinney] UnaryKernel::Call returns Status for now for simplicity. Support pre-allocated memory 8903709b [Wes McKinney] Implement casts from null to numbers. Try to cast for types where we do not have an inference rule when converting from arrays of Python objects a22dd20a [Wes McKinney] Add test to assert zero copy for compatible integer to date/time a14b83f7 [Wes McKinney] Create callable CastKernel object. Add zero-copy cast rules for date/time types 2017-09-08 10:09:38 -04:00			`"""`
ARROW-8918: [C++][Python] Implement cast metafunction to allow use of "cast" with CallFunction, use in Python This provides the `CAST(data AS target_type)` SQL idiom. The target_type is provided via CastOptions (FWIW I believe this is the most correct approach for handling the target_type). As a result we no longer need to maintain separate binding boilerplate in Python for Array vs. ChunkedArray Closes #7258 from wesm/ARROW-8918 Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-05-28 16:07:19 -05:00			`Cast array values to another data type`
ARROW-2700: [Python] Add simple examples to Array.cast docstring Author: Korn, Uwe <Uwe.Korn@blue-yonder.com> Closes #2335 from xhochy/ARROW-2700 and squashes the following commits: 292614f2 <Korn, Uwe> ARROW-2700: Add simple examples to Array.cast docstring 2018-07-28 19:59:51 -04:00
ARROW-15321: [Dev][Python] Also numpydoc-validate Cython-generated methods The numpydoc-validation routine in Archery would skip over many Cython-generated methods and properties. This PR also fixes and enhances the docstrings that would newly raise validation errors. Closes #12698 from pitrou/ARROW-15321-numpydoc-cython Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-03-28 18:55:45 +02:00			See :func:`pyarrow.compute.cast` for usage.

			`Parameters`
			`----------`
ARROW-15365: [Python] Expose full cast options in the pyarrow.compute.cast function (#13109) Added CastOptions to pc.cast and related relate cast functions Authored-by: JabariBooker <o.jabari.booker@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2022-06-28 16:44:12 -04:00			`target_type : DataType, default None`
ARROW-15321: [Dev][Python] Also numpydoc-validate Cython-generated methods The numpydoc-validation routine in Archery would skip over many Cython-generated methods and properties. This PR also fixes and enhances the docstrings that would newly raise validation errors. Closes #12698 from pitrou/ARROW-15321-numpydoc-cython Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-03-28 18:55:45 +02:00			`Type to cast array to.`
			`safe : boolean, default True`
			`Whether to check for conversion errors such as overflow.`
ARROW-15365: [Python] Expose full cast options in the pyarrow.compute.cast function (#13109) Added CastOptions to pc.cast and related relate cast functions Authored-by: JabariBooker <o.jabari.booker@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2022-06-28 16:44:12 -04:00			`options : CastOptions, default None`
			`Additional checks pass by CastOptions`
GH-34411: [Python] Change array constructor to accept pyarrow array (#34275) ### Rationale for this change Currently, `pyarrow.array` doesn't accept pyarrow Arrays and this PR adds a check to allow that. * Closes: #34411 Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-03-02 14:29:31 +01:00			`memory_pool : MemoryPool, optional`
			`memory pool to use for allocations during function execution.`
ARROW-15321: [Dev][Python] Also numpydoc-validate Cython-generated methods The numpydoc-validation routine in Archery would skip over many Cython-generated methods and properties. This PR also fixes and enhances the docstrings that would newly raise validation errors. Closes #12698 from pitrou/ARROW-15321-numpydoc-cython Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-03-28 18:55:45 +02:00
			`Returns`
			`-------`
			`cast : Array`
ARROW-1156: [C++/Python] Expand casting API, add UnaryKernel callable. Use Cast in appropriate places when converting from pandas cc @cloud-fan With this patch we now try to cast to indicated type on ingest of objects from pandas: ``` In [3]: arr = np.array([None] * 5) In [4]: pa.Array.from_pandas(arr) Out[4]: <pyarrow.lib.NullArray object at 0x7f6cf1485d18> [ NA, NA, NA, NA, NA ] In [5]: pa.Array.from_pandas(arr, type=pa.int32()) Out[5]: <pyarrow.lib.Int32Array object at 0x7f6cf1485d68> [ NA, NA, NA, NA, NA ] ``` I also added zero-copy casts from integers of the right size to each of the date and time types. Includes refactoring for ARROW-1481. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1063 from wesm/ARROW-1156 and squashes the following commits: 166d1a50 [Wes McKinney] iwyu 34f5c9d1 [Wes McKinney] Harden default cast options, fix unsafe Python case 1d07b756 [Wes McKinney] Add some basic casting unit tests in Python c1b45709 [Wes McKinney] Expose arrow::compute::Cast in Python as Array.cast. Still need to write tests a9a04c9c [Wes McKinney] UnaryKernel::Call returns Status for now for simplicity. Support pre-allocated memory 8903709b [Wes McKinney] Implement casts from null to numbers. Try to cast for types where we do not have an inference rule when converting from arrays of Python objects a22dd20a [Wes McKinney] Add test to assert zero copy for compatible integer to date/time a14b83f7 [Wes McKinney] Create callable CastKernel object. Add zero-copy cast rules for date/time types 2017-09-08 10:09:38 -04:00			`"""`
GH-42112: [Python] Array gracefully fails on non-cpu device (#42113) ### Rationale for this change Common `Array` APIs should not segfault or abort on non-cpu devices. ### What changes are included in this PR? * `device_type` and `is_cpu` methods added to the `Array` class * Any function that segfaults, aborts, or gives incorrect results on non-cpu devices now raises an exception ### Are these changes tested? * Unit tests added ### Are there any user-facing changes? * `device_type` and `is_cpu` methods added to the `Array` class * GitHub Issue: #42112 Lead-authored-by: Dane Pitkin <dpitkin@apache.org> Co-authored-by: Dane Pitkin <dpitkin.oss@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-26 05:03:02 -04:00			`self._assert_cpu()`
GH-34411: [Python] Change array constructor to accept pyarrow array (#34275) ### Rationale for this change Currently, `pyarrow.array` doesn't accept pyarrow Arrays and this PR adds a check to allow that. * Closes: #34411 Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-03-02 14:29:31 +01:00			`return _pc().cast(self, target_type, safe=safe,`
			`options=options, memory_pool=memory_pool)`
ARROW-1156: [C++/Python] Expand casting API, add UnaryKernel callable. Use Cast in appropriate places when converting from pandas cc @cloud-fan With this patch we now try to cast to indicated type on ingest of objects from pandas: ``` In [3]: arr = np.array([None] * 5) In [4]: pa.Array.from_pandas(arr) Out[4]: <pyarrow.lib.NullArray object at 0x7f6cf1485d18> [ NA, NA, NA, NA, NA ] In [5]: pa.Array.from_pandas(arr, type=pa.int32()) Out[5]: <pyarrow.lib.Int32Array object at 0x7f6cf1485d68> [ NA, NA, NA, NA, NA ] ``` I also added zero-copy casts from integers of the right size to each of the date and time types. Includes refactoring for ARROW-1481. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1063 from wesm/ARROW-1156 and squashes the following commits: 166d1a50 [Wes McKinney] iwyu 34f5c9d1 [Wes McKinney] Harden default cast options, fix unsafe Python case 1d07b756 [Wes McKinney] Add some basic casting unit tests in Python c1b45709 [Wes McKinney] Expose arrow::compute::Cast in Python as Array.cast. Still need to write tests a9a04c9c [Wes McKinney] UnaryKernel::Call returns Status for now for simplicity. Support pre-allocated memory 8903709b [Wes McKinney] Implement casts from null to numbers. Try to cast for types where we do not have an inference rule when converting from arrays of Python objects a22dd20a [Wes McKinney] Add test to assert zero copy for compatible integer to date/time a14b83f7 [Wes McKinney] Create callable CastKernel object. Add zero-copy cast rules for date/time types 2017-09-08 10:09:38 -04:00
ARROW-5992: [C++][Python] Support String->Binary in Array::View. Add Python bindings for Array::View Note that calling View won't perform UTF-8 validation, but I think that is probably okay Closes #5125 from wesm/ARROW-5992 and squashes the following commits: 360715e29 <Wes McKinney> Fix prior workaround for Array::View with binary type 9fc5a2f9b <Wes McKinney> Support Array::View from string to binary and back. Add Python bindings Array.view method Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: François Saint-Jacques <fsaintjacques@gmail.com> 2019-08-20 13:29:00 -04:00			`def view(self, object target_type):`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`"""`
			`Return zero-copy "view" of array as another data type.`

			`The data types must have compatible columnar buffer layouts`
ARROW-5992: [C++][Python] Support String->Binary in Array::View. Add Python bindings for Array::View Note that calling View won't perform UTF-8 validation, but I think that is probably okay Closes #5125 from wesm/ARROW-5992 and squashes the following commits: 360715e29 <Wes McKinney> Fix prior workaround for Array::View with binary type 9fc5a2f9b <Wes McKinney> Support Array::View from string to binary and back. Add Python bindings Array.view method Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: François Saint-Jacques <fsaintjacques@gmail.com> 2019-08-20 13:29:00 -04:00
			`Parameters`
			`----------`
			`target_type : DataType`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`Type to construct view as.`
ARROW-5992: [C++][Python] Support String->Binary in Array::View. Add Python bindings for Array::View Note that calling View won't perform UTF-8 validation, but I think that is probably okay Closes #5125 from wesm/ARROW-5992 and squashes the following commits: 360715e29 <Wes McKinney> Fix prior workaround for Array::View with binary type 9fc5a2f9b <Wes McKinney> Support Array::View from string to binary and back. Add Python bindings Array.view method Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: François Saint-Jacques <fsaintjacques@gmail.com> 2019-08-20 13:29:00 -04:00
			`Returns`
			`-------`
			`view : Array`
			`"""`
GH-42112: [Python] Array gracefully fails on non-cpu device (#42113) ### Rationale for this change Common `Array` APIs should not segfault or abort on non-cpu devices. ### What changes are included in this PR? * `device_type` and `is_cpu` methods added to the `Array` class * Any function that segfaults, aborts, or gives incorrect results on non-cpu devices now raises an exception ### Are these changes tested? * Unit tests added ### Are there any user-facing changes? * `device_type` and `is_cpu` methods added to the `Array` class * GitHub Issue: #42112 Lead-authored-by: Dane Pitkin <dpitkin@apache.org> Co-authored-by: Dane Pitkin <dpitkin.oss@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-26 05:03:02 -04:00			`self._assert_cpu()`
ARROW-5992: [C++][Python] Support String->Binary in Array::View. Add Python bindings for Array::View Note that calling View won't perform UTF-8 validation, but I think that is probably okay Closes #5125 from wesm/ARROW-5992 and squashes the following commits: 360715e29 <Wes McKinney> Fix prior workaround for Array::View with binary type 9fc5a2f9b <Wes McKinney> Support Array::View from string to binary and back. Add Python bindings Array.view method Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: François Saint-Jacques <fsaintjacques@gmail.com> 2019-08-20 13:29:00 -04:00			`cdef DataType type = ensure_type(target_type)`
			`cdef shared_ptr[CArray] result`
			`with nogil:`
ARROW-8347: [C++] Migrate Array methods to Result<T> Closes #6851 from pitrou/ARROW-8347-array-result-apis Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2020-04-07 11:36:31 +02:00			`result = GetResultValue(self.ap.View(type.sp_type))`
ARROW-5992: [C++][Python] Support String->Binary in Array::View. Add Python bindings for Array::View Note that calling View won't perform UTF-8 validation, but I think that is probably okay Closes #5125 from wesm/ARROW-5992 and squashes the following commits: 360715e29 <Wes McKinney> Fix prior workaround for Array::View with binary type 9fc5a2f9b <Wes McKinney> Support Array::View from string to binary and back. Add Python bindings Array.view method Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: François Saint-Jacques <fsaintjacques@gmail.com> 2019-08-20 13:29:00 -04:00			`return pyarrow_wrap_array(result)`

ARROW-12911: [Python] Export scalar aggregate options to pc.sum Closes #10433 from cyb70289/12911-py-sum Authored-by: Yibo Cai <yibo.cai@arm.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2021-06-04 12:14:50 +02:00			`def sum(self, **kwargs):`
ARROW-4939: [Python] Add wrapper for "sum" kernel Author: Philipp Moritz <pcmoritz@gmail.com> Closes #3954 from pcmoritz/python-sum and squashes the following commits: c177c9e47 <Philipp Moritz> update f967acad6 <Philipp Moritz> use pytest.mark.parametrize 6c5d2df44 <Philipp Moritz> cleanup b3658bbaf <Philipp Moritz> update dedb324b6 <Philipp Moritz> update 21049a8ce <Philipp Moritz> update 0e868c42a <Philipp Moritz> update ef6decd23 <Philipp Moritz> fix b414dc45f <Philipp Moritz> add doc and test b77589d04 <Philipp Moritz> add more scalar types 0d4a9ede5 <Philipp Moritz> update c2a468d3f <Philipp Moritz> update 30df4e955 <Philipp Moritz> update 9b8f4ae29 <Philipp Moritz> add debugging 429923e7f <Philipp Moritz> add scalar wrappers 94a17cc3a <Philipp Moritz> fix 293e6c1e1 <Philipp Moritz> add python wrapper for sum kernel 2019-03-26 23:13:22 -07:00			`"""`
			`Sum the values in a numerical array.`
ARROW-15321: [Dev][Python] Also numpydoc-validate Cython-generated methods The numpydoc-validation routine in Archery would skip over many Cython-generated methods and properties. This PR also fixes and enhances the docstrings that would newly raise validation errors. Closes #12698 from pitrou/ARROW-15321-numpydoc-cython Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-03-28 18:55:45 +02:00
			See :func:`pyarrow.compute.sum` for full usage.

			`Parameters`
			`----------`
			`**kwargs : dict, optional`
			Options to pass to :func:`pyarrow.compute.sum`.

			`Returns`
			`-------`
			`sum : Scalar`
			`A scalar containing the sum value.`
ARROW-4939: [Python] Add wrapper for "sum" kernel Author: Philipp Moritz <pcmoritz@gmail.com> Closes #3954 from pcmoritz/python-sum and squashes the following commits: c177c9e47 <Philipp Moritz> update f967acad6 <Philipp Moritz> use pytest.mark.parametrize 6c5d2df44 <Philipp Moritz> cleanup b3658bbaf <Philipp Moritz> update dedb324b6 <Philipp Moritz> update 21049a8ce <Philipp Moritz> update 0e868c42a <Philipp Moritz> update ef6decd23 <Philipp Moritz> fix b414dc45f <Philipp Moritz> add doc and test b77589d04 <Philipp Moritz> add more scalar types 0d4a9ede5 <Philipp Moritz> update c2a468d3f <Philipp Moritz> update 30df4e955 <Philipp Moritz> update 9b8f4ae29 <Philipp Moritz> add debugging 429923e7f <Philipp Moritz> add scalar wrappers 94a17cc3a <Philipp Moritz> fix 293e6c1e1 <Philipp Moritz> add python wrapper for sum kernel 2019-03-26 23:13:22 -07:00			`"""`
GH-42112: [Python] Array gracefully fails on non-cpu device (#42113) ### Rationale for this change Common `Array` APIs should not segfault or abort on non-cpu devices. ### What changes are included in this PR? * `device_type` and `is_cpu` methods added to the `Array` class * Any function that segfaults, aborts, or gives incorrect results on non-cpu devices now raises an exception ### Are these changes tested? * Unit tests added ### Are there any user-facing changes? * `device_type` and `is_cpu` methods added to the `Array` class * GitHub Issue: #42112 Lead-authored-by: Dane Pitkin <dpitkin@apache.org> Co-authored-by: Dane Pitkin <dpitkin.oss@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-26 05:03:02 -04:00			`self._assert_cpu()`
ARROW-12911: [Python] Export scalar aggregate options to pc.sum Closes #10433 from cyb70289/12911-py-sum Authored-by: Yibo Cai <yibo.cai@arm.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2021-06-04 12:14:50 +02:00			`options = _pc().ScalarAggregateOptions(**kwargs)`
			`return _pc().call_function('sum', [self], options)`
ARROW-4939: [Python] Add wrapper for "sum" kernel Author: Philipp Moritz <pcmoritz@gmail.com> Closes #3954 from pcmoritz/python-sum and squashes the following commits: c177c9e47 <Philipp Moritz> update f967acad6 <Philipp Moritz> use pytest.mark.parametrize 6c5d2df44 <Philipp Moritz> cleanup b3658bbaf <Philipp Moritz> update dedb324b6 <Philipp Moritz> update 21049a8ce <Philipp Moritz> update 0e868c42a <Philipp Moritz> update ef6decd23 <Philipp Moritz> fix b414dc45f <Philipp Moritz> add doc and test b77589d04 <Philipp Moritz> add more scalar types 0d4a9ede5 <Philipp Moritz> update c2a468d3f <Philipp Moritz> update 30df4e955 <Philipp Moritz> update 9b8f4ae29 <Philipp Moritz> add debugging 429923e7f <Philipp Moritz> add scalar wrappers 94a17cc3a <Philipp Moritz> fix 293e6c1e1 <Philipp Moritz> add python wrapper for sum kernel 2019-03-26 23:13:22 -07:00
ARROW-1559: [C++] Add Unique kernel and refactor DictionaryBuilder to be a stateful kernel Only intended to implement selective categorical conversion in `to_pandas()` but it seems that there is a lot missing to do this in a clean fashion. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1266 from xhochy/ARROW-1559 and squashes the following commits: 50249652 [Wes McKinney] Fix MSVC linker issue b6cb1ece [Wes McKinney] Export CastOptions 4ea3ce61 [Wes McKinney] Return NONE Datum in else branch of functions 4f969c6b [Wes McKinney] Move deprecation suppression after flag munging 7f557cc0 [Wes McKinney] Code review comments, disable C4996 warning (equivalent to -Wno-deprecated) in MSVC builds 84717461 [Wes McKinney] Do not compute hash table threshold on each iteration ae8f2339 [Wes McKinney] Fix double to int64_t conversion warning c1444a26 [Wes McKinney] Fix doxygen warnings 2de85961 [Wes McKinney] Add test cases for unique, dictionary_encode 383b46fd [Wes McKinney] Add Array methods for Unique, DictionaryEncode 0962f06b [Wes McKinney] Add cast method for Column, chunked_array and column factory functions 62c3cefd [Wes McKinney] Datum stubs 27151c47 [Wes McKinney] Implement Cast for chunked arrays, fix kernel implementation. Change kernel API to write to a single Datum 1bf2e2f4 [Wes McKinney] Fix bug with column using wrong type eaadc3e5 [Wes McKinney] Use macros to reduce code duplication in DoubleTableSize 6b4f8f3c [Wes McKinney] Fix datetime64->date32 casting error raised by refactor 2c77a19e [Wes McKinney] Some Decimal->Decimal128 renaming. Add DecimalType base class c07f91b3 [Wes McKinney] ARROW-1559: Add unique kernel 2017-11-17 18:29:49 -05:00			`def unique(self):`
			`"""`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`Compute distinct elements in array.`
ARROW-15321: [Dev][Python] Also numpydoc-validate Cython-generated methods The numpydoc-validation routine in Archery would skip over many Cython-generated methods and properties. This PR also fixes and enhances the docstrings that would newly raise validation errors. Closes #12698 from pitrou/ARROW-15321-numpydoc-cython Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-03-28 18:55:45 +02:00
			`Returns`
			`-------`
			`unique : Array`
			`An array of the same data type, with deduplicated elements.`
ARROW-1559: [C++] Add Unique kernel and refactor DictionaryBuilder to be a stateful kernel Only intended to implement selective categorical conversion in `to_pandas()` but it seems that there is a lot missing to do this in a clean fashion. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1266 from xhochy/ARROW-1559 and squashes the following commits: 50249652 [Wes McKinney] Fix MSVC linker issue b6cb1ece [Wes McKinney] Export CastOptions 4ea3ce61 [Wes McKinney] Return NONE Datum in else branch of functions 4f969c6b [Wes McKinney] Move deprecation suppression after flag munging 7f557cc0 [Wes McKinney] Code review comments, disable C4996 warning (equivalent to -Wno-deprecated) in MSVC builds 84717461 [Wes McKinney] Do not compute hash table threshold on each iteration ae8f2339 [Wes McKinney] Fix double to int64_t conversion warning c1444a26 [Wes McKinney] Fix doxygen warnings 2de85961 [Wes McKinney] Add test cases for unique, dictionary_encode 383b46fd [Wes McKinney] Add Array methods for Unique, DictionaryEncode 0962f06b [Wes McKinney] Add cast method for Column, chunked_array and column factory functions 62c3cefd [Wes McKinney] Datum stubs 27151c47 [Wes McKinney] Implement Cast for chunked arrays, fix kernel implementation. Change kernel API to write to a single Datum 1bf2e2f4 [Wes McKinney] Fix bug with column using wrong type eaadc3e5 [Wes McKinney] Use macros to reduce code duplication in DoubleTableSize 6b4f8f3c [Wes McKinney] Fix datetime64->date32 casting error raised by refactor 2c77a19e [Wes McKinney] Some Decimal->Decimal128 renaming. Add DecimalType base class c07f91b3 [Wes McKinney] ARROW-1559: Add unique kernel 2017-11-17 18:29:49 -05:00			`"""`
GH-42112: [Python] Array gracefully fails on non-cpu device (#42113) ### Rationale for this change Common `Array` APIs should not segfault or abort on non-cpu devices. ### What changes are included in this PR? * `device_type` and `is_cpu` methods added to the `Array` class * Any function that segfaults, aborts, or gives incorrect results on non-cpu devices now raises an exception ### Are these changes tested? * Unit tests added ### Are there any user-facing changes? * `device_type` and `is_cpu` methods added to the `Array` class * GitHub Issue: #42112 Lead-authored-by: Dane Pitkin <dpitkin@apache.org> Co-authored-by: Dane Pitkin <dpitkin.oss@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-26 05:03:02 -04:00			`self._assert_cpu()`
ARROW-8792: [C++][Python][R][GLib] New Array compute kernels implementation and execution framework This patch is a major reworking of our development strategy for implementing array-valued functions and applying them in a query processing setting. The design was partly inspired by my previous work designing Ibis (https://github.com/ibis-project/ibis -- the "expr" subsystem and the way that operators validate input types and resolve output types). Using only function names and input types, you can determine the output types of each function and resolve the "execute" function that performs a unit of work processing a batch of data. This will allow us to build deferred column expressions and then (eventually) do parallel execution. There are a ton of details, but one nice thing is that there is now a single API entry point for invoking any function by its name: ```c++ Result<Datum> CallFunction(ExecContext* ctx, const std::string& func_name, const std::vector<Datum>& args, const FunctionOptions* options = NULLPTR); ``` What occurs when you do this: * A `Function` instance is looked up in the global `FunctionRegistry` * Given the descriptors of `args` (their types and shapes -- array or scalar), the Function searches for `Kernel` that is able to process those types and shapes. A kernel might be able to do `array[T0], array[T1]` or only `scalar[T0], scalar[T1]`, for example. This permits kernel specialization to treat different type and shape combinations * The kernel is executed iteratively against `args` based on what `args` contains -- if there are ChunkedArrays, they will be split into contiguous pieces. Kernels never see ChunkedArray, only Array or Scalar * The Executor implementation is able to split contiguous Array inputs into smaller chunks, which is important for parallel execution. See `ExecContext::set_exec_chunksize` To summarize: the REGISTRY contains FUNCTIONS. A FUNCTION contains KERNELS. A KERNEL is a specific implementation of a function that services a particular type combination. An additional effort in this patch is to radically simplify the process of creating kernels that are based on a scalar function. To do this, there is a growing collection of template-based kernel generation classes in compute/kernels/codegen_internal.h that will surely be the topic of much debate. I want to make it a lot easier for people to add new kernels. There are some other incidental changes in the PR, such as changing the convenience APIs like `Cast` to return `Result`. I'm afraid we may have to live with the API breakage unless someone else wants to add backward compatibility code for the old APIs. I have to apologize for making such a large PR. I've been working long hours on this for nearly a month and the process of porting all of our existing functionality and making the unit tests pass caused much iteration in the "framework" part of the code, such that it would have been a huge time drain to review incomplete iterations of the framework that had not been proven to capture the functionality that previously existed in the project. Given the size of this PR and that fact that it completely blocks any work into src/arrow/compute, I don't think we should let this sit unmerged for more than 4 or 5 days, tops. I'm committed to responding to all of your questions and working to address your feedback about the design and improving the documentation and code comments. I tried to leave copious comments to explain my thought process in various places. Feel free to make any and all comments in this PR or in whatever form you like. I don't think that merging should be blocked on stylistic issues. Closes #7240 from wesm/ARROW-8792-kernels-revamp Lead-authored-by: Wes McKinney <wesm+git@apache.org> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-05-24 09:35:00 -05:00			`return _pc().call_function('unique', [self])`
ARROW-1559: [C++] Add Unique kernel and refactor DictionaryBuilder to be a stateful kernel Only intended to implement selective categorical conversion in `to_pandas()` but it seems that there is a lot missing to do this in a clean fashion. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1266 from xhochy/ARROW-1559 and squashes the following commits: 50249652 [Wes McKinney] Fix MSVC linker issue b6cb1ece [Wes McKinney] Export CastOptions 4ea3ce61 [Wes McKinney] Return NONE Datum in else branch of functions 4f969c6b [Wes McKinney] Move deprecation suppression after flag munging 7f557cc0 [Wes McKinney] Code review comments, disable C4996 warning (equivalent to -Wno-deprecated) in MSVC builds 84717461 [Wes McKinney] Do not compute hash table threshold on each iteration ae8f2339 [Wes McKinney] Fix double to int64_t conversion warning c1444a26 [Wes McKinney] Fix doxygen warnings 2de85961 [Wes McKinney] Add test cases for unique, dictionary_encode 383b46fd [Wes McKinney] Add Array methods for Unique, DictionaryEncode 0962f06b [Wes McKinney] Add cast method for Column, chunked_array and column factory functions 62c3cefd [Wes McKinney] Datum stubs 27151c47 [Wes McKinney] Implement Cast for chunked arrays, fix kernel implementation. Change kernel API to write to a single Datum 1bf2e2f4 [Wes McKinney] Fix bug with column using wrong type eaadc3e5 [Wes McKinney] Use macros to reduce code duplication in DoubleTableSize 6b4f8f3c [Wes McKinney] Fix datetime64->date32 casting error raised by refactor 2c77a19e [Wes McKinney] Some Decimal->Decimal128 renaming. Add DecimalType base class c07f91b3 [Wes McKinney] ARROW-1559: Add unique kernel 2017-11-17 18:29:49 -05:00
ARROW-10438: [C++][Dataset] Partitioning::Format on nulls Tested and added support for partitioning with nulls. I had to make some changes to the hash kernels. You can now specify how you want DictionaryEncode to treat nulls. The MASK option will continue the current behavior (null not in dictionary, null value in indices) and the ENCODE option will put `null` in the dictionary and there will be no null values in the indices array. Partitioning on nulls will depend on the partitioning scheme. For directory partitioning null is allowed on inner fields but it is not allowed on an outer field if an inner field is defined. In other words, if the schema is a(int32), b(int32), c(int32) then the following are allowed ``` / (a=null, b=null, c=null) /32 (a=32, b=null, c=null) /32/57 (a=32, b=57, c=null) ``` There is no way to specify `a=null, b=57, c=null`. This does mean that partition directories can contain a mix of files and nested partition directories (e.g. /32 might contain file.parquet and the directory /57). Alternatively we could just forbid nulls in the directory partitioning scheme. For the hive scheme we need to be compatible with other tools that read/write hive. Those tools use a fallback value which defaults to `__HIVE_DEFAULT_PARTITION__`. So by default you would have directories that look like... ``` /a=__HIVE_DEFAULT_PARTITION__/b=__HIVE_DEFAULT_PARTITION__/c=__HIVE_DEFAULT_PARTITION__ ``` The null fallback value is configurable as a string passed to HivePartitioning::HivePartitioning or HivePartitioning::MakeFactory. ARROW-11649 has been created for extending this null fallback configuration to R. Closes #9323 from westonpace/feature/arrow-10438 Lead-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com> 2021-02-24 10:34:31 -05:00			`def dictionary_encode(self, null_encoding='mask'):`
ARROW-1559: [C++] Add Unique kernel and refactor DictionaryBuilder to be a stateful kernel Only intended to implement selective categorical conversion in `to_pandas()` but it seems that there is a lot missing to do this in a clean fashion. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1266 from xhochy/ARROW-1559 and squashes the following commits: 50249652 [Wes McKinney] Fix MSVC linker issue b6cb1ece [Wes McKinney] Export CastOptions 4ea3ce61 [Wes McKinney] Return NONE Datum in else branch of functions 4f969c6b [Wes McKinney] Move deprecation suppression after flag munging 7f557cc0 [Wes McKinney] Code review comments, disable C4996 warning (equivalent to -Wno-deprecated) in MSVC builds 84717461 [Wes McKinney] Do not compute hash table threshold on each iteration ae8f2339 [Wes McKinney] Fix double to int64_t conversion warning c1444a26 [Wes McKinney] Fix doxygen warnings 2de85961 [Wes McKinney] Add test cases for unique, dictionary_encode 383b46fd [Wes McKinney] Add Array methods for Unique, DictionaryEncode 0962f06b [Wes McKinney] Add cast method for Column, chunked_array and column factory functions 62c3cefd [Wes McKinney] Datum stubs 27151c47 [Wes McKinney] Implement Cast for chunked arrays, fix kernel implementation. Change kernel API to write to a single Datum 1bf2e2f4 [Wes McKinney] Fix bug with column using wrong type eaadc3e5 [Wes McKinney] Use macros to reduce code duplication in DoubleTableSize 6b4f8f3c [Wes McKinney] Fix datetime64->date32 casting error raised by refactor 2c77a19e [Wes McKinney] Some Decimal->Decimal128 renaming. Add DecimalType base class c07f91b3 [Wes McKinney] ARROW-1559: Add unique kernel 2017-11-17 18:29:49 -05:00			`"""`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`Compute dictionary-encoded representation of array.`
ARROW-15321: [Dev][Python] Also numpydoc-validate Cython-generated methods The numpydoc-validation routine in Archery would skip over many Cython-generated methods and properties. This PR also fixes and enhances the docstrings that would newly raise validation errors. Closes #12698 from pitrou/ARROW-15321-numpydoc-cython Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-03-28 18:55:45 +02:00
			See :func:`pyarrow.compute.dictionary_encode` for full usage.

			`Parameters`
			`----------`
ARROW-15006: [Python][Doc] Add five more numpydoc checks to CI (#15214) This adds the numpydoc checks GL10, PR04, PR05, RT03, and YD01 to CI and fixes the associated issues in docstrings. These checks are: - GL10: reST directives {directives} must be followed by two colons - PR04: Parameter "{param_name}" has no type - PR05: Parameter "{param_name}" type should not finish with "." - RT03: Return value has no description - YD01: No Yields section found https://numpydoc.readthedocs.io/en/latest/validation.html Lead-authored-by: Bryce Mecum <petridish@gmail.com> Co-authored-by: Will Jones <willjones127@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com> 2023-01-06 14:21:27 -09:00			`null_encoding : str, default "mask"`
ARROW-15321: [Dev][Python] Also numpydoc-validate Cython-generated methods The numpydoc-validation routine in Archery would skip over many Cython-generated methods and properties. This PR also fixes and enhances the docstrings that would newly raise validation errors. Closes #12698 from pitrou/ARROW-15321-numpydoc-cython Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-03-28 18:55:45 +02:00			`How to handle null entries.`

			`Returns`
			`-------`
			`encoded : DictionaryArray`
			`A dictionary-encoded version of this array.`
ARROW-1559: [C++] Add Unique kernel and refactor DictionaryBuilder to be a stateful kernel Only intended to implement selective categorical conversion in `to_pandas()` but it seems that there is a lot missing to do this in a clean fashion. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1266 from xhochy/ARROW-1559 and squashes the following commits: 50249652 [Wes McKinney] Fix MSVC linker issue b6cb1ece [Wes McKinney] Export CastOptions 4ea3ce61 [Wes McKinney] Return NONE Datum in else branch of functions 4f969c6b [Wes McKinney] Move deprecation suppression after flag munging 7f557cc0 [Wes McKinney] Code review comments, disable C4996 warning (equivalent to -Wno-deprecated) in MSVC builds 84717461 [Wes McKinney] Do not compute hash table threshold on each iteration ae8f2339 [Wes McKinney] Fix double to int64_t conversion warning c1444a26 [Wes McKinney] Fix doxygen warnings 2de85961 [Wes McKinney] Add test cases for unique, dictionary_encode 383b46fd [Wes McKinney] Add Array methods for Unique, DictionaryEncode 0962f06b [Wes McKinney] Add cast method for Column, chunked_array and column factory functions 62c3cefd [Wes McKinney] Datum stubs 27151c47 [Wes McKinney] Implement Cast for chunked arrays, fix kernel implementation. Change kernel API to write to a single Datum 1bf2e2f4 [Wes McKinney] Fix bug with column using wrong type eaadc3e5 [Wes McKinney] Use macros to reduce code duplication in DoubleTableSize 6b4f8f3c [Wes McKinney] Fix datetime64->date32 casting error raised by refactor 2c77a19e [Wes McKinney] Some Decimal->Decimal128 renaming. Add DecimalType base class c07f91b3 [Wes McKinney] ARROW-1559: Add unique kernel 2017-11-17 18:29:49 -05:00			`"""`
GH-42112: [Python] Array gracefully fails on non-cpu device (#42113) ### Rationale for this change Common `Array` APIs should not segfault or abort on non-cpu devices. ### What changes are included in this PR? * `device_type` and `is_cpu` methods added to the `Array` class * Any function that segfaults, aborts, or gives incorrect results on non-cpu devices now raises an exception ### Are these changes tested? * Unit tests added ### Are there any user-facing changes? * `device_type` and `is_cpu` methods added to the `Array` class * GitHub Issue: #42112 Lead-authored-by: Dane Pitkin <dpitkin@apache.org> Co-authored-by: Dane Pitkin <dpitkin.oss@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-26 05:03:02 -04:00			`self._assert_cpu()`
ARROW-10438: [C++][Dataset] Partitioning::Format on nulls Tested and added support for partitioning with nulls. I had to make some changes to the hash kernels. You can now specify how you want DictionaryEncode to treat nulls. The MASK option will continue the current behavior (null not in dictionary, null value in indices) and the ENCODE option will put `null` in the dictionary and there will be no null values in the indices array. Partitioning on nulls will depend on the partitioning scheme. For directory partitioning null is allowed on inner fields but it is not allowed on an outer field if an inner field is defined. In other words, if the schema is a(int32), b(int32), c(int32) then the following are allowed ``` / (a=null, b=null, c=null) /32 (a=32, b=null, c=null) /32/57 (a=32, b=57, c=null) ``` There is no way to specify `a=null, b=57, c=null`. This does mean that partition directories can contain a mix of files and nested partition directories (e.g. /32 might contain file.parquet and the directory /57). Alternatively we could just forbid nulls in the directory partitioning scheme. For the hive scheme we need to be compatible with other tools that read/write hive. Those tools use a fallback value which defaults to `__HIVE_DEFAULT_PARTITION__`. So by default you would have directories that look like... ``` /a=__HIVE_DEFAULT_PARTITION__/b=__HIVE_DEFAULT_PARTITION__/c=__HIVE_DEFAULT_PARTITION__ ``` The null fallback value is configurable as a string passed to HivePartitioning::HivePartitioning or HivePartitioning::MakeFactory. ARROW-11649 has been created for extending this null fallback configuration to R. Closes #9323 from westonpace/feature/arrow-10438 Lead-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com> 2021-02-24 10:34:31 -05:00			`options = _pc().DictionaryEncodeOptions(null_encoding)`
			`return _pc().call_function('dictionary_encode', [self], options)`
ARROW-1559: [C++] Add Unique kernel and refactor DictionaryBuilder to be a stateful kernel Only intended to implement selective categorical conversion in `to_pandas()` but it seems that there is a lot missing to do this in a clean fashion. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1266 from xhochy/ARROW-1559 and squashes the following commits: 50249652 [Wes McKinney] Fix MSVC linker issue b6cb1ece [Wes McKinney] Export CastOptions 4ea3ce61 [Wes McKinney] Return NONE Datum in else branch of functions 4f969c6b [Wes McKinney] Move deprecation suppression after flag munging 7f557cc0 [Wes McKinney] Code review comments, disable C4996 warning (equivalent to -Wno-deprecated) in MSVC builds 84717461 [Wes McKinney] Do not compute hash table threshold on each iteration ae8f2339 [Wes McKinney] Fix double to int64_t conversion warning c1444a26 [Wes McKinney] Fix doxygen warnings 2de85961 [Wes McKinney] Add test cases for unique, dictionary_encode 383b46fd [Wes McKinney] Add Array methods for Unique, DictionaryEncode 0962f06b [Wes McKinney] Add cast method for Column, chunked_array and column factory functions 62c3cefd [Wes McKinney] Datum stubs 27151c47 [Wes McKinney] Implement Cast for chunked arrays, fix kernel implementation. Change kernel API to write to a single Datum 1bf2e2f4 [Wes McKinney] Fix bug with column using wrong type eaadc3e5 [Wes McKinney] Use macros to reduce code duplication in DoubleTableSize 6b4f8f3c [Wes McKinney] Fix datetime64->date32 casting error raised by refactor 2c77a19e [Wes McKinney] Some Decimal->Decimal128 renaming. Add DecimalType base class c07f91b3 [Wes McKinney] ARROW-1559: Add unique kernel 2017-11-17 18:29:49 -05:00
ARROW-7802: [C++][Python] Support LargeBinary and LargeString in the hash kernel templatized BinaryMemoTable so that it can hold either a BinaryBuilder or a LargeBinaryBuilder. Also added Python binding for ValueCounts(). Closes #6548 from brills/hash Lead-authored-by: Zhuo Peng <1835738+brills@users.noreply.github.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2020-03-12 16:21:02 +01:00			`def value_counts(self):`
			`"""`
			`Compute counts of unique elements in array.`

			`Returns`
			`-------`
ARROW-14738: [Python][Doc] Make return types clickable Closes #11726 from amol-/ARROW-14738 Lead-authored-by: Alessandro Molina <amol@turbogears.org> Co-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2022-01-19 12:51:43 +01:00			`StructArray`
			`An array of <input type "Values", int64 "Counts"> structs`
ARROW-7802: [C++][Python] Support LargeBinary and LargeString in the hash kernel templatized BinaryMemoTable so that it can hold either a BinaryBuilder or a LargeBinaryBuilder. Also added Python binding for ValueCounts(). Closes #6548 from brills/hash Lead-authored-by: Zhuo Peng <1835738+brills@users.noreply.github.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2020-03-12 16:21:02 +01:00			`"""`
GH-42112: [Python] Array gracefully fails on non-cpu device (#42113) ### Rationale for this change Common `Array` APIs should not segfault or abort on non-cpu devices. ### What changes are included in this PR? * `device_type` and `is_cpu` methods added to the `Array` class * Any function that segfaults, aborts, or gives incorrect results on non-cpu devices now raises an exception ### Are these changes tested? * Unit tests added ### Are there any user-facing changes? * `device_type` and `is_cpu` methods added to the `Array` class * GitHub Issue: #42112 Lead-authored-by: Dane Pitkin <dpitkin@apache.org> Co-authored-by: Dane Pitkin <dpitkin.oss@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-26 05:03:02 -04:00			`self._assert_cpu()`
ARROW-8792: [C++][Python][R][GLib] New Array compute kernels implementation and execution framework This patch is a major reworking of our development strategy for implementing array-valued functions and applying them in a query processing setting. The design was partly inspired by my previous work designing Ibis (https://github.com/ibis-project/ibis -- the "expr" subsystem and the way that operators validate input types and resolve output types). Using only function names and input types, you can determine the output types of each function and resolve the "execute" function that performs a unit of work processing a batch of data. This will allow us to build deferred column expressions and then (eventually) do parallel execution. There are a ton of details, but one nice thing is that there is now a single API entry point for invoking any function by its name: ```c++ Result<Datum> CallFunction(ExecContext* ctx, const std::string& func_name, const std::vector<Datum>& args, const FunctionOptions* options = NULLPTR); ``` What occurs when you do this: * A `Function` instance is looked up in the global `FunctionRegistry` * Given the descriptors of `args` (their types and shapes -- array or scalar), the Function searches for `Kernel` that is able to process those types and shapes. A kernel might be able to do `array[T0], array[T1]` or only `scalar[T0], scalar[T1]`, for example. This permits kernel specialization to treat different type and shape combinations * The kernel is executed iteratively against `args` based on what `args` contains -- if there are ChunkedArrays, they will be split into contiguous pieces. Kernels never see ChunkedArray, only Array or Scalar * The Executor implementation is able to split contiguous Array inputs into smaller chunks, which is important for parallel execution. See `ExecContext::set_exec_chunksize` To summarize: the REGISTRY contains FUNCTIONS. A FUNCTION contains KERNELS. A KERNEL is a specific implementation of a function that services a particular type combination. An additional effort in this patch is to radically simplify the process of creating kernels that are based on a scalar function. To do this, there is a growing collection of template-based kernel generation classes in compute/kernels/codegen_internal.h that will surely be the topic of much debate. I want to make it a lot easier for people to add new kernels. There are some other incidental changes in the PR, such as changing the convenience APIs like `Cast` to return `Result`. I'm afraid we may have to live with the API breakage unless someone else wants to add backward compatibility code for the old APIs. I have to apologize for making such a large PR. I've been working long hours on this for nearly a month and the process of porting all of our existing functionality and making the unit tests pass caused much iteration in the "framework" part of the code, such that it would have been a huge time drain to review incomplete iterations of the framework that had not been proven to capture the functionality that previously existed in the project. Given the size of this PR and that fact that it completely blocks any work into src/arrow/compute, I don't think we should let this sit unmerged for more than 4 or 5 days, tops. I'm committed to responding to all of your questions and working to address your feedback about the design and improving the documentation and code comments. I tried to leave copious comments to explain my thought process in various places. Feel free to make any and all comments in this PR or in whatever form you like. I don't think that merging should be blocked on stylistic issues. Closes #7240 from wesm/ARROW-8792-kernels-revamp Lead-authored-by: Wes McKinney <wesm+git@apache.org> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-05-24 09:35:00 -05:00			`return _pc().call_function('value_counts', [self])`
ARROW-7802: [C++][Python] Support LargeBinary and LargeString in the hash kernel templatized BinaryMemoTable so that it can hold either a BinaryBuilder or a LargeBinaryBuilder. Also added Python binding for ValueCounts(). Closes #6548 from brills/hash Lead-authored-by: Zhuo Peng <1835738+brills@users.noreply.github.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2020-03-12 16:21:02 +01:00
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00			`@staticmethod`
ARROW-1949: [Python/C++] Add option to Array.from_pandas and pyarrow.array to perform unsafe casts Author: Krisztián Szűcs <szucs.krisztian@gmail.com> Closes #2497 from kszucs/ARROW-1949 and squashes the following commits: f352c477 <Krisztián Szűcs> remove safe flag from _sequence_to_array 70d6cae2 <Krisztián Szűcs> annotate boolean arguments as bint e838a14d <Krisztián Szűcs> check-format fff89aaa <Krisztián Szűcs> lint 92ac3a92 <Krisztián Szűcs> tests for timestamp casts dd8871e8 <Krisztián Szűcs> wire CastOptions through the API 2018-09-04 08:36:29 +02:00			`def from_pandas(obj, mask=None, type=None, bint safe=True,`
			`MemoryPool memory_pool=None):`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00			`"""`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`Convert pandas.Series to an Arrow Array.`

			`This method uses Pandas semantics about what values indicate`
			`nulls. See pyarrow.array for more general conversion from arrays or`
			`sequences to Arrow arrays.`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
			`Parameters`
			`----------`
ARROW-13637: [Python] Fix docstrings Address all docstrings to make sure they pass `archery numpydoc --allow-rule PR01` Closes #11245 from amol-/ARROW-13637 Authored-by: Alessandro Molina <amol@turbogears.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-10-04 11:44:40 +02:00			`obj : ndarray, pandas.Series, array-like`
ARROW-838: [Python] Expand pyarrow.array to handle NumPy arrays not originating in pandas This unifies the ingest path for 1D data into `pyarrow.array`. I added the argument `from_pandas` to turn null sentinel checking on or off: ``` In [8]: arr = np.random.randn(10000000) In [9]: arr[::3] = np.nan In [10]: arr2 = pa.array(arr) In [11]: arr2.null_count Out[11]: 0 In [12]: %timeit arr2 = pa.array(arr) The slowest run took 5.43 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 68.4 µs per loop In [13]: arr2 = pa.array(arr, from_pandas=True) In [14]: arr2.null_count Out[14]: 3333334 In [15]: %timeit arr2 = pa.array(arr, from_pandas=True) 1 loop, best of 3: 228 ms per loop ``` When the data is contiguous, it is always zero-copy, but then `from_pandas=True` and no null mask is passed, then a null bitmap is constructed and populated. This also permits sequence reads into integers smaller than int64: ``` In [17]: pa.array([1, 2, 3, 4], type='i1') Out[17]: <pyarrow.lib.Int8Array object at 0x7ffa1c1c65e8> [ 1, 2, 3, 4 ] ``` Oh, I also added NumPy-like string type aliases: ``` In [18]: pa.int32() == 'i4' Out[18]: True ``` Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1146 from wesm/expand-py-array-method and squashes the following commits: 1570e525 [Wes McKinney] Code review comments d3bbb3c3 [Wes McKinney] Handle type aliases in cast, too 797f0151 [Wes McKinney] Allow null checking to be skipped with from_pandas=False in pyarrow.array f2802fc7 [Wes McKinney] Cleaner codepath for numpy->arrow conversions 587c575a [Wes McKinney] Add direct types sequence converters for more data types cf40b767 [Wes McKinney] Add type aliases, some unit tests 7b530e4b [Wes McKinney] Consolidate both sequence and ndarray/Series/Index conversion in pyarrow.Array 2017-09-29 23:02:58 -05:00			`mask : array (boolean), optional`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`Indicate which values are null (True) or not null (False).`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00			`type : pyarrow.DataType`
ARROW-838: [Python] Expand pyarrow.array to handle NumPy arrays not originating in pandas This unifies the ingest path for 1D data into `pyarrow.array`. I added the argument `from_pandas` to turn null sentinel checking on or off: ``` In [8]: arr = np.random.randn(10000000) In [9]: arr[::3] = np.nan In [10]: arr2 = pa.array(arr) In [11]: arr2.null_count Out[11]: 0 In [12]: %timeit arr2 = pa.array(arr) The slowest run took 5.43 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 68.4 µs per loop In [13]: arr2 = pa.array(arr, from_pandas=True) In [14]: arr2.null_count Out[14]: 3333334 In [15]: %timeit arr2 = pa.array(arr, from_pandas=True) 1 loop, best of 3: 228 ms per loop ``` When the data is contiguous, it is always zero-copy, but then `from_pandas=True` and no null mask is passed, then a null bitmap is constructed and populated. This also permits sequence reads into integers smaller than int64: ``` In [17]: pa.array([1, 2, 3, 4], type='i1') Out[17]: <pyarrow.lib.Int8Array object at 0x7ffa1c1c65e8> [ 1, 2, 3, 4 ] ``` Oh, I also added NumPy-like string type aliases: ``` In [18]: pa.int32() == 'i4' Out[18]: True ``` Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1146 from wesm/expand-py-array-method and squashes the following commits: 1570e525 [Wes McKinney] Code review comments d3bbb3c3 [Wes McKinney] Handle type aliases in cast, too 797f0151 [Wes McKinney] Allow null checking to be skipped with from_pandas=False in pyarrow.array f2802fc7 [Wes McKinney] Cleaner codepath for numpy->arrow conversions 587c575a [Wes McKinney] Add direct types sequence converters for more data types cf40b767 [Wes McKinney] Add type aliases, some unit tests 7b530e4b [Wes McKinney] Consolidate both sequence and ndarray/Series/Index conversion in pyarrow.Array 2017-09-29 23:02:58 -05:00			`Explicit type to attempt to coerce to, otherwise will be inferred`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`from the data.`
			`safe : bool, default True`
			`Check for overflows or other unsafe conversions.`
ARROW-838: [Python] Expand pyarrow.array to handle NumPy arrays not originating in pandas This unifies the ingest path for 1D data into `pyarrow.array`. I added the argument `from_pandas` to turn null sentinel checking on or off: ``` In [8]: arr = np.random.randn(10000000) In [9]: arr[::3] = np.nan In [10]: arr2 = pa.array(arr) In [11]: arr2.null_count Out[11]: 0 In [12]: %timeit arr2 = pa.array(arr) The slowest run took 5.43 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 68.4 µs per loop In [13]: arr2 = pa.array(arr, from_pandas=True) In [14]: arr2.null_count Out[14]: 3333334 In [15]: %timeit arr2 = pa.array(arr, from_pandas=True) 1 loop, best of 3: 228 ms per loop ``` When the data is contiguous, it is always zero-copy, but then `from_pandas=True` and no null mask is passed, then a null bitmap is constructed and populated. This also permits sequence reads into integers smaller than int64: ``` In [17]: pa.array([1, 2, 3, 4], type='i1') Out[17]: <pyarrow.lib.Int8Array object at 0x7ffa1c1c65e8> [ 1, 2, 3, 4 ] ``` Oh, I also added NumPy-like string type aliases: ``` In [18]: pa.int32() == 'i4' Out[18]: True ``` Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1146 from wesm/expand-py-array-method and squashes the following commits: 1570e525 [Wes McKinney] Code review comments d3bbb3c3 [Wes McKinney] Handle type aliases in cast, too 797f0151 [Wes McKinney] Allow null checking to be skipped with from_pandas=False in pyarrow.array f2802fc7 [Wes McKinney] Cleaner codepath for numpy->arrow conversions 587c575a [Wes McKinney] Add direct types sequence converters for more data types cf40b767 [Wes McKinney] Add type aliases, some unit tests 7b530e4b [Wes McKinney] Consolidate both sequence and ndarray/Series/Index conversion in pyarrow.Array 2017-09-29 23:02:58 -05:00			`memory_pool : pyarrow.MemoryPool, optional`
			`If not passed, will allocate memory from the currently-set default`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`memory pool.`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
			`Notes`
			`-----`
			`Localized timestamps will currently be returned as UTC (pandas's native`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`representation). Timezone-naive data will be implicitly interpreted as`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00			`UTC.`

			`Returns`
			`-------`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`array : pyarrow.Array or pyarrow.ChunkedArray`
			`ChunkedArray is returned if object data overflows binary buffer.`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00			`"""`
ARROW-1949: [Python/C++] Add option to Array.from_pandas and pyarrow.array to perform unsafe casts Author: Krisztián Szűcs <szucs.krisztian@gmail.com> Closes #2497 from kszucs/ARROW-1949 and squashes the following commits: f352c477 <Krisztián Szűcs> remove safe flag from _sequence_to_array 70d6cae2 <Krisztián Szűcs> annotate boolean arguments as bint e838a14d <Krisztián Szűcs> check-format fff89aaa <Krisztián Szűcs> lint 92ac3a92 <Krisztián Szűcs> tests for timestamp casts dd8871e8 <Krisztián Szűcs> wire CastOptions through the API 2018-09-04 08:36:29 +02:00			`return array(obj, mask=mask, type=type, safe=safe, from_pandas=True,`
			`memory_pool=memory_pool)`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
ARROW-2493: [Python] Add support for pickling to buffers and arrays Author: Korn, Uwe <Uwe.Korn@blue-yonder.com> Closes #1928 from xhochy/ARROW-2493 and squashes the following commits: e3600f99 <Korn, Uwe> Add pickling support for Arrays 17ec8055 <Korn, Uwe> ARROW-2493: Add support for pickling to buffers 2018-05-02 11:27:53 +02:00			`def __reduce__(self):`
GH-42112: [Python] Array gracefully fails on non-cpu device (#42113) ### Rationale for this change Common `Array` APIs should not segfault or abort on non-cpu devices. ### What changes are included in this PR? * `device_type` and `is_cpu` methods added to the `Array` class * Any function that segfaults, aborts, or gives incorrect results on non-cpu devices now raises an exception ### Are these changes tested? * Unit tests added ### Are there any user-facing changes? * `device_type` and `is_cpu` methods added to the `Array` class * GitHub Issue: #42112 Lead-authored-by: Dane Pitkin <dpitkin@apache.org> Co-authored-by: Dane Pitkin <dpitkin.oss@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-26 05:03:02 -04:00			`self._assert_cpu()`
ARROW-2493: [Python] Add support for pickling to buffers and arrays Author: Korn, Uwe <Uwe.Korn@blue-yonder.com> Closes #1928 from xhochy/ARROW-2493 and squashes the following commits: e3600f99 <Korn, Uwe> Add pickling support for Arrays 17ec8055 <Korn, Uwe> ARROW-2493: Add support for pickling to buffers 2018-05-02 11:27:53 +02:00			`return _restore_array, \`
			`(_reduce_array_data(self.sp_array.get().data().get()),)`

ARROW-2281: [Python] Add Array.from_buffers() Note this is shadowed by the specialized StringArray.from_buffers(). Author: Antoine Pitrou <antoine@python.org> Closes #1772 from pitrou/ARROW-2281-python-array-from-buffers and squashes the following commits: c6bf3730 <Antoine Pitrou> Try to fix crashes a7f658e2 <Antoine Pitrou> ARROW-2281: Add Array.from_buffers() 2018-03-22 13:56:33 -04:00			`@staticmethod`
ARROW-5531: [Python] Implement Array.from_buffers for varbinary and nested types, add DataType.num_buffers property Thanks to Antoine's recent work on `Array::View` this method can be made more robust and safe by checking for the correct number of buffers. Author: Wes McKinney <wesm+git@apache.org> Closes #4537 from wesm/ARROW-5531 and squashes the following commits: ec0695d86 <Wes McKinney> Address code review feedback a72533831 <Wes McKinney> Implement Array.from_buffers for nested types, add DataType.num_buffers, more checks 2019-06-13 14:45:13 -05:00			`def from_buffers(DataType type, length, buffers, null_count=-1, offset=0,`
			`children=None):`
ARROW-2281: [Python] Add Array.from_buffers() Note this is shadowed by the specialized StringArray.from_buffers(). Author: Antoine Pitrou <antoine@python.org> Closes #1772 from pitrou/ARROW-2281-python-array-from-buffers and squashes the following commits: c6bf3730 <Antoine Pitrou> Try to fix crashes a7f658e2 <Antoine Pitrou> ARROW-2281: Add Array.from_buffers() 2018-03-22 13:56:33 -04:00			`"""`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`Construct an Array from a sequence of buffers.`

			`The concrete type returned depends on the datatype.`
ARROW-2281: [Python] Add Array.from_buffers() Note this is shadowed by the specialized StringArray.from_buffers(). Author: Antoine Pitrou <antoine@python.org> Closes #1772 from pitrou/ARROW-2281-python-array-from-buffers and squashes the following commits: c6bf3730 <Antoine Pitrou> Try to fix crashes a7f658e2 <Antoine Pitrou> ARROW-2281: Add Array.from_buffers() 2018-03-22 13:56:33 -04:00
			`Parameters`
			`----------`
			`type : DataType`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`The value type of the array.`
ARROW-2281: [Python] Add Array.from_buffers() Note this is shadowed by the specialized StringArray.from_buffers(). Author: Antoine Pitrou <antoine@python.org> Closes #1772 from pitrou/ARROW-2281-python-array-from-buffers and squashes the following commits: c6bf3730 <Antoine Pitrou> Try to fix crashes a7f658e2 <Antoine Pitrou> ARROW-2281: Add Array.from_buffers() 2018-03-22 13:56:33 -04:00			`length : int`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`The number of values in the array.`
MINOR: [Python] `Array.from_buffers` accepts `None` buffers (#49163) ### Rationale for this change Fix the docs. It seems this is a feature and not just happens to work as the code a little below even has an explicit comment: https://github.com/apache/arrow/blob/0dfae701ef98aa4a26b9abbaaf3bf01130df3702/python/pyarrow/array.pxi#L1347 ### What changes are included in this PR? Document that `Array.from_buffers` takes a `list[Buffer \| None]` instead of strictly `list[Buffer]`. ### Are these changes tested? not applicable ### Are there any user-facing changes? Yes, but not in an API sense :) Authored-by: Robsdedude <dev@rouvenbauer.de> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2026-03-05 15:35:56 +01:00			`buffers : List[Buffer \| None]`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`The buffers backing this array.`
ARROW-2281: [Python] Add Array.from_buffers() Note this is shadowed by the specialized StringArray.from_buffers(). Author: Antoine Pitrou <antoine@python.org> Closes #1772 from pitrou/ARROW-2281-python-array-from-buffers and squashes the following commits: c6bf3730 <Antoine Pitrou> Try to fix crashes a7f658e2 <Antoine Pitrou> ARROW-2281: Add Array.from_buffers() 2018-03-22 13:56:33 -04:00			`null_count : int, default -1`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`The number of null entries in the array. Negative value means that`
			`the null count is not known.`
ARROW-2281: [Python] Add Array.from_buffers() Note this is shadowed by the specialized StringArray.from_buffers(). Author: Antoine Pitrou <antoine@python.org> Closes #1772 from pitrou/ARROW-2281-python-array-from-buffers and squashes the following commits: c6bf3730 <Antoine Pitrou> Try to fix crashes a7f658e2 <Antoine Pitrou> ARROW-2281: Add Array.from_buffers() 2018-03-22 13:56:33 -04:00			`offset : int, default 0`
			`The array's logical offset (in values, not in bytes) from the`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`start of each buffer.`
ARROW-5531: [Python] Implement Array.from_buffers for varbinary and nested types, add DataType.num_buffers property Thanks to Antoine's recent work on `Array::View` this method can be made more robust and safe by checking for the correct number of buffers. Author: Wes McKinney <wesm+git@apache.org> Closes #4537 from wesm/ARROW-5531 and squashes the following commits: ec0695d86 <Wes McKinney> Address code review feedback a72533831 <Wes McKinney> Implement Array.from_buffers for nested types, add DataType.num_buffers, more checks 2019-06-13 14:45:13 -05:00			`children : List[Array], default None`
ARROW-8904: [Python] Adapt to child->field API migration/deprecation Closes #7331 from wesm/ARROW-8904 Authored-by: Wes McKinney <wesm@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-06-03 19:57:27 -05:00			`Nested type children with length matching type.num_fields.`
ARROW-2281: [Python] Add Array.from_buffers() Note this is shadowed by the specialized StringArray.from_buffers(). Author: Antoine Pitrou <antoine@python.org> Closes #1772 from pitrou/ARROW-2281-python-array-from-buffers and squashes the following commits: c6bf3730 <Antoine Pitrou> Try to fix crashes a7f658e2 <Antoine Pitrou> ARROW-2281: Add Array.from_buffers() 2018-03-22 13:56:33 -04:00
			`Returns`
			`-------`
			`array : Array`
			`"""`
			`cdef:`
			`Buffer buf`
ARROW-5531: [Python] Implement Array.from_buffers for varbinary and nested types, add DataType.num_buffers property Thanks to Antoine's recent work on `Array::View` this method can be made more robust and safe by checking for the correct number of buffers. Author: Wes McKinney <wesm+git@apache.org> Closes #4537 from wesm/ARROW-5531 and squashes the following commits: ec0695d86 <Wes McKinney> Address code review feedback a72533831 <Wes McKinney> Implement Array.from_buffers for nested types, add DataType.num_buffers, more checks 2019-06-13 14:45:13 -05:00			`Array child`
ARROW-2281: [Python] Add Array.from_buffers() Note this is shadowed by the specialized StringArray.from_buffers(). Author: Antoine Pitrou <antoine@python.org> Closes #1772 from pitrou/ARROW-2281-python-array-from-buffers and squashes the following commits: c6bf3730 <Antoine Pitrou> Try to fix crashes a7f658e2 <Antoine Pitrou> ARROW-2281: Add Array.from_buffers() 2018-03-22 13:56:33 -04:00			`vector[shared_ptr[CBuffer]] c_buffers`
ARROW-5531: [Python] Implement Array.from_buffers for varbinary and nested types, add DataType.num_buffers property Thanks to Antoine's recent work on `Array::View` this method can be made more robust and safe by checking for the correct number of buffers. Author: Wes McKinney <wesm+git@apache.org> Closes #4537 from wesm/ARROW-5531 and squashes the following commits: ec0695d86 <Wes McKinney> Address code review feedback a72533831 <Wes McKinney> Implement Array.from_buffers for nested types, add DataType.num_buffers, more checks 2019-06-13 14:45:13 -05:00			`vector[shared_ptr[CArrayData]] c_child_data`
			`shared_ptr[CArrayData] array_data`
ARROW-2281: [Python] Add Array.from_buffers() Note this is shadowed by the specialized StringArray.from_buffers(). Author: Antoine Pitrou <antoine@python.org> Closes #1772 from pitrou/ARROW-2281-python-array-from-buffers and squashes the following commits: c6bf3730 <Antoine Pitrou> Try to fix crashes a7f658e2 <Antoine Pitrou> ARROW-2281: Add Array.from_buffers() 2018-03-22 13:56:33 -04:00
ARROW-5531: [Python] Implement Array.from_buffers for varbinary and nested types, add DataType.num_buffers property Thanks to Antoine's recent work on `Array::View` this method can be made more robust and safe by checking for the correct number of buffers. Author: Wes McKinney <wesm+git@apache.org> Closes #4537 from wesm/ARROW-5531 and squashes the following commits: ec0695d86 <Wes McKinney> Address code review feedback a72533831 <Wes McKinney> Implement Array.from_buffers for nested types, add DataType.num_buffers, more checks 2019-06-13 14:45:13 -05:00			`children = children or []`

ARROW-8904: [Python] Adapt to child->field API migration/deprecation Closes #7331 from wesm/ARROW-8904 Authored-by: Wes McKinney <wesm@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-06-03 19:57:27 -05:00			`if type.num_fields != len(children):`
ARROW-5531: [Python] Implement Array.from_buffers for varbinary and nested types, add DataType.num_buffers property Thanks to Antoine's recent work on `Array::View` this method can be made more robust and safe by checking for the correct number of buffers. Author: Wes McKinney <wesm+git@apache.org> Closes #4537 from wesm/ARROW-5531 and squashes the following commits: ec0695d86 <Wes McKinney> Address code review feedback a72533831 <Wes McKinney> Implement Array.from_buffers for nested types, add DataType.num_buffers, more checks 2019-06-13 14:45:13 -05:00			`raise ValueError("Type's expected number of children "`
GH-45619: [Python] Use f-string instead of string.format (#45629) ### Rationale for this change See https://github.com/apache/arrow/issues/45619. ### What changes are included in this PR? Refactor using f-string instead of `string.format`. But do not use f-string for following case, `string.format` allows passing parameters, making the code more reusable. https://github.com/apache/arrow/blob/0fbf9823542233c5f32c26534c34cc97ce3f0be2/python/pyarrow/parquet/core.py#L1624-L1695 ### Are these changes tested? Via CI. ### Are there any user-facing changes? No. * GitHub Issue: #45619 Lead-authored-by: Chilin Chiou <chilin.chiou@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> 2025-05-12 20:34:49 +08:00			`f"({type.num_fields}) did not match the passed number "`
			`f"({len(children)})")`
ARROW-5531: [Python] Implement Array.from_buffers for varbinary and nested types, add DataType.num_buffers property Thanks to Antoine's recent work on `Array::View` this method can be made more robust and safe by checking for the correct number of buffers. Author: Wes McKinney <wesm+git@apache.org> Closes #4537 from wesm/ARROW-5531 and squashes the following commits: ec0695d86 <Wes McKinney> Address code review feedback a72533831 <Wes McKinney> Implement Array.from_buffers for nested types, add DataType.num_buffers, more checks 2019-06-13 14:45:13 -05:00
GH-44651: [Python] Allow from_buffers to work with StringView on Python (#44701) ### Rationale for this change Currently `from_buffers` is not working with StringView on Python because we validate against num_buffers. This only take into account the mandatory buffers but does not take into account the variadic_spec that can be present for both string_view and binary_view ### What changes are included in this PR? Take into account whether the type contains a variadic_spec for the non-mandatory buffers and only check lower_bound number of buffers. ### Are these changes tested? Yes, I've added a couple of tests. ### Are there any user-facing changes? We are exposing a new method on the Python DataType. `has_variadic_buffers` which tells us whether the number of buffers expected is only lower-bounded by num_buffers. * GitHub Issue: #44651 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> 2024-11-18 12:10:57 +01:00			`if type.has_variadic_buffers:`
			`if type.num_buffers > len(buffers):`
			`raise ValueError("Type's expected number of buffers is at least "`
GH-45619: [Python] Use f-string instead of string.format (#45629) ### Rationale for this change See https://github.com/apache/arrow/issues/45619. ### What changes are included in this PR? Refactor using f-string instead of `string.format`. But do not use f-string for following case, `string.format` allows passing parameters, making the code more reusable. https://github.com/apache/arrow/blob/0fbf9823542233c5f32c26534c34cc97ce3f0be2/python/pyarrow/parquet/core.py#L1624-L1695 ### Are these changes tested? Via CI. ### Are there any user-facing changes? No. * GitHub Issue: #45619 Lead-authored-by: Chilin Chiou <chilin.chiou@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> 2025-05-12 20:34:49 +08:00			`f"{type.num_buffers}, but the passed number is "`
			`f"{len(buffers)}.")`
GH-44651: [Python] Allow from_buffers to work with StringView on Python (#44701) ### Rationale for this change Currently `from_buffers` is not working with StringView on Python because we validate against num_buffers. This only take into account the mandatory buffers but does not take into account the variadic_spec that can be present for both string_view and binary_view ### What changes are included in this PR? Take into account whether the type contains a variadic_spec for the non-mandatory buffers and only check lower_bound number of buffers. ### Are these changes tested? Yes, I've added a couple of tests. ### Are there any user-facing changes? We are exposing a new method on the Python DataType. `has_variadic_buffers` which tells us whether the number of buffers expected is only lower-bounded by num_buffers. * GitHub Issue: #44651 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> 2024-11-18 12:10:57 +01:00			`elif type.num_buffers != len(buffers):`
ARROW-5531: [Python] Implement Array.from_buffers for varbinary and nested types, add DataType.num_buffers property Thanks to Antoine's recent work on `Array::View` this method can be made more robust and safe by checking for the correct number of buffers. Author: Wes McKinney <wesm+git@apache.org> Closes #4537 from wesm/ARROW-5531 and squashes the following commits: ec0695d86 <Wes McKinney> Address code review feedback a72533831 <Wes McKinney> Implement Array.from_buffers for nested types, add DataType.num_buffers, more checks 2019-06-13 14:45:13 -05:00			`raise ValueError("Type's expected number of buffers "`
GH-45619: [Python] Use f-string instead of string.format (#45629) ### Rationale for this change See https://github.com/apache/arrow/issues/45619. ### What changes are included in this PR? Refactor using f-string instead of `string.format`. But do not use f-string for following case, `string.format` allows passing parameters, making the code more reusable. https://github.com/apache/arrow/blob/0fbf9823542233c5f32c26534c34cc97ce3f0be2/python/pyarrow/parquet/core.py#L1624-L1695 ### Are these changes tested? Via CI. ### Are there any user-facing changes? No. * GitHub Issue: #45619 Lead-authored-by: Chilin Chiou <chilin.chiou@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> 2025-05-12 20:34:49 +08:00			`f"({type.num_buffers}) did not match the passed number "`
			`f"({len(buffers)}).")`
ARROW-2491: [Python] raise NotImplementedError on from_buffers with nested types Author: Korn, Uwe <Uwe.Korn@blue-yonder.com> Closes #1927 from xhochy/ARROW-2491 and squashes the following commits: f2300dbf <Korn, Uwe> Test for NotImplementedError 0dbcc34f <Korn, Uwe> ARROW-2491: raise NotImplementedError on from_buffers with nested types 2018-05-08 19:15:40 +02:00
ARROW-2281: [Python] Add Array.from_buffers() Note this is shadowed by the specialized StringArray.from_buffers(). Author: Antoine Pitrou <antoine@python.org> Closes #1772 from pitrou/ARROW-2281-python-array-from-buffers and squashes the following commits: c6bf3730 <Antoine Pitrou> Try to fix crashes a7f658e2 <Antoine Pitrou> ARROW-2281: Add Array.from_buffers() 2018-03-22 13:56:33 -04:00			`for buf in buffers:`
			`# None will produce a null buffer pointer`
			`c_buffers.push_back(pyarrow_unwrap_buffer(buf))`
ARROW-5531: [Python] Implement Array.from_buffers for varbinary and nested types, add DataType.num_buffers property Thanks to Antoine's recent work on `Array::View` this method can be made more robust and safe by checking for the correct number of buffers. Author: Wes McKinney <wesm+git@apache.org> Closes #4537 from wesm/ARROW-5531 and squashes the following commits: ec0695d86 <Wes McKinney> Address code review feedback a72533831 <Wes McKinney> Implement Array.from_buffers for nested types, add DataType.num_buffers, more checks 2019-06-13 14:45:13 -05:00
			`for child in children:`
			`c_child_data.push_back(child.ap.data())`

			`array_data = CArrayData.MakeWithChildren(type.sp_type, length,`
			`c_buffers, c_child_data,`
			`null_count, offset)`
			`cdef Array result = pyarrow_wrap_array(MakeArray(array_data))`
			`result.validate()`
			`return result`
ARROW-2281: [Python] Add Array.from_buffers() Note this is shadowed by the specialized StringArray.from_buffers(). Author: Antoine Pitrou <antoine@python.org> Closes #1772 from pitrou/ARROW-2281-python-array-from-buffers and squashes the following commits: c6bf3730 <Antoine Pitrou> Try to fix crashes a7f658e2 <Antoine Pitrou> ARROW-2281: Add Array.from_buffers() 2018-03-22 13:56:33 -04:00
ARROW-3044: [Python] Remove all occurrences of cython's legacy property definition syntax plus add missing tests Author: Krisztián Szűcs <szucs.krisztian@gmail.com> Author: Wes McKinney <wesm+git@apache.org> Closes #2424 from kszucs/ARROW-3044 and squashes the following commits: c00fdd33 <Wes McKinney> Change pa.lib -> pa 846cb238 <Krisztián Szűcs> pandas timestamp compat 7d930aa7 <Krisztián Szűcs> replace all occurences of old property definition syntax 2018-08-14 10:09:41 -04:00			`@property`
			`def null_count(self):`
GH-42112: [Python] Array gracefully fails on non-cpu device (#42113) ### Rationale for this change Common `Array` APIs should not segfault or abort on non-cpu devices. ### What changes are included in this PR? * `device_type` and `is_cpu` methods added to the `Array` class * Any function that segfaults, aborts, or gives incorrect results on non-cpu devices now raises an exception ### Are these changes tested? * Unit tests added ### Are there any user-facing changes? * `device_type` and `is_cpu` methods added to the `Array` class * GitHub Issue: #42112 Lead-authored-by: Dane Pitkin <dpitkin@apache.org> Co-authored-by: Dane Pitkin <dpitkin.oss@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-26 05:03:02 -04:00			`self._assert_cpu()`
ARROW-3044: [Python] Remove all occurrences of cython's legacy property definition syntax plus add missing tests Author: Krisztián Szűcs <szucs.krisztian@gmail.com> Author: Wes McKinney <wesm+git@apache.org> Closes #2424 from kszucs/ARROW-3044 and squashes the following commits: c00fdd33 <Wes McKinney> Change pa.lib -> pa 846cb238 <Krisztián Szűcs> pandas timestamp compat 7d930aa7 <Krisztián Szűcs> replace all occurences of old property definition syntax 2018-08-14 10:09:41 -04:00			`return self.sp_array.get().null_count()`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
ARROW-3444: [Python] Add Array/ChunkedArray/Table nbytes attribute https://issues.apache.org/jira/browse/ARROW-3444 Question is then what the expected result is for a sliced array? (because the buffers do not take that into account) Closes #5793 from jorisvandenbossche/ARROW-3444-nbytes and squashes the following commits: 9b648e63f <Joris Van den Bossche> ARROW-3444: Add Array/ChunkedArray/Table nbytes attribute Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-11-12 13:06:21 +01:00			`@property`
			`def nbytes(self):`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`"""`
			`Total number of bytes consumed by the elements of the array.`
ARROW-15153: [Python] Expose ReferencedBufferSize to python ## Notes on PR In this PR, I am working on exposing the ReferencedBufferSize as described here: https://issues.apache.org/jira/browse/ARROW-15065 In this working branch, I am addressing the ARROW-15135 issue (Python bindings) Currently this is work in progress, but created a draft PR after applying the expected functionality on Array interface where the buffer size is exposed using a new function. The idea is to differentiate the `nbytes` from the actual amount of bytes allocated for data. ## Tasks - [x] Figure out best location to place the bindings - [x] Bindings for - [x] Table - [ ] ArrayData (N/A) - [x] ChunkedArray - [x] RecordBatch - [x] Test cases ## Notes Also exposed the `TotalBufferSize` function within the same PR. Not sure if that is okay. I included it since the issue has some relationship with the existing `nbytes` definition and `TotalBufferSize` calculation method. Closes #11993 from vibhatha/arrow-15153 Authored-by: Vibhatha Abeykoon <vibhatha@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com> 2022-01-20 16:23:36 -10:00
ARROW-15477: [C++][Python] Allow to create (FixedSize/Large)ListArray from arrays and type This enables to create a ListArray in this way with a custom type (eg non-default field name). Mimics the interface we already have for `MapArray::FromArrays` as well. Closes #12312 from jorisvandenbossche/ARROW-15477-list-from-arrays Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-02-02 14:26:01 +01:00			`In other words, the sum of bytes from all buffer`
ARROW-15153: [Python] Expose ReferencedBufferSize to python ## Notes on PR In this PR, I am working on exposing the ReferencedBufferSize as described here: https://issues.apache.org/jira/browse/ARROW-15065 In this working branch, I am addressing the ARROW-15135 issue (Python bindings) Currently this is work in progress, but created a draft PR after applying the expected functionality on Array interface where the buffer size is exposed using a new function. The idea is to differentiate the `nbytes` from the actual amount of bytes allocated for data. ## Tasks - [x] Figure out best location to place the bindings - [x] Bindings for - [x] Table - [ ] ArrayData (N/A) - [x] ChunkedArray - [x] RecordBatch - [x] Test cases ## Notes Also exposed the `TotalBufferSize` function within the same PR. Not sure if that is okay. I included it since the issue has some relationship with the existing `nbytes` definition and `TotalBufferSize` calculation method. Closes #11993 from vibhatha/arrow-15153 Authored-by: Vibhatha Abeykoon <vibhatha@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com> 2022-01-20 16:23:36 -10:00			`ranges referenced.`

			Unlike `get_total_buffer_size` this method will account for array
			`offsets.`

			`If buffers are shared between arrays then the shared`
			`portion will be counted multiple times.`

ARROW-15477: [C++][Python] Allow to create (FixedSize/Large)ListArray from arrays and type This enables to create a ListArray in this way with a custom type (eg non-default field name). Mimics the interface we already have for `MapArray::FromArrays` as well. Closes #12312 from jorisvandenbossche/ARROW-15477-list-from-arrays Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-02-02 14:26:01 +01:00			`The dictionary of dictionary arrays will always be counted in their`
ARROW-15153: [Python] Expose ReferencedBufferSize to python ## Notes on PR In this PR, I am working on exposing the ReferencedBufferSize as described here: https://issues.apache.org/jira/browse/ARROW-15065 In this working branch, I am addressing the ARROW-15135 issue (Python bindings) Currently this is work in progress, but created a draft PR after applying the expected functionality on Array interface where the buffer size is exposed using a new function. The idea is to differentiate the `nbytes` from the actual amount of bytes allocated for data. ## Tasks - [x] Figure out best location to place the bindings - [x] Bindings for - [x] Table - [ ] ArrayData (N/A) - [x] ChunkedArray - [x] RecordBatch - [x] Test cases ## Notes Also exposed the `TotalBufferSize` function within the same PR. Not sure if that is okay. I included it since the issue has some relationship with the existing `nbytes` definition and `TotalBufferSize` calculation method. Closes #11993 from vibhatha/arrow-15153 Authored-by: Vibhatha Abeykoon <vibhatha@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com> 2022-01-20 16:23:36 -10:00			`entirety even if the array only references a portion of the dictionary.`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`"""`
GH-42112: [Python] Array gracefully fails on non-cpu device (#42113) ### Rationale for this change Common `Array` APIs should not segfault or abort on non-cpu devices. ### What changes are included in this PR? * `device_type` and `is_cpu` methods added to the `Array` class * Any function that segfaults, aborts, or gives incorrect results on non-cpu devices now raises an exception ### Are these changes tested? * Unit tests added ### Are there any user-facing changes? * `device_type` and `is_cpu` methods added to the `Array` class * GitHub Issue: #42112 Lead-authored-by: Dane Pitkin <dpitkin@apache.org> Co-authored-by: Dane Pitkin <dpitkin.oss@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-26 05:03:02 -04:00			`self._assert_cpu()`
			`cdef CResult[int64_t] c_size_res`
GH-39096: [Python] Release GIL in `.nbytes` (#39097) ### Rationale for this change The `.nbytes` holds the GIL while computing the data size in C++, which has caused performance issues in Dask because threads were blocking each other See #39096 ### Are these changes tested? I am not sure if additional tests are necessary here. If so, I'm happy to add them but would welcome some pointers. ### Are there any user-facing changes? No * Closes: #39096 Authored-by: Hendrik Makait <hendrik@makait.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-12-07 14:18:06 +01:00			`with nogil:`
			`c_size_res = ReferencedBufferSize(deref(self.ap))`
			`size = GetResultValue(c_size_res)`
ARROW-3444: [Python] Add Array/ChunkedArray/Table nbytes attribute https://issues.apache.org/jira/browse/ARROW-3444 Question is then what the expected result is for a sliced array? (because the buffers do not take that into account) Closes #5793 from jorisvandenbossche/ARROW-3444-nbytes and squashes the following commits: 9b648e63f <Joris Van den Bossche> ARROW-3444: Add Array/ChunkedArray/Table nbytes attribute Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-11-12 13:06:21 +01:00			`return size`

ARROW-15153: [Python] Expose ReferencedBufferSize to python ## Notes on PR In this PR, I am working on exposing the ReferencedBufferSize as described here: https://issues.apache.org/jira/browse/ARROW-15065 In this working branch, I am addressing the ARROW-15135 issue (Python bindings) Currently this is work in progress, but created a draft PR after applying the expected functionality on Array interface where the buffer size is exposed using a new function. The idea is to differentiate the `nbytes` from the actual amount of bytes allocated for data. ## Tasks - [x] Figure out best location to place the bindings - [x] Bindings for - [x] Table - [ ] ArrayData (N/A) - [x] ChunkedArray - [x] RecordBatch - [x] Test cases ## Notes Also exposed the `TotalBufferSize` function within the same PR. Not sure if that is okay. I included it since the issue has some relationship with the existing `nbytes` definition and `TotalBufferSize` calculation method. Closes #11993 from vibhatha/arrow-15153 Authored-by: Vibhatha Abeykoon <vibhatha@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com> 2022-01-20 16:23:36 -10:00			`def get_total_buffer_size(self):`
			`"""`
			`The sum of bytes in each buffer referenced by the array.`

			`An array may only reference a portion of a buffer.`
			`This method will overestimate in this case and return the`
			`byte size of the entire buffer.`

			`If a buffer is referenced multiple times then it will`
			`only be counted once.`
			`"""`
GH-42112: [Python] Array gracefully fails on non-cpu device (#42113) ### Rationale for this change Common `Array` APIs should not segfault or abort on non-cpu devices. ### What changes are included in this PR? * `device_type` and `is_cpu` methods added to the `Array` class * Any function that segfaults, aborts, or gives incorrect results on non-cpu devices now raises an exception ### Are these changes tested? * Unit tests added ### Are there any user-facing changes? * `device_type` and `is_cpu` methods added to the `Array` class * GitHub Issue: #42112 Lead-authored-by: Dane Pitkin <dpitkin@apache.org> Co-authored-by: Dane Pitkin <dpitkin.oss@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-26 05:03:02 -04:00			`self._assert_cpu()`
			`cdef int64_t total_buffer_size`
ARROW-15153: [Python] Expose ReferencedBufferSize to python ## Notes on PR In this PR, I am working on exposing the ReferencedBufferSize as described here: https://issues.apache.org/jira/browse/ARROW-15065 In this working branch, I am addressing the ARROW-15135 issue (Python bindings) Currently this is work in progress, but created a draft PR after applying the expected functionality on Array interface where the buffer size is exposed using a new function. The idea is to differentiate the `nbytes` from the actual amount of bytes allocated for data. ## Tasks - [x] Figure out best location to place the bindings - [x] Bindings for - [x] Table - [ ] ArrayData (N/A) - [x] ChunkedArray - [x] RecordBatch - [x] Test cases ## Notes Also exposed the `TotalBufferSize` function within the same PR. Not sure if that is okay. I included it since the issue has some relationship with the existing `nbytes` definition and `TotalBufferSize` calculation method. Closes #11993 from vibhatha/arrow-15153 Authored-by: Vibhatha Abeykoon <vibhatha@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com> 2022-01-20 16:23:36 -10:00			`total_buffer_size = TotalBufferSize(deref(self.ap))`
			`return total_buffer_size`

ARROW-6926: [Python] Support __sizeof__ protocol for Python objects https://issues.apache.org/jira/browse/ARROW-6926 Closes #5879 from jorisvandenbossche/ARROW-6926-sizeof and squashes the following commits: baca02f46 <Joris Van den Bossche> check size of metadata keys and values 17f75d462 <Joris Van den Bossche> add Schema sizeof 54896e16b <Joris Van den Bossche> object -> super b36aa5dc1 <Joris Van den Bossche> ARROW-6926: Support __sizeof__ protocol for Python objects Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-12-02 14:44:45 +01:00			`def __sizeof__(self):`
GH-42112: [Python] Array gracefully fails on non-cpu device (#42113) ### Rationale for this change Common `Array` APIs should not segfault or abort on non-cpu devices. ### What changes are included in this PR? * `device_type` and `is_cpu` methods added to the `Array` class * Any function that segfaults, aborts, or gives incorrect results on non-cpu devices now raises an exception ### Are these changes tested? * Unit tests added ### Are there any user-facing changes? * `device_type` and `is_cpu` methods added to the `Array` class * GitHub Issue: #42112 Lead-authored-by: Dane Pitkin <dpitkin@apache.org> Co-authored-by: Dane Pitkin <dpitkin.oss@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-26 05:03:02 -04:00			`self._assert_cpu()`
ARROW-6926: [Python] Support __sizeof__ protocol for Python objects https://issues.apache.org/jira/browse/ARROW-6926 Closes #5879 from jorisvandenbossche/ARROW-6926-sizeof and squashes the following commits: baca02f46 <Joris Van den Bossche> check size of metadata keys and values 17f75d462 <Joris Van den Bossche> add Schema sizeof 54896e16b <Joris Van den Bossche> object -> super b36aa5dc1 <Joris Van den Bossche> ARROW-6926: Support __sizeof__ protocol for Python objects Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-12-02 14:44:45 +01:00			`return super(Array, self).__sizeof__() + self.nbytes`

ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00			`def __iter__(self):`
GH-42112: [Python] Array gracefully fails on non-cpu device (#42113) ### Rationale for this change Common `Array` APIs should not segfault or abort on non-cpu devices. ### What changes are included in this PR? * `device_type` and `is_cpu` methods added to the `Array` class * Any function that segfaults, aborts, or gives incorrect results on non-cpu devices now raises an exception ### Are these changes tested? * Unit tests added ### Are there any user-facing changes? * `device_type` and `is_cpu` methods added to the `Array` class * GitHub Issue: #42112 Lead-authored-by: Dane Pitkin <dpitkin@apache.org> Co-authored-by: Dane Pitkin <dpitkin.oss@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-26 05:03:02 -04:00			`self._assert_cpu()`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00			`for i in range(len(self)):`
			`yield self.getitem(i)`

			`def __repr__(self):`
			`type_format = object.__repr__(self)`
GH-45619: [Python] Use f-string instead of string.format (#45629) ### Rationale for this change See https://github.com/apache/arrow/issues/45619. ### What changes are included in this PR? Refactor using f-string instead of `string.format`. But do not use f-string for following case, `string.format` allows passing parameters, making the code more reusable. https://github.com/apache/arrow/blob/0fbf9823542233c5f32c26534c34cc97ce3f0be2/python/pyarrow/parquet/core.py#L1624-L1695 ### Are these changes tested? Via CI. ### Are there any user-facing changes? No. * GitHub Issue: #45619 Lead-authored-by: Chilin Chiou <chilin.chiou@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> 2025-05-12 20:34:49 +08:00			`return f'{type_format}\n{self}'`
ARROW-889: [Python/C++] Unify PrettyPrints between Python and C++ My main intention was only to have `PrettyPrint` for `ChunkedArray` instance but as the Python and C++ output was quite diverging, I have attempted to unify both. During that I came across many untested and unimplemented edge cases that I also fixed. Thus this turned out slightly larger than expected. Author: Korn, Uwe <Uwe.Korn@blue-yonder.com> Closes #2278 from xhochy/ARROW-889 and squashes the following commits: 002d7b04 <Korn, Uwe> Make test_string_format Python 2 compatible f9ce993c <Korn, Uwe> ARROW-889: Unify PrettyPrints between Python and C++ 2018-07-21 19:24:59 +02:00
ARROW-14242: Exposing the correct `indent` paramenter in `to_string` This PR closes ARROW-14242. This allows the expected behavior of `ident` in the `to_string` function to be able to indent the items and then for the top_level_indent parameter to indent the entire string. Closes #11873 from marlenezw/indent_size Lead-authored-by: marlenezw <marlene@voltrondata.com> Co-authored-by: Marlene <57748216+marlenezw@users.noreply.github.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2021-12-09 18:16:01 +01:00			`def to_string(self, *, int indent=2, int top_level_indent=0, int window=10,`
GH-46403: [C++] Add support for limiting element size when printing data (#46536) ### Rationale for this change #46403 ### What changes are included in this PR? A new PrettyPrinter option is added to limit elements to 100 characters by default. ### Are these changes tested? Yes ### Are there any user-facing changes? Yes, the default length for outputted elements when stringifying them is now different so if a user was relying on ToString of an array with large elements that result may now be changed. * GitHub Issue: #46403 Lead-authored-by: David Sherrier <davidsherrier1997@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2025-06-09 16:58:55 +01:00			`int container_window=2, c_bool skip_new_lines=False,`
			`int element_size_limit=100):`
ARROW-13783. [Python] Preview data when printing tables Closes #11028 from amol-/ARROW-13783 Authored-by: Alessandro Molina <amol@turbogears.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-10-05 13:02:18 +02:00			`"""`
			`Render a "pretty-printed" string representation of the Array.`

GH-41664: [C++][Python] PrettyPrint non-cpu data by copying to default CPU device (#42010) ### Rationale for this change The various Python reprs or the C++ `PrettyPrint` functions will currently just segfault when passing an object that has its data on a non-CPU device. In python, getting a segfault while displaying the object is very annoying, and so we should make this at least not crash. ### What changes are included in this PR? When we detect data on a non-CPU device passed to `PrettyPrint`, we copy the necessary part (the full Arrays for Array/RecordBatch, or the full chunks that are being printed for ChunkedArray/Table) to the default CPU device, and then use the existing print utilities as is on this copied subset. For large data, this can be potentially costly by copying a lot of data (but you can always avoid that by not printing the data), but for chunked data we will still only copy those chunks of the full dataset needed to print the object. Longer term, we should investigate if we can actually copy sliced arrays to a different device (with actual pruning of the buffers while copying): https://github.com/apache/arrow/issues/43055 ### Are these changes tested? Yes ### Are there any user-facing changes? No * GitHub Issue: #41664 Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-26 17:32:53 +02:00			`Note: for data on a non-CPU device, the full array is copied to CPU`
			`memory.`

ARROW-13783. [Python] Preview data when printing tables Closes #11028 from amol-/ARROW-13783 Authored-by: Alessandro Molina <amol@turbogears.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-10-05 13:02:18 +02:00			`Parameters`
			`----------`
ARROW-14242: Exposing the correct `indent` paramenter in `to_string` This PR closes ARROW-14242. This allows the expected behavior of `ident` in the `to_string` function to be able to indent the items and then for the top_level_indent parameter to indent the entire string. Closes #11873 from marlenezw/indent_size Lead-authored-by: marlenezw <marlene@voltrondata.com> Co-authored-by: Marlene <57748216+marlenezw@users.noreply.github.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2021-12-09 18:16:01 +01:00			`indent : int, default 2`
ARROW-13554: [C++] Remove deprecated Scanner::Scan Closes #11991 from westonpace/feature/ARROW-13554--unscan-sync-scanner Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com> 2022-01-10 08:25:50 -10:00			`How much to indent the internal items in the string to`
ARROW-14242: Exposing the correct `indent` paramenter in `to_string` This PR closes ARROW-14242. This allows the expected behavior of `ident` in the `to_string` function to be able to indent the items and then for the top_level_indent parameter to indent the entire string. Closes #11873 from marlenezw/indent_size Lead-authored-by: marlenezw <marlene@voltrondata.com> Co-authored-by: Marlene <57748216+marlenezw@users.noreply.github.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2021-12-09 18:16:01 +01:00			the right, by default ``2``.
			`top_level_indent : int, default 0`
			`How much to indent right the entire content of the array,`
ARROW-13554: [C++] Remove deprecated Scanner::Scan Closes #11991 from westonpace/feature/ARROW-13554--unscan-sync-scanner Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com> 2022-01-10 08:25:50 -10:00			by default ``0``.
ARROW-13783. [Python] Preview data when printing tables Closes #11028 from amol-/ARROW-13783 Authored-by: Alessandro Molina <amol@turbogears.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-10-05 13:02:18 +02:00			`window : int`
ARROW-14798: [C++][Python][R] Add container window to PrettyPrintOptions # Summary This PR makes a few changes to PrettyPrinting to make output shorter, particularly for ChunkedArray and ListArray types. * Introduces `container_window` argument to `PrettyPrinterOptions`, which controls the window for ChunkedArray and ListArray separately from other types. * Modified `PrettyPrinter` to pass down `ChildOptions()` to recursive calls. The main effect of this is that `skip_new_lines` is now passed down to children of StructArrays. It also makes sure that `window` and `container` window are passed down to children. * Modified `ChunkedArray` printer to always put new lines between sub-arrays of StructArray. * Added missing comma in `ChunkedArray` print output after ellipsis. * Changed `MapArray` printer to only indent if being printed on multiple lines. These changes affect the C++, Python, and R implementations. ## Example Here's a little test snippet: ```python from random import sample, choice import pyarrow as pa arr_int = pa.array(range(50)) tree_parts = ["roots", "trunk", "crown", "seeds"] arr_list = pa.array([sample(tree_parts, k=choice(range(len(tree_parts)))) for _ in range(50)]) arr_struct = pa.StructArray.from_arrays([arr_int, arr_list], names=['int_nested', 'list_nested']) arr_map = pa.array( [ [(part, choice(range(10))) for part in sample(tree_parts, k=choice(range(len(tree_parts))))] for _ in range(50) ], type=pa.map_(pa.utf8(), pa.int64()) ) table = pa.table({ 'int': pa.chunked_array([arr_int] * 10), 'list': pa.chunked_array([arr_list] * 10), 'struct': pa.chunked_array([arr_struct] * 10), 'map': pa.chunked_array([arr_map] * 10), }) print(table) ``` <details> <summary> Output Before </summary> ``` pyarrow.Table int: int64 list: list<item: string> child 0, item: string struct: struct<int_nested: int64, list_nested: list<item: string>> child 0, int_nested: int64 child 1, list_nested: list<item: string> child 0, item: string map: map<string, int64> child 0, entries: struct<key: string not null, value: int64> not null child 0, key: string not null child 1, value: int64 ---- int: [[0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49],[0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49],[0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49],[0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49],[0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49],[0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49],[0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49],[0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49],[0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49],[0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49]] list: [[["roots","trunk"],["trunk","crown","roots"],["crown","seeds"],["trunk"],[],["crown"],["seeds","crown"],["seeds","roots","trunk"],["roots"],["crown"],...,["trunk","seeds","crown"],["roots","crown","trunk"],["roots"],["crown","trunk","roots"],["crown"],["crown"],["trunk"],["seeds","crown","roots"],[],["trunk","roots"]],[["roots","trunk"],["trunk","crown","roots"],["crown","seeds"],["trunk"],[],["crown"],["seeds","crown"],["seeds","roots","trunk"],["roots"],["crown"],...,["trunk","seeds","crown"],["roots","crown","trunk"],["roots"],["crown","trunk","roots"],["crown"],["crown"],["trunk"],["seeds","crown","roots"],[],["trunk","roots"]],[["roots","trunk"],["trunk","crown","roots"],["crown","seeds"],["trunk"],[],["crown"],["seeds","crown"],["seeds","roots","trunk"],["roots"],["crown"],...,["trunk","seeds","crown"],["roots","crown","trunk"],["roots"],["crown","trunk","roots"],["crown"],["crown"],["trunk"],["seeds","crown","roots"],[],["trunk","roots"]],[["roots","trunk"],["trunk","crown","roots"],["crown","seeds"],["trunk"],[],["crown"],["seeds","crown"],["seeds","roots","trunk"],["roots"],["crown"],...,["trunk","seeds","crown"],["roots","crown","trunk"],["roots"],["crown","trunk","roots"],["crown"],["crown"],["trunk"],["seeds","crown","roots"],[],["trunk","roots"]],[["roots","trunk"],["trunk","crown","roots"],["crown","seeds"],["trunk"],[],["crown"],["seeds","crown"],["seeds","roots","trunk"],["roots"],["crown"],...,["trunk","seeds","crown"],["roots","crown","trunk"],["roots"],["crown","trunk","roots"],["crown"],["crown"],["trunk"],["seeds","crown","roots"],[],["trunk","roots"]],[["roots","trunk"],["trunk","crown","roots"],["crown","seeds"],["trunk"],[],["crown"],["seeds","crown"],["seeds","roots","trunk"],["roots"],["crown"],...,["trunk","seeds","crown"],["roots","crown","trunk"],["roots"],["crown","trunk","roots"],["crown"],["crown"],["trunk"],["seeds","crown","roots"],[],["trunk","roots"]],[["roots","trunk"],["trunk","crown","roots"],["crown","seeds"],["trunk"],[],["crown"],["seeds","crown"],["seeds","roots","trunk"],["roots"],["crown"],...,["trunk","seeds","crown"],["roots","crown","trunk"],["roots"],["crown","trunk","roots"],["crown"],["crown"],["trunk"],["seeds","crown","roots"],[],["trunk","roots"]],[["roots","trunk"],["trunk","crown","roots"],["crown","seeds"],["trunk"],[],["crown"],["seeds","crown"],["seeds","roots","trunk"],["roots"],["crown"],...,["trunk","seeds","crown"],["roots","crown","trunk"],["roots"],["crown","trunk","roots"],["crown"],["crown"],["trunk"],["seeds","crown","roots"],[],["trunk","roots"]],[["roots","trunk"],["trunk","crown","roots"],["crown","seeds"],["trunk"],[],["crown"],["seeds","crown"],["seeds","roots","trunk"],["roots"],["crown"],...,["trunk","seeds","crown"],["roots","crown","trunk"],["roots"],["crown","trunk","roots"],["crown"],["crown"],["trunk"],["seeds","crown","roots"],[],["trunk","roots"]],[["roots","trunk"],["trunk","crown","roots"],["crown","seeds"],["trunk"],[],["crown"],["seeds","crown"],["seeds","roots","trunk"],["roots"],["crown"],...,["trunk","seeds","crown"],["roots","crown","trunk"],["roots"],["crown","trunk","roots"],["crown"],["crown"],["trunk"],["seeds","crown","roots"],[],["trunk","roots"]]] struct: [ -- is_valid: all not null -- child 0 type: int64 [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ... 40, 41, 42, 43, 44, 45, 46, 47, 48, 49 ] -- child 1 type: list<item: string> [ [ "roots", "trunk" ], [ "trunk", "crown", "roots" ], [ "crown", "seeds" ], [ "trunk" ], [], [ "crown" ], [ "seeds", "crown" ], [ "seeds", "roots", "trunk" ], [ "roots" ], [ "crown" ], ... [ "trunk", "seeds", "crown" ], [ "roots", "crown", "trunk" ], [ "roots" ], [ "crown", "trunk", "roots" ], [ "crown" ], [ "crown" ], [ "trunk" ], [ "seeds", "crown", "roots" ], [], [ "trunk", "roots" ] ], -- is_valid: all not null -- child 0 type: int64 [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ... 40, 41, 42, 43, 44, 45, 46, 47, 48, 49 ] -- child 1 type: list<item: string> [ [ "roots", "trunk" ], [ "trunk", "crown", "roots" ], [ "crown", "seeds" ], [ "trunk" ], [], [ "crown" ], [ "seeds", "crown" ], [ "seeds", "roots", "trunk" ], [ "roots" ], [ "crown" ], ... [ "trunk", "seeds", "crown" ], [ "roots", "crown", "trunk" ], [ "roots" ], [ "crown", "trunk", "roots" ], [ "crown" ], [ "crown" ], [ "trunk" ], [ "seeds", "crown", "roots" ], [], [ "trunk", "roots" ] ], -- is_valid: all not null -- child 0 type: int64 [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ... 40, 41, 42, 43, 44, 45, 46, 47, 48, 49 ] -- child 1 type: list<item: string> [ [ "roots", "trunk" ], [ "trunk", "crown", "roots" ], [ "crown", "seeds" ], [ "trunk" ], [], [ "crown" ], [ "seeds", "crown" ], [ "seeds", "roots", "trunk" ], [ "roots" ], [ "crown" ], ... [ "trunk", "seeds", "crown" ], [ "roots", "crown", "trunk" ], [ "roots" ], [ "crown", "trunk", "roots" ], [ "crown" ], [ "crown" ], [ "trunk" ], [ "seeds", "crown", "roots" ], [], [ "trunk", "roots" ] ], -- is_valid: all not null -- child 0 type: int64 [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ... 40, 41, 42, 43, 44, 45, 46, 47, 48, 49 ] -- child 1 type: list<item: string> [ [ "roots", "trunk" ], [ "trunk", "crown", "roots" ], [ "crown", "seeds" ], [ "trunk" ], [], [ "crown" ], [ "seeds", "crown" ], [ "seeds", "roots", "trunk" ], [ "roots" ], [ "crown" ], ... [ "trunk", "seeds", "crown" ], [ "roots", "crown", "trunk" ], [ "roots" ], [ "crown", "trunk", "roots" ], [ "crown" ], [ "crown" ], [ "trunk" ], [ "seeds", "crown", "roots" ], [], [ "trunk", "roots" ] ], -- is_valid: all not null -- child 0 type: int64 [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ... 40, 41, 42, 43, 44, 45, 46, 47, 48, 49 ] -- child 1 type: list<item: string> [ [ "roots", "trunk" ], [ "trunk", "crown", "roots" ], [ "crown", "seeds" ], [ "trunk" ], [], [ "crown" ], [ "seeds", "crown" ], [ "seeds", "roots", "trunk" ], [ "roots" ], [ "crown" ], ... [ "trunk", "seeds", "crown" ], [ "roots", "crown", "trunk" ], [ "roots" ], [ "crown", "trunk", "roots" ], [ "crown" ], [ "crown" ], [ "trunk" ], [ "seeds", "crown", "roots" ], [], [ "trunk", "roots" ] ], -- is_valid: all not null -- child 0 type: int64 [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ... 40, 41, 42, 43, 44, 45, 46, 47, 48, 49 ] -- child 1 type: list<item: string> [ [ "roots", "trunk" ], [ "trunk", "crown", "roots" ], [ "crown", "seeds" ], [ "trunk" ], [], [ "crown" ], [ "seeds", "crown" ], [ "seeds", "roots", "trunk" ], [ "roots" ], [ "crown" ], ... [ "trunk", "seeds", "crown" ], [ "roots", "crown", "trunk" ], [ "roots" ], [ "crown", "trunk", "roots" ], [ "crown" ], [ "crown" ], [ "trunk" ], [ "seeds", "crown", "roots" ], [], [ "trunk", "roots" ] ], -- is_valid: all not null -- child 0 type: int64 [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ... 40, 41, 42, 43, 44, 45, 46, 47, 48, 49 ] -- child 1 type: list<item: string> [ [ "roots", "trunk" ], [ "trunk", "crown", "roots" ], [ "crown", "seeds" ], [ "trunk" ], [], [ "crown" ], [ "seeds", "crown" ], [ "seeds", "roots", "trunk" ], [ "roots" ], [ "crown" ], ... [ "trunk", "seeds", "crown" ], [ "roots", "crown", "trunk" ], [ "roots" ], [ "crown", "trunk", "roots" ], [ "crown" ], [ "crown" ], [ "trunk" ], [ "seeds", "crown", "roots" ], [], [ "trunk", "roots" ] ], -- is_valid: all not null -- child 0 type: int64 [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ... 40, 41, 42, 43, 44, 45, 46, 47, 48, 49 ] -- child 1 type: list<item: string> [ [ "roots", "trunk" ], [ "trunk", "crown", "roots" ], [ "crown", "seeds" ], [ "trunk" ], [], [ "crown" ], [ "seeds", "crown" ], [ "seeds", "roots", "trunk" ], [ "roots" ], [ "crown" ], ... [ "trunk", "seeds", "crown" ], [ "roots", "crown", "trunk" ], [ "roots" ], [ "crown", "trunk", "roots" ], [ "crown" ], [ "crown" ], [ "trunk" ], [ "seeds", "crown", "roots" ], [], [ "trunk", "roots" ] ], -- is_valid: all not null -- child 0 type: int64 [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ... 40, 41, 42, 43, 44, 45, 46, 47, 48, 49 ] -- child 1 type: list<item: string> [ [ "roots", "trunk" ], [ "trunk", "crown", "roots" ], [ "crown", "seeds" ], [ "trunk" ], [], [ "crown" ], [ "seeds", "crown" ], [ "seeds", "roots", "trunk" ], [ "roots" ], [ "crown" ], ... [ "trunk", "seeds", "crown" ], [ "roots", "crown", "trunk" ], [ "roots" ], [ "crown", "trunk", "roots" ], [ "crown" ], [ "crown" ], [ "trunk" ], [ "seeds", "crown", "roots" ], [], [ "trunk", "roots" ] ], -- is_valid: all not null -- child 0 type: int64 [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ... 40, 41, 42, 43, 44, 45, 46, 47, 48, 49 ] -- child 1 type: list<item: string> [ [ "roots", "trunk" ], [ "trunk", "crown", "roots" ], [ "crown", "seeds" ], [ "trunk" ], [], [ "crown" ], [ "seeds", "crown" ], [ "seeds", "roots", "trunk" ], [ "roots" ], [ "crown" ], ... [ "trunk", "seeds", "crown" ], [ "roots", "crown", "trunk" ], [ "roots" ], [ "crown", "trunk", "roots" ], [ "crown" ], [ "crown" ], [ "trunk" ], [ "seeds", "crown", "roots" ], [], [ "trunk", "roots" ] ]] map: [[ keys:["crown"]values:[4], keys:["seeds"]values:[7], keys:["trunk"]values:[7], keys:["roots","trunk","crown"]values:[4,8,0], keys:["crown","trunk","roots"]values:[3,6,8], keys:["crown","trunk","seeds"]values:[9,3,2], keys:["crown","seeds","roots"]values:[1,3,8], keys:["trunk","seeds"]values:[3,1], keys:[]values:[], keys:["roots","seeds","trunk"]values:[0,8,2],..., keys:[]values:[], keys:["trunk","crown","roots"]values:[7,2,8], keys:["seeds","trunk"]values:[9,5], keys:["trunk"]values:[7], keys:["roots"]values:[1], keys:["crown"]values:[5], keys:["crown","seeds","roots"]values:[2,7,2], keys:[]values:[], keys:[]values:[], keys:["roots","crown","trunk"]values:[2,1,5]],[ keys:["crown"]values:[4], keys:["seeds"]values:[7], keys:["trunk"]values:[7], keys:["roots","trunk","crown"]values:[4,8,0], keys:["crown","trunk","roots"]values:[3,6,8], keys:["crown","trunk","seeds"]values:[9,3,2], keys:["crown","seeds","roots"]values:[1,3,8], keys:["trunk","seeds"]values:[3,1], keys:[]values:[], keys:["roots","seeds","trunk"]values:[0,8,2],..., keys:[]values:[], keys:["trunk","crown","roots"]values:[7,2,8], keys:["seeds","trunk"]values:[9,5], keys:["trunk"]values:[7], keys:["roots"]values:[1], keys:["crown"]values:[5], keys:["crown","seeds","roots"]values:[2,7,2], keys:[]values:[], keys:[]values:[], keys:["roots","crown","trunk"]values:[2,1,5]],[ keys:["crown"]values:[4], keys:["seeds"]values:[7], keys:["trunk"]values:[7], keys:["roots","trunk","crown"]values:[4,8,0], keys:["crown","trunk","roots"]values:[3,6,8], keys:["crown","trunk","seeds"]values:[9,3,2], keys:["crown","seeds","roots"]values:[1,3,8], keys:["trunk","seeds"]values:[3,1], keys:[]values:[], keys:["roots","seeds","trunk"]values:[0,8,2],..., keys:[]values:[], keys:["trunk","crown","roots"]values:[7,2,8], keys:["seeds","trunk"]values:[9,5], keys:["trunk"]values:[7], keys:["roots"]values:[1], keys:["crown"]values:[5], keys:["crown","seeds","roots"]values:[2,7,2], keys:[]values:[], keys:[]values:[], keys:["roots","crown","trunk"]values:[2,1,5]],[ keys:["crown"]values:[4], keys:["seeds"]values:[7], keys:["trunk"]values:[7], keys:["roots","trunk","crown"]values:[4,8,0], keys:["crown","trunk","roots"]values:[3,6,8], keys:["crown","trunk","seeds"]values:[9,3,2], keys:["crown","seeds","roots"]values:[1,3,8], keys:["trunk","seeds"]values:[3,1], keys:[]values:[], keys:["roots","seeds","trunk"]values:[0,8,2],..., keys:[]values:[], keys:["trunk","crown","roots"]values:[7,2,8], keys:["seeds","trunk"]values:[9,5], keys:["trunk"]values:[7], keys:["roots"]values:[1], keys:["crown"]values:[5], keys:["crown","seeds","roots"]values:[2,7,2], keys:[]values:[], keys:[]values:[], keys:["roots","crown","trunk"]values:[2,1,5]],[ keys:["crown"]values:[4], keys:["seeds"]values:[7], keys:["trunk"]values:[7], keys:["roots","trunk","crown"]values:[4,8,0], keys:["crown","trunk","roots"]values:[3,6,8], keys:["crown","trunk","seeds"]values:[9,3,2], keys:["crown","seeds","roots"]values:[1,3,8], keys:["trunk","seeds"]values:[3,1], keys:[]values:[], keys:["roots","seeds","trunk"]values:[0,8,2],..., keys:[]values:[], keys:["trunk","crown","roots"]values:[7,2,8], keys:["seeds","trunk"]values:[9,5], keys:["trunk"]values:[7], keys:["roots"]values:[1], keys:["crown"]values:[5], keys:["crown","seeds","roots"]values:[2,7,2], keys:[]values:[], keys:[]values:[], keys:["roots","crown","trunk"]values:[2,1,5]],[ keys:["crown"]values:[4], keys:["seeds"]values:[7], keys:["trunk"]values:[7], keys:["roots","trunk","crown"]values:[4,8,0], keys:["crown","trunk","roots"]values:[3,6,8], keys:["crown","trunk","seeds"]values:[9,3,2], keys:["crown","seeds","roots"]values:[1,3,8], keys:["trunk","seeds"]values:[3,1], keys:[]values:[], keys:["roots","seeds","trunk"]values:[0,8,2],..., keys:[]values:[], keys:["trunk","crown","roots"]values:[7,2,8], keys:["seeds","trunk"]values:[9,5], keys:["trunk"]values:[7], keys:["roots"]values:[1], keys:["crown"]values:[5], keys:["crown","seeds","roots"]values:[2,7,2], keys:[]values:[], keys:[]values:[], keys:["roots","crown","trunk"]values:[2,1,5]],[ keys:["crown"]values:[4], keys:["seeds"]values:[7], keys:["trunk"]values:[7], keys:["roots","trunk","crown"]values:[4,8,0], keys:["crown","trunk","roots"]values:[3,6,8], keys:["crown","trunk","seeds"]values:[9,3,2], keys:["crown","seeds","roots"]values:[1,3,8], keys:["trunk","seeds"]values:[3,1], keys:[]values:[], keys:["roots","seeds","trunk"]values:[0,8,2],..., keys:[]values:[], keys:["trunk","crown","roots"]values:[7,2,8], keys:["seeds","trunk"]values:[9,5], keys:["trunk"]values:[7], keys:["roots"]values:[1], keys:["crown"]values:[5], keys:["crown","seeds","roots"]values:[2,7,2], keys:[]values:[], keys:[]values:[], keys:["roots","crown","trunk"]values:[2,1,5]],[ keys:["crown"]values:[4], keys:["seeds"]values:[7], keys:["trunk"]values:[7], keys:["roots","trunk","crown"]values:[4,8,0], keys:["crown","trunk","roots"]values:[3,6,8], keys:["crown","trunk","seeds"]values:[9,3,2], keys:["crown","seeds","roots"]values:[1,3,8], keys:["trunk","seeds"]values:[3,1], keys:[]values:[], keys:["roots","seeds","trunk"]values:[0,8,2],..., keys:[]values:[], keys:["trunk","crown","roots"]values:[7,2,8], keys:["seeds","trunk"]values:[9,5], keys:["trunk"]values:[7], keys:["roots"]values:[1], keys:["crown"]values:[5], keys:["crown","seeds","roots"]values:[2,7,2], keys:[]values:[], keys:[]values:[], keys:["roots","crown","trunk"]values:[2,1,5]],[ keys:["crown"]values:[4], keys:["seeds"]values:[7], keys:["trunk"]values:[7], keys:["roots","trunk","crown"]values:[4,8,0], keys:["crown","trunk","roots"]values:[3,6,8], keys:["crown","trunk","seeds"]values:[9,3,2], keys:["crown","seeds","roots"]values:[1,3,8], keys:["trunk","seeds"]values:[3,1], keys:[]values:[], keys:["roots","seeds","trunk"]values:[0,8,2],..., keys:[]values:[], keys:["trunk","crown","roots"]values:[7,2,8], keys:["seeds","trunk"]values:[9,5], keys:["trunk"]values:[7], keys:["roots"]values:[1], keys:["crown"]values:[5], keys:["crown","seeds","roots"]values:[2,7,2], keys:[]values:[], keys:[]values:[], keys:["roots","crown","trunk"]values:[2,1,5]],[ keys:["crown"]values:[4], keys:["seeds"]values:[7], keys:["trunk"]values:[7], keys:["roots","trunk","crown"]values:[4,8,0], keys:["crown","trunk","roots"]values:[3,6,8], keys:["crown","trunk","seeds"]values:[9,3,2], keys:["crown","seeds","roots"]values:[1,3,8], keys:["trunk","seeds"]values:[3,1], keys:[]values:[], keys:["roots","seeds","trunk"]values:[0,8,2],..., keys:[]values:[], keys:["trunk","crown","roots"]values:[7,2,8], keys:["seeds","trunk"]values:[9,5], keys:["trunk"]values:[7], keys:["roots"]values:[1], keys:["crown"]values:[5], keys:["crown","seeds","roots"]values:[2,7,2], keys:[]values:[], keys:[]values:[], keys:["roots","crown","trunk"]values:[2,1,5]]] ``` </details> <details open> <summary> Output after </summary> ``` pyarrow.Table int: int64 list: list<item: string> child 0, item: string struct: struct<int_nested: int64, list_nested: list<item: string>> child 0, int_nested: int64 child 1, list_nested: list<item: string> child 0, item: string map: map<string, int64> child 0, entries: struct<key: string not null, value: int64> not null child 0, key: string not null child 1, value: int64 ---- int: [[0,1,2,3,4,...,45,46,47,48,49],[0,1,2,3,4,...,45,46,47,48,49],...,[0,1,2,3,4,...,45,46,47,48,49],[0,1,2,3,4,...,45,46,47,48,49]] list: [[["crown","trunk","roots"],["roots","seeds"],...,[],["crown"]],[["crown","trunk","roots"],["roots","seeds"],...,[],["crown"]],...,[["crown","trunk","roots"],["roots","seeds"],...,[],["crown"]],[["crown","trunk","roots"],["roots","seeds"],...,[],["crown"]]] struct: [ -- is_valid: all not null -- child 0 type: int64 [0,1,2,3,4,...,45,46,47,48,49] -- child 1 type: list<item: string> [["crown","trunk","roots"],["roots","seeds"],...,[],["crown"]], -- is_valid: all not null -- child 0 type: int64 [0,1,2,3,4,...,45,46,47,48,49] -- child 1 type: list<item: string> [["crown","trunk","roots"],["roots","seeds"],...,[],["crown"]], ..., -- is_valid: all not null -- child 0 type: int64 [0,1,2,3,4,...,45,46,47,48,49] -- child 1 type: list<item: string> [["crown","trunk","roots"],["roots","seeds"],...,[],["crown"]], -- is_valid: all not null -- child 0 type: int64 [0,1,2,3,4,...,45,46,47,48,49] -- child 1 type: list<item: string> [["crown","trunk","roots"],["roots","seeds"],...,[],["crown"]]] map: [[keys:["trunk"]values:[2],keys:["seeds","roots"]values:[2,4],keys:["trunk","crown"]values:[2,7],keys:["trunk","crown","roots"]values:[8,8,0],keys:[]values:[],...,keys:["trunk","roots"]values:[2,8],keys:["trunk","crown"]values:[6,9],keys:[]values:[],keys:["seeds","trunk"]values:[9,6],keys:["crown","roots","trunk"]values:[0,3,9]],[keys:["trunk"]values:[2],keys:["seeds","roots"]values:[2,4],keys:["trunk","crown"]values:[2,7],keys:["trunk","crown","roots"]values:[8,8,0],keys:[]values:[],...,keys:["trunk","roots"]values:[2,8],keys:["trunk","crown"]values:[6,9],keys:[]values:[],keys:["seeds","trunk"]values:[9,6],keys:["crown","roots","trunk"]values:[0,3,9]],...,[keys:["trunk"]values:[2],keys:["seeds","roots"]values:[2,4],keys:["trunk","crown"]values:[2,7],keys:["trunk","crown","roots"]values:[8,8,0],keys:[]values:[],...,keys:["trunk","roots"]values:[2,8],keys:["trunk","crown"]values:[6,9],keys:[]values:[],keys:["seeds","trunk"]values:[9,6],keys:["crown","roots","trunk"]values:[0,3,9]],[keys:["trunk"]values:[2],keys:["seeds","roots"]values:[2,4],keys:["trunk","crown"]values:[2,7],keys:["trunk","crown","roots"]values:[8,8,0],keys:[]values:[],...,keys:["trunk","roots"]values:[2,8],keys:["trunk","crown"]values:[6,9],keys:[]values:[],keys:["seeds","trunk"]values:[9,6],keys:["crown","roots","trunk"]values:[0,3,9]]] ``` </details> Closes #12091 from wjones127/ARROW-14798-repr-child-limit Lead-authored-by: Will Jones <willjones127@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-02-24 18:21:55 +01:00			`How many primitive items to preview at the begin and end`
			`of the array when the array is bigger than the window.`
			`The other items will be ellipsed.`
			`container_window : int`
			`How many container items (such as a list in a list array)`
			`to preview at the begin and end of the array when the array`
			`is bigger than the window.`
ARROW-13783. [Python] Preview data when printing tables Closes #11028 from amol-/ARROW-13783 Authored-by: Alessandro Molina <amol@turbogears.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-10-05 13:02:18 +02:00			`skip_new_lines : bool`
			`If the array should be rendered as a single line of text`
			`or if each element should be on its own line.`
GH-46403: [C++] Add support for limiting element size when printing data (#46536) ### Rationale for this change #46403 ### What changes are included in this PR? A new PrettyPrinter option is added to limit elements to 100 characters by default. ### Are these changes tested? Yes ### Are there any user-facing changes? Yes, the default length for outputted elements when stringifying them is now different so if a user was relying on ToString of an array with large elements that result may now be changed. * GitHub Issue: #46403 Lead-authored-by: David Sherrier <davidsherrier1997@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2025-06-09 16:58:55 +01:00			`element_size_limit : int, default 100`
			`Maximum number of characters of a single element before it is truncated.`
ARROW-13783. [Python] Preview data when printing tables Closes #11028 from amol-/ARROW-13783 Authored-by: Alessandro Molina <amol@turbogears.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-10-05 13:02:18 +02:00			`"""`
ARROW-889: [Python/C++] Unify PrettyPrints between Python and C++ My main intention was only to have `PrettyPrint` for `ChunkedArray` instance but as the Python and C++ output was quite diverging, I have attempted to unify both. During that I came across many untested and unimplemented edge cases that I also fixed. Thus this turned out slightly larger than expected. Author: Korn, Uwe <Uwe.Korn@blue-yonder.com> Closes #2278 from xhochy/ARROW-889 and squashes the following commits: 002d7b04 <Korn, Uwe> Make test_string_format Python 2 compatible f9ce993c <Korn, Uwe> ARROW-889: Unify PrettyPrints between Python and C++ 2018-07-21 19:24:59 +02:00			`cdef:`
			`c_string result`
ARROW-13783. [Python] Preview data when printing tables Closes #11028 from amol-/ARROW-13783 Authored-by: Alessandro Molina <amol@turbogears.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-10-05 13:02:18 +02:00			`PrettyPrintOptions options`
ARROW-889: [Python/C++] Unify PrettyPrints between Python and C++ My main intention was only to have `PrettyPrint` for `ChunkedArray` instance but as the Python and C++ output was quite diverging, I have attempted to unify both. During that I came across many untested and unimplemented edge cases that I also fixed. Thus this turned out slightly larger than expected. Author: Korn, Uwe <Uwe.Korn@blue-yonder.com> Closes #2278 from xhochy/ARROW-889 and squashes the following commits: 002d7b04 <Korn, Uwe> Make test_string_format Python 2 compatible f9ce993c <Korn, Uwe> ARROW-889: Unify PrettyPrints between Python and C++ 2018-07-21 19:24:59 +02:00
			`with nogil:`
ARROW-14242: Exposing the correct `indent` paramenter in `to_string` This PR closes ARROW-14242. This allows the expected behavior of `ident` in the `to_string` function to be able to indent the items and then for the top_level_indent parameter to indent the entire string. Closes #11873 from marlenezw/indent_size Lead-authored-by: marlenezw <marlene@voltrondata.com> Co-authored-by: Marlene <57748216+marlenezw@users.noreply.github.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2021-12-09 18:16:01 +01:00			`options = PrettyPrintOptions(top_level_indent, window)`
ARROW-13783. [Python] Preview data when printing tables Closes #11028 from amol-/ARROW-13783 Authored-by: Alessandro Molina <amol@turbogears.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-10-05 13:02:18 +02:00			`options.skip_new_lines = skip_new_lines`
ARROW-14242: Exposing the correct `indent` paramenter in `to_string` This PR closes ARROW-14242. This allows the expected behavior of `ident` in the `to_string` function to be able to indent the items and then for the top_level_indent parameter to indent the entire string. Closes #11873 from marlenezw/indent_size Lead-authored-by: marlenezw <marlene@voltrondata.com> Co-authored-by: Marlene <57748216+marlenezw@users.noreply.github.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2021-12-09 18:16:01 +01:00			`options.indent_size = indent`
GH-46403: [C++] Add support for limiting element size when printing data (#46536) ### Rationale for this change #46403 ### What changes are included in this PR? A new PrettyPrinter option is added to limit elements to 100 characters by default. ### Are these changes tested? Yes ### Are there any user-facing changes? Yes, the default length for outputted elements when stringifying them is now different so if a user was relying on ToString of an array with large elements that result may now be changed. * GitHub Issue: #46403 Lead-authored-by: David Sherrier <davidsherrier1997@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2025-06-09 16:58:55 +01:00			`options.element_size_limit = element_size_limit`
ARROW-889: [Python/C++] Unify PrettyPrints between Python and C++ My main intention was only to have `PrettyPrint` for `ChunkedArray` instance but as the Python and C++ output was quite diverging, I have attempted to unify both. During that I came across many untested and unimplemented edge cases that I also fixed. Thus this turned out slightly larger than expected. Author: Korn, Uwe <Uwe.Korn@blue-yonder.com> Closes #2278 from xhochy/ARROW-889 and squashes the following commits: 002d7b04 <Korn, Uwe> Make test_string_format Python 2 compatible f9ce993c <Korn, Uwe> ARROW-889: Unify PrettyPrints between Python and C++ 2018-07-21 19:24:59 +02:00			`check_status(`
			`PrettyPrint(`
			`deref(self.ap),`
ARROW-13783. [Python] Preview data when printing tables Closes #11028 from amol-/ARROW-13783 Authored-by: Alessandro Molina <amol@turbogears.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-10-05 13:02:18 +02:00			`options,`
ARROW-889: [Python/C++] Unify PrettyPrints between Python and C++ My main intention was only to have `PrettyPrint` for `ChunkedArray` instance but as the Python and C++ output was quite diverging, I have attempted to unify both. During that I came across many untested and unimplemented edge cases that I also fixed. Thus this turned out slightly larger than expected. Author: Korn, Uwe <Uwe.Korn@blue-yonder.com> Closes #2278 from xhochy/ARROW-889 and squashes the following commits: 002d7b04 <Korn, Uwe> Make test_string_format Python 2 compatible f9ce993c <Korn, Uwe> ARROW-889: Unify PrettyPrints between Python and C++ 2018-07-21 19:24:59 +02:00			`&result`
			`)`
			`)`

ARROW-10214: [Python] Allow printing undecodable schema metadata Closes #8379 from pitrou/ARROW-10214-repr-undecodable-metadata Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2020-10-07 17:57:47 +02:00			`return frombytes(result, safe=True)`
ARROW-889: [Python/C++] Unify PrettyPrints between Python and C++ My main intention was only to have `PrettyPrint` for `ChunkedArray` instance but as the Python and C++ output was quite diverging, I have attempted to unify both. During that I came across many untested and unimplemented edge cases that I also fixed. Thus this turned out slightly larger than expected. Author: Korn, Uwe <Uwe.Korn@blue-yonder.com> Closes #2278 from xhochy/ARROW-889 and squashes the following commits: 002d7b04 <Korn, Uwe> Make test_string_format Python 2 compatible f9ce993c <Korn, Uwe> ARROW-889: Unify PrettyPrints between Python and C++ 2018-07-21 19:24:59 +02:00
			`def __str__(self):`
ARROW-7904: [C++][Python] Revamp metadata display, change show_metadata to verbose_metadata This is another attempt to present the information without overwhelming in the case where there is a large binary metadata blob. So the default will show just metadata keys like so: ``` foo: int32 not null, metadata.keys: ['key1'] bar: string, metadata.keys: ['key3'] -- schema.metadata.keys: ['key2'] ``` Another option is to show values truncated to 50 chars (or 80 chars less the length of the key and other whitespace chars) Closes #6577 from wesm/ARROW-7904 Lead-authored-by: Wes McKinney <wesm+git@apache.org> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-04-02 10:28:40 -05:00			`return self.to_string()`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
ARROW-9445: [Python] Revert Array.equals changes + expose comparison ops in compute Closes #7737 from jorisvandenbossche/ARROW-9445 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-07-13 16:35:24 -05:00			`def __eq__(self, other):`
			`try:`
			`return self.equals(other)`
			`except TypeError:`
ARROW-12466: [Python] Avoid AttributeError crash when comparing with None https://issues.apache.org/jira/browse/ARROW-12466 Closes #10099 from amol-/ARROW-12466 Authored-by: Alessandro Molina <amol@turbogears.org> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2021-04-20 16:21:47 +02:00			`# This also handles comparing with None`
			`# as Array.equals(None) raises a TypeError.`
ARROW-9445: [Python] Revert Array.equals changes + expose comparison ops in compute Closes #7737 from jorisvandenbossche/ARROW-9445 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-07-13 16:35:24 -05:00			`return NotImplemented`

ARROW-12466: [Python] Avoid AttributeError crash when comparing with None https://issues.apache.org/jira/browse/ARROW-12466 Closes #10099 from amol-/ARROW-12466 Authored-by: Alessandro Molina <amol@turbogears.org> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2021-04-20 16:21:47 +02:00			`def equals(Array self, Array other not None):`
GH-37217: [Python] Add missing docstrings to Cython (#37218) ### Rationale for this change The Cython 3.0.0 upgrade https://github.com/apache/arrow/pull/37097 is triggering numpydoc errors for these missing docstrings. ### What changes are included in this PR? * Docstrings added to Cython functions that omitted them ### Are these changes tested? Yes, locally. ### Are there any user-facing changes? User-facing documentation is added. * Closes: #37217 Lead-authored-by: Dane Pitkin <dane@voltrondata.com> Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com> 2023-08-17 23:03:28 -04:00			`"""`
			`Parameters`
			`----------`
			`other : pyarrow.Array`

			`Returns`
			`-------`
			`bool`
			`"""`
GH-42112: [Python] Array gracefully fails on non-cpu device (#42113) ### Rationale for this change Common `Array` APIs should not segfault or abort on non-cpu devices. ### What changes are included in this PR? * `device_type` and `is_cpu` methods added to the `Array` class * Any function that segfaults, aborts, or gives incorrect results on non-cpu devices now raises an exception ### Are these changes tested? * Unit tests added ### Are there any user-facing changes? * `device_type` and `is_cpu` methods added to the `Array` class * GitHub Issue: #42112 Lead-authored-by: Dane Pitkin <dpitkin@apache.org> Co-authored-by: Dane Pitkin <dpitkin.oss@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-26 05:03:02 -04:00			`self._assert_cpu()`
			`other._assert_cpu()`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00			`return self.ap.Equals(deref(other.ap))`

			`def __len__(self):`
ARROW-2331: [Python] Fix indexing for negative or out-of-bounds indices Author: Antoine Pitrou <antoine@python.org> Closes #1770 from pitrou/ARROW-2331-python-indexing and squashes the following commits: aec1ef0 <Antoine Pitrou> Try to fix downcast errors 1a38451 <Antoine Pitrou> ARROW-2331: Fix indexing for negative or out-of-bounds indices 2018-03-23 17:03:29 +01:00			`return self.length()`

			`cdef int64_t length(self):`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00			`if self.sp_array.get():`
			`return self.sp_array.get().length()`
			`else:`
			`return 0`

ARROW-12959: [C++][R] Option for is_null(NaN) to evaluate to true Closes #10896 from Christian8491/ARROW-12959-Option-for-is-nullNaN-to-evaluate-to-tru Lead-authored-by: christian <ccce91@gmail.com> Co-authored-by: Eduardo Ponce <edponce00@gmail.com> Co-authored-by: Christian Córdova <ccce91@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Ian Cook <ianmcook@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-08-26 19:09:36 +02:00			`def is_null(self, *, nan_is_null=False):`
ARROW-9159: [Python] Implement Array.isnull/isvalid methods Instead of having the NotImplementedError Closes #7467 from jorisvandenbossche/ARROW-9159 Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Wes McKinney <wesm@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-18 15:06:13 -05:00			`"""`
			`Return BooleanArray indicating the null values.`
ARROW-12959: [C++][R] Option for is_null(NaN) to evaluate to true Closes #10896 from Christian8491/ARROW-12959-Option-for-is-nullNaN-to-evaluate-to-tru Lead-authored-by: christian <ccce91@gmail.com> Co-authored-by: Eduardo Ponce <edponce00@gmail.com> Co-authored-by: Christian Córdova <ccce91@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Ian Cook <ianmcook@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-08-26 19:09:36 +02:00
			`Parameters`
			`----------`
			`nan_is_null : bool (optional, default False)`
			`Whether floating-point NaN values should also be considered null.`

			`Returns`
			`-------`
			`array : boolean Array`
ARROW-9159: [Python] Implement Array.isnull/isvalid methods Instead of having the NotImplementedError Closes #7467 from jorisvandenbossche/ARROW-9159 Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Wes McKinney <wesm@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-18 15:06:13 -05:00			`"""`
GH-42112: [Python] Array gracefully fails on non-cpu device (#42113) ### Rationale for this change Common `Array` APIs should not segfault or abort on non-cpu devices. ### What changes are included in this PR? * `device_type` and `is_cpu` methods added to the `Array` class * Any function that segfaults, aborts, or gives incorrect results on non-cpu devices now raises an exception ### Are these changes tested? * Unit tests added ### Are there any user-facing changes? * `device_type` and `is_cpu` methods added to the `Array` class * GitHub Issue: #42112 Lead-authored-by: Dane Pitkin <dpitkin@apache.org> Co-authored-by: Dane Pitkin <dpitkin.oss@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-26 05:03:02 -04:00			`self._assert_cpu()`
ARROW-13327: [C++][Python] Improve consistency of explicit C++ types in PyArrow files This PR updates C++ and PyArrow compute bindings for FunctionOptions and makes their API/implementations more consistent. Also, positional and keyword-only parameters in PyArrow FunctionOptions are updated. Closes #11147 from edponce/ARROW-13327-Improve-consistency-of-explicit-C++-types-in-PyArrow-files Authored-by: Eduardo Ponce <edponce00@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-09-22 15:42:44 +02:00			`options = _pc().NullOptions(nan_is_null=nan_is_null)`
ARROW-12959: [C++][R] Option for is_null(NaN) to evaluate to true Closes #10896 from Christian8491/ARROW-12959-Option-for-is-nullNaN-to-evaluate-to-tru Lead-authored-by: christian <ccce91@gmail.com> Co-authored-by: Eduardo Ponce <edponce00@gmail.com> Co-authored-by: Christian Córdova <ccce91@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Ian Cook <ianmcook@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-08-26 19:09:36 +02:00			`return _pc().call_function('is_null', [self], options)`
ARROW-9159: [Python] Implement Array.isnull/isvalid methods Instead of having the NotImplementedError Closes #7467 from jorisvandenbossche/ARROW-9159 Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Wes McKinney <wesm@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-18 15:06:13 -05:00
GH-34154: [Python] Add `is_nan` method to Array and Expression (#34184) * Closes: GH-34154 Lead-authored-by: Fokko Driesprong <fokko@tabular.io> Co-authored-by: Fokko Driesprong <fokko@apache.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-03-09 11:05:54 +01:00			`def is_nan(self):`
			`"""`
			`Return BooleanArray indicating the NaN values.`

			`Returns`
			`-------`
			`array : boolean Array`
			`"""`
GH-42112: [Python] Array gracefully fails on non-cpu device (#42113) ### Rationale for this change Common `Array` APIs should not segfault or abort on non-cpu devices. ### What changes are included in this PR? * `device_type` and `is_cpu` methods added to the `Array` class * Any function that segfaults, aborts, or gives incorrect results on non-cpu devices now raises an exception ### Are these changes tested? * Unit tests added ### Are there any user-facing changes? * `device_type` and `is_cpu` methods added to the `Array` class * GitHub Issue: #42112 Lead-authored-by: Dane Pitkin <dpitkin@apache.org> Co-authored-by: Dane Pitkin <dpitkin.oss@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-26 05:03:02 -04:00			`self._assert_cpu()`
GH-34154: [Python] Add `is_nan` method to Array and Expression (#34184) * Closes: GH-34154 Lead-authored-by: Fokko Driesprong <fokko@tabular.io> Co-authored-by: Fokko Driesprong <fokko@apache.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-03-09 11:05:54 +01:00			`return _pc().call_function('is_nan', [self])`

ARROW-9159: [Python] Implement Array.isnull/isvalid methods Instead of having the NotImplementedError Closes #7467 from jorisvandenbossche/ARROW-9159 Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Wes McKinney <wesm@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-18 15:06:13 -05:00			`def is_valid(self):`
			`"""`
			`Return BooleanArray indicating the non-null values.`
			`"""`
GH-42112: [Python] Array gracefully fails on non-cpu device (#42113) ### Rationale for this change Common `Array` APIs should not segfault or abort on non-cpu devices. ### What changes are included in this PR? * `device_type` and `is_cpu` methods added to the `Array` class * Any function that segfaults, aborts, or gives incorrect results on non-cpu devices now raises an exception ### Are these changes tested? * Unit tests added ### Are there any user-facing changes? * `device_type` and `is_cpu` methods added to the `Array` class * GitHub Issue: #42112 Lead-authored-by: Dane Pitkin <dpitkin@apache.org> Co-authored-by: Dane Pitkin <dpitkin.oss@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-26 05:03:02 -04:00			`self._assert_cpu()`
ARROW-9159: [Python] Implement Array.isnull/isvalid methods Instead of having the NotImplementedError Closes #7467 from jorisvandenbossche/ARROW-9159 Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Wes McKinney <wesm@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-18 15:06:13 -05:00			`return _pc().is_valid(self)`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
ARROW-9440: [Python] Expose Fill Null kernel Closes #7736 from c-jamie/ARROW-9440 Lead-authored-by: c-jamie <jamie.b.clery@gmail.com> Co-authored-by: Wes McKinney <wesm@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-07-13 19:53:47 -05:00			`def fill_null(self, fill_value):`
			`"""`
ARROW-15321: [Dev][Python] Also numpydoc-validate Cython-generated methods The numpydoc-validation routine in Archery would skip over many Cython-generated methods and properties. This PR also fixes and enhances the docstrings that would newly raise validation errors. Closes #12698 from pitrou/ARROW-15321-numpydoc-cython Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-03-28 18:55:45 +02:00			See :func:`pyarrow.compute.fill_null` for usage.

			`Parameters`
			`----------`
ARROW-15006: [Python][Doc] Add five more numpydoc checks to CI (#15214) This adds the numpydoc checks GL10, PR04, PR05, RT03, and YD01 to CI and fixes the associated issues in docstrings. These checks are: - GL10: reST directives {directives} must be followed by two colons - PR04: Parameter "{param_name}" has no type - PR05: Parameter "{param_name}" type should not finish with "." - RT03: Return value has no description - YD01: No Yields section found https://numpydoc.readthedocs.io/en/latest/validation.html Lead-authored-by: Bryce Mecum <petridish@gmail.com> Co-authored-by: Will Jones <willjones127@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com> 2023-01-06 14:21:27 -09:00			`fill_value : any`
ARROW-15321: [Dev][Python] Also numpydoc-validate Cython-generated methods The numpydoc-validation routine in Archery would skip over many Cython-generated methods and properties. This PR also fixes and enhances the docstrings that would newly raise validation errors. Closes #12698 from pitrou/ARROW-15321-numpydoc-cython Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-03-28 18:55:45 +02:00			`The replacement value for null entries.`

			`Returns`
			`-------`
			`result : Array`
			`A new array with nulls replaced by the given value.`
ARROW-9440: [Python] Expose Fill Null kernel Closes #7736 from c-jamie/ARROW-9440 Lead-authored-by: c-jamie <jamie.b.clery@gmail.com> Co-authored-by: Wes McKinney <wesm@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-07-13 19:53:47 -05:00			`"""`
GH-42112: [Python] Array gracefully fails on non-cpu device (#42113) ### Rationale for this change Common `Array` APIs should not segfault or abort on non-cpu devices. ### What changes are included in this PR? * `device_type` and `is_cpu` methods added to the `Array` class * Any function that segfaults, aborts, or gives incorrect results on non-cpu devices now raises an exception ### Are these changes tested? * Unit tests added ### Are there any user-facing changes? * `device_type` and `is_cpu` methods added to the `Array` class * GitHub Issue: #42112 Lead-authored-by: Dane Pitkin <dpitkin@apache.org> Co-authored-by: Dane Pitkin <dpitkin.oss@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-26 05:03:02 -04:00			`self._assert_cpu()`
ARROW-9440: [Python] Expose Fill Null kernel Closes #7736 from c-jamie/ARROW-9440 Lead-authored-by: c-jamie <jamie.b.clery@gmail.com> Co-authored-by: Wes McKinney <wesm@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-07-13 19:53:47 -05:00			`return _pc().fill_null(self, fill_value)`

ARROW-2714: [Python] Implement variable step slicing with Take Closes #6970 from wesm/ARROW-2714 Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-04-21 10:26:39 -05:00			`def __getitem__(self, key):`
ARROW-4216: [Python] Add CUDA API docs Also reorganize the API docs into several documents, and add/improve docstrings. To allow building the docs without CUDA enabled, I added some conditional inclusion logic. When CUDA isn't enabled, the API docs are still generated but the docstrings are empty. This seems to be the only sane setting that doesn't produce Sphinx errors, one way or the other. Author: Antoine Pitrou <antoine@python.org> Closes #3372 from pitrou/ARROW-4216-cuda-py-docs and squashes the following commits: 80600da5 <Antoine Pitrou> ARROW-4216: Add CUDA API docs 2019-01-10 21:05:31 +01:00			`"""`
ARROW-2714: [Python] Implement variable step slicing with Take Closes #6970 from wesm/ARROW-2714 Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-04-21 10:26:39 -05:00			`Slice or return value at given index`

			`Parameters`
			`----------`
			`key : integer or slice`
			`Slices with step not equal to 1 (or None) will produce a copy`
			`rather than a zero-copy view`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
ARROW-4216: [Python] Add CUDA API docs Also reorganize the API docs into several documents, and add/improve docstrings. To allow building the docs without CUDA enabled, I added some conditional inclusion logic. When CUDA isn't enabled, the API docs are still generated but the docstrings are empty. This seems to be the only sane setting that doesn't produce Sphinx errors, one way or the other. Author: Antoine Pitrou <antoine@python.org> Closes #3372 from pitrou/ARROW-4216-cuda-py-docs and squashes the following commits: 80600da5 <Antoine Pitrou> ARROW-4216: Add CUDA API docs 2019-01-10 21:05:31 +01:00			`Returns`
			`-------`
ARROW-2714: [Python] Implement variable step slicing with Take Closes #6970 from wesm/ARROW-2714 Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-04-21 10:26:39 -05:00			`value : Scalar (index) or Array (slice)`
ARROW-4216: [Python] Add CUDA API docs Also reorganize the API docs into several documents, and add/improve docstrings. To allow building the docs without CUDA enabled, I added some conditional inclusion logic. When CUDA isn't enabled, the API docs are still generated but the docstrings are empty. This seems to be the only sane setting that doesn't produce Sphinx errors, one way or the other. Author: Antoine Pitrou <antoine@python.org> Closes #3372 from pitrou/ARROW-4216-cuda-py-docs and squashes the following commits: 80600da5 <Antoine Pitrou> ARROW-4216: Add CUDA API docs 2019-01-10 21:05:31 +01:00			`"""`
GH-42112: [Python] Array gracefully fails on non-cpu device (#42113) ### Rationale for this change Common `Array` APIs should not segfault or abort on non-cpu devices. ### What changes are included in this PR? * `device_type` and `is_cpu` methods added to the `Array` class * Any function that segfaults, aborts, or gives incorrect results on non-cpu devices now raises an exception ### Are these changes tested? * Unit tests added ### Are there any user-facing changes? * `device_type` and `is_cpu` methods added to the `Array` class * GitHub Issue: #42112 Lead-authored-by: Dane Pitkin <dpitkin@apache.org> Co-authored-by: Dane Pitkin <dpitkin.oss@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-26 05:03:02 -04:00			`self._assert_cpu()`
ARROW-11989: [C++][Python] Improve ChunkedArray's complexity for the access of elements Improves search time for finding a chunk in ChunkedArray using a binary search, O(log n) for random access. Chunks are searched when invoking `GetScalar()` (C++) and index operator (Python). Closes #12055 from edponce/ARROW-11989-Improve-ChunkedArrays-complexity-for-the Lead-authored-by: Eduardo Ponce <edponce00@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-04-14 12:43:31 +02:00			`if isinstance(key, slice):`
ARROW-2714: [Python] Implement variable step slicing with Take Closes #6970 from wesm/ARROW-2714 Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-04-21 10:26:39 -05:00			`return _normalize_slice(self, key)`
ARROW-4216: [Python] Add CUDA API docs Also reorganize the API docs into several documents, and add/improve docstrings. To allow building the docs without CUDA enabled, I added some conditional inclusion logic. When CUDA isn't enabled, the API docs are still generated but the docstrings are empty. This seems to be the only sane setting that doesn't produce Sphinx errors, one way or the other. Author: Antoine Pitrou <antoine@python.org> Closes #3372 from pitrou/ARROW-4216-cuda-py-docs and squashes the following commits: 80600da5 <Antoine Pitrou> ARROW-4216: Add CUDA API docs 2019-01-10 21:05:31 +01:00
ARROW-2714: [Python] Implement variable step slicing with Take Closes #6970 from wesm/ARROW-2714 Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-04-21 10:26:39 -05:00			`return self.getitem(_normalize_index(key, self.length()))`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
			`cdef getitem(self, int64_t i):`
GH-42112: [Python] Array gracefully fails on non-cpu device (#42113) ### Rationale for this change Common `Array` APIs should not segfault or abort on non-cpu devices. ### What changes are included in this PR? * `device_type` and `is_cpu` methods added to the `Array` class * Any function that segfaults, aborts, or gives incorrect results on non-cpu devices now raises an exception ### Are these changes tested? * Unit tests added ### Are there any user-facing changes? * `device_type` and `is_cpu` methods added to the `Array` class * GitHub Issue: #42112 Lead-authored-by: Dane Pitkin <dpitkin@apache.org> Co-authored-by: Dane Pitkin <dpitkin.oss@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-26 05:03:02 -04:00			`self._assert_cpu()`
ARROW-9017: [C++][Python] Refactor scalar bindings TODOs: - [x] split PRs into two, one with simplified python to arrow conversions with benchmarks - [x] implement union scalar on the C++ side - [x] store the index value for dictionary scalar - [x] more tests Closes ARROW-9017, ARROW-9153 Closes #7519 from kszucs/ARROW-9153 Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-07-06 17:15:50 -05:00			`return Scalar.wrap(GetResultValue(self.ap.GetScalar(i)))`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
			`def slice(self, offset=0, length=None):`
			`"""`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`Compute zero-copy slice of this array.`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
			`Parameters`
			`----------`
			`offset : int, default 0`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`Offset from start of array to slice.`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00			`length : int, default None`
			`Length of slice (default is until end of Array starting from`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`offset).`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
			`Returns`
			`-------`
MINOR: [Docs][Python] Fix return type in docstring for Array.slice (#44134) ### Rationale for this change Currently the docstring for Array.slice says it returns a RecordBatch. I don't see how this is possible with the existing code. My guess is that this was a copy-and-paste error back when the Array and RecordBatch slice impls were added in https://issues.apache.org/jira/browse/ARROW-547. ### What changes are included in this PR? Just a docstring change. I copied the language from the take method so things are consistent. ### Are these changes tested? No. ### Are there any user-facing changes? Just docs. Authored-by: Bryce Mecum <petridish@gmail.com> Signed-off-by: Bryce Mecum <petridish@gmail.com> 2024-09-17 09:33:43 -07:00			`sliced : Array`
			`An array with the same datatype, containing the sliced values.`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00			`"""`
GH-42112: [Python] Array gracefully fails on non-cpu device (#42113) ### Rationale for this change Common `Array` APIs should not segfault or abort on non-cpu devices. ### What changes are included in this PR? * `device_type` and `is_cpu` methods added to the `Array` class * Any function that segfaults, aborts, or gives incorrect results on non-cpu devices now raises an exception ### Are these changes tested? * Unit tests added ### Are there any user-facing changes? * `device_type` and `is_cpu` methods added to the `Array` class * GitHub Issue: #42112 Lead-authored-by: Dane Pitkin <dpitkin@apache.org> Co-authored-by: Dane Pitkin <dpitkin.oss@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-26 05:03:02 -04:00			`cdef shared_ptr[CArray] result`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
			`if offset < 0:`
			`raise IndexError('Offset must be non-negative')`

ARROW-10054: [Python] don't crash when slice offset > length Adjust offsets in Python so that the caller gets an empty array, matching expectations for Python users. Closes #8241 from lidavidm/arrow-10054 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-10-01 17:33:33 -05:00			`offset = min(len(self), offset)`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00			`if length is None:`
			`result = self.ap.Slice(offset)`
			`else:`
ARROW-12769: [Python] Fix slicing array with "negative" length (start > stop) When the normalized slice has a start > stop, we were creating invalid arrays with a negative length (which then errors on subsequent operations) Closes #10341 from jorisvandenbossche/ARROW-12769 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2021-05-17 10:09:46 -04:00			`if length < 0:`
			`raise ValueError('Length must be non-negative')`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00			`result = self.ap.Slice(offset, length)`

ARROW-819: Public Cython and C++ API in the style of lxml, arrow::py::import_pyarrow method I have been looking at LXML's approach to creating both a public Cython API and C++ API https://github.com/lxml/lxml While this may seem like a somewhat radical reorganization of the code, putting all of the main symbols in a single Cython extension makes generating a C++ API for them significantly simpler. By using `.pxi` files we can break the codebase into as small pieces as we like (as long as there are no circular dependencies). As a convenient side effect, the build times are shorter. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #680 from wesm/ARROW-819 and squashes the following commits: 9e6ee246 [Wes McKinney] Fix up optional extensions cff757de [Wes McKinney] Expose pyarrow C API in arrow/python/pyarrow.h b39d19cd [Wes McKinney] Fix test suite. Move _config into lib ff1b5e51 [Wes McKinney] Rename things a bit d4a83912 [Wes McKinney] Reorganize Cython code in the style of lxml so make declaring a public C API easier 2017-05-13 15:44:43 -04:00			`return pyarrow_wrap_array(result)`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
ARROW-2714: [Python] Implement variable step slicing with Take Closes #6970 from wesm/ARROW-2714 Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-04-21 10:26:39 -05:00			`def take(self, object indices):`
ARROW-5291: [Python] Add wrapper for take kernel on Array https://issues.apache.org/jira/browse/ARROW-5291 Author: Joris Van den Bossche <jorisvandenbossche@gmail.com> Closes #4281 from jorisvandenbossche/ARROW-5291-array-take and squashes the following commits: f5e82b992 <Joris Van den Bossche> remove array creation into test body 4d70b93f5 <Joris Van den Bossche> Merge remote-tracking branch 'upstream/master' into ARROW-5291-array-take e24397cd1 <Joris Van den Bossche> Merge remote-tracking branch 'upstream/master' into ARROW-5291-array-take bd72d7069 <Joris Van den Bossche> add tests for empty indices 1f65936e1 <Joris Van den Bossche> ARROW-5291: Add wrapper for take kernel on Array 2019-05-13 19:57:46 +02:00			`"""`
ARROW-15321: [Dev][Python] Also numpydoc-validate Cython-generated methods The numpydoc-validation routine in Archery would skip over many Cython-generated methods and properties. This PR also fixes and enhances the docstrings that would newly raise validation errors. Closes #12698 from pitrou/ARROW-15321-numpydoc-cython Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-03-28 18:55:45 +02:00			`Select values from an array.`

			See :func:`pyarrow.compute.take` for full usage.

			`Parameters`
			`----------`
			`indices : Array or array-like`
			`The indices in the array whose values will be returned.`

			`Returns`
			`-------`
			`taken : Array`
			`An array with the same datatype, containing the taken values.`
ARROW-5291: [Python] Add wrapper for take kernel on Array https://issues.apache.org/jira/browse/ARROW-5291 Author: Joris Van den Bossche <jorisvandenbossche@gmail.com> Closes #4281 from jorisvandenbossche/ARROW-5291-array-take and squashes the following commits: f5e82b992 <Joris Van den Bossche> remove array creation into test body 4d70b93f5 <Joris Van den Bossche> Merge remote-tracking branch 'upstream/master' into ARROW-5291-array-take e24397cd1 <Joris Van den Bossche> Merge remote-tracking branch 'upstream/master' into ARROW-5291-array-take bd72d7069 <Joris Van den Bossche> add tests for empty indices 1f65936e1 <Joris Van den Bossche> ARROW-5291: Add wrapper for take kernel on Array 2019-05-13 19:57:46 +02:00			`"""`
GH-42112: [Python] Array gracefully fails on non-cpu device (#42113) ### Rationale for this change Common `Array` APIs should not segfault or abort on non-cpu devices. ### What changes are included in this PR? * `device_type` and `is_cpu` methods added to the `Array` class * Any function that segfaults, aborts, or gives incorrect results on non-cpu devices now raises an exception ### Are these changes tested? * Unit tests added ### Are there any user-facing changes? * `device_type` and `is_cpu` methods added to the `Array` class * GitHub Issue: #42112 Lead-authored-by: Dane Pitkin <dpitkin@apache.org> Co-authored-by: Dane Pitkin <dpitkin.oss@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-26 05:03:02 -04:00			`self._assert_cpu()`
ARROW-8917: [C++] Formalize "metafunction" concept. Add Take and Filter metafunctions, port R and Python bindings A "metafunction" is one that dispatches to other functions based on the argument types. It does not contain any kernels. Other stuff in this PR: * Set up to remove all but two versions of `arrow::compute::Take`. Other ones are still there but will be deprecated or removed after the GLib bindings are ported to use either CallFunction or the Take with Datum-Datum. * Make "take" and "filter" metafunctions that also deal with RecordBatch, Table arguments * Delete tons of now unnecessary binding code from Python and R. Hence the significant LOC reduction There is one failing R test that I wasn't able to debug easily. Connected JIRAs: ARROW-7009 Closes #7318 from wesm/take-filter-metafunctions Lead-authored-by: Wes McKinney <wesm@apache.org> Co-authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-02 19:27:58 -05:00			`return _pc().take(self, indices)`
ARROW-5291: [Python] Add wrapper for take kernel on Array https://issues.apache.org/jira/browse/ARROW-5291 Author: Joris Van den Bossche <jorisvandenbossche@gmail.com> Closes #4281 from jorisvandenbossche/ARROW-5291-array-take and squashes the following commits: f5e82b992 <Joris Van den Bossche> remove array creation into test body 4d70b93f5 <Joris Van den Bossche> Merge remote-tracking branch 'upstream/master' into ARROW-5291-array-take e24397cd1 <Joris Van den Bossche> Merge remote-tracking branch 'upstream/master' into ARROW-5291-array-take bd72d7069 <Joris Van den Bossche> add tests for empty indices 1f65936e1 <Joris Van den Bossche> ARROW-5291: Add wrapper for take kernel on Array 2019-05-13 19:57:46 +02:00
ARROW-1568: [C++] Implement Drop Null Kernel for Arrays Implement "drop null" kernels that return array without nulls Note: This can be implemented as a arrow::compute::VectorFunction because the size of the array is changed, so this function is not valid in a SQL-like context Closes #10802 from aocsa/aocsa/ARROW-1568 Lead-authored-by: Alexander <aocsa.cs@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-08-18 18:58:05 +02:00			`def drop_null(self):`
			`"""`
			`Remove missing values from an array.`
			`"""`
GH-42112: [Python] Array gracefully fails on non-cpu device (#42113) ### Rationale for this change Common `Array` APIs should not segfault or abort on non-cpu devices. ### What changes are included in this PR? * `device_type` and `is_cpu` methods added to the `Array` class * Any function that segfaults, aborts, or gives incorrect results on non-cpu devices now raises an exception ### Are these changes tested? * Unit tests added ### Are there any user-facing changes? * `device_type` and `is_cpu` methods added to the `Array` class * GitHub Issue: #42112 Lead-authored-by: Dane Pitkin <dpitkin@apache.org> Co-authored-by: Dane Pitkin <dpitkin.oss@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-26 05:03:02 -04:00			`self._assert_cpu()`
ARROW-1568: [C++] Implement Drop Null Kernel for Arrays Implement "drop null" kernels that return array without nulls Note: This can be implemented as a arrow::compute::VectorFunction because the size of the array is changed, so this function is not valid in a SQL-like context Closes #10802 from aocsa/aocsa/ARROW-1568 Lead-authored-by: Alexander <aocsa.cs@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-08-18 18:58:05 +02:00			`return _pc().drop_null(self)`

GH-42013 [Python] Allow Array.filter() to take general array input (#42051) ### What changes are included in this PR? Allow Array.filter() to take general array input. ### Are these changes tested? Unit test added, via CI. ### Are there any user-facing changes? No * GitHub Issue: #42013 Authored-by: Kelvin Wu <kelvinyu1117@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-11 20:41:24 +08:00			`def filter(self, object mask, *, null_selection_behavior='drop'):`
ARROW-5853: [Python] Expose boolean filter kernel on Array https://issues.apache.org/jira/browse/ARROW-5853 Closes #5339 from jorisvandenbossche/ARROW-5853-python-filter-kernel and squashes the following commits: 72443bc54 <Joris Van den Bossche> also parametrize invalid tests d448afba2 <Joris Van den Bossche> move type check into c++ 65d049ec8 <Joris Van den Bossche> rename Filter->FilterKernel to avoid name clash with Gandiva 833a40d1a <Joris Van den Bossche> ARROW-5853: Expose boolean filter kernel on Array Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-09-12 11:30:12 +02:00			`"""`
ARROW-15321: [Dev][Python] Also numpydoc-validate Cython-generated methods The numpydoc-validation routine in Archery would skip over many Cython-generated methods and properties. This PR also fixes and enhances the docstrings that would newly raise validation errors. Closes #12698 from pitrou/ARROW-15321-numpydoc-cython Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-03-28 18:55:45 +02:00			`Select values from an array.`

			See :func:`pyarrow.compute.filter` for full usage.

			`Parameters`
			`----------`
			`mask : Array or array-like`
			`The boolean mask to filter the array with.`
ARROW-15006: [Python][Doc] Add five more numpydoc checks to CI (#15214) This adds the numpydoc checks GL10, PR04, PR05, RT03, and YD01 to CI and fixes the associated issues in docstrings. These checks are: - GL10: reST directives {directives} must be followed by two colons - PR04: Parameter "{param_name}" has no type - PR05: Parameter "{param_name}" type should not finish with "." - RT03: Return value has no description - YD01: No Yields section found https://numpydoc.readthedocs.io/en/latest/validation.html Lead-authored-by: Bryce Mecum <petridish@gmail.com> Co-authored-by: Will Jones <willjones127@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com> 2023-01-06 14:21:27 -09:00			`null_selection_behavior : str, default "drop"`
ARROW-15321: [Dev][Python] Also numpydoc-validate Cython-generated methods The numpydoc-validation routine in Archery would skip over many Cython-generated methods and properties. This PR also fixes and enhances the docstrings that would newly raise validation errors. Closes #12698 from pitrou/ARROW-15321-numpydoc-cython Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-03-28 18:55:45 +02:00			`How nulls in the mask should be handled.`

			`Returns`
			`-------`
			`filtered : Array`
			`An array of the same type, with only the elements selected by`
			`the boolean mask.`
ARROW-5853: [Python] Expose boolean filter kernel on Array https://issues.apache.org/jira/browse/ARROW-5853 Closes #5339 from jorisvandenbossche/ARROW-5853-python-filter-kernel and squashes the following commits: 72443bc54 <Joris Van den Bossche> also parametrize invalid tests d448afba2 <Joris Van den Bossche> move type check into c++ 65d049ec8 <Joris Van den Bossche> rename Filter->FilterKernel to avoid name clash with Gandiva 833a40d1a <Joris Van den Bossche> ARROW-5853: Expose boolean filter kernel on Array Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-09-12 11:30:12 +02:00			`"""`
GH-42112: [Python] Array gracefully fails on non-cpu device (#42113) ### Rationale for this change Common `Array` APIs should not segfault or abort on non-cpu devices. ### What changes are included in this PR? * `device_type` and `is_cpu` methods added to the `Array` class * Any function that segfaults, aborts, or gives incorrect results on non-cpu devices now raises an exception ### Are these changes tested? * Unit tests added ### Are there any user-facing changes? * `device_type` and `is_cpu` methods added to the `Array` class * GitHub Issue: #42112 Lead-authored-by: Dane Pitkin <dpitkin@apache.org> Co-authored-by: Dane Pitkin <dpitkin.oss@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-26 05:03:02 -04:00			`self._assert_cpu()`
ARROW-13327: [C++][Python] Improve consistency of explicit C++ types in PyArrow files This PR updates C++ and PyArrow compute bindings for FunctionOptions and makes their API/implementations more consistent. Also, positional and keyword-only parameters in PyArrow FunctionOptions are updated. Closes #11147 from edponce/ARROW-13327-Improve-consistency-of-explicit-C++-types-in-PyArrow-files Authored-by: Eduardo Ponce <edponce00@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-09-22 15:42:44 +02:00			`return _pc().filter(self, mask,`
			`null_selection_behavior=null_selection_behavior)`
ARROW-5853: [Python] Expose boolean filter kernel on Array https://issues.apache.org/jira/browse/ARROW-5853 Closes #5339 from jorisvandenbossche/ARROW-5853-python-filter-kernel and squashes the following commits: 72443bc54 <Joris Van den Bossche> also parametrize invalid tests d448afba2 <Joris Van den Bossche> move type check into c++ 65d049ec8 <Joris Van den Bossche> rename Filter->FilterKernel to avoid name clash with Gandiva 833a40d1a <Joris Van den Bossche> ARROW-5853: Expose boolean filter kernel on Array Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-09-12 11:30:12 +02:00
ARROW-2665: [C++][Python] Add index() kernel Add a simple index() kernel. Note that the Python start/end options are handled entirely in Python, not in the kernel itself. Short-circuiting is somewhat implemented: the kernel executor will still loop through all batches but the kernel will stop looking at data once it finds an item. Closes #10358 from lidavidm/arrow-2665 Lead-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Yibo Cai <yibo.cai@arm.com> Signed-off-by: Yibo Cai <yibo.cai@arm.com> 2021-05-22 21:39:04 +08:00			`def index(self, value, start=None, end=None, *, memory_pool=None):`
			`"""`
			`Find the first index of a value.`

ARROW-15321: [Dev][Python] Also numpydoc-validate Cython-generated methods The numpydoc-validation routine in Archery would skip over many Cython-generated methods and properties. This PR also fixes and enhances the docstrings that would newly raise validation errors. Closes #12698 from pitrou/ARROW-15321-numpydoc-cython Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-03-28 18:55:45 +02:00			See :func:`pyarrow.compute.index` for full usage.

			`Parameters`
			`----------`
			`value : Scalar or object`
			`The value to look for in the array.`
			`start : int, optional`
			The start index where to look for `value`.
			`end : int, optional`
			The end index where to look for `value`.
			`memory_pool : MemoryPool, optional`
			`A memory pool for potential memory allocations.`

			`Returns`
			`-------`
			`index : Int64Scalar`
			`The index of the value in the array (-1 if not found).`
ARROW-2665: [C++][Python] Add index() kernel Add a simple index() kernel. Note that the Python start/end options are handled entirely in Python, not in the kernel itself. Short-circuiting is somewhat implemented: the kernel executor will still loop through all batches but the kernel will stop looking at data once it finds an item. Closes #10358 from lidavidm/arrow-2665 Lead-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Yibo Cai <yibo.cai@arm.com> Signed-off-by: Yibo Cai <yibo.cai@arm.com> 2021-05-22 21:39:04 +08:00			`"""`
GH-42112: [Python] Array gracefully fails on non-cpu device (#42113) ### Rationale for this change Common `Array` APIs should not segfault or abort on non-cpu devices. ### What changes are included in this PR? * `device_type` and `is_cpu` methods added to the `Array` class * Any function that segfaults, aborts, or gives incorrect results on non-cpu devices now raises an exception ### Are these changes tested? * Unit tests added ### Are there any user-facing changes? * `device_type` and `is_cpu` methods added to the `Array` class * GitHub Issue: #42112 Lead-authored-by: Dane Pitkin <dpitkin@apache.org> Co-authored-by: Dane Pitkin <dpitkin.oss@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-26 05:03:02 -04:00			`self._assert_cpu()`
ARROW-2665: [C++][Python] Add index() kernel Add a simple index() kernel. Note that the Python start/end options are handled entirely in Python, not in the kernel itself. Short-circuiting is somewhat implemented: the kernel executor will still loop through all batches but the kernel will stop looking at data once it finds an item. Closes #10358 from lidavidm/arrow-2665 Lead-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Yibo Cai <yibo.cai@arm.com> Signed-off-by: Yibo Cai <yibo.cai@arm.com> 2021-05-22 21:39:04 +08:00			`return _pc().index(self, value, start, end, memory_pool=memory_pool)`

GH-14778: [Python] Add (Chunked)Array sort() method (#14781) * Closes: #14778 Lead-authored-by: Alessandro Molina <amol@turbogears.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2022-12-15 13:44:57 +01:00			`def sort(self, order="ascending", **kwargs):`
			`"""`
			`Sort the Array`

			`Parameters`
			`----------`
			`order : str, default "ascending"`
			`Which order to sort values in.`
			`Accepted values are "ascending", "descending".`
			`**kwargs : dict, optional`
			`Additional sorting options.`
			As allowed by :class:`SortOptions`

			`Returns`
			`-------`
			`result : Array`
			`"""`
GH-42112: [Python] Array gracefully fails on non-cpu device (#42113) ### Rationale for this change Common `Array` APIs should not segfault or abort on non-cpu devices. ### What changes are included in this PR? * `device_type` and `is_cpu` methods added to the `Array` class * Any function that segfaults, aborts, or gives incorrect results on non-cpu devices now raises an exception ### Are these changes tested? * Unit tests added ### Are there any user-facing changes? * `device_type` and `is_cpu` methods added to the `Array` class * GitHub Issue: #42112 Lead-authored-by: Dane Pitkin <dpitkin@apache.org> Co-authored-by: Dane Pitkin <dpitkin.oss@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-26 05:03:02 -04:00			`self._assert_cpu()`
GH-14778: [Python] Add (Chunked)Array sort() method (#14781) * Closes: #14778 Lead-authored-by: Alessandro Molina <amol@turbogears.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2022-12-15 13:44:57 +01:00			`indices = _pc().sort_indices(`
			`self,`
			`options=_pc().SortOptions(sort_keys=[("", order)], **kwargs)`
			`)`
			`return self.take(indices)`

ARROW-9664: [Python] Array/ChunkedArray.to_pandas do not support types_mapper keyword This PR tires to add `types_mapper` argument to Array and ChunkedArray `to_pandas` method. To be used like in `Table.to_pandas()` where `types_mapper` needs to be a function or a dictionary mapping (`dict.get`). Closes #12178 from AlenkaF/ARROW-9664 Lead-authored-by: Alenka Frim <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2022-02-15 11:14:35 +01:00			`def _to_pandas(self, options, types_mapper=None, **kwargs):`
GH-42112: [Python] Array gracefully fails on non-cpu device (#42113) ### Rationale for this change Common `Array` APIs should not segfault or abort on non-cpu devices. ### What changes are included in this PR? * `device_type` and `is_cpu` methods added to the `Array` class * Any function that segfaults, aborts, or gives incorrect results on non-cpu devices now raises an exception ### Are these changes tested? * Unit tests added ### Are there any user-facing changes? * `device_type` and `is_cpu` methods added to the `Array` class * GitHub Issue: #42112 Lead-authored-by: Dane Pitkin <dpitkin@apache.org> Co-authored-by: Dane Pitkin <dpitkin.oss@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-26 05:03:02 -04:00			`self._assert_cpu()`
ARROW-9664: [Python] Array/ChunkedArray.to_pandas do not support types_mapper keyword This PR tires to add `types_mapper` argument to Array and ChunkedArray `to_pandas` method. To be used like in `Table.to_pandas()` where `types_mapper` needs to be a function or a dictionary mapping (`dict.get`). Closes #12178 from AlenkaF/ARROW-9664 Lead-authored-by: Alenka Frim <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2022-02-15 11:14:35 +01:00			`return _array_like_to_pandas(self, options, types_mapper=types_mapper)`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
GH-41098: [Python] Add copy keyword in Array.__array__ for numpy 2.0+ compatibility (#41071) ### Rationale for this change Adapting for changes in numpy 2.0 as decribed at https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword and future changes to pass copy=True (https://github.com/numpy/numpy/issues/26208) ### What changes are included in this PR? Add a `copy=None` to the signatures of our `__array__` methods. This does have impact on the user facing behaviour, though. Questioning that upstream at https://github.com/numpy/numpy/issues/25941#issuecomment-2043035821 ### Are these changes tested? Yes ### Are there any user-facing changes? No (compared to usage with numpy<2) * GitHub Issue: #39532 * GitHub Issue: #41098 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2024-04-15 17:07:44 +02:00			`def __array__(self, dtype=None, copy=None):`
GH-42112: [Python] Array gracefully fails on non-cpu device (#42113) ### Rationale for this change Common `Array` APIs should not segfault or abort on non-cpu devices. ### What changes are included in this PR? * `device_type` and `is_cpu` methods added to the `Array` class * Any function that segfaults, aborts, or gives incorrect results on non-cpu devices now raises an exception ### Are these changes tested? * Unit tests added ### Are there any user-facing changes? * `device_type` and `is_cpu` methods added to the `Array` class * GitHub Issue: #42112 Lead-authored-by: Dane Pitkin <dpitkin@apache.org> Co-authored-by: Dane Pitkin <dpitkin.oss@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-26 05:03:02 -04:00			`self._assert_cpu()`

GH-41098: [Python] Add copy keyword in Array.__array__ for numpy 2.0+ compatibility (#41071) ### Rationale for this change Adapting for changes in numpy 2.0 as decribed at https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword and future changes to pass copy=True (https://github.com/numpy/numpy/issues/26208) ### What changes are included in this PR? Add a `copy=None` to the signatures of our `__array__` methods. This does have impact on the user facing behaviour, though. Questioning that upstream at https://github.com/numpy/numpy/issues/25941#issuecomment-2043035821 ### Are these changes tested? Yes ### Are there any user-facing changes? No (compared to usage with numpy<2) * GitHub Issue: #39532 * GitHub Issue: #41098 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2024-04-15 17:07:44 +02:00			`if copy is False:`
			`try:`
			`values = self.to_numpy(zero_copy_only=True)`
			`except ArrowInvalid:`
			`raise ValueError(`
			`"Unable to avoid a copy while creating a numpy array as requested.\n"`
			"If using `np.array(obj, copy=False)` replace it with "
			"`np.asarray(obj)` to allow a copy when needed"
			`)`
			`# values is already a numpy array at this point, but calling np.array(..)`
			# again to handle the `dtype` keyword with a no-copy guarantee
			`return np.array(values, dtype=dtype, copy=False)`

ARROW-3789: [Python] Use common conversion path for Arrow to pandas.Series/DataFrame. Zero copy optimizations for DataFrame, add split_blocks and self_destruct options The primary goal of this patch is to provide a way for some users to avoid memory doubling with converting from Arrow to pandas. This took me entirely too much time to get right, but partly I was attempting to disentangle some of the technical debt and overdue refactoring in arrow_to_pandas.cc. Summary of what's here: - Refactor ChunkedArray->Series and Table->DataFrame conversion paths to use the exact same code rather than two implementations of the same thing with slightly different behavior. The `ArrowDeserializer` helper class is now gone - Do zero-copy construction of internal DataFrame blocks for the case of a contiguous non-nullable array and a block with only 1 column represented - Add `split_blocks` option to `to_pandas` which constructs one block per DataFrame column, resulting in more zero-copy opportunities. Note that pandas's internal "consolidation" can still cause memory doubling (see discussion about this in https://github.com/pandas-dev/pandas/issues/10556) - Add `self_destruct` option to `to_pandas` which releases the Table's internal buffers as soon as they are converted to the required pandas structure. This allows memory to be reclaimed by the OS as conversion is taking place rather than having a forced memory-doubling and then post-facto reclamation (which has been causing OOM for some users) The most conservative invocation of `to_pandas` now would be `table.to_pandas(use_threads=False, split_blocks=True, self_destruct=True)` Note that the self-destruct option makes the `Table` object unsafe for further use. This is a bit dissatisfying but I wasn't sure how else to provide this capability. Closes #6067 from wesm/ARROW-3789 and squashes the following commits: 3b4260283 <Wes McKinney> Code review comments 8f39cce05 <Wes McKinney> Add some documentation. Try fixing MSVC warnings c22d280dc <Wes McKinney> Fix one MSVC cast warning 43068032c <Wes McKinney> Add "split blocks" and "self destruct" options to Table.to_pandas, with zero-copy operations for improved memory use when converting from Arrow to pandas Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-01-14 18:25:01 -06:00			`values = self.to_numpy(zero_copy_only=False)`
GH-41098: [Python] Add copy keyword in Array.__array__ for numpy 2.0+ compatibility (#41071) ### Rationale for this change Adapting for changes in numpy 2.0 as decribed at https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword and future changes to pass copy=True (https://github.com/numpy/numpy/issues/26208) ### What changes are included in this PR? Add a `copy=None` to the signatures of our `__array__` methods. This does have impact on the user facing behaviour, though. Questioning that upstream at https://github.com/numpy/numpy/issues/25941#issuecomment-2043035821 ### Are these changes tested? Yes ### Are there any user-facing changes? No (compared to usage with numpy<2) * GitHub Issue: #39532 * GitHub Issue: #41098 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2024-04-15 17:07:44 +02:00			`if copy is True and is_numeric(self.type.id) and self.null_count == 0:`
			`# to_numpy did not yet make a copy (is_numeric = integer/floats, no decimal)`
			`return np.array(values, dtype=dtype, copy=True)`

ARROW-2666: [Python] Add __array__ method to Array, ChunkedArray, Column Implement `__array__` method on `pyarrow.Array`, `pyarrow.ChunkedArray` and `pyarrow.Column` so that the `to_pandas()` method is used when calling `numpy.asarray` on an instance of these classes. Currently `numpy.asarray` falls back to using the iterator interface so we get numpy object arrays of the underlying pyarrow scalar value type. Author: Pedro M. Duarte <pmd323@gmail.com> Closes #2365 from PedroMDuarte/asarray and squashes the following commits: 71f9e291 <Pedro M. Duarte> Improve inline comment 6eac2685 <Pedro M. Duarte> Add __array__ method to Array, ChunkedArray, Column 2018-08-04 16:00:40 -04:00			`if dtype is None:`
ARROW-6557: [Python] Always return pandas.Series from Array/ChunkedArray.to_pandas. Add mechanism to preserve "column names" from RecordBatch, Table as Series.name This does not fully fix the docker-spark-integration build but fixes one class of problems. Closes #5373 from wesm/ARROW-6429 and squashes the following commits: 3115a3958 <Wes McKinney> Attach record batch and table column names to Array/ChunkedArray so they propagate to pandas.Series in to_pandas 0cc9f42f0 <Wes McKinney> Always return pandas.Series from Array/ChunkedArray.to_pandas Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-09-13 10:40:13 -05:00			`return values`
GH-41098: [Python] Add copy keyword in Array.__array__ for numpy 2.0+ compatibility (#41071) ### Rationale for this change Adapting for changes in numpy 2.0 as decribed at https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword and future changes to pass copy=True (https://github.com/numpy/numpy/issues/26208) ### What changes are included in this PR? Add a `copy=None` to the signatures of our `__array__` methods. This does have impact on the user facing behaviour, though. Questioning that upstream at https://github.com/numpy/numpy/issues/25941#issuecomment-2043035821 ### Are these changes tested? Yes ### Are there any user-facing changes? No (compared to usage with numpy<2) * GitHub Issue: #39532 * GitHub Issue: #41098 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2024-04-15 17:07:44 +02:00			`return np.asarray(values, dtype=dtype)`
ARROW-2666: [Python] Add __array__ method to Array, ChunkedArray, Column Implement `__array__` method on `pyarrow.Array`, `pyarrow.ChunkedArray` and `pyarrow.Column` so that the `to_pandas()` method is used when calling `numpy.asarray` on an instance of these classes. Currently `numpy.asarray` falls back to using the iterator interface so we get numpy object arrays of the underlying pyarrow scalar value type. Author: Pedro M. Duarte <pmd323@gmail.com> Closes #2365 from PedroMDuarte/asarray and squashes the following commits: 71f9e291 <Pedro M. Duarte> Improve inline comment 6eac2685 <Pedro M. Duarte> Add __array__ method to Array, ChunkedArray, Column 2018-08-04 16:00:40 -04:00
ARROW-6749: [Python] Let Array.to_numpy use general conversion code with zero_copy_only=True `Array.to_numpy` converts to a numpy array zero-copy. It currently does that with a custom `np.frombuffer` (although with a bug for timestamp data, which was the original report in [ARROW-6749](https://issues.apache.org/jira/browse/ARROW-6749)), while we also have the `zero_copy_only` guarantee in the arrow->python conversion code. So here I try to switch to that. - I added a zero_copy conversion for Timestamp/Duration. I think this can correctly be done since the memory layout for the actual values is identical with numpy (not sure if there is a specific reason it was not done before) - One consequence of using the conversion code is that the resulting numpy array is non-writable. While the current `to_numpy` created a writable array (and the tests actually used this property to check the zero-copy assumption, which is why tests are now failing). Are we OK with that restriction? Closes #5718 from jorisvandenbossche/ARROW-6749-to_numpy-datetimes-zero-copy and squashes the following commits: 1e0c5a7cf <Joris Van den Bossche> lint 5e723f307 <Joris Van den Bossche> update for feedback a4f4c4517 <Joris Van den Bossche> fix pandas tests c9161df9b <Joris Van den Bossche> add zero_copy_only and writable keywords to to_numpy a32070653 <Joris Van den Bossche> ARROW-6749: Let Array.to_numpy use general conversion code with zero_copy_only=True Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-11-14 18:18:52 +01:00			`def to_numpy(self, zero_copy_only=True, writable=False):`
ARROW-564: [Python] Add Array.to_numpy() Author: Florian Rathgeber <florian.rathgeber@gmail.com> Author: Antoine Pitrou <pitrou@free.fr> Closes #1931 from kynan/ARROW-564 and squashes the following commits: 2f14cb2 <Antoine Pitrou> Make assertion stricter 79d6877 <Florian Rathgeber> ARROW-564: Address code review comments 3ec8a36 <Florian Rathgeber> ARROW-564: Refactor test_to_numpy_zero_copy 2672e82 <Florian Rathgeber> ARROW-564: Add support for return zero copy NumPy arrays 2018-07-17 17:39:12 +02:00			`"""`
GH-17682: [C++][Python] Bool8 Extension Type Implementation (#43488) ### Rationale for this change C++ and Python implementations of #43234 ### What changes are included in this PR? - Implement C++ `Bool8Type`, `Bool8Array`, `Bool8Scalar`, and tests - Implement Python bindings to C++, as well as zero-copy numpy conversion methods - TODO: docs waiting for rebase on #43458 ### Are these changes tested? Yes ### Are there any user-facing changes? Bool8 extension type will be available in C++ and Python libraries * GitHub Issue: #17682 Authored-by: Joel Lubinitsky <joellubi@gmail.com> Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com> 2024-08-20 20:25:19 -04:00			`Return a NumPy view or copy of this array.`
ARROW-6749: [Python] Let Array.to_numpy use general conversion code with zero_copy_only=True `Array.to_numpy` converts to a numpy array zero-copy. It currently does that with a custom `np.frombuffer` (although with a bug for timestamp data, which was the original report in [ARROW-6749](https://issues.apache.org/jira/browse/ARROW-6749)), while we also have the `zero_copy_only` guarantee in the arrow->python conversion code. So here I try to switch to that. - I added a zero_copy conversion for Timestamp/Duration. I think this can correctly be done since the memory layout for the actual values is identical with numpy (not sure if there is a specific reason it was not done before) - One consequence of using the conversion code is that the resulting numpy array is non-writable. While the current `to_numpy` created a writable array (and the tests actually used this property to check the zero-copy assumption, which is why tests are now failing). Are we OK with that restriction? Closes #5718 from jorisvandenbossche/ARROW-6749-to_numpy-datetimes-zero-copy and squashes the following commits: 1e0c5a7cf <Joris Van den Bossche> lint 5e723f307 <Joris Van den Bossche> update for feedback a4f4c4517 <Joris Van den Bossche> fix pandas tests c9161df9b <Joris Van den Bossche> add zero_copy_only and writable keywords to to_numpy a32070653 <Joris Van den Bossche> ARROW-6749: Let Array.to_numpy use general conversion code with zero_copy_only=True Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-11-14 18:18:52 +01:00
			`By default, tries to return a view of this array. This is only`
			`supported for primitive arrays with the same memory layout as NumPy`
			`(i.e. integers, floating point, ..) and without any nulls.`

GH-34165: [Python] Extension array data type should default to the storage type if to_pandas_dtype is not implemented (#34559) ### Rationale for this change Method `to_pandas` fails with `KeyError` if a table has an extension array as a column with extension dtype not having `to_pandas_dtype` defined. In this cases we should fall back to storage type of the extension array. ### What changes are included in this PR? Changes in `arrow_to_pandas.cc` at: - `GetBlockType` for `ConvertTableToPandas` - `ConvertChunkedArrayToPandas` * Closes: #34165 Lead-authored-by: Alenka Frim <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-04-05 15:27:03 +02:00			`For the extension arrays, this method simply delegates to the`
			`underlying storage array.`

ARROW-6749: [Python] Let Array.to_numpy use general conversion code with zero_copy_only=True `Array.to_numpy` converts to a numpy array zero-copy. It currently does that with a custom `np.frombuffer` (although with a bug for timestamp data, which was the original report in [ARROW-6749](https://issues.apache.org/jira/browse/ARROW-6749)), while we also have the `zero_copy_only` guarantee in the arrow->python conversion code. So here I try to switch to that. - I added a zero_copy conversion for Timestamp/Duration. I think this can correctly be done since the memory layout for the actual values is identical with numpy (not sure if there is a specific reason it was not done before) - One consequence of using the conversion code is that the resulting numpy array is non-writable. While the current `to_numpy` created a writable array (and the tests actually used this property to check the zero-copy assumption, which is why tests are now failing). Are we OK with that restriction? Closes #5718 from jorisvandenbossche/ARROW-6749-to_numpy-datetimes-zero-copy and squashes the following commits: 1e0c5a7cf <Joris Van den Bossche> lint 5e723f307 <Joris Van den Bossche> update for feedback a4f4c4517 <Joris Van den Bossche> fix pandas tests c9161df9b <Joris Van den Bossche> add zero_copy_only and writable keywords to to_numpy a32070653 <Joris Van den Bossche> ARROW-6749: Let Array.to_numpy use general conversion code with zero_copy_only=True Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-11-14 18:18:52 +01:00			`Parameters`
			`----------`
			`zero_copy_only : bool, default True`
			`If True, an exception will be raised if the conversion to a numpy`
			`array would require copying the underlying data (e.g. in presence`
			`of nulls, or for non-primitive types).`
			`writable : bool, default False`
			`For numpy arrays created with zero copy (view on the Arrow data),`
			`the resulting array is not writable (Arrow data is immutable).`
			`By setting this to True, a copy of the array is made to ensure`
			`it is writable.`
ARROW-2871: [Python] Raise when calling to_numpy() on boolean array Also marks that `to_numpy` API is experimental, since we need to figure out what to do about nulls (ARROW-2870) Author: Wes McKinney <wesm+git@apache.org> Closes #2277 from wesm/ARROW-2871 and squashes the following commits: c07b9b48 <Wes McKinney> Boolean is not supported in Array.to_numpy 2018-07-18 15:27:50 +02:00
			`Returns`
			`-------`
ARROW-2869: [Python] Add documentation for Array.to_numpy Author: Antoine Pitrou <antoine@python.org> Closes #2351 from pitrou/ARROW-2869-document-numpy and squashes the following commits: 2792dc84 <Antoine Pitrou> Fix renamed reference 8cb89989 <Antoine Pitrou> Revert "Capitalize Pandas" 34d8c36e <Antoine Pitrou> Capitalize Pandas 395231e0 <Antoine Pitrou> Address review comments 347ca4e7 <Antoine Pitrou> ARROW-2869: Add documentation for Array.to_numpy 2018-08-05 16:09:58 -04:00			`array : numpy.ndarray`
ARROW-564: [Python] Add Array.to_numpy() Author: Florian Rathgeber <florian.rathgeber@gmail.com> Author: Antoine Pitrou <pitrou@free.fr> Closes #1931 from kynan/ARROW-564 and squashes the following commits: 2f14cb2 <Antoine Pitrou> Make assertion stricter 79d6877 <Florian Rathgeber> ARROW-564: Address code review comments 3ec8a36 <Florian Rathgeber> ARROW-564: Refactor test_to_numpy_zero_copy 2672e82 <Florian Rathgeber> ARROW-564: Add support for return zero copy NumPy arrays 2018-07-17 17:39:12 +02:00			`"""`
GH-42112: [Python] Array gracefully fails on non-cpu device (#42113) ### Rationale for this change Common `Array` APIs should not segfault or abort on non-cpu devices. ### What changes are included in this PR? * `device_type` and `is_cpu` methods added to the `Array` class * Any function that segfaults, aborts, or gives incorrect results on non-cpu devices now raises an exception ### Are these changes tested? * Unit tests added ### Are there any user-facing changes? * `device_type` and `is_cpu` methods added to the `Array` class * GitHub Issue: #42112 Lead-authored-by: Dane Pitkin <dpitkin@apache.org> Co-authored-by: Dane Pitkin <dpitkin.oss@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-26 05:03:02 -04:00			`self._assert_cpu()`

GH-25118: [Python] Make NumPy an optional runtime dependency (#41904) ### Rationale for this change Being able to run pyarrow without requiring numpy. ### What changes are included in this PR? If numpy is not present we are able to import pyarrow and run functionality. A new CI job has been created to run some basic tests without numpy. ### Are these changes tested? Yes via CI. ### Are there any user-facing changes? Yes, NumPy can be removed from the user installation and pyarrow functionality still works * GitHub Issue: #25118 Lead-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org> 2024-09-02 16:35:26 +02:00			`if np is None:`
			`raise ImportError(`
			`"Cannot return a numpy.ndarray if NumPy is not present")`
ARROW-6749: [Python] Let Array.to_numpy use general conversion code with zero_copy_only=True `Array.to_numpy` converts to a numpy array zero-copy. It currently does that with a custom `np.frombuffer` (although with a bug for timestamp data, which was the original report in [ARROW-6749](https://issues.apache.org/jira/browse/ARROW-6749)), while we also have the `zero_copy_only` guarantee in the arrow->python conversion code. So here I try to switch to that. - I added a zero_copy conversion for Timestamp/Duration. I think this can correctly be done since the memory layout for the actual values is identical with numpy (not sure if there is a specific reason it was not done before) - One consequence of using the conversion code is that the resulting numpy array is non-writable. While the current `to_numpy` created a writable array (and the tests actually used this property to check the zero-copy assumption, which is why tests are now failing). Are we OK with that restriction? Closes #5718 from jorisvandenbossche/ARROW-6749-to_numpy-datetimes-zero-copy and squashes the following commits: 1e0c5a7cf <Joris Van den Bossche> lint 5e723f307 <Joris Van den Bossche> update for feedback a4f4c4517 <Joris Van den Bossche> fix pandas tests c9161df9b <Joris Van den Bossche> add zero_copy_only and writable keywords to to_numpy a32070653 <Joris Van den Bossche> ARROW-6749: Let Array.to_numpy use general conversion code with zero_copy_only=True Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-11-14 18:18:52 +01:00			`cdef:`
			`PyObject* out`
			`PandasOptions c_options`
			`object values`

			`if zero_copy_only and writable:`
			`raise ValueError(`
			`"Cannot return a writable array if asking for zero-copy")`

ARROW-9594: [Python] Preserve null indexes in DictionaryArray.to_numpy as it's done in DictionaryArray.to_pandas https://issues.apache.org/jira/browse/ARROW-9594 Closes #10101 from amol-/ARROW-9594 Authored-by: Alessandro Molina <amol@turbogears.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-04-28 11:05:38 +02:00			`# If there are nulls and the array is a DictionaryArray`
			`# decoding the dictionary will make sure nulls are correctly handled.`
			`# Decoding a dictionary does imply a copy by the way,`
			`# so it can't be done if the user requested a zero_copy.`
GH-40153: [Python] Avoid using np.take in Array.to_numpy() (#40295) ### Rationale for this change `Array.to_numpy` calls `np.take` to linearize dictionary arrays. This fails on 32-bit Numpy builds because we give Numpy 64-bit indices and Numpy would like to downcast them. ### What changes are included in this PR? Avoid calling `np.take`, instead using our own dictionary decoding routine. ### Are these changes tested? Yes. A test failure is fixed on 32-bit. ### Are there any user-facing changes? No. * GitHub Issue: #40153 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2024-02-29 19:14:13 +01:00			`c_options.decode_dictionaries = True`
ARROW-6749: [Python] Let Array.to_numpy use general conversion code with zero_copy_only=True `Array.to_numpy` converts to a numpy array zero-copy. It currently does that with a custom `np.frombuffer` (although with a bug for timestamp data, which was the original report in [ARROW-6749](https://issues.apache.org/jira/browse/ARROW-6749)), while we also have the `zero_copy_only` guarantee in the arrow->python conversion code. So here I try to switch to that. - I added a zero_copy conversion for Timestamp/Duration. I think this can correctly be done since the memory layout for the actual values is identical with numpy (not sure if there is a specific reason it was not done before) - One consequence of using the conversion code is that the resulting numpy array is non-writable. While the current `to_numpy` created a writable array (and the tests actually used this property to check the zero-copy assumption, which is why tests are now failing). Are we OK with that restriction? Closes #5718 from jorisvandenbossche/ARROW-6749-to_numpy-datetimes-zero-copy and squashes the following commits: 1e0c5a7cf <Joris Van den Bossche> lint 5e723f307 <Joris Van den Bossche> update for feedback a4f4c4517 <Joris Van den Bossche> fix pandas tests c9161df9b <Joris Van den Bossche> add zero_copy_only and writable keywords to to_numpy a32070653 <Joris Van den Bossche> ARROW-6749: Let Array.to_numpy use general conversion code with zero_copy_only=True Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-11-14 18:18:52 +01:00			`c_options.zero_copy_only = zero_copy_only`
GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0 (#35656) Do not coerce temporal types to nanosecond when pandas >= 2.0 is imported, since pandas now supports s/ms/us time units. This PR adds support for the following Arrow -> Pandas conversions, which previously all defaulted to `datetime64[ns]` or `datetime64[ns, <TZ>]`: ``` date32 -> datetime64[ms] date64 -> datetime64[ms] datetime64[s] -> datetime64[s] datetime64[ms] -> datetime64[ms] datetime64[us] -> datetime64[us] datetime64[s, <TZ>] -> datetime64[s, <TZ>] datetime64[ms, <TZ>] -> datetime64[ms, <TZ>] datetime64[us, <TZ>] -> datetime64[us, <TZ>] ``` ### Rationale for this change Pandas 2.0 introduces proper support for temporal types. ### Are these changes tested? Yes. Pytests added and updated. ### Are there any user-facing changes? Yes, arrow-to-pandas default conversion behavior will change when users have pandas >= 2.0, but a legacy option is exposed to provide backwards compatibility. * Closes: #33321 Lead-authored-by: Dane Pitkin <dane@voltrondata.com> Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-07-07 10:45:58 -04:00			`c_options.to_numpy = True`
ARROW-6749: [Python] Let Array.to_numpy use general conversion code with zero_copy_only=True `Array.to_numpy` converts to a numpy array zero-copy. It currently does that with a custom `np.frombuffer` (although with a bug for timestamp data, which was the original report in [ARROW-6749](https://issues.apache.org/jira/browse/ARROW-6749)), while we also have the `zero_copy_only` guarantee in the arrow->python conversion code. So here I try to switch to that. - I added a zero_copy conversion for Timestamp/Duration. I think this can correctly be done since the memory layout for the actual values is identical with numpy (not sure if there is a specific reason it was not done before) - One consequence of using the conversion code is that the resulting numpy array is non-writable. While the current `to_numpy` created a writable array (and the tests actually used this property to check the zero-copy assumption, which is why tests are now failing). Are we OK with that restriction? Closes #5718 from jorisvandenbossche/ARROW-6749-to_numpy-datetimes-zero-copy and squashes the following commits: 1e0c5a7cf <Joris Van den Bossche> lint 5e723f307 <Joris Van den Bossche> update for feedback a4f4c4517 <Joris Van den Bossche> fix pandas tests c9161df9b <Joris Van den Bossche> add zero_copy_only and writable keywords to to_numpy a32070653 <Joris Van den Bossche> ARROW-6749: Let Array.to_numpy use general conversion code with zero_copy_only=True Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-11-14 18:18:52 +01:00
			`with nogil:`
			`check_status(ConvertArrayToPandas(c_options, self.sp_array,`
			`self, &out))`
ARROW-7591: [Python] Fix DictionaryArray.to_numpy() to return decoded numpy array Closes #6212 from jorisvandenbossche/ARROW-7591 and squashes the following commits: 230b7d2b0 <Joris Van den Bossche> ARROW-7591: Fix DictionaryArray.to_numpy() to return decoded numpy array Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com> 2020-01-17 15:44:57 -08:00
			`# wrap_array_output uses pandas to convert to Categorical, here`
			`# always convert to numpy array without pandas dependency`
ARROW-6749: [Python] Let Array.to_numpy use general conversion code with zero_copy_only=True `Array.to_numpy` converts to a numpy array zero-copy. It currently does that with a custom `np.frombuffer` (although with a bug for timestamp data, which was the original report in [ARROW-6749](https://issues.apache.org/jira/browse/ARROW-6749)), while we also have the `zero_copy_only` guarantee in the arrow->python conversion code. So here I try to switch to that. - I added a zero_copy conversion for Timestamp/Duration. I think this can correctly be done since the memory layout for the actual values is identical with numpy (not sure if there is a specific reason it was not done before) - One consequence of using the conversion code is that the resulting numpy array is non-writable. While the current `to_numpy` created a writable array (and the tests actually used this property to check the zero-copy assumption, which is why tests are now failing). Are we OK with that restriction? Closes #5718 from jorisvandenbossche/ARROW-6749-to_numpy-datetimes-zero-copy and squashes the following commits: 1e0c5a7cf <Joris Van den Bossche> lint 5e723f307 <Joris Van den Bossche> update for feedback a4f4c4517 <Joris Van den Bossche> fix pandas tests c9161df9b <Joris Van den Bossche> add zero_copy_only and writable keywords to to_numpy a32070653 <Joris Van den Bossche> ARROW-6749: Let Array.to_numpy use general conversion code with zero_copy_only=True Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-11-14 18:18:52 +01:00			`array = PyObject_to_object(out)`
ARROW-3789: [Python] Use common conversion path for Arrow to pandas.Series/DataFrame. Zero copy optimizations for DataFrame, add split_blocks and self_destruct options The primary goal of this patch is to provide a way for some users to avoid memory doubling with converting from Arrow to pandas. This took me entirely too much time to get right, but partly I was attempting to disentangle some of the technical debt and overdue refactoring in arrow_to_pandas.cc. Summary of what's here: - Refactor ChunkedArray->Series and Table->DataFrame conversion paths to use the exact same code rather than two implementations of the same thing with slightly different behavior. The `ArrowDeserializer` helper class is now gone - Do zero-copy construction of internal DataFrame blocks for the case of a contiguous non-nullable array and a block with only 1 column represented - Add `split_blocks` option to `to_pandas` which constructs one block per DataFrame column, resulting in more zero-copy opportunities. Note that pandas's internal "consolidation" can still cause memory doubling (see discussion about this in https://github.com/pandas-dev/pandas/issues/10556) - Add `self_destruct` option to `to_pandas` which releases the Table's internal buffers as soon as they are converted to the required pandas structure. This allows memory to be reclaimed by the OS as conversion is taking place rather than having a forced memory-doubling and then post-facto reclamation (which has been causing OOM for some users) The most conservative invocation of `to_pandas` now would be `table.to_pandas(use_threads=False, split_blocks=True, self_destruct=True)` Note that the self-destruct option makes the `Table` object unsafe for further use. This is a bit dissatisfying but I wasn't sure how else to provide this capability. Closes #6067 from wesm/ARROW-3789 and squashes the following commits: 3b4260283 <Wes McKinney> Code review comments 8f39cce05 <Wes McKinney> Add some documentation. Try fixing MSVC warnings c22d280dc <Wes McKinney> Fix one MSVC cast warning 43068032c <Wes McKinney> Add "split blocks" and "self destruct" options to Table.to_pandas, with zero-copy operations for improved memory use when converting from Arrow to pandas Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-01-14 18:25:01 -06:00
ARROW-6749: [Python] Let Array.to_numpy use general conversion code with zero_copy_only=True `Array.to_numpy` converts to a numpy array zero-copy. It currently does that with a custom `np.frombuffer` (although with a bug for timestamp data, which was the original report in [ARROW-6749](https://issues.apache.org/jira/browse/ARROW-6749)), while we also have the `zero_copy_only` guarantee in the arrow->python conversion code. So here I try to switch to that. - I added a zero_copy conversion for Timestamp/Duration. I think this can correctly be done since the memory layout for the actual values is identical with numpy (not sure if there is a specific reason it was not done before) - One consequence of using the conversion code is that the resulting numpy array is non-writable. While the current `to_numpy` created a writable array (and the tests actually used this property to check the zero-copy assumption, which is why tests are now failing). Are we OK with that restriction? Closes #5718 from jorisvandenbossche/ARROW-6749-to_numpy-datetimes-zero-copy and squashes the following commits: 1e0c5a7cf <Joris Van den Bossche> lint 5e723f307 <Joris Van den Bossche> update for feedback a4f4c4517 <Joris Van den Bossche> fix pandas tests c9161df9b <Joris Van den Bossche> add zero_copy_only and writable keywords to to_numpy a32070653 <Joris Van den Bossche> ARROW-6749: Let Array.to_numpy use general conversion code with zero_copy_only=True Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-11-14 18:18:52 +01:00			`if writable and not array.flags.writeable:`
			`# if the conversion already needed to a copy, writeable is True`
			`array = array.copy()`
			`return array`
ARROW-564: [Python] Add Array.to_numpy() Author: Florian Rathgeber <florian.rathgeber@gmail.com> Author: Antoine Pitrou <pitrou@free.fr> Closes #1931 from kynan/ARROW-564 and squashes the following commits: 2f14cb2 <Antoine Pitrou> Make assertion stricter 79d6877 <Florian Rathgeber> ARROW-564: Address code review comments 3ec8a36 <Florian Rathgeber> ARROW-564: Refactor test_to_numpy_zero_copy 2672e82 <Florian Rathgeber> ARROW-564: Add support for return zero copy NumPy arrays 2018-07-17 17:39:12 +02:00
GH-39010: [Python] Introduce `maps_as_pydicts` parameter for `to_pylist`, `to_pydict`, `as_py` (#45471) ### Rationale for this change Currently, unfortunately `MapScalar`/`Array` types are not deserialized into proper Python `dict`s, which is unfortunate since this breaks "roundtrips" from Python -> Arrow -> Python: ``` import pyarrow as pa schema = pa.schema([pa.field('x', pa.map_(pa.string(), pa.int64()))]) data = [{'x': {'a': 1}}] pa.RecordBatch.from_pylist(data, schema=schema).to_pylist() # [{'x': [('a', 1)]}] ``` This is especially bad when storing TiBs of deeply nested data (think of lists in structs in maps...) that were created from Python and serialized into Arrow/Parquet, since they can't be read in again with native `pyarrow` methods without doing extremely ugly and computationally costly workarounds. ### What changes are included in this PR? A new parameter `maps_as_pydicts` is introduced to `to_pylist`, `to_pydict`, `as_py` which will allow proper roundtrips: ``` import pyarrow as pa schema = pa.schema([pa.field('x', pa.map_(pa.string(), pa.int64()))]) data = [{'x': {'a': 1}}] pa.RecordBatch.from_pylist(data, schema=schema).to_pylist(maps_as_pydicts="strict") # [{'x': {'a': 1}}] ``` ### Are these changes tested? Yes. There are tests for `to_pylist` and `to_pydict` included for `pyarrow.Table`, whilst low-level `MapScalar` and especially a nesting with `ListScalar` and `StructScalar` is tested. Also, duplicate keys now should throw an error, which is also tested for. ### Are there any user-facing changes? No callsites should be broken, simply a new keyword-only optional parameter is added. * GitHub Issue: #39010 Authored-by: Jonas Dedden <university@jonas-dedden.de> Signed-off-by: Antoine Pitrou <antoine@python.org> 2025-02-20 16:17:48 +01:00			`def to_pylist(self, *, maps_as_pydicts=None):`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00			`"""`
ARROW-2869: [Python] Add documentation for Array.to_numpy Author: Antoine Pitrou <antoine@python.org> Closes #2351 from pitrou/ARROW-2869-document-numpy and squashes the following commits: 2792dc84 <Antoine Pitrou> Fix renamed reference 8cb89989 <Antoine Pitrou> Revert "Capitalize Pandas" 34d8c36e <Antoine Pitrou> Capitalize Pandas 395231e0 <Antoine Pitrou> Address review comments 347ca4e7 <Antoine Pitrou> ARROW-2869: Add documentation for Array.to_numpy 2018-08-05 16:09:58 -04:00			`Convert to a list of native Python objects.`

GH-39010: [Python] Introduce `maps_as_pydicts` parameter for `to_pylist`, `to_pydict`, `as_py` (#45471) ### Rationale for this change Currently, unfortunately `MapScalar`/`Array` types are not deserialized into proper Python `dict`s, which is unfortunate since this breaks "roundtrips" from Python -> Arrow -> Python: ``` import pyarrow as pa schema = pa.schema([pa.field('x', pa.map_(pa.string(), pa.int64()))]) data = [{'x': {'a': 1}}] pa.RecordBatch.from_pylist(data, schema=schema).to_pylist() # [{'x': [('a', 1)]}] ``` This is especially bad when storing TiBs of deeply nested data (think of lists in structs in maps...) that were created from Python and serialized into Arrow/Parquet, since they can't be read in again with native `pyarrow` methods without doing extremely ugly and computationally costly workarounds. ### What changes are included in this PR? A new parameter `maps_as_pydicts` is introduced to `to_pylist`, `to_pydict`, `as_py` which will allow proper roundtrips: ``` import pyarrow as pa schema = pa.schema([pa.field('x', pa.map_(pa.string(), pa.int64()))]) data = [{'x': {'a': 1}}] pa.RecordBatch.from_pylist(data, schema=schema).to_pylist(maps_as_pydicts="strict") # [{'x': {'a': 1}}] ``` ### Are these changes tested? Yes. There are tests for `to_pylist` and `to_pydict` included for `pyarrow.Table`, whilst low-level `MapScalar` and especially a nesting with `ListScalar` and `StructScalar` is tested. Also, duplicate keys now should throw an error, which is also tested for. ### Are there any user-facing changes? No callsites should be broken, simply a new keyword-only optional parameter is added. * GitHub Issue: #39010 Authored-by: Jonas Dedden <university@jonas-dedden.de> Signed-off-by: Antoine Pitrou <antoine@python.org> 2025-02-20 16:17:48 +01:00			`Parameters`
			`----------`
			maps_as_pydicts : str, optional, default `None`
			Valid values are `None`, 'lossy', or 'strict'.
			The default behavior (`None`), is to convert Arrow Map arrays to
			`Python association lists (list-of-tuples) in the same order as the`
			`Arrow Map, as in [(key1, value1), (key2, value2), ...].`

			`If 'lossy' or 'strict', convert Arrow Map arrays to native Python dicts.`

			`If 'lossy', whenever duplicate keys are detected, a warning will be printed.`
			`The last seen value of a duplicate key will be in the Python dictionary.`
			`If 'strict', this instead results in an exception being raised when detected.`

ARROW-2869: [Python] Add documentation for Array.to_numpy Author: Antoine Pitrou <antoine@python.org> Closes #2351 from pitrou/ARROW-2869-document-numpy and squashes the following commits: 2792dc84 <Antoine Pitrou> Fix renamed reference 8cb89989 <Antoine Pitrou> Revert "Capitalize Pandas" 34d8c36e <Antoine Pitrou> Capitalize Pandas 395231e0 <Antoine Pitrou> Address review comments 347ca4e7 <Antoine Pitrou> ARROW-2869: Add documentation for Array.to_numpy 2018-08-05 16:09:58 -04:00			`Returns`
			`-------`
			`lst : list`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00			`"""`
GH-42112: [Python] Array gracefully fails on non-cpu device (#42113) ### Rationale for this change Common `Array` APIs should not segfault or abort on non-cpu devices. ### What changes are included in this PR? * `device_type` and `is_cpu` methods added to the `Array` class * Any function that segfaults, aborts, or gives incorrect results on non-cpu devices now raises an exception ### Are these changes tested? * Unit tests added ### Are there any user-facing changes? * `device_type` and `is_cpu` methods added to the `Array` class * GitHub Issue: #42112 Lead-authored-by: Dane Pitkin <dpitkin@apache.org> Co-authored-by: Dane Pitkin <dpitkin.oss@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-26 05:03:02 -04:00			`self._assert_cpu()`
GH-39010: [Python] Introduce `maps_as_pydicts` parameter for `to_pylist`, `to_pydict`, `as_py` (#45471) ### Rationale for this change Currently, unfortunately `MapScalar`/`Array` types are not deserialized into proper Python `dict`s, which is unfortunate since this breaks "roundtrips" from Python -> Arrow -> Python: ``` import pyarrow as pa schema = pa.schema([pa.field('x', pa.map_(pa.string(), pa.int64()))]) data = [{'x': {'a': 1}}] pa.RecordBatch.from_pylist(data, schema=schema).to_pylist() # [{'x': [('a', 1)]}] ``` This is especially bad when storing TiBs of deeply nested data (think of lists in structs in maps...) that were created from Python and serialized into Arrow/Parquet, since they can't be read in again with native `pyarrow` methods without doing extremely ugly and computationally costly workarounds. ### What changes are included in this PR? A new parameter `maps_as_pydicts` is introduced to `to_pylist`, `to_pydict`, `as_py` which will allow proper roundtrips: ``` import pyarrow as pa schema = pa.schema([pa.field('x', pa.map_(pa.string(), pa.int64()))]) data = [{'x': {'a': 1}}] pa.RecordBatch.from_pylist(data, schema=schema).to_pylist(maps_as_pydicts="strict") # [{'x': {'a': 1}}] ``` ### Are these changes tested? Yes. There are tests for `to_pylist` and `to_pydict` included for `pyarrow.Table`, whilst low-level `MapScalar` and especially a nesting with `ListScalar` and `StructScalar` is tested. Also, duplicate keys now should throw an error, which is also tested for. ### Are there any user-facing changes? No callsites should be broken, simply a new keyword-only optional parameter is added. * GitHub Issue: #39010 Authored-by: Jonas Dedden <university@jonas-dedden.de> Signed-off-by: Antoine Pitrou <antoine@python.org> 2025-02-20 16:17:48 +01:00			`return [x.as_py(maps_as_pydicts=maps_as_pydicts) for x in self]`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
ARROW-9403: [Python] add Array.tolist as alias of .to_pylist Closes #7701 from maartenbreddels/ARROW-9403 Authored-by: Maarten A. Breddels <maartenbreddels@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-07-11 17:07:09 -05:00			`def tolist(self):`
			`"""`
			`Alias of to_pylist for compatibility with NumPy.`
			`"""`
			`return self.to_pylist()`

ARROW-6157: [C++] Array data validation Add a method ValidateFull() on arrays, batches etc. which does O(N) data validation for a few types (list, union, dictionary). Also, fix the assumptions about union arrays to match official semantics. Closes #5892 from pitrou/ARROW-6157-array-data-validation and squashes the following commits: c8983f643 <Antoine Pitrou> ARROW-6157: Array data validation Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-11-29 11:40:29 +01:00			`def validate(self, *, full=False):`
ARROW-1177: [C++] Check for int32 offset overflow in ListBuilder, BinaryBuilder I also refactored BinaryBuilder to not inherit from ListBuilder, which is a bit cleaner. I added a draft of ARROW-507; it needs a unit test and to handle the case where some passed offsets are null (so they need to be sanitized) Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #853 from wesm/ARROW-1177 and squashes the following commits: f6be04f [Wes McKinney] Fix DCHECKs in ListBuilder, BinaryBuilder 28f17ab [Wes McKinney] Use binary strings for py2.7 c9e7502 [Wes McKinney] Fix some off-by-one errors 5a8be84 [Wes McKinney] Fix another warning 23adefc [Wes McKinney] Fix compiler warning 35ab4f2 [Wes McKinney] Refactoring BinaryBuilder. Add check for int32 offset overflow for List, Binary, String. Add basic ListArray::FromArrays method, add Python binding 2017-07-17 18:32:30 +02:00			`"""`
ARROW-6157: [C++] Array data validation Add a method ValidateFull() on arrays, batches etc. which does O(N) data validation for a few types (list, union, dictionary). Also, fix the assumptions about union arrays to match official semantics. Closes #5892 from pitrou/ARROW-6157-array-data-validation and squashes the following commits: c8983f643 <Antoine Pitrou> ARROW-6157: Array data validation Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-11-29 11:40:29 +01:00			`Perform validation checks. An exception is raised if validation fails.`

			By default only cheap validation checks are run. Pass `full=True`
			`for thorough validation checks (potentially O(n)).`

			`Parameters`
			`----------`
ARROW-15321: [Dev][Python] Also numpydoc-validate Cython-generated methods The numpydoc-validation routine in Archery would skip over many Cython-generated methods and properties. This PR also fixes and enhances the docstrings that would newly raise validation errors. Closes #12698 from pitrou/ARROW-15321-numpydoc-cython Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-03-28 18:55:45 +02:00			`full : bool, default False`
ARROW-6157: [C++] Array data validation Add a method ValidateFull() on arrays, batches etc. which does O(N) data validation for a few types (list, union, dictionary). Also, fix the assumptions about union arrays to match official semantics. Closes #5892 from pitrou/ARROW-6157-array-data-validation and squashes the following commits: c8983f643 <Antoine Pitrou> ARROW-6157: Array data validation Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-11-29 11:40:29 +01:00			`If True, run expensive checks, otherwise cheap checks only.`
ARROW-1177: [C++] Check for int32 offset overflow in ListBuilder, BinaryBuilder I also refactored BinaryBuilder to not inherit from ListBuilder, which is a bit cleaner. I added a draft of ARROW-507; it needs a unit test and to handle the case where some passed offsets are null (so they need to be sanitized) Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #853 from wesm/ARROW-1177 and squashes the following commits: f6be04f [Wes McKinney] Fix DCHECKs in ListBuilder, BinaryBuilder 28f17ab [Wes McKinney] Use binary strings for py2.7 c9e7502 [Wes McKinney] Fix some off-by-one errors 5a8be84 [Wes McKinney] Fix another warning 23adefc [Wes McKinney] Fix compiler warning 35ab4f2 [Wes McKinney] Refactoring BinaryBuilder. Add check for int32 offset overflow for List, Binary, String. Add basic ListArray::FromArrays method, add Python binding 2017-07-17 18:32:30 +02:00
			`Raises`
			`------`
			`ArrowInvalid`
			`"""`
ARROW-6157: [C++] Array data validation Add a method ValidateFull() on arrays, batches etc. which does O(N) data validation for a few types (list, union, dictionary). Also, fix the assumptions about union arrays to match official semantics. Closes #5892 from pitrou/ARROW-6157-array-data-validation and squashes the following commits: c8983f643 <Antoine Pitrou> ARROW-6157: Array data validation Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-11-29 11:40:29 +01:00			`if full:`
GH-42112: [Python] Array gracefully fails on non-cpu device (#42113) ### Rationale for this change Common `Array` APIs should not segfault or abort on non-cpu devices. ### What changes are included in this PR? * `device_type` and `is_cpu` methods added to the `Array` class * Any function that segfaults, aborts, or gives incorrect results on non-cpu devices now raises an exception ### Are these changes tested? * Unit tests added ### Are there any user-facing changes? * `device_type` and `is_cpu` methods added to the `Array` class * GitHub Issue: #42112 Lead-authored-by: Dane Pitkin <dpitkin@apache.org> Co-authored-by: Dane Pitkin <dpitkin.oss@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-26 05:03:02 -04:00			`self._assert_cpu()`
ARROW-6157: [C++] Array data validation Add a method ValidateFull() on arrays, batches etc. which does O(N) data validation for a few types (list, union, dictionary). Also, fix the assumptions about union arrays to match official semantics. Closes #5892 from pitrou/ARROW-6157-array-data-validation and squashes the following commits: c8983f643 <Antoine Pitrou> ARROW-6157: Array data validation Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-11-29 11:40:29 +01:00			`with nogil:`
			`check_status(self.ap.ValidateFull())`
			`else:`
			`with nogil:`
			`check_status(self.ap.Validate())`
ARROW-1177: [C++] Check for int32 offset overflow in ListBuilder, BinaryBuilder I also refactored BinaryBuilder to not inherit from ListBuilder, which is a bit cleaner. I added a draft of ARROW-507; it needs a unit test and to handle the case where some passed offsets are null (so they need to be sanitized) Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #853 from wesm/ARROW-1177 and squashes the following commits: f6be04f [Wes McKinney] Fix DCHECKs in ListBuilder, BinaryBuilder 28f17ab [Wes McKinney] Use binary strings for py2.7 c9e7502 [Wes McKinney] Fix some off-by-one errors 5a8be84 [Wes McKinney] Fix another warning 23adefc [Wes McKinney] Fix compiler warning 35ab4f2 [Wes McKinney] Refactoring BinaryBuilder. Add check for int32 offset overflow for List, Binary, String. Add basic ListArray::FromArrays method, add Python binding 2017-07-17 18:32:30 +02:00
ARROW-3044: [Python] Remove all occurrences of cython's legacy property definition syntax plus add missing tests Author: Krisztián Szűcs <szucs.krisztian@gmail.com> Author: Wes McKinney <wesm+git@apache.org> Closes #2424 from kszucs/ARROW-3044 and squashes the following commits: c00fdd33 <Wes McKinney> Change pa.lib -> pa 846cb238 <Krisztián Szűcs> pandas timestamp compat 7d930aa7 <Krisztián Szűcs> replace all occurences of old property definition syntax 2018-08-14 10:09:41 -04:00			`@property`
			`def offset(self):`
			`"""`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`A relative position into another array's data.`

			`The purpose is to enable zero-copy slicing. This value defaults to zero`
			`but must be applied on all operations with the physical storage`
			`buffers.`
ARROW-3044: [Python] Remove all occurrences of cython's legacy property definition syntax plus add missing tests Author: Krisztián Szűcs <szucs.krisztian@gmail.com> Author: Wes McKinney <wesm+git@apache.org> Closes #2424 from kszucs/ARROW-3044 and squashes the following commits: c00fdd33 <Wes McKinney> Change pa.lib -> pa 846cb238 <Krisztián Szűcs> pandas timestamp compat 7d930aa7 <Krisztián Szűcs> replace all occurences of old property definition syntax 2018-08-14 10:09:41 -04:00			`"""`
			`return self.sp_array.get().offset()`
ARROW-2280: [Python] Return the offset for the buffers in pyarrow.Array Author: Uwe L. Korn <uwelk@xhochy.com> Closes #1719 from xhochy/ARROW-2280 and squashes the following commits: 82b50a76 <Uwe L. Korn> ARROW-2280: Return the offset for the buffers in pyarrow.Array 2018-03-07 14:29:38 -05:00
ARROW-2068: [Python] Expose array's buffers This recurses into child data if present (for nested types). Author: Antoine Pitrou <antoine@python.org> Closes #1613 from pitrou/ARROW-2068-expose-array-buffers and squashes the following commits: 0634aaf [Antoine Pitrou] ARROW-2068: [Python] Expose array's buffers 2018-02-15 18:58:34 +01:00			`def buffers(self):`
			`"""`
			`Return a list of Buffer objects pointing to this array's physical`
			`storage.`
ARROW-2280: [Python] Return the offset for the buffers in pyarrow.Array Author: Uwe L. Korn <uwelk@xhochy.com> Closes #1719 from xhochy/ARROW-2280 and squashes the following commits: 82b50a76 <Uwe L. Korn> ARROW-2280: Return the offset for the buffers in pyarrow.Array 2018-03-07 14:29:38 -05:00
			`To correctly interpret these buffers, you need to also apply the offset`
			`multiplied with the size of the stored data type.`
ARROW-2068: [Python] Expose array's buffers This recurses into child data if present (for nested types). Author: Antoine Pitrou <antoine@python.org> Closes #1613 from pitrou/ARROW-2068-expose-array-buffers and squashes the following commits: 0634aaf [Antoine Pitrou] ARROW-2068: [Python] Expose array's buffers 2018-02-15 18:58:34 +01:00			`"""`
			`res = []`
			`_append_array_buffers(self.sp_array.get().data().get(), res)`
			`return res`

GH-42222: [Python] Add bindings for CopyTo on RecordBatch and Array classes (#42223) ### Rationale for this change We have added bindings for the Device and MemoryManager classes (https://github.com/apache/arrow/issues/41126), and as a next step we can expose the functionality to copy a full Array or RecordBatch to a specific memory manager. ### What changes are included in this PR? This adds a `copy_to` method on pyarrow Array and RecordBatch. ### Are these changes tested? Yes * GitHub Issue: #42222 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-08-21 14:18:45 +02:00			`def copy_to(self, destination):`
			`"""`
			`Construct a copy of the array with all buffers on destination`
			`device.`

			`This method recursively copies the array's buffers and those of its`
			`children onto the destination MemoryManager device and returns the`
			`new Array.`

			`Parameters`
			`----------`
			`destination : pyarrow.MemoryManager or pyarrow.Device`
			`The destination device to copy the array to.`

			`Returns`
			`-------`
			`Array`
			`"""`
			`cdef:`
			`shared_ptr[CArray] c_array`
			`shared_ptr[CMemoryManager] c_memory_manager`

			`if isinstance(destination, Device):`
			`c_memory_manager = (<Device>destination).unwrap().get().default_memory_manager()`
			`elif isinstance(destination, MemoryManager):`
			`c_memory_manager = (<MemoryManager>destination).unwrap()`
			`else:`
			`raise TypeError(`
			`"Argument 'destination' has incorrect type (expected a "`
			`f"pyarrow Device or MemoryManager, got {type(destination)})"`
			`)`

			`with nogil:`
			`c_array = GetResultValue(self.ap.CopyTo(c_memory_manager))`
			`return pyarrow_wrap_array(c_array)`

ARROW-15169: [Python][R] Avoid unsafe Python-R pointer transfer Transferring C pointer values between Python and R would involve an intermediate conversion to a double, which might lose precision. Instead, pass an external pointer from R, which gets converted into a Python capsule. Closes #12011 from pitrou/ARROW-15169-py-r-safe Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-12-22 16:52:16 +01:00			`def _export_to_c(self, out_ptr, out_schema_ptr=0):`
ARROW-7913: [C++][Python][R] C++ implementation of C data interface Closes #6483 from pitrou/ARROW-7913-c-data-interface-impl and squashes the following commits: d18ec1d84 <Antoine Pitrou> Make internal Parse functions return Result<> 2a912a054 <Antoine Pitrou> Add issue number beside TODO 5e6d306f0 <Antoine Pitrou> Apply review comments a65924caf <Antoine Pitrou> Remove dead code 170506600 <Antoine Pitrou> Try a blind fix for the buildbot failure ceeded816 <Antoine Pitrou> ARROW-7913: C++ implementation of C data interface Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com> 2020-03-02 15:00:49 -05:00			`"""`
			`Export to a C ArrowArray struct, given its pointer.`

			`If a C ArrowSchema struct pointer is also given, the array type`
			`is exported to it at the same time.`

			`Parameters`
			`----------`
			`out_ptr: int`
			`The raw pointer to a C ArrowArray struct.`
			`out_schema_ptr: int (optional)`
			`The raw pointer to a C ArrowSchema struct.`

			`Be careful: if you don't pass the ArrowArray struct to a consumer,`
			`array memory will leak. This is a low-level function intended for`
			`expert users.`
			`"""`
ARROW-15169: [Python][R] Avoid unsafe Python-R pointer transfer Transferring C pointer values between Python and R would involve an intermediate conversion to a double, which might lose precision. Instead, pass an external pointer from R, which gets converted into a Python capsule. Closes #12011 from pitrou/ARROW-15169-py-r-safe Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-12-22 16:52:16 +01:00			`cdef:`
			`void* c_ptr = _as_c_pointer(out_ptr)`
			`void* c_schema_ptr = _as_c_pointer(out_schema_ptr,`
			`allow_null=True)`
ARROW-7913: [C++][Python][R] C++ implementation of C data interface Closes #6483 from pitrou/ARROW-7913-c-data-interface-impl and squashes the following commits: d18ec1d84 <Antoine Pitrou> Make internal Parse functions return Result<> 2a912a054 <Antoine Pitrou> Add issue number beside TODO 5e6d306f0 <Antoine Pitrou> Apply review comments a65924caf <Antoine Pitrou> Remove dead code 170506600 <Antoine Pitrou> Try a blind fix for the buildbot failure ceeded816 <Antoine Pitrou> ARROW-7913: C++ implementation of C data interface Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com> 2020-03-02 15:00:49 -05:00			`with nogil:`
			`check_status(ExportArray(deref(self.sp_array),`
ARROW-15169: [Python][R] Avoid unsafe Python-R pointer transfer Transferring C pointer values between Python and R would involve an intermediate conversion to a double, which might lose precision. Instead, pass an external pointer from R, which gets converted into a Python capsule. Closes #12011 from pitrou/ARROW-15169-py-r-safe Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-12-22 16:52:16 +01:00			`<ArrowArray*> c_ptr,`
			`<ArrowSchema*> c_schema_ptr))`
ARROW-7913: [C++][Python][R] C++ implementation of C data interface Closes #6483 from pitrou/ARROW-7913-c-data-interface-impl and squashes the following commits: d18ec1d84 <Antoine Pitrou> Make internal Parse functions return Result<> 2a912a054 <Antoine Pitrou> Add issue number beside TODO 5e6d306f0 <Antoine Pitrou> Apply review comments a65924caf <Antoine Pitrou> Remove dead code 170506600 <Antoine Pitrou> Try a blind fix for the buildbot failure ceeded816 <Antoine Pitrou> ARROW-7913: C++ implementation of C data interface Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com> 2020-03-02 15:00:49 -05:00
			`@staticmethod`
ARROW-15169: [Python][R] Avoid unsafe Python-R pointer transfer Transferring C pointer values between Python and R would involve an intermediate conversion to a double, which might lose precision. Instead, pass an external pointer from R, which gets converted into a Python capsule. Closes #12011 from pitrou/ARROW-15169-py-r-safe Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-12-22 16:52:16 +01:00			`def _import_from_c(in_ptr, type):`
ARROW-7913: [C++][Python][R] C++ implementation of C data interface Closes #6483 from pitrou/ARROW-7913-c-data-interface-impl and squashes the following commits: d18ec1d84 <Antoine Pitrou> Make internal Parse functions return Result<> 2a912a054 <Antoine Pitrou> Add issue number beside TODO 5e6d306f0 <Antoine Pitrou> Apply review comments a65924caf <Antoine Pitrou> Remove dead code 170506600 <Antoine Pitrou> Try a blind fix for the buildbot failure ceeded816 <Antoine Pitrou> ARROW-7913: C++ implementation of C data interface Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com> 2020-03-02 15:00:49 -05:00			`"""`
			`Import Array from a C ArrowArray struct, given its pointer`
			`and the imported array type.`

			`Parameters`
			`----------`
			`in_ptr: int`
			`The raw pointer to a C ArrowArray struct.`
			`type: DataType or int`
			`Either a DataType object, or the raw pointer to a C ArrowSchema`
			`struct.`

			`This is a low-level function intended for expert users.`
			`"""`
			`cdef:`
ARROW-15169: [Python][R] Avoid unsafe Python-R pointer transfer Transferring C pointer values between Python and R would involve an intermediate conversion to a double, which might lose precision. Instead, pass an external pointer from R, which gets converted into a Python capsule. Closes #12011 from pitrou/ARROW-15169-py-r-safe Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-12-22 16:52:16 +01:00			`void* c_ptr = _as_c_pointer(in_ptr)`
			`void* c_type_ptr`
ARROW-7913: [C++][Python][R] C++ implementation of C data interface Closes #6483 from pitrou/ARROW-7913-c-data-interface-impl and squashes the following commits: d18ec1d84 <Antoine Pitrou> Make internal Parse functions return Result<> 2a912a054 <Antoine Pitrou> Add issue number beside TODO 5e6d306f0 <Antoine Pitrou> Apply review comments a65924caf <Antoine Pitrou> Remove dead code 170506600 <Antoine Pitrou> Try a blind fix for the buildbot failure ceeded816 <Antoine Pitrou> ARROW-7913: C++ implementation of C data interface Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com> 2020-03-02 15:00:49 -05:00			`shared_ptr[CArray] c_array`

			`c_type = pyarrow_unwrap_data_type(type)`
			`if c_type == nullptr:`
			`# Not a DataType object, perhaps a raw ArrowSchema pointer`
ARROW-15169: [Python][R] Avoid unsafe Python-R pointer transfer Transferring C pointer values between Python and R would involve an intermediate conversion to a double, which might lose precision. Instead, pass an external pointer from R, which gets converted into a Python capsule. Closes #12011 from pitrou/ARROW-15169-py-r-safe Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-12-22 16:52:16 +01:00			`c_type_ptr = _as_c_pointer(type)`
ARROW-7913: [C++][Python][R] C++ implementation of C data interface Closes #6483 from pitrou/ARROW-7913-c-data-interface-impl and squashes the following commits: d18ec1d84 <Antoine Pitrou> Make internal Parse functions return Result<> 2a912a054 <Antoine Pitrou> Add issue number beside TODO 5e6d306f0 <Antoine Pitrou> Apply review comments a65924caf <Antoine Pitrou> Remove dead code 170506600 <Antoine Pitrou> Try a blind fix for the buildbot failure ceeded816 <Antoine Pitrou> ARROW-7913: C++ implementation of C data interface Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com> 2020-03-02 15:00:49 -05:00			`with nogil:`
ARROW-15169: [Python][R] Avoid unsafe Python-R pointer transfer Transferring C pointer values between Python and R would involve an intermediate conversion to a double, which might lose precision. Instead, pass an external pointer from R, which gets converted into a Python capsule. Closes #12011 from pitrou/ARROW-15169-py-r-safe Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-12-22 16:52:16 +01:00			`c_array = GetResultValue(ImportArray(`
			`<ArrowArray> c_ptr, <ArrowSchema> c_type_ptr))`
ARROW-7913: [C++][Python][R] C++ implementation of C data interface Closes #6483 from pitrou/ARROW-7913-c-data-interface-impl and squashes the following commits: d18ec1d84 <Antoine Pitrou> Make internal Parse functions return Result<> 2a912a054 <Antoine Pitrou> Add issue number beside TODO 5e6d306f0 <Antoine Pitrou> Apply review comments a65924caf <Antoine Pitrou> Remove dead code 170506600 <Antoine Pitrou> Try a blind fix for the buildbot failure ceeded816 <Antoine Pitrou> ARROW-7913: C++ implementation of C data interface Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com> 2020-03-02 15:00:49 -05:00			`else:`
			`with nogil:`
ARROW-15169: [Python][R] Avoid unsafe Python-R pointer transfer Transferring C pointer values between Python and R would involve an intermediate conversion to a double, which might lose precision. Instead, pass an external pointer from R, which gets converted into a Python capsule. Closes #12011 from pitrou/ARROW-15169-py-r-safe Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-12-22 16:52:16 +01:00			`c_array = GetResultValue(ImportArray(<ArrowArray*> c_ptr,`
ARROW-7913: [C++][Python][R] C++ implementation of C data interface Closes #6483 from pitrou/ARROW-7913-c-data-interface-impl and squashes the following commits: d18ec1d84 <Antoine Pitrou> Make internal Parse functions return Result<> 2a912a054 <Antoine Pitrou> Add issue number beside TODO 5e6d306f0 <Antoine Pitrou> Apply review comments a65924caf <Antoine Pitrou> Remove dead code 170506600 <Antoine Pitrou> Try a blind fix for the buildbot failure ceeded816 <Antoine Pitrou> ARROW-7913: C++ implementation of C data interface Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com> 2020-03-02 15:00:49 -05:00			`c_type))`
			`return pyarrow_wrap_array(c_array)`

GH-35531: [Python] C Data Interface PyCapsule Protocol (#37797) ### Rationale for this change ### What changes are included in this PR? * A new specification for Arrow PyCapsules and related dunder methods * Implementing the dunder methods for `DataType`, `Field`, `Schema`, `Array`, `RecordBatch`, `Table`, and `RecordBatchReader`. ### Are these changes tested? Yes, I've added various roundtrip tests for each of the types. ### Are there any user-facing changes? This introduces some new APIs and documents them. * Closes: #34031 * Closes: #35531 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2023-10-18 04:44:50 -07:00			`def __arrow_c_array__(self, requested_schema=None):`
			`"""`
			`Get a pair of PyCapsules containing a C ArrowArray representation of the object.`

			`Parameters`
			`----------`
			`requested_schema : PyCapsule \| None`
			`A PyCapsule containing a C ArrowSchema representation of a requested`
			`schema. PyArrow will attempt to cast the array to this data type.`
			`If None, the array will be returned as-is, with a type matching the`
			one returned by :meth:`__arrow_c_schema__()`.

			`Returns`
			`-------`
			`Tuple[PyCapsule, PyCapsule]`
			`A pair of PyCapsules containing a C ArrowSchema and ArrowArray,`
			`respectively.`
			`"""`
GH-42112: [Python] Array gracefully fails on non-cpu device (#42113) ### Rationale for this change Common `Array` APIs should not segfault or abort on non-cpu devices. ### What changes are included in this PR? * `device_type` and `is_cpu` methods added to the `Array` class * Any function that segfaults, aborts, or gives incorrect results on non-cpu devices now raises an exception ### Are these changes tested? * Unit tests added ### Are there any user-facing changes? * `device_type` and `is_cpu` methods added to the `Array` class * GitHub Issue: #42112 Lead-authored-by: Dane Pitkin <dpitkin@apache.org> Co-authored-by: Dane Pitkin <dpitkin.oss@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-26 05:03:02 -04:00			`self._assert_cpu()`

GH-35531: [Python] C Data Interface PyCapsule Protocol (#37797) ### Rationale for this change ### What changes are included in this PR? * A new specification for Arrow PyCapsules and related dunder methods * Implementing the dunder methods for `DataType`, `Field`, `Schema`, `Array`, `RecordBatch`, `Table`, and `RecordBatchReader`. ### Are these changes tested? Yes, I've added various roundtrip tests for each of the types. ### Are there any user-facing changes? This introduces some new APIs and documents them. * Closes: #34031 * Closes: #35531 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2023-10-18 04:44:50 -07:00			`cdef:`
			`ArrowArray* c_array`
			`ArrowSchema* c_schema`
			`shared_ptr[CArray] inner_array`

			`if requested_schema is not None:`
			`target_type = DataType._import_from_c_capsule(requested_schema)`

			`if target_type != self.type:`
			`try:`
			`casted_array = _pc().cast(self, target_type, safe=True)`
			`inner_array = pyarrow_unwrap_array(casted_array)`
			`except ArrowInvalid as e:`
			`raise ValueError(`
			`f"Could not cast {self.type} to requested type {target_type}: {e}"`
			`)`
			`else:`
			`inner_array = self.sp_array`
			`else:`
			`inner_array = self.sp_array`

			`schema_capsule = alloc_c_schema(&c_schema)`
			`array_capsule = alloc_c_array(&c_array)`

			`with nogil:`
			`check_status(ExportArray(deref(inner_array), c_array, c_schema))`

			`return schema_capsule, array_capsule`

			`@staticmethod`
			`def _import_from_c_capsule(schema_capsule, array_capsule):`
			`cdef:`
			`ArrowSchema* c_schema`
			`ArrowArray* c_array`
			`shared_ptr[CArray] array`

			`c_schema = <ArrowSchema*> PyCapsule_GetPointer(schema_capsule, 'arrow_schema')`
			`c_array = <ArrowArray*> PyCapsule_GetPointer(array_capsule, 'arrow_array')`

			`with nogil:`
			`array = GetResultValue(ImportArray(c_array, c_schema))`

			`return pyarrow_wrap_array(array)`

GH-39979: [Python] Low-level bindings for exporting/importing the C Device Interface (#39980) ### Rationale for this change We have low-level methods `_import_from_c`/`_export_to_c` for the C Data Interface, we can add similar methods for the C Device data interface. Expanding the Arrow PyCapsule protocol (i.e. a better public API for other libraries) is covered by https://github.com/apache/arrow/issues/38325. Because of that, we might not want to keep those low-level methods long term (or at least we need to have the equivalents using capsules), but for testing it's useful to already add those. ### What changes are included in this PR? Added methods to Array and RecordBatch classes. Currently import only works for CPU devices. * GitHub Issue: #39979 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2024-02-28 14:24:15 +01:00			`def _export_to_c_device(self, out_ptr, out_schema_ptr=0):`
			`"""`
			`Export to a C ArrowDeviceArray struct, given its pointer.`

			`If a C ArrowSchema struct pointer is also given, the array type`
			`is exported to it at the same time.`

			`Parameters`
			`----------`
			`out_ptr: int`
			`The raw pointer to a C ArrowDeviceArray struct.`
			`out_schema_ptr: int (optional)`
			`The raw pointer to a C ArrowSchema struct.`

			`Be careful: if you don't pass the ArrowDeviceArray struct to a consumer,`
			`array memory will leak. This is a low-level function intended for`
			`expert users.`
			`"""`
			`cdef:`
			`void* c_ptr = _as_c_pointer(out_ptr)`
			`void* c_schema_ptr = _as_c_pointer(out_schema_ptr,`
			`allow_null=True)`
			`with nogil:`
			`check_status(ExportDeviceArray(`
			`deref(self.sp_array), <shared_ptr[CSyncEvent]>NULL,`
			`<ArrowDeviceArray> c_ptr, <ArrowSchema> c_schema_ptr))`

			`@staticmethod`
			`def _import_from_c_device(in_ptr, type):`
			`"""`
			`Import Array from a C ArrowDeviceArray struct, given its pointer`
			`and the imported array type.`

			`Parameters`
			`----------`
			`in_ptr: int`
			`The raw pointer to a C ArrowDeviceArray struct.`
			`type: DataType or int`
			`Either a DataType object, or the raw pointer to a C ArrowSchema`
			`struct.`

			`This is a low-level function intended for expert users.`
			`"""`
			`cdef:`
GH-40384: [Python] Expand the C Device Interface bindings to support import on CUDA device (#40385) ### Rationale for this change Follow-up on https://github.com/apache/arrow/issues/39979 which added `_export_to_c_device`/`_import_from_c_device` methods, but for now only for CPU devices. ### What changes are included in this PR? * Ensure `pyarrow.cuda` is imported before importing data through the C Interface, to ensure the CUDA device is registered * Add tests for exporting/importing with the device interface on CUDA ### Are these changes tested? Yes, added tests for CUDA. * GitHub Issue: #40384 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-19 11:46:26 +02:00			`ArrowDeviceArray* c_device_array = <ArrowDeviceArray*>_as_c_pointer(in_ptr)`
GH-39979: [Python] Low-level bindings for exporting/importing the C Device Interface (#39980) ### Rationale for this change We have low-level methods `_import_from_c`/`_export_to_c` for the C Data Interface, we can add similar methods for the C Device data interface. Expanding the Arrow PyCapsule protocol (i.e. a better public API for other libraries) is covered by https://github.com/apache/arrow/issues/38325. Because of that, we might not want to keep those low-level methods long term (or at least we need to have the equivalents using capsules), but for testing it's useful to already add those. ### What changes are included in this PR? Added methods to Array and RecordBatch classes. Currently import only works for CPU devices. * GitHub Issue: #39979 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2024-02-28 14:24:15 +01:00			`void* c_type_ptr`
			`shared_ptr[CArray] c_array`

GH-40384: [Python] Expand the C Device Interface bindings to support import on CUDA device (#40385) ### Rationale for this change Follow-up on https://github.com/apache/arrow/issues/39979 which added `_export_to_c_device`/`_import_from_c_device` methods, but for now only for CPU devices. ### What changes are included in this PR? * Ensure `pyarrow.cuda` is imported before importing data through the C Interface, to ensure the CUDA device is registered * Add tests for exporting/importing with the device interface on CUDA ### Are these changes tested? Yes, added tests for CUDA. * GitHub Issue: #40384 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-19 11:46:26 +02:00			`if c_device_array.device_type == ARROW_DEVICE_CUDA:`
			`_ensure_cuda_loaded()`

GH-39979: [Python] Low-level bindings for exporting/importing the C Device Interface (#39980) ### Rationale for this change We have low-level methods `_import_from_c`/`_export_to_c` for the C Data Interface, we can add similar methods for the C Device data interface. Expanding the Arrow PyCapsule protocol (i.e. a better public API for other libraries) is covered by https://github.com/apache/arrow/issues/38325. Because of that, we might not want to keep those low-level methods long term (or at least we need to have the equivalents using capsules), but for testing it's useful to already add those. ### What changes are included in this PR? Added methods to Array and RecordBatch classes. Currently import only works for CPU devices. * GitHub Issue: #39979 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2024-02-28 14:24:15 +01:00			`c_type = pyarrow_unwrap_data_type(type)`
			`if c_type == nullptr:`
			`# Not a DataType object, perhaps a raw ArrowSchema pointer`
			`c_type_ptr = _as_c_pointer(type)`
			`with nogil:`
			`c_array = GetResultValue(`
GH-40384: [Python] Expand the C Device Interface bindings to support import on CUDA device (#40385) ### Rationale for this change Follow-up on https://github.com/apache/arrow/issues/39979 which added `_export_to_c_device`/`_import_from_c_device` methods, but for now only for CPU devices. ### What changes are included in this PR? * Ensure `pyarrow.cuda` is imported before importing data through the C Interface, to ensure the CUDA device is registered * Add tests for exporting/importing with the device interface on CUDA ### Are these changes tested? Yes, added tests for CUDA. * GitHub Issue: #40384 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-19 11:46:26 +02:00			`ImportDeviceArray(c_device_array, <ArrowSchema*> c_type_ptr)`
GH-39979: [Python] Low-level bindings for exporting/importing the C Device Interface (#39980) ### Rationale for this change We have low-level methods `_import_from_c`/`_export_to_c` for the C Data Interface, we can add similar methods for the C Device data interface. Expanding the Arrow PyCapsule protocol (i.e. a better public API for other libraries) is covered by https://github.com/apache/arrow/issues/38325. Because of that, we might not want to keep those low-level methods long term (or at least we need to have the equivalents using capsules), but for testing it's useful to already add those. ### What changes are included in this PR? Added methods to Array and RecordBatch classes. Currently import only works for CPU devices. * GitHub Issue: #39979 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2024-02-28 14:24:15 +01:00			`)`
			`else:`
			`with nogil:`
			`c_array = GetResultValue(`
GH-40384: [Python] Expand the C Device Interface bindings to support import on CUDA device (#40385) ### Rationale for this change Follow-up on https://github.com/apache/arrow/issues/39979 which added `_export_to_c_device`/`_import_from_c_device` methods, but for now only for CPU devices. ### What changes are included in this PR? * Ensure `pyarrow.cuda` is imported before importing data through the C Interface, to ensure the CUDA device is registered * Add tests for exporting/importing with the device interface on CUDA ### Are these changes tested? Yes, added tests for CUDA. * GitHub Issue: #40384 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-19 11:46:26 +02:00			`ImportDeviceArray(c_device_array, c_type)`
GH-39979: [Python] Low-level bindings for exporting/importing the C Device Interface (#39980) ### Rationale for this change We have low-level methods `_import_from_c`/`_export_to_c` for the C Data Interface, we can add similar methods for the C Device data interface. Expanding the Arrow PyCapsule protocol (i.e. a better public API for other libraries) is covered by https://github.com/apache/arrow/issues/38325. Because of that, we might not want to keep those low-level methods long term (or at least we need to have the equivalents using capsules), but for testing it's useful to already add those. ### What changes are included in this PR? Added methods to Array and RecordBatch classes. Currently import only works for CPU devices. * GitHub Issue: #39979 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2024-02-28 14:24:15 +01:00			`)`
			`return pyarrow_wrap_array(c_array)`

GH-38325: [Python] Implement PyCapsule interface for Device data in PyArrow (#40717) ### Rationale for this change PyArrow implementation for the specification additions being proposed in https://github.com/apache/arrow/pull/40708 ### What changes are included in this PR? New `__arrow_c_device_array__` method to `pyarrow.Array` and `pyarrow.RecordBatch`, and support in the `pyarrow.array(..)`, `pyarrow.record_batch(..)` and `pyarrow.table(..)` functions to consume objects that have those methods. ### Are these changes tested? Yes (for CPU only for now, https://github.com/apache/arrow/pull/40385 is a prerequisite to test this for CUDA) * GitHub Issue: #38325 2024-06-26 17:41:17 +02:00			`def __arrow_c_device_array__(self, requested_schema=None, **kwargs):`
			`"""`
			`Get a pair of PyCapsules containing a C ArrowDeviceArray representation`
			`of the object.`

			`Parameters`
			`----------`
			`requested_schema : PyCapsule \| None`
			`A PyCapsule containing a C ArrowSchema representation of a requested`
			`schema. PyArrow will attempt to cast the array to this data type.`
			`If None, the array will be returned as-is, with a type matching the`
			one returned by :meth:`__arrow_c_schema__()`.
			`kwargs`
			`Currently no additional keyword arguments are supported, but`
			this method will accept any keyword with a value of ``None``
			`for compatibility with future keywords.`

			`Returns`
			`-------`
			`Tuple[PyCapsule, PyCapsule]`
			`A pair of PyCapsules containing a C ArrowSchema and ArrowDeviceArray,`
			`respectively.`
			`"""`
			`cdef:`
			`ArrowDeviceArray* c_array`
			`ArrowSchema* c_schema`
			`shared_ptr[CArray] inner_array`

			`non_default_kwargs = [`
			`name for name, value in kwargs.items() if value is not None`
			`]`
			`if non_default_kwargs:`
			`raise NotImplementedError(`
			`f"Received unsupported keyword argument(s): {non_default_kwargs}"`
			`)`

			`if requested_schema is not None:`
			`target_type = DataType._import_from_c_capsule(requested_schema)`

			`if target_type != self.type:`
			`if not self.is_cpu:`
			`raise NotImplementedError(`
			`"Casting to a requested schema is only supported for CPU data"`
			`)`
			`try:`
			`casted_array = _pc().cast(self, target_type, safe=True)`
			`inner_array = pyarrow_unwrap_array(casted_array)`
			`except ArrowInvalid as e:`
			`raise ValueError(`
			`f"Could not cast {self.type} to requested type {target_type}: {e}"`
			`)`
			`else:`
			`inner_array = self.sp_array`
			`else:`
			`inner_array = self.sp_array`

			`schema_capsule = alloc_c_schema(&c_schema)`
			`array_capsule = alloc_c_device_array(&c_array)`

			`with nogil:`
			`check_status(ExportDeviceArray(`
			`deref(inner_array), <shared_ptr[CSyncEvent]>NULL,`
			`c_array, c_schema))`

			`return schema_capsule, array_capsule`

			`@staticmethod`
			`def _import_from_c_device_capsule(schema_capsule, array_capsule):`
			`cdef:`
			`ArrowSchema* c_schema`
			`ArrowDeviceArray* c_array`
			`shared_ptr[CArray] array`

			`c_schema = <ArrowSchema*> PyCapsule_GetPointer(schema_capsule, 'arrow_schema')`
			`c_array = <ArrowDeviceArray*> PyCapsule_GetPointer(`
			`array_capsule, 'arrow_device_array'`
			`)`

			`with nogil:`
			`array = GetResultValue(ImportDeviceArray(c_array, c_schema))`

			`return pyarrow_wrap_array(array)`

GH-33984: [C++][Python] DLPack implementation for Arrow Arrays (producer) (#38472) ### Rationale for this change DLPack is selected for Array API protocol so it is important to have it implemented for Arrow/PyArrow Arrays also. This is possible for primitive type arrays (int, uint and float) with no validity buffer. Device support is not in scope of this PR (CPU only). ### What changes are included in this PR? - `ExportArray` and `ExportDevice` methods on Arrow C++ Arrays - `__dlpack__` method on the base PyArrow Array class exposing `ExportArray` method - `__dlpack_device__` method on the base PyArrow Array class exposing `ExportDevice` method ### Are these changes tested? Yes, tests are added to `dlpack_test.cc` and `test_array.py`. ### Are there any user-facing changes? No. * Closes: #33984 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2023-12-19 19:58:29 +01:00			`def __dlpack__(self, stream=None):`
GH-39294: [C++][Python] DLPack on Tensor class (#42118) ### Rationale for this change Producing part of DLPack protocol [has been added to Arrow Arrays](https://github.com/apache/arrow/issues/33984) but is missing on Arrow Tensor class. ### What changes are included in this PR? This PR adds support for producing DLPack struct and bindings to it in Python: - `ExportTensor` and `ExportTensorDevice` methods on Arrow C++ Tensor class - `__dlpack__` method on PyArrow Tensor class exposing ExportTensor method - `__dlpack_device__` method on PyArrow Tensor class exposing ExportTensorDevice method ### Are these changes tested? Yes ### Are there any user-facing changes? No * GitHub Issue: #39294 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2025-06-18 11:06:24 +02:00			`"""`
			`Export a primitive array as a DLPack capsule.`
GH-33984: [C++][Python] DLPack implementation for Arrow Arrays (producer) (#38472) ### Rationale for this change DLPack is selected for Array API protocol so it is important to have it implemented for Arrow/PyArrow Arrays also. This is possible for primitive type arrays (int, uint and float) with no validity buffer. Device support is not in scope of this PR (CPU only). ### What changes are included in this PR? - `ExportArray` and `ExportDevice` methods on Arrow C++ Arrays - `__dlpack__` method on the base PyArrow Array class exposing `ExportArray` method - `__dlpack_device__` method on the base PyArrow Array class exposing `ExportDevice` method ### Are these changes tested? Yes, tests are added to `dlpack_test.cc` and `test_array.py`. ### Are there any user-facing changes? No. * Closes: #33984 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2023-12-19 19:58:29 +01:00
			`Parameters`
			`----------`
			`stream : int, optional`
			`A Python integer representing a pointer to a stream. Currently not supported.`
			`Stream is provided by the consumer to the producer to instruct the producer`
			`to ensure that operations can safely be performed on the array.`

			`Returns`
			`-------`
			`capsule : PyCapsule`
			`A DLPack capsule for the array, pointing to a DLManagedTensor.`
			`"""`
			`if stream is None:`
GH-39294: [C++][Python] DLPack on Tensor class (#42118) ### Rationale for this change Producing part of DLPack protocol [has been added to Arrow Arrays](https://github.com/apache/arrow/issues/33984) but is missing on Arrow Tensor class. ### What changes are included in this PR? This PR adds support for producing DLPack struct and bindings to it in Python: - `ExportTensor` and `ExportTensorDevice` methods on Arrow C++ Tensor class - `__dlpack__` method on PyArrow Tensor class exposing ExportTensor method - `__dlpack_device__` method on PyArrow Tensor class exposing ExportTensorDevice method ### Are these changes tested? Yes ### Are there any user-facing changes? No * GitHub Issue: #39294 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2025-06-18 11:06:24 +02:00			`dlm_tensor = GetResultValue(ExportArrayToDLPack(self.sp_array))`
GH-33984: [C++][Python] DLPack implementation for Arrow Arrays (producer) (#38472) ### Rationale for this change DLPack is selected for Array API protocol so it is important to have it implemented for Arrow/PyArrow Arrays also. This is possible for primitive type arrays (int, uint and float) with no validity buffer. Device support is not in scope of this PR (CPU only). ### What changes are included in this PR? - `ExportArray` and `ExportDevice` methods on Arrow C++ Arrays - `__dlpack__` method on the base PyArrow Array class exposing `ExportArray` method - `__dlpack_device__` method on the base PyArrow Array class exposing `ExportDevice` method ### Are these changes tested? Yes, tests are added to `dlpack_test.cc` and `test_array.py`. ### Are there any user-facing changes? No. * Closes: #33984 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2023-12-19 19:58:29 +01:00
			`return PyCapsule_New(dlm_tensor, 'dltensor', dlpack_pycapsule_deleter)`
			`else:`
			`raise NotImplementedError(`
			`"Only stream=None is supported."`
			`)`

			`def __dlpack_device__(self):`
			`"""`
			`Return the DLPack device tuple this arrays resides on.`

			`Returns`
			`-------`
			`tuple : Tuple[int, int]`
			`Tuple with index specifying the type of the device (where`
			`CPU = 1, see cpp/src/arrow/c/dpack_abi.h) and index of the`
			`device which is 0 by default for CPU.`
			`"""`
			`device = GetResultValue(ExportDevice(self.sp_array))`
			`return device.device_type, device.device_id`

GH-42112: [Python] Array gracefully fails on non-cpu device (#42113) ### Rationale for this change Common `Array` APIs should not segfault or abort on non-cpu devices. ### What changes are included in this PR? * `device_type` and `is_cpu` methods added to the `Array` class * Any function that segfaults, aborts, or gives incorrect results on non-cpu devices now raises an exception ### Are these changes tested? * Unit tests added ### Are there any user-facing changes? * `device_type` and `is_cpu` methods added to the `Array` class * GitHub Issue: #42112 Lead-authored-by: Dane Pitkin <dpitkin@apache.org> Co-authored-by: Dane Pitkin <dpitkin.oss@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-06-26 05:03:02 -04:00			`@property`
			`def device_type(self):`
			`"""`
			`The device type where the array resides.`

			`Returns`
			`-------`
			`DeviceAllocationType`
			`"""`
			`return _wrap_device_allocation_type(self.sp_array.get().device_type())`

			`@property`
			`def is_cpu(self):`
			`"""`
			`Whether the array is CPU-accessible.`
			`"""`
			`return self.device_type == DeviceAllocationType.CPU`

			`cdef void _assert_cpu(self) except *:`
			`if self.sp_array.get().device_type() != CDeviceAllocationType_kCPU:`
			`raise NotImplementedError("Implemented only for data on CPU device")`

GH-45457: [Python] Add `pyarrow.ArrayStatistics` (#45550) ### Rationale for this change Apache Arrow C++ can attach statistics read from Apache Parquet data to `arrow::Array`. If we have the bindings of the feature in Python, Python users can also use attached statistics. ### What changes are included in this PR? * Add `pyarrow.ArrayStatistics` * Add `pyarrow.Array.statistics()`. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #45457 Lead-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Sutou Kouhei <kou@clear-code.com> 2025-02-25 22:25:52 +09:00			`@property`
			`def statistics(self):`
			`"""`
			`Statistics of the array.`
			`"""`
			`cdef ArrayStatistics stat`
			`sp_stat = self.sp_array.get().statistics()`
			`if sp_stat.get() == nullptr:`
			`return None`
			`else:`
			`stat = ArrayStatistics.__new__(ArrayStatistics)`
			`stat.init(sp_stat)`
			`return stat`

GH-32007: [Python] Support arithmetic on arrays and scalars (#48085) ### Rationale for this change Please see #32007, currently, neither arrays nor scalars support Python-native arithmetic operations, such as `array + array`, it has to be done via `pyarrow.compute` API. This PR strives to fix this with custom dunder methods. ### What changes are included in this PR? Implemented dunder methods ### Are these changes tested? Yes ### Are there any user-facing changes? Possibility to use Python operators directly instead of calling the `pyarrow.compute` API. * GitHub Issue: #32007 Authored-by: Bogdan Romenskii <rmnsk@seznam.cz> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2026-03-27 09:07:21 +01:00			`def __abs__(self):`
			`self._assert_cpu()`
			`return _pc().call_function('abs_checked', [self])`

			`def __add__(self, object other):`
			`self._assert_cpu()`
			`return _pc().call_function('add_checked', [self, other])`

			`def __truediv__(self, object other):`
			`self._assert_cpu()`
			`return _pc().call_function('divide_checked', [self, other])`

			`def __mul__(self, object other):`
			`self._assert_cpu()`
			`return _pc().call_function('multiply_checked', [self, other])`

			`def __neg__(self):`
			`self._assert_cpu()`
			`return _pc().call_function('negate_checked', [self])`

			`def __pow__(self, object other):`
			`self._assert_cpu()`
			`return _pc().call_function('power_checked', [self, other])`

			`def __sub__(self, object other):`
			`self._assert_cpu()`
			`return _pc().call_function('subtract_checked', [self, other])`

			`def __and__(self, object other):`
			`self._assert_cpu()`
			`return _pc().call_function('bit_wise_and', [self, other])`

			`def __or__(self, object other):`
			`self._assert_cpu()`
			`return _pc().call_function('bit_wise_or', [self, other])`

			`def __xor__(self, object other):`
			`self._assert_cpu()`
			`return _pc().call_function('bit_wise_xor', [self, other])`

			`def __lshift__(self, object other):`
			`self._assert_cpu()`
			`return _pc().call_function('shift_left_checked', [self, other])`

			`def __rshift__(self, object other):`
			`self._assert_cpu()`
			`return _pc().call_function('shift_right_checked', [self, other])`

ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
ARROW-9664: [Python] Array/ChunkedArray.to_pandas do not support types_mapper keyword This PR tires to add `types_mapper` argument to Array and ChunkedArray `to_pandas` method. To be used like in `Table.to_pandas()` where `types_mapper` needs to be a function or a dictionary mapping (`dict.get`). Closes #12178 from AlenkaF/ARROW-9664 Lead-authored-by: Alenka Frim <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2022-02-15 11:14:35 +01:00			`cdef _array_like_to_pandas(obj, options, types_mapper):`
ARROW-3789: [Python] Use common conversion path for Arrow to pandas.Series/DataFrame. Zero copy optimizations for DataFrame, add split_blocks and self_destruct options The primary goal of this patch is to provide a way for some users to avoid memory doubling with converting from Arrow to pandas. This took me entirely too much time to get right, but partly I was attempting to disentangle some of the technical debt and overdue refactoring in arrow_to_pandas.cc. Summary of what's here: - Refactor ChunkedArray->Series and Table->DataFrame conversion paths to use the exact same code rather than two implementations of the same thing with slightly different behavior. The `ArrowDeserializer` helper class is now gone - Do zero-copy construction of internal DataFrame blocks for the case of a contiguous non-nullable array and a block with only 1 column represented - Add `split_blocks` option to `to_pandas` which constructs one block per DataFrame column, resulting in more zero-copy opportunities. Note that pandas's internal "consolidation" can still cause memory doubling (see discussion about this in https://github.com/pandas-dev/pandas/issues/10556) - Add `self_destruct` option to `to_pandas` which releases the Table's internal buffers as soon as they are converted to the required pandas structure. This allows memory to be reclaimed by the OS as conversion is taking place rather than having a forced memory-doubling and then post-facto reclamation (which has been causing OOM for some users) The most conservative invocation of `to_pandas` now would be `table.to_pandas(use_threads=False, split_blocks=True, self_destruct=True)` Note that the self-destruct option makes the `Table` object unsafe for further use. This is a bit dissatisfying but I wasn't sure how else to provide this capability. Closes #6067 from wesm/ARROW-3789 and squashes the following commits: 3b4260283 <Wes McKinney> Code review comments 8f39cce05 <Wes McKinney> Add some documentation. Try fixing MSVC warnings c22d280dc <Wes McKinney> Fix one MSVC cast warning 43068032c <Wes McKinney> Add "split blocks" and "self destruct" options to Table.to_pandas, with zero-copy operations for improved memory use when converting from Arrow to pandas Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-01-14 18:25:01 -06:00			`cdef:`
			`PyObject* out`
			`PandasOptions c_options = _convert_pandas_options(options)`

			`original_type = obj.type`
ARROW-7709: [Python] Preserve column name in conversion from Table column to pandas for non-ns timestamps https://issues.apache.org/jira/browse/ARROW-7709 Closes #6312 from jorisvandenbossche/ARROW-7709 and squashes the following commits: 211440d04 <Joris Van den Bossche> ARROW-7709: Preserve column name in conversion from Table column to pandas for non-ns timestamps Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-01-30 11:41:36 +01:00			`name = obj._name`
GH-36096: [Python] Call __from_arrow__ in Array.to_pandas (#36314) ### Rationale for this change Array.to_pandas should mimic ChunkedArray.to_pandas implementation. Notably, there is a missing call to `__from_arrow__` if the attribute exists. ### Are these changes tested? Requires dev pandas. Can manually kick off integration tests to test with pandas nightly. ### Are there any user-facing changes? No * Closes: #36096 Authored-by: Dane Pitkin <dane@voltrondata.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-06-30 03:08:17 -04:00			`dtype = None`

			`if types_mapper:`
			`dtype = types_mapper(original_type)`
			`elif original_type.id == _Type_EXTENSION:`
			`try:`
			`dtype = original_type.to_pandas_dtype()`
			`except NotImplementedError:`
			`pass`
GH-49002: [Python] Fix array.to_pandas string type conversion for arrays with None (#49247) ### Rationale for this change The conversion from array with string type to pandas series, when array only has a `None` element, has been taking the old code path even with pandas 3.0. ### What changes are included in this PR? Always check `dtype` in the `_array_like_to_pandas ` conversion and use pandas new default string `dtype` if available. ### Are these changes tested? Yes. ### Are there any user-facing changes? No, only bug fix. * GitHub Issue: #49002 Lead-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2026-04-01 16:01:07 +02:00			`elif pandas_api.uses_string_dtype() and not options["strings_to_categorical"] and (`
			`original_type.id == _Type_STRING or`
			`original_type.id == _Type_LARGE_STRING or`
			`original_type.id == _Type_STRING_VIEW`
			`):`
			`# for pandas 3.0+, use pandas' new default string dtype`
			`dtype = pandas_api.pd.StringDtype(na_value=np.nan)`
GH-36096: [Python] Call __from_arrow__ in Array.to_pandas (#36314) ### Rationale for this change Array.to_pandas should mimic ChunkedArray.to_pandas implementation. Notably, there is a missing call to `__from_arrow__` if the attribute exists. ### Are these changes tested? Requires dev pandas. Can manually kick off integration tests to test with pandas nightly. ### Are there any user-facing changes? No * Closes: #36096 Authored-by: Dane Pitkin <dane@voltrondata.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-06-30 03:08:17 -04:00
			`# Only call __from_arrow__ for Arrow extension types or when explicitly`
			`# overridden via types_mapper`
			`if hasattr(dtype, '__from_arrow__'):`
			`arr = dtype.__from_arrow__(obj)`
			`return pandas_api.series(arr, name=name, copy=False)`
ARROW-3789: [Python] Use common conversion path for Arrow to pandas.Series/DataFrame. Zero copy optimizations for DataFrame, add split_blocks and self_destruct options The primary goal of this patch is to provide a way for some users to avoid memory doubling with converting from Arrow to pandas. This took me entirely too much time to get right, but partly I was attempting to disentangle some of the technical debt and overdue refactoring in arrow_to_pandas.cc. Summary of what's here: - Refactor ChunkedArray->Series and Table->DataFrame conversion paths to use the exact same code rather than two implementations of the same thing with slightly different behavior. The `ArrowDeserializer` helper class is now gone - Do zero-copy construction of internal DataFrame blocks for the case of a contiguous non-nullable array and a block with only 1 column represented - Add `split_blocks` option to `to_pandas` which constructs one block per DataFrame column, resulting in more zero-copy opportunities. Note that pandas's internal "consolidation" can still cause memory doubling (see discussion about this in https://github.com/pandas-dev/pandas/issues/10556) - Add `self_destruct` option to `to_pandas` which releases the Table's internal buffers as soon as they are converted to the required pandas structure. This allows memory to be reclaimed by the OS as conversion is taking place rather than having a forced memory-doubling and then post-facto reclamation (which has been causing OOM for some users) The most conservative invocation of `to_pandas` now would be `table.to_pandas(use_threads=False, split_blocks=True, self_destruct=True)` Note that the self-destruct option makes the `Table` object unsafe for further use. This is a bit dissatisfying but I wasn't sure how else to provide this capability. Closes #6067 from wesm/ARROW-3789 and squashes the following commits: 3b4260283 <Wes McKinney> Code review comments 8f39cce05 <Wes McKinney> Add some documentation. Try fixing MSVC warnings c22d280dc <Wes McKinney> Fix one MSVC cast warning 43068032c <Wes McKinney> Add "split blocks" and "self destruct" options to Table.to_pandas, with zero-copy operations for improved memory use when converting from Arrow to pandas Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-01-14 18:25:01 -06:00
GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0 (#35656) Do not coerce temporal types to nanosecond when pandas >= 2.0 is imported, since pandas now supports s/ms/us time units. This PR adds support for the following Arrow -> Pandas conversions, which previously all defaulted to `datetime64[ns]` or `datetime64[ns, <TZ>]`: ``` date32 -> datetime64[ms] date64 -> datetime64[ms] datetime64[s] -> datetime64[s] datetime64[ms] -> datetime64[ms] datetime64[us] -> datetime64[us] datetime64[s, <TZ>] -> datetime64[s, <TZ>] datetime64[ms, <TZ>] -> datetime64[ms, <TZ>] datetime64[us, <TZ>] -> datetime64[us, <TZ>] ``` ### Rationale for this change Pandas 2.0 introduces proper support for temporal types. ### Are these changes tested? Yes. Pytests added and updated. ### Are there any user-facing changes? Yes, arrow-to-pandas default conversion behavior will change when users have pandas >= 2.0, but a legacy option is exposed to provide backwards compatibility. * Closes: #33321 Lead-authored-by: Dane Pitkin <dane@voltrondata.com> Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-07-07 10:45:58 -04:00			`if pandas_api.is_v1():`
			`# ARROW-3789: Coerce date/timestamp types to datetime64[ns]`
			`c_options.coerce_temporal_nanoseconds = True`
ARROW-3789: [Python] Use common conversion path for Arrow to pandas.Series/DataFrame. Zero copy optimizations for DataFrame, add split_blocks and self_destruct options The primary goal of this patch is to provide a way for some users to avoid memory doubling with converting from Arrow to pandas. This took me entirely too much time to get right, but partly I was attempting to disentangle some of the technical debt and overdue refactoring in arrow_to_pandas.cc. Summary of what's here: - Refactor ChunkedArray->Series and Table->DataFrame conversion paths to use the exact same code rather than two implementations of the same thing with slightly different behavior. The `ArrowDeserializer` helper class is now gone - Do zero-copy construction of internal DataFrame blocks for the case of a contiguous non-nullable array and a block with only 1 column represented - Add `split_blocks` option to `to_pandas` which constructs one block per DataFrame column, resulting in more zero-copy opportunities. Note that pandas's internal "consolidation" can still cause memory doubling (see discussion about this in https://github.com/pandas-dev/pandas/issues/10556) - Add `self_destruct` option to `to_pandas` which releases the Table's internal buffers as soon as they are converted to the required pandas structure. This allows memory to be reclaimed by the OS as conversion is taking place rather than having a forced memory-doubling and then post-facto reclamation (which has been causing OOM for some users) The most conservative invocation of `to_pandas` now would be `table.to_pandas(use_threads=False, split_blocks=True, self_destruct=True)` Note that the self-destruct option makes the `Table` object unsafe for further use. This is a bit dissatisfying but I wasn't sure how else to provide this capability. Closes #6067 from wesm/ARROW-3789 and squashes the following commits: 3b4260283 <Wes McKinney> Code review comments 8f39cce05 <Wes McKinney> Add some documentation. Try fixing MSVC warnings c22d280dc <Wes McKinney> Fix one MSVC cast warning 43068032c <Wes McKinney> Add "split blocks" and "self destruct" options to Table.to_pandas, with zero-copy operations for improved memory use when converting from Arrow to pandas Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-01-14 18:25:01 -06:00
			`if isinstance(obj, Array):`
			`with nogil:`
			`check_status(ConvertArrayToPandas(c_options,`
			`(<Array> obj).sp_array,`
			`obj, &out))`
			`elif isinstance(obj, ChunkedArray):`
			`with nogil:`
ARROW-16062: [Python] Move libarrow_python include definitions to its own file This PR moves all the `arrow/python` definitions to its own `libarrow_python.pxd` file as requested on ARROW-16062 Closes #12873 from raulcd/ARROW-16062 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2022-04-15 17:35:15 +02:00			`check_status(libarrow_python.ConvertChunkedArrayToPandas(`
ARROW-3789: [Python] Use common conversion path for Arrow to pandas.Series/DataFrame. Zero copy optimizations for DataFrame, add split_blocks and self_destruct options The primary goal of this patch is to provide a way for some users to avoid memory doubling with converting from Arrow to pandas. This took me entirely too much time to get right, but partly I was attempting to disentangle some of the technical debt and overdue refactoring in arrow_to_pandas.cc. Summary of what's here: - Refactor ChunkedArray->Series and Table->DataFrame conversion paths to use the exact same code rather than two implementations of the same thing with slightly different behavior. The `ArrowDeserializer` helper class is now gone - Do zero-copy construction of internal DataFrame blocks for the case of a contiguous non-nullable array and a block with only 1 column represented - Add `split_blocks` option to `to_pandas` which constructs one block per DataFrame column, resulting in more zero-copy opportunities. Note that pandas's internal "consolidation" can still cause memory doubling (see discussion about this in https://github.com/pandas-dev/pandas/issues/10556) - Add `self_destruct` option to `to_pandas` which releases the Table's internal buffers as soon as they are converted to the required pandas structure. This allows memory to be reclaimed by the OS as conversion is taking place rather than having a forced memory-doubling and then post-facto reclamation (which has been causing OOM for some users) The most conservative invocation of `to_pandas` now would be `table.to_pandas(use_threads=False, split_blocks=True, self_destruct=True)` Note that the self-destruct option makes the `Table` object unsafe for further use. This is a bit dissatisfying but I wasn't sure how else to provide this capability. Closes #6067 from wesm/ARROW-3789 and squashes the following commits: 3b4260283 <Wes McKinney> Code review comments 8f39cce05 <Wes McKinney> Add some documentation. Try fixing MSVC warnings c22d280dc <Wes McKinney> Fix one MSVC cast warning 43068032c <Wes McKinney> Add "split blocks" and "self destruct" options to Table.to_pandas, with zero-copy operations for improved memory use when converting from Arrow to pandas Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-01-14 18:25:01 -06:00			`c_options,`
			`(<ChunkedArray> obj).sp_chunked_array,`
			`obj, &out))`

ARROW-5359: [Python] Support non-nanosecond out-of-range timestamps in conversion to pandas This fixes https://issues.apache.org/jira/browse/ARROW-5359 by adding a new flag, `timestamp_as_object`. In this PR the default is to be False. Plausibly it should default to True, much like `date_as_object` is True by default, but that would be backwards incompatible. There are definitely a number of tests that fail when the default is changed, but they might be overly brittle tests. Closes #7169 from itamarst/ARROW-5359 Lead-authored-by: Itamar Turner-Trauring <itamar@itamarst.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-18 22:29:07 -05:00			`arr = wrap_array_output(out)`

			`if (isinstance(original_type, TimestampType) and`
			`options["timestamp_as_object"]):`
			`# ARROW-5359 - need to specify object dtype to avoid pandas to`
			`# coerce back to ns resolution`
			`dtype = "object"`
ARROW-9664: [Python] Array/ChunkedArray.to_pandas do not support types_mapper keyword This PR tires to add `types_mapper` argument to Array and ChunkedArray `to_pandas` method. To be used like in `Table.to_pandas()` where `types_mapper` needs to be a function or a dictionary mapping (`dict.get`). Closes #12178 from AlenkaF/ARROW-9664 Lead-authored-by: Alenka Frim <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2022-02-15 11:14:35 +01:00			`elif types_mapper:`
			`dtype = types_mapper(original_type)`
ARROW-5359: [Python] Support non-nanosecond out-of-range timestamps in conversion to pandas This fixes https://issues.apache.org/jira/browse/ARROW-5359 by adding a new flag, `timestamp_as_object`. In this PR the default is to be False. Plausibly it should default to True, much like `date_as_object` is True by default, but that would be backwards incompatible. There are definitely a number of tests that fail when the default is changed, but they might be overly brittle tests. Closes #7169 from itamarst/ARROW-5359 Lead-authored-by: Itamar Turner-Trauring <itamar@itamarst.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-18 22:29:07 -05:00			`else:`
			`dtype = None`

GH-34703: [Python] Set copy=False explicitly when creating a pandas Series (#34593) ### Rationale for this change pandas will change the default for creating a Series from an array (numpy, arrow, ...) to copy=True when Copy-on-Write is enabled. To avoid this when using it internally, we have to specify copy=False explicitly. ### What changes are included in this PR? Setting copy=False when creating a Series ### Are these changes tested? This is equivalent to the current default behavior, so no reason to add any additional tests. ### Are there any user-facing changes? no cc @ jorisvandenbossche * Closes: #34703 Authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Signed-off-by: Alenka Frim <frim.alenka@gmail.com> 2023-03-23 07:07:18 -04:00			`result = pandas_api.series(arr, dtype=dtype, name=name, copy=False)`
ARROW-3789: [Python] Use common conversion path for Arrow to pandas.Series/DataFrame. Zero copy optimizations for DataFrame, add split_blocks and self_destruct options The primary goal of this patch is to provide a way for some users to avoid memory doubling with converting from Arrow to pandas. This took me entirely too much time to get right, but partly I was attempting to disentangle some of the technical debt and overdue refactoring in arrow_to_pandas.cc. Summary of what's here: - Refactor ChunkedArray->Series and Table->DataFrame conversion paths to use the exact same code rather than two implementations of the same thing with slightly different behavior. The `ArrowDeserializer` helper class is now gone - Do zero-copy construction of internal DataFrame blocks for the case of a contiguous non-nullable array and a block with only 1 column represented - Add `split_blocks` option to `to_pandas` which constructs one block per DataFrame column, resulting in more zero-copy opportunities. Note that pandas's internal "consolidation" can still cause memory doubling (see discussion about this in https://github.com/pandas-dev/pandas/issues/10556) - Add `self_destruct` option to `to_pandas` which releases the Table's internal buffers as soon as they are converted to the required pandas structure. This allows memory to be reclaimed by the OS as conversion is taking place rather than having a forced memory-doubling and then post-facto reclamation (which has been causing OOM for some users) The most conservative invocation of `to_pandas` now would be `table.to_pandas(use_threads=False, split_blocks=True, self_destruct=True)` Note that the self-destruct option makes the `Table` object unsafe for further use. This is a bit dissatisfying but I wasn't sure how else to provide this capability. Closes #6067 from wesm/ARROW-3789 and squashes the following commits: 3b4260283 <Wes McKinney> Code review comments 8f39cce05 <Wes McKinney> Add some documentation. Try fixing MSVC warnings c22d280dc <Wes McKinney> Fix one MSVC cast warning 43068032c <Wes McKinney> Add "split blocks" and "self destruct" options to Table.to_pandas, with zero-copy operations for improved memory use when converting from Arrow to pandas Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-01-14 18:25:01 -06:00
			`if (isinstance(original_type, TimestampType) and`
ARROW-9528: [Python] Honor tzinfo when converting from datetime Follow up of: - ARROW-9223: [Python] Propagate timezone information in pandas conversion - ARROW-9528: [Python] Honor tzinfo when converting from datetime (https://github.com/apache/arrow/pull/7805) TODOs: - [x] Store all Timestamp values normalized to UTC - [x] Infer timezone from the array values if no explicit type was given - [x] Testing (especially pandas object roundtrip) - [x] Testing of timezone-naive roundtrips - [x] Testing mixed pandas and datetime objects Closes #7816 from kszucs/tz Lead-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Co-authored-by: Micah Kornfield <emkornfield@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-08-16 15:12:28 -05:00			`original_type.tz is not None and`
			`# can be object dtype for non-ns and timestamp_as_object=True`
			`result.dtype.kind == "M"):`
ARROW-3789: [Python] Use common conversion path for Arrow to pandas.Series/DataFrame. Zero copy optimizations for DataFrame, add split_blocks and self_destruct options The primary goal of this patch is to provide a way for some users to avoid memory doubling with converting from Arrow to pandas. This took me entirely too much time to get right, but partly I was attempting to disentangle some of the technical debt and overdue refactoring in arrow_to_pandas.cc. Summary of what's here: - Refactor ChunkedArray->Series and Table->DataFrame conversion paths to use the exact same code rather than two implementations of the same thing with slightly different behavior. The `ArrowDeserializer` helper class is now gone - Do zero-copy construction of internal DataFrame blocks for the case of a contiguous non-nullable array and a block with only 1 column represented - Add `split_blocks` option to `to_pandas` which constructs one block per DataFrame column, resulting in more zero-copy opportunities. Note that pandas's internal "consolidation" can still cause memory doubling (see discussion about this in https://github.com/pandas-dev/pandas/issues/10556) - Add `self_destruct` option to `to_pandas` which releases the Table's internal buffers as soon as they are converted to the required pandas structure. This allows memory to be reclaimed by the OS as conversion is taking place rather than having a forced memory-doubling and then post-facto reclamation (which has been causing OOM for some users) The most conservative invocation of `to_pandas` now would be `table.to_pandas(use_threads=False, split_blocks=True, self_destruct=True)` Note that the self-destruct option makes the `Table` object unsafe for further use. This is a bit dissatisfying but I wasn't sure how else to provide this capability. Closes #6067 from wesm/ARROW-3789 and squashes the following commits: 3b4260283 <Wes McKinney> Code review comments 8f39cce05 <Wes McKinney> Add some documentation. Try fixing MSVC warnings c22d280dc <Wes McKinney> Fix one MSVC cast warning 43068032c <Wes McKinney> Add "split blocks" and "self destruct" options to Table.to_pandas, with zero-copy operations for improved memory use when converting from Arrow to pandas Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-01-14 18:25:01 -06:00			`from pyarrow.pandas_compat import make_tz_aware`
			`result = make_tz_aware(result, original_type.tz)`

			`return result`


ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00			`cdef wrap_array_output(PyObject* output):`
			`cdef object obj = PyObject_to_object(output)`

			`if isinstance(obj, dict):`
GH-35025: [Python] Remove use of deprecated pandas.Categorical fastpath keyword (#35026) ### Rationale for this change We are using `pd.Categorical(codes, categories, fastpath=True)`. This keyword is deprecated in pandas 2.1 and this should be changed to `pd.Categorical.from_codes(codes, categories)`. ### Are there any user-facing changes? No * Closes: #35025 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-04-11 11:59:29 +02:00			`return _pandas_api.categorical_type.from_codes(`
			`obj['indices'], categories=obj['dictionary'], ordered=obj['ordered']`
			`)`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00			`else:`
			`return obj`


			`cdef class NullArray(Array):`
ARROW-4216: [Python] Add CUDA API docs Also reorganize the API docs into several documents, and add/improve docstrings. To allow building the docs without CUDA enabled, I added some conditional inclusion logic. When CUDA isn't enabled, the API docs are still generated but the docstrings are empty. This seems to be the only sane setting that doesn't produce Sphinx errors, one way or the other. Author: Antoine Pitrou <antoine@python.org> Closes #3372 from pitrou/ARROW-4216-cuda-py-docs and squashes the following commits: 80600da5 <Antoine Pitrou> ARROW-4216: Add CUDA API docs 2019-01-10 21:05:31 +01:00			`"""`
			`Concrete class for Arrow arrays of null data type.`
			`"""`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00

			`cdef class BooleanArray(Array):`
ARROW-4216: [Python] Add CUDA API docs Also reorganize the API docs into several documents, and add/improve docstrings. To allow building the docs without CUDA enabled, I added some conditional inclusion logic. When CUDA isn't enabled, the API docs are still generated but the docstrings are empty. This seems to be the only sane setting that doesn't produce Sphinx errors, one way or the other. Author: Antoine Pitrou <antoine@python.org> Closes #3372 from pitrou/ARROW-4216-cuda-py-docs and squashes the following commits: 80600da5 <Antoine Pitrou> ARROW-4216: Add CUDA API docs 2019-01-10 21:05:31 +01:00			`"""`
			`Concrete class for Arrow arrays of boolean data type.`
			`"""`
ARROW-9145: [C++] Implement BooleanArray::true_count and false_count, add Python bindings This seemed like a reasonable place to put this, and it seems like it may come in handy. Closes #7463 from wesm/ARROW-9145 Authored-by: Wes McKinney <wesm@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-17 19:27:16 -05:00			`@property`
			`def false_count(self):`
			`return (<CBooleanArray*> self.ap).false_count()`

			`@property`
			`def true_count(self):`
			`return (<CBooleanArray*> self.ap).true_count()`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00

			`cdef class NumericArray(Array):`
ARROW-4216: [Python] Add CUDA API docs Also reorganize the API docs into several documents, and add/improve docstrings. To allow building the docs without CUDA enabled, I added some conditional inclusion logic. When CUDA isn't enabled, the API docs are still generated but the docstrings are empty. This seems to be the only sane setting that doesn't produce Sphinx errors, one way or the other. Author: Antoine Pitrou <antoine@python.org> Closes #3372 from pitrou/ARROW-4216-cuda-py-docs and squashes the following commits: 80600da5 <Antoine Pitrou> ARROW-4216: Add CUDA API docs 2019-01-10 21:05:31 +01:00			`"""`
			`A base class for Arrow numeric arrays.`
			`"""`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00

			`cdef class IntegerArray(NumericArray):`
ARROW-4216: [Python] Add CUDA API docs Also reorganize the API docs into several documents, and add/improve docstrings. To allow building the docs without CUDA enabled, I added some conditional inclusion logic. When CUDA isn't enabled, the API docs are still generated but the docstrings are empty. This seems to be the only sane setting that doesn't produce Sphinx errors, one way or the other. Author: Antoine Pitrou <antoine@python.org> Closes #3372 from pitrou/ARROW-4216-cuda-py-docs and squashes the following commits: 80600da5 <Antoine Pitrou> ARROW-4216: Add CUDA API docs 2019-01-10 21:05:31 +01:00			`"""`
			`A base class for Arrow integer arrays.`
			`"""`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00

			`cdef class FloatingPointArray(NumericArray):`
ARROW-4216: [Python] Add CUDA API docs Also reorganize the API docs into several documents, and add/improve docstrings. To allow building the docs without CUDA enabled, I added some conditional inclusion logic. When CUDA isn't enabled, the API docs are still generated but the docstrings are empty. This seems to be the only sane setting that doesn't produce Sphinx errors, one way or the other. Author: Antoine Pitrou <antoine@python.org> Closes #3372 from pitrou/ARROW-4216-cuda-py-docs and squashes the following commits: 80600da5 <Antoine Pitrou> ARROW-4216: Add CUDA API docs 2019-01-10 21:05:31 +01:00			`"""`
			`A base class for Arrow floating-point arrays.`
			`"""`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00

			`cdef class Int8Array(IntegerArray):`
ARROW-4216: [Python] Add CUDA API docs Also reorganize the API docs into several documents, and add/improve docstrings. To allow building the docs without CUDA enabled, I added some conditional inclusion logic. When CUDA isn't enabled, the API docs are still generated but the docstrings are empty. This seems to be the only sane setting that doesn't produce Sphinx errors, one way or the other. Author: Antoine Pitrou <antoine@python.org> Closes #3372 from pitrou/ARROW-4216-cuda-py-docs and squashes the following commits: 80600da5 <Antoine Pitrou> ARROW-4216: Add CUDA API docs 2019-01-10 21:05:31 +01:00			`"""`
			`Concrete class for Arrow arrays of int8 data type.`
			`"""`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00

			`cdef class UInt8Array(IntegerArray):`
ARROW-4216: [Python] Add CUDA API docs Also reorganize the API docs into several documents, and add/improve docstrings. To allow building the docs without CUDA enabled, I added some conditional inclusion logic. When CUDA isn't enabled, the API docs are still generated but the docstrings are empty. This seems to be the only sane setting that doesn't produce Sphinx errors, one way or the other. Author: Antoine Pitrou <antoine@python.org> Closes #3372 from pitrou/ARROW-4216-cuda-py-docs and squashes the following commits: 80600da5 <Antoine Pitrou> ARROW-4216: Add CUDA API docs 2019-01-10 21:05:31 +01:00			`"""`
			`Concrete class for Arrow arrays of uint8 data type.`
			`"""`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00

			`cdef class Int16Array(IntegerArray):`
ARROW-4216: [Python] Add CUDA API docs Also reorganize the API docs into several documents, and add/improve docstrings. To allow building the docs without CUDA enabled, I added some conditional inclusion logic. When CUDA isn't enabled, the API docs are still generated but the docstrings are empty. This seems to be the only sane setting that doesn't produce Sphinx errors, one way or the other. Author: Antoine Pitrou <antoine@python.org> Closes #3372 from pitrou/ARROW-4216-cuda-py-docs and squashes the following commits: 80600da5 <Antoine Pitrou> ARROW-4216: Add CUDA API docs 2019-01-10 21:05:31 +01:00			`"""`
			`Concrete class for Arrow arrays of int16 data type.`
			`"""`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00

			`cdef class UInt16Array(IntegerArray):`
ARROW-4216: [Python] Add CUDA API docs Also reorganize the API docs into several documents, and add/improve docstrings. To allow building the docs without CUDA enabled, I added some conditional inclusion logic. When CUDA isn't enabled, the API docs are still generated but the docstrings are empty. This seems to be the only sane setting that doesn't produce Sphinx errors, one way or the other. Author: Antoine Pitrou <antoine@python.org> Closes #3372 from pitrou/ARROW-4216-cuda-py-docs and squashes the following commits: 80600da5 <Antoine Pitrou> ARROW-4216: Add CUDA API docs 2019-01-10 21:05:31 +01:00			`"""`
			`Concrete class for Arrow arrays of uint16 data type.`
			`"""`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00

			`cdef class Int32Array(IntegerArray):`
ARROW-4216: [Python] Add CUDA API docs Also reorganize the API docs into several documents, and add/improve docstrings. To allow building the docs without CUDA enabled, I added some conditional inclusion logic. When CUDA isn't enabled, the API docs are still generated but the docstrings are empty. This seems to be the only sane setting that doesn't produce Sphinx errors, one way or the other. Author: Antoine Pitrou <antoine@python.org> Closes #3372 from pitrou/ARROW-4216-cuda-py-docs and squashes the following commits: 80600da5 <Antoine Pitrou> ARROW-4216: Add CUDA API docs 2019-01-10 21:05:31 +01:00			`"""`
			`Concrete class for Arrow arrays of int32 data type.`
			`"""`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00

			`cdef class UInt32Array(IntegerArray):`
ARROW-4216: [Python] Add CUDA API docs Also reorganize the API docs into several documents, and add/improve docstrings. To allow building the docs without CUDA enabled, I added some conditional inclusion logic. When CUDA isn't enabled, the API docs are still generated but the docstrings are empty. This seems to be the only sane setting that doesn't produce Sphinx errors, one way or the other. Author: Antoine Pitrou <antoine@python.org> Closes #3372 from pitrou/ARROW-4216-cuda-py-docs and squashes the following commits: 80600da5 <Antoine Pitrou> ARROW-4216: Add CUDA API docs 2019-01-10 21:05:31 +01:00			`"""`
			`Concrete class for Arrow arrays of uint32 data type.`
			`"""`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00

			`cdef class Int64Array(IntegerArray):`
ARROW-4216: [Python] Add CUDA API docs Also reorganize the API docs into several documents, and add/improve docstrings. To allow building the docs without CUDA enabled, I added some conditional inclusion logic. When CUDA isn't enabled, the API docs are still generated but the docstrings are empty. This seems to be the only sane setting that doesn't produce Sphinx errors, one way or the other. Author: Antoine Pitrou <antoine@python.org> Closes #3372 from pitrou/ARROW-4216-cuda-py-docs and squashes the following commits: 80600da5 <Antoine Pitrou> ARROW-4216: Add CUDA API docs 2019-01-10 21:05:31 +01:00			`"""`
			`Concrete class for Arrow arrays of int64 data type.`
			`"""`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00

			`cdef class UInt64Array(IntegerArray):`
ARROW-4216: [Python] Add CUDA API docs Also reorganize the API docs into several documents, and add/improve docstrings. To allow building the docs without CUDA enabled, I added some conditional inclusion logic. When CUDA isn't enabled, the API docs are still generated but the docstrings are empty. This seems to be the only sane setting that doesn't produce Sphinx errors, one way or the other. Author: Antoine Pitrou <antoine@python.org> Closes #3372 from pitrou/ARROW-4216-cuda-py-docs and squashes the following commits: 80600da5 <Antoine Pitrou> ARROW-4216: Add CUDA API docs 2019-01-10 21:05:31 +01:00			`"""`
			`Concrete class for Arrow arrays of uint64 data type.`
			`"""`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00

			`cdef class Date32Array(NumericArray):`
ARROW-4216: [Python] Add CUDA API docs Also reorganize the API docs into several documents, and add/improve docstrings. To allow building the docs without CUDA enabled, I added some conditional inclusion logic. When CUDA isn't enabled, the API docs are still generated but the docstrings are empty. This seems to be the only sane setting that doesn't produce Sphinx errors, one way or the other. Author: Antoine Pitrou <antoine@python.org> Closes #3372 from pitrou/ARROW-4216-cuda-py-docs and squashes the following commits: 80600da5 <Antoine Pitrou> ARROW-4216: Add CUDA API docs 2019-01-10 21:05:31 +01:00			`"""`
			`Concrete class for Arrow arrays of date32 data type.`
			`"""`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00

			`cdef class Date64Array(NumericArray):`
ARROW-4216: [Python] Add CUDA API docs Also reorganize the API docs into several documents, and add/improve docstrings. To allow building the docs without CUDA enabled, I added some conditional inclusion logic. When CUDA isn't enabled, the API docs are still generated but the docstrings are empty. This seems to be the only sane setting that doesn't produce Sphinx errors, one way or the other. Author: Antoine Pitrou <antoine@python.org> Closes #3372 from pitrou/ARROW-4216-cuda-py-docs and squashes the following commits: 80600da5 <Antoine Pitrou> ARROW-4216: Add CUDA API docs 2019-01-10 21:05:31 +01:00			`"""`
			`Concrete class for Arrow arrays of date64 data type.`
			`"""`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00

			`cdef class TimestampArray(NumericArray):`
ARROW-4216: [Python] Add CUDA API docs Also reorganize the API docs into several documents, and add/improve docstrings. To allow building the docs without CUDA enabled, I added some conditional inclusion logic. When CUDA isn't enabled, the API docs are still generated but the docstrings are empty. This seems to be the only sane setting that doesn't produce Sphinx errors, one way or the other. Author: Antoine Pitrou <antoine@python.org> Closes #3372 from pitrou/ARROW-4216-cuda-py-docs and squashes the following commits: 80600da5 <Antoine Pitrou> ARROW-4216: Add CUDA API docs 2019-01-10 21:05:31 +01:00			`"""`
			`Concrete class for Arrow arrays of timestamp data type.`
			`"""`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00

			`cdef class Time32Array(NumericArray):`
ARROW-4216: [Python] Add CUDA API docs Also reorganize the API docs into several documents, and add/improve docstrings. To allow building the docs without CUDA enabled, I added some conditional inclusion logic. When CUDA isn't enabled, the API docs are still generated but the docstrings are empty. This seems to be the only sane setting that doesn't produce Sphinx errors, one way or the other. Author: Antoine Pitrou <antoine@python.org> Closes #3372 from pitrou/ARROW-4216-cuda-py-docs and squashes the following commits: 80600da5 <Antoine Pitrou> ARROW-4216: Add CUDA API docs 2019-01-10 21:05:31 +01:00			`"""`
			`Concrete class for Arrow arrays of time32 data type.`
			`"""`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00

			`cdef class Time64Array(NumericArray):`
ARROW-4216: [Python] Add CUDA API docs Also reorganize the API docs into several documents, and add/improve docstrings. To allow building the docs without CUDA enabled, I added some conditional inclusion logic. When CUDA isn't enabled, the API docs are still generated but the docstrings are empty. This seems to be the only sane setting that doesn't produce Sphinx errors, one way or the other. Author: Antoine Pitrou <antoine@python.org> Closes #3372 from pitrou/ARROW-4216-cuda-py-docs and squashes the following commits: 80600da5 <Antoine Pitrou> ARROW-4216: Add CUDA API docs 2019-01-10 21:05:31 +01:00			`"""`
			`Concrete class for Arrow arrays of time64 data type.`
			`"""`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00

ARROW-5855: [Python] Support for Duration (timedelta) type https://issues.apache.org/jira/browse/ARROW-5855 - [x] Basic wrappers for DurationArray/Type/Value - [x] Numpy / pandas conversion - [x] Python conversion Closes #5566 from jorisvandenbossche/ARROW-5855-duration-python and squashes the following commits: cd9f12aa6 <Joris Van den Bossche> Type -> ArrowType 7458000cf <Joris Van den Bossche> templated TimestampConverter 56bd1b3b3 <Joris Van den Bossche> update for feedback f4d6b4b64 <Joris Van den Bossche> handle python 2 compat 6bfafbe14 <Joris Van den Bossche> clean-up 779e724f0 <Joris Van den Bossche> implement python conversion 03d399223 <Joris Van den Bossche> ARROW-5855: Support for Duration (timedelta) type Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-10-08 13:12:38 +02:00			`cdef class DurationArray(NumericArray):`
			`"""`
			`Concrete class for Arrow arrays of duration data type.`
			`"""`

ARROW-13806: [C++][Python] Add support for new MonthDayNano Interval Type - Refactored ObjectWriter helpers from arrow_to_pandas, so they can be used for plain python types as well (generalized the lowest level so it can work on both PyObject** and an adapter for PyList. - Add DateOffset to static pandas imports - Tried to start laying out code in a way to use C++ for Array.to_pylist (feel free to comment). Support importing from timeinterval, relativedelta and DateOffset types (this is actually mostly duck types, the one complication is that relativedelta has a property weeks that is automatically calculated, so some type checking is necessary). Open questions: - Should we be more strict on duck typing imports? I chose generalism over performance here (rechecking non-present attributes, etc)? - Is the new arrow_to_python.h desirable (I think this can be easily extended for other types)? - My python is rusty and Python C-API even more so, please don't assume I know exactly what I'm doing :) Closes #11302 from emkornfield/interval_python Lead-authored-by: Micah Kornfield <micahk@google.com> Co-authored-by: emkornfield <emkornfield@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-10-07 12:36:46 +02:00
			`cdef class MonthDayNanoIntervalArray(Array):`
			`"""`
			`Concrete class for Arrow arrays of interval[MonthDayNano] type.`
			`"""`

GH-39010: [Python] Introduce `maps_as_pydicts` parameter for `to_pylist`, `to_pydict`, `as_py` (#45471) ### Rationale for this change Currently, unfortunately `MapScalar`/`Array` types are not deserialized into proper Python `dict`s, which is unfortunate since this breaks "roundtrips" from Python -> Arrow -> Python: ``` import pyarrow as pa schema = pa.schema([pa.field('x', pa.map_(pa.string(), pa.int64()))]) data = [{'x': {'a': 1}}] pa.RecordBatch.from_pylist(data, schema=schema).to_pylist() # [{'x': [('a', 1)]}] ``` This is especially bad when storing TiBs of deeply nested data (think of lists in structs in maps...) that were created from Python and serialized into Arrow/Parquet, since they can't be read in again with native `pyarrow` methods without doing extremely ugly and computationally costly workarounds. ### What changes are included in this PR? A new parameter `maps_as_pydicts` is introduced to `to_pylist`, `to_pydict`, `as_py` which will allow proper roundtrips: ``` import pyarrow as pa schema = pa.schema([pa.field('x', pa.map_(pa.string(), pa.int64()))]) data = [{'x': {'a': 1}}] pa.RecordBatch.from_pylist(data, schema=schema).to_pylist(maps_as_pydicts="strict") # [{'x': {'a': 1}}] ``` ### Are these changes tested? Yes. There are tests for `to_pylist` and `to_pydict` included for `pyarrow.Table`, whilst low-level `MapScalar` and especially a nesting with `ListScalar` and `StructScalar` is tested. Also, duplicate keys now should throw an error, which is also tested for. ### Are there any user-facing changes? No callsites should be broken, simply a new keyword-only optional parameter is added. * GitHub Issue: #39010 Authored-by: Jonas Dedden <university@jonas-dedden.de> Signed-off-by: Antoine Pitrou <antoine@python.org> 2025-02-20 16:17:48 +01:00			`def to_pylist(self, *, maps_as_pydicts=None):`
ARROW-13806: [C++][Python] Add support for new MonthDayNano Interval Type - Refactored ObjectWriter helpers from arrow_to_pandas, so they can be used for plain python types as well (generalized the lowest level so it can work on both PyObject** and an adapter for PyList. - Add DateOffset to static pandas imports - Tried to start laying out code in a way to use C++ for Array.to_pylist (feel free to comment). Support importing from timeinterval, relativedelta and DateOffset types (this is actually mostly duck types, the one complication is that relativedelta has a property weeks that is automatically calculated, so some type checking is necessary). Open questions: - Should we be more strict on duck typing imports? I chose generalism over performance here (rechecking non-present attributes, etc)? - Is the new arrow_to_python.h desirable (I think this can be easily extended for other types)? - My python is rusty and Python C-API even more so, please don't assume I know exactly what I'm doing :) Closes #11302 from emkornfield/interval_python Lead-authored-by: Micah Kornfield <micahk@google.com> Co-authored-by: emkornfield <emkornfield@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-10-07 12:36:46 +02:00			`"""`
			`Convert to a list of native Python objects.`

			`pyarrow.MonthDayNano is used as the native representation.`

GH-39010: [Python] Introduce `maps_as_pydicts` parameter for `to_pylist`, `to_pydict`, `as_py` (#45471) ### Rationale for this change Currently, unfortunately `MapScalar`/`Array` types are not deserialized into proper Python `dict`s, which is unfortunate since this breaks "roundtrips" from Python -> Arrow -> Python: ``` import pyarrow as pa schema = pa.schema([pa.field('x', pa.map_(pa.string(), pa.int64()))]) data = [{'x': {'a': 1}}] pa.RecordBatch.from_pylist(data, schema=schema).to_pylist() # [{'x': [('a', 1)]}] ``` This is especially bad when storing TiBs of deeply nested data (think of lists in structs in maps...) that were created from Python and serialized into Arrow/Parquet, since they can't be read in again with native `pyarrow` methods without doing extremely ugly and computationally costly workarounds. ### What changes are included in this PR? A new parameter `maps_as_pydicts` is introduced to `to_pylist`, `to_pydict`, `as_py` which will allow proper roundtrips: ``` import pyarrow as pa schema = pa.schema([pa.field('x', pa.map_(pa.string(), pa.int64()))]) data = [{'x': {'a': 1}}] pa.RecordBatch.from_pylist(data, schema=schema).to_pylist(maps_as_pydicts="strict") # [{'x': {'a': 1}}] ``` ### Are these changes tested? Yes. There are tests for `to_pylist` and `to_pydict` included for `pyarrow.Table`, whilst low-level `MapScalar` and especially a nesting with `ListScalar` and `StructScalar` is tested. Also, duplicate keys now should throw an error, which is also tested for. ### Are there any user-facing changes? No callsites should be broken, simply a new keyword-only optional parameter is added. * GitHub Issue: #39010 Authored-by: Jonas Dedden <university@jonas-dedden.de> Signed-off-by: Antoine Pitrou <antoine@python.org> 2025-02-20 16:17:48 +01:00			`Parameters`
			`----------`
			maps_as_pydicts : str, optional, default `None`
			Valid values are `None`, 'lossy', or 'strict'.
			`This parameter is ignored for non-nested Scalars.`

ARROW-13806: [C++][Python] Add support for new MonthDayNano Interval Type - Refactored ObjectWriter helpers from arrow_to_pandas, so they can be used for plain python types as well (generalized the lowest level so it can work on both PyObject** and an adapter for PyList. - Add DateOffset to static pandas imports - Tried to start laying out code in a way to use C++ for Array.to_pylist (feel free to comment). Support importing from timeinterval, relativedelta and DateOffset types (this is actually mostly duck types, the one complication is that relativedelta has a property weeks that is automatically calculated, so some type checking is necessary). Open questions: - Should we be more strict on duck typing imports? I chose generalism over performance here (rechecking non-present attributes, etc)? - Is the new arrow_to_python.h desirable (I think this can be easily extended for other types)? - My python is rusty and Python C-API even more so, please don't assume I know exactly what I'm doing :) Closes #11302 from emkornfield/interval_python Lead-authored-by: Micah Kornfield <micahk@google.com> Co-authored-by: emkornfield <emkornfield@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-10-07 12:36:46 +02:00			`Returns`
			`-------`
			`lst : list`
			`"""`
			`cdef:`
			`CResult[PyObject*] maybe_py_list`
			`PyObject* py_list`
			`CMonthDayNanoIntervalArray* array`
			`array = <CMonthDayNanoIntervalArray*>self.sp_array.get()`
			`maybe_py_list = MonthDayNanoIntervalArrayToPyList(deref(array))`
			`py_list = GetResultValue(maybe_py_list)`
			`return PyObject_to_object(py_list)`


ARROW-2140: [Python] Improve float16 support Author: Antoine Pitrou <antoine@python.org> Closes #1744 from pitrou/ARROW-2140-py-float16 and squashes the following commits: f6ebc83 <Antoine Pitrou> Merge branch 'master' into ARROW-2140-py-float16 64fb518 <Antoine Pitrou> ARROW-2140: Improve float16 support 2018-03-29 19:38:05 -04:00			`cdef class HalfFloatArray(FloatingPointArray):`
ARROW-4216: [Python] Add CUDA API docs Also reorganize the API docs into several documents, and add/improve docstrings. To allow building the docs without CUDA enabled, I added some conditional inclusion logic. When CUDA isn't enabled, the API docs are still generated but the docstrings are empty. This seems to be the only sane setting that doesn't produce Sphinx errors, one way or the other. Author: Antoine Pitrou <antoine@python.org> Closes #3372 from pitrou/ARROW-4216-cuda-py-docs and squashes the following commits: 80600da5 <Antoine Pitrou> ARROW-4216: Add CUDA API docs 2019-01-10 21:05:31 +01:00			`"""`
			`Concrete class for Arrow arrays of float16 data type.`
			`"""`
ARROW-2140: [Python] Improve float16 support Author: Antoine Pitrou <antoine@python.org> Closes #1744 from pitrou/ARROW-2140-py-float16 and squashes the following commits: f6ebc83 <Antoine Pitrou> Merge branch 'master' into ARROW-2140-py-float16 64fb518 <Antoine Pitrou> ARROW-2140: Improve float16 support 2018-03-29 19:38:05 -04:00

ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00			`cdef class FloatArray(FloatingPointArray):`
ARROW-4216: [Python] Add CUDA API docs Also reorganize the API docs into several documents, and add/improve docstrings. To allow building the docs without CUDA enabled, I added some conditional inclusion logic. When CUDA isn't enabled, the API docs are still generated but the docstrings are empty. This seems to be the only sane setting that doesn't produce Sphinx errors, one way or the other. Author: Antoine Pitrou <antoine@python.org> Closes #3372 from pitrou/ARROW-4216-cuda-py-docs and squashes the following commits: 80600da5 <Antoine Pitrou> ARROW-4216: Add CUDA API docs 2019-01-10 21:05:31 +01:00			`"""`
			`Concrete class for Arrow arrays of float32 data type.`
			`"""`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00

			`cdef class DoubleArray(FloatingPointArray):`
ARROW-4216: [Python] Add CUDA API docs Also reorganize the API docs into several documents, and add/improve docstrings. To allow building the docs without CUDA enabled, I added some conditional inclusion logic. When CUDA isn't enabled, the API docs are still generated but the docstrings are empty. This seems to be the only sane setting that doesn't produce Sphinx errors, one way or the other. Author: Antoine Pitrou <antoine@python.org> Closes #3372 from pitrou/ARROW-4216-cuda-py-docs and squashes the following commits: 80600da5 <Antoine Pitrou> ARROW-4216: Add CUDA API docs 2019-01-10 21:05:31 +01:00			`"""`
			`Concrete class for Arrow arrays of float64 data type.`
			`"""`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00

			`cdef class FixedSizeBinaryArray(Array):`
ARROW-4216: [Python] Add CUDA API docs Also reorganize the API docs into several documents, and add/improve docstrings. To allow building the docs without CUDA enabled, I added some conditional inclusion logic. When CUDA isn't enabled, the API docs are still generated but the docstrings are empty. This seems to be the only sane setting that doesn't produce Sphinx errors, one way or the other. Author: Antoine Pitrou <antoine@python.org> Closes #3372 from pitrou/ARROW-4216-cuda-py-docs and squashes the following commits: 80600da5 <Antoine Pitrou> ARROW-4216: Add CUDA API docs 2019-01-10 21:05:31 +01:00			`"""`
			`Concrete class for Arrow arrays of a fixed-size binary data type.`
			`"""`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
GH-44713: [Python] Add support for Decimal32 and Decimal64 types (#44882) ### Rationale for this change Arrow C++ and the Arrow specification now support 32-bit and 64-bit decimal types...pyarrow should too! ### What changes are included in this PR? Added type, array, and scalar bindings. ### Are these changes tested? Yes! ### Are there any user-facing changes? Yes! * GitHub Issue: #44713 Authored-by: Dewey Dunnington <dewey@fishandwhistle.net> Signed-off-by: Dewey Dunnington <dewey@fishandwhistle.net> 2024-12-16 21:12:07 -06:00			`cdef class Decima32Array(FixedSizeBinaryArray):`
			`"""`
			`Concrete class for Arrow arrays of decimal32 data type.`
			`"""`

			`cdef class Decimal64Array(FixedSizeBinaryArray):`
			`"""`
			`Concrete class for Arrow arrays of decimal64 data type.`
			`"""`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
ARROW-1794: [C++/Python] Rename DecimalArray to Decimal128Array Author: Phillip Cloud <cpcloud@gmail.com> Closes #1312 from cpcloud/ARROW-1794 and squashes the following commits: 0b8ba5e0 [Phillip Cloud] Backward compat 4eb2a3ba [Phillip Cloud] ARROW-1794: [C++/Python] Rename DecimalArray to Decimal128Array 2017-11-12 23:47:47 -05:00			`cdef class Decimal128Array(FixedSizeBinaryArray):`
ARROW-4216: [Python] Add CUDA API docs Also reorganize the API docs into several documents, and add/improve docstrings. To allow building the docs without CUDA enabled, I added some conditional inclusion logic. When CUDA isn't enabled, the API docs are still generated but the docstrings are empty. This seems to be the only sane setting that doesn't produce Sphinx errors, one way or the other. Author: Antoine Pitrou <antoine@python.org> Closes #3372 from pitrou/ARROW-4216-cuda-py-docs and squashes the following commits: 80600da5 <Antoine Pitrou> ARROW-4216: Add CUDA API docs 2019-01-10 21:05:31 +01:00			`"""`
			`Concrete class for Arrow arrays of decimal128 data type.`
			`"""`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
ARROW-9747: [Java][C++] Initial Support for 256-bit Decimals This provides sufficient coverage to support round trip between C++ and Java. There are still some gaps in python. Based on review, I will open JIRAs to track missing functionality (i.e. parquet support in C++). Marking as draft until i can triage CI failures but early feedback is welcome. Open questions I have: [C++] * Should we retain logic in decimal() factory function to adjust type on scale/precision or take an explicit argument or keep it as an alias for decimal128? [Java] * Naming: Would Decimal256 be better then BigDecimal? Closes #8475 from emkornfield/decimal256 Lead-authored-by: Mingyu Zhong <69326943+MingyuZhong@users.noreply.github.com> Co-authored-by: Micah Kornfield <micahk@google.com> Co-authored-by: Micah Kornfield <emkornfield@gmail.com> Co-authored-by: emkornfield <emkornfield@gmail.com> Co-authored-by: Ezra <eumen@google.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com> 2020-10-22 21:39:53 -07:00
			`cdef class Decimal256Array(FixedSizeBinaryArray):`
			`"""`
			`Concrete class for Arrow arrays of decimal256 data type.`
			`"""`

ARROW-3520: [C++] Add "list_flatten" vector kernel wrapper for Flatten method of ListArray types This simplifies invocation of this action in bindings. Closes #7585 from wesm/ARROW-3520 Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-30 12:23:44 -05:00			`cdef class BaseListArray(Array):`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
GH-41183: [C++][Python] Expose recursive flatten for lists on list_flatten kernel function and pyarrow bindings (#41295) ### Rationale for this change Expose recursive flatten for logical lists on list_flatten kernel function and pyarrow bindings. ### What changes are included in this PR? 1. Expose recursive flatten for logical lists on `list_flatten` kernel function 2. Support [Large]ListView for some kernel functions: `list_flatten`,`list_value_length`, `list_element` 3. Support recursive flatten for pyarrow bindinds and simplify [Large]ListView's pyarrow bindings 4. Refactor vector_nested_test.cc for better support [Large]ListView types. ### Are these changes tested? Yes ### Are there any user-facing changes? Yes. 1. Some kernel functions like: list_flatten, list_value_length, list_element would support [Large]ListView types 2. `list_flatten` and related pyarrow bindings could support flatten recursively with an ListFlattenOptions. * GitHub Issue: #41183 Lead-authored-by: ZhangHuiGui <hugo.zhang@openpie.com> Co-authored-by: ZhangHuiGui <2689496754@qq.com> Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com> 2024-05-01 06:20:04 +08:00			`def flatten(self, recursive=False):`
ARROW-3520: [C++] Add "list_flatten" vector kernel wrapper for Flatten method of ListArray types This simplifies invocation of this action in bindings. Closes #7585 from wesm/ARROW-3520 Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-30 12:23:44 -05:00			`"""`
GH-41183: [C++][Python] Expose recursive flatten for lists on list_flatten kernel function and pyarrow bindings (#41295) ### Rationale for this change Expose recursive flatten for logical lists on list_flatten kernel function and pyarrow bindings. ### What changes are included in this PR? 1. Expose recursive flatten for logical lists on `list_flatten` kernel function 2. Support [Large]ListView for some kernel functions: `list_flatten`,`list_value_length`, `list_element` 3. Support recursive flatten for pyarrow bindinds and simplify [Large]ListView's pyarrow bindings 4. Refactor vector_nested_test.cc for better support [Large]ListView types. ### Are these changes tested? Yes ### Are there any user-facing changes? Yes. 1. Some kernel functions like: list_flatten, list_value_length, list_element would support [Large]ListView types 2. `list_flatten` and related pyarrow bindings could support flatten recursively with an ListFlattenOptions. * GitHub Issue: #41183 Lead-authored-by: ZhangHuiGui <hugo.zhang@openpie.com> Co-authored-by: ZhangHuiGui <2689496754@qq.com> Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com> 2024-05-01 06:20:04 +08:00			`Unnest this [Large]ListArray/[Large]ListViewArray/FixedSizeListArray`
			`according to 'recursive'.`
ARROW-3520: [C++] Add "list_flatten" vector kernel wrapper for Flatten method of ListArray types This simplifies invocation of this action in bindings. Closes #7585 from wesm/ARROW-3520 Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-30 12:23:44 -05:00
GH-35740: Add documentation for list arrays' values property (#35865) Just docs. I'm a little shaky on my understanding of exactly what's going on with FixedSizeListArray.values, and its behavior with nulls, so that wording might deserve a careful read. * Closes: #35740 Lead-authored-by: Spencer Nelson <spencer@b612foundation.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2023-08-23 11:13:33 -07:00			Note that this method is different from ``self.values`` in that
ARROW-3520: [C++] Add "list_flatten" vector kernel wrapper for Flatten method of ListArray types This simplifies invocation of this action in bindings. Closes #7585 from wesm/ARROW-3520 Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-30 12:23:44 -05:00			`it takes care of the slicing offset as well as null elements backed`
			`by non-empty sub-lists.`

GH-41183: [C++][Python] Expose recursive flatten for lists on list_flatten kernel function and pyarrow bindings (#41295) ### Rationale for this change Expose recursive flatten for logical lists on list_flatten kernel function and pyarrow bindings. ### What changes are included in this PR? 1. Expose recursive flatten for logical lists on `list_flatten` kernel function 2. Support [Large]ListView for some kernel functions: `list_flatten`,`list_value_length`, `list_element` 3. Support recursive flatten for pyarrow bindinds and simplify [Large]ListView's pyarrow bindings 4. Refactor vector_nested_test.cc for better support [Large]ListView types. ### Are these changes tested? Yes ### Are there any user-facing changes? Yes. 1. Some kernel functions like: list_flatten, list_value_length, list_element would support [Large]ListView types 2. `list_flatten` and related pyarrow bindings could support flatten recursively with an ListFlattenOptions. * GitHub Issue: #41183 Lead-authored-by: ZhangHuiGui <hugo.zhang@openpie.com> Co-authored-by: ZhangHuiGui <2689496754@qq.com> Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com> 2024-05-01 06:20:04 +08:00			`Parameters`
			`----------`
			`recursive : bool, default False, optional`
			`When True, flatten this logical list-array recursively until an`
			`array of non-list values is formed.`

			`When False, flatten only the top level.`

ARROW-3520: [C++] Add "list_flatten" vector kernel wrapper for Flatten method of ListArray types This simplifies invocation of this action in bindings. Closes #7585 from wesm/ARROW-3520 Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-30 12:23:44 -05:00			`Returns`
			`-------`
			`result : Array`
GH-41183: [C++][Python] Expose recursive flatten for lists on list_flatten kernel function and pyarrow bindings (#41295) ### Rationale for this change Expose recursive flatten for logical lists on list_flatten kernel function and pyarrow bindings. ### What changes are included in this PR? 1. Expose recursive flatten for logical lists on `list_flatten` kernel function 2. Support [Large]ListView for some kernel functions: `list_flatten`,`list_value_length`, `list_element` 3. Support recursive flatten for pyarrow bindinds and simplify [Large]ListView's pyarrow bindings 4. Refactor vector_nested_test.cc for better support [Large]ListView types. ### Are these changes tested? Yes ### Are there any user-facing changes? Yes. 1. Some kernel functions like: list_flatten, list_value_length, list_element would support [Large]ListView types 2. `list_flatten` and related pyarrow bindings could support flatten recursively with an ListFlattenOptions. * GitHub Issue: #41183 Lead-authored-by: ZhangHuiGui <hugo.zhang@openpie.com> Co-authored-by: ZhangHuiGui <2689496754@qq.com> Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com> 2024-05-01 06:20:04 +08:00
			`Examples`
			`--------`

			`Basic logical list-array's flatten`
GH-49509: [Docs][Python][C++] Minimize warnings and docutils errors for Sphinx build html (#49510) ### Rationale for this change Closes #49509 ### What changes are included in this PR? Docs formatting/typos corrected. ### Are these changes tested? Yes, on fork branch. ### Are there any user-facing changes? No, just corrected formatting/typos in docs. * GitHub Issue: #49509 Authored-by: Tadeja Kadunc <tadeja.kadunc@gmail.com> Signed-off-by: Rok Mihevc <rok@mihevc.org> 2026-03-17 18:33:51 +01:00
GH-41183: [C++][Python] Expose recursive flatten for lists on list_flatten kernel function and pyarrow bindings (#41295) ### Rationale for this change Expose recursive flatten for logical lists on list_flatten kernel function and pyarrow bindings. ### What changes are included in this PR? 1. Expose recursive flatten for logical lists on `list_flatten` kernel function 2. Support [Large]ListView for some kernel functions: `list_flatten`,`list_value_length`, `list_element` 3. Support recursive flatten for pyarrow bindinds and simplify [Large]ListView's pyarrow bindings 4. Refactor vector_nested_test.cc for better support [Large]ListView types. ### Are these changes tested? Yes ### Are there any user-facing changes? Yes. 1. Some kernel functions like: list_flatten, list_value_length, list_element would support [Large]ListView types 2. `list_flatten` and related pyarrow bindings could support flatten recursively with an ListFlattenOptions. * GitHub Issue: #41183 Lead-authored-by: ZhangHuiGui <hugo.zhang@openpie.com> Co-authored-by: ZhangHuiGui <2689496754@qq.com> Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com> 2024-05-01 06:20:04 +08:00			`>>> import pyarrow as pa`
			`>>> values = [1, 2, 3, 4]`
			`>>> offsets = [2, 1, 0]`
			`>>> sizes = [2, 2, 2]`
			`>>> array = pa.ListViewArray.from_arrays(offsets, sizes, values)`
			`>>> array`
			`<pyarrow.lib.ListViewArray object at ...>`
			`[`
			`[`
			`3,`
			`4`
			`],`
			`[`
			`2,`
			`3`
			`],`
			`[`
			`1,`
			`2`
			`]`
			`]`
			`>>> array.flatten()`
			`<pyarrow.lib.Int64Array object at ...>`
			`[`
			`3,`
			`4,`
			`2,`
			`3,`
			`1,`
			`2`
			`]`

			`When recursive=True, nested list arrays are flattened recursively`
			`until an array of non-list values is formed.`

			`>>> array = pa.array([`
			`... None,`
			`... [`
			`... [1, None, 2],`
			`... None,`
			`... [3, 4]`
			`... ],`
			`... [],`
			`... [`
			`... [],`
			`... [5, 6],`
			`... None`
			`... ],`
			`... [`
			`... [7, 8]`
			`... ]`
			`... ], type=pa.list_(pa.list_(pa.int64())))`
			`>>> array.flatten(True)`
			`<pyarrow.lib.Int64Array object at ...>`
			`[`
			`1,`
			`null,`
			`2,`
			`3,`
			`4,`
			`5,`
			`6,`
			`7,`
			`8`
			`]`
ARROW-3520: [C++] Add "list_flatten" vector kernel wrapper for Flatten method of ListArray types This simplifies invocation of this action in bindings. Closes #7585 from wesm/ARROW-3520 Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-30 12:23:44 -05:00			`"""`
GH-41183: [C++][Python] Expose recursive flatten for lists on list_flatten kernel function and pyarrow bindings (#41295) ### Rationale for this change Expose recursive flatten for logical lists on list_flatten kernel function and pyarrow bindings. ### What changes are included in this PR? 1. Expose recursive flatten for logical lists on `list_flatten` kernel function 2. Support [Large]ListView for some kernel functions: `list_flatten`,`list_value_length`, `list_element` 3. Support recursive flatten for pyarrow bindinds and simplify [Large]ListView's pyarrow bindings 4. Refactor vector_nested_test.cc for better support [Large]ListView types. ### Are these changes tested? Yes ### Are there any user-facing changes? Yes. 1. Some kernel functions like: list_flatten, list_value_length, list_element would support [Large]ListView types 2. `list_flatten` and related pyarrow bindings could support flatten recursively with an ListFlattenOptions. * GitHub Issue: #41183 Lead-authored-by: ZhangHuiGui <hugo.zhang@openpie.com> Co-authored-by: ZhangHuiGui <2689496754@qq.com> Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com> 2024-05-01 06:20:04 +08:00			`options = _pc().ListFlattenOptions(recursive)`
			`return _pc().list_flatten(self, options=options)`
ARROW-3520: [C++] Add "list_flatten" vector kernel wrapper for Flatten method of ListArray types This simplifies invocation of this action in bindings. Closes #7585 from wesm/ARROW-3520 Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-30 12:23:44 -05:00
ARROW-6775: [C++][Python] Implement list_value_lengths and list_parent_indices functions This adds two functions that operate on list types: * list_value_lengths: returns an int32 (for List) or int64 (for LargeList) array with the number of elements in each list value slot * list_parent_indices: returns an int32/int64 array with the same length as the child values array of a List type where each value is the index of the list "slot" containing each child value Closes #7632 from wesm/some-list-functions Lead-authored-by: Wes McKinney <wesm@apache.org> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2020-07-07 13:23:37 +02:00			`def value_parent_indices(self):`
			`"""`
			`Return array of same length as list child values array where each`
			`output value is the index of the parent list array slot containing each`
			`child value.`

			`Examples`
			`--------`
ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-cython) Adding `--doctest-cython` functionality which will be run on the CI with a follow-up PR. This PR can be tested with `pytest --doctest-cython python/pyarrow`. Closes #13204 from AlenkaF/ARROW-16018-doctest-cython Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-05-26 15:30:02 +02:00			`>>> import pyarrow as pa`
ARROW-6775: [C++][Python] Implement list_value_lengths and list_parent_indices functions This adds two functions that operate on list types: * list_value_lengths: returns an int32 (for List) or int64 (for LargeList) array with the number of elements in each list value slot * list_parent_indices: returns an int32/int64 array with the same length as the child values array of a List type where each value is the index of the list "slot" containing each child value Closes #7632 from wesm/some-list-functions Lead-authored-by: Wes McKinney <wesm@apache.org> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2020-07-07 13:23:37 +02:00			`>>> arr = pa.array([[1, 2, 3], [], None, [4]],`
			`... type=pa.list_(pa.int32()))`
			`>>> arr.value_parent_indices()`
ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-cython) Adding `--doctest-cython` functionality which will be run on the CI with a follow-up PR. This PR can be tested with `pytest --doctest-cython python/pyarrow`. Closes #13204 from AlenkaF/ARROW-16018-doctest-cython Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-05-26 15:30:02 +02:00			`<pyarrow.lib.Int64Array object at ...>`
ARROW-6775: [C++][Python] Implement list_value_lengths and list_parent_indices functions This adds two functions that operate on list types: * list_value_lengths: returns an int32 (for List) or int64 (for LargeList) array with the number of elements in each list value slot * list_parent_indices: returns an int32/int64 array with the same length as the child values array of a List type where each value is the index of the list "slot" containing each child value Closes #7632 from wesm/some-list-functions Lead-authored-by: Wes McKinney <wesm@apache.org> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2020-07-07 13:23:37 +02:00			`[`
			`0,`
			`0,`
			`0,`
			`3`
			`]`
			`"""`
			`return _pc().list_parent_indices(self)`

			`def value_lengths(self):`
			`"""`
			`Return integers array with values equal to the respective length of`
			`each list element. Null list values are null in the output.`

			`Examples`
			`--------`
ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-cython) Adding `--doctest-cython` functionality which will be run on the CI with a follow-up PR. This PR can be tested with `pytest --doctest-cython python/pyarrow`. Closes #13204 from AlenkaF/ARROW-16018-doctest-cython Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-05-26 15:30:02 +02:00			`>>> import pyarrow as pa`
ARROW-6775: [C++][Python] Implement list_value_lengths and list_parent_indices functions This adds two functions that operate on list types: * list_value_lengths: returns an int32 (for List) or int64 (for LargeList) array with the number of elements in each list value slot * list_parent_indices: returns an int32/int64 array with the same length as the child values array of a List type where each value is the index of the list "slot" containing each child value Closes #7632 from wesm/some-list-functions Lead-authored-by: Wes McKinney <wesm@apache.org> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2020-07-07 13:23:37 +02:00			`>>> arr = pa.array([[1, 2, 3], [], None, [4]],`
			`... type=pa.list_(pa.int32()))`
			`>>> arr.value_lengths()`
ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-cython) Adding `--doctest-cython` functionality which will be run on the CI with a follow-up PR. This PR can be tested with `pytest --doctest-cython python/pyarrow`. Closes #13204 from AlenkaF/ARROW-16018-doctest-cython Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-05-26 15:30:02 +02:00			`<pyarrow.lib.Int32Array object at ...>`
ARROW-6775: [C++][Python] Implement list_value_lengths and list_parent_indices functions This adds two functions that operate on list types: * list_value_lengths: returns an int32 (for List) or int64 (for LargeList) array with the number of elements in each list value slot * list_parent_indices: returns an int32/int64 array with the same length as the child values array of a List type where each value is the index of the list "slot" containing each child value Closes #7632 from wesm/some-list-functions Lead-authored-by: Wes McKinney <wesm@apache.org> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2020-07-07 13:23:37 +02:00			`[`
			`3,`
			`0,`
			`null,`
			`1`
			`]`
			`"""`
ARROW-9390: [C++][Doc] Review compute function names Modified function names: * minmax -> min_max * binary_isascii -> string_isascii (only works on string types) * ascii_length -> binary_length (also make it work on binary types) * binary_contains_exact -> match_substring (other possibility: has_substring ?) * match -> index_in * isin -> is_in * list_value_lengths -> list_value_length * partition_indices -> partition_nth_indices (other kinds of partitioning would be possible, e.g. using a predicate) Document string predicate functions (ARROW-9444). Also fix the allocation of IsValid output buffer in certain cases. Closes #7755 from pitrou/ARROW-9390-compute-func-names Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2020-07-14 18:05:38 +02:00			`return _pc().list_value_length(self)`
ARROW-6775: [C++][Python] Implement list_value_lengths and list_parent_indices functions This adds two functions that operate on list types: * list_value_lengths: returns an int32 (for List) or int64 (for LargeList) array with the number of elements in each list value slot * list_parent_indices: returns an int32/int64 array with the same length as the child values array of a List type where each value is the index of the list "slot" containing each child value Closes #7632 from wesm/some-list-functions Lead-authored-by: Wes McKinney <wesm@apache.org> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2020-07-07 13:23:37 +02:00
ARROW-3520: [C++] Add "list_flatten" vector kernel wrapper for Flatten method of ListArray types This simplifies invocation of this action in bindings. Closes #7585 from wesm/ARROW-3520 Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-30 12:23:44 -05:00
			`cdef class ListArray(BaseListArray):`
ARROW-4216: [Python] Add CUDA API docs Also reorganize the API docs into several documents, and add/improve docstrings. To allow building the docs without CUDA enabled, I added some conditional inclusion logic. When CUDA isn't enabled, the API docs are still generated but the docstrings are empty. This seems to be the only sane setting that doesn't produce Sphinx errors, one way or the other. Author: Antoine Pitrou <antoine@python.org> Closes #3372 from pitrou/ARROW-4216-cuda-py-docs and squashes the following commits: 80600da5 <Antoine Pitrou> ARROW-4216: Add CUDA API docs 2019-01-10 21:05:31 +01:00			`"""`
			`Concrete class for Arrow arrays of a list data type.`
			`"""`
ARROW-1177: [C++] Check for int32 offset overflow in ListBuilder, BinaryBuilder I also refactored BinaryBuilder to not inherit from ListBuilder, which is a bit cleaner. I added a draft of ARROW-507; it needs a unit test and to handle the case where some passed offsets are null (so they need to be sanitized) Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #853 from wesm/ARROW-1177 and squashes the following commits: f6be04f [Wes McKinney] Fix DCHECKs in ListBuilder, BinaryBuilder 28f17ab [Wes McKinney] Use binary strings for py2.7 c9e7502 [Wes McKinney] Fix some off-by-one errors 5a8be84 [Wes McKinney] Fix another warning 23adefc [Wes McKinney] Fix compiler warning 35ab4f2 [Wes McKinney] Refactoring BinaryBuilder. Add check for int32 offset overflow for List, Binary, String. Add basic ListArray::FromArrays method, add Python binding 2017-07-17 18:32:30 +02:00
			`@staticmethod`
ARROW-15839: [C++][Python] Accept validity bitmap in ListArray.from_arrays (#13894) Will close [ARROW-15839](https://issues.apache.org/jira/browse/ARROW-15839) This will allow proper nulls in place of what is presently empty lists (on top level) when no mask is passed. ```python import pyarrow as pa arr = pa.array([None, [0]]) reconstructed_arr = pa.ListArray.from_arrays(arr.offsets, arr.values, mask=arr.is_null()) print(reconstructed_arr.to_pylist()) # [None, [0]] # Currently it doesn't accept `mask` and results in `[[], [0]]` unless explicit offsets set ``` There was also discussion with @jorisvandenbossche about renaming the added `null_bitmap` to `validity_bitmap`; but this will deviate from [existing `null_bitmap` use in the same area of code](https://github.com/apache/arrow/blob/d880d7517a33f2ac8ff259cad711bc210fd570c5/cpp/src/arrow/array/array_nested.h#L113). Should I change those names now or in a later PR/issue to standardize the `null_bitmap` / `validity_buf` naming? Lead-authored-by: Miles Granger <miles59923@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-09-01 12:33:56 +02:00			`def from_arrays(offsets, values, DataType type=None, MemoryPool pool=None, mask=None):`
ARROW-1177: [C++] Check for int32 offset overflow in ListBuilder, BinaryBuilder I also refactored BinaryBuilder to not inherit from ListBuilder, which is a bit cleaner. I added a draft of ARROW-507; it needs a unit test and to handle the case where some passed offsets are null (so they need to be sanitized) Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #853 from wesm/ARROW-1177 and squashes the following commits: f6be04f [Wes McKinney] Fix DCHECKs in ListBuilder, BinaryBuilder 28f17ab [Wes McKinney] Use binary strings for py2.7 c9e7502 [Wes McKinney] Fix some off-by-one errors 5a8be84 [Wes McKinney] Fix another warning 23adefc [Wes McKinney] Fix compiler warning 35ab4f2 [Wes McKinney] Refactoring BinaryBuilder. Add check for int32 offset overflow for List, Binary, String. Add basic ListArray::FromArrays method, add Python binding 2017-07-17 18:32:30 +02:00			`"""`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`Construct ListArray from arrays of int32 offsets and values.`
ARROW-1177: [C++] Check for int32 offset overflow in ListBuilder, BinaryBuilder I also refactored BinaryBuilder to not inherit from ListBuilder, which is a bit cleaner. I added a draft of ARROW-507; it needs a unit test and to handle the case where some passed offsets are null (so they need to be sanitized) Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #853 from wesm/ARROW-1177 and squashes the following commits: f6be04f [Wes McKinney] Fix DCHECKs in ListBuilder, BinaryBuilder 28f17ab [Wes McKinney] Use binary strings for py2.7 c9e7502 [Wes McKinney] Fix some off-by-one errors 5a8be84 [Wes McKinney] Fix another warning 23adefc [Wes McKinney] Fix compiler warning 35ab4f2 [Wes McKinney] Refactoring BinaryBuilder. Add check for int32 offset overflow for List, Binary, String. Add basic ListArray::FromArrays method, add Python binding 2017-07-17 18:32:30 +02:00
			`Parameters`
			`----------`
ARROW-7031: [Python] Expose the offsets of a ListArray in python Closes #5759 from jorisvandenbossche/ARROW-7031-list-array-offsets and squashes the following commits: 664225763 <Joris Van den Bossche> Slice the offsets if needed 4afe9f0ea <Joris Van den Bossche> ARROW-7031: Expose the offsets of a ListArray in python Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-11-05 11:29:28 -06:00			`offsets : Array (int32 type)`
ARROW-1177: [C++] Check for int32 offset overflow in ListBuilder, BinaryBuilder I also refactored BinaryBuilder to not inherit from ListBuilder, which is a bit cleaner. I added a draft of ARROW-507; it needs a unit test and to handle the case where some passed offsets are null (so they need to be sanitized) Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #853 from wesm/ARROW-1177 and squashes the following commits: f6be04f [Wes McKinney] Fix DCHECKs in ListBuilder, BinaryBuilder 28f17ab [Wes McKinney] Use binary strings for py2.7 c9e7502 [Wes McKinney] Fix some off-by-one errors 5a8be84 [Wes McKinney] Fix another warning 23adefc [Wes McKinney] Fix compiler warning 35ab4f2 [Wes McKinney] Refactoring BinaryBuilder. Add check for int32 offset overflow for List, Binary, String. Add basic ListArray::FromArrays method, add Python binding 2017-07-17 18:32:30 +02:00			`values : Array (any type)`
ARROW-15477: [C++][Python] Allow to create (FixedSize/Large)ListArray from arrays and type This enables to create a ListArray in this way with a custom type (eg non-default field name). Mimics the interface we already have for `MapArray::FromArrays` as well. Closes #12312 from jorisvandenbossche/ARROW-15477-list-from-arrays Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-02-02 14:26:01 +01:00			`type : DataType, optional`
			`If not specified, a default ListType with the values' type is`
			`used.`
ARROW-15839: [C++][Python] Accept validity bitmap in ListArray.from_arrays (#13894) Will close [ARROW-15839](https://issues.apache.org/jira/browse/ARROW-15839) This will allow proper nulls in place of what is presently empty lists (on top level) when no mask is passed. ```python import pyarrow as pa arr = pa.array([None, [0]]) reconstructed_arr = pa.ListArray.from_arrays(arr.offsets, arr.values, mask=arr.is_null()) print(reconstructed_arr.to_pylist()) # [None, [0]] # Currently it doesn't accept `mask` and results in `[[], [0]]` unless explicit offsets set ``` There was also discussion with @jorisvandenbossche about renaming the added `null_bitmap` to `validity_bitmap`; but this will deviate from [existing `null_bitmap` use in the same area of code](https://github.com/apache/arrow/blob/d880d7517a33f2ac8ff259cad711bc210fd570c5/cpp/src/arrow/array/array_nested.h#L113). Should I change those names now or in a later PR/issue to standardize the `null_bitmap` / `validity_buf` naming? Lead-authored-by: Miles Granger <miles59923@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-09-01 12:33:56 +02:00			`pool : MemoryPool, optional`
			`mask : Array (boolean type), optional`
			`Indicate which values are null (True) or not null (False).`
ARROW-1177: [C++] Check for int32 offset overflow in ListBuilder, BinaryBuilder I also refactored BinaryBuilder to not inherit from ListBuilder, which is a bit cleaner. I added a draft of ARROW-507; it needs a unit test and to handle the case where some passed offsets are null (so they need to be sanitized) Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #853 from wesm/ARROW-1177 and squashes the following commits: f6be04f [Wes McKinney] Fix DCHECKs in ListBuilder, BinaryBuilder 28f17ab [Wes McKinney] Use binary strings for py2.7 c9e7502 [Wes McKinney] Fix some off-by-one errors 5a8be84 [Wes McKinney] Fix another warning 23adefc [Wes McKinney] Fix compiler warning 35ab4f2 [Wes McKinney] Refactoring BinaryBuilder. Add check for int32 offset overflow for List, Binary, String. Add basic ListArray::FromArrays method, add Python binding 2017-07-17 18:32:30 +02:00
			`Returns`
			`-------`
			`list_array : ListArray`
ARROW-12677: [Python] Add a mask argument to pyarrow.StructArray.from_arrays This allows the user to supply an optional `mask` when creating a struct array. * The mask requirements are pretty strict (must be a boolean arrow array without nulls) compared with some of the other functions (e.g. `array.mask` accepts a wide variety of inputs). I think this should be ok since this use case is probably rarer and there are other plenty of existing ways to convert other datatypes to an arrow array. * ~~Unfortunately, StructArray::Make interprets the "null buffer" as more of a validity buffer (1 = valid, 0 = null). This is the opposite of everywhere else a `mask` is used. I was torn between inverting the input buffer to mimic the python API and passing through directly to the C interface for simplicity. I chose the simpler option but could be convinced otherwise.~~ Per request, I now invert the mask to align with the python API. Closes #10272 from westonpace/feature/ARROW-12677--python-add-a-mask-argument-to-pyarrow-structarra Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2021-05-14 16:48:25 -04:00
			`Examples`
			`--------`
ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-cython) Adding `--doctest-cython` functionality which will be run on the CI with a follow-up PR. This PR can be tested with `pytest --doctest-cython python/pyarrow`. Closes #13204 from AlenkaF/ARROW-16018-doctest-cython Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-05-26 15:30:02 +02:00			`>>> import pyarrow as pa`
ARROW-12677: [Python] Add a mask argument to pyarrow.StructArray.from_arrays This allows the user to supply an optional `mask` when creating a struct array. * The mask requirements are pretty strict (must be a boolean arrow array without nulls) compared with some of the other functions (e.g. `array.mask` accepts a wide variety of inputs). I think this should be ok since this use case is probably rarer and there are other plenty of existing ways to convert other datatypes to an arrow array. * ~~Unfortunately, StructArray::Make interprets the "null buffer" as more of a validity buffer (1 = valid, 0 = null). This is the opposite of everywhere else a `mask` is used. I was torn between inverting the input buffer to mimic the python API and passing through directly to the C interface for simplicity. I chose the simpler option but could be convinced otherwise.~~ Per request, I now invert the mask to align with the python API. Closes #10272 from westonpace/feature/ARROW-12677--python-add-a-mask-argument-to-pyarrow-structarra Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2021-05-14 16:48:25 -04:00			`>>> values = pa.array([1, 2, 3, 4])`
			`>>> offsets = pa.array([0, 2, 4])`
			`>>> pa.ListArray.from_arrays(offsets, values)`
ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-cython) Adding `--doctest-cython` functionality which will be run on the CI with a follow-up PR. This PR can be tested with `pytest --doctest-cython python/pyarrow`. Closes #13204 from AlenkaF/ARROW-16018-doctest-cython Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-05-26 15:30:02 +02:00			`<pyarrow.lib.ListArray object at ...>`
ARROW-12677: [Python] Add a mask argument to pyarrow.StructArray.from_arrays This allows the user to supply an optional `mask` when creating a struct array. * The mask requirements are pretty strict (must be a boolean arrow array without nulls) compared with some of the other functions (e.g. `array.mask` accepts a wide variety of inputs). I think this should be ok since this use case is probably rarer and there are other plenty of existing ways to convert other datatypes to an arrow array. * ~~Unfortunately, StructArray::Make interprets the "null buffer" as more of a validity buffer (1 = valid, 0 = null). This is the opposite of everywhere else a `mask` is used. I was torn between inverting the input buffer to mimic the python API and passing through directly to the C interface for simplicity. I chose the simpler option but could be convinced otherwise.~~ Per request, I now invert the mask to align with the python API. Closes #10272 from westonpace/feature/ARROW-12677--python-add-a-mask-argument-to-pyarrow-structarra Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2021-05-14 16:48:25 -04:00			`[`
			`[`
ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-cython) Adding `--doctest-cython` functionality which will be run on the CI with a follow-up PR. This PR can be tested with `pytest --doctest-cython python/pyarrow`. Closes #13204 from AlenkaF/ARROW-16018-doctest-cython Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-05-26 15:30:02 +02:00			`1,`
			`2`
ARROW-12677: [Python] Add a mask argument to pyarrow.StructArray.from_arrays This allows the user to supply an optional `mask` when creating a struct array. * The mask requirements are pretty strict (must be a boolean arrow array without nulls) compared with some of the other functions (e.g. `array.mask` accepts a wide variety of inputs). I think this should be ok since this use case is probably rarer and there are other plenty of existing ways to convert other datatypes to an arrow array. * ~~Unfortunately, StructArray::Make interprets the "null buffer" as more of a validity buffer (1 = valid, 0 = null). This is the opposite of everywhere else a `mask` is used. I was torn between inverting the input buffer to mimic the python API and passing through directly to the C interface for simplicity. I chose the simpler option but could be convinced otherwise.~~ Per request, I now invert the mask to align with the python API. Closes #10272 from westonpace/feature/ARROW-12677--python-add-a-mask-argument-to-pyarrow-structarra Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2021-05-14 16:48:25 -04:00			`],`
			`[`
ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-cython) Adding `--doctest-cython` functionality which will be run on the CI with a follow-up PR. This PR can be tested with `pytest --doctest-cython python/pyarrow`. Closes #13204 from AlenkaF/ARROW-16018-doctest-cython Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-05-26 15:30:02 +02:00			`3,`
			`4`
ARROW-12677: [Python] Add a mask argument to pyarrow.StructArray.from_arrays This allows the user to supply an optional `mask` when creating a struct array. * The mask requirements are pretty strict (must be a boolean arrow array without nulls) compared with some of the other functions (e.g. `array.mask` accepts a wide variety of inputs). I think this should be ok since this use case is probably rarer and there are other plenty of existing ways to convert other datatypes to an arrow array. * ~~Unfortunately, StructArray::Make interprets the "null buffer" as more of a validity buffer (1 = valid, 0 = null). This is the opposite of everywhere else a `mask` is used. I was torn between inverting the input buffer to mimic the python API and passing through directly to the C interface for simplicity. I chose the simpler option but could be convinced otherwise.~~ Per request, I now invert the mask to align with the python API. Closes #10272 from westonpace/feature/ARROW-12677--python-add-a-mask-argument-to-pyarrow-structarra Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2021-05-14 16:48:25 -04:00			`]`
			`]`
ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-cython) Adding `--doctest-cython` functionality which will be run on the CI with a follow-up PR. This PR can be tested with `pytest --doctest-cython python/pyarrow`. Closes #13204 from AlenkaF/ARROW-16018-doctest-cython Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-05-26 15:30:02 +02:00			`>>> # nulls in the offsets array become null lists`
ARROW-12677: [Python] Add a mask argument to pyarrow.StructArray.from_arrays This allows the user to supply an optional `mask` when creating a struct array. * The mask requirements are pretty strict (must be a boolean arrow array without nulls) compared with some of the other functions (e.g. `array.mask` accepts a wide variety of inputs). I think this should be ok since this use case is probably rarer and there are other plenty of existing ways to convert other datatypes to an arrow array. * ~~Unfortunately, StructArray::Make interprets the "null buffer" as more of a validity buffer (1 = valid, 0 = null). This is the opposite of everywhere else a `mask` is used. I was torn between inverting the input buffer to mimic the python API and passing through directly to the C interface for simplicity. I chose the simpler option but could be convinced otherwise.~~ Per request, I now invert the mask to align with the python API. Closes #10272 from westonpace/feature/ARROW-12677--python-add-a-mask-argument-to-pyarrow-structarra Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2021-05-14 16:48:25 -04:00			`>>> offsets = pa.array([0, None, 2, 4])`
			`>>> pa.ListArray.from_arrays(offsets, values)`
ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-cython) Adding `--doctest-cython` functionality which will be run on the CI with a follow-up PR. This PR can be tested with `pytest --doctest-cython python/pyarrow`. Closes #13204 from AlenkaF/ARROW-16018-doctest-cython Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-05-26 15:30:02 +02:00			`<pyarrow.lib.ListArray object at ...>`
ARROW-12677: [Python] Add a mask argument to pyarrow.StructArray.from_arrays This allows the user to supply an optional `mask` when creating a struct array. * The mask requirements are pretty strict (must be a boolean arrow array without nulls) compared with some of the other functions (e.g. `array.mask` accepts a wide variety of inputs). I think this should be ok since this use case is probably rarer and there are other plenty of existing ways to convert other datatypes to an arrow array. * ~~Unfortunately, StructArray::Make interprets the "null buffer" as more of a validity buffer (1 = valid, 0 = null). This is the opposite of everywhere else a `mask` is used. I was torn between inverting the input buffer to mimic the python API and passing through directly to the C interface for simplicity. I chose the simpler option but could be convinced otherwise.~~ Per request, I now invert the mask to align with the python API. Closes #10272 from westonpace/feature/ARROW-12677--python-add-a-mask-argument-to-pyarrow-structarra Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2021-05-14 16:48:25 -04:00			`[`
			`[`
ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-cython) Adding `--doctest-cython` functionality which will be run on the CI with a follow-up PR. This PR can be tested with `pytest --doctest-cython python/pyarrow`. Closes #13204 from AlenkaF/ARROW-16018-doctest-cython Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-05-26 15:30:02 +02:00			`1,`
			`2`
ARROW-12677: [Python] Add a mask argument to pyarrow.StructArray.from_arrays This allows the user to supply an optional `mask` when creating a struct array. * The mask requirements are pretty strict (must be a boolean arrow array without nulls) compared with some of the other functions (e.g. `array.mask` accepts a wide variety of inputs). I think this should be ok since this use case is probably rarer and there are other plenty of existing ways to convert other datatypes to an arrow array. * ~~Unfortunately, StructArray::Make interprets the "null buffer" as more of a validity buffer (1 = valid, 0 = null). This is the opposite of everywhere else a `mask` is used. I was torn between inverting the input buffer to mimic the python API and passing through directly to the C interface for simplicity. I chose the simpler option but could be convinced otherwise.~~ Per request, I now invert the mask to align with the python API. Closes #10272 from westonpace/feature/ARROW-12677--python-add-a-mask-argument-to-pyarrow-structarra Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2021-05-14 16:48:25 -04:00			`],`
			`null,`
			`[`
ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-cython) Adding `--doctest-cython` functionality which will be run on the CI with a follow-up PR. This PR can be tested with `pytest --doctest-cython python/pyarrow`. Closes #13204 from AlenkaF/ARROW-16018-doctest-cython Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-05-26 15:30:02 +02:00			`3,`
			`4`
ARROW-12677: [Python] Add a mask argument to pyarrow.StructArray.from_arrays This allows the user to supply an optional `mask` when creating a struct array. * The mask requirements are pretty strict (must be a boolean arrow array without nulls) compared with some of the other functions (e.g. `array.mask` accepts a wide variety of inputs). I think this should be ok since this use case is probably rarer and there are other plenty of existing ways to convert other datatypes to an arrow array. * ~~Unfortunately, StructArray::Make interprets the "null buffer" as more of a validity buffer (1 = valid, 0 = null). This is the opposite of everywhere else a `mask` is used. I was torn between inverting the input buffer to mimic the python API and passing through directly to the C interface for simplicity. I chose the simpler option but could be convinced otherwise.~~ Per request, I now invert the mask to align with the python API. Closes #10272 from westonpace/feature/ARROW-12677--python-add-a-mask-argument-to-pyarrow-structarra Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2021-05-14 16:48:25 -04:00			`]`
			`]`
ARROW-1177: [C++] Check for int32 offset overflow in ListBuilder, BinaryBuilder I also refactored BinaryBuilder to not inherit from ListBuilder, which is a bit cleaner. I added a draft of ARROW-507; it needs a unit test and to handle the case where some passed offsets are null (so they need to be sanitized) Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #853 from wesm/ARROW-1177 and squashes the following commits: f6be04f [Wes McKinney] Fix DCHECKs in ListBuilder, BinaryBuilder 28f17ab [Wes McKinney] Use binary strings for py2.7 c9e7502 [Wes McKinney] Fix some off-by-one errors 5a8be84 [Wes McKinney] Fix another warning 23adefc [Wes McKinney] Fix compiler warning 35ab4f2 [Wes McKinney] Refactoring BinaryBuilder. Add check for int32 offset overflow for List, Binary, String. Add basic ListArray::FromArrays method, add Python binding 2017-07-17 18:32:30 +02:00			`"""`
ARROW-507: [C++] Complete ListArray::FromArrays implementation, add unit tests In the event that the offsets array has nulls, this will backward-fill the offsets to compute the correct value sizes. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1224 from wesm/ARROW-507 and squashes the following commits: 9027c140 [Wes McKinney] Clean valid bits to remove trailing set bit 8d2cb512 [Wes McKinney] Implement / add tests for ListArray.from_arrays in Python 1c6a8702 [Wes McKinney] Complete C++ implementation, unit test for ListArray::FromArrays, handling of offsets with nulls 2017-10-23 17:57:13 -04:00			`cdef:`
			`Array _offsets, _values`
			`shared_ptr[CArray] out`
ARROW-15839: [C++][Python] Accept validity bitmap in ListArray.from_arrays (#13894) Will close [ARROW-15839](https://issues.apache.org/jira/browse/ARROW-15839) This will allow proper nulls in place of what is presently empty lists (on top level) when no mask is passed. ```python import pyarrow as pa arr = pa.array([None, [0]]) reconstructed_arr = pa.ListArray.from_arrays(arr.offsets, arr.values, mask=arr.is_null()) print(reconstructed_arr.to_pylist()) # [None, [0]] # Currently it doesn't accept `mask` and results in `[[], [0]]` unless explicit offsets set ``` There was also discussion with @jorisvandenbossche about renaming the added `null_bitmap` to `validity_bitmap`; but this will deviate from [existing `null_bitmap` use in the same area of code](https://github.com/apache/arrow/blob/d880d7517a33f2ac8ff259cad711bc210fd570c5/cpp/src/arrow/array/array_nested.h#L113). Should I change those names now or in a later PR/issue to standardize the `null_bitmap` / `validity_buf` naming? Lead-authored-by: Miles Granger <miles59923@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-09-01 12:33:56 +02:00			`shared_ptr[CBuffer] c_mask`
ARROW-1177: [C++] Check for int32 offset overflow in ListBuilder, BinaryBuilder I also refactored BinaryBuilder to not inherit from ListBuilder, which is a bit cleaner. I added a draft of ARROW-507; it needs a unit test and to handle the case where some passed offsets are null (so they need to be sanitized) Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #853 from wesm/ARROW-1177 and squashes the following commits: f6be04f [Wes McKinney] Fix DCHECKs in ListBuilder, BinaryBuilder 28f17ab [Wes McKinney] Use binary strings for py2.7 c9e7502 [Wes McKinney] Fix some off-by-one errors 5a8be84 [Wes McKinney] Fix another warning 23adefc [Wes McKinney] Fix compiler warning 35ab4f2 [Wes McKinney] Refactoring BinaryBuilder. Add check for int32 offset overflow for List, Binary, String. Add basic ListArray::FromArrays method, add Python binding 2017-07-17 18:32:30 +02:00			`cdef CMemoryPool* cpool = maybe_unbox_memory_pool(pool)`
ARROW-507: [C++] Complete ListArray::FromArrays implementation, add unit tests In the event that the offsets array has nulls, this will backward-fill the offsets to compute the correct value sizes. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1224 from wesm/ARROW-507 and squashes the following commits: 9027c140 [Wes McKinney] Clean valid bits to remove trailing set bit 8d2cb512 [Wes McKinney] Implement / add tests for ListArray.from_arrays in Python 1c6a8702 [Wes McKinney] Complete C++ implementation, unit test for ListArray::FromArrays, handling of offsets with nulls 2017-10-23 17:57:13 -04:00
			`_offsets = asarray(offsets, type='int32')`
			`_values = asarray(values)`

ARROW-15839: [C++][Python] Accept validity bitmap in ListArray.from_arrays (#13894) Will close [ARROW-15839](https://issues.apache.org/jira/browse/ARROW-15839) This will allow proper nulls in place of what is presently empty lists (on top level) when no mask is passed. ```python import pyarrow as pa arr = pa.array([None, [0]]) reconstructed_arr = pa.ListArray.from_arrays(arr.offsets, arr.values, mask=arr.is_null()) print(reconstructed_arr.to_pylist()) # [None, [0]] # Currently it doesn't accept `mask` and results in `[[], [0]]` unless explicit offsets set ``` There was also discussion with @jorisvandenbossche about renaming the added `null_bitmap` to `validity_bitmap`; but this will deviate from [existing `null_bitmap` use in the same area of code](https://github.com/apache/arrow/blob/d880d7517a33f2ac8ff259cad711bc210fd570c5/cpp/src/arrow/array/array_nested.h#L113). Should I change those names now or in a later PR/issue to standardize the `null_bitmap` / `validity_buf` naming? Lead-authored-by: Miles Granger <miles59923@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-09-01 12:33:56 +02:00			`c_mask = c_mask_inverted_from_obj(mask, pool)`

ARROW-15477: [C++][Python] Allow to create (FixedSize/Large)ListArray from arrays and type This enables to create a ListArray in this way with a custom type (eg non-default field name). Mimics the interface we already have for `MapArray::FromArrays` as well. Closes #12312 from jorisvandenbossche/ARROW-15477-list-from-arrays Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-02-02 14:26:01 +01:00			`if type is not None:`
			`with nogil:`
			`out = GetResultValue(`
			`CListArray.FromArraysAndType(`
ARROW-15839: [C++][Python] Accept validity bitmap in ListArray.from_arrays (#13894) Will close [ARROW-15839](https://issues.apache.org/jira/browse/ARROW-15839) This will allow proper nulls in place of what is presently empty lists (on top level) when no mask is passed. ```python import pyarrow as pa arr = pa.array([None, [0]]) reconstructed_arr = pa.ListArray.from_arrays(arr.offsets, arr.values, mask=arr.is_null()) print(reconstructed_arr.to_pylist()) # [None, [0]] # Currently it doesn't accept `mask` and results in `[[], [0]]` unless explicit offsets set ``` There was also discussion with @jorisvandenbossche about renaming the added `null_bitmap` to `validity_bitmap`; but this will deviate from [existing `null_bitmap` use in the same area of code](https://github.com/apache/arrow/blob/d880d7517a33f2ac8ff259cad711bc210fd570c5/cpp/src/arrow/array/array_nested.h#L113). Should I change those names now or in a later PR/issue to standardize the `null_bitmap` / `validity_buf` naming? Lead-authored-by: Miles Granger <miles59923@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-09-01 12:33:56 +02:00			`type.sp_type, _offsets.ap[0], _values.ap[0], cpool, c_mask))`
ARROW-15477: [C++][Python] Allow to create (FixedSize/Large)ListArray from arrays and type This enables to create a ListArray in this way with a custom type (eg non-default field name). Mimics the interface we already have for `MapArray::FromArrays` as well. Closes #12312 from jorisvandenbossche/ARROW-15477-list-from-arrays Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-02-02 14:26:01 +01:00			`else:`
			`with nogil:`
			`out = GetResultValue(`
			`CListArray.FromArrays(`
ARROW-15839: [C++][Python] Accept validity bitmap in ListArray.from_arrays (#13894) Will close [ARROW-15839](https://issues.apache.org/jira/browse/ARROW-15839) This will allow proper nulls in place of what is presently empty lists (on top level) when no mask is passed. ```python import pyarrow as pa arr = pa.array([None, [0]]) reconstructed_arr = pa.ListArray.from_arrays(arr.offsets, arr.values, mask=arr.is_null()) print(reconstructed_arr.to_pylist()) # [None, [0]] # Currently it doesn't accept `mask` and results in `[[], [0]]` unless explicit offsets set ``` There was also discussion with @jorisvandenbossche about renaming the added `null_bitmap` to `validity_bitmap`; but this will deviate from [existing `null_bitmap` use in the same area of code](https://github.com/apache/arrow/blob/d880d7517a33f2ac8ff259cad711bc210fd570c5/cpp/src/arrow/array/array_nested.h#L113). Should I change those names now or in a later PR/issue to standardize the `null_bitmap` / `validity_buf` naming? Lead-authored-by: Miles Granger <miles59923@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-09-01 12:33:56 +02:00			`_offsets.ap[0], _values.ap[0], cpool, c_mask))`
ARROW-6132: [Python] validate result in ListArray.from_arrays https://issues.apache.org/jira/browse/ARROW-6132 Closes #5029 from jorisvandenbossche/ARROW-6132-from_arrays-check-validity and squashes the following commits: 5fe476eca <Joris Van den Bossche> ARROW-6132: validate result in ListArray.from_arrays Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-08-08 17:30:43 +02:00			`cdef Array result = pyarrow_wrap_array(out)`
			`result.validate()`
			`return result`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
ARROW-3325: [Python][Parquet] Add "read_dictionary" argument to parquet.read_table, ParquetDataset to enable direct-to-DictionaryArray reads I also added support to `pyarrow.table` to invoke `Table.from_arrays` if a list or tuple of arrays is passed. This makes for more natural code IMHO. Using this option with heavily compressed data results in far less memory use and much better performance. See example benchmarks https://gist.github.com/wesm/450d85e52844aee685c0680111cbb1d7 Closes #4999 from wesm/ARROW-3325 and squashes the following commits: 2ca388149 <Wes McKinney> Improve docstring for read_dictionary parameter, add to ParquetDataset ee73d7b41 <Wes McKinney> Add missing PARQUET_EXPORT 0f450d53e <Wes McKinney> Clean up FileReaderBuilder. Add simle Python docs 8e2b70b1a <Wes McKinney> Expand read_dictionary with ParquetDataset test for multiple files 7237e6958 <Wes McKinney> Fix C++ and Python unit tests 9d503516f <Wes McKinney> Read Parquet fields directly as DictionaryArray in parquet.read_table and ParquetDataset 85f9b7206 <Wes McKinney> Initial threading of read_dictionary parameter, not terribly satisfying Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-08-05 13:12:12 -05:00			`@property`
			`def values(self):`
GH-35740: Add documentation for list arrays' values property (#35865) Just docs. I'm a little shaky on my understanding of exactly what's going on with FixedSizeListArray.values, and its behavior with nulls, so that wording might deserve a careful read. * Closes: #35740 Lead-authored-by: Spencer Nelson <spencer@b612foundation.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2023-08-23 11:13:33 -07:00			`"""`
			`Return the underlying array of values which backs the ListArray`
			`ignoring the array's offset.`

			`If any of the list elements are null, but are backed by a`
			`non-empty sub-list, those elements will be included in the`
			`output.`

			Compare with :meth:`flatten`, which returns only the non-null
			`values taking into consideration the array's offset.`

			`Returns`
			`-------`
			`values : Array`

			`See Also`
			`--------`
			`ListArray.flatten : ...`

			`Examples`
			`--------`

			`The values include null elements from sub-lists:`

			`>>> import pyarrow as pa`
			`>>> array = pa.array([[1, 2], None, [3, 4, None, 6]])`
			`>>> array.values`
			`<pyarrow.lib.Int64Array object at ...>`
			`[`
			`1,`
			`2,`
			`3,`
			`4,`
			`null,`
			`6`
			`]`

			`If an array is sliced, the slice still uses the same`
			`underlying data as the original array, just with an`
			`offset. Since values ignores the offset, the values are the`
			`same:`

			`>>> sliced = array.slice(1, 2)`
			`>>> sliced`
			`<pyarrow.lib.ListArray object at ...>`
			`[`
			`null,`
			`[`
			`3,`
			`4,`
			`null,`
			`6`
			`]`
			`]`
			`>>> sliced.values`
			`<pyarrow.lib.Int64Array object at ...>`
			`[`
			`1,`
			`2,`
			`3,`
			`4,`
			`null,`
			`6`
			`]`

			`"""`
ARROW-7362: [Python][C++] Added ListArray.Flatten() that properly flattens a ListArray Currently ListArray.flatten() simply returns the child array. If a ListArray is a slice of another ListArray, they will share the same child array, however the expected behavior (I think) of flatten() should be returning an Array that's a concatenation of all the sub-lists in the ListArray, so the slicing offset should be taken into account. For example: ```python a = pa.array([[1], [2], [3]]) assert a.flatten().equals(pa.array([1,2,3])) # expected: a.slice(1).flatten().equals(pa.array([2, 3])) ``` Closes #6006 from brills/flatten and squashes the following commits: 4702f59da <Antoine Pitrou> Improve implementation characteristics d14210dc4 <Antoine Pitrou> Address review comments, add a test for non-canonical list arrays c789812d0 <Antoine Pitrou> Fix git merge error 7d0b864bf <Antoine Pitrou> Fix typo + print out conda env in conda docker builds 5f3650f90 <Zhuo Peng> comments 6aa3181b8 <Zhuo Peng> Take care of nulls correctly. 3c8746241 <Zhuo Peng> Also changed the python wrapper. a3d4d2fbb <Zhuo Peng> Added a C++ method Flatten() to ListArray. Lead-authored-by: Zhuo Peng <1835738+brills@users.noreply.github.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-12-18 17:57:32 +01:00			`cdef CListArray* arr = <CListArray*> self.ap`
			`return pyarrow_wrap_array(arr.values())`
ARROW-3325: [Python][Parquet] Add "read_dictionary" argument to parquet.read_table, ParquetDataset to enable direct-to-DictionaryArray reads I also added support to `pyarrow.table` to invoke `Table.from_arrays` if a list or tuple of arrays is passed. This makes for more natural code IMHO. Using this option with heavily compressed data results in far less memory use and much better performance. See example benchmarks https://gist.github.com/wesm/450d85e52844aee685c0680111cbb1d7 Closes #4999 from wesm/ARROW-3325 and squashes the following commits: 2ca388149 <Wes McKinney> Improve docstring for read_dictionary parameter, add to ParquetDataset ee73d7b41 <Wes McKinney> Add missing PARQUET_EXPORT 0f450d53e <Wes McKinney> Clean up FileReaderBuilder. Add simle Python docs 8e2b70b1a <Wes McKinney> Expand read_dictionary with ParquetDataset test for multiple files 7237e6958 <Wes McKinney> Fix C++ and Python unit tests 9d503516f <Wes McKinney> Read Parquet fields directly as DictionaryArray in parquet.read_table and ParquetDataset 85f9b7206 <Wes McKinney> Initial threading of read_dictionary parameter, not terribly satisfying Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-08-05 13:12:12 -05:00
ARROW-7031: [Python] Expose the offsets of a ListArray in python Closes #5759 from jorisvandenbossche/ARROW-7031-list-array-offsets and squashes the following commits: 664225763 <Joris Van den Bossche> Slice the offsets if needed 4afe9f0ea <Joris Van den Bossche> ARROW-7031: Expose the offsets of a ListArray in python Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-11-05 11:29:28 -06:00			`@property`
			`def offsets(self):`
			`"""`
ARROW-15837: [C++][Python] Clarify documentation for ListArray::offsets() Closes #12557 from pitrou/ARROW-15837-clarify-list-offsets Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-03-08 12:04:29 +01:00			`Return the list offsets as an int32 array.`

			`The returned array will not have a validity bitmap, so you cannot`
			expect to pass it to `ListArray.from_arrays` and get back the same
			`list array if the original one has nulls.`

			`Returns`
			`-------`
			`offsets : Int32Array`

			`Examples`
			`--------`
ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-cython) Adding `--doctest-cython` functionality which will be run on the CI with a follow-up PR. This PR can be tested with `pytest --doctest-cython python/pyarrow`. Closes #13204 from AlenkaF/ARROW-16018-doctest-cython Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-05-26 15:30:02 +02:00			`>>> import pyarrow as pa`
			`>>> array = pa.array([[1, 2], None, [3, 4, 5]])`
ARROW-15837: [C++][Python] Clarify documentation for ListArray::offsets() Closes #12557 from pitrou/ARROW-15837-clarify-list-offsets Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-03-08 12:04:29 +01:00			`>>> array.offsets`
ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-cython) Adding `--doctest-cython` functionality which will be run on the CI with a follow-up PR. This PR can be tested with `pytest --doctest-cython python/pyarrow`. Closes #13204 from AlenkaF/ARROW-16018-doctest-cython Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-05-26 15:30:02 +02:00			`<pyarrow.lib.Int32Array object at ...>`
ARROW-15837: [C++][Python] Clarify documentation for ListArray::offsets() Closes #12557 from pitrou/ARROW-15837-clarify-list-offsets Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-03-08 12:04:29 +01:00			`[`
			`0,`
			`2,`
			`2,`
			`5`
			`]`
ARROW-7031: [Python] Expose the offsets of a ListArray in python Closes #5759 from jorisvandenbossche/ARROW-7031-list-array-offsets and squashes the following commits: 664225763 <Joris Van den Bossche> Slice the offsets if needed 4afe9f0ea <Joris Van den Bossche> ARROW-7031: Expose the offsets of a ListArray in python Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-11-05 11:29:28 -06:00			`"""`
ARROW-7068: [C++] Add ListArray::offsets and LargeListArray::offsets returning boxed version of offsets as Int32Array/Int64Array Closes #7462 from wesm/ARROW-7068 Lead-authored-by: Wes McKinney <wesm@apache.org> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2020-06-17 12:09:27 +02:00			`return pyarrow_wrap_array((<CListArray*> self.ap).offsets())`
ARROW-3325: [Python][Parquet] Add "read_dictionary" argument to parquet.read_table, ParquetDataset to enable direct-to-DictionaryArray reads I also added support to `pyarrow.table` to invoke `Table.from_arrays` if a list or tuple of arrays is passed. This makes for more natural code IMHO. Using this option with heavily compressed data results in far less memory use and much better performance. See example benchmarks https://gist.github.com/wesm/450d85e52844aee685c0680111cbb1d7 Closes #4999 from wesm/ARROW-3325 and squashes the following commits: 2ca388149 <Wes McKinney> Improve docstring for read_dictionary parameter, add to ParquetDataset ee73d7b41 <Wes McKinney> Add missing PARQUET_EXPORT 0f450d53e <Wes McKinney> Clean up FileReaderBuilder. Add simle Python docs 8e2b70b1a <Wes McKinney> Expand read_dictionary with ParquetDataset test for multiple files 7237e6958 <Wes McKinney> Fix C++ and Python unit tests 9d503516f <Wes McKinney> Read Parquet fields directly as DictionaryArray in parquet.read_table and ParquetDataset 85f9b7206 <Wes McKinney> Initial threading of read_dictionary parameter, not terribly satisfying Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-08-05 13:12:12 -05:00
ARROW-45: [Python] Add unnest/flatten function for List types Author: Krisztián Szűcs <szucs.krisztian@gmail.com> Closes #2757 from kszucs/ARROW-45 and squashes the following commits: 0420020a <Krisztián Szűcs> remove Flatten from cpp API 3aabaf72 <Krisztián Szűcs> lint c2f71f1b <Krisztián Szűcs> small docstring bbd42472 <Krisztián Szűcs> ListArray::Flatten 2018-10-17 13:56:11 +02:00
ARROW-3520: [C++] Add "list_flatten" vector kernel wrapper for Flatten method of ListArray types This simplifies invocation of this action in bindings. Closes #7585 from wesm/ARROW-3520 Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-30 12:23:44 -05:00			`cdef class LargeListArray(BaseListArray):`
ARROW-6084: [Python] Support LargeList Closes #4979 from pitrou/ARROW-6084-py-large-list and squashes the following commits: 4266ea2c6 <Antoine Pitrou> ARROW-6084: Support LargeList Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-08-06 13:41:41 -05:00			`"""`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`Concrete class for Arrow arrays of a large list data type.`

			`Identical to ListArray, but 64-bit offsets.`
ARROW-6084: [Python] Support LargeList Closes #4979 from pitrou/ARROW-6084-py-large-list and squashes the following commits: 4266ea2c6 <Antoine Pitrou> ARROW-6084: Support LargeList Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-08-06 13:41:41 -05:00			`"""`

			`@staticmethod`
ARROW-15839: [C++][Python] Accept validity bitmap in ListArray.from_arrays (#13894) Will close [ARROW-15839](https://issues.apache.org/jira/browse/ARROW-15839) This will allow proper nulls in place of what is presently empty lists (on top level) when no mask is passed. ```python import pyarrow as pa arr = pa.array([None, [0]]) reconstructed_arr = pa.ListArray.from_arrays(arr.offsets, arr.values, mask=arr.is_null()) print(reconstructed_arr.to_pylist()) # [None, [0]] # Currently it doesn't accept `mask` and results in `[[], [0]]` unless explicit offsets set ``` There was also discussion with @jorisvandenbossche about renaming the added `null_bitmap` to `validity_bitmap`; but this will deviate from [existing `null_bitmap` use in the same area of code](https://github.com/apache/arrow/blob/d880d7517a33f2ac8ff259cad711bc210fd570c5/cpp/src/arrow/array/array_nested.h#L113). Should I change those names now or in a later PR/issue to standardize the `null_bitmap` / `validity_buf` naming? Lead-authored-by: Miles Granger <miles59923@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-09-01 12:33:56 +02:00			`def from_arrays(offsets, values, DataType type=None, MemoryPool pool=None, mask=None):`
ARROW-6084: [Python] Support LargeList Closes #4979 from pitrou/ARROW-6084-py-large-list and squashes the following commits: 4266ea2c6 <Antoine Pitrou> ARROW-6084: Support LargeList Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-08-06 13:41:41 -05:00			`"""`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`Construct LargeListArray from arrays of int64 offsets and values.`
ARROW-6084: [Python] Support LargeList Closes #4979 from pitrou/ARROW-6084-py-large-list and squashes the following commits: 4266ea2c6 <Antoine Pitrou> ARROW-6084: Support LargeList Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-08-06 13:41:41 -05:00
			`Parameters`
			`----------`
ARROW-7031: [Python] Expose the offsets of a ListArray in python Closes #5759 from jorisvandenbossche/ARROW-7031-list-array-offsets and squashes the following commits: 664225763 <Joris Van den Bossche> Slice the offsets if needed 4afe9f0ea <Joris Van den Bossche> ARROW-7031: Expose the offsets of a ListArray in python Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-11-05 11:29:28 -06:00			`offsets : Array (int64 type)`
ARROW-6084: [Python] Support LargeList Closes #4979 from pitrou/ARROW-6084-py-large-list and squashes the following commits: 4266ea2c6 <Antoine Pitrou> ARROW-6084: Support LargeList Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-08-06 13:41:41 -05:00			`values : Array (any type)`
ARROW-15477: [C++][Python] Allow to create (FixedSize/Large)ListArray from arrays and type This enables to create a ListArray in this way with a custom type (eg non-default field name). Mimics the interface we already have for `MapArray::FromArrays` as well. Closes #12312 from jorisvandenbossche/ARROW-15477-list-from-arrays Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-02-02 14:26:01 +01:00			`type : DataType, optional`
			`If not specified, a default ListType with the values' type is`
			`used.`
ARROW-15839: [C++][Python] Accept validity bitmap in ListArray.from_arrays (#13894) Will close [ARROW-15839](https://issues.apache.org/jira/browse/ARROW-15839) This will allow proper nulls in place of what is presently empty lists (on top level) when no mask is passed. ```python import pyarrow as pa arr = pa.array([None, [0]]) reconstructed_arr = pa.ListArray.from_arrays(arr.offsets, arr.values, mask=arr.is_null()) print(reconstructed_arr.to_pylist()) # [None, [0]] # Currently it doesn't accept `mask` and results in `[[], [0]]` unless explicit offsets set ``` There was also discussion with @jorisvandenbossche about renaming the added `null_bitmap` to `validity_bitmap`; but this will deviate from [existing `null_bitmap` use in the same area of code](https://github.com/apache/arrow/blob/d880d7517a33f2ac8ff259cad711bc210fd570c5/cpp/src/arrow/array/array_nested.h#L113). Should I change those names now or in a later PR/issue to standardize the `null_bitmap` / `validity_buf` naming? Lead-authored-by: Miles Granger <miles59923@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-09-01 12:33:56 +02:00			`pool : MemoryPool, optional`
			`mask : Array (boolean type), optional`
			`Indicate which values are null (True) or not null (False).`
ARROW-6084: [Python] Support LargeList Closes #4979 from pitrou/ARROW-6084-py-large-list and squashes the following commits: 4266ea2c6 <Antoine Pitrou> ARROW-6084: Support LargeList Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-08-06 13:41:41 -05:00
			`Returns`
			`-------`
			`list_array : LargeListArray`
			`"""`
			`cdef:`
			`Array _offsets, _values`
			`shared_ptr[CArray] out`
ARROW-15839: [C++][Python] Accept validity bitmap in ListArray.from_arrays (#13894) Will close [ARROW-15839](https://issues.apache.org/jira/browse/ARROW-15839) This will allow proper nulls in place of what is presently empty lists (on top level) when no mask is passed. ```python import pyarrow as pa arr = pa.array([None, [0]]) reconstructed_arr = pa.ListArray.from_arrays(arr.offsets, arr.values, mask=arr.is_null()) print(reconstructed_arr.to_pylist()) # [None, [0]] # Currently it doesn't accept `mask` and results in `[[], [0]]` unless explicit offsets set ``` There was also discussion with @jorisvandenbossche about renaming the added `null_bitmap` to `validity_bitmap`; but this will deviate from [existing `null_bitmap` use in the same area of code](https://github.com/apache/arrow/blob/d880d7517a33f2ac8ff259cad711bc210fd570c5/cpp/src/arrow/array/array_nested.h#L113). Should I change those names now or in a later PR/issue to standardize the `null_bitmap` / `validity_buf` naming? Lead-authored-by: Miles Granger <miles59923@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-09-01 12:33:56 +02:00			`shared_ptr[CBuffer] c_mask`

ARROW-6084: [Python] Support LargeList Closes #4979 from pitrou/ARROW-6084-py-large-list and squashes the following commits: 4266ea2c6 <Antoine Pitrou> ARROW-6084: Support LargeList Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-08-06 13:41:41 -05:00			`cdef CMemoryPool* cpool = maybe_unbox_memory_pool(pool)`

			`_offsets = asarray(offsets, type='int64')`
			`_values = asarray(values)`

ARROW-15839: [C++][Python] Accept validity bitmap in ListArray.from_arrays (#13894) Will close [ARROW-15839](https://issues.apache.org/jira/browse/ARROW-15839) This will allow proper nulls in place of what is presently empty lists (on top level) when no mask is passed. ```python import pyarrow as pa arr = pa.array([None, [0]]) reconstructed_arr = pa.ListArray.from_arrays(arr.offsets, arr.values, mask=arr.is_null()) print(reconstructed_arr.to_pylist()) # [None, [0]] # Currently it doesn't accept `mask` and results in `[[], [0]]` unless explicit offsets set ``` There was also discussion with @jorisvandenbossche about renaming the added `null_bitmap` to `validity_bitmap`; but this will deviate from [existing `null_bitmap` use in the same area of code](https://github.com/apache/arrow/blob/d880d7517a33f2ac8ff259cad711bc210fd570c5/cpp/src/arrow/array/array_nested.h#L113). Should I change those names now or in a later PR/issue to standardize the `null_bitmap` / `validity_buf` naming? Lead-authored-by: Miles Granger <miles59923@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-09-01 12:33:56 +02:00			`c_mask = c_mask_inverted_from_obj(mask, pool)`

ARROW-15477: [C++][Python] Allow to create (FixedSize/Large)ListArray from arrays and type This enables to create a ListArray in this way with a custom type (eg non-default field name). Mimics the interface we already have for `MapArray::FromArrays` as well. Closes #12312 from jorisvandenbossche/ARROW-15477-list-from-arrays Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-02-02 14:26:01 +01:00			`if type is not None:`
			`with nogil:`
			`out = GetResultValue(`
			`CLargeListArray.FromArraysAndType(`
ARROW-15839: [C++][Python] Accept validity bitmap in ListArray.from_arrays (#13894) Will close [ARROW-15839](https://issues.apache.org/jira/browse/ARROW-15839) This will allow proper nulls in place of what is presently empty lists (on top level) when no mask is passed. ```python import pyarrow as pa arr = pa.array([None, [0]]) reconstructed_arr = pa.ListArray.from_arrays(arr.offsets, arr.values, mask=arr.is_null()) print(reconstructed_arr.to_pylist()) # [None, [0]] # Currently it doesn't accept `mask` and results in `[[], [0]]` unless explicit offsets set ``` There was also discussion with @jorisvandenbossche about renaming the added `null_bitmap` to `validity_bitmap`; but this will deviate from [existing `null_bitmap` use in the same area of code](https://github.com/apache/arrow/blob/d880d7517a33f2ac8ff259cad711bc210fd570c5/cpp/src/arrow/array/array_nested.h#L113). Should I change those names now or in a later PR/issue to standardize the `null_bitmap` / `validity_buf` naming? Lead-authored-by: Miles Granger <miles59923@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-09-01 12:33:56 +02:00			`type.sp_type, _offsets.ap[0], _values.ap[0], cpool, c_mask))`
ARROW-15477: [C++][Python] Allow to create (FixedSize/Large)ListArray from arrays and type This enables to create a ListArray in this way with a custom type (eg non-default field name). Mimics the interface we already have for `MapArray::FromArrays` as well. Closes #12312 from jorisvandenbossche/ARROW-15477-list-from-arrays Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-02-02 14:26:01 +01:00			`else:`
			`with nogil:`
			`out = GetResultValue(`
			`CLargeListArray.FromArrays(`
ARROW-15839: [C++][Python] Accept validity bitmap in ListArray.from_arrays (#13894) Will close [ARROW-15839](https://issues.apache.org/jira/browse/ARROW-15839) This will allow proper nulls in place of what is presently empty lists (on top level) when no mask is passed. ```python import pyarrow as pa arr = pa.array([None, [0]]) reconstructed_arr = pa.ListArray.from_arrays(arr.offsets, arr.values, mask=arr.is_null()) print(reconstructed_arr.to_pylist()) # [None, [0]] # Currently it doesn't accept `mask` and results in `[[], [0]]` unless explicit offsets set ``` There was also discussion with @jorisvandenbossche about renaming the added `null_bitmap` to `validity_bitmap`; but this will deviate from [existing `null_bitmap` use in the same area of code](https://github.com/apache/arrow/blob/d880d7517a33f2ac8ff259cad711bc210fd570c5/cpp/src/arrow/array/array_nested.h#L113). Should I change those names now or in a later PR/issue to standardize the `null_bitmap` / `validity_buf` naming? Lead-authored-by: Miles Granger <miles59923@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-09-01 12:33:56 +02:00			`_offsets.ap[0], _values.ap[0], cpool, c_mask))`
ARROW-6132: [Python] validate result in ListArray.from_arrays https://issues.apache.org/jira/browse/ARROW-6132 Closes #5029 from jorisvandenbossche/ARROW-6132-from_arrays-check-validity and squashes the following commits: 5fe476eca <Joris Van den Bossche> ARROW-6132: validate result in ListArray.from_arrays Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-08-08 17:30:43 +02:00			`cdef Array result = pyarrow_wrap_array(out)`
			`result.validate()`
			`return result`
ARROW-6084: [Python] Support LargeList Closes #4979 from pitrou/ARROW-6084-py-large-list and squashes the following commits: 4266ea2c6 <Antoine Pitrou> ARROW-6084: Support LargeList Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-08-06 13:41:41 -05:00
ARROW-7031: [Python] Correct LargeListArray.offsets attribute Follow-up on https://github.com/apache/arrow/pull/5759, which was apparently merged too quickly (I only now saw that I did the slicing behaviour only for ListArray, and not yet updated LargeListArray). Also added the LargeListArray.values attribute which was missing (compared to ListArray) Closes #5784 from jorisvandenbossche/ARROW-7031-follow-up and squashes the following commits: b84c496b1 <Joris Van den Bossche> ARROW-7031: Correct LargeListArray.offsets attribute Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-11-06 20:45:01 -06:00			`@property`
			`def values(self):`
GH-35740: Add documentation for list arrays' values property (#35865) Just docs. I'm a little shaky on my understanding of exactly what's going on with FixedSizeListArray.values, and its behavior with nulls, so that wording might deserve a careful read. * Closes: #35740 Lead-authored-by: Spencer Nelson <spencer@b612foundation.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2023-08-23 11:13:33 -07:00			`"""`
			`Return the underlying array of values which backs the LargeListArray`
			`ignoring the array's offset.`

			`If any of the list elements are null, but are backed by a`
			`non-empty sub-list, those elements will be included in the`
			`output.`

			Compare with :meth:`flatten`, which returns only the non-null
			`values taking into consideration the array's offset.`

			`Returns`
			`-------`
			`values : Array`

			`See Also`
			`--------`
			`LargeListArray.flatten : ...`

			`Examples`
			`--------`

			`The values include null elements from the sub-lists:`

			`>>> import pyarrow as pa`
			`>>> array = pa.array(`
			`... [[1, 2], None, [3, 4, None, 6]],`
			`... type=pa.large_list(pa.int32()),`
			`... )`
			`>>> array.values`
			`<pyarrow.lib.Int32Array object at ...>`
			`[`
			`1,`
			`2,`
			`3,`
			`4,`
			`null,`
			`6`
			`]`

			`If an array is sliced, the slice still uses the same`
			`underlying data as the original array, just with an`
			`offset. Since values ignores the offset, the values are the`
			`same:`

			`>>> sliced = array.slice(1, 2)`
			`>>> sliced`
			`<pyarrow.lib.LargeListArray object at ...>`
			`[`
			`null,`
			`[`
			`3,`
			`4,`
			`null,`
			`6`
			`]`
			`]`
			`>>> sliced.values`
			`<pyarrow.lib.Int32Array object at ...>`
			`[`
			`1,`
			`2,`
			`3,`
			`4,`
			`null,`
			`6`
			`]`
			`"""`
ARROW-7362: [Python][C++] Added ListArray.Flatten() that properly flattens a ListArray Currently ListArray.flatten() simply returns the child array. If a ListArray is a slice of another ListArray, they will share the same child array, however the expected behavior (I think) of flatten() should be returning an Array that's a concatenation of all the sub-lists in the ListArray, so the slicing offset should be taken into account. For example: ```python a = pa.array([[1], [2], [3]]) assert a.flatten().equals(pa.array([1,2,3])) # expected: a.slice(1).flatten().equals(pa.array([2, 3])) ``` Closes #6006 from brills/flatten and squashes the following commits: 4702f59da <Antoine Pitrou> Improve implementation characteristics d14210dc4 <Antoine Pitrou> Address review comments, add a test for non-canonical list arrays c789812d0 <Antoine Pitrou> Fix git merge error 7d0b864bf <Antoine Pitrou> Fix typo + print out conda env in conda docker builds 5f3650f90 <Zhuo Peng> comments 6aa3181b8 <Zhuo Peng> Take care of nulls correctly. 3c8746241 <Zhuo Peng> Also changed the python wrapper. a3d4d2fbb <Zhuo Peng> Added a C++ method Flatten() to ListArray. Lead-authored-by: Zhuo Peng <1835738+brills@users.noreply.github.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-12-18 17:57:32 +01:00			`cdef CLargeListArray* arr = <CLargeListArray*> self.ap`
			`return pyarrow_wrap_array(arr.values())`
ARROW-7031: [Python] Correct LargeListArray.offsets attribute Follow-up on https://github.com/apache/arrow/pull/5759, which was apparently merged too quickly (I only now saw that I did the slicing behaviour only for ListArray, and not yet updated LargeListArray). Also added the LargeListArray.values attribute which was missing (compared to ListArray) Closes #5784 from jorisvandenbossche/ARROW-7031-follow-up and squashes the following commits: b84c496b1 <Joris Van den Bossche> ARROW-7031: Correct LargeListArray.offsets attribute Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-11-06 20:45:01 -06:00
ARROW-7031: [Python] Expose the offsets of a ListArray in python Closes #5759 from jorisvandenbossche/ARROW-7031-list-array-offsets and squashes the following commits: 664225763 <Joris Van den Bossche> Slice the offsets if needed 4afe9f0ea <Joris Van den Bossche> ARROW-7031: Expose the offsets of a ListArray in python Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-11-05 11:29:28 -06:00			`@property`
			`def offsets(self):`
			`"""`
ARROW-15837: [C++][Python] Clarify documentation for ListArray::offsets() Closes #12557 from pitrou/ARROW-15837-clarify-list-offsets Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-03-08 12:04:29 +01:00			`Return the list offsets as an int64 array.`

			`The returned array will not have a validity bitmap, so you cannot`
			expect to pass it to `LargeListArray.from_arrays` and get back the
			`same list array if the original one has nulls.`

			`Returns`
			`-------`
			`offsets : Int64Array`
ARROW-7031: [Python] Expose the offsets of a ListArray in python Closes #5759 from jorisvandenbossche/ARROW-7031-list-array-offsets and squashes the following commits: 664225763 <Joris Van den Bossche> Slice the offsets if needed 4afe9f0ea <Joris Van den Bossche> ARROW-7031: Expose the offsets of a ListArray in python Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-11-05 11:29:28 -06:00			`"""`
ARROW-7068: [C++] Add ListArray::offsets and LargeListArray::offsets returning boxed version of offsets as Int32Array/Int64Array Closes #7462 from wesm/ARROW-7068 Lead-authored-by: Wes McKinney <wesm@apache.org> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2020-06-17 12:09:27 +02:00			`return pyarrow_wrap_array((<CLargeListArray*> self.ap).offsets())`
ARROW-7031: [Python] Expose the offsets of a ListArray in python Closes #5759 from jorisvandenbossche/ARROW-7031-list-array-offsets and squashes the following commits: 664225763 <Joris Van den Bossche> Slice the offsets if needed 4afe9f0ea <Joris Van den Bossche> ARROW-7031: Expose the offsets of a ListArray in python Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-11-05 11:29:28 -06:00
ARROW-6084: [Python] Support LargeList Closes #4979 from pitrou/ARROW-6084-py-large-list and squashes the following commits: 4266ea2c6 <Antoine Pitrou> ARROW-6084: Support LargeList Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-08-06 13:41:41 -05:00
GH-41183: [C++][Python] Expose recursive flatten for lists on list_flatten kernel function and pyarrow bindings (#41295) ### Rationale for this change Expose recursive flatten for logical lists on list_flatten kernel function and pyarrow bindings. ### What changes are included in this PR? 1. Expose recursive flatten for logical lists on `list_flatten` kernel function 2. Support [Large]ListView for some kernel functions: `list_flatten`,`list_value_length`, `list_element` 3. Support recursive flatten for pyarrow bindinds and simplify [Large]ListView's pyarrow bindings 4. Refactor vector_nested_test.cc for better support [Large]ListView types. ### Are these changes tested? Yes ### Are there any user-facing changes? Yes. 1. Some kernel functions like: list_flatten, list_value_length, list_element would support [Large]ListView types 2. `list_flatten` and related pyarrow bindings could support flatten recursively with an ListFlattenOptions. * GitHub Issue: #41183 Lead-authored-by: ZhangHuiGui <hugo.zhang@openpie.com> Co-authored-by: ZhangHuiGui <2689496754@qq.com> Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com> 2024-05-01 06:20:04 +08:00			`cdef class ListViewArray(BaseListArray):`
GH-39812: [Python] Add bindings for ListView and LargeListView (#39813) ### Rationale for this change Add bindings to the ListView and LargeListView array formats. ### What changes are included in this PR? * Add initial implementation for ListView and LargeListView * Add basic unit tests ### Are these changes tested? * Basic unit tests only (follow up PRs will be needed to implement full functionality) ### Are there any user-facing changes? Yes, documentation is updated in this PR to include the new PyArrow objects. * Closes: #39812 Lead-authored-by: Dane Pitkin <dane@voltrondata.com> Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-02-08 09:44:19 -05:00			`"""`
			`Concrete class for Arrow arrays of a list view data type.`
			`"""`

			`@staticmethod`
			`def from_arrays(offsets, sizes, values, DataType type=None, MemoryPool pool=None, mask=None):`
			`"""`
			`Construct ListViewArray from arrays of int32 offsets, sizes, and values.`

			`Parameters`
			`----------`
			`offsets : Array (int32 type)`
			`sizes : Array (int32 type)`
			`values : Array (any type)`
			`type : DataType, optional`
			`If not specified, a default ListType with the values' type is`
			`used.`
			`pool : MemoryPool, optional`
			`mask : Array (boolean type), optional`
			`Indicate which values are null (True) or not null (False).`

			`Returns`
			`-------`
			`list_view_array : ListViewArray`

			`Examples`
			`--------`
			`>>> import pyarrow as pa`
			`>>> values = pa.array([1, 2, 3, 4])`
			`>>> offsets = pa.array([0, 1, 2])`
			`>>> sizes = pa.array([2, 2, 2])`
			`>>> pa.ListViewArray.from_arrays(offsets, sizes, values)`
			`<pyarrow.lib.ListViewArray object at ...>`
			`[`
			`[`
			`1,`
			`2`
			`],`
			`[`
			`2,`
			`3`
			`],`
			`[`
			`3,`
			`4`
			`]`
			`]`
			`>>> # use a null mask to represent null values`
			`>>> mask = pa.array([False, True, False])`
			`>>> pa.ListViewArray.from_arrays(offsets, sizes, values, mask=mask)`
			`<pyarrow.lib.ListViewArray object at ...>`
			`[`
			`[`
			`1,`
			`2`
			`],`
			`null,`
			`[`
			`3,`
			`4`
			`]`
			`]`
			`>>> # null values can be defined in either offsets or sizes arrays`
			`>>> # WARNING: this will result in a copy of the offsets or sizes arrays`
			`>>> offsets = pa.array([0, None, 2])`
			`>>> pa.ListViewArray.from_arrays(offsets, sizes, values)`
			`<pyarrow.lib.ListViewArray object at ...>`
			`[`
			`[`
			`1,`
			`2`
			`],`
			`null,`
			`[`
			`3,`
			`4`
			`]`
			`]`
			`"""`
			`cdef:`
			`Array _offsets, _sizes, _values`
			`shared_ptr[CArray] out`
			`shared_ptr[CBuffer] c_mask`
			`CMemoryPool* cpool = maybe_unbox_memory_pool(pool)`

			`_offsets = asarray(offsets, type='int32')`
			`_sizes = asarray(sizes, type='int32')`
			`_values = asarray(values)`

			`c_mask = c_mask_inverted_from_obj(mask, pool)`

			`if type is not None:`
			`with nogil:`
			`out = GetResultValue(`
			`CListViewArray.FromArraysAndType(`
			`type.sp_type, _offsets.ap[0], _sizes.ap[0], _values.ap[0], cpool, c_mask))`
			`else:`
			`with nogil:`
			`out = GetResultValue(`
			`CListViewArray.FromArrays(`
			`_offsets.ap[0], _sizes.ap[0], _values.ap[0], cpool, c_mask))`
			`cdef Array result = pyarrow_wrap_array(out)`
			`result.validate()`
			`return result`

			`@property`
			`def values(self):`
			`"""`
			`Return the underlying array of values which backs the ListViewArray`
			`ignoring the array's offset and sizes.`

			`The values array may be out of order and/or contain additional values`
			`that are not found in the logical representation of the array. The only`
			`guarantee is that each non-null value in the ListView Array is contiguous.`

			Compare with :meth:`flatten`, which returns only the non-null
			`values taking into consideration the array's order and offset.`

			`Returns`
			`-------`
			`values : Array`

			`Examples`
			`--------`
			`The values include null elements from sub-lists:`

			`>>> import pyarrow as pa`
			`>>> values = [1, 2, None, 3, 4]`
			`>>> offsets = [0, 0, 1]`
			`>>> sizes = [2, 0, 4]`
			`>>> array = pa.ListViewArray.from_arrays(offsets, sizes, values)`
			`>>> array`
			`<pyarrow.lib.ListViewArray object at ...>`
			`[`
			`[`
			`1,`
			`2`
			`],`
			`[],`
			`[`
			`2,`
			`null,`
			`3,`
			`4`
			`]`
			`]`
			`>>> array.values`
			`<pyarrow.lib.Int64Array object at ...>`
			`[`
			`1,`
			`2,`
			`null,`
			`3,`
			`4`
			`]`
			`"""`
			`cdef CListViewArray* arr = <CListViewArray*> self.ap`
			`return pyarrow_wrap_array(arr.values())`

			`@property`
			`def offsets(self):`
			`"""`
			`Return the list offsets as an int32 array.`

			`The returned array will not have a validity bitmap, so you cannot`
			expect to pass it to `ListViewArray.from_arrays` and get back the same
			`list array if the original one has nulls.`

			`Returns`
			`-------`
			`offsets : Int32Array`

			`Examples`
			`--------`
			`>>> import pyarrow as pa`
			`>>> values = [1, 2, None, 3, 4]`
			`>>> offsets = [0, 0, 1]`
			`>>> sizes = [2, 0, 4]`
			`>>> array = pa.ListViewArray.from_arrays(offsets, sizes, values)`
			`>>> array.offsets`
			`<pyarrow.lib.Int32Array object at ...>`
			`[`
			`0,`
			`0,`
			`1`
			`]`
			`"""`
			`return pyarrow_wrap_array((<CListViewArray*> self.ap).offsets())`

			`@property`
			`def sizes(self):`
			`"""`
			`Return the list sizes as an int32 array.`

			`The returned array will not have a validity bitmap, so you cannot`
			expect to pass it to `ListViewArray.from_arrays` and get back the same
			`list array if the original one has nulls.`

			`Returns`
			`-------`
			`sizes : Int32Array`

			`Examples`
			`--------`
			`>>> import pyarrow as pa`
			`>>> values = [1, 2, None, 3, 4]`
			`>>> offsets = [0, 0, 1]`
			`>>> sizes = [2, 0, 4]`
			`>>> array = pa.ListViewArray.from_arrays(offsets, sizes, values)`
			`>>> array.sizes`
			`<pyarrow.lib.Int32Array object at ...>`
			`[`
			`2,`
			`0,`
			`4`
			`]`
			`"""`
			`return pyarrow_wrap_array((<CListViewArray*> self.ap).sizes())`


GH-41183: [C++][Python] Expose recursive flatten for lists on list_flatten kernel function and pyarrow bindings (#41295) ### Rationale for this change Expose recursive flatten for logical lists on list_flatten kernel function and pyarrow bindings. ### What changes are included in this PR? 1. Expose recursive flatten for logical lists on `list_flatten` kernel function 2. Support [Large]ListView for some kernel functions: `list_flatten`,`list_value_length`, `list_element` 3. Support recursive flatten for pyarrow bindinds and simplify [Large]ListView's pyarrow bindings 4. Refactor vector_nested_test.cc for better support [Large]ListView types. ### Are these changes tested? Yes ### Are there any user-facing changes? Yes. 1. Some kernel functions like: list_flatten, list_value_length, list_element would support [Large]ListView types 2. `list_flatten` and related pyarrow bindings could support flatten recursively with an ListFlattenOptions. * GitHub Issue: #41183 Lead-authored-by: ZhangHuiGui <hugo.zhang@openpie.com> Co-authored-by: ZhangHuiGui <2689496754@qq.com> Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com> 2024-05-01 06:20:04 +08:00			`cdef class LargeListViewArray(BaseListArray):`
GH-39812: [Python] Add bindings for ListView and LargeListView (#39813) ### Rationale for this change Add bindings to the ListView and LargeListView array formats. ### What changes are included in this PR? * Add initial implementation for ListView and LargeListView * Add basic unit tests ### Are these changes tested? * Basic unit tests only (follow up PRs will be needed to implement full functionality) ### Are there any user-facing changes? Yes, documentation is updated in this PR to include the new PyArrow objects. * Closes: #39812 Lead-authored-by: Dane Pitkin <dane@voltrondata.com> Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-02-08 09:44:19 -05:00			`"""`
			`Concrete class for Arrow arrays of a large list view data type.`

			`Identical to ListViewArray, but with 64-bit offsets.`
			`"""`
			`@staticmethod`
			`def from_arrays(offsets, sizes, values, DataType type=None, MemoryPool pool=None, mask=None):`
			`"""`
			`Construct LargeListViewArray from arrays of int64 offsets and values.`

			`Parameters`
			`----------`
			`offsets : Array (int64 type)`
			`sizes : Array (int64 type)`
			`values : Array (any type)`
			`type : DataType, optional`
			`If not specified, a default ListType with the values' type is`
			`used.`
			`pool : MemoryPool, optional`
			`mask : Array (boolean type), optional`
			`Indicate which values are null (True) or not null (False).`

			`Returns`
			`-------`
			`list_view_array : LargeListViewArray`

			`Examples`
			`--------`
			`>>> import pyarrow as pa`
			`>>> values = pa.array([1, 2, 3, 4])`
			`>>> offsets = pa.array([0, 1, 2])`
			`>>> sizes = pa.array([2, 2, 2])`
			`>>> pa.LargeListViewArray.from_arrays(offsets, sizes, values)`
			`<pyarrow.lib.LargeListViewArray object at ...>`
			`[`
			`[`
			`1,`
			`2`
			`],`
			`[`
			`2,`
			`3`
			`],`
			`[`
			`3,`
			`4`
			`]`
			`]`
			`>>> # use a null mask to represent null values`
			`>>> mask = pa.array([False, True, False])`
			`>>> pa.LargeListViewArray.from_arrays(offsets, sizes, values, mask=mask)`
			`<pyarrow.lib.LargeListViewArray object at ...>`
			`[`
			`[`
			`1,`
			`2`
			`],`
			`null,`
			`[`
			`3,`
			`4`
			`]`
			`]`
			`>>> # null values can be defined in either offsets or sizes arrays`
			`>>> # WARNING: this will result in a copy of the offsets or sizes arrays`
			`>>> offsets = pa.array([0, None, 2])`
			`>>> pa.LargeListViewArray.from_arrays(offsets, sizes, values)`
			`<pyarrow.lib.LargeListViewArray object at ...>`
			`[`
			`[`
			`1,`
			`2`
			`],`
			`null,`
			`[`
			`3,`
			`4`
			`]`
			`]`
			`"""`
			`cdef:`
			`Array _offsets, _sizes, _values`
			`shared_ptr[CArray] out`
			`shared_ptr[CBuffer] c_mask`
			`CMemoryPool* cpool = maybe_unbox_memory_pool(pool)`

			`_offsets = asarray(offsets, type='int64')`
			`_sizes = asarray(sizes, type='int64')`
			`_values = asarray(values)`

			`c_mask = c_mask_inverted_from_obj(mask, pool)`

			`if type is not None:`
			`with nogil:`
			`out = GetResultValue(`
			`CLargeListViewArray.FromArraysAndType(`
			`type.sp_type, _offsets.ap[0], _sizes.ap[0], _values.ap[0], cpool, c_mask))`
			`else:`
			`with nogil:`
			`out = GetResultValue(`
			`CLargeListViewArray.FromArrays(`
			`_offsets.ap[0], _sizes.ap[0], _values.ap[0], cpool, c_mask))`
			`cdef Array result = pyarrow_wrap_array(out)`
			`result.validate()`
			`return result`

			`@property`
			`def values(self):`
			`"""`
			`Return the underlying array of values which backs the LargeListArray`
			`ignoring the array's offset.`

			`The values array may be out of order and/or contain additional values`
			`that are not found in the logical representation of the array. The only`
			`guarantee is that each non-null value in the ListView Array is contiguous.`

			Compare with :meth:`flatten`, which returns only the non-null
			`values taking into consideration the array's order and offset.`

			`Returns`
			`-------`
			`values : Array`

			`See Also`
			`--------`
			`LargeListArray.flatten : ...`

			`Examples`
			`--------`

			`The values include null elements from sub-lists:`

			`>>> import pyarrow as pa`
			`>>> values = [1, 2, None, 3, 4]`
			`>>> offsets = [0, 0, 1]`
			`>>> sizes = [2, 0, 4]`
			`>>> array = pa.LargeListViewArray.from_arrays(offsets, sizes, values)`
			`>>> array`
			`<pyarrow.lib.LargeListViewArray object at ...>`
			`[`
			`[`
			`1,`
			`2`
			`],`
			`[],`
			`[`
			`2,`
			`null,`
			`3,`
			`4`
			`]`
			`]`
			`>>> array.values`
			`<pyarrow.lib.Int64Array object at ...>`
			`[`
			`1,`
			`2,`
			`null,`
			`3,`
			`4`
			`]`
			`"""`
			`cdef CLargeListViewArray* arr = <CLargeListViewArray*> self.ap`
			`return pyarrow_wrap_array(arr.values())`

			`@property`
			`def offsets(self):`
			`"""`
			`Return the list view offsets as an int64 array.`

			`The returned array will not have a validity bitmap, so you cannot`
			expect to pass it to `LargeListViewArray.from_arrays` and get back the
			`same list array if the original one has nulls.`

			`Returns`
			`-------`
			`offsets : Int64Array`

			`Examples`
			`--------`

			`>>> import pyarrow as pa`
			`>>> values = [1, 2, None, 3, 4]`
			`>>> offsets = [0, 0, 1]`
			`>>> sizes = [2, 0, 4]`
			`>>> array = pa.LargeListViewArray.from_arrays(offsets, sizes, values)`
			`>>> array.offsets`
			`<pyarrow.lib.Int64Array object at ...>`
			`[`
			`0,`
			`0,`
			`1`
			`]`
			`"""`
			`return pyarrow_wrap_array((<CLargeListViewArray*> self.ap).offsets())`

			`@property`
			`def sizes(self):`
			`"""`
			`Return the list view sizes as an int64 array.`

			`The returned array will not have a validity bitmap, so you cannot`
			expect to pass it to `LargeListViewArray.from_arrays` and get back the
			`same list array if the original one has nulls.`

			`Returns`
			`-------`
			`sizes : Int64Array`

			`Examples`
			`--------`

			`>>> import pyarrow as pa`
			`>>> values = [1, 2, None, 3, 4]`
			`>>> offsets = [0, 0, 1]`
			`>>> sizes = [2, 0, 4]`
			`>>> array = pa.LargeListViewArray.from_arrays(offsets, sizes, values)`
			`>>> array.sizes`
			`<pyarrow.lib.Int64Array object at ...>`
			`[`
			`2,`
			`0,`
			`4`
			`]`
			`"""`
			`return pyarrow_wrap_array((<CLargeListViewArray*> self.ap).sizes())`


ARROW-15087: [Python][Docs] Document MapArray and update parent class to ListArray ## Summary of Changes * MapArray should inherit from ListArray to provide access to offsets property. (C++ class has this inheritance.) * Updated StructArray Python tutorial to indicate that schemas can now be inferred in some cases. * Create MapArray Python guide to show how to construct and access keys and items. I mention that the `keys` and `items` attributes are flattened, since that surprised me, and shows how to construct the ListArray containing keys and items. Closes #12007 from wjones127/ARROW-15087-py-doc-maptype Authored-by: Will Jones <willjones127@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2022-01-11 10:28:05 +01:00			`cdef class MapArray(ListArray):`
ARROW-6904: [Python] Add support for MapArray This adds support for `MapArray` in Python with conversion from a Python sequence of either dictionaries with "key" and "value" fields or a tuple with 2 elements. Additionally, added the API `MapArray.from_arrays` to build a `MapArray` from individual offset, key, value arrays. Closes #5774 from BryanCutler/python-impl-MapArray-ARROW-6904 and squashes the following commits: f2935c378 <Bryan Cutler> Avoid lookup of key_builder at each value c1fa14ef5 <Bryan Cutler> Added MapArray decl to lib.pxd 6b23dc2aa <Bryan Cutler> typo 5385529ff <Bryan Cutler> Address comments, add test compare with ListBuilder of structs 9772e1df4 <Bryan Cutler> unicode repr for py2 f1a354764 <Bryan Cutler> Fix test_map error for py2 7bdfdebb6 <Bryan Cutler> Changed MapValue.as_py() to return a list of tuples, added test_scalars 3c1a7f85a <Bryan Cutler> Add MapType to schema_test 2f4c29652 <Bryan Cutler> Add Map tests to test_misc acd0e6b04 <Bryan Cutler> Add tests for python MapType 2555849c4 <Bryan Cutler> Add tests for MapArray::FromArrays 99ba44f0f <Bryan Cutler> Fix python2 test error 442dac2bc <Bryan Cutler> Fix lint issues 3a1134d01 <Bryan Cutler> Added checks in MapConverter to verify appended value, passing tests fa883cc9e <Bryan Cutler> Added tests for python to arrow conversion, need to pass verify dicts 70b453db6 <Bryan Cutler> Added test_array using from_arrays 72ab5295c <Bryan Cutler> Fix MapArray.from_arrays to work with null values 7f8770140 <Bryan Cutler> Adding map converter as ListConverter with MapType 309ac112b <Bryan Cutler> Change MapBuilder to use a StructBuilder internally 033479de8 <Bryan Cutler> Fix MapArray::SetData to use ListArray::SetData without faking type cf6a4fb72 <Bryan Cutler> Added MapType and MapArray, working in python with FromArrays Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-12-04 20:27:27 +01:00			`"""`
			`Concrete class for Arrow arrays of a map data type.`
			`"""`

			`@staticmethod`
GH-41684: [C++][Python] Add optional null_bitmap to MapArray::FromArrays (#41757) ### Rationale for this change When constructing a `MapArray` with `FromArrays` one can not supply a `null_bitmap`. ### What changes are included in this PR? Optional `null_bitmap` argument is added to `MapArray::FromArrays`. ### Are these changes tested? TODO (have them locally, need to clean them up and commit. ### Are there any user-facing changes? No. * GitHub Issue: #41684 Authored-by: AlenkaF <frim.alenka@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-05-31 10:09:54 +02:00			`def from_arrays(offsets, keys, items, DataType type=None, MemoryPool pool=None, mask=None):`
ARROW-6904: [Python] Add support for MapArray This adds support for `MapArray` in Python with conversion from a Python sequence of either dictionaries with "key" and "value" fields or a tuple with 2 elements. Additionally, added the API `MapArray.from_arrays` to build a `MapArray` from individual offset, key, value arrays. Closes #5774 from BryanCutler/python-impl-MapArray-ARROW-6904 and squashes the following commits: f2935c378 <Bryan Cutler> Avoid lookup of key_builder at each value c1fa14ef5 <Bryan Cutler> Added MapArray decl to lib.pxd 6b23dc2aa <Bryan Cutler> typo 5385529ff <Bryan Cutler> Address comments, add test compare with ListBuilder of structs 9772e1df4 <Bryan Cutler> unicode repr for py2 f1a354764 <Bryan Cutler> Fix test_map error for py2 7bdfdebb6 <Bryan Cutler> Changed MapValue.as_py() to return a list of tuples, added test_scalars 3c1a7f85a <Bryan Cutler> Add MapType to schema_test 2f4c29652 <Bryan Cutler> Add Map tests to test_misc acd0e6b04 <Bryan Cutler> Add tests for python MapType 2555849c4 <Bryan Cutler> Add tests for MapArray::FromArrays 99ba44f0f <Bryan Cutler> Fix python2 test error 442dac2bc <Bryan Cutler> Fix lint issues 3a1134d01 <Bryan Cutler> Added checks in MapConverter to verify appended value, passing tests fa883cc9e <Bryan Cutler> Added tests for python to arrow conversion, need to pass verify dicts 70b453db6 <Bryan Cutler> Added test_array using from_arrays 72ab5295c <Bryan Cutler> Fix MapArray.from_arrays to work with null values 7f8770140 <Bryan Cutler> Adding map converter as ListConverter with MapType 309ac112b <Bryan Cutler> Change MapBuilder to use a StructBuilder internally 033479de8 <Bryan Cutler> Fix MapArray::SetData to use ListArray::SetData without faking type cf6a4fb72 <Bryan Cutler> Added MapType and MapArray, working in python with FromArrays Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-12-04 20:27:27 +01:00			`"""`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`Construct MapArray from arrays of int32 offsets and key, item arrays.`
ARROW-6904: [Python] Add support for MapArray This adds support for `MapArray` in Python with conversion from a Python sequence of either dictionaries with "key" and "value" fields or a tuple with 2 elements. Additionally, added the API `MapArray.from_arrays` to build a `MapArray` from individual offset, key, value arrays. Closes #5774 from BryanCutler/python-impl-MapArray-ARROW-6904 and squashes the following commits: f2935c378 <Bryan Cutler> Avoid lookup of key_builder at each value c1fa14ef5 <Bryan Cutler> Added MapArray decl to lib.pxd 6b23dc2aa <Bryan Cutler> typo 5385529ff <Bryan Cutler> Address comments, add test compare with ListBuilder of structs 9772e1df4 <Bryan Cutler> unicode repr for py2 f1a354764 <Bryan Cutler> Fix test_map error for py2 7bdfdebb6 <Bryan Cutler> Changed MapValue.as_py() to return a list of tuples, added test_scalars 3c1a7f85a <Bryan Cutler> Add MapType to schema_test 2f4c29652 <Bryan Cutler> Add Map tests to test_misc acd0e6b04 <Bryan Cutler> Add tests for python MapType 2555849c4 <Bryan Cutler> Add tests for MapArray::FromArrays 99ba44f0f <Bryan Cutler> Fix python2 test error 442dac2bc <Bryan Cutler> Fix lint issues 3a1134d01 <Bryan Cutler> Added checks in MapConverter to verify appended value, passing tests fa883cc9e <Bryan Cutler> Added tests for python to arrow conversion, need to pass verify dicts 70b453db6 <Bryan Cutler> Added test_array using from_arrays 72ab5295c <Bryan Cutler> Fix MapArray.from_arrays to work with null values 7f8770140 <Bryan Cutler> Adding map converter as ListConverter with MapType 309ac112b <Bryan Cutler> Change MapBuilder to use a StructBuilder internally 033479de8 <Bryan Cutler> Fix MapArray::SetData to use ListArray::SetData without faking type cf6a4fb72 <Bryan Cutler> Added MapType and MapArray, working in python with FromArrays Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-12-04 20:27:27 +01:00
			`Parameters`
			`----------`
			`offsets : array-like or sequence (int32 type)`
			`keys : array-like or sequence (any type)`
			`items : array-like or sequence (any type)`
GH-39515: [Python] Pass in type to `MapType.from_arrays` (#39516) ### Rationale for this change For Iceberg we want to add metadata type the type (the field-id), therefore we need to pass in the type analog to what we do for `ListArray.from_arrays(self, offsets, values, DataType type=None, MemoryPool pool=None, mask=None)`. ### What changes are included in this PR? Updated a keyword argument for the `type`, and make sure that the the static method to create the MapType is exposed from the cpp side. ### Are these changes tested? I've added a simple test. ### Are there any user-facing changes? * Closes: #39515 Authored-by: Fokko Driesprong <fokko@tabular.io> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2024-01-10 09:07:24 -08:00			`type : DataType, optional`
			`If not specified, a default MapArray with the keys' and items' type is used.`
ARROW-13637: [Python] Fix docstrings Address all docstrings to make sure they pass `archery numpydoc --allow-rule PR01` Closes #11245 from amol-/ARROW-13637 Authored-by: Alessandro Molina <amol@turbogears.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-10-04 11:44:40 +02:00			`pool : MemoryPool`
GH-41684: [C++][Python] Add optional null_bitmap to MapArray::FromArrays (#41757) ### Rationale for this change When constructing a `MapArray` with `FromArrays` one can not supply a `null_bitmap`. ### What changes are included in this PR? Optional `null_bitmap` argument is added to `MapArray::FromArrays`. ### Are these changes tested? TODO (have them locally, need to clean them up and commit. ### Are there any user-facing changes? No. * GitHub Issue: #41684 Authored-by: AlenkaF <frim.alenka@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-05-31 10:09:54 +02:00			`mask : Array (boolean type), optional`
			`Indicate which values are null (True) or not null (False).`
ARROW-6904: [Python] Add support for MapArray This adds support for `MapArray` in Python with conversion from a Python sequence of either dictionaries with "key" and "value" fields or a tuple with 2 elements. Additionally, added the API `MapArray.from_arrays` to build a `MapArray` from individual offset, key, value arrays. Closes #5774 from BryanCutler/python-impl-MapArray-ARROW-6904 and squashes the following commits: f2935c378 <Bryan Cutler> Avoid lookup of key_builder at each value c1fa14ef5 <Bryan Cutler> Added MapArray decl to lib.pxd 6b23dc2aa <Bryan Cutler> typo 5385529ff <Bryan Cutler> Address comments, add test compare with ListBuilder of structs 9772e1df4 <Bryan Cutler> unicode repr for py2 f1a354764 <Bryan Cutler> Fix test_map error for py2 7bdfdebb6 <Bryan Cutler> Changed MapValue.as_py() to return a list of tuples, added test_scalars 3c1a7f85a <Bryan Cutler> Add MapType to schema_test 2f4c29652 <Bryan Cutler> Add Map tests to test_misc acd0e6b04 <Bryan Cutler> Add tests for python MapType 2555849c4 <Bryan Cutler> Add tests for MapArray::FromArrays 99ba44f0f <Bryan Cutler> Fix python2 test error 442dac2bc <Bryan Cutler> Fix lint issues 3a1134d01 <Bryan Cutler> Added checks in MapConverter to verify appended value, passing tests fa883cc9e <Bryan Cutler> Added tests for python to arrow conversion, need to pass verify dicts 70b453db6 <Bryan Cutler> Added test_array using from_arrays 72ab5295c <Bryan Cutler> Fix MapArray.from_arrays to work with null values 7f8770140 <Bryan Cutler> Adding map converter as ListConverter with MapType 309ac112b <Bryan Cutler> Change MapBuilder to use a StructBuilder internally 033479de8 <Bryan Cutler> Fix MapArray::SetData to use ListArray::SetData without faking type cf6a4fb72 <Bryan Cutler> Added MapType and MapArray, working in python with FromArrays Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-12-04 20:27:27 +01:00
			`Returns`
			`-------`
			`map_array : MapArray`
MINOR: [Python][Docs] Add examples for `MapArray.from_arrays` (#37656) ### Rationale for this change This PR enriched the `MapArray.from_arrays` with some nice examples. The examples are from the real-world scenario of working with survey data (scaled down, of course). ### What changes are included in this PR? The only change that this PR presents is to the docstring of the `MapArray.from_arrays` function. ### Are these changes tested? Does not apply ### Are there any user-facing changes? Yes, the docstring of the `MapArray.from_arrays` function. Lead-authored-by: Slobodan Ilic <slobodan.a.ilic@gmail.com> Co-authored-by: Slobodan Ilic <slobodan@crunch.io> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2023-09-14 17:46:32 +02:00
			`Examples`
			`--------`
GH-34316: [Python] FixedSizeListArray.from_arrays supports mask parameter (#39396) ### What changes are included in this PR? Add `mask` / `null_bitmap` parameters in corresponding Cython / C++ `FixedSizeListArray` methods, and propagate this bitmap instead of using the current dummy `validity_buf`. ### Are these changes tested? Yes ### Are there any user-facing changes? Yes, `mask` parameter has been added to `FixedSizeListArray.from_arrays` * Closes: #34316 Authored-by: LucasG0 <guillermou.lucas@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com> 2024-01-04 00:12:24 +01:00			`First, let's understand the structure of our dataset when viewed in a rectangular data model.`
MINOR: [Python][Docs] Add examples for `MapArray.from_arrays` (#37656) ### Rationale for this change This PR enriched the `MapArray.from_arrays` with some nice examples. The examples are from the real-world scenario of working with survey data (scaled down, of course). ### What changes are included in this PR? The only change that this PR presents is to the docstring of the `MapArray.from_arrays` function. ### Are these changes tested? Does not apply ### Are there any user-facing changes? Yes, the docstring of the `MapArray.from_arrays` function. Lead-authored-by: Slobodan Ilic <slobodan.a.ilic@gmail.com> Co-authored-by: Slobodan Ilic <slobodan@crunch.io> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2023-09-14 17:46:32 +02:00			`The total of 5 respondents answered the question "How much did you like the movie x?".`
			`The value -1 in the integer array means that the value is missing. The boolean array`
			`represents the null bitmask corresponding to the missing values in the integer array.`

			`>>> import pyarrow as pa`
			`>>> movies_rectangular = np.ma.masked_array([`
			`... [10, -1, -1],`
			`... [8, 4, 5],`
			`... [-1, 10, 3],`
			`... [-1, -1, -1],`
			`... [-1, -1, -1]`
			`... ],`
			`... [`
			`... [False, True, True],`
			`... [False, False, False],`
			`... [True, False, False],`
			`... [True, True, True],`
			`... [True, True, True],`
			`... ])`

			`To represent the same data with the MapArray and from_arrays, the data is`
			`formed like this:`

			`>>> offsets = [`
			`... 0, # -- row 1 start`
			`... 1, # -- row 2 start`
			`... 4, # -- row 3 start`
			`... 6, # -- row 4 start`
			`... 6, # -- row 5 start`
			`... 6, # -- row 5 end`
			`... ]`
			`>>> movies = [`
			`... "Dark Knight", # ---------------------------------- row 1`
			`... "Dark Knight", "Meet the Parents", "Superman", # -- row 2`
			`... "Meet the Parents", "Superman", # ----------------- row 3`
			`... ]`
			`>>> likings = [`
			`... 10, # -------- row 1`
			`... 8, 4, 5, # --- row 2`
			`... 10, 3 # ------ row 3`
			`... ]`
			`>>> pa.MapArray.from_arrays(offsets, movies, likings).to_pandas()`
			`0 [(Dark Knight, 10)]`
			`1 [(Dark Knight, 8), (Meet the Parents, 4), (Sup...`
			`2 [(Meet the Parents, 10), (Superman, 3)]`
			`3 []`
			`4 []`
			`dtype: object`

			`If the data in the empty rows needs to be marked as missing, it's possible`
			to do so by modifying the offsets argument, so that we specify `None` as
			`the starting positions of the rows we want marked as missing. The end row`
			`offset still has to refer to the existing value from keys (and values):`

			`>>> offsets = [`
			`... 0, # ----- row 1 start`
			`... 1, # ----- row 2 start`
			`... 4, # ----- row 3 start`
			`... None, # -- row 4 start`
			`... None, # -- row 5 start`
			`... 6, # ----- row 5 end`
			`... ]`
			`>>> pa.MapArray.from_arrays(offsets, movies, likings).to_pandas()`
			`0 [(Dark Knight, 10)]`
			`1 [(Dark Knight, 8), (Meet the Parents, 4), (Sup...`
			`2 [(Meet the Parents, 10), (Superman, 3)]`
			`3 None`
			`4 None`
			`dtype: object`
ARROW-6904: [Python] Add support for MapArray This adds support for `MapArray` in Python with conversion from a Python sequence of either dictionaries with "key" and "value" fields or a tuple with 2 elements. Additionally, added the API `MapArray.from_arrays` to build a `MapArray` from individual offset, key, value arrays. Closes #5774 from BryanCutler/python-impl-MapArray-ARROW-6904 and squashes the following commits: f2935c378 <Bryan Cutler> Avoid lookup of key_builder at each value c1fa14ef5 <Bryan Cutler> Added MapArray decl to lib.pxd 6b23dc2aa <Bryan Cutler> typo 5385529ff <Bryan Cutler> Address comments, add test compare with ListBuilder of structs 9772e1df4 <Bryan Cutler> unicode repr for py2 f1a354764 <Bryan Cutler> Fix test_map error for py2 7bdfdebb6 <Bryan Cutler> Changed MapValue.as_py() to return a list of tuples, added test_scalars 3c1a7f85a <Bryan Cutler> Add MapType to schema_test 2f4c29652 <Bryan Cutler> Add Map tests to test_misc acd0e6b04 <Bryan Cutler> Add tests for python MapType 2555849c4 <Bryan Cutler> Add tests for MapArray::FromArrays 99ba44f0f <Bryan Cutler> Fix python2 test error 442dac2bc <Bryan Cutler> Fix lint issues 3a1134d01 <Bryan Cutler> Added checks in MapConverter to verify appended value, passing tests fa883cc9e <Bryan Cutler> Added tests for python to arrow conversion, need to pass verify dicts 70b453db6 <Bryan Cutler> Added test_array using from_arrays 72ab5295c <Bryan Cutler> Fix MapArray.from_arrays to work with null values 7f8770140 <Bryan Cutler> Adding map converter as ListConverter with MapType 309ac112b <Bryan Cutler> Change MapBuilder to use a StructBuilder internally 033479de8 <Bryan Cutler> Fix MapArray::SetData to use ListArray::SetData without faking type cf6a4fb72 <Bryan Cutler> Added MapType and MapArray, working in python with FromArrays Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-12-04 20:27:27 +01:00			`"""`
			`cdef:`
			`Array _offsets, _keys, _items`
			`shared_ptr[CArray] out`
GH-41684: [C++][Python] Add optional null_bitmap to MapArray::FromArrays (#41757) ### Rationale for this change When constructing a `MapArray` with `FromArrays` one can not supply a `null_bitmap`. ### What changes are included in this PR? Optional `null_bitmap` argument is added to `MapArray::FromArrays`. ### Are these changes tested? TODO (have them locally, need to clean them up and commit. ### Are there any user-facing changes? No. * GitHub Issue: #41684 Authored-by: AlenkaF <frim.alenka@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-05-31 10:09:54 +02:00			`shared_ptr[CBuffer] c_mask`
ARROW-6904: [Python] Add support for MapArray This adds support for `MapArray` in Python with conversion from a Python sequence of either dictionaries with "key" and "value" fields or a tuple with 2 elements. Additionally, added the API `MapArray.from_arrays` to build a `MapArray` from individual offset, key, value arrays. Closes #5774 from BryanCutler/python-impl-MapArray-ARROW-6904 and squashes the following commits: f2935c378 <Bryan Cutler> Avoid lookup of key_builder at each value c1fa14ef5 <Bryan Cutler> Added MapArray decl to lib.pxd 6b23dc2aa <Bryan Cutler> typo 5385529ff <Bryan Cutler> Address comments, add test compare with ListBuilder of structs 9772e1df4 <Bryan Cutler> unicode repr for py2 f1a354764 <Bryan Cutler> Fix test_map error for py2 7bdfdebb6 <Bryan Cutler> Changed MapValue.as_py() to return a list of tuples, added test_scalars 3c1a7f85a <Bryan Cutler> Add MapType to schema_test 2f4c29652 <Bryan Cutler> Add Map tests to test_misc acd0e6b04 <Bryan Cutler> Add tests for python MapType 2555849c4 <Bryan Cutler> Add tests for MapArray::FromArrays 99ba44f0f <Bryan Cutler> Fix python2 test error 442dac2bc <Bryan Cutler> Fix lint issues 3a1134d01 <Bryan Cutler> Added checks in MapConverter to verify appended value, passing tests fa883cc9e <Bryan Cutler> Added tests for python to arrow conversion, need to pass verify dicts 70b453db6 <Bryan Cutler> Added test_array using from_arrays 72ab5295c <Bryan Cutler> Fix MapArray.from_arrays to work with null values 7f8770140 <Bryan Cutler> Adding map converter as ListConverter with MapType 309ac112b <Bryan Cutler> Change MapBuilder to use a StructBuilder internally 033479de8 <Bryan Cutler> Fix MapArray::SetData to use ListArray::SetData without faking type cf6a4fb72 <Bryan Cutler> Added MapType and MapArray, working in python with FromArrays Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-12-04 20:27:27 +01:00			`cdef CMemoryPool* cpool = maybe_unbox_memory_pool(pool)`

			`_offsets = asarray(offsets, type='int32')`
			`_keys = asarray(keys)`
			`_items = asarray(items)`

GH-41684: [C++][Python] Add optional null_bitmap to MapArray::FromArrays (#41757) ### Rationale for this change When constructing a `MapArray` with `FromArrays` one can not supply a `null_bitmap`. ### What changes are included in this PR? Optional `null_bitmap` argument is added to `MapArray::FromArrays`. ### Are these changes tested? TODO (have them locally, need to clean them up and commit. ### Are there any user-facing changes? No. * GitHub Issue: #41684 Authored-by: AlenkaF <frim.alenka@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-05-31 10:09:54 +02:00			`c_mask = c_mask_inverted_from_obj(mask, pool)`

GH-39515: [Python] Pass in type to `MapType.from_arrays` (#39516) ### Rationale for this change For Iceberg we want to add metadata type the type (the field-id), therefore we need to pass in the type analog to what we do for `ListArray.from_arrays(self, offsets, values, DataType type=None, MemoryPool pool=None, mask=None)`. ### What changes are included in this PR? Updated a keyword argument for the `type`, and make sure that the the static method to create the MapType is exposed from the cpp side. ### Are these changes tested? I've added a simple test. ### Are there any user-facing changes? * Closes: #39515 Authored-by: Fokko Driesprong <fokko@tabular.io> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2024-01-10 09:07:24 -08:00			`if type is not None:`
			`with nogil:`
			`out = GetResultValue(`
			`CMapArray.FromArraysAndType(`
			`type.sp_type, _offsets.sp_array,`
GH-41684: [C++][Python] Add optional null_bitmap to MapArray::FromArrays (#41757) ### Rationale for this change When constructing a `MapArray` with `FromArrays` one can not supply a `null_bitmap`. ### What changes are included in this PR? Optional `null_bitmap` argument is added to `MapArray::FromArrays`. ### Are these changes tested? TODO (have them locally, need to clean them up and commit. ### Are there any user-facing changes? No. * GitHub Issue: #41684 Authored-by: AlenkaF <frim.alenka@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-05-31 10:09:54 +02:00			`_keys.sp_array, _items.sp_array, cpool, c_mask))`
GH-39515: [Python] Pass in type to `MapType.from_arrays` (#39516) ### Rationale for this change For Iceberg we want to add metadata type the type (the field-id), therefore we need to pass in the type analog to what we do for `ListArray.from_arrays(self, offsets, values, DataType type=None, MemoryPool pool=None, mask=None)`. ### What changes are included in this PR? Updated a keyword argument for the `type`, and make sure that the the static method to create the MapType is exposed from the cpp side. ### Are these changes tested? I've added a simple test. ### Are there any user-facing changes? * Closes: #39515 Authored-by: Fokko Driesprong <fokko@tabular.io> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2024-01-10 09:07:24 -08:00			`else:`
			`with nogil:`
			`out = GetResultValue(`
			`CMapArray.FromArrays(_offsets.sp_array,`
			`_keys.sp_array,`
GH-41684: [C++][Python] Add optional null_bitmap to MapArray::FromArrays (#41757) ### Rationale for this change When constructing a `MapArray` with `FromArrays` one can not supply a `null_bitmap`. ### What changes are included in this PR? Optional `null_bitmap` argument is added to `MapArray::FromArrays`. ### Are these changes tested? TODO (have them locally, need to clean them up and commit. ### Are there any user-facing changes? No. * GitHub Issue: #41684 Authored-by: AlenkaF <frim.alenka@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-05-31 10:09:54 +02:00			`_items.sp_array, cpool, c_mask))`
ARROW-6904: [Python] Add support for MapArray This adds support for `MapArray` in Python with conversion from a Python sequence of either dictionaries with "key" and "value" fields or a tuple with 2 elements. Additionally, added the API `MapArray.from_arrays` to build a `MapArray` from individual offset, key, value arrays. Closes #5774 from BryanCutler/python-impl-MapArray-ARROW-6904 and squashes the following commits: f2935c378 <Bryan Cutler> Avoid lookup of key_builder at each value c1fa14ef5 <Bryan Cutler> Added MapArray decl to lib.pxd 6b23dc2aa <Bryan Cutler> typo 5385529ff <Bryan Cutler> Address comments, add test compare with ListBuilder of structs 9772e1df4 <Bryan Cutler> unicode repr for py2 f1a354764 <Bryan Cutler> Fix test_map error for py2 7bdfdebb6 <Bryan Cutler> Changed MapValue.as_py() to return a list of tuples, added test_scalars 3c1a7f85a <Bryan Cutler> Add MapType to schema_test 2f4c29652 <Bryan Cutler> Add Map tests to test_misc acd0e6b04 <Bryan Cutler> Add tests for python MapType 2555849c4 <Bryan Cutler> Add tests for MapArray::FromArrays 99ba44f0f <Bryan Cutler> Fix python2 test error 442dac2bc <Bryan Cutler> Fix lint issues 3a1134d01 <Bryan Cutler> Added checks in MapConverter to verify appended value, passing tests fa883cc9e <Bryan Cutler> Added tests for python to arrow conversion, need to pass verify dicts 70b453db6 <Bryan Cutler> Added test_array using from_arrays 72ab5295c <Bryan Cutler> Fix MapArray.from_arrays to work with null values 7f8770140 <Bryan Cutler> Adding map converter as ListConverter with MapType 309ac112b <Bryan Cutler> Change MapBuilder to use a StructBuilder internally 033479de8 <Bryan Cutler> Fix MapArray::SetData to use ListArray::SetData without faking type cf6a4fb72 <Bryan Cutler> Added MapType and MapArray, working in python with FromArrays Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-12-04 20:27:27 +01:00			`cdef Array result = pyarrow_wrap_array(out)`
			`result.validate()`
			`return result`

			`@property`
			`def keys(self):`
ARROW-15087: [Python][Docs] Document MapArray and update parent class to ListArray ## Summary of Changes * MapArray should inherit from ListArray to provide access to offsets property. (C++ class has this inheritance.) * Updated StructArray Python tutorial to indicate that schemas can now be inferred in some cases. * Create MapArray Python guide to show how to construct and access keys and items. I mention that the `keys` and `items` attributes are flattened, since that surprised me, and shows how to construct the ListArray containing keys and items. Closes #12007 from wjones127/ARROW-15087-py-doc-maptype Authored-by: Will Jones <willjones127@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2022-01-11 10:28:05 +01:00			`"""Flattened array of keys across all maps in array"""`
ARROW-6904: [Python] Add support for MapArray This adds support for `MapArray` in Python with conversion from a Python sequence of either dictionaries with "key" and "value" fields or a tuple with 2 elements. Additionally, added the API `MapArray.from_arrays` to build a `MapArray` from individual offset, key, value arrays. Closes #5774 from BryanCutler/python-impl-MapArray-ARROW-6904 and squashes the following commits: f2935c378 <Bryan Cutler> Avoid lookup of key_builder at each value c1fa14ef5 <Bryan Cutler> Added MapArray decl to lib.pxd 6b23dc2aa <Bryan Cutler> typo 5385529ff <Bryan Cutler> Address comments, add test compare with ListBuilder of structs 9772e1df4 <Bryan Cutler> unicode repr for py2 f1a354764 <Bryan Cutler> Fix test_map error for py2 7bdfdebb6 <Bryan Cutler> Changed MapValue.as_py() to return a list of tuples, added test_scalars 3c1a7f85a <Bryan Cutler> Add MapType to schema_test 2f4c29652 <Bryan Cutler> Add Map tests to test_misc acd0e6b04 <Bryan Cutler> Add tests for python MapType 2555849c4 <Bryan Cutler> Add tests for MapArray::FromArrays 99ba44f0f <Bryan Cutler> Fix python2 test error 442dac2bc <Bryan Cutler> Fix lint issues 3a1134d01 <Bryan Cutler> Added checks in MapConverter to verify appended value, passing tests fa883cc9e <Bryan Cutler> Added tests for python to arrow conversion, need to pass verify dicts 70b453db6 <Bryan Cutler> Added test_array using from_arrays 72ab5295c <Bryan Cutler> Fix MapArray.from_arrays to work with null values 7f8770140 <Bryan Cutler> Adding map converter as ListConverter with MapType 309ac112b <Bryan Cutler> Change MapBuilder to use a StructBuilder internally 033479de8 <Bryan Cutler> Fix MapArray::SetData to use ListArray::SetData without faking type cf6a4fb72 <Bryan Cutler> Added MapType and MapArray, working in python with FromArrays Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-12-04 20:27:27 +01:00			`return pyarrow_wrap_array((<CMapArray*> self.ap).keys())`

			`@property`
			`def items(self):`
ARROW-15087: [Python][Docs] Document MapArray and update parent class to ListArray ## Summary of Changes * MapArray should inherit from ListArray to provide access to offsets property. (C++ class has this inheritance.) * Updated StructArray Python tutorial to indicate that schemas can now be inferred in some cases. * Create MapArray Python guide to show how to construct and access keys and items. I mention that the `keys` and `items` attributes are flattened, since that surprised me, and shows how to construct the ListArray containing keys and items. Closes #12007 from wjones127/ARROW-15087-py-doc-maptype Authored-by: Will Jones <willjones127@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2022-01-11 10:28:05 +01:00			`"""Flattened array of items across all maps in array"""`
ARROW-6904: [Python] Add support for MapArray This adds support for `MapArray` in Python with conversion from a Python sequence of either dictionaries with "key" and "value" fields or a tuple with 2 elements. Additionally, added the API `MapArray.from_arrays` to build a `MapArray` from individual offset, key, value arrays. Closes #5774 from BryanCutler/python-impl-MapArray-ARROW-6904 and squashes the following commits: f2935c378 <Bryan Cutler> Avoid lookup of key_builder at each value c1fa14ef5 <Bryan Cutler> Added MapArray decl to lib.pxd 6b23dc2aa <Bryan Cutler> typo 5385529ff <Bryan Cutler> Address comments, add test compare with ListBuilder of structs 9772e1df4 <Bryan Cutler> unicode repr for py2 f1a354764 <Bryan Cutler> Fix test_map error for py2 7bdfdebb6 <Bryan Cutler> Changed MapValue.as_py() to return a list of tuples, added test_scalars 3c1a7f85a <Bryan Cutler> Add MapType to schema_test 2f4c29652 <Bryan Cutler> Add Map tests to test_misc acd0e6b04 <Bryan Cutler> Add tests for python MapType 2555849c4 <Bryan Cutler> Add tests for MapArray::FromArrays 99ba44f0f <Bryan Cutler> Fix python2 test error 442dac2bc <Bryan Cutler> Fix lint issues 3a1134d01 <Bryan Cutler> Added checks in MapConverter to verify appended value, passing tests fa883cc9e <Bryan Cutler> Added tests for python to arrow conversion, need to pass verify dicts 70b453db6 <Bryan Cutler> Added test_array using from_arrays 72ab5295c <Bryan Cutler> Fix MapArray.from_arrays to work with null values 7f8770140 <Bryan Cutler> Adding map converter as ListConverter with MapType 309ac112b <Bryan Cutler> Change MapBuilder to use a StructBuilder internally 033479de8 <Bryan Cutler> Fix MapArray::SetData to use ListArray::SetData without faking type cf6a4fb72 <Bryan Cutler> Added MapType and MapArray, working in python with FromArrays Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-12-04 20:27:27 +01:00			`return pyarrow_wrap_array((<CMapArray*> self.ap).items())`


ARROW-16174: [Python] Fix FixedSizeListArray.flatten() on sliced input (#14000) [ARROW-16174](https://issues.apache.org/jira/browse/ARROW-16174) Current behavior ```python import pyarrow as pa array = pa.array([[1], [2], [3]], type=pa.list_(pa.int64(), list_size=1)) array[2:].flatten().to_pylist() [1, 2, 3] ``` After this patch ```python import pyarrow as pa array = pa.array([[1], [2], [3]], type=pa.list_(pa.int64(), list_size=1)) array[2:].flatten().to_pylist() [3] ``` Authored-by: Miles Granger <miles59923@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-09-21 16:37:49 +02:00			`cdef class FixedSizeListArray(BaseListArray):`
ARROW-7261: [Python] Add Python support for Fixed Size List type https://issues.apache.org/jira/browse/ARROW-7261 Conversion from python lists works, but conversion to python (to numpy/pandas) not yet. Closes #5906 from jorisvandenbossche/ARROW-7261-fixed-size-list and squashes the following commits: 5046c85fc <Joris Van den Bossche> fixup rebase feb648834 <Joris Van den Bossche> fix C++ docstring 7b4f57102 <Joris Van den Bossche> other updates d01b45d6a <Joris Van den Bossche> fix list_size of 0 + add tests da2a2b3ea <Joris Van den Bossche> Use Result instead of Status 8b3161937 <Joris Van den Bossche> known null_count of 0 cb68db4d5 <Joris Van den Bossche> use int32 for list_size 88a5cbdd9 <Joris Van den Bossche> add proper python -> arrow conversion a502fa729 <Joris Van den Bossche> edits for feedback 8d94b0f7a <Joris Van den Bossche> ARROW-7261: Add Python support for Fixed Size List type Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-12-10 14:58:29 +01:00			`"""`
			`Concrete class for Arrow arrays of a fixed size list data type.`
			`"""`

			`@staticmethod`
GH-34316: [Python] FixedSizeListArray.from_arrays supports mask parameter (#39396) ### What changes are included in this PR? Add `mask` / `null_bitmap` parameters in corresponding Cython / C++ `FixedSizeListArray` methods, and propagate this bitmap instead of using the current dummy `validity_buf`. ### Are these changes tested? Yes ### Are there any user-facing changes? Yes, `mask` parameter has been added to `FixedSizeListArray.from_arrays` * Closes: #34316 Authored-by: LucasG0 <guillermou.lucas@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com> 2024-01-04 00:12:24 +01:00			`def from_arrays(values, list_size=None, DataType type=None, mask=None):`
ARROW-7261: [Python] Add Python support for Fixed Size List type https://issues.apache.org/jira/browse/ARROW-7261 Conversion from python lists works, but conversion to python (to numpy/pandas) not yet. Closes #5906 from jorisvandenbossche/ARROW-7261-fixed-size-list and squashes the following commits: 5046c85fc <Joris Van den Bossche> fixup rebase feb648834 <Joris Van den Bossche> fix C++ docstring 7b4f57102 <Joris Van den Bossche> other updates d01b45d6a <Joris Van den Bossche> fix list_size of 0 + add tests da2a2b3ea <Joris Van den Bossche> Use Result instead of Status 8b3161937 <Joris Van den Bossche> known null_count of 0 cb68db4d5 <Joris Van den Bossche> use int32 for list_size 88a5cbdd9 <Joris Van den Bossche> add proper python -> arrow conversion a502fa729 <Joris Van den Bossche> edits for feedback 8d94b0f7a <Joris Van den Bossche> ARROW-7261: Add Python support for Fixed Size List type Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-12-10 14:58:29 +01:00			`"""`
			`Construct FixedSizeListArray from array of values and a list length.`

			`Parameters`
			`----------`
			`values : Array (any type)`
			`list_size : int`
			`The fixed length of the lists.`
ARROW-15477: [C++][Python] Allow to create (FixedSize/Large)ListArray from arrays and type This enables to create a ListArray in this way with a custom type (eg non-default field name). Mimics the interface we already have for `MapArray::FromArrays` as well. Closes #12312 from jorisvandenbossche/ARROW-15477-list-from-arrays Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-02-02 14:26:01 +01:00			`type : DataType, optional`
			`If not specified, a default ListType with the values' type and`
			`list_size` length is used.
GH-34316: [Python] FixedSizeListArray.from_arrays supports mask parameter (#39396) ### What changes are included in this PR? Add `mask` / `null_bitmap` parameters in corresponding Cython / C++ `FixedSizeListArray` methods, and propagate this bitmap instead of using the current dummy `validity_buf`. ### Are these changes tested? Yes ### Are there any user-facing changes? Yes, `mask` parameter has been added to `FixedSizeListArray.from_arrays` * Closes: #34316 Authored-by: LucasG0 <guillermou.lucas@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com> 2024-01-04 00:12:24 +01:00			`mask : Array (boolean type), optional`
			`Indicate which values are null (True) or not null (False).`

ARROW-7261: [Python] Add Python support for Fixed Size List type https://issues.apache.org/jira/browse/ARROW-7261 Conversion from python lists works, but conversion to python (to numpy/pandas) not yet. Closes #5906 from jorisvandenbossche/ARROW-7261-fixed-size-list and squashes the following commits: 5046c85fc <Joris Van den Bossche> fixup rebase feb648834 <Joris Van den Bossche> fix C++ docstring 7b4f57102 <Joris Van den Bossche> other updates d01b45d6a <Joris Van den Bossche> fix list_size of 0 + add tests da2a2b3ea <Joris Van den Bossche> Use Result instead of Status 8b3161937 <Joris Van den Bossche> known null_count of 0 cb68db4d5 <Joris Van den Bossche> use int32 for list_size 88a5cbdd9 <Joris Van den Bossche> add proper python -> arrow conversion a502fa729 <Joris Van den Bossche> edits for feedback 8d94b0f7a <Joris Van den Bossche> ARROW-7261: Add Python support for Fixed Size List type Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-12-10 14:58:29 +01:00
			`Returns`
			`-------`
			`FixedSizeListArray`
ARROW-15477: [C++][Python] Allow to create (FixedSize/Large)ListArray from arrays and type This enables to create a ListArray in this way with a custom type (eg non-default field name). Mimics the interface we already have for `MapArray::FromArrays` as well. Closes #12312 from jorisvandenbossche/ARROW-15477-list-from-arrays Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-02-02 14:26:01 +01:00
			`Examples`
			`--------`

			`Create from a values array and a list size:`

ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-cython) Adding `--doctest-cython` functionality which will be run on the CI with a follow-up PR. This PR can be tested with `pytest --doctest-cython python/pyarrow`. Closes #13204 from AlenkaF/ARROW-16018-doctest-cython Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-05-26 15:30:02 +02:00			`>>> import pyarrow as pa`
ARROW-15477: [C++][Python] Allow to create (FixedSize/Large)ListArray from arrays and type This enables to create a ListArray in this way with a custom type (eg non-default field name). Mimics the interface we already have for `MapArray::FromArrays` as well. Closes #12312 from jorisvandenbossche/ARROW-15477-list-from-arrays Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-02-02 14:26:01 +01:00			`>>> values = pa.array([1, 2, 3, 4])`
			`>>> arr = pa.FixedSizeListArray.from_arrays(values, 2)`
			`>>> arr`
ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-cython) Adding `--doctest-cython` functionality which will be run on the CI with a follow-up PR. This PR can be tested with `pytest --doctest-cython python/pyarrow`. Closes #13204 from AlenkaF/ARROW-16018-doctest-cython Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-05-26 15:30:02 +02:00			`<pyarrow.lib.FixedSizeListArray object at ...>`
ARROW-15477: [C++][Python] Allow to create (FixedSize/Large)ListArray from arrays and type This enables to create a ListArray in this way with a custom type (eg non-default field name). Mimics the interface we already have for `MapArray::FromArrays` as well. Closes #12312 from jorisvandenbossche/ARROW-15477-list-from-arrays Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-02-02 14:26:01 +01:00			`[`
			`[`
			`1,`
			`2`
			`],`
			`[`
			`3,`
			`4`
			`]`
			`]`

ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-cython) Adding `--doctest-cython` functionality which will be run on the CI with a follow-up PR. This PR can be tested with `pytest --doctest-cython python/pyarrow`. Closes #13204 from AlenkaF/ARROW-16018-doctest-cython Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-05-26 15:30:02 +02:00			`Or create from a values array, list size and matching type:`
ARROW-15477: [C++][Python] Allow to create (FixedSize/Large)ListArray from arrays and type This enables to create a ListArray in this way with a custom type (eg non-default field name). Mimics the interface we already have for `MapArray::FromArrays` as well. Closes #12312 from jorisvandenbossche/ARROW-15477-list-from-arrays Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-02-02 14:26:01 +01:00
ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-cython) Adding `--doctest-cython` functionality which will be run on the CI with a follow-up PR. This PR can be tested with `pytest --doctest-cython python/pyarrow`. Closes #13204 from AlenkaF/ARROW-16018-doctest-cython Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-05-26 15:30:02 +02:00			`>>> typ = pa.list_(pa.field("values", pa.int64()), 2)`
			`>>> arr = pa.FixedSizeListArray.from_arrays(values,type=typ)`
			`>>> arr`
			`<pyarrow.lib.FixedSizeListArray object at ...>`
			`[`
			`[`
			`1,`
			`2`
			`],`
			`[`
			`3,`
			`4`
			`]`
			`]`
ARROW-7261: [Python] Add Python support for Fixed Size List type https://issues.apache.org/jira/browse/ARROW-7261 Conversion from python lists works, but conversion to python (to numpy/pandas) not yet. Closes #5906 from jorisvandenbossche/ARROW-7261-fixed-size-list and squashes the following commits: 5046c85fc <Joris Van den Bossche> fixup rebase feb648834 <Joris Van den Bossche> fix C++ docstring 7b4f57102 <Joris Van den Bossche> other updates d01b45d6a <Joris Van den Bossche> fix list_size of 0 + add tests da2a2b3ea <Joris Van den Bossche> Use Result instead of Status 8b3161937 <Joris Van den Bossche> known null_count of 0 cb68db4d5 <Joris Van den Bossche> use int32 for list_size 88a5cbdd9 <Joris Van den Bossche> add proper python -> arrow conversion a502fa729 <Joris Van den Bossche> edits for feedback 8d94b0f7a <Joris Van den Bossche> ARROW-7261: Add Python support for Fixed Size List type Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-12-10 14:58:29 +01:00			`"""`
			`cdef:`
			`Array _values`
ARROW-15477: [C++][Python] Allow to create (FixedSize/Large)ListArray from arrays and type This enables to create a ListArray in this way with a custom type (eg non-default field name). Mimics the interface we already have for `MapArray::FromArrays` as well. Closes #12312 from jorisvandenbossche/ARROW-15477-list-from-arrays Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-02-02 14:26:01 +01:00			`int32_t _list_size`
ARROW-7261: [Python] Add Python support for Fixed Size List type https://issues.apache.org/jira/browse/ARROW-7261 Conversion from python lists works, but conversion to python (to numpy/pandas) not yet. Closes #5906 from jorisvandenbossche/ARROW-7261-fixed-size-list and squashes the following commits: 5046c85fc <Joris Van den Bossche> fixup rebase feb648834 <Joris Van den Bossche> fix C++ docstring 7b4f57102 <Joris Van den Bossche> other updates d01b45d6a <Joris Van den Bossche> fix list_size of 0 + add tests da2a2b3ea <Joris Van den Bossche> Use Result instead of Status 8b3161937 <Joris Van den Bossche> known null_count of 0 cb68db4d5 <Joris Van den Bossche> use int32 for list_size 88a5cbdd9 <Joris Van den Bossche> add proper python -> arrow conversion a502fa729 <Joris Van den Bossche> edits for feedback 8d94b0f7a <Joris Van den Bossche> ARROW-7261: Add Python support for Fixed Size List type Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-12-10 14:58:29 +01:00			`CResult[shared_ptr[CArray]] c_result`

			`_values = asarray(values)`

GH-34316: [Python] FixedSizeListArray.from_arrays supports mask parameter (#39396) ### What changes are included in this PR? Add `mask` / `null_bitmap` parameters in corresponding Cython / C++ `FixedSizeListArray` methods, and propagate this bitmap instead of using the current dummy `validity_buf`. ### Are these changes tested? Yes ### Are there any user-facing changes? Yes, `mask` parameter has been added to `FixedSizeListArray.from_arrays` * Closes: #34316 Authored-by: LucasG0 <guillermou.lucas@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com> 2024-01-04 00:12:24 +01:00			`c_mask = c_mask_inverted_from_obj(mask, None)`

ARROW-15477: [C++][Python] Allow to create (FixedSize/Large)ListArray from arrays and type This enables to create a ListArray in this way with a custom type (eg non-default field name). Mimics the interface we already have for `MapArray::FromArrays` as well. Closes #12312 from jorisvandenbossche/ARROW-15477-list-from-arrays Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-02-02 14:26:01 +01:00			`if type is not None:`
			`if list_size is not None:`
			`raise ValueError("Cannot specify both list_size and type")`
			`with nogil:`
			`c_result = CFixedSizeListArray.FromArraysAndType(`
GH-34316: [Python] FixedSizeListArray.from_arrays supports mask parameter (#39396) ### What changes are included in this PR? Add `mask` / `null_bitmap` parameters in corresponding Cython / C++ `FixedSizeListArray` methods, and propagate this bitmap instead of using the current dummy `validity_buf`. ### Are these changes tested? Yes ### Are there any user-facing changes? Yes, `mask` parameter has been added to `FixedSizeListArray.from_arrays` * Closes: #34316 Authored-by: LucasG0 <guillermou.lucas@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com> 2024-01-04 00:12:24 +01:00			`_values.sp_array, type.sp_type, c_mask)`
ARROW-15477: [C++][Python] Allow to create (FixedSize/Large)ListArray from arrays and type This enables to create a ListArray in this way with a custom type (eg non-default field name). Mimics the interface we already have for `MapArray::FromArrays` as well. Closes #12312 from jorisvandenbossche/ARROW-15477-list-from-arrays Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-02-02 14:26:01 +01:00			`else:`
			`if list_size is None:`
			`raise ValueError("Should specify one of list_size and type")`
			`_list_size = <int32_t>list_size`
			`with nogil:`
			`c_result = CFixedSizeListArray.FromArrays(`
GH-34316: [Python] FixedSizeListArray.from_arrays supports mask parameter (#39396) ### What changes are included in this PR? Add `mask` / `null_bitmap` parameters in corresponding Cython / C++ `FixedSizeListArray` methods, and propagate this bitmap instead of using the current dummy `validity_buf`. ### Are these changes tested? Yes ### Are there any user-facing changes? Yes, `mask` parameter has been added to `FixedSizeListArray.from_arrays` * Closes: #34316 Authored-by: LucasG0 <guillermou.lucas@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com> 2024-01-04 00:12:24 +01:00			`_values.sp_array, _list_size, c_mask)`
ARROW-7261: [Python] Add Python support for Fixed Size List type https://issues.apache.org/jira/browse/ARROW-7261 Conversion from python lists works, but conversion to python (to numpy/pandas) not yet. Closes #5906 from jorisvandenbossche/ARROW-7261-fixed-size-list and squashes the following commits: 5046c85fc <Joris Van den Bossche> fixup rebase feb648834 <Joris Van den Bossche> fix C++ docstring 7b4f57102 <Joris Van den Bossche> other updates d01b45d6a <Joris Van den Bossche> fix list_size of 0 + add tests da2a2b3ea <Joris Van den Bossche> Use Result instead of Status 8b3161937 <Joris Van den Bossche> known null_count of 0 cb68db4d5 <Joris Van den Bossche> use int32 for list_size 88a5cbdd9 <Joris Van den Bossche> add proper python -> arrow conversion a502fa729 <Joris Van den Bossche> edits for feedback 8d94b0f7a <Joris Van den Bossche> ARROW-7261: Add Python support for Fixed Size List type Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-12-10 14:58:29 +01:00			`cdef Array result = pyarrow_wrap_array(GetResultValue(c_result))`
			`result.validate()`
			`return result`

			`@property`
			`def values(self):`
GH-35740: Add documentation for list arrays' values property (#35865) Just docs. I'm a little shaky on my understanding of exactly what's going on with FixedSizeListArray.values, and its behavior with nulls, so that wording might deserve a careful read. * Closes: #35740 Lead-authored-by: Spencer Nelson <spencer@b612foundation.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2023-08-23 11:13:33 -07:00			`"""`
			`Return the underlying array of values which backs the`
GH-41672: [Python][Doc] Clarify docstring of FixedSizeListArray.values that it ignores the offset (#46144) ### Rationale for this change Update docstring to explicitly mention that FixedSizeListArray.values ignores the offset ### What changes are included in this PR? Updated docstring to mention this ### Are these changes tested? No ### Are there any user-facing changes? No * GitHub Issue: #41672 Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2025-04-15 11:50:15 +01:00			`FixedSizeListArray ignoring the array's offset.`
GH-35740: Add documentation for list arrays' values property (#35865) Just docs. I'm a little shaky on my understanding of exactly what's going on with FixedSizeListArray.values, and its behavior with nulls, so that wording might deserve a careful read. * Closes: #35740 Lead-authored-by: Spencer Nelson <spencer@b612foundation.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2023-08-23 11:13:33 -07:00
			`Note even null elements are included.`

			Compare with :meth:`flatten`, which returns only the non-null
			`sub-list values.`

			`Returns`
			`-------`
			`values : Array`

			`See Also`
			`--------`
			`FixedSizeListArray.flatten : ...`

			`Examples`
			`--------`
			`>>> import pyarrow as pa`
			`>>> array = pa.array(`
			`... [[1, 2], None, [3, None]],`
			`... type=pa.list_(pa.int32(), 2)`
			`... )`
			`>>> array.values`
			`<pyarrow.lib.Int32Array object at ...>`
			`[`
			`1,`
			`2,`
			`null,`
			`null,`
			`3,`
			`null`
			`]`

			`"""`
ARROW-7261: [Python] Add Python support for Fixed Size List type https://issues.apache.org/jira/browse/ARROW-7261 Conversion from python lists works, but conversion to python (to numpy/pandas) not yet. Closes #5906 from jorisvandenbossche/ARROW-7261-fixed-size-list and squashes the following commits: 5046c85fc <Joris Van den Bossche> fixup rebase feb648834 <Joris Van den Bossche> fix C++ docstring 7b4f57102 <Joris Van den Bossche> other updates d01b45d6a <Joris Van den Bossche> fix list_size of 0 + add tests da2a2b3ea <Joris Van den Bossche> Use Result instead of Status 8b3161937 <Joris Van den Bossche> known null_count of 0 cb68db4d5 <Joris Van den Bossche> use int32 for list_size 88a5cbdd9 <Joris Van den Bossche> add proper python -> arrow conversion a502fa729 <Joris Van den Bossche> edits for feedback 8d94b0f7a <Joris Van den Bossche> ARROW-7261: Add Python support for Fixed Size List type Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-12-10 14:58:29 +01:00			`cdef CFixedSizeListArray* arr = <CFixedSizeListArray*> self.ap`
			`return pyarrow_wrap_array(arr.values())`


ARROW-972: UnionArray in pyarrow This is taking a stab at exposing UnionArray to pyarrow. Tasks to be done: - [x] Support UnionType::SPARSE - [x] Add doc strings Author: Philipp Moritz <pcmoritz@gmail.com> Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1216 from pcmoritz/pyarrow-union-array and squashes the following commits: 7f3ca313 [Wes McKinney] Fix flakes 9f33076b [Wes McKinney] Change UnionMode to scoped enumeration 9e602a8d [Philipp Moritz] wrap UnionType in pyarrow eeef7226 [Philipp Moritz] linting 502c335a [Philipp Moritz] fixes c6c85491 [Philipp Moritz] add doc strings 9068bbb5 [Philipp Moritz] linting d8da0170 [Philipp Moritz] implement dense and sparse UnionArrays cbdedc7a [Philipp Moritz] make fields in UnionArray unique to be compatiable with Java b796ce64 [Philipp Moritz] Implement UnionArray in pyarrow 2017-11-09 11:31:36 -05:00			`cdef class UnionArray(Array):`
ARROW-4216: [Python] Add CUDA API docs Also reorganize the API docs into several documents, and add/improve docstrings. To allow building the docs without CUDA enabled, I added some conditional inclusion logic. When CUDA isn't enabled, the API docs are still generated but the docstrings are empty. This seems to be the only sane setting that doesn't produce Sphinx errors, one way or the other. Author: Antoine Pitrou <antoine@python.org> Closes #3372 from pitrou/ARROW-4216-cuda-py-docs and squashes the following commits: 80600da5 <Antoine Pitrou> ARROW-4216: Add CUDA API docs 2019-01-10 21:05:31 +01:00			`"""`
			`Concrete class for Arrow arrays of a Union data type.`
			`"""`
ARROW-972: UnionArray in pyarrow This is taking a stab at exposing UnionArray to pyarrow. Tasks to be done: - [x] Support UnionType::SPARSE - [x] Add doc strings Author: Philipp Moritz <pcmoritz@gmail.com> Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1216 from pcmoritz/pyarrow-union-array and squashes the following commits: 7f3ca313 [Wes McKinney] Fix flakes 9f33076b [Wes McKinney] Change UnionMode to scoped enumeration 9e602a8d [Philipp Moritz] wrap UnionType in pyarrow eeef7226 [Philipp Moritz] linting 502c335a [Philipp Moritz] fixes c6c85491 [Philipp Moritz] add doc strings 9068bbb5 [Philipp Moritz] linting d8da0170 [Philipp Moritz] implement dense and sparse UnionArrays cbdedc7a [Philipp Moritz] make fields in UnionArray unique to be compatiable with Java b796ce64 [Philipp Moritz] Implement UnionArray in pyarrow 2017-11-09 11:31:36 -05:00
ARROW-8572: [Python] expose UnionArray fields to Python - Adds an explicit range check to `UnionArray.child` - Exposes `child`, `value_offsets`, and `type_codes` to Python. (In Python, they're wrapped in arrays for you to save you the trouble.) This lets you losslessly assemble and then disassemble a union array in Python. Closes #7027 from lidavidm/arrow-8572 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com> 2020-04-28 13:08:06 -07:00			`def child(self, int pos):`
GH-37217: [Python] Add missing docstrings to Cython (#37218) ### Rationale for this change The Cython 3.0.0 upgrade https://github.com/apache/arrow/pull/37097 is triggering numpydoc errors for these missing docstrings. ### What changes are included in this PR? * Docstrings added to Cython functions that omitted them ### Are these changes tested? Yes, locally. ### Are there any user-facing changes? User-facing documentation is added. * Closes: #37217 Lead-authored-by: Dane Pitkin <dane@voltrondata.com> Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com> 2023-08-17 23:03:28 -04:00			`"""`
			`DEPRECATED, use field() instead.`

			`Parameters`
			`----------`
			`pos : int`
			`The physical index of the union child field (not its type code).`

			`Returns`
			`-------`
			`field : pyarrow.Field`
			`The given child field.`
			`"""`
ARROW-8904: [Python] Adapt to child->field API migration/deprecation Closes #7331 from wesm/ARROW-8904 Authored-by: Wes McKinney <wesm@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-06-03 19:57:27 -05:00			`import warnings`
			`warnings.warn("child is deprecated, use field", FutureWarning)`
			`return self.field(pos)`

			`def field(self, int pos):`
ARROW-8572: [Python] expose UnionArray fields to Python - Adds an explicit range check to `UnionArray.child` - Exposes `child`, `value_offsets`, and `type_codes` to Python. (In Python, they're wrapped in arrays for you to save you the trouble.) This lets you losslessly assemble and then disassemble a union array in Python. Closes #7027 from lidavidm/arrow-8572 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com> 2020-04-28 13:08:06 -07:00			`"""`
ARROW-8904: [Python] Adapt to child->field API migration/deprecation Closes #7331 from wesm/ARROW-8904 Authored-by: Wes McKinney <wesm@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-06-03 19:57:27 -05:00			`Return the given child field as an individual array.`
ARROW-8572: [Python] expose UnionArray fields to Python - Adds an explicit range check to `UnionArray.child` - Exposes `child`, `value_offsets`, and `type_codes` to Python. (In Python, they're wrapped in arrays for you to save you the trouble.) This lets you losslessly assemble and then disassemble a union array in Python. Closes #7027 from lidavidm/arrow-8572 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com> 2020-04-28 13:08:06 -07:00
			`For sparse unions, the returned array has its offset, length,`
			`and null count adjusted.`

			`For dense unions, the returned array is unchanged.`
ARROW-15321: [Dev][Python] Also numpydoc-validate Cython-generated methods The numpydoc-validation routine in Archery would skip over many Cython-generated methods and properties. This PR also fixes and enhances the docstrings that would newly raise validation errors. Closes #12698 from pitrou/ARROW-15321-numpydoc-cython Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-03-28 18:55:45 +02:00
			`Parameters`
			`----------`
			`pos : int`
			`The physical index of the union child field (not its type code).`

			`Returns`
			`-------`
			`field : Array`
			`The given child field.`
ARROW-8572: [Python] expose UnionArray fields to Python - Adds an explicit range check to `UnionArray.child` - Exposes `child`, `value_offsets`, and `type_codes` to Python. (In Python, they're wrapped in arrays for you to save you the trouble.) This lets you losslessly assemble and then disassemble a union array in Python. Closes #7027 from lidavidm/arrow-8572 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com> 2020-04-28 13:08:06 -07:00			`"""`
			`cdef shared_ptr[CArray] result`
ARROW-8711: [Python] Expose timestamp_parsers in csv.ConvertOptions Closes #7223 from pitrou/ARROW-8711-py-csv-timestamp-parsing Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: François Saint-Jacques <fsaintjacques@gmail.com> 2020-06-03 08:27:17 -04:00			`result = (<CUnionArray*> self.ap).field(pos)`
ARROW-8572: [Python] expose UnionArray fields to Python - Adds an explicit range check to `UnionArray.child` - Exposes `child`, `value_offsets`, and `type_codes` to Python. (In Python, they're wrapped in arrays for you to save you the trouble.) This lets you losslessly assemble and then disassemble a union array in Python. Closes #7027 from lidavidm/arrow-8572 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com> 2020-04-28 13:08:06 -07:00			`if result != NULL:`
			`return pyarrow_wrap_array(result)`
GH-45619: [Python] Use f-string instead of string.format (#45629) ### Rationale for this change See https://github.com/apache/arrow/issues/45619. ### What changes are included in this PR? Refactor using f-string instead of `string.format`. But do not use f-string for following case, `string.format` allows passing parameters, making the code more reusable. https://github.com/apache/arrow/blob/0fbf9823542233c5f32c26534c34cc97ce3f0be2/python/pyarrow/parquet/core.py#L1624-L1695 ### Are these changes tested? Via CI. ### Are there any user-facing changes? No. * GitHub Issue: #45619 Lead-authored-by: Chilin Chiou <chilin.chiou@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> 2025-05-12 20:34:49 +08:00			`raise KeyError(f"UnionArray does not have child {pos}")`
ARROW-8572: [Python] expose UnionArray fields to Python - Adds an explicit range check to `UnionArray.child` - Exposes `child`, `value_offsets`, and `type_codes` to Python. (In Python, they're wrapped in arrays for you to save you the trouble.) This lets you losslessly assemble and then disassemble a union array in Python. Closes #7027 from lidavidm/arrow-8572 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com> 2020-04-28 13:08:06 -07:00
			`@property`
			`def type_codes(self):`
			`"""Get the type codes array."""`
			`buf = pyarrow_wrap_buffer((<CUnionArray*> self.ap).type_codes())`
			`return Array.from_buffers(int8(), len(self), [None, buf])`

			`@property`
			`def offsets(self):`
			`"""`
			`Get the value offsets array (dense arrays only).`

			`Does not account for any slice offset.`
			`"""`
			`if self.type.mode != "dense":`
			`raise ArrowTypeError("Can only get value offsets for dense arrays")`
ARROW-8866: [C++] Split UNION into SPARSE_UNION and DENSE_UNION Also splits `UnionType -> SparseUnionType and DenseUnionType` and similar for `UnionArray`, `UnionScalar`. `SparseUnionArray` no longer includes the unused offsets buffer Closes #7378 from bkietz/8866-Split-TypeUNION-into-Type Lead-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: Wes McKinney <wesm@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-12 13:03:12 -05:00			`cdef CDenseUnionArray* dense = <CDenseUnionArray*> self.ap`
			`buf = pyarrow_wrap_buffer(dense.value_offsets())`
ARROW-8572: [Python] expose UnionArray fields to Python - Adds an explicit range check to `UnionArray.child` - Exposes `child`, `value_offsets`, and `type_codes` to Python. (In Python, they're wrapped in arrays for you to save you the trouble.) This lets you losslessly assemble and then disassemble a union array in Python. Closes #7027 from lidavidm/arrow-8572 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com> 2020-04-28 13:08:06 -07:00			`return Array.from_buffers(int32(), len(self), [None, buf])`

ARROW-972: UnionArray in pyarrow This is taking a stab at exposing UnionArray to pyarrow. Tasks to be done: - [x] Support UnionType::SPARSE - [x] Add doc strings Author: Philipp Moritz <pcmoritz@gmail.com> Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1216 from pcmoritz/pyarrow-union-array and squashes the following commits: 7f3ca313 [Wes McKinney] Fix flakes 9f33076b [Wes McKinney] Change UnionMode to scoped enumeration 9e602a8d [Philipp Moritz] wrap UnionType in pyarrow eeef7226 [Philipp Moritz] linting 502c335a [Philipp Moritz] fixes c6c85491 [Philipp Moritz] add doc strings 9068bbb5 [Philipp Moritz] linting d8da0170 [Philipp Moritz] implement dense and sparse UnionArrays cbdedc7a [Philipp Moritz] make fields in UnionArray unique to be compatiable with Java b796ce64 [Philipp Moritz] Implement UnionArray in pyarrow 2017-11-09 11:31:36 -05:00			`@staticmethod`
ARROW-4622: [C++][Python] MakeDense and MakeSparse in UnionArray should accept a vector of Field ## TODO: - [x] Write tests for existing behaviors - [x] Support to supply field names - [x] union_(field_names, children, mode) - [x] Support to supply type codes - [x] make format - [x] Fix GLib binding - [x] Fix Ruby binding - [x] Fix Python binding - [ ] Make CI green Author: Kenta Murata <mrkn@mrkn.jp> Author: Antoine Pitrou <antoine@python.org> Closes #3723 from mrkn/make_union_array_with_field_names and squashes the following commits: 1480c3c72 <Antoine Pitrou> Some nits 90db62f97 <Kenta Murata> Fix coding style c81b1c4fd <Kenta Murata> ninja format 8c598c9cf <Kenta Murata> Consolidate test cases 6c840454c <Kenta Murata> Fix variable names b04e7cfdb <Kenta Murata> Fix style 40c1c6257 <Kenta Murata> Add support to create union array with field names and type codes ec24b41d9 <Kenta Murata> Refactoring 66ae94210 <Kenta Murata> Add support to supply type codes dc475ad1d <Kenta Murata> make format 18c574a51 <Kenta Murata> Add support to supply type codes to union_ d64882111 <Kenta Murata> Replace MakeUnionType with union_ 09fd89ce9 <Kenta Murata> Add support to supply field names ce7ee3752 <Kenta Murata> Add tests of MakeDense and MakeSparse of UnionArray 2019-04-09 20:32:54 +02:00			`def from_dense(Array types, Array value_offsets, list children,`
			`list field_names=None, list type_codes=None):`
ARROW-972: UnionArray in pyarrow This is taking a stab at exposing UnionArray to pyarrow. Tasks to be done: - [x] Support UnionType::SPARSE - [x] Add doc strings Author: Philipp Moritz <pcmoritz@gmail.com> Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1216 from pcmoritz/pyarrow-union-array and squashes the following commits: 7f3ca313 [Wes McKinney] Fix flakes 9f33076b [Wes McKinney] Change UnionMode to scoped enumeration 9e602a8d [Philipp Moritz] wrap UnionType in pyarrow eeef7226 [Philipp Moritz] linting 502c335a [Philipp Moritz] fixes c6c85491 [Philipp Moritz] add doc strings 9068bbb5 [Philipp Moritz] linting d8da0170 [Philipp Moritz] implement dense and sparse UnionArrays cbdedc7a [Philipp Moritz] make fields in UnionArray unique to be compatiable with Java b796ce64 [Philipp Moritz] Implement UnionArray in pyarrow 2017-11-09 11:31:36 -05:00			`"""`
			`Construct dense UnionArray from arrays of int8 types, int32 offsets and`
			`children arrays`

			`Parameters`
			`----------`
			`types : Array (int8 type)`
			`value_offsets : Array (int32 type)`
			`children : list`
ARROW-4622: [C++][Python] MakeDense and MakeSparse in UnionArray should accept a vector of Field ## TODO: - [x] Write tests for existing behaviors - [x] Support to supply field names - [x] union_(field_names, children, mode) - [x] Support to supply type codes - [x] make format - [x] Fix GLib binding - [x] Fix Ruby binding - [x] Fix Python binding - [ ] Make CI green Author: Kenta Murata <mrkn@mrkn.jp> Author: Antoine Pitrou <antoine@python.org> Closes #3723 from mrkn/make_union_array_with_field_names and squashes the following commits: 1480c3c72 <Antoine Pitrou> Some nits 90db62f97 <Kenta Murata> Fix coding style c81b1c4fd <Kenta Murata> ninja format 8c598c9cf <Kenta Murata> Consolidate test cases 6c840454c <Kenta Murata> Fix variable names b04e7cfdb <Kenta Murata> Fix style 40c1c6257 <Kenta Murata> Add support to create union array with field names and type codes ec24b41d9 <Kenta Murata> Refactoring 66ae94210 <Kenta Murata> Add support to supply type codes dc475ad1d <Kenta Murata> make format 18c574a51 <Kenta Murata> Add support to supply type codes to union_ d64882111 <Kenta Murata> Replace MakeUnionType with union_ 09fd89ce9 <Kenta Murata> Add support to supply field names ce7ee3752 <Kenta Murata> Add tests of MakeDense and MakeSparse of UnionArray 2019-04-09 20:32:54 +02:00			`field_names : list`
			`type_codes : list`
ARROW-972: UnionArray in pyarrow This is taking a stab at exposing UnionArray to pyarrow. Tasks to be done: - [x] Support UnionType::SPARSE - [x] Add doc strings Author: Philipp Moritz <pcmoritz@gmail.com> Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1216 from pcmoritz/pyarrow-union-array and squashes the following commits: 7f3ca313 [Wes McKinney] Fix flakes 9f33076b [Wes McKinney] Change UnionMode to scoped enumeration 9e602a8d [Philipp Moritz] wrap UnionType in pyarrow eeef7226 [Philipp Moritz] linting 502c335a [Philipp Moritz] fixes c6c85491 [Philipp Moritz] add doc strings 9068bbb5 [Philipp Moritz] linting d8da0170 [Philipp Moritz] implement dense and sparse UnionArrays cbdedc7a [Philipp Moritz] make fields in UnionArray unique to be compatiable with Java b796ce64 [Philipp Moritz] Implement UnionArray in pyarrow 2017-11-09 11:31:36 -05:00
			`Returns`
			`-------`
			`union_array : UnionArray`
			`"""`
ARROW-9017: [C++][Python] Refactor scalar bindings TODOs: - [x] split PRs into two, one with simplified python to arrow conversions with benchmarks - [x] implement union scalar on the C++ side - [x] store the index value for dictionary scalar - [x] more tests Closes ARROW-9017, ARROW-9153 Closes #7519 from kszucs/ARROW-9153 Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-07-06 17:15:50 -05:00			`cdef:`
			`shared_ptr[CArray] out`
			`vector[shared_ptr[CArray]] c`
			`Array child`
			`vector[c_string] c_field_names`
			`vector[int8_t] c_type_codes`

ARROW-972: UnionArray in pyarrow This is taking a stab at exposing UnionArray to pyarrow. Tasks to be done: - [x] Support UnionType::SPARSE - [x] Add doc strings Author: Philipp Moritz <pcmoritz@gmail.com> Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1216 from pcmoritz/pyarrow-union-array and squashes the following commits: 7f3ca313 [Wes McKinney] Fix flakes 9f33076b [Wes McKinney] Change UnionMode to scoped enumeration 9e602a8d [Philipp Moritz] wrap UnionType in pyarrow eeef7226 [Philipp Moritz] linting 502c335a [Philipp Moritz] fixes c6c85491 [Philipp Moritz] add doc strings 9068bbb5 [Philipp Moritz] linting d8da0170 [Philipp Moritz] implement dense and sparse UnionArrays cbdedc7a [Philipp Moritz] make fields in UnionArray unique to be compatiable with Java b796ce64 [Philipp Moritz] Implement UnionArray in pyarrow 2017-11-09 11:31:36 -05:00			`for child in children:`
			`c.push_back(child.sp_array)`
ARROW-4622: [C++][Python] MakeDense and MakeSparse in UnionArray should accept a vector of Field ## TODO: - [x] Write tests for existing behaviors - [x] Support to supply field names - [x] union_(field_names, children, mode) - [x] Support to supply type codes - [x] make format - [x] Fix GLib binding - [x] Fix Ruby binding - [x] Fix Python binding - [ ] Make CI green Author: Kenta Murata <mrkn@mrkn.jp> Author: Antoine Pitrou <antoine@python.org> Closes #3723 from mrkn/make_union_array_with_field_names and squashes the following commits: 1480c3c72 <Antoine Pitrou> Some nits 90db62f97 <Kenta Murata> Fix coding style c81b1c4fd <Kenta Murata> ninja format 8c598c9cf <Kenta Murata> Consolidate test cases 6c840454c <Kenta Murata> Fix variable names b04e7cfdb <Kenta Murata> Fix style 40c1c6257 <Kenta Murata> Add support to create union array with field names and type codes ec24b41d9 <Kenta Murata> Refactoring 66ae94210 <Kenta Murata> Add support to supply type codes dc475ad1d <Kenta Murata> make format 18c574a51 <Kenta Murata> Add support to supply type codes to union_ d64882111 <Kenta Murata> Replace MakeUnionType with union_ 09fd89ce9 <Kenta Murata> Add support to supply field names ce7ee3752 <Kenta Murata> Add tests of MakeDense and MakeSparse of UnionArray 2019-04-09 20:32:54 +02:00			`if field_names is not None:`
			`for x in field_names:`
			`c_field_names.push_back(tobytes(x))`
			`if type_codes is not None:`
			`for x in type_codes:`
			`c_type_codes.push_back(x)`
ARROW-9017: [C++][Python] Refactor scalar bindings TODOs: - [x] split PRs into two, one with simplified python to arrow conversions with benchmarks - [x] implement union scalar on the C++ side - [x] store the index value for dictionary scalar - [x] more tests Closes ARROW-9017, ARROW-9153 Closes #7519 from kszucs/ARROW-9153 Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-07-06 17:15:50 -05:00
ARROW-972: UnionArray in pyarrow This is taking a stab at exposing UnionArray to pyarrow. Tasks to be done: - [x] Support UnionType::SPARSE - [x] Add doc strings Author: Philipp Moritz <pcmoritz@gmail.com> Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1216 from pcmoritz/pyarrow-union-array and squashes the following commits: 7f3ca313 [Wes McKinney] Fix flakes 9f33076b [Wes McKinney] Change UnionMode to scoped enumeration 9e602a8d [Philipp Moritz] wrap UnionType in pyarrow eeef7226 [Philipp Moritz] linting 502c335a [Philipp Moritz] fixes c6c85491 [Philipp Moritz] add doc strings 9068bbb5 [Philipp Moritz] linting d8da0170 [Philipp Moritz] implement dense and sparse UnionArrays cbdedc7a [Philipp Moritz] make fields in UnionArray unique to be compatiable with Java b796ce64 [Philipp Moritz] Implement UnionArray in pyarrow 2017-11-09 11:31:36 -05:00			`with nogil:`
ARROW-8866: [C++] Split UNION into SPARSE_UNION and DENSE_UNION Also splits `UnionType -> SparseUnionType and DenseUnionType` and similar for `UnionArray`, `UnionScalar`. `SparseUnionArray` no longer includes the unused offsets buffer Closes #7378 from bkietz/8866-Split-TypeUNION-into-Type Lead-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: Wes McKinney <wesm@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-12 13:03:12 -05:00			`out = GetResultValue(CDenseUnionArray.Make(`
ARROW-4622: [C++][Python] MakeDense and MakeSparse in UnionArray should accept a vector of Field ## TODO: - [x] Write tests for existing behaviors - [x] Support to supply field names - [x] union_(field_names, children, mode) - [x] Support to supply type codes - [x] make format - [x] Fix GLib binding - [x] Fix Ruby binding - [x] Fix Python binding - [ ] Make CI green Author: Kenta Murata <mrkn@mrkn.jp> Author: Antoine Pitrou <antoine@python.org> Closes #3723 from mrkn/make_union_array_with_field_names and squashes the following commits: 1480c3c72 <Antoine Pitrou> Some nits 90db62f97 <Kenta Murata> Fix coding style c81b1c4fd <Kenta Murata> ninja format 8c598c9cf <Kenta Murata> Consolidate test cases 6c840454c <Kenta Murata> Fix variable names b04e7cfdb <Kenta Murata> Fix style 40c1c6257 <Kenta Murata> Add support to create union array with field names and type codes ec24b41d9 <Kenta Murata> Refactoring 66ae94210 <Kenta Murata> Add support to supply type codes dc475ad1d <Kenta Murata> make format 18c574a51 <Kenta Murata> Add support to supply type codes to union_ d64882111 <Kenta Murata> Replace MakeUnionType with union_ 09fd89ce9 <Kenta Murata> Add support to supply field names ce7ee3752 <Kenta Murata> Add tests of MakeDense and MakeSparse of UnionArray 2019-04-09 20:32:54 +02:00			`deref(types.ap), deref(value_offsets.ap), c, c_field_names,`
ARROW-8347: [C++] Migrate Array methods to Result<T> Closes #6851 from pitrou/ARROW-8347-array-result-apis Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2020-04-07 11:36:31 +02:00			`c_type_codes))`
ARROW-9017: [C++][Python] Refactor scalar bindings TODOs: - [x] split PRs into two, one with simplified python to arrow conversions with benchmarks - [x] implement union scalar on the C++ side - [x] store the index value for dictionary scalar - [x] more tests Closes ARROW-9017, ARROW-9153 Closes #7519 from kszucs/ARROW-9153 Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-07-06 17:15:50 -05:00
ARROW-6158: [C++/Python] Validate child array types with type fields of StructArray https://issues.apache.org/jira/browse/ARROW-6158 Closes #5488 from jorisvandenbossche/ARROW-6158-struct-array-validation and squashes the following commits: 757378139 <Joris Van den Bossche> ARROW-6158: Validate child array types with type fields of StructArray Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-09-24 22:02:07 -05:00			`cdef Array result = pyarrow_wrap_array(out)`
			`result.validate()`
			`return result`
ARROW-972: UnionArray in pyarrow This is taking a stab at exposing UnionArray to pyarrow. Tasks to be done: - [x] Support UnionType::SPARSE - [x] Add doc strings Author: Philipp Moritz <pcmoritz@gmail.com> Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1216 from pcmoritz/pyarrow-union-array and squashes the following commits: 7f3ca313 [Wes McKinney] Fix flakes 9f33076b [Wes McKinney] Change UnionMode to scoped enumeration 9e602a8d [Philipp Moritz] wrap UnionType in pyarrow eeef7226 [Philipp Moritz] linting 502c335a [Philipp Moritz] fixes c6c85491 [Philipp Moritz] add doc strings 9068bbb5 [Philipp Moritz] linting d8da0170 [Philipp Moritz] implement dense and sparse UnionArrays cbdedc7a [Philipp Moritz] make fields in UnionArray unique to be compatiable with Java b796ce64 [Philipp Moritz] Implement UnionArray in pyarrow 2017-11-09 11:31:36 -05:00
			`@staticmethod`
ARROW-4622: [C++][Python] MakeDense and MakeSparse in UnionArray should accept a vector of Field ## TODO: - [x] Write tests for existing behaviors - [x] Support to supply field names - [x] union_(field_names, children, mode) - [x] Support to supply type codes - [x] make format - [x] Fix GLib binding - [x] Fix Ruby binding - [x] Fix Python binding - [ ] Make CI green Author: Kenta Murata <mrkn@mrkn.jp> Author: Antoine Pitrou <antoine@python.org> Closes #3723 from mrkn/make_union_array_with_field_names and squashes the following commits: 1480c3c72 <Antoine Pitrou> Some nits 90db62f97 <Kenta Murata> Fix coding style c81b1c4fd <Kenta Murata> ninja format 8c598c9cf <Kenta Murata> Consolidate test cases 6c840454c <Kenta Murata> Fix variable names b04e7cfdb <Kenta Murata> Fix style 40c1c6257 <Kenta Murata> Add support to create union array with field names and type codes ec24b41d9 <Kenta Murata> Refactoring 66ae94210 <Kenta Murata> Add support to supply type codes dc475ad1d <Kenta Murata> make format 18c574a51 <Kenta Murata> Add support to supply type codes to union_ d64882111 <Kenta Murata> Replace MakeUnionType with union_ 09fd89ce9 <Kenta Murata> Add support to supply field names ce7ee3752 <Kenta Murata> Add tests of MakeDense and MakeSparse of UnionArray 2019-04-09 20:32:54 +02:00			`def from_sparse(Array types, list children, list field_names=None,`
			`list type_codes=None):`
ARROW-972: UnionArray in pyarrow This is taking a stab at exposing UnionArray to pyarrow. Tasks to be done: - [x] Support UnionType::SPARSE - [x] Add doc strings Author: Philipp Moritz <pcmoritz@gmail.com> Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1216 from pcmoritz/pyarrow-union-array and squashes the following commits: 7f3ca313 [Wes McKinney] Fix flakes 9f33076b [Wes McKinney] Change UnionMode to scoped enumeration 9e602a8d [Philipp Moritz] wrap UnionType in pyarrow eeef7226 [Philipp Moritz] linting 502c335a [Philipp Moritz] fixes c6c85491 [Philipp Moritz] add doc strings 9068bbb5 [Philipp Moritz] linting d8da0170 [Philipp Moritz] implement dense and sparse UnionArrays cbdedc7a [Philipp Moritz] make fields in UnionArray unique to be compatiable with Java b796ce64 [Philipp Moritz] Implement UnionArray in pyarrow 2017-11-09 11:31:36 -05:00			`"""`
			`Construct sparse UnionArray from arrays of int8 types and children`
			`arrays`

			`Parameters`
			`----------`
			`types : Array (int8 type)`
			`children : list`
ARROW-4622: [C++][Python] MakeDense and MakeSparse in UnionArray should accept a vector of Field ## TODO: - [x] Write tests for existing behaviors - [x] Support to supply field names - [x] union_(field_names, children, mode) - [x] Support to supply type codes - [x] make format - [x] Fix GLib binding - [x] Fix Ruby binding - [x] Fix Python binding - [ ] Make CI green Author: Kenta Murata <mrkn@mrkn.jp> Author: Antoine Pitrou <antoine@python.org> Closes #3723 from mrkn/make_union_array_with_field_names and squashes the following commits: 1480c3c72 <Antoine Pitrou> Some nits 90db62f97 <Kenta Murata> Fix coding style c81b1c4fd <Kenta Murata> ninja format 8c598c9cf <Kenta Murata> Consolidate test cases 6c840454c <Kenta Murata> Fix variable names b04e7cfdb <Kenta Murata> Fix style 40c1c6257 <Kenta Murata> Add support to create union array with field names and type codes ec24b41d9 <Kenta Murata> Refactoring 66ae94210 <Kenta Murata> Add support to supply type codes dc475ad1d <Kenta Murata> make format 18c574a51 <Kenta Murata> Add support to supply type codes to union_ d64882111 <Kenta Murata> Replace MakeUnionType with union_ 09fd89ce9 <Kenta Murata> Add support to supply field names ce7ee3752 <Kenta Murata> Add tests of MakeDense and MakeSparse of UnionArray 2019-04-09 20:32:54 +02:00			`field_names : list`
			`type_codes : list`
ARROW-972: UnionArray in pyarrow This is taking a stab at exposing UnionArray to pyarrow. Tasks to be done: - [x] Support UnionType::SPARSE - [x] Add doc strings Author: Philipp Moritz <pcmoritz@gmail.com> Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1216 from pcmoritz/pyarrow-union-array and squashes the following commits: 7f3ca313 [Wes McKinney] Fix flakes 9f33076b [Wes McKinney] Change UnionMode to scoped enumeration 9e602a8d [Philipp Moritz] wrap UnionType in pyarrow eeef7226 [Philipp Moritz] linting 502c335a [Philipp Moritz] fixes c6c85491 [Philipp Moritz] add doc strings 9068bbb5 [Philipp Moritz] linting d8da0170 [Philipp Moritz] implement dense and sparse UnionArrays cbdedc7a [Philipp Moritz] make fields in UnionArray unique to be compatiable with Java b796ce64 [Philipp Moritz] Implement UnionArray in pyarrow 2017-11-09 11:31:36 -05:00
			`Returns`
			`-------`
			`union_array : UnionArray`
			`"""`
ARROW-9017: [C++][Python] Refactor scalar bindings TODOs: - [x] split PRs into two, one with simplified python to arrow conversions with benchmarks - [x] implement union scalar on the C++ side - [x] store the index value for dictionary scalar - [x] more tests Closes ARROW-9017, ARROW-9153 Closes #7519 from kszucs/ARROW-9153 Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-07-06 17:15:50 -05:00			`cdef:`
			`shared_ptr[CArray] out`
			`vector[shared_ptr[CArray]] c`
			`Array child`
			`vector[c_string] c_field_names`
			`vector[int8_t] c_type_codes`

ARROW-972: UnionArray in pyarrow This is taking a stab at exposing UnionArray to pyarrow. Tasks to be done: - [x] Support UnionType::SPARSE - [x] Add doc strings Author: Philipp Moritz <pcmoritz@gmail.com> Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1216 from pcmoritz/pyarrow-union-array and squashes the following commits: 7f3ca313 [Wes McKinney] Fix flakes 9f33076b [Wes McKinney] Change UnionMode to scoped enumeration 9e602a8d [Philipp Moritz] wrap UnionType in pyarrow eeef7226 [Philipp Moritz] linting 502c335a [Philipp Moritz] fixes c6c85491 [Philipp Moritz] add doc strings 9068bbb5 [Philipp Moritz] linting d8da0170 [Philipp Moritz] implement dense and sparse UnionArrays cbdedc7a [Philipp Moritz] make fields in UnionArray unique to be compatiable with Java b796ce64 [Philipp Moritz] Implement UnionArray in pyarrow 2017-11-09 11:31:36 -05:00			`for child in children:`
			`c.push_back(child.sp_array)`
ARROW-4622: [C++][Python] MakeDense and MakeSparse in UnionArray should accept a vector of Field ## TODO: - [x] Write tests for existing behaviors - [x] Support to supply field names - [x] union_(field_names, children, mode) - [x] Support to supply type codes - [x] make format - [x] Fix GLib binding - [x] Fix Ruby binding - [x] Fix Python binding - [ ] Make CI green Author: Kenta Murata <mrkn@mrkn.jp> Author: Antoine Pitrou <antoine@python.org> Closes #3723 from mrkn/make_union_array_with_field_names and squashes the following commits: 1480c3c72 <Antoine Pitrou> Some nits 90db62f97 <Kenta Murata> Fix coding style c81b1c4fd <Kenta Murata> ninja format 8c598c9cf <Kenta Murata> Consolidate test cases 6c840454c <Kenta Murata> Fix variable names b04e7cfdb <Kenta Murata> Fix style 40c1c6257 <Kenta Murata> Add support to create union array with field names and type codes ec24b41d9 <Kenta Murata> Refactoring 66ae94210 <Kenta Murata> Add support to supply type codes dc475ad1d <Kenta Murata> make format 18c574a51 <Kenta Murata> Add support to supply type codes to union_ d64882111 <Kenta Murata> Replace MakeUnionType with union_ 09fd89ce9 <Kenta Murata> Add support to supply field names ce7ee3752 <Kenta Murata> Add tests of MakeDense and MakeSparse of UnionArray 2019-04-09 20:32:54 +02:00			`if field_names is not None:`
			`for x in field_names:`
			`c_field_names.push_back(tobytes(x))`
			`if type_codes is not None:`
			`for x in type_codes:`
			`c_type_codes.push_back(x)`
ARROW-9017: [C++][Python] Refactor scalar bindings TODOs: - [x] split PRs into two, one with simplified python to arrow conversions with benchmarks - [x] implement union scalar on the C++ side - [x] store the index value for dictionary scalar - [x] more tests Closes ARROW-9017, ARROW-9153 Closes #7519 from kszucs/ARROW-9153 Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-07-06 17:15:50 -05:00
ARROW-972: UnionArray in pyarrow This is taking a stab at exposing UnionArray to pyarrow. Tasks to be done: - [x] Support UnionType::SPARSE - [x] Add doc strings Author: Philipp Moritz <pcmoritz@gmail.com> Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1216 from pcmoritz/pyarrow-union-array and squashes the following commits: 7f3ca313 [Wes McKinney] Fix flakes 9f33076b [Wes McKinney] Change UnionMode to scoped enumeration 9e602a8d [Philipp Moritz] wrap UnionType in pyarrow eeef7226 [Philipp Moritz] linting 502c335a [Philipp Moritz] fixes c6c85491 [Philipp Moritz] add doc strings 9068bbb5 [Philipp Moritz] linting d8da0170 [Philipp Moritz] implement dense and sparse UnionArrays cbdedc7a [Philipp Moritz] make fields in UnionArray unique to be compatiable with Java b796ce64 [Philipp Moritz] Implement UnionArray in pyarrow 2017-11-09 11:31:36 -05:00			`with nogil:`
ARROW-8866: [C++] Split UNION into SPARSE_UNION and DENSE_UNION Also splits `UnionType -> SparseUnionType and DenseUnionType` and similar for `UnionArray`, `UnionScalar`. `SparseUnionArray` no longer includes the unused offsets buffer Closes #7378 from bkietz/8866-Split-TypeUNION-into-Type Lead-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: Wes McKinney <wesm@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-12 13:03:12 -05:00			`out = GetResultValue(CSparseUnionArray.Make(`
ARROW-8347: [C++] Migrate Array methods to Result<T> Closes #6851 from pitrou/ARROW-8347-array-result-apis Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2020-04-07 11:36:31 +02:00			`deref(types.ap), c, c_field_names, c_type_codes))`
ARROW-9017: [C++][Python] Refactor scalar bindings TODOs: - [x] split PRs into two, one with simplified python to arrow conversions with benchmarks - [x] implement union scalar on the C++ side - [x] store the index value for dictionary scalar - [x] more tests Closes ARROW-9017, ARROW-9153 Closes #7519 from kszucs/ARROW-9153 Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-07-06 17:15:50 -05:00
ARROW-6158: [C++/Python] Validate child array types with type fields of StructArray https://issues.apache.org/jira/browse/ARROW-6158 Closes #5488 from jorisvandenbossche/ARROW-6158-struct-array-validation and squashes the following commits: 757378139 <Joris Van den Bossche> ARROW-6158: Validate child array types with type fields of StructArray Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-09-24 22:02:07 -05:00			`cdef Array result = pyarrow_wrap_array(out)`
			`result.validate()`
			`return result`
ARROW-972: UnionArray in pyarrow This is taking a stab at exposing UnionArray to pyarrow. Tasks to be done: - [x] Support UnionType::SPARSE - [x] Add doc strings Author: Philipp Moritz <pcmoritz@gmail.com> Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1216 from pcmoritz/pyarrow-union-array and squashes the following commits: 7f3ca313 [Wes McKinney] Fix flakes 9f33076b [Wes McKinney] Change UnionMode to scoped enumeration 9e602a8d [Philipp Moritz] wrap UnionType in pyarrow eeef7226 [Philipp Moritz] linting 502c335a [Philipp Moritz] fixes c6c85491 [Philipp Moritz] add doc strings 9068bbb5 [Philipp Moritz] linting d8da0170 [Philipp Moritz] implement dense and sparse UnionArrays cbdedc7a [Philipp Moritz] make fields in UnionArray unique to be compatiable with Java b796ce64 [Philipp Moritz] Implement UnionArray in pyarrow 2017-11-09 11:31:36 -05:00
ARROW-2281: [Python] Add Array.from_buffers() Note this is shadowed by the specialized StringArray.from_buffers(). Author: Antoine Pitrou <antoine@python.org> Closes #1772 from pitrou/ARROW-2281-python-array-from-buffers and squashes the following commits: c6bf3730 <Antoine Pitrou> Try to fix crashes a7f658e2 <Antoine Pitrou> ARROW-2281: Add Array.from_buffers() 2018-03-22 13:56:33 -04:00
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00			`cdef class StringArray(Array):`
ARROW-4216: [Python] Add CUDA API docs Also reorganize the API docs into several documents, and add/improve docstrings. To allow building the docs without CUDA enabled, I added some conditional inclusion logic. When CUDA isn't enabled, the API docs are still generated but the docstrings are empty. This seems to be the only sane setting that doesn't produce Sphinx errors, one way or the other. Author: Antoine Pitrou <antoine@python.org> Closes #3372 from pitrou/ARROW-4216-cuda-py-docs and squashes the following commits: 80600da5 <Antoine Pitrou> ARROW-4216: Add CUDA API docs 2019-01-10 21:05:31 +01:00			`"""`
			`Concrete class for Arrow arrays of string (or utf8) data type.`
			`"""`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
ARROW-2282: [Python] Create StringArray from buffers Author: Uwe L. Korn <uwelk@xhochy.com> Closes #1720 from xhochy/ARROW-2282 and squashes the following commits: 36dc6b86 <Uwe L. Korn> Check computed null_count 5bb257de <Uwe L. Korn> ARROW-2282: Create StringArray from buffers 2018-03-12 14:20:30 -04:00			`@staticmethod`
			`def from_buffers(int length, Buffer value_offsets, Buffer data,`
			`Buffer null_bitmap=None, int null_count=-1,`
			`int offset=0):`
			`"""`
			`Construct a StringArray from value_offsets and data buffers.`
			`If there are nulls in the data, also a null_bitmap and the matching`
			`null_count must be passed.`

			`Parameters`
			`----------`
			`length : int`
			`value_offsets : Buffer`
			`data : Buffer`
			`null_bitmap : Buffer, optional`
			`null_count : int, default 0`
			`offset : int, default 0`

			`Returns`
			`-------`
			`string_array : StringArray`
			`"""`
ARROW-5531: [Python] Implement Array.from_buffers for varbinary and nested types, add DataType.num_buffers property Thanks to Antoine's recent work on `Array::View` this method can be made more robust and safe by checking for the correct number of buffers. Author: Wes McKinney <wesm+git@apache.org> Closes #4537 from wesm/ARROW-5531 and squashes the following commits: ec0695d86 <Wes McKinney> Address code review feedback a72533831 <Wes McKinney> Implement Array.from_buffers for nested types, add DataType.num_buffers, more checks 2019-06-13 14:45:13 -05:00			`return Array.from_buffers(utf8(), length,`
			`[null_bitmap, value_offsets, data],`
			`null_count, offset)`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
ARROW-2281: [Python] Add Array.from_buffers() Note this is shadowed by the specialized StringArray.from_buffers(). Author: Antoine Pitrou <antoine@python.org> Closes #1772 from pitrou/ARROW-2281-python-array-from-buffers and squashes the following commits: c6bf3730 <Antoine Pitrou> Try to fix crashes a7f658e2 <Antoine Pitrou> ARROW-2281: Add Array.from_buffers() 2018-03-22 13:56:33 -04:00
ARROW-6000: [Python] Add support for LargeString and LargeBinary types Also fix a bug in Take / Filter for large binary types. Closes #4927 from pitrou/ARROW-6000-py-large-binary and squashes the following commits: 0bb1b7f89 <Antoine Pitrou> Fix Take on LargeBinary / LargeString data 9672ca4c3 <Antoine Pitrou> ARROW-6000: Add support for LargeString and LargeBinary types Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-08-01 12:50:21 +02:00			`cdef class LargeStringArray(Array):`
			`"""`
			`Concrete class for Arrow arrays of large string (or utf8) data type.`
			`"""`

			`@staticmethod`
			`def from_buffers(int length, Buffer value_offsets, Buffer data,`
			`Buffer null_bitmap=None, int null_count=-1,`
			`int offset=0):`
			`"""`
			`Construct a LargeStringArray from value_offsets and data buffers.`
			`If there are nulls in the data, also a null_bitmap and the matching`
			`null_count must be passed.`

			`Parameters`
			`----------`
			`length : int`
			`value_offsets : Buffer`
			`data : Buffer`
			`null_bitmap : Buffer, optional`
			`null_count : int, default 0`
			`offset : int, default 0`

			`Returns`
			`-------`
			`string_array : StringArray`
			`"""`
			`return Array.from_buffers(large_utf8(), length,`
			`[null_bitmap, value_offsets, data],`
			`null_count, offset)`


GH-39651: [Python] Basic pyarrow bindings for Binary/StringView classes (#39652) ### Rationale for this change First step for https://github.com/apache/arrow/issues/39633: exposing the Array, DataType and Scalar classes for BinaryView and StringView, such that those can already be represented in pyarrow. (I exposed a variant of StringBuilder as well, just for now to be able to create test data) * Closes: #39651 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-01-30 12:54:19 +01:00			`cdef class StringViewArray(Array):`
			`"""`
			`Concrete class for Arrow arrays of string (or utf8) view data type.`
			`"""`


ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00			`cdef class BinaryArray(Array):`
ARROW-4216: [Python] Add CUDA API docs Also reorganize the API docs into several documents, and add/improve docstrings. To allow building the docs without CUDA enabled, I added some conditional inclusion logic. When CUDA isn't enabled, the API docs are still generated but the docstrings are empty. This seems to be the only sane setting that doesn't produce Sphinx errors, one way or the other. Author: Antoine Pitrou <antoine@python.org> Closes #3372 from pitrou/ARROW-4216-cuda-py-docs and squashes the following commits: 80600da5 <Antoine Pitrou> ARROW-4216: Add CUDA API docs 2019-01-10 21:05:31 +01:00			`"""`
			`Concrete class for Arrow arrays of variable-sized binary data type.`
			`"""`
ARROW-9247: [Python] Expose total_values_length functions on BinaryArray, LargeBinaryArray Closes #7559 from wesm/ARROW-9247 Authored-by: Wes McKinney <wesm@apache.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2020-06-29 19:39:21 +02:00			`@property`
			`def total_values_length(self):`
			`"""`
			`The number of bytes from beginning to end of the data buffer addressed`
			`by the offsets of this BinaryArray.`
			`"""`
			`return (<CBinaryArray*> self.ap).total_values_length()`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00

ARROW-6000: [Python] Add support for LargeString and LargeBinary types Also fix a bug in Take / Filter for large binary types. Closes #4927 from pitrou/ARROW-6000-py-large-binary and squashes the following commits: 0bb1b7f89 <Antoine Pitrou> Fix Take on LargeBinary / LargeString data 9672ca4c3 <Antoine Pitrou> ARROW-6000: Add support for LargeString and LargeBinary types Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-08-01 12:50:21 +02:00			`cdef class LargeBinaryArray(Array):`
			`"""`
			`Concrete class for Arrow arrays of large variable-sized binary data type.`
			`"""`
ARROW-9247: [Python] Expose total_values_length functions on BinaryArray, LargeBinaryArray Closes #7559 from wesm/ARROW-9247 Authored-by: Wes McKinney <wesm@apache.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2020-06-29 19:39:21 +02:00			`@property`
			`def total_values_length(self):`
			`"""`
			`The number of bytes from beginning to end of the data buffer addressed`
			`by the offsets of this LargeBinaryArray.`
			`"""`
			`return (<CLargeBinaryArray*> self.ap).total_values_length()`
ARROW-6000: [Python] Add support for LargeString and LargeBinary types Also fix a bug in Take / Filter for large binary types. Closes #4927 from pitrou/ARROW-6000-py-large-binary and squashes the following commits: 0bb1b7f89 <Antoine Pitrou> Fix Take on LargeBinary / LargeString data 9672ca4c3 <Antoine Pitrou> ARROW-6000: Add support for LargeString and LargeBinary types Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-08-01 12:50:21 +02:00

GH-39651: [Python] Basic pyarrow bindings for Binary/StringView classes (#39652) ### Rationale for this change First step for https://github.com/apache/arrow/issues/39633: exposing the Array, DataType and Scalar classes for BinaryView and StringView, such that those can already be represented in pyarrow. (I exposed a variant of StringBuilder as well, just for now to be able to create test data) * Closes: #39651 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-01-30 12:54:19 +01:00			`cdef class BinaryViewArray(Array):`
			`"""`
			`Concrete class for Arrow arrays of variable-sized binary view data type.`
			`"""`


ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00			`cdef class DictionaryArray(Array):`
ARROW-4216: [Python] Add CUDA API docs Also reorganize the API docs into several documents, and add/improve docstrings. To allow building the docs without CUDA enabled, I added some conditional inclusion logic. When CUDA isn't enabled, the API docs are still generated but the docstrings are empty. This seems to be the only sane setting that doesn't produce Sphinx errors, one way or the other. Author: Antoine Pitrou <antoine@python.org> Closes #3372 from pitrou/ARROW-4216-cuda-py-docs and squashes the following commits: 80600da5 <Antoine Pitrou> ARROW-4216: Add CUDA API docs 2019-01-10 21:05:31 +01:00			`"""`
			`Concrete class for dictionary-encoded Arrow arrays.`
			`"""`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
ARROW-1559: [C++] Add Unique kernel and refactor DictionaryBuilder to be a stateful kernel Only intended to implement selective categorical conversion in `to_pandas()` but it seems that there is a lot missing to do this in a clean fashion. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1266 from xhochy/ARROW-1559 and squashes the following commits: 50249652 [Wes McKinney] Fix MSVC linker issue b6cb1ece [Wes McKinney] Export CastOptions 4ea3ce61 [Wes McKinney] Return NONE Datum in else branch of functions 4f969c6b [Wes McKinney] Move deprecation suppression after flag munging 7f557cc0 [Wes McKinney] Code review comments, disable C4996 warning (equivalent to -Wno-deprecated) in MSVC builds 84717461 [Wes McKinney] Do not compute hash table threshold on each iteration ae8f2339 [Wes McKinney] Fix double to int64_t conversion warning c1444a26 [Wes McKinney] Fix doxygen warnings 2de85961 [Wes McKinney] Add test cases for unique, dictionary_encode 383b46fd [Wes McKinney] Add Array methods for Unique, DictionaryEncode 0962f06b [Wes McKinney] Add cast method for Column, chunked_array and column factory functions 62c3cefd [Wes McKinney] Datum stubs 27151c47 [Wes McKinney] Implement Cast for chunked arrays, fix kernel implementation. Change kernel API to write to a single Datum 1bf2e2f4 [Wes McKinney] Fix bug with column using wrong type eaadc3e5 [Wes McKinney] Use macros to reduce code duplication in DoubleTableSize 6b4f8f3c [Wes McKinney] Fix datetime64->date32 casting error raised by refactor 2c77a19e [Wes McKinney] Some Decimal->Decimal128 renaming. Add DecimalType base class c07f91b3 [Wes McKinney] ARROW-1559: Add unique kernel 2017-11-17 18:29:49 -05:00			`def dictionary_encode(self):`
			`return self`

ARROW-12492: [Python] Helper method to decode DictionaryArray back to Array https://issues.apache.org/jira/browse/ARROW-12492 Also verified that the doc gets updated and the new method is correctly listed and documented. Closes #10123 from amol-/ARROW-12492 Authored-by: Alessandro Molina <amol@turbogears.org> Signed-off-by: David Li <li.davidm96@gmail.com> 2021-04-26 14:11:06 -04:00			`def dictionary_decode(self):`
			`"""`
			`Decodes the DictionaryArray to an Array.`
			`"""`
			`return self.dictionary.take(self.indices)`

ARROW-3044: [Python] Remove all occurrences of cython's legacy property definition syntax plus add missing tests Author: Krisztián Szűcs <szucs.krisztian@gmail.com> Author: Wes McKinney <wesm+git@apache.org> Closes #2424 from kszucs/ARROW-3044 and squashes the following commits: c00fdd33 <Wes McKinney> Change pa.lib -> pa 846cb238 <Krisztián Szűcs> pandas timestamp compat 7d930aa7 <Krisztián Szűcs> replace all occurences of old property definition syntax 2018-08-14 10:09:41 -04:00			`@property`
			`def dictionary(self):`
			`cdef CDictionaryArray* darr = <CDictionaryArray*>(self.ap)`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
ARROW-3044: [Python] Remove all occurrences of cython's legacy property definition syntax plus add missing tests Author: Krisztián Szűcs <szucs.krisztian@gmail.com> Author: Wes McKinney <wesm+git@apache.org> Closes #2424 from kszucs/ARROW-3044 and squashes the following commits: c00fdd33 <Wes McKinney> Change pa.lib -> pa 846cb238 <Krisztián Szűcs> pandas timestamp compat 7d930aa7 <Krisztián Szűcs> replace all occurences of old property definition syntax 2018-08-14 10:09:41 -04:00			`if self._dictionary is None:`
			`self._dictionary = pyarrow_wrap_array(darr.dictionary())`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
ARROW-3044: [Python] Remove all occurrences of cython's legacy property definition syntax plus add missing tests Author: Krisztián Szűcs <szucs.krisztian@gmail.com> Author: Wes McKinney <wesm+git@apache.org> Closes #2424 from kszucs/ARROW-3044 and squashes the following commits: c00fdd33 <Wes McKinney> Change pa.lib -> pa 846cb238 <Krisztián Szűcs> pandas timestamp compat 7d930aa7 <Krisztián Szűcs> replace all occurences of old property definition syntax 2018-08-14 10:09:41 -04:00			`return self._dictionary`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
ARROW-3044: [Python] Remove all occurrences of cython's legacy property definition syntax plus add missing tests Author: Krisztián Szűcs <szucs.krisztian@gmail.com> Author: Wes McKinney <wesm+git@apache.org> Closes #2424 from kszucs/ARROW-3044 and squashes the following commits: c00fdd33 <Wes McKinney> Change pa.lib -> pa 846cb238 <Krisztián Szűcs> pandas timestamp compat 7d930aa7 <Krisztián Szűcs> replace all occurences of old property definition syntax 2018-08-14 10:09:41 -04:00			`@property`
			`def indices(self):`
			`cdef CDictionaryArray* darr = <CDictionaryArray*>(self.ap)`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
ARROW-3044: [Python] Remove all occurrences of cython's legacy property definition syntax plus add missing tests Author: Krisztián Szűcs <szucs.krisztian@gmail.com> Author: Wes McKinney <wesm+git@apache.org> Closes #2424 from kszucs/ARROW-3044 and squashes the following commits: c00fdd33 <Wes McKinney> Change pa.lib -> pa 846cb238 <Krisztián Szűcs> pandas timestamp compat 7d930aa7 <Krisztián Szűcs> replace all occurences of old property definition syntax 2018-08-14 10:09:41 -04:00			`if self._indices is None:`
			`self._indices = pyarrow_wrap_array(darr.indices())`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
ARROW-3044: [Python] Remove all occurrences of cython's legacy property definition syntax plus add missing tests Author: Krisztián Szűcs <szucs.krisztian@gmail.com> Author: Wes McKinney <wesm+git@apache.org> Closes #2424 from kszucs/ARROW-3044 and squashes the following commits: c00fdd33 <Wes McKinney> Change pa.lib -> pa 846cb238 <Krisztián Szűcs> pandas timestamp compat 7d930aa7 <Krisztián Szűcs> replace all occurences of old property definition syntax 2018-08-14 10:09:41 -04:00			`return self._indices`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
ARROW-14495: [Python] Fix DictionaryArray.from_buffers, should not crash (#13989) Fix [ARROW-14495](https://issues.apache.org/jira/browse/ARROW-14495) Authored-by: Miles Granger <miles59923@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-09-01 10:06:03 +02:00			`@staticmethod`
			`def from_buffers(DataType type, int64_t length, buffers, Array dictionary,`
			`int64_t null_count=-1, int64_t offset=0):`
			`"""`
			`Construct a DictionaryArray from buffers.`

			`Parameters`
			`----------`
			`type : pyarrow.DataType`
			`length : int`
			`The number of values in the array.`
MINOR: [Python] `Array.from_buffers` accepts `None` buffers (#49163) ### Rationale for this change Fix the docs. It seems this is a feature and not just happens to work as the code a little below even has an explicit comment: https://github.com/apache/arrow/blob/0dfae701ef98aa4a26b9abbaaf3bf01130df3702/python/pyarrow/array.pxi#L1347 ### What changes are included in this PR? Document that `Array.from_buffers` takes a `list[Buffer \| None]` instead of strictly `list[Buffer]`. ### Are these changes tested? not applicable ### Are there any user-facing changes? Yes, but not in an API sense :) Authored-by: Robsdedude <dev@rouvenbauer.de> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2026-03-05 15:35:56 +01:00			`buffers : List[Buffer \| None]`
ARROW-14495: [Python] Fix DictionaryArray.from_buffers, should not crash (#13989) Fix [ARROW-14495](https://issues.apache.org/jira/browse/ARROW-14495) Authored-by: Miles Granger <miles59923@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-09-01 10:06:03 +02:00			`The buffers backing the indices array.`
			`dictionary : pyarrow.Array, ndarray or pandas.Series`
			`The array of values referenced by the indices.`
			`null_count : int, default -1`
			`The number of null entries in the indices array. Negative value means that`
			`the null count is not known.`
			`offset : int, default 0`
			`The array's logical offset (in values, not in bytes) from the`
			`start of each buffer.`

			`Returns`
			`-------`
			`dict_array : DictionaryArray`
			`"""`
			`cdef:`
			`vector[shared_ptr[CBuffer]] c_buffers`
			`shared_ptr[CDataType] c_type`
			`shared_ptr[CArrayData] c_data`
			`shared_ptr[CArray] c_result`

			`for buf in buffers:`
			`c_buffers.push_back(pyarrow_unwrap_buffer(buf))`

			`c_type = pyarrow_unwrap_data_type(type)`

			`with nogil:`
			`c_data = CArrayData.Make(`
			`c_type, length, c_buffers, null_count, offset)`
			`c_data.get().dictionary = dictionary.sp_array.get().data()`
			`c_result.reset(new CDictionaryArray(c_data))`

			`cdef Array result = pyarrow_wrap_array(c_result)`
			`result.validate()`
			`return result`

ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00			`@staticmethod`
ARROW-1949: [Python/C++] Add option to Array.from_pandas and pyarrow.array to perform unsafe casts Author: Krisztián Szűcs <szucs.krisztian@gmail.com> Closes #2497 from kszucs/ARROW-1949 and squashes the following commits: f352c477 <Krisztián Szűcs> remove safe flag from _sequence_to_array 70d6cae2 <Krisztián Szűcs> annotate boolean arguments as bint e838a14d <Krisztián Szűcs> check-format fff89aaa <Krisztián Szűcs> lint 92ac3a92 <Krisztián Szűcs> tests for timestamp casts dd8871e8 <Krisztián Szűcs> wire CastOptions through the API 2018-09-04 08:36:29 +02:00			`def from_arrays(indices, dictionary, mask=None, bint ordered=False,`
			`bint from_pandas=False, bint safe=True,`
ARROW-2099: [Python] Add safe option to DictionaryArray.from_arrays to do boundschecking of indices by default Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1734 from wesm/ARROW-2099 and squashes the following commits: eabc5d19 <Wes McKinney> Add safe option to DictionaryArray.from_arrays to do boundschecking of indices by default 2018-03-11 23:41:38 -04:00			`MemoryPool memory_pool=None):`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00			`"""`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`Construct a DictionaryArray from indices and values.`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
			`Parameters`
			`----------`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`indices : pyarrow.Array, numpy.ndarray or pandas.Series, int type`
			`Non-negative integers referencing the dictionary values by zero`
			`based index.`
			`dictionary : pyarrow.Array, ndarray or pandas.Series`
			`The array of values referenced by the indices.`
			`mask : ndarray or pandas.Series, bool type`
			`True values indicate that indices are actually null.`
ARROW-15006: [Python][CI][Doc] Enable numpydoc check PR03 (#13983) Adds an additional numypdoc check to CI (PR03) and fixes all corresponding violations. Note this does not fully resolve [ARROW-15006](https://issues.apache.org/jira/browse/ARROW-15006). Authored-by: Bryce Mecum <petridish@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2022-10-19 23:41:24 -08:00			`ordered : bool, default False`
			`Set to True if the category values are ordered.`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`from_pandas : bool, default False`
ARROW-1658: [Python] Add boundschecking of dictionary indices when creating CategoricalBlock We should probably do this bounds-checking earlier and in the main Arrow C++ library when ingesting "untrusted" arrays. I will create a JIRA, but this is a stopgap in the meantime Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1270 from wesm/ARROW-1658 and squashes the following commits: 234a5685 [Wes McKinney] Add boundschecking of dictionary indices when creating CategoricalBlock as workaround for segfaults from invalid codes making their way into pandas 2017-10-31 12:54:06 -04:00			`If True, the indices should be treated as though they originated in`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`a pandas.Categorical (null encoded as -1).`
			`safe : bool, default True`
			`If True, check that the dictionary indices are in range.`
ARROW-1658: [Python] Add boundschecking of dictionary indices when creating CategoricalBlock We should probably do this bounds-checking earlier and in the main Arrow C++ library when ingesting "untrusted" arrays. I will create a JIRA, but this is a stopgap in the meantime Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1270 from wesm/ARROW-1658 and squashes the following commits: 234a5685 [Wes McKinney] Add boundschecking of dictionary indices when creating CategoricalBlock as workaround for segfaults from invalid codes making their way into pandas 2017-10-31 12:54:06 -04:00			`memory_pool : MemoryPool, default None`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`For memory allocations, if required, otherwise uses default pool.`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
			`Returns`
			`-------`
			`dict_array : DictionaryArray`
			`"""`
			`cdef:`
ARROW-1658: [Python] Add boundschecking of dictionary indices when creating CategoricalBlock We should probably do this bounds-checking earlier and in the main Arrow C++ library when ingesting "untrusted" arrays. I will create a JIRA, but this is a stopgap in the meantime Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1270 from wesm/ARROW-1658 and squashes the following commits: 234a5685 [Wes McKinney] Add boundschecking of dictionary indices when creating CategoricalBlock as workaround for segfaults from invalid codes making their way into pandas 2017-10-31 12:54:06 -04:00			`Array _indices, _dictionary`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00			`shared_ptr[CDataType] c_type`
			`shared_ptr[CArray] c_result`

			`if isinstance(indices, Array):`
			`if mask is not None:`
			`raise NotImplementedError(`
			`"mask not implemented with Arrow array inputs yet")`
ARROW-1658: [Python] Add boundschecking of dictionary indices when creating CategoricalBlock We should probably do this bounds-checking earlier and in the main Arrow C++ library when ingesting "untrusted" arrays. I will create a JIRA, but this is a stopgap in the meantime Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1270 from wesm/ARROW-1658 and squashes the following commits: 234a5685 [Wes McKinney] Add boundschecking of dictionary indices when creating CategoricalBlock as workaround for segfaults from invalid codes making their way into pandas 2017-10-31 12:54:06 -04:00			`_indices = indices`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00			`else:`
ARROW-1658: [Python] Add boundschecking of dictionary indices when creating CategoricalBlock We should probably do this bounds-checking earlier and in the main Arrow C++ library when ingesting "untrusted" arrays. I will create a JIRA, but this is a stopgap in the meantime Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1270 from wesm/ARROW-1658 and squashes the following commits: 234a5685 [Wes McKinney] Add boundschecking of dictionary indices when creating CategoricalBlock as workaround for segfaults from invalid codes making their way into pandas 2017-10-31 12:54:06 -04:00			`if from_pandas:`
ARROW-7168: [Python] Respect the specified dictionary type for pd.Categorical conversion https://issues.apache.org/jira/browse/ARROW-7168 This change ensures that if you specify a `type` in `pa.array`, we ensure the output actually has this type when converting to dictionary array (as we also do for other types). The PR now implements this change, but we might want to do this with a deprecation first, as this can break people's code. Closes #5866 from jorisvandenbossche/ARROW-7168-categorical-specified-type and squashes the following commits: 39ff8e82c <Joris Van den Bossche> more python 2 e4dbb2c4f <Joris Van den Bossche> try fix python 2 003e6532b <Joris Van den Bossche> for now use deprecation warnings instead of error bfb82372b <Joris Van den Bossche> additional tests 3535a5699 <Joris Van den Bossche> ARROW-7168: Respect the specified dictionary type when converting pd.Categorical Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-11-21 11:17:56 +01:00			`_indices = _codes_to_indices(indices, mask, None, memory_pool)`
			`else:`
			`_indices = array(indices, mask=mask, memory_pool=memory_pool)`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
			`if isinstance(dictionary, Array):`
ARROW-1658: [Python] Add boundschecking of dictionary indices when creating CategoricalBlock We should probably do this bounds-checking earlier and in the main Arrow C++ library when ingesting "untrusted" arrays. I will create a JIRA, but this is a stopgap in the meantime Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1270 from wesm/ARROW-1658 and squashes the following commits: 234a5685 [Wes McKinney] Add boundschecking of dictionary indices when creating CategoricalBlock as workaround for segfaults from invalid codes making their way into pandas 2017-10-31 12:54:06 -04:00			`_dictionary = dictionary`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00			`else:`
ARROW-1658: [Python] Add boundschecking of dictionary indices when creating CategoricalBlock We should probably do this bounds-checking earlier and in the main Arrow C++ library when ingesting "untrusted" arrays. I will create a JIRA, but this is a stopgap in the meantime Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1270 from wesm/ARROW-1658 and squashes the following commits: 234a5685 [Wes McKinney] Add boundschecking of dictionary indices when creating CategoricalBlock as workaround for segfaults from invalid codes making their way into pandas 2017-10-31 12:54:06 -04:00			`_dictionary = array(dictionary, memory_pool=memory_pool)`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
ARROW-1658: [Python] Add boundschecking of dictionary indices when creating CategoricalBlock We should probably do this bounds-checking earlier and in the main Arrow C++ library when ingesting "untrusted" arrays. I will create a JIRA, but this is a stopgap in the meantime Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1270 from wesm/ARROW-1658 and squashes the following commits: 234a5685 [Wes McKinney] Add boundschecking of dictionary indices when creating CategoricalBlock as workaround for segfaults from invalid codes making their way into pandas 2017-10-31 12:54:06 -04:00			`if not isinstance(_indices, IntegerArray):`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00			`raise ValueError('Indices must be integer type')`

ARROW-573: [C++/Python] Implement IPC metadata handling for ordered dictionaries, pandas conversions This was an oversight in the IPC implementation and pandas conversion path, and has been fixed. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #922 from wesm/ARROW-573 and squashes the following commits: 458820e5 [Wes McKinney] Suppress C4800 in MSVC 46361f3f [Wes McKinney] Implement IPC metadata handling for ordered dictionaries, faithful conversion to/from pandas.Categorical 2017-08-01 11:57:43 -04:00			`cdef c_bool c_ordered = ordered`

ARROW-1658: [Python] Add boundschecking of dictionary indices when creating CategoricalBlock We should probably do this bounds-checking earlier and in the main Arrow C++ library when ingesting "untrusted" arrays. I will create a JIRA, but this is a stopgap in the meantime Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1270 from wesm/ARROW-1658 and squashes the following commits: 234a5685 [Wes McKinney] Add boundschecking of dictionary indices when creating CategoricalBlock as workaround for segfaults from invalid codes making their way into pandas 2017-10-31 12:54:06 -04:00			`c_type.reset(new CDictionaryType(_indices.type.sp_type,`
ARROW-3144: [C++/Python] Move "dictionary" member from DictionaryType to ArrayData to allow for variable dictionaries This patch moves the dictionary member out of DictionaryType to a new member on the internal ArrayData structure. As a result, serializing and deserializing schemas requires only a single IPC message, and schemas have no knowledge of what the dictionary values are. The objective of this change is to correct a long-standing Arrow C++ design problem with dictionary-encoded arrays where the dictionary values must be known at schema construction time. This has plagued us all over the codebase: * In reading Parquet files, reading directly to DictionaryArray is not simple because each row group may have a different dictionary * In IPC streams, delta dictionaries (not yet implemented) would invalidate the pre-existing schema, causing subsequent RecordBatch objects to be incompatible * In Arrow Flight, schema negotiation requires the dictionaries to be sent, having possibly unbounded size. * Not possible to have different dictionaries in a ChunkedArray * In CSV files, converting columns to dictionary in parallel would require an expensive type unification The summary of what can be learned from this is: do not put data in type objects, only metadata. Dictionaries are data, not metadata. There are a number of unavoidable API changes (straightforward for library users to fix) but otherwise no functional difference in the library. As you can see the change is quite complex as significant parts of IPC read/write, JSON integration testing, and Flight needed to be reworked to alter the control flow around schema resolution and handling the first record batch. Key APIs changed * `DictionaryType` constructor requires a `DataType` for the dictionary value type instead of the dictionary itself. The `dictionary` factory method is correspondingly changed. The `dictionary` accessor method on `DictionaryType` is replaced with `value_type`. * `DictionaryArray` constructor and `DictionaryArray::FromArrays` must be passed the dictionary values as an additional argument. * `DictionaryMemo` is exposed in the public API as it is now required for granular interactions with IPC messages with such functions as `ipc::ReadSchema` and `ipc::ReadRecordBatch` * A `DictionaryMemo` argument is added to several low-level public functions in `ipc/writer.h` and `ipc/reader.h` Some other incidental changes: Because DictionaryType objects could be reused previous in Schemas, such dictionaries would be "deduplicated" in IPC messages in passing. This is no longer possible by the same trick, so dictionary reuse will have to be handled in a different way (I opened ARROW-5340 to investigate) * As a result of this, an integration test that featured dictionary reuse has been changed to not reuse dictionaries. Technically this is a regression, but I didn't want to block the patch over it * R is added to allow_failures in Travis CI for now Author: Wes McKinney <wesm+git@apache.org> Author: Kouhei Sutou <kou@clear-code.com> Author: Antoine Pitrou <antoine@python.org> Closes #4316 from wesm/ARROW-3144 and squashes the following commits: 9f1ccfbf4 <Kouhei Sutou> Follow DictionaryArray changes 89e274da5 <Wes McKinney> Do not reuse dictionaries in integration tests for now until more follow on work around this can be done f62819f5b <Wes McKinney> Support many fields referencing the same dictionary, fix integration tests 37e82b4da <Antoine Pitrou> Fix CUDA and Duration issues 037075083 <Wes McKinney> Add R to allow_failures for now bd04774e2 <Wes McKinney> Code review comments b1cc52e62 <Wes McKinney> Fix rest of Python unit tests, fix some incorrect code comments f1178b2a6 <Wes McKinney> Fix all but 3 Python unit tests ab7fc1741 <Wes McKinney> Fix up Cython compilation, haven't fixed unit tests yet though 6ce51ef79 <Wes McKinney> Get everything compiling again e23c578fd <Wes McKinney> Fix Parquet tests c73b2162f <Wes McKinney> arrow-tests all passing again, huzzah! 04d40e8e6 <Wes McKinney> Flat dictionary IPC test passing now 481f316dc <Wes McKinney> Get JSON integration tests passing again 77a43dc9f <Wes McKinney> Fix pretty_print-test f4ada6685 <Wes McKinney> array-tests compilers again 8276dce6c <Wes McKinney> libarrow compiles again 8ea0e260a <Wes McKinney> Refactor IPC read path for new paradigm a1afe879a <Wes McKinney> More refactoring to have correct logic in IPC paths, not yet done aed04304e <Wes McKinney> More refactoring, regularize some type names 6bd72f946 <Wes McKinney> Start porting changes 24f99f16b <Wes McKinney> Initial boilerplate 2019-05-17 11:40:55 -05:00			`_dictionary.sp_array.get().type(),`
			`c_ordered))`
ARROW-2099: [Python] Add safe option to DictionaryArray.from_arrays to do boundschecking of indices by default Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1734 from wesm/ARROW-2099 and squashes the following commits: eabc5d19 <Wes McKinney> Add safe option to DictionaryArray.from_arrays to do boundschecking of indices by default 2018-03-11 23:41:38 -04:00
			`if safe:`
			`with nogil:`
ARROW-8347: [C++] Migrate Array methods to Result<T> Closes #6851 from pitrou/ARROW-8347-array-result-apis Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2020-04-07 11:36:31 +02:00			`c_result = GetResultValue(`
ARROW-2099: [Python] Add safe option to DictionaryArray.from_arrays to do boundschecking of indices by default Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1734 from wesm/ARROW-2099 and squashes the following commits: eabc5d19 <Wes McKinney> Add safe option to DictionaryArray.from_arrays to do boundschecking of indices by default 2018-03-11 23:41:38 -04:00			`CDictionaryArray.FromArrays(c_type, _indices.sp_array,`
ARROW-8347: [C++] Migrate Array methods to Result<T> Closes #6851 from pitrou/ARROW-8347-array-result-apis Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2020-04-07 11:36:31 +02:00			`_dictionary.sp_array))`
ARROW-2099: [Python] Add safe option to DictionaryArray.from_arrays to do boundschecking of indices by default Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1734 from wesm/ARROW-2099 and squashes the following commits: eabc5d19 <Wes McKinney> Add safe option to DictionaryArray.from_arrays to do boundschecking of indices by default 2018-03-11 23:41:38 -04:00			`else:`
ARROW-3144: [C++/Python] Move "dictionary" member from DictionaryType to ArrayData to allow for variable dictionaries This patch moves the dictionary member out of DictionaryType to a new member on the internal ArrayData structure. As a result, serializing and deserializing schemas requires only a single IPC message, and schemas have no knowledge of what the dictionary values are. The objective of this change is to correct a long-standing Arrow C++ design problem with dictionary-encoded arrays where the dictionary values must be known at schema construction time. This has plagued us all over the codebase: * In reading Parquet files, reading directly to DictionaryArray is not simple because each row group may have a different dictionary * In IPC streams, delta dictionaries (not yet implemented) would invalidate the pre-existing schema, causing subsequent RecordBatch objects to be incompatible * In Arrow Flight, schema negotiation requires the dictionaries to be sent, having possibly unbounded size. * Not possible to have different dictionaries in a ChunkedArray * In CSV files, converting columns to dictionary in parallel would require an expensive type unification The summary of what can be learned from this is: do not put data in type objects, only metadata. Dictionaries are data, not metadata. There are a number of unavoidable API changes (straightforward for library users to fix) but otherwise no functional difference in the library. As you can see the change is quite complex as significant parts of IPC read/write, JSON integration testing, and Flight needed to be reworked to alter the control flow around schema resolution and handling the first record batch. Key APIs changed * `DictionaryType` constructor requires a `DataType` for the dictionary value type instead of the dictionary itself. The `dictionary` factory method is correspondingly changed. The `dictionary` accessor method on `DictionaryType` is replaced with `value_type`. * `DictionaryArray` constructor and `DictionaryArray::FromArrays` must be passed the dictionary values as an additional argument. * `DictionaryMemo` is exposed in the public API as it is now required for granular interactions with IPC messages with such functions as `ipc::ReadSchema` and `ipc::ReadRecordBatch` * A `DictionaryMemo` argument is added to several low-level public functions in `ipc/writer.h` and `ipc/reader.h` Some other incidental changes: Because DictionaryType objects could be reused previous in Schemas, such dictionaries would be "deduplicated" in IPC messages in passing. This is no longer possible by the same trick, so dictionary reuse will have to be handled in a different way (I opened ARROW-5340 to investigate) * As a result of this, an integration test that featured dictionary reuse has been changed to not reuse dictionaries. Technically this is a regression, but I didn't want to block the patch over it * R is added to allow_failures in Travis CI for now Author: Wes McKinney <wesm+git@apache.org> Author: Kouhei Sutou <kou@clear-code.com> Author: Antoine Pitrou <antoine@python.org> Closes #4316 from wesm/ARROW-3144 and squashes the following commits: 9f1ccfbf4 <Kouhei Sutou> Follow DictionaryArray changes 89e274da5 <Wes McKinney> Do not reuse dictionaries in integration tests for now until more follow on work around this can be done f62819f5b <Wes McKinney> Support many fields referencing the same dictionary, fix integration tests 37e82b4da <Antoine Pitrou> Fix CUDA and Duration issues 037075083 <Wes McKinney> Add R to allow_failures for now bd04774e2 <Wes McKinney> Code review comments b1cc52e62 <Wes McKinney> Fix rest of Python unit tests, fix some incorrect code comments f1178b2a6 <Wes McKinney> Fix all but 3 Python unit tests ab7fc1741 <Wes McKinney> Fix up Cython compilation, haven't fixed unit tests yet though 6ce51ef79 <Wes McKinney> Get everything compiling again e23c578fd <Wes McKinney> Fix Parquet tests c73b2162f <Wes McKinney> arrow-tests all passing again, huzzah! 04d40e8e6 <Wes McKinney> Flat dictionary IPC test passing now 481f316dc <Wes McKinney> Get JSON integration tests passing again 77a43dc9f <Wes McKinney> Fix pretty_print-test f4ada6685 <Wes McKinney> array-tests compilers again 8276dce6c <Wes McKinney> libarrow compiles again 8ea0e260a <Wes McKinney> Refactor IPC read path for new paradigm a1afe879a <Wes McKinney> More refactoring to have correct logic in IPC paths, not yet done aed04304e <Wes McKinney> More refactoring, regularize some type names 6bd72f946 <Wes McKinney> Start porting changes 24f99f16b <Wes McKinney> Initial boilerplate 2019-05-17 11:40:55 -05:00			`c_result.reset(new CDictionaryArray(c_type, _indices.sp_array,`
			`_dictionary.sp_array))`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
ARROW-6158: [C++/Python] Validate child array types with type fields of StructArray https://issues.apache.org/jira/browse/ARROW-6158 Closes #5488 from jorisvandenbossche/ARROW-6158-struct-array-validation and squashes the following commits: 757378139 <Joris Van den Bossche> ARROW-6158: Validate child array types with type fields of StructArray Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-09-24 22:02:07 -05:00			`cdef Array result = pyarrow_wrap_array(c_result)`
			`result.validate()`
			`return result`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
ARROW-1183: [Python] Implement pandas conversions between Time32, Time64 types and datetime.time There's also a little bit of code reorganization; sorry about the large diff. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #849 from wesm/ARROW-1183 and squashes the following commits: 7e94e25c [Wes McKinney] Improve error messages to add extra context when using InvalidConversion 9659b903 [Wes McKinney] Always install thrift-cpp in toolchain Parquet MSVC build 41ce47ea [Wes McKinney] Fix MSVC compiler warning 681459c9 [Wes McKinney] Add missing PyDateTime_IMPORT 7024a366 [Wes McKinney] Finish roundtrip of time32/time64 to array of pytime 58fe4c00 [Wes McKinney] Add time to_pandas test 58f99f60 [Wes McKinney] Test from_pandas conversions from pytime 9228ec46 [Wes McKinney] Start in on time conversions 2017-07-16 17:29:43 -04:00
ARROW-915: [Python] Struct Array reads limited support Add limited struct array reading support in pyarrow. This is done to complement parquet-cpp struct reader. cc @wesm Author: Itai Incze <itai.in@gmail.com> Closes #615 from itaiin/ARROW-915 and squashes the following commits: d8f2636e [Itai Incze] convert struct field names using frombytes e654abfa [Itai Incze] fix python3 tests & msvc build 3a4edf43 [Itai Incze] fix lint errors bef46447 [Itai Incze] Refactor due to review e2a697ff [Itai Incze] Further fixes due to review eecbb32f [Itai Incze] fix per code review 4c255391 [Itai Incze] Add basic StructArray read support 2017-07-02 13:38:46 -04:00			`cdef class StructArray(Array):`
ARROW-4216: [Python] Add CUDA API docs Also reorganize the API docs into several documents, and add/improve docstrings. To allow building the docs without CUDA enabled, I added some conditional inclusion logic. When CUDA isn't enabled, the API docs are still generated but the docstrings are empty. This seems to be the only sane setting that doesn't produce Sphinx errors, one way or the other. Author: Antoine Pitrou <antoine@python.org> Closes #3372 from pitrou/ARROW-4216-cuda-py-docs and squashes the following commits: 80600da5 <Antoine Pitrou> ARROW-4216: Add CUDA API docs 2019-01-10 21:05:31 +01:00			`"""`
			`Concrete class for Arrow arrays of a struct data type.`
			`"""`
ARROW-1706: [Python] Coerce array inputs to StructArray.from_arrays. Flip order of arguments I flipped the argument order to be more consistent with the same methods in RecordBatch, Table. The StructArray method doesn't seem to be widely used so I'm not sure there's the need to go through a deprecation cycle Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1512 from wesm/ARROW-1706 and squashes the following commits: 786c37d0 [Wes McKinney] Raise error when names is None in StructArray.from_arrays 990dda57 [Wes McKinney] Fix API change 2053d941 [Wes McKinney] Add test case 9c229498 [Wes McKinney] Flip order of arguments to StructArray.from_arrays, try to coerce non-pyarrow data to Array 2018-02-02 12:27:22 -05:00
ARROW-3261: [Python] Add "field" method to select fields from StructArray Author: Krisztián Szűcs <szucs.krisztian@gmail.com> Closes #2586 from kszucs/ARROW-3261 and squashes the following commits: 39750520a <Krisztián Szűcs> docstring 248ae7ae4 <Krisztián Szűcs> StructArray.field 2018-09-20 17:17:58 -04:00			`def field(self, index):`
			`"""`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`Retrieves the child array belonging to field.`
ARROW-3261: [Python] Add "field" method to select fields from StructArray Author: Krisztián Szűcs <szucs.krisztian@gmail.com> Closes #2586 from kszucs/ARROW-3261 and squashes the following commits: 39750520a <Krisztián Szűcs> docstring 248ae7ae4 <Krisztián Szűcs> StructArray.field 2018-09-20 17:17:58 -04:00
			`Parameters`
			`----------`
ARROW-3278: [Python] Retrieve StructType's and StructArray's field by name Author: Krisztián Szűcs <szucs.krisztian@gmail.com> Closes #2754 from kszucs/ARROW-3278 and squashes the following commits: 737ca2989 <Krisztián Szűcs> int cast for _normalize_index 1f224d561 <Krisztián Szűcs> remove StructType.field_by_name 9f7b9c04c <Krisztián Szűcs> overload field 7523eefdd <Krisztián Szűcs> StructArray.field_by_name 4a6f2d5c2 <Krisztián Szűcs> StructType.field_by_name 2018-10-17 14:35:21 -04:00			`index : Union[int, str]`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`Index / position or name of the field.`
ARROW-3261: [Python] Add "field" method to select fields from StructArray Author: Krisztián Szűcs <szucs.krisztian@gmail.com> Closes #2586 from kszucs/ARROW-3261 and squashes the following commits: 39750520a <Krisztián Szűcs> docstring 248ae7ae4 <Krisztián Szűcs> StructArray.field 2018-09-20 17:17:58 -04:00
			`Returns`
			`-------`
			`result : Array`
			`"""`
			`cdef:`
ARROW-3278: [Python] Retrieve StructType's and StructArray's field by name Author: Krisztián Szűcs <szucs.krisztian@gmail.com> Closes #2754 from kszucs/ARROW-3278 and squashes the following commits: 737ca2989 <Krisztián Szűcs> int cast for _normalize_index 1f224d561 <Krisztián Szűcs> remove StructType.field_by_name 9f7b9c04c <Krisztián Szűcs> overload field 7523eefdd <Krisztián Szűcs> StructArray.field_by_name 4a6f2d5c2 <Krisztián Szűcs> StructType.field_by_name 2018-10-17 14:35:21 -04:00			`CStructArray* arr = <CStructArray*> self.ap`
			`shared_ptr[CArray] child`

ARROW-5757: [Python] Remove Python 2.7 support Part of the changes were done using [pyupgrade](https://github.com/asottile/pyupgrade). Closes #6410 from pitrou/ARROW-5757-py2-goodbye and squashes the following commits: f0f9f513b <Antoine Pitrou> Address review comments 561ac965c <Antoine Pitrou> ARROW-5757: Remove Python 2.7 support Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-02-12 21:37:59 -06:00			`if isinstance(index, (bytes, str)):`
ARROW-3278: [Python] Retrieve StructType's and StructArray's field by name Author: Krisztián Szűcs <szucs.krisztian@gmail.com> Closes #2754 from kszucs/ARROW-3278 and squashes the following commits: 737ca2989 <Krisztián Szűcs> int cast for _normalize_index 1f224d561 <Krisztián Szűcs> remove StructType.field_by_name 9f7b9c04c <Krisztián Szűcs> overload field 7523eefdd <Krisztián Szűcs> StructArray.field_by_name 4a6f2d5c2 <Krisztián Szűcs> StructType.field_by_name 2018-10-17 14:35:21 -04:00			`child = arr.GetFieldByName(tobytes(index))`
			`if child == nullptr:`
			`raise KeyError(index)`
ARROW-5757: [Python] Remove Python 2.7 support Part of the changes were done using [pyupgrade](https://github.com/asottile/pyupgrade). Closes #6410 from pitrou/ARROW-5757-py2-goodbye and squashes the following commits: f0f9f513b <Antoine Pitrou> Address review comments 561ac965c <Antoine Pitrou> ARROW-5757: Remove Python 2.7 support Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-02-12 21:37:59 -06:00			`elif isinstance(index, int):`
ARROW-3278: [Python] Retrieve StructType's and StructArray's field by name Author: Krisztián Szűcs <szucs.krisztian@gmail.com> Closes #2754 from kszucs/ARROW-3278 and squashes the following commits: 737ca2989 <Krisztián Szűcs> int cast for _normalize_index 1f224d561 <Krisztián Szűcs> remove StructType.field_by_name 9f7b9c04c <Krisztián Szűcs> overload field 7523eefdd <Krisztián Szűcs> StructArray.field_by_name 4a6f2d5c2 <Krisztián Szűcs> StructType.field_by_name 2018-10-17 14:35:21 -04:00			`child = arr.field(`
			`<int>_normalize_index(index, self.ap.num_fields()))`
			`else:`
			`raise TypeError('Expected integer or string index')`

			`return pyarrow_wrap_array(child)`
ARROW-3261: [Python] Add "field" method to select fields from StructArray Author: Krisztián Szűcs <szucs.krisztian@gmail.com> Closes #2586 from kszucs/ARROW-3261 and squashes the following commits: 39750520a <Krisztián Szűcs> docstring 248ae7ae4 <Krisztián Szűcs> StructArray.field 2018-09-20 17:17:58 -04:00
GH-14778: [Python] Add (Chunked)Array sort() method (#14781) * Closes: #14778 Lead-authored-by: Alessandro Molina <amol@turbogears.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2022-12-15 13:44:57 +01:00			`def _flattened_field(self, index, MemoryPool memory_pool=None):`
			`"""`
			`Retrieves the child array belonging to field,`
			`accounting for the parent array null bitmap.`

			`Parameters`
			`----------`
			`index : Union[int, str]`
			`Index / position or name of the field.`
			`memory_pool : MemoryPool, default None`
			`For memory allocations, if required, otherwise use default pool.`

			`Returns`
			`-------`
			`result : Array`
			`"""`
			`cdef:`
			`CStructArray* arr = <CStructArray*> self.ap`
			`shared_ptr[CArray] child`
			`CMemoryPool* pool = maybe_unbox_memory_pool(memory_pool)`

			`if isinstance(index, (bytes, str)):`
			`int_index = self.type.get_field_index(index)`
			`if int_index < 0:`
			`raise KeyError(index)`
			`elif isinstance(index, int):`
			`int_index = _normalize_index(index, self.ap.num_fields())`
			`else:`
			`raise TypeError('Expected integer or string index')`

			`child = GetResultValue(arr.GetFlattenedField(int_index, pool))`
			`return pyarrow_wrap_array(child)`

ARROW-2315: [C++/Python] Flatten struct array Also adds tests for BitmapReader and BitmapWriter. Author: Antoine Pitrou <antoine@python.org> Closes #1755 from pitrou/ARROW-2315-struct-flatten and squashes the following commits: c14d492 <Antoine Pitrou> Address review comments @cpcloud f1ecb99 <Antoine Pitrou> Avoid C++14 binary literals 88b14c6 <Antoine Pitrou> Fix sliced flattening cced334 <Antoine Pitrou> ARROW-1886: Flatten struct array 2018-04-17 17:10:04 +02:00			`def flatten(self, MemoryPool memory_pool=None):`
			`"""`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`Return one individual array for each field in the struct.`
ARROW-2315: [C++/Python] Flatten struct array Also adds tests for BitmapReader and BitmapWriter. Author: Antoine Pitrou <antoine@python.org> Closes #1755 from pitrou/ARROW-2315-struct-flatten and squashes the following commits: c14d492 <Antoine Pitrou> Address review comments @cpcloud f1ecb99 <Antoine Pitrou> Avoid C++14 binary literals 88b14c6 <Antoine Pitrou> Fix sliced flattening cced334 <Antoine Pitrou> ARROW-1886: Flatten struct array 2018-04-17 17:10:04 +02:00
			`Parameters`
			`----------`
			`memory_pool : MemoryPool, default None`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`For memory allocations, if required, otherwise use default pool.`
ARROW-2315: [C++/Python] Flatten struct array Also adds tests for BitmapReader and BitmapWriter. Author: Antoine Pitrou <antoine@python.org> Closes #1755 from pitrou/ARROW-2315-struct-flatten and squashes the following commits: c14d492 <Antoine Pitrou> Address review comments @cpcloud f1ecb99 <Antoine Pitrou> Avoid C++14 binary literals 88b14c6 <Antoine Pitrou> Fix sliced flattening cced334 <Antoine Pitrou> ARROW-1886: Flatten struct array 2018-04-17 17:10:04 +02:00
			`Returns`
			`-------`
			`result : List[Array]`
			`"""`
			`cdef:`
			`vector[shared_ptr[CArray]] arrays`
			`CMemoryPool* pool = maybe_unbox_memory_pool(memory_pool)`
			`CStructArray* sarr = <CStructArray*> self.ap`

			`with nogil:`
ARROW-8347: [C++] Migrate Array methods to Result<T> Closes #6851 from pitrou/ARROW-8347-array-result-apis Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2020-04-07 11:36:31 +02:00			`arrays = GetResultValue(sarr.Flatten(pool))`
ARROW-2315: [C++/Python] Flatten struct array Also adds tests for BitmapReader and BitmapWriter. Author: Antoine Pitrou <antoine@python.org> Closes #1755 from pitrou/ARROW-2315-struct-flatten and squashes the following commits: c14d492 <Antoine Pitrou> Address review comments @cpcloud f1ecb99 <Antoine Pitrou> Avoid C++14 binary literals 88b14c6 <Antoine Pitrou> Fix sliced flattening cced334 <Antoine Pitrou> ARROW-1886: Flatten struct array 2018-04-17 17:10:04 +02:00
			`return [pyarrow_wrap_array(arr) for arr in arrays]`

ARROW-915: [Python] Struct Array reads limited support Add limited struct array reading support in pyarrow. This is done to complement parquet-cpp struct reader. cc @wesm Author: Itai Incze <itai.in@gmail.com> Closes #615 from itaiin/ARROW-915 and squashes the following commits: d8f2636e [Itai Incze] convert struct field names using frombytes e654abfa [Itai Incze] fix python3 tests & msvc build 3a4edf43 [Itai Incze] fix lint errors bef46447 [Itai Incze] Refactor due to review e2a697ff [Itai Incze] Further fixes due to review eecbb32f [Itai Incze] fix per code review 4c255391 [Itai Incze] Add basic StructArray read support 2017-07-02 13:38:46 -04:00			`@staticmethod`
ARROW-12677: [Python] Add a mask argument to pyarrow.StructArray.from_arrays This allows the user to supply an optional `mask` when creating a struct array. * The mask requirements are pretty strict (must be a boolean arrow array without nulls) compared with some of the other functions (e.g. `array.mask` accepts a wide variety of inputs). I think this should be ok since this use case is probably rarer and there are other plenty of existing ways to convert other datatypes to an arrow array. * ~~Unfortunately, StructArray::Make interprets the "null buffer" as more of a validity buffer (1 = valid, 0 = null). This is the opposite of everywhere else a `mask` is used. I was torn between inverting the input buffer to mimic the python API and passing through directly to the C interface for simplicity. I chose the simpler option but could be convinced otherwise.~~ Per request, I now invert the mask to align with the python API. Closes #10272 from westonpace/feature/ARROW-12677--python-add-a-mask-argument-to-pyarrow-structarra Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2021-05-14 16:48:25 -04:00			`def from_arrays(arrays, names=None, fields=None, mask=None,`
GH-42014: [Python] Let StructArray.from_array accept a type in addition to names or fields (#43047) ### Rationale for this change StructArray.from_array currently accepts names or fields to create the struct array. However if you already have a struct type it's more convenient to pass that in and allow the function to use it to build the StructArray instead of the user having to pull out the fields themselves. ### What changes are included in this PR? Add a new argument to StructArray.from_array called structtype. The function will prevent both fields and structype from being passed by raising a ValueError. If structtype is not null then the existing fields argument is set from the structtype fields. This allows all of the existing code in the function to remain untouched. ### Are these changes tested? Yes. Testing creating the structarray from fields a test is added to make sure that a struct type can be used to create the array. ### Are there any user-facing changes? Yes, the StructArray.from_arrays function now has an extra optional argument * GitHub Issue: #42014 Authored-by: Akshay Subramanian <173708861+shinespiked@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2024-07-31 09:35:30 -04:00			`memory_pool=None, type=None):`
ARROW-1706: [Python] Coerce array inputs to StructArray.from_arrays. Flip order of arguments I flipped the argument order to be more consistent with the same methods in RecordBatch, Table. The StructArray method doesn't seem to be widely used so I'm not sure there's the need to go through a deprecation cycle Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1512 from wesm/ARROW-1706 and squashes the following commits: 786c37d0 [Wes McKinney] Raise error when names is None in StructArray.from_arrays 990dda57 [Wes McKinney] Fix API change 2053d941 [Wes McKinney] Add test case 9c229498 [Wes McKinney] Flip order of arguments to StructArray.from_arrays, try to coerce non-pyarrow data to Array 2018-02-02 12:27:22 -05:00			`"""`
ARROW-6068: [C++] Allow passing Field instances to StructArray::Make This helps fix a Python test failure when hypothesis testing is enabled. Closes #4981 from pitrou/ARROW-6068-struct-array-from-fields and squashes the following commits: 12049109a <Antoine Pitrou> ARROW-6068: Allow passing Field instances to StructArray::Make Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-08-01 17:05:11 -05:00			`Construct StructArray from collection of arrays representing`
			`each field in the struct.`

GH-42014: [Python] Let StructArray.from_array accept a type in addition to names or fields (#43047) ### Rationale for this change StructArray.from_array currently accepts names or fields to create the struct array. However if you already have a struct type it's more convenient to pass that in and allow the function to use it to build the StructArray instead of the user having to pull out the fields themselves. ### What changes are included in this PR? Add a new argument to StructArray.from_array called structtype. The function will prevent both fields and structype from being passed by raising a ValueError. If structtype is not null then the existing fields argument is set from the structtype fields. This allows all of the existing code in the function to remain untouched. ### Are these changes tested? Yes. Testing creating the structarray from fields a test is added to make sure that a struct type can be used to create the array. ### Are there any user-facing changes? Yes, the StructArray.from_arrays function now has an extra optional argument * GitHub Issue: #42014 Authored-by: Akshay Subramanian <173708861+shinespiked@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2024-07-31 09:35:30 -04:00			`Either field names, field instances or a struct type must be passed.`
ARROW-1706: [Python] Coerce array inputs to StructArray.from_arrays. Flip order of arguments I flipped the argument order to be more consistent with the same methods in RecordBatch, Table. The StructArray method doesn't seem to be widely used so I'm not sure there's the need to go through a deprecation cycle Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1512 from wesm/ARROW-1706 and squashes the following commits: 786c37d0 [Wes McKinney] Raise error when names is None in StructArray.from_arrays 990dda57 [Wes McKinney] Fix API change 2053d941 [Wes McKinney] Add test case 9c229498 [Wes McKinney] Flip order of arguments to StructArray.from_arrays, try to coerce non-pyarrow data to Array 2018-02-02 12:27:22 -05:00
			`Parameters`
			`----------`
			`arrays : sequence of Array`
ARROW-6068: [C++] Allow passing Field instances to StructArray::Make This helps fix a Python test failure when hypothesis testing is enabled. Closes #4981 from pitrou/ARROW-6068-struct-array-from-fields and squashes the following commits: 12049109a <Antoine Pitrou> ARROW-6068: Allow passing Field instances to StructArray::Make Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-08-01 17:05:11 -05:00			`names : List[str] (optional)`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`Field names for each struct child.`
ARROW-6068: [C++] Allow passing Field instances to StructArray::Make This helps fix a Python test failure when hypothesis testing is enabled. Closes #4981 from pitrou/ARROW-6068-struct-array-from-fields and squashes the following commits: 12049109a <Antoine Pitrou> ARROW-6068: Allow passing Field instances to StructArray::Make Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-08-01 17:05:11 -05:00			`fields : List[Field] (optional)`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`Field instances for each struct child.`
ARROW-12677: [Python] Add a mask argument to pyarrow.StructArray.from_arrays This allows the user to supply an optional `mask` when creating a struct array. * The mask requirements are pretty strict (must be a boolean arrow array without nulls) compared with some of the other functions (e.g. `array.mask` accepts a wide variety of inputs). I think this should be ok since this use case is probably rarer and there are other plenty of existing ways to convert other datatypes to an arrow array. * ~~Unfortunately, StructArray::Make interprets the "null buffer" as more of a validity buffer (1 = valid, 0 = null). This is the opposite of everywhere else a `mask` is used. I was torn between inverting the input buffer to mimic the python API and passing through directly to the C interface for simplicity. I chose the simpler option but could be convinced otherwise.~~ Per request, I now invert the mask to align with the python API. Closes #10272 from westonpace/feature/ARROW-12677--python-add-a-mask-argument-to-pyarrow-structarra Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2021-05-14 16:48:25 -04:00			`mask : pyarrow.Array[bool] (optional)`
			`Indicate which values are null (True) or not null (False).`
			`memory_pool : MemoryPool (optional)`
			`For memory allocations, if required, otherwise uses default pool.`
GH-42014: [Python] Let StructArray.from_array accept a type in addition to names or fields (#43047) ### Rationale for this change StructArray.from_array currently accepts names or fields to create the struct array. However if you already have a struct type it's more convenient to pass that in and allow the function to use it to build the StructArray instead of the user having to pull out the fields themselves. ### What changes are included in this PR? Add a new argument to StructArray.from_array called structtype. The function will prevent both fields and structype from being passed by raising a ValueError. If structtype is not null then the existing fields argument is set from the structtype fields. This allows all of the existing code in the function to remain untouched. ### Are these changes tested? Yes. Testing creating the structarray from fields a test is added to make sure that a struct type can be used to create the array. ### Are there any user-facing changes? Yes, the StructArray.from_arrays function now has an extra optional argument * GitHub Issue: #42014 Authored-by: Akshay Subramanian <173708861+shinespiked@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2024-07-31 09:35:30 -04:00			`type : pyarrow.StructType (optional)`
GH-44713: [Python] Add support for Decimal32 and Decimal64 types (#44882) ### Rationale for this change Arrow C++ and the Arrow specification now support 32-bit and 64-bit decimal types...pyarrow should too! ### What changes are included in this PR? Added type, array, and scalar bindings. ### Are these changes tested? Yes! ### Are there any user-facing changes? Yes! * GitHub Issue: #44713 Authored-by: Dewey Dunnington <dewey@fishandwhistle.net> Signed-off-by: Dewey Dunnington <dewey@fishandwhistle.net> 2024-12-16 21:12:07 -06:00			`Struct type for name and type of each child.`
ARROW-1706: [Python] Coerce array inputs to StructArray.from_arrays. Flip order of arguments I flipped the argument order to be more consistent with the same methods in RecordBatch, Table. The StructArray method doesn't seem to be widely used so I'm not sure there's the need to go through a deprecation cycle Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1512 from wesm/ARROW-1706 and squashes the following commits: 786c37d0 [Wes McKinney] Raise error when names is None in StructArray.from_arrays 990dda57 [Wes McKinney] Fix API change 2053d941 [Wes McKinney] Add test case 9c229498 [Wes McKinney] Flip order of arguments to StructArray.from_arrays, try to coerce non-pyarrow data to Array 2018-02-02 12:27:22 -05:00
			`Returns`
			`-------`
			`result : StructArray`
			`"""`
ARROW-915: [Python] Struct Array reads limited support Add limited struct array reading support in pyarrow. This is done to complement parquet-cpp struct reader. cc @wesm Author: Itai Incze <itai.in@gmail.com> Closes #615 from itaiin/ARROW-915 and squashes the following commits: d8f2636e [Itai Incze] convert struct field names using frombytes e654abfa [Itai Incze] fix python3 tests & msvc build 3a4edf43 [Itai Incze] fix lint errors bef46447 [Itai Incze] Refactor due to review e2a697ff [Itai Incze] Further fixes due to review eecbb32f [Itai Incze] fix per code review 4c255391 [Itai Incze] Add basic StructArray read support 2017-07-02 13:38:46 -04:00			`cdef:`
			`shared_ptr[CArray] c_array`
ARROW-12677: [Python] Add a mask argument to pyarrow.StructArray.from_arrays This allows the user to supply an optional `mask` when creating a struct array. * The mask requirements are pretty strict (must be a boolean arrow array without nulls) compared with some of the other functions (e.g. `array.mask` accepts a wide variety of inputs). I think this should be ok since this use case is probably rarer and there are other plenty of existing ways to convert other datatypes to an arrow array. * ~~Unfortunately, StructArray::Make interprets the "null buffer" as more of a validity buffer (1 = valid, 0 = null). This is the opposite of everywhere else a `mask` is used. I was torn between inverting the input buffer to mimic the python API and passing through directly to the C interface for simplicity. I chose the simpler option but could be convinced otherwise.~~ Per request, I now invert the mask to align with the python API. Closes #10272 from westonpace/feature/ARROW-12677--python-add-a-mask-argument-to-pyarrow-structarra Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2021-05-14 16:48:25 -04:00			`shared_ptr[CBuffer] c_mask`
ARROW-915: [Python] Struct Array reads limited support Add limited struct array reading support in pyarrow. This is done to complement parquet-cpp struct reader. cc @wesm Author: Itai Incze <itai.in@gmail.com> Closes #615 from itaiin/ARROW-915 and squashes the following commits: d8f2636e [Itai Incze] convert struct field names using frombytes e654abfa [Itai Incze] fix python3 tests & msvc build 3a4edf43 [Itai Incze] fix lint errors bef46447 [Itai Incze] Refactor due to review e2a697ff [Itai Incze] Further fixes due to review eecbb32f [Itai Incze] fix per code review 4c255391 [Itai Incze] Add basic StructArray read support 2017-07-02 13:38:46 -04:00			`vector[shared_ptr[CArray]] c_arrays`
ARROW-4788: [C++] Less verbose API for constructing StructArray Also our first API to return a `Result<>`. Includes appropriate massaging in the test helpers so that ASSERT_OK and ASSERT_RAISES can handle `Result<>` instances. Author: Antoine Pitrou <antoine@python.org> Closes #4707 from pitrou/ARROW-4788-struct-array-factory and squashes the following commits: ae5435857 <Antoine Pitrou> ARROW-4788: Less verbose API for constructing StructArray 2019-06-27 22:32:08 -05:00			`vector[c_string] c_names`
ARROW-6068: [C++] Allow passing Field instances to StructArray::Make This helps fix a Python test failure when hypothesis testing is enabled. Closes #4981 from pitrou/ARROW-6068-struct-array-from-fields and squashes the following commits: 12049109a <Antoine Pitrou> ARROW-6068: Allow passing Field instances to StructArray::Make Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-08-01 17:05:11 -05:00			`vector[shared_ptr[CField]] c_fields`
ARROW-4788: [C++] Less verbose API for constructing StructArray Also our first API to return a `Result<>`. Includes appropriate massaging in the test helpers so that ASSERT_OK and ASSERT_RAISES can handle `Result<>` instances. Author: Antoine Pitrou <antoine@python.org> Closes #4707 from pitrou/ARROW-4788-struct-array-factory and squashes the following commits: ae5435857 <Antoine Pitrou> ARROW-4788: Less verbose API for constructing StructArray 2019-06-27 22:32:08 -05:00			`CResult[shared_ptr[CArray]] c_result`
ARROW-915: [Python] Struct Array reads limited support Add limited struct array reading support in pyarrow. This is done to complement parquet-cpp struct reader. cc @wesm Author: Itai Incze <itai.in@gmail.com> Closes #615 from itaiin/ARROW-915 and squashes the following commits: d8f2636e [Itai Incze] convert struct field names using frombytes e654abfa [Itai Incze] fix python3 tests & msvc build 3a4edf43 [Itai Incze] fix lint errors bef46447 [Itai Incze] Refactor due to review e2a697ff [Itai Incze] Further fixes due to review eecbb32f [Itai Incze] fix per code review 4c255391 [Itai Incze] Add basic StructArray read support 2017-07-02 13:38:46 -04:00			`ssize_t num_arrays`
			`ssize_t length`
			`ssize_t i`
ARROW-6068: [C++] Allow passing Field instances to StructArray::Make This helps fix a Python test failure when hypothesis testing is enabled. Closes #4981 from pitrou/ARROW-6068-struct-array-from-fields and squashes the following commits: 12049109a <Antoine Pitrou> ARROW-6068: Allow passing Field instances to StructArray::Make Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-08-01 17:05:11 -05:00			`Field py_field`
ARROW-4434: [Python] Allow creating trivial StructArray Also fix an unrelated flake8 error with the latest flake8. Author: Antoine Pitrou <antoine@python.org> Closes #3610 from pitrou/ARROW-4434-empty-struct-array and squashes the following commits: 1190f539 <Antoine Pitrou> ARROW-4434: Allow creating trivial StructArray 2019-02-11 14:31:57 -06:00			`DataType struct_type`
ARROW-915: [Python] Struct Array reads limited support Add limited struct array reading support in pyarrow. This is done to complement parquet-cpp struct reader. cc @wesm Author: Itai Incze <itai.in@gmail.com> Closes #615 from itaiin/ARROW-915 and squashes the following commits: d8f2636e [Itai Incze] convert struct field names using frombytes e654abfa [Itai Incze] fix python3 tests & msvc build 3a4edf43 [Itai Incze] fix lint errors bef46447 [Itai Incze] Refactor due to review e2a697ff [Itai Incze] Further fixes due to review eecbb32f [Itai Incze] fix per code review 4c255391 [Itai Incze] Add basic StructArray read support 2017-07-02 13:38:46 -04:00
GH-42014: [Python] Let StructArray.from_array accept a type in addition to names or fields (#43047) ### Rationale for this change StructArray.from_array currently accepts names or fields to create the struct array. However if you already have a struct type it's more convenient to pass that in and allow the function to use it to build the StructArray instead of the user having to pull out the fields themselves. ### What changes are included in this PR? Add a new argument to StructArray.from_array called structtype. The function will prevent both fields and structype from being passed by raising a ValueError. If structtype is not null then the existing fields argument is set from the structtype fields. This allows all of the existing code in the function to remain untouched. ### Are these changes tested? Yes. Testing creating the structarray from fields a test is added to make sure that a struct type can be used to create the array. ### Are there any user-facing changes? Yes, the StructArray.from_arrays function now has an extra optional argument * GitHub Issue: #42014 Authored-by: Akshay Subramanian <173708861+shinespiked@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> 2024-07-31 09:35:30 -04:00			`if fields is not None and type is not None:`
			`raise ValueError('Must pass either fields or type, not both')`

			`if type is not None:`
			`fields = []`
			`for field in type:`
			`fields.append(field)`

ARROW-6068: [C++] Allow passing Field instances to StructArray::Make This helps fix a Python test failure when hypothesis testing is enabled. Closes #4981 from pitrou/ARROW-6068-struct-array-from-fields and squashes the following commits: 12049109a <Antoine Pitrou> ARROW-6068: Allow passing Field instances to StructArray::Make Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-08-01 17:05:11 -05:00			`if names is None and fields is None:`
			`raise ValueError('Must pass either names or fields')`
			`if names is not None and fields is not None:`
			`raise ValueError('Must pass either names or fields, not both')`
ARROW-1706: [Python] Coerce array inputs to StructArray.from_arrays. Flip order of arguments I flipped the argument order to be more consistent with the same methods in RecordBatch, Table. The StructArray method doesn't seem to be widely used so I'm not sure there's the need to go through a deprecation cycle Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1512 from wesm/ARROW-1706 and squashes the following commits: 786c37d0 [Wes McKinney] Raise error when names is None in StructArray.from_arrays 990dda57 [Wes McKinney] Fix API change 2053d941 [Wes McKinney] Add test case 9c229498 [Wes McKinney] Flip order of arguments to StructArray.from_arrays, try to coerce non-pyarrow data to Array 2018-02-02 12:27:22 -05:00
ARROW-15839: [C++][Python] Accept validity bitmap in ListArray.from_arrays (#13894) Will close [ARROW-15839](https://issues.apache.org/jira/browse/ARROW-15839) This will allow proper nulls in place of what is presently empty lists (on top level) when no mask is passed. ```python import pyarrow as pa arr = pa.array([None, [0]]) reconstructed_arr = pa.ListArray.from_arrays(arr.offsets, arr.values, mask=arr.is_null()) print(reconstructed_arr.to_pylist()) # [None, [0]] # Currently it doesn't accept `mask` and results in `[[], [0]]` unless explicit offsets set ``` There was also discussion with @jorisvandenbossche about renaming the added `null_bitmap` to `validity_bitmap`; but this will deviate from [existing `null_bitmap` use in the same area of code](https://github.com/apache/arrow/blob/d880d7517a33f2ac8ff259cad711bc210fd570c5/cpp/src/arrow/array/array_nested.h#L113). Should I change those names now or in a later PR/issue to standardize the `null_bitmap` / `validity_buf` naming? Lead-authored-by: Miles Granger <miles59923@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-09-01 12:33:56 +02:00			`c_mask = c_mask_inverted_from_obj(mask, memory_pool)`
ARROW-12677: [Python] Add a mask argument to pyarrow.StructArray.from_arrays This allows the user to supply an optional `mask` when creating a struct array. * The mask requirements are pretty strict (must be a boolean arrow array without nulls) compared with some of the other functions (e.g. `array.mask` accepts a wide variety of inputs). I think this should be ok since this use case is probably rarer and there are other plenty of existing ways to convert other datatypes to an arrow array. * ~~Unfortunately, StructArray::Make interprets the "null buffer" as more of a validity buffer (1 = valid, 0 = null). This is the opposite of everywhere else a `mask` is used. I was torn between inverting the input buffer to mimic the python API and passing through directly to the C interface for simplicity. I chose the simpler option but could be convinced otherwise.~~ Per request, I now invert the mask to align with the python API. Closes #10272 from westonpace/feature/ARROW-12677--python-add-a-mask-argument-to-pyarrow-structarra Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2021-05-14 16:48:25 -04:00
ARROW-1706: [Python] Coerce array inputs to StructArray.from_arrays. Flip order of arguments I flipped the argument order to be more consistent with the same methods in RecordBatch, Table. The StructArray method doesn't seem to be widely used so I'm not sure there's the need to go through a deprecation cycle Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #1512 from wesm/ARROW-1706 and squashes the following commits: 786c37d0 [Wes McKinney] Raise error when names is None in StructArray.from_arrays 990dda57 [Wes McKinney] Fix API change 2053d941 [Wes McKinney] Add test case 9c229498 [Wes McKinney] Flip order of arguments to StructArray.from_arrays, try to coerce non-pyarrow data to Array 2018-02-02 12:27:22 -05:00			`arrays = [asarray(x) for x in arrays]`
ARROW-4788: [C++] Less verbose API for constructing StructArray Also our first API to return a `Result<>`. Includes appropriate massaging in the test helpers so that ASSERT_OK and ASSERT_RAISES can handle `Result<>` instances. Author: Antoine Pitrou <antoine@python.org> Closes #4707 from pitrou/ARROW-4788-struct-array-factory and squashes the following commits: ae5435857 <Antoine Pitrou> ARROW-4788: Less verbose API for constructing StructArray 2019-06-27 22:32:08 -05:00			`for arr in arrays:`
ARROW-11780: [Python] Avoid crashing when a ChunkedArray is provided to StructArray.from_arrays() https://issues.apache.org/jira/browse/ARROW-11780 Closes #10097 from amol-/ARROW-11780 Authored-by: Alessandro Molina <amol@turbogears.org> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2021-04-27 11:48:33 +02:00			`c_array = pyarrow_unwrap_array(arr)`
			`if c_array == nullptr:`
			`raise TypeError(f"Expected Array, got {arr.__class__}")`
			`c_arrays.push_back(c_array)`
ARROW-6068: [C++] Allow passing Field instances to StructArray::Make This helps fix a Python test failure when hypothesis testing is enabled. Closes #4981 from pitrou/ARROW-6068-struct-array-from-fields and squashes the following commits: 12049109a <Antoine Pitrou> ARROW-6068: Allow passing Field instances to StructArray::Make Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-08-01 17:05:11 -05:00			`if names is not None:`
			`for name in names:`
			`c_names.push_back(tobytes(name))`
			`else:`
			`for item in fields:`
			`if isinstance(item, tuple):`
			`py_field = field(*item)`
			`else:`
			`py_field = item`
			`c_fields.push_back(py_field.sp_field)`
ARROW-4788: [C++] Less verbose API for constructing StructArray Also our first API to return a `Result<>`. Includes appropriate massaging in the test helpers so that ASSERT_OK and ASSERT_RAISES can handle `Result<>` instances. Author: Antoine Pitrou <antoine@python.org> Closes #4707 from pitrou/ARROW-4788-struct-array-factory and squashes the following commits: ae5435857 <Antoine Pitrou> ARROW-4788: Less verbose API for constructing StructArray 2019-06-27 22:32:08 -05:00
ARROW-6068: [C++] Allow passing Field instances to StructArray::Make This helps fix a Python test failure when hypothesis testing is enabled. Closes #4981 from pitrou/ARROW-6068-struct-array-from-fields and squashes the following commits: 12049109a <Antoine Pitrou> ARROW-6068: Allow passing Field instances to StructArray::Make Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-08-01 17:05:11 -05:00			`if (c_arrays.size() == 0 and c_names.size() == 0 and`
			`c_fields.size() == 0):`
ARROW-4788: [C++] Less verbose API for constructing StructArray Also our first API to return a `Result<>`. Includes appropriate massaging in the test helpers so that ASSERT_OK and ASSERT_RAISES can handle `Result<>` instances. Author: Antoine Pitrou <antoine@python.org> Closes #4707 from pitrou/ARROW-4788-struct-array-factory and squashes the following commits: ae5435857 <Antoine Pitrou> ARROW-4788: Less verbose API for constructing StructArray 2019-06-27 22:32:08 -05:00			`# The C++ side doesn't allow this`
GH-15109: [Python] Allow creation of non empty struct array with zero field (#33764) * Closes: #15109 Authored-by: aandres <aandres@tradewelltech.co> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-01-19 14:38:21 +00:00			`if mask is None:`
			`return array([], struct([]))`
			`else:`
			`return array([{}] * len(mask), struct([]), mask=mask)`
ARROW-4788: [C++] Less verbose API for constructing StructArray Also our first API to return a `Result<>`. Includes appropriate massaging in the test helpers so that ASSERT_OK and ASSERT_RAISES can handle `Result<>` instances. Author: Antoine Pitrou <antoine@python.org> Closes #4707 from pitrou/ARROW-4788-struct-array-factory and squashes the following commits: ae5435857 <Antoine Pitrou> ARROW-4788: Less verbose API for constructing StructArray 2019-06-27 22:32:08 -05:00
ARROW-6068: [C++] Allow passing Field instances to StructArray::Make This helps fix a Python test failure when hypothesis testing is enabled. Closes #4981 from pitrou/ARROW-6068-struct-array-from-fields and squashes the following commits: 12049109a <Antoine Pitrou> ARROW-6068: Allow passing Field instances to StructArray::Make Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-08-01 17:05:11 -05:00			`if names is not None:`
			`# XXX Cannot pass "nullptr" for a shared_ptr<T> argument:`
			`# https://github.com/cython/cython/issues/3020`
			`c_result = CStructArray.MakeFromFieldNames(`
ARROW-12677: [Python] Add a mask argument to pyarrow.StructArray.from_arrays This allows the user to supply an optional `mask` when creating a struct array. * The mask requirements are pretty strict (must be a boolean arrow array without nulls) compared with some of the other functions (e.g. `array.mask` accepts a wide variety of inputs). I think this should be ok since this use case is probably rarer and there are other plenty of existing ways to convert other datatypes to an arrow array. * ~~Unfortunately, StructArray::Make interprets the "null buffer" as more of a validity buffer (1 = valid, 0 = null). This is the opposite of everywhere else a `mask` is used. I was torn between inverting the input buffer to mimic the python API and passing through directly to the C interface for simplicity. I chose the simpler option but could be convinced otherwise.~~ Per request, I now invert the mask to align with the python API. Closes #10272 from westonpace/feature/ARROW-12677--python-add-a-mask-argument-to-pyarrow-structarra Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2021-05-14 16:48:25 -04:00			`c_arrays, c_names, c_mask, -1, 0)`
ARROW-6068: [C++] Allow passing Field instances to StructArray::Make This helps fix a Python test failure when hypothesis testing is enabled. Closes #4981 from pitrou/ARROW-6068-struct-array-from-fields and squashes the following commits: 12049109a <Antoine Pitrou> ARROW-6068: Allow passing Field instances to StructArray::Make Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-08-01 17:05:11 -05:00			`else:`
			`c_result = CStructArray.MakeFromFields(`
ARROW-12677: [Python] Add a mask argument to pyarrow.StructArray.from_arrays This allows the user to supply an optional `mask` when creating a struct array. * The mask requirements are pretty strict (must be a boolean arrow array without nulls) compared with some of the other functions (e.g. `array.mask` accepts a wide variety of inputs). I think this should be ok since this use case is probably rarer and there are other plenty of existing ways to convert other datatypes to an arrow array. * ~~Unfortunately, StructArray::Make interprets the "null buffer" as more of a validity buffer (1 = valid, 0 = null). This is the opposite of everywhere else a `mask` is used. I was torn between inverting the input buffer to mimic the python API and passing through directly to the C interface for simplicity. I chose the simpler option but could be convinced otherwise.~~ Per request, I now invert the mask to align with the python API. Closes #10272 from westonpace/feature/ARROW-12677--python-add-a-mask-argument-to-pyarrow-structarra Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2021-05-14 16:48:25 -04:00			`c_arrays, c_fields, c_mask, -1, 0)`
ARROW-6158: [C++/Python] Validate child array types with type fields of StructArray https://issues.apache.org/jira/browse/ARROW-6158 Closes #5488 from jorisvandenbossche/ARROW-6158-struct-array-validation and squashes the following commits: 757378139 <Joris Van den Bossche> ARROW-6158: Validate child array types with type fields of StructArray Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-09-24 22:02:07 -05:00			`cdef Array result = pyarrow_wrap_array(GetResultValue(c_result))`
			`result.validate()`
			`return result`
ARROW-915: [Python] Struct Array reads limited support Add limited struct array reading support in pyarrow. This is done to complement parquet-cpp struct reader. cc @wesm Author: Itai Incze <itai.in@gmail.com> Closes #615 from itaiin/ARROW-915 and squashes the following commits: d8f2636e [Itai Incze] convert struct field names using frombytes e654abfa [Itai Incze] fix python3 tests & msvc build 3a4edf43 [Itai Incze] fix lint errors bef46447 [Itai Incze] Refactor due to review e2a697ff [Itai Incze] Further fixes due to review eecbb32f [Itai Incze] fix per code review 4c255391 [Itai Incze] Add basic StructArray read support 2017-07-02 13:38:46 -04:00
GH-14778: [Python] Add (Chunked)Array sort() method (#14781) * Closes: #14778 Lead-authored-by: Alessandro Molina <amol@turbogears.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2022-12-15 13:44:57 +01:00			`def sort(self, order="ascending", by=None, **kwargs):`
			`"""`
			`Sort the StructArray`

			`Parameters`
			`----------`
			`order : str, default "ascending"`
			`Which order to sort values in.`
			`Accepted values are "ascending", "descending".`
			`by : str or None, default None`
			`If to sort the array by one of its fields`
			`or by the whole array.`
			`**kwargs : dict, optional`
			`Additional sorting options.`
			As allowed by :class:`SortOptions`

			`Returns`
			`-------`
			`result : StructArray`
			`"""`
			`if by is not None:`
GH-41464: [Python] Fix StructArray.sort() for by=None (#41495) ### Rationale for this change Closes issue https://github.com/apache/arrow/issues/41464. Fix `StructArray.sort` method's `by` param to work in the case of `by=None` which was documented to mean sort by all fields (the default), but would raise an exception. ### What changes are included in this PR? * Add a unit test with by=None in `test_struct_array_sort` that fails on main * Fix the sort method ### Are these changes tested? yes ### Are there any user-facing changes? yes * GitHub Issue: #41464 Authored-by: a-reich <asafspades@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-05-14 07:47:55 -04:00			`tosort, sort_keys = self._flattened_field(by), [("", order)]`
GH-14778: [Python] Add (Chunked)Array sort() method (#14781) * Closes: #14778 Lead-authored-by: Alessandro Molina <amol@turbogears.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2022-12-15 13:44:57 +01:00			`else:`
GH-41464: [Python] Fix StructArray.sort() for by=None (#41495) ### Rationale for this change Closes issue https://github.com/apache/arrow/issues/41464. Fix `StructArray.sort` method's `by` param to work in the case of `by=None` which was documented to mean sort by all fields (the default), but would raise an exception. ### What changes are included in this PR? * Add a unit test with by=None in `test_struct_array_sort` that fails on main * Fix the sort method ### Are these changes tested? yes ### Are there any user-facing changes? yes * GitHub Issue: #41464 Authored-by: a-reich <asafspades@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-05-14 07:47:55 -04:00			`tosort, sort_keys = self, [(field.name, order) for field in self.type]`
GH-14778: [Python] Add (Chunked)Array sort() method (#14781) * Closes: #14778 Lead-authored-by: Alessandro Molina <amol@turbogears.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2022-12-15 13:44:57 +01:00			`indices = _pc().sort_indices(`
GH-41464: [Python] Fix StructArray.sort() for by=None (#41495) ### Rationale for this change Closes issue https://github.com/apache/arrow/issues/41464. Fix `StructArray.sort` method's `by` param to work in the case of `by=None` which was documented to mean sort by all fields (the default), but would raise an exception. ### What changes are included in this PR? * Add a unit test with by=None in `test_struct_array_sort` that fails on main * Fix the sort method ### Are these changes tested? yes ### Are there any user-facing changes? yes * GitHub Issue: #41464 Authored-by: a-reich <asafspades@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-05-14 07:47:55 -04:00			`tosort, options=_pc().SortOptions(sort_keys=sort_keys, **kwargs)`
GH-14778: [Python] Add (Chunked)Array sort() method (#14781) * Closes: #14778 Lead-authored-by: Alessandro Molina <amol@turbogears.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2022-12-15 13:44:57 +01:00			`)`
			`return self.take(indices)`

ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
GH-34568: [C++][Python] Expose Run-End Encoded arrays in Python Arrow (#34570) * Closes: #34568 Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-03-23 05:20:06 -03:00			`cdef class RunEndEncodedArray(Array):`
			`"""`
			`Concrete class for Arrow run-end encoded arrays.`
			`"""`

			`@staticmethod`
			`def _from_arrays(type, allow_none_for_type, logical_length, run_ends, values, logical_offset):`
			`cdef:`
			`int64_t _logical_length`
			`Array _run_ends`
			`Array _values`
			`int64_t _logical_offset`
			`shared_ptr[CDataType] c_type`
			`shared_ptr[CRunEndEncodedArray] ree_array`

			`_logical_length = <int64_t>logical_length`
			`_logical_offset = <int64_t>logical_offset`

			`type = ensure_type(type, allow_none=allow_none_for_type)`
			`if type is not None:`
			`_run_ends = asarray(run_ends, type=type.run_end_type)`
			`_values = asarray(values, type=type.value_type)`
			`c_type = pyarrow_unwrap_data_type(type)`
			`with nogil:`
			`ree_array = GetResultValue(CRunEndEncodedArray.Make(`
			`c_type, _logical_length, _run_ends.sp_array, _values.sp_array, _logical_offset))`
			`else:`
			`_run_ends = asarray(run_ends)`
			`_values = asarray(values)`
			`with nogil:`
			`ree_array = GetResultValue(CRunEndEncodedArray.MakeFromArrays(`
			`_logical_length, _run_ends.sp_array, _values.sp_array, _logical_offset))`
			`cdef Array result = pyarrow_wrap_array(<shared_ptr[CArray]>ree_array)`
			`result.validate(full=True)`
			`return result`

			`@staticmethod`
			`def from_arrays(run_ends, values, type=None):`
			`"""`
			`Construct RunEndEncodedArray from run_ends and values arrays.`

			`Parameters`
			`----------`
			`run_ends : Array (int16, int32, or int64 type)`
			`The run_ends array.`
			`values : Array (any type)`
			`The values array.`
			`type : pyarrow.DataType, optional`
			`The run_end_encoded(run_end_type, value_type) array type.`

			`Returns`
			`-------`
			`RunEndEncodedArray`
			`"""`
GH-40560: [Python] RunEndEncodedArray.from_arrays: bugfix for Array arguments (#40560) (#41093) ### Rationale for this change The documentation suggests that `RunEndEncodedArray.from_arrays` takes two `Array` parameters, as would be expected of a `from_arrays` method. However, if given an `Array` instance for the `run_ends` parameter, it errors because `Array.__getitem__` returns a pyarrow scalar instead of a native Python integer. ### What changes are included in this PR? * Handle `Array` parameters for `run_ends` by unconditionally coercing the logical length to a pyarrow scalar, then to a Python native value. ### Are these change tested? Yes. Augmented the existing unit tests to test with `Array` as well as Python lists, and check that the data types of the `Array` instances correctly carry over to the data type of the `RunEndEncodedArray`. ### Are there any user-facing changes? Not apart from the bugfix; this was the minimum necessary change to make `Array` parameters work. `RunEndEncodedArray.from_arrays` continues to support e.g. python lists as before. * GitHub Issue: #40560 Authored-by: Hemidark <support@hemidark.net> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-05-07 02:44:48 -07:00			`logical_length = scalar(run_ends[-1]).as_py() if len(run_ends) > 0 else 0`
GH-34568: [C++][Python] Expose Run-End Encoded arrays in Python Arrow (#34570) * Closes: #34568 Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-03-23 05:20:06 -03:00			`return RunEndEncodedArray._from_arrays(type, True, logical_length,`
			`run_ends, values, 0)`

			`@staticmethod`
			`def from_buffers(DataType type, length, buffers, null_count=-1, offset=0,`
			`children=None):`
			`"""`
			`Construct a RunEndEncodedArray from all the parameters that make up an`
			`Array.`

			`RunEndEncodedArrays do not have buffers, only children arrays, but this`
			`implementation is needed to satisfy the Array interface.`

			`Parameters`
			`----------`
			`type : DataType`
			`The run_end_encoded(run_end_type, value_type) type.`
			`length : int`
			`The logical length of the run-end encoded array. Expected to match`
			`the last value of the run_ends array (children[0]) minus the offset.`
			`buffers : List[Buffer]`
			`Empty List or [None].`
			`null_count : int, default -1`
			`The number of null entries in the array. Run-end encoded arrays`
			`are specified to not have valid bits and null_count always equals 0.`
			`offset : int, default 0`
			`The array's logical offset (in values, not in bytes) from the`
			`start of each buffer.`
			`children : List[Array]`
			`Nested type children containing the run_ends and values arrays.`

			`Returns`
			`-------`
			`RunEndEncodedArray`
			`"""`
			`children = children or []`

			`if type.num_fields != len(children):`
			`raise ValueError("RunEndEncodedType's expected number of children "`
GH-45619: [Python] Use f-string instead of string.format (#45629) ### Rationale for this change See https://github.com/apache/arrow/issues/45619. ### What changes are included in this PR? Refactor using f-string instead of `string.format`. But do not use f-string for following case, `string.format` allows passing parameters, making the code more reusable. https://github.com/apache/arrow/blob/0fbf9823542233c5f32c26534c34cc97ce3f0be2/python/pyarrow/parquet/core.py#L1624-L1695 ### Are these changes tested? Via CI. ### Are there any user-facing changes? No. * GitHub Issue: #45619 Lead-authored-by: Chilin Chiou <chilin.chiou@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> 2025-05-12 20:34:49 +08:00			`f"({type.num_fields}) did not match the passed number "`
			`f"({len(children)})")`
GH-34568: [C++][Python] Expose Run-End Encoded arrays in Python Arrow (#34570) * Closes: #34568 Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-03-23 05:20:06 -03:00
			`# buffers are validated as if we needed to pass them to C++, but`
			`# _make_from_arrays will take care of filling in the expected`
			`# buffers array containing a single NULL buffer on the C++ side`
			`if len(buffers) == 0:`
			`buffers = [None]`
			`if buffers[0] is not None:`
			`raise ValueError("RunEndEncodedType expects None as validity "`
			`"bitmap, buffers[0] is not None")`
			`if type.num_buffers != len(buffers):`
			`raise ValueError("RunEndEncodedType's expected number of buffers "`
GH-45619: [Python] Use f-string instead of string.format (#45629) ### Rationale for this change See https://github.com/apache/arrow/issues/45619. ### What changes are included in this PR? Refactor using f-string instead of `string.format`. But do not use f-string for following case, `string.format` allows passing parameters, making the code more reusable. https://github.com/apache/arrow/blob/0fbf9823542233c5f32c26534c34cc97ce3f0be2/python/pyarrow/parquet/core.py#L1624-L1695 ### Are these changes tested? Via CI. ### Are there any user-facing changes? No. * GitHub Issue: #45619 Lead-authored-by: Chilin Chiou <chilin.chiou@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> 2025-05-12 20:34:49 +08:00			`f"({type.num_buffers}) did not match the passed number "`
			`f"({len(buffers)}).")`
GH-34568: [C++][Python] Expose Run-End Encoded arrays in Python Arrow (#34570) * Closes: #34568 Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-03-23 05:20:06 -03:00
			`# null_count is also validated as if we needed it`
			`if null_count != -1 and null_count != 0:`
			`raise ValueError("RunEndEncodedType's expected null_count (0) "`
GH-45619: [Python] Use f-string instead of string.format (#45629) ### Rationale for this change See https://github.com/apache/arrow/issues/45619. ### What changes are included in this PR? Refactor using f-string instead of `string.format`. But do not use f-string for following case, `string.format` allows passing parameters, making the code more reusable. https://github.com/apache/arrow/blob/0fbf9823542233c5f32c26534c34cc97ce3f0be2/python/pyarrow/parquet/core.py#L1624-L1695 ### Are these changes tested? Via CI. ### Are there any user-facing changes? No. * GitHub Issue: #45619 Lead-authored-by: Chilin Chiou <chilin.chiou@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> 2025-05-12 20:34:49 +08:00			`f"did not match passed number ({null_count})")`
GH-34568: [C++][Python] Expose Run-End Encoded arrays in Python Arrow (#34570) * Closes: #34568 Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-03-23 05:20:06 -03:00
			`return RunEndEncodedArray._from_arrays(type, False, length, children[0],`
			`children[1], offset)`

			`@property`
			`def run_ends(self):`
			`"""`
			`An array holding the logical indexes of each run-end.`

			`The physical offset to the array is applied.`
			`"""`
			`cdef CRunEndEncodedArray* ree_array = <CRunEndEncodedArray*>(self.ap)`
			`return pyarrow_wrap_array(ree_array.run_ends())`

			`@property`
			`def values(self):`
			`"""`
			`An array holding the values of each run.`

			`The physical offset to the array is applied.`
			`"""`
			`cdef CRunEndEncodedArray* ree_array = <CRunEndEncodedArray*>(self.ap)`
			`return pyarrow_wrap_array(ree_array.values())`

			`def find_physical_offset(self):`
			`"""`
			`Find the physical offset of this REE array.`

			`This is the offset of the run that contains the value of the first`
GH-38944: [Python] Fix spelling (#38945) ### Rationale for this change ### What changes are included in this PR? Spelling fixes to python/ ### Are these changes tested? ### Are there any user-facing changes? * Closes: #38944 Authored-by: Josh Soref <2119212+jsoref@users.noreply.github.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-12-01 03:30:13 -05:00			`logical element of this array considering its offset.`
GH-34568: [C++][Python] Expose Run-End Encoded arrays in Python Arrow (#34570) * Closes: #34568 Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-03-23 05:20:06 -03:00
			`This function uses binary-search, so it has a O(log N) cost.`
			`"""`
			`cdef CRunEndEncodedArray* ree_array = <CRunEndEncodedArray*>(self.ap)`
			`return ree_array.FindPhysicalOffset()`

			`def find_physical_length(self):`
			`"""`
			`Find the physical length of this REE array.`

			`The physical length of an REE is the number of physical values (and`
			`run-ends) necessary to represent the logical range of values from offset`
			`to length.`

			`This function uses binary-search, so it has a O(log N) cost.`
			`"""`
			`cdef CRunEndEncodedArray* ree_array = <CRunEndEncodedArray*>(self.ap)`
			`return ree_array.FindPhysicalLength()`


ARROW-840: [Python] Expose extension types Add infrastructure to consume C++ extension types and extension arrays from Python. Also allow creating Python-specific extension types by subclassing `ExtensionType`, and creating extension arrays by passing the type and storage array to `ExtensionArray.from_storage`. Author: Antoine Pitrou <antoine@python.org> Closes #4532 from pitrou/ARROW-840-py-ext-types and squashes the following commits: 95ca6148e <Antoine Pitrou> Add IPC tests 44ac0a156 <Antoine Pitrou> ARROW-840: Expose extension types 2019-06-14 07:53:40 -05:00			`cdef class ExtensionArray(Array):`
			`"""`
			`Concrete class for Arrow extension arrays.`
			`"""`

			`@property`
			`def storage(self):`
			`cdef:`
			`CExtensionArray* ext_array = <CExtensionArray*>(self.ap)`

			`return pyarrow_wrap_array(ext_array.storage())`

			`@staticmethod`
			`def from_storage(BaseExtensionType typ, Array storage):`
			`"""`
			`Construct ExtensionArray from type and storage array.`

			`Parameters`
			`----------`
ARROW-13637: [Python] Fix docstrings Address all docstrings to make sure they pass `archery numpydoc --allow-rule PR01` Closes #11245 from amol-/ARROW-13637 Authored-by: Alessandro Molina <amol@turbogears.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-10-04 11:44:40 +02:00			`typ : DataType`
ARROW-840: [Python] Expose extension types Add infrastructure to consume C++ extension types and extension arrays from Python. Also allow creating Python-specific extension types by subclassing `ExtensionType`, and creating extension arrays by passing the type and storage array to `ExtensionArray.from_storage`. Author: Antoine Pitrou <antoine@python.org> Closes #4532 from pitrou/ARROW-840-py-ext-types and squashes the following commits: 95ca6148e <Antoine Pitrou> Add IPC tests 44ac0a156 <Antoine Pitrou> ARROW-840: Expose extension types 2019-06-14 07:53:40 -05:00			`The extension type for the result array.`
ARROW-13637: [Python] Fix docstrings Address all docstrings to make sure they pass `archery numpydoc --allow-rule PR01` Closes #11245 from amol-/ARROW-13637 Authored-by: Alessandro Molina <amol@turbogears.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-10-04 11:44:40 +02:00			`storage : Array`
ARROW-840: [Python] Expose extension types Add infrastructure to consume C++ extension types and extension arrays from Python. Also allow creating Python-specific extension types by subclassing `ExtensionType`, and creating extension arrays by passing the type and storage array to `ExtensionArray.from_storage`. Author: Antoine Pitrou <antoine@python.org> Closes #4532 from pitrou/ARROW-840-py-ext-types and squashes the following commits: 95ca6148e <Antoine Pitrou> Add IPC tests 44ac0a156 <Antoine Pitrou> ARROW-840: Expose extension types 2019-06-14 07:53:40 -05:00			`The underlying storage for the result array.`

			`Returns`
			`-------`
			`ext_array : ExtensionArray`
			`"""`
			`cdef:`
			`shared_ptr[CExtensionArray] ext_array`

			`if storage.type != typ.storage_type:`
GH-45619: [Python] Use f-string instead of string.format (#45629) ### Rationale for this change See https://github.com/apache/arrow/issues/45619. ### What changes are included in this PR? Refactor using f-string instead of `string.format`. But do not use f-string for following case, `string.format` allows passing parameters, making the code more reusable. https://github.com/apache/arrow/blob/0fbf9823542233c5f32c26534c34cc97ce3f0be2/python/pyarrow/parquet/core.py#L1624-L1695 ### Are these changes tested? Via CI. ### Are there any user-facing changes? No. * GitHub Issue: #45619 Lead-authored-by: Chilin Chiou <chilin.chiou@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> 2025-05-12 20:34:49 +08:00			`raise TypeError(f"Incompatible storage type {storage.type} "`
			`f"for extension type {typ}")`
ARROW-840: [Python] Expose extension types Add infrastructure to consume C++ extension types and extension arrays from Python. Also allow creating Python-specific extension types by subclassing `ExtensionType`, and creating extension arrays by passing the type and storage array to `ExtensionArray.from_storage`. Author: Antoine Pitrou <antoine@python.org> Closes #4532 from pitrou/ARROW-840-py-ext-types and squashes the following commits: 95ca6148e <Antoine Pitrou> Add IPC tests 44ac0a156 <Antoine Pitrou> ARROW-840: Expose extension types 2019-06-14 07:53:40 -05:00
			`ext_array = make_shared[CExtensionArray](typ.sp_type, storage.sp_array)`
ARROW-6158: [C++/Python] Validate child array types with type fields of StructArray https://issues.apache.org/jira/browse/ARROW-6158 Closes #5488 from jorisvandenbossche/ARROW-6158-struct-array-validation and squashes the following commits: 757378139 <Joris Van den Bossche> ARROW-6158: Validate child array types with type fields of StructArray Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-09-24 22:02:07 -05:00			`cdef Array result = pyarrow_wrap_array(<shared_ptr[CArray]> ext_array)`
			`result.validate()`
			`return result`
ARROW-840: [Python] Expose extension types Add infrastructure to consume C++ extension types and extension arrays from Python. Also allow creating Python-specific extension types by subclassing `ExtensionType`, and creating extension arrays by passing the type and storage array to `ExtensionArray.from_storage`. Author: Antoine Pitrou <antoine@python.org> Closes #4532 from pitrou/ARROW-840-py-ext-types and squashes the following commits: 95ca6148e <Antoine Pitrou> Add IPC tests 44ac0a156 <Antoine Pitrou> ARROW-840: Expose extension types 2019-06-14 07:53:40 -05:00

GH-44066: [Python] Add Python wrapper for JsonExtensionType (#44070) ### Rationale for this change We [added canonical JsonExtensionType](https://github.com/apache/arrow/pull/13901) and we should make it usable from Python. ### What changes are included in this PR? Python wrapper for `JsonExtensionType` and `JsonArray` are added on Python side as well as `JsonArray` on c++ side. ### Are these changes tested? Python tests for the extension type and array are included. ### Are there any user-facing changes? This adds a json canonical extension type to pyarrow. * GitHub Issue: #44066 Lead-authored-by: Rok Mihevc <rok@mihevc.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2024-10-22 13:56:59 +02:00			`class JsonArray(ExtensionArray):`
			`"""`
			`Concrete class for Arrow arrays of JSON data type.`

			`This does not guarantee that the JSON data actually`
			`is valid JSON.`

			`Examples`
			`--------`
			`Define the extension type for JSON array`

			`>>> import pyarrow as pa`
			`>>> json_type = pa.json_(pa.large_utf8())`

			`Create an extension array`

			`>>> arr = [None, '{ "id":30, "values":["a", "b"] }']`
			`>>> storage = pa.array(arr, pa.large_utf8())`
			`>>> pa.ExtensionArray.from_storage(json_type, storage)`
			`<pyarrow.lib.JsonArray object at ...>`
			`[`
			`null,`
			`"{ "id":30, "values":["a", "b"] }"`
			`]`
			`"""`


GH-15058: [C++][Python] Native support for UUID (#37298) ### Rationale for this change See #15058. UUID datatype is common in throughout the ecosystem and Arrow as supporting it as a native type would reduce friction. ### What changes are included in this PR? This PR implements logic for Arrow canonical extension type in C++ and a Python wrapper. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes, new extension type is added. * Closes: #15058 Authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2024-08-26 16:34:18 +02:00			`class UuidArray(ExtensionArray):`
			`"""`
			`Concrete class for Arrow arrays of UUID data type.`
			`"""`


GH-37484: [Python] Add a FixedSizeTensorScalar class (#37533) ### Rationale for this change When working with `FixedSizeTensorArray` we want to access individual tensors. This would be enabled by adding: ```python def FixedSizeTensorScalar(pa.ExtensionScalar): def to_numpy_ndarray(): ... ``` See #37484. ### What changes are included in this PR? This adds `FixedSizeTensorScalar` and tests for it. ### Are there any user-facing changes? Yes, when calling `FixedSizeTensorArray[i]` we would get back `FixedSizeTensorScalar` instead of `ExtensionScalar`. * Closes: #37484 Lead-authored-by: Rok Mihevc <rok@mihevc.org> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2024-02-08 12:25:38 +01:00			`cdef class FixedShapeTensorArray(ExtensionArray):`
GH-34882: [Python] Binding for FixedShapeTensorType (#34883) ### Rationale for this change In the C++ the fixed shape tensor canonical extension type is implementated https://github.com/apache/arrow/pull/8510 so we can add bindings to the extension type in Python. ### What changes are included in this PR? Binding for fixed shape tensor canonical extension type. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * Closes: #34882 Lead-authored-by: Alenka Frim <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-04-11 16:37:03 +02:00			`"""`
			`Concrete class for fixed shape tensor extension arrays.`

			`Examples`
			`--------`
			`Define the extension type for tensor array`

			`>>> import pyarrow as pa`
			`>>> tensor_type = pa.fixed_shape_tensor(pa.int32(), [2, 2])`

			`Create an extension array`

			`>>> arr = [[1, 2, 3, 4], [10, 20, 30, 40], [100, 200, 300, 400]]`
			`>>> storage = pa.array(arr, pa.list_(pa.int32(), 4))`
			`>>> pa.ExtensionArray.from_storage(tensor_type, storage)`
			`<pyarrow.lib.FixedShapeTensorArray object at ...>`
			`[`
			`[`
			`1,`
			`2,`
			`3,`
			`4`
			`],`
			`[`
			`10,`
			`20,`
			`30,`
			`40`
			`],`
			`[`
			`100,`
			`200,`
			`300,`
			`400`
			`]`
			`]`
			`"""`

			`def to_numpy_ndarray(self):`
			`"""`
GH-37484: [Python] Add a FixedSizeTensorScalar class (#37533) ### Rationale for this change When working with `FixedSizeTensorArray` we want to access individual tensors. This would be enabled by adding: ```python def FixedSizeTensorScalar(pa.ExtensionScalar): def to_numpy_ndarray(): ... ``` See #37484. ### What changes are included in this PR? This adds `FixedSizeTensorScalar` and tests for it. ### Are there any user-facing changes? Yes, when calling `FixedSizeTensorArray[i]` we would get back `FixedSizeTensorScalar` instead of `ExtensionScalar`. * Closes: #37484 Lead-authored-by: Rok Mihevc <rok@mihevc.org> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2024-02-08 12:25:38 +01:00			`Convert fixed shape tensor extension array to a multi-dimensional numpy.ndarray.`
GH-34882: [Python] Binding for FixedShapeTensorType (#34883) ### Rationale for this change In the C++ the fixed shape tensor canonical extension type is implementated https://github.com/apache/arrow/pull/8510 so we can add bindings to the extension type in Python. ### What changes are included in this PR? Binding for fixed shape tensor canonical extension type. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * Closes: #34882 Lead-authored-by: Alenka Frim <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-04-11 16:37:03 +02:00
GH-37484: [Python] Add a FixedSizeTensorScalar class (#37533) ### Rationale for this change When working with `FixedSizeTensorArray` we want to access individual tensors. This would be enabled by adding: ```python def FixedSizeTensorScalar(pa.ExtensionScalar): def to_numpy_ndarray(): ... ``` See #37484. ### What changes are included in this PR? This adds `FixedSizeTensorScalar` and tests for it. ### Are there any user-facing changes? Yes, when calling `FixedSizeTensorArray[i]` we would get back `FixedSizeTensorScalar` instead of `ExtensionScalar`. * Closes: #37484 Lead-authored-by: Rok Mihevc <rok@mihevc.org> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2024-02-08 12:25:38 +01:00			`The resulting ndarray will have (ndim + 1) dimensions.`
			`The size of the first dimension will be the length of the fixed shape tensor array`
			`and the rest of the dimensions will match the permuted shape of the fixed`
			`shape tensor.`

			`The conversion is zero-copy.`

			`Returns`
			`-------`
			`numpy.ndarray`
			`Ndarray representing tensors in the fixed shape tensor array concatenated`
			`along the first dimension.`
GH-34882: [Python] Binding for FixedShapeTensorType (#34883) ### Rationale for this change In the C++ the fixed shape tensor canonical extension type is implementated https://github.com/apache/arrow/pull/8510 so we can add bindings to the extension type in Python. ### What changes are included in this PR? Binding for fixed shape tensor canonical extension type. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * Closes: #34882 Lead-authored-by: Alenka Frim <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-04-11 16:37:03 +02:00			`"""`
GH-37484: [Python] Add a FixedSizeTensorScalar class (#37533) ### Rationale for this change When working with `FixedSizeTensorArray` we want to access individual tensors. This would be enabled by adding: ```python def FixedSizeTensorScalar(pa.ExtensionScalar): def to_numpy_ndarray(): ... ``` See #37484. ### What changes are included in this PR? This adds `FixedSizeTensorScalar` and tests for it. ### Are there any user-facing changes? Yes, when calling `FixedSizeTensorArray[i]` we would get back `FixedSizeTensorScalar` instead of `ExtensionScalar`. * Closes: #37484 Lead-authored-by: Rok Mihevc <rok@mihevc.org> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2024-02-08 12:25:38 +01:00
			`return self.to_tensor().to_numpy()`

			`def to_tensor(self):`
			`"""`
			`Convert fixed shape tensor extension array to a pyarrow.Tensor.`

			`The resulting Tensor will have (ndim + 1) dimensions.`
			`The size of the first dimension will be the length of the fixed shape tensor array`
			`and the rest of the dimensions will match the permuted shape of the fixed`
			`shape tensor.`

			`The conversion is zero-copy.`

			`Returns`
			`-------`
			`pyarrow.Tensor`
			`Tensor representing tensors in the fixed shape tensor array concatenated`
			`along the first dimension.`
			`"""`

			`cdef:`
			`CFixedShapeTensorArray* ext_array = <CFixedShapeTensorArray*>(self.ap)`
			`CResult[shared_ptr[CTensor]] ctensor`
			`with nogil:`
			`ctensor = ext_array.ToTensor()`
			`return pyarrow_wrap_tensor(GetResultValue(ctensor))`
GH-34882: [Python] Binding for FixedShapeTensorType (#34883) ### Rationale for this change In the C++ the fixed shape tensor canonical extension type is implementated https://github.com/apache/arrow/pull/8510 so we can add bindings to the extension type in Python. ### What changes are included in this PR? Binding for fixed shape tensor canonical extension type. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * Closes: #34882 Lead-authored-by: Alenka Frim <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-04-11 16:37:03 +02:00
			`@staticmethod`
GH-45531: [Python] Add the `dim_names` argument to `from_numpy_ndarray` (#46170) ### Rationale for this change The `FixedShapeTensorArray.from_numpy_ndarray` method did not pass `dim_names` to the `fixed_shape_tensor` constructor, which resulted in dimension names being lost when converting from a NumPy array. This change ensures that dimension names are properly preserved when constructing a tensor array from a NumPy ndarray. ### What changes are included in this PR? - Added an optional `dim_names` parameter to `FixedShapeTensorArray.from_numpy_ndarray`. - If provided, the `dim_names` are now passed to the `fixed_shape_tensor` constructor. ### Are these changes tested? - Existing tests pass, confirming no regressions to current functionality. - Additional unit tests have been added to verify that `dim_names` are correctly handled when specified. ### Are there any user-facing changes? - The method `FixedShapeTensorArray.from_numpy_ndarray` now accepts an optional `dim_names` argument. - This argument is optional, and the behavior remains unchanged when it is not provided. * GitHub Issue: #45531 Authored-by: yyossy5 <hm.hr.yossy@gmail.com> Signed-off-by: Rok Mihevc <rok@mihevc.org> 2025-04-18 03:13:51 +09:00			`def from_numpy_ndarray(obj, dim_names=None):`
GH-34882: [Python] Binding for FixedShapeTensorType (#34883) ### Rationale for this change In the C++ the fixed shape tensor canonical extension type is implementated https://github.com/apache/arrow/pull/8510 so we can add bindings to the extension type in Python. ### What changes are included in this PR? Binding for fixed shape tensor canonical extension type. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * Closes: #34882 Lead-authored-by: Alenka Frim <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-04-11 16:37:03 +02:00			`"""`
			`Convert numpy tensors (ndarrays) to a fixed shape tensor extension array.`
			`The first dimension of ndarray will become the length of the fixed`
			`shape tensor array.`
GH-37484: [Python] Add a FixedSizeTensorScalar class (#37533) ### Rationale for this change When working with `FixedSizeTensorArray` we want to access individual tensors. This would be enabled by adding: ```python def FixedSizeTensorScalar(pa.ExtensionScalar): def to_numpy_ndarray(): ... ``` See #37484. ### What changes are included in this PR? This adds `FixedSizeTensorScalar` and tests for it. ### Are there any user-facing changes? Yes, when calling `FixedSizeTensorArray[i]` we would get back `FixedSizeTensorScalar` instead of `ExtensionScalar`. * Closes: #37484 Lead-authored-by: Rok Mihevc <rok@mihevc.org> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2024-02-08 12:25:38 +01:00			`If input array data is not contiguous a copy will be made.`
GH-34882: [Python] Binding for FixedShapeTensorType (#34883) ### Rationale for this change In the C++ the fixed shape tensor canonical extension type is implementated https://github.com/apache/arrow/pull/8510 so we can add bindings to the extension type in Python. ### What changes are included in this PR? Binding for fixed shape tensor canonical extension type. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * Closes: #34882 Lead-authored-by: Alenka Frim <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-04-11 16:37:03 +02:00
			`Parameters`
			`----------`
			`obj : numpy.ndarray`
GH-45531: [Python] Add the `dim_names` argument to `from_numpy_ndarray` (#46170) ### Rationale for this change The `FixedShapeTensorArray.from_numpy_ndarray` method did not pass `dim_names` to the `fixed_shape_tensor` constructor, which resulted in dimension names being lost when converting from a NumPy array. This change ensures that dimension names are properly preserved when constructing a tensor array from a NumPy ndarray. ### What changes are included in this PR? - Added an optional `dim_names` parameter to `FixedShapeTensorArray.from_numpy_ndarray`. - If provided, the `dim_names` are now passed to the `fixed_shape_tensor` constructor. ### Are these changes tested? - Existing tests pass, confirming no regressions to current functionality. - Additional unit tests have been added to verify that `dim_names` are correctly handled when specified. ### Are there any user-facing changes? - The method `FixedShapeTensorArray.from_numpy_ndarray` now accepts an optional `dim_names` argument. - This argument is optional, and the behavior remains unchanged when it is not provided. * GitHub Issue: #45531 Authored-by: yyossy5 <hm.hr.yossy@gmail.com> Signed-off-by: Rok Mihevc <rok@mihevc.org> 2025-04-18 03:13:51 +09:00			`dim_names : tuple or list of strings, default None`
			`Explicit names to tensor dimensions.`
GH-34882: [Python] Binding for FixedShapeTensorType (#34883) ### Rationale for this change In the C++ the fixed shape tensor canonical extension type is implementated https://github.com/apache/arrow/pull/8510 so we can add bindings to the extension type in Python. ### What changes are included in this PR? Binding for fixed shape tensor canonical extension type. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * Closes: #34882 Lead-authored-by: Alenka Frim <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-04-11 16:37:03 +02:00
			`Examples`
			`--------`
			`>>> import pyarrow as pa`
			`>>> import numpy as np`
			`>>> arr = np.array(`
			`... [[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]],`
			`... dtype=np.float32)`
			`>>> pa.FixedShapeTensorArray.from_numpy_ndarray(arr)`
			`<pyarrow.lib.FixedShapeTensorArray object at ...>`
			`[`
			`[`
			`1,`
			`2,`
			`3,`
			`4,`
			`5,`
			`6`
			`],`
			`[`
			`1,`
			`2,`
			`3,`
			`4,`
			`5,`
			`6`
			`]`
			`]`
			`"""`
GH-37484: [Python] Add a FixedSizeTensorScalar class (#37533) ### Rationale for this change When working with `FixedSizeTensorArray` we want to access individual tensors. This would be enabled by adding: ```python def FixedSizeTensorScalar(pa.ExtensionScalar): def to_numpy_ndarray(): ... ``` See #37484. ### What changes are included in this PR? This adds `FixedSizeTensorScalar` and tests for it. ### Are there any user-facing changes? Yes, when calling `FixedSizeTensorArray[i]` we would get back `FixedSizeTensorScalar` instead of `ExtensionScalar`. * Closes: #37484 Lead-authored-by: Rok Mihevc <rok@mihevc.org> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2024-02-08 12:25:38 +01:00
			`if len(obj.shape) < 2:`
			`raise ValueError(`
			`"Cannot convert 1D array or scalar to fixed shape tensor array")`
			`if np.prod(obj.shape) == 0:`
			`raise ValueError("Expected a non-empty ndarray")`
GH-45531: [Python] Add the `dim_names` argument to `from_numpy_ndarray` (#46170) ### Rationale for this change The `FixedShapeTensorArray.from_numpy_ndarray` method did not pass `dim_names` to the `fixed_shape_tensor` constructor, which resulted in dimension names being lost when converting from a NumPy array. This change ensures that dimension names are properly preserved when constructing a tensor array from a NumPy ndarray. ### What changes are included in this PR? - Added an optional `dim_names` parameter to `FixedShapeTensorArray.from_numpy_ndarray`. - If provided, the `dim_names` are now passed to the `fixed_shape_tensor` constructor. ### Are these changes tested? - Existing tests pass, confirming no regressions to current functionality. - Additional unit tests have been added to verify that `dim_names` are correctly handled when specified. ### Are there any user-facing changes? - The method `FixedShapeTensorArray.from_numpy_ndarray` now accepts an optional `dim_names` argument. - This argument is optional, and the behavior remains unchanged when it is not provided. * GitHub Issue: #45531 Authored-by: yyossy5 <hm.hr.yossy@gmail.com> Signed-off-by: Rok Mihevc <rok@mihevc.org> 2025-04-18 03:13:51 +09:00			`if dim_names is not None:`
			`if not isinstance(dim_names, Sequence):`
			`raise TypeError("dim_names must be a tuple or list")`
			`if len(dim_names) != len(obj.shape[1:]):`
			`raise ValueError(`
			`(f"The length of dim_names ({len(dim_names)}) does not match"`
			`f"the number of tensor dimensions ({len(obj.shape[1:])})."`
			`)`
			`)`
			`if not all(isinstance(name, str) for name in dim_names):`
			`raise TypeError("Each element of dim_names must be a string")`
GH-37484: [Python] Add a FixedSizeTensorScalar class (#37533) ### Rationale for this change When working with `FixedSizeTensorArray` we want to access individual tensors. This would be enabled by adding: ```python def FixedSizeTensorScalar(pa.ExtensionScalar): def to_numpy_ndarray(): ... ``` See #37484. ### What changes are included in this PR? This adds `FixedSizeTensorScalar` and tests for it. ### Are there any user-facing changes? Yes, when calling `FixedSizeTensorArray[i]` we would get back `FixedSizeTensorScalar` instead of `ExtensionScalar`. * Closes: #37484 Lead-authored-by: Rok Mihevc <rok@mihevc.org> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2024-02-08 12:25:38 +01:00
			`permutation = (-np.array(obj.strides)).argsort(kind='stable')`
			`if permutation[0] != 0:`
			`raise ValueError('First stride needs to be largest to ensure that '`
			`'individual tensor data is contiguous in memory.')`
GH-34882: [Python] Binding for FixedShapeTensorType (#34883) ### Rationale for this change In the C++ the fixed shape tensor canonical extension type is implementated https://github.com/apache/arrow/pull/8510 so we can add bindings to the extension type in Python. ### What changes are included in this PR? Binding for fixed shape tensor canonical extension type. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * Closes: #34882 Lead-authored-by: Alenka Frim <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-04-11 16:37:03 +02:00
			`arrow_type = from_numpy_dtype(obj.dtype)`
GH-37484: [Python] Add a FixedSizeTensorScalar class (#37533) ### Rationale for this change When working with `FixedSizeTensorArray` we want to access individual tensors. This would be enabled by adding: ```python def FixedSizeTensorScalar(pa.ExtensionScalar): def to_numpy_ndarray(): ... ``` See #37484. ### What changes are included in this PR? This adds `FixedSizeTensorScalar` and tests for it. ### Are there any user-facing changes? Yes, when calling `FixedSizeTensorArray[i]` we would get back `FixedSizeTensorScalar` instead of `ExtensionScalar`. * Closes: #37484 Lead-authored-by: Rok Mihevc <rok@mihevc.org> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2024-02-08 12:25:38 +01:00			`shape = np.take(obj.shape, permutation)`
			`values = np.ravel(obj, order="K")`
GH-34882: [Python] Binding for FixedShapeTensorType (#34883) ### Rationale for this change In the C++ the fixed shape tensor canonical extension type is implementated https://github.com/apache/arrow/pull/8510 so we can add bindings to the extension type in Python. ### What changes are included in this PR? Binding for fixed shape tensor canonical extension type. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * Closes: #34882 Lead-authored-by: Alenka Frim <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-04-11 16:37:03 +02:00
			`return ExtensionArray.from_storage(`
GH-45531: [Python] Add the `dim_names` argument to `from_numpy_ndarray` (#46170) ### Rationale for this change The `FixedShapeTensorArray.from_numpy_ndarray` method did not pass `dim_names` to the `fixed_shape_tensor` constructor, which resulted in dimension names being lost when converting from a NumPy array. This change ensures that dimension names are properly preserved when constructing a tensor array from a NumPy ndarray. ### What changes are included in this PR? - Added an optional `dim_names` parameter to `FixedShapeTensorArray.from_numpy_ndarray`. - If provided, the `dim_names` are now passed to the `fixed_shape_tensor` constructor. ### Are these changes tested? - Existing tests pass, confirming no regressions to current functionality. - Additional unit tests have been added to verify that `dim_names` are correctly handled when specified. ### Are there any user-facing changes? - The method `FixedShapeTensorArray.from_numpy_ndarray` now accepts an optional `dim_names` argument. - This argument is optional, and the behavior remains unchanged when it is not provided. * GitHub Issue: #45531 Authored-by: yyossy5 <hm.hr.yossy@gmail.com> Signed-off-by: Rok Mihevc <rok@mihevc.org> 2025-04-18 03:13:51 +09:00			`fixed_shape_tensor(arrow_type, shape[1:],`
			`dim_names=dim_names,`
			`permutation=permutation[1:] - 1),`
GH-37484: [Python] Add a FixedSizeTensorScalar class (#37533) ### Rationale for this change When working with `FixedSizeTensorArray` we want to access individual tensors. This would be enabled by adding: ```python def FixedSizeTensorScalar(pa.ExtensionScalar): def to_numpy_ndarray(): ... ``` See #37484. ### What changes are included in this PR? This adds `FixedSizeTensorScalar` and tests for it. ### Are there any user-facing changes? Yes, when calling `FixedSizeTensorArray[i]` we would get back `FixedSizeTensorScalar` instead of `ExtensionScalar`. * Closes: #37484 Lead-authored-by: Rok Mihevc <rok@mihevc.org> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2024-02-08 12:25:38 +01:00			`FixedSizeListArray.from_arrays(values, shape[1:].prod())`
GH-34882: [Python] Binding for FixedShapeTensorType (#34883) ### Rationale for this change In the C++ the fixed shape tensor canonical extension type is implementated https://github.com/apache/arrow/pull/8510 so we can add bindings to the extension type in Python. ### What changes are included in this PR? Binding for fixed shape tensor canonical extension type. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * Closes: #34882 Lead-authored-by: Alenka Frim <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-04-11 16:37:03 +02:00			`)`


GH-43454: [C++][Python] Add Opaque canonical extension type (#43458) ### Rationale for this change Add the newly ratified extension type. ### What changes are included in this PR? The C++/Python implementation only. ### Are these changes tested? Yes ### Are there any user-facing changes? No. * GitHub Issue: #43454 Lead-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com> 2024-08-14 10:41:38 +09:00			`cdef class OpaqueArray(ExtensionArray):`
			`"""`
			`Concrete class for opaque extension arrays.`

			`Examples`
			`--------`
			`Define the extension type for an opaque array`

			`>>> import pyarrow as pa`
			`>>> opaque_type = pa.opaque(`
			`... pa.binary(),`
			`... type_name="geometry",`
			`... vendor_name="postgis",`
			`... )`

			`Create an extension array`

			`>>> arr = [None, b"data"]`
			`>>> storage = pa.array(arr, pa.binary())`
			`>>> pa.ExtensionArray.from_storage(opaque_type, storage)`
			`<pyarrow.lib.OpaqueArray object at ...>`
			`[`
			`null,`
			`64617461`
			`]`
			`"""`


GH-17682: [C++][Python] Bool8 Extension Type Implementation (#43488) ### Rationale for this change C++ and Python implementations of #43234 ### What changes are included in this PR? - Implement C++ `Bool8Type`, `Bool8Array`, `Bool8Scalar`, and tests - Implement Python bindings to C++, as well as zero-copy numpy conversion methods - TODO: docs waiting for rebase on #43458 ### Are these changes tested? Yes ### Are there any user-facing changes? Bool8 extension type will be available in C++ and Python libraries * GitHub Issue: #17682 Authored-by: Joel Lubinitsky <joellubi@gmail.com> Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com> 2024-08-20 20:25:19 -04:00			`cdef class Bool8Array(ExtensionArray):`
			`"""`
			`Concrete class for bool8 extension arrays.`

			`Examples`
			`--------`
			`Define the extension type for an bool8 array`

			`>>> import pyarrow as pa`
			`>>> bool8_type = pa.bool8()`

			`Create an extension array`

			`>>> arr = [-1, 0, 1, 2, None]`
			`>>> storage = pa.array(arr, pa.int8())`
			`>>> pa.ExtensionArray.from_storage(bool8_type, storage)`
			`<pyarrow.lib.Bool8Array object at ...>`
			`[`
			`-1,`
			`0,`
			`1,`
			`2,`
			`null`
			`]`
			`"""`

			`def to_numpy(self, zero_copy_only=True, writable=False):`
			`"""`
			`Return a NumPy bool view or copy of this array.`

			`By default, tries to return a view of this array. This is only`
			`supported for arrays without any nulls.`

			`Parameters`
			`----------`
			`zero_copy_only : bool, default True`
			`If True, an exception will be raised if the conversion to a numpy`
			`array would require copying the underlying data (e.g. in presence`
			`of nulls).`
			`writable : bool, default False`
			`For numpy arrays created with zero copy (view on the Arrow data),`
			`the resulting array is not writable (Arrow data is immutable).`
			`By setting this to True, a copy of the array is made to ensure`
			`it is writable.`

			`Returns`
			`-------`
			`array : numpy.ndarray`
			`"""`
			`if not writable:`
			`try:`
			`return self.storage.to_numpy().view(np.bool_)`
			`except ArrowInvalid as e:`
			`if zero_copy_only:`
			`raise e`

			`return _pc().not_equal(self.storage, 0).to_numpy(zero_copy_only=zero_copy_only, writable=writable)`

			`@staticmethod`
			`def from_storage(Int8Array storage):`
			`"""`
			`Construct Bool8Array from Int8Array storage.`

			`Parameters`
			`----------`
			`storage : Int8Array`
			`The underlying storage for the result array.`

			`Returns`
			`-------`
			`bool8_array : Bool8Array`
			`"""`
			`return ExtensionArray.from_storage(bool8(), storage)`

			`@staticmethod`
			`def from_numpy(obj):`
			`"""`
			`Convert numpy array to a bool8 extension array without making a copy.`
GH-49509: [Docs][Python][C++] Minimize warnings and docutils errors for Sphinx build html (#49510) ### Rationale for this change Closes #49509 ### What changes are included in this PR? Docs formatting/typos corrected. ### Are these changes tested? Yes, on fork branch. ### Are there any user-facing changes? No, just corrected formatting/typos in docs. * GitHub Issue: #49509 Authored-by: Tadeja Kadunc <tadeja.kadunc@gmail.com> Signed-off-by: Rok Mihevc <rok@mihevc.org> 2026-03-17 18:33:51 +01:00			The input array must be 1-dimensional, with either ``bool_`` or ``int8`` dtype.
GH-17682: [C++][Python] Bool8 Extension Type Implementation (#43488) ### Rationale for this change C++ and Python implementations of #43234 ### What changes are included in this PR? - Implement C++ `Bool8Type`, `Bool8Array`, `Bool8Scalar`, and tests - Implement Python bindings to C++, as well as zero-copy numpy conversion methods - TODO: docs waiting for rebase on #43458 ### Are these changes tested? Yes ### Are there any user-facing changes? Bool8 extension type will be available in C++ and Python libraries * GitHub Issue: #17682 Authored-by: Joel Lubinitsky <joellubi@gmail.com> Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com> 2024-08-20 20:25:19 -04:00
			`Parameters`
			`----------`
			`obj : numpy.ndarray`

			`Returns`
			`-------`
			`bool8_array : Bool8Array`

			`Examples`
			`--------`
			`>>> import pyarrow as pa`
			`>>> import numpy as np`
			`>>> arr = np.array([True, False, True], dtype=np.bool_)`
			`>>> pa.Bool8Array.from_numpy(arr)`
			`<pyarrow.lib.Bool8Array object at ...>`
			`[`
			`1,`
			`0,`
			`1`
			`]`
			`"""`

			`if obj.ndim != 1:`
			`raise ValueError(f"Cannot convert {obj.ndim}-D array to bool8 array")`

			`if obj.dtype not in [np.bool_, np.int8]:`
			`raise TypeError(f"Array dtype {obj.dtype} incompatible with bool8 storage")`

			`storage_arr = array(obj.view(np.int8), type=int8())`
			`return Bool8Array.from_storage(storage_arr)`


ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00			`cdef dict _array_classes = {`
ARROW-827: [Python] Miscellaneous improvements to help with Dask support Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #543 from wesm/dask-improvements and squashes the following commits: 1f587e2 [Wes McKinney] Store the input Parquet paths on the dataset object 3504281 [Wes McKinney] Add some more cases edc9b59 [Wes McKinney] Unit tests 88f4380 [Wes McKinney] Use dict for type mapping for now 7e69cab [Wes McKinney] Miscellaneous improvements to help with Dask support 2017-04-17 09:22:15 -04:00			`_Type_NA: NullArray,`
			`_Type_BOOL: BooleanArray,`
			`_Type_UINT8: UInt8Array,`
			`_Type_UINT16: UInt16Array,`
			`_Type_UINT32: UInt32Array,`
			`_Type_UINT64: UInt64Array,`
			`_Type_INT8: Int8Array,`
			`_Type_INT16: Int16Array,`
			`_Type_INT32: Int32Array,`
			`_Type_INT64: Int64Array,`
			`_Type_DATE32: Date32Array,`
			`_Type_DATE64: Date64Array,`
			`_Type_TIMESTAMP: TimestampArray,`
			`_Type_TIME32: Time32Array,`
			`_Type_TIME64: Time64Array,`
ARROW-5855: [Python] Support for Duration (timedelta) type https://issues.apache.org/jira/browse/ARROW-5855 - [x] Basic wrappers for DurationArray/Type/Value - [x] Numpy / pandas conversion - [x] Python conversion Closes #5566 from jorisvandenbossche/ARROW-5855-duration-python and squashes the following commits: cd9f12aa6 <Joris Van den Bossche> Type -> ArrowType 7458000cf <Joris Van den Bossche> templated TimestampConverter 56bd1b3b3 <Joris Van den Bossche> update for feedback f4d6b4b64 <Joris Van den Bossche> handle python 2 compat 6bfafbe14 <Joris Van den Bossche> clean-up 779e724f0 <Joris Van den Bossche> implement python conversion 03d399223 <Joris Van den Bossche> ARROW-5855: Support for Duration (timedelta) type Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-10-08 13:12:38 +02:00			`_Type_DURATION: DurationArray,`
ARROW-13806: [C++][Python] Add support for new MonthDayNano Interval Type - Refactored ObjectWriter helpers from arrow_to_pandas, so they can be used for plain python types as well (generalized the lowest level so it can work on both PyObject** and an adapter for PyList. - Add DateOffset to static pandas imports - Tried to start laying out code in a way to use C++ for Array.to_pylist (feel free to comment). Support importing from timeinterval, relativedelta and DateOffset types (this is actually mostly duck types, the one complication is that relativedelta has a property weeks that is automatically calculated, so some type checking is necessary). Open questions: - Should we be more strict on duck typing imports? I chose generalism over performance here (rechecking non-present attributes, etc)? - Is the new arrow_to_python.h desirable (I think this can be easily extended for other types)? - My python is rusty and Python C-API even more so, please don't assume I know exactly what I'm doing :) Closes #11302 from emkornfield/interval_python Lead-authored-by: Micah Kornfield <micahk@google.com> Co-authored-by: emkornfield <emkornfield@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2021-10-07 12:36:46 +02:00			`_Type_INTERVAL_MONTH_DAY_NANO: MonthDayNanoIntervalArray,`
ARROW-2140: [Python] Improve float16 support Author: Antoine Pitrou <antoine@python.org> Closes #1744 from pitrou/ARROW-2140-py-float16 and squashes the following commits: f6ebc83 <Antoine Pitrou> Merge branch 'master' into ARROW-2140-py-float16 64fb518 <Antoine Pitrou> ARROW-2140: Improve float16 support 2018-03-29 19:38:05 -04:00			`_Type_HALF_FLOAT: HalfFloatArray,`
ARROW-827: [Python] Miscellaneous improvements to help with Dask support Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #543 from wesm/dask-improvements and squashes the following commits: 1f587e2 [Wes McKinney] Store the input Parquet paths on the dataset object 3504281 [Wes McKinney] Add some more cases edc9b59 [Wes McKinney] Unit tests 88f4380 [Wes McKinney] Use dict for type mapping for now 7e69cab [Wes McKinney] Miscellaneous improvements to help with Dask support 2017-04-17 09:22:15 -04:00			`_Type_FLOAT: FloatArray,`
			`_Type_DOUBLE: DoubleArray,`
			`_Type_LIST: ListArray,`
ARROW-6084: [Python] Support LargeList Closes #4979 from pitrou/ARROW-6084-py-large-list and squashes the following commits: 4266ea2c6 <Antoine Pitrou> ARROW-6084: Support LargeList Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2019-08-06 13:41:41 -05:00			`_Type_LARGE_LIST: LargeListArray,`
GH-39812: [Python] Add bindings for ListView and LargeListView (#39813) ### Rationale for this change Add bindings to the ListView and LargeListView array formats. ### What changes are included in this PR? * Add initial implementation for ListView and LargeListView * Add basic unit tests ### Are these changes tested? * Basic unit tests only (follow up PRs will be needed to implement full functionality) ### Are there any user-facing changes? Yes, documentation is updated in this PR to include the new PyArrow objects. * Closes: #39812 Lead-authored-by: Dane Pitkin <dane@voltrondata.com> Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-02-08 09:44:19 -05:00			`_Type_LIST_VIEW: ListViewArray,`
			`_Type_LARGE_LIST_VIEW: LargeListViewArray,`
ARROW-6904: [Python] Add support for MapArray This adds support for `MapArray` in Python with conversion from a Python sequence of either dictionaries with "key" and "value" fields or a tuple with 2 elements. Additionally, added the API `MapArray.from_arrays` to build a `MapArray` from individual offset, key, value arrays. Closes #5774 from BryanCutler/python-impl-MapArray-ARROW-6904 and squashes the following commits: f2935c378 <Bryan Cutler> Avoid lookup of key_builder at each value c1fa14ef5 <Bryan Cutler> Added MapArray decl to lib.pxd 6b23dc2aa <Bryan Cutler> typo 5385529ff <Bryan Cutler> Address comments, add test compare with ListBuilder of structs 9772e1df4 <Bryan Cutler> unicode repr for py2 f1a354764 <Bryan Cutler> Fix test_map error for py2 7bdfdebb6 <Bryan Cutler> Changed MapValue.as_py() to return a list of tuples, added test_scalars 3c1a7f85a <Bryan Cutler> Add MapType to schema_test 2f4c29652 <Bryan Cutler> Add Map tests to test_misc acd0e6b04 <Bryan Cutler> Add tests for python MapType 2555849c4 <Bryan Cutler> Add tests for MapArray::FromArrays 99ba44f0f <Bryan Cutler> Fix python2 test error 442dac2bc <Bryan Cutler> Fix lint issues 3a1134d01 <Bryan Cutler> Added checks in MapConverter to verify appended value, passing tests fa883cc9e <Bryan Cutler> Added tests for python to arrow conversion, need to pass verify dicts 70b453db6 <Bryan Cutler> Added test_array using from_arrays 72ab5295c <Bryan Cutler> Fix MapArray.from_arrays to work with null values 7f8770140 <Bryan Cutler> Adding map converter as ListConverter with MapType 309ac112b <Bryan Cutler> Change MapBuilder to use a StructBuilder internally 033479de8 <Bryan Cutler> Fix MapArray::SetData to use ListArray::SetData without faking type cf6a4fb72 <Bryan Cutler> Added MapType and MapArray, working in python with FromArrays Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-12-04 20:27:27 +01:00			`_Type_MAP: MapArray,`
ARROW-7261: [Python] Add Python support for Fixed Size List type https://issues.apache.org/jira/browse/ARROW-7261 Conversion from python lists works, but conversion to python (to numpy/pandas) not yet. Closes #5906 from jorisvandenbossche/ARROW-7261-fixed-size-list and squashes the following commits: 5046c85fc <Joris Van den Bossche> fixup rebase feb648834 <Joris Van den Bossche> fix C++ docstring 7b4f57102 <Joris Van den Bossche> other updates d01b45d6a <Joris Van den Bossche> fix list_size of 0 + add tests da2a2b3ea <Joris Van den Bossche> Use Result instead of Status 8b3161937 <Joris Van den Bossche> known null_count of 0 cb68db4d5 <Joris Van den Bossche> use int32 for list_size 88a5cbdd9 <Joris Van den Bossche> add proper python -> arrow conversion a502fa729 <Joris Van den Bossche> edits for feedback 8d94b0f7a <Joris Van den Bossche> ARROW-7261: Add Python support for Fixed Size List type Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-12-10 14:58:29 +01:00			`_Type_FIXED_SIZE_LIST: FixedSizeListArray,`
ARROW-8866: [C++] Split UNION into SPARSE_UNION and DENSE_UNION Also splits `UnionType -> SparseUnionType and DenseUnionType` and similar for `UnionArray`, `UnionScalar`. `SparseUnionArray` no longer includes the unused offsets buffer Closes #7378 from bkietz/8866-Split-TypeUNION-into-Type Lead-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: Wes McKinney <wesm@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-12 13:03:12 -05:00			`_Type_SPARSE_UNION: UnionArray,`
			`_Type_DENSE_UNION: UnionArray,`
ARROW-827: [Python] Miscellaneous improvements to help with Dask support Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #543 from wesm/dask-improvements and squashes the following commits: 1f587e2 [Wes McKinney] Store the input Parquet paths on the dataset object 3504281 [Wes McKinney] Add some more cases edc9b59 [Wes McKinney] Unit tests 88f4380 [Wes McKinney] Use dict for type mapping for now 7e69cab [Wes McKinney] Miscellaneous improvements to help with Dask support 2017-04-17 09:22:15 -04:00			`_Type_BINARY: BinaryArray,`
			`_Type_STRING: StringArray,`
ARROW-6000: [Python] Add support for LargeString and LargeBinary types Also fix a bug in Take / Filter for large binary types. Closes #4927 from pitrou/ARROW-6000-py-large-binary and squashes the following commits: 0bb1b7f89 <Antoine Pitrou> Fix Take on LargeBinary / LargeString data 9672ca4c3 <Antoine Pitrou> ARROW-6000: Add support for LargeString and LargeBinary types Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-08-01 12:50:21 +02:00			`_Type_LARGE_BINARY: LargeBinaryArray,`
			`_Type_LARGE_STRING: LargeStringArray,`
GH-39651: [Python] Basic pyarrow bindings for Binary/StringView classes (#39652) ### Rationale for this change First step for https://github.com/apache/arrow/issues/39633: exposing the Array, DataType and Scalar classes for BinaryView and StringView, such that those can already be represented in pyarrow. (I exposed a variant of StringBuilder as well, just for now to be able to create test data) * Closes: #39651 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2024-01-30 12:54:19 +01:00			`_Type_BINARY_VIEW: BinaryViewArray,`
			`_Type_STRING_VIEW: StringViewArray,`
ARROW-827: [Python] Miscellaneous improvements to help with Dask support Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #543 from wesm/dask-improvements and squashes the following commits: 1f587e2 [Wes McKinney] Store the input Parquet paths on the dataset object 3504281 [Wes McKinney] Add some more cases edc9b59 [Wes McKinney] Unit tests 88f4380 [Wes McKinney] Use dict for type mapping for now 7e69cab [Wes McKinney] Miscellaneous improvements to help with Dask support 2017-04-17 09:22:15 -04:00			`_Type_DICTIONARY: DictionaryArray,`
			`_Type_FIXED_SIZE_BINARY: FixedSizeBinaryArray,`
GH-44713: [Python] Add support for Decimal32 and Decimal64 types (#44882) ### Rationale for this change Arrow C++ and the Arrow specification now support 32-bit and 64-bit decimal types...pyarrow should too! ### What changes are included in this PR? Added type, array, and scalar bindings. ### Are these changes tested? Yes! ### Are there any user-facing changes? Yes! * GitHub Issue: #44713 Authored-by: Dewey Dunnington <dewey@fishandwhistle.net> Signed-off-by: Dewey Dunnington <dewey@fishandwhistle.net> 2024-12-16 21:12:07 -06:00			`_Type_DECIMAL32: Decimal32Array,`
			`_Type_DECIMAL64: Decimal64Array,`
ARROW-9747: [Java][C++] Initial Support for 256-bit Decimals This provides sufficient coverage to support round trip between C++ and Java. There are still some gaps in python. Based on review, I will open JIRAs to track missing functionality (i.e. parquet support in C++). Marking as draft until i can triage CI failures but early feedback is welcome. Open questions I have: [C++] * Should we retain logic in decimal() factory function to adjust type on scale/precision or take an explicit argument or keep it as an alias for decimal128? [Java] * Naming: Would Decimal256 be better then BigDecimal? Closes #8475 from emkornfield/decimal256 Lead-authored-by: Mingyu Zhong <69326943+MingyuZhong@users.noreply.github.com> Co-authored-by: Micah Kornfield <micahk@google.com> Co-authored-by: Micah Kornfield <emkornfield@gmail.com> Co-authored-by: emkornfield <emkornfield@gmail.com> Co-authored-by: Ezra <eumen@google.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com> 2020-10-22 21:39:53 -07:00			`_Type_DECIMAL128: Decimal128Array,`
			`_Type_DECIMAL256: Decimal256Array,`
ARROW-915: [Python] Struct Array reads limited support Add limited struct array reading support in pyarrow. This is done to complement parquet-cpp struct reader. cc @wesm Author: Itai Incze <itai.in@gmail.com> Closes #615 from itaiin/ARROW-915 and squashes the following commits: d8f2636e [Itai Incze] convert struct field names using frombytes e654abfa [Itai Incze] fix python3 tests & msvc build 3a4edf43 [Itai Incze] fix lint errors bef46447 [Itai Incze] Refactor due to review e2a697ff [Itai Incze] Further fixes due to review eecbb32f [Itai Incze] fix per code review 4c255391 [Itai Incze] Add basic StructArray read support 2017-07-02 13:38:46 -04:00			`_Type_STRUCT: StructArray,`
GH-34568: [C++][Python] Expose Run-End Encoded arrays in Python Arrow (#34570) * Closes: #34568 Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-03-23 05:20:06 -03:00			`_Type_RUN_END_ENCODED: RunEndEncodedArray,`
ARROW-840: [Python] Expose extension types Add infrastructure to consume C++ extension types and extension arrays from Python. Also allow creating Python-specific extension types by subclassing `ExtensionType`, and creating extension arrays by passing the type and storage array to `ExtensionArray.from_storage`. Author: Antoine Pitrou <antoine@python.org> Closes #4532 from pitrou/ARROW-840-py-ext-types and squashes the following commits: 95ca6148e <Antoine Pitrou> Add IPC tests 44ac0a156 <Antoine Pitrou> ARROW-840: Expose extension types 2019-06-14 07:53:40 -05:00			`_Type_EXTENSION: ExtensionArray,`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00			`}`


ARROW-15839: [C++][Python] Accept validity bitmap in ListArray.from_arrays (#13894) Will close [ARROW-15839](https://issues.apache.org/jira/browse/ARROW-15839) This will allow proper nulls in place of what is presently empty lists (on top level) when no mask is passed. ```python import pyarrow as pa arr = pa.array([None, [0]]) reconstructed_arr = pa.ListArray.from_arrays(arr.offsets, arr.values, mask=arr.is_null()) print(reconstructed_arr.to_pylist()) # [None, [0]] # Currently it doesn't accept `mask` and results in `[[], [0]]` unless explicit offsets set ``` There was also discussion with @jorisvandenbossche about renaming the added `null_bitmap` to `validity_bitmap`; but this will deviate from [existing `null_bitmap` use in the same area of code](https://github.com/apache/arrow/blob/d880d7517a33f2ac8ff259cad711bc210fd570c5/cpp/src/arrow/array/array_nested.h#L113). Should I change those names now or in a later PR/issue to standardize the `null_bitmap` / `validity_buf` naming? Lead-authored-by: Miles Granger <miles59923@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-09-01 12:33:56 +02:00			`cdef inline shared_ptr[CBuffer] c_mask_inverted_from_obj(object mask, MemoryPool pool) except *:`
			`"""`
			`Convert mask array obj to c_mask while also inverting to signify 1 for valid and 0 for null`
			`"""`
			`cdef shared_ptr[CBuffer] c_mask`
			`if mask is None:`
			`c_mask = shared_ptr[CBuffer]()`
			`elif isinstance(mask, Array):`
			`if mask.type.id != Type_BOOL:`
			`raise TypeError('Mask must be a pyarrow.Array of type boolean')`
			`if mask.null_count != 0:`
			`raise ValueError('Mask must not contain nulls')`
			`inverted_mask = _pc().invert(mask, memory_pool=pool)`
			`c_mask = pyarrow_unwrap_buffer(inverted_mask.buffers()[1])`
			`else:`
			`raise TypeError('Mask must be a pyarrow.Array of type boolean')`
			`return c_mask`


ARROW-6176: [Python] Basic implementation of __arrow_ext_class__, in pure Python This is a very basic implementation of what could be `__arrow_ext_class__`, i.e. allowing extension types in PyArrow to have a custom Extension Array class (useful for adding additional logic). It is an alternative to adding dynamic attributes to `ExtensionArray` (see ARROW-8131), Of interest @kszucs @jorisvandenbossche Closes #6653 from balancap/ARROW-6176-allow-sub-class-extension-array-to-attach-custom-extension-type Authored-by: Paul Balanca <paul.balanca@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-04-08 13:13:19 -05:00			`cdef object get_array_class_from_type(`
			`const shared_ptr[CDataType]& sp_data_type):`
			`cdef CDataType* data_type = sp_data_type.get()`
			`if data_type == NULL:`
			`raise ValueError('Array data type was NULL')`

			`if data_type.id() == _Type_EXTENSION:`
			`py_ext_data_type = pyarrow_wrap_data_type(sp_data_type)`
			`return py_ext_data_type.__arrow_ext_class__()`
			`else:`
			`return _array_classes[data_type.id()]`


ARROW-7022, ARROW-7023: [Python] fix handling of pandas Index and Period/Interval extension arrays in pa.array Fixes https://issues.apache.org/jira/browse/ARROW-7022, and while doing this noticed another bug this is fixing (for which I opened https://issues.apache.org/jira/browse/ARROW-7023) Closes #5753 from jorisvandenbossche/ARROW-7022-arrow-array-extension and squashes the following commits: b2f0eb5c2 <Joris Van den Bossche> do not fallback to ndarray for pandas ExtensionArray 53ee1085f <Joris Van den Bossche> fix error message 57b7a506b <Joris Van den Bossche> ARROW-7022, ARROW-7023: fix handling of pandas Index and Period/Interval extension arrays in pa.array Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2019-11-05 14:21:38 +01:00			`cdef object get_values(object obj, bint* is_series):`
			`if pandas_api.is_series(obj) or pandas_api.is_index(obj):`
			`result = pandas_api.get_values(obj)`
ARROW-4324: [Python] Triage broken type inference logic in presence of a mix of NumPy dtype-having objects and other scalar values In investigating the innocuous bug report from ARROW-4324 I stumbled on a pile of hacks and flawed design around type inference ``` test_list = [np.dtype('int32').type(10), np.dtype('float32').type(0.5)] test_array = pa.array(test_list) # Expected # test_array # <pyarrow.lib.DoubleArray object at 0x7f009963bf48> # [ # 10, # 0.5 # ] # Got # test_array # <pyarrow.lib.Int32Array object at 0x7f009963bf48> # [ # 10, # 0 # ] ``` It turns out there are several issues: * There was a kludge around handling the `numpy.nan` value which is a PyFloat, not a NumPy float64 scalar * Type inference assumed "NaN is null", which should not be hard coded, so I added a flag to switch between pandas semantics and non-pandas * Mixing NumPy scalar values and non-NumPy scalars (like our evil friend numpy.nan) caused the output type to be simply incorrect. For example `[np.float16(1.5), 2.5]` would yield `pa.float16()` output type. Yuck In inserted some hacks to force what I believe to be the correct behavior and fixed a couple unit tests that actually exhibited buggy behavior before (see within). I don't have time to do the "right thing" right now which is to more or less rewrite the hot path of `arrow/python/inference.cc`, so at least this gets the unit tests asserting what is correct so that refactoring will be more productive later. Author: Wes McKinney <wesm+git@apache.org> Closes #4527 from wesm/ARROW-4324 and squashes the following commits: e396958b0 <Wes McKinney> Add unit test for passing pandas Series with from_pandas=False 754468a5d <Wes McKinney> Set from_pandas to None by default in pyarrow.array so that user wishes can be respected e1b839339 <Wes McKinney> Remove outdated unit test, add Python unit test that shows behavior from ARROW-2240 that's been changed 4bc8c8193 <Wes McKinney> Triage type inference logic in presence of a mix of NumPy dtype-having objects and other typed values, pending more serious refactor in ARROW-5564 2019-06-12 17:14:40 -05:00			`is_series[0] = True`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00			`elif isinstance(obj, np.ndarray):`
			`result = obj`
ARROW-4324: [Python] Triage broken type inference logic in presence of a mix of NumPy dtype-having objects and other scalar values In investigating the innocuous bug report from ARROW-4324 I stumbled on a pile of hacks and flawed design around type inference ``` test_list = [np.dtype('int32').type(10), np.dtype('float32').type(0.5)] test_array = pa.array(test_list) # Expected # test_array # <pyarrow.lib.DoubleArray object at 0x7f009963bf48> # [ # 10, # 0.5 # ] # Got # test_array # <pyarrow.lib.Int32Array object at 0x7f009963bf48> # [ # 10, # 0 # ] ``` It turns out there are several issues: * There was a kludge around handling the `numpy.nan` value which is a PyFloat, not a NumPy float64 scalar * Type inference assumed "NaN is null", which should not be hard coded, so I added a flag to switch between pandas semantics and non-pandas * Mixing NumPy scalar values and non-NumPy scalars (like our evil friend numpy.nan) caused the output type to be simply incorrect. For example `[np.float16(1.5), 2.5]` would yield `pa.float16()` output type. Yuck In inserted some hacks to force what I believe to be the correct behavior and fixed a couple unit tests that actually exhibited buggy behavior before (see within). I don't have time to do the "right thing" right now which is to more or less rewrite the hot path of `arrow/python/inference.cc`, so at least this gets the unit tests asserting what is correct so that refactoring will be more productive later. Author: Wes McKinney <wesm+git@apache.org> Closes #4527 from wesm/ARROW-4324 and squashes the following commits: e396958b0 <Wes McKinney> Add unit test for passing pandas Series with from_pandas=False 754468a5d <Wes McKinney> Set from_pandas to None by default in pyarrow.array so that user wishes can be respected e1b839339 <Wes McKinney> Remove outdated unit test, add Python unit test that shows behavior from ARROW-2240 that's been changed 4bc8c8193 <Wes McKinney> Triage type inference logic in presence of a mix of NumPy dtype-having objects and other typed values, pending more serious refactor in ARROW-5564 2019-06-12 17:14:40 -05:00			`is_series[0] = False`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00			`else:`
GH-34703: [Python] Set copy=False explicitly when creating a pandas Series (#34593) ### Rationale for this change pandas will change the default for creating a Series from an array (numpy, arrow, ...) to copy=True when Copy-on-Write is enabled. To avoid this when using it internally, we have to specify copy=False explicitly. ### What changes are included in this PR? Setting copy=False when creating a Series ### Are these changes tested? This is equivalent to the current default behavior, so no reason to add any additional tests. ### Are there any user-facing changes? no cc @ jorisvandenbossche * Closes: #34703 Authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com> Signed-off-by: Alenka Frim <frim.alenka@gmail.com> 2023-03-23 07:07:18 -04:00			`result = pandas_api.series(obj, copy=False).values`
ARROW-4324: [Python] Triage broken type inference logic in presence of a mix of NumPy dtype-having objects and other scalar values In investigating the innocuous bug report from ARROW-4324 I stumbled on a pile of hacks and flawed design around type inference ``` test_list = [np.dtype('int32').type(10), np.dtype('float32').type(0.5)] test_array = pa.array(test_list) # Expected # test_array # <pyarrow.lib.DoubleArray object at 0x7f009963bf48> # [ # 10, # 0.5 # ] # Got # test_array # <pyarrow.lib.Int32Array object at 0x7f009963bf48> # [ # 10, # 0 # ] ``` It turns out there are several issues: * There was a kludge around handling the `numpy.nan` value which is a PyFloat, not a NumPy float64 scalar * Type inference assumed "NaN is null", which should not be hard coded, so I added a flag to switch between pandas semantics and non-pandas * Mixing NumPy scalar values and non-NumPy scalars (like our evil friend numpy.nan) caused the output type to be simply incorrect. For example `[np.float16(1.5), 2.5]` would yield `pa.float16()` output type. Yuck In inserted some hacks to force what I believe to be the correct behavior and fixed a couple unit tests that actually exhibited buggy behavior before (see within). I don't have time to do the "right thing" right now which is to more or less rewrite the hot path of `arrow/python/inference.cc`, so at least this gets the unit tests asserting what is correct so that refactoring will be more productive later. Author: Wes McKinney <wesm+git@apache.org> Closes #4527 from wesm/ARROW-4324 and squashes the following commits: e396958b0 <Wes McKinney> Add unit test for passing pandas Series with from_pandas=False 754468a5d <Wes McKinney> Set from_pandas to None by default in pyarrow.array so that user wishes can be respected e1b839339 <Wes McKinney> Remove outdated unit test, add Python unit test that shows behavior from ARROW-2240 that's been changed 4bc8c8193 <Wes McKinney> Triage type inference logic in presence of a mix of NumPy dtype-having objects and other typed values, pending more serious refactor in ARROW-5564 2019-06-12 17:14:40 -05:00			`is_series[0] = False`
ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning 2017-04-13 12:51:47 +02:00
			`return result`
ARROW-5554: [Python] Added a python wrapper for arrow::Concatenate() Author: Zhuo Peng <1835738+brills@users.noreply.github.com> Closes #4519 from brills/conc-wrap and squashes the following commits: c3f45b42c <Zhuo Peng> doc 1576859d7 <Zhuo Peng> Added a python wrapper for arrow::Concatenate(). 2019-06-11 20:00:43 +02:00

			`def concat_arrays(arrays, MemoryPool memory_pool=None):`
			`"""`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`Concatenate the given arrays.`

			`The contents of the input arrays are copied into the returned array.`

			`Raises`
			`------`
ARROW-14738: [Python][Doc] Make return types clickable Closes #11726 from amol-/ARROW-14738 Lead-authored-by: Alessandro Molina <amol@turbogears.org> Co-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2022-01-19 12:51:43 +01:00			`ArrowInvalid`
			`If not all of the arrays have the same type.`
ARROW-5554: [Python] Added a python wrapper for arrow::Concatenate() Author: Zhuo Peng <1835738+brills@users.noreply.github.com> Closes #4519 from brills/conc-wrap and squashes the following commits: c3f45b42c <Zhuo Peng> doc 1576859d7 <Zhuo Peng> Added a python wrapper for arrow::Concatenate(). 2019-06-11 20:00:43 +02:00
			`Parameters`
			`----------`
ARROW-7898: [Python] Reduce the number docstring violations using numpydoc Depends on #6420. Reduces the number of docstring violations from 1335 to 793 (fixes 542). This is going to require more patches, but we need to start somewhere. Closes #6444 from kszucs/docstrings Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-03-25 14:44:38 +01:00			`arrays : iterable of pyarrow.Array`
			`Arrays to concatenate, must be identically typed.`
ARROW-5554: [Python] Added a python wrapper for arrow::Concatenate() Author: Zhuo Peng <1835738+brills@users.noreply.github.com> Closes #4519 from brills/conc-wrap and squashes the following commits: c3f45b42c <Zhuo Peng> doc 1576859d7 <Zhuo Peng> Added a python wrapper for arrow::Concatenate(). 2019-06-11 20:00:43 +02:00			`memory_pool : MemoryPool, default None`
			`For memory allocations. If None, the default pool is used.`
ARROW-16058: [Python] Address docstrings for Table class, methods, attributes and constructor This PR adds docstring examples to: - `pyarrow.Table` class methods and attributes - `pyarrow.table` - `to_pandas` for `_PandasConvertible` - `pyarrow.TableGroupBy` class - `pyarrow.concat_tables` - `pyarrow.concat_arrays` Closes #12772 from AlenkaF/ARROW-16058 Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-04-07 12:09:13 +02:00
			`Examples`
			`--------`
			`>>> import pyarrow as pa`
			`>>> arr1 = pa.array([2, 4, 5, 100])`
			`>>> arr2 = pa.array([2, 4])`
			`>>> pa.concat_arrays([arr1, arr2])`
ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-cython) Adding `--doctest-cython` functionality which will be run on the CI with a follow-up PR. This PR can be tested with `pytest --doctest-cython python/pyarrow`. Closes #13204 from AlenkaF/ARROW-16018-doctest-cython Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-05-26 15:30:02 +02:00			`<pyarrow.lib.Int64Array object at ...>`
ARROW-16058: [Python] Address docstrings for Table class, methods, attributes and constructor This PR adds docstring examples to: - `pyarrow.Table` class methods and attributes - `pyarrow.table` - `to_pandas` for `_PandasConvertible` - `pyarrow.TableGroupBy` class - `pyarrow.concat_tables` - `pyarrow.concat_arrays` Closes #12772 from AlenkaF/ARROW-16058 Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Alessandro Molina <amol@turbogears.org> 2022-04-07 12:09:13 +02:00			`[`
			`2,`
			`4,`
			`5,`
			`100,`
			`2,`
			`4`
			`]`

ARROW-5554: [Python] Added a python wrapper for arrow::Concatenate() Author: Zhuo Peng <1835738+brills@users.noreply.github.com> Closes #4519 from brills/conc-wrap and squashes the following commits: c3f45b42c <Zhuo Peng> doc 1576859d7 <Zhuo Peng> Added a python wrapper for arrow::Concatenate(). 2019-06-11 20:00:43 +02:00			`"""`
			`cdef:`
			`vector[shared_ptr[CArray]] c_arrays`
ARROW-5744: [C++] Allow Table::CombineChunks to leave string columns chunked Concatenation can otherwise fail due to overflowing binary character data buffers Closes #7454 from bkietz/5744-Do-not-error-in-TableComb Lead-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: Wes McKinney <wesm@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-18 22:22:42 -05:00			`shared_ptr[CArray] c_concatenated`
ARROW-5554: [Python] Added a python wrapper for arrow::Concatenate() Author: Zhuo Peng <1835738+brills@users.noreply.github.com> Closes #4519 from brills/conc-wrap and squashes the following commits: c3f45b42c <Zhuo Peng> doc 1576859d7 <Zhuo Peng> Added a python wrapper for arrow::Concatenate(). 2019-06-11 20:00:43 +02:00			`CMemoryPool* pool = maybe_unbox_memory_pool(memory_pool)`

			`for array in arrays:`
ARROW-9920: [Python] Validate input to pa.concat_arrays() to avoid segfault Closes #8132 from jorisvandenbossche/ARROW-9920-chunked-segfault Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Uwe L. Korn <uwe.korn@quantco.com> 2020-09-08 11:46:50 +02:00			`if not isinstance(array, Array):`
			`raise TypeError("Iterable should contain Array objects, "`
GH-45619: [Python] Use f-string instead of string.format (#45629) ### Rationale for this change See https://github.com/apache/arrow/issues/45619. ### What changes are included in this PR? Refactor using f-string instead of `string.format`. But do not use f-string for following case, `string.format` allows passing parameters, making the code more reusable. https://github.com/apache/arrow/blob/0fbf9823542233c5f32c26534c34cc97ce3f0be2/python/pyarrow/parquet/core.py#L1624-L1695 ### Are these changes tested? Via CI. ### Are there any user-facing changes? No. * GitHub Issue: #45619 Lead-authored-by: Chilin Chiou <chilin.chiou@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> 2025-05-12 20:34:49 +08:00			`f"got {type(array)} instead")`
ARROW-5744: [C++] Allow Table::CombineChunks to leave string columns chunked Concatenation can otherwise fail due to overflowing binary character data buffers Closes #7454 from bkietz/5744-Do-not-error-in-TableComb Lead-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: Wes McKinney <wesm@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-18 22:22:42 -05:00			`c_arrays.push_back(pyarrow_unwrap_array(array))`
ARROW-5554: [Python] Added a python wrapper for arrow::Concatenate() Author: Zhuo Peng <1835738+brills@users.noreply.github.com> Closes #4519 from brills/conc-wrap and squashes the following commits: c3f45b42c <Zhuo Peng> doc 1576859d7 <Zhuo Peng> Added a python wrapper for arrow::Concatenate(). 2019-06-11 20:00:43 +02:00
			`with nogil:`
ARROW-5744: [C++] Allow Table::CombineChunks to leave string columns chunked Concatenation can otherwise fail due to overflowing binary character data buffers Closes #7454 from bkietz/5744-Do-not-error-in-TableComb Lead-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: Wes McKinney <wesm@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-18 22:22:42 -05:00			`c_concatenated = GetResultValue(Concatenate(c_arrays, pool))`
ARROW-5554: [Python] Added a python wrapper for arrow::Concatenate() Author: Zhuo Peng <1835738+brills@users.noreply.github.com> Closes #4519 from brills/conc-wrap and squashes the following commits: c3f45b42c <Zhuo Peng> doc 1576859d7 <Zhuo Peng> Added a python wrapper for arrow::Concatenate(). 2019-06-11 20:00:43 +02:00
ARROW-5744: [C++] Allow Table::CombineChunks to leave string columns chunked Concatenation can otherwise fail due to overflowing binary character data buffers Closes #7454 from bkietz/5744-Do-not-error-in-TableComb Lead-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: Wes McKinney <wesm@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org> 2020-06-18 22:22:42 -05:00			`return pyarrow_wrap_array(c_concatenated)`
ARROW-6872: [Python] Fix empty table creation from schema with dictionary field This is a quick fix for ARROW-6872 to handle dictionary fields. What it does not yet fix is a nested type with dictionary inside (eg list of dictionary type), which probably needs a sequence converted for dictionary type. Closes #6698 from jorisvandenbossche/ARROW-6872 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com> 2020-03-24 08:12:18 -04:00

			`def _empty_array(DataType type):`
			`"""`
			`Create empty array of the given type.`
			`"""`
			`if type.id == Type_DICTIONARY:`
			`arr = DictionaryArray.from_arrays(`
			`_empty_array(type.index_type), _empty_array(type.value_type),`
			`ordered=type.ordered)`
			`else:`
			`arr = array([], type=type)`
			`return arr`