Blame: python/pyarrow/_compute.pxd - apache/arrow

apache / arrow UNCLAIMED

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics

0 0 16 C++

Normal View History Raw

ARROW-8918: [C++][Python] Implement cast metafunction to allow use of "cast" with CallFunction, use in Python This provides the `CAST(data AS target_type)` SQL idiom. The target_type is provided via CastOptions (FWIW I believe this is the most correct approach for handling the target_type). As a result we no longer need to maintain separate binding boilerplate in Python for Array vs. ChunkedArray Closes #7258 from wesm/ARROW-8918 Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-05-28 16:07:19 -05:00			`# Licensed to the Apache Software Foundation (ASF) under one`
			`# or more contributor license agreements. See the NOTICE file`
			`# distributed with this work for additional information`
			`# regarding copyright ownership. The ASF licenses this file`
			`# to you under the Apache License, Version 2.0 (the`
			`# "License"); you may not use this file except in compliance`
			`# with the License. You may obtain a copy of the License at`
			`#`
			`# http://www.apache.org/licenses/LICENSE-2.0`
			`#`
			`# Unless required by applicable law or agreed to in writing,`
			`# software distributed under the License is distributed on an`
			`# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY`
			`# KIND, either express or implied. See the License for the`
			`# specific language governing permissions and limitations`
			`# under the License.`

			`# cython: language_level = 3`

			`from pyarrow.lib cimport *`
			`from pyarrow.includes.common cimport *`
			`from pyarrow.includes.libarrow cimport *`

GH-35515: [C++][Python] Add non decomposable aggregation UDF (#35514) ### Rationale for this change Non decomposable aggregation is aggregation that cannot be split into consume/merge/finalize. This is often when the logic rewritten with external python libraries (numpy, pandas, statmodels, etc) and those either cannot be decomposed or not worthy the effect (these are often one-off function instead of reusable one). This PR implements the support for non decomposable aggregation UDFs. The major issue with non decomposable UDF is that the UDF needs to see all data at once, unlike scalar UDF where UDF only needs to see a batch at a time. This makes non decomposable not so useful as it is same as collect all the data to a pd.DataFrame and apply the UDF on it. However, one very application of non decomposable UDF is with segmented aggregation. To refresh, segmented aggregation works on ordered data and passed one logic chunk at a time (e.g., all data with the same date). With segmented aggregation and non decomposable aggregation UDF, the user can apply any custom aggregation logic over large stream of ordered data, with the memory overhead of a single segment. ### What changes are included in this PR? This PR is currently WIP and not ready for review. So far I have implemented the minimal amount of code to make a basic test working but needs clean up, error handling etc. * [x] First round of self review * [x] Second round of self review * [x] Implement and test unary * [x] Implement and test varargs * [x] Implement and test Acero support with segmented aggregation ### Are these changes tested? Added new test calling with compute and acero. The compute tests calls the aggregation on the full array. The acero test callings the aggregation with segmented aggregation. ### Are there any user-facing changes? * Closes: #35515 Lead-authored-by: Li Jin <ice.xelloss@gmail.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Li Jin <ice.xelloss@gmail.com> 2023-06-08 14:12:49 -04:00			`cdef class UdfContext(_Weakrefable):`
ARROW-15639 [C++][Python] UDF Scalar Function Implementation PR for Scalar UDF integration This is the first phase of UDF integration to Arrow. This version only includes ScalarFunctions. In future of PRs, Vector UDF (using Arrow VectorFunction), UDTF (user-defined table function) and Aggregation UDFs will be integrated. This PR includes the following; - [x] UDF Python Scalar Function registration and usage - [x] UDF Python Scalar Function Examples - [x] UDF Python Scalar Function test cases - [x] UDF C++ Example extended from Compute Function Example - [x] Added aggregation example (optional to this PR: if required can remove and push in a different PR) Closes #12590 from vibhatha/arrow-15639 Lead-authored-by: Vibhatha Abeykoon <vibhatha@gmail.com> Co-authored-by: Vibhatha Lakmal Abeykoon <vibhatha@users.noreply.github.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-05-03 09:46:11 +02:00			`cdef:`
GH-35515: [C++][Python] Add non decomposable aggregation UDF (#35514) ### Rationale for this change Non decomposable aggregation is aggregation that cannot be split into consume/merge/finalize. This is often when the logic rewritten with external python libraries (numpy, pandas, statmodels, etc) and those either cannot be decomposed or not worthy the effect (these are often one-off function instead of reusable one). This PR implements the support for non decomposable aggregation UDFs. The major issue with non decomposable UDF is that the UDF needs to see all data at once, unlike scalar UDF where UDF only needs to see a batch at a time. This makes non decomposable not so useful as it is same as collect all the data to a pd.DataFrame and apply the UDF on it. However, one very application of non decomposable UDF is with segmented aggregation. To refresh, segmented aggregation works on ordered data and passed one logic chunk at a time (e.g., all data with the same date). With segmented aggregation and non decomposable aggregation UDF, the user can apply any custom aggregation logic over large stream of ordered data, with the memory overhead of a single segment. ### What changes are included in this PR? This PR is currently WIP and not ready for review. So far I have implemented the minimal amount of code to make a basic test working but needs clean up, error handling etc. * [x] First round of self review * [x] Second round of self review * [x] Implement and test unary * [x] Implement and test varargs * [x] Implement and test Acero support with segmented aggregation ### Are these changes tested? Added new test calling with compute and acero. The compute tests calls the aggregation on the full array. The acero test callings the aggregation with segmented aggregation. ### Are there any user-facing changes? * Closes: #35515 Lead-authored-by: Li Jin <ice.xelloss@gmail.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Li Jin <ice.xelloss@gmail.com> 2023-06-08 14:12:49 -04:00			`CUdfContext c_context`
ARROW-15639 [C++][Python] UDF Scalar Function Implementation PR for Scalar UDF integration This is the first phase of UDF integration to Arrow. This version only includes ScalarFunctions. In future of PRs, Vector UDF (using Arrow VectorFunction), UDTF (user-defined table function) and Aggregation UDFs will be integrated. This PR includes the following; - [x] UDF Python Scalar Function registration and usage - [x] UDF Python Scalar Function Examples - [x] UDF Python Scalar Function test cases - [x] UDF C++ Example extended from Compute Function Example - [x] Added aggregation example (optional to this PR: if required can remove and push in a different PR) Closes #12590 from vibhatha/arrow-15639 Lead-authored-by: Vibhatha Abeykoon <vibhatha@gmail.com> Co-authored-by: Vibhatha Lakmal Abeykoon <vibhatha@users.noreply.github.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> 2022-05-03 09:46:11 +02:00
GH-35515: [C++][Python] Add non decomposable aggregation UDF (#35514) ### Rationale for this change Non decomposable aggregation is aggregation that cannot be split into consume/merge/finalize. This is often when the logic rewritten with external python libraries (numpy, pandas, statmodels, etc) and those either cannot be decomposed or not worthy the effect (these are often one-off function instead of reusable one). This PR implements the support for non decomposable aggregation UDFs. The major issue with non decomposable UDF is that the UDF needs to see all data at once, unlike scalar UDF where UDF only needs to see a batch at a time. This makes non decomposable not so useful as it is same as collect all the data to a pd.DataFrame and apply the UDF on it. However, one very application of non decomposable UDF is with segmented aggregation. To refresh, segmented aggregation works on ordered data and passed one logic chunk at a time (e.g., all data with the same date). With segmented aggregation and non decomposable aggregation UDF, the user can apply any custom aggregation logic over large stream of ordered data, with the memory overhead of a single segment. ### What changes are included in this PR? This PR is currently WIP and not ready for review. So far I have implemented the minimal amount of code to make a basic test working but needs clean up, error handling etc. * [x] First round of self review * [x] Second round of self review * [x] Implement and test unary * [x] Implement and test varargs * [x] Implement and test Acero support with segmented aggregation ### Are these changes tested? Added new test calling with compute and acero. The compute tests calls the aggregation on the full array. The acero test callings the aggregation with segmented aggregation. ### Are there any user-facing changes? * Closes: #35515 Lead-authored-by: Li Jin <ice.xelloss@gmail.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Li Jin <ice.xelloss@gmail.com> 2023-06-08 14:12:49 -04:00			`cdef void init(self, const CUdfContext& c_context)`
ARROW-8918: [C++][Python] Implement cast metafunction to allow use of "cast" with CallFunction, use in Python This provides the `CAST(data AS target_type)` SQL idiom. The target_type is provided via CastOptions (FWIW I believe this is the most correct approach for handling the target_type). As a result we no longer need to maintain separate binding boilerplate in Python for Array vs. ChunkedArray Closes #7258 from wesm/ARROW-8918 Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-05-28 16:07:19 -05:00
GH-14975: [Python] Dataset.sort_by (#14976) * Closes: #14975 - [x] Proof of concept using an ExecPlan - [x] Add test to filter and then sort to confirm lazy filtering works with sorting. Authored-by: Alessandro Molina <amol@turbogears.org> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2022-12-23 09:40:40 +01:00
ARROW-9469: [Python] Make more objects weakrefable By default, Cython extension classes (defined with "cdef class") don't have a weakref slot, so add one to all of them. This adds just one memory word to each object, which IMHO is acceptable. Closes #7758 from pitrou/ARROW-9469-py-weakrefable-objects Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2020-07-29 12:24:36 +02:00			`cdef class FunctionOptions(_Weakrefable):`
ARROW-13025: [C++][Python] Add FunctionOptions::Equals/ToString/Serialize This is a draft of adding more utility methods to FunctionOptions. It's not fully implemented (it needs rebasing + serialization isn't implemented for most options, plus there are various TODOs scattered). But before I proceed further, I wanted to get some feedback. Some concerns I have: - I don't like adding protected methods to a struct, and it's inconsistent with how equality is implemented for other structs (via a visitor or otherwise centralized in a single location). However ARROW-8891 will require that we be able to define kernels - and presumably their options - in a separate shared library, so I don't think we can do much better than this. - But for (de)serialization, we'll still need some way to dynamically register the mapping between a type_name and the actual struct, so maybe this is a moot point. - I've exposed the fact that serialization uses StructScalars to support Expression - but maybe this is too much to commit to in the API? Closes #10511 from lidavidm/arrow-13025 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com> 2021-06-30 14:23:23 -04:00			`cdef:`
ARROW-12060: [Python] Enable calling compute functions on Expressions Closes #11918 from jorisvandenbossche/ARROW-12060-expressions-compute Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2022-01-18 21:39:53 +01:00			`shared_ptr[CFunctionOptions] wrapped`
ARROW-8918: [C++][Python] Implement cast metafunction to allow use of "cast" with CallFunction, use in Python This provides the `CAST(data AS target_type)` SQL idiom. The target_type is provided via CastOptions (FWIW I believe this is the most correct approach for handling the target_type). As a result we no longer need to maintain separate binding boilerplate in Python for Array vs. ChunkedArray Closes #7258 from wesm/ARROW-8918 Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org> 2020-05-28 16:07:19 -05:00
			`cdef const CFunctionOptions* get_options(self) except NULL`
ARROW-12060: [Python] Enable calling compute functions on Expressions Closes #11918 from jorisvandenbossche/ARROW-12060-expressions-compute Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2022-01-18 21:39:53 +01:00			`cdef void init(self, const shared_ptr[CFunctionOptions]& sp)`

			`cdef inline shared_ptr[CFunctionOptions] unwrap(self)`
ARROW-15077: [Python] Move Expression class from _dataset to _compute cython module Closes #11938 from jorisvandenbossche/ARROW-15077 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2022-01-14 14:17:01 +01:00

GH-14975: [Python] Dataset.sort_by (#14976) * Closes: #14975 - [x] Proof of concept using an ExecPlan - [x] Add test to filter and then sort to confirm lazy filtering works with sorting. Authored-by: Alessandro Molina <amol@turbogears.org> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2022-12-23 09:40:40 +01:00			`cdef class _SortOptions(FunctionOptions):`
			`pass`


ARROW-15077: [Python] Move Expression class from _dataset to _compute cython module Closes #11938 from jorisvandenbossche/ARROW-15077 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com> 2022-01-14 14:17:01 +01:00			`cdef CExpression _bind(Expression filter, Schema schema) except *`


			`cdef class Expression(_Weakrefable):`

			`cdef:`
			`CExpression expr`

			`cdef void init(self, const CExpression& sp)`

			`@staticmethod`
			`cdef wrap(const CExpression& sp)`

			`cdef inline CExpression unwrap(self)`

			`@staticmethod`
			`cdef Expression _expr_or_scalar(object expr)`
ARROW-14292: [C++][Python] Join foundation for Tables This implements the `Table.join` method and the underlying infrastructure. It provides the `tables_join` function that wraps the `execplan` machinery and tests to verify that joins work as expected. Closes #12452 from amol-/ARROW-14292 Lead-authored-by: Alessandro Molina <amol@turbogears.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2022-03-29 17:59:04 +02:00

			`cdef CExpression _true`
GH-33976: [Python] Initial bindings for acero Declaration and ExecNodeOptions classes (#34102) First step for GH-33976, adding basic bindings for the different ExecNodeOptions classes and the Declaration class to combine those in a query. Some notes on what is and what is not included in this PR: * For source nodes, didn't expose the generic `SourceNodeOptions` et al, only the concrete `TableSourceNodeOptions` (should probably also add `RecordBatchReaderSourceNodeOptions`) * Didn't yet expose any sink nodes. The table sink is implicitly used by `Declaration.to_table()`, and given that there is currently no explicit API to manually convert to ExecPlan and execute it, explicit table sink node bindings didn't seem necessary. * Also didn't yet expose the order_by sink node, because this requires a custom sink when collecting as a Table, and it's not directly clear how this is possible with the Declaration interface. This requires https://github.com/apache/arrow/issues/34248 to be fixed first. * Leaving dataset-based scan and write nodes for a follow-up PR * Basic class for `Declaration` with a `to_table` method to execute the plan and consume it into a Table, and a `to_reader()` to get a RecordBatchReader (could also further add a `to_batches()` method) -- * Issue: #33976 Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-03-03 13:46:53 +01:00
			`cdef CFieldRef _ensure_field_ref(value) except *`
GH-34248: [Python] Expose the order_by node (#34654) Adds Python bindings for the OrderByNode added in https://github.com/apache/arrow/pull/34249 * Closes: #34248 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-03-22 14:56:56 +01:00
GH-45380: [Python] Expose RankQuantileOptions to Python (#45392) ### Rationale for this change `RankQuantileOptions` are currently not exposed on Pyarrow and CI job breaks when `-W error` is used. ### What changes are included in this PR? Expose `RankQuantileOptions` and test options and kernel from pyarrow. It also includes some minor refactor for the unwrap sort keys logic to move it into a common function. ### Are these changes tested? Yes ### Are there any user-facing changes? The options for the new kernel are exposed on pyarrow. * GitHub Issue: #45380 Lead-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> 2025-02-05 16:03:18 +01:00			`cdef vector[CSortKey] unwrap_sort_keys(sort_keys, allow_str=) except `

GH-34248: [Python] Expose the order_by node (#34654) Adds Python bindings for the OrderByNode added in https://github.com/apache/arrow/pull/34249 * Closes: #34248 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> 2023-03-22 14:56:56 +01:00			`cdef CSortOrder unwrap_sort_order(order) except *`

			`cdef CNullPlacement unwrap_null_placement(null_placement) except *`