SIGN IN SIGN UP
apache / arrow UNCLAIMED

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics

0 0 1 C++
#!/usr/bin/env bash
ARROW-7101: [CI] Refactor docker-compose setup and use it with GitHub Actions ## Projecting ideas from ursabot ### Parametric docker images The images are better parameterized now, meaning that we can build more variant of the same service. Couple of examples: ```console UBUNTU=16.04 docker-compose build ubuntu-cpp ARCH=arm64v8 UBUNTU=18.04 docker-compose build ubuntu-cpp PYTHON=3.6 docker-compose build conda-python ARCH=arm32v7 PYTHON=3.6 PANDAS=0.25 docker-compose build conda-python-pandas ``` Each variant has it's own docker image following a string naming schema: `{org}/{arch}-{platform}-{platform-version}[[-{variant}-{variant-version}]..]:latest` ### Use *_build.sh and *_test.sh for each job The docker images provide the environment, and each language backend usually should implement two scripts, a `build.sh` and a `test.sh`. This way dependent build like the docker python, r or c glib are able to reuse the build script of the ancestor without running its tests. With small enough scripts, if the environment is properly set up even the non-docker builds should be reproducible locally. GitHub Actions support bash scripts across all three platforms, so we can reuse the same `*_build.sh` and `*_test.sh` scripts to execute the builds either in docker, on the CI or locally. ## Using GitHub Actions for running the builds Regardless of the CI we're going to choose, the isolation constraint of different platforms requires some sort of virtualisation. Currently linux (and windows, but I have not tried it yet) has lightweight containerisation, so we should keep the linux builds isolated in docker containers. The rest of the platforms (windows and macOS) should be executed on the CI system. GitHub Actions support all three major platforms, linux, windows and macOS. I've added cross platform builds for a couple of languages, like Rust, and Go, the rest are work in progress. ### Workflow A workflow should define all builds of a language, mostly because the path filters can be defined on workflow level. For example the python builds should be triggered if either a cpp/** or a python/** file changes which can be covered in the same workflow file. ## Feature parity with the current builds Reaching feature parity with all of the builds below is not a goal for this PR, the difficult ones should at least have a tracking JIRA ticket. ### Travis-CI - [x] **Lint, Release tests**: - `Lint / C++, Python, R, Rust, Docker, RAT` - `Dev / Source Release` - [x] **C++ unit tests w/ conda-forge toolchain, coverage**: without coverage - `C++ / AMD64 Conda C++` - [x] **Python 3.6 unit tests, conda-forge toolchain, coverage**: without coverage - `Python / AMD64 Conda Python 3.6` - [x] **[OS X] C++ w/ Xcode 9.3**: - `C++ / AMD64 MacOS 10.14 C++`: with Xcode 10.3 - [x] **[OS X] Python w/ Xcode 9.3**: - `Python / AMD64 MacOS 10.14 Python 3`: with Xcode 10.3 - [x] **Java OpenJDK8 and OpenJDK11**: - `Java / AMD64 Debian Java JDK 8 Maven 3.5.2` - `Java / AMD64 Debian Java JDK 11 Maven 3.6.2` - [x] **Protocol / Flight Integration Tests**: - `Dev / Protocol Test` - [x] **NodeJS**: without running lint and coverage - `NodeJS / AMD64 Debian NodeJS 11` - [x] **C++ & GLib & Ruby w/ gcc 5.4**: - `C++ / AMD64 Debian 10 C++`: with GCC 8.3 - `C++ / AMD64 Ubuntu 16.04 C++`: with GCC 5.4 - `C++ / AMD64 Ubuntu 18.04 C++`: with GCC 7.4 - `C GLib / AMD64 Ubuntu 18.04 C GLib` - `Ruby / AMD64 Ubuntu 18.04 Ruby` - [x] **[OS X] C++ & GLib & Ruby w/ XCode 10.2 & Homebrew** - `C++ / AMD64 MacOS 10.14 C++`: with Xcode 10.3 - `C GLib / AMD64 MacOS 10.14 C Glib`: with Xcode 10.3 - `Ruby / AMD64 MacOS 10.14 Ruby`: with Xcode 10.3 - [x] **Go**: without coverage - `Go / AMD64 Debian Go 1.12` - [x] **R (with and without libarrow)**: - `R / AMD64 Conda R 3.6`: with libarrow - `R / AMD64 Ubuntu 18.04 R 3.6` with libarrow ### Appveyor - ~JOB=Build, GENERATOR=Ninja, CONFIGURATION=Release, APPVEYOR_BUILD_WORKER_IMAGE=Visual Studio 2017~ - ~JOB=Toolchain, GENERATOR=Ninja, CONFIGURATION=Release, ARROW_S3=ON, ARROW_BUILD_FLIGHT=ON, ARROW_BUILD_GANDIVA=ON~ - ~JOB=Build_Debug, GENERATOR=Ninja, CONFIGURATION=Debug~ - ~JOB=MinGW32, MINGW_ARCH=i686, MINGW_PACKAGE_PREFIX=mingw-w64-i686, MINGW_PREFIX=c:\msys64\mingw32, MSYSTEM=MINGW32, USE_CLCACHE=false~ - ~JOB=MinGW64, MINGW_ARCH=x86_64, MINGW_PACKAGE_PREFIX=mingw-w64-x86_64, MINGW_PREFIX=c:\msys64\mingw64, MSYSTEM=MINGW64, USE_CLCACHE=false~ - [x] **JOB=Rust, TARGET=x86_64-pc-windows-msvc, USE_CLCACHE=false**: - `Rust / AMD64 Windows 2019 Rust nightly-2019-09-25` - [x] **JOB=C#, APPVEYOR_BUILD_WORKER_IMAGE=Visual Studio 2017, USE_CLCACHE=false** - `C# / AMD64 Windows 2019 C# 2.2.103` - [x] **JOB=Go, MINGW_PACKAGE_PREFIX=mingw-w64-x86_64 ...**: - `Go / AMD64 Windows 2019 Go 1.12` - ~JOB=R with libarrow, USE_CLCACHE=false, TEST_R_WITH_ARROW=TRUE, RWINLIB_LOCAL=%APPVEYOR_BUILD_FOLDER%\libarrow.zip~ ### Github Actions - [x] **Windows MSVC C++ / Build (Visual Studio 16 2019)**: - `C++ / AMD64 Windows 2019 C++`: without tests - [x] **Windows MSVC C++ / Build (Visual Studio 15 2017)**: - `C++ / AMD64 Windows 2016 C++`: without tests - [x] **Linux docker-compose / Test (C++ w/ clang-7 & system packages)**: all have llvm for gandiva but the compiler is set to gcc - `C++ / AMD64 Debian 10 C++`: with GCC 8.3 - `C++ / AMD64 Ubuntu 16.04 C++`: with GCC 5.4 - `C++ / AMD64 Ubuntu 18.04 C++`: with GCC 7.4 - [x] **Linux docker-compose / Test (Rust)**: without rustfmt - `Rust / AMD64 Debian Rust nightly-2019-09-25` - [x] **Linux docker-compose / Test (Lint, Release tests)**: - `Lint / C++, Python, R, Rust, Docker, RAT` - `Dev / Source Release` ### Nightly Crossbow tests The packaging builds are out of the scope if this PR, but the nightly **dockerized test** task are in. Nightly tests: - [x] docker-r - [x] docker-r-conda - [x] docker-r-sanitizer - [x] docker-rust - [x] docker-cpp - [x] docker-cpp-cmake32 - [x] docker-cpp-release - [x] docker-cpp-static-only - [x] docker-c_glib - [x] docker-go - [x] docker-python-2.7 - [x] docker-python-3.6 - [x] docker-python-3.7 - [x] docker-python-2.7-nopandas - [x] docker-python-3.6-nopandas - [x] docker-java - [x] docker-js - [x] docker-docs - [x] docker-lint - [x] docker-iwyu: included in the lint - [x] docker-clang-format: included in the lint - [x] docker-pandas-master - [x] docker-dask-integration - [x] docker-hdfs-integration - [x] docker-spark-integration - [x] docker-turbodbc-integration # TODOs left: - [x] Fix the Apidoc generation for c_glib - [x] Fix the JNI test for Gandiva and ORC - [x] Test that crossbow tests are passing - ~Optionally restore the travis configuration to incrementally decommission old builds~ ## Follow-up JIRAs: - [Archery] Consider porting the docker tool of ursabot to archery - [Archery] Consider to use archery with or instead of the pre-commit hooks - [Archery] Create a wrapper script in archery for docker compose in order to run the containers with the host's user and group - [C++] GCC 5.4.0 has a compile errors, reproduce with UBUNTU=16.04 docker-compose run ubuntu-cpp - [C++][CI] Test the ported fuzzit integration image - [C++][CI] Turn off unnecessary features in the integration tests (spark/turbodbc/dask/hdfs) - [C++][CI] Revisit ASAN UBSAN settings in every C++ based image - [CI] Consider re-adding the removed debian testing image is removed - [Go][CI] Pre-install the go dependencies in the dockerfile using go get - [JS][CI] Pre-install the JS dependencies in the dockerfile - [Rust][CI] Pre-install the rust dependencies in the dockerfile - [Java][CI] Pre-install the java dependencies in the dockerfile - [Ruby][CI] Pre-install the ruby dependencies in the dockerfile and remove it from the test script - [C#][CI] Pre-install the C# dependencies in the dockerfile - [R][CI] Fix the r-sanitizer build https://issues.apache.org/jira/browse/ARROW-6957 - [GLIB][MacOS] Fail to execute lua examples (fails to load 'lgi.corelgilua51' despite that lgi is installed) - [C++][CMake] Automatically set ARROW_GANDIVA_PC_CXX_FLAGS for conda and OSX sdk (see cpp_build.sh) - [C++][CI] Hiveserver2 instegration test fails to connect to impala container - [CI][Spark] Support specific Spark version in the integration tet including latest - [JS][CI] Move nodejs linting from js_build.sh to archery - [Python][CI] create a docker image for python ASV benchmarks and fix the script - [CI] Find a short but related prefix for the env vars used for the docker-compose file to prevent collisions - [C#] the docker container fails to run because of the ubuntu host versions, see https://github.com/dotnet/core/issues/3509 - [C++][Windows] Enable more features on the windows GHA build - [Doc] document docker-compose usage in the developer sphinx guide - [CI][C++] Add .ccache to the docker-compose mounts - [Archery][CI] Refactor the ci/scripts to a sourceable bash functions or to archery directly - [C++][CI] Use scripts/util_coredump.sh to show automatic backtraces - [C++] Fix the hanging C++ tests in Windows 2019 - [CI] Ask INFRA to set up the DOCKERHUB_* secrets for GitHub actions - [C++][CI] Running Gandiva tests fails on Fedora: Reproduce with: `docker-compose run -e ARROW_GANDIVA=ON fedora-cpp` ``` Running gandiva-internals-test, redirecting output into /build/cpp/build/test-logs/gandiva-internals-test.txt (attempt 1/1) 1364 : CommandLine Error: Option 'x86-experimental-vector-widening-legalization' registered more than once! 1365 LLVM ERROR: inconsistency in registered CommandLine options 1366 /build/cpp/src/gandiva ``` - [JS][CI] NodeJS build fails on Github Actions Windows node ``` > NODE_NO_WARNINGS=1 gulp build # 'NODE_NO_WARNINGS' is not recognized as an internal or external command, # operable program or batch file. # npm ERR! code ELIFECYCLE # npm ERR! errno 1 # npm ERR! apache-arrow@1.0.0-SNAPSHOT build: `NODE_NO_WARNINGS=1 gulp build` # npm ERR! Exit status 1 # npm ERR! # npm ERR! Failed at the apache-arrow@1.0.0-SNAPSHOT build script. # npm ERR! This is probably not a problem with npm. There is likely additional logging output above. ``` Closes #5589 from kszucs/docker-refactor and squashes the following commits: 5105d12e6 <Krisztián Szűcs> Rename pull-request folder to dev_cron e9e9a7eec <Krisztián Szűcs> Use underscores for naming the workflow files a92c99d03 <Krisztián Szűcs> Disable hanging C++ tests on windows f158c89b5 <Krisztián Szűcs> Attempt to push from apache/arrow master; Don't push from crossbow tasks 0e1d470a1 <Krisztián Szűcs> Turn off ORC on macOS C++ test due to link error 258db5cff <Krisztián Szűcs> Only push docker images from apache/arrow repository acdfcf086 <Krisztián Szűcs> Remove ORC from the brewfile 5102b85b1 <Krisztián Szűcs> Fix nodeJS workflow 032d6a388 <Krisztián Szűcs> Turn off 2 python builds 7f15b97a8 <Krisztián Szűcs> Filter branches 48b8d128a <Krisztián Szűcs> Fix workflows 36ad9d297 <Krisztián Szűcs> Disable builds 0f603af0c <Krisztián Szűcs> master only and cron workflows 28cc2d78d <Krisztián Szűcs> Rename Java JNI workflow bcd8af7b7 <Krisztián Szűcs> Port the remaining travis utility scripts ed5688154 <Krisztián Szűcs> Usage comments; recommend installing pandas from the docs because of its removal from conda_env_python 3c8c023ce <Krisztián Szűcs> Use Arch in volumes; some comments; remove conda version 'latest' from the images 771b023a8 <Krisztián Szűcs> Cleanup files; separate JNI builds 97ff8a122 <Krisztián Szűcs> Push docker images only from master dc00b4297 <Krisztián Szűcs> Enable path filters e0e2e1f46 <Krisztián Szűcs> Fix pandas master build 3814e0828 <Krisztián Szűcs> Fix manylinux volumes c18edda70 <Krisztián Szűcs> Add CentOS version to the manylinux image names c8b9dd6b1 <Krisztián Szűcs> Missing --pyargs argument for the python test command 33e646981 <Krisztián Szűcs> Turn off gandiva and flight for the HDFS test b9c547889 <Krisztián Szűcs> Refactor docker-compose file and use it with github actions. Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
2019-11-12 11:07:48 +01:00
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
set -ex
GH-46699: [CI][Dev] fix shellcheck errors in the ci/scripts/cpp_test.sh (#46700) ### Rationale for this change `ci/scripts/cpp_test.sh` violates two shellcheck rules. * SC2071: `< is for string comparisons. Use -lt instead.` * SC2086: `Double quote to prevent globbing and word splitting.` ``` ./ci/scripts/cpp_test.sh In ./ci/scripts/cpp_test.sh line 22: if [[ $# < 2 ]]; then ^-- SC2071 (error): < is for string comparisons. Use -lt instead. In ./ci/scripts/cpp_test.sh line 87: pushd ${build_dir} ^----------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: pushd "${build_dir}" In ./ci/scripts/cpp_test.sh line 103: --parallel ${n_jobs} \ ^-------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: --parallel "${n_jobs}" \ In ./ci/scripts/cpp_test.sh line 105: --timeout ${ARROW_CTEST_TIMEOUT:-300} \ ^-------------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: --timeout "${ARROW_CTEST_TIMEOUT:-300}" \ In ./ci/scripts/cpp_test.sh line 111: examples=$(find ${binary_output_dir} -executable -name "*example") ^------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: examples=$(find "${binary_output_dir}" -executable -name "*example") In ./ci/scripts/cpp_test.sh line 129: ${binary_output_dir}/arrow-ipc-stream-fuzz ${ARROW_TEST_DATA}/arrow-ipc-stream/crash-* ^------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. ^----------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: "${binary_output_dir}"/arrow-ipc-stream-fuzz "${ARROW_TEST_DATA}"/arrow-ipc-stream/crash-* In ./ci/scripts/cpp_test.sh line 130: ${binary_output_dir}/arrow-ipc-stream-fuzz ${ARROW_TEST_DATA}/arrow-ipc-stream/*-testcase-* ^------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. ^----------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: "${binary_output_dir}"/arrow-ipc-stream-fuzz "${ARROW_TEST_DATA}"/arrow-ipc-stream/*-testcase-* In ./ci/scripts/cpp_test.sh line 131: ${binary_output_dir}/arrow-ipc-file-fuzz ${ARROW_TEST_DATA}/arrow-ipc-file/*-testcase-* ^------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. ^----------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: "${binary_output_dir}"/arrow-ipc-file-fuzz "${ARROW_TEST_DATA}"/arrow-ipc-file/*-testcase-* In ./ci/scripts/cpp_test.sh line 132: ${binary_output_dir}/arrow-ipc-tensor-stream-fuzz ${ARROW_TEST_DATA}/arrow-ipc-tensor-stream/*-testcase-* ^------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. ^----------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: "${binary_output_dir}"/arrow-ipc-tensor-stream-fuzz "${ARROW_TEST_DATA}"/arrow-ipc-tensor-stream/*-testcase-* In ./ci/scripts/cpp_test.sh line 134: ${binary_output_dir}/parquet-arrow-fuzz ${ARROW_TEST_DATA}/parquet/fuzzing/*-testcase-* ^------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. ^----------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: "${binary_output_dir}"/parquet-arrow-fuzz "${ARROW_TEST_DATA}"/parquet/fuzzing/*-testcase-* For more information: https://www.shellcheck.net/wiki/SC2071 -- < is for string comparisons. Use ... https://www.shellcheck.net/wiki/SC2086 -- Double quote to prevent globbing ... ``` ### What changes are included in this PR? * Use `-lt` instead of `<` * Quote variables. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #46699 Authored-by: Hiroyuki Sato <hiroysato@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
2025-06-05 09:26:14 +09:00
if [[ $# -lt 2 ]]; then
echo "Usage: $0 <Arrow dir> <build dir> [ctest args ...]"
exit 1
fi
arrow_dir=${1}; shift
build_dir=${1}/cpp; shift
source_dir=${arrow_dir}/cpp
binary_output_dir=${build_dir}/${ARROW_BUILD_TYPE:-debug}
ARROW-7101: [CI] Refactor docker-compose setup and use it with GitHub Actions ## Projecting ideas from ursabot ### Parametric docker images The images are better parameterized now, meaning that we can build more variant of the same service. Couple of examples: ```console UBUNTU=16.04 docker-compose build ubuntu-cpp ARCH=arm64v8 UBUNTU=18.04 docker-compose build ubuntu-cpp PYTHON=3.6 docker-compose build conda-python ARCH=arm32v7 PYTHON=3.6 PANDAS=0.25 docker-compose build conda-python-pandas ``` Each variant has it's own docker image following a string naming schema: `{org}/{arch}-{platform}-{platform-version}[[-{variant}-{variant-version}]..]:latest` ### Use *_build.sh and *_test.sh for each job The docker images provide the environment, and each language backend usually should implement two scripts, a `build.sh` and a `test.sh`. This way dependent build like the docker python, r or c glib are able to reuse the build script of the ancestor without running its tests. With small enough scripts, if the environment is properly set up even the non-docker builds should be reproducible locally. GitHub Actions support bash scripts across all three platforms, so we can reuse the same `*_build.sh` and `*_test.sh` scripts to execute the builds either in docker, on the CI or locally. ## Using GitHub Actions for running the builds Regardless of the CI we're going to choose, the isolation constraint of different platforms requires some sort of virtualisation. Currently linux (and windows, but I have not tried it yet) has lightweight containerisation, so we should keep the linux builds isolated in docker containers. The rest of the platforms (windows and macOS) should be executed on the CI system. GitHub Actions support all three major platforms, linux, windows and macOS. I've added cross platform builds for a couple of languages, like Rust, and Go, the rest are work in progress. ### Workflow A workflow should define all builds of a language, mostly because the path filters can be defined on workflow level. For example the python builds should be triggered if either a cpp/** or a python/** file changes which can be covered in the same workflow file. ## Feature parity with the current builds Reaching feature parity with all of the builds below is not a goal for this PR, the difficult ones should at least have a tracking JIRA ticket. ### Travis-CI - [x] **Lint, Release tests**: - `Lint / C++, Python, R, Rust, Docker, RAT` - `Dev / Source Release` - [x] **C++ unit tests w/ conda-forge toolchain, coverage**: without coverage - `C++ / AMD64 Conda C++` - [x] **Python 3.6 unit tests, conda-forge toolchain, coverage**: without coverage - `Python / AMD64 Conda Python 3.6` - [x] **[OS X] C++ w/ Xcode 9.3**: - `C++ / AMD64 MacOS 10.14 C++`: with Xcode 10.3 - [x] **[OS X] Python w/ Xcode 9.3**: - `Python / AMD64 MacOS 10.14 Python 3`: with Xcode 10.3 - [x] **Java OpenJDK8 and OpenJDK11**: - `Java / AMD64 Debian Java JDK 8 Maven 3.5.2` - `Java / AMD64 Debian Java JDK 11 Maven 3.6.2` - [x] **Protocol / Flight Integration Tests**: - `Dev / Protocol Test` - [x] **NodeJS**: without running lint and coverage - `NodeJS / AMD64 Debian NodeJS 11` - [x] **C++ & GLib & Ruby w/ gcc 5.4**: - `C++ / AMD64 Debian 10 C++`: with GCC 8.3 - `C++ / AMD64 Ubuntu 16.04 C++`: with GCC 5.4 - `C++ / AMD64 Ubuntu 18.04 C++`: with GCC 7.4 - `C GLib / AMD64 Ubuntu 18.04 C GLib` - `Ruby / AMD64 Ubuntu 18.04 Ruby` - [x] **[OS X] C++ & GLib & Ruby w/ XCode 10.2 & Homebrew** - `C++ / AMD64 MacOS 10.14 C++`: with Xcode 10.3 - `C GLib / AMD64 MacOS 10.14 C Glib`: with Xcode 10.3 - `Ruby / AMD64 MacOS 10.14 Ruby`: with Xcode 10.3 - [x] **Go**: without coverage - `Go / AMD64 Debian Go 1.12` - [x] **R (with and without libarrow)**: - `R / AMD64 Conda R 3.6`: with libarrow - `R / AMD64 Ubuntu 18.04 R 3.6` with libarrow ### Appveyor - ~JOB=Build, GENERATOR=Ninja, CONFIGURATION=Release, APPVEYOR_BUILD_WORKER_IMAGE=Visual Studio 2017~ - ~JOB=Toolchain, GENERATOR=Ninja, CONFIGURATION=Release, ARROW_S3=ON, ARROW_BUILD_FLIGHT=ON, ARROW_BUILD_GANDIVA=ON~ - ~JOB=Build_Debug, GENERATOR=Ninja, CONFIGURATION=Debug~ - ~JOB=MinGW32, MINGW_ARCH=i686, MINGW_PACKAGE_PREFIX=mingw-w64-i686, MINGW_PREFIX=c:\msys64\mingw32, MSYSTEM=MINGW32, USE_CLCACHE=false~ - ~JOB=MinGW64, MINGW_ARCH=x86_64, MINGW_PACKAGE_PREFIX=mingw-w64-x86_64, MINGW_PREFIX=c:\msys64\mingw64, MSYSTEM=MINGW64, USE_CLCACHE=false~ - [x] **JOB=Rust, TARGET=x86_64-pc-windows-msvc, USE_CLCACHE=false**: - `Rust / AMD64 Windows 2019 Rust nightly-2019-09-25` - [x] **JOB=C#, APPVEYOR_BUILD_WORKER_IMAGE=Visual Studio 2017, USE_CLCACHE=false** - `C# / AMD64 Windows 2019 C# 2.2.103` - [x] **JOB=Go, MINGW_PACKAGE_PREFIX=mingw-w64-x86_64 ...**: - `Go / AMD64 Windows 2019 Go 1.12` - ~JOB=R with libarrow, USE_CLCACHE=false, TEST_R_WITH_ARROW=TRUE, RWINLIB_LOCAL=%APPVEYOR_BUILD_FOLDER%\libarrow.zip~ ### Github Actions - [x] **Windows MSVC C++ / Build (Visual Studio 16 2019)**: - `C++ / AMD64 Windows 2019 C++`: without tests - [x] **Windows MSVC C++ / Build (Visual Studio 15 2017)**: - `C++ / AMD64 Windows 2016 C++`: without tests - [x] **Linux docker-compose / Test (C++ w/ clang-7 & system packages)**: all have llvm for gandiva but the compiler is set to gcc - `C++ / AMD64 Debian 10 C++`: with GCC 8.3 - `C++ / AMD64 Ubuntu 16.04 C++`: with GCC 5.4 - `C++ / AMD64 Ubuntu 18.04 C++`: with GCC 7.4 - [x] **Linux docker-compose / Test (Rust)**: without rustfmt - `Rust / AMD64 Debian Rust nightly-2019-09-25` - [x] **Linux docker-compose / Test (Lint, Release tests)**: - `Lint / C++, Python, R, Rust, Docker, RAT` - `Dev / Source Release` ### Nightly Crossbow tests The packaging builds are out of the scope if this PR, but the nightly **dockerized test** task are in. Nightly tests: - [x] docker-r - [x] docker-r-conda - [x] docker-r-sanitizer - [x] docker-rust - [x] docker-cpp - [x] docker-cpp-cmake32 - [x] docker-cpp-release - [x] docker-cpp-static-only - [x] docker-c_glib - [x] docker-go - [x] docker-python-2.7 - [x] docker-python-3.6 - [x] docker-python-3.7 - [x] docker-python-2.7-nopandas - [x] docker-python-3.6-nopandas - [x] docker-java - [x] docker-js - [x] docker-docs - [x] docker-lint - [x] docker-iwyu: included in the lint - [x] docker-clang-format: included in the lint - [x] docker-pandas-master - [x] docker-dask-integration - [x] docker-hdfs-integration - [x] docker-spark-integration - [x] docker-turbodbc-integration # TODOs left: - [x] Fix the Apidoc generation for c_glib - [x] Fix the JNI test for Gandiva and ORC - [x] Test that crossbow tests are passing - ~Optionally restore the travis configuration to incrementally decommission old builds~ ## Follow-up JIRAs: - [Archery] Consider porting the docker tool of ursabot to archery - [Archery] Consider to use archery with or instead of the pre-commit hooks - [Archery] Create a wrapper script in archery for docker compose in order to run the containers with the host's user and group - [C++] GCC 5.4.0 has a compile errors, reproduce with UBUNTU=16.04 docker-compose run ubuntu-cpp - [C++][CI] Test the ported fuzzit integration image - [C++][CI] Turn off unnecessary features in the integration tests (spark/turbodbc/dask/hdfs) - [C++][CI] Revisit ASAN UBSAN settings in every C++ based image - [CI] Consider re-adding the removed debian testing image is removed - [Go][CI] Pre-install the go dependencies in the dockerfile using go get - [JS][CI] Pre-install the JS dependencies in the dockerfile - [Rust][CI] Pre-install the rust dependencies in the dockerfile - [Java][CI] Pre-install the java dependencies in the dockerfile - [Ruby][CI] Pre-install the ruby dependencies in the dockerfile and remove it from the test script - [C#][CI] Pre-install the C# dependencies in the dockerfile - [R][CI] Fix the r-sanitizer build https://issues.apache.org/jira/browse/ARROW-6957 - [GLIB][MacOS] Fail to execute lua examples (fails to load 'lgi.corelgilua51' despite that lgi is installed) - [C++][CMake] Automatically set ARROW_GANDIVA_PC_CXX_FLAGS for conda and OSX sdk (see cpp_build.sh) - [C++][CI] Hiveserver2 instegration test fails to connect to impala container - [CI][Spark] Support specific Spark version in the integration tet including latest - [JS][CI] Move nodejs linting from js_build.sh to archery - [Python][CI] create a docker image for python ASV benchmarks and fix the script - [CI] Find a short but related prefix for the env vars used for the docker-compose file to prevent collisions - [C#] the docker container fails to run because of the ubuntu host versions, see https://github.com/dotnet/core/issues/3509 - [C++][Windows] Enable more features on the windows GHA build - [Doc] document docker-compose usage in the developer sphinx guide - [CI][C++] Add .ccache to the docker-compose mounts - [Archery][CI] Refactor the ci/scripts to a sourceable bash functions or to archery directly - [C++][CI] Use scripts/util_coredump.sh to show automatic backtraces - [C++] Fix the hanging C++ tests in Windows 2019 - [CI] Ask INFRA to set up the DOCKERHUB_* secrets for GitHub actions - [C++][CI] Running Gandiva tests fails on Fedora: Reproduce with: `docker-compose run -e ARROW_GANDIVA=ON fedora-cpp` ``` Running gandiva-internals-test, redirecting output into /build/cpp/build/test-logs/gandiva-internals-test.txt (attempt 1/1) 1364 : CommandLine Error: Option 'x86-experimental-vector-widening-legalization' registered more than once! 1365 LLVM ERROR: inconsistency in registered CommandLine options 1366 /build/cpp/src/gandiva ``` - [JS][CI] NodeJS build fails on Github Actions Windows node ``` > NODE_NO_WARNINGS=1 gulp build # 'NODE_NO_WARNINGS' is not recognized as an internal or external command, # operable program or batch file. # npm ERR! code ELIFECYCLE # npm ERR! errno 1 # npm ERR! apache-arrow@1.0.0-SNAPSHOT build: `NODE_NO_WARNINGS=1 gulp build` # npm ERR! Exit status 1 # npm ERR! # npm ERR! Failed at the apache-arrow@1.0.0-SNAPSHOT build script. # npm ERR! This is probably not a problem with npm. There is likely additional logging output above. ``` Closes #5589 from kszucs/docker-refactor and squashes the following commits: 5105d12e6 <Krisztián Szűcs> Rename pull-request folder to dev_cron e9e9a7eec <Krisztián Szűcs> Use underscores for naming the workflow files a92c99d03 <Krisztián Szűcs> Disable hanging C++ tests on windows f158c89b5 <Krisztián Szűcs> Attempt to push from apache/arrow master; Don't push from crossbow tasks 0e1d470a1 <Krisztián Szűcs> Turn off ORC on macOS C++ test due to link error 258db5cff <Krisztián Szűcs> Only push docker images from apache/arrow repository acdfcf086 <Krisztián Szűcs> Remove ORC from the brewfile 5102b85b1 <Krisztián Szűcs> Fix nodeJS workflow 032d6a388 <Krisztián Szűcs> Turn off 2 python builds 7f15b97a8 <Krisztián Szűcs> Filter branches 48b8d128a <Krisztián Szűcs> Fix workflows 36ad9d297 <Krisztián Szűcs> Disable builds 0f603af0c <Krisztián Szűcs> master only and cron workflows 28cc2d78d <Krisztián Szűcs> Rename Java JNI workflow bcd8af7b7 <Krisztián Szűcs> Port the remaining travis utility scripts ed5688154 <Krisztián Szűcs> Usage comments; recommend installing pandas from the docs because of its removal from conda_env_python 3c8c023ce <Krisztián Szűcs> Use Arch in volumes; some comments; remove conda version 'latest' from the images 771b023a8 <Krisztián Szűcs> Cleanup files; separate JNI builds 97ff8a122 <Krisztián Szűcs> Push docker images only from master dc00b4297 <Krisztián Szűcs> Enable path filters e0e2e1f46 <Krisztián Szűcs> Fix pandas master build 3814e0828 <Krisztián Szűcs> Fix manylinux volumes c18edda70 <Krisztián Szűcs> Add CentOS version to the manylinux image names c8b9dd6b1 <Krisztián Szűcs> Missing --pyargs argument for the python test command 33e646981 <Krisztián Szűcs> Turn off gandiva and flight for the HDFS test b9c547889 <Krisztián Szűcs> Refactor docker-compose file and use it with github actions. Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
2019-11-12 11:07:48 +01:00
export ARROW_TEST_DATA=${arrow_dir}/testing/data
export PARQUET_TEST_DATA=${source_dir}/submodules/parquet-testing/data
export LD_LIBRARY_PATH=${ARROW_HOME}/${CMAKE_INSTALL_LIBDIR:-lib}:${LD_LIBRARY_PATH}
# By default, aws-sdk tries to contact a non-existing local ip host
# to retrieve metadata. Disable this so that S3FileSystem tests run faster.
export AWS_EC2_METADATA_DISABLED=TRUE
# Enable memory debug checks if the env is not set already
if [ -z "${ARROW_DEBUG_MEMORY_POOL}" ]; then
export ARROW_DEBUG_MEMORY_POOL=trap
fi
exclude_tests=()
ctest_options=()
if ! type azurite >/dev/null 2>&1; then
exclude_tests+=("arrow-azurefs-test")
fi
if ! type storage-testbench >/dev/null 2>&1; then
exclude_tests+=("arrow-gcsfs-test")
fi
if ! type minio >/dev/null 2>&1; then
exclude_tests+=("arrow-s3fs-test")
fi
case "$(uname)" in
Linux)
n_jobs=$(nproc)
;;
Darwin)
n_jobs=$(sysctl -n hw.ncpu)
# TODO: https://github.com/apache/arrow/issues/40410
exclude_tests+=("arrow-s3fs-test")
;;
MINGW*)
n_jobs=${NUMBER_OF_PROCESSORS:-1}
# TODO: Enable these crashed tests.
# https://issues.apache.org/jira/browse/ARROW-9072
exclude_tests+=("gandiva-binary-test")
exclude_tests+=("gandiva-boolean-expr-test")
exclude_tests+=("gandiva-date-time-test")
exclude_tests+=("gandiva-decimal-single-test")
exclude_tests+=("gandiva-decimal-test")
exclude_tests+=("gandiva-filter-project-test")
exclude_tests+=("gandiva-filter-test")
exclude_tests+=("gandiva-hash-test")
exclude_tests+=("gandiva-if-expr-test")
exclude_tests+=("gandiva-in-expr-test")
exclude_tests+=("gandiva-internals-test")
exclude_tests+=("gandiva-literal-test")
exclude_tests+=("gandiva-null-validity-test")
exclude_tests+=("gandiva-precompiled-test")
exclude_tests+=("gandiva-projector-test")
exclude_tests+=("gandiva-utf8-test")
;;
*)
n_jobs=${NPROC:-1}
;;
esac
if [ "${#exclude_tests[@]}" -gt 0 ]; then
IFS="|"
ctest_options+=(--exclude-regex "${exclude_tests[*]}")
unset IFS
fi
ARROW-7119: [C++][CI] Show automatic backtraces Automatically generate coredumps and backtraces for the failing C++ tests on both Linux (including docker) and macOS. Closes #6202 from kszucs/coredump and squashes the following commits: f8200da97 <Krisztián Szűcs> remove sudo from the comment 1e75d9349 <Krisztián Szűcs> resolve review issues ef5ac74e0 <Krisztián Szűcs> revert segfault enforcing change deaca223a <Krisztián Szűcs> remove debug commands 467e0001d <Krisztián Szűcs> set ulimit within the same shell d2af4e07c <Krisztián Szűcs> debug macos coredumps 7eae170cd <Krisztián Szűcs> debug macos 01961de60 <Krisztián Szűcs> macos debug d0e4f623d <Krisztián Szűcs> remove debug commands 382a5f02a <Krisztián Szűcs> install gdb in more images 95c5384f9 <Krisztián Szűcs> limit the pattern for the first 15 chars e75238cf6 <Krisztián Szűcs> debug 76d072b67 <Krisztián Szűcs> set ulimit b8db15f16 <Krisztián Szűcs> no sudo 04fef61ae <Krisztián Szűcs> use sysctl 6c8bc4ffe <Krisztián Szűcs> debug e8ec7ca53 <Krisztián Szűcs> remove util_coredump 655db2291 <Krisztián Szűcs> workarounds cf03c0c3d <Krisztián Szűcs> ARROW_TEST_ULIMIT_CORE: unlimited 8a092cb1b <Krisztián Szűcs> tune run-tests.sh 7f2ed2dbe <Krisztián Szűcs> delimiter e830bc7bd <Krisztián Szűcs> debug 187c32e23 <Krisztián Szűcs> permissions ada66c06a <Krisztián Szűcs> force segfault 379b6ea5b <Krisztián Szűcs> enable for python and ruby macos builds 3f4af23c1 <Krisztián Szűcs> executable flag b51b1fe49 <Krisztián Szűcs> on failure 28628acbb <Krisztián Szűcs> continue on error to show the coredump eb1cdb57f <Krisztián Szűcs> coredump on macos Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
2020-02-12 08:53:23 +09:00
if [ "${ARROW_EMSCRIPTEN:-OFF}" = "ON" ]; then
n_jobs=1 # avoid spurious fails on emscripten due to loading too many big executables
fi
GH-46699: [CI][Dev] fix shellcheck errors in the ci/scripts/cpp_test.sh (#46700) ### Rationale for this change `ci/scripts/cpp_test.sh` violates two shellcheck rules. * SC2071: `< is for string comparisons. Use -lt instead.` * SC2086: `Double quote to prevent globbing and word splitting.` ``` ./ci/scripts/cpp_test.sh In ./ci/scripts/cpp_test.sh line 22: if [[ $# < 2 ]]; then ^-- SC2071 (error): < is for string comparisons. Use -lt instead. In ./ci/scripts/cpp_test.sh line 87: pushd ${build_dir} ^----------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: pushd "${build_dir}" In ./ci/scripts/cpp_test.sh line 103: --parallel ${n_jobs} \ ^-------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: --parallel "${n_jobs}" \ In ./ci/scripts/cpp_test.sh line 105: --timeout ${ARROW_CTEST_TIMEOUT:-300} \ ^-------------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: --timeout "${ARROW_CTEST_TIMEOUT:-300}" \ In ./ci/scripts/cpp_test.sh line 111: examples=$(find ${binary_output_dir} -executable -name "*example") ^------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: examples=$(find "${binary_output_dir}" -executable -name "*example") In ./ci/scripts/cpp_test.sh line 129: ${binary_output_dir}/arrow-ipc-stream-fuzz ${ARROW_TEST_DATA}/arrow-ipc-stream/crash-* ^------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. ^----------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: "${binary_output_dir}"/arrow-ipc-stream-fuzz "${ARROW_TEST_DATA}"/arrow-ipc-stream/crash-* In ./ci/scripts/cpp_test.sh line 130: ${binary_output_dir}/arrow-ipc-stream-fuzz ${ARROW_TEST_DATA}/arrow-ipc-stream/*-testcase-* ^------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. ^----------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: "${binary_output_dir}"/arrow-ipc-stream-fuzz "${ARROW_TEST_DATA}"/arrow-ipc-stream/*-testcase-* In ./ci/scripts/cpp_test.sh line 131: ${binary_output_dir}/arrow-ipc-file-fuzz ${ARROW_TEST_DATA}/arrow-ipc-file/*-testcase-* ^------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. ^----------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: "${binary_output_dir}"/arrow-ipc-file-fuzz "${ARROW_TEST_DATA}"/arrow-ipc-file/*-testcase-* In ./ci/scripts/cpp_test.sh line 132: ${binary_output_dir}/arrow-ipc-tensor-stream-fuzz ${ARROW_TEST_DATA}/arrow-ipc-tensor-stream/*-testcase-* ^------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. ^----------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: "${binary_output_dir}"/arrow-ipc-tensor-stream-fuzz "${ARROW_TEST_DATA}"/arrow-ipc-tensor-stream/*-testcase-* In ./ci/scripts/cpp_test.sh line 134: ${binary_output_dir}/parquet-arrow-fuzz ${ARROW_TEST_DATA}/parquet/fuzzing/*-testcase-* ^------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. ^----------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: "${binary_output_dir}"/parquet-arrow-fuzz "${ARROW_TEST_DATA}"/parquet/fuzzing/*-testcase-* For more information: https://www.shellcheck.net/wiki/SC2071 -- < is for string comparisons. Use ... https://www.shellcheck.net/wiki/SC2086 -- Double quote to prevent globbing ... ``` ### What changes are included in this PR? * Use `-lt` instead of `<` * Quote variables. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #46699 Authored-by: Hiroyuki Sato <hiroysato@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
2025-06-05 09:26:14 +09:00
pushd "${build_dir}"
ARROW-7119: [C++][CI] Show automatic backtraces Automatically generate coredumps and backtraces for the failing C++ tests on both Linux (including docker) and macOS. Closes #6202 from kszucs/coredump and squashes the following commits: f8200da97 <Krisztián Szűcs> remove sudo from the comment 1e75d9349 <Krisztián Szűcs> resolve review issues ef5ac74e0 <Krisztián Szűcs> revert segfault enforcing change deaca223a <Krisztián Szűcs> remove debug commands 467e0001d <Krisztián Szűcs> set ulimit within the same shell d2af4e07c <Krisztián Szűcs> debug macos coredumps 7eae170cd <Krisztián Szűcs> debug macos 01961de60 <Krisztián Szűcs> macos debug d0e4f623d <Krisztián Szűcs> remove debug commands 382a5f02a <Krisztián Szűcs> install gdb in more images 95c5384f9 <Krisztián Szűcs> limit the pattern for the first 15 chars e75238cf6 <Krisztián Szűcs> debug 76d072b67 <Krisztián Szűcs> set ulimit b8db15f16 <Krisztián Szűcs> no sudo 04fef61ae <Krisztián Szűcs> use sysctl 6c8bc4ffe <Krisztián Szűcs> debug e8ec7ca53 <Krisztián Szűcs> remove util_coredump 655db2291 <Krisztián Szűcs> workarounds cf03c0c3d <Krisztián Szűcs> ARROW_TEST_ULIMIT_CORE: unlimited 8a092cb1b <Krisztián Szűcs> tune run-tests.sh 7f2ed2dbe <Krisztián Szűcs> delimiter e830bc7bd <Krisztián Szűcs> debug 187c32e23 <Krisztián Szűcs> permissions ada66c06a <Krisztián Szűcs> force segfault 379b6ea5b <Krisztián Szűcs> enable for python and ruby macos builds 3f4af23c1 <Krisztián Szűcs> executable flag b51b1fe49 <Krisztián Szűcs> on failure 28628acbb <Krisztián Szűcs> continue on error to show the coredump eb1cdb57f <Krisztián Szűcs> coredump on macos Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
2020-02-12 08:53:23 +09:00
if [ -z "${PYTHON}" ] && ! which python > /dev/null 2>&1; then
export PYTHON="${PYTHON:-python3}"
fi
if [ "${ARROW_USE_MESON:-OFF}" = "ON" ]; then
ARROW_BUILD_EXAMPLES=OFF # TODO: Remove this
meson test \
--max-lines=0 \
--no-rebuild \
--print-errorlogs \
--suite arrow \
--timeout-multiplier=10 \
"$@"
else
ctest \
--label-regex unittest \
--output-on-failure \
GH-46699: [CI][Dev] fix shellcheck errors in the ci/scripts/cpp_test.sh (#46700) ### Rationale for this change `ci/scripts/cpp_test.sh` violates two shellcheck rules. * SC2071: `< is for string comparisons. Use -lt instead.` * SC2086: `Double quote to prevent globbing and word splitting.` ``` ./ci/scripts/cpp_test.sh In ./ci/scripts/cpp_test.sh line 22: if [[ $# < 2 ]]; then ^-- SC2071 (error): < is for string comparisons. Use -lt instead. In ./ci/scripts/cpp_test.sh line 87: pushd ${build_dir} ^----------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: pushd "${build_dir}" In ./ci/scripts/cpp_test.sh line 103: --parallel ${n_jobs} \ ^-------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: --parallel "${n_jobs}" \ In ./ci/scripts/cpp_test.sh line 105: --timeout ${ARROW_CTEST_TIMEOUT:-300} \ ^-------------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: --timeout "${ARROW_CTEST_TIMEOUT:-300}" \ In ./ci/scripts/cpp_test.sh line 111: examples=$(find ${binary_output_dir} -executable -name "*example") ^------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: examples=$(find "${binary_output_dir}" -executable -name "*example") In ./ci/scripts/cpp_test.sh line 129: ${binary_output_dir}/arrow-ipc-stream-fuzz ${ARROW_TEST_DATA}/arrow-ipc-stream/crash-* ^------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. ^----------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: "${binary_output_dir}"/arrow-ipc-stream-fuzz "${ARROW_TEST_DATA}"/arrow-ipc-stream/crash-* In ./ci/scripts/cpp_test.sh line 130: ${binary_output_dir}/arrow-ipc-stream-fuzz ${ARROW_TEST_DATA}/arrow-ipc-stream/*-testcase-* ^------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. ^----------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: "${binary_output_dir}"/arrow-ipc-stream-fuzz "${ARROW_TEST_DATA}"/arrow-ipc-stream/*-testcase-* In ./ci/scripts/cpp_test.sh line 131: ${binary_output_dir}/arrow-ipc-file-fuzz ${ARROW_TEST_DATA}/arrow-ipc-file/*-testcase-* ^------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. ^----------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: "${binary_output_dir}"/arrow-ipc-file-fuzz "${ARROW_TEST_DATA}"/arrow-ipc-file/*-testcase-* In ./ci/scripts/cpp_test.sh line 132: ${binary_output_dir}/arrow-ipc-tensor-stream-fuzz ${ARROW_TEST_DATA}/arrow-ipc-tensor-stream/*-testcase-* ^------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. ^----------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: "${binary_output_dir}"/arrow-ipc-tensor-stream-fuzz "${ARROW_TEST_DATA}"/arrow-ipc-tensor-stream/*-testcase-* In ./ci/scripts/cpp_test.sh line 134: ${binary_output_dir}/parquet-arrow-fuzz ${ARROW_TEST_DATA}/parquet/fuzzing/*-testcase-* ^------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. ^----------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: "${binary_output_dir}"/parquet-arrow-fuzz "${ARROW_TEST_DATA}"/parquet/fuzzing/*-testcase-* For more information: https://www.shellcheck.net/wiki/SC2071 -- < is for string comparisons. Use ... https://www.shellcheck.net/wiki/SC2086 -- Double quote to prevent globbing ... ``` ### What changes are included in this PR? * Use `-lt` instead of `<` * Quote variables. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #46699 Authored-by: Hiroyuki Sato <hiroysato@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
2025-06-05 09:26:14 +09:00
--parallel "${n_jobs}" \
--repeat until-pass:3 \
GH-46699: [CI][Dev] fix shellcheck errors in the ci/scripts/cpp_test.sh (#46700) ### Rationale for this change `ci/scripts/cpp_test.sh` violates two shellcheck rules. * SC2071: `< is for string comparisons. Use -lt instead.` * SC2086: `Double quote to prevent globbing and word splitting.` ``` ./ci/scripts/cpp_test.sh In ./ci/scripts/cpp_test.sh line 22: if [[ $# < 2 ]]; then ^-- SC2071 (error): < is for string comparisons. Use -lt instead. In ./ci/scripts/cpp_test.sh line 87: pushd ${build_dir} ^----------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: pushd "${build_dir}" In ./ci/scripts/cpp_test.sh line 103: --parallel ${n_jobs} \ ^-------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: --parallel "${n_jobs}" \ In ./ci/scripts/cpp_test.sh line 105: --timeout ${ARROW_CTEST_TIMEOUT:-300} \ ^-------------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: --timeout "${ARROW_CTEST_TIMEOUT:-300}" \ In ./ci/scripts/cpp_test.sh line 111: examples=$(find ${binary_output_dir} -executable -name "*example") ^------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: examples=$(find "${binary_output_dir}" -executable -name "*example") In ./ci/scripts/cpp_test.sh line 129: ${binary_output_dir}/arrow-ipc-stream-fuzz ${ARROW_TEST_DATA}/arrow-ipc-stream/crash-* ^------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. ^----------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: "${binary_output_dir}"/arrow-ipc-stream-fuzz "${ARROW_TEST_DATA}"/arrow-ipc-stream/crash-* In ./ci/scripts/cpp_test.sh line 130: ${binary_output_dir}/arrow-ipc-stream-fuzz ${ARROW_TEST_DATA}/arrow-ipc-stream/*-testcase-* ^------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. ^----------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: "${binary_output_dir}"/arrow-ipc-stream-fuzz "${ARROW_TEST_DATA}"/arrow-ipc-stream/*-testcase-* In ./ci/scripts/cpp_test.sh line 131: ${binary_output_dir}/arrow-ipc-file-fuzz ${ARROW_TEST_DATA}/arrow-ipc-file/*-testcase-* ^------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. ^----------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: "${binary_output_dir}"/arrow-ipc-file-fuzz "${ARROW_TEST_DATA}"/arrow-ipc-file/*-testcase-* In ./ci/scripts/cpp_test.sh line 132: ${binary_output_dir}/arrow-ipc-tensor-stream-fuzz ${ARROW_TEST_DATA}/arrow-ipc-tensor-stream/*-testcase-* ^------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. ^----------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: "${binary_output_dir}"/arrow-ipc-tensor-stream-fuzz "${ARROW_TEST_DATA}"/arrow-ipc-tensor-stream/*-testcase-* In ./ci/scripts/cpp_test.sh line 134: ${binary_output_dir}/parquet-arrow-fuzz ${ARROW_TEST_DATA}/parquet/fuzzing/*-testcase-* ^------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. ^----------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: "${binary_output_dir}"/parquet-arrow-fuzz "${ARROW_TEST_DATA}"/parquet/fuzzing/*-testcase-* For more information: https://www.shellcheck.net/wiki/SC2071 -- < is for string comparisons. Use ... https://www.shellcheck.net/wiki/SC2086 -- Double quote to prevent globbing ... ``` ### What changes are included in this PR? * Use `-lt` instead of `<` * Quote variables. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #46699 Authored-by: Hiroyuki Sato <hiroysato@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
2025-06-05 09:26:14 +09:00
--timeout "${ARROW_CTEST_TIMEOUT:-300}" \
"${ctest_options[@]}" \
"$@"
fi
# This is for testing find_package(Arrow).
#
# Note that this is not a perfect solution. We should improve this
# later.
#
# * This is ad-hoc
# * This doesn't test other CMake packages such as ArrowDataset
if [ "${ARROW_USE_MESON:-OFF}" = "OFF" ] && \
[ "${ARROW_EMSCRIPTEN:-OFF}" = "OFF" ] && \
[ "${ARROW_USE_ASAN:-OFF}" = "OFF" ] && \
[ "${ARROW_USE_TSAN:-OFF}" = "OFF" ] && \
[ "${ARROW_CSV:-ON}" = "ON" ]; then
CMAKE_PREFIX_PATH="${CMAKE_INSTALL_PREFIX:-${ARROW_HOME}}"
case "$(uname)" in
MINGW*)
# <prefix>/lib/cmake/ isn't searched on Windows.
#
# See also:
# https://cmake.org/cmake/help/latest/command/find_package.html#config-mode-search-procedure
CMAKE_PREFIX_PATH+="/lib/cmake/"
;;
esac
if [ -n "${VCPKG_ROOT}" ] && [ -n "${VCPKG_DEFAULT_TRIPLET}" ]; then
# Search vcpkg before <prefix>/lib/cmake.
CMAKE_PREFIX_PATH="${VCPKG_ROOT}/installed/${VCPKG_DEFAULT_TRIPLET};${CMAKE_PREFIX_PATH}"
fi
cmake \
-S "${source_dir}/examples/minimal_build" \
-B "${build_dir}/examples/minimal_build" \
-DCMAKE_PREFIX_PATH="${CMAKE_PREFIX_PATH}"
cmake --build "${build_dir}/examples/minimal_build"
pushd "${source_dir}/examples/minimal_build"
# PATH= is for Windows.
PATH="${CMAKE_INSTALL_PREFIX:-${ARROW_HOME}}/bin:${PATH}" \
"${build_dir}/examples/minimal_build/arrow-example"
popd
fi
if [ "${ARROW_BUILD_EXAMPLES}" == "ON" ]; then
GH-46699: [CI][Dev] fix shellcheck errors in the ci/scripts/cpp_test.sh (#46700) ### Rationale for this change `ci/scripts/cpp_test.sh` violates two shellcheck rules. * SC2071: `< is for string comparisons. Use -lt instead.` * SC2086: `Double quote to prevent globbing and word splitting.` ``` ./ci/scripts/cpp_test.sh In ./ci/scripts/cpp_test.sh line 22: if [[ $# < 2 ]]; then ^-- SC2071 (error): < is for string comparisons. Use -lt instead. In ./ci/scripts/cpp_test.sh line 87: pushd ${build_dir} ^----------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: pushd "${build_dir}" In ./ci/scripts/cpp_test.sh line 103: --parallel ${n_jobs} \ ^-------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: --parallel "${n_jobs}" \ In ./ci/scripts/cpp_test.sh line 105: --timeout ${ARROW_CTEST_TIMEOUT:-300} \ ^-------------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: --timeout "${ARROW_CTEST_TIMEOUT:-300}" \ In ./ci/scripts/cpp_test.sh line 111: examples=$(find ${binary_output_dir} -executable -name "*example") ^------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: examples=$(find "${binary_output_dir}" -executable -name "*example") In ./ci/scripts/cpp_test.sh line 129: ${binary_output_dir}/arrow-ipc-stream-fuzz ${ARROW_TEST_DATA}/arrow-ipc-stream/crash-* ^------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. ^----------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: "${binary_output_dir}"/arrow-ipc-stream-fuzz "${ARROW_TEST_DATA}"/arrow-ipc-stream/crash-* In ./ci/scripts/cpp_test.sh line 130: ${binary_output_dir}/arrow-ipc-stream-fuzz ${ARROW_TEST_DATA}/arrow-ipc-stream/*-testcase-* ^------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. ^----------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: "${binary_output_dir}"/arrow-ipc-stream-fuzz "${ARROW_TEST_DATA}"/arrow-ipc-stream/*-testcase-* In ./ci/scripts/cpp_test.sh line 131: ${binary_output_dir}/arrow-ipc-file-fuzz ${ARROW_TEST_DATA}/arrow-ipc-file/*-testcase-* ^------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. ^----------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: "${binary_output_dir}"/arrow-ipc-file-fuzz "${ARROW_TEST_DATA}"/arrow-ipc-file/*-testcase-* In ./ci/scripts/cpp_test.sh line 132: ${binary_output_dir}/arrow-ipc-tensor-stream-fuzz ${ARROW_TEST_DATA}/arrow-ipc-tensor-stream/*-testcase-* ^------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. ^----------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: "${binary_output_dir}"/arrow-ipc-tensor-stream-fuzz "${ARROW_TEST_DATA}"/arrow-ipc-tensor-stream/*-testcase-* In ./ci/scripts/cpp_test.sh line 134: ${binary_output_dir}/parquet-arrow-fuzz ${ARROW_TEST_DATA}/parquet/fuzzing/*-testcase-* ^------------------^ SC2086 (info): Double quote to prevent globbing and word splitting. ^----------------^ SC2086 (info): Double quote to prevent globbing and word splitting. Did you mean: "${binary_output_dir}"/parquet-arrow-fuzz "${ARROW_TEST_DATA}"/parquet/fuzzing/*-testcase-* For more information: https://www.shellcheck.net/wiki/SC2071 -- < is for string comparisons. Use ... https://www.shellcheck.net/wiki/SC2086 -- Double quote to prevent globbing ... ``` ### What changes are included in this PR? * Use `-lt` instead of `<` * Quote variables. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #46699 Authored-by: Hiroyuki Sato <hiroysato@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
2025-06-05 09:26:14 +09:00
examples=$(find "${binary_output_dir}" -executable -name "*example")
if [ "${examples}" == "" ]; then
echo "=================="
echo "No examples found!"
echo "=================="
exit 1
fi
for ex in ${examples}
do
echo "=================="
echo "Executing ${ex}"
echo "=================="
${ex}
done
fi
if [ "${ARROW_FUZZING}" == "ON" ]; then
# Fuzzing regression tests
GH-49329: [C++][Parquet][CI] Add fuzz target for encoder/decoder roundtrip (#49374) ### Rationale for this change Add a fuzz target that does a 3-way encoding roundtrip: 1. decodes its payload using a source encoding 2. re-encodes the values using a roundtrip encoding 3. re-decodes using the same roundtrip encoding, and checks that the values read are identical The motivation is to ensure that encoding and decoding can roundtrip for any valid data, while the full-file Parquet fuzzer only tests _reading_ the data. Furthermore, we postulate that a smaller workload that runs on a smaller search space like this might lead to a faster exploration of the search space (but we don't know for sure). The reason for the two encodings (source and roundtrip) is that not all encodings imply the same input structure (grammar, etc.) and using another source encoding (especially PLAIN, which has very little structure) might allow the fuzzer to explore the logical search space (i.e. values) in different ways. #### Parametrization Since we want to fuzz all supported (type, encoding) combinations and OSS-Fuzz does not support fuzzer parametrization, we have to use a trick: we prefix the fuzz payload with a fixed-size header that encodes those parameters. (while this may seem like we are reinventing the Parquet file format, it is a massively simpler format to decode, and the payload otherwise consists of a single range of encoded values: no Thrift metadata, no row groups, no columns, no multiple pages, no def/rep levels, no statistics, etc.) ### What changes are included in this PR? 1. Add a new fuzz target 2. Add a new fuzzing utility to generate the corresponding seed corpus 3. Add a test step on ASAN/ UBSAN CI job to run the fuzz targets on their respective seed corpuses 4. Add a Parquet API function to get the supported encodings for a given type 5. Add a Parquet helper API to dispatch a Parquet type to a generic callable (usually a lambda with an auto parameter) 6. Add a macro to pack a structure (and replace the previously existing macro from Flatbuffers) ### Are these changes tested? Yes, partly by CI and partly manually. ### Are there any user-facing changes? No. * GitHub Issue: #49329 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
2026-03-03 15:56:20 +01:00
# This will display any errors generated during fuzzing. These errors are
# usually not bugs (most fuzz files are invalid and hence generate errors
# when trying to read them), which is why they are hidden by default when
# fuzzing.
export ARROW_FUZZING_VERBOSITY=1
GH-47012: [C++][Parquet] Reserve values correctly when reading BYTE_ARRAY and FLBA (#47013) ### Rationale for this change When reading a Parquet leaf column as Arrow, we [presize the Arrow builder](https://github.com/apache/arrow/blob/a0cc2d8ed35dce7ee6c3e7cbcc4867216a9ef16f/cpp/src/parquet/arrow/reader.cc#L487-L488) so as to avoid spurious reallocations during incremental Parquet decoding calls. However, the Reserve method on RecordReader will [only properly reserve values](https://github.com/apache/arrow/blob/a0cc2d8ed35dce7ee6c3e7cbcc4867216a9ef16f/cpp/src/parquet/column_reader.cc#L1693-L1696) for non-FLBA non-BYTE_ARRAY physical types. The result is that, on some of our micro-benchmarks, we spend a significant amount of time reallocating data on the ArrayBuilder. ### What changes are included in this PR? Properly reserve space on Array builders when reading Parquet data as Arrow. Note that, when reading into Binary or LargeBinary, this doesn't avoid reallocations for the actual data. However, for FixedSizeBinary and BinaryView, this is sufficient to avoid any reallocations. Benchmark numbers on my local machine (Ubuntu 24.04): ``` ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Non-regressions: (250) ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- benchmark baseline contender change % counters BM_ReadColumnPlain<false,Float16LogicalType>/null_probability:-1 3.295 GiB/sec 7.834 GiB/sec 137.771 {'family_index': 10, 'per_family_instance_index': 0, 'run_name': 'BM_ReadColumnPlain<false,Float16LogicalType>/null_probability:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 118} BM_ReadColumnByteStreamSplit<false,Float16LogicalType>/null_probability:-1 3.453 GiB/sec 8.148 GiB/sec 135.957 {'family_index': 12, 'per_family_instance_index': 0, 'run_name': 'BM_ReadColumnByteStreamSplit<false,Float16LogicalType>/null_probability:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 119} BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:100 1.360 GiB/sec 1.780 GiB/sec 30.870 {'family_index': 13, 'per_family_instance_index': 4, 'run_name': 'BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 49} BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:100 1.360 GiB/sec 1.780 GiB/sec 30.861 {'family_index': 11, 'per_family_instance_index': 4, 'run_name': 'BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 49} BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:0 1.292 GiB/sec 1.662 GiB/sec 28.666 {'family_index': 13, 'per_family_instance_index': 0, 'run_name': 'BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 47} BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:0 1.304 GiB/sec 1.665 GiB/sec 27.691 {'family_index': 11, 'per_family_instance_index': 0, 'run_name': 'BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 46} BM_ReadBinaryViewColumn/null_probability:99/unique_values:32 959.085 MiB/sec 1.185 GiB/sec 26.568 {'family_index': 15, 'per_family_instance_index': 4, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:99/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 9} BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:99 1.012 GiB/sec 1.210 GiB/sec 19.557 {'family_index': 13, 'per_family_instance_index': 3, 'run_name': 'BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:99', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 36} BM_ReadBinaryViewColumnDeltaByteArray/null_probability:99/unique_values:-1 1.011 GiB/sec 1.187 GiB/sec 17.407 {'family_index': 17, 'per_family_instance_index': 3, 'run_name': 'BM_ReadBinaryViewColumnDeltaByteArray/null_probability:99/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 9} BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:99 1.024 GiB/sec 1.201 GiB/sec 17.206 {'family_index': 11, 'per_family_instance_index': 3, 'run_name': 'BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:99', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 36} BM_ReadBinaryViewColumn/null_probability:99/unique_values:-1 1.023 GiB/sec 1.197 GiB/sec 17.016 {'family_index': 15, 'per_family_instance_index': 7, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:99/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 9} BM_ReadBinaryColumn/null_probability:99/unique_values:32 541.347 MiB/sec 632.640 MiB/sec 16.864 {'family_index': 14, 'per_family_instance_index': 4, 'run_name': 'BM_ReadBinaryColumn/null_probability:99/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 9} BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:1 954.762 MiB/sec 1.084 GiB/sec 16.272 {'family_index': 11, 'per_family_instance_index': 1, 'run_name': 'BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 33} BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:1 970.997 MiB/sec 1.100 GiB/sec 15.969 {'family_index': 13, 'per_family_instance_index': 1, 'run_name': 'BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 34} BM_ReadBinaryColumn/null_probability:99/unique_values:-1 592.605 MiB/sec 666.605 MiB/sec 12.487 {'family_index': 14, 'per_family_instance_index': 7, 'run_name': 'BM_ReadBinaryColumn/null_probability:99/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 10} BM_ReadBinaryColumnDeltaByteArray/null_probability:99/unique_values:-1 587.604 MiB/sec 659.154 MiB/sec 12.177 {'family_index': 16, 'per_family_instance_index': 3, 'run_name': 'BM_ReadBinaryColumnDeltaByteArray/null_probability:99/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 10} BM_ReadBinaryViewColumn/null_probability:50/unique_values:-1 867.001 MiB/sec 962.427 MiB/sec 11.006 {'family_index': 15, 'per_family_instance_index': 6, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:50/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4} BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:50 473.040 MiB/sec 522.948 MiB/sec 10.551 {'family_index': 11, 'per_family_instance_index': 2, 'run_name': 'BM_ReadColumnPlain<true,Float16LogicalType>/null_probability:50', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 17} BM_ReadBinaryViewColumn/null_probability:0/unique_values:-1 1.633 GiB/sec 1.800 GiB/sec 10.197 {'family_index': 15, 'per_family_instance_index': 1, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:0/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 5} BM_ReadStructOfListColumn/50 466.944 MiB/sec 513.407 MiB/sec 9.951 {'family_index': 20, 'per_family_instance_index': 2, 'run_name': 'BM_ReadStructOfListColumn/50', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 27} BM_ReadBinaryViewColumnDeltaByteArray/null_probability:50/unique_values:-1 894.649 MiB/sec 976.595 MiB/sec 9.160 {'family_index': 17, 'per_family_instance_index': 2, 'run_name': 'BM_ReadBinaryViewColumnDeltaByteArray/null_probability:50/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4} BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:50 479.717 MiB/sec 523.293 MiB/sec 9.084 {'family_index': 13, 'per_family_instance_index': 2, 'run_name': 'BM_ReadColumnByteStreamSplit<true,Float16LogicalType>/null_probability:50', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 17} BM_ReadBinaryColumn/null_probability:50/unique_values:-1 613.860 MiB/sec 667.963 MiB/sec 8.814 {'family_index': 14, 'per_family_instance_index': 6, 'run_name': 'BM_ReadBinaryColumn/null_probability:50/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3} BM_ReadBinaryViewColumnDeltaByteArray/null_probability:1/unique_values:-1 1.479 GiB/sec 1.608 GiB/sec 8.761 {'family_index': 17, 'per_family_instance_index': 1, 'run_name': 'BM_ReadBinaryViewColumnDeltaByteArray/null_probability:1/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4} BM_ReadBinaryViewColumnDeltaByteArray/null_probability:0/unique_values:-1 1.628 GiB/sec 1.762 GiB/sec 8.235 {'family_index': 17, 'per_family_instance_index': 0, 'run_name': 'BM_ReadBinaryViewColumnDeltaByteArray/null_probability:0/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 5} BM_ReadStructOfListColumn/0 760.221 MiB/sec 822.339 MiB/sec 8.171 {'family_index': 20, 'per_family_instance_index': 0, 'run_name': 'BM_ReadStructOfListColumn/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 47} BM_ReadBinaryViewColumn/null_probability:1/unique_values:32 843.826 MiB/sec 912.397 MiB/sec 8.126 {'family_index': 15, 'per_family_instance_index': 2, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:1/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3} BM_ReadBinaryViewColumn/null_probability:50/unique_values:32 699.538 MiB/sec 755.468 MiB/sec 7.995 {'family_index': 15, 'per_family_instance_index': 3, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:50/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3} BM_ByteStreamSplitDecode_FLBA_Generic<16>/1024 3.724 GiB/sec 4.007 GiB/sec 7.597 {'family_index': 4, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<16>/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 176027} BM_ReadBinaryViewColumn/null_probability:1/unique_values:-1 1.474 GiB/sec 1.586 GiB/sec 7.591 {'family_index': 15, 'per_family_instance_index': 5, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:1/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4} BM_ReadBinaryColumn/null_probability:0/unique_values:-1 1.114 GiB/sec 1.192 GiB/sec 7.005 {'family_index': 14, 'per_family_instance_index': 1, 'run_name': 'BM_ReadBinaryColumn/null_probability:0/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3} BM_ReadBinaryColumn/null_probability:1/unique_values:-1 1.022 GiB/sec 1.091 GiB/sec 6.715 {'family_index': 14, 'per_family_instance_index': 5, 'run_name': 'BM_ReadBinaryColumn/null_probability:1/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4} BM_ReadBinaryColumnDeltaByteArray/null_probability:0/unique_values:-1 1.101 GiB/sec 1.174 GiB/sec 6.557 {'family_index': 16, 'per_family_instance_index': 0, 'run_name': 'BM_ReadBinaryColumnDeltaByteArray/null_probability:0/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4} BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:5000 18.019 MiB/sec 19.100 MiB/sec 5.997 {'family_index': 33, 'per_family_instance_index': 14, 'run_name': 'BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:5000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 6295} BM_ReadBinaryViewColumn/null_probability:0/unique_values:32 893.151 MiB/sec 945.900 MiB/sec 5.906 {'family_index': 15, 'per_family_instance_index': 0, 'run_name': 'BM_ReadBinaryViewColumn/null_probability:0/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3} BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:1000 20.243 MiB/sec 21.404 MiB/sec 5.733 {'family_index': 33, 'per_family_instance_index': 10, 'run_name': 'BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:1000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 7257} BM_ReadBinaryColumnDeltaByteArray/null_probability:50/unique_values:-1 620.583 MiB/sec 655.859 MiB/sec 5.684 {'family_index': 16, 'per_family_instance_index': 2, 'run_name': 'BM_ReadBinaryColumnDeltaByteArray/null_probability:50/unique_values:-1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3} BM_ReadBinaryColumn/null_probability:0/unique_values:32 751.375 MiB/sec 793.728 MiB/sec 5.637 {'family_index': 14, 'per_family_instance_index': 0, 'run_name': 'BM_ReadBinaryColumn/null_probability:0/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3} BM_ReadBinaryColumn/null_probability:50/unique_values:32 537.693 MiB/sec 567.159 MiB/sec 5.480 {'family_index': 14, 'per_family_instance_index': 3, 'run_name': 'BM_ReadBinaryColumn/null_probability:50/unique_values:32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3} BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:100 44.112 MiB/sec 46.474 MiB/sec 5.355 {'family_index': 33, 'per_family_instance_index': 6, 'run_name': 'BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 15273} BM_DecodeArrowBooleanRle/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:1000 20.750 MiB/sec 21.843 MiB/sec 5.265 {'family_index': 30, 'per_family_instance_index': 10, 'run_name': 'BM_DecodeArrowBooleanRle/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:1000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 7387} BM_ReadColumn<false,Int32Type>/-1/10 7.621 GiB/sec 8.019 GiB/sec 5.223 {'family_index': 0, 'per_family_instance_index': 1, 'run_name': 'BM_ReadColumn<false,Int32Type>/-1/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 137} [ ... snip non-significant changes ... ] --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Regressions: (4) --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- benchmark baseline contender change % counters BM_ReadListColumn/99 1.452 GiB/sec 1.379 GiB/sec -5.006 {'family_index': 21, 'per_family_instance_index': 3, 'run_name': 'BM_ReadListColumn/99', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 129} BM_ArrowBinaryViewDict/DecodeArrowNonNull_Dense/1024 270.542 MiB/sec 256.345 MiB/sec -5.248 {'family_index': 27, 'per_family_instance_index': 0, 'run_name': 'BM_ArrowBinaryViewDict/DecodeArrowNonNull_Dense/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 32060} BM_ArrowBinaryPlain/DecodeArrow_Dict/65536 172.371 MiB/sec 162.455 MiB/sec -5.753 {'family_index': 18, 'per_family_instance_index': 3, 'run_name': 'BM_ArrowBinaryPlain/DecodeArrow_Dict/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 319} BM_ArrowBinaryPlain/DecodeArrowNonNull_Dict/1024 189.008 MiB/sec 176.900 MiB/sec -6.406 {'family_index': 19, 'per_family_instance_index': 0, 'run_name': 'BM_ArrowBinaryPlain/DecodeArrowNonNull_Dict/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 22292} ``` ### Are these changes tested? By existing tests. ### Are there any user-facing changes? No. * GitHub Issue: #47012 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
2025-07-09 09:35:19 +02:00
# Some fuzz regression files may trigger huge memory allocations,
# let the allocator return null instead of aborting.
export ASAN_OPTIONS="$ASAN_OPTIONS allocator_may_return_null=1"
GH-49329: [C++][Parquet][CI] Add fuzz target for encoder/decoder roundtrip (#49374) ### Rationale for this change Add a fuzz target that does a 3-way encoding roundtrip: 1. decodes its payload using a source encoding 2. re-encodes the values using a roundtrip encoding 3. re-decodes using the same roundtrip encoding, and checks that the values read are identical The motivation is to ensure that encoding and decoding can roundtrip for any valid data, while the full-file Parquet fuzzer only tests _reading_ the data. Furthermore, we postulate that a smaller workload that runs on a smaller search space like this might lead to a faster exploration of the search space (but we don't know for sure). The reason for the two encodings (source and roundtrip) is that not all encodings imply the same input structure (grammar, etc.) and using another source encoding (especially PLAIN, which has very little structure) might allow the fuzzer to explore the logical search space (i.e. values) in different ways. #### Parametrization Since we want to fuzz all supported (type, encoding) combinations and OSS-Fuzz does not support fuzzer parametrization, we have to use a trick: we prefix the fuzz payload with a fixed-size header that encodes those parameters. (while this may seem like we are reinventing the Parquet file format, it is a massively simpler format to decode, and the payload otherwise consists of a single range of encoded values: no Thrift metadata, no row groups, no columns, no multiple pages, no def/rep levels, no statistics, etc.) ### What changes are included in this PR? 1. Add a new fuzz target 2. Add a new fuzzing utility to generate the corresponding seed corpus 3. Add a test step on ASAN/ UBSAN CI job to run the fuzz targets on their respective seed corpuses 4. Add a Parquet API function to get the supported encodings for a given type 5. Add a Parquet helper API to dispatch a Parquet type to a generic callable (usually a lambda with an auto parameter) 6. Add a macro to pack a structure (and replace the previously existing macro from Flatbuffers) ### Are these changes tested? Yes, partly by CI and partly manually. ### Are there any user-facing changes? No. * GitHub Issue: #49329 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
2026-03-03 15:56:20 +01:00
# 1. Generate seed corpuses
# For IPC fuzz targets, these will include the golden IPC integration files.
GH-49329: [C++][Parquet][CI] Add fuzz target for encoder/decoder roundtrip (#49374) ### Rationale for this change Add a fuzz target that does a 3-way encoding roundtrip: 1. decodes its payload using a source encoding 2. re-encodes the values using a roundtrip encoding 3. re-decodes using the same roundtrip encoding, and checks that the values read are identical The motivation is to ensure that encoding and decoding can roundtrip for any valid data, while the full-file Parquet fuzzer only tests _reading_ the data. Furthermore, we postulate that a smaller workload that runs on a smaller search space like this might lead to a faster exploration of the search space (but we don't know for sure). The reason for the two encodings (source and roundtrip) is that not all encodings imply the same input structure (grammar, etc.) and using another source encoding (especially PLAIN, which has very little structure) might allow the fuzzer to explore the logical search space (i.e. values) in different ways. #### Parametrization Since we want to fuzz all supported (type, encoding) combinations and OSS-Fuzz does not support fuzzer parametrization, we have to use a trick: we prefix the fuzz payload with a fixed-size header that encodes those parameters. (while this may seem like we are reinventing the Parquet file format, it is a massively simpler format to decode, and the payload otherwise consists of a single range of encoded values: no Thrift metadata, no row groups, no columns, no multiple pages, no def/rep levels, no statistics, etc.) ### What changes are included in this PR? 1. Add a new fuzz target 2. Add a new fuzzing utility to generate the corresponding seed corpus 3. Add a test step on ASAN/ UBSAN CI job to run the fuzz targets on their respective seed corpuses 4. Add a Parquet API function to get the supported encodings for a given type 5. Add a Parquet helper API to dispatch a Parquet type to a generic callable (usually a lambda with an auto parameter) 6. Add a macro to pack a structure (and replace the previously existing macro from Flatbuffers) ### Are these changes tested? Yes, partly by CI and partly manually. ### Are there any user-facing changes? No. * GitHub Issue: #49329 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
2026-03-03 15:56:20 +01:00
"${source_dir}/build-support/fuzzing/generate_corpuses.sh" "${binary_output_dir}"
# 2. Run fuzz targets on seed corpus entries
function run_fuzz_target_on_seed_corpus() {
fuzz_target_basename=$1
corpus_dir=${binary_output_dir}/${fuzz_target_basename}_seed_corpus
mkdir -p "${corpus_dir}"
pushd "${corpus_dir}"
unzip -q "${binary_output_dir}"/"${fuzz_target_basename}"_seed_corpus.zip -d .
"${binary_output_dir}"/"${fuzz_target_basename}" -rss_limit_mb=4000 ./*
popd
rm -rf "${corpus_dir}"
GH-49329: [C++][Parquet][CI] Add fuzz target for encoder/decoder roundtrip (#49374) ### Rationale for this change Add a fuzz target that does a 3-way encoding roundtrip: 1. decodes its payload using a source encoding 2. re-encodes the values using a roundtrip encoding 3. re-decodes using the same roundtrip encoding, and checks that the values read are identical The motivation is to ensure that encoding and decoding can roundtrip for any valid data, while the full-file Parquet fuzzer only tests _reading_ the data. Furthermore, we postulate that a smaller workload that runs on a smaller search space like this might lead to a faster exploration of the search space (but we don't know for sure). The reason for the two encodings (source and roundtrip) is that not all encodings imply the same input structure (grammar, etc.) and using another source encoding (especially PLAIN, which has very little structure) might allow the fuzzer to explore the logical search space (i.e. values) in different ways. #### Parametrization Since we want to fuzz all supported (type, encoding) combinations and OSS-Fuzz does not support fuzzer parametrization, we have to use a trick: we prefix the fuzz payload with a fixed-size header that encodes those parameters. (while this may seem like we are reinventing the Parquet file format, it is a massively simpler format to decode, and the payload otherwise consists of a single range of encoded values: no Thrift metadata, no row groups, no columns, no multiple pages, no def/rep levels, no statistics, etc.) ### What changes are included in this PR? 1. Add a new fuzz target 2. Add a new fuzzing utility to generate the corresponding seed corpus 3. Add a test step on ASAN/ UBSAN CI job to run the fuzz targets on their respective seed corpuses 4. Add a Parquet API function to get the supported encodings for a given type 5. Add a Parquet helper API to dispatch a Parquet type to a generic callable (usually a lambda with an auto parameter) 6. Add a macro to pack a structure (and replace the previously existing macro from Flatbuffers) ### Are these changes tested? Yes, partly by CI and partly manually. ### Are there any user-facing changes? No. * GitHub Issue: #49329 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
2026-03-03 15:56:20 +01:00
}
run_fuzz_target_on_seed_corpus arrow-csv-fuzz
run_fuzz_target_on_seed_corpus arrow-ipc-file-fuzz
run_fuzz_target_on_seed_corpus arrow-ipc-stream-fuzz
run_fuzz_target_on_seed_corpus arrow-ipc-tensor-stream-fuzz
if [ "${ARROW_PARQUET}" == "ON" ]; then
run_fuzz_target_on_seed_corpus parquet-arrow-fuzz
run_fuzz_target_on_seed_corpus parquet-encoding-fuzz
fi
# 3. Run fuzz targets on regression files from arrow-testing
fuzz_target_options="-rss_limit_mb=2560" # same as on OSS-Fuzz
pushd "${ARROW_TEST_DATA}"
"${binary_output_dir}/arrow-ipc-stream-fuzz" ${fuzz_target_options} arrow-ipc-stream/crash-*
"${binary_output_dir}/arrow-ipc-stream-fuzz" ${fuzz_target_options} arrow-ipc-stream/*-testcase-*
"${binary_output_dir}/arrow-ipc-file-fuzz" ${fuzz_target_options} arrow-ipc-file/*-testcase-*
"${binary_output_dir}/arrow-ipc-tensor-stream-fuzz" ${fuzz_target_options} arrow-ipc-tensor-stream/*-testcase-*
if [ "${ARROW_PARQUET}" == "ON" ]; then
"${binary_output_dir}/parquet-arrow-fuzz" ${fuzz_target_options} parquet/fuzzing/*-testcase-*
"${binary_output_dir}/parquet-encoding-fuzz" ${fuzz_target_options} parquet/encoding-fuzzing/*-testcase-*
fi
"${binary_output_dir}/arrow-csv-fuzz" ${fuzz_target_options} csv/fuzzing/*-testcase-*
popd
fi
popd