SIGN IN SIGN UP
apache / arrow UNCLAIMED

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics

0 0 55 C++

GH-47973: [C++][Parquet] Fix invalid Parquet files written when dictionary encoded pages are large (#47998)

### Rationale for this change

Prevents silently writing invalid data when using dictionary encoding and the number of bits in the estimated max buffer size is greater than the max int32 value.

Also fixes an overflow resulting in a "Negative buffer resize" error if the buffer size in bytes is greater than max int32, and instead throw a more helpful exception.

### What changes are included in this PR?

* Fix overflow when computing the bit position in `BitWriter::PutValue`. This overflow would cause the method to return without writing data, and the return value is only checked in debug builds.
* Change buffer size calculations to use int64 and check for overflow before casting to int

### Are these changes tested?

Yes, I've added unit tests for both issues. These require enabling `ARROW_LARGE_MEMORY_TESTS` as they allocate a lot of memory.

### Are there any user-facing changes?

**This PR contains a "Critical Fix".**

This fixes a bug where invalid Parquet files can be silently written when the buffer size for dictionary indices is large.

* GitHub Issue: #47973

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
A
Adam Reeve committed
055c2f4e91c63593aacab38250ac9da899cabb31
Parent: 2e46c05
Committed by GitHub <noreply@github.com> on 10/31/2025, 8:08:36 AM