Skip to content

Get MIRI running against parquet crate #614

Open
@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The MIRI tool helps crates ensure they have no undefined behavior https://github.com/rust-lang/miri

To improve the stability / security of the parquet crate, it would be cool to run MIRI against that too.

Describe the solution you'd like

Ideally, we would run same kind of MIRI checks on the parquet crate as we do now on arrow: https://github.com/apache/arrow-rs/blob/master/.github/workflows/miri.yaml#L60

This means a command something like

cargo +nightly miri  test  -p parquet

Sadly, when I run this locally the very first test fails with an alignment issue

(arrow_dev) alamb@MacBook-Pro:~/Software/arrow-rs$ cargo +nightly miri  test  -p parquet
WARNING: Ignoring `RUSTC_WRAPPER` environment variable, Miri does not support wrapping.
   Compiling arrow v6.0.0-SNAPSHOT (/Users/alamb/Software/arrow-rs/arrow)
   Compiling parquet v6.0.0-SNAPSHOT (/Users/alamb/Software/arrow-rs/parquet)
    Finished test [unoptimized + debuginfo] target(s) in 4.53s
     Running unittests (target/miri/x86_64-apple-darwin/debug/deps/parquet-7056b5943fd0017c)

running 456 tests
test arrow::array_reader::tests::test_complex_array_reader_def_and_rep_levels ... error: Undefined Behavior: accessing memory with alignment 1, but alignment 4 is required
    --> parquet/src/util/bit_packing.rs:150:12
     |
150  |     *out = (*in_buf) % (1u32 << 2);
     |            ^^^^^^^^^ accessing memory with alignment 1, but alignment 4 is required
     |
     = help: this indicates a bug in the program: it performed an invalid operation, and caused Undefined Behavior
     = help: see https://doc.rust-lang.org/nightly/reference/behavior-considered-undefined.html for further information
             
     = note: inside `util::bit_packing::unpack2_32` at parquet/src/util/bit_packing.rs:150:12
note: inside `util::bit_packing::unpack32` at parquet/src/util/bit_packing.rs:37:14
    --> parquet/src/util/bit_packing.rs:37:14
     |
37   |         2 => unpack2_32(in_ptr, out_ptr),
     |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside `util::bit_util::BitReader::get_batch::<i16>` at parquet/src/util/bit_util.rs:568:30
    --> parquet/src/util/bit_util.rs:568:30
     |
568  |                     in_ptr = unpack32(in_ptr, out_ptr, num_bits);
     |                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside `encodings::rle::RleDecoder::get_batch::<i16>` at parquet/src/encodings/rle.rs:415:30
    --> parquet/src/encodings/rle.rs:415:30
     |
415  |                   num_values = bit_reader.get_batch::<T>(
     |  ______________________________^
416  | |                     &mut buffer[values_read..values_read + num_values],
417  | |                     self.bit_width as usize,
418  | |                 );
     | |_________________^
note: inside `encodings::levels::LevelDecoder::get` at parquet/src/encodings/levels.rs:262:35
    --> parquet/src/encodings/levels.rs:262:35
     |
262  |                 let values_read = decoder.get_batch::<i16>(&mut buffer[0..len])?;
     |                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside `column::reader::ColumnReaderImpl::<data_type::ByteArrayType>::read_def_levels` at parquet/src/column/reader.rs:459:9
    --> parquet/src/column/reader.rs:459:9
     |
459  |         level_decoder.get(buffer)
     |         ^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside `column::reader::ColumnReaderImpl::<data_type::ByteArrayType>::read_batch` at parquet/src/column/reader.rs:210:38
    --> parquet/src/column/reader.rs:210:38
     |
210  |                       num_def_levels = self.read_def_levels(
     |  ______________________________________^
211  | |                         &mut levels[levels_read..levels_read + iter_batch_size],
212  | |                     )?;
     | |_____________________^
note: inside `<arrow::array_reader::ComplexObjectArrayReader<data_type::ByteArrayType, arrow::converter::ArrayRefConverter<std::vec::Vec<std::option::Option<data_type::ByteArray>>, arrow::array::GenericStringArray<i32>, arrow::converter::Utf8ArrayConverter>> as arrow::array_reader::ArrayReader>::next_batch` at parquet/src/arrow/array_reader.rs:490:17
    --> parquet/src/arrow/array_reader.rs:490:17
     |
490  | /                 self.column_reader.as_mut().unwrap().read_batch(
491  | |                     num_to_read,
492  | |                     cur_def_levels_buf,
493  | |                     cur_rep_levels_buf,
494  | |                     cur_data_buf,
495  | |                 )?;
     | |_________________^
note: inside `arrow::array_reader::tests::test_complex_array_reader_def_and_rep_levels` at parquet/src/arrow/array_reader.rs:2229:21
    --> parquet/src/arrow/array_reader.rs:2229:21
     |
2229 |         let array = array_reader.next_batch(values_per_page / 2).unwrap();
     |                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside closure at parquet/src/arrow/array_reader.rs:2154:5
    --> parquet/src/arrow/array_reader.rs:2154:5
     |
2153 |       #[test]
     |       ------- in this procedural macro expansion
2154 | /     fn test_complex_array_reader_def_and_rep_levels() {
2155 | |         // Construct column schema
2156 | |         let message_type = "
2157 | |         message test_schema {
...    |
2276 | |         );
2277 | |     }
     | |_____^
     = note: inside `<[closure@parquet/src/arrow/array_reader.rs:2154:5: 2277:6] as std::ops::FnOnce<()>>::call_once - shim` at /Users/alamb/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/core/src/ops/function.rs:227:5
     = note: inside `<fn() as std::ops::FnOnce<()>>::call_once - shim(fn())` at /Users/alamb/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/core/src/ops/function.rs:227:5
     = note: inside `test::__rust_begin_short_backtrace::<fn()>` at /Users/alamb/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/test/src/lib.rs:578:5
     = note: inside closure at /Users/alamb/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/test/src/lib.rs:569:30
     = note: inside `<[closure@test::run_test::{closure#2}] as std::ops::FnOnce<()>>::call_once - shim(vtable)` at /Users/alamb/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/core/src/ops/function.rs:227:5
     = note: inside `<std::boxed::Box<dyn std::ops::FnOnce() + std::marker::Send> as std::ops::FnOnce<()>>::call_once` at /Users/alamb/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/alloc/src/boxed.rs:1572:9
     = note: inside `<std::panic::AssertUnwindSafe<std::boxed::Box<dyn std::ops::FnOnce() + std::marker::Send>> as std::ops::FnOnce<()>>::call_once` at /Users/alamb/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/std/src/panic.rs:347:9
     = note: inside `std::panicking::r#try::do_call::<std::panic::AssertUnwindSafe<std::boxed::Box<dyn std::ops::FnOnce() + std::marker::Send>>, ()>` at /Users/alamb/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/std/src/panicking.rs:401:40
     = note: inside `std::panicking::r#try::<(), std::panic::AssertUnwindSafe<std::boxed::Box<dyn std::ops::FnOnce() + std::marker::Send>>>` at /Users/alamb/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/std/src/panicking.rs:365:19
     = note: inside `std::panic::catch_unwind::<std::panic::AssertUnwindSafe<std::boxed::Box<dyn std::ops::FnOnce() + std::marker::Send>>, ()>` at /Users/alamb/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/std/src/panic.rs:434:14
     = note: inside `test::run_test_in_process` at /Users/alamb/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/test/src/lib.rs:601:18
     = note: inside closure at /Users/alamb/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/test/src/lib.rs:493:39
     = note: inside `test::run_test::run_test_inner` at /Users/alamb/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/test/src/lib.rs:531:13
     = note: inside `test::run_test` at /Users/alamb/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/test/src/lib.rs:565:28
     = note: inside `test::run_tests::<[closure@test::run_tests_console::{closure#2}]>` at /Users/alamb/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/test/src/lib.rs:306:17
     = note: inside `test::run_tests_console` at /Users/alamb/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/test/src/console.rs:290:5
     = note: inside `test::test_main` at /Users/alamb/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/test/src/lib.rs:123:15
     = note: inside `test::test_main_static` at /Users/alamb/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/test/src/lib.rs:142:5
     = note: inside `main`
     = note: inside `<fn() as std::ops::FnOnce<()>>::call_once - shim(fn())` at /Users/alamb/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/core/src/ops/function.rs:227:5
     = note: inside `std::sys_common::backtrace::__rust_begin_short_backtrace::<fn(), ()>` at /Users/alamb/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/std/src/sys_common/backtrace.rs:125:18
     = note: inside closure at /Users/alamb/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/std/src/rt.rs:63:18
     = note: inside `std::ops::function::impls::<impl std::ops::FnOnce<()> for &dyn std::ops::Fn() -> i32 + std::marker::Sync + std::panic::RefUnwindSafe>::call_once` at /Users/alamb/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/core/src/ops/function.rs:259:13
     = note: inside `std::panicking::r#try::do_call::<&dyn std::ops::Fn() -> i32 + std::marker::Sync + std::panic::RefUnwindSafe, i32>` at /Users/alamb/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/std/src/panicking.rs:401:40
     = note: inside `std::panicking::r#try::<i32, &dyn std::ops::Fn() -> i32 + std::marker::Sync + std::panic::RefUnwindSafe>` at /Users/alamb/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/std/src/panicking.rs:365:19
     = note: inside `std::panic::catch_unwind::<&dyn std::ops::Fn() -> i32 + std::marker::Sync + std::panic::RefUnwindSafe, i32>` at /Users/alamb/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/std/src/panic.rs:434:14
     = note: inside closure at /Users/alamb/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/std/src/rt.rs:45:48
     = note: inside `std::panicking::r#try::do_call::<[closure@std::rt::lang_start_internal::{closure#2}], isize>` at /Users/alamb/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/std/src/panicking.rs:401:40
     = note: inside `std::panicking::r#try::<isize, [closure@std::rt::lang_start_internal::{closure#2}]>` at /Users/alamb/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/std/src/panicking.rs:365:19
     = note: inside `std::panic::catch_unwind::<[closure@std::rt::lang_start_internal::{closure#2}], isize>` at /Users/alamb/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/std/src/panic.rs:434:14
     = note: inside `std::rt::lang_start_internal` at /Users/alamb/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/std/src/rt.rs:45:20
     = note: inside `std::rt::lang_start::<()>` at /Users/alamb/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/std/src/rt.rs:62:5
     = note: this error originates in the attribute macro `test` (in Nightly builds, run with -Z macro-backtrace for more info)

Describe alternatives you've considered
The parquet2 crate in https://github.com/jorgecarleitao/parquet2 from @jorgecarleitao apparently already has parquet running under MIR this already running, so depending on if that gets integrated into the main arrow-rs repo, that might change how we go about getting a clean run

Additional context
Thanks to help from @roee88 , @jhorstmann and others who I probably missed, we now have MIRI checks validated for the arrow crate on each PR: https://github.com/apache/arrow-rs/runs/3154551440

Which runs a command such as:

  cargo miri test -p arrow -- --skip csv --skip ipc --skip json

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions