Skip to content

[C++][Parquet] num_rows method in FileSerializer returns wrong number. #47664

@HuaHuaY

Description

@HuaHuaY

Describe the bug, including details regarding any error messages, version, and platform.

nums_rows() method is inherited from ParquetFileWriter. It seems that the method is designed to count the number of rows of RowGroups which are closed.

/// Number of rows in the yet started RowGroups.
///
/// Changes on the addition of a new RowGroup.
int64_t num_rows() const;

But in the implementation of class FileSerializer : public ParquetFileWriter::Contents, the num_rows_ variable is only modified in Close method. It means that num_rows() method always returns 0 before Close is called.
And after Close of ParquetFileWriter is called we can no longer call nums_rows() method, because contents_ is reset in Close and we will get a null pointer exception if we call nums_rows().

So, in the current implementation of FileSerializer, num_rows() will only return 0 or throw exception.

int ParquetFileWriter::num_columns() const { return contents_->num_columns(); }

void ParquetFileWriter::Close() {
  if (contents_) {
    contents_->Close();
    file_metadata_ = contents_->metadata();
    contents_.reset();
  }
}


class FileSerializer : public ParquetFileWriter::Contents {
  int64_t num_rows() const override { return num_rows_; }

  void Close() override {
    if (is_open_) {
      if (row_group_writer_) {
        num_rows_ += row_group_writer_->num_rows();    // <- num_rows is only modified here and can never be read.
        ...
      }
    ...
    }
  }

  RowGroupWriter* AppendRowGroup(bool buffered_row_group) {
    if (row_group_writer_) {
      // <- A possible missing `num_rows_ += row_group_writer_->num_rows()` here.
      row_group_writer_->Close();
    }
    ...
  }
};

Should we mark the virtual method in ParquetFileWriter deprecated and remove the method in next release? I guess the method which is used to count the row number during writing row groups is useless and our default implementation always returns 0 now. Otherwise we can fix the issue by following the comments above.

Component(s)

C++

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions