-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Description
Describe the bug, including details regarding any error messages, version, and platform.
nums_rows()
method is inherited from ParquetFileWriter
. It seems that the method is designed to count the number of rows of RowGroups which are closed.
/// Number of rows in the yet started RowGroups.
///
/// Changes on the addition of a new RowGroup.
int64_t num_rows() const;
But in the implementation of class FileSerializer : public ParquetFileWriter::Contents
, the num_rows_
variable is only modified in Close
method. It means that num_rows()
method always returns 0
before Close
is called.
And after Close
of ParquetFileWriter
is called we can no longer call nums_rows()
method, because contents_
is reset in Close
and we will get a null pointer exception if we call nums_rows()
.
So, in the current implementation of FileSerializer
, num_rows()
will only return 0
or throw exception.
int ParquetFileWriter::num_columns() const { return contents_->num_columns(); }
void ParquetFileWriter::Close() {
if (contents_) {
contents_->Close();
file_metadata_ = contents_->metadata();
contents_.reset();
}
}
class FileSerializer : public ParquetFileWriter::Contents {
int64_t num_rows() const override { return num_rows_; }
void Close() override {
if (is_open_) {
if (row_group_writer_) {
num_rows_ += row_group_writer_->num_rows(); // <- num_rows is only modified here and can never be read.
...
}
...
}
}
RowGroupWriter* AppendRowGroup(bool buffered_row_group) {
if (row_group_writer_) {
// <- A possible missing `num_rows_ += row_group_writer_->num_rows()` here.
row_group_writer_->Close();
}
...
}
};
Should we mark the virtual method in ParquetFileWriter
deprecated and remove the method in next release? I guess the method which is used to count the row number during writing row groups is useless and our default implementation always returns 0
now. Otherwise we can fix the issue by following the comments above.
Component(s)
C++