-
Notifications
You must be signed in to change notification settings - Fork 262
Writer Design #34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks @ZENOTME for raising this discussion. For Others LGTM. |
I think making the writing process async would be of great value because uploading the data to an object store is not CPU bound. For parquet there is actually a great AsyncArrowWriter. |
Good suggestion. I investigate metadata return by the parquet writer and find that this metadata can fullfill most of datafile need.
The metadata only record the offset of the column chunk. I'm not sure can we use the first column chunk offset as the row group offset. |
I'm not an expert in parquet but maybe we can calculate the And the |
According the parquet format, a parquet file consist of:
So I think accumulate each rowgroup size can't work.🤔 |
So the missing piece is the size of the metadata, right? It's a bit weird approach but we could write the |
It's bit weird but more efficient than track wrapper I think.🤣 |
I think the next thing is to decide the FileIo interface and we can work on our SpecificFormatWriter.(Or call it FileWriter, FileAppender). Here is some info may useful for FileIo interface design:
And as suggestion, interface may need to have the following function look like:
cc @Xuanwo |
Most of the writing tasks are handled by @liurenjie1024 and @ZENOTME, so I may not have sufficient experience on this particular topic. However, I will do my best to provide input from the OpenDAL perspective to assist with the design. When we talking about
For IO Writer, there are three kinds of APIs we can use:
PutObject
MultipartUpload
AppendObject
I suggest that:
|
Thanks for your suggestion!
Agree. We can use the internal buffer so that the avro writer can write into first. And then we can use the AsyncWriter to write via https://docs.rs/tokio/1.29.1/tokio/io/trait.AsyncWriteExt.html. |
More detail SpecificFormatWriter interface designI think there is enough info to have a SpecificFormatWriter interface design. (May be we can call it FileWriter). If following interface looks well, I will introduce a PR to make it work first after we have DataFile.
|
Thanks @Xuanwo for the write up. For me it would be Okay to just start with And I strongly agree with your point on using the same interface for the parquet and the avro writer. Even when it means we need to implement an async writer with an internal buffer for avro. Looking at the parquet asyncwriter implementation, it doesn't look too bad. @ZENOTME the interface looks great to me. Thanks for the effort. |
Thanks everyone for this nice discussion. For parquet file metadata, I've submitted a pr to fix it, so we will have About metrics such as In iceberg java's design, there are several components: It provides several methods for manipulating files in underlying storage, such as creating a new file for writing, deleting a file. I think we can provide similar data structure, which can be implemented as a wrapper of underlying library(opendal, etc).
A file appender focus on one file, and the format of this file. We can have different file appender for different formats, such as parquet, orc, avro. A A task writer is used by a task in distributed computing framework, such as spark, flink. A task writer takes care of assigning input data into different partitions, and calls I think above design is quite elegant, and I would suggest similar components as it. |
I find that the parquet writer will cast the type automatelly, which means that following code can work:
The schema of writer is timestamp with time zone, but we can insert into it using int64, float array . And the cast logic is to cast the physical representation directly rather than logical cast. In above example, the physical representation of timestamp is i64, so it just cast i64, f32 into i64. I'm not sure this behaviour will cause potential bug in future. So I want to discuss:
|
Thanks @ZENOTME 's interesting finding, I even didn't notice the behavior before. The schema of parquet writer should be determined by table schema, e.g. the schema of iceberg. Also the input record batch's schema should match the table's schema. The question of whether we should do runtime check during writing? I think it should be configurable since it have huge performance impact for the writer. It would be useful during debugging, developing phase, but should be turned off in production. |
@ZENOTME are you intending on implementing writer support until completion yourself? |
I have implemented the writer support in icelake so I'm glad to migrate them into iceberg-rust. I'm glad to take most of the work but it's not necessary to be completed myself. I'm glad to see any good suggestions and contributions. |
@ZENOTME got it, thanks. We may be able to help here if you create some discrete sub-issues for the conversion. |
We have finished the init writer framework! We can close this issue now and the next step is to implement more writers. I will create the issues and track them separately. |
This issue propose the writer design to solve:
And the design is based on what we do in icelake and inspire by java iceberg, feel free to any suggestion:
Class Design
SpecificFormatWriter
At the bottom level, we have kinds of specific format writer, which responsible for writing record batch into a file of specific format, such as:
1. Disscusion: Which format we prepare to support in v0.2. I guess only parquet?
DataFileWriter
A higher level of writer is the data writer, data writer use the SpecificWriter and it will split the record batch into multiple file according the config such as
file_size_limit
, it looks like:2. Disscusion: how do we treat the type SpecificWriter, use enum to dispatch or use generic parameter.
ParititionWriter and UnparitionWriter
The top level is PartitionWriter and UnpartitionWriter. For UnpartitionWriter, it is just simlar to the DataFileWriter. For ParitionWriter, it need to split the record batch into different group according partition. And these record batch will be wrote using DataWriter responsible for different partition. It looks like:
The text was updated successfully, but these errors were encountered: