Description
This issue propose the writer design to solve:
arrow: Writing unpartitioned data into iceberg from arrow record batches
arrow: Writing partitioned data into iceberg from arrow record batches
And the design is based on what we do in icelake and inspire by java iceberg, feel free to any suggestion:
Class Design
SpecificFormatWriter
At the bottom level, we have kinds of specific format writer, which responsible for writing record batch into a file of specific format, such as:
struct ParquetWriter {
...
}
struct AvroWriter {
...
}
struct OrcWriter {
...
}
/// Implement this trait for above writer
trait SpecificWriter {
fn write(batch: &RecordBatch) -> Result<()>
}
1. Disscusion: Which format we prepare to support in v0.2. I guess only parquet?
DataFileWriter
A higher level of writer is the data writer, data writer use the SpecificWriter and it will split the record batch into multiple file according the config such as file_size_limit
, it looks like:
struct DataFileWriter {
current_specific_writer: SpecificWriter
}
2. Disscusion: how do we treat the type SpecificWriter, use enum to dispatch or use generic parameter.
ParititionWriter and UnparitionWriter
The top level is PartitionWriter and UnpartitionWriter. For UnpartitionWriter, it is just simlar to the DataFileWriter. For ParitionWriter, it need to split the record batch into different group according partition. And these record batch will be wrote using DataWriter responsible for different partition. It looks like:
struct PartitionWriter {
HashMap<Partition,DataFileWriter>
}