Skip to content

Writer Design #34

Closed
Closed
@ZENOTME

Description

@ZENOTME

This issue propose the writer design to solve:

arrow: Writing unpartitioned data into iceberg from arrow record batches
arrow: Writing partitioned data into iceberg from arrow record batches

And the design is based on what we do in icelake and inspire by java iceberg, feel free to any suggestion:

Class Design

SpecificFormatWriter

At the bottom level, we have kinds of specific format writer, which responsible for writing record batch into a file of specific format, such as:

struct ParquetWriter {
    ...
}

struct AvroWriter {
    ...
}

struct OrcWriter {
    ...
}

/// Implement this trait for above writer
trait SpecificWriter {
    fn write(batch: &RecordBatch) -> Result<()>
}

1. Disscusion: Which format we prepare to support in v0.2. I guess only parquet?

DataFileWriter

A higher level of writer is the data writer, data writer use the SpecificWriter and it will split the record batch into multiple file according the config such as file_size_limit, it looks like:

struct DataFileWriter {
    current_specific_writer: SpecificWriter
}

2. Disscusion: how do we treat the type SpecificWriter, use enum to dispatch or use generic parameter.

ParititionWriter and UnparitionWriter

The top level is PartitionWriter and UnpartitionWriter. For UnpartitionWriter, it is just simlar to the DataFileWriter. For ParitionWriter, it need to split the record batch into different group according partition. And these record batch will be wrote using DataWriter responsible for different partition. It looks like:

struct PartitionWriter {
    HashMap<Partition,DataFileWriter>    
} 

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions