-
Notifications
You must be signed in to change notification settings - Fork 312
refactor(writer): Refactor writers for the future partitioning writers #1657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
/// Creates a new `RollingFileWriterBuilder` with the specified inner builder and target size. | ||
impl<B, L, F> RollingWriter<B, L, F> | ||
where | ||
B: IcebergWriterBuilder, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing need to noticed is that following is what IcebergWriterBuilder looks like.
#[async_trait::async_trait]
pub trait IcebergWriterBuilder<I = DefaultInput, O = DefaultOutput>:
Send + Clone + 'static
{
/// The associated writer type.
type R: IcebergWriter<I, O>;
/// Build the iceberg writer.
async fn build(self) -> Result<Self::R>;
}
For writer like position delete writer, it has different input like following, see: #704
#[async_trait::async_trait]
impl<B: FileWriterBuilder> IcebergWriterBuilder<Vec<PositionDeleteInput>>
for PositionDeleteWriterBuilder<B>
{
type R = PositionDeleteWriter<B>;
async fn build(self) -> Result<Self::R> {
Ok(PositionDeleteWriter {
inner_writer: Some(self.inner.build().await?),
partition_value: self.partition_value.unwrap_or(Struct::empty()),
})
}
}
And that's why rolling writer is a FileWriter at first. After we adopt this design, how can we something like
RollingWriter<PostitionDeletWriter>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your concern is valid, we may need to expose I
and O
in the RollingWriter
as well, and that should solve this problem?
pub struct RollingWriter<B, L, F, I, O>
where
B: IcebergWriterBuilder<I, O>,
L: LocationGenerator,
F: FileNameGenerator,
Meanwhile I've been wondering how useful is the abstraction of IcebergWriter
... If we separate RollingWriter
into RollingPositionalDeletesWriter
and RollingXXXWriter
and have them use concrete types then this would be a lot easier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meanwhile I've been wondering how useful is the abstraction of IcebergWriter
E.g the user want to custom their own writer with to track some metrics like following:
RollingWriter<TrackPositionalDeletesWriter>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think custom writers can either implement FileWriter
(lightweighted, file-level customization) or PartitioningWriter
(heavier, customization across multiple partitions).
In your example, the custom writer can implement FileWriter and be used like this:
RollingPositionalDeletesWriter<TrackWriter>
ad66fa5
to
ac264fc
Compare
/// Close all writers and return the data files. | ||
fn close(&mut self) -> Result<Vec<DataFile>>; | ||
} | ||
|
||
/// The builder for iceberg writer. | ||
#[async_trait::async_trait] | ||
pub trait IcebergWriterBuilder<I = DefaultInput, O = DefaultOutput>: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we will also need to change the DefaultOutput for IcebergWriter
from Vec<DataFile>
to Vec<DataFileBuilder>
since IcebergWriter
is no longer the outermost writer
Which issue does this PR close?
What changes are included in this PR?
Refactored the writer layers; from a bird’s-eye view, the structure now looks like this:
Modified Writer Interfaces:
FileWriterBuilder
andIcebergWriterBuilder
interfaces to accept anOutputFile
parameter in theirbuild
methodsPartitioningWriter
trait inwriter/mod.rs
with methods for partitioning-aware writers (not implemented yet, we can use a separate PR to add this trait if needed)Transformed RollingFileWriter to RollingWriter:
RollingFileWriter
toRollingWriter
FileWriter
to being a standalone writer that usesIcebergWriterBuilder
IcebergWriterBuilder
,LocationGenerator
, andFileNameGenerator
Updated ParquetWriter:
LocationGenerator
andFileNameGenerator
out_file
tooutput_file
for consistencyUpdated DataFileWriter and EqualityDeleteWriter:
OutputFile
parameter to their inner writersUpdated DataFusion Integration:
IcebergWriteExec
to use the newRollingWriter
directly instead of using builder patternsTaskWriter
->PartitioningWriter
->RollingWriter
-> ..., butTaskWriter
andPartitioningWriter
are not included in this draft so farAre these changes tested?
Not yet, but changing the existing tests accordingly should be enough