[Discussion] Efficient Row Selection for Multi-Engine Support

<h2>Background</h2>We have an usecase where data is stored in multiple engines/formats and Parquet is the primary format containing all the data. While text queries are handled by inverted index format, numeric data queries and aggregations are processed via Parquet files. While the file formats are different, the data is sorted and stored in the same order across them.<br><br>We are using DataFusion to query Parquet files and wondering if the result of the query can be represented as a bit set of the document position (example below). Bit sets from the different engines can be intersected to identify the documents which meets the criteria. The resulting bit set then can be used to fetch the relevant documents from Parquet.<br><br>Example:<br><br>Assume we have the following data stored in parquet file:<br>

colA | colB
-- | --
200 | Autumn leaves
200 | Salty breeze
100 | Misty mountains
100 | Misty mountains
200 | Velvet curtains

For example, assume have an query like <code> SELECT colB where colA = 100</code> <br><br>The matching documents can be represented in the form of bitset : 00110 (row number starts from left). We want to use the matching document information collected from any underlying engine to fetch the relevant documents in the parquet file using DataFusion.<br><h2>What we explored</h2>We explored that one of the ways to fetch specific rows in DataFusion is by creating an access plan and passing it to ParquetExec. Since we need the complete plan, we can't parallelize it and start collecting data from Parquet, which reduces the overall query performance and is also memory-inefficient as we need to iterate the complete stream and convert it to the AccessPlan. <br><h2>Possible Solution</h2>If there is a way to:<br><ol><li style="list-style-type:decimal">Pass the iterator directly to DataFusion, or</li><li style="list-style-type:decimal">Process the matching rows in batches.</li></ol>Then it will enable on-demand conversion from the matching rows iterator to RowSelection in DataFusion thus improving efficiency by reducing memory overhead.<br><h2>Questions</h2><ol><li style="list-style-type:decimal">Are there existing mechanisms in DataFusion to handle external iterators or row sources?</li><li style="list-style-type:decimal">What are the best practices for integrating DataFusion with external data sources in a streaming or batched manner?</li><li style="list-style-type:decimal">Are there any plans or ongoing work in the DataFusion project that might address this use case?</li><li style="list-style-type:decimal">Any alternative approaches or design patterns that might help us achieve efficient row selection in our multi-engine implementation?</li></ol>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Discussion] Efficient Row Selection for Multi-Engine Support #14816

Background

What we explored

Possible Solution

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

colA	colB
200	Autumn leaves
200	Salty breeze
100	Misty mountains
100	Misty mountains
200	Velvet curtains

[Discussion] Efficient Row Selection for Multi-Engine Support #14816

Description

Background

What we explored

Possible Solution

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions