You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The code modifies the row_counts array before returning the table, but if multiple tasks are running concurrently, the next task that starts executing the _task_to_table function will return None due to
So now, I suppose, what happens is that the task that returned None is processed before the real task with the table content, indeed the completed_futures list now contains only a task with None, witch course the code to return an empty table:
This is the content of the completed_futures & tables variables on the project_table fn:
>>>>> completed_futures SortedKeyList([<Future at 0x7fa7e4b97fd0 state=finished returned NoneType>], key=<function project_table.<locals>.<lambda> at 0x7fa90316c4a0>)
>>>>> tables []
And by modifying the loop with:
# for consistent ordering, we need to maintain future orderfutures_index= {f: ifori, finenumerate(futures)}
completed_futures: SortedList[Future[pa.Table]] =SortedList(iterable=[], key=lambdaf: futures_index[f])
forfutureinconcurrent.futures.as_completed(futures):
completed_futures.add(future)
# stop early if limit is satisfiediflimitisnotNoneandsum(row_counts) >=limit:
print('>>>>> ', limit, sum(row_counts), future.result())
break
I got:
>>>>> 10000 10000 None
The text was updated successfully, but these errors were encountered:
So to summarize the above.
There's a bug with table.scan related to Futures execution when a limit is set. The bug is related to the order of the Futures returned and the shared state row_counts.
When the executor is used to run multiple Futures, each Future checks the shared state row_counts first before proceeding.
The bug is when one Future updates the shared state row_counts (L1021). Before this specific Future returns and completes (L1023), another Future check the shared state row_counts (L953) and returns first (L954).
This leads to the correct row_counts but incorrect completed_futures since the Future returned is not the only that modified the row_counts (L1112).
One potential solution is to have _task_to_table return the row_counts and not modify a shared list.
The row_counts can be aggregated as Futures complete.
Uh oh!
There was an error while loading. Please reload this page.
Apache Iceberg version
0.6.0 (latest release)
Please describe the bug 🐞
I'm facing a race condition when doing
table.scan
on my code. For some strange reason, the code exits before getting the final table.This is my code:
Which returns:
I think the problem happens here:
iceberg-python/pyiceberg/io/pyarrow.py
Lines 1021 to 1023 in 6989b92
The code modifies the
row_counts
array before returning the table, but if multiple tasks are running concurrently, the next task that starts executing the_task_to_table
function will returnNone
due toiceberg-python/pyiceberg/io/pyarrow.py
Lines 941 to 954 in 6989b92
I think it happens because the original task with the data is still processing the content of the table here:
iceberg-python/pyiceberg/io/pyarrow.py
Line 1023 in 6989b92
So now, I suppose, what happens is that the task that returned
None
is processed before the real task with the table content, indeed thecompleted_futures
list now contains only a task withNone
, witch course the code to return an empty table:https://github.com/apache/iceberg-python/blob/6989b92c2d449beb9fe4817c64f619ea5bfc81dc/pyiceberg/io/pyarrow.py#L1111C1-L1116C18
This is the content of the
completed_futures
&tables
variables on theproject_table
fn:And by modifying the loop with:
I got:
The text was updated successfully, but these errors were encountered: