Fix UUID support #2007

Fokko · 2025-05-16T08:59:55Z

Rationale for this change

The UUID support is a gift that keeps on giving. The current support of PyIceberg is incomplete, and problematic. Mostly because:

It is an extension-type in Arrow, which means it is not fully supported: Writing UUID using PyArrow does not set the UUID logical type on Parquet arrow#46469 Support for grouping in UUID columns arrow#46468
It doesn't have native support in Spark, where it is converted into a string. This limits the current tests, which are mostly Spark-based.

I think we have to wait for some fixes in Arrow upstream until we can fully support this. In PyIceberg, we're converting the fixed[16] to a UUID, but Spark does seem to error because the logical type annotation in Parquet is missing:

E                   py4j.protocol.Py4JJavaError: An error occurred while calling o72.collectToPython.
E                   : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1) (localhost executor driver): java.lang.UnsupportedOperationException: Unsupported type: UTF8String
E                   	at org.apache.iceberg.arrow.vectorized.ArrowVectorAccessor.getUTF8String(ArrowVectorAccessor.java:81)
E                   	at org.apache.iceberg.spark.data.vectorized.IcebergArrowColumnVector.getUTF8String(IcebergArrowColumnVector.java:143)
E                   	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
E                   	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
E                   	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
E                   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
E                   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
E                   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
E                   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
E                   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
E                   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
E                   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
E                   	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
E                   	at org.apache.spark.scheduler.Task.run(Task.scala:141)
E                   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
E                   	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
E                   	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
E                   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
E                   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
E                   	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
E                   	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
E                   	at java.base/java.lang.Thread.run(Thread.java:829)
E                   
E                   Driver stacktrace:
E                   	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2856)
E                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2792)
E                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2791)
E                   	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
E                   	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
E                   	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
E                   	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2791)
E                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1247)
E                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1247)
E                   	at scala.Option.foreach(Option.scala:407)
E                   	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1247)
E                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3060)
E                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2994)
E                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2983)
E                   	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
E                   	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:989)
E                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2393)
E                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2414)
E                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2433)
E                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2458)
E                   	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1049)
E                   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
E                   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
E                   	at org.apache.spark.rdd.RDD.withScope(RDD.scala:410)
E                   	at org.apache.spark.rdd.RDD.collect(RDD.scala:1048)
E                   	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:448)
E                   	at org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:4149)
E                   	at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4323)
E                   	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:546)
E                   	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:4321)
E                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125)
E                   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201)
E                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108)
E                   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
E                   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66)
E                   	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:4321)
E                   	at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:4146)
E                   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
E                   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
E                   	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
E                   	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
E                   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
E                   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
E                   	at py4j.Gateway.invoke(Gateway.java:282)
E                   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
E                   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
E                   	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
E                   	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
E                   	at java.base/java.lang.Thread.run(Thread.java:829)
E                   Caused by: java.lang.UnsupportedOperationException: Unsupported type: UTF8String
E                   	at org.apache.iceberg.arrow.vectorized.ArrowVectorAccessor.getUTF8String(ArrowVectorAccessor.java:81)
E                   	at org.apache.iceberg.spark.data.vectorized.IcebergArrowColumnVector.getUTF8String(IcebergArrowColumnVector.java:143)
E                   	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
E                   	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
E                   	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
E                   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
E                   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
E                   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
E                   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
E                   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
E                   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
E                   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
E                   	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
E                   	at org.apache.spark.scheduler.Task.run(Task.scala:141)
E                   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
E                   	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
E                   	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
E                   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
E                   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
E                   	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
E                   	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
E                   	... 1 more

Are these changes tested?

Are there any user-facing changes?

Closes #1986
Closes #2002

simw · 2025-05-26T16:04:16Z

Following issue #1986 , I was about to make a smaller PR without the knowledge of the extra spark-related complications.

In case it's useful, the only extra thing I had that you haven't (yet) added is a small unit test in tests/io/test_pyarrow_visitor.py at roughly line 235:

def test_pyarrow_uuid_to_iceberg() -> None:
    pyarrow_type = pa.uuid()
    converted_iceberg_type = visit_pyarrow(pyarrow_type, _ConvertToIceberg())
    assert converted_iceberg_type == UUIDType()
    assert visit(converted_iceberg_type, _ConvertToArrowSchema()) == pa.uuid()

Fokko · 2025-06-14T21:37:13Z

Going down the rabbit hole, I'm able to reproduce this on the Java main branch:

While fixing some issues on the PyIceberg ends to fully support UUIDs: apache/iceberg-python#2007 I noticed this issue, and was suprised since UUID used to work with Spark, but it turns out that the dictionary encoded UUID was not implemented yet. For PyIceberg we only generate little data, so therefore this wasn't caught previously.

dingo4dev and others added 5 commits May 14, 2025 16:29

fix: correct UUIDType partition representation for BucketTransform

db445e1

align return type with Union

cea1f96

Merge branch 'apache:main' into uuid-partition-representation

be2d03f

run code linting and apply formatting

bedf777

Fix UUID support

2f0a5cb

Fokko marked this pull request as draft May 16, 2025 09:00

simw mentioned this pull request May 26, 2025

Error creating table from pyarrow schema with pa.uuid() #1986

Open

3 tasks

Fokko added 3 commits June 13, 2025 16:06

Merge branch 'main' of github.com:apache/iceberg-python into fd-uuid

33043f9

Make linter happy

5fa89e9

Merge branch 'main' of github.com:apache/iceberg-python into fd-uuid

80637b9

Fokko mentioned this pull request Jun 16, 2025

Spark: Cannot read or write UUID columns apache/iceberg#4581

Open

Fokko mentioned this pull request Jun 16, 2025

Spark: Support Parquet dictionary encoded UUIDs apache/iceberg#13324

Open

Merge branch 'main' of github.com:apache/iceberg-python into fd-uuid

da4c784

Fokko marked this pull request as ready for review June 16, 2025 21:52

Make CI happy

698ce85

kevinjqliu self-requested a review June 17, 2025 03:50

matthias-Q mentioned this pull request Jul 5, 2025

feat: add schema conversion from avro timestamp-millis #2173

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix UUID support #2007

Fix UUID support #2007

Uh oh!

Fokko commented May 16, 2025 •

edited

Loading

Uh oh!

simw commented May 26, 2025

Uh oh!

Fokko commented Jun 14, 2025

Uh oh!

Uh oh!

Fix UUID support #2007

Are you sure you want to change the base?

Fix UUID support #2007

Uh oh!

Conversation

Fokko commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

simw commented May 26, 2025

Uh oh!

Fokko commented Jun 14, 2025

Uh oh!

Uh oh!

Fokko commented May 16, 2025 •

edited

Loading