[Python] (PySpark) Support for subclasses in type_verifier #50726

ybhaw · 2025-04-26T18:07:19Z

What changes were proposed in this pull request?

Current implementation of _type_verifier does not support classes extending the acceptable types. Here is a small test case for same that fails in current implementation:

Sample test case that fails currently

import unittest
from pyspark.sql.types import StructType, _make_type_verifier

class ExtendedStructType(StructType): ...

class SampleTest(unittest.TestCase):
    def test_extended_struct_type(self):
        schema = ExtendedStructType([])
        _make_type_verifier(schema)([])

Failure logs

Failure
Traceback (most recent call last):
  File ".../spark/python/pyspark/sql/tests/test_types.py", line 3016, in test_extended_struct_type
    _make_type_verifier(schema)([])
  File ".../spark/python/pyspark/sql/types.py", line 2947, in verify
    verify_value(obj)
  File ".../spark/python/pyspark/sql/types.py", line 2878, in verify_struct
    assert_acceptable_types(obj)
  File ".../spark/python/pyspark/sql/types.py", line 2707, in assert_acceptable_types
    assert _type in _acceptable_types, new_msg(
AssertionError: unknown datatype: StructType([]) for object []

This is happening due to current implementation using type(data_type) which does not return StructType for classes extending StructType. (ref)

Why are the changes needed?

proposal: Changing implementation to use isinstance() instead of type()

I believe inheritance should be allowed for DataTypes as it enables users to add behavior, validations or schematic meanings to them.

Example: my use case that is failing currently

I was trying to achieve this behavior:

class Schema(StructType):
   """Some implementation allowing class attributes as fields of StructType"""

class Person(Schema):
   name = StructField("name", StringType())

person = Person()  # Equivalent to StructType([StructField("name", StringType())])

# this was failing
df = spark.createDataFrame({...}, schema=Person())

If you fix a bug, you can clarify why it is a bug.

The current implementation only checks for behavior of a data type. By using type it restricts inheritance. It can achieve same by using isinstance too. IF inheritance is not desirable, then maybe types should be annotated with @final. But in either cases, I would consider it to be a bug.

Does this PR introduce any user-facing change?

No. This PR does not change any existing user facing behavior, but allows them to extend DataTypes if they need to.

How was this patch tested?

There are already few unit tests for _make_type_verifier (here) that test against the direct supported data types. Created a copy of those tests and instead of using the direct types, checked against extended datatypes.

Was this patch authored or co-authored using generative AI tooling?

No.

HyukjinKwon · 2025-04-28T23:11:25Z

Can we file a JIRA? See also https://spark.apache.org/contributing.html

ybhaw added 2 commits April 26, 2025 22:28

fix: support for subclasses in type_verifier

98cbb3d

fix: Added note to credit original tests

28241bf

github-actions bot added SQL PYTHON labels Apr 26, 2025

ybhaw added 3 commits April 26, 2025 23:42

feat: improved variable names

139abd6

feat: improved variable names

c283918

Fixed linting

0cd6e7a

ybhaw changed the title ~~fix: (PySpark) Support for subclasses in type_verifier~~ fix: (PySpark) Support for subclasses in type_verifier [WIP] Apr 26, 2025

ybhaw changed the title ~~fix: (PySpark) Support for subclasses in type_verifier [WIP]~~ [WIP] fix: (PySpark) Support for subclasses in type_verifier Apr 26, 2025

Fixed annotations

0a363f5

ybhaw changed the title ~~[WIP] fix: (PySpark) Support for subclasses in type_verifier~~ [Python] (PySpark) Support for subclasses in type_verifier Apr 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] (PySpark) Support for subclasses in type_verifier #50726

[Python] (PySpark) Support for subclasses in type_verifier #50726

ybhaw commented Apr 26, 2025

HyukjinKwon commented Apr 28, 2025

[Python] (PySpark) Support for subclasses in type_verifier #50726

Are you sure you want to change the base?

[Python] (PySpark) Support for subclasses in type_verifier #50726

Conversation

ybhaw commented Apr 26, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

HyukjinKwon commented Apr 28, 2025