Skip to content

[Python] (PySpark) Support for subclasses in type_verifier #50726

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

ybhaw
Copy link

@ybhaw ybhaw commented Apr 26, 2025

What changes were proposed in this pull request?

Current implementation of _type_verifier does not support classes extending the acceptable types. Here is a small test case for same that fails in current implementation:

Sample test case that fails currently
import unittest
from pyspark.sql.types import StructType, _make_type_verifier

class ExtendedStructType(StructType): ...

class SampleTest(unittest.TestCase):
    def test_extended_struct_type(self):
        schema = ExtendedStructType([])
        _make_type_verifier(schema)([])
Failure logs
Failure
Traceback (most recent call last):
  File ".../spark/python/pyspark/sql/tests/test_types.py", line 3016, in test_extended_struct_type
    _make_type_verifier(schema)([])
  File ".../spark/python/pyspark/sql/types.py", line 2947, in verify
    verify_value(obj)
  File ".../spark/python/pyspark/sql/types.py", line 2878, in verify_struct
    assert_acceptable_types(obj)
  File ".../spark/python/pyspark/sql/types.py", line 2707, in assert_acceptable_types
    assert _type in _acceptable_types, new_msg(
AssertionError: unknown datatype: StructType([]) for object []

This is happening due to current implementation using type(data_type) which does not return StructType for classes extending StructType. (ref)

Why are the changes needed?

proposal: Changing implementation to use isinstance() instead of type()

I believe inheritance should be allowed for DataTypes as it enables users to add behavior, validations or schematic meanings to them.

Example: my use case that is failing currently

I was trying to achieve this behavior:

class Schema(StructType):
   """Some implementation allowing class attributes as fields of StructType"""

class Person(Schema):
   name = StructField("name", StringType())

person = Person()  # Equivalent to StructType([StructField("name", StringType())])

# this was failing
df = spark.createDataFrame({...}, schema=Person())

If you fix a bug, you can clarify why it is a bug.

The current implementation only checks for behavior of a data type. By using type it restricts inheritance. It can achieve same by using isinstance too. IF inheritance is not desirable, then maybe types should be annotated with @final. But in either cases, I would consider it to be a bug.

Does this PR introduce any user-facing change?

No. This PR does not change any existing user facing behavior, but allows them to extend DataTypes if they need to.

How was this patch tested?

There are already few unit tests for _make_type_verifier (here) that test against the direct supported data types. Created a copy of those tests and instead of using the direct types, checked against extended datatypes.

Was this patch authored or co-authored using generative AI tooling?

No.

@ybhaw ybhaw changed the title fix: (PySpark) Support for subclasses in type_verifier fix: (PySpark) Support for subclasses in type_verifier [WIP] Apr 26, 2025
@ybhaw ybhaw changed the title fix: (PySpark) Support for subclasses in type_verifier [WIP] [WIP] fix: (PySpark) Support for subclasses in type_verifier Apr 26, 2025
@ybhaw ybhaw changed the title [WIP] fix: (PySpark) Support for subclasses in type_verifier [Python] (PySpark) Support for subclasses in type_verifier Apr 28, 2025
@HyukjinKwon
Copy link
Member

Can we file a JIRA? See also https://spark.apache.org/contributing.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants