Skip to content

[SPARK-47933][PYTHON] Parent Column class for Spark Connect and Spark Classic #46155

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

Same as #46129 but for Column class.

Why are the changes needed?

Same as #46129

Does this PR introduce any user-facing change?

Same as #46129

How was this patch tested?

Manually tested, and CI should verify them.

Was this patch authored or co-authored using generative AI tooling?

No.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning of pyspark.sql.classic?

@HyukjinKwon
Copy link
Member Author

Let me run this naming stuff separately (maybe sending an email to the dev mailing list) because Spark Classic has been used in other places lately in any event.

@dongjoon-hyun
Copy link
Member

Sure, please proceed it. We can discuss the naming separately, @HyukjinKwon .

@@ -15,7 +15,7 @@
# limitations under the License.
#
from pyspark.ml import functions as PyMLFunctions
from pyspark.sql.connect.column import Column
from pyspark.sql.column import Column
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should keep it as-is to be consistent with the classic functions.py if it changes to refer classic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is actually related to the type hints. We're using pyspark.sql.Column for type hints, and when we actually need to create the instance, we should use pyspark.sql.classic.column import Column or pyspark.sql.connect.column import Column .. it is also a bit messy. I should probably clean those up separately.

@@ -22,7 +22,8 @@
from pyspark.ml.common import _java2py, _py2java
from pyspark.ml.linalg import Matrix, Vector
from pyspark.ml.wrapper import JavaWrapper, _jvm
from pyspark.sql.column import Column, _to_seq
from pyspark.sql.column import Column
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this consistent with functions.py?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or import _to_seq later as the other files do?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is only used in Spark Classic so it should probably be fine. It is a bit messy and complicated .. I will try to refactor them separately.

@HyukjinKwon
Copy link
Member Author

Merged to master.

For naming, I will send an email soon.
For others, please leave the comments. I will followup.

HyukjinKwon added a commit that referenced this pull request Apr 24, 2024
…t `_with_origin`

### What changes were proposed in this pull request?

This PR is a followup of #46155 that adds check of `__name__` at `_with_origin`.

### Why are the changes needed?

It is possible for a callable instance without __name__ attribute or/and __module__ attribute to be wrapped. For example, functools.partial.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`./bin/pyspark`

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46198 from HyukjinKwon/SPARK-47933-followup2.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Apr 25, 2024
…n `pyspark-connect`

### What changes were proposed in this pull request?

This PR is a followup of #46155 that removes the reference of `_to_seq` that `pyspark-connect` package does not have.

### Why are the changes needed?

To recover the CI https://github.com/apache/spark/actions/runs/8821919392/job/24218893631

### Does this PR introduce _any_ user-facing change?

No, the main change has not been released out yet.

### How was this patch tested?

Manually tested.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46229 from HyukjinKwon/SPARK-47933-followuptmp.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
HyukjinKwon added a commit that referenced this pull request Apr 28, 2024
…` reference in `pyspark.ml.stat`

### What changes were proposed in this pull request?

This PR is a followup of #46155 that removes the reference of `_to_seq` that `pyspark-connect` package does not have.

### Why are the changes needed?

To recover the CI https://github.com/apache/spark/actions/runs/8861971303

### Does this PR introduce _any_ user-facing change?

No, the main change has not been released out yet.

### How was this patch tested?

Manually tested.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46262 from HyukjinKwon/SPARK-47933-followup4.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
HyukjinKwon added a commit that referenced this pull request Apr 30, 2024
… classes

### What changes were proposed in this pull request?

This PR proposes to fix `pyspark-connect` build that is broken by #46155. It is similar with the followups such as #46229 and #46262 but make sure the tests pass within this PR.

### Why are the changes needed?

To recover the CI https://github.com/apache/spark/actions/runs/8884003967

### Does this PR introduce _any_ user-facing change?

No, the main change has not been released yet.

### How was this patch tested?

CI in my own fork: https://github.com/HyukjinKwon/spark/actions/runs/8889698252/job/24408487227

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46294 from HyukjinKwon/SPARK-48052.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
… Classic

### What changes were proposed in this pull request?

Same as apache#46129 but for `Column` class.

### Why are the changes needed?

Same as apache#46129

### Does this PR introduce _any_ user-facing change?

Same as apache#46129

### How was this patch tested?

Manually tested, and CI should verify them.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46155 from HyukjinKwon/SPARK-47933.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
…t `_with_origin`

### What changes were proposed in this pull request?

This PR is a followup of apache#46155 that adds check of `__name__` at `_with_origin`.

### Why are the changes needed?

It is possible for a callable instance without __name__ attribute or/and __module__ attribute to be wrapped. For example, functools.partial.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`./bin/pyspark`

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46198 from HyukjinKwon/SPARK-47933-followup2.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
…n `pyspark-connect`

### What changes were proposed in this pull request?

This PR is a followup of apache#46155 that removes the reference of `_to_seq` that `pyspark-connect` package does not have.

### Why are the changes needed?

To recover the CI https://github.com/apache/spark/actions/runs/8821919392/job/24218893631

### Does this PR introduce _any_ user-facing change?

No, the main change has not been released out yet.

### How was this patch tested?

Manually tested.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46229 from HyukjinKwon/SPARK-47933-followuptmp.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
…` reference in `pyspark.ml.stat`

### What changes were proposed in this pull request?

This PR is a followup of apache#46155 that removes the reference of `_to_seq` that `pyspark-connect` package does not have.

### Why are the changes needed?

To recover the CI https://github.com/apache/spark/actions/runs/8861971303

### Does this PR introduce _any_ user-facing change?

No, the main change has not been released out yet.

### How was this patch tested?

Manually tested.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46262 from HyukjinKwon/SPARK-47933-followup4.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
… classes

### What changes were proposed in this pull request?

This PR proposes to fix `pyspark-connect` build that is broken by apache#46155. It is similar with the followups such as apache#46229 and apache#46262 but make sure the tests pass within this PR.

### Why are the changes needed?

To recover the CI https://github.com/apache/spark/actions/runs/8884003967

### Does this PR introduce _any_ user-facing change?

No, the main change has not been released yet.

### How was this patch tested?

CI in my own fork: https://github.com/HyukjinKwon/spark/actions/runs/8889698252/job/24408487227

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46294 from HyukjinKwon/SPARK-48052.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants