-
Notifications
You must be signed in to change notification settings - Fork 28.5k
[SPARK-39821][PYTHON][PS] Fix error during using DatetimeIndex #37232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pandas disallow conversion between datetime/timedelta and conversions for any datetimelike to float. This will raise error in PYSPARK, during we simply call a DatetimeIndex. So we need to avoid to call astype with datetime64. BTW, PYSPARK PANDAS announces that won't support DatetimeTZD type. So lets skip datetime64 type only in base __repr__ func in Index.
Looks there are other testcases need to be fixed. This is I testing on master without any change.
|
cc @zhengruifeng @xinrong-meng @itholic FYI |
Can one of the admins verify this patch? |
@bzhaoopenstack Thanks for reporting this! BTW, I can not reproduce the case https://issues.apache.org/jira/browse/SPARK-39821 with master branch:
So should this fix target against 3.2.x? |
I also test with Spark 3.3.0 with Python 3.9.12, and it's fine. could you help figure out whether this repr issue only exists in Spark 3.2.x or Python 3.8.x? |
Thanks Zheng, I test with Spark master and Pandas master branch, that's I just know. I think that's why we are confused about this test result. |
Is it dependent on pandas version being used? See also https://github.com/apache/spark/blob/master/dev/infra/Dockerfile |
Hi, I tested with pandas 1.3.X and 1.4.X. That's true that anything is OK and won't raise error. But in pandas master branch, that's true the pandas master still raise error. And in my env, it goes inside different code path. The below is the good ones, pandas 1.3.x and 1.4.x
pandas main(master) branch
I will debug further for locating the root cause. |
This is the associated commit from pandas upstream. Analysising the history |
I had opened an issue in Pandas community. Let's waiting |
From the pandas community, it seems the new behavior is good and expected. They only support several matches during using astype with DatetimeArray . We should apply it on PySpark side for future pandas versions if we plan to upgrade it in PySpark. |
qq: Now the pandas 1.5.1 is released, is this any update related to this PR? |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
With the release of pandas 2.0, I think this is PR should be re-opened, right? I can recreate the issue originally described with Python 3.9.16 (main, May 3 2023, 09:54:39)
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyspark
>>> pyspark.__version__
'3.4.0'
>>> import pandas
>>> pandas.__version__
'2.0.1'
>>> import pyspark.pandas as ps
>>> ps.DatetimeIndex(["1970-01-01", "1970-01-02", "1970-01-03"])
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/05/18 21:07:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/05/18 21:07:31 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ubuntu/.local/lib/python3.9/site-packages/pyspark/pandas/indexes/base.py", line 2705, in __repr__
pindex = self._psdf._get_or_create_repr_pandas_cache(max_display_count).index
File "/home/ubuntu/.local/lib/python3.9/site-packages/pyspark/pandas/frame.py", line 13347, in _get_or_create_repr_pandas_cache
self, "_repr_pandas_cache", {n: self.head(n + 1)._to_internal_pandas()}
File "/home/ubuntu/.local/lib/python3.9/site-packages/pyspark/pandas/frame.py", line 13342, in _to_internal_pandas
return self._internal.to_pandas_frame
File "/home/ubuntu/.local/lib/python3.9/site-packages/pyspark/pandas/utils.py", line 588, in wrapped_lazy_property
setattr(self, attr_name, fn(self))
File "/home/ubuntu/.local/lib/python3.9/site-packages/pyspark/pandas/internal.py", line 1056, in to_pandas_frame
pdf = sdf.toPandas()
File "/home/ubuntu/.local/lib/python3.9/site-packages/pyspark/sql/pandas/conversion.py", line 251, in toPandas
if (t is not None and not all([is_timedelta64_dtype(t),is_datetime64_dtype(t)])) or should_check_timedelta:
File "/home/ubuntu/.local/lib/python3.9/site-packages/pandas/core/generic.py", line 6324, in astype
new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
File "/home/ubuntu/.local/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 451, in astype
return self.apply(
File "/home/ubuntu/.local/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 352, in apply
applied = getattr(b, f)(**kwargs)
File "/home/ubuntu/.local/lib/python3.9/site-packages/pandas/core/internals/blocks.py", line 511, in astype
new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
File "/home/ubuntu/.local/lib/python3.9/site-packages/pandas/core/dtypes/astype.py", line 242, in astype_array_safe
new_values = astype_array(values, dtype, copy=copy)
File "/home/ubuntu/.local/lib/python3.9/site-packages/pandas/core/dtypes/astype.py", line 184, in astype_array
values = values.astype(dtype, copy=copy)
File "/home/ubuntu/.local/lib/python3.9/site-packages/pandas/core/arrays/datetimes.py", line 694, in astype
raise TypeError(
TypeError: Casting to unit-less dtype 'datetime64' is not supported. Pass e.g. 'datetime64[ns]' instead. |
my pandas == 2.2.2,pyspark==3.4.3,also raise TypeError: Casting to unit-less dtype 'datetime64' is not supported. Pass e.g. 'datetime64[ns]' instead. |
Pandas disallow conversion between datetime/timedelta and
conversions for any datetimelike to float.
This will raise error in PYSPARK, during we simply call a DatetimeIndex.
So we need to avoid to call astype with datetime64.
BTW, PYSPARK PANDAS announces that won't support DatetimeTZD type.
So lets skip datetime64 type only in base repr func in Index.
What changes were proposed in this pull request?
Skip datetime64 type during exec astype to convert by pandas in repr func.
Why are the changes needed?
Improve the experience of spark python developers
Does this PR introduce any user-facing change?
No
How was this patch tested?