-
Notifications
You must be signed in to change notification settings - Fork 3.7k
ARROW-13806: [C++][Python] Add support for new MonthDayNano Interval Type #11302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cpp/src/arrow/python/CMakeLists.txt
Outdated
@@ -28,6 +28,7 @@ add_dependencies(arrow_python-all arrow_python arrow_python-tests) | |||
|
|||
set(ARROW_PYTHON_SRCS | |||
arrow_to_pandas.cc | |||
arrow_to_python.cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you forgot to add this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added.
cpp/src/arrow/python/datetime.cc
Outdated
return (PyObject*)&MonthDayNanoTupleType; | ||
} | ||
|
||
PyTypeObject* BorrowMonthDayNanoTupleType() { return &MonthDayNanoTupleType; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, left over from a prior design. removed.
cpp/src/arrow/python/datetime.cc
Outdated
@@ -450,6 +475,19 @@ Result<std::string> TzinfoToString(PyObject* tzinfo) { | |||
return PyTZInfo_utcoffset_hhmm(tzinfo); | |||
} | |||
|
|||
Result<PyObject*> MonthDayNanoIntervalToNamedTuple( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since NULL is return an error, you don't need to wrap the return value in a Result<>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few other places that return Result<PyObject*> is there any guidance on when to use that and when to use pure PyObject*?
python/pyarrow/array.pxi
Outdated
|
||
is installed the objects will be | ||
pd.tseries.offsets.DateOffset objects. Otherwise they are | ||
pyarrow.MonthDayNanoTuple objects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same remark as for the corresponding scalar class. Also, it seems the docstring is a bit garbled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fixed now (once I push latest set of commits.
return Status::Invalid("Overflow on: ", (attr - 1)->name); | ||
} | ||
if (PyObject_HasAttrString(obj, attr->name)) { | ||
OwnedRef field_value(PyObject_GetAttrString(obj, attr->name)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to avoid a double hash lookup, you can simply call PyObject_GetAttrString
and catch the AttributeError
on failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Done.
OwnedRef field_value(PyObject_GetAttrString(obj, attr->name)); | ||
RETURN_IF_PYERROR(); | ||
*found_attrs = true; | ||
if (field_value.obj() == Py_None) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does None
mean in this context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think I mean null here.
@pitrou thanks for the feedback, I'll address the ones I haven't commented on. I think the most substantive one is whether return types should be different with or without Pandas. Happy to go with whatever you and @jorisvandenbossche think is best. |
const bool has_nulls = arr.null_count() > 0; | ||
for (int64_t i = 0; i < arr.length(); ++i) { | ||
if (has_nulls && arr.IsNull(i)) { | ||
Py_INCREF(Py_None); | ||
*out_values = Py_None; | ||
} else { | ||
RETURN_NOT_OK(write_func(arr.GetView(i), out_values)); | ||
} | ||
++out_values; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this has_nulls
enabling some kind of compiler optimization? From a naive read it doesn't look like it is providing any value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was moved from existing code base, I don't know but would prefer to handle the TODO above use Visitor in general.
python/pyarrow/scalar.pxi
Outdated
def as_py(self): | ||
""" | ||
Return this value as a Pandas DateOffset instance if Pandas is present | ||
otherwise as a named tuple containing months days and nanoseconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to agree with @pitrou . With pyarrow we even have cases where users aren't even really aware they are using pyarrow. I could see a circumstance where they install their application on some new machine or environment and suddenly start getting errors. Maybe someday we can add a use_pandas_types
option to as_py
, to_pylist
, and to_pydict
. That way we can error if the expected library is not available instead of silently changing the type.
python/pyarrow/types.pxi
Outdated
def month_day_nano_interval(): | ||
""" | ||
Create instance of a interval representing the time between two calendar | ||
instances represented as a triple of months, days and nanoseconds. | ||
""" | ||
return primitive_type(_Type_INTERVAL_MONTH_DAY_NANO) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I would avoid "the time between two calendar instances" because a duration/Joda-interval is traditionally defined as "the time between two instants". I haven't seen intervals / periods defined in this way before (possibly because calendars could change).
Also, instead of a triple of months
(which might make a python user think 3-tuple) could we say "represented by values for months, days, and nanoseconds"? Or, if we want to be real precise, "represented by a signed 32 bit integer of months, a signed 32 bit integer of days, and a signed 64 bit integer of nanoseconds".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copied language from type.h
@pitrou @westonpace thanks for the reviews I think I've addressed all of the comments. |
- Refactored ObjectWriter helpers from arrow_to_pandas, so they can be used for plain python types as well (generalized the lowest level so it can work on both PyObject** and an adapter for PyList. - Add DateOffset to static pandas imports - Tried to start laying out code in a way to use C++ for Array.to_pylist (feel free to comment). Support importing from timeinterval, relativedelta and DateOffset types (this is actually mostly duck types, the one complication is that relativedelta has a property weeks that is automatically calculated, so some type checking is necessary). Open questions: - Should we be more strict on duck typing imports? I chose generalism over performance here (rechecking non-present attributes, etc)? - Is the new arrow_to_python.h desirable (I think this can be easily extended for other types)? - My python is rusty and Python C-API even more so, please don't assume I know exactly what I'm doing :)
Co-authored-by: Weston Pace <[email protected]>
7842d8b
to
9845763
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it rather unwelcome that this adds an abstraction ("ArrowToPython") that takes care of a single type. I'm not sure if that refactor is potentially useful, but we should have a single facility for conversion to Python objects, not a bunch of unrelated ones.
/// representation. | ||
/// | ||
/// For instance timestamp would be translated to a integer representing an | ||
// offset from the unix epoch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it? I thought a timestamp would be converted to a datetime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scalars seem to have two notions. a property value which corresponds to the primitive and "as_py" which does the conversion. Previously I had ToLogical() method which was meant to be the analogue of as_py, and I expect this method would be added back as part of ARROW-12976. I'm open to naming recommendations. ToLogical was removed because for this interval type the two end up being the same.
public: | ||
/// \brief Converts the given Array to a PyList object. Returns NULL if there | ||
/// is an error converting the Array. The list elements are the same ones | ||
/// generated via ToLogical() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is ToLogical?
} // namespace | ||
|
||
Result<PyObject*> ArrowToPython::ToPyList(const Array& array) { | ||
RETURN_NOT_OK(CheckInterval(*array.type())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So these methods only work for month_day_nano_interval? That seems like a weird API choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent was to generalize them for other types (see TODO JIRAs in the header for supporting more types).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I'd rather see a comprehensive refactor in another PR than a stub like this, which may end up abandoned.
Mentioned this elsewhere but I was trying to prep layout for ARROW-12976 is this is undesirable, I can make these simple functions for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Added a few inline comments. And some additional non-inline comments:
- Can you also add it to
test_types.py
(eg adding toget_all_types()
will automatically run some tests) - For consistency with other types, we might also want to add a
is_interval
function intypes.py
- Add it to
test_type_for_alias
intest_schema.py
- Add the type factory to docs in
/python/api/datatypes.rst
andarrays.rst
I can also push myself for those (trivial) changes if you want.
One other question (but doesn't need to be handled here):
- Should we allow creating an interval array from plain tuples instead of only from the MonthDayNano named tuple?
cpp/src/arrow/python/helpers.cc
Outdated
@@ -321,6 +323,14 @@ void InitPandasStaticData() { | |||
pandas_NA = ref.obj(); | |||
} | |||
|
|||
// Import DateOffset type | |||
OwnedRef offsets; | |||
if (internal::ImportModule("pandas.tseries.offsets", &offsets).ok()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that DateOffset if available in the top-level pandas namespace, in which case I would import it from there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in next commit.
// Functions for converting between pandas's NumPy-based data representation | ||
// and Arrow data structures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment is a left-over from copying another file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah. removed on next commit.
@pitrou is changing the signature and docs to be specific to MonthDayNanoInterval
in this module OK for this PR or do you have another suggestion on
organization? Move it to datetime.h?
…On Wednesday, October 6, 2021, Antoine Pitrou ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In cpp/src/arrow/python/arrow_to_python.cc
<#11302 (comment)>:
> + }
+
+ PyListAssigner& operator+=(int64_t offset) {
+ current_index_ += offset;
+ return *this;
+ }
+
+ private:
+ PyObject* list_;
+ int64_t current_index_ = 0;
+};
+
+} // namespace
+
+Result<PyObject*> ArrowToPython::ToPyList(const Array& array) {
+ RETURN_NOT_OK(CheckInterval(*array.type()));
I see. I'd rather see a comprehensive refactor in another PR than a stub
like this, which may end up abandoned.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#11302 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEIKYDWFIVHB3GUOHI2735TUFQV2RANCNFSM5FITG4GA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
@emkornfield It can live somewhere with the other Python helpers IMHO. |
Co-authored-by: Joris Van den Bossche <[email protected]>
Consolidated in datetime.h/datetime.cc. |
@jorisvandenbossche thanks for the feedback. Please see responses inline
It looks like this has been refactored. Please let me know if I did it correctly.
I think I've addressed all the feedback here, let me know if I've missed something.
Possibly, I think we can add this if users request it later. I don't think there is a need to do it here because tuples already get inferred as lists. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, a few more nits
// Hack for python versions < 3.7 where members of PyStruct members | ||
// where non-const (C++ doesn't like assigning string literals to these types) | ||
return const_cast<char*>(st); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still support python versions < 3.7? I thought we stopped shipping binary wheels for these versions but maybe we still support building from source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do, I asked that we ddidn't drop it last release. I would also ask that we don't drop it this release (i.e. keep it through its full python support cycle. There are a number of consumers of pyarrow that try to keep support for python versions until they dropped it which is end of this year). This was caught because we run it in CI.
Co-authored-by: Weston Pace <[email protected]>
Co-authored-by: Weston Pace <[email protected]>
…Type - Refactored ObjectWriter helpers from arrow_to_pandas, so they can be used for plain python types as well (generalized the lowest level so it can work on both PyObject** and an adapter for PyList. - Add DateOffset to static pandas imports - Tried to start laying out code in a way to use C++ for Array.to_pylist (feel free to comment). Support importing from timeinterval, relativedelta and DateOffset types (this is actually mostly duck types, the one complication is that relativedelta has a property weeks that is automatically calculated, so some type checking is necessary). Open questions: - Should we be more strict on duck typing imports? I chose generalism over performance here (rechecking non-present attributes, etc)? - Is the new arrow_to_python.h desirable (I think this can be easily extended for other types)? - My python is rusty and Python C-API even more so, please don't assume I know exactly what I'm doing :) Closes apache#11302 from emkornfield/interval_python Lead-authored-by: Micah Kornfield <[email protected]> Co-authored-by: emkornfield <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Refactored ObjectWriter helpers from arrow_to_pandas, so they can be
used for plain python types as well (generalized the lowest level so
it can work on both PyObject** and an adapter for PyList.
Add DateOffset to static pandas imports
Tried to start laying out code in a way to use C++ for Array.to_pylist
(feel free to comment).
Support importing from timeinterval, relativedelta and DateOffset types
(this is actually mostly duck types, the one complication is that
relativedelta has a property weeks that is automatically calculated, so
some type checking is necessary).
Open questions:
over performance here (rechecking non-present attributes, etc)?
extended for other types)?
I know exactly what I'm doing :)