ARROW-13806: [C++][Python] Add support for new MonthDayNano Interval Type #11302

emkornfield · 2021-10-04T06:45:57Z

Refactored ObjectWriter helpers from arrow_to_pandas, so they can be
used for plain python types as well (generalized the lowest level so
it can work on both PyObject** and an adapter for PyList.
Add DateOffset to static pandas imports
Tried to start laying out code in a way to use C++ for Array.to_pylist
(feel free to comment).

Support importing from timeinterval, relativedelta and DateOffset types
(this is actually mostly duck types, the one complication is that
relativedelta has a property weeks that is automatically calculated, so
some type checking is necessary).

Open questions:

Should we be more strict on duck typing imports? I chose generalism
over performance here (rechecking non-present attributes, etc)?
Is the new arrow_to_python.h desirable (I think this can be easily
extended for other types)?
My python is rusty and Python C-API even more so, please don't assume
I know exactly what I'm doing :)

github-actions · 2021-10-04T06:46:15Z

https://issues.apache.org/jira/browse/ARROW-13806

github-actions · 2021-10-04T06:46:16Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

emkornfield · 2021-10-04T06:46:43Z

CC @tswast @jorisvandenbossche

pitrou · 2021-10-04T16:02:24Z

cpp/src/arrow/python/CMakeLists.txt

@@ -28,6 +28,7 @@ add_dependencies(arrow_python-all arrow_python arrow_python-tests)

 set(ARROW_PYTHON_SRCS
    arrow_to_pandas.cc
+    arrow_to_python.cc


Looks like you forgot to add this file?

cpp/src/arrow/python/datetime.cc

pitrou · 2021-10-04T16:02:59Z

cpp/src/arrow/python/datetime.cc

+  return (PyObject*)&MonthDayNanoTupleType;
+}
+
+PyTypeObject* BorrowMonthDayNanoTupleType() { return &MonthDayNanoTupleType; }


This doesn't seem used?

yep, left over from a prior design. removed.

pitrou · 2021-10-04T16:03:26Z

cpp/src/arrow/python/datetime.cc

@@ -450,6 +475,19 @@ Result<std::string> TzinfoToString(PyObject* tzinfo) {
  return PyTZInfo_utcoffset_hhmm(tzinfo);
 }

+Result<PyObject*> MonthDayNanoIntervalToNamedTuple(


Since NULL is return an error, you don't need to wrap the return value in a Result<>.

There are a few other places that return Result<PyObject*> is there any guidance on when to use that and when to use pure PyObject*?

cpp/src/arrow/python/python_to_arrow.cc

pitrou · 2021-10-04T16:21:03Z

python/pyarrow/array.pxi

+
+        is installed the objects will be
+        pd.tseries.offsets.DateOffset objects.  Otherwise they are
+        pyarrow.MonthDayNanoTuple objects.


Same remark as for the corresponding scalar class. Also, it seems the docstring is a bit garbled?

Should be fixed now (once I push latest set of commits.

python/pyarrow/array.pxi

pitrou · 2021-10-04T16:28:39Z

cpp/src/arrow/python/python_to_arrow.cc

+        return Status::Invalid("Overflow on: ", (attr - 1)->name);
+      }
+      if (PyObject_HasAttrString(obj, attr->name)) {
+        OwnedRef field_value(PyObject_GetAttrString(obj, attr->name));


If you want to avoid a double hash lookup, you can simply call PyObject_GetAttrString and catch the AttributeError on failure.

Good point. Done.

cpp/src/arrow/python/python_to_arrow.cc

pitrou · 2021-10-04T16:29:55Z

cpp/src/arrow/python/python_to_arrow.cc

+        OwnedRef field_value(PyObject_GetAttrString(obj, attr->name));
+        RETURN_IF_PYERROR();
+        *found_attrs = true;
+        if (field_value.obj() == Py_None) {


What does None mean in this context?

i think I mean null here.

emkornfield · 2021-10-04T16:43:25Z

@pitrou thanks for the feedback, I'll address the ones I haven't commented on. I think the most substantive one is whether return types should be different with or without Pandas. Happy to go with whatever you and @jorisvandenbossche think is best.

westonpace · 2021-10-05T00:59:42Z

cpp/src/arrow/python/arrow_to_python.h

+  const bool has_nulls = arr.null_count() > 0;
+  for (int64_t i = 0; i < arr.length(); ++i) {
+    if (has_nulls && arr.IsNull(i)) {
+      Py_INCREF(Py_None);
+      *out_values = Py_None;
+    } else {
+      RETURN_NOT_OK(write_func(arr.GetView(i), out_values));
+    }
+    ++out_values;
+  }


Is this has_nulls enabling some kind of compiler optimization? From a naive read it doesn't look like it is providing any value.

this was moved from existing code base, I don't know but would prefer to handle the TODO above use Visitor in general.

westonpace · 2021-10-05T01:18:34Z

python/pyarrow/scalar.pxi

+    def as_py(self):
+        """
+        Return this value as a Pandas DateOffset instance if Pandas is present
+        otherwise as a named tuple containing months days and nanoseconds.


I'm going to agree with @pitrou . With pyarrow we even have cases where users aren't even really aware they are using pyarrow. I could see a circumstance where they install their application on some new machine or environment and suddenly start getting errors. Maybe someday we can add a use_pandas_types option to as_py, to_pylist, and to_pydict. That way we can error if the expected library is not available instead of silently changing the type.

python/pyarrow/tests/test_array.py

westonpace · 2021-10-05T01:47:42Z

python/pyarrow/types.pxi

+def month_day_nano_interval():
+    """
+    Create instance of a interval representing the time between two calendar
+    instances represented as a triple of months, days and nanoseconds.
+    """
+    return primitive_type(_Type_INTERVAL_MONTH_DAY_NANO)


Nit: I would avoid "the time between two calendar instances" because a duration/Joda-interval is traditionally defined as "the time between two instants". I haven't seen intervals / periods defined in this way before (possibly because calendars could change).

Also, instead of a triple of months (which might make a python user think 3-tuple) could we say "represented by values for months, days, and nanoseconds"? Or, if we want to be real precise, "represented by a signed 32 bit integer of months, a signed 32 bit integer of days, and a signed 64 bit integer of nanoseconds".

copied language from type.h

python/pyarrow/scalar.pxi

emkornfield · 2021-10-05T05:55:29Z

@pitrou @westonpace thanks for the reviews I think I've addressed all of the comments.

- Refactored ObjectWriter helpers from arrow_to_pandas, so they can be used for plain python types as well (generalized the lowest level so it can work on both PyObject** and an adapter for PyList. - Add DateOffset to static pandas imports - Tried to start laying out code in a way to use C++ for Array.to_pylist (feel free to comment). Support importing from timeinterval, relativedelta and DateOffset types (this is actually mostly duck types, the one complication is that relativedelta has a property weeks that is automatically calculated, so some type checking is necessary). Open questions: - Should we be more strict on duck typing imports? I chose generalism over performance here (rechecking non-present attributes, etc)? - Is the new arrow_to_python.h desirable (I think this can be easily extended for other types)? - My python is rusty and Python C-API even more so, please don't assume I know exactly what I'm doing :)

Co-authored-by: Weston Pace <[email protected]>

pitrou

I find it rather unwelcome that this adds an abstraction ("ArrowToPython") that takes care of a single type. I'm not sure if that refactor is potentially useful, but we should have a single facility for conversion to Python objects, not a bunch of unrelated ones.

cpp/src/arrow/python/arrow_to_pandas.cc

cpp/src/arrow/python/arrow_to_python.h

pitrou · 2021-10-05T11:19:00Z

cpp/src/arrow/python/arrow_to_python.h

+  /// representation.
+  ///
+  /// For instance timestamp would be translated to a integer representing an
+  // offset from the unix epoch.


Would it? I thought a timestamp would be converted to a datetime.

Scalars seem to have two notions. a property value which corresponds to the primitive and "as_py" which does the conversion. Previously I had ToLogical() method which was meant to be the analogue of as_py, and I expect this method would be added back as part of ARROW-12976. I'm open to naming recommendations. ToLogical was removed because for this interval type the two end up being the same.

pitrou · 2021-10-05T11:19:14Z

cpp/src/arrow/python/arrow_to_python.h

+ public:
+  /// \brief Converts the given Array to a PyList object. Returns NULL if there
+  /// is an error converting the Array. The list elements are the same ones
+  /// generated via ToLogical()


What is ToLogical?

pitrou · 2021-10-05T11:23:42Z

cpp/src/arrow/python/arrow_to_python.cc

+}  // namespace
+
+Result<PyObject*> ArrowToPython::ToPyList(const Array& array) {
+  RETURN_NOT_OK(CheckInterval(*array.type()));


So these methods only work for month_day_nano_interval? That seems like a weird API choice.

The intent was to generalize them for other types (see TODO JIRAs in the header for supporting more types).

I see. I'd rather see a comprehensive refactor in another PR than a stub like this, which may end up abandoned.

python/pyarrow/array.pxi

python/pyarrow/tests/test_scalars.py

emkornfield · 2021-10-05T23:31:32Z

I find it rather unwelcome that this adds an abstraction ("ArrowToPython") that takes care of a single type. I'm not sure if that refactor is potentially useful, but we should have a single facility for conversion to Python objects, not a bunch of unrelated ones.

Mentioned this elsewhere but I was trying to prep layout for ARROW-12976 is this is undesirable, I can make these simple functions for now.

…bles

jorisvandenbossche

Looks good! Added a few inline comments. And some additional non-inline comments:

Can you also add it to test_types.py (eg adding to get_all_types() will automatically run some tests)
For consistency with other types, we might also want to add a is_interval function in types.py
Add it to test_type_for_alias in test_schema.py
Add the type factory to docs in /python/api/datatypes.rst and arrays.rst

I can also push myself for those (trivial) changes if you want.

One other question (but doesn't need to be handled here):

Should we allow creating an interval array from plain tuples instead of only from the MonthDayNano named tuple?

python/pyarrow/tests/test_pandas.py

python/pyarrow/array.pxi

jorisvandenbossche · 2021-10-06T08:24:25Z

cpp/src/arrow/python/helpers.cc

@@ -321,6 +323,14 @@ void InitPandasStaticData() {
    pandas_NA = ref.obj();
  }

+  // Import DateOffset type
+  OwnedRef offsets;
+  if (internal::ImportModule("pandas.tseries.offsets", &offsets).ok()) {


It seems that DateOffset if available in the top-level pandas namespace, in which case I would import it from there

done in next commit.

cpp/src/arrow/python/datetime.h

jorisvandenbossche · 2021-10-06T08:31:26Z

cpp/src/arrow/python/arrow_to_python_internal.h

+// Functions for converting between pandas's NumPy-based data representation
+// and Arrow data structures


I think this comment is a left-over from copying another file?

yeah. removed on next commit.

emkornfield · 2021-10-06T14:04:57Z

@pitrou is changing the signature and docs to be specific to MonthDayNanoInterval in this module OK for this PR or do you have another suggestion on organization? Move it to datetime.h?

…

On Wednesday, October 6, 2021, Antoine Pitrou ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In cpp/src/arrow/python/arrow_to_python.cc <#11302 (comment)>: > + } + + PyListAssigner& operator+=(int64_t offset) { + current_index_ += offset; + return *this; + } + + private: + PyObject* list_; + int64_t current_index_ = 0; +}; + +} // namespace + +Result<PyObject*> ArrowToPython::ToPyList(const Array& array) { + RETURN_NOT_OK(CheckInterval(*array.type())); I see. I'd rather see a comprehensive refactor in another PR than a stub like this, which may end up abandoned. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#11302 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEIKYDWFIVHB3GUOHI2735TUFQV2RANCNFSM5FITG4GA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

pitrou · 2021-10-06T14:31:04Z

@emkornfield It can live somewhere with the other Python helpers IMHO.

Co-authored-by: Joris Van den Bossche <[email protected]>

emkornfield · 2021-10-06T22:42:10Z

@emkornfield It can live somewhere with the other Python helpers IMHO.

Consolidated in datetime.h/datetime.cc.

emkornfield · 2021-10-06T22:45:19Z

@jorisvandenbossche thanks for the feedback. Please see responses inline

Looks good! Added a few inline comments. And some additional non-inline comments:

Can you also add it to test_types.py (eg adding to get_all_types() will automatically run some tests)

It looks like this has been refactored. Please let me know if I did it correctly.

I can also push myself for those (trivial) changes if you want.

I think I've addressed all the feedback here, let me know if I've missed something.

One other question (but doesn't need to be handled here):

Should we allow creating an interval array from plain tuples instead of only from the MonthDayNano named tuple?

Possibly, I think we can add this if users request it later. I don't think there is a need to do it here because tuples already get inferred as lists.

westonpace

LGTM, a few more nits

python/pyarrow/types.pxi

python/pyarrow/array.pxi

westonpace · 2021-10-06T23:24:22Z

cpp/src/arrow/python/datetime.cc

+  // Hack for python versions < 3.7 where members of PyStruct members
+  // where non-const (C++ doesn't like assigning string literals to these types)
+  return const_cast<char*>(st);


Do we still support python versions < 3.7? I thought we stopped shipping binary wheels for these versions but maybe we still support building from source.

we do, I asked that we ddidn't drop it last release. I would also ask that we don't drop it this release (i.e. keep it through its full python support cycle. There are a number of consumers of pyarrow that try to keep support for python versions until they dropped it which is end of this year). This was caught because we run it in CI.

Co-authored-by: Weston Pace <[email protected]>

…Type - Refactored ObjectWriter helpers from arrow_to_pandas, so they can be used for plain python types as well (generalized the lowest level so it can work on both PyObject** and an adapter for PyList. - Add DateOffset to static pandas imports - Tried to start laying out code in a way to use C++ for Array.to_pylist (feel free to comment). Support importing from timeinterval, relativedelta and DateOffset types (this is actually mostly duck types, the one complication is that relativedelta has a property weeks that is automatically calculated, so some type checking is necessary). Open questions: - Should we be more strict on duck typing imports? I chose generalism over performance here (rechecking non-present attributes, etc)? - Is the new arrow_to_python.h desirable (I think this can be easily extended for other types)? - My python is rusty and Python C-API even more so, please don't assume I know exactly what I'm doing :) Closes apache#11302 from emkornfield/interval_python Lead-authored-by: Micah Kornfield <[email protected]> Co-authored-by: emkornfield <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

github-actions bot added Component: C++ Component: Python labels Oct 4, 2021

pitrou reviewed Oct 4, 2021

View reviewed changes

pitrou requested a review from jorisvandenbossche October 4, 2021 16:33

westonpace reviewed Oct 5, 2021

View reviewed changes

emkornfield marked this pull request as draft October 5, 2021 04:08

emkornfield marked this pull request as ready for review October 5, 2021 05:55

emkornfield requested review from westonpace and pitrou October 5, 2021 05:55

Micah Kornfield and others added 8 commits October 4, 2021 23:54

add missing files

f3f0273

Update python/pyarrow/scalar.pxi

47aa6f7

Co-authored-by: Weston Pace <[email protected]>

wip

faae971

address feedback

299d97b

last format/lint/anonymous namespace

096797b

add all the rest of apis to anonymous i python_to_arrow

67527dc

some ci fixes

9845763

emkornfield force-pushed the interval_python branch from 7842d8b to 9845763 Compare October 5, 2021 07:08

Micah Kornfield added 2 commits October 5, 2021 00:12

try to fix py 3.6

51fa76b

Add common casts

f5b2750

pitrou requested changes Oct 5, 2021

View reviewed changes

address more comments

a9ca3ed

emkornfield requested a review from pitrou October 5, 2021 23:31

try empty initializer

88a97db

rename ToPrimitive as ToPyObject. Update docs and removeunussed varia…

f1a6d15

…bles

jorisvandenbossche changed the title ~~ARROW-13806: [C++][Python] Add support for new Interval Type~~ ARROW-13806: [C++][Python] Add support for new MonthDayNano Interval Type Oct 6, 2021

jorisvandenbossche reviewed Oct 6, 2021

View reviewed changes

Micah Kornfield and others added 6 commits October 6, 2021 13:36

remove arrow_to_python for now

19d4072

Apply suggestions from Joris's code review

3cac885

Co-authored-by: Joris Van den Bossche <[email protected]>

address small comments

b4c4501

simplify

c640152

remove bad include

b5aadb1

address more feedback

c2aa56e

fix pandas test

0e47a98

westonpace reviewed Oct 6, 2021

View reviewed changes

emkornfield and others added 2 commits October 6, 2021 21:19

Apply suggestions from code review from Weston

3b2db24

Co-authored-by: Weston Pace <[email protected]>

Update python/pyarrow/types.pxi

5a5c73b

Co-authored-by: Weston Pace <[email protected]>

emkornfield requested a review from jorisvandenbossche October 7, 2021 05:01

Micah Kornfield and others added 2 commits October 7, 2021 00:18

fix lint

9467051

Nits

93108a9

pitrou approved these changes Oct 7, 2021

View reviewed changes

pitrou closed this in 415439c Oct 7, 2021

asfimport mentioned this pull request Oct 7, 2021

[Python] Add conversion to/from Pandas/Python for Month, Day Nano Interval Type #29432

Closed

jorisvandenbossche mentioned this pull request Mar 3, 2023

to_numpy().tolist() is significantlly faster than .tolist() #34354

Closed

		// Functions for converting between pandas's NumPy-based data representation
		// and Arrow data structures

ARROW-13806: [C++][Python] Add support for new MonthDayNano Interval Type #11302

ARROW-13806: [C++][Python] Add support for new MonthDayNano Interval Type #11302

Uh oh!

Conversation

emkornfield commented Oct 4, 2021

Uh oh!

github-actions bot commented Oct 4, 2021

Uh oh!

github-actions bot commented Oct 4, 2021

Uh oh!

emkornfield commented Oct 4, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emkornfield Oct 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emkornfield commented Oct 4, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

emkornfield commented Oct 5, 2021

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

emkornfield Oct 5, 2021 •

edited

Loading

emkornfield commented Oct 6, 2021 via email •

edited

Loading