Skip to content

ARROW-13806: [C++][Python] Add support for new MonthDayNano Interval Type #11302

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 24 commits into from

Conversation

emkornfield
Copy link
Contributor

  • Refactored ObjectWriter helpers from arrow_to_pandas, so they can be
    used for plain python types as well (generalized the lowest level so
    it can work on both PyObject** and an adapter for PyList.

  • Add DateOffset to static pandas imports

  • Tried to start laying out code in a way to use C++ for Array.to_pylist
    (feel free to comment).

Support importing from timeinterval, relativedelta and DateOffset types
(this is actually mostly duck types, the one complication is that
relativedelta has a property weeks that is automatically calculated, so
some type checking is necessary).

Open questions:

  • Should we be more strict on duck typing imports? I chose generalism
    over performance here (rechecking non-present attributes, etc)?
  • Is the new arrow_to_python.h desirable (I think this can be easily
    extended for other types)?
  • My python is rusty and Python C-API even more so, please don't assume
    I know exactly what I'm doing :)

@github-actions
Copy link

github-actions bot commented Oct 4, 2021

@github-actions
Copy link

github-actions bot commented Oct 4, 2021

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@emkornfield
Copy link
Contributor Author

CC @tswast @jorisvandenbossche

@@ -28,6 +28,7 @@ add_dependencies(arrow_python-all arrow_python arrow_python-tests)

set(ARROW_PYTHON_SRCS
arrow_to_pandas.cc
arrow_to_python.cc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you forgot to add this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

return (PyObject*)&MonthDayNanoTupleType;
}

PyTypeObject* BorrowMonthDayNanoTupleType() { return &MonthDayNanoTupleType; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, left over from a prior design. removed.

@@ -450,6 +475,19 @@ Result<std::string> TzinfoToString(PyObject* tzinfo) {
return PyTZInfo_utcoffset_hhmm(tzinfo);
}

Result<PyObject*> MonthDayNanoIntervalToNamedTuple(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since NULL is return an error, you don't need to wrap the return value in a Result<>.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few other places that return Result<PyObject*> is there any guidance on when to use that and when to use pure PyObject*?


is installed the objects will be
pd.tseries.offsets.DateOffset objects. Otherwise they are
pyarrow.MonthDayNanoTuple objects.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same remark as for the corresponding scalar class. Also, it seems the docstring is a bit garbled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed now (once I push latest set of commits.

return Status::Invalid("Overflow on: ", (attr - 1)->name);
}
if (PyObject_HasAttrString(obj, attr->name)) {
OwnedRef field_value(PyObject_GetAttrString(obj, attr->name));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to avoid a double hash lookup, you can simply call PyObject_GetAttrString and catch the AttributeError on failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Done.

OwnedRef field_value(PyObject_GetAttrString(obj, attr->name));
RETURN_IF_PYERROR();
*found_attrs = true;
if (field_value.obj() == Py_None) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does None mean in this context?

Copy link
Contributor Author

@emkornfield emkornfield Oct 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think I mean null here.

@emkornfield
Copy link
Contributor Author

@pitrou thanks for the feedback, I'll address the ones I haven't commented on. I think the most substantive one is whether return types should be different with or without Pandas. Happy to go with whatever you and @jorisvandenbossche think is best.

Comment on lines 90 to 99
const bool has_nulls = arr.null_count() > 0;
for (int64_t i = 0; i < arr.length(); ++i) {
if (has_nulls && arr.IsNull(i)) {
Py_INCREF(Py_None);
*out_values = Py_None;
} else {
RETURN_NOT_OK(write_func(arr.GetView(i), out_values));
}
++out_values;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this has_nulls enabling some kind of compiler optimization? From a naive read it doesn't look like it is providing any value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was moved from existing code base, I don't know but would prefer to handle the TODO above use Visitor in general.

def as_py(self):
"""
Return this value as a Pandas DateOffset instance if Pandas is present
otherwise as a named tuple containing months days and nanoseconds.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to agree with @pitrou . With pyarrow we even have cases where users aren't even really aware they are using pyarrow. I could see a circumstance where they install their application on some new machine or environment and suddenly start getting errors. Maybe someday we can add a use_pandas_types option to as_py, to_pylist, and to_pydict. That way we can error if the expected library is not available instead of silently changing the type.

Comment on lines 2157 to 2184
def month_day_nano_interval():
"""
Create instance of a interval representing the time between two calendar
instances represented as a triple of months, days and nanoseconds.
"""
return primitive_type(_Type_INTERVAL_MONTH_DAY_NANO)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I would avoid "the time between two calendar instances" because a duration/Joda-interval is traditionally defined as "the time between two instants". I haven't seen intervals / periods defined in this way before (possibly because calendars could change).

Also, instead of a triple of months (which might make a python user think 3-tuple) could we say "represented by values for months, days, and nanoseconds"? Or, if we want to be real precise, "represented by a signed 32 bit integer of months, a signed 32 bit integer of days, and a signed 64 bit integer of nanoseconds".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copied language from type.h

@emkornfield emkornfield marked this pull request as draft October 5, 2021 04:08
@emkornfield emkornfield marked this pull request as ready for review October 5, 2021 05:55
@emkornfield
Copy link
Contributor Author

@pitrou @westonpace thanks for the reviews I think I've addressed all of the comments.

Micah Kornfield and others added 8 commits October 4, 2021 23:54
- Refactored ObjectWriter helpers from arrow_to_pandas, so they can be
  used for plain python types as well (generalized the lowest level so
  it can work on both PyObject** and an adapter for PyList.

- Add DateOffset to static pandas imports

- Tried to start laying out code in a way to use C++ for Array.to_pylist
  (feel free to comment).

Support importing from timeinterval, relativedelta and DateOffset types
(this is actually mostly duck types, the one complication is that
 relativedelta has a property weeks that is automatically calculated, so
 some type checking is necessary).

Open questions:
 - Should we be more strict on duck typing imports? I chose generalism
   over performance here (rechecking non-present attributes, etc)?
 - Is the new arrow_to_python.h desirable (I think this can be easily
   extended for other types)?
 - My python is rusty and Python C-API even more so, please don't assume
   I know exactly what I'm doing :)
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it rather unwelcome that this adds an abstraction ("ArrowToPython") that takes care of a single type. I'm not sure if that refactor is potentially useful, but we should have a single facility for conversion to Python objects, not a bunch of unrelated ones.

/// representation.
///
/// For instance timestamp would be translated to a integer representing an
// offset from the unix epoch.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it? I thought a timestamp would be converted to a datetime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scalars seem to have two notions. a property value which corresponds to the primitive and "as_py" which does the conversion. Previously I had ToLogical() method which was meant to be the analogue of as_py, and I expect this method would be added back as part of ARROW-12976. I'm open to naming recommendations. ToLogical was removed because for this interval type the two end up being the same.

public:
/// \brief Converts the given Array to a PyList object. Returns NULL if there
/// is an error converting the Array. The list elements are the same ones
/// generated via ToLogical()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is ToLogical?

} // namespace

Result<PyObject*> ArrowToPython::ToPyList(const Array& array) {
RETURN_NOT_OK(CheckInterval(*array.type()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So these methods only work for month_day_nano_interval? That seems like a weird API choice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent was to generalize them for other types (see TODO JIRAs in the header for supporting more types).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I'd rather see a comprehensive refactor in another PR than a stub like this, which may end up abandoned.

@emkornfield
Copy link
Contributor Author

I find it rather unwelcome that this adds an abstraction ("ArrowToPython") that takes care of a single type. I'm not sure if that refactor is potentially useful, but we should have a single facility for conversion to Python objects, not a bunch of unrelated ones.

Mentioned this elsewhere but I was trying to prep layout for ARROW-12976 is this is undesirable, I can make these simple functions for now.

@emkornfield emkornfield requested a review from pitrou October 5, 2021 23:31
@jorisvandenbossche jorisvandenbossche changed the title ARROW-13806: [C++][Python] Add support for new Interval Type ARROW-13806: [C++][Python] Add support for new MonthDayNano Interval Type Oct 6, 2021
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Added a few inline comments. And some additional non-inline comments:

  • Can you also add it to test_types.py (eg adding to get_all_types() will automatically run some tests)
  • For consistency with other types, we might also want to add a is_interval function in types.py
  • Add it to test_type_for_alias in test_schema.py
  • Add the type factory to docs in /python/api/datatypes.rst and arrays.rst

I can also push myself for those (trivial) changes if you want.

One other question (but doesn't need to be handled here):

  • Should we allow creating an interval array from plain tuples instead of only from the MonthDayNano named tuple?

@@ -321,6 +323,14 @@ void InitPandasStaticData() {
pandas_NA = ref.obj();
}

// Import DateOffset type
OwnedRef offsets;
if (internal::ImportModule("pandas.tseries.offsets", &offsets).ok()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that DateOffset if available in the top-level pandas namespace, in which case I would import it from there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in next commit.

Comment on lines 18 to 19
// Functions for converting between pandas's NumPy-based data representation
// and Arrow data structures
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is a left-over from copying another file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. removed on next commit.

@emkornfield
Copy link
Contributor Author

emkornfield commented Oct 6, 2021 via email

@pitrou
Copy link
Member

pitrou commented Oct 6, 2021

@emkornfield It can live somewhere with the other Python helpers IMHO.

@emkornfield
Copy link
Contributor Author

@emkornfield It can live somewhere with the other Python helpers IMHO.

Consolidated in datetime.h/datetime.cc.

@emkornfield
Copy link
Contributor Author

@jorisvandenbossche thanks for the feedback. Please see responses inline

Looks good! Added a few inline comments. And some additional non-inline comments:

  • Can you also add it to test_types.py (eg adding to get_all_types() will automatically run some tests)

It looks like this has been refactored. Please let me know if I did it correctly.

I can also push myself for those (trivial) changes if you want.

I think I've addressed all the feedback here, let me know if I've missed something.

One other question (but doesn't need to be handled here):

  • Should we allow creating an interval array from plain tuples instead of only from the MonthDayNano named tuple?

Possibly, I think we can add this if users request it later. I don't think there is a need to do it here because tuples already get inferred as lists.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a few more nits

Comment on lines +80 to +82
// Hack for python versions < 3.7 where members of PyStruct members
// where non-const (C++ doesn't like assigning string literals to these types)
return const_cast<char*>(st);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still support python versions < 3.7? I thought we stopped shipping binary wheels for these versions but maybe we still support building from source.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do, I asked that we ddidn't drop it last release. I would also ask that we don't drop it this release (i.e. keep it through its full python support cycle. There are a number of consumers of pyarrow that try to keep support for python versions until they dropped it which is end of this year). This was caught because we run it in CI.

Micah Kornfield and others added 2 commits October 7, 2021 00:18
@pitrou pitrou closed this in 415439c Oct 7, 2021
ViniciusSouzaRoque pushed a commit to s1mbi0se/arrow that referenced this pull request Oct 20, 2021
…Type

- Refactored ObjectWriter helpers from arrow_to_pandas, so they can be
  used for plain python types as well (generalized the lowest level so
  it can work on both PyObject** and an adapter for PyList.

- Add DateOffset to static pandas imports

- Tried to start laying out code in a way to use C++ for Array.to_pylist
  (feel free to comment).

Support importing from timeinterval, relativedelta and DateOffset types
(this is actually mostly duck types, the one complication is that
 relativedelta has a property weeks that is automatically calculated, so
 some type checking is necessary).

Open questions:
 - Should we be more strict on duck typing imports? I chose generalism
   over performance here (rechecking non-present attributes, etc)?
 - Is the new arrow_to_python.h desirable (I think this can be easily
   extended for other types)?
 - My python is rusty and Python C-API even more so, please don't assume
   I know exactly what I'm doing :)

Closes apache#11302 from emkornfield/interval_python

Lead-authored-by: Micah Kornfield <[email protected]>
Co-authored-by: emkornfield <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants