-
-
Notifications
You must be signed in to change notification settings - Fork 18.9k
ENH: Add skipna parameter to infer_dtype #17066
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add skipna parameter to infer_dtype #17066
Conversation
pandas/_libs/src/inference.pyx
Outdated
""" | ||
Effeciently infer the type of a passed val, or list-like | ||
array of values. Return a string describing the type. | ||
|
||
Parameters | ||
---------- | ||
value : scalar, list, ndarray, or pandas type | ||
skipna : bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add default value (i.e. "default False") (and remove it from the description).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, should we document that this will change to True
(don't need to necessarily deprecate now)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cpcloud : Thoughts about my comment regarding the eventual change of default value?
pandas/_libs/src/inference.pyx
Outdated
for i in range(n): | ||
val = objbuf[i] | ||
if not util._checknull(val) and not util.is_bool_object(val): | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this logic is refactor-able. It seems very duplicative across all of the functions. Not a big deal, but just an observation ATM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could write this as a single for loop, but I don't want to introduce any performance regressions for skipna=False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow you here. This loop would only be entered if skipna
is True
. What I meant is that you could potentially abstract this for-loop into a separate function, which accepts ndarray
, n
, and a dtype
-checker function (e.g. is_bool_object
, is_float_object
).
Would be more compact but could impact performance, which is why I'm not pushing this so hard but just floating the idea in case there is a good way to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactoring the entire set of functions is kind of a can of worms, because the fundamental problem here is that we must assume that arrays contain potentially mixed types or objects, which are really the same thing.
It might be possible to make a Validator
class with subclasses for each type that we want to infer, that has a method validate
that encapsulates the current loop logic.
Generally agree this needs some TLC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactoring the entire set of functions is kind of a can of worms, because the fundamental problem here is that we must assume that arrays contain potentially mixed types or objects, which are really the same thing.
Yeah, that's fair. I was imagining something like this (very rough):
cpdef check_non_na(ndarray[object] values, <callable> checker):
n = len(values)
for i in range(n):
val = values[i]
if not util._checknull(val) and not checker(val):
return False
return True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've started working on a refactor, I'll put up a PR if it's viable :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Makes sense to keep separate from this 😄
Failures related to |
actually I am not big on this. This is just a bug fix, we should be ignoring NaN when we have a string,unciode,bytes objects. otherwise. |
I don't really follow what you're saying here. Can you clarify? |
Why is this a bug-fix? The behavior is consistent with what I would expect because there is |
it should mirror what we already do for other types. We automatically skip na's; they don't have an impact on the inference type. In fact that's the entire point of this routine.
|
Right. That's where we want to be, but this routine has existed for a very long time with this behavior and I don't think it'd be a great idea to break it right this second. I think @gfyoung had the right idea with 1 or more deprecation cycles before changing the default behavior to skip NA values by default. |
this is broken. I don't really have a problem with adding a |
Wait for my refactor, which reconciles them nicely. |
I figured there would be a way. 😄 |
I'll put up another PR with the refactor, but since it's quite a large refactor and may introduce a perf regression let's leave this one open as well in case the refactor isn't viable from a perf perspective. |
@cpcloud : You could have made a separate commit here and just show |
Indeed. I'll do that. Thanks for the tip! |
@cpcloud : Rebase onto |
cool, will do once the benchmarks are finished! |
asv results (top offenders):
|
It's possible that first one is failing because I haven't yet run the entire test suite to make sure I haven't broken anything outside of the inference code. |
Yep, some failing tests :) Fixing them up right now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty good. any significant perf diffs?
pandas/_libs/src/inference.pyx
Outdated
|
||
cdef bint is_valid(self, object value) except -1: | ||
return self.is_value_typed(value) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
too bad you can't make these in-line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I might be able to, didn't try. I would also expect the C compiler to be pretty smart here. inline
shouldn't have much affect nowadays, since the compiler will ignore it if it's unsafe to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
pandas/_libs/src/inference.pyx
Outdated
|
||
cdef bint is_value_typed(self, object value) except -1: | ||
return util.is_bool_object(value) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can prob make some things inlne here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try inline
ing the leaf classes and see if there's any perf benefit.
pandas/_libs/src/inference.pyx
Outdated
""" | ||
Effeciently infer the type of a passed val, or list-like | ||
array of values. Return a string describing the type. | ||
|
||
Parameters | ||
---------- | ||
value : scalar, list, ndarray, or pandas type | ||
skipna : bool, default False | ||
Ignore NaN values when inferring the type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
versionadded tag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor thing: I would prefer that you still mention that we will change the default to True
in the future, even if we don't deprecate now. What do you think?
Here are the inference benchmarks with latest commit:
|
Codecov Report
@@ Coverage Diff @@
## master #17066 +/- ##
=========================================
Coverage ? 91.01%
=========================================
Files ? 161
Lines ? 49363
Branches ? 0
=========================================
Hits ? 44927
Misses ? 4436
Partials ? 0
Continue to review full report at Codecov.
|
@jreback @gfyoung No big regressions :) Here's everything that was worse than 1x
|
@cpcloud : Nice! Great to see that the refactoring worked out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. must some minor comments.
doc/source/whatsnew/v0.21.0.txt
Outdated
@@ -24,6 +24,8 @@ New features | |||
<https://www.python.org/dev/peps/pep-0519/>`_ on most readers and writers (:issue:`13823`) | |||
- Added ``__fspath__`` method to :class:`~pandas.HDFStore`, :class:`~pandas.ExcelFile`, | |||
and :class:`~pandas.ExcelWriter` to work properly with the file system path protocol (:issue:`13823`) | |||
- Added ``skipna`` parameter :func:`~pandas.api.types.infer_dtype` to support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parameter to :func:....., to support
@@ -272,6 +277,9 @@ def infer_dtype(object value): | |||
>>> infer_dtype(['foo', 'bar']) | |||
'string' | |||
|
|||
>>> infer_dtype(['a', np.nan, 'b'], skipna=True) | |||
'string' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add same example with skipna=False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pandas/_libs/src/inference.pyx
Outdated
|
||
cdef bint is_valid(self, object value) except -1: | ||
return self.is_value_typed(value) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
|
||
if n == 0: | ||
return False | ||
cdef class StringValidator(Validator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if there is value to making String, Unicode, Bytes inherit from an ABC StringLike (for consistency with the way you did datetime (Temporal) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe just String/Unicode (as Bytes is a different animal)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so because there's no code sharing between those three classes here. The child classes of TemporalValidator
all share a large amount of validation code and really only differ in the way they check their value types and their nulls. With string-like values they don't share any code that isn't already factored out in Validator
.
pandas/_libs/src/inference.pyx
Outdated
cdef inline bint is_valid(self, object value) except -1: | ||
return self.is_value_typed(value) or self.is_valid_null(value) | ||
|
||
cdef bint is_value_typed(self, object value) except -1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_value_typed NI is not needed (as defined in base class)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good catch. Fixing.
pandas/_libs/src/inference.pyx
Outdated
cdef inline bint is_array_typed(self) except -1: | ||
return ( | ||
PY2 and issubclass(self.dtype.type, np.string_) | ||
) or not PY2 and issubclass(self.dtype.type, np.unicode_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly enough I think isinstance(self.dtype.type, np.str_)
does exactly this. I'll give it a shot and see if anything breaks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works!
@jreback this is ready to go |
Actually one sec. |
I presume you meant to push that change first? |
@gfyoung Yep! Pushed. |
Given that both @jreback and I have approved, and you're ready to merge this, I imagine three approvals from ">= core-committers" makes this merge-able on green 😄 |
@gfyoung agree :) Merging on green. |
Thanks @cpcloud ! |
closes #17059
git diff upstream/master -u -- "*.py" | flake8 --diff