ENH: Add skipna parameter to infer_dtype #17066

cpcloud · 2017-07-24T16:41:31Z

tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

gfyoung · 2017-07-24T16:43:31Z

pandas/_libs/src/inference.pyx

    """
    Effeciently infer the type of a passed val, or list-like
    array of values. Return a string describing the type.

    Parameters
    ----------
    value : scalar, list, ndarray, or pandas type
+    skipna : bool


Add default value (i.e. "default False") (and remove it from the description).

Also, should we document that this will change to True (don't need to necessarily deprecate now)?

@cpcloud : Thoughts about my comment regarding the eventual change of default value?

gfyoung · 2017-07-24T16:46:20Z

pandas/_libs/src/inference.pyx

+            for i in range(n):
+                val = objbuf[i]
+                if not util._checknull(val) and not util.is_bool_object(val):
+                    return False


I feel like this logic is refactor-able. It seems very duplicative across all of the functions. Not a big deal, but just an observation ATM.

We could write this as a single for loop, but I don't want to introduce any performance regressions for skipna=False

I'm not sure I follow you here. This loop would only be entered if skipna is True. What I meant is that you could potentially abstract this for-loop into a separate function, which accepts ndarray, n, and a dtype-checker function (e.g. is_bool_object, is_float_object).

Would be more compact but could impact performance, which is why I'm not pushing this so hard but just floating the idea in case there is a good way to do this.

Refactoring the entire set of functions is kind of a can of worms, because the fundamental problem here is that we must assume that arrays contain potentially mixed types or objects, which are really the same thing.

It might be possible to make a Validator class with subclasses for each type that we want to infer, that has a method validate that encapsulates the current loop logic.

Generally agree this needs some TLC.

Refactoring the entire set of functions is kind of a can of worms, because the fundamental problem here is that we must assume that arrays contain potentially mixed types or objects, which are really the same thing.

Yeah, that's fair. I was imagining something like this (very rough):

cpdef check_non_na(ndarray[object] values, <callable> checker): n = len(values) for i in range(n): val = values[i] if not util._checknull(val) and not checker(val): return False return True

I've started working on a refactor, I'll put up a PR if it's viable :)

Sounds good. Makes sense to keep separate from this 😄

gfyoung · 2017-07-24T17:17:27Z

Failures related to pyarrow or feather are unrelated to this PR.

jreback · 2017-07-24T17:57:06Z

actually I am not big on this. This is just a bug fix, we should be ignoring NaN when we have a string,unciode,bytes objects. otherwise.

cpcloud · 2017-07-24T18:29:16Z

@jreback

actually I am not big on this. This is just a bug fix, we should be ignoring NaN when we have a string,unciode,bytes objects. otherwise.

I don't really follow what you're saying here. Can you clarify?

gfyoung · 2017-07-24T18:33:23Z

actually I am not big on this. This is just a bug fix, we should be ignoring NaN when we have a string,unciode,bytes objects. otherwise.

Why is this a bug-fix? The behavior is consistent with what I would expect because there is NaN and other elements in the array-like. I don't see why we should be changing this behavior that drastically.

jreback · 2017-07-24T18:44:24Z

it should mirror what we already do for other types. We automatically skip na's; they don't have an impact on the inference type. In fact that's the entire point of this routine.

In [4]: infer_dtype([pd.Timestamp('20130101'), np.nan])
Out[4]: 'datetime'

In [5]: infer_dtype(['foo', np.nan])
Out[5]: 'mixed'

In [6]: infer_dtype([pd.Timestamp('20130101')])
Out[6]: 'datetime'

In [7]: infer_dtype([pd.Timestamp('20130101'), pd.NaT])
Out[7]: 'datetime'

In [8]: infer_dtype(['foo'])
Out[8]: 'string'

In [9]: infer_dtype(['foo', np.nan])
Out[9]: 'mixed'

cpcloud · 2017-07-24T18:46:14Z

Right. That's where we want to be, but this routine has existed for a very long time with this behavior and I don't think it'd be a great idea to break it right this second. I think @gfyoung had the right idea with 1 or more deprecation cycles before changing the default behavior to skip NA values by default.

jreback · 2017-07-24T18:50:38Z

Right. That's where we want to be, but this routine has existed for a very long time with this behavior and I don't think it'd be a great idea to break it right this second. I think @gfyoung had the right idea with 1 or more deprecation cycles before changing the default behavior to skip NA values by default.

this is broken. I don't really have a problem with adding a skipna, though adds a lot more logic. The problem is these are now different for different dtypes. I don't think easy to reconcile this.

cpcloud · 2017-07-24T18:51:19Z

I don't think easy to reconcile this.

Wait for my refactor, which reconciles them nicely.

gfyoung · 2017-07-24T18:52:19Z

Wait for my refactor, which reconciles them nicely.

I figured there would be a way. 😄

cpcloud · 2017-07-24T20:17:57Z

I'll put up another PR with the refactor, but since it's quite a large refactor and may introduce a perf regression let's leave this one open as well in case the refactor isn't viable from a perf perspective.

gfyoung · 2017-07-24T20:35:47Z

@cpcloud : You could have made a separate commit here and just show asv comparisons between them.

cpcloud · 2017-07-24T20:49:01Z

Indeed. I'll do that. Thanks for the tip!

gfyoung · 2017-07-24T21:06:59Z

@cpcloud : Rebase onto master so that pyarrow tests don't fail anymore

cpcloud · 2017-07-24T21:24:20Z

cool, will do once the benchmarks are finished!

cpcloud · 2017-07-24T22:54:58Z

asv results (top offenders):

    before     after       ratio                                                                                                                                                               
  [c55dbf06] [04003c8d]                                                                                                                                                                        
!  117.73ms     failed       n/a  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('FY5253Quarter_2', 2)                                                                          
+  204.01ms   336.96ms      1.65  inference.to_numeric_downcast.time_downcast('string-nint', 'unsigned')                                                                                       
+  197.91ms   314.76ms      1.59  inference.to_numeric_downcast.time_downcast('string-int', 'integer')                                                                                         
+  189.44ms   298.65ms      1.58  inference.to_numeric_downcast.time_downcast('string-int', None)                                                                                              
+  105.50ms   160.92ms      1.53  packers.packers_read_sas.time_read_sas7bdat                                                                                                                  
+   33.91ms    50.75ms      1.50  groupby.GroupBySuite.time_diff('float', 100)                                                                                                                 
+  213.73ms   305.88ms      1.43  inference.to_numeric_downcast.time_downcast('string-nint', 'signed')                                                                                         
+  196.81ms   276.27ms      1.40  inference.to_numeric_downcast.time_downcast('string-nint', None)

cpcloud · 2017-07-24T22:56:57Z

It's possible that first one is failing because I haven't yet run the entire test suite to make sure I haven't broken anything outside of the inference code.

cpcloud · 2017-07-24T23:02:49Z

Yep, some failing tests :) Fixing them up right now

jreback

looks pretty good. any significant perf diffs?

jreback · 2017-07-24T23:51:23Z

pandas/_libs/src/inference.pyx

+
+    cdef bint is_valid(self, object value) except -1:
+        return self.is_value_typed(value)
+


too bad you can't make these in-line.

I think I might be able to, didn't try. I would also expect the C compiler to be pretty smart here. inline shouldn't have much affect nowadays, since the compiler will ignore it if it's unsafe to do.

jreback · 2017-07-24T23:51:48Z

pandas/_libs/src/inference.pyx

+
+    cdef bint is_value_typed(self, object value) except -1:
+        return util.is_bool_object(value)
+


you can prob make some things inlne here

I'll try inlineing the leaf classes and see if there's any perf benefit.

jreback · 2017-07-24T23:54:05Z

pandas/_libs/src/inference.pyx

    """
    Effeciently infer the type of a passed val, or list-like
    array of values. Return a string describing the type.

    Parameters
    ----------
    value : scalar, list, ndarray, or pandas type
+    skipna : bool, default False
+        Ignore NaN values when inferring the type.


versionadded tag

Minor thing: I would prefer that you still mention that we will change the default to True in the future, even if we don't deprecate now. What do you think?

cpcloud · 2017-07-25T00:02:17Z

Here are the inference benchmarks with latest commit:

    before     after       ratio
  [1d0e6a15] [390aae13]
+  115.00μs   164.03μs      1.43  inference.to_numeric.time_from_str_ignore
+  500.56μs   706.92μs      1.41  inference.DtypeInfer.time_uint32
+  149.24ms   195.31ms      1.31  inference.to_numeric_downcast.time_downcast('string-float', 'float')
+  222.45ms   279.17ms      1.25  inference.to_numeric_downcast.time_downcast('string-int', None)
+  157.01ms   196.73ms      1.25  inference.to_numeric_downcast.time_downcast('string-float', 'integer')
+  196.16ms   244.83ms      1.25  inference.to_numeric_downcast.time_downcast('string-int', 'signed')
+  166.39ms   202.27ms      1.22  inference.to_numeric_downcast.time_downcast('string-float', 'signed')
+   15.05ms    18.08ms      1.20  inference.DtypeInfer.time_datetime64
+  193.79ms   232.27ms      1.20  inference.to_numeric_downcast.time_downcast('string-nint', None)
+  168.74ms   201.07ms      1.19  inference.to_numeric_downcast.time_downcast('string-float', 'unsigned')
+   46.88ms    55.18ms      1.18  inference.to_numeric_downcast.time_downcast('int-list', 'integer')
+    2.69ms     3.16ms      1.17  inference.to_numeric_downcast.time_downcast('datetime64', 'float')
+   27.13ms    31.86ms      1.17  inference.to_numeric_downcast.time_downcast('datetime64', 'signed')
+  169.50ms   194.19ms      1.15  inference.to_numeric_downcast.time_downcast('string-float', None)
+   49.51ms    56.34ms      1.14  inference.to_numeric_downcast.time_downcast('int-list', 'unsigned')
+    1.34ms     1.51ms      1.13  inference.to_numeric_downcast.time_downcast('datetime64', None)
+   36.23ms    40.06ms      1.11  inference.to_numeric_downcast.time_downcast('int-list', 'float')
-   39.59μs    35.49μs      0.90  inference.to_numeric.time_from_float
-  770.33ms   688.99ms      0.89  inference.MaybeConvertNumeric.time_convert
-   19.51ms    17.18ms      0.88  inference.to_numeric.time_from_str_coerce
-    6.96ms     6.05ms      0.87  inference.DtypeInfer.time_int64
-  467.61ms   401.44ms      0.86  inference.to_numeric.time_from_numeric_str

codecov · 2017-07-25T00:37:26Z

Codecov Report

❗ No coverage uploaded for pull request base (master@9e6bb42). Click here to learn what that means.
The diff coverage is n/a.

@@            Coverage Diff            @@
##             master   #17066   +/-   ##
=========================================
  Coverage          ?   91.01%           
=========================================
  Files             ?      161           
  Lines             ?    49363           
  Branches          ?        0           
=========================================
  Hits              ?    44927           
  Misses            ?     4436           
  Partials          ?        0

Flag	Coverage Δ
#multiple	`88.78% <ø> (?)`
#single	`40.25% <ø> (?)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9e6bb42...a6d122b. Read the comment docs.

cpcloud · 2017-07-25T02:08:52Z

@jreback @gfyoung No big regressions :)

Here's everything that was worse than 1x upstream/master:

    before     after       ratio                                                                                                                                                               
  [9e6bb42f] [a6d122b9]                                                                                                                                                                        
+  794.88ms      1.13s      1.42  packers.Excel.time_write_excel_openpyxl                                                                                                                      
+    3.80ms     5.03ms      1.33  strings.StringMethods.time_get                               
+  211.56ms   276.62ms      1.31  inference.to_numeric_downcast.time_downcast('string-nint', 'float')
+   16.39ms    21.34ms      1.30  gil.nogil_read_csv.time_read_csv_object                      
+   23.70ms    30.14ms      1.27  binary_ops.Ops.time_frame_mult(True, 'default')                               
+    3.17ms     3.82ms      1.21  strings.StringMethods.time_upper                             
+    4.30ms     5.14ms      1.20  strings.StringMethods.time_rstrip                            
+    6.27ms     7.50ms      1.20  frame_methods.Reindex.time_reindex_upcast                                 
+  164.44ms   193.30ms      1.18  inference.to_numeric_downcast.time_downcast('string-float', 'unsigned')
+    5.56ms     6.48ms      1.17  categoricals.Categoricals.time_union                                      
+    4.09ms     4.71ms      1.15  strings.StringMethods.time_lstrip                            
+    6.23ms     7.17ms      1.15  algorithms.Algorithms.time_add_overflow_pos_arr                                                                                                              
+   28.19ms    32.39ms      1.15  stat_ops.stats_rank_pct_average_old.time_stats_rank_pct_average_old
+  336.85ns   382.73ns      1.14  attrs_caching.DataFrameAttributes.time_get_index                        
+    1.01ms     1.14ms      1.13  join_merge.join_non_unique_equal.time_join_non_unique_equal  
+    4.23ms     4.75ms      1.12  strings.StringMethods.time_strip                                            
+   14.70ms    16.52ms      1.12  binary_ops.Ops2.time_frame_int_mod                                                                                                                           
+     6.83s      7.67s      1.12  gil.nogil_datetime_fields.time_datetime_field_day                 
+   17.80ms    19.91ms      1.12  parser_vb.read_csv1.time_sep                                         
+    2.70ms     3.01ms      1.12  groupby.GroupBySuite.time_size('float', 10000)               
+    3.09ms     3.44ms      1.11  strings.StringMethods.time_contains_many_noregex                                                                                                             
+    2.78ms     3.07ms      1.10  period.Algorithms.time_drop_duplicates_pseries               
+  253.54ms   279.28ms      1.10  inference.to_numeric_downcast.time_downcast('string-nint', 'unsigned')       
+   83.70ms    92.18ms      1.10  eval.Eval.time_mult('numexpr', 'all')                                           
+   21.67ms    23.86ms      1.10  frame_methods.frame_mask_bools.time_frame_mask_bools                
+   15.72ms    17.30ms      1.10  categoricals.Categoricals.time_constructor_regular

gfyoung · 2017-07-25T03:16:14Z

@cpcloud : Nice! Great to see that the refactoring worked out.

jreback

lgtm. must some minor comments.

jreback · 2017-07-25T09:53:55Z

doc/source/whatsnew/v0.21.0.txt

@@ -24,6 +24,8 @@ New features
  <https://www.python.org/dev/peps/pep-0519/>`_ on most readers and writers (:issue:`13823`)
 - Added ``__fspath__`` method to :class:`~pandas.HDFStore`, :class:`~pandas.ExcelFile`,
  and :class:`~pandas.ExcelWriter` to work properly with the file system path protocol (:issue:`13823`)
+- Added ``skipna`` parameter :func:`~pandas.api.types.infer_dtype` to support


parameter to :func:....., to support

jreback · 2017-07-25T09:55:33Z

pandas/_libs/src/inference.pyx

@@ -272,6 +277,9 @@ def infer_dtype(object value):
    >>> infer_dtype(['foo', 'bar'])
    'string'

+    >>> infer_dtype(['a', np.nan, 'b'], skipna=True)
+    'string'


can you add same example with skipna=False

jreback · 2017-07-25T09:56:52Z

pandas/_libs/src/inference.pyx

+
+    cdef bint is_valid(self, object value) except -1:
+        return self.is_value_typed(value)
+


jreback · 2017-07-25T09:59:25Z

pandas/_libs/src/inference.pyx


-        if n == 0:
-            return False
+cdef class StringValidator(Validator):


not sure if there is value to making String, Unicode, Bytes inherit from an ABC StringLike (for consistency with the way you did datetime (Temporal) )

or maybe just String/Unicode (as Bytes is a different animal)

I don't think so because there's no code sharing between those three classes here. The child classes of TemporalValidator all share a large amount of validation code and really only differ in the way they check their value types and their nulls. With string-like values they don't share any code that isn't already factored out in Validator.

jreback · 2017-07-25T10:01:14Z

pandas/_libs/src/inference.pyx

+    cdef inline bint is_valid(self, object value) except -1:
+        return self.is_value_typed(value) or self.is_valid_null(value)
+
+    cdef bint is_value_typed(self, object value) except -1:


is_value_typed NI is not needed (as defined in base class)

Ah, good catch. Fixing.

cpcloud · 2017-07-25T13:54:58Z

pandas/_libs/src/inference.pyx

+    cdef inline bint is_array_typed(self) except -1:
+        return (
+            PY2 and issubclass(self.dtype.type, np.string_)
+        ) or not PY2 and issubclass(self.dtype.type, np.unicode_)


Interestingly enough I think isinstance(self.dtype.type, np.str_) does exactly this. I'll give it a shot and see if anything breaks.

cpcloud · 2017-07-25T16:02:22Z

@jreback this is ready to go

cpcloud · 2017-07-25T16:02:59Z

Actually one sec.

gfyoung · 2017-07-25T16:11:13Z

Actually one sec.

I presume you meant to push that change first?

cpcloud · 2017-07-25T16:14:38Z

@gfyoung Yep! Pushed.

gfyoung · 2017-07-25T16:20:13Z

Given that both @jreback and I have approved, and you're ready to merge this, I imagine three approvals from ">= core-committers" makes this merge-able on green 😄

cpcloud · 2017-07-25T16:21:28Z

@gfyoung agree :) Merging on green.

gfyoung · 2017-07-25T17:27:57Z

Thanks @cpcloud !

cpcloud · 2017-07-25T18:23:18Z

@gfyoung @jreback Thanks for the reviews!

cpcloud added Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jul 24, 2017

cpcloud added this to the Next Major Release milestone Jul 24, 2017

cpcloud requested a review from jreback July 24, 2017 16:41

gfyoung reviewed Jul 24, 2017

View reviewed changes

gfyoung approved these changes Jul 24, 2017

View reviewed changes

cpcloud mentioned this pull request Jul 24, 2017

Cleaner infer #17069

Closed

jreback reviewed Jul 24, 2017

View reviewed changes

cpcloud mentioned this pull request Jul 25, 2017

ENH: Enable unary math operations for pandas, sqlite ibis-project/ibis#1071

Closed

cpcloud modified the milestones: 0.21.0, Next Major Release Jul 25, 2017

jreback approved these changes Jul 25, 2017

View reviewed changes

cpcloud added 6 commits July 25, 2017 09:49

ENH: Add skipna parameter to infer_dtype

62008ec

ENH: Refactor type inference Cython code

9422651

PERF: No negative index or out of bounds check

68b277b

DOC: versionadded tag

b4552c3

PERF: Inline leaf classes' methods

ddfca51

PERF: Don't try to inline the larger methods

234110a

cpcloud commented Jul 25, 2017

View reviewed changes

cpcloud added 2 commits July 25, 2017 09:57

DOC: typo

ca2e65b

REF: Address comments

9a42836

CLN: Simplify is_array_type for strings

1883927

gfyoung merged commit 13b57cd into pandas-dev:master Jul 25, 2017

cpcloud deleted the add-skipna-for-infer-dtype branch July 25, 2017 18:23

This was referenced Oct 30, 2018

API: fix corner cases of lib.infer_dtype #23421

Closed

API: fix corner case of lib.infer_dtype #23422

Merged

h-vetinari mentioned this pull request Dec 2, 2018

DEPR: deprecate default of skipna=False in infer_dtype #24050

Merged

4 tasks


		cdef bint is_valid(self, object value) except -1:
		return self.is_value_typed(value)


		cdef bint is_value_typed(self, object value) except -1:
		return util.is_bool_object(value)

Uh oh!

ENH: Add skipna parameter to infer_dtype #17066

ENH: Add skipna parameter to infer_dtype #17066

Uh oh!

Conversation

cpcloud commented Jul 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gfyoung Jul 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gfyoung Jul 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gfyoung Jul 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gfyoung commented Jul 24, 2017

Uh oh!

jreback commented Jul 24, 2017

Uh oh!

cpcloud commented Jul 24, 2017

Uh oh!

gfyoung commented Jul 24, 2017

Uh oh!

jreback commented Jul 24, 2017

Uh oh!

cpcloud commented Jul 24, 2017

Uh oh!

jreback commented Jul 24, 2017

Uh oh!

cpcloud commented Jul 24, 2017

Uh oh!

gfyoung commented Jul 24, 2017

Uh oh!

cpcloud commented Jul 24, 2017

Uh oh!

gfyoung commented Jul 24, 2017

Uh oh!

cpcloud commented Jul 24, 2017

Uh oh!

gfyoung commented Jul 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cpcloud commented Jul 24, 2017

Uh oh!

cpcloud commented Jul 24, 2017

Uh oh!

cpcloud commented Jul 24, 2017

Uh oh!

cpcloud commented Jul 24, 2017

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cpcloud Jul 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cpcloud commented Jul 24, 2017 •

edited

Loading

gfyoung Jul 24, 2017 •

edited

Loading

gfyoung Jul 24, 2017 •

edited

Loading

gfyoung Jul 24, 2017 •

edited

Loading

gfyoung commented Jul 24, 2017 •

edited

Loading

cpcloud Jul 24, 2017 •

edited

Loading