Skip to content

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Jul 24, 2017

closes #17059

  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

@cpcloud cpcloud added Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jul 24, 2017
@cpcloud cpcloud added this to the Next Major Release milestone Jul 24, 2017
@cpcloud cpcloud requested a review from jreback July 24, 2017 16:41
"""
Effeciently infer the type of a passed val, or list-like
array of values. Return a string describing the type.

Parameters
----------
value : scalar, list, ndarray, or pandas type
skipna : bool
Copy link
Member

@gfyoung gfyoung Jul 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add default value (i.e. "default False") (and remove it from the description).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, should we document that this will change to True (don't need to necessarily deprecate now)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpcloud : Thoughts about my comment regarding the eventual change of default value?

for i in range(n):
val = objbuf[i]
if not util._checknull(val) and not util.is_bool_object(val):
return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this logic is refactor-able. It seems very duplicative across all of the functions. Not a big deal, but just an observation ATM.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could write this as a single for loop, but I don't want to introduce any performance regressions for skipna=False

Copy link
Member

@gfyoung gfyoung Jul 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow you here. This loop would only be entered if skipna is True. What I meant is that you could potentially abstract this for-loop into a separate function, which accepts ndarray, n, and a dtype-checker function (e.g. is_bool_object, is_float_object).

Would be more compact but could impact performance, which is why I'm not pushing this so hard but just floating the idea in case there is a good way to do this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactoring the entire set of functions is kind of a can of worms, because the fundamental problem here is that we must assume that arrays contain potentially mixed types or objects, which are really the same thing.

It might be possible to make a Validator class with subclasses for each type that we want to infer, that has a method validate that encapsulates the current loop logic.

Generally agree this needs some TLC.

Copy link
Member

@gfyoung gfyoung Jul 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactoring the entire set of functions is kind of a can of worms, because the fundamental problem here is that we must assume that arrays contain potentially mixed types or objects, which are really the same thing.

Yeah, that's fair. I was imagining something like this (very rough):

cpdef check_non_na(ndarray[object] values, <callable> checker):
   n = len(values)
   for i in range(n):
       val = values[i]
       if not util._checknull(val) and not checker(val):
              return False
   return True

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've started working on a refactor, I'll put up a PR if it's viable :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Makes sense to keep separate from this 😄

@gfyoung
Copy link
Member

gfyoung commented Jul 24, 2017

Failures related to pyarrow or feather are unrelated to this PR.

@jreback
Copy link
Contributor

jreback commented Jul 24, 2017

actually I am not big on this. This is just a bug fix, we should be ignoring NaN when we have a string,unciode,bytes objects. otherwise.

@cpcloud
Copy link
Member Author

cpcloud commented Jul 24, 2017

@jreback

actually I am not big on this. This is just a bug fix, we should be ignoring NaN when we have a string,unciode,bytes objects. otherwise.

I don't really follow what you're saying here. Can you clarify?

@gfyoung
Copy link
Member

gfyoung commented Jul 24, 2017

actually I am not big on this. This is just a bug fix, we should be ignoring NaN when we have a string,unciode,bytes objects. otherwise.

Why is this a bug-fix? The behavior is consistent with what I would expect because there is NaN and other elements in the array-like. I don't see why we should be changing this behavior that drastically.

@jreback
Copy link
Contributor

jreback commented Jul 24, 2017

it should mirror what we already do for other types. We automatically skip na's; they don't have an impact on the inference type. In fact that's the entire point of this routine.

In [4]: infer_dtype([pd.Timestamp('20130101'), np.nan])
Out[4]: 'datetime'

In [5]: infer_dtype(['foo', np.nan])
Out[5]: 'mixed'

In [6]: infer_dtype([pd.Timestamp('20130101')])
Out[6]: 'datetime'

In [7]: infer_dtype([pd.Timestamp('20130101'), pd.NaT])
Out[7]: 'datetime'

In [8]: infer_dtype(['foo'])
Out[8]: 'string'

In [9]: infer_dtype(['foo', np.nan])
Out[9]: 'mixed'

@cpcloud
Copy link
Member Author

cpcloud commented Jul 24, 2017

Right. That's where we want to be, but this routine has existed for a very long time with this behavior and I don't think it'd be a great idea to break it right this second. I think @gfyoung had the right idea with 1 or more deprecation cycles before changing the default behavior to skip NA values by default.

@jreback
Copy link
Contributor

jreback commented Jul 24, 2017

Right. That's where we want to be, but this routine has existed for a very long time with this behavior and I don't think it'd be a great idea to break it right this second. I think @gfyoung had the right idea with 1 or more deprecation cycles before changing the default behavior to skip NA values by default.

this is broken. I don't really have a problem with adding a skipna, though adds a lot more logic. The problem is these are now different for different dtypes. I don't think easy to reconcile this.

@cpcloud
Copy link
Member Author

cpcloud commented Jul 24, 2017

I don't think easy to reconcile this.

Wait for my refactor, which reconciles them nicely.

@gfyoung
Copy link
Member

gfyoung commented Jul 24, 2017

Wait for my refactor, which reconciles them nicely.

I figured there would be a way. 😄

@cpcloud
Copy link
Member Author

cpcloud commented Jul 24, 2017

I'll put up another PR with the refactor, but since it's quite a large refactor and may introduce a perf regression let's leave this one open as well in case the refactor isn't viable from a perf perspective.

@cpcloud cpcloud mentioned this pull request Jul 24, 2017
@gfyoung
Copy link
Member

gfyoung commented Jul 24, 2017

@cpcloud : You could have made a separate commit here and just show asv comparisons between them.

@cpcloud
Copy link
Member Author

cpcloud commented Jul 24, 2017

Indeed. I'll do that. Thanks for the tip!

@gfyoung
Copy link
Member

gfyoung commented Jul 24, 2017

@cpcloud : Rebase onto master so that pyarrow tests don't fail anymore

@cpcloud
Copy link
Member Author

cpcloud commented Jul 24, 2017

cool, will do once the benchmarks are finished!

@cpcloud
Copy link
Member Author

cpcloud commented Jul 24, 2017

asv results (top offenders):

    before     after       ratio                                                                                                                                                               
  [c55dbf06] [04003c8d]                                                                                                                                                                        
!  117.73ms     failed       n/a  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('FY5253Quarter_2', 2)                                                                          
+  204.01ms   336.96ms      1.65  inference.to_numeric_downcast.time_downcast('string-nint', 'unsigned')                                                                                       
+  197.91ms   314.76ms      1.59  inference.to_numeric_downcast.time_downcast('string-int', 'integer')                                                                                         
+  189.44ms   298.65ms      1.58  inference.to_numeric_downcast.time_downcast('string-int', None)                                                                                              
+  105.50ms   160.92ms      1.53  packers.packers_read_sas.time_read_sas7bdat                                                                                                                  
+   33.91ms    50.75ms      1.50  groupby.GroupBySuite.time_diff('float', 100)                                                                                                                 
+  213.73ms   305.88ms      1.43  inference.to_numeric_downcast.time_downcast('string-nint', 'signed')                                                                                         
+  196.81ms   276.27ms      1.40  inference.to_numeric_downcast.time_downcast('string-nint', None) 

@cpcloud
Copy link
Member Author

cpcloud commented Jul 24, 2017

It's possible that first one is failing because I haven't yet run the entire test suite to make sure I haven't broken anything outside of the inference code.

@cpcloud
Copy link
Member Author

cpcloud commented Jul 24, 2017

Yep, some failing tests :) Fixing them up right now

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good. any significant perf diffs?


cdef bint is_valid(self, object value) except -1:
return self.is_value_typed(value)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too bad you can't make these in-line.

Copy link
Member Author

@cpcloud cpcloud Jul 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I might be able to, didn't try. I would also expect the C compiler to be pretty smart here. inline shouldn't have much affect nowadays, since the compiler will ignore it if it's unsafe to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep


cdef bint is_value_typed(self, object value) except -1:
return util.is_bool_object(value)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can prob make some things inlne here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try inlineing the leaf classes and see if there's any perf benefit.

"""
Effeciently infer the type of a passed val, or list-like
array of values. Return a string describing the type.

Parameters
----------
value : scalar, list, ndarray, or pandas type
skipna : bool, default False
Ignore NaN values when inferring the type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versionadded tag

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor thing: I would prefer that you still mention that we will change the default to True in the future, even if we don't deprecate now. What do you think?

@cpcloud
Copy link
Member Author

cpcloud commented Jul 25, 2017

Here are the inference benchmarks with latest commit:

    before     after       ratio
  [1d0e6a15] [390aae13]
+  115.00μs   164.03μs      1.43  inference.to_numeric.time_from_str_ignore
+  500.56μs   706.92μs      1.41  inference.DtypeInfer.time_uint32
+  149.24ms   195.31ms      1.31  inference.to_numeric_downcast.time_downcast('string-float', 'float')
+  222.45ms   279.17ms      1.25  inference.to_numeric_downcast.time_downcast('string-int', None)
+  157.01ms   196.73ms      1.25  inference.to_numeric_downcast.time_downcast('string-float', 'integer')
+  196.16ms   244.83ms      1.25  inference.to_numeric_downcast.time_downcast('string-int', 'signed')
+  166.39ms   202.27ms      1.22  inference.to_numeric_downcast.time_downcast('string-float', 'signed')
+   15.05ms    18.08ms      1.20  inference.DtypeInfer.time_datetime64
+  193.79ms   232.27ms      1.20  inference.to_numeric_downcast.time_downcast('string-nint', None)
+  168.74ms   201.07ms      1.19  inference.to_numeric_downcast.time_downcast('string-float', 'unsigned')
+   46.88ms    55.18ms      1.18  inference.to_numeric_downcast.time_downcast('int-list', 'integer')
+    2.69ms     3.16ms      1.17  inference.to_numeric_downcast.time_downcast('datetime64', 'float')
+   27.13ms    31.86ms      1.17  inference.to_numeric_downcast.time_downcast('datetime64', 'signed')
+  169.50ms   194.19ms      1.15  inference.to_numeric_downcast.time_downcast('string-float', None)
+   49.51ms    56.34ms      1.14  inference.to_numeric_downcast.time_downcast('int-list', 'unsigned')
+    1.34ms     1.51ms      1.13  inference.to_numeric_downcast.time_downcast('datetime64', None)
+   36.23ms    40.06ms      1.11  inference.to_numeric_downcast.time_downcast('int-list', 'float')
-   39.59μs    35.49μs      0.90  inference.to_numeric.time_from_float
-  770.33ms   688.99ms      0.89  inference.MaybeConvertNumeric.time_convert
-   19.51ms    17.18ms      0.88  inference.to_numeric.time_from_str_coerce
-    6.96ms     6.05ms      0.87  inference.DtypeInfer.time_int64
-  467.61ms   401.44ms      0.86  inference.to_numeric.time_from_numeric_str

@codecov
Copy link

codecov bot commented Jul 25, 2017

Codecov Report

❗ No coverage uploaded for pull request base (master@9e6bb42). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #17066   +/-   ##
=========================================
  Coverage          ?   91.01%           
=========================================
  Files             ?      161           
  Lines             ?    49363           
  Branches          ?        0           
=========================================
  Hits              ?    44927           
  Misses            ?     4436           
  Partials          ?        0
Flag Coverage Δ
#multiple 88.78% <ø> (?)
#single 40.25% <ø> (?)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9e6bb42...a6d122b. Read the comment docs.

@cpcloud
Copy link
Member Author

cpcloud commented Jul 25, 2017

@jreback @gfyoung No big regressions :)

Here's everything that was worse than 1x upstream/master:

    before     after       ratio                                                                                                                                                               
  [9e6bb42f] [a6d122b9]                                                                                                                                                                        
+  794.88ms      1.13s      1.42  packers.Excel.time_write_excel_openpyxl                                                                                                                      
+    3.80ms     5.03ms      1.33  strings.StringMethods.time_get                               
+  211.56ms   276.62ms      1.31  inference.to_numeric_downcast.time_downcast('string-nint', 'float')
+   16.39ms    21.34ms      1.30  gil.nogil_read_csv.time_read_csv_object                      
+   23.70ms    30.14ms      1.27  binary_ops.Ops.time_frame_mult(True, 'default')                               
+    3.17ms     3.82ms      1.21  strings.StringMethods.time_upper                             
+    4.30ms     5.14ms      1.20  strings.StringMethods.time_rstrip                            
+    6.27ms     7.50ms      1.20  frame_methods.Reindex.time_reindex_upcast                                 
+  164.44ms   193.30ms      1.18  inference.to_numeric_downcast.time_downcast('string-float', 'unsigned')
+    5.56ms     6.48ms      1.17  categoricals.Categoricals.time_union                                      
+    4.09ms     4.71ms      1.15  strings.StringMethods.time_lstrip                            
+    6.23ms     7.17ms      1.15  algorithms.Algorithms.time_add_overflow_pos_arr                                                                                                              
+   28.19ms    32.39ms      1.15  stat_ops.stats_rank_pct_average_old.time_stats_rank_pct_average_old
+  336.85ns   382.73ns      1.14  attrs_caching.DataFrameAttributes.time_get_index                        
+    1.01ms     1.14ms      1.13  join_merge.join_non_unique_equal.time_join_non_unique_equal  
+    4.23ms     4.75ms      1.12  strings.StringMethods.time_strip                                            
+   14.70ms    16.52ms      1.12  binary_ops.Ops2.time_frame_int_mod                                                                                                                           
+     6.83s      7.67s      1.12  gil.nogil_datetime_fields.time_datetime_field_day                 
+   17.80ms    19.91ms      1.12  parser_vb.read_csv1.time_sep                                         
+    2.70ms     3.01ms      1.12  groupby.GroupBySuite.time_size('float', 10000)               
+    3.09ms     3.44ms      1.11  strings.StringMethods.time_contains_many_noregex                                                                                                             
+    2.78ms     3.07ms      1.10  period.Algorithms.time_drop_duplicates_pseries               
+  253.54ms   279.28ms      1.10  inference.to_numeric_downcast.time_downcast('string-nint', 'unsigned')       
+   83.70ms    92.18ms      1.10  eval.Eval.time_mult('numexpr', 'all')                                           
+   21.67ms    23.86ms      1.10  frame_methods.frame_mask_bools.time_frame_mask_bools                
+   15.72ms    17.30ms      1.10  categoricals.Categoricals.time_constructor_regular  

@cpcloud cpcloud modified the milestones: 0.21.0, Next Major Release Jul 25, 2017
@gfyoung
Copy link
Member

gfyoung commented Jul 25, 2017

@cpcloud : Nice! Great to see that the refactoring worked out.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. must some minor comments.

@@ -24,6 +24,8 @@ New features
<https://www.python.org/dev/peps/pep-0519/>`_ on most readers and writers (:issue:`13823`)
- Added ``__fspath__`` method to :class:`~pandas.HDFStore`, :class:`~pandas.ExcelFile`,
and :class:`~pandas.ExcelWriter` to work properly with the file system path protocol (:issue:`13823`)
- Added ``skipna`` parameter :func:`~pandas.api.types.infer_dtype` to support
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parameter to :func:....., to support

@@ -272,6 +277,9 @@ def infer_dtype(object value):
>>> infer_dtype(['foo', 'bar'])
'string'

>>> infer_dtype(['a', np.nan, 'b'], skipna=True)
'string'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add same example with skipna=False

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


cdef bint is_valid(self, object value) except -1:
return self.is_value_typed(value)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep


if n == 0:
return False
cdef class StringValidator(Validator):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if there is value to making String, Unicode, Bytes inherit from an ABC StringLike (for consistency with the way you did datetime (Temporal) )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe just String/Unicode (as Bytes is a different animal)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so because there's no code sharing between those three classes here. The child classes of TemporalValidator all share a large amount of validation code and really only differ in the way they check their value types and their nulls. With string-like values they don't share any code that isn't already factored out in Validator.

cdef inline bint is_valid(self, object value) except -1:
return self.is_value_typed(value) or self.is_valid_null(value)

cdef bint is_value_typed(self, object value) except -1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_value_typed NI is not needed (as defined in base class)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch. Fixing.

cdef inline bint is_array_typed(self) except -1:
return (
PY2 and issubclass(self.dtype.type, np.string_)
) or not PY2 and issubclass(self.dtype.type, np.unicode_)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly enough I think isinstance(self.dtype.type, np.str_) does exactly this. I'll give it a shot and see if anything breaks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works!

@cpcloud
Copy link
Member Author

cpcloud commented Jul 25, 2017

@jreback this is ready to go

@cpcloud
Copy link
Member Author

cpcloud commented Jul 25, 2017

Actually one sec.

@gfyoung
Copy link
Member

gfyoung commented Jul 25, 2017

Actually one sec.

I presume you meant to push that change first?

@cpcloud
Copy link
Member Author

cpcloud commented Jul 25, 2017

@gfyoung Yep! Pushed.

@gfyoung
Copy link
Member

gfyoung commented Jul 25, 2017

Given that both @jreback and I have approved, and you're ready to merge this, I imagine three approvals from ">= core-committers" makes this merge-able on green 😄

@cpcloud
Copy link
Member Author

cpcloud commented Jul 25, 2017

@gfyoung agree :) Merging on green.

@gfyoung gfyoung merged commit 13b57cd into pandas-dev:master Jul 25, 2017
@gfyoung
Copy link
Member

gfyoung commented Jul 25, 2017

Thanks @cpcloud !

@cpcloud cpcloud deleted the add-skipna-for-infer-dtype branch July 25, 2017 18:23
@cpcloud
Copy link
Member Author

cpcloud commented Jul 25, 2017

@gfyoung @jreback Thanks for the reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pandas.api.types.infer_dtype should be nan aware
3 participants