Skip to content

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Nov 1, 2016

Intrigued by the profiling results of the below example (Multi-index loc indexing, based on the example in #14549), where searchsorted seemed to take the majority of the computation time.
And it seems that searchsorted casts both inputs (in this case labels and loc) to a common dtype, and the labels of the MultiIndex were in this case int16, while loc (output from Index.get_loc) is a python int.

By casting loc to the dtype of labels, the specific example gets a ca 20 x speed improvement

df = pd.DataFrame({'a': np.random.randn(500*5000)}, index=pd.MultiIndex.from_product([date_range("2014-01-01", periods=500), range(5000)]))
dt = pd.Timestamp('2015-01-01')
%timeit df.loc[dt]

On master:

In [3]: %timeit df.loc[dt]
The slowest run took 5.70 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 9.39 ms per loop

with this PR:

In [3]: %timeit df.loc[dt]
The slowest run took 122.51 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 422 µs per loop

Just putting it here as a showcase.
Actual PR probably needs some more work (other places where this can be done, can loc ever be out of bound for that dtype?, benchmarks, ..)

@@ -1907,6 +1907,7 @@ def convert_indexer(start, stop, step, indexer=indexer, labels=labels):
return np.array(labels == loc, dtype=bool)
else:
# sorted, so can return slice object -> view
loc = labels.dtype.type(loc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe should do this like

loc, orig_loc = lables.dtype.type(loc), loc
if  loc != orig_loc:
    loc = orig_loc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am not sure how much checking would be needed here.

My understanding is that it probably is not needed in this case, as loc is coming from loc = level_index.get_loc(key), and labels and level_index are from the same MultiIndex. So I would assume that get_loc can only return an existing label, and so should fit in the dtype of labels?

(but probably also not that a perf issue to do the check)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that sounds right; if the test suite passes, prob ok!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently not :-)
But it's only a partial indexing test that fails. So loc could also be a slice, and it is expected in partial indexing that this raises an error in searchsorted, but now it raises already a slightly different error in dtype.type(loc). So that can easily be solved with:

                try:
                    loc = labels.dtype.type(loc)
                except TypeError:
                    # this occurs when loc is a slice (partial string indexing)
                    # but the TypeError raised by searchsorted in this case
                    # is catched in Index._has_valid_type()
                    pass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, or just tests if its a scalar to begin with

@codecov-io
Copy link

codecov-io commented Nov 1, 2016

Current coverage is 85.26% (diff: 100%)

Merging #14551 into master will increase coverage by <.01%

@@             master     #14551   diff @@
==========================================
  Files           140        140          
  Lines         50672      50676     +4   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43207      43211     +4   
  Misses         7465       7465          
  Partials          0          0          

Powered by Codecov. Last update 60a335e...6447e4c

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Nov 1, 2016
@jreback jreback added this to the 0.19.1 milestone Nov 1, 2016
@jreback
Copy link
Contributor

jreback commented Nov 1, 2016

lgtm. just release note. I think this can only help :>

@jorisvandenbossche
Copy link
Member Author

OK, added whatsnew notice. Will merge, but will open a new issue for follow-up on this (other cases, benchmarks that capture this (at the moment there are none)).

@jorisvandenbossche jorisvandenbossche merged commit 1d95179 into pandas-dev:master Nov 2, 2016
jorisvandenbossche added a commit that referenced this pull request Nov 3, 2016
yarikoptic added a commit to neurodebian/pandas that referenced this pull request Nov 18, 2016
Version 0.19.1

* tag 'v0.19.1': (43 commits)
  RLS: v0.19.1
  DOC: update whatsnew/release notes for 0.19.1 (pandas-dev#14573)
  [Backport pandas-dev#14545] BUG/API: Index.append with mixed object/Categorical indices (pandas-dev#14545)
  DOC: rst fixes
  [Backport pandas-dev#14567] DEPR: add deprecation warning for com.array_equivalent (pandas-dev#14567)
  [Backport pandas-dev#14551] PERF: casting loc to labels dtype before searchsorted (pandas-dev#14551)
  [Backport pandas-dev#14536] BUG: DataFrame.quantile with NaNs (GH14357) (pandas-dev#14536)
  [Backport pandas-dev#14520] BUG: don't close user-provided file handles in C parser (GH14418) (pandas-dev#14520)
  [Backport pandas-dev#14392] BUG: Dataframe constructor when given dict with None value (pandas-dev#14392)
  [Backport pandas-dev#14514] BUG: Don't parse inline quotes in skipped lines (pandas-dev#14514)
  [Bacport pandas-dev#14543] BUG: tseries ceil doc fix (pandas-dev#14543)
  [Backport pandas-dev#14541] DOC: Simplify the gbq integration testing procedure for contributors (pandas-dev#14541)
  [Backport pandas-dev#14527] BUG/ERR: raise correct error when sql driver is not installed (pandas-dev#14527)
  [Backport pandas-dev#14501] BUG: fix DatetimeIndex._maybe_cast_slice_bound for empty index (GH14354) (pandas-dev#14501)
  [Backport pandas-dev#14442] DOC: Expand on reference docs for read_json() (pandas-dev#14442)
  BLD: fix 3.4 build for cython to 0.24.1
  [Backport pandas-dev#14492] BUG: Accept unicode quotechars again in pd.read_csv
  [Backport pandas-dev#14496] BLD: Support Cython 0.25
  [Backport pandas-dev#14498] COMPAT/TST: fix test for range testing of negative integers to neg powers
  [Backport pandas-dev#14476] PERF: performance regression in Series.asof (pandas-dev#14476)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants