Skip to content

Conversation

nlittlepoole
Copy link

@nlittlepoole nlittlepoole commented Jan 25, 2017

Overview

So as a heavy user of arrow for data pipelines, I've found that arrow.get, while faster than the standard strptime still isn't as fast as it could be. Others have discussed this more in depth. While its probably not feasible to be as fast as udatetime, I think its worth exploring.

So this PR is an attempt to increase the speed of date parsing. Compiling the parser module with Cython plus the LRU Cache that @ownaginatious added provides for some nice performance increases. I went from averaging between 100µs and 200µs per date parsed to averaging between 55µs and 75µs per date parsed (when using the cache). For long running data pipelines this is large increase.

What I changed

I added 2 functions and some imports for Cython to setup tools. These are just for automatically finding pyx files and compiling them. Additionally I added Cython to the requirements.txt

I fixed up the line widths in parser.py. I changed l to length for improved readability. Lastly switched from using the python version of datetime to using the cpython version bundled with Cython. They are identical except the Cython version calls directly to the C code written for CPython.

Tests

All the tests pass on 2.X.X without any changes to them. However Chai doesn't play very nicely with Cython3. The LRU cache of DateTimeParser._generate_pattern_re changes the type from method to some subclass of functools, so Chai doesn't think it knows how to handle it. I might open a PR there to fix that issue but in the meantime I changed the tests to mock to make them pass.

Todos & Further Exploration

  • A cleaner Makefile
  • investigate a faster LRU or maybe even LFU implementation
  • investigate static typing variables for speed increases.
  • benchmark switching cpython.datetime with udatetime
  • figure out how to get arrow.factory.get to use cache

@codecov-io
Copy link

codecov-io commented Jan 25, 2017

Current coverage is 100% (diff: 100%)

Merging #414 into master will not change coverage

@@           master   #414   diff @@
====================================
  Files          14     13     -1   
  Lines        3081   2902   -179   
  Methods         0      0          
  Messages        0      0          
  Branches      226    182    -44   
====================================
- Hits         3081   2902   -179   
  Misses          0      0          
  Partials        0      0          

Powered by Codecov. Last update fd723b4...be48542

@systemcatch
Copy link
Collaborator

It's a real shame that this was ignored for so long. If you want to try fixing all the conflicts or submit a new PR that'd be great but I understand if it's too late for that.

The only part I'm not keen on is the line width fixes.

@jadchaar
Copy link
Member

jadchaar commented Jul 8, 2019

@nlittlepoole this seems like a great addition to Arrow. Like Chris said above, if you can resolve the conflicts and get things up-to-date, this would be a great change to merge in.

@jadchaar
Copy link
Member

Hey, since Cython and arrow have changed quite a bit over the years since this PR was submitted, I think it is best if we close it. Please feel free to update the Cython implementation and re-submit the PR :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants