-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
read_csv : using day first 23x to 35x slower than setting the format explicitly #25848
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
i think this is also a duplicate issue dateutil does dayfirst / yearfirst parsing and it’s python based so it’s slow you are welcome to submit a patch |
cc @anmyachev i believe this will be much faster now (from your recent patch) |
Ok I only need time.
|
For comparing
No benchmarks with |
could you add those u ran above? |
Yes I'll do it. |
@garfieldthecat and others interested - this was improved in our PR #25922 (the figures @anmyachev is showing above are after it was accepted to master). |
Code Sample: create a small csv with dates, then imports it and times the import
Problem description
I am trying to import a CSV with dates, where the days are always < 13 , i.e. there is no way to infer the proper format - it must be specified explicitly (how can you tell whether 1-2 is Jan 2nd or Feb 1st?).
What I noticed is that setting the date format with dayfirst is 25 to 35x (I tried it on two PCs) slower than setting the format explicitly with format='%d-%m-%Y'
How can this be? It's insane. I appreciate there may be a little bit of additional overhead, as dayfirst must guess the position of the year, but 25x to 35x slower leaves me speechless. With dayfirst it took 16 seconds to import a csv with 2 date columns and only 100k rows.
There must be something very wrong with how read_csv implements dayfirst - I'm hoping it shouldn't be too complicated to fix it?
Expected Output
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: AMD64 Family 23 Model 1 Stepping 1, AuthenticAMD
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None
pandas: 0.23.4
pytest: 4.0.2
pip: 18.1
setuptools: 40.6.3
Cython: 0.29.2
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: 1.8.2
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.12
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
None
The text was updated successfully, but these errors were encountered: