Skip to content

Commit 7d852a9

Browse files
authored
PDEP-4: consistent parsing of datetimes (#48621)
* pdep-4: initial draft * note about making guess_datetime_format public, out of scope work, dayfirst/yearfirst Co-authored-by: MarcoGorelli <>
1 parent cda0f6b commit 7d852a9

File tree

1 file changed

+97
-0
lines changed

1 file changed

+97
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
# PDEP-4: Consistent datetime parsing
2+
3+
- Created: 18 September 2022
4+
- Status: Accepted
5+
- Discussion: [#48621](https://github.com/pandas-dev/pandas/pull/48621)
6+
- Author: [Marco Gorelli](https://github.com/MarcoGorelli)
7+
- Revision: 1
8+
9+
## Abstract
10+
11+
The suggestion is that:
12+
- ``to_datetime`` becomes strict and uses the same datetime format to parse all elements in its input.
13+
The format will either be inferred from the first non-NaN element (if `format` is not provided by the user), or from
14+
`format`;
15+
- ``infer_datetime_format`` be deprecated (as a strict version of it will become the default);
16+
- an easy workaround for non-strict parsing be clearly documented.
17+
18+
## Motivation and Scope
19+
20+
Pandas date parsing is very flexible, but arguably too much so - see
21+
https://github.com/pandas-dev/pandas/issues/12585 and linked issues for how
22+
much confusion this causes. Pandas can swap format midway, and though this
23+
is documented, it regularly breaks users' expectations.
24+
25+
Simple example:
26+
```ipython
27+
In [1]: pd.to_datetime(['12-01-2000 00:00:00', '13-01-2000 00:00:00'])
28+
Out[1]: DatetimeIndex(['2000-12-01', '2000-01-13'], dtype='datetime64[ns]', freq=None)
29+
```
30+
The user was almost certainly intending the data to be read as "12th of January, 13th of January".
31+
However, it's read as "1st of December, 13th of January". No warning or error is thrown.
32+
33+
Currently, the only way to ensure consistent parsing is by explicitly passing
34+
``format=``. The argument ``infer_datetime_format``
35+
isn't strict, can be called together with ``format``, and can still break users' expectations:
36+
37+
```ipython
38+
In [2]: pd.to_datetime(['12-01-2000 00:00:00', '13-01-2000 00:00:00'], infer_datetime_format=True)
39+
Out[2]: DatetimeIndex(['2000-12-01', '2000-01-13'], dtype='datetime64[ns]', freq=None)
40+
```
41+
42+
## Detailed Description
43+
44+
Concretely, the suggestion is:
45+
- if no ``format`` is specified, ``pandas`` will guess the format from the first non-NaN row
46+
and parse the rest of the input according to that format. Errors will be handled
47+
according to the ``errors`` argument - there will be no silent switching of format;
48+
- ``infer_datetime_format`` will be deprecated;
49+
- ``dayfirst`` and ``yearfirst`` will continue working as they currently do;
50+
- if the format cannot be guessed from the first non-NaN row, a ``UserWarning`` will be thrown,
51+
encouraging users to explicitly pass in a format.
52+
Note that this should only happen for invalid inputs such as `'a'`
53+
(which would later throw a ``ParserError`` anyway), or inputs such as ``'00:12:13'``,
54+
which would currently get converted to ``''2022-09-18 00:12:13''``.
55+
56+
If a user has dates in a mixed format, they can still use flexible parsing and accept
57+
the risks that poses, e.g.:
58+
```ipython
59+
In [3]: pd.Series(['12-01-2000 00:00:00', '13-01-2000 00:00:00']).apply(pd.to_datetime)
60+
Out[3]:
61+
0 2000-12-01
62+
1 2000-01-13
63+
dtype: datetime64[ns]
64+
```
65+
66+
## Usage and Impact
67+
68+
My expectation is that the impact would be a net-positive:
69+
- potentially severe bugs in people's code will be caught early;
70+
- users who actually want mixed formats can still parse them, but now they'd be forced to be
71+
very explicit about it;
72+
- the codebase would be noticeably simplified.
73+
74+
As far as I can tell, there is no chance of _introducing_ bugs.
75+
76+
## Implementation
77+
78+
The whatsnew notes read
79+
80+
> In the next major version release, 2.0, several larger API changes are being considered without a formal deprecation.
81+
82+
I'd suggest making this change as part of the above, because:
83+
- it would only help prevent bugs, not introduce any;
84+
- given the severity of bugs that can result from the current behaviour, waiting another 2 years until pandas 3.0.0
85+
would potentially cause a lot of damage.
86+
87+
Note that this wouldn't mean getting rid of ``dateutil.parser``, as that would still be used within ``guess_datetime_format``. With this proposal, however, subsequent rows would be parsed with the guessed format rather than repeatedly calling ``dateutil.parser`` and risk having it silently switch format
88+
89+
Finally, the function ``from pandas._libs.tslibs.parsing import guess_datetime_format`` would be made public, under ``pandas.tools``.
90+
91+
## Out of scope
92+
93+
We could make ``guess_datetime_format`` smarter by using a random sample of elements to infer the format.
94+
95+
### PDEP History
96+
97+
- 18 September 2022: Initial draft

0 commit comments

Comments
 (0)