|
| 1 | +# PDEP-4: Consistent datetime parsing |
| 2 | + |
| 3 | +- Created: 18 September 2022 |
| 4 | +- Status: Accepted |
| 5 | +- Discussion: [#48621](https://github.com/pandas-dev/pandas/pull/48621) |
| 6 | +- Author: [Marco Gorelli](https://github.com/MarcoGorelli) |
| 7 | +- Revision: 1 |
| 8 | + |
| 9 | +## Abstract |
| 10 | + |
| 11 | +The suggestion is that: |
| 12 | +- ``to_datetime`` becomes strict and uses the same datetime format to parse all elements in its input. |
| 13 | + The format will either be inferred from the first non-NaN element (if `format` is not provided by the user), or from |
| 14 | + `format`; |
| 15 | +- ``infer_datetime_format`` be deprecated (as a strict version of it will become the default); |
| 16 | +- an easy workaround for non-strict parsing be clearly documented. |
| 17 | + |
| 18 | +## Motivation and Scope |
| 19 | + |
| 20 | +Pandas date parsing is very flexible, but arguably too much so - see |
| 21 | +https://github.com/pandas-dev/pandas/issues/12585 and linked issues for how |
| 22 | +much confusion this causes. Pandas can swap format midway, and though this |
| 23 | +is documented, it regularly breaks users' expectations. |
| 24 | + |
| 25 | +Simple example: |
| 26 | +```ipython |
| 27 | +In [1]: pd.to_datetime(['12-01-2000 00:00:00', '13-01-2000 00:00:00']) |
| 28 | +Out[1]: DatetimeIndex(['2000-12-01', '2000-01-13'], dtype='datetime64[ns]', freq=None) |
| 29 | +``` |
| 30 | +The user was almost certainly intending the data to be read as "12th of January, 13th of January". |
| 31 | +However, it's read as "1st of December, 13th of January". No warning or error is thrown. |
| 32 | + |
| 33 | +Currently, the only way to ensure consistent parsing is by explicitly passing |
| 34 | +``format=``. The argument ``infer_datetime_format`` |
| 35 | +isn't strict, can be called together with ``format``, and can still break users' expectations: |
| 36 | + |
| 37 | +```ipython |
| 38 | +In [2]: pd.to_datetime(['12-01-2000 00:00:00', '13-01-2000 00:00:00'], infer_datetime_format=True) |
| 39 | +Out[2]: DatetimeIndex(['2000-12-01', '2000-01-13'], dtype='datetime64[ns]', freq=None) |
| 40 | +``` |
| 41 | + |
| 42 | +## Detailed Description |
| 43 | + |
| 44 | +Concretely, the suggestion is: |
| 45 | +- if no ``format`` is specified, ``pandas`` will guess the format from the first non-NaN row |
| 46 | + and parse the rest of the input according to that format. Errors will be handled |
| 47 | + according to the ``errors`` argument - there will be no silent switching of format; |
| 48 | +- ``infer_datetime_format`` will be deprecated; |
| 49 | +- ``dayfirst`` and ``yearfirst`` will continue working as they currently do; |
| 50 | +- if the format cannot be guessed from the first non-NaN row, a ``UserWarning`` will be thrown, |
| 51 | + encouraging users to explicitly pass in a format. |
| 52 | + Note that this should only happen for invalid inputs such as `'a'` |
| 53 | + (which would later throw a ``ParserError`` anyway), or inputs such as ``'00:12:13'``, |
| 54 | + which would currently get converted to ``''2022-09-18 00:12:13''``. |
| 55 | + |
| 56 | +If a user has dates in a mixed format, they can still use flexible parsing and accept |
| 57 | +the risks that poses, e.g.: |
| 58 | +```ipython |
| 59 | +In [3]: pd.Series(['12-01-2000 00:00:00', '13-01-2000 00:00:00']).apply(pd.to_datetime) |
| 60 | +Out[3]: |
| 61 | +0 2000-12-01 |
| 62 | +1 2000-01-13 |
| 63 | +dtype: datetime64[ns] |
| 64 | +``` |
| 65 | + |
| 66 | +## Usage and Impact |
| 67 | + |
| 68 | +My expectation is that the impact would be a net-positive: |
| 69 | +- potentially severe bugs in people's code will be caught early; |
| 70 | +- users who actually want mixed formats can still parse them, but now they'd be forced to be |
| 71 | + very explicit about it; |
| 72 | +- the codebase would be noticeably simplified. |
| 73 | + |
| 74 | +As far as I can tell, there is no chance of _introducing_ bugs. |
| 75 | + |
| 76 | +## Implementation |
| 77 | + |
| 78 | +The whatsnew notes read |
| 79 | + |
| 80 | +> In the next major version release, 2.0, several larger API changes are being considered without a formal deprecation. |
| 81 | +
|
| 82 | +I'd suggest making this change as part of the above, because: |
| 83 | +- it would only help prevent bugs, not introduce any; |
| 84 | +- given the severity of bugs that can result from the current behaviour, waiting another 2 years until pandas 3.0.0 |
| 85 | + would potentially cause a lot of damage. |
| 86 | + |
| 87 | +Note that this wouldn't mean getting rid of ``dateutil.parser``, as that would still be used within ``guess_datetime_format``. With this proposal, however, subsequent rows would be parsed with the guessed format rather than repeatedly calling ``dateutil.parser`` and risk having it silently switch format |
| 88 | + |
| 89 | +Finally, the function ``from pandas._libs.tslibs.parsing import guess_datetime_format`` would be made public, under ``pandas.tools``. |
| 90 | + |
| 91 | +## Out of scope |
| 92 | + |
| 93 | +We could make ``guess_datetime_format`` smarter by using a random sample of elements to infer the format. |
| 94 | + |
| 95 | +### PDEP History |
| 96 | + |
| 97 | +- 18 September 2022: Initial draft |
0 commit comments