Skip to content

bpo-15873: Implement [date][time].fromisoformat #4699

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Dec 21, 2017

Conversation

pganssle
Copy link
Member

@pganssle pganssle commented Dec 4, 2017

Per discussion on the python-dev mailing list, this is a C implementation of fromisoformat as alternate constructors for datetime, date and time, resolving this issue.

If it is deemed desirable, it can be extended later to cover all ISO-8601 datetime strings, but I believe the consensus so far is to have a minimum implementation that only covers the outputs of datetime.isoformat().

One thing I'd like to call attention to is that my profiling seems to indicate that in the common case (i.e. not calling this from a subclass), using the C API directly is much faster than going through PyObject_CallFunction.

Here is a profiling script:

from datetime import datetime
from datetime import timezone, timedelta
from timeit import default_timer as timer

N = 100000
tzi = timezone(timedelta(hours=4))
tzi = None
comps = (2014, 3, 25, 4, 17, 30, 204300, tzi)
dt = datetime(*comps)

s = timer()
for i in range(N):
    dt_c = datetime(2014, 3, 25, 4, 17, 30, 241300, tzinfo=tzi)
e = timer()

tt = 1000000000 * (e - s) / N
dtstr = dt.isoformat()

s = timer()
for i in range(N):
    dt_fi = datetime.fromisoformat(dtstr)
e = timer()
tt2 = 1000000000 * (e - s) / N

print('datetime constructor: {:0.1f}ns'.format(tt))
print('fromisoformat:       {:0.1f}ns'.format(tt2))

Result on my laptop:

(tzi is None):
datetime constructor: 2294.1ns
fromisoformat:       944.5ns

(tzi = timezone(timedelta(hours=4)))
datetime constructor: 2229.1ns
fromisoformat:       1400.5ns

If there's no particularly pressing reason why all the other alternate constructors universally go through the main constructor call, I could write a small function or macro that would take the fast path if available and use it for all the alternate constructors.

CC @abalkin @mariocj89

https://bugs.python.org/issue15873

@pganssle pganssle changed the title Implement [date][time].fromisoformat (GH-15873) [bpo-15873] Implement [date][time].fromisoformat Dec 4, 2017
@pganssle
Copy link
Member Author

pganssle commented Dec 4, 2017

One other thing I'll note - the pure python implementation here is at least partially optimized for speed. There is a fairly trivial implementation that looks, more or less, like this:

from datetime import datetime 

_base_strptimes = {
    10: '%Y-%m-%d',
    13: '%Y-%m-%dT%H',
    16: '%Y-%m-%dT%H:%M',
    19: '%Y-%m-%dT%H:%M:%S',
    23: '%Y-%m-%dT%H:%M:%S.%f',
    26: '%Y-%m-%dT%H:%M:%S.%f'
}

def from_strptime(dtstr):
    if not isinstance(dtstr, str):
        raise TypeError('isoformat takes str')

    if len(dtstr) >= 19 and dtstr[-6] in '+-':
        fmt = _base_strptimes[len(dtstr) - 6] + '%z'
    else:
        fmt = _base_strptimes[len(dtstr)]

    return datetime.strptime(dtstr, fmt)

This will work, but strptime is actually fairly slow, and in my benchmarks the pure python implementation here is twice as fast as the strptime-based method.

@pganssle pganssle changed the title [bpo-15873] Implement [date][time].fromisoformat bpo-15873: Implement [date][time].fromisoformat Dec 4, 2017
/* ---------------------------------------------------------------------------
* String parsing utilities and helper functions
*/
static inline unsigned int to_int(char ptr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think "static inline" is interpreted differently from "static" by any relevant compilers. AFAIK, we don't use "static inline" elsewhere in CPython code. Please just use "static".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a problem. I translated these from macros in my original code so it seemed like inline was appropriate at the time, but I don't have strong opinions in this regard.

/* ---------------------------------------------------------------------------
* String parsing utilities and helper functions
*/
static inline unsigned int to_int(char ptr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name "ptr" is usually used for pointers. Please change "ptr" to "ch" here. On the other hand, I don't think to_int(ch) is clearer than ch - '-'. Call it digit_to_int or just use the expanded expression.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is my mistake. As I mentioned in another comment, I translated this from a macro where it was actually manipulating the pointer directly. When I translated it into a function the variable name didn't get fixed.

{
for (size_t i = 0; i < num_digits; ++i) {
int tmp = to_int(*(ptr++));
if (!is_digit(tmp)) { return NULL; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just use ANSI C isdigit on the character.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can actually just replace this with tmp <= 9, I think isdigit might be overkill. This is_digit relies on the specific way that to_int works - and I only used this particular combination because in early Cython-based tests my profiling indicated that atoi was a slow step.

@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

Return a :class:`datetime` corresponding to a *date_string* in one of the
ISO 8601 formats emitted by :meth:`datetime.isoformat`. Specifically, this function
supports strings in the format(s) ``YYYY-MM-DD[*[HH[:MM:[SS[.mmm[mmm]]]]][+HH:MM]]``,
where ``*`` can match any single character.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the colon after :MM be optional?

@@ -1486,6 +1515,23 @@ In boolean contexts, a :class:`.time` object is always considered to be true.
error-prone and has been removed in Python 3.5. See :issue:`13936` for full
details.


Other constructors:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Singular: constructor


Return a :class:`time` corresponding to a *time_string* in one of the ISO 8601
formats emitted by :meth:`time.isoformat`. Specifically, this function supports
strings in the format(s) ``HH[:MM:[SS[.mmm[mmm[+HH:MM]]]]]]``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to parse the time zone without :MM:SS.mmmmmm, e.g. 09:12:32+11:00. See your datetime.datetime format.


dt_rt = DateSubclass.fromisoformat(dt.isoformat())

self.assertIsInstance(dt_rt, DateSubclass)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is worth testing this, maybe it is worth documenting it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That can be done. I considered doing it but decided against it because while this behavior is both true and tested for all the constructors in datetime, it's not documented for any of them as far as I've seen. I can document it in just this one, or I can document it in all of them, I'm not picky.

for tzi in tzinfos]

for dt in dts:
for sep in separators:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 × 5 × 8 × 5 is 1200 tests. Perhaps just one test for each case (6 + 5 + 8 + 5) instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. I translated these tests from hypothesis where the space is sampled sparsely. I think I can refactor out at least the separator dimension and probably reduce the dimensionality of some of the remaining tests. I'll look at it tomorrow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored out the separator tests and reduced some of the space that is sampled. I think most of the variation in the date and time portions separately is covered by the respective date and time parser functions. I think we're down to ~160 tests.

Lib/datetime.py Outdated
if next_char != ':':
raise ValueError('Invalid time separator')
if pos >= len_str:
break
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In retrospect, I think maybe it's logically impossible to hit this branch, since we only get here if tstr[pos:pos+1] == ':'. I'll remove this in the next iteration.

Lib/datetime.py Outdated

time_comps[-1] = timezone(td)

if pos < len_str:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this check is redundant with the (len_str - pos) != 6 check above.

@pganssle pganssle force-pushed the from_isoformat branch 2 times, most recently from 39a65c3 to df3c032 Compare December 5, 2017 01:53
@pganssle
Copy link
Member Author

pganssle commented Dec 5, 2017

@abalkin @vadmium I think I've fixed all the concerns raised so far - some good catches in there. There are three documentation-related concerns:

  1. Per @vadmium, although there are tests that ensure that this and other alternate constructors have heritable behavior (e.g. MyDateTimeClass.fromisoformat(dt.isoformat()) should return a MyDateTimeClass), this is not explicitly documented. I think we could add a note at the class level like, "Alternate constructors called on datetime subclasses will return an instance of the subclass" - though to some extent that feels like it's already part of the implicit contract (considering it's a classmethod and not a staticmethod).

  2. I do not explicitly mention the fact that +00:00 and -00:00 will attach timezone.utc instead of timezone(timedelta(0)). I think this is almost certainly the correct behavior, but it may be somewhat surprising and I could easily see dropping this special-casing of the zero-offset zone. Presumably I should at least document this?

  3. Per some discussions on the issue, this is explicitly the "minimum viable feature set". Currently it's true that datetime.fromisoformat will parse a string if and only if that string can be generated by datetime.isoformat. Do we want to put something in the documentation equivalent to "don't rely on this to validate whether or not something came from datetime.isoformat, because we're only agreeing to guarantee that it will parse anything that datetime.isoformat can generate, and we reserve the right to accept other formats"?

def fromisoformat(cls, time_string):
"""Construct a time from the output of isoformat()."""
if not isinstance(time_string, str):
raise TypeError('fromisoformat: argument must be str')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure why this is not getting hit by the test suite. test_fromisoformat_fails_typeerror is designed explicitly to hit this condition. Anyone have an idea why it's getting missed?


for bad_type in bad_types:
with self.assertRaises(TypeError):
self.theclass(bad_type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.theclass.fromisoformat(bad_type)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A-ha! Thanks!

* String parsing utilities and helper functions
*/

static const char* parse_digits(const char* ptr, int* var,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this on the first review. PEP 7 calls for "function name in column 1". See PEP 7.

static const char* parse_digits(const char* ptr, int* var,
size_t num_digits)
{
for (size_t i = 0; i < num_digits; ++i) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C99 - style for loops are not specifically mentioned in the "C dialect" section of PEP 7. I personally like this style, so if this compiles without warnings, we may leave it.

}

// Macro that short-circuits to timezone parsing
#define PARSE_ISOFORMAT_ADVANCE_TIME_SEP(SEP) { \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please alight the .

return -5;
}

parse_timezone:;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for a ; after the label.

Copy link
Member Author

@pganssle pganssle Dec 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately yes, there is. It's a syntax error without the ;, because the following line is a variable declaration, not a statement, and the label is intended to be a labeled statement (Ref StackOverflow, citing section 6.8.1).

That said, there's no particular reason why those variables can't be declared at the top of the function (before the first call to PARSE_ISOFORMAT_ADVANCE_TIME_SEP). Would you prefer that?

*/
const char *p = dtstr;
p = parse_digits(p, year, 4);
if (NULL == p) { return -1; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use PEP 7 code layout.

@mariocj89
Copy link
Contributor

To fully support isoformat output you also need to parse seconds on the timezone:

datetime.datetime.now(datetime.timezone(datetime.timedelta(seconds=30))).isoformat()
'2017-12-06T12:28:38.567161+00:00:30'

@pganssle
Copy link
Member Author

pganssle commented Dec 6, 2017

@mariocj89 Thanks for the catch. The PR is updated with tests and implementation. Didn't realize that second precision was allowed in isoformat() in Python 3.7.

}

PyObject *result;
if (PyDate_CheckExact(cls)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is actually wrong, so this branch is never reached. The correct code is:

if ( (PyTypeObject*)cls == &PyDateTime_DateType ) {

@abalkin
Copy link
Member

abalkin commented Dec 7, 2017

Didn't realize that second precision was allowed in isoformat() in Python 3.7.

Moreover, the sub-second precision is now allowed as well:

>>> from datetime import *
>>> import random
>>> datetime.now(timezone(timedelta(hours=24*random.random()))).isoformat()
'2017-12-08T11:34:30.238049+15:47:34.297540'

The good news is that you should be able to reuse the time fields parser logic to parse the timezone.

cc: @pganssle

@pganssle
Copy link
Member Author

pganssle commented Dec 9, 2017

@abalkin Unless I'm doing something wrong, it seems that the pure python implementation of isoformat() doesn't actually support subsecond precision.

import sys 
sys.modules['_datetime'] = None     # cause ImportError
from datetime import *

print(time(12, 30, 15, tzinfo=timezone(timedelta(microseconds=123456))).isoformat())
# 12:30:15+00:00:00

print(datetime(2017, 1, 1, 12, 30, 15,
      tzinfo=timezone(timedelta(microseconds=123456))).isoformat())
# Traceback (most recent call last):
#  File "<stdin>", line 2, in <module>
#  File ".../cpython/Lib/datetime.py", line 1832, in isoformat
#   assert not ss.microseconds
# AssertionError

I suppose I'll fix that as part of this PR.

@pganssle
Copy link
Member Author

pganssle commented Dec 9, 2017

Latest changes fix both isoformat() and fromisoformat() in pure python and C implementations for subsecond offsets.


Return a :class:`datetime` corresponding to a *date_string* in one of the
ISO 8601 formats emitted by :meth:`datetime.isoformat`. Specifically, this function
supports strings in the format(s) ``YYYY-MM-DD[*[HH[:MM[:SS[.mmm[mmm]]]]][+HH:MM]]``,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation needs updating to reflect second and microsecond precision.

@abalkin
Copy link
Member

abalkin commented Dec 13, 2017

it seems that the pure python implementation of isoformat() doesn't actually support subsecond precision.

@pganssle - It looks like you are right. Please mention bpo-5288 in the commit message when you fix this. Please also find out why this was not caught by the test suit. We should be running all tests with and without C acceleration.

@pganssle
Copy link
Member Author

@abalkin It's already fixed and tests added. I didn't see any tests actually testing subsecond support for timezones in .isoformat, which is why it wasn't caught by the tests.

I just pushed a rewritten history that mentions bpo-5288 in the commit fixing the pure python version.

@abalkin
Copy link
Member

abalkin commented Dec 13, 2017

@pganssle - great job! This PR looks good to me now and I will merge it in a few days to give others a chance to review.


Return a :class:`time` corresponding to a *time_string* in one of the ISO 8601
formats emitted by :meth:`time.isoformat`. Specifically, this function supports
strings in the format(s) ``HH[:MM[:SS[.mmm[mmm]]]]][+HH:MM[:SS[.ffffff]]]``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too many square brackets.

.. classmethod:: datetime.fromisoformat(date_string)

Return a :class:`datetime` corresponding to a *date_string* in one of the
ISO 8601 formats emitted by :meth:`datetime.isoformat`. Specifically, this function
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would drop “ISO 8601” and just say “one of the formats emitted by isoformat”. As far as I understand, ISO 8601 doesn’t have a seconds field in time zones, but it seems you want to support this.

If you intend to support dates without the time part, maybe write “emitted by datetime.isoformat and date.isoformat”. This may also need a test case; I didn’t notice anything relevant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vadmium The test cases from TestDate are actually inherited by TestDateTime, so it is indeed supported. I think it's fair to support them.

Lib/datetime.py Outdated
try:
assert len(date_string) == 10
return cls(*_parse_isoformat_date(date_string))
except:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Write except Exception, to avoid catching KeyboardInterrupt or similar. Or even better, be explicit and list the exceptions you are expecting (AssertionError, ValueError, IndexError?).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vadmium I think catching Exception is probably the right thing to do, I didn't think about KeyboardInterrupt. I'm mainly trying to get the C implementation and the Python implementation to always raise the same exceptions, and the C implementation only ever raises ValueError (I will fuzz it some time this week to verify this), hence the catch-and-re-raise.

Lib/datetime.py Outdated
@@ -1193,19 +1320,7 @@ def __hash__(self):
def _tzstr(self, sep=":"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good opportunity to remove the unsupported sep parameter

Lib/datetime.py Outdated
@@ -1193,19 +1320,7 @@ def __hash__(self):
def _tzstr(self, sep=":"):
"""Return formatted timezone offset (+xx:xx) or None."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or the empty string

Return a :class:`datetime` corresponding to a *date_string* in one of the
ISO 8601 formats emitted by :meth:`datetime.isoformat`. Specifically, this function
supports strings in the format(s) ``YYYY-MM-DD[*HH[:MM[:SS[.mmm[mmm]]]]][+HH:MM[:SS[.ffffff]]]``,
where ``*`` can match any single character.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be less ambiguous putting the time zone inside the optional time part. Test case:

datetime(2017, 12, 18, 11, 0).isoformat(sep="+", timespec="minutes") -> "2017-12-18+11:00"
datetime.fromisoformat("2017-12-18+11:00") -> datetime(2017, 12, 18, 11, 0)

Copy link
Member Author

@pganssle pganssle Dec 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm.. As much as I dislike the idea of timezone being part of the time component, this is fine I suppose (and in fact that is how it is currently implemented).

That said, another design I considered is one where we take sep as a keyword argument to isoparse, which would relieve this ambiguity for all separators other than - and + if we wanted to eventually allow parsing of strings of the format YYYY-MM-DD+HH:MM. That's one decision we'd have to make in this version because changing it would not be backwards compatible.


Return a :class:`date` corresponding to a *date_string* in one of the ISO 8601
formats emitted by :meth:`date.isoformat`. Specifically, this function supports
strings in the format(s) ``YYYY-MM-DD``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is only one format emitted by date.isoformat and supported by your date.fromisoformat, as far as I know.


Other constructor:

.. classmethod:: time.fromisoformat(date_string)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time_string?

@pganssle
Copy link
Member Author

pganssle commented Dec 18, 2017

@vadmium Thanks for all the comments. I've added a test for the ambiguous cases you identified and updated the documentation (plus dropped the unused sep parameter).

@pganssle
Copy link
Member Author

@abalkin @vadmium Any thoughts on the question of whether to add a sep parameter to fromisoformat (from this now-hidden comment)?

The only reason to do so would be if in a future release we want to support ISO-8601 style strings that are just dates with offsets attached (e.g. 2017-01-01+12:00). I'm not sure, but I think ISO-8601 makes time zone offsets a property of time, so such dates-with-offsets are probably not actually valid ISO-8601 strings. Because 2017-01-01+12:00 could have been generated from datetime.isoformat(sep='+'), in its current form it would be impossible to get fromisoformat to interpret that +12:00 as anything except the time component. If we make the default behavior "strict", then in the future we would be able to support these "out of the box", specifying the separator as + or - would cause these style strings to be interpreted as times, and otherwise they would be offsets.

Having given this a bit of thought, I think it's probably a good idea to leave out sep, and if we want to support this use case in future releases, specifying a separator other than + or - would allow these things to be parsed as intended. Having the default value ignore the separator will make it easier to handle what is probably the most frequent use cases, which will be sep=' ' and sep='T' out of the box without any sort of branching to detect which one you're in, etc.

If we assume that some fraction of people will want to be strict and some people will want to be lenient, no matter which one we choose, the other behavior can easily be emulated. If we choose lenient by default, strict users can do the equivalent with:

def strict_fromisoformat(dtstr, sep='T'):
    if dtstr[10:11] != sep:
        raise ValueError('Invalid isoformat string: %s' % dstr)
    return datetime.fromisoformat(dtstr)

If we choose strict by default, the "lenient by default" option is:

def lenient_fromisoformat(dtstr):
    # Assuming sep requires a single character be passed, for dates it can be anything
    return datetime.fromisoformat(dtstr, sep=dtstr[10:11] or 'T')

Anyway, this is a kind of long comment just to say, "I think we should keep the status quo", but I thought I'd at least outline my thinking.

@abalkin
Copy link
Member

abalkin commented Dec 19, 2017

Any thoughts on the question of whether to add a sep parameter to fromisoformat

-1 or adding a parameter, but I would not mind having fromisoformat() not checking the separator at all.

@pganssle
Copy link
Member Author

@abalkin The current behavior is to allow any single character for the separator at all, if that's what you mean.

@abalkin
Copy link
Member

abalkin commented Dec 19, 2017

The current behavior is to allow any single character for the separator at all, if that's what you mean.

Yes. That's what I want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants