Skip to content

bpo-38250: [Enum] single-bit flags are canonical #24215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
Jan 25, 2021

Conversation

ethanfurman
Copy link
Member

@ethanfurman ethanfurman commented Jan 14, 2021

Flag members are now divided by one-bit verses multi-bit, with multi-bit
being treated as aliases. Iterating over a flag only returns the
contained single-bit flags.

Iterating, repr(), and str() show members in definition order.

When constructing combined-member flags, any extra integer values are either discarded (CONFORM), turned into ints (EJECT) or treated as errors (STRICT). Flag classes can specify which of those three behaviors is desired:

>>> class Test(Flag, boundary=CONFORM):
...     ONE = 1
...     TWO = 2
...
>>> Test(5)
<Test.ONE: 1>

Besides the three above behaviors, there is also KEEP, which should not be used unless necessary -- for example, _convert_ specifies KEEP as there are flag sets in the stdlib that are incomplete and/or inconsistent (e.g. ssl.Options). KEEP will, as the name suggests, keep all bits; however, iterating over a flag with extra bits will only return the canonical flags contained, not the extra bits.

https://bugs.python.org/issue38250

Flag members are now divided by one-bit verses multi-bit, with multi-bit
being treated as aliases.  Iterating over a flag only returns the
contained single-bit flags.

repr() and str() now only show the Flags, not extra integer values; any
extra integer values are either discarded (CONFORM), turned into
``int``s (EJECT) or treated as errors (STRICT).  Flag classes can
specify which of those three behaviors is desired:

    >>> class Test(Flag, boundary=CONFORM):
    ...     ONE = 1
    ...     TWO = 2
    ...
    >>> Test(5)
    <Test.ONE: 1>
some flag sets, such as ``ssl.Options`` are incomplete/inconsistent;
using KEEP allows those flags to exist, and have useful repr()s, etc.

also, add ``_inverted_`` attribute to Flag members to significantly
speed up that operation.
repr() has been modified to support as closely as possible its previous
output; the big difference is that inverted flags cannot be output as
before because the inversion operation now always returns the comparable
positive result; i.e.

   re.A|re.I|re.M|re.S is ~(re.L|re.U|re.S|re.T|re.DEBUG)

in both of the above terms, the ``value`` is 282.

re's tests have been updated to reflect the modifications to repr().
@ethanfurman ethanfurman force-pushed the enum-single_bit_flags branch from 2cde48b to 1fd7471 Compare January 14, 2021 04:24
Copy link
Contributor

@belm0 belm0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the documented semantic changes look good 👍

CONFORM functionality looks useful, it would replace this hack I've got:

SafeFlag
class SafeFlag(enum.Flag):
    """Replacement for enum.Flag which ignores unknown bits.

      * for Enum, unknown value should be an error
      * but for Flag, usually unknown bits can be safely ignored--
        this is a simple way to achieve backwards compatibility
      * use SafeFlag when flag values come from external input
        (another service, firmware, etc.)
    """

    @classmethod
    def _missing_(cls, value):
        # constrain to known bits
        # NOTE: enum subclass cannot define members, so using getattr
        all_value = getattr(cls, '_all_value', None)
        if all_value is None:
            all_value = sum(v.value for v in cls)
            setattr(cls, '_all_value', all_value)
        return super()._missing_(value & all_value)

Addition of the boundary options increases complexity however, and the implementation is growing rather than shrinking.

I'd like to help get the implementation into shape towards these ideals:

  • all operations should be O(1) or O(number of set bits). Use bit operations rather than iteration wherever possible.
  • iterators should generate values on the fly rather than pre-assemble in memory (implies no sorting or reversing)
  • consider eliminating caching like _value2member_map_, assuming previous efficiency points can be achieved
  • eliminate _decompose 🙏

I reviewed some of the enum.py changes, but there is a lot to cover. Just raising the larger points to start.

@ethanfurman
Copy link
Member Author

@belm0 Comments left, changes made. Feel free to resolve any conversations you feel are complete.

@ethanfurman ethanfurman added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jan 15, 2021
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @ethanfurman for commit 668c9a9 🤖

If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jan 15, 2021
@belm0
Copy link
Contributor

belm0 commented Jan 16, 2021

@belm0 Comments left, changes made. Feel free to resolve any conversations you feel are complete.

For whatever reason, "resolve conversation" buttons aren't appearing. I haven't seen this problem in other repos. I suspect it's a permissions issue.

@belm0
Copy link
Contributor

belm0 commented Jan 18, 2021

Here's why I can't resolve conversations:

You can resolve a conversation in a pull request if you opened the pull request or if you have write access to the repository where the pull request was opened.

https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/commenting-on-a-pull-request#resolving-conversations

Iteration is now in member definition order.  If member definition order
matches increasing value order, then a more efficient method of flag
decomposition is used; otherwise, sort() is called on the results of
that method to get definition order.
Copy link
Contributor

@belm0 belm0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a complicated change to a complicated implementation (and sprinkled with noise)... I need to try another review pass later

new composite members aren't always added to _value2member_map_ -- this
ensures the operation succeeds
@@ -153,15 +199,29 @@ def __set_name__(self, enum_class, member_name):
enum_member._name_ = member_name
enum_member.__objclass__ = enum_class
enum_member.__init__(*args)
enum_member._sort_order_ = len(enum_class._member_names_)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't look right-- _sort_order_ is an int?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because sorting ints is easy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code through me off because it's not clear how len(enum_class._member_names_) provides the sort order, so it may warrant a comment.

@belm0
Copy link
Contributor

belm0 commented Jan 23, 2021

(Thank you for removing the formatting changes!)

Unfortunately, I can't really say I'm supporting this PR and design-- I'll try to summarize my differences, some of which are lost in resolved comments.

  • implementation and API complexity for the "boundary" option - My opinion is that Flag and IntFlag should be effectively "CONFORM"-- since it is simple, efficient, and safe-- and that user code should bear the complexity if they need other behavior (e.g. checking for unknown bits to implement something like "STRICT"). It can be argued that would be a breaking change-- but the PR at hand is itself likely to break existing code due to behavior changes and repr changes.
  • str() and repr() performance is important - these should be fast (no temporary list or sorting) without caching. Rather than 2nd class methods, they are essential for serialization to text, logging, etc. There is opportunity to do so by having the representation in LSB to MSB (ideal) or MSB to LSB (sacrifices CPU or requires a temporary list, but still no sorting) order. Performance shouldn't depend on definition order.
  • instance repr() should be in some order by bit value, not definition order
  • instance iteration should be performant - shouldn't depend on definition order

@ethanfurman
Copy link
Member Author

re. iteration ordering

I see it as less complication for both: the "correct" iterator is chosen at class definition, then
never changes; user's always get definition order. How is that more complication?

The code has to have two iteration implementations and choose the correct one. There should be
unit tests covering both cases, and somehow confirming which iteration implementation is actually
used-- but I don't think that's currently covered.

The Perm enumeration is defined in msb order, and tested.

Moreover, this is a complication for users. Flag definitions in existing code which happened to be
defined in "not the ideal order" will have slower iteration. Users will be expected to somehow know
about this, decide if they need the CPU savings or not, and be careful to use the correct ordering.
(And document it on every Flag definition so that some future code maintainer that doesn't know
about it doesn't violate the careful ordering.)

While efficiency is definitely a virtue, CPython doesn't make speed guarantees.

All for what? I don't understand why it's good that instance iteration is in definition order.

For me at least, visually comparing the single-bit flags present in a multi-bit flag is far easier if both are in the same order, and comparing the Flag against its definition order is easier of Flag iteration is in definition order. I don't think I'm alone in this.

Rest assured that your efforts have helped make Flag magnitudes faster than it was.

  • implementation and API complexity for the "boundary" option - My opinion is that Flag and
    IntFlag should be effectively "CONFORM"-- since it is simple, efficient, and safe-- and that user
    code should bear the complexity if they need other behavior (e.g. checking for unknown bits to
    implement something like "STRICT"). It can be argued that would be a breaking change-- but the
    PR at hand is itself likely to break existing code due to behavior changes and repr changes.

Previous behavior was already effectively STRICT and EJECT -- adding boundary just makes it easier for users to actually select the behaviors they need -- such as needing to use IntEnum, but wanting the safety of STRICT (so errors do not pass silently). Further, moving to single-bit flags being canonical made transforming existing integer flags in the stdlib difficult if not impossible, which is why I ended up adding KEEP.

  • str() and repr() performance is important - these should be fast (no temporary list or sorting)
    without caching. Rather than 2nd class methods, they are essential for serialization to text,
    logging, etc. There is opportunity to do so by having the representation in LSB to MSB (ideal) or
    MSB to LSB (sacrifices CPU or requires a temporary list, but still no sorting) order. Performance
    shouldn't depend on definition order.

str() and repr() use the name of the flag, and the name is calculated once at flag creation. (The old code used to iterate every time for multi-bit flags, and that was definitely one of the behaviors to get fixed in this PR.)

  • instance repr() should be in some order by bit value, not definition order

Noted, but see above about visual comparisons.

  • instance iteration should be performant - shouldn't depend on definition order

It is much more performant now than it was before, even with the (slightly) slower of the two methods. I expect most Flag enumerations will just get the speed increase (defined with auto(), or converted with _convert_(), and for those that absolutely need every last cpu cycle it is easy enough to use a decorator to reset the _member_iter_ to _member_iter_by_value_. Either you or I are sure to add an SO answer for that (If you, please make sure to stress that these are implementation details and are subject to change without notice).

I looked at removing _value2member_map_, but that was a serious hit on performance. After this PR lands I can look at swapping that for a WeakDict instead (more complexity, but a space savings).


While we disagree about some of the design choices, I truly appreciate your help in getting this version of Flag to be simpler and faster. If you have ideas in the future about making Flag even faster (without changing semantics) I would love to hear them.

@belm0
Copy link
Contributor

belm0 commented Jan 25, 2021

Regarding str() output, I find the following arbitrary. I'd rather have str output always use single-bit flags, never multi-bit. Besides avoiding strange user-facing rules like "output a multi-bit alias if the value happens to be covered by a single such alias", it would make the implementation simpler.

>>> class Bar(Flag):
...     A = auto()
...     B = auto()
...     C = auto()
...     AB = A|B
...     BC = B|C
...     AC = A|C
...     CA = A|C
...
>>> ~Bar.B
<Bar.AC: 5>  # why is AC canonical over A|C or CA?
>>> ~Bar(0)
<Bar.A|B|C: 7> # why is A|B|C canonical over A|BC or AB|C or AC|B?

Regarding repr() output, the best practice is for it to be eval-able back to the original object. Now:

>>> repr(Bar.A)
'<Bar.A: 1>'
>>> repr(~Bar(0))
'<Bar.A|B|C: 7>'

option 1:

>>> repr(Bar.A)
'Bar(1)'
>>> repr(~Bar(0))
'Bar(7)'

option 2 (my suggestion):

>>> repr(Bar.A)
'Bar.A'
>>> repr(~Bar(0))
'(Bar.A|Bar.B|Bar.C)'

This applies equally to Enum. repr(MyEnum.BAZ) should be 'MyEnum.BAZ'.

@ethanfurman
Copy link
Member Author

On 1/25/21 5:57 AM, John Belmonte wrote:

Regarding |str()| output, I find the following arbitrary. I'd rather have str output
always use single-bit flags, never multi-bit. Besides avoiding strange rules like
"output a multi-bit alias if the value happens to be covered by a single such alias",
it would make the implementation simpler.

Making the implementation simpler has never been a primary goal. My primary goal is to
make users' code simpler, which means the implementations for the libraries they use are
usually less simple.

Sometimes a certain combination of flags has a distinct meaning, and so a distinct name;
allowing the use of that distinct name in repr()s makes it easier to reason about what
is going on in the code. If the user bothers to give a distinct name to a certain flag
combination, then Flag will echo it back when appropriate.

As you noted in the bpo issue, a line must be drawn somewhere, so Flag no longer echos
back every possible alias present in a value -- either you get one alias that matches
every flag set, or you get the canonical, aka single-bit, flags.

>>> class Bar(Flag):
...     A = auto()
...     B = auto()
...     C = auto()
...     AB = A|B
...     BC = B|C
...     AC = A|C
...     CA = A|C
...
>>> ~Bar.B <Bar.AC: 5>
# why is AC canonical over CA?

AC is not canonical, but it was listed first and follows standard aliasing rules
(first one listed, wins).

>>> ~Bar(0) <Bar.A|B|C: 7>
# why is A|B|C canonical over A|BC or AB|C or AC|B?

Because multibit flags (aka aliases) are not listed in iteration, and repr() and
str() are produced by iteration.

Regarding |repr()| output, the best practice is for it to be eval-able back to
the original object.

While that is desirable behavior, it is not a hard and fast rule, and Enum has never
followed it. There are other areas of the stdlib whose repr()s are also not round-
trippable.

Now:

>>> repr(~Bar(0))
<Bar.A|B|C: 7>

option 1:

>>> repr(~Bar(0))
Bar(7)

This loses the flag names.

option 2:

>>> repr(~Bar(0))
(Bar.A|Bar.B|Bar.C)

And this loses the value. Neither appeals to me.

On the bright side: the stdlib Enum/Flag objects will probably get changed to option 2 above, which means there will be a readily accessible __repr__ that users are welcome to use in their own enumerations.

@ethanfurman ethanfurman merged commit 7aaeb2a into python:master Jan 25, 2021
@belm0
Copy link
Contributor

belm0 commented Jan 26, 2021

My primary goal is to make users' code simpler, which means the implementations for the libraries they use are usually less simple.

Sometimes that needs to be so. But other times, for the user's sake it's better to have a simple thing that they can understand fully and adapt to suit their needs, rather than something with more complex API and behavior which is trying to anticipate every need (and ultimately cannot succeed).

Take KEEP and EJECT:

Previous behavior was already effectively STRICT and EJECT -- adding boundary just makes it easier for users to actually select the behaviors they need -- such as needing to use IntEnum, but wanting the safety of STRICT (so errors do not pass silently). Further, moving to single-bit flags being canonical made transforming existing integer flags in the stdlib difficult if not impossible, which is why I ended up adding KEEP.

I was trying to argue that perhaps IntFlag should be simpler, and for cases where one would desire KEEP, it would be better to change the flag usage to be at the boundaries of the API only. It's suspect to add something to an API like KEEP with a disclaimer that it should almost never be used.

Sometimes a certain combination of flags has a distinct meaning, and so a distinct name;
allowing the use of that distinct name in repr()s makes it easier to reason about what
is going on in the code. If the user bothers to give a distinct name to a certain flag
combination, then Flag will echo it back when appropriate.

Another way I'll try to argue it: multi-bit aliases can be thought of as implementation details. They can be added and removed, their names can be changed, all without affecting a flag value's representation. So they don't really belong in a value's str or repr output. The user can rest assured that as long as he doesn't change the single-bit definitions, any logging or serialized textual representation of a flag value will remain correct and applicable in the future.

closing

I understand we accomplished some movement: invert of IntFlag is no longer surprising by default, some implementation inefficiencies are addressed, Flag gets iteration and length of set bits (fixed various bugs with the initial implementation), Flag supports a safe mode (CONFORM) without hacking into it.

My disappointment is that through careful API and behavior decisions it would have been possible to do all this while at the same time contracting both the API and implementation. Rather, we ended up with a more complex API and larger implementation. That means it's harder for users to digest Enum/Flag docs, it will be harder for future devs to maintain the code, and it will be harder to make future enhancements to the library without breaking the implementation or compatibility.

Future user A - "Why are IntFlag values behaving so bizarrely, I've never seen it!". (Doesn't know that the flag is defined with boundary=KEEP.)

Something as simple as a definition of compose-able bits should not have 4 modes...

Screen Shot 2021-01-26 at 1 27 15 PM

@ethanfurman ethanfurman deleted the enum-single_bit_flags branch January 26, 2021 07:34
@ethanfurman
Copy link
Member Author

ethanfurman commented Jan 26, 2021 via email

adorilson pushed a commit to adorilson/cpython that referenced this pull request Mar 13, 2021
Flag members are now divided by one-bit verses multi-bit, with multi-bit being treated as aliases. Iterating over a flag only returns the contained single-bit flags.

Iterating, repr(), and str() show members in definition order.

When constructing combined-member flags, any extra integer values are either discarded (CONFORM), turned into ints (EJECT) or treated as errors (STRICT). Flag classes can specify which of those three behaviors is desired:

>>> class Test(Flag, boundary=CONFORM):
...     ONE = 1
...     TWO = 2
...
>>> Test(5)
<Test.ONE: 1>

Besides the three above behaviors, there is also KEEP, which should not be used unless necessary -- for example, _convert_ specifies KEEP as there are flag sets in the stdlib that are incomplete and/or inconsistent (e.g. ssl.Options). KEEP will, as the name suggests, keep all bits; however, iterating over a flag with extra bits will only return the canonical flags contained, not the extra bits.

Iteration is now in member definition order.  If member definition order
matches increasing value order, then a more efficient method of flag
decomposition is used; otherwise, sort() is called on the results of
that method to get definition order.


``re`` module:

repr() has been modified to support as closely as possible its previous
output; the big difference is that inverted flags cannot be output as
before because the inversion operation now always returns the comparable
positive result; i.e.

   re.A|re.I|re.M|re.S is ~(re.L|re.U|re.S|re.T|re.DEBUG)

in both of the above terms, the ``value`` is 282.

re's tests have been updated to reflect the modifications to repr().
tacaswell added a commit to tacaswell/ophyd that referenced this pull request May 4, 2021
Python 3.10 makes a number of changes to Enum and its sub-classes.  Under
the new default behavior of IntFlag the fact that we have `0b1b1` as a
flag, but not `0b100` is considered a class definition time error (preventing
import of ophyd).

This opts back into allowing our (unorthodox) definition to import again and
hides the new features behind a version gate.

python/cpython#24215
tacaswell added a commit to tacaswell/ophyd that referenced this pull request Apr 5, 2022
Python 3.10 makes a number of changes to Enum and its sub-classes.  Under
the new default behavior of IntFlag the fact that we have `0b1b1` as a
flag, but not `0b100` is considered a class definition time error (preventing
import of ophyd).

This opts back into allowing our (unorthodox) definition to import again and
hides the new features behind a version gate.

python/cpython#24215
@AdamWill AdamWill mentioned this pull request Jun 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants