Skip to content

Sort roles dict by keys when serializing #2161

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
KOLANICH opened this issue Nov 1, 2022 · 17 comments
Closed

Sort roles dict by keys when serializing #2161

KOLANICH opened this issue Nov 1, 2022 · 17 comments

Comments

@KOLANICH
Copy link
Contributor

KOLANICH commented Nov 1, 2022

Description of issue or feature request:

To reduce variation and achieve better diffs we probably should sort keys in roles dict before serializing.

Current behavior:
The order is random.

Expected behavior:
The order should be alwwys the same.

@jku
Copy link
Member

jku commented Nov 1, 2022

The order is random.

It is not. Python 3.7 and newer guarantee that dict ordering is preserved, just like OrderedDict. python-tuf supports python >= 3.7

@jku
Copy link
Member

jku commented Nov 1, 2022

This looks like an unnecessary change to me but reproducible output is definitely a priority to this project: so if there are any issues with that, let's handle them.

@KOLANICH
Copy link
Contributor Author

KOLANICH commented Nov 1, 2022

Python 3.7 and newer guarantee that dict ordering is preserved, just like OrderedDict.

I know, but may be different because of different reasons, like different insertion order in different launches of tools (I'm currently doing an own tool for making a repo and splitted the pipeline into different CLI commands, and noticed that keys are permutted on each call of my tool). We want line-based diffs between versions of the metadata be minimal (and no difference at all, when there is no meaningful changes), it is easier to review them for a human this way using tools like meld when debugging apps using TUF and it should consume less space in git.

@jku
Copy link
Member

jku commented Nov 2, 2022

If your issue is that you are inserting items in non-reproducible order, then using OrderedDict won't help at all: it just preserves the insert order, just like Dict does.

If this PR helps your case then it's actually the sort() calls you've also added that are doing the work: those would help the case you describe. Sorted order seems like a fine choice for some containers (like sigs) but I don't know if we want to change that at this point...

I'm making this point because most of the commit seems to be OrderedDict changes but they seem irrelevant to the proposed fix: this may not obvious to casual observer

@jku
Copy link
Member

jku commented Nov 2, 2022

@lukpueh @joshuagl I'm sure we discussed potentially sorting dicts on serialize at some point. Any recollections on that?

@lukpueh
Copy link
Member

lukpueh commented Nov 2, 2022

It came up in some of @MVrachev's work:

There was also an issue with not preserving the order in a list, for which we use a dict internally:

And:

Can't remember the details right now.

@KOLANICH
Copy link
Contributor Author

KOLANICH commented Nov 2, 2022

If your issue is that you are inserting items in non-reproducible order, then using OrderedDict won't help at all: it just preserves the insert order, just like Dict does.

I know. The order is enforced by sorting, and I think it's TUF lib responsibility to enforce ordering: almost everyone will need it, sorting will cost almost nothing (for small enough n, but it is likely to be small enough). OrderedDict (Python language keeps it despite just dict is currently ordered in modern versions and even has introduced into typing module instead of just deprecating it, moving its functionality to just dict and leaving an alias OrderedDict = dict. I guess it may be the case that in future just dict can become unordered again) is just to enforce that sortedness is future-proof.

@jku
Copy link
Member

jku commented Nov 2, 2022

  • On using OrderedDict: I don't think it has enough advantages to outweigh the cost in uglier code. The idea that python will make dict unordered in future is pure guesswork
  • On sorting: As discussed the output is stable if your order of operations is stable: so no, everyone does not need this, only those who have tests with non-reproducible operation order. I'm hoping most actual TUF tools would not even show users raw metadata diffs, but even if they do -- how would they have a random operation order? I'm not saying I'm totally opposed to the idea but it does have costs involved as well:
    • a tricky looking change in low level code: the PR is harder to read than current code at least at the moment. Maybe this can be improved but I'm saying the review and fixes do have a cost...
    • sudden output change for all projects producing metadata already -- this only happens once per metadata but since some metadata changes rarely, it will keep surprising people. That said the effect might be quite limited: most dicts have static keys and have been completely sorted all along (this is why no tests failed: the test metadata didn't change at all)

TL;DR: I think the sorting is worth considering.

@MVrachev
Copy link
Collaborator

MVrachev commented Nov 2, 2022

I am not sure I understand the need of using sort here.
We are sure that the order is preserved which is the important thing.
Can you phrase it clearer why you think that almost anyone will need it?

@KOLANICH
Copy link
Contributor Author

KOLANICH commented Nov 2, 2022

Can you phrase it clearer why you think that almost anyone will need it?

I think that the metadata files produced from a set of source files (i.e. keys and user files) should be as similar as possible to other (i.e. from different invocations in randomized conditions, or produced by different tools) metadata, because it (in the order of reducing importance)

  1. simplifies looking into them for a human being
  2. eliminates non-meaningful (when files have the same meaning, but are serialized into different set of bytes) variations, this way eliminating unneeded storage objects in content-addressable storages
  3. minimizes delta size for delta-compression

While 1. and 2. are only important for developers, 3. would impact everyone.

@jku
Copy link
Member

jku commented Nov 4, 2022

Cheers, the code changes look a lot easier to understand now without OrderedDict (it's such a painful API).
I have code comments but let's not get into those before others have voiced an opinion on the idea itself: should we change the serialization dict output order from insert-order to stable or alphabetical order*?

*) current PR doesn't actually implement 100% alphabetical but what looks like a stable, almost alphabetical order

@MVrachev
Copy link
Collaborator

MVrachev commented Nov 5, 2022

I don't have a strong opinion on this.
I can see the benefits, especially the first point.
I think it's up to the implementation - of it looks complicated then it's not worth it as an addition, if there is a way to make it simple and understandable for the maintainers then it could be worth it.

@lukpueh
Copy link
Member

lukpueh commented Nov 17, 2022

Let me try to summarize the discussion. IIUC the goal here is reproducible metadata, which can be achieved by:

  1. not changing inputs in an unexpected way (e.g. preserving input order), and/or
  2. canonicalizing inputs (e.g. ordering)

(1) is already true for the internal representation as discussed above, and (2) is already true to some extent for the default json serializer (sorted keys, uniform indentation and whitespace).

IMO:

  • it's okay to expect users, who desire reproducible outputs, to provide reproducible inputs
  • the key sorting in our json deserializer already does what OP wants
  • doing additional canonicalization in the default json serializer seems like unexpected behavior
  • users, who want a canonical wireline format, can use a custom serializer

TL;DR: I'm for closing this issue.

@jku
Copy link
Member

jku commented Nov 21, 2022

So after lukas' comment I actually looked at the JSON serialization code and we do sort dicts by key! So now I'm confused. If I understand correctly the serialization result (the json) is already 100% sorted wherever it should be...

@KOLANICH is this correct: JSON is already like you want but you are asking for the intermediate format (the python dicts) to also be sorted? Could you show actual code that is broken by the current functionality -- is this about Metadata eq comparisons or something?

@KOLANICH
Copy link
Contributor Author

I just used an own serializer and parser because I have an own library of serializers and parsers from/into various serialization formats.

@KOLANICH
Copy link
Contributor Author

but you are asking for the intermediate format (the python dicts) to also be sorted?

So, yes.

It can probably be dealt with in a serializer, but storing them sorted should have a benefit that re-sorting should be faster.

@jku
Copy link
Member

jku commented Nov 22, 2022

Ok thanks for the details.

I think in that case I agree with Lukas: When this was designed sorting was intentionally left to the serializer. Change of behaviour now is not worth the small risk when the benefit is also so small: serializers are supposed to choose their serialization format, this includes potentially sorting dicts.

I'll close this based on votes from me and lukas

@jku jku closed this as completed Nov 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants