Skip to content

Metadata API: Provide a way to validate object when serializing to dictionary #1696

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
MVrachev opened this issue Nov 26, 2021 · 17 comments · Fixed by #1775
Closed

Metadata API: Provide a way to validate object when serializing to dictionary #1696

MVrachev opened this issue Nov 26, 2021 · 17 comments · Fixed by #1775
Assignees
Labels
backlog Issues to address with priority for current development goals
Milestone

Comments

@MVrachev
Copy link
Collaborator

Description of issue or feature request:

This issue is discussed during a couple of conversations with @jku.

We can say we are almost done with Metadata API validation during the initialization of Signed objects as summarized here: #1140 (comment).

What we didn't focus on is validation when serializing the Signed objects to dictionaries through Signed.to_dict().
This could be useful for users of Metadata API that changed the signed portion

I imagine some of the requirements for such validation are:

  1. It's optional. As pointed out by @jku in our conversation the client using Metadata API could provide its own validation on the metadata objects. Additionally, if this kind of validation is mandatory it could slow the work of a big python-tuf client such as Warehouse working on thousands of targets.
  2. It doesn't duplicate validation code. It will be best if we find a way to use the same exact code as the mandatory validation during the initialization without duplicating it.

One thing to consider is doing we want to add a little more complex validation, additional to the one done during initialization.
For example, for Root, it probably could be useful to validate that each of the set of keyids defined in the keys dictionary is all used inside a particular role in roles.

Current behavior:
No validation is done when calling to_dict() from any of the Signed child classes.

Expected behavior:
Provide a way to validate an object before calling to_dict() or when calling to_dict() with an explicit option set.

PS: I didn't include the signatures part from the Metadata objects in this discussion as their validation is a little more complex as you need access to the delegator.

@MVrachev
Copy link
Collaborator Author

I decided to create one prototype of a possible solution: https://github.com/MVrachev/tuf/tree/serialization-validation.

@jku
Copy link
Member

jku commented Nov 30, 2021

I agree with the basic premise: we'll want to re-evaluate whether our validation is appropriate for the serialization as well (while we've been concentrating on the client case, the focus has been on validating deserialized data).

some quick comments:

  • the idea of validating when creating a dict is not necessarily the correct one: to_dict() is a helper that json serialization happens to use. There's no guarantee that doing so makes sense for some possible other serialization format. We might want to consider validating during Metadata.to_bytes() if possible
  • a potential way to 100% prevent incorrect serialization (instead of just checking e.g. specific data types) is during Metadata.to_bytes() to serialize Metadata to bytes, then deserialize back to Metadata, and raise if the two metadata instances are not equal.
    • This might be too heavy (cpu/memory use) to be a default choice but could be an option
    • This would mean we can keep focusing on the deserialization path when validating
    • Metadata does not currently support eq() so this would need a bit of work

@jku
Copy link
Member

jku commented Dec 1, 2021

I agree with the basic premise: we'll want to re-evaluate whether our validation is appropriate for the serialization as well

I'll rephrase this to be more specific: Ideally we want to guarantee not just that our internal data structures are correct at serialization time, but that our serialization output is valid and that we can succesfully deserialize that output.

@MVrachev
Copy link
Collaborator Author

MVrachev commented Dec 6, 2021

  • a potential way to 100% prevent incorrect serialization (instead of just checking e.g. specific data types) is during Metadata.to_bytes() to serialize Metadata to bytes, then deserialize back to Metadata, and raise if the two metadata instances are not equal.

    • This might be too heavy (cpu/memory use) to be a default choice but could be an option
    • This would mean we can keep focusing on the deserialization path when validating
    • Metadata does not currently support eq() so this would need a bit of work

I see many advantages to that approach:

  • it will require a small amount of code compared to my solution in https://github.com/MVrachev/tuf/tree/serialization-validation as it will only require additions to Metadata.to_bytes() instead of each class separately
  • it will use the same level of validation as initialization. That's useful as if we want to implement more sophisticated checks, like validating that in a root instance each of the set of keyids defined in the keys dictionary is all used inside a particular role in roles, we can implement them separately where they can be called. If we implement them inside the class we can get different results from the two types of validation - during initialization and during serialization
  • it's easy to understand

So, I like the idea and I can volunteer for the next sprint to implement it.

The only question that remains is do we need to think for an API or a way to do a deeper serialization functionality as this one:

For example, for Root, it probably could be useful to validate that each of the set of keyids defined in the keys dictionary is all used inside a particular role in roles.

or maybe we don't actually as during an update with TrustedMetadataSet we will get an exception if something is bad?

@jku
Copy link
Member

jku commented Dec 7, 2021

I can volunteer for the next sprint to implement it.

This is unlikely to get implemented in a single sprint but a design doc or prototype or some infrastructure work might be a good idea (I have no idea how much work there is in adding the eq() implementations as an example).

For example, for Root, it probably could be useful to validate that each of the set of keyids defined in the keys dictionary is all used inside a particular role in roles.

Loading (deserializing) validation should fail only if the there is a compliance or safety issue... and I think that's a good rule for serialization as well: deciding that implementations aren't allowed to add keys without using them doesn't clearly improve safety but it could be annoying to the repository implementers who want do it for whatever reasons.

Checks like that would make definitely make sense in repository implementations... and possibly even in a repository library.

@sechkova sechkova added the backlog Issues to address with priority for current development goals label Dec 8, 2021
@MVrachev
Copy link
Collaborator Author

MVrachev commented Dec 8, 2021

We should test how much slower it will get when using this kind of validation.

@sechkova sechkova added this to the Sprint 14 milestone Dec 8, 2021
@MVrachev
Copy link
Collaborator Author

I created a prototype in my branch here https://github.com/MVrachev/tuf/tree/validation-during-serialization.
Can you have a look @jku?

@jku
Copy link
Member

jku commented Jan 7, 2022

As a prototype that's pretty much what I was imagining: definitely good enough to make some sort of a decision

  • not sure that validate can rely on from_dict/to_dict -- how do you know the serializer will even call to_dict? Could be that this extra validation should be a Serializer.serialize() feature? Not sure about this as serializer would then have to know which Deserializer to use... I suppose that's not out of the question: JSONSerializer could well know to use JSONDeserializer for validation purposes.
  • the __eq__() implementations are 119 LOC -- not as bad as I feared but obviously makes metadata.py bigger without being useful to client code
  • Metadata.__eq__() seems complex, I don't understand why

Overall: looks like a reasonable direction to me but let's discuss with others too.

@MVrachev
Copy link
Collaborator Author

MVrachev commented Jan 7, 2022

As a prototype that's pretty much what I was imagining: definitely good enough to make some sort of a decision

  • not sure that validate can rely on from_dict/to_dict -- how do you know the serializer will even call to_dict? Could be that this extra validation should be a Serializer.serialize() feature? Not sure about this as serializer would then have to know which Deserializer to use... I suppose that's not out of the question: JSONSerializer could well know to use JSONDeserializer for validation purposes.

Even if we decide to leave it as a separate API call I think it could be useful.
It's hard for me to be sure about that where we don't have actual interaction with users of Metadata API.

  • the __eq__() implementations are 119 LOC -- not as bad as I feared but obviously makes metadata.py bigger without being useful to client code
  • Metadata.__eq__() seems complex, I don't understand why

It's becasue the Metadata.signatures is an OrderedDict and its values are securesytemslib Signature objects which doesn't implement __eq__. I added a comment to clarify that.

Overall: looks like a reasonable direction to me but let's discuss with others too.

@jku
Copy link
Member

jku commented Jan 7, 2022

not sure that validate can rely on from_dict/to_dict

Thinking a bit further: the prototype ends up calling to_dict() twice when Metadata.to_bytes() is called. The second call is unnecessary if serialize() handled the validation so that seems like the correct direction for the real implementation of this

@MVrachev
Copy link
Collaborator Author

MVrachev commented Jan 7, 2022

Added a validation constructor argument in JsonSerializer.

@jku jku modified the milestones: Sprint 14, Sprint 15 Jan 12, 2022
@lukpueh
Copy link
Member

lukpueh commented Jan 12, 2022

Thanks for working on this, @MVrachev and @jku! I just caught up on your discussion and read through the prototype and I like it!

I also think that validate should be implemented as part of JSONSerializer.serialize, because that's really what it validates, i.e. "is the json serialize/deserialize loop idempotent?". Then we also don't have to call to_dict twice, and it does not feel odd that we use to_dict/from_dict.

@MVrachev
Copy link
Collaborator Author

Thanks for working on this, @MVrachev and @jku! I just caught up on your discussion and read through the prototype and I like it!

I also think that validate should be implemented as part of JSONSerializer.serialize, because that's really what it validates, i.e. "is the json serialize/deserialize loop idempotent?". Then we also don't have to call to_dict twice, and it does not feel odd that we use to_dict/from_dict.

@lukpueh do you think it should be an argument or it's okay to do it during initialization?
Also, do you think it makes sense to add it to the MetadataSerializer as well?

My question towards both of you @jku and @lukpueh is how do you think we should move forward?
Should I propose a pr? Do you think I can/should experiment with alternative solutions?
Do we want to add this API now before 1.0.0?

@lukpueh
Copy link
Member

lukpueh commented Jan 14, 2022

@lukpueh do you think it should be an argument or it's okay to do it during initialization?

Initialization of what? Regardless, given the computational cost of the validation, I'd make it optional.

Also, do you think it makes sense to add it to the MetadataSerializer as well?

No. I think this is something specific to JSONSerializer.

Should I propose a pr?

Yes.

Do you think I can/should experiment with alternative solutions?

Do you have something on your mind? I think this sounds good for the time being.

Do we want to add this API now before 1.0.0?

It would make 1.0.0 stronger. But I also see no problem in adding this in a subsequent minor release. Let's prioritize things that need to go into 1.0.0.

@MVrachev
Copy link
Collaborator Author

MVrachev commented Jan 14, 2022

@lukpueh do you think it should be an argument or it's okay to do it during initialization?

Initialization of what? Regardless, given the computational cost of the validation, I'd make it optional.

There are two options on how to call the validation through the JsonSerializer:

  1. pass a validate argument when creating a JsonSerializer object.
  2. pass a validate argument when calling JSONSerializer.serialize()

I think the first option makes more sense as once a JsonSerializer is created I see no use case when the user would want to change his behavior regarding the validation.

Do you think I can/should experiment with alternative solutions?

Do you have something on your mind? I think this sounds good for the time being.

No, I don't have anything else in pat

Do we want to add this API now before 1.0.0?

It would make 1.0.0 stronger. But I also see no problem in adding this in a subsequent minor release. Let's prioritize things that need to go into 1.0.0.

I think it makes sense to include it in 1.0.0 as we have already discussed it broadly which will make it easier for a review.
Will prepare a pr soon.

@lukpueh
Copy link
Member

lukpueh commented Jan 14, 2022

There are two options on how to call the validation through the JsonSerializer:

  1. pass a validate argument when creating a JsonSerializer object.
  2. pass a validate argument when calling JSONSerializer.serialize()

I think the first option makes more sense as once a JsonSerializer is created I see no use case when the user would want to change his behavior regarding the validation.

Oh right, I remember. The first option is also consistent with how we configure compact. We only have to ask ourselves, how we want to expose this feature in Metadata.to_bytes:

if serializer is None:
# Use local scope import to avoid circular import errors
# pylint: disable=import-outside-toplevel
from tuf.api.serialization.json import JSONSerializer
serializer = JSONSerializer(compact=True)
return serializer.serialize(self)

Given the computational cost of validation I lean towards a default of validate=False and not changing to_bytes for the time being, so that in order to use the feature it a user would have to call e.g.

metadata.to_bytes(JSONSerializer(compact=True, validate=True))

We can discuss this more on the PR.

@jku
Copy link
Member

jku commented Jan 14, 2022

Agreed with the discussion above:

  • discussed approach looks sensible and we're reasonably sure it'll work
  • API change will not be a breaking one
  • would be good to have in 1.0 but I feel based on the points above that this does not need to be a release blocker anymore

MVrachev added a commit to MVrachev/securesystemslib that referenced this issue Jan 18, 2022
Add Signature.__eq__() method to support comparison between signature
objects.
The need for this method came when solving issue #1696 in TUF
and more precisely read: theupdateframework/python-tuf#1696 (comment)
Issue #1696 in TUF is about adding validation API after metadata
object initialization and because signatures in TUF are dictionaries
whose values are Signature objects they need for Signature.__eq__()
the method became apparent.

Signed-off-by: Martin Vrachev <[email protected]>
@jku jku modified the milestones: Sprint 15, sprint 16 Jan 26, 2022
@jku jku removed this from the sprint 16 milestone Feb 9, 2022
@jku jku added this to the sprint 17 milestone Feb 9, 2022
MVrachev added a commit to MVrachev/securesystemslib that referenced this issue Feb 9, 2022
Add Signature.__eq__() method to support comparison between signature
objects.
The need for this method came when solving issue #1696 in TUF
and more precisely read: theupdateframework/python-tuf#1696 (comment)
Issue #1696 in TUF is about adding validation API after metadata
object initialization and because signatures in TUF are dictionaries
whose values are Signature objects they need for Signature.__eq__()
the method became apparent.

Signed-off-by: Martin Vrachev <[email protected]>
MVrachev added a commit to MVrachev/securesystemslib that referenced this issue Feb 16, 2022
Add Signature.__eq__() method to support comparison between signature
objects.
The need for this method came when solving issue #1696 in TUF
and more precisely read: theupdateframework/python-tuf#1696 (comment)
Issue #1696 in TUF is about adding validation API after metadata
object initialization and because signatures in TUF are dictionaries
whose values are Signature objects they need for Signature.__eq__()
the method became apparent.

Signed-off-by: Martin Vrachev <[email protected]>
@ivanayov ivanayov modified the milestones: sprint 17, sprint 18 Feb 23, 2022
@jku jku closed this as completed in #1775 Feb 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog Issues to address with priority for current development goals
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants