Metadata API: Provide a way to validate object when serializing to dictionary #1696

MVrachev · 2021-11-26T11:50:23Z

Description of issue or feature request:

This issue is discussed during a couple of conversations with @jku.

We can say we are almost done with Metadata API validation during the initialization of Signed objects as summarized here: #1140 (comment).

What we didn't focus on is validation when serializing the Signed objects to dictionaries through Signed.to_dict().
This could be useful for users of Metadata API that changed the signed portion

I imagine some of the requirements for such validation are:

It's optional. As pointed out by @jku in our conversation the client using Metadata API could provide its own validation on the metadata objects. Additionally, if this kind of validation is mandatory it could slow the work of a big python-tuf client such as Warehouse working on thousands of targets.
It doesn't duplicate validation code. It will be best if we find a way to use the same exact code as the mandatory validation during the initialization without duplicating it.

One thing to consider is doing we want to add a little more complex validation, additional to the one done during initialization.
For example, for Root, it probably could be useful to validate that each of the set of keyids defined in the keys dictionary is all used inside a particular role in roles.

Current behavior:
No validation is done when calling to_dict() from any of the Signed child classes.

Expected behavior:
Provide a way to validate an object before calling to_dict() or when calling to_dict() with an explicit option set.

PS: I didn't include the signatures part from the Metadata objects in this discussion as their validation is a little more complex as you need access to the delegator.

The text was updated successfully, but these errors were encountered:

MVrachev · 2021-11-27T21:55:36Z

I decided to create one prototype of a possible solution: https://github.com/MVrachev/tuf/tree/serialization-validation.

jku · 2021-11-30T11:33:31Z

I agree with the basic premise: we'll want to re-evaluate whether our validation is appropriate for the serialization as well (while we've been concentrating on the client case, the focus has been on validating deserialized data).

some quick comments:

the idea of validating when creating a dict is not necessarily the correct one: to_dict() is a helper that json serialization happens to use. There's no guarantee that doing so makes sense for some possible other serialization format. We might want to consider validating during Metadata.to_bytes() if possible
a potential way to 100% prevent incorrect serialization (instead of just checking e.g. specific data types) is during Metadata.to_bytes() to serialize Metadata to bytes, then deserialize back to Metadata, and raise if the two metadata instances are not equal.
- This might be too heavy (cpu/memory use) to be a default choice but could be an option
- This would mean we can keep focusing on the deserialization path when validating
- Metadata does not currently support eq() so this would need a bit of work

jku · 2021-12-01T12:53:58Z

I agree with the basic premise: we'll want to re-evaluate whether our validation is appropriate for the serialization as well

I'll rephrase this to be more specific: Ideally we want to guarantee not just that our internal data structures are correct at serialization time, but that our serialization output is valid and that we can succesfully deserialize that output.

MVrachev · 2021-12-06T09:47:39Z

a potential way to 100% prevent incorrect serialization (instead of just checking e.g. specific data types) is during Metadata.to_bytes() to serialize Metadata to bytes, then deserialize back to Metadata, and raise if the two metadata instances are not equal.

This might be too heavy (cpu/memory use) to be a default choice but could be an option

This would mean we can keep focusing on the deserialization path when validating

Metadata does not currently support eq() so this would need a bit of work

I see many advantages to that approach:

it will require a small amount of code compared to my solution in https://github.com/MVrachev/tuf/tree/serialization-validation as it will only require additions to Metadata.to_bytes() instead of each class separately
it will use the same level of validation as initialization. That's useful as if we want to implement more sophisticated checks, like validating that in a root instance each of the set of keyids defined in the keys dictionary is all used inside a particular role in roles, we can implement them separately where they can be called. If we implement them inside the class we can get different results from the two types of validation - during initialization and during serialization
it's easy to understand

So, I like the idea and I can volunteer for the next sprint to implement it.

The only question that remains is do we need to think for an API or a way to do a deeper serialization functionality as this one:

For example, for Root, it probably could be useful to validate that each of the set of keyids defined in the keys dictionary is all used inside a particular role in roles.

or maybe we don't actually as during an update with TrustedMetadataSet we will get an exception if something is bad?

jku · 2021-12-07T13:14:43Z

I can volunteer for the next sprint to implement it.

This is unlikely to get implemented in a single sprint but a design doc or prototype or some infrastructure work might be a good idea (I have no idea how much work there is in adding the eq() implementations as an example).

For example, for Root, it probably could be useful to validate that each of the set of keyids defined in the keys dictionary is all used inside a particular role in roles.

Loading (deserializing) validation should fail only if the there is a compliance or safety issue... and I think that's a good rule for serialization as well: deciding that implementations aren't allowed to add keys without using them doesn't clearly improve safety but it could be annoying to the repository implementers who want do it for whatever reasons.

Checks like that would make definitely make sense in repository implementations... and possibly even in a repository library.

MVrachev · 2021-12-08T10:07:11Z

We should test how much slower it will get when using this kind of validation.

MVrachev · 2021-12-21T16:15:21Z

I created a prototype in my branch here https://github.com/MVrachev/tuf/tree/validation-during-serialization.
Can you have a look @jku?

jku · 2022-01-07T13:53:20Z

As a prototype that's pretty much what I was imagining: definitely good enough to make some sort of a decision

not sure that validate can rely on from_dict/to_dict -- how do you know the serializer will even call to_dict? Could be that this extra validation should be a Serializer.serialize() feature? Not sure about this as serializer would then have to know which Deserializer to use... I suppose that's not out of the question: JSONSerializer could well know to use JSONDeserializer for validation purposes.
the __eq__() implementations are 119 LOC -- not as bad as I feared but obviously makes metadata.py bigger without being useful to client code
Metadata.__eq__() seems complex, I don't understand why

Overall: looks like a reasonable direction to me but let's discuss with others too.

MVrachev · 2022-01-07T14:29:54Z

As a prototype that's pretty much what I was imagining: definitely good enough to make some sort of a decision

not sure that validate can rely on from_dict/to_dict -- how do you know the serializer will even call to_dict? Could be that this extra validation should be a Serializer.serialize() feature? Not sure about this as serializer would then have to know which Deserializer to use... I suppose that's not out of the question: JSONSerializer could well know to use JSONDeserializer for validation purposes.

Even if we decide to leave it as a separate API call I think it could be useful.
It's hard for me to be sure about that where we don't have actual interaction with users of Metadata API.

the __eq__() implementations are 119 LOC -- not as bad as I feared but obviously makes metadata.py bigger without being useful to client code

Metadata.__eq__() seems complex, I don't understand why

It's becasue the Metadata.signatures is an OrderedDict and its values are securesytemslib Signature objects which doesn't implement __eq__. I added a comment to clarify that.

Overall: looks like a reasonable direction to me but let's discuss with others too.

jku · 2022-01-07T14:37:03Z

not sure that validate can rely on from_dict/to_dict

Thinking a bit further: the prototype ends up calling to_dict() twice when Metadata.to_bytes() is called. The second call is unnecessary if serialize() handled the validation so that seems like the correct direction for the real implementation of this

MVrachev · 2022-01-07T15:36:27Z

Added a validation constructor argument in JsonSerializer.

lukpueh · 2022-01-12T17:08:34Z

Thanks for working on this, @MVrachev and @jku! I just caught up on your discussion and read through the prototype and I like it!

I also think that validate should be implemented as part of JSONSerializer.serialize, because that's really what it validates, i.e. "is the json serialize/deserialize loop idempotent?". Then we also don't have to call to_dict twice, and it does not feel odd that we use to_dict/from_dict.

MVrachev · 2022-01-14T09:42:14Z

Thanks for working on this, @MVrachev and @jku! I just caught up on your discussion and read through the prototype and I like it!

I also think that validate should be implemented as part of JSONSerializer.serialize, because that's really what it validates, i.e. "is the json serialize/deserialize loop idempotent?". Then we also don't have to call to_dict twice, and it does not feel odd that we use to_dict/from_dict.

@lukpueh do you think it should be an argument or it's okay to do it during initialization?
Also, do you think it makes sense to add it to the MetadataSerializer as well?

My question towards both of you @jku and @lukpueh is how do you think we should move forward?
Should I propose a pr? Do you think I can/should experiment with alternative solutions?
Do we want to add this API now before 1.0.0?

lukpueh · 2022-01-14T09:50:04Z

@lukpueh do you think it should be an argument or it's okay to do it during initialization?

Initialization of what? Regardless, given the computational cost of the validation, I'd make it optional.

Also, do you think it makes sense to add it to the MetadataSerializer as well?

No. I think this is something specific to JSONSerializer.

Should I propose a pr?

Yes.

Do you think I can/should experiment with alternative solutions?

Do you have something on your mind? I think this sounds good for the time being.

Do we want to add this API now before 1.0.0?

It would make 1.0.0 stronger. But I also see no problem in adding this in a subsequent minor release. Let's prioritize things that need to go into 1.0.0.

MVrachev · 2022-01-14T10:08:59Z

@lukpueh do you think it should be an argument or it's okay to do it during initialization?

Initialization of what? Regardless, given the computational cost of the validation, I'd make it optional.

There are two options on how to call the validation through the JsonSerializer:

pass a validate argument when creating a JsonSerializer object.
pass a validate argument when calling JSONSerializer.serialize()

I think the first option makes more sense as once a JsonSerializer is created I see no use case when the user would want to change his behavior regarding the validation.

Do you think I can/should experiment with alternative solutions?

Do you have something on your mind? I think this sounds good for the time being.

No, I don't have anything else in pat

Do we want to add this API now before 1.0.0?

It would make 1.0.0 stronger. But I also see no problem in adding this in a subsequent minor release. Let's prioritize things that need to go into 1.0.0.

I think it makes sense to include it in 1.0.0 as we have already discussed it broadly which will make it easier for a review.
Will prepare a pr soon.

lukpueh · 2022-01-14T10:19:28Z

There are two options on how to call the validation through the JsonSerializer:

pass a validate argument when creating a JsonSerializer object.

pass a validate argument when calling JSONSerializer.serialize()

I think the first option makes more sense as once a JsonSerializer is created I see no use case when the user would want to change his behavior regarding the validation.

Oh right, I remember. The first option is also consistent with how we configure compact. We only have to ask ourselves, how we want to expose this feature in Metadata.to_bytes:

python-tuf/tuf/api/metadata.py

Lines 242 to 249 in 059bfda

    
           if serializer is None: 
        
               # Use local scope import to avoid circular import errors 
        
               # pylint: disable=import-outside-toplevel 
        
               from tuf.api.serialization.json import JSONSerializer 
        
               serializer = JSONSerializer(compact=True) 
        
           return serializer.serialize(self)

Given the computational cost of validation I lean towards a default of validate=False and not changing to_bytes for the time being, so that in order to use the feature it a user would have to call e.g.

metadata.to_bytes(JSONSerializer(compact=True, validate=True))

We can discuss this more on the PR.

jku · 2022-01-14T11:03:43Z

Agreed with the discussion above:

discussed approach looks sensible and we're reasonably sure it'll work
API change will not be a breaking one
would be good to have in 1.0 but I feel based on the points above that this does not need to be a release blocker anymore

Add Signature.__eq__() method to support comparison between signature objects. The need for this method came when solving issue #1696 in TUF and more precisely read: theupdateframework/python-tuf#1696 (comment) Issue #1696 in TUF is about adding validation API after metadata object initialization and because signatures in TUF are dictionaries whose values are Signature objects they need for Signature.__eq__() the method became apparent. Signed-off-by: Martin Vrachev <[email protected]>

jku mentioned this issue Dec 8, 2021

Remove microseconds from metadata API Signed.expires #1712

Merged

1 task

sechkova added the backlog Issues to address with priority for current development goals label Dec 8, 2021

sechkova assigned MVrachev Dec 8, 2021

sechkova added this to the Sprint 14 milestone Dec 8, 2021

jku mentioned this issue Jan 12, 2022

Version 1.0 release strategy #1645

Closed

jku modified the milestones: Sprint 14, Sprint 15 Jan 12, 2022

MVrachev mentioned this issue Jan 17, 2022

Add a "validate" argument option to JSONSerializer #1775

Merged

3 tasks

MVrachev mentioned this issue Jan 18, 2022

Add Signature.__eq__() method secure-systems-lab/securesystemslib#383

Merged

3 tasks

jku modified the milestones: Sprint 15, sprint 16 Jan 26, 2022

jku removed this from the sprint 16 milestone Feb 9, 2022

jku added this to the sprint 17 milestone Feb 9, 2022

ivanayov modified the milestones: sprint 17, sprint 18 Feb 23, 2022

jku closed this as completed in #1775 Feb 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata API: Provide a way to validate object when serializing to dictionary #1696

Metadata API: Provide a way to validate object when serializing to dictionary #1696

MVrachev commented Nov 26, 2021

MVrachev commented Nov 27, 2021

jku commented Nov 30, 2021 •

edited

Loading

jku commented Dec 1, 2021 •

edited

Loading

MVrachev commented Dec 6, 2021 •

edited

Loading

jku commented Dec 7, 2021

MVrachev commented Dec 8, 2021

MVrachev commented Dec 21, 2021

jku commented Jan 7, 2022 •

edited

Loading

MVrachev commented Jan 7, 2022

jku commented Jan 7, 2022 •

edited

Loading

MVrachev commented Jan 7, 2022

lukpueh commented Jan 12, 2022

MVrachev commented Jan 14, 2022

lukpueh commented Jan 14, 2022

MVrachev commented Jan 14, 2022 •

edited

Loading

lukpueh commented Jan 14, 2022

jku commented Jan 14, 2022

Metadata API: Provide a way to validate object when serializing to dictionary #1696

Metadata API: Provide a way to validate object when serializing to dictionary #1696

Comments

MVrachev commented Nov 26, 2021

MVrachev commented Nov 27, 2021

jku commented Nov 30, 2021 • edited Loading

jku commented Dec 1, 2021 • edited Loading

MVrachev commented Dec 6, 2021 • edited Loading

jku commented Dec 7, 2021

MVrachev commented Dec 8, 2021

MVrachev commented Dec 21, 2021

jku commented Jan 7, 2022 • edited Loading

MVrachev commented Jan 7, 2022

jku commented Jan 7, 2022 • edited Loading

MVrachev commented Jan 7, 2022

lukpueh commented Jan 12, 2022

MVrachev commented Jan 14, 2022

lukpueh commented Jan 14, 2022

MVrachev commented Jan 14, 2022 • edited Loading

lukpueh commented Jan 14, 2022

jku commented Jan 14, 2022

jku commented Nov 30, 2021 •

edited

Loading

jku commented Dec 1, 2021 •

edited

Loading

MVrachev commented Dec 6, 2021 •

edited

Loading

jku commented Jan 7, 2022 •

edited

Loading

jku commented Jan 7, 2022 •

edited

Loading

MVrachev commented Jan 14, 2022 •

edited

Loading