Skip to content

updater: abstract out the network IO #1213

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jku opened this issue Nov 13, 2020 · 34 comments
Closed

updater: abstract out the network IO #1213

jku opened this issue Nov 13, 2020 · 34 comments
Labels
client Related to the client (updater) implementation enhancement

Comments

@jku
Copy link
Member

jku commented Nov 13, 2020

This might be relevant to Updater redesign (#1135) and if accepted would deprecate #1142 and the PR #1171

We (me, Joshua, Martin, Teodora) have been talking about abstracting some of the client functionality out of the Updater itself. The biggest issue from my perspective is network IO. Teodora already made a PR to let the application download targets but it seems like there are still issues with TUF handling metadata downloads.

Why is this needed?

  • In the real world applications are already using a network stack and will be using it after integrating TUF as well: we should not force another one on them
  • Even if the network stacks of the application and TUF are same, the fact that they use different sessions and configurations is not great
  • Complex applications have legitimate needs to configure a lot of things we don't want to provide API for: user agent, proxies, basic authentication, custom request headers. This applies to both metadata and targets
  • Complex applications have legitimate needs to control the download process (e.g. progress information, canceling)
  • Complex applications have (legitimate?) needs to poke at low level details like timeouts

Potential solutions

We identified two main solutions to this:

  1. Make a new event-based non-blocking client API. This would be most flexible but also more complex for TUF maintainers to maintain and application developers to customize
  2. Keep the current API but add a new Fetcher interface that applications can optionally implement. This is likely fairly easy and non-invasive to implement but remains a blocking API

I'm proposing option 2 but for reference please see the draft of option 1 as well.

Proposal

Add a Fetcher interface that applications can implement. Provide a default implementation of Fetcher. Add a new method to Updater that Fetcher can use to provide the data it fetches.

The Updater processes (refresh(), get_one_valid_targetinfo() and download_target()) will now look like this:

  • Whenever a remote file (metadata or target) is needed:
    • setup a temporary file to write results to
    • call Fetcher.fetch()
      • fetcher calls Updater.provide_fetched_data() zero or more times to provide chunks of data. Updater writes these chunks into the file
    • when fetcher returns without exceptions, the download is finished and written to the file

This is like the go-tuf RemoteStore abstraction with two differences: 1. Python does not have reasonable stream abstractions like io.ReadCloser (that would actually be implemented by any of the network stacks) so we cannot return something like that: instead our implementation blocks and adds a provide_fetched_data() callback into Updater instead. 2. Metadata and target fetching is not separated: this way the Fetcher does not need any understanding of TUF or server structure, it's just a dumb downloader.

# Only new/changed methods mentioned for Updater
class Updater(object):
    # init now accepts an optional fetcher argument
    def __init__(self, repository_name, repository_mirrors, fetcher: Fetcher = None):

    # Accepts content of the url that is being currently fetched.
    # Can be called only from Fetcher.fetch() that this Updater called.
    def provide_fetched_data(self, data: bytes)

# New interface for applications to implement
class Fetcher(metaclass=abc.ABCMeta):
    # Fetches the contents of HTTP/HTTPS url from a remote server. Calls 
    # self.updater.provide_fetched_data() to forward sequential chunks of
    # bytes to the updater. Returns when the download is complete and all
    # bytes have been fed to updater.
    @abc.abstractmethod
    def fetch(self, url: str, length: int):
        pass

    # Called by updater init
    def set_updater(self, updater: Updater):
        self.updater = updater

I think this is fairly straight-forward to implement even without a client redesign (and will be backwards-compatible). download.py is split into two parts: one part contains the Tempfile handling bits and _check_downloaded_length() and are used by the updater itself; the rest of download.py form the default Fetcher implementation.

@trishankatdatadog
Copy link
Member

💯

This actually happened with the Datadog Agent: some of our customers needed to use TLS proxies, and we were using custom networking code back then that simply couldn't handle it. Hence the switch to requests, but as you say, it's not great that we use our own configuration and sessions there...

@jku
Copy link
Member Author

jku commented Nov 13, 2020

with regards to parallel downloads: Making sure that a fancy custom Fetcher could download multiple urls in parallel seems relatively simple -- the design doesn't have to implement that but it should be taken into account so adding it later would be possible without breaking API.

@lukpueh
Copy link
Member

lukpueh commented Nov 20, 2020

Thanks for sketching out two approaches in such great detail, @jku!

Regarding (2), may I ask why Updater.fetcher.fetch()calls back into Updater.fetcher.updater.provide_fetched_data()? And where is fetch() called in the first place?

Couldn't we just make fetch return the data directly to the caller (i.e. some updater method). I (naively) picture this similar to how we use securesystemslib.storage.StorageBackendInterface-methods.

Either way, and as @joshuagl has mentioned in #1142 (comment), if we're blocked until the I/O provider gives us all what they fetched, then we can not protect against endless data or slow retrieval attacks (see #932).

Maybe we could tell our users that, if they give us a blocking fetch they have to protect against those themselves. Or they give us a non-blocking fetch that allows us to incrementally receive data and perform checks on size and rate constraints.

@jku
Copy link
Member Author

jku commented Nov 20, 2020

Regarding (2), may I ask why Updater.fetcher.fetch()calls back into Updater.fetcher.updater.provide_fetched_data()? And where is fetch() called in the first place?

fetch is called in the same two places in updater that now call tuf.download.*safe_download().

The issues that provide_fetched_data() tries to solve are:

  • we don't want to return an array of bytes from fetch() because in the case of targets that could be very large
  • as far as I can tell in python there's no nice commonly supported "stream" abstraction that fetch could immediately return in python like the go-tuf version does (in e.g. requests the only thing that's close is requests.Response.raw and it's a file object but ... a little rough to use. In aiohttp the comparable object is a aiohttp.StreamReader which looks fine... but are these two equivalent 🤷 ?)
  • even if fetch() could return a stream/fileobject, one of the goals of the design is that the application should be able to do things like progress notifications: so the application has to be able to process the stream "as it flows" as well anyway

So as a compromise the design requires fetch() to deliver the bytes in chunks by calling provide_fetched_data() multiple times before returning from fetch() -- I think this leads to simplest Fetcher and simplest Updater implementations: this way

  • large files are never fully in memory
  • we don't make assumptions about the implementation of Fetcher: any http stack should be able to return bytes in chunks (and the default implementation is already 99% done in tuf.download)
  • fetch() will indeed block but tuf can still do the endless data and slow retrieval protections (if we figure out how to do them) because Updater.provide_fetched_data() runs regularly -- with the caveat that individual chunk download and the chunk size is in Fetcher control
  • Also, we can still make sure that the data is only processed once (by updating the hash digest in provide_fetched_data() before writing data to the tempfile) so this should still allow us to be as efficient as we want to be

Does that explain my thinking?

@lukpueh
Copy link
Member

lukpueh commented Nov 23, 2020

Thanks a ton for the clarification, @jku! This is a pretty smart solution. I am a bit concerned about the coupling in this design, but that may be ill-founded. Here is what I'm concerned about:

  1. It may not be obvious to implementers of fetch that they need to call an unrelated provide_fetched_data.
  2. provide_fetched_data has a side-effect, i.e. write to file, which depends on state managed two function calls earlier in the stack, i.e. when our updater code calls fetch we need to create a temporary file and let provide_fetched_data know that it shall henceforth write to that file until fetch returns to us and we close and persist.

What if instead of writing to a function we asked fetch to write to a passed tmp file pointer? This might be a slightly more obvious usage pattern to the fetch interface implementor and we get built-in enter/exit handling...

with tempfile.SpooledTmpFile() as fp_for_fetched_data: # use spooled tmp file to indeed unburden memory
  self.fetcher.fetch(url, length, fp_for_fetched_data) # fetch must write to passed file
  # Persist file once fetch has returned...

And to protect against endless data and slow retrieval protection, we can pass a custom tmp file object with an augmented write method that adds length and rate check. The file already knows how much has been written so far (endless data) and it could easily be extended to know when we started writing (slow retrieval).

What do you think?

@jku
Copy link
Member Author

jku commented Nov 23, 2020

What if instead of writing to a function we asked fetch to write to a passed tmp file pointer?

I like it, it has advantages (and does not require Fetcher to figure out what a "file object" actually means).

The only dislike is that it still leaves a "file object" in the API and python file objects seem to defy definition... I know this is probably not a pythonic way to think but I just do not understand what is the actual interface that we promise to implement there. A non-seekable, non-readable io.RawIOBase maybe? or just "any object that has a write(bytes) method"?

Anyway, I can get over that ambiguity: passing a file object from updater to fetcher looks like a reasonable solution

@lukpueh
Copy link
Member

lukpueh commented Nov 23, 2020

You are right, I didn't think about "hiding/protecting" the file object from the fetch implementer. But I guess it's in their best interest to not do anything other than write(bytes) akin to what they'd do with your provide_fetched_data.

@trishankatdatadog
Copy link
Member

What guarantees the Fetcher would only call provide_fetched_data()on the Updater, and nothing else?

@lukpueh
Copy link
Member

lukpueh commented Nov 23, 2020

Good question, @trishankatdatadog. In my proposal the fetcher doesn't need a reference to the updater object.

@jku
Copy link
Member Author

jku commented Nov 23, 2020

What guarantees the Fetcher would only call provide_fetched_data()on the Updater, and nothing else?

Good question: It would make sense to make all other function calls on the same object fail while fetch() call is in progress just in case. I'm pretty sure this applies to both the original and Lukas' versions -- In both cases calling Updater.refresh() from Fetcher.fetch() would lead to surprising results.

In my proposal the fetcher doesn't need a reference to the updater object

Application is going to have that already so this is not really a defense.

Protecting against accidentally calling Updater API while in Fetcher.fetch() sounds reasonable and not difficult but I don't want to go too far down that path: the application will always be able to break TUF if it wants to. The best we can do is make accidental misuse more difficult.

@trishankatdatadog
Copy link
Member

trishankatdatadog commented Nov 23, 2020

Is there a way to apply the Hollywood principle here: don't call us, we'll call you?

@lukpueh
Copy link
Member

lukpueh commented Nov 23, 2020

In my proposal the fetcher doesn't need a reference to the updater object

Application is going to have that already so this is not really a defense.

What do you mean by "Application"? I don't think that we'll have access to an updater object inside fetch unless it is stored in some global context. But yes, I agree, ...

the application will always be able to break TUF if it wants to. The best we can do is make accidental misuse more difficult.

@jku
Copy link
Member Author

jku commented Nov 23, 2020

What do you mean by "Application"? I don't think that we'll have access to an updater object inside fetch unless it is stored in some global context.

Oh just that if a client application implements a custom Fetcher, it creates both Updater and Fetcher... so the latter may have a reference to Updater -- we can't know

@jku jku added the client Related to the client (updater) implementation label Nov 27, 2020
@joshuagl
Copy link
Member

Some great discussion here that has resulted in a refined design, thanks all! If I'm following correctly, we've agreed on the following to abstract out the network I/O:

# Only new/changed methods mentioned for Updater
class Updater(object):
    # init now accepts an optional fetcher argument
    def __init__(self, repository_name, repository_mirrors, fetcher: Optional[Fetcher] = None):

# New interface for applications to implement
class Fetcher(metaclass=abc.ABCMeta):
    # Fetches the contents of HTTP/HTTPS url from a remote server and writes them
    # to the provided file-object. Returns when the download is complete and all
    # bytes have been written to the file-object.
    @abc.abstractmethod
    def fetch(self, url: str, length: int, spooled_file: BinaryIO):
        pass

@lukpueh
Copy link
Member

lukpueh commented Nov 30, 2020

That's what I had in mind. Plus (optionally) a custom BinaryIO implementation that handles write rate and max size to fend off slow retrieval and endless data attacks.

@trishankatdatadog
Copy link
Member

Some great discussion here that has resulted in a refined design, thanks all! If I'm following correctly, we've agreed on the following to abstract out the network I/O:

Yes, this LGTM. Fetcher does not have access to Updater. Another possible design is to use a mixin, but it is just as good as well-isolated composition IMHO, just a matter of preference.

@trishankatdatadog
Copy link
Member

trishankatdatadog commented Nov 30, 2020

Here is a mixin alternative design...

@jku
Copy link
Member Author

jku commented Nov 30, 2020

If I'm following correctly, we've agreed on the following to abstract out the network I/O

LGTM.

Plus (optionally) a custom BinaryIO implementation that handles write rate and max size to fend off slow retrieval and endless data attacks.

👍 (although if we're talking about the current implementations of those protections, I'd remove the slow retrieval protection even if the "file object" is implemented -- non-working features are worse than non-existing features)

@trishankatdatadog
Copy link
Member

👍 (although if we're talking about the current implementations of those protections, I'd remove the slow retrieval protection even if the "file object" is implemented -- non-working features are worse than non-existing features)

@jku have you seen httpx? Everything seems to be timed out there, it's really nice for preventing slow retrieval attacks. @florimondmanca can correct me if wrong...

@jku
Copy link
Member Author

jku commented Nov 30, 2020

I've never used httpx but it looked good when I was catching up on the more modern http stacks. I'm not convinced the issues with slow retrieval protection can be fixed with a better http stack though. I don't want my opinion on slow retrieval to create friction in this issue though: if keeping the current slow retrieval code is preferred by others, let's do that by all means.

...That said, I'll document my opinion on slow retrieval here just for reference:

Defining limits (e.g. timeouts) that A) don't trigger false positives and B) also meaningfully help against a slow retrieval attack seems extremely difficult if not impossible... In any case the limits are going to be application and/or context specific -- I find it hard to believe that there are useful protections that apply to all users of TUF. Without a specific documented use case implementing these protections seems pointless: we will not even be able to test that the protections work.

Btw, just so it's clear: I do believe that the default Fetcher should set sensible timeouts (if the http stack default is not sensible for the purpose) and possibly even provide ways to configure them: but we should do that in order to create high quality software, not because it will realistically protect against an attack.

@trishankatdatadog
Copy link
Member

@jku

Btw, just so it's clear: I do believe that the default Fetcher should set sensible timeouts (if the http stack default is not sensible for the purpose) and possibly even provide ways to configure them: but we should do that in order to create high quality software, not because it will realistically protect against an attack.

Agreed: the settings/config should be configurable so that protection is meaningful in different cases to different users.

@florimondmanca
Copy link

florimondmanca commented Nov 30, 2020

👋

@trishankatdatadog: Yes, HTTPX has timeouts enabled by default for TCP connect/read/write, as well as connection pool acquiry. They're all 5 seconds by default, and configurable. So e.g. if the remote server takes > 5s to send a chunk after HTTPX started to .recv(), we hard-shut the connection and raise an httpx.ReadTimeout exception.

I don't know if this corresponds to the "slow retrieval" attack scenario here. E.g. it's still possible for a remote server to send 1-byte chunks every 5s and be just fine as far as HTTPX is concerned.

We don't have "write rate" or "max size" knobs built-in either. We do however provide a customization mechanism. HTTPX is actually separated in two projects: HTTPX itself, which does high-level client smarts, and HTTPCore, which does low-level HTTP networking. The interface between the two is the "Transport API". HTTPX provides default transports, but it's possible to switch it out for something else, such as a wrapper transport. Our docs on this are still nascent, but there are many features that can be implemented at this level. In particular anything that wants to control the flow of bytes (upload or download) would fit there very nicely. Example:

import httpcore

class TooBig(Exception):
    pass

class MaxSizeTransport(httpcore.SyncHTTPTransport):
    def __init__(self, parent: httpcore.SyncHTTPTransport, max_size: int) -> None:
        self._parent = parent
        self._max_size = max_size

    def _wrap(self, stream: Iterator[bytes]) -> Iterator[bytes]:
        length = 0.0
        for chunk in stream:
            length += len(chunk)
            if length > self._max_size:
                raise TooBig()
            yield chunk

    def request(self, *args, **kwargs):
        status_code, headers, stream, ext = self._parent.request(*args, **kwargs)
        return status_code, headers, self._wrap(stream), ext


import httpx

transport = httpx.HTTPTransport()  # Default transport.
transport = MaxSizeTransport(transport, max_size=...)  # Add "max size" layer.

with httpx.Client(transport=transport) as client:
    ...

There may be some bits to work out, but if this sounds like something that could fit your use case I'd be happy to work through any specifics.

(Getting down to the transport API level may also be totally overkill. HTTPX provides a nice and simple .stream() method that returns a streaming response — you can iterate over response content and do chunk-wise operations at that level too. But if you do need that amount of low-level-ness, it's there.)


On a separate note, I read through this thread and wondered — was it considered to have .fetch() return a bytes iterator (most typically a generator)…?

class Fetcher(metaclass=abc.ABCMeta):
    @abc.abstractmethod
    def fetch(self, url: str, length: int) -> Iterator[bytes]:
        ...

The main benefit here is lower coupling — removes any references to the Updater at the Fetcher level. Fits the "we'll call you" mindset Trishank mentioned earlier.

Here's an example implementation using HTTPX:

import httpx

class HTTPXFetcher:
    def __init__(self):
        # v Optionally pass a custom `transport` to implement smart features,
        # like write rate or max size controls…
        self._client = httpx.Client()

    def fetch(self, url: str, length: int) -> Iterator[bytes]:
        with self._client.stream("GET", url) as response:
            for chunk in response.iter_bytes():
                # Perhaps do some checks based on `length`.
                yield chunk

The Updater can then "drive the flow" of the request, e.g. by iterating over the iterator…

with tempfile.SpooledTmpFile() as fp:
    # Drive the fetcher's flow of bytes...
    for chunk in self.fetcher.fetch(url, length):
        fp.write(chunk)
  # Persist file…

Not sure if this would fit the use case as I obviously have very limited context here :-) For example this won't work if you need file operations like .seek() — but thought I'd mention it. HTTPCore makes extensive use of iterators as a way to represent a "drivable" stream of bytes (both for request bodies and response bodies). In an HTTP context we really just need one-way iteration rather than full file-like controls.

@trishankatdatadog
Copy link
Member

trishankatdatadog commented Nov 30, 2020

Thanks very much, @florimondmanca, very helpful to us going forward, I'm sure 🙂 I especially like the generator idea, but will leave it to the fine gents here to decide, given that they are the ones actually driving the refactoring effort.

@lukpueh
Copy link
Member

lukpueh commented Dec 1, 2020

Thanks for the detailed comment and examples, @florimondmanca! Returning a generator from fetch instead of passing a file object does seem like a cleaner separation of Updater and Fetcher, and I don't think that we need any other file operations than incrementally writing a stream of bytes. 👍

@jku
Copy link
Member Author

jku commented Dec 1, 2020

was it considered to have .fetch() return a bytes iterator ?

This is an interesting idea and intuitively feels better than including the vague 'BinaryIO' in the API -- a chunk of bytes is exactly what we want to return. I considered it before the original proposal but my relative lack of python experience means I'm not really sure what demands that makes to the Fetcher implementation: I really want that to be as simple as possible to implement however your bytes actually arrive. But I guess there is no reason to think yielding a chunk would ever be much more complex than writing chunks to whatever a 'BinaryIO' happens to be...

Thanks Florimond, the input is very much appreciated.

@trishankatdatadog
Copy link
Member

trishankatdatadog commented Dec 1, 2020

Another benefit of Florimond's idea I just realized: it allows us, not the downloader, to check for endless data and slow retrieval attacks, which is crucial when calling an untrusted downloader.

@jku
Copy link
Member Author

jku commented Dec 1, 2020

Another benefit of Florimond's idea I just realized: it allows us, not the downloader, to check for endless data and slow retrieval attacks, which is crucial when calling an untrusted downloader.

This is true for all of the proposals presented here (with the caveat that downloading each individual chunk is in the control of the downloader component of course).

@trishankatdatadog
Copy link
Member

trishankatdatadog commented Dec 1, 2020

This is true for all of the proposals presented here (with the caveat that downloading each individual chunk is in the control of the downloader component of course).

I don't think so IIUC. If we let a downloader we don't control write to the file, then we don't measure how fast/many bytes were written (unless we do things like providing a custom file-like object, which is maybe what you proposed but I missed).

@jku
Copy link
Member Author

jku commented Dec 1, 2020

unless we do things like providing a custom file-like object, which is maybe what you proposed but I missed

My proposal was a simple function that takes one chunk at a time (and can do all the checks it wants). Lukas' proposal was indeed to let the downloader write into a file-like object (later io.BinaryIO) which could have a custom write() that did the checks we want.

@trishankatdatadog
Copy link
Member

My proposal was a simple function that takes one chunk at a time (and can do all the checks it wants). Lukas' proposal was indeed to let the downloader write into a file-like object (later io.BinaryIO) which could have a custom write() that did the checks we want.

Hmm, I see. Still not the same, I think, because the downloader could override what the file-like object does, which is why I think the Hollywood principle is safer. Anyway, I'm glad we're all agreed on a solution now!

@lukpueh
Copy link
Member

lukpueh commented Dec 2, 2020

If malicious downloader code wants to crash our disk or DOS us, I doubt that we can stop it by simply asking for a generator.

@trishankatdatadog
Copy link
Member

If malicious downloader code wants to crash our disk or DOS us, I doubt that we can stop it by simply asking for a generator.

Sure, that occurred to me, too, but the generator reduces the attack surface.

@sechkova
Copy link
Contributor

Based on the great discussion and ideas exchanged here and with the disclaimer that this is an internal colab with @jku who had the main idea, I opened #1250 which shows how the current implementation can benefit from a Fetcher class.

I will be happy to receive your comments in the new year 🙂 🎆 🍾

@jku
Copy link
Member Author

jku commented Mar 3, 2021

Implemented in #1250!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client Related to the client (updater) implementation enhancement
Projects
None yet
Development

No branches or pull requests

6 participants