-
Notifications
You must be signed in to change notification settings - Fork 278
updater: abstract out the network IO #1213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
💯 This actually happened with the Datadog Agent: some of our customers needed to use TLS proxies, and we were using custom networking code back then that simply couldn't handle it. Hence the switch to requests, but as you say, it's not great that we use our own configuration and sessions there... |
with regards to parallel downloads: Making sure that a fancy custom Fetcher could download multiple urls in parallel seems relatively simple -- the design doesn't have to implement that but it should be taken into account so adding it later would be possible without breaking API. |
Thanks for sketching out two approaches in such great detail, @jku! Regarding (2), may I ask why Couldn't we just make Either way, and as @joshuagl has mentioned in #1142 (comment), if we're blocked until the I/O provider gives us all what they fetched, then we can not protect against endless data or slow retrieval attacks (see #932). Maybe we could tell our users that, if they give us a blocking |
fetch is called in the same two places in updater that now call The issues that
So as a compromise the design requires
Does that explain my thinking? |
Thanks a ton for the clarification, @jku! This is a pretty smart solution. I am a bit concerned about the coupling in this design, but that may be ill-founded. Here is what I'm concerned about:
What if instead of writing to a function we asked with tempfile.SpooledTmpFile() as fp_for_fetched_data: # use spooled tmp file to indeed unburden memory
self.fetcher.fetch(url, length, fp_for_fetched_data) # fetch must write to passed file
# Persist file once fetch has returned... And to protect against endless data and slow retrieval protection, we can pass a custom tmp file object with an augmented What do you think? |
I like it, it has advantages (and does not require Fetcher to figure out what a "file object" actually means). The only dislike is that it still leaves a "file object" in the API and python file objects seem to defy definition... I know this is probably not a pythonic way to think but I just do not understand what is the actual interface that we promise to implement there. A non-seekable, non-readable io.RawIOBase maybe? or just "any object that has a Anyway, I can get over that ambiguity: passing a file object from updater to fetcher looks like a reasonable solution |
You are right, I didn't think about "hiding/protecting" the file object from the |
What guarantees the Fetcher would only call |
Good question, @trishankatdatadog. In my proposal the fetcher doesn't need a reference to the updater object. |
Good question: It would make sense to make all other function calls on the same object fail while fetch() call is in progress just in case. I'm pretty sure this applies to both the original and Lukas' versions -- In both cases calling Updater.refresh() from Fetcher.fetch() would lead to surprising results.
Application is going to have that already so this is not really a defense. Protecting against accidentally calling Updater API while in Fetcher.fetch() sounds reasonable and not difficult but I don't want to go too far down that path: the application will always be able to break TUF if it wants to. The best we can do is make accidental misuse more difficult. |
Is there a way to apply the Hollywood principle here: don't call us, we'll call you? |
What do you mean by "Application"? I don't think that we'll have access to an updater object inside
|
Oh just that if a client application implements a custom Fetcher, it creates both Updater and Fetcher... so the latter may have a reference to Updater -- we can't know |
Some great discussion here that has resulted in a refined design, thanks all! If I'm following correctly, we've agreed on the following to abstract out the network I/O: # Only new/changed methods mentioned for Updater
class Updater(object):
# init now accepts an optional fetcher argument
def __init__(self, repository_name, repository_mirrors, fetcher: Optional[Fetcher] = None):
# New interface for applications to implement
class Fetcher(metaclass=abc.ABCMeta):
# Fetches the contents of HTTP/HTTPS url from a remote server and writes them
# to the provided file-object. Returns when the download is complete and all
# bytes have been written to the file-object.
@abc.abstractmethod
def fetch(self, url: str, length: int, spooled_file: BinaryIO):
pass |
That's what I had in mind. Plus (optionally) a custom |
Yes, this LGTM. Fetcher does not have access to Updater. Another possible design is to use a mixin, but it is just as good as well-isolated composition IMHO, just a matter of preference. |
Here is a mixin alternative design... |
LGTM.
👍 (although if we're talking about the current implementations of those protections, I'd remove the slow retrieval protection even if the "file object" is implemented -- non-working features are worse than non-existing features) |
@jku have you seen httpx? Everything seems to be timed out there, it's really nice for preventing slow retrieval attacks. @florimondmanca can correct me if wrong... |
I've never used httpx but it looked good when I was catching up on the more modern http stacks. I'm not convinced the issues with slow retrieval protection can be fixed with a better http stack though. I don't want my opinion on slow retrieval to create friction in this issue though: if keeping the current slow retrieval code is preferred by others, let's do that by all means. ...That said, I'll document my opinion on slow retrieval here just for reference: Defining limits (e.g. timeouts) that A) don't trigger false positives and B) also meaningfully help against a slow retrieval attack seems extremely difficult if not impossible... In any case the limits are going to be application and/or context specific -- I find it hard to believe that there are useful protections that apply to all users of TUF. Without a specific documented use case implementing these protections seems pointless: we will not even be able to test that the protections work. Btw, just so it's clear: I do believe that the default Fetcher should set sensible timeouts (if the http stack default is not sensible for the purpose) and possibly even provide ways to configure them: but we should do that in order to create high quality software, not because it will realistically protect against an attack. |
Agreed: the settings/config should be configurable so that protection is meaningful in different cases to different users. |
👋 @trishankatdatadog: Yes, HTTPX has timeouts enabled by default for TCP connect/read/write, as well as connection pool acquiry. They're all 5 seconds by default, and configurable. So e.g. if the remote server takes > 5s to send a chunk after HTTPX started to I don't know if this corresponds to the "slow retrieval" attack scenario here. E.g. it's still possible for a remote server to send 1-byte chunks every 5s and be just fine as far as HTTPX is concerned. We don't have "write rate" or "max size" knobs built-in either. We do however provide a customization mechanism. HTTPX is actually separated in two projects: HTTPX itself, which does high-level client smarts, and HTTPCore, which does low-level HTTP networking. The interface between the two is the "Transport API". HTTPX provides default transports, but it's possible to switch it out for something else, such as a wrapper transport. Our docs on this are still nascent, but there are many features that can be implemented at this level. In particular anything that wants to control the flow of bytes (upload or download) would fit there very nicely. Example: import httpcore
class TooBig(Exception):
pass
class MaxSizeTransport(httpcore.SyncHTTPTransport):
def __init__(self, parent: httpcore.SyncHTTPTransport, max_size: int) -> None:
self._parent = parent
self._max_size = max_size
def _wrap(self, stream: Iterator[bytes]) -> Iterator[bytes]:
length = 0.0
for chunk in stream:
length += len(chunk)
if length > self._max_size:
raise TooBig()
yield chunk
def request(self, *args, **kwargs):
status_code, headers, stream, ext = self._parent.request(*args, **kwargs)
return status_code, headers, self._wrap(stream), ext
import httpx
transport = httpx.HTTPTransport() # Default transport.
transport = MaxSizeTransport(transport, max_size=...) # Add "max size" layer.
with httpx.Client(transport=transport) as client:
... There may be some bits to work out, but if this sounds like something that could fit your use case I'd be happy to work through any specifics. (Getting down to the transport API level may also be totally overkill. HTTPX provides a nice and simple On a separate note, I read through this thread and wondered — was it considered to have class Fetcher(metaclass=abc.ABCMeta):
@abc.abstractmethod
def fetch(self, url: str, length: int) -> Iterator[bytes]:
... The main benefit here is lower coupling — removes any references to the Here's an example implementation using HTTPX: import httpx
class HTTPXFetcher:
def __init__(self):
# v Optionally pass a custom `transport` to implement smart features,
# like write rate or max size controls…
self._client = httpx.Client()
def fetch(self, url: str, length: int) -> Iterator[bytes]:
with self._client.stream("GET", url) as response:
for chunk in response.iter_bytes():
# Perhaps do some checks based on `length`.
yield chunk The with tempfile.SpooledTmpFile() as fp:
# Drive the fetcher's flow of bytes...
for chunk in self.fetcher.fetch(url, length):
fp.write(chunk)
# Persist file… Not sure if this would fit the use case as I obviously have very limited context here :-) For example this won't work if you need file operations like |
Thanks very much, @florimondmanca, very helpful to us going forward, I'm sure 🙂 I especially like the generator idea, but will leave it to the fine gents here to decide, given that they are the ones actually driving the refactoring effort. |
Thanks for the detailed comment and examples, @florimondmanca! Returning a generator from |
This is an interesting idea and intuitively feels better than including the vague 'BinaryIO' in the API -- a chunk of bytes is exactly what we want to return. I considered it before the original proposal but my relative lack of python experience means I'm not really sure what demands that makes to the Fetcher implementation: I really want that to be as simple as possible to implement however your bytes actually arrive. But I guess there is no reason to think yielding a chunk would ever be much more complex than writing chunks to whatever a 'BinaryIO' happens to be... Thanks Florimond, the input is very much appreciated. |
Another benefit of Florimond's idea I just realized: it allows us, not the downloader, to check for endless data and slow retrieval attacks, which is crucial when calling an untrusted downloader. |
This is true for all of the proposals presented here (with the caveat that downloading each individual chunk is in the control of the downloader component of course). |
I don't think so IIUC. If we let a downloader we don't control write to the file, then we don't measure how fast/many bytes were written (unless we do things like providing a custom file-like object, which is maybe what you proposed but I missed). |
My proposal was a simple function that takes one chunk at a time (and can do all the checks it wants). Lukas' proposal was indeed to let the downloader write into a file-like object (later |
Hmm, I see. Still not the same, I think, because the downloader could override what the file-like object does, which is why I think the Hollywood principle is safer. Anyway, I'm glad we're all agreed on a solution now! |
If malicious downloader code wants to crash our disk or DOS us, I doubt that we can stop it by simply asking for a generator. |
Sure, that occurred to me, too, but the generator reduces the attack surface. |
Implemented in #1250! |
Uh oh!
There was an error while loading. Please reload this page.
This might be relevant to Updater redesign (#1135) and if accepted would deprecate #1142 and the PR #1171
We (me, Joshua, Martin, Teodora) have been talking about abstracting some of the client functionality out of the Updater itself. The biggest issue from my perspective is network IO. Teodora already made a PR to let the application download targets but it seems like there are still issues with TUF handling metadata downloads.
Why is this needed?
Potential solutions
We identified two main solutions to this:
I'm proposing option 2 but for reference please see the draft of option 1 as well.
Proposal
Add a Fetcher interface that applications can implement. Provide a default implementation of Fetcher. Add a new method to Updater that Fetcher can use to provide the data it fetches.
The Updater processes (
refresh()
,get_one_valid_targetinfo()
anddownload_target()
) will now look like this:Fetcher.fetch()
Updater.provide_fetched_data()
zero or more times to provide chunks of data. Updater writes these chunks into the fileThis is like the go-tuf RemoteStore abstraction with two differences: 1. Python does not have reasonable stream abstractions like io.ReadCloser (that would actually be implemented by any of the network stacks) so we cannot return something like that: instead our implementation blocks and adds a
provide_fetched_data()
callback into Updater instead. 2. Metadata and target fetching is not separated: this way the Fetcher does not need any understanding of TUF or server structure, it's just a dumb downloader.I think this is fairly straight-forward to implement even without a client redesign (and will be backwards-compatible). download.py is split into two parts: one part contains the Tempfile handling bits and _check_downloaded_length() and are used by the updater itself; the rest of download.py form the default Fetcher implementation.
The text was updated successfully, but these errors were encountered: