Skip to content

Delays while making requests over a slow network #797

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bradfol opened this issue May 10, 2016 · 28 comments
Closed

Delays while making requests over a slow network #797

bradfol opened this issue May 10, 2016 · 28 comments
Labels

Comments

@bradfol
Copy link

bradfol commented May 10, 2016

Operating System and Hardware: Raspberry Pi 2 B running Jessie
NodeJS Version: 4.2.6
Serialport version: 3.1.1
EDIT: Also reproduced on ASUS 64-bit desktop running Debian 8.4.0
NodeJS version: 4.4.4
Serialport version: 3.1.1

Problem description

Simultaneous network requests alongside serial port writes can cause large delays to the serial port communication. When the Raspberry Pi is on a high latency connection, the serial port communication can become significantly delayed, usually multiple seconds between writes or data events. All the code should be performed async, so it is unclear why there would be a delay.

Steps and/or code to reproduce the issue

Here is a small test case to reproduce the issue. Running this on a Raspberry Pi connected to a high latency network will cause significant delays. (There is an Arduino on the other side of the serial port connection that sends back responses.)

We were experiencing this issue as part of a larger system, but have isolated it down to node-serialport while there are network requests being made. Simply unplugging the network connection (ethernet cable or Wifi adapter, we have tested both) will cause the serial port communications to significantly speed up (dozens of read/writes per second).

Further reference

I have run across a couple other examples of people experiencing similar issues:

Any suggestions on how to further isolate or debug this issue would be greatly appreciated.

@reconbot
Copy link
Member

This is a weird one. Does it reproduce on other systems or just the pi?

@bradfol
Copy link
Author

bradfol commented May 11, 2016

@reconbot Agreed, it is very unusual.

I took some time today to get Debian 8.4.0 set up on a 64-bit ASUS desktop, and was able to reproduce the issue on that machine. (We have also reproduced across multiple Raspberry Pis.)

So it appears that it is not isolated to the Pi hardware.

@reconbot
Copy link
Member

I had a great suggestion that this is due to the event queue being backed up. The network events might be queued behind serialport events and vice versa. Our write currently calls the write callback a large amount of times for writes. I've released beta [email protected] which fixes this issue. Give it a try!

@reconbot
Copy link
Member

Thanks to @stevekinney for this idea!

@bradfol
Copy link
Author

bradfol commented May 18, 2016

@reconbot Thanks for your and @stevekinney 's help! Unfortunately the issue still reproduces on both the Raspberry Pi and the desktop machine with [email protected].

I would be happy to perform any debugging tasks you can suggest.

@reconbot
Copy link
Member

@bradfol I'll see if I can get more attention for this issue, we're not doing anything that should be effected by network traffic.

@soldair
Copy link

soldair commented May 24, 2016

im very curious about this. =)
@bradfol its probably not related. But just wondering what device do you have plugged into serial?

@bradfol
Copy link
Author

bradfol commented May 24, 2016

@reconbot Thanks for the help. This is an important issue for us, so please let me know anything that can be done to provide additional debug information.

@soldair It is an Atmel micro controller (very similar to Arduino) running an echo program. It is isolated to just the USB connection (does not have any other connectivity).

@reconbot
Copy link
Member

@bradfol as a work around; Can you do network operations and then serialport operations? eg not at the same time

@soldair
Copy link

soldair commented May 26, 2016

i should have time to test later today. I'm sure this effects serial bridging on pinoccios also and i'd like to see them still work =)

@rvagg
Copy link
Contributor

rvagg commented May 31, 2016

Maybe try something like UV_THREADPOOL_SIZE=16 (put this before you run the application, e.g. UV_THREADPOOL_SIZE=16 node foo.js, the number can go up to 128). Just a hunch but it's possible that serialport's use of the worker threads is somehow blocking something else that needs doing. Because there are only 4 and they are shared across the various functions that are blocking within Node (and any other addon that wants to use them), it's possible to block them all up. We've seen evidence of this with LevelDB and I've been inclined to suggest we move that to its own threadpool so it doesn't disrupt other uses of the standard worker threads.

Of course ... networking is (mostly) done non-blocking in the main thread in Node so it's not clear that this is even related, it's mainly file IO that's done with the worker threads but there are some other minor things.

Also @bradfol, can you see how long the port.write() takes to come back (add a callback and count ms from call to callback)? That would probably be an interesting datapoint.

@reconbot
Copy link
Member

reconbot commented Jun 1, 2016

@rvagg is being in our own threadpool something that as a library I can do or provide options to do?

@bradfol
Copy link
Author

bradfol commented Jun 2, 2016

@rvagg Thank you for the suggestions. I'm traveling for the next two weeks, but I will definitely test this out as soon as I get back to the office.

@rvagg
Copy link
Contributor

rvagg commented Jun 2, 2016

@reconbot yes, libuv provides you with all the primitives you need to manage a custom threadpool. I've never done it before it but it should be relatively straightforward to duplicate the way that Node core does it. It'll add maintenance, testing and debugging burden of course so you'd want to be sure you need it first!

@bradfol
Copy link
Author

bradfol commented Jun 17, 2016

@rvagg @reconbot I have tested out the above suggestion to specify a different UV_THREADPOOL_SIZE. This does have an impact of the behavior of the test case, but unfortunately does not fix the delays.

What I have observed is that the thread pool size directly corresponds to the number of data events that come back before delays start occurring. That is, if the thread pool size is 16, exactly 16 data event callbacks will be invoked before delays start occurring and everything slows down. (If I don’t specify a thread pool size, the number is 4, which agrees with your above comment that there are only 4 by default.)

Other than this initial number of processed events, changing the thread pool size does not have an impact on the delays that are experienced.

@rvagg
Copy link
Contributor

rvagg commented Jun 20, 2016

Well, that's very interesting and probably helps narrow down where the problem is. It might be safe to guess that something here is locking up a worker thread per event.

How can this be reproduced minimally by someone else? I don't do much hardware hacking so I'm not sure what would need to be connected if anything to run this. And what does "high latency connection" mean? Are we talking seconds or milliseconds? And what is the correlation between latency and the problem, do blockages resolve at the same rate as the latency?

@reconbot
Copy link
Member

reconbot commented Jun 21, 2016

@rvagg An arduino uno with our testing echo sketch would be enough for the hardware. (Happy to mail you one if you need.) As for the network... idk.

@bradfol your test case grabs a file over http while writing to the port, you wait for the response and start another http session and repeat. I assume even at 9600 we're writing faster than the http file takes to download. After 1k of writes (roughly a second), you'll have 1000 http requests in flight. That is unless you have a delay on the Arduino side or something?

@bradfol
Copy link
Author

bradfol commented Jul 6, 2016

@rvagg @reconbot
The best specification I can give for high latency connection is that ping duration on the internal network is slow and variable. That is, ping time will vary and sometimes take < 10ms or take hundreds of ms. See screenshot for an example.

We first experienced the problem intermittently as we deployed the Raspberry Pis on different networks we did not control. Now we use a low quality router that reliably creates these network conditions for us, so we could isolate the problem.

Notably, there are two important control cases where everything runs fine: both a fast network, and no network connection. On a fast network (reliable < 5ms pings), many requests are created and do not interrupt serial port communication. The request responses come back whenever the actual network request finishes. The serial port communication is definitely far faster than the network, but the network requests do not block the serial port communication, as expected for async operations. When there is no network, the requests fail nearly instantaneously, and again, there is no slow down of the serial port requests.
screen shot 2016-07-06 at 12 18 18 pm

@reconbot
Copy link
Member

reconbot commented Jul 7, 2016

@bradfol how many outstanding network requests do you think are in flight during these periods of high latency?

@bradfol
Copy link
Author

bradfol commented Jul 12, 2016

@reconbot It seems to be limited to the thread pool size.

@reconbot
Copy link
Member

That is the limit. In some applications you might only have 1 in flight at a time. What is your app doing?

@bradfol
Copy link
Author

bradfol commented Jul 14, 2016

@reconbot Are you saying that Node is intended to be limited to the thread pool size worth of network requests?

@bradfol
Copy link
Author

bradfol commented Jul 20, 2016

@reconbot @rvagg I wanted to provide some more context around this issue, to hopefully better demonstrate the symptoms using a video.

Here's the control case video, where things are working normally: https://youtu.be/fB0J4ePF04g

And here's a video showing reproduction of the issue: https://youtu.be/EJG-YSo3Oac

Hopefully this is helpful to show the extent of the delays.

@napalm272
Copy link

napalm272 commented Aug 25, 2016

I have seen the same problem. It's not node-serialport, you can replace the serialport calls with fs.write and the same delay would happen. The root cause is that the network requests (actually the dns part), serialport, fs all use the same thread pool (default size=4).

On fast network network requests finish quickly so the pool drains and allows you use serialport. On no network, network requests error out immediately, and the pool also drains. On slow/unreliable network, network requests (that use dns.lookup) can take a while to return anything, so the pool is saturated and serialport calls are delayed until there's some room in the pool.

I have changed my application to only do one network request at a time. It's very cumbersome, but it works.

@ottobonn
Copy link

Hi all, I work with @bradfol and I've spent the past few days working on a way to reproduce this issue without our router. Thanks to @napalm272's tip, I was able to use filesystem calls instead of serial port calls to reproduce the issue without a serial device.

I've created a Vagrant setup that spins up a Debian virtual machine with a simulated high-latency network so we don't depend on a particular router to reproduce the issue. If you'd like to see the slowdown in action, try taking the VM for a spin.

Notably, latency on HTTP (port 80) is not sufficient to reproduce the issue. Though I haven't yet isolated which ports need to be delayed to cause the issue, @napalm272's DNS suggestion is particularly intriguing. I'd like to hear more about why DNS is blocking while normal HTTP delays are not.

The fact that the issue appears outside of node-serialport's purvue seems to absolve this issue of the "bug" label, at least within this repo. However, we'd appreciate any further insights into how libuv is supposed to handle simultaneous network and file-descriptor I/O, if not simultaneously; I find it particularly odd that the network requests seem to block the other I/O while they're in-flight. I would expect that the network- and file-descriptor-based I/O would all be treated the same, and the threads from the pool would iterate over each outstanding I/O request (both from the network and from file descriptors) and invoke callbacks when ready.

I'm going to continue looking through the libuv source code tomorrow, to see if I can uncover the difference between the thread behavior while waiting for network I/O and "other" I/O.

@ottobonn
Copy link

To clarify my previous comment, the Vagrant setup should be a totally self-contained way to reproduce this issue, with no additional hardware.

After additional testing, I can confirm both that our "slow" router is somehow interfering with DNS responses, and that simulating latency on only the DNS port (53) reproduces the issue. I have updated the Vagrant setup to add latency only to port 53.

On our router, I have tested the speed of DNS resolution using dig:

# "Good" routers:
> time dig www.example.com +nssearch
dig www.example.com +nssearch 
0.06s user 0.00s system 61% cpu 0.103 total

# "Slow" router:
> time dig www.example.com +nssearch

; <<>> DiG 9.10.3-P4-Ubuntu <<>> www.example.com +nssearch
;; global options: +cmd
;; connection timed out; no servers could be reached
dig www.example.com +nssearch 
0.06s user 0.01s system 0% cpu 15.077 total

As seen above, the timeout takes about 15 seconds, and the simulated latency on port 53 of the Vagrant machine is 10 seconds, to give a good feel for the slowdown.

I should note that I'm just using a Linux networking tool called tc to simulate these delays. If you're running Linux, you can reproduce the issue without running the whole virtual machine, but your web-browsing experience will show similar delays. See the file simulate-latency.sh for the tc commands needed to create these delays.

In addition, I have observed that making HTTP requests by IP address rather than by URL in our Node scripts eliminates the slowdown, by bypassing the DNS resolution step. I have added a useIP boolean to the fs-test file so you can toggle between IP addresses and URLs to see the difference there.

As an additional data-point, note that we have not been able to reproduce this issue on Mac OSX. When we connect a Mac to our "slow" router, we see the same DNS timeouts; however, the Node scripts manage to make thousands of network requests before the first network responses start coming in. As I mentioned previously, this is the behavior we would expect from a totally async system, and it highlights the Linux implementation of libuv's async IO polling as suspect.

As @bradfol mentioned to me earlier today, libuv advertises asynchronous DNS resolution, so the behavior on the Mac strikes me as the correct one.

@napalm272
Copy link

napalm272 commented Aug 26, 2016

@ottobonn Take a look at https://nodejs.org/docs/latest-v0.12.x/api/dns.html, towards the bottom, there's a section describing dns.lookup. (This behavior is the same in the newest version of node)

Though the call will be asynchronous from JavaScript's perspective, it is implemented as a synchronous call to getaddrinfo(3) that runs on libuv's threadpool. Because libuv's threadpool has a fixed size, it means that if for whatever reason the call to getaddrinfo(3) takes a long time, other operations that could run on libuv's threadpool (such as filesystem operations) will experience degraded performance.

Some other discussion here: nodejs/node-v0.x-archive#2868

When you make the dns port have a 10 second latency, each http request is going to take 10 seconds before it returns, so you will use up the pool very quickly.

When you use ip, it bypasses dns.lookup, so you don't have the issue.
On Macs, maybe getaddrinfo() is implemented differently, so you don't have the issue. Keep in mind that dns.lookup relies on the underlying OS's implementation for resolving hostnames.

I see that your test uses the request library. I'm pretty sure it uses http and dns.lookup under the hood. In the node doc, it claims that dns.resolve is actually async unlike dns.lookup, so I suggest you to use dns.resolve to resolve the hostname to an ip first, then use the ip address. I'm very curious to see if that solves your problem. If it does, you can wrap request in a module which does the resolve first then .get or .post with ip.

@reconbot
Copy link
Member

reconbot commented Oct 5, 2016

I'm going to close this as it's more of a nodejs issue than a serialport issue. We might be able to improve the situation but we don't need this issue for that.

@reconbot reconbot closed this as completed Oct 5, 2016
@lock lock bot locked as resolved and limited conversation to collaborators May 6, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Development

No branches or pull requests

6 participants