-
Notifications
You must be signed in to change notification settings - Fork 38.5k
Memory leak on STOMP broker connections that fail before heartbeats start [SPR-14266] #18838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Rossen Stoyanchev commented I'll run the test but I'm wondering what's causing the failure in the first place? Is RabbitMQ rejecting the failed 900 or so perhaps hitting some limits which would show in the RabbitMQ logs for example. The tests were running OK with 2000 connecting last time I checked. Maybe the answer to the above will answer this too but what's the actual scenario in production, are two thousand connecting simultaneously with some failing for the same reason as in the test? |
dave burns commented Yeah, it's strange, nothing in the RabbitMQ logs around the error and using the single threaded approach I get all 2000 connections successfully connecting. In production its actually quite hard to see. We have 15 brokers/instance running in a cloud foundry environment. Profiling memory over around 1 week+ we see it trend up until the Garbage Collector uses up too many CPU cycles and the instance restarts. The brokers also serve up an SPA and the user experience is the app getting really slow before the instance(s) restart. Our workaround is to manually restart the services once a week, before the degradation happens. The application instance serves about 2000 clients max, this ramps up fairly quickly around 8-9am as users log in but CPU on each instance only ever reaches about 60% peak. Also, I still see both the Reactor2TcpClient and StompBrokerRelayMessagHandler leaking when running a harness against our actual app, haven't worked out why this is yet but our app has security,SPA, controllers etc, hoping a fixed for the stripped down broker/harness should fix both object leaks in the app too. |
Rossen Stoyanchev commented In your modified project, the Spring Boot version is still |
Rossen Stoyanchev commented Can you also confirm all the versions you're running with in production for Spring Framework and Reactor in particular? |
dave burns commented Sorry, I just cloned the original sample and manually added the multi-threaded test, forgot to update the versions on it, but I have the updated versions running locally. I get the following error on the server side for each failed connection/thread, still no error in RabbitMQ logs:
So, after a bit of googling on that error I found this, basically saying set async true on the filter (thought it would be by default in boot), so if I then add this to the config : @Bean
public ServletRegistrationBean dispatcherRegistration(DispatcherServlet dispatcherServlet) {
ServletRegistrationBean registration = new ServletRegistrationBean(
dispatcherServlet);
registration.setAsyncSupported(true);
return registration;
} I get the following error instead for each failed connection, (still no error in RabbitMQ logs) :
I'm currently running on 2 virtual machines, one for client, other for server+rabbitMQ. I can set this up on a fast host and re-run, not sure why you get no error what so ever. |
dave burns commented Sorry, meant to reiterate, in both scenarios above I still get the Reactor2TcpClient leak in visualVM, so the leak does seem to occur for failed connections. |
Rossen Stoyanchev commented What I'm not sure is are you getting these errors through the sample or are these logs from your actual application? Looking at the output and regarding the asyncSupported flag, I can see a connection to "/xhr_streaming" so the client is falling back from a WebSocket connection. The test does configure XHR fallbacks but uses WebSocket. As for the connection refused that seems clearly like RabbitMQ is rejecting it. |
Rossen Stoyanchev commented I suspect that somehow between your two virtual machines the WebSocket handshake fails and the client falls back on HTTP streaming. You can confirm that by running a small number of connections and turning put the logs on What I think is happening is that with XHR streaming there are more sockets open on the server side and that may be causing RabbitMQ to reach a limit on open file descriptors especially if you start RabbitMQ from the command line (as opposed to as a service). The log would show something like this:
To be sure I am not saying this is the problem in your actual application. For now before going any further can you confirm what you see in the RabbitMQ logs for the test? From the error messages above it sure looks like RabbitMQ is rejecting the connections so the question is why. |
dave burns commented Sorry but nothing in the RabbitMQ logs after running the test, just the standard :
The 2 virtuial machine setup was causing fallback to HTTP streaming. I executed the test again with everything running on the same VM but still get the :
error on the server. On the VM if I set max connections/threads to 500 I get about 400 actual connections in rabbit and see the leak in visualVM. I then setup the test on a fairly good spec windows host (i7, 16GB), again I get about 1100 out of 2000 connections and the same output as above. If I execute the single threaded test all 2000 connections go through. I can't work out why you don't get the same result??? Might be OS related (doubt it though as it happens in CF). Could the single threaded test be executing on your setup? I have a colleague who has a Mac setup, I'll post back results once he executes the test. Maybe you could try on a windows OS if you have one handy. |
Rossen Stoyanchev commented
I am certain I'm running the multi-thread test because the output is different:
Why windows by the way, I doubt this is what's in production? I'm on Linux myself, don't have Windows.
I will try to constrain the number of connections artificially and see if I can spot a leak with failed connections. Perhaps you can try and insert a break every 500 connections during ramp-up to see if that makes a difference? |
Rossen Stoyanchev commented A key concern of mine is that the test is an environment-related issue and isn't helping us replicate the production issue. For once I can't even get it to fail in my environment (a laptop). Furthermore sending 2000 users at the same time is not quite the same as users logging in over the course of an hour. In terms of your production setup, do you see evidence of the same issue, i.e. the "java.net.ConnectionException: Connection refused"? You wrote that "the application instance serves about 2000 clients max". How is this controlled, i.e. what happens if there are more than that?
Can you provide details on what you see? |
Rossen Stoyanchev commented Okay some progress to report. I see one problem that needs fixing. I start RabbitMQ with lower limit on sockets:
I then run the test. After 828 connections RabbitMQ hits the limit:
What happens from there varies. Some TCP connections fail with a ConnectionTimeoutException which results in a callback + cleanup. Others however manage to connect and send CONNECT but never get a CONNECTED frame back. Since the CONNECTED frame is required to complete heartbeat interval negotiation, we don't set up a heartbeat until it's received, which means if the connection remains open, it doesn't get cleaned up. I'll work on a fix for that but would be great to get your feedback on the questions above still. |
dave burns commented Well spotted!!! and a little bit of progress here too. My colleague ran the multi-threaded client on a mac, he gets about 1000-1500 connections per run (500+ fail; to connect) BUT no memory leak! Again he has 8,000 available file and socket descriptors showing in rabbitMQ, no limits are hit. So your setup is all connections/no leak, The production setup above was a test I executed with the 4.3.0 SNAPSHOT after your first fix, again hard to see the contents of memory leak in our CF env as I cant get visualVM onto the JVM. I plan to re-run the test using the test using the 4.2.6.RELEASE version from last friday. In the prod/CF env I just kept running the clients until the instance gave an out of memory, couldn't actually see the Reactor2TcpClient in any dump file (didn't have a dump file) just tried to reproduce locally to give access to the better tooling. Afraid I don't have too much detail on the prod/CF test ATM which is why I plan to re-run it and will post back. I was assuming it was the same cause as what I'm seeing on windows, but this may not be the case. As you say actual production usage will not be as intensive as the test we are running and it's possible your original 4.2.6 fix along with the fix above could be all that is required. |
Rossen Stoyanchev commented Okay I have a fix in master (4.3.0 snapshots) for a leak resulting from connect failures. Can you give that a try? I can also propagate the fix to the 4.2.x branch if you need a snapshot there but would prefer to confirm before backporting if feasible. At least in my environment with a forced limit on connections, it cleans up well now. Note however that with this fix, the state is cleared after 60 seconds. Connection timeouts are after 30 seconds. I've set the max for the CONNECTED frame to 60 seconds. This matches a similar setting we have on the WebSocket side in the SubProtocolWebSocketHandler. This should address the test harness, pending your confirmation. However I'm still wondering for your production case. Are there connection failures there, that could explain this? Also I want to check that clients do support heartbeats (stomp.js does) are not connecting with heartbeat settings set to 0 (i.e. off). |
Rossen Stoyanchev commented Actually i should say as soon as the CI build completes. |
dave burns commented FANTASTIC!!! That has fixed the leak on windows machines. Still need to validate in our CF environment but just having some red tape issues getting our nexus repo updated with the latest snapshot. Will post back as soon as I've been able to re-test the production scenario but looking great so far, thanks so much. |
Rossen Stoyanchev commented Great, I've pushed the change to 4.2.x so there is a 4.2.7 snapshot building if that makes it easier to try. |
Rossen Stoyanchev commented daveburnsLIT, any further updates with the snapshot? |
dave burns commented Sorry Again, many thanks for the help on this and (as usual), response time phenomenal. |
Rossen Stoyanchev commented Alright, let us know if you find anything further, and thanks for the reports! |
dave burns opened SPR-14266 and commented
The original boot issue (or similar issue) still occurs for a broker/server receiving simultaneous connections. I've modified the sample project and added a multi-threaded Junit method as a client. This launches 2000 concurrent threads, about 1100 successfully connect to the server (on my setup). I don't know if its the load or the failed connections (rather than clean disconnects) but the org.springframework.messaging.tcp.reactor.Reactor2TcpClient can be seen to be leaking following by the same instructions as in the original boot issue.
Weird thing is the original leak showed both the Reactor2TcpClient and StompBrokerRelayMessagHandler leaking, this just shows the reactor 2 client.
Noticed this in our production application, i.e. system accepting concurrent connections. The multi-threaded harness mimics this behavior.
Affects: 4.2.6
Reference URL: #18805
Issue Links:
Referenced from: commits 4a81aab, b502210
Backported to: 4.2.7
The text was updated successfully, but these errors were encountered: