Skip to content

Memory leak on STOMP broker connections that fail before heartbeats start [SPR-14266] #18838

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
spring-projects-issues opened this issue May 11, 2016 · 20 comments
Assignees
Labels
in: web Issues in web modules (web, webmvc, webflux, websocket) status: backported An issue that has been backported to maintenance branches type: bug A general bug
Milestone

Comments

@spring-projects-issues
Copy link
Collaborator

spring-projects-issues commented May 11, 2016

dave burns opened SPR-14266 and commented

The original boot issue (or similar issue) still occurs for a broker/server receiving simultaneous connections. I've modified the sample project and added a multi-threaded Junit method as a client. This launches 2000 concurrent threads, about 1100 successfully connect to the server (on my setup). I don't know if its the load or the failed connections (rather than clean disconnects) but the org.springframework.messaging.tcp.reactor.Reactor2TcpClient can be seen to be leaking following by the same instructions as in the original boot issue.

Weird thing is the original leak showed both the Reactor2TcpClient and StompBrokerRelayMessagHandler leaking, this just shows the reactor 2 client.

Noticed this in our production application, i.e. system accepting concurrent connections. The multi-threaded harness mimics this behavior.


Affects: 4.2.6

Reference URL: #18805

Issue Links:

Referenced from: commits 4a81aab, b502210

Backported to: 4.2.7

@spring-projects-issues
Copy link
Collaborator Author

Rossen Stoyanchev commented

I'll run the test but I'm wondering what's causing the failure in the first place? Is RabbitMQ rejecting the failed 900 or so perhaps hitting some limits which would show in the RabbitMQ logs for example. The tests were running OK with 2000 connecting last time I checked.

Maybe the answer to the above will answer this too but what's the actual scenario in production, are two thousand connecting simultaneously with some failing for the same reason as in the test?

@spring-projects-issues
Copy link
Collaborator Author

dave burns commented

Yeah, it's strange, nothing in the RabbitMQ logs around the error and using the single threaded approach I get all 2000 connections successfully connecting.

In production its actually quite hard to see. We have 15 brokers/instance running in a cloud foundry environment. Profiling memory over around 1 week+ we see it trend up until the Garbage Collector uses up too many CPU cycles and the instance restarts. The brokers also serve up an SPA and the user experience is the app getting really slow before the instance(s) restart. Our workaround is to manually restart the services once a week, before the degradation happens. The application instance serves about 2000 clients max, this ramps up fairly quickly around 8-9am as users log in but CPU on each instance only ever reaches about 60% peak.

Also, I still see both the Reactor2TcpClient and StompBrokerRelayMessagHandler leaking when running a harness against our actual app, haven't worked out why this is yet but our app has security,SPA, controllers etc, hoping a fixed for the stripped down broker/harness should fix both object leaks in the app too.

@spring-projects-issues
Copy link
Collaborator Author

Rossen Stoyanchev commented

In your modified project, the Spring Boot version is still 1.3.0.RC1. I changed that to 1.4.0.M2 and also added <spring.version>4.3.0.BUILD-SNAPSHOT</spring.version>. After those changes I can run TestConnectionsMultiThreaded successfully. The RabbitMQ console shows 2000 connected. So I'm not able to see the connection failures in my environment yet. Please advise.

@spring-projects-issues
Copy link
Collaborator Author

Rossen Stoyanchev commented

Can you also confirm all the versions you're running with in production for Spring Framework and Reactor in particular?

@spring-projects-issues
Copy link
Collaborator Author

dave burns commented

Sorry, I just cloned the original sample and manually added the multi-threaded test, forgot to update the versions on it, but I have the updated versions running locally.

I get the following error on the server side for each failed connection/thread, still no error in RabbitMQ logs:

2016-05-11 12:27:09.160 ERROR 2072 --- [nio-8080-exec-7] o.a.c.c.C.[.[.[/].[dispatcherServlet]    : Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is org.springframework.web.socket.sockjs.SockJsException: Uncaught failure in SockJS request, uri=http://vddp03e-01288a1.lm.lmig.com:8080/websocket/797/676741f349874b7d9bdd91fdfbda63ce/xhr_streaming; nested exception is org.springframework.web.socket.sockjs.SockJsException: Uncaught failure for request http://vddp03e-01288a1.lm.lmig.com:8080/websocket/797/676741f349874b7d9bdd91fdfbda63ce/xhr_streaming; nested exception is java.lang.IllegalStateException: WebSocketSession not yet initialized] with root cause

java.lang.IllegalStateException: WebSocketSession not yet initialized
	at org.springframework.util.Assert.state(Assert.java:392) ~[spring-core-4.3.0.BUILD-SNAPSHOT.jar:4.3.0.BUILD-SNAPSHOT]

So, after a bit of googling on that error I found this, basically saying set async true on the filter (thought it would be by default in boot), so if I then add this to the config :

@Bean
	public ServletRegistrationBean dispatcherRegistration(DispatcherServlet dispatcherServlet) {
	    ServletRegistrationBean registration = new ServletRegistrationBean(
	            dispatcherServlet);
	    registration.setAsyncSupported(true);
	    return registration;
	}

I get the following error instead for each failed connection, (still no error in RabbitMQ logs) :

2016-05-11 12:18:39.695 ERROR 8940 --- [eactor-tcp-io-1] o.s.m.s.s.StompBrokerRelayMessageHandler : TCP connection failure in session abade2963e3b458aa8c58b60a62ccaca: Failed to connect: Connection refused: no further information: /127.0.0.1:61613

java.net.ConnectException: Connection refused: no further information: /127.0.0.1:61613
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[na:1.8.0_60]
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[na:1.8.0_60]
	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224) ~[netty-all-4.0.29.Final.jar:4.0.29.Final]
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289) ~[netty-all-4.0.29.Final.jar:4.0.29.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) [netty-all-4.0.29.Final.jar:4.0.29.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) [netty-all-4.0.29.Final.jar:4.0.29.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) [netty-all-4.0.29.Final.jar:4.0.29.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) [netty-all-4.0.29.Final.jar:4.0.29.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) [netty-all-4.0.29.Final.jar:4.0.29.Final]
	at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]

I'm currently running on 2 virtual machines, one for client, other for server+rabbitMQ. I can set this up on a fast host and re-run, not sure why you get no error what so ever.

@spring-projects-issues
Copy link
Collaborator Author

dave burns commented

Sorry, meant to reiterate, in both scenarios above I still get the Reactor2TcpClient leak in visualVM, so the leak does seem to occur for failed connections.

@spring-projects-issues
Copy link
Collaborator Author

Rossen Stoyanchev commented

What I'm not sure is are you getting these errors through the sample or are these logs from your actual application?

Looking at the output and regarding the asyncSupported flag, I can see a connection to "/xhr_streaming" so the client is falling back from a WebSocket connection. The test does configure XHR fallbacks but uses WebSocket.

As for the connection refused that seems clearly like RabbitMQ is rejecting it.

@spring-projects-issues
Copy link
Collaborator Author

Rossen Stoyanchev commented

I suspect that somehow between your two virtual machines the WebSocket handshake fails and the client falls back on HTTP streaming. You can confirm that by running a small number of connections and turning put the logs on org.springframework.web. To try the same I commented out the WebSocket transport on the SockJsClient and ran with HTTP streaming instead. Once again it went up to 2000 without any issues.

What I think is happening is that with XHR streaming there are more sockets open on the server side and that may be causing RabbitMQ to reach a limit on open file descriptors especially if you start RabbitMQ from the command line (as opposed to as a service). The log would show something like this:

=WARNING REPORT==== 11-May-2016::16:02:05 ===
file descriptor limit alarm set.

********************************************************************
*** New connections will not be accepted until this alarm clears ***
********************************************************************

To be sure I am not saying this is the problem in your actual application. For now before going any further can you confirm what you see in the RabbitMQ logs for the test? From the error messages above it sure looks like RabbitMQ is rejecting the connections so the question is why.

@spring-projects-issues
Copy link
Collaborator Author

dave burns commented

Sorry but nothing in the RabbitMQ logs after running the test, just the standard :

=INFO REPORT==== 12-May-2016::05:38:38 ===
accepting STOMP connection <0.8748.32> (127.0.0.1:65529 -> 127.0.0.1:61613)

The 2 virtuial machine setup was causing fallback to HTTP streaming. I executed the test again with everything running on the same VM but still get the :

java.net.ConnectException: Connection refused: no further information: /127.0.0.1:61613 

error on the server.

On the VM if I set max connections/threads to 500 I get about 400 actual connections in rabbit and see the leak in visualVM.

I then setup the test on a fairly good spec windows host (i7, 16GB), again I get about 1100 out of 2000 connections and the same output as above. If I execute the single threaded test all 2000 connections go through. I can't work out why you don't get the same result??? Might be OS related (doubt it though as it happens in CF). Could the single threaded test be executing on your setup?

I have a colleague who has a Mac setup, I'll post back results once he executes the test. Maybe you could try on a windows OS if you have one handy.

@spring-projects-issues
Copy link
Collaborator Author

Rossen Stoyanchev commented

Could the single threaded test be executing on your setup?

I am certain I'm running the multi-thread test because the output is different:

2016-05-12 06:29:30 [main] INFO  org.sample.websocket.WebsocketTest - Starting multi thread test for 2000 threads
2016-05-12 06:29:30 [main] INFO  org.sample.websocket.WebsocketTest - About to start thread  1
2016-05-12 06:29:30 [main] INFO  org.sample.websocket.WebsocketTest - About to start thread  2
2016-05-12 06:29:30 [main] INFO  org.sample.websocket.WebsocketTest - About to start thread  3
...
2016-05-12 06:30:00 [main] INFO  org.sample.websocket.WebsocketTest - About to start thread  1998
2016-05-12 06:30:00 [main] INFO  org.sample.websocket.WebsocketTest - About to start thread  1999
2016-05-12 06:30:00 [main] INFO  org.sample.websocket.WebsocketTest - Sleep in thread 30 secs

I then setup the test on a fairly good spec windows host

Why windows by the way, I doubt this is what's in production? I'm on Linux myself, don't have Windows.

On the VM if I set max connections/threads to 500 I get about 400 actual connections in rabbit and see the leak in visualVM.

I will try to constrain the number of connections artificially and see if I can spot a leak with failed connections. Perhaps you can try and insert a break every 500 connections during ramp-up to see if that makes a difference?

@spring-projects-issues
Copy link
Collaborator Author

Rossen Stoyanchev commented

A key concern of mine is that the test is an environment-related issue and isn't helping us replicate the production issue. For once I can't even get it to fail in my environment (a laptop). Furthermore sending 2000 users at the same time is not quite the same as users logging in over the course of an hour.

In terms of your production setup, do you see evidence of the same issue, i.e. the "java.net.ConnectionException: Connection refused"? You wrote that "the application instance serves about 2000 clients max". How is this controlled, i.e. what happens if there are more than that?

Also, I still see both the Reactor2TcpClient and StompBrokerRelayMessagHandler leaking when running a harness against our actual app

Can you provide details on what you see?

@spring-projects-issues
Copy link
Collaborator Author

Rossen Stoyanchev commented

Okay some progress to report.

I see one problem that needs fixing. I start RabbitMQ with lower limit on sockets:

=INFO REPORT==== 12-May-2016::09:10:03 ===
Limiting to approx 924 file handles (829 sockets)

I then run the test. After 828 connections RabbitMQ hits the limit:

=WARNING REPORT==== 12-May-2016::08:58:51 ===
file descriptor limit alarm set.

********************************************************************
*** New connections will not be accepted until this alarm clears ***
********************************************************************

What happens from there varies. Some TCP connections fail with a ConnectionTimeoutException which results in a callback + cleanup. Others however manage to connect and send CONNECT but never get a CONNECTED frame back. Since the CONNECTED frame is required to complete heartbeat interval negotiation, we don't set up a heartbeat until it's received, which means if the connection remains open, it doesn't get cleaned up.

I'll work on a fix for that but would be great to get your feedback on the questions above still.

@spring-projects-issues
Copy link
Collaborator Author

dave burns commented

Well spotted!!! and a little bit of progress here too.

My colleague ran the multi-threaded client on a mac, he gets about 1000-1500 connections per run (500+ fail; to connect) BUT no memory leak! Again he has 8,000 available file and socket descriptors showing in rabbitMQ, no limits are hit.

So your setup is all connections/no leak,
Mac Os : 1000-1500 connections/no leak
Windows : 1000-1500 connections with leak

The production setup above was a test I executed with the 4.3.0 SNAPSHOT after your first fix, again hard to see the contents of memory leak in our CF env as I cant get visualVM onto the JVM. I plan to re-run the test using the test using the 4.2.6.RELEASE version from last friday. In the prod/CF env I just kept running the clients until the instance gave an out of memory, couldn't actually see the Reactor2TcpClient in any dump file (didn't have a dump file) just tried to reproduce locally to give access to the better tooling.

Afraid I don't have too much detail on the prod/CF test ATM which is why I plan to re-run it and will post back. I was assuming it was the same cause as what I'm seeing on windows, but this may not be the case.

As you say actual production usage will not be as intensive as the test we are running and it's possible your original 4.2.6 fix along with the fix above could be all that is required.

@spring-projects-issues
Copy link
Collaborator Author

Rossen Stoyanchev commented

Okay I have a fix in master (4.3.0 snapshots) for a leak resulting from connect failures. Can you give that a try? I can also propagate the fix to the 4.2.x branch if you need a snapshot there but would prefer to confirm before backporting if feasible. At least in my environment with a forced limit on connections, it cleans up well now.

Note however that with this fix, the state is cleared after 60 seconds. Connection timeouts are after 30 seconds. I've set the max for the CONNECTED frame to 60 seconds. This matches a similar setting we have on the WebSocket side in the SubProtocolWebSocketHandler.

This should address the test harness, pending your confirmation. However I'm still wondering for your production case. Are there connection failures there, that could explain this? Also I want to check that clients do support heartbeats (stomp.js does) are not connecting with heartbeat settings set to 0 (i.e. off).

@spring-projects-issues
Copy link
Collaborator Author

Rossen Stoyanchev commented

Actually i should say as soon as the CI build completes.

@spring-projects-issues
Copy link
Collaborator Author

dave burns commented

FANTASTIC!!!

That has fixed the leak on windows machines. Still need to validate in our CF environment but just having some red tape issues getting our nexus repo updated with the latest snapshot. Will post back as soon as I've been able to re-test the production scenario but looking great so far, thanks so much.

@spring-projects-issues
Copy link
Collaborator Author

Rossen Stoyanchev commented

Great, I've pushed the change to 4.2.x so there is a 4.2.7 snapshot building if that makes it easier to try.

@spring-projects-issues
Copy link
Collaborator Author

Rossen Stoyanchev commented

daveburnsLIT, any further updates with the snapshot?

@spring-projects-issues
Copy link
Collaborator Author

dave burns commented

Sorry @rstoyanchev, the fix works perfectly on a local windows and OSX but we're still having a leak in our CF env. In the CF env I can see the JVM clean up (logging JVM stats and also in new relic) so the bug is definitely fixed, but the container (or maybe OS) is leaking something (still getting an out of memory after a few test runs). Should be good to close this out, no longer a Spring issue, I'll just need to drill in and find the other cause.

Again, many thanks for the help on this and (as usual), response time phenomenal.

@spring-projects-issues
Copy link
Collaborator Author

Rossen Stoyanchev commented

Alright, let us know if you find anything further, and thanks for the reports!

@spring-projects-issues spring-projects-issues added type: bug A general bug status: backported An issue that has been backported to maintenance branches in: web Issues in web modules (web, webmvc, webflux, websocket) labels Jan 11, 2019
@spring-projects-issues spring-projects-issues added this to the 4.3 GA milestone Jan 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in: web Issues in web modules (web, webmvc, webflux, websocket) status: backported An issue that has been backported to maintenance branches type: bug A general bug
Projects
None yet
Development

No branches or pull requests

2 participants