Skip to content

Windows Open SSH Server cannot support more than 50 concurrent ssh sessions (posix_spawn failing) #1096

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
KaizerT opened this issue Mar 6, 2018 · 24 comments
Labels
Issue-Enhancement Feature request

Comments

@KaizerT
Copy link

KaizerT commented Mar 6, 2018

Please answer the following

"OpenSSH for Windows" version
v1.0.0.0

Server OperatingSystem
Windows Server 2012 R2

Client OperatingSystem
Windows 7 Enterprise

What is failing
any connection after 50 open connections

Expected output
Support >100 connections

Actual output
stops at 50, exactly

I actually have an open question at server fault for this https://serverfault.com/questions/900137/windows-open-ssh-server-cannot-support-more-than-50-concurrent-ssh-sessions

Anyways, here are the details. Any help will be greatly appreciated

I'm trying to set up a Windows OpenSSH Server (v1.0.0.0-Beta) and I've hit a problem.

I have a program that uses plink to create ssh connections. This can range from 1-200 connections and I'm pretty sure that should be safe. However, when I've made 50, EXACTLY 50 ssh connections, the ssh server starts rejecting the connections and looking at the logs, it shows that the posix_spawn failed and it can no longer instantiate child processes to handle the connections

I've tried this link(https://stackoverflow.com/questions/17472389/how-to-increase-the-maximum-number-of-child-processes-that-can-be-spawned-by-a-w) and it didn't do anything for it. I'd really appreciate your help. Thank you!

Here is the error log when the 51st connection is made

1656 2018-03-05 16:08:05.610 debug3: fd 5 is not O_NONBLOCK
1656 2018-03-05 16:08:05.610 debug3: spawning "C:\OpenSSH\sshd.exe" "-R"
1656 2018-03-05 16:08:05.610 error: posix_spawn failed
1656 2018-03-05 16:08:05.610 debug3: send_rexec_state: entering fd = 8 config len 357
1656 2018-03-05 16:08:05.610 debug3: ssh_msg_send: type 0
1656 2018-03-05 16:08:05.610 debug3: send_rexec_state: done
1656 2018-03-05 16:08:05.610 debug3: ReadFileEx() ERROR:109, io:00000010C2BF98B0

Here is my config

Subsystem sftp sftp-server.exe
LogLevel Debug3
AllowAgentForwarding yes
AllowTcpForwarding yes
GatewayPorts yes
TCPKeepAlive yes
MaxStartups 1000
PasswordAuthentication yes
PermitTTY yes
PidFile C:/ProgramData/ssh/sshd.pid
MaxSessions 1000

@NoMoreFood
Copy link

I'm just now getting familiar with this code but I suspect this could be due to the following hard-coded limitation in signal_sigchld.c where MAX_CHILDREN is 50. If true, it seems like this value could safely be larger.

int
register_child(HANDLE child, DWORD pid)
{
	DWORD first_zombie_index;

	debug4("Register child %p pid %d, %d zombies of %d", child, pid,
		children.num_zombies, children.num_children);
	if (children.num_children == MAX_CHILDREN) {
		errno = ENOMEM;
		return -1;
	}

@manojampalam
Copy link
Contributor

manojampalam commented Mar 7, 2018

@NoMoreFood you're right.

Ultimately, all these child process handles will be fed in a WaitForMultipleObjects call (as part of wait_for_any_event()) and that Win32 API call has a limitation of 64 handles. I had to pick a number less than this to accommodate other events (that of async accept and connect calls). Going beyond these limits will result in multi threaded POSIX compat layer that will make it more complex and less resilient.

Related to #191

Would love to hear if you have any suggestions to improve this.

@KaizerT
Copy link
Author

KaizerT commented Mar 7, 2018 via email

@NoMoreFood
Copy link

@KaizerT Not without some additional programming. There are ways to work around the limitations that @manojampalam is describing, but it's not an "easy" change. I'm new to this project and have already submitted two pull requests for other issues. Depending on how those are received, maybe I'll take on this one next.

@KaizerT
Copy link
Author

KaizerT commented Mar 7, 2018

@NoMoreFood
I undertand that it's not that simple of a fix. Just saying that it might have been nice to have been resolved earlier. Thank you both @manojampalam for investigating.

@KaizerT
Copy link
Author

KaizerT commented Mar 7, 2018

I've set up a linux box for this instead to resolve this issue

@NoMoreFood
Copy link

@manojampalam I couldn't really think of a great way to come through that limitation other than a pretty significant overhaul of how process handling is being done. However, I was able to create a [almost] drop-in replacement for WaitForMultipleObjects() that handles an arbitrary number of objects. Tested on my computer to 5,000 threads with no issues. About 175 lines long. Interested?

@manojampalam
Copy link
Contributor

@NoMoreFood, absolutely, please share.

@NoMoreFood
Copy link

NoMoreFood commented Mar 8, 2018

@manojampalam See attached. I can wrap this into a pull request of course. Just need to replace the normal defines in the calling code that are used to identify the array offset with the *_ENHANCED versions. Let me know what you think.

-- Attachment Removed See Commit Referenced Below --

@KaizerT
Copy link
Author

KaizerT commented Mar 8, 2018

@NoMoreFood @manojampalam

It'd be great if you guys can come up with a release with @NoMoreFood 's fix in the near future. I'm doing a lot of finessing to make this work with the current set up I have.

@manojampalam
Copy link
Contributor

manojampalam commented Mar 8, 2018

@NoMoreFood , your implementation looks promising. I believe it would be a good next step to create a stand alone project and unit test it thoroughly (you may borrow OpenSSH unit test framework, so we can integrate it into OpenSSH at a later point seamlessly). We specifically want the synchronization logic to be tested thoroughly. This logic is very critical since any ensuing E2E deadlocks will be pretty much impossible to debug.

The specific requirements for wait_for_multipleobjects_enhanced, from wait_for_any_event point of view:

  • Ability to handle unbounded number of handles (or at least a sufficiently large number)
  • Know the event index that woke up the wait call (SIGCHLD relies on this)
  • Support APC completions (async IO reads and writes and other signals rely on this)
  • Support timeout

Thanks for your support and contributions.

@NoMoreFood
Copy link

@manojampalam Sounds good. I know there are some areas that could be problematic, but hopefully aren't relevant in our case and I will do my best to verify that. For example, you could have a race condition where multiple signals are signally simultaneously and, in that case, only the first signal will be returned to the caller. I think this is really only a problem for objects that automatically reset after being signaled. I'll do additional testing in a standalone project and let you know how it goes.

@NoMoreFood
Copy link

@manojampalam Actually now that I look at the documentation better, that simultaneous issue may actually exist with the WaitForMultipleObjects() anyhow... but regardless, I'll look into it. How does one use your "OpenSSH unit test framework"?

@manojampalam
Copy link
Contributor

Right, we only deal with manual reset events, so you should be fine there.

For unittest, take a look at any of the test projects. Its implemented in test_helper.c Perhaps it will be easy if you do this work in your OpenSSH-portable fork. Make a copy of any of those test projects for your purpose and you should be good to go.

@NoMoreFood
Copy link

@manojampalam. I updated the code to deal with APC completions, made it so it deals with thread cleanups more gracefully, optimized synchronization logic to be more clear, adjusted style to match that that of the rest of the project, and added logic so that it will just defer to the native function if less than 64 objects for efficiency. All my tests have been integrated thus far and been very successful. I was going to add a file to 'unittest-win32compat' for unit testing the code; does that seem reasonable?

Also, do you have any existing tests for massively spawning copies to test scalability? I didn't see anything in there.

@NoMoreFood
Copy link

See commit NoMoreFood/openssh-portable@817e977.

@NoMoreFood
Copy link

@manojampalam Updated based on your comments. NoMoreFood/openssh-portable@ba70118.

@manojampalam
Copy link
Contributor

Thanks. I'll take a look. Please post a PR, so we can keep track of further conversation.

@NoMoreFood
Copy link

@manojampalam Not foreseeing the three potential changes I've submitted thus far (group detection, symbolic link support, and this change), I believe my pull request will reflect all three changes since I made the first two without branches. I can cancel and resubmit my other pulls from three different branches if you think that makes the most sense. Any advice would be appreciated.

@manojampalam
Copy link
Contributor

Yes, it will be better for tracking if we deal with each feature/fix in a separate branch/PR/commit.

@NoMoreFood
Copy link

@KaizerT Just wanted to let you know my pull request is finally in hopefully we seen this in a upcoming binary release.

@ghost
Copy link

ghost commented Mar 18, 2018

@NoMoreFood, can you comment the overarching goal of your code? E.g. in other projects, limitations in WaitForMultipleObjects are handled by a red black tree which objects are sorted in to. Your solution may not be the same, but it would be helpful if any datastructures used were labelled.

@NoMoreFood
Copy link

@R030t1 I'm sure there are a variety of ways to address those limitations. I used a variant of one idea what was suggested on MSDN WaitForMultipleObjects() page. I believe it's adequate for what this project needed. My main "goals" were to submit something that was reasonably low overhead, reliable, addressed the issue noted in this thread, and was relatively non-intrusive to the existing code such that the project maintainers would be comfortable accepting it since I'm relatively new to this project. The "optimized" wait-any version I ended up submitting and that was accepted is here: signal_wait.c. I think it is pretty well commented overall and simple enough that it doesn't need much too head-scratching to figure out what's going on. If you think that's something that's very unclear or could drastically improve maintainability, I'm certainly open to suggestions.

@manojampalam manojampalam added this to the vNext milestone Mar 20, 2018
manojampalam pushed a commit to PowerShell/openssh-portable that referenced this issue Mar 20, 2018
PowerShell/Win32-OpenSSH#1096
PowerShell/Win32-OpenSSH#191

- Updated wait_for_multiple_objects_enhanced() to handle a no-event request while alterable.
- Simplified wait_for_any_event() to by taking advantage of no-event alterable request in wait_for_multiple_objects_enhanced().
- Updated wait_for_any_event() to use MAX_CHILDREN limit instead of MAXIMUM_WAIT_OBJECTS limit.
- Removed unnecessary ZeroMemory() call.
- Created distinct definition MAXIMUM_WAIT_OBJECTS_ENHANCED
 and modified functions to use it.
- Upped w32_select() event limit.
- Modified wait_for_multiple_objects_enhanced() to allow for 0 millisecond wait.
@NoMoreFood
Copy link

@KaizerT Unless you want to wait for the official release, you can probably close this since the code changes are in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue-Enhancement Feature request
Projects
None yet
Development

No branches or pull requests

4 participants