Skip to content

Secsh channel 10 open FAILED: open failed: Connect failed #3482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
brorfred opened this issue Mar 16, 2020 · 20 comments · Fixed by #3675
Closed

Secsh channel 10 open FAILED: open failed: Connect failed #3482

brorfred opened this issue Mar 16, 2020 · 20 comments · Fixed by #3675
Labels
enhancement Enhances DVC p2-medium Medium priority, should be done, but less important

Comments

@brorfred
Copy link

Please provide information about your setup
DVC version(i.e. dvc --version), Platform and method of installation (pip, homebrew, pkg Mac, exe (Windows), DEB(Linux), RPM(Linux))

version 0.88.0
installed via pip on Linux and Mac.
Syncing to a Ubuntu host via ssh.

Everything seems to be working but I get a cryptic warning message every time I do anything. For example:

dvc push
0% Querying cache in ssh://[email protected]/media/sda2/dvc/first_day| |0/42 Secsh channel 10 open FAILED: open failed: Connect failed
Secsh channel 10 open FAILED: open failed: Connect failed
Secsh channel 10 open FAILED: open failed: Connect failed
2% /media/sda2/dvc/first_day/4e/4b31f0c5784a2e185d88a3120cac19| |1/42 [00:02<0Secsh channel 10 open FAILED: open failed: Connect failed
Everything is up to date.

This is probably an edge case due to my setup but I'm not sure how to quiet the message or resolve the issue.

@ghost ghost added the triage Needs to be triaged label Mar 16, 2020
@skshetry
Copy link
Collaborator

@brorfred, can you paste the dvc push -v verbose output? Also, can you try with a single job and check?

dvc push -j 1

@brorfred
Copy link
Author

brorfred commented Mar 16, 2020

Thanks for looking into this!

Running the following command:

dvc push -v -j 1 > verbose_push_one_job.log

Generates the following output:

  0% Querying cache in ssh://[email protected]/media/sda2/dvc/first_day|                                                                                                                                   |0/51 [00:00<?,     ?file/s]
Secsh channel 10 open FAILED: open failed: Connect failed

And the log file:
verbose_push_one_job.log

@skshetry
Copy link
Collaborator

skshetry commented Mar 16, 2020

@brorfred, My google-fu returned this answer from Unix StackExchange: https://unix.stackexchange.com/a/459051

It basically says that this happens if you get over resource limit, eg: no. of open file descriptors.
Please give that link a try, and also increase the ulimit.

And, please respond if it solved the issue as it will help other users as well.

@brorfred
Copy link
Author

I'm starting to realize that the error probably is some annoying kitchen sink response that can be pretty much anything. Most of my resource limits are set to infinite:

ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 256902
max locked memory       (kbytes, -l) 16384
max memory size         (kbytes, -m) unlimited
open files                      (-n) 20000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 256902
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

@brorfred
Copy link
Author

brorfred commented Mar 16, 2020

The strange thing is that scp works fine:

touch test
scp test [email protected]:/media/sda2/dvc/first_day/
test                                100%    0     0.0KB/s   00:00 

@skshetry
Copy link
Collaborator

@brorfred, looks like the issue is on server side. Closing it for now, feel free to re-open if you find that the issue is on dvc side.

@brorfred
Copy link
Author

brorfred commented Mar 16, 2020

It seems like the problem is in paramico and particularly in paramiko.ssh_exception.ChannelException. The class SSHConnection in dvc.remote.ssh.connection assumes that the exception fails quietly in open_max_sftp_channels but paramiko logs a warning standard error. Not sure if this is something specific with our setup but I get the same behavior on my Mac laptop...

from dvc.remote.ssh import connection

ssh = connection.SSHConnection(host="zotac.bror.us", username="bror", port=22)
ssh.open_max_sftp_channels()                                          
Secsh channel 10 open FAILED: open failed: Connect failed
Out[122]: 
[<paramiko.sftp_client.SFTPClient at 0x7fe1321fac50>,
 <paramiko.sftp_client.SFTPClient at 0x7fe1321be150>,
 <paramiko.sftp_client.SFTPClient at 0x7fe132239990>,
 <paramiko.sftp_client.SFTPClient at 0x7fe132202950>,
 <paramiko.sftp_client.SFTPClient at 0x7fe1321fa050>,
 <paramiko.sftp_client.SFTPClient at 0x7fe13219ed50>,
 <paramiko.sftp_client.SFTPClient at 0x7fe13219df90>,
 <paramiko.sftp_client.SFTPClient at 0x7fe137584850>,
 <paramiko.sftp_client.SFTPClient at 0x7fe13218f990>,
 <paramiko.sftp_client.SFTPClient at 0x7fe13220a4d0>]

#Redo the command 

ssh.open_max_sftp_channels()                                          
Out[123]: 
[<paramiko.sftp_client.SFTPClient at 0x7fe1321fac50>,
 <paramiko.sftp_client.SFTPClient at 0x7fe1321be150>,
 <paramiko.sftp_client.SFTPClient at 0x7fe132239990>,
 <paramiko.sftp_client.SFTPClient at 0x7fe132202950>,
 <paramiko.sftp_client.SFTPClient at 0x7fe1321fa050>,
 <paramiko.sftp_client.SFTPClient at 0x7fe13219ed50>,
 <paramiko.sftp_client.SFTPClient at 0x7fe13219df90>,
 <paramiko.sftp_client.SFTPClient at 0x7fe137584850>,
 <paramiko.sftp_client.SFTPClient at 0x7fe13218f990>,
 <paramiko.sftp_client.SFTPClient at 0x7fe13220a4d0>]

@skshetry skshetry reopened this Mar 16, 2020
@brorfred
Copy link
Author

brorfred commented Mar 16, 2020

Yes, the text is sent to a logger object in the paramiko instance in transport.py line 2700. Would it be possible for dvc to replace that logger with your own?

@efiop
Copy link
Contributor

efiop commented Mar 16, 2020

Pretty strange that that log message even comes up like that. Dvc disables all other loggers in CLI https://github.com/iterative/dvc/blob/0.88.0/dvc/main.py#L37 , so that error shouldn't really show up at all. Maybe paramiko sets the logger back up during the runtime, causing its errors to show up again. 🤔

@skshetry
Copy link
Collaborator

We load paramiko lazily, so, I guess, the logger is not active at the point when we try to disable it. Probably, we need to bring back the logger dict configs that were before for paramiko.

@skshetry
Copy link
Collaborator

I found a few discussions regarding this specific log message when it was implemented:

#2131 (comment)
#2131 (comment)

Of course, we need to silent the log, but also need to see why the message is there.

@brorfred
Copy link
Author

The message in itself is benign I think. You poll paramiko for max number of channels by open them one by one until max number is reached and paramiko raises an exception. I still don't understand why I'm the only one that get's the warning message though. I've tried two different setups (my office linux machine and my Mac laptop) and both generates it. Could it be because I've installed dvc in a conda environment? I tried to create a minimal environment with only pip paramiko and dvc installed and got the same warnings...

@efiop
Copy link
Contributor

efiop commented Mar 18, 2020

@brorfred Not sure, maybe it is a server-related thingy. Still, the fact that paramiko messages sip through the cracks is not great, we need to disable it when we import it or maybe adjust logging configs somehow.

@efiop efiop added enhancement Enhances DVC p2-medium Medium priority, should be done, but less important labels Mar 18, 2020
@ghost ghost removed the triage Needs to be triaged label Mar 18, 2020
@hg-zt
Copy link

hg-zt commented Apr 21, 2020

Our MaxSession was set to a large number, and I found hundreds of sessions are opened for dvc push -j 1 and failed.

After reverting to v0.41.3, which is before #2131 was merged, we could now use dvc push

@efiop
Copy link
Contributor

efiop commented Apr 21, 2020

@hg-zt dvc push -j 1 affects the number of workers, but we still try to open as much sftp channels as we can. Did it fail without uploading anything at all? Usually, you should be able to dvc push again to upload the rest.

@gurunathparasaram
Copy link

gurunathparasaram commented Apr 23, 2020

When doing dvc pull or dvc push, I get an error, same as above :
0% Querying cache in ssh://user@dvc_machine:path/to//dvc_remote| |0/1 [00:00<?, [837/1980] Secsh channel 30 open FAILED: open failed: Connect failed

  • There is enough space on the machine and I am able to ssh and scp files to the dvc remote.
  • Please let me know if any further details are needed.
  • Couldn't find much info when using -v flag.
  • I tried setting the ulimit open files to a higher number, but in vain.
  • Setting MaxSessions to higher limits didn't work.
  • I also tried the same with another machine as dvc remote. But again I get the same error.

Details:
DVC version: 0.93.0
Python version: 3.7.1
Platform: Linux-5.3.0-46-generic-x86_64-with-debian-buster-sid
Binary: True
Package: deb
Supported remotes: azure, gdrive, gs, hdfs, http, https, s3, ssh, oss
Cache: reflink - not supported, hardlink - supported, symlink - supported
Filesystem type (cache directory): ('ext4', '/dev/sda1')
Repo: dvc, git
Filesystem type (workspace): ('ext4', '/dev/sda1')

@efiop
Copy link
Contributor

efiop commented Apr 23, 2020

@gurunath-p Thanks for the report! A workaround is to run dvc push a few more times, it will upload what is missing.

But since we are getting more reports it makes me think whether some of our recent patches have made dvc go over the reasonable limit. CC @pmrowla

@pmrowla
Copy link
Contributor

pmrowla commented Apr 24, 2020

I'll look into this, although out of the recent patches, I don't think any of the relevant changes should have affected ssh remotes, since RemoteSSH still has its own overridden cache_exists implementation.

@hg-zt
Copy link

hg-zt commented Apr 24, 2020

@efiop

Retried for a few more times and the rest was uploaded. I also found when running dvc push without any actual changes (everything up-to-date), I still get these errors.

@pmrowla
Copy link
Contributor

pmrowla commented Apr 24, 2020

I think the issue is just that we are not suppressing the paramiko logger. We hit the channel limit when determining the max number of channels and trigger the message, even though it's expected and not an error for DVC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhances DVC p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants