Description
when trying to contact nodes via a topology configuration I have problems especially with a deeper nested topology.
While a flat topology seems to work.
Example topology file 'topo':
[routes]
n061901: n061902,n062001
n061902: n062002,n143501
n062001: n183601,n192201
n062002: n193201,n072602
n143501: n072701,n072702
n183601: n072801,n072802
n192201: n072901,n072902
N=n061901,n061902,n062001,n061902,n062002,n143501,n062001,n183601,n192201,n062002,n193201,n072602,n143501,n072701,n072702,n183601,n072801,n072802,n192201,n072901,n072902
clush -w $N --topology=topo hostname
^Cn061902: n061902
n061901: n061901
n062001: n062001
n062002: n062002
n143501: n143501
n192201: n192201
n183601: n183601
Keyboard interrupt.
(the interrupt is needed because it hangs for ever)
The tree in the debug looks like this:
n061901
|- n061902
| |- n062002
| | - n[072602,193201] |
- n143501
| - n[072701-072702]
- n062001
|- n183601
| - n[072801-072802]
- n192201
`- n[072901-072902]
Maybe I am doing something wrong. Has anyone an idea what is going wrong?
(every node listed above has been checked for ssh connection to each other)
With a more flat topology I see no problems:
[routes]
n061901: n061902,n062001,n062002
n061902: n143501,n183601,n192201,n193201
n062001: n072801,n072802,n072901,n072902
n062002: n072602,n072701,n072702
clush -w $N --topology=topo hostname
n061902: n061902
n061901: n061901
n062002: n062002
n062001: n062001
n193201: n193201
n192201: n192201
n072602: n072602
n072701: n072701
n072702: n072702
n183601: n183601
n143501: n143501
n072902: n072902
n072801: n072801
n072901: n072901
n072802: n072802
Thank you
Bernd
Activity
martinetd commentedon Feb 18, 2022
I can reproduce this.
Just changing node names to have something I understand
with this I have no problem reaching two levels deep (b[1-4]) but I can't seem to reach any of the d nodes three levels deep. We can see in debug level that the
a1
gateway closes too early, presumably it thinks it's done from b level ack when it shouldn't...martinetd commentedon Feb 19, 2022
running with
CLUSTERSHELL_GW_LOG_LEVEL=debug
, here's the logs of the first level of the gw (a1):and one of the deeper gw (b2)
So from the second log we can see the command actually ran successfully, just couldn't come up because of the failure.
I've fixed that error in
https://review.gerrithub.io/c/cea-hpc/clustershell/+/533465
and running now works normally. There might be some missing fallbacks if a lower level gateway is unreachable however, that'd require some testing...
BerndKrischok commentedon Feb 20, 2022
Hi Dominique,
many thanks for this fix. It works - great.
Bernd
tree mode: fix error with intermediate gateways (cea-hpc#471)
tree mode: fix error with intermediate gateways (#471)