-
Notifications
You must be signed in to change notification settings - Fork 207
Description
While testing #2192, we saw Cosmo BRM13250012 (cubby 15 in London) disappear midway through the RoT update:
matt@castle ~ $ pilot sp exec -e "update rot 1 /staff/matt//hubris-2192/oxide-rot-1-cosmo-b.zip" BRM13250012
Aug 13 15:08:39.613 INFO creating SP handle on interface london_sw0tp0, component: faux-mgs
Aug 13 15:08:39.616 INFO initial discovery complete, addr: [fe80::aa40:25ff:fe04:402%3]:11111, interface: london_sw0tp0, socket: control-plane-agent, component: faux-mgs
Aug 13 15:08:39.639 INFO generated update ID, id: 1ba66d0f-d11d-4b17-b102-8cea44dbe63d, component: faux-mgs
Aug 13 15:08:39.664 INFO starting update, total_size: 214984, id: 1ba66d0f-d11d-4b17-b102-8cea44dbe63d, component: rot, interface: london_sw0tp0, socket: control-plane-agent, component: faux-mgs
Aug 13 15:08:39.670 INFO update in progress, total_size: 214984, bytes_received: 0, component: faux-mgs
Aug 13 15:08:39.671 INFO update preparation complete, update_id: 1ba66d0f-d11d-4b17-b102-8cea44dbe63d, interface: london_sw0tp0, socket: control-plane-agent, component: faux-mgs
Aug 13 15:08:40.677 INFO update in progress, total_size: 214984, bytes_received: 25428, component: faux-mgs
Aug 13 15:08:41.703 INFO update in progress, total_size: 214984, bytes_received: 50856, component: faux-mgs
Aug 13 15:08:42.709 INFO update in progress, total_size: 214984, bytes_received: 76284, component: faux-mgs
Aug 13 15:08:53.430 ERRO update failed, error: RPC call failed (gave up after 5 attempts), id: 1ba66d0f-d11d-4b17-b102-8cea44dbe63d, interface: london_sw0tp0, socket: control-plane-agent, component: faux-mgs
At this point, the SP had been updated with the a image from that branch. Note that we didn't actually select the new RoT image; the system dropped dead midway through flashing it.
Since then, we have not managed to get it to show up on the network, either from castle
(pilot -rlondon sp ls
) or the switch zone. The host is also not visible, and drive lights are off.
Manually removing + reracking the sled doesn't recover the system. We see the chassis LED turn on about 8-10 seconds after it's reracked, but it doesn't come up on the network.
@leftwo is going to extract the sled into a benchtop debugging setup, so we can probe the SP and see what's going on.