Cosmo BRM13250012 is seemingly dead

While testing https://github.com/oxidecomputer/hubris/pull/2192, we saw Cosmo BRM13250012 (cubby 15 in London) disappear midway through the RoT update:
```
matt@castle ~ $ pilot sp exec -e "update rot 1 /staff/matt//hubris-2192/oxide-rot-1-cosmo-b.zip" BRM13250012
Aug 13 15:08:39.613 INFO creating SP handle on interface london_sw0tp0, component: faux-mgs
Aug 13 15:08:39.616 INFO initial discovery complete, addr: [fe80::aa40:25ff:fe04:402%3]:11111, interface: london_sw0tp0, socket: control-plane-agent, component: faux-mgs
Aug 13 15:08:39.639 INFO generated update ID, id: 1ba66d0f-d11d-4b17-b102-8cea44dbe63d, component: faux-mgs
Aug 13 15:08:39.664 INFO starting update, total_size: 214984, id: 1ba66d0f-d11d-4b17-b102-8cea44dbe63d, component: rot, interface: london_sw0tp0, socket: control-plane-agent, component: faux-mgs
Aug 13 15:08:39.670 INFO update in progress, total_size: 214984, bytes_received: 0, component: faux-mgs
Aug 13 15:08:39.671 INFO update preparation complete, update_id: 1ba66d0f-d11d-4b17-b102-8cea44dbe63d, interface: london_sw0tp0, socket: control-plane-agent, component: faux-mgs
Aug 13 15:08:40.677 INFO update in progress, total_size: 214984, bytes_received: 25428, component: faux-mgs
Aug 13 15:08:41.703 INFO update in progress, total_size: 214984, bytes_received: 50856, component: faux-mgs
Aug 13 15:08:42.709 INFO update in progress, total_size: 214984, bytes_received: 76284, component: faux-mgs
Aug 13 15:08:53.430 ERRO update failed, error: RPC call failed (gave up after 5 attempts), id: 1ba66d0f-d11d-4b17-b102-8cea44dbe63d, interface: london_sw0tp0, socket: control-plane-agent, component: faux-mgs
```
At this point, the SP had been updated with the a image from that branch.  Note that we didn't actually select the new RoT image; the system dropped dead midway through flashing it.

Since then, we have not managed to get it to show up on the network, either from `castle` (`pilot -rlondon sp ls`) or the switch zone.  The host is also not visible, and drive lights are off.

Manually removing + reracking the sled doesn't recover the system.  We see the chassis LED turn on about 8-10 seconds after it's reracked, but it doesn't come up on the network.

@leftwo is going to extract the sled into a benchtop debugging setup, so we can probe the SP and see what's going on.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cosmo BRM13250012 is seemingly dead #2198

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cosmo BRM13250012 is seemingly dead #2198

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions