Skip to content

Cosmo BRM13250012 is seemingly dead #2198

@mkeeter

Description

@mkeeter

While testing #2192, we saw Cosmo BRM13250012 (cubby 15 in London) disappear midway through the RoT update:

matt@castle ~ $ pilot sp exec -e "update rot 1 /staff/matt//hubris-2192/oxide-rot-1-cosmo-b.zip" BRM13250012
Aug 13 15:08:39.613 INFO creating SP handle on interface london_sw0tp0, component: faux-mgs
Aug 13 15:08:39.616 INFO initial discovery complete, addr: [fe80::aa40:25ff:fe04:402%3]:11111, interface: london_sw0tp0, socket: control-plane-agent, component: faux-mgs
Aug 13 15:08:39.639 INFO generated update ID, id: 1ba66d0f-d11d-4b17-b102-8cea44dbe63d, component: faux-mgs
Aug 13 15:08:39.664 INFO starting update, total_size: 214984, id: 1ba66d0f-d11d-4b17-b102-8cea44dbe63d, component: rot, interface: london_sw0tp0, socket: control-plane-agent, component: faux-mgs
Aug 13 15:08:39.670 INFO update in progress, total_size: 214984, bytes_received: 0, component: faux-mgs
Aug 13 15:08:39.671 INFO update preparation complete, update_id: 1ba66d0f-d11d-4b17-b102-8cea44dbe63d, interface: london_sw0tp0, socket: control-plane-agent, component: faux-mgs
Aug 13 15:08:40.677 INFO update in progress, total_size: 214984, bytes_received: 25428, component: faux-mgs
Aug 13 15:08:41.703 INFO update in progress, total_size: 214984, bytes_received: 50856, component: faux-mgs
Aug 13 15:08:42.709 INFO update in progress, total_size: 214984, bytes_received: 76284, component: faux-mgs
Aug 13 15:08:53.430 ERRO update failed, error: RPC call failed (gave up after 5 attempts), id: 1ba66d0f-d11d-4b17-b102-8cea44dbe63d, interface: london_sw0tp0, socket: control-plane-agent, component: faux-mgs

At this point, the SP had been updated with the a image from that branch. Note that we didn't actually select the new RoT image; the system dropped dead midway through flashing it.

Since then, we have not managed to get it to show up on the network, either from castle (pilot -rlondon sp ls) or the switch zone. The host is also not visible, and drive lights are off.

Manually removing + reracking the sled doesn't recover the system. We see the chassis LED turn on about 8-10 seconds after it's reracked, but it doesn't come up on the network.

@leftwo is going to extract the sled into a benchtop debugging setup, so we can probe the SP and see what's going on.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions