Description
While working on a roachtest (#103228), I saw a RESTORE
fail because a couple of nodes went OOM. The backup was taken using the following command:
BACKUP INTO 'gs://cockroachdb-backup-testing/9_22.2.9-to-current_cluster_all-planned-and-executed-on-random-node_X4iV?AUTH=implicit' AS OF SYSTEM TIME '1684255935229011892.0000000000' WITH detached, encryption_passphrase = 'kvxN1Tmlwg0OesQw86rg8xjhsQdKBdHFZ7'
Worth noting about this backup (may or may not be relevant):
- it was taken while the cluster was in mixed-version state.
- an incremental backup was taken shortly (30s) after the full backup finished (for full logs, see [1]).
- both jobs were paused and resumed a couple of times while they ran.
- the backup does not include revision_history.
At a certain point in the test, we attempt to restore this backup on a 4-node cluster on v22.2.9. This failed because two of the nodes went OOM a few minutes after the RESTORE
statement:
This backup does not contain a lot of data. The biggest table has ~2GiB of data in it:
$ ./cockroach sql --insecure -e "SELECT database_name, parent_schema_name, object_name, size_bytes FROM [SHOW BACKUP LATEST IN 'gs://cockroach-tmp/backup_issue_22_2_oom/9_22.2.9-to-current_cluster_all-planned-and-executed-on-random-node_X4iV?AUTH=implicit' WITH check_files, encryption_passphrase = 'kvxN1Tmlwg0OesQw86rg8xjhsQdKBdHFZ7'] ORDER BY size_bytes DESC LIMIT 5"
database_name parent_schema_name object_name size_bytes
tpcc public stock 3217512461
tpcc public order_line 1858665525
tpcc public customer 1848283823
bank public bank 1310901452
restore_1_22_2_9_to_current_database_bank_before_upgrade_in_22_2_9_1 public bank 1310899634
More importantly, very similar backups in other tests can be successfully restored in 22.2, so I think something went wrong with this particular backup.
Reproduction
The issue can be very easily reproduced by attempting to restore this backup on a 22.2 cluster (I have since moved the backup to a bucket with longer TTL [2]). This happens even on a completely empty cluster, with no workloads running.
The commands below will create a node with 14GiB of memory, just like the nodes in the failed test.
$ roachprod create -n 1 $CLUSTER
$ roachprod stage $CLUSTER release v22.2.9
$ roachprod start $CLUSTER
$ roachprod ssh $CLUSTER
...
ubuntu@CLUSTER $ time ./cockroach sql --insecure -e "RESTORE FROM LATEST IN 'gs://cockroach-tmp/backup_issue_22_2_oom/9_22.2.9-to-current_cluster_all-planned-and-executed-on-random-node_X4iV?AUTH=implicit' WITH encryption_passphrase = 'kvxN1Tmlwg0OesQw86rg8xjhsQdKBdHFZ7';"
ERROR: connection lost.
ERROR: -e: unexpected EOF
Failed running "sql"
real 2m46.609s
user 0m0.623s
sys 0m0.280s
Finally, note that this does not happen on master or 23.1.1.
[1] roachtest artifacts
[2] 9_22.2.9-to-current_cluster_all-planned-and-executed-on-random-node_X4iV
Jira issue: CRDB-28023