Skip to content

backupccl: OOM while restoring backup in 22.2 #103481

Closed
@renatolabs

Description

@renatolabs

While working on a roachtest (#103228), I saw a RESTORE fail because a couple of nodes went OOM. The backup was taken using the following command:

BACKUP INTO 'gs://cockroachdb-backup-testing/9_22.2.9-to-current_cluster_all-planned-and-executed-on-random-node_X4iV?AUTH=implicit' AS OF SYSTEM TIME '1684255935229011892.0000000000' WITH detached, encryption_passphrase = 'kvxN1Tmlwg0OesQw86rg8xjhsQdKBdHFZ7'

Worth noting about this backup (may or may not be relevant):

  • it was taken while the cluster was in mixed-version state.
  • an incremental backup was taken shortly (30s) after the full backup finished (for full logs, see [1]).
  • both jobs were paused and resumed a couple of times while they ran.
  • the backup does not include revision_history.

At a certain point in the test, we attempt to restore this backup on a 4-node cluster on v22.2.9. This failed because two of the nodes went OOM a few minutes after the RESTORE statement:

Screenshot 2023-05-16 at 6 07 37 PM

This backup does not contain a lot of data. The biggest table has ~2GiB of data in it:

$ ./cockroach sql --insecure -e "SELECT database_name, parent_schema_name, object_name, size_bytes FROM [SHOW BACKUP LATEST IN 'gs://cockroach-tmp/backup_issue_22_2_oom/9_22.2.9-to-current_cluster_all-planned-and-executed-on-random-node_X4iV?AUTH=implicit' WITH check_files, encryption_passphrase = 'kvxN1Tmlwg0OesQw86rg8xjhsQdKBdHFZ7'] ORDER BY size_bytes DESC LIMIT 5"
database_name   parent_schema_name      object_name     size_bytes
tpcc    public  stock   3217512461
tpcc    public  order_line      1858665525
tpcc    public  customer        1848283823
bank    public  bank    1310901452
restore_1_22_2_9_to_current_database_bank_before_upgrade_in_22_2_9_1    public  bank    1310899634

More importantly, very similar backups in other tests can be successfully restored in 22.2, so I think something went wrong with this particular backup.

Reproduction

The issue can be very easily reproduced by attempting to restore this backup on a 22.2 cluster (I have since moved the backup to a bucket with longer TTL [2]). This happens even on a completely empty cluster, with no workloads running.

The commands below will create a node with 14GiB of memory, just like the nodes in the failed test.

$ roachprod create -n 1 $CLUSTER
$ roachprod stage $CLUSTER release v22.2.9
$ roachprod start $CLUSTER
$ roachprod ssh $CLUSTER
...
ubuntu@CLUSTER $ time ./cockroach sql --insecure -e "RESTORE FROM LATEST IN 'gs://cockroach-tmp/backup_issue_22_2_oom/9_22.2.9-to-current_cluster_all-planned-and-executed-on-random-node_X4iV?AUTH=implicit' WITH encryption_passphrase = 'kvxN1Tmlwg0OesQw86rg8xjhsQdKBdHFZ7';"
ERROR: connection lost.

ERROR: -e: unexpected EOF
Failed running "sql"

real    2m46.609s
user    0m0.623s
sys     0m0.280s

Finally, note that this does not happen on master or 23.1.1.

[1] roachtest artifacts
[2] 9_22.2.9-to-current_cluster_all-planned-and-executed-on-random-node_X4iV

Jira issue: CRDB-28023

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-disaster-recoveryC-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.T-disaster-recovery

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions