Skip to content

Random delay before readyToUse=true #225

@bswartz

Description

@bswartz

My CSI plugin always returns readyToUse=true, because it simply blocks in CreateSnapshot() until the snapshot is created (typically 1 second or less). Usually the volumesnapshot object in k8s reflects readyToUse=true immediately, but with some randomness it sometimes shows up as readyToUse=false, and only get corrected after about a minute.

Here is a log that illustrates this happening:
external-snapshotter.log

Notice at 05:10:00, the CreateSnapshot() RPC returns success with readyToUse=true. However there's an error on line 96 of the log:

snapshot_controller.go:325] error updating volume snapshot content status for snapshot snapcontent-d7f6b159-fd33-4f57-9084-21c9a12a691b: snapshot controller failed to update snapcontent-d7f6b159-fd33-4f57-9084-21c9a12a691b on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io "snapcontent-d7f6b159-fd33-4f57-9084-21c9a12a691b": the object has been modified; please apply your changes to the latest version and try again.

48 seconds later, the controller retries, and successfully updates the object.

I have 2 issues with this behavior (1) why was readyToUse ever set to false if the CreateSnapshot() RPC returned readyToUse=true on the first try? And (2) it seems that the long wait time before retrying is unneeded because it's just an API race with something else modifying the same snapshotcontent object. We could just retry the update right after the error, or requeue the operation for very soon after instead of waiting. 48 seconds is a long time to wait in an automated sequence of steps that's waiting for the snapshot to be usable.

Metadata

Metadata

Assignees

Labels

lifecycle/rottenDenotes an issue or PR that has aged beyond stale and will be auto-closed.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions