Skip to content

x/telemetry: audit upload process for 1.23 #65970

Closed
@findleyr

Description

@findleyr

We were discussing the potential raciness of telemetry uploading in the context of a thundering herd of Go commands. I looked into this in more detail. Here's a paraphrased summary of the upload process:

  1. Collect all counter file names.
  2. Group by expiry. Ignore expiries that are not in the past (meaning at least a day in the past)
  3. For each group of counter files with the same expiry, compute both "local" and "uploadable" reports, using a random X value that is specific to this operations (i.e. not a hash of report content). This X determines both upload sampling and uploaded report names. Skip counter files that can't be read (e.g. because they've been deleted by one of the following steps...).
  4. Before writing <date>.json (the upload report) and local.<date>.json, check if they already exist. If either of them already exists, assume something else got there first, delete all the counter files, and exit.
  5. Then try to write the report files.
  6. Finally, delete all the counter files.
  7. POST the uploadable report (<date>.json) to telemetry.go.dev/upload/<date>/<X>.json

I'm not concerned about races involving user intervention, such as e.g. deleting the telemetry directory in the middle of an upload. However, it does look like there's a race between steps (4) and (7) that could theoretically lead to duplicate uploads: two processes overwrite each other's report with different X values, and each thinks they should upload. But it looks very unlikely given how much has to happen in the unsynchronized period, and we can probably avoid the race by replacing (4) and (5) with an exclusive write, and not deleting files if that write fails.

Nevertheless, I think we should consider this more.

Bugs discovered:

  • a file with no counters is considered a failed file
  • a single report failure causes all other reports not to be uploaded

Metadata

Metadata

Assignees

Labels

telemetryx/telemetry issues

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions