-
Notifications
You must be signed in to change notification settings - Fork 18k
x/telemetry: audit upload process for 1.23 #65970
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I would also be concerned about races between steps (3) and (6), particularly on Windows (where a file normally can't be deleted while another process has an open handle to it). |
@bcmills indeed, if deleting the counter files fails, the error is logged but no other action is taken. However, given that step 6 happens after the report is written, I don't think we have to be worried about this leading to duplicate uploads--that would only be a problem if some other process deleted the local reports, but left the counter files in place. Otherwise, the counter files will "eventually" be deleted because of the logic in step (4): if a report already exists, we still try to delete the counter files. Are you concerned about having stray counter files around? The race that you describe could lead to A situation where you have e.g.
That |
Yeah — if the error in deleting the file doesn't cause any other ill effects, that's probably an ok failure mode. |
Repurposing this issue to track generally auditing the upload process. Assigning to myself as I'm working on it. |
Change https://go.dev/cl/584400 mentions this issue: |
Change https://go.dev/cl/584305 mentions this issue: |
A single failure of createReport should not prevent the upload of other reports. Fix this by logging and proceeding when there is a failure. To test this failure, and to generally make it easier to exercise upload bugs, make the following testing improvements: - Add regtest.RunProgAsOf, which is like RunProg but sets CounterTime (newly exposed) to a time in the past. - Add regtest.NewIncProgram for the common use case of a program that just increments counters and exits. - Export CreateTestUploadConfig and CreateTestUploadServer. - Have CreateTestUploadServer implement its own cleanup. - Add a testWriter to echo upload logs to t.Log. Together, these helpers make it relatively easy to write an ad-hoc upload test using only the public counter and upload APIs. For golang/go#65970 Change-Id: I9f54ad22a1f69cc6162ebe5628ad3287b89bbee1 Reviewed-on: https://go-review.googlesource.com/c/telemetry/+/584400 Auto-Submit: Robert Findley <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Hyang-Ah Hana Kim <[email protected]>
For golang/go#65970 Change-Id: I2e84e0c9ffaa2f270b8d42164ea95181ec088e8c Reviewed-on: https://go-review.googlesource.com/c/telemetry/+/584305 Reviewed-by: Hyang-Ah Hana Kim <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]>
Change https://go.dev/cl/584402 mentions this issue: |
Change https://go.dev/cl/584795 mentions this issue: |
Remove fallback logic in counter.Parse to handle a missing "TimeBegin" metadata field. All counters should have this field populated (it is written by file.init). To ensure we catch this form of corruption, return an error from counterDateSpan when the TimeBegin or TimeEnd fields are missing (this was a pre-existing TODO). The logic dates to the original instance of the counter package, so it is likely just obsolete. For golang/go#65970 Change-Id: I6a75a42f3092c3471fe95cda1958b57b954e9140 Reviewed-on: https://go-review.googlesource.com/c/telemetry/+/584402 LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Hyang-Ah Hana Kim <[email protected]> Auto-Submit: Robert Findley <[email protected]>
Move TestUploadBasic and TestUpload failure to the new upload_test package, and rewrite them to use only the public upload API. This also cleans up the file layout a bit, as TestDates is now the only test in dates_test.go. Additionally, check the state of the telemetry dir before and after upload, and generalize TestUploadFailure to a retry test that checks various status code retry behavior. For golang/go#65970 Change-Id: Iab9d33a5eb852b43e7876bbf8b12d91b40c85909 Reviewed-on: https://go-review.googlesource.com/c/telemetry/+/584795 LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Hyang-Ah Hana Kim <[email protected]>
Change https://go.dev/cl/586141 mentions this issue: |
Change https://go.dev/cl/587197 mentions this issue: |
Add two tests, one for telemetry.Start and an other for upload.Run, which execute the upload concurrently. The test for telemetry.Start succeeds, due to the concurrency safety from the exclusive acquisition of the upload.token file. The test for upload.Run results in incorrect upload counts and occasional invalid report json (due to write shearing). Despite the upload.token guard, upload.Run should be more concurrency safe, since there is still a race condition when the upload.token is released. A subsequent CL will add more safeguards. For golang/go#65970 Change-Id: Ic7e57b1ee794a58340901289930250bf5114fdf6 Reviewed-on: https://go-review.googlesource.com/c/telemetry/+/586141 LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Hyang-Ah Hana Kim <[email protected]>
Fix broken writes by writing upload-related files exclusively. Prevent duplicate uploads using a lock file. For golang/go#65970 Change-Id: I548134f597b2dbf5232de54027adb3daa4bad53d Reviewed-on: https://go-review.googlesource.com/c/telemetry/+/587197 LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Hyang-Ah Hana Kim <[email protected]>
Change https://go.dev/cl/598036 mentions this issue: |
We were discussing the potential raciness of telemetry uploading in the context of a thundering herd of Go commands. I looked into this in more detail. Here's a paraphrased summary of the upload process:
X
value that is specific to this operations (i.e. not a hash of report content). ThisX
determines both upload sampling and uploaded report names. Skip counter files that can't be read (e.g. because they've been deleted by one of the following steps...).<date>.json
(the upload report) andlocal.<date>.json
, check if they already exist. If either of them already exists, assume something else got there first, delete all the counter files, and exit.<date>.json
) totelemetry.go.dev/upload/<date>/<X>.json
I'm not concerned about races involving user intervention, such as e.g. deleting the telemetry directory in the middle of an upload. However, it does look like there's a race between steps (4) and (7) that could theoretically lead to duplicate uploads: two processes overwrite each other's report with different
X
values, and each thinks they should upload. But it looks very unlikely given how much has to happen in the unsynchronized period, and we can probably avoid the race by replacing (4) and (5) with an exclusive write, and not deleting files if that write fails.Nevertheless, I think we should consider this more.
Bugs discovered:
The text was updated successfully, but these errors were encountered: