-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: seemingly stuck writing telemetry #68311
Comments
Stacktrace after sending SIGQUIT:
|
I don't know how to reproduce this. I've used this version of Go for quite a bit, and it suddenly started behaving this way, so I suspect it depends on some system state. It currently reproduces every time with this Go version, and doesn't reproduce with Go 1.22. I've avoided perturbing the system so it stays reproducible for me.
|
That CL pulled in 7 changes from x/telemetry: https://go.googlesource.com/telemetry/+log/51e8b5d718eb..7d78a974cc44. Skipping ones that were test-only or modifying services (and not vendored to the main repo), it leaves 3 I think:
It should be possible to temporarily swap out the vendored x/telemetry version arbitrarily by using |
Aha. I tried adding the replace manually in go.mod, but either I forgot to It bisects to
|
Thanks very much @dominikh. I also can't reproduce this, so it seems very important that you are able to preserve your system state. That https://go.dev/cl/585198 is implicated is probably a red herring: that CL changed the telemetry file path, by detecting when the version of toolchain programs is really the Go version. My first guess is that there's something corrupted about your counter file, and reverting to an older CL changes the path to not point to that corrupted file. One quick test for this would be: does this reproduce today? (If it was some sort of corrupted file, it wouldn't reproduce the next day).
cp -R ~/.config/go/telemetry ~/.config/go/telemetry.2024-07-07 Unfortunately, since I am only seeing this a few days later, we have a ~3/7 chance that a weekly telemetry aggregation operation will have wiped away your old counter file :(. In that case, we'd have very little to go on, but could make the code in question more defensive. CC @golang/telemetry |
It does not currently reproduce; I tried at the same commit as from my initial report.
I should have an abundance of ZFS snapshots covering that directory for the day I reported the issue.
|
Is it possible for us to have the copy of the counter file ? /home/dominikh/.config/go/telemetry/local/compile@devel-devel-linux-amd64-2024-07-04.v1.count From the strace output, it looks like Open->mmap->munmap was repeated, probably for each counter definition in the program (*). If that was caused by a corrupted counter file, I hope the copy of corrupted counter file may give us some hints. (*): can we memoize the error occurred during |
Can you also share |
Thanks very much. Some initial observations:
Now looking into the raw contents of |
Just in case anything unbroke the files since I filed the issue, I've pulled the telemetry from a snapshot a couple minutes before I filed the issue: |
Ok, after much staring I can sort of reproduce this. Sadly, not with the data you provided: the counter files seems clean. The recovery code in your stack is not reached by any test, and cannot be correct. It will almost certainly result in an infinite loop: In short: at this point we have a pointer to a location in the file that exceeds the bounds of the file. No amount of re-mapping is going to fix this. Furthermore, we only write these types of pointers after successfully extending the file and acquiring allocation using a compare-and-swap on the allocation limit. I didn't spot a bug in the happy path: counter heads should only point to valid locations in the counter file. Therefore, the bug seems related to either a race or external mutation of the file. Given that the timing of this corruption is the day after the weekly file expiry, it seems likely that file rotation is involved. Nevertheless, a straightforward way to force the corruption is to write enough data that the file must be extended, and then truncate it back to its previous length. By doing so, we can reproduce the conditions of the corruption, and experience the same symptoms: a process hanging while telemetry pointlessly unmaps and remaps the file. So, there are several actions items to close this bug:
Fortunately, since the fix is to just return an error from a branch that is (almost) never reached, it should be very safe for 1.23. |
Would this reproduce across many runs? The way I originally discovered the bug was |
Yes, once the file is corrupted in this way I'd expect it could reproduce nearly 100% of the time (technically, it depends on which counters are incremented, but it looks presumably the problematic counter is incremented every invocation). |
Change https://go.dev/cl/597336 mentions this issue: |
Change https://go.dev/cl/597337 mentions this issue: |
I'm not sure this is really related: it seems the telemetry code ( |
Updates:
|
Change https://go.dev/cl/597278 mentions this issue: |
While investigating golang/go#68311, we became concerned that assumptions of cross process atomic operations may not hold on all platforms. While this logic should be exercised indirectly via telemetry integration tests, add an explicit test to the mmap package for this core behavior. Updates golang/go#68311 Change-Id: Ibf5049e8a2fe03335e394af17f611207ca0f2016 Reviewed-on: https://go-review.googlesource.com/c/telemetry/+/597336 Reviewed-by: Hyang-Ah Hana Kim <hyangah@gmail.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Robert Findley <rfindley@google.com>
@cherrymui I think the write in openMapped -- at the end of the first page -- it problematic but probably not the problem here, since is far away and only done when the file is first created. Please let me know if that is inconsistent with your intuition. Can you more generally elaborate on the types of mmap problems the old linker encountered? |
I retract my comment about Truncate. There is no race using write, because I reserved the final bytes of every page to be unused specifically so we could write there to ensure the file was >= that length. In contrast, using Truncate would introduce a race that does not exist today. |
I have spent two hours puzzling through this bug and came to the same conclusions as @findleyr and @hyangah about the right changes to make to prevent future infinite loops. BUT, like them, I do not see how the infinite loop happens without corrupt files, and the file in the telemetry.tar.gz is certainly not corrupt. @dominikh, I wonder if perhaps since any process that opened the counter file would hang, nothing ever closed the counter file, and without an explicit close(2) call, perhaps the in-memory changes were in kernel buffers but not flushed to disk yet at the time of the snapshot. Can you please check for any slightly later snapshots that have different contents in
That is, it's a completely empty counter file, just a 0xc0-byte header. If you find any later copies of the file that have non-zero bytes between 0xc0 and 0x4000, that would be very interesting. Thanks. |
The file is identical to the one I uploaded in every snapshot, from 20240704_171111 to 20240710_220029. It no longer exists in newer snapshots. |
Very very odd. Thanks very much for checking @dominikh. |
On Windows, unmapping the previous mappedFile in mmap.Mmap caused panics when counter pointers were read concurrent to the remapping. Upon investigation, it appears that the unmapping was only necessary for tests, to ensure that previous mappings were cleaned up (and therefore that test files can be deleted). A call to runtime.SetFinalizer(..., mappedFile.close) appeared to serve a similar purpose, yet for an unknown reason finalizers were never run. Deeper investigation revealed that there was simply one bug in file cleanup (coincidentally already noted in a TODO): after storing the newly mapped file in file.newCounter1 and invalidating counters, we can close the previous mappedFile. Therefore: - fix the cleanup in file.newCounter1 - remove the unmap in mmap.Mmap on windows - remove the now unnecessary 'existing' parameter in mmap APIs - remove the SetFinalizer call - add a test for multiple concurrent mappings of a file - add an end-to-end test for concurrent file extension - change ReadCounter to read by memory mapping the file, in an attempt to avoid msync issues For golang/go#68311 Fixes golang/go#68358 Change-Id: I27b6f4f4939e93f7c76f920d553848bf014be236 Reviewed-on: https://go-review.googlesource.com/c/telemetry/+/597278 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Hyang-Ah Hana Kim <hyangah@gmail.com>
I packaged up a test of what I think must have been happening in the reported problem, namely multiple processes trying to create a new telemetry file at the same time. I don't understand why that would cause a problem. I've run it on Linux under stress for over 90 minutes with no problems, but I am not using ZFS, and perhaps ZFS and mmap are interacting badly. @dominikh If you are willing to help investigate,
will just run forever printing status updates every 5 seconds. If something gets stuck you should see a timeout message. |
I'm happy to keep this running for longer, but with how reliably I was triggering it that one day, this doesn't seem very promising. On the other hand, the problem hasn't reproduced naturally for me, either. I also had to reboot the machine at one point, so a lot of the state from that day is gone by now. As for ZFS and mmap, that path isn't as special as it would seem. Mmaping a file causes ZFS to use the Linux page cache instead of the ARC and normal reads check it for dirty pages. |
@dominikh The reliable hang was because the mmap'ed pages had been corrupted into a bad state, but apparently not a bad state reflected in the file itself. (It must be corrupt because the infinite loop is only possible if a specific word in the file is >16384 (= size of mapping we are seeing in the strace and the size of the file in the snapshot), but that word is <16384 in your snapshot. So somehow that loop is not reading the word from the file.) Once the mmap-available copy was corrupted, all future accesses would hang the same way. I was wondering how to trigger the mmap corruption itself. Since the snapshot file was an empty counters file (with no counter data written yet), I reasoned that perhaps the bad state was somehow triggered by multiple processes trying to create the file at the same time and interacting badly with the combination of concurrent creation, mmap, and ZFS. So the test repeatedly runs 10 concurrent creates and then has each make a counter. If the same mmap corruption happened, the test would hang. I guess not. Thanks for trying it. |
We've closed this issue because we've avoided the infinite loop at least, but I'm still curious whether we can provoke it. We may just never understand it. The code looks solid at least. If this was any other program, the answer would be to add a telemetry counter so we could at least find out whether this presumably rare occurrence is happening in the wild. Unfortunately, since this is the code that is trying to reach the telemetry counters, we can't increment a counter when it fails. 😵💫 |
Indeed, there are plenty of special cases we could imagine introducing, including just writing an empty "counter-bug-yyyy-mm-dd" file. I'm not sure we're there yet. |
Change https://go.dev/cl/598115 mentions this issue: |
Update x/telemetry to pick up recent bug fixes (notably golang/go#68311). Change-Id: Ic95d54fc43872ab5c3548cf47134fbaea27e1487 Reviewed-on: https://go-review.googlesource.com/c/tools/+/598115 Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Change https://go.dev/cl/598056 mentions this issue: |
Update x/telemetry to pick up recent bug fixes (notably golang/go#68311). Change-Id: Ic95d54fc43872ab5c3548cf47134fbaea27e1487 Reviewed-on: https://go-review.googlesource.com/c/tools/+/598115 Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> (cherry picked from commit 6f4e2a8) Reviewed-on: https://go-review.googlesource.com/c/tools/+/598056 Auto-Submit: Robert Findley <rfindley@google.com>
The issue seems to happen to me when I was testing #63285 (Just run Not sure it could be the same problem here, but the hanging is there.
For now, I have created a backup of |
@cuonglm, it doesn't look like the exact same issue, as I don't see any munmap calls in your strace. In fact, the telemetry system calls at first glance look somewhat reasonable (reading the mode, the upload.token, mmapping the counter file). Questions:
|
Yes, still reproducible for me at this moment.
It's still reproducible after removing
Yes.
Here's it: Output after sending SIGQUIT{23:09}~/p:main ✓ ➭ go build main.go SIGQUIT: quit PC=0x47af01 m=0 sigcode=0 |
FYI, it seems to only happen with the reproducer in #63285:
Simple hello world or others compile/run just fine. |
Oh, so I sent SIGQUIT to Output after sending SIGQUIT{23:26}~/p:main ✓ ➭ go build main.go # whatever/vector SIGQUIT: quit PC=0x486601 m=0 sigcode=0 |
@cuonglm it looks quite like you are reproducing #63285, in other words: you are seeing a hang in go/types. Now to figure out how telemetry is related at all. What happens before the bisection point? I understand it doesn't hang, but presumably the go/types bug still exists. What type of error do you get? |
I got the same error reported in #63285 with 805f6b3:
With 483a913, |
So I tried removing all telemetry code in cmd/compile,
|
(sorry, should have assumed as much) Ok, my guess is that the timer to rotate the counter file breaks deadlock detection. |
Thanks for reporting. Filed as #68497. |
Changes in this release This release updates the golang.org/x/telemetry dependency to pick up fixes for the following bugs: golang/go#68946: unnecessary downloading of the golang.org/x/telemetry/config module golang/go#68358: a potential crash on Windows when the telemetry counter file is extended golang/go#68311: a potential hanging process if the telemetry file is truncated Additionally, this release changes the gopls/telemetryprompt/accepted counter to be recorded each time the prompt state is checked (golang/go#68770). None of these issues were particularly urgent, but we also wanted to make a gopls release to exercise our recently added release automation (golang/go#57643).
Go version
go version devel go1.23-82c371a307 Fri Jun 28 19:59:35 2024 +0000 linux/amd64
Output of
go env
in your module/workspace:What did you do?
Run
go tool compile -help
What did you see happen?
The process never terminates and shows high CPU usage, most of which is spent in syscalls..
strace -ff
shows a rapid and endless stream ofWhat did you expect to see?
Not that.
The text was updated successfully, but these errors were encountered: