-
Notifications
You must be signed in to change notification settings - Fork 77
System freeze on i7-12700H (06-9a-03, microcode 0x41e and 0x429) #67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
More details posted to the Linux Mint forums, in case the issue could be caused by something else. Help with decoding the BERT hexdump is still welcome, if anyone knows how to interpret it - it's the only clue. I've collected a few more, in case a larger sample helps. |
@Technologicat Is there more BERT data than you are showing above? The data appears to be missing some data records, so it doesn't have any useful information. If this is all of the data, then there may be a BIOS bug or some other reason that resulted in not capturing all of the BERT data. There is not enough info here to know what may have happened. |
@whpenner Thanks for taking a look! In that particular instance of the error, that was all of the data. However, in more recent instances of the error, for some reason, it has produced longer records - long enough for Below is a representative collection (7 records, 3300 bytes each). Here's a EDIT: Here's a zip file with the raw data, numbered as Instance 1
Instance 2
Instance 3
Instance 4
Instance 5
Instance 6
Instance 7
|
@Technologicat: I'm looking at the data/structures to make sure the data is here. It may take me a bit since I'll need to reformat it and make sure the data is all here. |
@whpenner Ok. Thank you! I uploaded the raw BERT records to Dropbox (link in the edit of the previous post) - please tell me if you need anything else. |
@Technologicat Just to let you know, I'm looking at the data, but so far it is not showing an obvious cause. I'm having to dig to look for a hint as to what may have happened. I'll let you know if I can find something. |
@whpenner Thanks for the update. Please also let me know if I can help, for example by testing in some particular way. |
Just in case it might help, here are four more raw BERT records. One from last week, three from today; by a quick glance, all of them are of the same kind as the previous ones ( Today's freezes happened while scrolling down a long PDF (300+ pages) in Emacs (labeled BERT9 and BERT10), and while updating a Back In Time backup (essentially |
@Technologicat Hi. These crashlog dumps are pretty much the same as the previous ones. I don't know if this would get us more data, but could you install and enable the mcelog tool (https://mcelog.org/)? This tool would try to capture any machine checks that happen. I can't tell if there is another error that is missed because of a second error after the system hangs. |
@Technologicat Oh, can you send us a bit more of the dmesg log. We're also looking at why we don't see more of the log in dmesg. Thanks |
@whpenner Yes, ( Here is the Sorry for no complete log before the reboot - I accidentally took the data from the wrong boot when saving the full log, noticed it only now, and the original has already been auto-deleted from I no longer have the other logs, but I can produce new ones by triggering the freeze again. I'll try to get a complete set or two over the weekend, including a full log before the freeze. (However, note that there have never been any messages in |
Now just to perform some tests... I'll post again when I have the |
Here is one set of logs, before and after a freeze. For both boots (before and after the freeze), |
I suspected that may be the case. I am not able to find anything in the logs to tell me what happened other than the crashlog output is telling me various types of resets have occurred. I suspect this is where the system hangs and you reset it or the system is resetting by itself. There isn't any data left to say why it hung/reboot. Since I'm not seeing other systems like this, I would suspect the system/hw or how FW/BIOS is responding after the hang/reboot. Your best bet may be to contact Clevo. One other item, and maybe you are already aware of this, but I see many lwlwifi errors in the log. The references to microcode in that log is about the firmware for the wifi device and not processor microcode. That device seems to be hanging quite often. It's possible that whatever that device is doing, it could cause your system to hang/reset. That could explain some of your symptoms as well (network stops responding to pings, etc). Note the: |
Ok. Yes, I'm aware of the If it's possible for a wifi driver to hard-freeze the entire PC - so fast that the kernel has no chance to log anything, and so hard that the whole kernel just dies, no kernel panic, and not even Alt+SysRq+REISUB responds - then I think blacklisting the driver is the next thing to try. My hunch that the CPU microcode might have been responsible came from an internet search for the The From my internet search, I got the impression that the most common causes for an This is a new laptop, and the vendor I bought it from tests each individual machine they sell for two weeks before shipping, so a HW failure seems unlikely (although of course not entirely impossible). I don't know if handling during international shipping could bump a laptop enough for the CPU or one of the DIMMs to become unseated just enough to start causing intermittent failures in one of the electrical contacts. Also seems unlikely, but not entirely impossible. As for a BIOS bug, yes, could be a good idea for me to contact CLEVO. I don't know if they support Linux, but it never hurts to ask. In any case, thank you for your efforts! |
Just as a last update to this issue, it's not the So most probably the I'll go ask CLEVO next for their input - thanks for your support! |
As a final follow-up to this: the freezing turned out to be caused a faulty RAM module that wasn't caught by memtest. The RMA process took some time, but the faulty memory was replaced under warranty, and now everything is working fine. |
@Technologicat Thanks for following up and letting us know. I'm glad your system is now working. |
@Technologicat I have the exact same error with my new MB with N100 CPU, I'm suspecting RAM issue as well, sometimes an error is reported by |
Hi,
I have a CLEVO PD70PNN1 laptop with an Alder Lake i7-12700H CPU (
06-9a-03
). I'm running Linux Mint 21.The system randomly freezes under certain workloads. When this occurs, nothing responds (not even
Alt+SysRq+REISUB
), audio starts looping, the last picture remains on the screen, the system instantly vanishes from the local network (no response toping
orssh
), and most often, the system automatically reboots itself after ~10 seconds. Sometimes it doesn't reboot, and I have to hold down the power button to forcibly turn the laptop off.After the next bootup, the kernel log reports the error
8f87f311-c998-4d9e-a0c4-6065518c4f6d
from the previous boot. There is nothing else relevant to the freeze in any of the system logs.Googling the error number led me here. I read through #44 and #58, but the conditions triggering the freeze as well as the CPU model are different.
The full error message, complete with hex dump, is included at the end of this post.
Any thoughts, anyone? If the error looks relevant to the Intel microcode, what other info should I provide? And in case I'm barking up the wrong tree, any pointers where to go next?
Some details:
The machine is new and has been tested (in Windows) by the retailer I bought it from, so a hardware failure is unlikely. Linux Mint 21 is the only OS on the machine.
In the BIOS, I have set the GPU mode to only use the NVIDIA dGPU (RTX 3070 Ti Mobile). This solved some issues with external displays. The Intel iGPU is off, and
lsmod
confirms thei915
driver is not loaded.The (Samsung) SSD firmware is up to date, and setting
nvme_core.default_ps_max_latency_us=0
does not affect the freezing.Workloads that reliably reproduce the system freeze:
In the Automatic1111 workload, when the freeze occurs, GPU compute usage itself (according to
nvtop
) remains at 0%. The system freezes while Automatic1111 is computing a checksum for the model checkpoint file, on the CPU.Over the two months the system has been in use, there have also been freezes on rare occasions with other workloads, for example:
But curiously, I can go a full work week writing a book in Emacs, recompiling the PDF every few minutes (
org-mode
+pdflatex
), without experiencing any system freezes.I have gone through the usual troubleshooting checklist: temperatures, RAM health, NVIDIA driver version, SSD firmware version, NVMe power saving, CPU C-states, ACPI,
pcie_aspm=off
, other miscellaneous kernel boot options, and so on - nothing so far has affected the freezing behavior. I can post the details if needed.The system behaves the same on its built-in microcode revision
0x41e
, as well as with the updated microcode from the 20230214 release (06-9a-03
, revision0x429
, date 2023-01-11). The new microcode was installed viaapt
, packageintel-microcode
, version3.20230214.0ubuntu0.22.04.1
. Specifically this is fromjammy-updates
, which is one of the default repositories of Linux Mint 21.The BIOS is the original version that came preinstalled; CLEVO's driver download page for PD70PNN1 does not have any BIOS updates.
Full error message from the kernel log
The text was updated successfully, but these errors were encountered: