Linux not* booting after BIOS flash on a dual-boot machine with Windows 11 [I can't use Arch btw]

goobert123

New Member
Joined
Jul 22, 2024
Messages
3
Reaction score
2
Credits
163
Preface:
I own an Intel Core i9-13900K, installed on my home PC. There was recent news that indicated mysterious hardware failure affecting the Raptor Lake CPUs, specifically the unlocked Core i9s from the 13th and 14th generations. Although the sources that first reported this issue were using the CPUs on game servers, which likely meant they were being utilized heavily, the 100% hardware failure rate they reported was shocking to say the least. Hence, I wanted to protect the CPU on my home machine from slow degradation and possibly extend its life. Since I'm still under warranty, I cannot mess with the clock settings to achieve this. Only official microcode updates and motherboard BIOS updates would not void my warranty (AFAIK). About a month ago, Intel apparently released a microcode update (version 125) for my processor. But I found out about this whole issue only two days back.

Prologue:
About a week ago, before I discovered the Raptor Lake CPU problem or the available microcode update from Intel, I had installed Arch Linux on my desktop PC (which has the i9). Prior to this, I had Arch Linux installed on a laptop that I used to practice programming and web development on the side. Over time, I bloated the Arch installation on the laptop with a ton of packages that I didn't document properly, and I wanted a clean slate with solid documentation. So, I disabled secure boot on my desktop, installed Arch on a separate NVMe drive, and selectively began to port over the configs from the other machine that I needed the most. I am not a developer or sysadmin by trade. It's just a 'professional hobby' at this point.

Chapter 1 [When I First Found Out]:
I watched a video on YouTube that discussed the CPU issue that could potentially affect my system, and it spooked me a little. I ran a Windows backup and prepared to update my CPU microcode. On my current Arch installation, I had ignored the microcode setup and put it off for another time. This was probably a grave oversight on my part¹.

Just to be sure, I went to the support page for my motherboard to check for any BIOS updates, and lo and behold, there was a new update. I assumed it came with the microcode (it did), along with some optimizations for it. So, I ditched the plan of only updating my microcode and chose the riskier endeavor of flashing my motherboard. I followed the instructions, double-checked everything, said a few prayers, and the flash was successful. The motherboard didn't brick.

At this point, I didn't even know that the BIOS would revert to default settings when it is updated. But I'm not sure if this is relevant since the only setting I changed when I built the PC were enabling the XMP and I had modified the boot order of my hard drives when I installed Arch. Everything else was on auto on the previous version of my BIOS firmware.

SPECS
Motherboard
GIGABYTE Z790 AERO G (1.0) - link
BIOS Version(previous)​
F8 - link
BIOS Version(latest)​
F12e - link
VGA 1
GIGABYTE RTX 4080 Gaming OC - link
VGA 2
Intel UHD Graphics 770 - link
CPU
Intel Core i9-13900K - link
Memory
CORSAIR VENGEANCE 64GB (2x32GB) DDR5 DRAM 6000MT/s CL40 Memory Kit- link
Power Supply
CoolerMaster MWE Gold V2 / 1250 W - link
OS1
Windows on 1 TB PCIe Gen5 NVMe - link
OS2
Arch Linux on 1 TB PCIe Gen4 NVMe - link (linux kernel & linux-zen kernel)​
Bootloader
GRUB
Filesystem(Arch)
ext4 (encrypted LVM)

Chapter 2 [The Symptoms]:
The first thing I did after the BIOS flash was select Arch Linux - Linux kernel on GRUB bootloader. As it begins loading, the monitor displays:
Code:
Loading Linux linux ...
Loading initial ramdisk ...

I hear the relay switching sound on my PSU (I assume it comes from the PSU) as my screen goes blank and my PC reboots abruptly. I was back on GRUB. I think maybe it's just a bit of starting trouble and try to load Arch again. This time, after:
Code:
Loading Linux linux ...
Loading initial ramdisk ...

It makes it to the LVM decryption password prompt. Hooray, I thought to myself and begin typing the password to decrypt the LVM that contains my /root and /home. As I am typing my password, all of a sudden, I hear the relay switch again, my screen goes blank, the fans whirr up, and the PC reboots. I was back on GRUB. Slight panic set in.

I had to make sure this didn't affect Windows and so I select the Windows 11 menu entry on GRUB, and it begins loading. I had my fingers crossed, but apparently, there was nothing to worry about. It loads successfully, and I am able to log in. phew I can load Windows at least. Then I remember I had installed the linux-zen kernel as a backup and thought maybe that would boot without causing any trouble. So, I rebooted my PC and selected the linux-zen kernel:
Code:
Loading Linux linux-zen ...
Loading initial ramdisk ...

relay switching sound Reboot. Back to GRUB. I realise that I am stuck in a boot loop whenever I select a Linux kernel on GRUB. At some seemingly random point during the transition from the boot loader to OS, a reboot signal is triggered. But what though?

I boot back into Windows and go online to look for solutions. On some Reddit and forum posts with similar issues, people were asked to chroot into their Linux installation and rebuild GRUB². I still had the flash drive that contained the live Arch installation medium from when I installed it a week ago. So I power off my PC, plug in the installation media, and power back on. I choose the flash drive in the boot priority and when it loaded, select "Arch Linux Install Medium". It began loading as usual, and about ~15 seconds into it, I hear the relay switching sound, and the PC reboots again. I am now in full panic mode. Not because I couldn't load into my week-old Arch system, but because I may never be able to install Arch again. Just to be sure, I try one more time, but it ends with the same result. Then I remember I had an Ubuntu installation media laying around somewhere that I use to partition drives. I find it, plug it in, reboot, and select to boot the flash drive that contained Ubuntu installation media. I choose the "Try Ubuntu without installing" option, it begins loading, and I see the GNOME desktop environment show up on the monitor. For a brief moment, I felt happy till I heard the relay switching sound again. sigh.

All this while, my monitors were connected to the GPU's ports (DP on Primary monitor and HDMI on secondary).

Quick recap on boot status:
EnvironmentBootBehaviour
Windows 11YNothing suspicious
Arch Linux linux kernelNReboot at a random time~(2s-8s) after selection on GRUB
Arch Linux linux-zen kernelNReboot at a random time~(2s-8s) after selection on GRUB
Arch Linux installation media (recent)NReboot at a random time~(8s-15s) after selection on GRUB
Ubuntu installation media (a bit old)Y* but really NTried once with successful render of GNOME desktop environment before crash reboot

I booted Windows again and went back online to look for answers. There were some posts that suggested it might be a GPU issue when this sort of thing happens, and they said the GPU driver must be rebuilt.

I had my suspicions if it was really a GPU issue³ since the Ubuntu GNOME desktop successfully loaded before the abrupt reboot. But I was ready to try anything at this point, so I decided to focus my efforts to rebuild the GPU drivers. At the very least, I could rule it out.

Chapter 3 [Seeking Stability]:
To rebuild the GPU driver for Linux and Linux-zen (I use the nvidia-dkms driver for an integrated build process for different kernels), I must either boot a stable chroot environment or somehow access the tty on my existing Arch installation. So I began my investigation on how to achieve this.

On some random corner of Reddit, I find a post about an issue that is very different from the one I'm facing, and someone in the comments talked about power management issues. They suggested passing a kernel parameter acpi=off. I look it up, and it didn't seem like it would do any damage (to the extent of my knowledge), so I edit the kernel parameters on the Linux kernel on GRUB and pressed Control+X to boot with the modified kernel parameters:
Code:
Booting a command list
Loading Linux linux ...
Loading initial ramdisk ...

The LUKS volume decryption prompt appears and I am not convinced yet. I have reached this prompt before and it had still failed earlier, so I brace to hear the relay switch again. But nothing. I enter the password, the screen goes blank, then a cursor loads, and my display manager loads!

I was super happy to see this.

You must also know that I don't use a regular display manager like LightDM but a TUI-based display manager called Ly. I type in my credentials, but I am met with a blank screen with a blinking cursor. This is good. I can still access the tty, so I switch into one, type in my credentials, and log in. As all this is happening, I notice the CPU fans ramping up at a slow pace, and as I begin typing the command to delete my GPU driver, the fans get louder. So I pause for a bit to observe it. It kept ramping up, and because I didn't want my CPU to spontaneously combust (since that's kinda what started this whole saga in the first place), I chose to power off the machine and entered the command poweroff. The PC begins to power off, but it doesn't completely. The screen froze after printing some warnings/errors (which I didn't take a not of at the time). I thought it might have been a symptom of passing the acpi=off kernel parameter, which I knew nothing about at the time. I just pressed the reboot button on my PC's chassis, and it rebooted.

Now I know (do I really?) it may be a power management issue, so I log back into Windows to prod further. On a lot of forum/Reddit posts, people say it is a fundamental incompatibility between the hardware (motherboard) and the Linux kernel since the ACPI tables probably mismatch or don't parse correctly (?). They suggest contacting the motherboard manufacturer. So I did. I sent Gigabyte a support ticket with my findings so far in a very concise format.

Chapter 4 [ACPI]:
While I waited for a reply from Gigabyte, I thought I might as well try to narrow down the issue as best I could. So, I did some preliminary research and found two wiki pages on debugging Linux ACPI issues. First, I came across an Ubuntu wiki page dedicated to debugging ACPI issues.

The wiki provides instructions to try passing a series of kernel parameters in order to narrow down the issue. I tried all of them.

After going through the instructions on the Ubuntu wiki page, I also found an Arch wiki page on ACPI modules. Section 4.3, Boot-Looping, had some test parameters not found in the Ubuntu wiki.

Ubuntu wiki instruction results:
Kernel ParameterBootBehaviour
acpi=offYDisplay server doesn't load (Xorg) but tty is available
acpi=htNReboots
pci=noacpiNScreen freeze after Loading Linux linux ... Loading initial ramdisk..., frozen until I manually reboot
acpi=noirqNReboots
pnpacpi=offNReboots
noapicYLoads display manager, Loads display server, loads window manager!!
nolapicNScreen freeze after Loading Linux linux ... Loading initial ramdisk ... but reboots

Arch wiki instruction results:
Kernel ParameterBootBehaviour
acpi_osi="Windows 2015"NReboots
processor.max_cstate=0NReboots
intel_idle.max_cstate=2YLoads display manager, Loads display server, loads window manager, everything works!!
idle=nomwaitNReboots

In the tables above, you'll notice that in two cases, everything loaded. I shed a tear when I saw my default, ugly i3 status bar.

So, the two parameters that let me access my window manager (i3) were noapic and intel_idle.max_cstate=2. Between the two, and to the extent of my knowledge at the time, I decided to use intel_idle.max_cstate=2 for any further investigations, as it seemed less restrictive than noapic to me.

Chapter 5 [C-States]:
It is now the next day. I check the Gigabyte support ticket I created, and there’s a reply.

Paraphrased - "Ubuntu support is not mentioned in the specifications, so we cannot offer assistance. Have a nice day."

Fair. Screw you, but fair. I can’t really place all the blame on Gigabyte ...yet. Maybe all of this could have been easily avoided if I'd done conscious research about flashing the BIOS on a dual-boot machine with Linux and Windows. Whatever... looks like Gigabyte won’t be helping me and that's fair(no sarcasm).

I start looking into the kernel parameters that worked and I'm still trying to make the connection between noapic and intel_idle.max_cstate=2. Why are these the only parameters that let me boot into Arch? I don’t understand the relation (if any) between the two, and if you haven’t noticed yet, I’m a complete novice. I just follow existing instructions and I'm not very proficient at conducting independent investigations since I don’t understand kernel or hardware behavior. Maybe someday I might...

When I looked up C-States on Google, I came across a discussion on an overclocking subreddit. This post was related to overclocking, but by reading it, I learned that C-States could be disabled in BIOS and may only result in extra power consumption. Some users even mentioned that idle power consumption, even with C-States disabled, was still super low and didn’t make a big difference. Besides, disabling it in BIOS would mean I wouldn’t have to touch my GRUB config. Maybe setting intel_idle.max_cstate to a value greater than 2 as mentioned in the 4.3 Boot-looping section on the Arch wiki page might work, but I just disabled it in the BIOS instead. Arch boots up without adding any extra parameters, and I was able to log in to my environment. In fact, this post is created in a browser within my Arch installation. No odd behavior of CPU/GPU noted so far. I didn’t end up rebuilding GRUB or the Nvidia driver(s).

Disabling C-States in BIOS seems like the best option at the moment, but I just can’t shake this itch. I never had to disable it in the previous version of my BIOS, and everything worked just fine. I don’t know if I should compromise by permanently disabling C-States. Will disabling C-States, even with updated microcode and BIOS firmware really help my CPU live longer like I initially wanted it to? Trying to protect my CPU was the whole reason I got into this mess and now I'm disabling C-States without knowing if the microcode update from intel even works to halt further damage. Is there a better option? I don’t have the answers; I was hoping someone here, who got this far down, might.

Epilogue [The End?]:
I could use some help in dividing the blame here. I choose to put myself at the top of the list since I faced absolutely no issues with the previous version of my BIOS firmware. I violated the cardinal sin of BIOS flashing: if it ain't broke, don't fix it. I am ready to take all the blame. If you think I should, I respect your choice.

However, as it stands (in descending order of Rank):
MeFor prematurely attempting to fix something
GigabyteBad ACPI tables? I don't have the knowledge to verify this myself
IntelFaulty CPUs, allegedly* trying to run the warranty period on raptor lake i9s to avoid recall
Linux Kernel ACPI SubsystemBlasphemy I know, this is just for the lols
Windows 11It's a dual boot system so you never know xD Might even deserve the spot above Intel tbh

Feel free to rip me a new one.

If you notice something or have suggestions to further narrow down the issue, please let me know with any cautionary details wherever applicable. I can gather logs from the crash reboots if needed, I just don't know what to look for. I vaguely remember reading somewhere that you can pass kernel parameters to live installation media. If that's true, I could use that to chroot into my Arch installation and collect logs.

Appendix:
⁽¹⁾
Ignored the microcode setup in Arch. Since the microcode in my case is updated by BIOS firmware, does Arch really need it's own microcode setup?
⁽²⁾ There might be a chance that it's a GRUB config issue but why wouldn't Arch installation media or a live Ubuntu environment also abruptly reboot?
⁽³⁾ Can it be a GPU issue? I haven't rebuilt the drivers and I am able to boot into and use Arch after disabling C-States or passing one of the working kernel parameters.

The GRUB menu entry to load Arch Linux (without any added kernel parameters):
Code:
setparams 'Arch Linux'
    load_video
    set gfxpayload=keep
    insmod gzio
    insmod part_gpt
    insmod ext2
    search --no-floppy --fs-uuid --set=root %rootUUID%
    echo    'Loading Linux linux ...'
    linux   /vmlinuz-linux-zen root=/dev/mapper/volgroup0-lv_root rw loglevel=3 cryptdevice=/dev/nvme0n1p3:volgroup0 nvidia-drm.modeset=1 quiet
    echo    'Loading initial ramdisk ...'
    initrd  /initramfs-linux.img

Links:
  1. The issue with intel CPUs that set this in motion - here
  2. The Gigabyte press release containing some information on the BIOS firmware update - here (I tried Enabling CEP, it didn't work)
  3. The Arch wiki page on microcode - here
  4. The Ubuntu wiki page on debugging ACPI issues - here
  5. The Arch wiki page on ACPI modules - here
  6. The BIOS setup manual for my motherboard - here
Here are all the names of all the files that were dumped when I ran the acpidump -b:
Code:
apic.dat  dbgp.dat  facs.dat  hpet.dat  mcfg.dat  ssdt10.dat  ssdt13.dat  ssdt16.dat  ssdt19.dat  ssdt2.dat  ssdt5.dat  ssdt8.dat  wpbt.dat
bgrt.dat  dsdt.dat  fidt.dat  hwin.dat  nhlt.dat  ssdt11.dat  ssdt14.dat  ssdt17.dat  ssdt1.dat   ssdt3.dat  ssdt6.dat  ssdt9.dat  wsmt.dat
dbg2.dat  facp.dat  fpdt.dat  lpit.dat  phat.dat  ssdt12.dat  ssdt15.dat  ssdt18.dat  ssdt20.dat  ssdt4.dat  ssdt7.dat  tpm2.dat
If you are interested in any specific .dat file, let me know and I can provide the disassambled output.

Since noapic parameter also seemed to work here is the raw table data from iasl -d on apic.dat:
Code:
    0000: 41 50 49 43 DC 01 00 00 05 A0 41 4C 41 53 4B 41  // APIC......ALASKA
    0010: 41 20 4D 20 49 20 00 00 09 20 07 01 41 4D 49 20  // A M I ... ..AMI
    0020: 13 00 00 01 00 00 E0 FE 01 00 00 00 00 08 00 00  // ................
    0030: 01 00 00 00 00 08 01 01 01 00 00 00 00 08 02 08  // ................
    0040: 01 00 00 00 00 08 03 09 01 00 00 00 00 08 04 10  // ................
    0050: 01 00 00 00 00 08 05 11 01 00 00 00 00 08 06 18  // ................
    0060: 01 00 00 00 00 08 07 19 01 00 00 00 00 08 08 20  // ...............
    0070: 01 00 00 00 00 08 09 21 01 00 00 00 00 08 0A 28  // .......!.......(
    0080: 01 00 00 00 00 08 0B 29 01 00 00 00 00 08 0C 30  // .......).......0
    0090: 01 00 00 00 00 08 0D 31 01 00 00 00 00 08 0E 38  // .......1.......8
    00A0: 01 00 00 00 00 08 0F 39 01 00 00 00 00 08 10 40  // .......9.......@
    00B0: 01 00 00 00 00 08 11 42 01 00 00 00 00 08 12 44  // .......B.......D
    00C0: 01 00 00 00 00 08 13 46 01 00 00 00 00 08 14 48  // .......F.......H
    00D0: 01 00 00 00 00 08 15 4A 01 00 00 00 00 08 16 4C  // .......J.......L
    00E0: 01 00 00 00 00 08 17 4E 01 00 00 00 00 08 18 50  // .......N.......P
    00F0: 01 00 00 00 00 08 19 52 01 00 00 00 00 08 1A 54  // .......R.......T
    0100: 01 00 00 00 00 08 1B 56 01 00 00 00 00 08 1C 58  // .......V.......X
    0110: 01 00 00 00 00 08 1D 5A 01 00 00 00 00 08 1E 5C  // .......Z.......\
    0120: 01 00 00 00 00 08 1F 5E 01 00 00 00 01 0C 02 00  // .......^........
    0130: 00 00 C0 FE 00 00 00 00 02 0A 00 00 02 00 00 00  // ................
    0140: 00 00 02 0A 00 09 09 00 00 00 0D 00 04 06 01 05  // ................
    0150: 00 01 04 06 02 05 00 01 04 06 03 05 00 01 04 06  // ................
    0160: 04 05 00 01 04 06 05 05 00 01 04 06 06 05 00 01  // ................
    0170: 04 06 07 05 00 01 04 06 08 05 00 01 04 06 09 05  // ................
    0180: 00 01 04 06 0A 05 00 01 04 06 0B 05 00 01 04 06  // ................
    0190: 0C 05 00 01 04 06 0D 05 00 01 04 06 0E 05 00 01  // ................
    01A0: 04 06 0F 05 00 01 04 06 10 05 00 01 04 06 11 05  // ................
    01B0: 00 01 04 06 12 05 00 01 04 06 13 05 00 01 04 06  // ................
    01C0: 14 05 00 01 04 06 15 05 00 01 04 06 16 05 00 01  // ................
    01D0: 04 06 17 05 00 01 04 06 00 05 00 01              // ............

Things left to try:
  • Rebuild GRUB
  • Rebuild nvidia drivers
  • Every single troubleshooting method on the Arch wiki page on ACPI modules
  • Steps on the DSDT page on Arch wiki
  • Rule out any devices causing the issue(wireless, sound, etc)
  • ???
  • Become an expert on ACPI subsystem, learn the standard specification, find the root cause of my issue, build a custom kernel
 
Last edited:


G'day goobert123, Welcome to Linux.org

Go into bios, and look to see if Secure Boot is enabled or disabled.

It needs to be Disabled.

If you happen to come across it, do the same with FastBoot
 
G'day goobert123, Welcome to Linux.org

Go into bios, and look to see if Secure Boot is enabled or disabled.

It needs to be Disabled.

If you happen to come across it, do the same with FastBoot
Hey Condobloke, thanks for the reply.
Secure boot status was one of the first things I checked after flashing the BIOS. It remained off after the flash to new firmware.
Secure boot set to off is the default state on my motherboard. I had to turn it on when I built the PC.

However, fast boot was on. And during preliminary investigation, I had tried to boot with it turned off and my PC would still reboot as Arch was loading.
 
you may need to purge and re-start the bios, on a laptop this is usually done by disconnecting the power supply and any powered peripherals switching on the machine then pressing and holding down the power button for 60 seconds, release wait 20 seconds and then power on
 
you may need to purge and re-start the bios, on a laptop this is usually done by disconnecting the power supply and any powered peripherals switching on the machine then pressing and holding down the power button for 60 seconds, release wait 20 seconds and then power on
If you are talking about clearing the CMOS to perform a reset, I want to reserve this as a last resort. I may even rollback to the previous firmware before I try a hard reset.
Since Windows 11 loads without any issues and Arch loads with C-States disabled, I just want to rule out faulty BIOS firmware before I start disassembling my rig.
 
I just want to rule out faulty BIOS firmware before I start disassembling my rig.
by disassembling you are referring to getting to the cmos battery and removing it, then on modern machines you do not have to take the battery out, just save any work and follow my instructions, this has worked for quite a few members, although its not a guaranteed fix for your problem
 
Preface:
I own an Intel Core i9-13900K, installed on my home PC. There was recent news that indicated mysterious hardware failure affecting the Raptor Lake CPUs, specifically the unlocked Core i9s from the 13th and 14th generations. Although the sources that first reported this issue were using the CPUs on game servers, which likely meant they were being utilized heavily, the 100% hardware failure rate they reported was shocking to say the least. Hence, I wanted to protect the CPU on my home machine from slow degradation and possibly extend its life. Since I'm still under warranty, I cannot mess with the clock settings to achieve this. Only official microcode updates and motherboard BIOS updates would not void my warranty (AFAIK). About a month ago, Intel apparently released a microcode update (version 125) for my processor. But I found out about this whole issue only two days back.

Prologue:
About a week ago, before I discovered the Raptor Lake CPU problem or the available microcode update from Intel, I had installed Arch Linux on my desktop PC (which has the i9). Prior to this, I had Arch Linux installed on a laptop that I used to practice programming and web development on the side. Over time, I bloated the Arch installation on the laptop with a ton of packages that I didn't document properly, and I wanted a clean slate with solid documentation. So, I disabled secure boot on my desktop, installed Arch on a separate NVMe drive, and selectively began to port over the configs from the other machine that I needed the most. I am not a developer or sysadmin by trade. It's just a 'professional hobby' at this point.

Chapter 1 [When I First Found Out]:
I watched a video on YouTube that discussed the CPU issue that could potentially affect my system, and it spooked me a little. I ran a Windows backup and prepared to update my CPU microcode. On my current Arch installation, I had ignored the microcode setup and put it off for another time. This was probably a grave oversight on my part¹.

Just to be sure, I went to the support page for my motherboard to check for any BIOS updates, and lo and behold, there was a new update. I assumed it came with the microcode (it did), along with some optimizations for it. So, I ditched the plan of only updating my microcode and chose the riskier endeavor of flashing my motherboard. I followed the instructions, double-checked everything, said a few prayers, and the flash was successful. The motherboard didn't brick.

At this point, I didn't even know that the BIOS would revert to default settings when it is updated. But I'm not sure if this is relevant since the only setting I changed when I built the PC were enabling the XMP and I had modified the boot order of my hard drives when I installed Arch. Everything else was on auto on the previous version of my BIOS firmware.

SPECS
Motherboard
GIGABYTE Z790 AERO G (1.0) - link
BIOS Version(previous)​
F8 - link
BIOS Version(latest)​
F12e - link
VGA 1
GIGABYTE RTX 4080 Gaming OC - link
VGA 2
Intel UHD Graphics 770 - link
CPU
Intel Core i9-13900K - link
Memory
CORSAIR VENGEANCE 64GB (2x32GB) DDR5 DRAM 6000MT/s CL40 Memory Kit- link
Power Supply
CoolerMaster MWE Gold V2 / 1250 W - link
OS1
Windows on 1 TB PCIe Gen5 NVMe - link
OS2
Arch Linux on 1 TB PCIe Gen4 NVMe - link (linux kernel & linux-zen kernel)​
Bootloader
GRUB
Filesystem(Arch)
ext4 (encrypted LVM)

Chapter 2 [The Symptoms]:
The first thing I did after the BIOS flash was select Arch Linux - Linux kernel on GRUB bootloader. As it begins loading, the monitor displays:
Code:
Loading Linux linux ...
Loading initial ramdisk ...

I hear the relay switching sound on my PSU (I assume it comes from the PSU) as my screen goes blank and my PC reboots abruptly. I was back on GRUB. I think maybe it's just a bit of starting trouble and try to load Arch again. This time, after:
Code:
Loading Linux linux ...
Loading initial ramdisk ...

It makes it to the LVM decryption password prompt. Hooray, I thought to myself and begin typing the password to decrypt the LVM that contains my /root and /home. As I am typing my password, all of a sudden, I hear the relay switch again, my screen goes blank, the fans whirr up, and the PC reboots. I was back on GRUB. Slight panic set in.

I had to make sure this didn't affect Windows and so I select the Windows 11 menu entry on GRUB, and it begins loading. I had my fingers crossed, but apparently, there was nothing to worry about. It loads successfully, and I am able to log in. phew I can load Windows at least. Then I remember I had installed the linux-zen kernel as a backup and thought maybe that would boot without causing any trouble. So, I rebooted my PC and selected the linux-zen kernel:
Code:
Loading Linux linux-zen ...
Loading initial ramdisk ...

relay switching sound Reboot. Back to GRUB. I realise that I am stuck in a boot loop whenever I select a Linux kernel on GRUB. At some seemingly random point during the transition from the boot loader to OS, a reboot signal is triggered. But what though?

I boot back into Windows and go online to look for solutions. On some Reddit and forum posts with similar issues, people were asked to chroot into their Linux installation and rebuild GRUB². I still had the flash drive that contained the live Arch installation medium from when I installed it a week ago. So I power off my PC, plug in the installation media, and power back on. I choose the flash drive in the boot priority and when it loaded, select "Arch Linux Install Medium". It began loading as usual, and about ~15 seconds into it, I hear the relay switching sound, and the PC reboots again. I am now in full panic mode. Not because I couldn't load into my week-old Arch system, but because I may never be able to install Arch again. Just to be sure, I try one more time, but it ends with the same result. Then I remember I had an Ubuntu installation media laying around somewhere that I use to partition drives. I find it, plug it in, reboot, and select to boot the flash drive that contained Ubuntu installation media. I choose the "Try Ubuntu without installing" option, it begins loading, and I see the GNOME desktop environment show up on the monitor. For a brief moment, I felt happy till I heard the relay switching sound again. sigh.

All this while, my monitors were connected to the GPU's ports (DP on Primary monitor and HDMI on secondary).

Quick recap on boot status:
EnvironmentBootBehaviour
Windows 11YNothing suspicious
Arch Linux linux kernelNReboot at a random time~(2s-8s) after selection on GRUB
Arch Linux linux-zen kernelNReboot at a random time~(2s-8s) after selection on GRUB
Arch Linux installation media (recent)NReboot at a random time~(8s-15s) after selection on GRUB
Ubuntu installation media (a bit old)Y* but really NTried once with successful render of GNOME desktop environment before crash reboot

I booted Windows again and went back online to look for answers. There were some posts that suggested it might be a GPU issue when this sort of thing happens, and they said the GPU driver must be rebuilt.

I had my suspicions if it was really a GPU issue³ since the Ubuntu GNOME desktop successfully loaded before the abrupt reboot. But I was ready to try anything at this point, so I decided to focus my efforts to rebuild the GPU drivers. At the very least, I could rule it out.

Chapter 3 [Seeking Stability]:
To rebuild the GPU driver for Linux and Linux-zen (I use the nvidia-dkms driver for an integrated build process for different kernels), I must either boot a stable chroot environment or somehow access the tty on my existing Arch installation. So I began my investigation on how to achieve this.

On some random corner of Reddit, I find a post about an issue that is very different from the one I'm facing, and someone in the comments talked about power management issues. They suggested passing a kernel parameter acpi=off. I look it up, and it didn't seem like it would do any damage (to the extent of my knowledge), so I edit the kernel parameters on the Linux kernel on GRUB and pressed Control+X to boot with the modified kernel parameters:
Code:
Booting a command list
Loading Linux linux ...
Loading initial ramdisk ...

The LUKS volume decryption prompt appears and I am not convinced yet. I have reached this prompt before and it had still failed earlier, so I brace to hear the relay switch again. But nothing. I enter the password, the screen goes blank, then a cursor loads, and my display manager loads!

I was super happy to see this.

You must also know that I don't use a regular display manager like LightDM but a TUI-based display manager called Ly. I type in my credentials, but I am met with a blank screen with a blinking cursor. This is good. I can still access the tty, so I switch into one, type in my credentials, and log in. As all this is happening, I notice the CPU fans ramping up at a slow pace, and as I begin typing the command to delete my GPU driver, the fans get louder. So I pause for a bit to observe it. It kept ramping up, and because I didn't want my CPU to spontaneously combust (since that's kinda what started this whole saga in the first place), I chose to power off the machine and entered the command poweroff. The PC begins to power off, but it doesn't completely. The screen froze after printing some warnings/errors (which I didn't take a not of at the time). I thought it might have been a symptom of passing the acpi=off kernel parameter, which I knew nothing about at the time. I just pressed the reboot button on my PC's chassis, and it rebooted.

Now I know (do I really?) it may be a power management issue, so I log back into Windows to prod further. On a lot of forum/Reddit posts, people say it is a fundamental incompatibility between the hardware (motherboard) and the Linux kernel since the ACPI tables probably mismatch or don't parse correctly (?). They suggest contacting the motherboard manufacturer. So I did. I sent Gigabyte a support ticket with my findings so far in a very concise format.

Chapter 4 [ACPI]:
While I waited for a reply from Gigabyte, I thought I might as well try to narrow down the issue as best I could. So, I did some preliminary research and found two wiki pages on debugging Linux ACPI issues. First, I came across an Ubuntu wiki page dedicated to debugging ACPI issues.

The wiki provides instructions to try passing a series of kernel parameters in order to narrow down the issue. I tried all of them.

After going through the instructions on the Ubuntu wiki page, I also found an Arch wiki page on ACPI modules. Section 4.3, Boot-Looping, had some test parameters not found in the Ubuntu wiki.

Ubuntu wiki instruction results:
Kernel ParameterBootBehaviour
acpi=offYDisplay server doesn't load (Xorg) but tty is available
acpi=htNReboots
pci=noacpiNScreen freeze after Loading Linux linux ... Loading initial ramdisk..., frozen until I manually reboot
acpi=noirqNReboots
pnpacpi=offNReboots
noapicYLoads display manager, Loads display server, loads window manager!!
nolapicNScreen freeze after Loading Linux linux ... Loading initial ramdisk ... but reboots

Arch wiki instruction results:
Kernel ParameterBootBehaviour
acpi_osi="Windows 2015"NReboots
processor.max_cstate=0NReboots
intel_idle.max_cstate=2YLoads display manager, Loads display server, loads window manager, everything works!!
idle=nomwaitNReboots

In the tables above, you'll notice that in two cases, everything loaded. I shed a tear when I saw my default, ugly i3 status bar.

So, the two parameters that let me access my window manager (i3) were noapic and intel_idle.max_cstate=2. Between the two, and to the extent of my knowledge at the time, I decided to use intel_idle.max_cstate=2 for any further investigations, as it seemed less restrictive than noapic to me.

Chapter 5 [C-States]:
It is now the next day. I check the Gigabyte support ticket I created, and there’s a reply.

Paraphrased - "Ubuntu support is not mentioned in the specifications, so we cannot offer assistance. Have a nice day."

Fair. Screw you, but fair. I can’t really place all the blame on Gigabyte ...yet. Maybe all of this could have been easily avoided if I'd done conscious research about flashing the BIOS on a dual-boot machine with Linux and Windows. Whatever... looks like Gigabyte won’t be helping me and that's fair(no sarcasm).

I start looking into the kernel parameters that worked and I'm still trying to make the connection between noapic and intel_idle.max_cstate=2. Why are these the only parameters that let me boot into Arch? I don’t understand the relation (if any) between the two, and if you haven’t noticed yet, I’m a complete novice. I just follow existing instructions and I'm not very proficient at conducting independent investigations since I don’t understand kernel or hardware behavior. Maybe someday I might...

When I looked up C-States on Google, I came across a discussion on an overclocking subreddit. This post was related to overclocking, but by reading it, I learned that C-States could be disabled in BIOS and may only result in extra power consumption. Some users even mentioned that idle power consumption, even with C-States disabled, was still super low and didn’t make a big difference. Besides, disabling it in BIOS would mean I wouldn’t have to touch my GRUB config. Maybe setting intel_idle.max_cstate to a value greater than 2 as mentioned in the 4.3 Boot-looping section on the Arch wiki page might work, but I just disabled it in the BIOS instead. Arch boots up without adding any extra parameters, and I was able to log in to my environment. In fact, this post is created in a browser within my Arch installation. No odd behavior of CPU/GPU noted so far. I didn’t end up rebuilding GRUB or the Nvidia driver(s).

Disabling C-States in BIOS seems like the best option at the moment, but I just can’t shake this itch. I never had to disable it in the previous version of my BIOS, and everything worked just fine. I don’t know if I should compromise by permanently disabling C-States. Will disabling C-States, even with updated microcode and BIOS firmware really help my CPU live longer like I initially wanted it to? Trying to protect my CPU was the whole reason I got into this mess and now I'm disabling C-States without knowing if the microcode update from intel even works to halt further damage. Is there a better option? I don’t have the answers; I was hoping someone here, who got this far down, might.

Epilogue [The End?]:
I could use some help in dividing the blame here. I choose to put myself at the top of the list since I faced absolutely no issues with the previous version of my BIOS firmware. I violated the cardinal sin of BIOS flashing: if it ain't broke, don't fix it. I am ready to take all the blame. If you think I should, I respect your choice.

However, as it stands (in descending order of Rank):
MeFor prematurely attempting to fix something
GigabyteBad ACPI tables? I don't have the knowledge to verify this myself
IntelFaulty CPUs, allegedly* trying to run the warranty period on raptor lake i9s to avoid recall
Linux Kernel ACPI SubsystemBlasphemy I know, this is just for the lols
Windows 11It's a dual boot system so you never know xD Might even deserve the spot above Intel tbh

Feel free to rip me a new one.

If you notice something or have suggestions to further narrow down the issue, please let me know with any cautionary details wherever applicable. I can gather logs from the crash reboots if needed, I just don't know what to look for. I vaguely remember reading somewhere that you can pass kernel parameters to live installation media. If that's true, I could use that to chroot into my Arch installation and collect logs.

Appendix:
⁽¹⁾
Ignored the microcode setup in Arch. Since the microcode in my case is updated by BIOS firmware, does Arch really need it's own microcode setup?
⁽²⁾ There might be a chance that it's a GRUB config issue but why wouldn't Arch installation media or a live Ubuntu environment also abruptly reboot?
⁽³⁾ Can it be a GPU issue? I haven't rebuilt the drivers and I am able to boot into and use Arch after disabling C-States or passing one of the working kernel parameters.

The GRUB menu entry to load Arch Linux (without any added kernel parameters):
Code:
setparams 'Arch Linux'
    load_video
    set gfxpayload=keep
    insmod gzio
    insmod part_gpt
    insmod ext2
    search --no-floppy --fs-uuid --set=root %rootUUID%
    echo    'Loading Linux linux ...'
    linux   /vmlinuz-linux-zen root=/dev/mapper/volgroup0-lv_root rw loglevel=3 cryptdevice=/dev/nvme0n1p3:volgroup0 nvidia-drm.modeset=1 quiet
    echo    'Loading initial ramdisk ...'
    initrd  /initramfs-linux.img

Links:
  1. The issue with intel CPUs that set this in motion - here
  2. The Gigabyte press release containing some information on the BIOS firmware update - here (I tried Enabling CEP, it didn't work)
  3. The Arch wiki page on microcode - here
  4. The Ubuntu wiki page on debugging ACPI issues - here
  5. The Arch wiki page on ACPI modules - here
  6. The BIOS setup manual for my motherboard - here
Here are all the names of all the files that were dumped when I ran the acpidump -b:
Code:
apic.dat  dbgp.dat  facs.dat  hpet.dat  mcfg.dat  ssdt10.dat  ssdt13.dat  ssdt16.dat  ssdt19.dat  ssdt2.dat  ssdt5.dat  ssdt8.dat  wpbt.dat
bgrt.dat  dsdt.dat  fidt.dat  hwin.dat  nhlt.dat  ssdt11.dat  ssdt14.dat  ssdt17.dat  ssdt1.dat   ssdt3.dat  ssdt6.dat  ssdt9.dat  wsmt.dat
dbg2.dat  facp.dat  fpdt.dat  lpit.dat  phat.dat  ssdt12.dat  ssdt15.dat  ssdt18.dat  ssdt20.dat  ssdt4.dat  ssdt7.dat  tpm2.dat
If you are interested in any specific .dat file, let me know and I can provide the disassambled output.

Since noapic parameter also seemed to work here is the raw table data from iasl -d on apic.dat:
Code:
    0000: 41 50 49 43 DC 01 00 00 05 A0 41 4C 41 53 4B 41  // APIC......ALASKA
    0010: 41 20 4D 20 49 20 00 00 09 20 07 01 41 4D 49 20  // A M I ... ..AMI
    0020: 13 00 00 01 00 00 E0 FE 01 00 00 00 00 08 00 00  // ................
    0030: 01 00 00 00 00 08 01 01 01 00 00 00 00 08 02 08  // ................
    0040: 01 00 00 00 00 08 03 09 01 00 00 00 00 08 04 10  // ................
    0050: 01 00 00 00 00 08 05 11 01 00 00 00 00 08 06 18  // ................
    0060: 01 00 00 00 00 08 07 19 01 00 00 00 00 08 08 20  // ...............
    0070: 01 00 00 00 00 08 09 21 01 00 00 00 00 08 0A 28  // .......!.......(
    0080: 01 00 00 00 00 08 0B 29 01 00 00 00 00 08 0C 30  // .......).......0
    0090: 01 00 00 00 00 08 0D 31 01 00 00 00 00 08 0E 38  // .......1.......8
    00A0: 01 00 00 00 00 08 0F 39 01 00 00 00 00 08 10 40  // .......9.......@
    00B0: 01 00 00 00 00 08 11 42 01 00 00 00 00 08 12 44  // .......B.......D
    00C0: 01 00 00 00 00 08 13 46 01 00 00 00 00 08 14 48  // .......F.......H
    00D0: 01 00 00 00 00 08 15 4A 01 00 00 00 00 08 16 4C  // .......J.......L
    00E0: 01 00 00 00 00 08 17 4E 01 00 00 00 00 08 18 50  // .......N.......P
    00F0: 01 00 00 00 00 08 19 52 01 00 00 00 00 08 1A 54  // .......R.......T
    0100: 01 00 00 00 00 08 1B 56 01 00 00 00 00 08 1C 58  // .......V.......X
    0110: 01 00 00 00 00 08 1D 5A 01 00 00 00 00 08 1E 5C  // .......Z.......\
    0120: 01 00 00 00 00 08 1F 5E 01 00 00 00 01 0C 02 00  // .......^........
    0130: 00 00 C0 FE 00 00 00 00 02 0A 00 00 02 00 00 00  // ................
    0140: 00 00 02 0A 00 09 09 00 00 00 0D 00 04 06 01 05  // ................
    0150: 00 01 04 06 02 05 00 01 04 06 03 05 00 01 04 06  // ................
    0160: 04 05 00 01 04 06 05 05 00 01 04 06 06 05 00 01  // ................
    0170: 04 06 07 05 00 01 04 06 08 05 00 01 04 06 09 05  // ................
    0180: 00 01 04 06 0A 05 00 01 04 06 0B 05 00 01 04 06  // ................
    0190: 0C 05 00 01 04 06 0D 05 00 01 04 06 0E 05 00 01  // ................
    01A0: 04 06 0F 05 00 01 04 06 10 05 00 01 04 06 11 05  // ................
    01B0: 00 01 04 06 12 05 00 01 04 06 13 05 00 01 04 06  // ................
    01C0: 14 05 00 01 04 06 15 05 00 01 04 06 16 05 00 01  // ................
    01D0: 04 06 17 05 00 01 04 06 00 05 00 01              // ............

Things left to try:
  • Rebuild GRUB
  • Rebuild nvidia drivers
  • Every single troubleshooting method on the Arch wiki page on ACPI modules
  • Steps on the DSDT page on Arch wiki
  • Rule out any devices causing the issue(wireless, sound, etc)
  • ???
  • Become an expert on ACPI subsystem, learn the standard specification, find the root cause of my issue, build a custom kernel
On the ACPI issues, there's a long history of issues in linux. Basically the linux kernel implements the standardised industry specification, and the gory details can be found here:

https://www.kernel.org/doc/ols/2005/ols2005v1-pages-59-76.pdf

The major problem is that the motherboard/BIOS vendors do not implement the specification as intended, but rather just use what they need from it to get their hardware up. Hence there are issues where the kernel sees what's missing which is often expressed as errors on booting. Most of these errors can be harmless for the functioning of the machine, but sometimes the issues are deeper and disturb functionality. Since in your description, the kernel parameter: acpi=off, enabled a proper boot, the acpi implementation may be implicated in the problems.

In the kernel docs the devs comment that if there are issues with acpi:
Complain to your platform/BIOS vendor if you find a bug which is so severe that a workaround is not accepted in the Linux kernel.
See here: https://www.kernel.org/doc/Documentation/admin-guide/acpi/initrd_table_override.rst
Of course, such complaint may be a forlorn hope, as you found with your experience with Gigabyte.

Nevertheless, in that kernel doc there is also a rundown on how one might "Disassemble, modify and recompile" ACPI tables. Not for the faint-hearted, and not something I've needed to do.

Perhaps some stress testing of the cpu may reveal something useful. See here:

I have no experience of the "relay switching sound" on a PSU, so can't say anything about that.
 


Top