Troubleshooting random system shutdowns

kelltech

Member
Joined
Aug 15, 2021
Messages
41
Reaction score
15
Credits
344
Hello!

I've asked in a few different communities and while most are helpful, the results have been really limited. I should say first that I've been using Linux for about 6 years, but to troubleshoot deeply or to understand logs is beyond my ability.

My laptop has been randomly shutting down during gameplay using Lutris. At first, I suspected temperature, so searched and applied all that I could to test that, and I can say that I don't think my laptop is overheating, unless it's happening so fast I can't catch it. My CPU runs in the 70C range while gaming but I've seen it reach 80C. GPU runs cooler, usually 65C-70C.

When this began I had suspected Fedora because I had just switched from 2 years on Manjaro where it didn't happen, so moving from Fedora to Pop_OS did not resolve the random shutdowns. Now I'm suspecting hardware failure somewhere, or a faulty sensor, but of course I'm guessing because perhaps it's Gnome, or an extension, or some rouge app.

That's where I hope to find help: is there a way to understand what could be causing this? I do realize that it could be thousands of different things, but I'd like to learn where to start looking to find out.

Along the way, I did discover a helpful article that showed this command:
sudo grep -iv ': starting\|kernel: .*: Power Button\|watching system buttons\|Stopped Cleaning Up\|Started Crash recovery kernel' \ /var/log/messages /var/log/syslog /var/log/apcupsd* \ | grep -iw 'recover[a-z]*\|power[a-z]*\|shut[a-z ]*down\|rsyslogd\|ups'

Which shows this:
/var/log/syslog:Aug 13 17:49:14 pop-os system76-power[800]: [INFO] DBUS Received GetSwitchable() method /var/log/syslog:Aug 13 17:49:14 pop-os system76-power[800]: [INFO] DBUS Received GetGraphics() method /var/log/syslog:Aug 13 17:49:14 pop-os system76-power[800]: [INFO] DBUS Received GetProfile() method /var/log/syslog:Aug 13 17:49:14 pop-os gnome-shell[2195]: gnome-shell-extension-system76-power: power profile was set: 'Balanced' /var/log/syslog:Aug 13 17:49:14 pop-os systemd[1523]: Started GNOME power management service. /var/log/syslog:Aug 13 17:49:14 pop-os systemd[1523]: Reached target GNOME power management target. /var/log/syslog:Aug 13 17:49:17 pop-os gsd-power[1388]: gsd-power: Fatal IO error 11 (Resource temporarily unavailable) on X server :0.

Having no real understanding, I wonder if I'm on the right track and seeing clues, and are there ways to reveal what 'error 11' is and what could trigger it? I'd like to ask for help digging deeper, but also to learn along the way about troubleshooting issues. Especially if it reveals some hardware failure, motherboard, sensors, whatever the case may be.

Thanks for reading! :)

My inxi is attached as text document and my laptop is a Clevo N850HK1 / Sager NP6852.
 

Attachments

  • inxi_15Aug2021.txt
    2.5 KB · Views: 648


So, what exactly is the distro you're on right now? Pop OS? Fedora? Manjaro? Just a guess, but this gnome-shell-extension-system76-power: power profile was set: 'Balanced' might be the culprit, change the power profile to "performance" and see whether the issue still persists after doing so.
EDIT: Ok, I just noticed it seems the distro is Pop_OS, is it?
 
So, what exactly is the distro you're on right now? Pop OS? Fedora? Manjaro? Just a guess, but this gnome-shell-extension-system76-power: power profile was set: 'Balanced' might be the culprit, change the power profile to "performance" and see whether the issue still persists after doing so.
EDIT: Ok, I just noticed it seems the distro is Pop_OS, is it?
It is Pop_OS at the moment, yes. I'll change that setting and restart, but the issue did occur several times on Fedora which is how I landed on Pop.
 
What is the output of
Code:
journalctl

Code:
$ journalctl
-- Journal begins at Sun 2021-06-20 12:16:40 CEST, ends at Sun 2021-08-15 15:19:27 CEST. --
Jun 20 12:16:40 pop-os kernel: microcode: microcode updated early to revision 0xea, date = 2021-01-05
Jun 20 12:16:40 pop-os kernel: Linux version 5.11.0-7614-generic (buildd@lgw01-amd64-047) (gcc (Ubuntu 10.2.0-13ubuntu1) 10.2.0, GNU ld (GNU Binutils for Ubuntu) 2.35.1) #15~1622578982~20.1>
Jun 20 12:16:40 pop-os kernel: Command line: initrd=\EFI\Pop_OS-838d39a4-eada-4007-b94c-0947720e1d48\initrd.img root=UUID=838d39a4-eada-4007-b94c-0947720e1d48 ro quiet loglevel=0 systemd.sh>
Jun 20 12:16:40 pop-os kernel: KERNEL supported cpus:
Jun 20 12:16:40 pop-os kernel:   Intel GenuineIntel
Jun 20 12:16:40 pop-os kernel:   AMD AuthenticAMD
Jun 20 12:16:40 pop-os kernel:   Hygon HygonGenuine
Jun 20 12:16:40 pop-os kernel:   Centaur CentaurHauls
Jun 20 12:16:40 pop-os kernel:   zhaoxin   Shanghai 
Jun 20 12:16:40 pop-os kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Jun 20 12:16:40 pop-os kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Jun 20 12:16:40 pop-os kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Jun 20 12:16:40 pop-os kernel: x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
Jun 20 12:16:40 pop-os kernel: x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
Jun 20 12:16:40 pop-os kernel: x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
Jun 20 12:16:40 pop-os kernel: x86/fpu: xstate_offset[3]:  832, xstate_sizes[3]:   64
Jun 20 12:16:40 pop-os kernel: x86/fpu: xstate_offset[4]:  896, xstate_sizes[4]:   64
Jun 20 12:16:40 pop-os kernel: x86/fpu: Enabled xstate features 0x1f, context size is 960 bytes, using 'compacted' format.
Jun 20 12:16:40 pop-os kernel: BIOS-provided physical RAM map:
Jun 20 12:16:40 pop-os kernel: BIOS-e820: [mem 0x0000000000000000-0x0000000000057fff] usable
Jun 20 12:16:40 pop-os kernel: BIOS-e820: [mem 0x0000000000058000-0x0000000000058fff] reserved
Jun 20 12:16:40 pop-os kernel: BIOS-e820: [mem 0x0000000000059000-0x000000000009dfff] usable
Jun 20 12:16:40 pop-os kernel: BIOS-e820: [mem 0x000000000009e000-0x00000000000fffff] reserved
Jun 20 12:16:40 pop-os kernel: BIOS-e820: [mem 0x0000000000100000-0x0000000065e4afff] usable
Jun 20 12:16:40 pop-os kernel: BIOS-e820: [mem 0x0000000065e4b000-0x0000000065e4bfff] ACPI NVS
Jun 20 12:16:40 pop-os kernel: BIOS-e820: [mem 0x0000000065e4c000-0x0000000065e4cfff] reserved
Jun 20 12:16:40 pop-os kernel: BIOS-e820: [mem 0x0000000065e4d000-0x000000006a073fff] usable
Jun 20 12:16:40 pop-os kernel: BIOS-e820: [mem 0x000000006a074000-0x000000006ab55fff] reserved
Jun 20 12:16:40 pop-os kernel: BIOS-e820: [mem 0x000000006ab56000-0x000000006abf2fff] usable
Jun 20 12:16:40 pop-os kernel: BIOS-e820: [mem 0x000000006abf3000-0x000000006afc9fff] ACPI NVS
Jun 20 12:16:40 pop-os kernel: BIOS-e820: [mem 0x000000006afca000-0x000000006b3fdfff] reserved
Jun 20 12:16:40 pop-os kernel: BIOS-e820: [mem 0x000000006b3fe000-0x000000006b3fefff] usable
Jun 20 12:16:40 pop-os kernel: BIOS-e820: [mem 0x000000006b3ff000-0x000000006fffffff] reserved
Jun 20 12:16:40 pop-os kernel: BIOS-e820: [mem 0x00000000e0000000-0x00000000efffffff] reserved
Jun 20 12:16:40 pop-os kernel: BIOS-e820: [mem 0x00000000fe000000-0x00000000fe010fff] reserved
Jun 20 12:16:40 pop-os kernel: BIOS-e820: [mem 0x00000000fec00000-0x00000000fec00fff] reserved
Jun 20 12:16:40 pop-os kernel: BIOS-e820: [mem 0x00000000fed00000-0x00000000fed00fff] reserved
Jun 20 12:16:40 pop-os kernel: BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
Jun 20 12:16:40 pop-os kernel: BIOS-e820: [mem 0x00000000ff000000-0x00000000ffffffff] reserved
Jun 20 12:16:40 pop-os kernel: BIOS-e820: [mem 0x0000000100000000-0x000000048effffff] usable
Jun 20 12:16:40 pop-os kernel: NX (Execute Disable) protection: active
Jun 20 12:16:40 pop-os kernel: e820: update [mem 0x5ecff018-0x5ed27057] usable ==> usable
Jun 20 12:16:40 pop-os kernel: e820: update [mem 0x5ecff018-0x5ed27057] usable ==> usable
Jun 20 12:16:40 pop-os kernel: e820: update [mem 0x5ecee018-0x5ecfe057] usable ==> usable
Jun 20 12:16:40 pop-os kernel: e820: update [mem 0x5ecee018-0x5ecfe057] usable ==> usable
Jun 20 12:16:40 pop-os kernel: e820: update [mem 0x5ece0018-0x5eced857] usable ==> usable
Jun 20 12:16:40 pop-os kernel: e820: update [mem 0x5ece0018-0x5eced857] usable ==> usable
Jun 20 12:16:40 pop-os kernel: extended physical RAM map:
 
It is Pop_OS at the moment, yes. I'll change that setting and restart, but the issue did occur several times on Fedora which is how I landed on Pop.
I think balanced is the default profile, so try changing that to performance.
 
I think balanced is the default profile, so try changing that to performance.
True, balanced is default, and after a restart it reverts back to balanced lol! It's on performance now, I'll fire up some games in a while and see if it shuts down again.
 
This may be an inane question but troubleshooting often begins with eliminating the obvious: are you sure the machine is actually "shutting down"? Or is the screen just 'blanking'?

Does this happen on either battery supply or AC power? Both?

If the issue occurs among various distros I would be inclined to suspect hardware.
 
This may be an inane question but troubleshooting often begins with eliminating the obvious: are you sure the machine is actually "shutting down"? Or is the screen just 'blanking'?

Does this happen on either battery supply or AC power? Both?

If the issue occurs among various distros I would be inclined to suspect hardware.
It's good to eliminate the obvious, otherwise I tend to get lost in chasing what-ifs. :)

It is a full shutdown, as if power was suddenly cut. Possible it could be a drop into hibernate or something, but all lights (power button, keyboard, screen) are suddenly off and no mouse or keyboard interaction is available, only power back on.

It's on AC power, and I'm also suspecting hardware, but how to know what is failing is the needle in this haystack. :oops:
 
I am not seeing anything unusual in your journalctl output - does it shut down when you are not playing games or only when you play?
I am suspecting a power supply issue or a cooling fan issue
 
Last edited by a moderator:
I am not seeing anything unusual in your journalctl output - does it shut down when you are not playing games or only when you play?
I am suspecting a power supply issue or a cooling fan issue
So far it's only happened during play, and has happened with more than one game so that was easy to rule out. :)

If power supply, do you mean the unit that plugs into the wall or something internal?
 
The one that plugs into the wall - when your machine shuts down feel the supply does it feel warmer then when not gaming it is a good indicator it maybe getting weak and is not putting out the required power at high use such as gaming which is causing it to shut down. If you have a volt meter you can check the output - but that does not tell you if the amps drop off during a load - I have seen on laptops where the voltage output of the supply is fine but the amp draw was degrading when under a load

You can put a variable load on it, and put the meter on amps in series with the load. But you really need a second meter that is measuring the voltage at the same time, so you can confirm that it can sustain 19V at the rated current.
But to do this, you need a resistive load that is of the correct resistance to draw 3.95A at 19V, and withstand the power dissipation.

R = V/I

19/3.95 = 4.81 ohms

P = VI

But that is usually way more then anybody wants to do
 
Last edited by a moderator:
But to do this, you need a resistive load that is of the correct resistance to draw 3.95A at 19V, and withstand the power dissipation.

But that is usually way more then anybody wants to do
I could be able to do that with my father in law's help, he's an electrician and has all that stuff. I'll see him next month, but until then I'll go with the power supply heat indicator. :p
 
True, balanced is default, and after a restart it reverts back to balanced lol! It's on performance now, I'll fire up some games in a while and see if it shuts down again.
Did it work? Or it just keeps shutting down?
 
Did it work? Or it just keeps shutting down?
Hasn't shut down yet, but it's quite random. Sometimes it happens 2 or 3 times in an hour, sometimes not for an entire day. It's just been a weird one to track down. :confused:
 
You didn't exactly define what you specifically mean by "shutting down". I hard stop / power off?

Is it possible that the software you are running is causing a kernel fault? That can cause a system to just stop functioning.

Definitely run updates on everything.
 
You didn't exactly define what you specifically mean by "shutting down". I hard stop / power off?

Is it possible that the software you are running is causing a kernel fault? That can cause a system to just stop functioning.

Definitely run updates on everything.

Definitely a sudden, hard power off preceded by some video lag/freeze, but temperature well within normal.

#9:
It is a full shutdown, as if power was suddenly cut. Possible it could be a drop into hibernate or something, but all lights (power button, keyboard, screen) are suddenly off and no mouse or keyboard interaction is available, only power back on.

It's on AC power, and I'm also suspecting hardware, but how to know what is failing is the needle in this haystack.

As far as I can tell, every possible update I can find is done.
 
Hasn't shut down yet, but it's quite random. Sometimes it happens 2 or 3 times in an hour, sometimes not for an entire day. It's just been a weird one to track down. :confused:
You know, I just remembered having a similar issue a couple of years ago, in my case it was the CPU fan cooler that was malfunctioning. Maybe check that, who knows ...
 
The one that plugs into the wall - when your machine shuts down feel the supply does it feel warmer then when not gaming it is a good indicator it maybe getting weak and is not putting out the required power at high use...
This served as a reminder today. I was at work and had my laptop plugged in there with virtually no heat coming from the power supply at all. I thought perhaps the power strip at home is a culprit. Testing this theory, I have it plugged into the wall instead of the strip now and power supply is cool. I haven't had a chance to fire up a game and see what happens, though hopefully tomorrow I can do that.
 
The one that plugs into the wall - when your machine shuts down feel the supply does it feel warmer then when not gaming it is a good indicator it maybe getting weak and is not putting out the required power at high use such as gaming which is causing it to shut down. If you have a volt meter you can check the output - but that does not tell you if the amps drop off during a load - I have seen on laptops where the voltage output of the supply is fine but the amp draw was degrading when under a load

You can put a variable load on it, and put the meter on amps in series with the load. But you really need a second meter that is measuring the voltage at the same time, so you can confirm that it can sustain 19V at the rated current.
But to do this, you need a resistive load that is of the correct resistance to draw 3.95A at 19V, and withstand the power dissipation.

R = V/I

19/3.95 = 4.81 ohms

P = VI

But that is usually way more then anybody wants to do

Electronic loads are esoteric equipment for most people. (even Linux guys) I have an electronics bench and have a 300w Kunkin KL283 electronic load. Though that doesn't help him.

If it's possible to open up the power supply's power brick, you can look for electrolyte caps that are bulging. Even better if you have an ESR meter.

Another option might be a wall socket electricity usage monitor and watch for erratic behavior.

1629146679682.png
 

Members online


Top