Help Request: intermittent amdgpu_device_ip_init failed when booting Pop_OS 22.04

ads103

New Member
Joined
Jul 28, 2024
Messages
2
Reaction score
0
Credits
34
I have a system with an AMD Ryzen 5 7600 CPU and an Sapphire Pulse AMD RX 7600 GPU. The system runs Pop_OS 22.04 with kernel 6.9.3-76060903-generic. The motherboard is a Gigabyte A620I AX with BIOS version F31b.

Intermittently, during bootup, Pop_OS fails to initialize the GPU. Visually, the system either seems to hang at the motherboard manufacturer's branded screen with the BIOS and boot menu hotkeys, or the screen remains black with no HDMI video output. Pop_OS continues to load, though, and I can SSH into it.

This issue is similar to, but not exactly the same as, the one found here a few years ago. In post #5 of that post, the OP seemed to be missing some firmware, but that firmware is present on my system:
Code:
root@pop-os:~# ls -l /lib/firmware/amdgpu/navy_flounder_sos.bin
-rw-r--r-- 1 root root 218608 Jun 11 03:41 /lib/firmware/amdgpu/navy_flounder_sos.bin

I'll emphasize that the issue is intermittent. Four-fifths of bootup attempts are ordinary and successful; the system boots, I see the login screen, I log in, and get 65 fps in Cyberpunk 2077 with the help of Proton on Steam. At idle, the system draws about 70 watts of electricity. But when the system fails to initialize the gpu, its idle draw with a black screen (or the motherboard manufacturer's post screen) is two hundred watts!

This is dmesg after a successful bootup:
Code:
root@pop-os:~# dmesg | grep amdgpu
[    6.526511] [drm] amdgpu kernel modesetting enabled.
[    6.526619] amdgpu: Virtual CRAT table created for CPU
[    6.526630] amdgpu: Topology: Add CPU node
[    6.526732] amdgpu 0000:03:00.0: enabling device (0000 -> 0003)
[    6.562906] amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from ROM BAR
[    6.562908] amdgpu: ATOM BIOS: 113-4481LHS-UC1
[    6.563640] amdgpu 0000:03:00.0: amdgpu: CP RS64 enable
[    6.564501] amdgpu 0000:03:00.0: [drm:jpeg_v4_0_early_init [amdgpu]] JPEG decode is enabled in VM mode
[    6.564942] amdgpu 0000:03:00.0: vgaarb: deactivate vga console
[    6.564944] amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[    6.564991] amdgpu 0000:03:00.0: amdgpu: VRAM: 8176M 0x0000008000000000 - 0x00000081FEFFFFFF (8176M used)
[    6.564993] amdgpu 0000:03:00.0: amdgpu: GART: 512M 0x00007FFF00000000 - 0x00007FFF1FFFFFFF
[    6.565075] [drm] amdgpu: 8176M of VRAM memory ready
[    6.565076] [drm] amdgpu: 7799M of GTT memory ready.
[    6.565599] amdgpu 0000:03:00.0: amdgpu: Will use PSP to load VCN firmware
[    6.622572] amdgpu 0000:03:00.0: amdgpu: reserve 0x1300000 from 0x81fc000000 for PSP TMR
[    6.716471] amdgpu 0000:03:00.0: amdgpu: RAS: optional ras ta ucode is not available
[    6.723981] amdgpu 0000:03:00.0: amdgpu: RAP: optional rap ta ucode is not available
[    6.723983] amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[    6.724019] amdgpu 0000:03:00.0: amdgpu: smu driver if version = 0x00000035, smu fw if version = 0x00000040, smu fw program = 0, smu fw version = 0x00525b00 (82.91.0)
[    6.724021] amdgpu 0000:03:00.0: amdgpu: SMU driver if version not matched
[    6.796725] amdgpu 0000:03:00.0: amdgpu: SMU is initialized successfully!
[    6.853594] amdgpu 0000:03:00.0: [drm:jpeg_v4_0_hw_init [amdgpu]] JPEG decode initialized successfully.
[    6.888020] amdgpu: HMM registered 8176MB device memory
[    6.888873] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[    6.888884] kfd kfd: amdgpu: Total number of KFD nodes to be created: 1
[    6.888909] amdgpu: Virtual CRAT table created for GPU
[    6.889006] amdgpu: Topology: Add dGPU node [0x7480:0x1002]
[    6.889008] kfd kfd: amdgpu: added device 1002:7480
[    6.889021] amdgpu 0000:03:00.0: amdgpu: SE 2, SH per SE 2, CU per SH 8, active_cu_number 32
[    6.889025] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[    6.889027] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[    6.889028] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[    6.889029] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[    6.889030] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[    6.889031] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[    6.889033] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[    6.889034] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[    6.889035] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[    6.889036] amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[    6.889037] amdgpu 0000:03:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[    6.889038] amdgpu 0000:03:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[    6.889040] amdgpu 0000:03:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
[    6.889041] amdgpu 0000:03:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 14 on hub 0
[    6.892603] amdgpu 0000:03:00.0: amdgpu: Using BACO for runtime pm
[    6.893087] [drm] Initialized amdgpu 3.57.0 20150101 for 0000:03:00.0 on minor 0
[    6.899907] fbcon: amdgpudrmfb (fb0) is primary device
[    6.899911] amdgpu 0000:03:00.0: [drm] fb0: amdgpudrmfb frame buffer device
[    8.213974] snd_hda_intel 0000:03:00.1: bound 0000:03:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])

This is diagnostic info gathered after a failed bootup:

Code:
root@pop-os:~# lshw -c video
  *-display UNCLAIMED       
       description: VGA compatible controller
       product: Advanced Micro Devices, Inc. [AMD/ATI]
       vendor: Advanced Micro Devices, Inc. [AMD/ATI]
       physical id: 0
       bus info: pci@0000:03:00.0
       version: cf
       width: 64 bits
       clock: 33MHz
       capabilities: pm pciexpress msi vga_controller bus_master cap_list
       configuration: latency=0
       resources: iomemory:fa0-f9f iomemory:fc0-fbf memory:fa00000000-fbffffffff memory:fc00000000-fc0fffffff ioport:f000(size=256) memory:f6b00000-f6bfffff memory:f6c00000-f6c1ffff

Code:
root@pop-os:~# lsmod | grep amd
amd_atl                53248  1
edac_mce_amd           28672  0
kvm_amd               208896  0
kvm                  1417216  1 kvm_amd
ccp                   155648  1 kvm_amd
amdgpu              17563648  0
amdxcp                 12288  1 amdgpu
drm_exec               12288  1 amdgpu
gpu_sched              61440  1 amdgpu
drm_buddy              20480  1 amdgpu
i2c_algo_bit           16384  1 amdgpu
drm_suballoc_helper    16384  1 amdgpu
drm_ttm_helper         12288  1 amdgpu
ttm                   110592  2 amdgpu,drm_ttm_helper
drm_display_helper    266240  1 amdgpu
video                  73728  1 amdgpu

Code:
root@pop-os:~# dmesg | grep -i amdgpu
[    6.405158] [drm] amdgpu kernel modesetting enabled.
[    6.405272] amdgpu: Virtual CRAT table created for CPU
[    6.405287] amdgpu: Topology: Add CPU node
[    6.405381] amdgpu 0000:03:00.0: enabling device (0006 -> 0007)
[    6.409253] amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from VFCT
[    6.409256] amdgpu: ATOM BIOS: 113-4481LHS-UC1
[    6.409997] amdgpu 0000:03:00.0: amdgpu: CP RS64 enable
[    6.410843] amdgpu 0000:03:00.0: [drm:jpeg_v4_0_early_init [amdgpu]] JPEG decode is enabled in VM mode
[    6.436090] amdgpu 0000:03:00.0: vgaarb: deactivate vga console
[    6.436094] amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[    6.436137] amdgpu 0000:03:00.0: amdgpu: VRAM: 8176M 0x0000008000000000 - 0x00000081FEFFFFFF (8176M used)
[    6.436139] amdgpu 0000:03:00.0: amdgpu: GART: 512M 0x00007FFF00000000 - 0x00007FFF1FFFFFFF
[    6.436222] [drm] amdgpu: 8176M of VRAM memory ready
[    6.436223] [drm] amdgpu: 7799M of GTT memory ready.
[    6.436740] amdgpu 0000:03:00.0: amdgpu: Will use PSP to load VCN firmware
[    6.493479] amdgpu 0000:03:00.0: amdgpu: reserve 0x1300000 from 0x81fc000000 for PSP TMR
[    6.587687] amdgpu 0000:03:00.0: amdgpu: RAS: optional ras ta ucode is not available
[    6.595203] amdgpu 0000:03:00.0: amdgpu: RAP: optional rap ta ucode is not available
[    6.595205] amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[    6.595233] amdgpu 0000:03:00.0: amdgpu: smu driver if version = 0x00000035, smu fw if version = 0x00000040, smu fw program = 0, smu fw version = 0x00525b00 (82.91.0)
[    6.595235] amdgpu 0000:03:00.0: amdgpu: SMU driver if version not matched
[    6.684669] amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:6 param:0x00000000 message:EnableAllSmuFeatures?
[    6.684672] amdgpu 0000:03:00.0: amdgpu: Failed to enable requested dpm features!
[    6.684673] amdgpu 0000:03:00.0: amdgpu: Failed to setup smc hw!
[    6.684674] [drm:amdgpu_device_ip_init [amdgpu]] *ERROR* hw_init of IP block <smu> failed -121
[    6.684812] amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_init failed
[    6.684813] amdgpu 0000:03:00.0: amdgpu: Fatal error during GPU init
[    6.684815] amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device.
[    6.684888] WARNING: CPU: 1 PID: 188 at drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:622 amdgpu_irq_put+0x9f/0xb0 [amdgpu]
[    6.685043] Modules linked in: hid_logitech_dj(+) hid_generic usbhid hid amdgpu(+) amdxcp drm_exec gpu_sched drm_buddy crct10dif_pclmul i2c_algo_bit crc32_pclmul drm_suballoc_helper polyval_clmulni drm_ttm_helper polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 ttm nvme drm_display_helper ahci i2c_piix4 xhci_pci nvme_core r8169 libahci cec xhci_pci_renesas realtek nvme_auth rc_core video wmi aesni_intel crypto_simd cryptd
[    6.685071] RIP: 0010:amdgpu_irq_put+0x9f/0xb0 [amdgpu]
[    6.685228]  ? amdgpu_irq_put+0x9f/0xb0 [amdgpu]
[    6.685372]  ? amdgpu_irq_put+0x9f/0xb0 [amdgpu]
[    6.685499]  amdgpu_fence_driver_hw_fini+0x11f/0x170 [amdgpu]
[    6.685632]  amdgpu_device_fini_hw+0xb3/0x250 [amdgpu]
[    6.685761]  amdgpu_driver_unload_kms+0x4b/0x70 [amdgpu]
[    6.685888]  amdgpu_driver_load_kms+0xf9/0x1c0 [amdgpu]
[    6.686014]  amdgpu_pci_probe+0x1bb/0x5d0 [amdgpu]
[    6.686171]  ? __pfx_amdgpu_init+0x10/0x10 [amdgpu]
[    6.686298]  amdgpu_init+0x69/0xff0 [amdgpu]
[    6.686539] WARNING: CPU: 1 PID: 188 at drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:622 amdgpu_irq_put+0x9f/0xb0 [amdgpu]
[    6.686679] Modules linked in: hid_logitech_dj(+) hid_generic usbhid hid amdgpu(+) amdxcp drm_exec gpu_sched drm_buddy crct10dif_pclmul i2c_algo_bit crc32_pclmul drm_suballoc_helper polyval_clmulni drm_ttm_helper polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 ttm nvme drm_display_helper ahci i2c_piix4 xhci_pci nvme_core r8169 libahci cec xhci_pci_renesas realtek nvme_auth rc_core video wmi aesni_intel crypto_simd cryptd
[    6.686704] RIP: 0010:amdgpu_irq_put+0x9f/0xb0 [amdgpu]
[    6.686849]  ? amdgpu_irq_put+0x9f/0xb0 [amdgpu]
[    6.686989]  ? amdgpu_irq_put+0x9f/0xb0 [amdgpu]
[    6.687113]  amdgpu_fence_driver_hw_fini+0x11f/0x170 [amdgpu]
[    6.687247]  amdgpu_device_fini_hw+0xb3/0x250 [amdgpu]
[    6.687373]  amdgpu_driver_unload_kms+0x4b/0x70 [amdgpu]
[    6.687494]  amdgpu_driver_load_kms+0xf9/0x1c0 [amdgpu]
[    6.687614]  amdgpu_pci_probe+0x1bb/0x5d0 [amdgpu]
[    6.687763]  ? __pfx_amdgpu_init+0x10/0x10 [amdgpu]
[    6.687885]  amdgpu_init+0x69/0xff0 [amdgpu]

I'm quite unsure where to go from here, so I'd be grateful for any help y'all can offer
 


[ 6.796725] amdgpu 0000:03:00.0: amdgpu: SMU is initialized successfully!

[ 6.684672] amdgpu 0000:03:00.0: amdgpu: Failed to enable requested dpm features!
[ 6.684673] amdgpu 0000:03:00.0: amdgpu: Failed to setup smc hw!
[ 6.684674] [drm:amdgpu_device_ip_init [amdgpu]] ERROR hw_init of IP block <smu> failed -121
[ 6.684812] amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_init failed
[ 6.684813] amdgpu 0000:03:00.0: amdgpu: Fatal error during GPU init


I'm quite unsure where to go from here, so I'd be grateful for any help y'all can offer
The first quote above is where success is reported.
The second quote is where the gpu failed.

In the past I've had this problem with a similar amd gpu bought from a large online seller, whose name is easy to imagine ... a forest like name. Unfortunately in the end, after trying quite a few things, I abandoned the card and bought one locally, known to be new, and all was well. Notwithstanding that experience, here are some suggestions which are best to try one at a time in the first instance:

Since the first failure is the dpm, one could try with turning it off with the kernel parameter:
amdgpu.dpm=0

Another kernel parameter to turn off dpm which I have not tried, but read about is:
amdgpu.ppfeaturemask=0xffffb

Since it may be a power management issue, one could try turning off the active power management with the kernel parameter:
amdgpu.aspm=0

Update the whole system in any case including the mesa packages.

Problems with firmware are usually addressed by installing the latest versions. There is the package:
firmware-amd-graphics
which is in many debian based distros, or PopOs may have it's own version. Wise to install the latest.

The latest firmware can also be obtained here:
If the firmware on the machine isn't the latest, one can extract the amd firmware from the download and replace the /lib/firmware/amdgpu files.

After doing most of the above, and more too which slips my memory, the video card was little better in reliability, hence the replacement with a new one. YMMV.
 
Yes, that large online seller does have a forest-like name, doesn't it? :) Mine came from the same place! I'm not sure if abandoning the card will be an option for me. I suppose I could RMA it and hope for one with a newer firmware, but I'm always hoping for a software fix.

I'll try installing the latest version of the firmware before I try disabling power management functions, but I'd like to ask for assistance with that. PopOS doesn't seem to have a firmware-amd-graphics package:
Code:
root@pop-os:~/linux-firmware/linux-firmware-20240709# apt search firmware-amd
Sorting... Done
Full Text Search... Done
root@pop-os:~/linux-firmware/linux-firmware-20240709#
I'm also not getting any package hints for the existing bins on my system:
Code:
root@pop-os:/usr/lib/firmware/amdgpu# dpkg -S /usr/lib/firmware/amdgpu/picasso_rlc.bin
dpkg-query: no path found matching pattern /usr/lib/firmware/amdgpu/picasso_rlc.bin
So I've downloaded the latest firmware .tar.gz from kernel.org - thank you for linking to them. Leaning on the old post I linked in my OP for hints leads me to the amdgpu directory, which contains a large number of binaries. Amy I correct in understanding that the RX 7600 is a Navi 33 card? Because I only see navi 1x entires. Do you suppose any of these binaries are the right one for my card?
Code:
root@pop-os:~/linux-firmware/linux-firmware-20240709# find | grep -i navi
./amdgpu/navi14_vcn.bin
./amdgpu/navi14_mec.bin
./amdgpu/navi12_mec2.bin
./amdgpu/navi10_asd.bin
./amdgpu/navi12_mec.bin
./amdgpu/navi10_pfp.bin
./amdgpu/navi10_mec.bin
./amdgpu/navi12_rlc.bin
./amdgpu/navi10_me.bin
./amdgpu/navi14_ce.bin
./amdgpu/navi10_smc.bin
./amdgpu/navi14_pfp.bin
./amdgpu/navi14_pfp_wks.bin
./amdgpu/navi10_gpu_info.bin
./amdgpu/navi14_ce_wks.bin
./amdgpu/navi12_sos.bin
./amdgpu/navi12_gpu_info.bin
./amdgpu/navi14_me_wks.bin
./amdgpu/navi12_asd.bin
./amdgpu/navi14_mec2_wks.bin
./amdgpu/navi10_sdma.bin
./amdgpu/navi12_sdma.bin
./amdgpu/navi14_mec_wks.bin
./amdgpu/navi12_ce.bin
./amdgpu/navi14_gpu_info.bin
./amdgpu/navi12_pfp.bin
./amdgpu/navi10_sdma1.bin
./amdgpu/navi12_vcn.bin
./amdgpu/navi10_mec2.bin
./amdgpu/navi12_sdma1.bin
./amdgpu/navi14_sos.bin
./amdgpu/navi10_ce.bin
./amdgpu/navi10_ta.bin
./amdgpu/navi14_sdma1.bin
./amdgpu/navi14_me.bin
./amdgpu/navi14_rlc.bin
./amdgpu/navi12_me.bin
./amdgpu/navi10_rlc.bin
./amdgpu/navi14_ta.bin
./amdgpu/navi12_smc.bin
./amdgpu/navi14_smc.bin
./amdgpu/navi12_ta.bin
./amdgpu/navi10_vcn.bin
./amdgpu/navi14_asd.bin
./amdgpu/navi12_dmcu.bin
./amdgpu/navi10_sos.bin
./amdgpu/navi14_mec2.bin
./amdgpu/navi14_sdma.bin
 
I know squat about PopOS!.....but from what I understand, it only really runs correctly on System76's own line of hyper-expensive laptops (despite being available for download and install on its own).

(shrug...)


Mike. :(
 
I know squat about PopOS!.....but from what I understand, it only really runs correctly on System76's own line of hyper-expensive laptops
that's about all you need to know, people think because its based on Ubuntu, it will have all the same drivers etc, but it doesn't.
I no longer list po as a user-friendly OS, If you can install it and everything works fine OOTB, then great for you, BUT if it only loads with errors you can waste days trying to fix it and still fail.
 

Members online


Top