this post was submitted on 16 Apr 2024
15 points (94.1% liked)

Arch Linux

7737 readers
1 users here now

The beloved lightweight distro

founded 4 years ago
MODERATORS
15
My GPU is gone (sopuli.xyz)
submitted 6 months ago* (last edited 6 months ago) by [email protected] to c/[email protected]
 

I have an optimus laptop, and after the update to KDE6 optimus-manager stopped working. I needed a second display, and all my display outputs are on the Nvdia GPU, so I needed to switch. I tried many different X11 configs, envycontrol then more X11 configs, but I couldn't get it working right, it would only be the internal display or the external one, not both. after a few hours I gave up and tried optimus-manager again. This time I checked the error log and it was failing to load the nvidia module, I tried loading it manually but I got a "No such device" error, which is where the title of the post comes in. My GPU has disappeared from linux, it won't show up in lspci, lshw, nvidia-smi, or anything else it should. The only reference to the thing in dmesg I can find are :

[    0.216410] pci 0000:01:00.0: [10de:1ba1] type 00 class 0x030000
[    0.216419] pci 0000:01:00.0: reg 0x10: [mem 0xde000000-0xdeffffff]
[    0.216427] pci 0000:01:00.0: reg 0x14: [mem 0xc0000000-0xcfffffff 64bit pref]
[    0.216435] pci 0000:01:00.0: reg 0x1c: [mem 0xd0000000-0xd1ffffff 64bit pref]
[    0.216440] pci 0000:01:00.0: reg 0x24: [io  0xe000-0xe07f]
[    0.216445] pci 0000:01:00.0: reg 0x30: [mem 0xdf000000-0xdf07ffff pref]
[    0.216460] pci 0000:01:00.0: Enabling HDA controller
[    0.257300] pci 0000:01:00.0: vgaarb: bridge control possible
[    0.257300] pci 0000:01:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[    0.270521] pci 0000:01:00.1: D0 power state depends on 0000:01:00.0

and then nothing, it doesn't even seem to try to load the nvidia module. I tried booting into windows and it shows up there fine, so the GPU didn't randomly die.
As far as I can tell I've rolled back everything I did in my histfile until it stopped working, The only thing I could think is I upgraded my kernel to (6.7.9) from (6.6.10), could that have caused it? I also tried adding pcie_port_pm=off to the kernel params from the archwiki, but still nothing. I'm just at a loss here, anyone have any ideas?

EDIT: I'm using the nvidia-dkms package
EDIT2: one kernel downgrade later and it's still not appearing, so thats not it.
EDIT3: fixed, see comments

you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 5 points 6 months ago* (last edited 6 months ago) (2 children)

I think I had this occur to me once and it was something really dumb but I can't remember what.

@[email protected] just for the sake of trying everything, you could rebuild the dkms and initrams, then reboot:

dkms autoinstall -F -a kernel-6.8.5-arch1 # change the kernel version according what you have now (read from uname -a)
mkinitcpio -P

E: Exhaustive of what I would try

  • check if drivers and modprobe blacklist make sense (this one is broad and requires digging into arch wiki but the optimus laptop I had required blacklisting some drivers from early loading afaik)
  • fiddle with re-scans and power states in the sys bus PCI folders for the GPU
  • check that my mkinitcpio makes sense, additionally look for .pacnew (/etc/mkinitcpio.conf.pacnew) and see if the changes might affect the system
  • downgrade kernel - already tried
  • downgrade dkms packages
  • update BIOS and firmwares from windows
  • cold boot the laptop (shutdown, remove AC and battery, leave it cold for few seconds)
  • on windows, look into ROG Armoury/MSI Center for any kind of toggles that could have impact on the GPUs (iGPU/dGPU) stuff like power states, optimizations etc)
[–] [email protected] 5 points 6 months ago

Looks like you where right about the udev rules earlier, I ran a pacman command to find all untracked files in /usr and I found /usr/lib/udev/rules.d/50-remove-nvidia.rules was there. Contents:

# Automatically generated by EnvyControl

# Remove NVIDIA USB xHCI Host Controller devices, if present
ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x0c0330", ATTR{power/control}="auto", ATTR{remove}="1"

# Remove NVIDIA USB Type-C UCSI devices, if present
ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x0c8000", ATTR{power/control}="auto", ATTR{remove}="1"

# Remove NVIDIA Audio devices, if present
ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x040300", ATTR{power/control}="auto", ATTR{remove}="1"

# Remove NVIDIA VGA/3D controller devices
ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x03[0-9]*", ATTR{power/control}="auto", ATTR{remove}="1"

looks like EnvyControl left some extra files after uninstalling.
Personally, I think it's pretty weird that it put runtime files in /usr/lib, if they where in /etc I would have found them quickly.
The GPU is back on the bus now and I can run optimus-manager to get my extra screen. Thank you for the help troubleshooting this issue.

[–] [email protected] 1 points 6 months ago (2 children)

I don't seem to have an -F on my dkms? when I ran that it without, it didn't rebuild all the DKMS modules for some reason, just bbswitch and evdi

[–] [email protected] 2 points 6 months ago* (last edited 6 months ago)

ah the -F might be wrong then actually, I was playing with custom kernels recently and my dkms is a mess, wouldn't worry about that option

[–] [email protected] 1 points 6 months ago

dkms status doesn't even list half of my DKMS modules for some reason