Major difficulties recovering system after week 27 and 28 updates

silke

Also, can we get the output of: sudo env CBM_DEBUG=1 clr-boot-manager update?

Matt_Nico

infinitymdm silke
Here is the output of lsblk nvme0n1p1 would be my efi partition

sda           8:0    0   5.5T  0 disk 
├─sda1        8:1    0   5.4T  0 part /mnt/quaternary
└─sda2        8:2    0  94.1G  0 part 
sdb           8:16   0 465.8G  0 disk 
├─sdb1        8:17   0   499M  0 part 
├─sdb2        8:18   0   100M  0 part 
├─sdb3        8:19   0    16M  0 part 
└─sdb4        8:20   0 195.4G  0 part 
sdc           8:32   0 931.5G  0 disk 
└─sdc1        8:33   0 931.5G  0 part /mnt/secondary
sdd           8:48   0   3.6T  0 disk 
└─sdd3        8:51   0   3.6T  0 part /mnt/tertiary
sde           8:64   1  14.5G  0 disk 
└─sde1        8:65   1  14.5G  0 part 
zram0       252:0    0     8G  0 disk [SWAP]
nvme0n1     259:0    0 931.5G  0 disk 
├─nvme0n1p1 259:1    0   512M  0 part 
├─nvme0n1p2 259:2    0  70.8G  0 part /
├─nvme0n1p3 259:3    0 161.4G  0 part /home
├─nvme0n1p4 259:4    0  29.8G  0 part [SWAP]
└─nvme0n1p5 259:5    0 668.9G  0 part /mnt/fast storage

and here is the output of sudo env CBM_DEBUG=1 clr-boot-manager update

[DEBUG] cbm (../src/cli/cli.c:L142): No such file: //etc/kernel/update_efi_vars
[INFO] cbm (../src/bootman/bootman.c:L787): Current running kernel: 6.9.8-294.current
[INFO] cbm (../src/bootman/sysconfig.c:L179): Discovered UEFI ESP: /dev/disk/by-partuuid/e9fc2609-be10-4546-ab1b-f7beebb9167e
[INFO] cbm (../src/bootman/sysconfig.c:L256): Fully resolved boot device: /dev/nvme0n1p1
[DEBUG] cbm (../src/bootman/bootman.c:L141): shim-systemd caps: 0x26, wanted: 0x26
[DEBUG] cbm (../src/bootman/bootman.c:L156): UEFI boot now selected (shim-systemd)
[INFO] cbm (../src/bootman/bootman.c:L807): path ///etc/kernel/initrd.d does not exist
[INFO] cbm (../src/bootman/bootman.c:L807): path ///usr/lib/initrd.d does not exist
[INFO] cbm (../src/bootman/bootman.c:L503): Checking for mounted boot dir
[INFO] cbm (../src/bootman/bootman.c:L555): Mounting boot device /dev/nvme0n1p1 at /boot
[SUCCESS] cbm (../src/bootman/bootman.c:L568): /dev/nvme0n1p1 successfully mounted at /boot
[DEBUG] cbm (../src/bootman/update.c:L164): Now beginning update_native
[DEBUG] cbm (../src/bootman/update.c:L173): update_native: 1 available kernels
[DEBUG] cbm (../src/bootman/update.c:L191): update_native: Running kernel is (current) ///usr/lib/kernel/com.solus-project.current.6.9.8-294
[SUCCESS] cbm (../src/bootman/update.c:L205): update_native: Bootloader updated
[DEBUG] cbm (../src/bootman/kernel.c:L617): installing extra initrd: /usr/lib64/kernel/initrd-com.solus-project.current.6.9.8-294.nvidia
[DEBUG] cbm (../src/bootloaders/systemd-class.c:L219): adding extra initrd to bootloader: initrd-com.solus-project.current.6.9.8-294.nvidia
[SUCCESS] cbm (../src/bootman/update.c:L220): update_native: Repaired running kernel ///usr/lib/kernel/com.solus-project.current.6.9.8-294
[DEBUG] cbm (../src/bootman/update.c:L230): update_native: Checking kernels for type current
[INFO] cbm (../src/bootman/update.c:L243): update_native: Default kernel for type current is ///usr/lib/kernel/com.solus-project.current.6.9.8-294
[DEBUG] cbm (../src/bootman/kernel.c:L617): installing extra initrd: /usr/lib64/kernel/initrd-com.solus-project.current.6.9.8-294.nvidia
[DEBUG] cbm (../src/bootloaders/systemd-class.c:L219): adding extra initrd to bootloader: initrd-com.solus-project.current.6.9.8-294.nvidia
[SUCCESS] cbm (../src/bootman/update.c:L255): update_native: Installed tip for current: ///usr/lib/kernel/com.solus-project.current.6.9.8-294
[DEBUG] cbm (../src/bootman/kernel.c:L617): installing extra initrd: /usr/lib64/kernel/initrd-com.solus-project.current.6.9.8-294.nvidia
[DEBUG] cbm (../src/bootloaders/systemd-class.c:L219): adding extra initrd to bootloader: initrd-com.solus-project.current.6.9.8-294.nvidia
[SUCCESS] cbm (../src/bootman/update.c:L269): update_native: Installed last_good kernel (current) (///usr/lib/kernel/com.solus-project.current.6.9.8-294)
[DEBUG] cbm (../src/bootman/update.c:L280): update_native: Analyzing for type current: ///usr/lib/kernel/com.solus-project.current.6.9.8-294
[DEBUG] cbm (../src/bootman/update.c:L285): update_native: Skipping running kernel
[INFO] cbm (../src/bootman/bootman.c:L503): Checking for mounted boot dir
[INFO] cbm (../src/bootman/bootman.c:L510): boot_dir is already mounted: /boot
[SUCCESS] cbm (../src/bootman/update.c:L338): update_native: Default kernel for current is ///usr/lib/kernel/com.solus-project.current.6.9.8-294
[DEBUG] cbm (../src/bootman/update.c:L353): No kernel removals found
[INFO] cbm (../src/bootman/bootman.c:L469): Attempting umount of /boot
[SUCCESS] cbm (../src/bootman/bootman.c:L473): Unmounted boot directory

Which appears to work as intended?

minh

Have you tried this? https://help.getsol.us/docs/user/troubleshooting/boot-rescue

After chroot /target, try to update to latest sudo eopkg up, which should have a fix for nvidia-*
Not sure if you need to run this sudo usysconf run -f but just do it for sure, then exist (chroot mode), umount and restart. This is how I recovered my PC

Matt_Nico

minh I was about to chroot and try this boot rescue when my computer booted as normal unexpectedly. I am not sure if it would still be worthwhile to chroot in, shouldn't I be able to do everything from inside my system now? If it's still worthwhile lmk and I will give it a shot!

Also weirdly I haven't been having any issues with my nvidia card since the initial error with lightdm. I have tested out a few different games as well and the system is definitely running on the gpu rather than internal graphics as performance is as expected.

ReillyBrogan

We pushed a hotfix for the nvidia driver, if you installed that it should be working (assuming it's the same issue). We're still trying to figure out the root cause of the issue.

Matt_Nico

ReillyBrogan If I try to update my system now it appears everything is up to date. The issue doesn't appear to be with the nvidia driver anymore, but clr-boot-manager is still giving me errors.

For now it seem I am still able to restart my computer and use it as normal. I believe it is able to find the efi partition when actually booting since my system starts, so I am not sure why it cannot properly detect it once the system is up and running.

silke

Which appears to work as intended?

Yep, strange that it sometimes complains about /dev/nvme0n1p1 not existing.

Matt_Nico

silke It really is bizarre. I was also having a very similar issue with my laptop today when I went to start it, except it was unable to detect my /home partition and so was only reaching the terminal. From the terminal if I rebooted it one or two times it would catch the partition and start as normal. But again if I turned it off I would risk it "losing" the partition again.

Incredibly odd that it is happening across both my main solus devices. The laptop is a t480s without any integrated graphics for what it is worth so nvidia should most definitely not be playing a role on my laptop.

*on Friday I believe I will be able to hop into the matrix at some point as I have more time available. I just need to sign up for it still.

Matt_Nico

Matt_Nico I am wondering if it may be something to do with these drives being nvme devices. Both the drive on my laptop and desktop with my solus install are fast nvme drives, this is purely conjecture but I wonder if the speed of these drives may be causing the issue? Could things be moving faster than clr-boot-manager or eopkg can keep up? That would explain why the issue is intermittent, as the system may be able to grab the information in time on some boot sequences but not on others.

infinitymdm

Matt_Nico I've run Solus on a variety of PCIe x4 gen 3 and gen 4 drives. The speed of the drive is probably not your problem.

silke

Matt_Nico I'm also using NVMe devices and have no issues. My guess is that there's something weird going on. You can check the kernel logs (journalctl -k) for any suspicious information.

You could try updating the firmware using fwupd (eopkg install fwupd). Make sure you have a good backup beforehand though (I haven't seen it brick a system yet, but someone has to be the first).

Ensure /boot is mounted:

   clr-boot-manager mount-boot

This is normally done automatically, but it can't harm to double check seeing as the partitions seem a bit flaky.

Check for updates:

   fwupdmgr refresh
   fwupgmgr get-updates

Install them:

   fwupdmgr update

Matt_Nico

silke here is the output of journalctl -k to me it didn't look like there was anything glaringly wrong but I am definitely out of my depth here.

Jul 18 14:09:59 solus kernel: Command line: initrd=\EFI\com.solus-project\initrd-com.solus-project.current.6.9.8-294 initrd=>
Jul 18 14:09:59 solus kernel: BIOS-provided physical RAM map:
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x0000000000000000-0x0000000000057fff] usable
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x0000000000058000-0x0000000000058fff] reserved
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x0000000000059000-0x000000000009dfff] usable
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x000000000009e000-0x00000000000fffff] reserved
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x0000000000100000-0x000000003fffffff] usable
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x0000000040000000-0x00000000403fffff] reserved
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x0000000040400000-0x0000000069bd7fff] usable
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x0000000069bd8000-0x0000000069bd8fff] ACPI NVS
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x0000000069bd9000-0x0000000069bd9fff] reserved
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x0000000069bda000-0x000000007b1befff] usable
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x000000007b1bf000-0x000000007b68efff] reserved
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x000000007b68f000-0x000000007b6fefff] ACPI data
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x000000007b6ff000-0x000000007bb2efff] ACPI NVS
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x000000007bb2f000-0x000000007cffefff] reserved
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x000000007cfff000-0x000000007cffffff] usable
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x000000007d000000-0x000000007fffffff] reserved
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x00000000e0000000-0x00000000efffffff] reserved
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x00000000fe000000-0x00000000fe010fff] reserved
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x00000000fec00000-0x00000000fec00fff] reserved
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x00000000fed00000-0x00000000fed00fff] reserved
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
lines 1-24

When I am home later I would be happy to try to update the firmware. I have my important files saved on a separate drive but have not done a proper backup. Would something like this guide be acceptable to create a backup?

iirc this installation was done years ago with a Solus 4.2 or 4.3 iso, would that point to a firmware issue?

Matt_Nico

Matt_Nico silke I didn't end up going through with the fwupdmgr update commands as of yet as I am not sure if it will actually do anything given the output of fwupdmgr get-updates command.

matt@matt-solus-desktop ~ $ fwupdmgr get-updates
WARNING: This package has not been validated, it may not work properly.
Devices with no available firmware updates: 
 • SSD 850 EVO 500GB
 • SSD 860 EVO 1TB
 • WD BLACK SN750 SE 1TB
 • WDC WD40EZRZ-75GXCB0
 • WDC WD60EZAZ-00SF3B0
No updatable devices

It appears as though all my drives are up to date firmware wise so would it be worth it to go through and do the fwupdmgr update command?

Still running into these issues on both my laptop and my desktop.

infinitymdm

Matt_Nico If there are no updates available, running fwupdmgr update will just tell you the same thing that get-updates did. No need to run it.

Just to be clear, you're still getting this error you mentioned in your earlier post, correct?

Matt_Nico
[✗] Updating clr-boot-manager failed

A copy of the command output follows:

[FATAL] cbm (../src/bootman/bootman.c:L562): FATAL: Cannot mount boot device /dev/nvme0n1p1 on /boot: No such device

Are there any other errors you're seeing, or any other behavior that you don't think is normal?

Could we get the output of sudo journalctl -k | grep nvme? That should filter the system logs for kernel messages containing the string "nvme". There may be better search strings to try, this is just where I would start.

Matt_Nico

infinitymdm On my laptop there is additional weird behaviour occurring. Namely it is failing to mount the /home partition on startup around 50% of the time. If I just get it to reboot after the error it will boot just fine most times but sometimes it takes multiple attempts. Here is an image of the error:

And yes to be clear the other error is still persisting, this is as it appears on my laptop:

 [✗] Updating clr-boot-manager                                           failed

A copy of the command output follows:

[FATAL] cbm (../src/bootman/bootman.c:L562): FATAL: Cannot mount boot device /dev/nvme0n1p1 on /boot: No such device


 [✗] Updating clr-boot-manager                                           failed

A copy of the command output follows:

[FATAL] cbm (../src/bootman/bootman.c:L562): FATAL: Cannot mount boot device /dev/nvme0n1p1 on /boot: No such device


 [✓] Running depmod on kernel 6.9.8-294.current                         success

Here is the output of sudo journalctl -k | grep nvme on my laptop (I will also respond with my desktops output in a few minutes):

Jul 19 15:42:35 solus kernel: nvme nvme0: 8/0/0 default/read/poll queues
Jul 19 15:42:35 solus kernel:  nvme0n1: p1 p2 p3 p4
Jul 19 15:42:37 solus kernel: EXT4-fs (nvme0n1p3): mounted filesystem 9aa71c0a-9d55-4f79-b370-a3f554f8eb80 r/w with ordered data mode. Quota mode: none.
Jul 19 15:42:35 solus kernel: nvme nvme0: pci function 0000:3e:00.0
Jul 19 15:42:35 solus kernel: nvme nvme0: 8/0/0 default/read/poll queues
Jul 19 15:42:35 solus kernel:  nvme0n1: p1 p2 p3 p4
Jul 19 15:42:37 solus kernel: EXT4-fs (nvme0n1p3): mounted filesystem 9aa71c0a-9d55-4f79-b370-a3f554f8eb80 r/w with ordered data mode. Quota mode: none.
Jul 19 15:42:38 matt-solus-t480s kernel: EXT4-fs (nvme0n1p3): re-mounted 9aa71c0a-9d55-4f79-b370-a3f554f8eb80 r/w. Quota mode: none.
Jul 19 15:42:38 matt-solus-t480s kernel: Adding 11718652k swap on /dev/nvme0n1p2.  Priority:-2 extents:1 across:11718652k SS
Jul 19 15:42:40 matt-solus-t480s kernel: EXT4-fs (nvme0n1p4): mounted filesystem fa0ff32f-4b57-468f-981f-e79f6fed9aa7 r/w with ordered data mode. Quota mode: none.

What I find most bizarre is that the error is almost exactly replicated on both of my systems.

Matt_Nico

infinitymdm One additional weird thing which is occurring on my desktop is that I will sometimes be unable to log in from the lock screen after the computer has gone into standby mode. There will be no option to type in the field where I must enter the password. This behaviour stops if I log out of the account and then log back into the system. This happens when the system is left in standby for longer than 6 hours, so I had just switched to powering off my system when this behaviour occasionally flares up (usually it will happen a few days in a row and then I will switch to powering the system down. I don't think this would be related.

Here is the error as it has been appearing on my desktop system:

 [✓] Updating dynamic library cache                                     success
 [✗] Updating clr-boot-manager                                           failed

A copy of the command output follows:

[FATAL] cbm (../src/bootman/bootman.c:L562): FATAL: Cannot mount boot device /dev/nvme0n1p1 on /boot: No such device


 [✗] Updating clr-boot-manager                                           failed

A copy of the command output follows:

[FATAL] cbm (../src/bootman/bootman.c:L562): FATAL: Cannot mount boot device /dev/nvme0n1p1 on /boot: No such device


 [✗] Updating clr-boot-manager                                           failed

A copy of the command output follows:

[FATAL] cbm (../src/bootman/bootman.c:L562): FATAL: Cannot mount boot device /dev/nvme0n1p1 on /boot: No such device


 [✓] Running depmod on kernel 6.9.8-294.current                         success
 [✓] Updating hwdb                                                      success
 [✓] Updating system users                                              success
 [✓] Updating systemd tmpfiles                                          success
 [✓] Reloading systemd configuration                                    success
 [✓] Re-starting vendor-enabled .socket units                           success
 [✓] Compiling and Reloading AppArmor profiles                          success
 [✓] Updating manpages database                                         success
 [✓] Reloading udev rules                                               success
 [✓] Applying udev rules                                                success

and the output of sudo journalctl -k | grep nvme on my desktop system:

Jul 19 16:01:31 solus kernel: nvme nvme0: allocated 64 MiB host memory buffer.
Jul 19 16:01:31 solus kernel: nvme nvme0: 6/0/0 default/read/poll queues
Jul 19 16:01:31 solus kernel:  nvme0n1: p1 p2 p3 p4 p5
Jul 19 16:01:33 solus kernel: EXT4-fs (nvme0n1p2): mounted filesystem 2456cde0-a7e1-4af1-99ca-c30eb65f868a r/w with ordered data mode. Quota mode: none.
Jul 19 16:01:31 solus kernel: nvme nvme0: pci function 0000:03:00.0
Jul 19 16:01:31 solus kernel: nvme nvme0: allocated 64 MiB host memory buffer.
Jul 19 16:01:31 solus kernel: nvme nvme0: 6/0/0 default/read/poll queues
Jul 19 16:01:31 solus kernel:  nvme0n1: p1 p2 p3 p4 p5
Jul 19 16:01:33 solus kernel: EXT4-fs (nvme0n1p2): mounted filesystem 2456cde0-a7e1-4af1-99ca-c30eb65f868a r/w with ordered data mode. Quota mode: none.
Jul 19 16:01:34 matt-solus-desktop kernel: EXT4-fs (nvme0n1p2): re-mounted 2456cde0-a7e1-4af1-99ca-c30eb65f868a r/w. Quota mode: none.
Jul 19 16:01:34 matt-solus-desktop kernel: Adding 31250428k swap on /dev/nvme0n1p4.  Priority:-2 extents:1 across:31250428k SS
Jul 19 16:01:34 matt-solus-desktop kernel: EXT4-fs (nvme0n1p5): mounted filesystem 81b8fa97-0b82-4375-a08a-6d7ec6daa7af r/w with ordered data mode. Quota mode: none.
Jul 19 16:01:35 matt-solus-desktop kernel: EXT4-fs (nvme0n1p3): mounted filesystem 168e8227-9af4-4f3a-b2be-3bdf0875dece r/w with ordered data mode. Quota mode: none.
Jul 19 16:01:36 matt-solus-desktop kernel: block nvme0n1: No UUID available providing old NGUID

Matt_Nico

Matt_Nico now I am running into the same issue with the /home partition on my desktop computer. I snapped a pic of it as well:

It seems to be identical to the issue which is present on the laptop. So now I can say that both systems are exhibiting the exact same set of errors.

I have not yet applied the W39 updates as I do not know what will happen when I do. Should I just go for it?

infinitymdm

Matt_Nico One additional weird thing which is occurring on my desktop is that I will sometimes be unable to log in from the lock screen after the computer has gone into standby mode. There will be no option to type in the field where I must enter the password. This behaviour stops if I log out of the account and then log back into the system. This happens when the system is left in standby for longer than 6 hours, so I had just switched to powering off my system when this behaviour occasionally flares up (usually it will happen a few days in a row and then I will switch to powering the system down. I don't think this would be related.

I actually also have this issue, but I usually resolve it by hitting "Switch User" and then logging back in. It kicks you back to the standby screen, but then password entry works. So I don't think this is related to your primary issue.

The /home failing to mount, alongside the boot manager issue, would lead me to think there's a problem with the storage device. But the fact that it's replicated across both systems should rule that out. The odds of two drives failing in exactly the same way at exactly the same time in two separate systems are basically zero.

Your issue seems to be beyond my knowledge to troubleshoot. I don't see any issues in your kernel logs either. This sure is a puzzle. If I think of anything I'll pipe in, but I'm not sure how to help you at this point. You need someone more experienced than myself.

WetGeek

Matt_Nico Should I just go for it?

I can't speak for your system, of course, but I installed Week 39 updates on two laptops and four VMs without any problems. Are you backed-up?

silke

Matt_Nico Do both devices have similar SSDs?

I have not yet applied the W39 updates as I do not know what will happen when I do. Should I just go for it?

You can always select the previous kernel in the boot menu, so it can't hurt. Use sudo clr-boot-manager set-timeout 5 to always show the boot menu, or hold space on boot to show it once.

Axios

Been watching not sure why 2 computers different would have the same issue.
Do you have fastboot turned off on each?

Its interesting.

Matt_Nico

infinitymdm I will try out hitting switch user next time that issue crops up. But yeah I don't think that it's related to the main issue.

I was thinking of drive failures, but as you said the chances of both systems having the exact same issue at the same time are so low. Thank you very much for looking into/at things for me even if the issue is still out there. Perhaps when I have more time some fresh installs will be in order

Axios I haven't checked but iirc both systems would have fastboot disabled. This is also a new issue for two older installs (one especially so), so I would guess that the fastboot state wouldn't have an effect as it has not been changed for either system since the install occurred. My desktop would have been installed with either solus 4.2 or 4.3 sometime in 2021 and my laptop was installed with 4.4 in september 2023.