Major difficulties recovering system after week 27 and 28 updates

Matt_Nico · Jul 18, 2024

silke It really is bizarre. I was also having a very similar issue with my laptop today when I went to start it, except it was unable to detect my /home partition and so was only reaching the terminal. From the terminal if I rebooted it one or two times it would catch the partition and start as normal. But again if I turned it off I would risk it "losing" the partition again.

Incredibly odd that it is happening across both my main solus devices. The laptop is a t480s without any integrated graphics for what it is worth so nvidia should most definitely not be playing a role on my laptop.

*on Friday I believe I will be able to hop into the matrix at some point as I have more time available. I just need to sign up for it still.

Matt_Nico · Jul 18, 2024

Matt_Nico I am wondering if it may be something to do with these drives being nvme devices. Both the drive on my laptop and desktop with my solus install are fast nvme drives, this is purely conjecture but I wonder if the speed of these drives may be causing the issue? Could things be moving faster than clr-boot-manager or eopkg can keep up? That would explain why the issue is intermittent, as the system may be able to grab the information in time on some boot sequences but not on others.

infinitymdm · Jul 18, 2024

Matt_Nico I've run Solus on a variety of PCIe x4 gen 3 and gen 4 drives. The speed of the drive is probably not your problem.

silke · Jul 18, 2024

Matt_Nico I'm also using NVMe devices and have no issues. My guess is that there's something weird going on. You can check the kernel logs (journalctl -k) for any suspicious information.

You could try updating the firmware using fwupd (eopkg install fwupd). Make sure you have a good backup beforehand though (I haven't seen it brick a system yet, but someone has to be the first).

Ensure /boot is mounted:

   clr-boot-manager mount-boot

This is normally done automatically, but it can't harm to double check seeing as the partitions seem a bit flaky.

Check for updates:

   fwupdmgr refresh
   fwupgmgr get-updates

Install them:

   fwupdmgr update

Matt_Nico · Jul 18, 2024

silke here is the output of journalctl -k to me it didn't look like there was anything glaringly wrong but I am definitely out of my depth here.

Jul 18 14:09:59 solus kernel: Command line: initrd=\EFI\com.solus-project\initrd-com.solus-project.current.6.9.8-294 initrd=>
Jul 18 14:09:59 solus kernel: BIOS-provided physical RAM map:
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x0000000000000000-0x0000000000057fff] usable
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x0000000000058000-0x0000000000058fff] reserved
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x0000000000059000-0x000000000009dfff] usable
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x000000000009e000-0x00000000000fffff] reserved
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x0000000000100000-0x000000003fffffff] usable
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x0000000040000000-0x00000000403fffff] reserved
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x0000000040400000-0x0000000069bd7fff] usable
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x0000000069bd8000-0x0000000069bd8fff] ACPI NVS
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x0000000069bd9000-0x0000000069bd9fff] reserved
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x0000000069bda000-0x000000007b1befff] usable
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x000000007b1bf000-0x000000007b68efff] reserved
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x000000007b68f000-0x000000007b6fefff] ACPI data
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x000000007b6ff000-0x000000007bb2efff] ACPI NVS
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x000000007bb2f000-0x000000007cffefff] reserved
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x000000007cfff000-0x000000007cffffff] usable
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x000000007d000000-0x000000007fffffff] reserved
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x00000000e0000000-0x00000000efffffff] reserved
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x00000000fe000000-0x00000000fe010fff] reserved
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x00000000fec00000-0x00000000fec00fff] reserved
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x00000000fed00000-0x00000000fed00fff] reserved
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
lines 1-24

When I am home later I would be happy to try to update the firmware. I have my important files saved on a separate drive but have not done a proper backup. Would something like this guide be acceptable to create a backup?

iirc this installation was done years ago with a Solus 4.2 or 4.3 iso, would that point to a firmware issue?

Matt_Nico · Jul 19, 2024

Matt_Nico silke I didn't end up going through with the fwupdmgr update commands as of yet as I am not sure if it will actually do anything given the output of fwupdmgr get-updates command.

matt@matt-solus-desktop ~ $ fwupdmgr get-updates
WARNING: This package has not been validated, it may not work properly.
Devices with no available firmware updates: 
 • SSD 850 EVO 500GB
 • SSD 860 EVO 1TB
 • WD BLACK SN750 SE 1TB
 • WDC WD40EZRZ-75GXCB0
 • WDC WD60EZAZ-00SF3B0
No updatable devices

It appears as though all my drives are up to date firmware wise so would it be worth it to go through and do the fwupdmgr update command?

Still running into these issues on both my laptop and my desktop.

infinitymdm · Jul 19, 2024

Matt_Nico If there are no updates available, running fwupdmgr update will just tell you the same thing that get-updates did. No need to run it.

Just to be clear, you're still getting this error you mentioned in your earlier post, correct?

Matt_Nico
[✗] Updating clr-boot-manager failed

A copy of the command output follows:

[FATAL] cbm (../src/bootman/bootman.c:L562): FATAL: Cannot mount boot device /dev/nvme0n1p1 on /boot: No such device

Are there any other errors you're seeing, or any other behavior that you don't think is normal?

Could we get the output of sudo journalctl -k | grep nvme? That should filter the system logs for kernel messages containing the string "nvme". There may be better search strings to try, this is just where I would start.

Matt_Nico · Jul 19, 2024

infinitymdm On my laptop there is additional weird behaviour occurring. Namely it is failing to mount the /home partition on startup around 50% of the time. If I just get it to reboot after the error it will boot just fine most times but sometimes it takes multiple attempts. Here is an image of the error:

And yes to be clear the other error is still persisting, this is as it appears on my laptop:

 [✗] Updating clr-boot-manager                                           failed

A copy of the command output follows:

[FATAL] cbm (../src/bootman/bootman.c:L562): FATAL: Cannot mount boot device /dev/nvme0n1p1 on /boot: No such device


 [✗] Updating clr-boot-manager                                           failed

A copy of the command output follows:

[FATAL] cbm (../src/bootman/bootman.c:L562): FATAL: Cannot mount boot device /dev/nvme0n1p1 on /boot: No such device


 [✓] Running depmod on kernel 6.9.8-294.current                         success

Here is the output of sudo journalctl -k | grep nvme on my laptop (I will also respond with my desktops output in a few minutes):

Jul 19 15:42:35 solus kernel: nvme nvme0: 8/0/0 default/read/poll queues
Jul 19 15:42:35 solus kernel:  nvme0n1: p1 p2 p3 p4
Jul 19 15:42:37 solus kernel: EXT4-fs (nvme0n1p3): mounted filesystem 9aa71c0a-9d55-4f79-b370-a3f554f8eb80 r/w with ordered data mode. Quota mode: none.
Jul 19 15:42:35 solus kernel: nvme nvme0: pci function 0000:3e:00.0
Jul 19 15:42:35 solus kernel: nvme nvme0: 8/0/0 default/read/poll queues
Jul 19 15:42:35 solus kernel:  nvme0n1: p1 p2 p3 p4
Jul 19 15:42:37 solus kernel: EXT4-fs (nvme0n1p3): mounted filesystem 9aa71c0a-9d55-4f79-b370-a3f554f8eb80 r/w with ordered data mode. Quota mode: none.
Jul 19 15:42:38 matt-solus-t480s kernel: EXT4-fs (nvme0n1p3): re-mounted 9aa71c0a-9d55-4f79-b370-a3f554f8eb80 r/w. Quota mode: none.
Jul 19 15:42:38 matt-solus-t480s kernel: Adding 11718652k swap on /dev/nvme0n1p2.  Priority:-2 extents:1 across:11718652k SS
Jul 19 15:42:40 matt-solus-t480s kernel: EXT4-fs (nvme0n1p4): mounted filesystem fa0ff32f-4b57-468f-981f-e79f6fed9aa7 r/w with ordered data mode. Quota mode: none.

What I find most bizarre is that the error is almost exactly replicated on both of my systems.

Matt_Nico · Jul 19, 2024

infinitymdm One additional weird thing which is occurring on my desktop is that I will sometimes be unable to log in from the lock screen after the computer has gone into standby mode. There will be no option to type in the field where I must enter the password. This behaviour stops if I log out of the account and then log back into the system. This happens when the system is left in standby for longer than 6 hours, so I had just switched to powering off my system when this behaviour occasionally flares up (usually it will happen a few days in a row and then I will switch to powering the system down. I don't think this would be related.

Here is the error as it has been appearing on my desktop system:

 [✓] Updating dynamic library cache                                     success
 [✗] Updating clr-boot-manager                                           failed

A copy of the command output follows:

[FATAL] cbm (../src/bootman/bootman.c:L562): FATAL: Cannot mount boot device /dev/nvme0n1p1 on /boot: No such device


 [✗] Updating clr-boot-manager                                           failed

A copy of the command output follows:

[FATAL] cbm (../src/bootman/bootman.c:L562): FATAL: Cannot mount boot device /dev/nvme0n1p1 on /boot: No such device


 [✗] Updating clr-boot-manager                                           failed

A copy of the command output follows:

[FATAL] cbm (../src/bootman/bootman.c:L562): FATAL: Cannot mount boot device /dev/nvme0n1p1 on /boot: No such device


 [✓] Running depmod on kernel 6.9.8-294.current                         success
 [✓] Updating hwdb                                                      success
 [✓] Updating system users                                              success
 [✓] Updating systemd tmpfiles                                          success
 [✓] Reloading systemd configuration                                    success
 [✓] Re-starting vendor-enabled .socket units                           success
 [✓] Compiling and Reloading AppArmor profiles                          success
 [✓] Updating manpages database                                         success
 [✓] Reloading udev rules                                               success
 [✓] Applying udev rules                                                success

and the output of sudo journalctl -k | grep nvme on my desktop system:

Jul 19 16:01:31 solus kernel: nvme nvme0: allocated 64 MiB host memory buffer.
Jul 19 16:01:31 solus kernel: nvme nvme0: 6/0/0 default/read/poll queues
Jul 19 16:01:31 solus kernel:  nvme0n1: p1 p2 p3 p4 p5
Jul 19 16:01:33 solus kernel: EXT4-fs (nvme0n1p2): mounted filesystem 2456cde0-a7e1-4af1-99ca-c30eb65f868a r/w with ordered data mode. Quota mode: none.
Jul 19 16:01:31 solus kernel: nvme nvme0: pci function 0000:03:00.0
Jul 19 16:01:31 solus kernel: nvme nvme0: allocated 64 MiB host memory buffer.
Jul 19 16:01:31 solus kernel: nvme nvme0: 6/0/0 default/read/poll queues
Jul 19 16:01:31 solus kernel:  nvme0n1: p1 p2 p3 p4 p5
Jul 19 16:01:33 solus kernel: EXT4-fs (nvme0n1p2): mounted filesystem 2456cde0-a7e1-4af1-99ca-c30eb65f868a r/w with ordered data mode. Quota mode: none.
Jul 19 16:01:34 matt-solus-desktop kernel: EXT4-fs (nvme0n1p2): re-mounted 2456cde0-a7e1-4af1-99ca-c30eb65f868a r/w. Quota mode: none.
Jul 19 16:01:34 matt-solus-desktop kernel: Adding 31250428k swap on /dev/nvme0n1p4.  Priority:-2 extents:1 across:31250428k SS
Jul 19 16:01:34 matt-solus-desktop kernel: EXT4-fs (nvme0n1p5): mounted filesystem 81b8fa97-0b82-4375-a08a-6d7ec6daa7af r/w with ordered data mode. Quota mode: none.
Jul 19 16:01:35 matt-solus-desktop kernel: EXT4-fs (nvme0n1p3): mounted filesystem 168e8227-9af4-4f3a-b2be-3bdf0875dece r/w with ordered data mode. Quota mode: none.
Jul 19 16:01:36 matt-solus-desktop kernel: block nvme0n1: No UUID available providing old NGUID

Matt_Nico · Jul 19, 2024

Matt_Nico now I am running into the same issue with the /home partition on my desktop computer. I snapped a pic of it as well:

It seems to be identical to the issue which is present on the laptop. So now I can say that both systems are exhibiting the exact same set of errors.

I have not yet applied the W39 updates as I do not know what will happen when I do. Should I just go for it?

WetGeek · Jul 20, 2024

Matt_Nico Should I just go for it?

I can't speak for your system, of course, but I installed Week 39 updates on two laptops and four VMs without any problems. Are you backed-up?

AAxios · Jul 20, 2024

Been watching not sure why 2 computers different would have the same issue.
Do you have fastboot turned off on each?

Its interesting.

infinitymdm · Jul 20, 2024

Matt_Nico One additional weird thing which is occurring on my desktop is that I will sometimes be unable to log in from the lock screen after the computer has gone into standby mode. There will be no option to type in the field where I must enter the password. This behaviour stops if I log out of the account and then log back into the system. This happens when the system is left in standby for longer than 6 hours, so I had just switched to powering off my system when this behaviour occasionally flares up (usually it will happen a few days in a row and then I will switch to powering the system down. I don't think this would be related.

I actually also have this issue, but I usually resolve it by hitting "Switch User" and then logging back in. It kicks you back to the standby screen, but then password entry works. So I don't think this is related to your primary issue.

The /home failing to mount, alongside the boot manager issue, would lead me to think there's a problem with the storage device. But the fact that it's replicated across both systems should rule that out. The odds of two drives failing in exactly the same way at exactly the same time in two separate systems are basically zero.

Your issue seems to be beyond my knowledge to troubleshoot. I don't see any issues in your kernel logs either. This sure is a puzzle. If I think of anything I'll pipe in, but I'm not sure how to help you at this point. You need someone more experienced than myself.

Matt_Nico · Jul 20, 2024

infinitymdm I will try out hitting switch user next time that issue crops up. But yeah I don't think that it's related to the main issue.

I was thinking of drive failures, but as you said the chances of both systems having the exact same issue at the same time are so low. Thank you very much for looking into/at things for me even if the issue is still out there. Perhaps when I have more time some fresh installs will be in order

Axios I haven't checked but iirc both systems would have fastboot disabled. This is also a new issue for two older installs (one especially so), so I would guess that the fastboot state wouldn't have an effect as it has not been changed for either system since the install occurred. My desktop would have been installed with either solus 4.2 or 4.3 sometime in 2021 and my laptop was installed with 4.4 in september 2023.

silke · Jul 22, 2024

Matt_Nico Do both devices have similar SSDs?

I have not yet applied the W39 updates as I do not know what will happen when I do. Should I just go for it?

You can always select the previous kernel in the boot menu, so it can't hurt. Use sudo clr-boot-manager set-timeout 5 to always show the boot menu, or hold space on boot to show it once.

Matt_Nico · Jul 22, 2024

silke My laptop has a 256gb Intel SSD (on my laptop) and the other is a Samsung 1tb SSD (on my desktop). I can't remember the exact Samsung model off the top of my head but I think it was a 970 evo.

I went through with the updates on both my systems but nothing has changed issue wise. I still get the clr-boot-manager error on both of my systems, the issue with mounting /home has also persisted.

I do wonder if an update in the future might solve things. If things break again/entirely I will probably go for a reinstall.

Eeryvile · Jul 23, 2024

Matt_Nico I had the exact same error this weekend, when I tried to boot up my desktop pc to apply the latest updates. And initially, especially when I saw your post, I thought that there's an issue with systemd.
But, as it turns out, my drive with the home partition actually died Thanks god (no, thanks restic ), I got backups... New drive should be here in the next days...

Matt_Nico · Jul 25, 2024

eryvile I am extremely doubtful of my issue being a drive failure. The chances of it happening on two identical SSDs at the same time exact time (within 2hours of each other) is already infinitesimally small. For it to happen on two completely different SSDs, made on two separate dates (two years apart), and from two separate manufacturers is a near statistical impossibility.

If it is drive failures that sucks and I am out a couple hundred bucks, but drive failures just make no sense to me probability wise. The error on both devices appeared immediately following the W38 updates, there were no issues on either drive prior to this set of updates. If they had failed at the same time but one with the update and one without I would be more likely to believe it was a drive failure. To me it seems the simplest answer is that something in that set of updates changed the way my system is identifying my drives (Occam's razor and all that) . I don't think it would be possible that the update is the cause of the drive failure but that has crossed my mind.

Matt_Nico · Jul 25, 2024

Matt_Nico I have since checked my desktop drive for any issues using smartmontools and it returned without any errors that I could see. So now I am fairly sure that at least my desktop drive is not failing.

Matt_Nico · Jul 26, 2024

Matt_Nico Just to be sure about this I also checked on my portable drive install of Solus. The exact same clr-boot-manager also became present on this device. Again this error only appeared after I applied updates to this system (jumping from W26 to W29 iirc). There was no error present on this system either prior to me applying this set of updates.

So this exact error is now being replicated across three of my systems, two on nvme drives and one on a usb drive (I know this isn't best practices but it lets me use the computers at my work). I can fairly certainly rule out drive failure. I don't know what is causing this issue but it definitely seem software related.