Major difficulties recovering system after week 27 and 28 updates

Matt_Nico · Jul 15, 2024

Hello all, following the past two updates I have been having a lot of troubles with my desktop system (as of now my laptop is still functional).

When the week 27 update came out it immediately broke some of my steam games so I rolled the update back and only applied the security updates. I planned on waiting until the week 28 updates to come out and see if those fixed my issues.
Well the week 28 updates came out and I applied them late last night while installing a new program (innoextract). This morning I rebooted my system (it had been a few days and fileroller was hanging while trying to extract a zip) but upon the restart I ran into the lightdm not starting.

Error preparing initrd: Device Error
Failed to start lightdm.service - light display manager

So I did the normal thing and went into the command line to run eopkg up to ensure everything had been applied. Oddly the command installed 100+ packages. Because of this I rolled back the system to before the innoextract was installed (there were two deprecated packages but I don't remember what). I regretfully did not run eopkg check before rebooting the system. On my reboot I ran into a new error I haven't seen before:

start_image() returned device error

After this I tried booting into the old kernel (6.8) and running the update there. I made the same error of not running eopkg check (really regretting this). The exact same error got thrown and now I cannot enter the terminal in either of the kernel versions.

I have a live version of solus on a usb stick, but I am not familiar as to how I would use it to recover my system. I would really like to recover my system if possible as I do not want to have to reinstall all my programs and customise my settings. I'm out of my depth here so I will definitely need help. Thank you guys in advance!!

I forgot to add something so I put it in italics

Staudey · Jul 15, 2024

Matt_Nico After this I tried booting into the old kernel (6.8) and running the update there.

Are you saying it still installed updates there (a third time!) even though it had already updated 100 packages just before when you booted using the other kernel version?

Something seems to be terribly wrong with eopkg on your system if it keeps on updating stuff that should already be up-to-date.

Matt_Nico · Jul 16, 2024

Staudey it only did it twice not a third time. I forgot to mention a rollback which I had done while trying to recover things. Thank you for pointing this out I have updated the original post!

I rolled back things after running the first update from the terminal while trying to recover. I also am not sure whether an update was actually applied when I installed innoextract as I did this through gui rather than the terminal. I have had my system break before when applying updates through the software center.

With both my kernels borked am I out of luck?

Matt_Nico · Jul 17, 2024

Update:

After installing the week 38 updates on my laptop I ran into a different issue. First I was unable to boot my laptop because it failed to identify the home partition. I just rebooted it from the terminal and the next time it started mostly fine. Except when I ran the sudo eopkg check | grep Broken ... command to check for any broken files. It reorganized the order of some things ("system.base" files), but then output that clr-boot-manager was unable to mount /boot as it did not exist.

From there I rebooted my laptop to see what would happen. Surprisingly it booted with no issues and after running the command to repair broken files again no errors with clr-boot-manager were returned.

With this set of odd occurrences I decided to attempt to boot my desktop "just to see". The first attempt returned the same error start_image() returned device error. So then I tried with the other kernel (6.8 rather than 6.9) and my system booted with no issues. Now remember that last time I tried this the exact same error occured on both kernels, I have made absolutely no changes to anything, I only left my computer alone for a day and a bit.

After this fortuitous turn of events I immediately ran an update on the desktop. Following the update I ran the sudo eopkg check | grep Broken ... command to ensure that any broken files were repaired. It did the same thing as on my laptop and rearranged all the "system.base" files. I was incredibly confused by this.

I then rebooted my desktop and tried launching both kernels, both kernels now appear to work as normal.

I did one final check by running just sudo eopkg check and this returned an even odder result. All of the "system.base" packages which had been rearranged were now listed as broken. So then I ran the sudo eopkg check | grep Broken ... command one more time to repair these broken packages. It didn't find that any of the packages were broken, but once again rearranged the "system.base" files.

I do not know what caused any of the behaviour I have seen from my two systems. I also have no clue if my systems are now stable or if they could fall back into recovery mode at any moment. This is extremely concerning to me as it seems that all of this instability was linked to the week 38 updates. I would not like to mark this as a solution as nothing has been done and I believe that the issue persists.

I do not know if you are the right person to contact about this @ReillyBrogan but I would really appreciate a staff member seeing this as I saw other people having similar issues in the update thread. I have absolutely no clue what to do from here, or if I should even try to do anything.

Matt_Nico · Jul 17, 2024

Matt_Nico I have since rebooted my desktop system again, and just in case ran the sudo eopkg check | grep Broken ... command one last time to make sure everything was in order. It came back with a clr-boot-manager error yet again. This time the error is as follows:

[✗] Updating clr-boot-manager failed

A copy of the command output follows:

[FATAL] cbm (../src/bootman/bootman.c:L562): FATAL: Cannot mount boot device /dev/nvme0n1p1 on /boot: No such device

I tried to run sudo clr-boot-manager update to see if that was able to do anything, and it appeared to fix thing when I ran the command to repair broken packages again. I will now reboot the system and see if the fix sticks

Matt_Nico · Jul 17, 2024

Matt_Nico After rebooting the system yet again and running the sudo eopkg check | grep Broken ... command, I was met with this error yet again:

[✗] Updating clr-boot-manager failed

A copy of the command output follows:

[FATAL] cbm (../src/bootman/bootman.c:L562): FATAL: Cannot mount boot device /dev/nvme0n1p1 on /boot: No such device

I have no clue why this error has returned after it was seemingly fixed. Running sudo clr-boot-manager update once again got the error to go away. I've done this reboot sequence a few times now and the error always returns upon the reboot but will go away once you update the clr-boot-manager. I have also tried running sudo usysconf run -f prior to rebooting the system but the error will still return.

Absolutely no clue where to go from here, guess I just wont turn of either my desktop or laptop for the time being.

infinitymdm · Jul 17, 2024

All those packages that keep getting reported as broken by eopkg are going to keep getting listed as broken. See https://discuss.getsol.us/d/10707-broken-packages-reinstalled-still-broken for details. You don't need to worry about those.

The clr-boot-manager error is a bit more interesting. Can we see the output of lsblk?

silke · Jul 17, 2024

Also, can we get the output of: sudo env CBM_DEBUG=1 clr-boot-manager update?

Mminh · Jul 17, 2024

Have you tried this? https://help.getsol.us/docs/user/troubleshooting/boot-rescue

After chroot /target, try to update to latest sudo eopkg up, which should have a fix for nvidia-*
Not sure if you need to run this sudo usysconf run -f but just do it for sure, then exist (chroot mode), umount and restart. This is how I recovered my PC

Matt_Nico · Jul 18, 2024

infinitymdm silke
Here is the output of lsblk nvme0n1p1 would be my efi partition

sda           8:0    0   5.5T  0 disk 
├─sda1        8:1    0   5.4T  0 part /mnt/quaternary
└─sda2        8:2    0  94.1G  0 part 
sdb           8:16   0 465.8G  0 disk 
├─sdb1        8:17   0   499M  0 part 
├─sdb2        8:18   0   100M  0 part 
├─sdb3        8:19   0    16M  0 part 
└─sdb4        8:20   0 195.4G  0 part 
sdc           8:32   0 931.5G  0 disk 
└─sdc1        8:33   0 931.5G  0 part /mnt/secondary
sdd           8:48   0   3.6T  0 disk 
└─sdd3        8:51   0   3.6T  0 part /mnt/tertiary
sde           8:64   1  14.5G  0 disk 
└─sde1        8:65   1  14.5G  0 part 
zram0       252:0    0     8G  0 disk [SWAP]
nvme0n1     259:0    0 931.5G  0 disk 
├─nvme0n1p1 259:1    0   512M  0 part 
├─nvme0n1p2 259:2    0  70.8G  0 part /
├─nvme0n1p3 259:3    0 161.4G  0 part /home
├─nvme0n1p4 259:4    0  29.8G  0 part [SWAP]
└─nvme0n1p5 259:5    0 668.9G  0 part /mnt/fast storage

and here is the output of sudo env CBM_DEBUG=1 clr-boot-manager update

[DEBUG] cbm (../src/cli/cli.c:L142): No such file: //etc/kernel/update_efi_vars
[INFO] cbm (../src/bootman/bootman.c:L787): Current running kernel: 6.9.8-294.current
[INFO] cbm (../src/bootman/sysconfig.c:L179): Discovered UEFI ESP: /dev/disk/by-partuuid/e9fc2609-be10-4546-ab1b-f7beebb9167e
[INFO] cbm (../src/bootman/sysconfig.c:L256): Fully resolved boot device: /dev/nvme0n1p1
[DEBUG] cbm (../src/bootman/bootman.c:L141): shim-systemd caps: 0x26, wanted: 0x26
[DEBUG] cbm (../src/bootman/bootman.c:L156): UEFI boot now selected (shim-systemd)
[INFO] cbm (../src/bootman/bootman.c:L807): path ///etc/kernel/initrd.d does not exist
[INFO] cbm (../src/bootman/bootman.c:L807): path ///usr/lib/initrd.d does not exist
[INFO] cbm (../src/bootman/bootman.c:L503): Checking for mounted boot dir
[INFO] cbm (../src/bootman/bootman.c:L555): Mounting boot device /dev/nvme0n1p1 at /boot
[SUCCESS] cbm (../src/bootman/bootman.c:L568): /dev/nvme0n1p1 successfully mounted at /boot
[DEBUG] cbm (../src/bootman/update.c:L164): Now beginning update_native
[DEBUG] cbm (../src/bootman/update.c:L173): update_native: 1 available kernels
[DEBUG] cbm (../src/bootman/update.c:L191): update_native: Running kernel is (current) ///usr/lib/kernel/com.solus-project.current.6.9.8-294
[SUCCESS] cbm (../src/bootman/update.c:L205): update_native: Bootloader updated
[DEBUG] cbm (../src/bootman/kernel.c:L617): installing extra initrd: /usr/lib64/kernel/initrd-com.solus-project.current.6.9.8-294.nvidia
[DEBUG] cbm (../src/bootloaders/systemd-class.c:L219): adding extra initrd to bootloader: initrd-com.solus-project.current.6.9.8-294.nvidia
[SUCCESS] cbm (../src/bootman/update.c:L220): update_native: Repaired running kernel ///usr/lib/kernel/com.solus-project.current.6.9.8-294
[DEBUG] cbm (../src/bootman/update.c:L230): update_native: Checking kernels for type current
[INFO] cbm (../src/bootman/update.c:L243): update_native: Default kernel for type current is ///usr/lib/kernel/com.solus-project.current.6.9.8-294
[DEBUG] cbm (../src/bootman/kernel.c:L617): installing extra initrd: /usr/lib64/kernel/initrd-com.solus-project.current.6.9.8-294.nvidia
[DEBUG] cbm (../src/bootloaders/systemd-class.c:L219): adding extra initrd to bootloader: initrd-com.solus-project.current.6.9.8-294.nvidia
[SUCCESS] cbm (../src/bootman/update.c:L255): update_native: Installed tip for current: ///usr/lib/kernel/com.solus-project.current.6.9.8-294
[DEBUG] cbm (../src/bootman/kernel.c:L617): installing extra initrd: /usr/lib64/kernel/initrd-com.solus-project.current.6.9.8-294.nvidia
[DEBUG] cbm (../src/bootloaders/systemd-class.c:L219): adding extra initrd to bootloader: initrd-com.solus-project.current.6.9.8-294.nvidia
[SUCCESS] cbm (../src/bootman/update.c:L269): update_native: Installed last_good kernel (current) (///usr/lib/kernel/com.solus-project.current.6.9.8-294)
[DEBUG] cbm (../src/bootman/update.c:L280): update_native: Analyzing for type current: ///usr/lib/kernel/com.solus-project.current.6.9.8-294
[DEBUG] cbm (../src/bootman/update.c:L285): update_native: Skipping running kernel
[INFO] cbm (../src/bootman/bootman.c:L503): Checking for mounted boot dir
[INFO] cbm (../src/bootman/bootman.c:L510): boot_dir is already mounted: /boot
[SUCCESS] cbm (../src/bootman/update.c:L338): update_native: Default kernel for current is ///usr/lib/kernel/com.solus-project.current.6.9.8-294
[DEBUG] cbm (../src/bootman/update.c:L353): No kernel removals found
[INFO] cbm (../src/bootman/bootman.c:L469): Attempting umount of /boot
[SUCCESS] cbm (../src/bootman/bootman.c:L473): Unmounted boot directory

Which appears to work as intended?

Matt_Nico · Jul 18, 2024

minh I was about to chroot and try this boot rescue when my computer booted as normal unexpectedly. I am not sure if it would still be worthwhile to chroot in, shouldn't I be able to do everything from inside my system now? If it's still worthwhile lmk and I will give it a shot!

Also weirdly I haven't been having any issues with my nvidia card since the initial error with lightdm. I have tested out a few different games as well and the system is definitely running on the gpu rather than internal graphics as performance is as expected.

ReillyBrogan · Jul 18, 2024

We pushed a hotfix for the nvidia driver, if you installed that it should be working (assuming it's the same issue). We're still trying to figure out the root cause of the issue.

silke · Jul 18, 2024

Which appears to work as intended?

Yep, strange that it sometimes complains about /dev/nvme0n1p1 not existing.

Matt_Nico · Jul 18, 2024

ReillyBrogan If I try to update my system now it appears everything is up to date. The issue doesn't appear to be with the nvidia driver anymore, but clr-boot-manager is still giving me errors.

For now it seem I am still able to restart my computer and use it as normal. I believe it is able to find the efi partition when actually booting since my system starts, so I am not sure why it cannot properly detect it once the system is up and running.

Matt_Nico · Jul 18, 2024

silke It really is bizarre. I was also having a very similar issue with my laptop today when I went to start it, except it was unable to detect my /home partition and so was only reaching the terminal. From the terminal if I rebooted it one or two times it would catch the partition and start as normal. But again if I turned it off I would risk it "losing" the partition again.

Incredibly odd that it is happening across both my main solus devices. The laptop is a t480s without any integrated graphics for what it is worth so nvidia should most definitely not be playing a role on my laptop.

*on Friday I believe I will be able to hop into the matrix at some point as I have more time available. I just need to sign up for it still.

Matt_Nico · Jul 18, 2024

Matt_Nico I am wondering if it may be something to do with these drives being nvme devices. Both the drive on my laptop and desktop with my solus install are fast nvme drives, this is purely conjecture but I wonder if the speed of these drives may be causing the issue? Could things be moving faster than clr-boot-manager or eopkg can keep up? That would explain why the issue is intermittent, as the system may be able to grab the information in time on some boot sequences but not on others.

infinitymdm · Jul 18, 2024

Matt_Nico I've run Solus on a variety of PCIe x4 gen 3 and gen 4 drives. The speed of the drive is probably not your problem.

silke · Jul 18, 2024

Matt_Nico I'm also using NVMe devices and have no issues. My guess is that there's something weird going on. You can check the kernel logs (journalctl -k) for any suspicious information.

You could try updating the firmware using fwupd (eopkg install fwupd). Make sure you have a good backup beforehand though (I haven't seen it brick a system yet, but someone has to be the first).

Ensure /boot is mounted:

   clr-boot-manager mount-boot

This is normally done automatically, but it can't harm to double check seeing as the partitions seem a bit flaky.

Check for updates:

   fwupdmgr refresh
   fwupgmgr get-updates

Install them:

   fwupdmgr update

Matt_Nico · Jul 18, 2024

silke here is the output of journalctl -k to me it didn't look like there was anything glaringly wrong but I am definitely out of my depth here.

Jul 18 14:09:59 solus kernel: Command line: initrd=\EFI\com.solus-project\initrd-com.solus-project.current.6.9.8-294 initrd=>
Jul 18 14:09:59 solus kernel: BIOS-provided physical RAM map:
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x0000000000000000-0x0000000000057fff] usable
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x0000000000058000-0x0000000000058fff] reserved
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x0000000000059000-0x000000000009dfff] usable
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x000000000009e000-0x00000000000fffff] reserved
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x0000000000100000-0x000000003fffffff] usable
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x0000000040000000-0x00000000403fffff] reserved
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x0000000040400000-0x0000000069bd7fff] usable
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x0000000069bd8000-0x0000000069bd8fff] ACPI NVS
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x0000000069bd9000-0x0000000069bd9fff] reserved
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x0000000069bda000-0x000000007b1befff] usable
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x000000007b1bf000-0x000000007b68efff] reserved
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x000000007b68f000-0x000000007b6fefff] ACPI data
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x000000007b6ff000-0x000000007bb2efff] ACPI NVS
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x000000007bb2f000-0x000000007cffefff] reserved
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x000000007cfff000-0x000000007cffffff] usable
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x000000007d000000-0x000000007fffffff] reserved
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x00000000e0000000-0x00000000efffffff] reserved
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x00000000fe000000-0x00000000fe010fff] reserved
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x00000000fec00000-0x00000000fec00fff] reserved
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x00000000fed00000-0x00000000fed00fff] reserved
Jul 18 14:09:59 solus kernel: BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
lines 1-24

When I am home later I would be happy to try to update the firmware. I have my important files saved on a separate drive but have not done a proper backup. Would something like this guide be acceptable to create a backup?

iirc this installation was done years ago with a Solus 4.2 or 4.3 iso, would that point to a firmware issue?

Matt_Nico · Jul 19, 2024

Matt_Nico silke I didn't end up going through with the fwupdmgr update commands as of yet as I am not sure if it will actually do anything given the output of fwupdmgr get-updates command.

matt@matt-solus-desktop ~ $ fwupdmgr get-updates
WARNING: This package has not been validated, it may not work properly.
Devices with no available firmware updates: 
 • SSD 850 EVO 500GB
 • SSD 860 EVO 1TB
 • WD BLACK SN750 SE 1TB
 • WDC WD40EZRZ-75GXCB0
 • WDC WD60EZAZ-00SF3B0
No updatable devices

It appears as though all my drives are up to date firmware wise so would it be worth it to go through and do the fwupdmgr update command?

Still running into these issues on both my laptop and my desktop.