NVMe root disks and recent kernels

When 5.14.0-362.18.1.el9_3.x86_64 came out I discovered that my machine wouldn’t boot, it hung just after discovering the disks. I asked about it at the time and gathered it was something to do with using an NVMe “disk” as the root disk. I’ve just tried to update to 5.14.0-362.24.1.el9_3.x86_64 and the same problem occurs.

The device in question is a Micron/Crucial Technology [C0A9] CT500P5PSSD8

Is there a fix or workaround? Alternatively should I move /boot and / to spinning rust?

What’s the error you are getting when you try to boot?

No error, just a hang. the brief conversation I had back on 1 Dec was:

Martin Rushton[12:02]

Any known problems with kernel 5.14.0-362.18.1.el9_3.x86_64? I did an upgrade last night from 5.14.0-362.13.1.el9_3.x86_64 and the bootstrap now hangs just after enumerating the disks but before starting them. No disks => no log file unfortunately. The system boots OK with the previous kernel, so it can’t be any of the other updates. I have a running system, so not critical, if there is no simple fix I’ll just remove the most recent kernel. Thx.

Andrew Lukoshko[15:29]

@Martin Rushton do you have NVMe disks? Which ones?

Martin Rushton[16:01]

Yes. CT500P5PSSD8 from Micron/Crucial Technology, version P7CR403. Three partition, /boot/efi, /boot and almalinux-root → /. Swap is to spinning rust, as are all other filesystems.

After which I removed the “18” kernel and the entry in /boot/loader/entries. After the latest update it hung in the same place, so I’ve disabled the “24” kernel to get a usable system.

A successful boot includes:

Apr 6 11:01:28 arun kernel: ata37: SATA max UDMA/133 abar m8192@0x73380000 port 0x73380f00 irq 134
Apr 6 11:01:28 arun kernel: ata38: SATA max UDMA/133 abar m8192@0x73380000 port 0x73380f80 irq 134
Apr 6 11:01:28 arun kernel: ata39: SATA max UDMA/133 abar m8192@0x73380000 port 0x73381000 irq 134
Apr 6 11:01:28 arun kernel: ata40: SATA max UDMA/133 abar m8192@0x73380000 port 0x73381080 irq 134
Apr 6 11:01:28 arun kernel: nvme nvme0: 16/0/0 default/read/poll queues
Apr 6 11:01:28 arun kernel: nvme0n1: p1 p2 p3
Apr 6 11:01:28 arun kernel: usb 1-7: new high-speed USB device number 5 using xhci_hcd
Apr 6 11:01:28 arun kernel: i915 0000:00:02.0: enabling device (0006 → 0007)

but when it fails the two lines nvme nvme0: 16/0/0 default/read/poll queues and nvme0n1: p1 p2 p3 are missing and it hangs after finding the USB device.

If it would help I could re-enable the “24” kernel and take a photo at the point it hangs, but I know photos are discouraged!

I have the feeling that the NVMe controller or the NVMe drive are not quite OK. Per se, the fact that you’re using NVMe has no bearing. I am using 9.3 on two systems (laptop and mini-PC), each one having two NVMe devices.

Laptop:

[ludditus@AcerPasCher ~]$ lsblk -i
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
nvme1n1     259:0    0 238.5G  0 disk 
|-nvme1n1p1 259:1    0   512M  0 part /boot/efi
|-nvme1n1p2 259:2    0    16G  0 part [SWAP]
`-nvme1n1p3 259:3    0   222G  0 part /
nvme0n1     259:4    0 931.5G  0 disk 
`-nvme0n1p1 259:5    0 931.5G  0 part /home

Actually, the mini-PC (which has a larger second NVMe of 2TB, but otherwise is identically partitioned, with the 2nd NVMe also used as /home, because I like to keep my life simple) is using the official 5.14 kernel, whereas on the laptop I’m using kernel-lt from ELRepo. Still, I can boot the 5.14 kernel just fine, but I won’t have WiFi and BT with it.

If you can afford to disable the Secure Boot (kernel-lt and kernel-mlfrom ELRepo are not signed), I suggest you try kernel-lt, which is 6.1. If you need an installable Live ISO, you can use Ventoy and put my ISO on it, but I have to warn you that this is a KDE-only ISO:

Maybe a newer kernel will give more meaningful messages. I still suspect a hardware issue.

Oh, and I noticed you had several booting issues, one of which was reported here: Cannot boot 9.3 install USB
I’m afraid that all the talk about NVMe is bollocks. NVMe devices are perfectly supported.

OTOH, I’m not sure about your LVM. I don’t even know how to use LVMs, because, as I said, I want to keep my life simple.

The booting issue you linked to was the same as this issue, I’d added a brief summary of my issue with NVMe to a similar problem a user was having with a USB. I’m only assuming that it is NVMe as the problem because Andrew Lukoshko (who is a Alma real expert) suggested it. It boots quite happily with 5.14.0-362.13.1.el9_3.x86_64, the problem arrived with 5.14.0-362.18.1.el9_3.x86_64.

The LVM configuration is fine, it’s basically what the installer did, but I’ve removed the swap partition and use spinning rust for that. I’ve used LVMs for maybe 10 years on small machines like this or for odd tasks on the local disks of a medium sized cluster. Once you get the hang of LVM is does make life simpler, particularly if you find yourself running out of space in a partition.

I’m happy with 362.13 pro tem, I’ll maybe have another go to see if I can get 362.24 working, failing that I might try booting off a live USB and seeing if I can get some error messages that way.

The Alma live images are from last year, so probably with kernel older than 362.18.
When 9.4 is released, there will be images newer than any 362, so they can tell something (new).

Ah, thanks for that. It saves me wasting time.

CentOS Stream 9 (install) images are dated 2024-04-01, so they should have a “future” kernel.
If that installer fails to see your SSD, then it is possible to report a CS bug to Red Hat.


Edit:

I’d say that unnecessary bitmap images are discouraged.
If one can copy text, then copy text, but if one can’t …

Alas, boot tends to spit many pages of text and the virtual consoles in el9 no longer have buffer to scroll back in. It has always been hard to spot the initial, real, error from the flow and now it is even more so.

Sorry for the delay in getting back, I’ve been having a bit of a problem with health issues. I reset the entry in /boot/loader/entries so that kernel 5.14.0-362.24.1.el9_3.x86_64 would be used and the following appeared:


at this point the system hung for over half an hour. With “24” disabled, “13” kernel boots happily and the system runs with the otherwise latest version. For comparison, this morning’s boot shows:

Apr 25 09:19:24 arun kernel: ata32: SATA max UDMA/133 abar m8192@0x73380000 port 0x73380c80 irq 134
Apr 25 09:19:24 arun kernel: ata33: SATA max UDMA/133 abar m8192@0x73380000 port 0x73380d00 irq 134
Apr 25 09:19:24 arun kernel: ata34: SATA max UDMA/133 abar m8192@0x73380000 port 0x73380d80 irq 134
Apr 25 09:19:24 arun kernel: ata35: SATA max UDMA/133 abar m8192@0x73380000 port 0x73380e00 irq 134
Apr 25 09:19:24 arun kernel: ata36: SATA max UDMA/133 abar m8192@0x73380000 port 0x73380e80 irq 134
Apr 25 09:19:24 arun kernel: ata37: SATA max UDMA/133 abar m8192@0x73380000 port 0x73380f00 irq 134
Apr 25 09:19:24 arun kernel: ata38: SATA max UDMA/133 abar m8192@0x73380000 port 0x73380f80 irq 134
Apr 25 09:19:24 arun kernel: ata39: SATA max UDMA/133 abar m8192@0x73380000 port 0x73381000 irq 134
Apr 25 09:19:24 arun kernel: ata40: SATA max UDMA/133 abar m8192@0x73380000 port 0x73381080 irq 134
Apr 25 09:19:24 arun kernel: nvme nvme0: 16/0/0 default/read/poll queues
Apr 25 09:19:24 arun kernel: nvme0n1: p1 p2 p3
Apr 25 09:19:24 arun kernel: usb 1-5.1: New USB device found, idVendor=0bc2, idProduct=ab52, bcdDevice= 1.00
Apr 25 09:19:24 arun kernel: usb 1-5.1: New USB device strings: Mfr=2, Product=3, SerialNumber=1
Apr 25 09:19:24 arun kernel: usb 1-5.1: Product: One Touch HDD
Apr 25 09:19:24 arun kernel: usb 1-5.1: Manufacturer: Seagate
Apr 25 09:19:24 arun kernel: usb 1-5.1: SerialNumber: 00000000NABLVP7C
Apr 25 09:19:24 arun kernel: usb 1-7: new high-speed USB device number 5 using xhci_hcd
Apr 25 09:19:24 arun kernel: i915 0000:00:02.0: enabling device (0006 → 0007)
Apr 25 09:19:24 arun kernel: i915 0000:00:02.0: [drm] VT-d active for gfx access
Apr 25 09:19:24 arun kernel: Console: switching to colour dummy device 80x25
Apr 25 09:19:24 arun kernel: i915 0000:00:02.0: [drm] Using Transparent Hugepages
Apr 25 09:19:24 arun kernel: i915 0000:00:02.0: [drm] Finished loading DMC firmware i915/adls_dmc_ver2_01.bin (v2.1)
Apr 25 09:19:24 arun kernel: i915 0000:00:02.0: [drm] GT0: GuC firmware i915/tgl_guc_70.bin version 70.5.1
Apr 25 09:19:24 arun kernel: i915 0000:00:02.0: [drm] GT0: HuC firmware i915/tgl_huc.bin version 7.9.3

Notice the three highlighted lines which differ from the failed startup.

tl;dr Kernels “18” and “24” fail to initialise nvme0 and hence fail to boot. Kernel “13” boots normally and runs with no reported errors.

The only kernel change that is nvme related between “13.1” and “18” is this:

If you could install and test “13.2”, it would point to that commit as the cause.

Hi Carl,

Do you know if a compiled “13.2” version is available anywhere?
Thx,
Martin

PS: To all: Just at the moment I’m having difficulty sitting at the computer, please excuse any tardy replies. Thanks. M.

I believe all packages generated from the kernel dist-git project for the version 13.2 are here in this build:

I’ll tag @alukoshko to see the actual possibility of this being a cause.

@MartinR
The story behind 0001-nvme-pci-add-BOGUS_NID-for-Intel-0a54-device.patch:

So indeed please check 5.14.0-362.13.2.el9_3.x86_64 kernel version, it’s still available in Testing repo, so:

dnf install almalinux-release-testing
dnf install kernel-5.14.0-362.13.2.el9_3.x86_64

And try to boot.
If 5.14.0-362.13.1.el9_3 works but 5.14.0-362.13.2.el9_3 doesn’t then issue is this patch.
It’s actually upstream patch from nvme-pci: add BOGUS_NID for Intel 0a54 device · torvalds/linux@5c3f406 · GitHub

That’s the issue. I did as you requested and tried to boot the 13.2 version. At first it looked OK, tye highlighted lines (above) were present. However, after half an hour I concluded that the problem was still there. I’ve removed the 13.2 kernel and the testing repo.

Have you any idea if this is specific to my particular NVMe? I notice that the MoBo offer 2 M.2 slots, one is labelled (From CPU), the other (From B760 chipset). Alternatively is the best thing to move /boot onto spinning rust?

Many thanks.