AMD ROCm GPU compute driver/kernel for RHEL 9.1 not working on AlmaLinux 9.1

I installed AMD ROCm GPU compute stack in order to enable applications on AMD GPUs. I followed the installation instructions for RHEL 9.1 (using install scripts approach) and the installation went well. When rebooting, I end up with a white screen + message “Unfortunately there’s a problem which cannot be fixed by the system. Please contact your systems administrator” the boot process stops with this message.

RHEL 9.1 Installation instructions can be found here: AMD Documentation - Portal

After rebooting using an old kernel (AlmaLinux 9.0), uninstalling ROCm and rebooting into AlmaLinux 9.1, version 9.1 is back to normal.

I can only assume, that the ROCm kernel “amdgpu-dkms-1:5.18.13.50401-1520974.el9.noarch” causes the problem.

Does anyone have experience with the installation of the ROCm GPU compute stack on AlmaLinux 9.1? Is there a fix?

seems like a massively complicated document which just boils down to:

cat << EOF > /etc/yum.repos.d/amdgpu.repo
[amdgpu]
name=amdgpu
baseurl=https://repo.radeon.com/amdgpu/latest/rhel/9.1/main/x86_64  
enabled=1
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF

cat << EOF > /etc/yum.repos.d/rocm.repo
[rocm]
name=rocm
baseurl=https://repo.radeon.com/rocm/rhel9/rpm  
enabled=1
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF

dnf install kernel-headers kernel-devel dkms amdgpu-dkms rocm-hip-sdk rocm-opencl-sdk 

i’d never use an installer binary, they’re usually pretty poorly written, stick to yum repo’s.

The installation worked and I can boot the system with ROCm installed but the screen resolution is now fix at only 1024 x 768 and no GPU i.e. neither the integrated nor the dedicated one is active/used (Firefox uses Compositor WebRender (Software) and WebGL-2-Driver: Renderer Mesa/X.org – llvmpipe (LLVM 14.0.6, 256 bits).

]$ xrandr --listproviders

Providers: number : 0

]$ xrandr

xrandr: Failed to get size of gamma for output default
Screen 0: minimum 1024 x 768, current 1024 x 768, maximum 1024 x 768
default connected primary 1024x768+0+0 0mm x 0mm
1024x768 76.00*

I tried to fix the resolution following these instructions: 5 easy Steps to change the display resolution in Linux – GetLabsDone

but ran into this problem “xrandr: Failed to get size of gamma for output default”

Also:

]$ rocminfo

ROCk module is NOT loaded, possibly no GPU devices

]$ clinfo

Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.1 AMD-APP (3513.0)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback

Platform Name: AMD Accelerated Parallel Processing
Number of devices:

Another hint let me check whether an AMD gpu driver is loaded - it’s not:

Without ROCm and kernel 9.0 amdgpu is active:

]$ lsmod | grep -Ei ‘amd|ati|radeon’

edac_mce_amd 45056 0
kvm_amd 147456 0
kvm 1060864 1 kvm_amd
amd_pmc 28672 0
amdgpu 7393280 21
drm_ttm_helper 16384 1 amdgpu
ttm 86016 2 amdgpu,drm_ttm_helper
iommu_v2 24576 1 amdgpu
gpu_sched 49152 1 amdgpu
i2c_algo_bit 16384 1 amdgpu
drm_kms_helper 311296 1 amdgpu
drm 634880 15 gpu_sched,drm_kms_helper,amdgpu,drm_ttm_helper,ttm
ccp 114688 1 kvm_amd
amd_sfh 24576 0

With ROCm and kernel 9.1 amdgpu is not active

]$ lsmod | grep -Ei ‘amd|ati|radeon’
snd_sof_amd_renoir 16384 0
snd_sof_amd_acp 40960 1 snd_sof_amd_renoir
snd_sof_pci 24576 1 snd_sof_amd_renoir
snd_sof 196608 3 snd_sof_amd_acp,snd_sof_pci,snd_sof_amd_renoir
edac_mce_amd 45056 0
kvm_amd 155648 0
snd_pcm 151552 11 snd_sof_amd_acp,snd_hda_codec_hdmi,snd_pci_acp6x,snd_hda_intel,snd_hda_codec,snd_sof,snd_compress,snd_soc_core,snd_sof_utils,snd_hda_core,snd_acp3x_pdm_dma
kvm 1105920 1 kvm_amd
snd_acp_config 16384 2 snd_rn_pci_acp3x,snd_sof_amd_renoir
snd_soc_acpi 16384 2 snd_acp_config,snd_sof_amd_renoir
amd_pmc 28672 0
ccp 118784 1 kvm_amd
amd_sfh 32768 0

Do you have experience with fixing this “driver” problem?

what does “rpm -qa | grep kernel” show? i wonder if you’ve only installed the headers for the 9.0 kernel with the uname -r instructions.

don’t you have to run “dkms autoinstall” to rebuild the kernel modules for the currently running kernel?

i’ve not used dkms for years, but the arch wiki has some useful info:

https://wiki.archlinux.org/title/Dynamic_Kernel_Module_Support

[SOLVED]

I double-checked the installation based on the ROCm installation instructions (header file versions etc.) but everything was fine so I decided to reinstall amdgpu-dkms (sudo yum reinstall amdgpu-dkms) and that fixed the problem, after a reboot, everything works fine, GPUs are detected…