Older RAID card throwing tons of read/write errors in Alma 8.6

Hey all -

We have an older Sun Microsystems SunFire x4250 server with one of Sun’s STK RAID INT, a rebadged Adaptec ASR-5805 (Cougar) RAID HBA that takes the aacraid driver, and I’m getting all kinds of errors with it in Alma Linux 8.6. I have performed hardware tests of both the server’s memory (it passed a weekend of memtest86+ runs of 24 passes) as well as all 16
of our Sun-branded HGST 900 GB 10k RPM SAS drives (using Seagate’s SeaTools Bootable, all 16 passed S.M.A.R.T. checks as well as short and long drive self-tests), and all pass with flying colors across multiple runs of every test.

I nonetheless am getting all kinds of errors on my Alma Linux 8.6 install with that dd-aacraid-1.2.1-7.el8_6.elrepo.iso driver obtained from Index of /linux/dud/el8/x86_64. I don’t know if that’s the wrong driver or what, but I get basically one session of usable Linux, and once I reboot, I get all kinds of errors - see here: Imgur: The magic of the Internet

Some of the errors I’m getting (right at the login screen) in text format:

[  899.895139] print_req_error: 75 callbacks suppressed
[  899.895144] blk_update_reqeuest: I/O error, dev sda, sector 136364232 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 0
[  899.898353] EXT4-fs warning: 5 callbacks suppressed
[  899.898538] EXT4-fs warning (device dm-0): htree_dirblock_to_tree:984: inode #3409691: lblock 0: comm systemd-tmpfile -5 reading directory block

Obviously, we’re using EXT4 as our filesystem, not the native XFS due to better performance. I’m going to try a standard XFS install and see if maybe the aacraid driver plays nicer with that.

/dev/sda is a simple, two-drive RAID-1 array - and I have tested now four different drives in that array in varying combinations, and I get these errors in all of them. I even swapped out our Sun STK RAID INT (Cougar) RAID HBA for another one of the same make/model, and got the same errors. Neither the Adaptec RAID BIOS reports any problems with the arrays, nor does SeaTools - so I’m increasingly disinclined to believe that any of our drives are bad, I’m starting to think this is some kind of configuration problem, or some kind of driver problem with the dd-aacraid-1.2.1-7.el8_6.elrepo.iso aacraid driver that I’m using. :stuck_out_tongue:

Any ideas?

Testing with the native XFS over EXT4 appears to have yielded some stability, but the lack of a native arcconf client is profoundly difficult to ignore for a production server. I’d want to know when/if a drive failed, and without an OS-level utility, I don’t think that’s possible. :confused:

The driver version in the elrepo kmod package is 1.2.1[50877]-custom. There is just one update to that version in the upstream (kernel.org) kernel. It is version 1.2.1[50983]-custom. I have no idea whether the newer version has a fix for the issue you are having but here’s what has changed;

https://lore.kernel.org/all/1571120524-6037-8-git-send-email-balsundar.p@microsemi.com/

You can try the latest upstream kernel by installing ELRepo’s kernel-ml.

# dnf install elrepo-release
# dnf --enablerepo=elrepo-kernel install kernel-ml

Boot into the newly installed kernel-ml and see how things go.

Sadly, even moving to XFS didn’t fix it. Seemed like it, for awhile, but then the same error messages kept popping up. I suppose I could probably try that driver, but I think we’re paddling upstream against a river that’s headed for a waterfall. These servers are 14 years old, we cannot hope to keep these things running forever. I think we’re going to either get a newer controller (pls no) or get new servers (:smiley: :smiley: :smiley:) with more modern hardware all around.

Thanks for pointing me in that direction, though! I might still try to get these controllers working at my homelab. :stuck_out_tongue: