I have an HP DL380 Gen 9 with a RAID5 array built from 6 INTEL SSDPE2MX020T4
devices. That raid device makes up a volume group with a couple logical
volumes with XFS filesystems backing VM storage. Twice now in 2 months the
raid array has become mostly unresponsive:
May 08 03:33:21 host kernel: INFO: task worker:1798511 blocked for more than
120 seconds.
May 08 03:33:21 host kernel: Not tainted 4.18.0-348.23.1.el8_5.x86_64 #1
May 08 03:33:21 host kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 08 03:33:21 host kernel: task:worker state:D stack: 0
pid:1798511 ppid: 1 flags:0x000043a0
May 08 03:33:21 host kernel: Call Trace:
May 08 03:33:21 host kernel: __schedule+0x2bd/0x760
May 08 03:33:21 host kernel: ? finish_wait+0x80/0x80
May 08 03:33:21 host kernel: schedule+0x37/0xa0
May 08 03:33:21 host kernel: md_bitmap_startwrite+0x16f/0x1e0
May 08 03:33:21 host kernel: ? finish_wait+0x80/0x80
May 08 03:33:21 host kernel: add_stripe_bio+0x4a3/0x7c0 [raid456]
May 08 03:33:21 host kernel: raid5_make_request+0x1bf/0xb60 [raid456]
May 08 03:33:21 host kernel: ? finish_wait+0x80/0x80
May 08 03:33:21 host kernel: ? blk_queue_split+0xd4/0x660
May 08 03:33:21 host kernel: ? finish_wait+0x80/0x80
May 08 03:33:21 host kernel: md_handle_request+0x119/0x190
May 08 03:33:21 host kernel: md_make_request+0x84/0x160
May 08 03:33:21 host kernel: generic_make_request+0x25b/0x350
May 08 03:33:21 host kernel: submit_bio+0x3c/0x160
May 08 03:33:21 host kernel: iomap_submit_ioend.isra.38+0x4a/0x70
May 08 03:33:21 host kernel: iomap_writepage_map+0x422/0x670
May 08 03:33:21 host kernel: write_cache_pages+0x197/0x420
May 08 03:33:21 host kernel: ? iomap_invalidatepage+0xe0/0xe0
May 08 03:33:21 host kernel: iomap_writepages+0x1c/0x40
May 08 03:33:21 host kernel: xfs_vm_writepages+0x64/0x90 [xfs]
May 08 03:33:21 host kernel: do_writepages+0x41/0xd0
May 08 03:33:21 host kernel: __filemap_fdatawrite_range+0xcb/0x100
May 08 03:33:21 host kernel: file_write_and_wait_range+0x4c/0xa0
May 08 03:33:21 host kernel: xfs_file_fsync+0x69/0x200 [xfs]
May 08 03:33:21 host kernel: do_fsync+0x38/0x70
May 08 03:33:21 host kernel: __x64_sys_fdatasync+0x13/0x20
May 08 03:33:21 host kernel: do_syscall_64+0x5b/0x1a0
May 08 03:33:21 host kernel: entry_SYSCALL_64_after_hwframe+0x65/0xca
May 08 03:33:21 host kernel: RIP: 0033:0x7f969efb858f
May 08 03:33:21 host kernel: Code: Unable to access opcode bytes at RIP
0x7f969efb8565.
May 08 03:33:21 host kernel: RSP: 002b:00007f94b3ffe6b0 EFLAGS: 00000293
ORIG_RAX: 000000000000004b
May 08 03:33:21 host kernel: RAX: ffffffffffffffda RBX: 000000000000000e RCX:
00007f969efb858f
May 08 03:33:21 host kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI:
000000000000000e
May 08 03:33:21 host kernel: RBP: 0000563f940b5b20 R08: 0000000000000000 R09:
0000000032f01b0c
May 08 03:33:21 host kernel: R10: 0000000e171e5000 R11: 0000000000000293 R12:
0000563f92a73bb4
May 08 03:33:21 host kernel: R13: 0000563f940b5b88 R14: 0000563f94097eb0 R15:
00007f94b3ffe800
May 08 03:33:21 host kernel: INFO: task worker:1799573 blocked for more than
120 seconds.
May 08 03:33:21 host kernel: Not tainted 4.18.0-348.23.1.el8_5.x86_64 #1
May 08 03:33:21 host kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 08 03:33:21 host kernel: task:worker state:D stack: 0
pid:1799573 ppid: 1 flags:0x000043a0
May 08 03:33:21 host kernel: Call Trace:
May 08 03:33:21 host kernel: __schedule+0x2bd/0x760
May 08 03:33:21 host kernel: schedule+0x37/0xa0
May 08 03:33:21 host kernel: io_schedule+0x12/0x40
May 08 03:33:21 host kernel: wait_on_page_bit+0x137/0x230
May 08 03:33:21 host kernel: ? file_fdatawait_range+0x20/0x20
May 08 03:33:21 host kernel: __filemap_fdatawait_range+0x88/0xe0
May 08 03:33:21 host kernel: file_write_and_wait_range+0x76/0xa0
May 08 03:33:21 host kernel: xfs_file_fsync+0x69/0x200 [xfs]
May 08 03:33:21 host kernel: do_fsync+0x38/0x70
May 08 03:33:21 host kernel: __x64_sys_fdatasync+0x13/0x20
May 08 03:33:21 host kernel: do_syscall_64+0x5b/0x1a0
May 08 03:33:21 host kernel: entry_SYSCALL_64_after_hwframe+0x65/0xca
May 08 03:33:21 host kernel: RIP: 0033:0x7f20c514c58f
May 08 03:33:21 host kernel: Code: Unable to access opcode bytes at RIP
0x7f20c514c565.
May 08 03:33:21 host kernel: RSP: 002b:00007f1ef4ff86b0 EFLAGS: 00000293
ORIG_RAX: 000000000000004b
May 08 03:33:21 host kernel: RAX: ffffffffffffffda RBX: 000000000000001b RCX:
00007f20c514c58f
May 08 03:33:21 host kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI:
000000000000001b
May 08 03:33:21 host kernel: RBP: 00005594bed1f120 R08: 0000000000000000 R09:
00000000ffffffff
May 08 03:33:21 host kernel: R10: 00007f1ef4ff86a0 R11: 0000000000000293 R12:
00005594bd72ebb4
May 08 03:33:21 host kernel: R13: 00005594bed1f188 R14: 00005594bed31c30 R15:
00007f1ef4ff8800
May 08 03:33:21 host kernel: INFO: task worker:871154 blocked for more than
120 seconds.
May 08 03:33:21 host kernel: Not tainted 4.18.0-348.23.1.el8_5.x86_64 #1
May 08 03:33:21 host kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 08 03:33:21 host kernel: task:worker state:D stack: 0
pid:871154 ppid: 1 flags:0x000043a0
May 08 03:33:21 host kernel: Call Trace:
May 08 03:33:21 host kernel: __schedule+0x2bd/0x760
May 08 03:33:21 host kernel: schedule+0x37/0xa0
May 08 03:33:21 host kernel: io_schedule+0x12/0x40
May 08 03:33:21 host kernel: wait_on_page_bit+0x137/0x230
May 08 03:33:21 host kernel: ? file_fdatawait_range+0x20/0x20
May 08 03:33:21 host kernel: __filemap_fdatawait_range+0x88/0xe0
May 08 03:33:21 host kernel: file_write_and_wait_range+0x76/0xa0
May 08 03:33:21 host kernel: xfs_file_fsync+0x69/0x200 [xfs]
May 08 03:33:21 host kernel: do_fsync+0x38/0x70
May 08 03:33:21 host kernel: __x64_sys_fdatasync+0x13/0x20
May 08 03:33:21 host kernel: do_syscall_64+0x5b/0x1a0
May 08 03:33:21 host kernel: entry_SYSCALL_64_after_hwframe+0x65/0xca
May 08 03:33:21 host kernel: RIP: 0033:0x7f13d27fd58f
May 08 03:33:21 host kernel: Code: Unable to access opcode bytes at RIP
0x7f13d27fd565.
May 08 03:33:21 host kernel: RSP: 002b:00007f0f697f96b0 EFLAGS: 00000293
ORIG_RAX: 000000000000004b
May 08 03:33:21 host kernel: RAX: ffffffffffffffda RBX: 000000000000000e RCX:
00007f13d27fd58f
May 08 03:33:21 host kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI:
000000000000000e
May 08 03:33:21 host kernel: RBP: 00005594f48b9010 R08: 0000000000000000 R09:
00000000ffffffff
May 08 03:33:21 host kernel: R10: 00007f0f697f96a0 R11: 0000000000000293 R12:
00005594f2222bb4
May 08 03:33:21 host kernel: R13: 00005594f48b9078 R14: 00005594f4e8ee50 R15:
00007f0f697f9800
May 08 03:33:21 host kernel: INFO: task kworker/u97:2:1790841 blocked for more
than 120 seconds.
May 08 03:33:21 host kernel: Not tainted 4.18.0-348.23.1.el8_5.x86_64 #1
May 08 03:33:21 host kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 08 03:33:21 host kernel: task:kworker/u97:2 state:D stack: 0
pid:1790841 ppid: 2 flags:0x80004080
May 08 03:33:21 host kernel: Workqueue: writeback wb_workfn (flush-253:3)
May 08 03:33:21 host kernel: Call Trace:
May 08 03:33:21 host kernel: __schedule+0x2bd/0x760
May 08 03:33:21 host kernel: ? blk_flush_plug_list+0xc2/0x100
May 08 03:33:21 host kernel: ? finish_wait+0x80/0x80
May 08 03:33:21 host kernel: schedule+0x37/0xa0
May 08 03:33:21 host kernel: md_bitmap_startwrite+0x16f/0x1e0
May 08 03:33:21 host kernel: ? finish_wait+0x80/0x80
May 08 03:33:21 host kernel: add_stripe_bio+0x4a3/0x7c0 [raid456]
May 08 03:33:21 host kernel: raid5_make_request+0x1bf/0xb60 [raid456]
May 08 03:33:21 host kernel: ? finish_wait+0x80/0x80
May 08 03:33:21 host kernel: ? blk_queue_split+0xd4/0x660
May 08 03:33:21 host kernel: ? finish_wait+0x80/0x80
May 08 03:33:21 host kernel: md_handle_request+0x119/0x190
May 08 03:33:21 host kernel: md_make_request+0x84/0x160
May 08 03:33:21 host kernel: generic_make_request+0x25b/0x350
May 08 03:33:21 host kernel: submit_bio+0x3c/0x160
May 08 03:33:21 host kernel: iomap_submit_ioend.isra.38+0x4a/0x70
May 08 03:33:21 host kernel: iomap_writepage_map+0x422/0x670
May 08 03:33:21 host kernel: write_cache_pages+0x197/0x420
May 08 03:33:21 host kernel: ? iomap_invalidatepage+0xe0/0xe0
May 08 03:33:21 host kernel: iomap_writepages+0x1c/0x40
May 08 03:33:21 host kernel: xfs_vm_writepages+0x64/0x90 [xfs]
May 08 03:33:21 host kernel: do_writepages+0x41/0xd0
May 08 03:33:21 host kernel: __writeback_single_inode+0x39/0x2f0
May 08 03:33:21 host kernel: writeback_sb_inodes+0x1e6/0x450
May 08 03:33:21 host kernel: __writeback_inodes_wb+0x5f/0xc0
May 08 03:33:21 host kernel: wb_writeback+0x25b/0x2f0
May 08 03:33:21 host kernel: wb_workfn+0x344/0x4c0
May 08 03:33:21 host kernel: ? __switch_to_asm+0x35/0x70
May 08 03:33:21 host kernel: ? __switch_to_asm+0x41/0x70
May 08 03:33:21 host kernel: ? __switch_to_asm+0x35/0x70
May 08 03:33:21 host kernel: ? __switch_to_asm+0x41/0x70
May 08 03:33:21 host kernel: ? __switch_to_asm+0x35/0x70
May 08 03:33:21 host kernel: ? __switch_to_asm+0x41/0x70
May 08 03:33:21 host kernel: ? __switch_to_asm+0x35/0x70
May 08 03:33:21 host kernel: ? __switch_to_asm+0x41/0x70
May 08 03:33:21 host kernel: process_one_work+0x1a7/0x360
May 08 03:33:21 host kernel: worker_thread+0x30/0x390
May 08 03:33:21 host kernel: ? create_worker+0x1a0/0x1a0
May 08 03:33:21 host kernel: kthread+0x116/0x130
May 08 03:33:21 host kernel: ? kthread_flush_work_fn+0x10/0x10
May 08 03:33:21 host kernel: ret_from_fork+0x35/0x40
I have another nearly identical system that has run without trouble, though
not with as much IO load as this one. Is there anything else I can check to
see if there is a hardware issue or if this might be an issue with the linux
RAID system?