I’m writing to seek your expert advice regarding a critical issue affecting my ThinkSystem SR630 V2 server running AlmaLinux 8.9 (kernel 4.18.0-513.9.1.el8_9.x86_64). It hosts two virtual machines on the same OS and kernel version.
Problem:
Multiple processes across diverse areas (MySQL, Java, QXL/VRAM) are experiencing abnormal termination (zombie state) and stuck states (D state). The lack of a consistent pattern makes pinpointing the root cause challenging, suggesting systemic instability rather than individual application issues.
Troubleshooting Steps Taken:
- Verified latest UEFI firmware and will check for RAID adapter updates with Lenovo.
- BMX and XClarity Provision Manager updates seem unlikely contributors but are also checked.
- Considering kernel and package upgrades on all servers as they are available.
Questions:
- Known Kernel Bugs: Are there known bugs in kernel 4.18.0-513.9.1.el8_9 or compatibility issues with SR630 V2 kernel modules?
- Zombie Process Analysis: What tools/methods can I use to determine the cause of zombie processes beyond stack traces (processes have already terminated)?
- D State Process Analysis: How can I overcome permission issues preventing strace and gdb usage for a D state process? What alternative debugging methods are available?
- Memory Dump: Is there a way to extract a memory dump of the D state process?
- Hardware Testing: Has anyone encountered similar issues due to memory or disk problems? Can you recommend specific hardware tests?
Additional Information:
- I’m currently gathering statistics through nmon.
- No new errors appear in the system log (/var/log/messages).
I’ve included a summary of incident details below, but can provide further information upon request.
Incident 1
- Issue:
- Web apps became unresponsive, likely due to a kernel worker thread (kworker/6:24+ev) stuck in the D state.
- Details:
- The affected thread is responsible for various background tasks like I/O, timers, and process management.
- Attempts to debug the thread using gdb or strace were unsuccessful.
- Analysis of the thread’s stack trace revealed it was waiting for a QXL graphics device fence.
- System logs confirmed a failure to allocate a buffer object (BO) in video RAM (VRAM) by the QXL graphics driver.
- Key Point:
- While the incident seems related to the QXL graphics driver, previous incidents didn’t involve it, suggesting a deeper underlying issue.
# cat /proc/719494/stack
[<0>] qxl_fence_wait+0x109/0x160 [qxl]
[<0>] dma_fence_wait_timeout+0x4e/0x120
[<0>] dma_resv_wait_timeout+0x67/0xc0
[<0>] ttm_bo_wait_ctx+0x39/0x50 [ttm]
[<0>] qxl_bo_move+0x32/0xa0 [qxl]
[<0>] ttm_bo_handle_move_mem+0xb9/0x140 [ttm]
[<0>] ttm_mem_evict_first+0x2bb/0x500 [ttm]
[<0>] ttm_bo_mem_space+0x1d1/0x220 [ttm]
[<0>] ttm_bo_validate+0xa0/0x120 [ttm]
[<0>] ttm_bo_init_reserved+0x147/0x170 [ttm]
[<0>] qxl_bo_create+0x158/0x200 [qxl]
[<0>] qxl_alloc_bo_reserved+0x3e/0xb0 [qxl]
[<0>] qxl_image_alloc_objects+0x69/0x120 [qxl]
[<0>] qxl_draw_dirty_fb+0x178/0x450 [qxl]
[<0>] qxl_framebuffer_surface_dirty+0xf8/0x1d0 [qxl]
[<0>] drm_fbdev_fb_dirty+0x125/0x350 [drm_kms_helper]
[<0>] drm_fb_helper_damage_work+0x86/0x160 [drm_kms_helper]
[<0>] process_one_work+0x1d3/0x390
[<0>] worker_thread+0x1f9/0x390
[<0>] kthread+0x134/0x150
[<0>] ret_from_fork+0x1f/0x40
dmesg | grep qxl
[831639.144553] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate VRAM BO
[831654.501399] qxl 0000:00:01.0: object_init failed for (4096, 0x00000001)
[831654.502061] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate VRAM BO
[831669.861572] qxl 0000:00:01.0: object_init failed for (4096, 0x00000001)
[831669.862240] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate VRAM BO
[831685.221761] qxl 0000:00:01.0: object_init failed for (4096, 0x00000001)
[831685.222435] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate VRAM BO
[831700.581990] qxl 0000:00:01.0: object_init failed for (4096, 0x00000001)
[831700.582679] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate VRAM BO
[831715.942106] qxl 0000:00:01.0: object_init failed for (4096, 0x00000001)
[831715.942783] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate VRAM BO
[831731.302242] qxl 0000:00:01.0: object_init failed for (4096, 0x00000001)
[831731.302909] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate VRAM BO
[831746.662426] qxl 0000:00:01.0: object_init failed for (4096, 0x00000001)
perf record -g -p 719494
perf report
Samples: 55 of event 'cpu-clock:pppH', Event count (approx.): 13750000
Children Self Command Shared Object Symbol
+ 100.00% 0.00% kworker/6:24+ev [kernel.kallsyms] [k] ret_from_fork
+ 100.00% 0.00% kworker/6:24+ev [kernel.kallsyms] [k] kthread
+ 100.00% 0.00% kworker/6:24+ev [kernel.kallsyms] [k] worker_thread
+ 100.00% 0.00% kworker/6:24+ev [kernel.kallsyms] [k] process_one_work
+ 100.00% 0.00% kworker/6:24+ev [kernel.kallsyms] [k] drm_fb_helper_damage_work
+ 100.00% 0.00% kworker/6:24+ev [kernel.kallsyms] [k] drm_fbdev_fb_dirty
+ 52.73% 52.73% kworker/6:24+ev [kernel.kallsyms] [k] memcpy_erms
+ 47.27% 0.00% kworker/6:24+ev [kernel.kallsyms] [k] qxl_framebuffer_surface_dirty
+ 47.27% 0.00% kworker/6:24+ev [kernel.kallsyms] [k] qxl_draw_dirty_fb
+ 47.27% 0.00% kworker/6:24+ev [kernel.kallsyms] [k] qxl_alloc_bo_reserved
+ 45.45% 0.00% kworker/6:24+ev [kernel.kallsyms] [k] qxl_image_alloc_objects
+ 25.45% 1.82% kworker/6:24+ev [kernel.kallsyms] [k] vprintk_emit
+ 23.64% 21.82% kworker/6:24+ev [kernel.kallsyms] [k] console_unlock
+ 20.00% 0.00% kworker/6:24+ev [kernel.kallsyms] [k] qxl_bo_create
+ 20.00% 0.00% kworker/6:24+ev [kernel.kallsyms] [k] ttm_bo_init_reserved
+ 16.36% 1.82% kworker/6:24+ev [kernel.kallsyms] [k] __drm_err
+ 14.55% 0.00% kworker/6:24+ev [kernel.kallsyms] [k] printk
+ 12.73% 0.00% kworker/6:24+ev [kernel.kallsyms] [k] ttm_bo_validate
+ 12.73% 0.00% kworker/6:24+ev [kernel.kallsyms] [k] ttm_bo_mem_space
+ 12.73% 0.00% kworker/6:24+ev [kernel.kallsyms] [k] ttm_mem_evict_first
+ 12.73% 0.00% kworker/6:24+ev [kernel.kallsyms] [k] ttm_bo_handle_move_mem
+ 12.73% 0.00% kworker/6:24+ev [kernel.kallsyms] [k] qxl_bo_move
+ 12.73% 0.00% kworker/6:24+ev [kernel.kallsyms] [k] ttm_bo_wait_ctx
+ 12.73% 0.00% kworker/6:24+ev [kernel.kallsyms] [k] dma_resv_wait_timeout
+ 12.73% 0.00% kworker/6:24+ev [kernel.kallsyms] [k] dma_fence_wait_timeout
+ 12.73% 0.00% kworker/6:24+ev [kernel.kallsyms] [k] qxl_fence_wait
+ 10.91% 0.00% kworker/6:24+ev [kernel.kallsyms] [k] qxl_bo_create.cold.9
Incident 2
- Issue:
- Web apps became unresponsive, initially attributed to table locks in MySQL.
- Restarting MySQL resulted in a zombie process.
- MySQL backup was also stuck with a zero-byte file, suggesting resource access issues.
- Technical Details (for reference):
- MySQL process (PID 1282) was in a defunct zombie state.
- Another service process (PID 604275) was also a zombie.
- Attempts to kill the zombie processes with SIGCHLD were unsuccessful
- Key Points: Further investigation casts doubt on table locks as the sole cause, pointing to broader resource access problems.
Incident 3
- Issue:
- Web Apps unresponsive and a Java process became stuck in the D state with no available stack trace, signaling potential kernel-level or process corruption issues.
- Key Points:
- Memory Access Problem: The error message “Unable to access opcode bytes at RIP 0x7fdd7e72a432” indicates the kernel couldn’t read instruction bytes at the specified memory address, suggesting memory-related trouble.
- Kernel Function Involvement: The presence of kernel functions related to process scheduling and exiting in the logs hints at potential problems during system calls or process termination.
dmesg logs
[Thu Jan 18 08:49:42 2024] RIP: 0033:0x7fdd7e72a45c
[Thu Jan 18 08:49:42 2024] Code: Unable to access opcode bytes at RIP 0x7fdd7e72a432.
[Thu Jan 18 08:49:42 2024] RSP: 002b:00007fdd663113e0 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[Thu Jan 18 08:49:42 2024] RAX: fffffffffffffe00 RBX: 00007fdd30000b60 RCX: 00007fdd7e72a45c
[Thu Jan 18 08:49:42 2024] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007fdd30000b8c
[Thu Jan 18 08:49:42 2024] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[Thu Jan 18 08:49:42 2024] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fdd78ded828
[Thu Jan 18 08:49:42 2024] R13: 0000000000000003 R14: 0000000000000000 R15: 00007fdd30000b8c
[Thu Jan 18 08:49:42 2024] task:java state:D stack:0 pid:2527 ppid:1 flags:0x80004084
[Thu Jan 18 08:49:42 2024] Call Trace:
[Thu Jan 18 08:49:42 2024] __schedule+0x2d1/0x870
[Thu Jan 18 08:49:42 2024] schedule+0x55/0xf0
[Thu Jan 18 08:49:42 2024] do_exit+0x250/0xb10
[Thu Jan 18 08:49:42 2024] do_group_exit+0x3a/0xa0
[Thu Jan 18 08:49:42 2024] get_signal+0x141/0x880
[Thu Jan 18 08:49:42 2024] ? do_futex+0x143/0x4e0
[Thu Jan 18 08:49:42 2024] do_signal+0x36/0x690
[Thu Jan 18 08:49:42 2024] ? __x64_sys_futex+0x14e/0x200
[Thu Jan 18 08:49:42 2024] ? syscall_trace_enter+0x1ff/0x2d0
[Thu Jan 18 08:49:42 2024] ? __x64_sys_futex+0x14e/0x200
[Thu Jan 18 08:49:42 2024] exit_to_usermode_loop+0x89/0x100
[Thu Jan 18 08:49:42 2024] do_syscall_64+0x19c/0x1b0
[Thu Jan 18 08:49:42 2024] entry_SYSCALL_64_after_hwframe+0x61/0xc6
[Thu Jan 18 08:49:42 2024] RIP: 0033:0x7fdd7e72a45c
[Thu Jan 18 08:49:42 2024] Code: Unable to access opcode bytes at RIP 0x7fdd7e72a432.
Incident 4
- Issue:
- Web Apps became unresponsive and two processes were discovered in a defunct (zombie) state, indicating abnormal termination without proper cleanup by their parent processes.
- Technical Details (for reference):
- The processes, both identified as “[service_process]” (process ID 261224 and 950658.), were stuck in the defunct state due to their parent processes not collecting their exit status.
- This suggests potential issues with process management or communication between parent and child processes.