Sporadic zombie and D state processes

I’m writing to seek your expert advice regarding a critical issue affecting my ThinkSystem SR630 V2 server running AlmaLinux 8.9 (kernel 4.18.0-513.9.1.el8_9.x86_64). It hosts two virtual machines on the same OS and kernel version.

Problem:

Multiple processes across diverse areas (MySQL, Java, QXL/VRAM) are experiencing abnormal termination (zombie state) and stuck states (D state). The lack of a consistent pattern makes pinpointing the root cause challenging, suggesting systemic instability rather than individual application issues.

Troubleshooting Steps Taken:

  1. Verified latest UEFI firmware and will check for RAID adapter updates with Lenovo.
  2. BMX and XClarity Provision Manager updates seem unlikely contributors but are also checked.
  3. Considering kernel and package upgrades on all servers as they are available.

Questions:

  • Known Kernel Bugs: Are there known bugs in kernel 4.18.0-513.9.1.el8_9 or compatibility issues with SR630 V2 kernel modules?
  • Zombie Process Analysis: What tools/methods can I use to determine the cause of zombie processes beyond stack traces (processes have already terminated)?
  • D State Process Analysis: How can I overcome permission issues preventing strace and gdb usage for a D state process? What alternative debugging methods are available?
  • Memory Dump: Is there a way to extract a memory dump of the D state process?
  • Hardware Testing: Has anyone encountered similar issues due to memory or disk problems? Can you recommend specific hardware tests?

Additional Information:

  • I’m currently gathering statistics through nmon.
  • No new errors appear in the system log (/var/log/messages).

I’ve included a summary of incident details below, but can provide further information upon request.

Incident 1

  • Issue:
    • Web apps became unresponsive, likely due to a kernel worker thread (kworker/6:24+ev) stuck in the D state.
  • Details:
    • The affected thread is responsible for various background tasks like I/O, timers, and process management.
    • Attempts to debug the thread using gdb or strace were unsuccessful.
    • Analysis of the thread’s stack trace revealed it was waiting for a QXL graphics device fence.
    • System logs confirmed a failure to allocate a buffer object (BO) in video RAM (VRAM) by the QXL graphics driver.
  • Key Point:
    • While the incident seems related to the QXL graphics driver, previous incidents didn’t involve it, suggesting a deeper underlying issue.
# cat /proc/719494/stack 
[<0>] qxl_fence_wait+0x109/0x160 [qxl]
[<0>] dma_fence_wait_timeout+0x4e/0x120
[<0>] dma_resv_wait_timeout+0x67/0xc0
[<0>] ttm_bo_wait_ctx+0x39/0x50 [ttm]
[<0>] qxl_bo_move+0x32/0xa0 [qxl]
[<0>] ttm_bo_handle_move_mem+0xb9/0x140 [ttm]
[<0>] ttm_mem_evict_first+0x2bb/0x500 [ttm]
[<0>] ttm_bo_mem_space+0x1d1/0x220 [ttm]
[<0>] ttm_bo_validate+0xa0/0x120 [ttm]
[<0>] ttm_bo_init_reserved+0x147/0x170 [ttm]
[<0>] qxl_bo_create+0x158/0x200 [qxl]
[<0>] qxl_alloc_bo_reserved+0x3e/0xb0 [qxl]
[<0>] qxl_image_alloc_objects+0x69/0x120 [qxl]
[<0>] qxl_draw_dirty_fb+0x178/0x450 [qxl]
[<0>] qxl_framebuffer_surface_dirty+0xf8/0x1d0 [qxl]
[<0>] drm_fbdev_fb_dirty+0x125/0x350 [drm_kms_helper]
[<0>] drm_fb_helper_damage_work+0x86/0x160 [drm_kms_helper]
[<0>] process_one_work+0x1d3/0x390
[<0>] worker_thread+0x1f9/0x390
[<0>] kthread+0x134/0x150
[<0>] ret_from_fork+0x1f/0x40

dmesg | grep qxl

[831639.144553] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate VRAM BO
[831654.501399] qxl 0000:00:01.0: object_init failed for (4096, 0x00000001)
[831654.502061] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate VRAM BO
[831669.861572] qxl 0000:00:01.0: object_init failed for (4096, 0x00000001)
[831669.862240] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate VRAM BO
[831685.221761] qxl 0000:00:01.0: object_init failed for (4096, 0x00000001)
[831685.222435] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate VRAM BO
[831700.581990] qxl 0000:00:01.0: object_init failed for (4096, 0x00000001)
[831700.582679] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate VRAM BO
[831715.942106] qxl 0000:00:01.0: object_init failed for (4096, 0x00000001)
[831715.942783] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate VRAM BO
[831731.302242] qxl 0000:00:01.0: object_init failed for (4096, 0x00000001)
[831731.302909] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate VRAM BO
[831746.662426] qxl 0000:00:01.0: object_init failed for (4096, 0x00000001)

perf record -g -p 719494
perf report 

Samples: 55  of event 'cpu-clock:pppH', Event count (approx.): 13750000
  Children      Self  Command          Shared Object      Symbol
+  100.00%     0.00%  kworker/6:24+ev  [kernel.kallsyms]  [k] ret_from_fork
+  100.00%     0.00%  kworker/6:24+ev  [kernel.kallsyms]  [k] kthread
+  100.00%     0.00%  kworker/6:24+ev  [kernel.kallsyms]  [k] worker_thread
+  100.00%     0.00%  kworker/6:24+ev  [kernel.kallsyms]  [k] process_one_work
+  100.00%     0.00%  kworker/6:24+ev  [kernel.kallsyms]  [k] drm_fb_helper_damage_work
+  100.00%     0.00%  kworker/6:24+ev  [kernel.kallsyms]  [k] drm_fbdev_fb_dirty
+   52.73%    52.73%  kworker/6:24+ev  [kernel.kallsyms]  [k] memcpy_erms
+   47.27%     0.00%  kworker/6:24+ev  [kernel.kallsyms]  [k] qxl_framebuffer_surface_dirty
+   47.27%     0.00%  kworker/6:24+ev  [kernel.kallsyms]  [k] qxl_draw_dirty_fb
+   47.27%     0.00%  kworker/6:24+ev  [kernel.kallsyms]  [k] qxl_alloc_bo_reserved
+   45.45%     0.00%  kworker/6:24+ev  [kernel.kallsyms]  [k] qxl_image_alloc_objects
+   25.45%     1.82%  kworker/6:24+ev  [kernel.kallsyms]  [k] vprintk_emit
+   23.64%    21.82%  kworker/6:24+ev  [kernel.kallsyms]  [k] console_unlock
+   20.00%     0.00%  kworker/6:24+ev  [kernel.kallsyms]  [k] qxl_bo_create
+   20.00%     0.00%  kworker/6:24+ev  [kernel.kallsyms]  [k] ttm_bo_init_reserved
+   16.36%     1.82%  kworker/6:24+ev  [kernel.kallsyms]  [k] __drm_err
+   14.55%     0.00%  kworker/6:24+ev  [kernel.kallsyms]  [k] printk
+   12.73%     0.00%  kworker/6:24+ev  [kernel.kallsyms]  [k] ttm_bo_validate
+   12.73%     0.00%  kworker/6:24+ev  [kernel.kallsyms]  [k] ttm_bo_mem_space
+   12.73%     0.00%  kworker/6:24+ev  [kernel.kallsyms]  [k] ttm_mem_evict_first
+   12.73%     0.00%  kworker/6:24+ev  [kernel.kallsyms]  [k] ttm_bo_handle_move_mem
+   12.73%     0.00%  kworker/6:24+ev  [kernel.kallsyms]  [k] qxl_bo_move
+   12.73%     0.00%  kworker/6:24+ev  [kernel.kallsyms]  [k] ttm_bo_wait_ctx
+   12.73%     0.00%  kworker/6:24+ev  [kernel.kallsyms]  [k] dma_resv_wait_timeout
+   12.73%     0.00%  kworker/6:24+ev  [kernel.kallsyms]  [k] dma_fence_wait_timeout
+   12.73%     0.00%  kworker/6:24+ev  [kernel.kallsyms]  [k] qxl_fence_wait
+   10.91%     0.00%  kworker/6:24+ev  [kernel.kallsyms]  [k] qxl_bo_create.cold.9

Incident 2

  • Issue:
    • Web apps became unresponsive, initially attributed to table locks in MySQL.
    • Restarting MySQL resulted in a zombie process.
    • MySQL backup was also stuck with a zero-byte file, suggesting resource access issues.
  • Technical Details (for reference):
    • MySQL process (PID 1282) was in a defunct zombie state.
    • Another service process (PID 604275) was also a zombie.
    • Attempts to kill the zombie processes with SIGCHLD were unsuccessful
  • Key Points: Further investigation casts doubt on table locks as the sole cause, pointing to broader resource access problems.

Incident 3

  • Issue:
    • Web Apps unresponsive and a Java process became stuck in the D state with no available stack trace, signaling potential kernel-level or process corruption issues.
  • Key Points:
    • Memory Access Problem: The error message “Unable to access opcode bytes at RIP 0x7fdd7e72a432” indicates the kernel couldn’t read instruction bytes at the specified memory address, suggesting memory-related trouble.
    • Kernel Function Involvement: The presence of kernel functions related to process scheduling and exiting in the logs hints at potential problems during system calls or process termination.

dmesg logs

[Thu Jan 18 08:49:42 2024] RIP: 0033:0x7fdd7e72a45c
[Thu Jan 18 08:49:42 2024] Code: Unable to access opcode bytes at RIP 0x7fdd7e72a432.
[Thu Jan 18 08:49:42 2024] RSP: 002b:00007fdd663113e0 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[Thu Jan 18 08:49:42 2024] RAX: fffffffffffffe00 RBX: 00007fdd30000b60 RCX: 00007fdd7e72a45c
[Thu Jan 18 08:49:42 2024] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007fdd30000b8c
[Thu Jan 18 08:49:42 2024] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[Thu Jan 18 08:49:42 2024] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fdd78ded828
[Thu Jan 18 08:49:42 2024] R13: 0000000000000003 R14: 0000000000000000 R15: 00007fdd30000b8c
[Thu Jan 18 08:49:42 2024] task:java            state:D stack:0     pid:2527  ppid:1      flags:0x80004084
[Thu Jan 18 08:49:42 2024] Call Trace:
[Thu Jan 18 08:49:42 2024]  __schedule+0x2d1/0x870
[Thu Jan 18 08:49:42 2024]  schedule+0x55/0xf0
[Thu Jan 18 08:49:42 2024]  do_exit+0x250/0xb10
[Thu Jan 18 08:49:42 2024]  do_group_exit+0x3a/0xa0
[Thu Jan 18 08:49:42 2024]  get_signal+0x141/0x880
[Thu Jan 18 08:49:42 2024]  ? do_futex+0x143/0x4e0
[Thu Jan 18 08:49:42 2024]  do_signal+0x36/0x690
[Thu Jan 18 08:49:42 2024]  ? __x64_sys_futex+0x14e/0x200
[Thu Jan 18 08:49:42 2024]  ? syscall_trace_enter+0x1ff/0x2d0
[Thu Jan 18 08:49:42 2024]  ? __x64_sys_futex+0x14e/0x200
[Thu Jan 18 08:49:42 2024]  exit_to_usermode_loop+0x89/0x100
[Thu Jan 18 08:49:42 2024]  do_syscall_64+0x19c/0x1b0
[Thu Jan 18 08:49:42 2024]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[Thu Jan 18 08:49:42 2024] RIP: 0033:0x7fdd7e72a45c
[Thu Jan 18 08:49:42 2024] Code: Unable to access opcode bytes at RIP 0x7fdd7e72a432.

Incident 4

  • Issue:
    • Web Apps became unresponsive and two processes were discovered in a defunct (zombie) state, indicating abnormal termination without proper cleanup by their parent processes.
  • Technical Details (for reference):
    • The processes, both identified as “[service_process]” (process ID 261224 and 950658.), were stuck in the defunct state due to their parent processes not collecting their exit status.
    • This suggests potential issues with process management or communication between parent and child processes.