OOM-Killer taking server offline

OS: AlmaLinux 8.7 (Stone Smilodon)
Running on: Plesk Obsidian

hi we have had an issue for a few weeks now where one of our servers experiences sudden spikes of cpu activity causing a max IO wait and then knocks the server offline.

The spikes don’t seem to be linked with any hosted website activity and Plesk have also checked from an OS level and can’t find anything but noticed it’s the OOM-killer component being invoked when the cpu, RAM and swap usage all max out.

OOM seems to be killing the kworker process instead of what is causing the spikes. Any suggestions on how to determine the root cause of this please as all the grep command is showing is what i’ve attached and we’re still not sure?

We use cgroups to limit cpu and ram on sites so we’re fairly sure it’s not an individual site causing the issue.

The spikes also happen days apart at random times both day and night.

Thanks!

When I was running a cluster of diskless machines this was a well trodden path. If you are out of both memory and swap space then eventually the oom-killer is brought into play to try and keep the system going. It’s a while since I’ve dug deep on this and my reference book is ageing a bit¹ but in essence the oom-killer selects:

  • Large VM
  • Low CPU usage
  • Non-root process

Once the process has been selected all threads in that process are sent SIGTERM or SIGKILL. We had a particular problem because languages that use a large amount of static storage (such as FORTRAN) become prime candidates. I know that OOM has been made cgroup aware so it is possible that it is hitting the worst possible cgroup by sheer bad luck.

¹Gorman, Mel. Understanding the Linux Virtual Memory Manager. Bruce Perens’ Open Source Series. Upper Saddle River, NJ: Prentice Hall, 2004. 005.4.

Hi Martin, thanks for the reply, do you have any suggestions to establish the cause of the memory and cpu usage? nothing has shown in the logs we have checked so far and there is no gradual build up from normal activity it’s a sudden spike then the server goes offline.

thanks

Don’t fret too much about CPU usage, of itself it wouldn’t trigger the OOM killer. As regards memory, then you’ll need to examine your workload leading up to the failure. Have you instrumented the server? PCP for instance? Is there a user job that runs for an extended period and then suddenly demands more memory? This latter one would build up the CPU (and avoid OOM due to point (2) above), then exhaust memory so that someone else gets killed! Our pet bugbear was a length compile (burn CPU) followed by lots of memory at runtime.