[patch] mm, oom: make a last minute check to prevent unnecessary memcg oom kills

Discussion:

(too old to reply)

Michal Hocko

2020-03-17 07:59:39 UTC

Killing a user process as a result of hitting memcg limits is a serious
decision that is unfortunately needed only when no forward progress in
reclaiming memory can be made.
Deciding the appropriate oom victim can take a sufficient amount of time
that allows another process that is exiting to actually uncharge to the
same memcg hierarchy and prevent unnecessarily killing user processes.
An example is to prevent *multiple* unnecessary oom kills on a system
with two cores where the oom kill occurs when there is an abundance of
Memory cgroup out of memory: Killed process 628 (repro) total-vm:41944kB, anon-rss:40888kB, file-rss:496kB, shmem-rss:0kB, UID:0 pgtables:116kB oom_score_adj:0
<immediately after>
repro invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
CPU: 1 PID: 629 Comm: repro Not tainted 5.6.0-rc5+ #130
dump_stack+0x78/0xb6
dump_header+0x55/0x240
oom_kill_process+0xc5/0x170
out_of_memory+0x305/0x4a0
try_charge+0x77b/0xac0
mem_cgroup_try_charge+0x10a/0x220
mem_cgroup_try_charge_delay+0x1e/0x40
handle_mm_fault+0xdf2/0x15f0
do_user_addr_fault+0x21f/0x420
async_page_fault+0x2f/0x40
memory: usage 61336kB, limit 102400kB, failcnt 74
Notice the second memcg oom kill shows usage is >40MB below its limit of
100MB but a process is still unnecessarily killed because the decision has
already been made to oom kill by calling out_of_memory() before the
initial victim had a chance to uncharge its memory.

Could you be more specific about the specific workload please?

Robert, could you elaborate on the user-visible effects of this issue that
caused it to initially get reported?

Yes please, real life usecases are important when adding hacks like this
one and we should have a clear data to support the check actually helps
(in how many instances etc...)

Friendly ping.

--
Michal Hocko
SUSE Labs

Robert Kolchmeyer

2020-03-17 18:25:52 UTC

Permalink

Robert, could you elaborate on the user-visible effects of this issue that
caused it to initially get reported?

Ami (now cc'ed) knows more, but here is my understanding. The use case
involves a Docker container running multiple processes. The container
has a memory limit set. The container contains two long-lived,
important processes p1 and p2, and some arbitrary, dynamic number of
usually ephemeral processes p3,...,pn. These processes are structured
in a hierarchy that looks like p1->p2->[p3,...,pn]; p1 is a parent of
p2, and p2 is the parent for all of the ephemeral processes p3,...,pn.

Since p1 and p2 are long-lived and important, the user does not want
p1 and p2 to be oom-killed. However, p3,...,pn are expected to use a
lot of memory, and it's ok for those processes to be oom-killed.

If the user sets oom_score_adj on p1 and p2 to make them very unlikely
to be oom-killed, p3,...,pn will inherit the oom_score_adj value,
which is bad. Additionally, setting oom_score_adj on p3,...,pn is
tricky, since processes in the Docker container (specifically p1 and
p2) don't have permissions to set oom_score_adj on p3,...,pn. The
ephemeral nature of p3,...,pn also makes setting oom_score_adj on them
tricky after they launch.

So, the user hopes that when one of p3,...,pn triggers an oom
condition in the Docker container, the oom killer will almost always
kill processes from p3,...,pn (and not kill p1 or p2, which are both
important and unlikely to trigger an oom condition). The issue of more
processes being killed than are strictly necessary is resulting in p1
or p2 being killed much more frequently when one of p3,...,pn triggers
an oom condition, and p1 or p2 being killed is very disruptive for the
user (my understanding is that p1 or p2 going down with high frequency
results in significant unhealthiness in the user's service).

The change proposed here has not been run in a production system, and
so I don't think anyone has data that conclusively demonstrates that
this change will solve the user's problem. But, from observations made
in their production system, the user is confident that addressing this
aggressive oom killing will solve their problem, and we have data that
shows this change does considerably reduce the frequency of aggressive
oom killing (from 61/100 oom killing events down to 0/100 with this
change).

Hope this gives a bit more context.

Thanks,
-Robert

Ami Fischman

2020-03-17 19:00:45 UTC

Permalink

On Tue, Mar 17, 2020 at 11:26 AM Robert Kolchmeyer

Post by Robert Kolchmeyer

Robert, could you elaborate on the user-visible effects of this issue that
caused it to initially get reported?

Ami (now cc'ed) knows more, but here is my understanding.

Robert's description of the mechanics we observed is accurate.

We discovered this regression in the oom-killer's behavior when
attempting to upgrade our system. The fraction of the system that
went unhealthy due to this issue was approximately equal to the
_sum_ of all other causes of unhealth, which are many and varied,
but each of which contribute only a small amount of
unhealth. This issue forced a rollback to the previous kernel
where we ~never see this behavior, returning our unhealth levels
to the previous background levels.

Cheers,
-a