-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Proposal to update container_memory_usage_bytes to Cache+RSS #3286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This commit updates container_memory_usage to use the calculated value of RSS+Cache instead of the value from cgroup. See issue google#3286 for reasoning Signed-off-by: Michael Honaker <[email protected]>
follow with interest |
/track |
m |
Thoughts? |
The best thing to do is to stop attempting to compute metrics in cAdvisor. Simply pass the values from the kernel to metrics so that the downstream users can decide what to do. This is exactly how we do it in the |
@SuperQ I completely agree, and I think that is the best long-term approach. I forgot to to update this issue with the thread but I emailed the lkml and found out that
|
Yes, I would love to see more granular data from the kernel on these fields. Plus it would be nice if the kernel would actually differentiate gauges from monotonic counters. 😁 |
Same problem here
usage_in_bytes is far more greater than RSS + CACHE, which leads to unreasonable kubernetes metrics. |
follow with interest |
Hello, I would like to propose updating
container_memory_usage_bytes
from readingmemory.usage_in_bytes
/memory.current
to calculating it manually from Cache+RSS found in thememory.stat
file. The reason being that there can be a discrepancy withmemory.usage_in_bytes
that's exacerbated on multi-core systems.Background
Metrics
My original investigation started when I was trying to debug the memory usage of an unrelated application running in a Kubernetes pod and was watching the
container_memory_working_set_bytes
metric as suggested by the kubernetes docs. I wanted to find out the source of this value which lead me to cadvisor GetStats which told me it wascontainer_memory_usage_bytes - inactive file
. My question then became where doescontainer_memory_usage_bytes
come from? The cadvisor code calls out to runc's GetStats which answered my question: the information is gathered from files in the/sys/fs/cgroup/memory
directory specificallymemory.stat
and eithermemory.usage_in_bytes
ormemory.current
depending on cgroup version.Linux Kernel
Throughout this research I ran into a few other cadvisor memory issues like #3197 and #3081 which discuss these values and what we should be subtracting. This made me assume that
usage_in_bytes
could be calculated fromstat
file so now I was curious about the calculation. That lead me to section5.5 in the Kernel Docs which says the following:memory.usage_in_bytes
isn't a calculation its a fuzz value! I was surprised to see that the kernel didn't have an exact value for memory usage especially whenRSS+CACHE
is available to it inmemory.stat
so I kept on digging. Looking back to the commit of that documentation a111c966 we see the following:That tells us that
usage_in_bytes
is not precise and can be an unreliable metric for memory measurement. However,res_counter->usage
should still be pretty close to actual usage right? From the kernel email discussion regarding the above change there is atleast the guarantee thatrss+cache <= usage_in_bytes
. However, the difference between the two will grow with the size of each per cpu bulk pre-allocated memory. In other words this difference can grow with the number of cpu's!At this point you might be wondering what a 12 year old commit and email thread talking about a removed res_counter api have to do with the current linux kernel? Well lockless page counters are the replacement to resource counters which tried its best to keep the syntax of the
usage_in_bytes
andstat
unchanged. In the most recent kernel filemm/memcontrol.c
file we see two functions mem_cgroup_usage and memcg_stat_show which correspond to theusage_in_bytes
andstat
respectively.In the first function
mem_cgroup_usage
we can see that for nonroot cgroups the return value is the current page counter value accessed usingpage_counter_read(&memcg-memory)
. Thispage_counter_read
is effectively the same call asres_counter->usage
but without the lock requirement. On the flip sidememcg_stat_show
pulls its stats from the memory controllersvmstats
struct which are synced either every 2 seconds or when a large enough stat change occurs as per comments in memcontrol.cThroughout the above discussion I've been focusing on cgroup v1
usage_in_bytes
. Well it turns out that cgroup v2.current
uses memory_current_read which pulls from the samepage_counter_read
as v1 so is similarly affected.Example
To illustrate this on a reproducible application I've copied the
memory.stat
andmemory.usage_in_bytes
from an nginx pod as described in the kube docs and copied belowAs you can see
total_cache + total_rss
equals2097152
which is almost half of the reportedusage_in_bytes 4562944
!! Albeit half in this case is a measly 2mb but that difference can add up. In the original cache and memory intensive application I was debugging I noticed the following in my memory files.In this example
CACHE+RSS
equals285203251
which is602.68Mb
less than the reportedusage_in_bytes
! However,602Mb
is only a 20% overestimation which is better than nginx's 50% but the impact is more visible.Proposal
I'd like to update
container_memory_usage_bytes
to be a calculation ofCACHE+RSS
instead of the currentusage_in_bytes
. This is what's suggested in the kernel docs above and also what the kernel does forusage_in_bytes
for the root cgroup like on the host/node. I also think this would be beneficial for the users since nowcontainer_memory_usage_bytes
would be easily calculable and understandable from other statistics.Most of my understanding about this issue comes from a 12 year old email discussion so I'm still familiarizing myself with the current kernel. Any corrections or historical context for the current implementation would be greatly appreciated. I'd also love to know if this has been discussed before. I'd be happy to work on implementing this change if the proposal is accepted.
The text was updated successfully, but these errors were encountered: