Description
I did some testing with big, memory intensive workloads. I used pmbench from here https://bitbucket.org/jisooy/pmbench/src/master/ and started to play with working set size around 1TB. I looked into how long will it take to execute h.libcontainerHandler.GetStats() function. My container agent is docker. Besides the workloads and cadvisor, nothing else was running on the node during tests. Command used to gather stats:
sudo ./cadvisor --disable_metrics=advtcp,process,sched,hugetlb,cpu_topology,resctrl,tcp,udp,percpu,accelerator,disk,diskIO,network --housekeeping_interval=5s --referenced_reset_interval=5 2>&1
Here are the results:
Number of containers | Working set size per container[GB] | Total ‘hot’ memory size[GB] | Getstats() execution time |
---|---|---|---|
1 | 1024 | 1024 | 4.8s peak 9s |
10 | 102.4 | 1024 | 650ms peak 1.2s |
100 | 10.24 | 1024 | 500ms peak 1.1s |
Last measurement may be a little bit off, as the machine got cpu overloaded by the number of containers. For comparison, with load as in the first test, GetStats execution without collecting container_referenced_bytes takes averages around 1ms CPU usage drops to couple of %.
It is important to note, that with container_referenced_bytes in the biggest single working set size cadvisor was able to consume a whole CPU core, with a hundred containers it used around 70% of one core(though as mentioned this result may be not accurate) and without it dropped to couple of %.
Two issues are related to this:
- Data refresh is slower, which is more of something to remember about than a real issue, as there is no real reason to gather stats in frequency much higher than scraping frequency(which reasonable value is usually set to couple of seconds)
- Real issue is that in this situation cadvisor consumes a lot of CPU. In case of the first test it was able fill one core fully. On production system it would be an issue. On the other hand deploying container with cpu limitation, as previous experiments show, always result in huge scraping slowdown. So the limitation of cpu usage should not come from resource limitation but from code optimization itself.
Therefore I think this is something at least worth addressing in documentation, that big working set sizes will slow down the referenced bytes readings, and as the code is sequential in this part, it will slow down statistics refreshment process.