-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Add referenced memory metric #2495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
31a447c
to
92b3dfc
Compare
I'm working on fix for data race indicated in tests:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it is still WIP, but I have a couple high-level questions.
info/v1/container.go
Outdated
@@ -899,6 +899,9 @@ type ContainerStats struct { | |||
|
|||
// Statistics originating from perf events | |||
PerfStats []PerfStat `json:"perf_stats,omitempty"` | |||
|
|||
// Working set size | |||
Wss uint64 `json:"wss,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we nest this under MemoryStats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All memory stats come from memory cgroup so I think that wss does not fit to them.
7a7d5bb
to
f25f2fd
Compare
@dashpole It's now ready for review. |
cmd/cadvisor.go
Outdated
@@ -87,6 +87,7 @@ var ( | |||
container.ProcessSchedulerMetrics: struct{}{}, | |||
container.ProcessMetrics: struct{}{}, | |||
container.HugetlbUsageMetrics: struct{}{}, | |||
container.ReferencedMetric: struct{}{}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT about ReferencedMemoryMetrics
instead? Just Referenced seems a little generic...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and similarly referenced_memory
for the flag string
container/libcontainer/handler.go
Outdated
} else { | ||
stats.Referenced, err = referencedBytesStat(pids, h.cycles, *referencedResetInterval) | ||
if err != nil { | ||
klog.V(4).Infof("Unable to get working set size: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/working set size/referenced memory
container/libcontainer/handler.go
Outdated
return referencedKBytes * 1024, nil | ||
} | ||
|
||
func getReferencedBytes(pids []int) (uint64, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getReferencedKBytes?
container/libcontainer/handler.go
Outdated
smapsFilePath := fmt.Sprintf(smapsFilePathPattern, pid) | ||
smapsContent, err := ioutil.ReadFile(smapsFilePath) | ||
if err != nil { | ||
klog.V(3).Infof("Cannot read %s file, err: %s", smapsFilePath, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For logging that is expected in some scenarios, lets lower to V(5)
return referencedKBytes, nil | ||
} | ||
|
||
func clearReferencedBytes(pids []int, cycles uint64, resetInterval uint64) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure I understand, if clearReferencedBytes errors, we will wait another resetInterval
cycles before attempting to reset again, right? If so, is that better than always trying to clear after an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, if clearReferencedBytes return error, we will wait another resetInterval
but error occurs only if there is a serious issue in system (problem with writing into existing file or problem with closing previously opened file). I don't think that is a good idea to force clearing in case of errors.
User can easily observe if referenced bytes were cleared or not, seeing value of referenced bytes.
Since your last review I've introduced an option to switch off clearing of referenced bytes (setting referenced_reset_interval
to 0). It's documented here. During experiments it's more user friendly to set 0 than very long reset interval if it's desired to observe referenced bytes in longer period.
82dc9bd
to
cc869de
Compare
see: https://github.com/brendangregg/wss#wsspl-referenced-page-flag Signed-off-by: Katarzyna Kujawa <[email protected]>
cc869de
to
7ab5e27
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
google#2495 introduced parsing smaps memory metrics, but only exposes referenced memory. As proposed in google#2634, also other metrics in here are of interest, specifically LazyFree; but generally also the others seem of interest. This commit replaces the referenced memory metrics by a generic smaps series of metrics. Runtime costs in cadvisor are not really impacted, as all the smaps parsing already happens anyway. The cardinality per container extends from 1 to 19 metrics, though. I think this is acceptable: also referenced memory scraping was disabled by default, so is scraping smaps. Resolves google#2634. Signed-off-by: Jens Erat <[email protected]>
see: https://github.com/brendangregg/wss#wsspl-referenced-page-flag
Signed-off-by: Katarzyna Kujawa [email protected]
The working set size indicates how much memory a container needs to keep working. It's intrusive metric because collection of metric can influence kernel page reclaim policy and add latency (this is mentioned in documentation). Wss metric is disabled by default.