-
Notifications
You must be signed in to change notification settings - Fork 4
Patching memory leak using tracer #153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: staging
Are you sure you want to change the base?
Conversation
A quick test against object lifecycles and tracer has confirmed that no object is being leaked, although the parent/child relationship is still kind of messed up. I'm working on fixing that right now. |
Brian, Roger, and I had a discussion involving the different approaches I can take to observing the state of the native rust layer of the library. There were a lot of solutions and ideas explored. CallbacksThe first approach was keeping the interface and the API consistent by introducing callbacks in the native layer which invoke javascript functions to perform and trigger tracing. This would have otherwise worked if not for a caveat — javascript can hook into rust functions by passing in parameters and geting return values back, but this is a one-way street. Rust bindings is used by the javascript layer, but rust cannot pull javascript functions and execute them. This happens because javaascript works using an event loop to track what to run and when, and rust has a completely different control and logic flow, so rust and javascript execution contexts are fundamentally disconnected. Technically, its possible to run javascript code in rust, as the NAPI exposes a way to do so, but the executed callbacks are meant to be more along the lines of fire-and-forget, and not something which we can wait to get return values from. This isn't an issue for stopping a span, but the start span callback returns a span identifier. Thus, the rust code will look like this. pub fn javascriptCallback(napi::JsFunction &callback) {
// ...
const rx = napi::JsReceiver;
napi::callJavascriptFunction(&callback, napi::Threadsafe::Blocking);
const result = rx.await.unwrap();
} This code is simplified, but it conveys the idea. The rust code is called as a single task inside the event loop. There, the NAPI queues the javascript function and immediately waits for a response, but the javascript function hasn't executed yet, only queued, so the program ends up deadlocking. This is a known limitation of working with javascript callbacks in the native layer, so this approach cannot be continued. Unix socketsThe next idea I had was to use Unix sockets to communicate between the two different layers. Even disregarding the work that would go into structuring the API (which would basically be a repeat of the work done in First-party native integrationAnother approach is to use first-party native integration. This would result in explicit bindings being available for native code, so we won't have to worry about weird interactions or limitations between them. This, however, introduces its own slew of problems. Logic/IO separationThe first approach was proposed by Roger. He mentioned that we can separate the code into the logical part residing in the native rust layer, and the IO is handled by javascript events. This allows for a clear separation of events and treats javascript as a wrapper for the logic defined by the rust code. This approach will require rust to interact with javascript, and potentially in the same way as the callbacks approach, so a lot of design considerations needs to be made. Separate implementationsBrian's proposal involved implementing two separate layers for tracing. Basically, we keep javascript tracing as-is and implement a dedicated native tracer. Both would operate on their own separate loops and generate data separately. As we will be using Both these approaches run into the same higher-level issue — Fresh perspectiveRoger and I had a discussion after sleeping on it. Here, Roger clarified the requirements and problem statement. The purpose of this is to find and patch the memory leak. Tracer is just a tool we were going to use, but if it can be done without it, then that would also mark the issue as completed. Thus, instead of working around or against the data and control flow as defined by NAPI, we should work with it. The bindings of all objects need to be explicitly exposed to the javascript layer, so the tracing can be directly applied to the javascript layer itself without worrying about interfacing. To perform rust layer tracing, we can use dedicated rust profilers to gain observability. We will not get data in the same format as what tracer provides and what the visualiser expects, but it would help us solve the bug. Thus, we don't need to worry about symbols present in quiche which we cannot expose to javascript for manual tracing. There's another problem that needs to be considered in the design phase. Lower-level language like rust use a manual memory model, where memory is manually allocated and freed. Higher-level languages like javascript use a garbage collector to automatically free memory when it is no longer reachable by the code. As we will be exposing manually-managed heap objects to dynamically-maintained heap objects, there is a chance that the garbage collector doesn't have an idea of the actual object lifecycle. In native code, the heap buffer can just be overwritten after the object it contained has been destroyed, but this is not propagated to the javascript layer, and it can read garbage data from invalid memory location. Thus, we need to take care about this scenario in our implementation. This approach is what I will be implementing now. |
Tracer isn't viable for debugging this as of now. With tracer, I was able to confirm the bug was not present in the javascript layer, but nothing else. So, at the end, I fell back on using typical memory leak detectors like
We can generally ignore I'm working on figuring out if the issue is in |
The leak is clearly happening through QUIC's |
To get to this output, this is what I did. First, I had to tell cargo to include debug symbols and prevent it from optimising the code so I get a full traceback of the symbols. I added this in my [profile.release]
debug = true
strip = false
opt-level = 0
lto = false
codegen-units = 1
incremental = false Then, I cleaned and rebuilt the native library. I used a test script which creates a lot of streams, sends some data through it, and closes it immediately.
These thousand iterations accumulate roughly 150 KiB of leaked memory. This information is acquired by the following command. This command automatically runs the program while basically 'capturing' the state of the heap, and then opens the result in a visualiser.
We can actually look at the caller/callee tree from this GUI visualiser but that doesn't have enough information. To actually view that, I had to print the heaptrack data.
Then, I had to relate the hash shown in the memory leak (here being Also note that the number of leaked allocations is not consistent through runs. For example, one run could have 700 leaked allocations, another could have 500. However, the leaked memory is roughly the same across runs, so the amount of memory leaked is deterministic. |
Good information so far. But for the code you're using for testing, we want to simulate graceful usage of the stream. So that means we don't want to cancel the readable. The steps should be
|
Okay, I've changed the test to do this instead, and it runs the exact same with the same amount of memory being leaked.
|
The leak explored here isn't accurate. Due to a mistake in the test itself, the streams were not being awaited properly, resulting in this leak. This is no longer the case. When checking the running memory usage using tools like My guess is that most of the leaks shown here are occurring due to overhead from using quiche and isn't directly related to the issues we are encountering. It is a bit hard to see, but the top of the spikes are slowly growing, indicating that there is a leak according to this Consumed Memory graph over time (~40 seconds, 5000 streams). To actually get to the bottom of this, I will need to use the chrome heap analysis tool with node's debug mode enabled so I can inspect heap snapshots manually. I can also look into incorporating runtime debug symbols so I can get a relative frame of reference to a known symbol. |
Description
This PR will integrate the experimental tracer from js-logger to trace the memory leak to finally patch it. This should make the polykey seednodes stable and robust. This will also act as a live-fire exercise for the tracer.
Issues Fixed
Tasks
Final checklist