Skip to content

Patching memory leak using tracer #153

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: staging
Choose a base branch
from

Conversation

aryanjassal
Copy link
Member

@aryanjassal aryanjassal commented May 21, 2025

Description

This PR will integrate the experimental tracer from js-logger to trace the memory leak to finally patch it. This should make the polykey seednodes stable and robust. This will also act as a live-fire exercise for the tracer.

Issues Fixed

Tasks

  • 1. Integrate tracer into js code
  • 2. Integrate tracer into native code
  • 3. Replicate memory leak with test/script
  • 4. Use tracer to visualise where the leak is coming from

Final checklist

  • Domain specific tests
  • Full tests
  • Updated inline-comment documentation
  • Lint fixed
  • Squash and rebased
  • Sanity check the final build

@aryanjassal aryanjassal self-assigned this May 21, 2025
Copy link

linear bot commented May 21, 2025

ENG-626

@aryanjassal
Copy link
Member Author

A quick test against object lifecycles and tracer has confirmed that no object is being leaked, although the parent/child relationship is still kind of messed up. I'm working on fixing that right now.

Copy link
Member Author

Brian, Roger, and I had a discussion involving the different approaches I can take to observing the state of the native rust layer of the library. There were a lot of solutions and ideas explored.

Callbacks

The first approach was keeping the interface and the API consistent by introducing callbacks in the native layer which invoke javascript functions to perform and trigger tracing. This would have otherwise worked if not for a caveat — javascript can hook into rust functions by passing in parameters and geting return values back, but this is a one-way street. Rust bindings is used by the javascript layer, but rust cannot pull javascript functions and execute them. This happens because javaascript works using an event loop to track what to run and when, and rust has a completely different control and logic flow, so rust and javascript execution contexts are fundamentally disconnected.

Technically, its possible to run javascript code in rust, as the NAPI exposes a way to do so, but the executed callbacks are meant to be more along the lines of fire-and-forget, and not something which we can wait to get return values from. This isn't an issue for stopping a span, but the start span callback returns a span identifier. Thus, the rust code will look like this.

pub fn javascriptCallback(napi::JsFunction &callback) {
  // ...
  const rx = napi::JsReceiver;
  napi::callJavascriptFunction(&callback, napi::Threadsafe::Blocking);
  const result = rx.await.unwrap();
}

This code is simplified, but it conveys the idea. The rust code is called as a single task inside the event loop. There, the NAPI queues the javascript function and immediately waits for a response, but the javascript function hasn't executed yet, only queued, so the program ends up deadlocking. This is a known limitation of working with javascript callbacks in the native layer, so this approach cannot be continued.

Unix sockets

The next idea I had was to use Unix sockets to communicate between the two different layers. Even disregarding the work that would go into structuring the API (which would basically be a repeat of the work done in js-rpc), the idea is still a bad one as the two layers are already linked, and using an IPC approach to communicate between the two layers just sounds dumb.

First-party native integration

Another approach is to use first-party native integration. This would result in explicit bindings being available for native code, so we won't have to worry about weird interactions or limitations between them. This, however, introduces its own slew of problems.

Logic/IO separation

The first approach was proposed by Roger. He mentioned that we can separate the code into the logical part residing in the native rust layer, and the IO is handled by javascript events. This allows for a clear separation of events and treats javascript as a wrapper for the logic defined by the rust code. This approach will require rust to interact with javascript, and potentially in the same way as the callbacks approach, so a lot of design considerations needs to be made.

Separate implementations

Brian's proposal involved implementing two separate layers for tracing. Basically, we keep javascript tracing as-is and implement a dedicated native tracer. Both would operate on their own separate loops and generate data separately. As we will be using IdSortable, we will be able to reconcile and merge the outputs of both into a single log file.


Both these approaches run into the same higher-level issue — IdSortable. It is provided via @matrixai/js-id, which is a javascript-only library. This means that the rust code still needs to pull data from javascript, which means we will run into the same issue as before.

Fresh perspective

Roger and I had a discussion after sleeping on it. Here, Roger clarified the requirements and problem statement. The purpose of this is to find and patch the memory leak. Tracer is just a tool we were going to use, but if it can be done without it, then that would also mark the issue as completed. Thus, instead of working around or against the data and control flow as defined by NAPI, we should work with it. The bindings of all objects need to be explicitly exposed to the javascript layer, so the tracing can be directly applied to the javascript layer itself without worrying about interfacing.

To perform rust layer tracing, we can use dedicated rust profilers to gain observability. We will not get data in the same format as what tracer provides and what the visualiser expects, but it would help us solve the bug. Thus, we don't need to worry about symbols present in quiche which we cannot expose to javascript for manual tracing.

There's another problem that needs to be considered in the design phase. Lower-level language like rust use a manual memory model, where memory is manually allocated and freed. Higher-level languages like javascript use a garbage collector to automatically free memory when it is no longer reachable by the code. As we will be exposing manually-managed heap objects to dynamically-maintained heap objects, there is a chance that the garbage collector doesn't have an idea of the actual object lifecycle. In native code, the heap buffer can just be overwritten after the object it contained has been destroyed, but this is not propagated to the javascript layer, and it can read garbage data from invalid memory location. Thus, we need to take care about this scenario in our implementation.

This approach is what I will be implementing now.

Copy link
Member Author

Tracer isn't viable for debugging this as of now. With tracer, I was able to confirm the bug was not present in the javascript layer, but nothing else. So, at the end, I fell back on using typical memory leak detectors like valgrind or heaptrack to trace it. I ended up using heaptrack to trace the code, and have confirmed that there definitely is a memory leak in the rust code.

heaptrack stats:
	allocations:          	6815926
	leaked allocations:   	618
	temporary allocations:	1492508
Heaptrack finished!

image.png

We can generally ignore v8:: scoped leaks as they are unlikely to be the cause of the memory leaks here. What is of particular importance is the unresolved function and alloc::alloc::alloc, indicating the missing pieces exist in the rust layer.

I'm working on figuring out if the issue is in js-quic or quiche, but given alloc::alloc::alloc is leaking, there might be either something leaking from quiche or js-quic mismanaging some state.

Copy link
Member Author

547.82K peak memory consumed over 19017 calls from
alloc::alloc::alloc::hd6fb3bd214935af9
  at /build/rustc-1.77.2-src/library/alloc/src/alloc.rs:98
  in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
alloc::alloc::Global::alloc_impl::hf22c66c92301f94a
  at /build/rustc-1.77.2-src/library/alloc/src/alloc.rs:181
386.06K consumed over 2 calls from:
    _$LT$alloc..alloc..Global$u20$as$u20$core..alloc..Allocator$GT$::allocate::h72f444ab5665945f
      at /build/rustc-1.77.2-src/library/alloc/src/alloc.rs:241
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
    hashbrown::raw::alloc::inner::do_alloc::h05c1c9964173c76c
      at /build/rustc-1.77.2-src/vendor/hashbrown/src/raw/alloc.rs:15
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
    hashbrown::raw::RawTableInner::new_uninitialized::h04185919e4520fe6
      at /build/rustc-1.77.2-src/vendor/hashbrown/src/raw/mod.rs:1752
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
    hashbrown::raw::RawTableInner::fallible_with_capacity::hc60b62db5a759de4
      at /build/rustc-1.77.2-src/vendor/hashbrown/src/raw/mod.rs:1790
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
    hashbrown::raw::RawTableInner::prepare_resize::h856fb8d96c0104dc
      at /build/rustc-1.77.2-src/vendor/hashbrown/src/raw/mod.rs:2869
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
    hashbrown::raw::RawTableInner::resize_inner::h5734c6ebce825348
      at /build/rustc-1.77.2-src/vendor/hashbrown/src/raw/mod.rs:3065
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
    hashbrown::raw::RawTableInner::reserve_rehash_inner::h02c7332d0072754b
      at /build/rustc-1.77.2-src/vendor/hashbrown/src/raw/mod.rs:2955
    hashbrown::raw::RawTable$LT$T$C$A$GT$::reserve_rehash::he043065fb3ccd23d
      at /build/rustc-1.77.2-src/vendor/hashbrown/src/raw/mod.rs:1233
    hashbrown::raw::RawTable$LT$T$C$A$GT$::reserve::ha5daa5c8aa9f2525
      at /build/rustc-1.77.2-src/vendor/hashbrown/src/raw/mod.rs:1181
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
    hashbrown::map::HashMap$LT$K$C$V$C$S$C$A$GT$::reserve::h75404fbadb979caa
      at /build/rustc-1.77.2-src/vendor/hashbrown/src/map.rs:1106
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
    hashbrown::rustc_entry::_$LT$impl$u20$hashbrown..map..HashMap$LT$K$C$V$C$S$C$A$GT$$GT$::rustc_entry::h03f7fef4df72c2a6
      at /build/rustc-1.77.2-src/vendor/hashbrown/src/rustc_entry.rs:46
    std::collections::hash::map::HashMap$LT$K$C$V$C$S$GT$::entry::hbdc371e735fcb947
      at /build/rustc-1.77.2-src/library/std/src/collections/hash/map.rs:856
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
    quiche::stream::StreamMap::get_or_create::h74d54b7613720d99
      at src/stream.rs:222
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
    quiche::Connection::get_or_create_stream::h9c9dc6a8b8b2d6eb
      at src/lib.rs:6585
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
    quiche::Connection::stream_send::h3942e29571d828c2
      at src/lib.rs:4562
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
    quic::connection::Connection::stream_send::he68a9132a1c70ae8
      at src/native/napi/connection.rs:400
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
    quic::connection::__napi_impl_helper__Connection__1::__napi__stream_send::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h395f50a9ae562376
      at src/native/napi/connection.rs:186
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
    napi::tokio_runtime::within_runtime_if_available::hee2811d803013358
      at /home/aryanj/.cargo/registry/src/index.crates.io-6f17d22bba15001f/napi-2.10.1/src/tokio_runtime.rs:69
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
    quic::connection::__napi_impl_helper__Connection__1::__napi__stream_send::_$u7b$$u7b$closure$u7d$$u7d$::h629475faa68d95e1
      at src/native/napi/connection.rs:186
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
    core::result::Result$LT$T$C$E$GT$::and_then::h54bcef81d4270c5d
      at /build/rustc-1.77.2-src/library/core/src/result.rs:1316
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
    quic::connection::__napi_impl_helper__Connection__1::__napi__stream_send::h6d1ebca11e523cd6
      at src/native/napi/connection.rs:186
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node

The leak is clearly happening through QUIC's stream_send, which is propagating into quiche. As I have tested sending thousands of messages through the QUIC stream, clearly that is not causing the issue. The issue is quiche not closing the stream properly, or that is what I think as of now with this information.

Copy link
Member Author

To get to this output, this is what I did.

First, I had to tell cargo to include debug symbols and prevent it from optimising the code so I get a full traceback of the symbols. I added this in my Cargo.toml to do that.

[profile.release]
debug = true
strip = false
opt-level = 0
lto = false
codegen-units = 1
incremental = false

Then, I cleaned and rebuilt the native library. I used a test script which creates a lot of streams, sends some data through it, and closes it immediately.

for (let i = 0; i < 1000; i++) {
  const stream = client.connection.newStream();
  const writer = stream.writable.getWriter();
  const reader = stream.readable.getReader();
  await writer.write(message);
  await reader.read();
  await writer.close();
  await reader.cancel();
}

These thousand iterations accumulate roughly 150 KiB of leaked memory. This information is acquired by the following command. This command automatically runs the program while basically 'capturing' the state of the heap, and then opens the result in a visualiser.

heaptrace node ./scripts/memtest.mjs

image.png

We can actually look at the caller/callee tree from this GUI visualiser but that doesn't have enough information. To actually view that, I had to print the heaptrack data.

heaptrack_print heaptrack.node.<pid>.zst > heaptrack.txt
nvim heaptrack.txt

Then, I had to relate the hash shown in the memory leak (here being hd6fb3bd214935af9), to the hash in the print dump. There, I was able to see a traceback through the call stack, which is what has been included in the previous comment.

Also note that the number of leaked allocations is not consistent through runs. For example, one run could have 700 leaked allocations, another could have 500. However, the leaked memory is roughly the same across runs, so the amount of memory leaked is deterministic.

@tegefaulkes
Copy link
Contributor

Good information so far. But for the code you're using for testing, we want to simulate graceful usage of the stream. So that means we don't want to cancel the readable.

The steps should be

  1. new stream
  2. write message
  3. close writable
  4. read readable until completion
  5. repeat.

Copy link
Member Author

Okay, I've changed the test to do this instead, and it runs the exact same with the same amount of memory being leaked.

  for (let i = 0; i < 1000; i++) {
    const stream = client.connection.newStream();
    const writer = stream.writable.getWriter();
    await writer.write(message);
    await writer.close();
    const reader = stream.readable.getReader();
    let finished = false;
    while (!finished) {
      finished = (await reader.read()).done;
    }
  }

@aryanjassal
Copy link
Member Author

aryanjassal commented Jun 4, 2025

I've discovered another quick something. There are two main allocation points resulting in the resource leaks. One of the alloc leaks are unaffected by the amount of data or the size of data transmitted, and the other alloc does increase with thee size/amount of data transmitted.

image
Heaptrace of test with 22000 transmitted bytes

image
Heaptrace of test with 0 transmitted bytes

The memory leak happening for the alloc call on h39d... is consistently 49.7 KiB, but the other call, hd6f..., increased with larger messages. This means that the leak is most likely occurring due to the streams not being closed or cleaned up properly. I believe Brian also suspected this outcome since the beginning.

@aryanjassal
Copy link
Member Author

image
There is no way to explicitly close a quiche stream. It will automattically close if either:

  • Both sides sent FIN
  • Either side sent RESET_STREAM
    From my exploration, it seems the code is handling that correctly, but obviously something is going wrong.

image

Also, Once I reduce the number of streams but kept the kept the data constant (around 44 bytes), I got this map. So, this confirms that one of the leaks is directly tied to the stream count and one is directly tied to the data amount, reinforcing the idea that the leak is in improper closing of quiche streams.

@aryanjassal
Copy link
Member Author

aryanjassal commented Jun 4, 2025

The leak explored here isn't accurate. Due to a mistake in the test itself, the streams were not being awaited properly, resulting in this leak. This is no longer the case. When checking the running memory usage using tools like btop (set to a low frequency like 100ms), I could see the memory grow, but the heaptrack results were consistently showing the same 'leaks'.

image

My guess is that most of the leaks shown here are occurring due to overhead from using quiche and isn't directly related to the issues we are encountering.

image

It is a bit hard to see, but the top of the spikes are slowly growing, indicating that there is a leak according to this Consumed Memory graph over time (~40 seconds, 5000 streams).

To actually get to the bottom of this, I will need to use the chrome heap analysis tool with node's debug mode enabled so I can inspect heap snapshots manually. I can also look into incorporating runtime debug symbols so I can get a relative frame of reference to a known symbol.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Integrate tracer to detect memory leaks
2 participants