fix: allow concurrent non-local npx calls #8512

jenseng · 2025-08-15T21:31:41Z

If you kick off multiple npx processes at the same time for the same non-local package(s), they can potentially install atop each other in the same npx cache directory. This can cause one or both processes to fail silently, or with various errors (e.g. TAR_ENTRY_ERROR, ENOTEMPTY, EJSONPARSE, MODULE_NOT_FOUND), depending on when/where something gets clobbered. See this issue for more context and previous discussion.

This pull request introduces a lock around reading and reifying the tree in the npx cache directory, so that concurrent npx executions for the same non-local package(s) can succeed. The lock mechanism is based on mkdirs atomicity and loosely inspired by proper-lockfile, though inlined in order to give us more control and enable some improvements.

References

Fixes #8224

wesleytodd

LGTM

wraithgar · 2025-08-20T18:26:14Z

Thank you @wesleytodd for helping review this!

wesleytodd · 2025-08-20T18:38:35Z

Haha, @jenseng is on my team. I swear I don't just go around reviewing random PRs. 😆

wraithgar · 2025-08-20T19:11:04Z

This probably deserves a little deeper thought. If this is the right solution it may mean we can rethink how cacache itself does concurrent operations. That module is usually where concurrency problems w/ reification are sorted out. Typically it's done with reifying to a unique tmp dir and then relying on the fact that an awaited move operation is atomic. Arborist also provides some concurrency/caching mechanisms for packuments which are passed through to pacote (which fetches them)

This npx problem is the exact same problem space, but because it is happening at a later above reification itself (i.e w/ concurrent whole reification operations) these other methods doesn't really work.

So, if this PR is the right solution, is it also perhaps a more fitting solution for cacache? Can this be fixed in a way that works for a single reification of a tree AND concurrent reifications? Is it worth the effort to do so? They are solving two different problems: one is concurrency inside a single runtime and the other is concurrency across separate runtimes.

Finally, I think we're gonna have to either inline the concurrency logic from that package, or find a better one. The "other version" of signal-exit is an old one and we want to try to stay current on production dependencies. cacache already exports its tmp operations so I suspect that there can be at least some overlap here between how we operate in these two use cases, and cacache can still own the "reification concurrency" space.

TLDR: this is a great start and I need some more focus time to see where the overlap here is w/ the other concurrency concerns in npm (package fetching and caching).

jenseng · 2025-08-20T19:57:52Z

Thanks for the detailed response @wraithgar! There's definitely a good bit of overlap with cacache's needs, so I'll be curious to see where your research leads. Let me know if I can assist at all, or if you think of other things you'd like to see in this PR in the meantime.

Finally, I think we're gonna have to either inline the concurrency logic from that package, or find a better one

A third option here would be to get proper-lockfile to update its signal-exit dependency. On a technical level that should be trivial, so I'm happy to open a PR there and see if it gets traction.

wraithgar · 2025-08-20T21:33:46Z

A third option here would be to get proper-lockfile to update its signal-exit dependency. On a technical level that should be trivial, so I'm happy to open a PR there and see if it gets traction.

Oh well yeah of course. I made the classic mistake of thinking "no updates in 5 years" was cause the project was not active, when "the project is complete" is also a very valid reason.

In the interest of getting things working I probably will do some research but not so much that this gets overly delayed. My initial thought here is that this is separate enough for now that it's fine to land as-is, but I wanna be sure there's not something we're missing that also won't bog the whole thing down rewriting 3 separate packages.

workspaces/libnpmexec/package.json

jenseng · 2025-09-04T19:13:38Z

Oh well yeah of course. I made the classic mistake of thinking "no updates in 5 years" was cause the project was not active, when "the project is complete" is also a very valid reason.

Looking at it more closely, it might be a bit of both 😆😭 ... the project seems pretty feature complete, but also there hasn't been any movement on open issues/PRs in years, so I'm not altogether surprised there's been no response to my PR to bump signal-exit.

TBH I don't think it would be too bad to inline a mkdir-based locking method. Let me know if you like that idea, and if so I'll update the PR.

jenseng · 2025-09-10T22:40:26Z

Looking at it more closely, it might be a bit of both 😆😭 ... the project seems pretty feature complete, but also there hasn't been any movement on open issues/PRs in years, so I'm not altogether surprised there's been no response to my PR to bump signal-exit.

TBH I don't think it would be too bad to inline a mkdir-based locking method. Let me know if you like that idea, and if so I'll update the PR.

@wraithgar in light of proper-lockfile being inactive, I went ahead and rolled an inline implementation, let me know what you think! 🙏

wraithgar · 2025-09-11T19:37:37Z

This is using [email protected] which is what promise-retry is using under the hood, and we're already using promise-retry for our other implementations of this in things like pacote, @npmcli/git, and make-fetch-happen. So using retry as-is isn't a big deal. This is why adding these packages to libnpmexec did not result in any changes to node_modules itself, just the lockfile.

I wonder if using promise-retry would clean any of this up? It does a bit of what you're doing here already. Something to consider. I'm ok w/ your decision either way I just want to present it as a possibility for less complexity in withLock itself.

I also think going forward w/ this approach is the right first step. Concurrency in cacache is a wholly separate concern, and this isn't a lot of new complexity.

Finally I want to think through the location of the .lock file.. This PR is taking the install directory and adding .lock and making that the lock filename. So if I were to run npx semver that would end up with an npx cache looking like:

$ ls ~/.npm/_npx/
a9bef924e4cb6cdb/
a9bef924e4cb6cdb.lock

This is going to confuse things that have assumptions about the layout of the npx cache. For instance npm cache npx ls will now show:

$ npm cache npx ls
a9bef924e4cb6cdb: semver
a9bef924e4cb6cdb.lock: (empty/invalid)

The good news is that the actual contents of the npx cache install are pretty well defined. The only thing in there should be a package.json, package-lock.json and node_modules/. It is a custom package made by npm, not the package being installed so we will not collide with the contents of those installed packages by adding new content. I would suggest a concurrency.lock file that goes into the installDir itself would be in order. Both of these withLock calls happen after await mkdir(installDir, { recursive: true }) so we should be ok to optimistically create that file. If it fails it is not something this new with-lock.js needs to worry about.

$ ls -al ~/.npm/_npx/a9bef924e4cb6cdb/
total 16
drwxr-xr-x   5 wraithgar  staff  160 Jan 22  2025 ./
drwxr-xr-x  19 wraithgar  staff  608 Sep 11 12:29 ../
drwxr-xr-x   5 wraithgar  staff  160 Jan 23  2025 node_modules/
-rw-r--r--   1 wraithgar  staff  559 Jan 23  2025 package-lock.json
-rw-r--r--   1 wraithgar  staff  107 Jan 23  2025 package.json

workspaces/libnpmexec/lib/with-lock.js

Co-authored-by: Gar <[email protected]>

jenseng · 2025-09-11T21:15:27Z

thanks for the feedback @wraithgar! I went ahead and switched to promise-retry, and now the lockfile gets created inside the installDir

i did notice that on a previous run one of the jobs encountered several unexpected ECOMPROMISED errors in tests, so i'll take a closer look to see what can be improved there

workspaces/libnpmexec/lib/with-lock.js

Co-authored-by: Gar <[email protected]>

jenseng · 2025-09-12T16:17:45Z

Ok so mtimeMs being a Date object was load bearing?

thought it was just due to an errant closing paren, though now it fails on windows, i'll take a closer look 🤔

jenseng · 2025-09-12T16:26:10Z

I do want to think through the stale/Lock compromised code path for a moment. I either don't understand how that's working, or it's a potential problem if two concurrent npm processes are running and one takes longer than 5 seconds to reify.

Yeah, definitely open to suggestions on how to make this easier to follow. A couple key points about what's happening here:

a lock should only become stale if the holding process dies abnormally; under happy path usage, it will touch the lock periodically, well within the 5 second cutoff. so a lock could be held for much longer than 5 seconds, and that normally wouldn't cause any issues when concurrent processes are waiting to acquire it
lock compromise should only happen a couple ways:
- something outside of withLock goes and touches/recreates/messes with the lock dir
- the tiny race condition referenced in the comment, i.e.
  - there is a stale lock
  - two processes detect this and try to take it over it at the same time
  - process one deletes the stale lock and recreates it
  - process two deletes the newly recreated lock and recreates it
  - now process one's lock is compromised

jenseng · 2025-09-12T16:30:14Z

thought it was just due to an errant closing paren, though now it fails on windows, i'll take a closer look 🤔

oh heh, utimes expects a Date or epoch seconds; mtimeMs is epoch milliseconds

wraithgar · 2025-09-12T16:43:06Z

oh heh, utimes expects a Date or epoch seconds; mtimeMs is epoch milliseconds

Ok well that's totally on me for not checking. Sorry.

A couple key points about what's happening here

Got it, great. I think once things are green ~~and we've removed the extra rmdir~~ we are good to go here.

wraithgar · 2025-09-12T17:23:12Z

CI is green, redundant rmdir is refactored away. This is probably good to go.

The failing test that passed w/ a re-run is a little concerning. It's an extreme test as far as practical applications though, do we feel confident shipping this or do we need to track down a potential race condition there?

ETA: I will wait till Monday to merge this regardless, so there is plenty of time to reflect

jenseng · 2025-09-12T17:40:32Z

The failing test that passed w/ a re-run is a little concerning. It's an extreme test as far as practical applications though, do we feel confident shipping this or do we need to track down a potential race condition there?

I've repro'd locally, it's looking like:

there's a little more we need to handle wrt lock compromise during concurrent stale lock takeover, i.e. the initial fs.stat can fail in maintainLock if the lock has just been deleted by another process. essentially the same race condition described here, the only difference being that process two hasn't recreated the lock yet
when a test fails (e.g. due to above), if we haven't awaited all its withLock calls, then the subsequent test can fail, since they may still be running async code that calls mocked functions

jenseng · 2025-09-12T19:12:59Z

Ok I think that should do the trick... I've re run the tests dozens of times without any failures (via a bash for loop locally, and via this modified GHA workflow run for windows) 🤞

wraithgar · 2025-09-12T19:14:46Z

THANK YOU. I knew there was ... something with windows and ENOENT and a quick search wasn't coming up with anything, I was gonna go look at some other of our fs code to find it, but you got to it. That EBUSY error was what I was remembering.

wraithgar · 2025-09-16T15:59:14Z

I totally planned on landing this yesterday but ... things are a little busy on other fronts in the the node ecosystem.

wraithgar · 2025-09-16T16:00:37Z

Thank you so much for seeing this PR through to completion. It wasn't a trivial fix and it did take a few rounds of feedback for things that needed to be addressed. The "last mile" of a PR is always the hardest, and it was definitely worth the effort.

jenseng · 2025-09-16T16:16:04Z

Thanks for all your help on this @wraithgar!

jenseng · 2025-09-16T16:46:53Z

Looks like there were some new failures around this on latest, I'm taking a look now 😢

wraithgar · 2025-09-16T16:53:04Z

They were in different platforms and node versions so if it is related it's a race condition. The most fun kind of condition.

jenseng · 2025-09-16T17:57:14Z

Looks like it's two different bugs:

on MacOS, sometimes fs.stat gives us something like 1758044683858.999 when we expect 1758044683859. Yay floating points!
on Windows, sometimes we get an EPERM when two processes are trying to delete/recreate the same stale lock

I'll open up a new PR to make this more robust

jenseng · 2025-09-17T16:24:23Z

#8577 should do the trick 🤞

Various improvements to withLock and its tests - fix windows race conditions/errors - use second-level granularity when detecting lock compromise; this resolves a sporadic floating point issue under APFS, and makes this generally more robust no matter the underlying file system - improve touchLock logic so that it doesn't compromise the lock for the next holder or keep running the interval when cleanup fails ## Testing notes Fixes were verified via a [modified GHA workflow](https://github.com/jenseng/cli/actions/runs/17803264354) that ran all the tests 100 times 😅 ## References Related to #8512

jenseng requested a review from a team as a code owner August 15, 2025 21:31

wesleytodd reviewed Aug 15, 2025

View reviewed changes

wraithgar self-assigned this Aug 20, 2025

jenseng mentioned this pull request Aug 20, 2025

feat: update signal-exit to 4.x moxystudio/node-proper-lockfile#120

Closed

wraithgar reviewed Aug 20, 2025

View reviewed changes

workspaces/libnpmexec/package.json Outdated Show resolved Hide resolved

jenseng added 3 commits September 10, 2025 16:25

fix: allow concurrent non-local npx calls

43fa096

fix: test

904c22a

switch to inline implementation

6749831

jenseng force-pushed the concurrent-npx branch from 5ff82eb to 6749831 Compare September 10, 2025 22:31

update comment, fix sync callback handling

8f909db

wraithgar reviewed Sep 11, 2025

View reviewed changes

workspaces/libnpmexec/lib/with-lock.js Outdated Show resolved Hide resolved

jenseng and others added 6 commits September 11, 2025 14:36

single line comment

08ba9c9

Co-authored-by: Gar <[email protected]>

another single line comment

b86edcb

switch to promise-retry

02f0c94

create lockfile within install dir

2c06415

make test more reliable

58ec2a0

fix dependencies

a107493

fix a race condition around lock compromise detection

be1efa1

wraithgar reviewed Sep 12, 2025

View reviewed changes

workspaces/libnpmexec/lib/with-lock.js Outdated Show resolved Hide resolved

wraithgar reviewed Sep 12, 2025

View reviewed changes

workspaces/libnpmexec/lib/with-lock.js Outdated Show resolved Hide resolved

make code more clear

3f5b284

Co-authored-by: Gar <[email protected]>

jenseng added 2 commits September 12, 2025 10:31

fix utimes argument

ebb5224

remove redundant rmdir

939e5fa

jenseng added 3 commits September 12, 2025 11:55

improve compromised lock detection

989c355

make test more resilient... though maybe we should just delete it? 🤔

8b1b38f

make test more resilient

e938026

jenseng force-pushed the concurrent-npx branch from 477850d to ff41093 Compare September 12, 2025 18:47

fix windows race condition

48f2400

jenseng force-pushed the concurrent-npx branch from ff41093 to 48f2400 Compare September 12, 2025 18:48

wraithgar approved these changes Sep 12, 2025

View reviewed changes

wraithgar merged commit 5db81c3 into npm:latest Sep 16, 2025
33 checks passed

github-actions bot mentioned this pull request Sep 16, 2025

chore: release 11.6.1 #8571

Merged

jenseng mentioned this pull request Sep 17, 2025

fix(libnpmexec): improve withLock stability #8577

Merged

jenseng deleted the concurrent-npx branch September 17, 2025 16:24

npm-cli-bot mentioned this pull request Sep 25, 2025

deps: upgrade npm to 11.6.1 nodejs/node#60012

Merged

fix: allow concurrent non-local npx calls #8512

fix: allow concurrent non-local npx calls #8512

Uh oh!

Conversation

jenseng commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

References

Uh oh!

wesleytodd left a comment

Choose a reason for hiding this comment

Uh oh!

wraithgar commented Aug 20, 2025

Uh oh!

wesleytodd commented Aug 20, 2025

Uh oh!

wraithgar commented Aug 20, 2025

Uh oh!

jenseng commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wraithgar commented Aug 20, 2025

Uh oh!

Uh oh!

jenseng commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jenseng commented Sep 10, 2025

Uh oh!

wraithgar commented Sep 11, 2025

Uh oh!

Uh oh!

jenseng commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jenseng commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jenseng commented Sep 12, 2025

Uh oh!

jenseng commented Sep 12, 2025

Uh oh!

wraithgar commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wraithgar commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jenseng commented Sep 12, 2025

Uh oh!

jenseng commented Sep 12, 2025

Uh oh!

wraithgar commented Sep 12, 2025

Uh oh!

wraithgar commented Sep 16, 2025

Uh oh!

Uh oh!

wraithgar commented Sep 16, 2025

Uh oh!

jenseng commented Sep 16, 2025

Uh oh!

jenseng commented Sep 16, 2025

Uh oh!

wraithgar commented Sep 16, 2025

Uh oh!

jenseng commented Sep 16, 2025

Uh oh!

jenseng commented Sep 17, 2025

Uh oh!

Uh oh!

jenseng commented Aug 15, 2025 •

edited

Loading

jenseng commented Aug 20, 2025 •

edited

Loading

jenseng commented Sep 4, 2025 •

edited

Loading

jenseng commented Sep 11, 2025 •

edited

Loading

jenseng commented Sep 12, 2025 •

edited

Loading

wraithgar commented Sep 12, 2025 •

edited

Loading

wraithgar commented Sep 12, 2025 •

edited

Loading