Skip to content

Conversation

jenseng
Copy link
Contributor

@jenseng jenseng commented Aug 15, 2025

If you kick off multiple npx processes at the same time for the same non-local package(s), they can potentially install atop each other in the same npx cache directory. This can cause one or both processes to fail silently, or with various errors (e.g. TAR_ENTRY_ERROR, ENOTEMPTY, EJSONPARSE, MODULE_NOT_FOUND), depending on when/where something gets clobbered. See this issue for more context and previous discussion.

This pull request introduces a lock around reading and reifying the tree in the npx cache directory, so that concurrent npx executions for the same non-local package(s) can succeed. The lock mechanism is based on mkdirs atomicity and loosely inspired by proper-lockfile, though inlined in order to give us more control and enable some improvements.

References

Fixes #8224

@jenseng jenseng requested a review from a team as a code owner August 15, 2025 21:31
Copy link
Contributor

@wesleytodd wesleytodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wraithgar wraithgar self-assigned this Aug 20, 2025
@wraithgar
Copy link
Member

Thank you @wesleytodd for helping review this!

@wesleytodd
Copy link
Contributor

Haha, @jenseng is on my team. I swear I don't just go around reviewing random PRs. 😆

@wraithgar
Copy link
Member

This probably deserves a little deeper thought. If this is the right solution it may mean we can rethink how cacache itself does concurrent operations. That module is usually where concurrency problems w/ reification are sorted out. Typically it's done with reifying to a unique tmp dir and then relying on the fact that an awaited move operation is atomic. Arborist also provides some concurrency/caching mechanisms for packuments which are passed through to pacote (which fetches them)

This npx problem is the exact same problem space, but because it is happening at a later above reification itself (i.e w/ concurrent whole reification operations) these other methods doesn't really work.

So, if this PR is the right solution, is it also perhaps a more fitting solution for cacache? Can this be fixed in a way that works for a single reification of a tree AND concurrent reifications? Is it worth the effort to do so? They are solving two different problems: one is concurrency inside a single runtime and the other is concurrency across separate runtimes.

Finally, I think we're gonna have to either inline the concurrency logic from that package, or find a better one. The "other version" of signal-exit is an old one and we want to try to stay current on production dependencies. cacache already exports its tmp operations so I suspect that there can be at least some overlap here between how we operate in these two use cases, and cacache can still own the "reification concurrency" space.

TLDR: this is a great start and I need some more focus time to see where the overlap here is w/ the other concurrency concerns in npm (package fetching and caching).

@jenseng
Copy link
Contributor Author

jenseng commented Aug 20, 2025

Thanks for the detailed response @wraithgar! There's definitely a good bit of overlap with cacache's needs, so I'll be curious to see where your research leads. Let me know if I can assist at all, or if you think of other things you'd like to see in this PR in the meantime.

Finally, I think we're gonna have to either inline the concurrency logic from that package, or find a better one

A third option here would be to get proper-lockfile to update its signal-exit dependency. On a technical level that should be trivial, so I'm happy to open a PR there and see if it gets traction.

@wraithgar
Copy link
Member

A third option here would be to get proper-lockfile to update its signal-exit dependency. On a technical level that should be trivial, so I'm happy to open a PR there and see if it gets traction.

Oh well yeah of course. I made the classic mistake of thinking "no updates in 5 years" was cause the project was not active, when "the project is complete" is also a very valid reason.

In the interest of getting things working I probably will do some research but not so much that this gets overly delayed. My initial thought here is that this is separate enough for now that it's fine to land as-is, but I wanna be sure there's not something we're missing that also won't bog the whole thing down rewriting 3 separate packages.

@jenseng
Copy link
Contributor Author

jenseng commented Sep 4, 2025

Oh well yeah of course. I made the classic mistake of thinking "no updates in 5 years" was cause the project was not active, when "the project is complete" is also a very valid reason.

Looking at it more closely, it might be a bit of both 😆😭 ... the project seems pretty feature complete, but also there hasn't been any movement on open issues/PRs in years, so I'm not altogether surprised there's been no response to my PR to bump signal-exit.

TBH I don't think it would be too bad to inline a mkdir-based locking method. Let me know if you like that idea, and if so I'll update the PR.

@jenseng
Copy link
Contributor Author

jenseng commented Sep 10, 2025

Looking at it more closely, it might be a bit of both 😆😭 ... the project seems pretty feature complete, but also there hasn't been any movement on open issues/PRs in years, so I'm not altogether surprised there's been no response to my PR to bump signal-exit.

TBH I don't think it would be too bad to inline a mkdir-based locking method. Let me know if you like that idea, and if so I'll update the PR.

@wraithgar in light of proper-lockfile being inactive, I went ahead and rolled an inline implementation, let me know what you think! 🙏

@wraithgar
Copy link
Member

This is using [email protected] which is what promise-retry is using under the hood, and we're already using promise-retry for our other implementations of this in things like pacote, @npmcli/git, and make-fetch-happen. So using retry as-is isn't a big deal. This is why adding these packages to libnpmexec did not result in any changes to node_modules itself, just the lockfile.

I wonder if using promise-retry would clean any of this up? It does a bit of what you're doing here already. Something to consider. I'm ok w/ your decision either way I just want to present it as a possibility for less complexity in withLock itself.

I also think going forward w/ this approach is the right first step. Concurrency in cacache is a wholly separate concern, and this isn't a lot of new complexity.

Finally I want to think through the location of the .lock file.. This PR is taking the install directory and adding .lock and making that the lock filename. So if I were to run npx semver that would end up with an npx cache looking like:

$ ls ~/.npm/_npx/
a9bef924e4cb6cdb/
a9bef924e4cb6cdb.lock

This is going to confuse things that have assumptions about the layout of the npx cache. For instance npm cache npx ls will now show:

$ npm cache npx ls
a9bef924e4cb6cdb: semver
a9bef924e4cb6cdb.lock: (empty/invalid)

The good news is that the actual contents of the npx cache install are pretty well defined. The only thing in there should be a package.json, package-lock.json and node_modules/. It is a custom package made by npm, not the package being installed so we will not collide with the contents of those installed packages by adding new content. I would suggest a concurrency.lock file that goes into the installDir itself would be in order. Both of these withLock calls happen after await mkdir(installDir, { recursive: true }) so we should be ok to optimistically create that file. If it fails it is not something this new with-lock.js needs to worry about.

$ ls -al ~/.npm/_npx/a9bef924e4cb6cdb/
total 16
drwxr-xr-x   5 wraithgar  staff  160 Jan 22  2025 ./
drwxr-xr-x  19 wraithgar  staff  608 Sep 11 12:29 ../
drwxr-xr-x   5 wraithgar  staff  160 Jan 23  2025 node_modules/
-rw-r--r--   1 wraithgar  staff  559 Jan 23  2025 package-lock.json
-rw-r--r--   1 wraithgar  staff  107 Jan 23  2025 package.json

@jenseng
Copy link
Contributor Author

jenseng commented Sep 11, 2025

thanks for the feedback @wraithgar! I went ahead and switched to promise-retry, and now the lockfile gets created inside the installDir

i did notice that on a previous run one of the jobs encountered several unexpected ECOMPROMISED errors in tests, so i'll take a closer look to see what can be improved there

@jenseng
Copy link
Contributor Author

jenseng commented Sep 12, 2025

Ok so mtimeMs being a Date object was load bearing?

thought it was just due to an errant closing paren, though now it fails on windows, i'll take a closer look 🤔

@jenseng
Copy link
Contributor Author

jenseng commented Sep 12, 2025

I do want to think through the stale/Lock compromised code path for a moment. I either don't understand how that's working, or it's a potential problem if two concurrent npm processes are running and one takes longer than 5 seconds to reify.

Yeah, definitely open to suggestions on how to make this easier to follow. A couple key points about what's happening here:

  • a lock should only become stale if the holding process dies abnormally; under happy path usage, it will touch the lock periodically, well within the 5 second cutoff. so a lock could be held for much longer than 5 seconds, and that normally wouldn't cause any issues when concurrent processes are waiting to acquire it
  • lock compromise should only happen a couple ways:
    • something outside of withLock goes and touches/recreates/messes with the lock dir
    • the tiny race condition referenced in the comment, i.e.
      • there is a stale lock
      • two processes detect this and try to take it over it at the same time
      • process one deletes the stale lock and recreates it
      • process two deletes the newly recreated lock and recreates it
      • now process one's lock is compromised

@jenseng
Copy link
Contributor Author

jenseng commented Sep 12, 2025

thought it was just due to an errant closing paren, though now it fails on windows, i'll take a closer look 🤔

oh heh, utimes expects a Date or epoch seconds; mtimeMs is epoch milliseconds

@wraithgar
Copy link
Member

wraithgar commented Sep 12, 2025

oh heh, utimes expects a Date or epoch seconds; mtimeMs is epoch milliseconds

Ok well that's totally on me for not checking. Sorry.

A couple key points about what's happening here

Got it, great. I think once things are green and we've removed the extra rmdir we are good to go here.

@wraithgar
Copy link
Member

wraithgar commented Sep 12, 2025

CI is green, redundant rmdir is refactored away. This is probably good to go.

The failing test that passed w/ a re-run is a little concerning. It's an extreme test as far as practical applications though, do we feel confident shipping this or do we need to track down a potential race condition there?

ETA: I will wait till Monday to merge this regardless, so there is plenty of time to reflect

@jenseng
Copy link
Contributor Author

jenseng commented Sep 12, 2025

The failing test that passed w/ a re-run is a little concerning. It's an extreme test as far as practical applications though, do we feel confident shipping this or do we need to track down a potential race condition there?

I've repro'd locally, it's looking like:

  • there's a little more we need to handle wrt lock compromise during concurrent stale lock takeover, i.e. the initial fs.stat can fail in maintainLock if the lock has just been deleted by another process. essentially the same race condition described here, the only difference being that process two hasn't recreated the lock yet
  • when a test fails (e.g. due to above), if we haven't awaited all its withLock calls, then the subsequent test can fail, since they may still be running async code that calls mocked functions

@jenseng
Copy link
Contributor Author

jenseng commented Sep 12, 2025

Ok I think that should do the trick... I've re run the tests dozens of times without any failures (via a bash for loop locally, and via this modified GHA workflow run for windows) 🤞

@wraithgar
Copy link
Member

THANK YOU. I knew there was ... something with windows and ENOENT and a quick search wasn't coming up with anything, I was gonna go look at some other of our fs code to find it, but you got to it. That EBUSY error was what I was remembering.

@wraithgar
Copy link
Member

I totally planned on landing this yesterday but ... things are a little busy on other fronts in the the node ecosystem.

@wraithgar wraithgar merged commit 5db81c3 into npm:latest Sep 16, 2025
33 checks passed
@wraithgar
Copy link
Member

Thank you so much for seeing this PR through to completion. It wasn't a trivial fix and it did take a few rounds of feedback for things that needed to be addressed. The "last mile" of a PR is always the hardest, and it was definitely worth the effort.

@github-actions github-actions bot mentioned this pull request Sep 16, 2025
@jenseng
Copy link
Contributor Author

jenseng commented Sep 16, 2025

Thanks for all your help on this @wraithgar!

@jenseng
Copy link
Contributor Author

jenseng commented Sep 16, 2025

Looks like there were some new failures around this on latest, I'm taking a look now 😢

@wraithgar
Copy link
Member

They were in different platforms and node versions so if it is related it's a race condition. The most fun kind of condition.

@jenseng
Copy link
Contributor Author

jenseng commented Sep 16, 2025

Looks like it's two different bugs:

  • on MacOS, sometimes fs.stat gives us something like 1758044683858.999 when we expect 1758044683859. Yay floating points!
  • on Windows, sometimes we get an EPERM when two processes are trying to delete/recreate the same stale lock

I'll open up a new PR to make this more robust

@jenseng
Copy link
Contributor Author

jenseng commented Sep 17, 2025

#8577 should do the trick 🤞

@jenseng jenseng deleted the concurrent-npx branch September 17, 2025 16:24
owlstronaut pushed a commit that referenced this pull request Sep 23, 2025
Various improvements to withLock and its tests

- fix windows race conditions/errors
- use second-level granularity when detecting lock compromise; this
resolves a sporadic floating point issue under APFS, and makes this
generally more robust no matter the underlying file system
- improve touchLock logic so that it doesn't compromise the lock for the
next holder or keep running the interval when cleanup fails

## Testing notes
Fixes were verified via a [modified GHA
workflow](https://github.com/jenseng/cli/actions/runs/17803264354) that
ran all the tests 100 times 😅

## References
  Related to #8512
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] concurrent npx executions are not process-safe
3 participants