-
Notifications
You must be signed in to change notification settings - Fork 3.7k
fix: allow concurrent non-local npx calls #8512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thank you @wesleytodd for helping review this! |
Haha, @jenseng is on my team. I swear I don't just go around reviewing random PRs. 😆 |
This probably deserves a little deeper thought. If this is the right solution it may mean we can rethink how This npx problem is the exact same problem space, but because it is happening at a later above reification itself (i.e w/ concurrent whole reification operations) these other methods doesn't really work. So, if this PR is the right solution, is it also perhaps a more fitting solution for cacache? Can this be fixed in a way that works for a single reification of a tree AND concurrent reifications? Is it worth the effort to do so? They are solving two different problems: one is concurrency inside a single runtime and the other is concurrency across separate runtimes. Finally, I think we're gonna have to either inline the concurrency logic from that package, or find a better one. The "other version" of signal-exit is an old one and we want to try to stay current on production dependencies. TLDR: this is a great start and I need some more focus time to see where the overlap here is w/ the other concurrency concerns in npm (package fetching and caching). |
Thanks for the detailed response @wraithgar! There's definitely a good bit of overlap with cacache's needs, so I'll be curious to see where your research leads. Let me know if I can assist at all, or if you think of other things you'd like to see in this PR in the meantime.
A third option here would be to get proper-lockfile to update its signal-exit dependency. On a technical level that should be trivial, so I'm happy to open a PR there and see if it gets traction. |
Oh well yeah of course. I made the classic mistake of thinking "no updates in 5 years" was cause the project was not active, when "the project is complete" is also a very valid reason. In the interest of getting things working I probably will do some research but not so much that this gets overly delayed. My initial thought here is that this is separate enough for now that it's fine to land as-is, but I wanna be sure there's not something we're missing that also won't bog the whole thing down rewriting 3 separate packages. |
Looking at it more closely, it might be a bit of both 😆😭 ... the project seems pretty feature complete, but also there hasn't been any movement on open issues/PRs in years, so I'm not altogether surprised there's been no response to my PR to bump signal-exit. TBH I don't think it would be too bad to inline a |
5ff82eb
to
6749831
Compare
@wraithgar in light of proper-lockfile being inactive, I went ahead and rolled an inline implementation, let me know what you think! 🙏 |
This is using I wonder if using I also think going forward w/ this approach is the right first step. Concurrency in Finally I want to think through the location of the $ ls ~/.npm/_npx/
a9bef924e4cb6cdb/
a9bef924e4cb6cdb.lock This is going to confuse things that have assumptions about the layout of the npx cache. For instance $ npm cache npx ls
a9bef924e4cb6cdb: semver
a9bef924e4cb6cdb.lock: (empty/invalid) The good news is that the actual contents of the npx cache install are pretty well defined. The only thing in there should be a $ ls -al ~/.npm/_npx/a9bef924e4cb6cdb/
total 16
drwxr-xr-x 5 wraithgar staff 160 Jan 22 2025 ./
drwxr-xr-x 19 wraithgar staff 608 Sep 11 12:29 ../
drwxr-xr-x 5 wraithgar staff 160 Jan 23 2025 node_modules/
-rw-r--r-- 1 wraithgar staff 559 Jan 23 2025 package-lock.json
-rw-r--r-- 1 wraithgar staff 107 Jan 23 2025 package.json |
thanks for the feedback @wraithgar! I went ahead and switched to promise-retry, and now the lockfile gets created inside the installDir i did notice that on a previous run one of the jobs encountered several unexpected ECOMPROMISED errors in tests, so i'll take a closer look to see what can be improved there |
Co-authored-by: Gar <[email protected]>
thought it was just due to an errant closing paren, though now it fails on windows, i'll take a closer look 🤔 |
Yeah, definitely open to suggestions on how to make this easier to follow. A couple key points about what's happening here:
|
oh heh, utimes expects a Date or epoch seconds; mtimeMs is epoch milliseconds |
Ok well that's totally on me for not checking. Sorry.
Got it, great. I think once things are green |
CI is green, redundant rmdir is refactored away. This is probably good to go. The failing test that passed w/ a re-run is a little concerning. It's an extreme test as far as practical applications though, do we feel confident shipping this or do we need to track down a potential race condition there? ETA: I will wait till Monday to merge this regardless, so there is plenty of time to reflect |
I've repro'd locally, it's looking like:
|
477850d
to
ff41093
Compare
ff41093
to
48f2400
Compare
Ok I think that should do the trick... I've re run the tests dozens of times without any failures (via a bash |
THANK YOU. I knew there was ... something with windows and |
I totally planned on landing this yesterday but ... things are a little busy on other fronts in the the node ecosystem. |
Thank you so much for seeing this PR through to completion. It wasn't a trivial fix and it did take a few rounds of feedback for things that needed to be addressed. The "last mile" of a PR is always the hardest, and it was definitely worth the effort. |
Thanks for all your help on this @wraithgar! |
Looks like there were some new failures around this on latest, I'm taking a look now 😢 |
They were in different platforms and node versions so if it is related it's a race condition. The most fun kind of condition. |
Looks like it's two different bugs:
I'll open up a new PR to make this more robust |
#8577 should do the trick 🤞 |
Various improvements to withLock and its tests - fix windows race conditions/errors - use second-level granularity when detecting lock compromise; this resolves a sporadic floating point issue under APFS, and makes this generally more robust no matter the underlying file system - improve touchLock logic so that it doesn't compromise the lock for the next holder or keep running the interval when cleanup fails ## Testing notes Fixes were verified via a [modified GHA workflow](https://github.com/jenseng/cli/actions/runs/17803264354) that ran all the tests 100 times 😅 ## References Related to #8512
If you kick off multiple npx processes at the same time for the same non-local package(s), they can potentially install atop each other in the same npx cache directory. This can cause one or both processes to fail silently, or with various errors (e.g.
TAR_ENTRY_ERROR
,ENOTEMPTY
,EJSONPARSE
,MODULE_NOT_FOUND
), depending on when/where something gets clobbered. See this issue for more context and previous discussion.This pull request introduces a lock around reading and reifying the tree in the npx cache directory, so that concurrent npx executions for the same non-local package(s) can succeed. The lock mechanism is based on
mkdir
s atomicity and loosely inspired by proper-lockfile, though inlined in order to give us more control and enable some improvements.References
Fixes #8224