-
Notifications
You must be signed in to change notification settings - Fork 19
Define most of Pathname in Ruby (redo) #57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* This is just before methods started to be moved from Ruby code to the C extension. * BTW, in the ruby/pathname repository there was no pathname.rb before that commit. (cherry picked from commit 16e97a5)
* This means it's only additions in lib/pathname.rb and zero removals. (cherry picked from commit 3736eab)
(cherry picked from commit 955186c)
* The <=> implementation in the extension is much faster, so is kept. * The other methods are actually faster in Ruby than in C, because rb_funcall() and rb_ivar_get() in C code have no inline cache, but method calls and `@path` have inline caches in Ruby code. https://railsatscale.com/2023-08-29-ruby-outperforms-c/ is an explanation of that (though it was known well before that). (cherry picked from commit c8c2210)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly fine to me besides a few nitpicks, and I'm very much in favor of migrating things to pure Ruby when it makes sense.
Not sure how this works with Pathname having been made a core class though.
37ebd64
to
834cc54
Compare
@hsbt Let's discuss your concerns and suggestions here.
Can you make a concrete suggestion by what you mean by small PRs for this change? I could make a PR with fewer commits, but every commit until Handle Windows NTFS edge case in Pathname#sub_ext is strictly necessary, otherwise the CI doesn't pass. If you are asking a smaller diff in general I think that is not feasible, e.g. making a PR per method would take months of work and still be the exact same end result. The approach here as detailed in the first commit message, Restore lib/pathname.rb from ext/pathname/lib/pathname.rb at ed9270a is to use the Ruby code of pathname.rb from before the translation to C, that is the code from @akr and other contributors to pathname.rb. There is no meaningful way to break that in smaller changes. That code has already been reviewed, it was exactly the code in Pathname before the translation to C started. Please take the time to read the commit messages, they should make it very clear what I did and what needs deeper review (e.g. imported code from the gem as-is doesn't). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've read through the C implementation and the Ruby implementation. I only found one slight difference, but I think it's such a rare edge case I don't know if we need to support it.
# If +path+ contains a NUL character (<tt>\0</tt>), an ArgumentError is raised. | ||
# | ||
def initialize(path) | ||
path = path.to_path if path.respond_to? :to_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is slightly different behavior than the original.
require "pathname"
path = "/"
def path.to_path
"/tmp"
end
pn = Pathname.new path
p pn
On master, the output is #<Pathname:/>
, but on this branch, the output is #<Pathname:/tmp>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks to me like an optimization in the C extension to avoid the rb_check_funcall()
because that's slow.
But in Ruby respond_to?
is properly cached and so there is no need for this manual optimization.
Semantics-wise I think it's more consistent the way the original Ruby code does it.
It's of course trivial to change if we need those semantics for some reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Semantics-wise I think it's more consistent the way the original Ruby code does it.
Yes, I agree this behavior makes more sense. I just wanted to point out the difference in case it matters (I don't think it should matter).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 96000f6 to be fully compatible with what the C extension initialize
did.
(cherry picked from commit a15c1f5)
(cherry picked from commit fe027ae)
* Avoids a MatchData allocation. (cherry picked from commit 643585a)
(cherry picked from commit 177a86d)
(cherry picked from commit f8e0cae)
(cherry picked from commit aa4d4c6)
(cherry picked from commit c96b559)
…assed too * Core methods regularly gain new keyword arguments so this is more future-proof.
Co-authored-by: Jean Boussier <[email protected]>
* The eval to set $~ is inneficient, so only do it when necessary (when running without the C extension).
249c24d
to
20f3653
Compare
I added the last 4 commits to add TruffleRuby and JRuby in CI, given the diff for it is pretty small, and that ensures all Ruby methods are tested, notably the ones also defined in the C extension. The C extension is then only used on CRuby. If reviewers prefer that to be a separate PR I can do that, it's easy to move these 4 commits. |
I also updated the description to add a table summarizing the performance gains, and filed https://bugs.ruby-lang.org/issues/21532 to make a ticket specifically about this. |
Huge +1 from the JRuby side. Pathname moving to C was unnecessary then and a hassle for all of us now (including CRuby due to JIT+C interaction. Ship it! I'll direct any additional comments to the ruby-lang issue. |
Ok, I went to the ruby-lang issue but the last couple of comments directed me back here. Some specific points:
Seems clear that it would not be gem-upgradeable anymore unless something has changed about how we define "core". If "core" means "loaded always at startup with or without gems enabled", then by definition it can't be upgrade by RubyGems. If "core" means "available without an explicit require" then we're moving toward a future where core features might trigger requires, potentially through RubyGems; that seems very problematic to me and very un-"core". In my mind, nothing "core" should break if you disable gems. In fact, I believe nothing "core" should break if there's no stdlib available (Ruby should be able to run "hello world" without loading any stdlib files).
JRuby and TruffleRuby have large parts of core written in Ruby. That code is simply loaded at startup, either by using
Why was it moved to C? Minor performance improvements? That's moot now that all implementations have JIT and none of those JIT implementations can optimize across C calls. The C move is now a millstone around our necks, both preventing the gem from being usable on non-CRuby and making calls to the library potentially slower than if they were pure Ruby. ... It sounds like we're almost all in agreement that this should move back to Ruby. If there's anything I can do to assist please let me know. I'd love to see this PR ultimately close #17 and let us align our Pathname functionality with CRuby once more. Big kudos to @eregon for forging ahead and righting this wrong. |
Approved by @akr in https://bugs.ruby-lang.org/issues/21532#note-5, merging 🎉 |
@headius Thanks for the support, should have pinged you earlier :)
I'm not sure from which issue your points come from, is it https://bugs.ruby-lang.org/issues/17473 maybe?
I believe we need to keep it, we need to keep the gem for older Ruby versions anyway.
Some parts of Pathname were already Ruby, see e.g. https://github.com/ruby/ruby/blob/master/pathname_builtin.rb and |
ruby-lang tracker issue: https://bugs.ruby-lang.org/issues/21532
Same as #53, but that was reverted in 593f030.
Should be reviewed commit-by-commit, that makes it much clearer which parts of the code are new, and which are from the original pathname.rb before translation to C began.
I cherry-picked the commits to make it easier to review.
Description from the original PR, reordered to have the most important first:
Once upon a time, Pathname was pure-Ruby: https://github.com/ruby/ruby/blob/95bc02237635d3fe42532bfe53038257575cee75/lib/pathname.rb
This PR goes back to that, and reuses that original Ruby code, but keeps the C extension implementation of
<=>
andsub
as those are significantly faster.The other Pathname methods are actually faster in Ruby than in C, because all these methods just do
rb_funcall()
andrb_ivar_get()
and those in C code have no inline cache, but the corresponding method calls and@path
have inline caches in Ruby code.https://railsatscale.com/2023-08-29-ruby-outperforms-c/ is an explanation of that.
I have discussed this with @akr several times (notably in https://bugs.ruby-lang.org/issues/17473) and the last time he said it was OK to do this change.
The main goals are:
I worked hard to make the diff really clean, it only adds lines in
lib/pathname.rb
and only removes lines inext/pathname/pathname.c
. That way it should be easy to review it.I restored the Ruby implementation of the methods from ed9270a, the commit just before methods started being migrated to the C extension.
I then fixed things to make the test suite pass and implemented the few missing methods based on their C definition.
The individual commits and their messages make it clear what exactly happened, so I would recommend to review commit-by-commit.
From my discussions with @akr, IIRC, the original motivation to rewrite pathname.rb to C, besides the optimization for
<=>
, was apparently to use*at
functions likeopenat
(seeman openat
,Rationale for openat() and other directory file descriptor APIs
) but these are not portable, it did not happen, and is only useful in very rare edge cases.The Ruby
Dir
class could potentially support some of that, but it seems it has never been important enough for someone to implement it.The API of Pathname would anyway also need to change to take advantage of a working directory different than the process CWD, e.g. Pathname methods would need to take an extra "Pathname to use as working directory" argument.
(because if one just uses
Pathname("relative/path").open(...)
there is no point to use*at()
functions).It's significantly faster with this PR:
Pathname.new(".")
Pathname#directory?
Pathname#to_s