Skip to content

Conversation

eregon
Copy link
Member

@eregon eregon commented Jul 16, 2025

ruby-lang tracker issue: https://bugs.ruby-lang.org/issues/21532

Same as #53, but that was reverted in 593f030.
Should be reviewed commit-by-commit, that makes it much clearer which parts of the code are new, and which are from the original pathname.rb before translation to C began.

I cherry-picked the commits to make it easier to review.

Description from the original PR, reordered to have the most important first:

Once upon a time, Pathname was pure-Ruby: https://github.com/ruby/ruby/blob/95bc02237635d3fe42532bfe53038257575cee75/lib/pathname.rb

This PR goes back to that, and reuses that original Ruby code, but keeps the C extension implementation of <=> and sub as those are significantly faster.
The other Pathname methods are actually faster in Ruby than in C, because all these methods just do rb_funcall() and rb_ivar_get() and those in C code have no inline cache, but the corresponding method calls and @path have inline caches in Ruby code.
https://railsatscale.com/2023-08-29-ruby-outperforms-c/ is an explanation of that.

I have discussed this with @akr several times (notably in https://bugs.ruby-lang.org/issues/17473) and the last time he said it was OK to do this change.
The main goals are:

  • Simplify the implementation, e.g. the Ruby version is 3 times smaller in terms of lines and is much easier to read and maintain.
  • Share more of the Pathname implementation between Ruby implementations. With that other Ruby implementations can then easily be added in CI. Currently the pathname gem does not work on JRuby (no C ext support) and on TruffleRuby (some Ruby C API functions that this gem uses are not supported), this will be a huge help towards supporting both.

I worked hard to make the diff really clean, it only adds lines in lib/pathname.rb and only removes lines in ext/pathname/pathname.c. That way it should be easy to review it.
I restored the Ruby implementation of the methods from ed9270a, the commit just before methods started being migrated to the C extension.
I then fixed things to make the test suite pass and implemented the few missing methods based on their C definition.
The individual commits and their messages make it clear what exactly happened, so I would recommend to review commit-by-commit.


From my discussions with @akr, IIRC, the original motivation to rewrite pathname.rb to C, besides the optimization for <=>, was apparently to use *at functions like openat (see man openat, Rationale for openat() and other directory file descriptor APIs) but these are not portable, it did not happen, and is only useful in very rare edge cases.
The Ruby Dir class could potentially support some of that, but it seems it has never been important enough for someone to implement it.
The API of Pathname would anyway also need to change to take advantage of a working directory different than the process CWD, e.g. Pathname methods would need to take an extra "Pathname to use as working directory" argument.
(because if one just uses Pathname("relative/path").open(...) there is no point to use *at() functions).


It's significantly faster with this PR:

Speedup (this branch / master) ruby 3.4.2 ruby 3.4.2 + YJIT
Pathname.new(".") 1.02x 1.19x
Pathname#directory? 1.03x 1.06x
Pathname#to_s 1.85x 2.38x
Structure:
benchmark name
command line
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +PRISM [x86_64-linux]
this branch
master
command line
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [x86_64-linux]
this branch
master
command line
truffleruby 24.2.1, like ruby 3.3.7, Oracle GraalVM JVM [x86_64-linux]
this branch
master

Pathname.new(".")
$ ruby -Ilib -rpathname -rbenchmark/ips -e 'Benchmark.ips { it.report { Pathname.new(".") } }'
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +PRISM [x86_64-linux]
1.718M (± 1.0%) i/s  (582.24 ns/i) -      8.629M in   5.024793s
1.680M (± 1.4%) i/s  (595.12 ns/i) -      8.457M in   5.033713s
$ ruby --yjit -Ilib -rpathname -rbenchmark/ips -e 'Benchmark.ips { it.report { Pathname.new(".") } }'
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [x86_64-linux]
2.093M (± 0.9%) i/s  (477.76 ns/i) -     10.622M in   5.075014s
1.762M (± 0.6%) i/s  (567.54 ns/i) -      8.858M in   5.027444s
$ ruby -Ilib -rpathname -rbenchmark/ips -e 'Benchmark.ips { it.report { Pathname.new(".") } }'
truffleruby 24.2.1, like ruby 3.3.7, Oracle GraalVM JVM [x86_64-linux]
 32.078Q (±15.6%) i/s    (0.00 ns/i) -     39.570Q (optimizes away)
720.391k (±17.3%) i/s    (1.39 μs/i) -      3.522M in   5.050059s

Pathname#directory?
$ ruby -Ilib -rpathname -rbenchmark/ips -e 'P = Pathname.pwd; Benchmark.ips { it.report { P.directory? } }' 
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +PRISM [x86_64-linux]
382.448k (± 0.3%) i/s    (2.61 μs/i) -      1.915M in   5.006863s
371.236k (± 0.4%) i/s    (2.69 μs/i) -      1.874M in   5.046993s
$ ruby --yjit -Ilib -rpathname -rbenchmark/ips -e 'P = Pathname.pwd; Benchmark.ips { it.report { P.directory? } }'
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [x86_64-linux]
388.278k (± 0.2%) i/s    (2.58 μs/i) -      1.945M in   5.009322s
366.325k (± 0.2%) i/s    (2.73 μs/i) -      1.843M in   5.030526s
$ ruby -Ilib -rpathname -rbenchmark/ips -e 'P = Pathname.pwd; Benchmark.ips { it.report { P.directory? } }' 
truffleruby 24.2.1, like ruby 3.3.7, Oracle GraalVM JVM [x86_64-linux]
448.926k (± 1.1%) i/s    (2.23 μs/i) -      2.244M in   4.998573s
314.099k (± 2.9%) i/s    (3.18 μs/i) -      1.574M in   5.015517s

Pathname#to_s
$ ruby -Ilib -rpathname -rbenchmark/ips -e 'P = Pathname.pwd; Benchmark.ips { it.report { P.to_s } }'       
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +PRISM [x86_64-linux]
6.821M (± 0.7%) i/s  (146.60 ns/i) -     34.758M in   5.095632s
3.683M (± 1.2%) i/s  (271.50 ns/i) -     18.572M in   5.043102s
$ ruby --yjit -Ilib -rpathname -rbenchmark/ips -e 'P = Pathname.pwd; Benchmark.ips { it.report { P.to_s } }'
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [x86_64-linux]
9.480M (± 0.4%) i/s  (105.49 ns/i) -     48.196M in   5.084075s
3.977M (± 1.4%) i/s  (251.46 ns/i) -     20.029M in   5.037328s
$ ruby -Ilib -rpathname -rbenchmark/ips -e 'P = Pathname.pwd; Benchmark.ips { it.report { P.to_s } }'       
truffleruby 24.2.1, like ruby 3.3.7, Oracle GraalVM JVM [x86_64-linux]
31.854Q (±15.5%) i/s    (0.00 ns/i) -     39.901Q (optimizes away)
 1.184M (±13.7%) i/s  (844.61 ns/i) -      5.805M in   5.006740s

eregon added 4 commits July 16, 2025 09:35
* This is just before methods started to be moved from Ruby code to the C extension.
* BTW, in the ruby/pathname repository there was no pathname.rb before that commit.

(cherry picked from commit 16e97a5)
* This means it's only additions in lib/pathname.rb and zero removals.

(cherry picked from commit 3736eab)
(cherry picked from commit 955186c)
* The <=> implementation in the extension is much faster, so is kept.
* The other methods are actually faster in Ruby than in C,
  because rb_funcall() and rb_ivar_get() in C code have no inline cache,
  but method calls and `@path` have inline caches in Ruby code.
  https://railsatscale.com/2023-08-29-ruby-outperforms-c/ is an explanation
  of that (though it was known well before that).

(cherry picked from commit c8c2210)
@eregon eregon requested review from akr, hsbt, nobu and byroot July 16, 2025 07:43
Copy link
Member

@byroot byroot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly fine to me besides a few nitpicks, and I'm very much in favor of migrating things to pure Ruby when it makes sense.

Not sure how this works with Pathname having been made a core class though.

@eregon eregon force-pushed the pure-ruby-pathname2 branch from 37ebd64 to 834cc54 Compare July 16, 2025 20:14
@eregon
Copy link
Member Author

eregon commented Jul 16, 2025

@byroot Thank you for the review, I think I addressed all of it.

@hsbt and/or @nobu Could you review this PR as well?

@eregon
Copy link
Member Author

eregon commented Jul 18, 2025

@hsbt Let's discuss your concerns and suggestions here.
You said in https://bugs.ruby-lang.org/issues/17473#note-27

Please separate the small PRs. I want to reduce the side effect like ruby/ruby#13906.

Can you make a concrete suggestion by what you mean by small PRs for this change?

I could make a PR with fewer commits, but every commit until Handle Windows NTFS edge case in Pathname#sub_ext is strictly necessary, otherwise the CI doesn't pass.
That leaves only Optimize Pathname#initialize to avoid extra send and Optimize Pathname#initialize to avoid extra ivar accesses which are trivial, and then commits to address @byroot's review.

If you are asking a smaller diff in general I think that is not feasible, e.g. making a PR per method would take months of work and still be the exact same end result. The approach here as detailed in the first commit message, Restore lib/pathname.rb from ext/pathname/lib/pathname.rb at ed9270a is to use the Ruby code of pathname.rb from before the translation to C, that is the code from @akr and other contributors to pathname.rb. There is no meaningful way to break that in smaller changes. That code has already been reviewed, it was exactly the code in Pathname before the translation to C started.

Please take the time to read the commit messages, they should make it very clear what I did and what needs deeper review (e.g. imported code from the gem as-is doesn't).
Just browsing through the commit messages should also make it clear I took great care to have a very clear git history of the changes here with not a single extra line of diff.

Copy link
Member

@tenderlove tenderlove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read through the C implementation and the Ruby implementation. I only found one slight difference, but I think it's such a rare edge case I don't know if we need to support it.

# If +path+ contains a NUL character (<tt>\0</tt>), an ArgumentError is raised.
#
def initialize(path)
path = path.to_path if path.respond_to? :to_path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is slightly different behavior than the original.

require "pathname"

path = "/"

def path.to_path
  "/tmp"
end

pn = Pathname.new path
p pn

On master, the output is #<Pathname:/>, but on this branch, the output is #<Pathname:/tmp>

Copy link
Member Author

@eregon eregon Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks to me like an optimization in the C extension to avoid the rb_check_funcall() because that's slow.
But in Ruby respond_to? is properly cached and so there is no need for this manual optimization.
Semantics-wise I think it's more consistent the way the original Ruby code does it.
It's of course trivial to change if we need those semantics for some reason.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semantics-wise I think it's more consistent the way the original Ruby code does it.

Yes, I agree this behavior makes more sense. I just wanted to point out the difference in case it matters (I don't think it should matter).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 96000f6 to be fully compatible with what the C extension initialize did.

eregon and others added 14 commits August 5, 2025 21:59
* Avoids a MatchData allocation.

(cherry picked from commit 643585a)
…assed too

* Core methods regularly gain new keyword arguments so this is more future-proof.
Co-authored-by: Jean Boussier <[email protected]>
* The eval to set $~ is inneficient, so only do it when necessary (when running without the C extension).
@eregon eregon force-pushed the pure-ruby-pathname2 branch from 249c24d to 20f3653 Compare August 5, 2025 20:16
@eregon
Copy link
Member Author

eregon commented Aug 5, 2025

I added the last 4 commits to add TruffleRuby and JRuby in CI, given the diff for it is pretty small, and that ensures all Ruby methods are tested, notably the ones also defined in the C extension. The C extension is then only used on CRuby.

If reviewers prefer that to be a separate PR I can do that, it's easy to move these 4 commits.
I just think it makes sense together.

@eregon
Copy link
Member Author

eregon commented Aug 5, 2025

I also updated the description to add a table summarizing the performance gains, and filed https://bugs.ruby-lang.org/issues/21532 to make a ticket specifically about this.

@headius
Copy link

headius commented Aug 21, 2025

Huge +1 from the JRuby side. Pathname moving to C was unnecessary then and a hassle for all of us now (including CRuby due to JIT+C interaction. Ship it!

I'll direct any additional comments to the ruby-lang issue.

@headius
Copy link

headius commented Aug 21, 2025

Ok, I went to the ruby-lang issue but the last couple of comments directed me back here.

Some specific points:

  • Pathname becoming core, what happens to the gem?

Seems clear that it would not be gem-upgradeable anymore unless something has changed about how we define "core". If "core" means "loaded always at startup with or without gems enabled", then by definition it can't be upgrade by RubyGems. If "core" means "available without an explicit require" then we're moving toward a future where core features might trigger requires, potentially through RubyGems; that seems very problematic to me and very un-"core".

In my mind, nothing "core" should break if you disable gems. In fact, I believe nothing "core" should break if there's no stdlib available (Ruby should be able to run "hello world" without loading any stdlib files).

  • Pathname becoming core, how can it be pure-Ruby?

JRuby and TruffleRuby have large parts of core written in Ruby. That code is simply loaded at startup, either by using load in JRuby (to avoid adding LOADED_FEATURES, accessing the stdlib, and exposing those internal files) or by TruffleRuby directly executing them during startup (details left to @eregon. The pathname.rb would move into the "kernel" of each implementation and be loaded at boot just like other Ruby sources currently loaded by CRuby (usually filled with "primitive" C calls that do the actual work).

  • Why move Pathname to Ruby?

Why was it moved to C? Minor performance improvements? That's moot now that all implementations have JIT and none of those JIT implementations can optimize across C calls. The C move is now a millstone around our necks, both preventing the gem from being usable on non-CRuby and making calls to the library potentially slower than if they were pure Ruby.

...

It sounds like we're almost all in agreement that this should move back to Ruby. If there's anything I can do to assist please let me know. I'd love to see this PR ultimately close #17 and let us align our Pathname functionality with CRuby once more.

Big kudos to @eregon for forging ahead and righting this wrong.

@eregon
Copy link
Member Author

eregon commented Aug 21, 2025

Approved by @akr in https://bugs.ruby-lang.org/issues/21532#note-5, merging 🎉

@eregon eregon merged commit 658648c into ruby:master Aug 21, 2025
20 checks passed
@eregon
Copy link
Member Author

eregon commented Aug 21, 2025

@headius Thanks for the support, should have pinged you earlier :)

Ok, I went to the ruby-lang issue but the last couple of comments directed me back here.

I'm not sure from which issue your points come from, is it https://bugs.ruby-lang.org/issues/17473 maybe?

Pathname becoming core, what happens to the gem?

I believe we need to keep it, we need to keep the gem for older Ruby versions anyway.
Also this gem replaces the ::Pathname constant if any when loaded, which seems good if e.g. there would be a bug fix in the gem without a CRuby release.

Pathname becoming core, how can it be pure-Ruby?

Some parts of Pathname were already Ruby, see e.g. https://github.com/ruby/ruby/blob/master/pathname_builtin.rb and lib/pathname.rb before this PR. So not a problem in any case.

@eregon eregon mentioned this pull request Aug 21, 2025
hsbt added a commit to hsbt/ruby that referenced this pull request Aug 22, 2025
hsbt added a commit to ruby/ruby that referenced this pull request Aug 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants