Skip to content

Updated JMH, Scalaz, Clojure, and ECollections benchmark versions #1658

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Nov 3, 2016
Merged

Updated JMH, Scalaz, Clojure, and ECollections benchmark versions #1658

merged 2 commits into from
Nov 3, 2016

Conversation

l0rinc
Copy link
Contributor

@l0rinc l0rinc commented Nov 3, 2016

Aligned Java 8 compatibility to Scala 2.12 benchmarks

@codecov-io
Copy link

Current coverage is 96.43% (diff: 100%)

Merging #1658 into master will not change coverage

@@             master      #1658   diff @@
==========================================
  Files            89         89          
  Lines         11123      11123          
  Methods           0          0          
  Messages          0          0          
  Branches       1893       1893          
==========================================
  Hits          10726      10726          
  Misses          243        243          
  Partials        154        154          

Powered by Codecov. Last update 0bbc789...0864032

@l0rinc
Copy link
Contributor Author

l0rinc commented Nov 3, 2016

Hmm, strangely it seems that the new Scala 2.12.0 Vector is significantly slower in a few cases (e.g. tail, slice, update, append, prepend). @djspiewak, what could be the reason for that?

Displaying speed ratios Javaslang Vector ops/s / Scala Vector ops/s (greater value means Javaslang is faster):

2.11.8:

Operation     10    100   1026 
Create     7.13×  7.92×  7.70×
Head       0.79×  0.59×  0.49×
Tail       2.64×  3.49×  4.55×
Get        1.09×  1.03×  0.71×
Update     1.21×  1.07×  0.76×
Map        1.21×  1.18×  1.43×
Filter     1.47×  1.55×  1.38×
Prepend    0.09×  0.08×  0.08×
Append     0.15×  0.10×  0.08×
AppendAll  2.79×  1.52×  1.67×
GroupBy    1.25×  1.17×  0.92×
Slice      2.45×  3.13×  4.87×
Iterate    2.03×  0.88×  1.43×

2.12.0:

Operation     10    100   1026
Create     6.68×  5.64×  5.77×
Head       0.79×  0.64×  0.48×
Tail       5.68×  7.67×  8.71×
Get        1.03×  1.05×  0.78×
Update     2.64×  2.72×  1.78×
Map        1.37×  1.08×  1.31×
Filter     1.22×  1.45×  1.16×
Prepend    0.18×  0.18×  0.17×
Append     0.31×  0.20×  0.20×
AppendAll  3.57×  1.87×  1.71×
GroupBy    1.52×  1.31×  1.13×
Slice      4.96×  8.52× 10.71×
Iterate    1.84×  1.42×  1.30×

@l0rinc
Copy link
Contributor Author

l0rinc commented Nov 3, 2016

@viktorklang, tried it separately with the final 2.12.0 and 2.11.8:

Scala slice for 1026 elements:

@Benchmark
public void scala_persistent(Blackhole bh) {
    scala.collection.immutable.Vector<Integer> values = scalaPersistent;
    while (!values.isEmpty()) {
        values = values.slice(1, values.size());
        values = values.slice(0, values.size() - 1);
        bh.consume(values);
    }
}

results in

8,267.31 ops/s - 2.11.8
4,132.78 ops/s - 2.12.0

while Javaslang's speed stays:

44,629.03 ops/s

@viktorklang
Copy link

Ping @SethTisue

@djspiewak
Copy link

Could this have anything to do with the new default methods encoding of traits?

@SethTisue
Copy link

@Ichoran is this expected?

@Ichoran
Copy link

Ichoran commented Nov 3, 2016

This is not what I would have expected, @SethTisue. I am not sure what is going on, but basically everything that has to create stuff is considerably slower. The JVM has a really tough job to do with keeping track of what's valid with all the pointers into different depths of the vector; maybe that optimization is affected by the trait encoding?
I could look into it starting this weekend if it's a high priority. Otherwise, if someone wants to look at the generated assembly, that might give a clue. I'd probably go for append as the easiest to understand. They're all pretty hairy.

@danieldietrich
Copy link
Contributor

Thank you Lorinc, interesting benchmarks! I hope we also can use the insights of the Scala language architects for the design of Javaslang.

@danieldietrich danieldietrich merged commit d6d2d19 into vavr-io:master Nov 3, 2016
@danieldietrich danieldietrich added this to the 2.1.0 milestone Nov 3, 2016
@l0rinc
Copy link
Contributor Author

l0rinc commented Nov 3, 2016

@viktorklang, @SethTisue, @Ichoran, @djspiewak, @danieldietrich, the slowdown seems to have happened in 2.12.0-M2 (4,528.84 ops/s), as in 2.12.0-M1 it was still 9,470.62 ops/s.

M2 changelog: http://www.scala-lang.org/news/2.12.0-M2

  • Beginning with 2.12.0-M2, the Scala 2.12 series targets Java 8.

@Ichoran
Copy link

Ichoran commented Nov 3, 2016

Oh, that was the GenBCode / inliner milestone! I wonder if just looking at the bytecode would show what the difference is?

@l0rinc
Copy link
Contributor Author

l0rinc commented Nov 3, 2016

Could this be the problem: scala/scala@bb4b79c#diff-59f3462485b74027de4fd5e9febcc81bR136 ?

@Ichoran
Copy link

Ichoran commented Nov 3, 2016

Could be. I definitely benchmarked that, but I benchmarked it before the GenBCode change IIRC, and it's possible that I forgot to benchmark both with and without optimization. That's a huge penalty for two equalities that ought to be false!

@viktorklang
Copy link

But it is 2 extra branches and method calls? Perhaps put the method size
above default for inlining? (31bytes IIRC)

Cheers,

On Nov 4, 2016 12:54 AM, "Ichoran" [email protected] wrote:

Could be. I definitely benchmarked that, but I benchmarked it before the
GenBCode change IIRC, and it's possible that I forgot to benchmark both
with and without optimization. That's a huge penalty for two equalities
that ought to be false!


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1658 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAAqd7ttMtiRms5nBzuXHamPA7gUzcLKks5q6nRSgaJpZM4Kouq1
.

@Ichoran
Copy link

Ichoran commented Nov 4, 2016

@viktorklang - It's marked @inline so I wasn't anticipating that the JVM would be responsible for inlining it! I'm sure it's over the default (35 according to http://www.oracle.com/technetwork/articles/java/vmoptions-jsp-140102.html), but that shouldn't have mattered unless the Scala compiler isn't inlining it.

You might be right that the method calls to fetch the builder instances are at fault.

@l0rinc
Copy link
Contributor Author

l0rinc commented Nov 6, 2016

@viktorklang, @SethTisue, @Ichoran, @djspiewak, @danieldietrich, I've recompiled Scala 2.12 with the Vector from 2.11, and the above methods are either slower or the same.
I'm afraid the slowdown has a more serious cause, which I'm not yet qualified to investigate :/

@l0rinc l0rinc deleted the benchmark-versions branch November 6, 2016 12:57
@viktorklang
Copy link

@paplorinc I think this is worth investigating, @Ichoran / @SethTisue is there an issue for this in the scala/scala tracker?

@Ichoran
Copy link

Ichoran commented Nov 7, 2016

I don't know if there's an issue, but I'm looking into it. So far I have not found any difference in bytecode or assembly that can explain the whole difference. (Also, in my hands the difference is more like 50%.) I did find that either switching VectorPointer to an abstract class (from a trait) or manually inlining all the code in it (so there is no default method) partially rescued the performance of :+. But that was still only partial.

@Ichoran
Copy link

Ichoran commented Nov 8, 2016

This is a tough one! The bytecode is nearly identical (aside from the obvious differences); there is a big difference in profiling gotoPosWritable1, but the bytecode of that, despite being in different places, is essentially identical. The assembly is, for some reason, a bit less efficient in 2.12, but it seems hard to believe that this is causing the entire slowdown.

I'm going to try some speculative fixes, but I'm not terribly hopeful that any of them will work given that I still don't understand precisely what is causing the slowness.

@l0rinc
Copy link
Contributor Author

l0rinc commented Nov 8, 2016

@Ichoran, @SethTisue, @viktorklang, @odersky, @djspiewak, @danieldietrich, @zsolt-donca

To sum it up, comparing the JMH benchmarks for 2.11.8 with 2.12.0, there seems to be a measurable slowdown in some Vector operations for all GCs, i.e.

e.g. ops/s for 1000 slice operations - 2.11.8 vs. 2.12.0:

GC type             2.12 / 2.11  ops/s ratios
UseParallelOldGC    78%          (8228 / 10563)
UseConcMarkSweepGC  63%          (7167 / 11320)
UseG1GC             49%          (4695 /  9482)

(See l0rinc/ScalaVectorBenchmark@918f75b)

Note: There still appears to be room for optimizations, seeing that the Javaslang persistent Vector's slice is ~5-10× faster

@djspiewak
Copy link

Some random thoughts that may have no bearing on anything (and may have already occurred to everyone):

  • The new bytecode emitted may have pushed some functions just over HotSpot's inline limit, preventing certain optimizations which used to apply
  • The assembly produced by HotSpot in response to the new bytecode may be less amenable to whatever dark magic the CPU is doing. I freely admit to being highly superstitious on this point…
  • What do the heap profiles look like? Presumably identical?
  • Have you tried enabling GenBCode on 2.11 to see if the same performance degradation is observed?
  • Exactly how much method defaulting is left in the code paths after converting the trait to an abstract class? I'm not entirely sure how well the JVM's PIC interacts with method defaulting. I would assume reasonably well, but I don't know the details. Insert more superstitious mumbling here…

@Ichoran
Copy link

Ichoran commented Nov 8, 2016

@djspiewak - I mostly have been looking at assembly; the only glaring thing I've found is a failure to convert from a method call to a field access on some instances of fetching display0. Still, it's hard to imagine that the slowdown could be so big just because of that. It doesn't seem to be an inlining thing in general.

About half the performance difference disappears when one converts the VectorPointer trait to an abstract class (plus various other changes that need to happen for that to work--and I'm not certain that the trait/abstract class change is the key one). Also, about half the performance difference disappears if the VectorPointer has no default methods and the implementations are moved to the children. I assume it's the "same" half, though obviously I can't easily check.

@l0rinc
Copy link
Contributor Author

l0rinc commented Nov 8, 2016

@Ichoran

  • how can the selection of GC affect the results so much in the new, but less in the old case?
  • is this a general slowdown? Why wasn't it detected by other benchmarks?
  • can I help in any way?

@Ichoran
Copy link

Ichoran commented Nov 8, 2016

@paplorinc - Good question regarding the GC. I don't have a good answer at this time. It does suggest that I should be looking for an extra allocation in the bytecode, doesn't it?

This seems relatively Vector-specific, and there isn't a broad benchmark suite. Maybe this is an indication that we should create one.

What you've been doing has already been very helpful! If you have time to bisect the commit history and figure out where the slowdown first happened, that would be awesome. Looking for generalization (e.g. is HashMap affected?) might help too. Or you may have other ideas that are better than either.

@SethTisue
Copy link

SethTisue commented Nov 8, 2016

may I suggest the discussion move to scala/scala-dev#260?

@Ichoran
Copy link

Ichoran commented Nov 9, 2016

@paplorinc - Could you try the PR at scala/scala#5516 to see if it fixes the issue in your benchmarks? It does in my hands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants