TruffleRuby Native: Fast Even for Short Scripts
Introduction
Nowadays, it seems every major Ruby implementation has a Just-In-Time (JIT) compiler. Recently, YARV-MJIT has been merged to MRI (CRuby) trunk. JRuby relies on the Java Virtual Machine JIT compilers, and TruffleRuby uses Graal.
One big challenge for JIT compilers is to be beneficial on short-running scripts. In general, JIT compilers are better for long-running applications like web servers.
John Hawthorn Recently wrote a blog post about using YARV-MJIT for a small Ruby script. In this post, I want to expand on that and analyze the performance of 43 short-running programs (from 0.04s to 20s). Quick startup and fast warmup are therefore important to achieve good results.
JRuby and TruffleRuby on JVM do not perform well on short-running programs as their startup alone gives them a big disadvantage. On the other hand, TruffleRuby on SubstrateVM has much better startup and warmup.
TruffleRuby Native
TruffleRuby can run on 2 different virtual machines:
- The HotSpot Java Virtual Machine, which has multiple just-in-time compilers and achieves the best peak performance. But the startup is not great (~2s), mostly due to classloading.
- The SubstrateVM, which provides very fast startup and fast warmup.
In both cases, the Graal dynamic/just-in-time compiler is used to compile Ruby code down to machine code and obtain great peak performance.
Since we look at short scripts, we pick TruffleRuby on SubstrateVM, also called TruffleRuby Native in this post.
TruffleRuby is part of GraalVM, which can be downloaded on OTN.
To run TruffleRuby Native, just add a --native
flag to bin/ruby
:
That’s pretty fast. I have been working on TruffleRuby startup for a while and it’s starting to look nice. Not as good as MRI yet, but we’re getting there (that’s for another blog post).
The Benchmarks
Startup is interesting, but it would be much more interesting to try on real Ruby scripts. So I took my solutions to Advent of Code, for all 25 days and both puzzles of each day. This amounts to 43 benchmarks, as for some of the days a single script solves both puzzles.
I wrote these solutions. So, of course, they might be biased and might not be representative of other short-running Ruby scripts. But I wrote them in good faith, optimizing for a concise and elegant style, tweaking the code for performance only when it would run for too long. For Advent of Code, I enjoy writing code straight from the problem description rather than reasoning about the maths behind the puzzle. Note that I also made a couple tweaks to TruffleRuby after solving the puzzles (see below for details), which are now part of GraalVM 0.31.
In this blog post, I use the latest MRI/CRuby trunk as of writing (r62451) as the baseline. I also try the new bundled YARV-MJIT and compare against the latest RTL-MJIT from Vladimir Makarov and TruffleRuby Native from GraalVM 0.31.
Enumerable#sum
and Kernel#yield_self
are not defined in all implementations as some of them target a different Ruby version than 2.5.
So I used a compat.rb file defining these methods in Ruby only when the method does not exist (sum
is needed for TruffleRuby and yield_self
for RTL-MJIT).
I verified all implementations produce the same output.
I ran each benchmark 10 times and took the average.
The maximal deviation from the average across the 10 runs is: for MRI trunk 8%, YARV-MJIT 8%, TruffleRuby Native 13% and RTL-MJIT 78% (due to the unstable startup time between 54ms and >100ms; the maximal deviation is 10% for programs running over 1s).
For completeness, this is run on a laptop with Fedora 26, an Intel Core i7-7700HQ CPU @ 2.80GHz and a SSD.
MRI was compiled with the system GCC 7.2.1 20170915 (Red Hat 7.2.1-2), which is also used by YARV-MJIT and RTL-MJIT.
The results are in seconds. The implementations are compared with time differences (Δ) instead of speedup/slowdown factors to reflect how much time a user gains or loses (10x faster if it’s already <100ms does not make a difference to the user in such a use case).
Cells highlighted in green show gains compared to the baseline. Cells in red highlight losses of more than 1 second.
Bench | MRI trunk | YARV-MJIT | Δ | RTL-MJIT | Δ | TruffleRuby Native | Δ |
---|---|---|---|---|---|---|---|
1a.rb | 0.041 | 0.176 | +0.135 | 0.094 | +0.053 | 0.164 | +0.123 |
1b.rb | 0.040 | 0.179 | +0.139 | 0.094 | +0.054 | 0.103 | +0.063 |
2a.rb | 0.040 | 0.230 | +0.191 | 0.172 | +0.133 | 0.090 | +0.050 |
2b.rb | 0.040 | 0.198 | +0.158 | 0.086 | +0.047 | 0.115 | +0.075 |
3a.rb | 0.040 | 0.227 | +0.187 | 0.185 | +0.145 | 0.078 | +0.038 |
3b.rb | 0.040 | 0.239 | +0.199 | 0.197 | +0.156 | 0.098 | +0.057 |
4a.rb | 0.041 | 0.226 | +0.185 | 0.132 | +0.091 | 0.133 | +0.092 |
4b.rb | 0.045 | 0.195 | +0.149 | 0.102 | +0.057 | 0.401 | +0.355 |
5a.rb | 0.080 | 0.227 | +0.147 | 0.187 | +0.107 | 0.274 | +0.194 |
5b.rb | 3.312 | 3.583 | +0.271 | 3.151 | -0.160 | 0.534 | -2.778 |
6.rb | 0.087 | 0.222 | +0.135 | 0.133 | +0.046 | 0.283 | +0.197 |
7a.rb | 0.043 | 0.215 | +0.171 | 0.111 | +0.068 | 0.162 | +0.119 |
7b.rb | 0.046 | 0.239 | +0.193 | 0.143 | +0.097 | 0.260 | +0.214 |
8a.rb | 0.042 | 0.294 | +0.252 | 0.249 | +0.208 | 0.135 | +0.093 |
8b.rb | 0.042 | 0.316 | +0.274 | 0.288 | +0.246 | 0.144 | +0.101 |
9.rb | 0.042 | 0.232 | +0.189 | 0.140 | +0.098 | 0.135 | +0.093 |
10a.rb | 0.040 | 0.225 | +0.186 | 0.173 | +0.133 | 0.093 | +0.053 |
10b.rb | 0.046 | 0.283 | +0.237 | 0.123 | +0.077 | 0.181 | +0.134 |
11.rb | 13.818 | 15.706 | +1.889 | 14.246 | +0.429 | 0.805 | -13.013 |
12a.rb | 0.043 | 0.252 | +0.209 | 0.122 | +0.079 | 0.152 | +0.108 |
12b.rb | 0.044 | 0.181 | +0.137 | 0.083 | +0.039 | 0.179 | +0.135 |
13a.rb | 0.043 | 0.201 | +0.158 | 0.102 | +0.059 | 0.201 | +0.158 |
13b.rb | 1.830 | 1.979 | +0.149 | 1.503 | -0.327 | 0.456 | -1.374 |
14a.rb | 0.211 | 0.308 | +0.097 | 0.272 | +0.061 | 0.684 | +0.472 |
14b.rb | 0.244 | 0.307 | +0.063 | 0.342 | +0.098 | 0.984 | +0.740 |
15a.rb | 15.565 | 14.988 | -0.576 | 14.069 | -1.495 | 2.166 | -13.399 |
15b.rb | 8.802 | 8.404 | -0.398 | 7.974 | -0.828 | 1.278 | -7.524 |
16a.rb | 0.051 | 0.311 | +0.260 | 0.195 | +0.144 | 0.361 | +0.310 |
16b.rb | 19.581 | 22.147 | +2.566 | 24.874 | +5.293 | 8.994 | -10.587 |
17a.rb | 0.040 | 0.240 | +0.200 | 0.133 | +0.093 | 0.103 | +0.063 |
17b.rb | 3.027 | 2.411 | -0.616 | 1.577 | -1.451 | 0.588 | -2.439 |
18a.rb | 0.042 | 0.165 | +0.123 | 0.078 | +0.036 | 0.109 | +0.067 |
18b.rb | 0.076 | 0.184 | +0.108 | 0.088 | +0.012 | 0.810 | +0.734 |
19.rb | 0.068 | 0.187 | +0.119 | 0.129 | +0.061 | 0.530 | +0.463 |
20a.rb | 3.362 | 3.908 | +0.545 | 3.071 | -0.291 | 2.170 | -1.192 |
20b.rb | 2.100 | 2.394 | +0.294 | 2.096 | -0.004 | 5.101 | +3.001 |
21.rb | 3.408 | 3.697 | +0.289 | 3.886 | +0.478 | 3.731 | +0.323 |
22a.rb | 0.063 | 0.261 | +0.198 | 0.167 | +0.104 | 0.277 | +0.214 |
22b.rb | 16.493 | 18.182 | +1.689 | 16.893 | +0.400 | 2.734 | -13.759 |
23a.rb | 0.047 | 0.205 | +0.158 | 0.080 | +0.033 | 0.277 | +0.230 |
23b.rb | 4.297 | 3.906 | -0.391 | 2.058 | -2.239 | 0.582 | -3.715 |
24.rb | 5.726 | 5.899 | +0.173 | 6.474 | +0.748 | 1.901 | -3.825 |
25.rb | 1.980 | 2.094 | +0.114 | 1.929 | -0.051 | 1.563 | -0.417 |
Total | 105.068 | 116.023 | +10.954 | 108.203 | +3.135 | 40.118 | -64.950 |
There seems to be essentially 2 categories in these benchmarks. Scripts which take less than 1 second and on which none of the implementations with a JIT runs faster than MRI trunk. But the JIT implementations also don’t take more than 1 second, so it’s likely not a big difference to the user.
For scripts which run for more than 1 second, TruffleRuby Native saves a significant amount of time (except 20b.rb and 21.rb). YARV-MJIT and RTL-MJIT achieve some gains on that second category as well, although they are much more modest.
Overall, the last line (Total) shows that YARV-MJIT and RTL-MJIT are not improving
the total time needed to run those scripts.
It is a big challenge for JIT compilers to be beneficial on short-running scripts.
However, since TruffleRuby Native gains so much on the second category,
it manages to execute all scripts in less than half the time MRI trunk
takes!
Analysis
All 3 contenders have slower startup than MRI trunk
here.
YARV-MJIT currently has the known problem to compile a header on every startup and waiting for it.
It seems RTL-MJIT has the same issue.
TruffleRuby Native currently has to load its core library written in Ruby on startup, which makes it a bit slower than MRI.
The other issue is warmup.
The approach to shell out to an external compiler and emitting C code (MJIT) is far from
optimal in terms of warmup (how long it takes until the often-executed parts of the program are compiled).
For instance, running ruby --jit --jit-verbose=1 15a.rb
shows that compiling a method with YARV-MJIT takes
at minimum 28ms and the median for all 82 methods compiled is 105ms.
With TruffleRuby Native, the TruffleRuby interpreter and Graal are compiled ahead-of-time by SubstrateVM to machine code.
That machine code is saved in an executable (called the image).
When starting the executable, we have an already warmed-up TruffleRuby interpreter and calling the JIT compiler is just a method call away.
Since Graal is ahead-of-time compiled it starts compiling faster than on JVM and requires no classloading.
So for instance, graalvm-0.31/bin/ruby --native --native.XX:+TraceTruffleCompilation 15a.rb
shows that Graal takes 11ms to compile the block at line 20, compared to 70ms for YARV-MJIT.
On 20b.rb, the slowdown for TruffleRuby seems to be caused by Struct#== using Struct#values, which is not specialized compared to other Struct methods (it’s a bug, Struct#to_a is specialized).
Finally, it’s time to consider performance as a whole. We see that for slightly longer scripts, TruffleRuby can save up to 13 seconds. The maximum gain for YARV-MJIT is 1 second and for RTL-MJIT 2 seconds.
For YARV-MJIT, it is still the early days and it does not have many optimizations. RTL-MJIT has more optimizations, but does not support Ruby inlining currently. TruffleRuby supports Ruby inlining and also inlining to and from the core library (for both the part written in Ruby and the part written in Java). It even supports inlining Ruby method calls from C extensions.
Integer#times and On-Stack-Replacement
Some programs (particularly short-running ones) are hard to optimize for a JIT compiler. For instance, let’s take John Hawthorn’s solution to Day 15:
Let’s simplify a little bit by removing the Enumerator to understand better what is going on
(the same reasoning applies as count
ends up calling times
with a block, just with more indirections):
Here, we would ideally compile the calculate
method and inline everything called from there.
But that method is only called once.
So by the time the JIT compiler thinks it’s good to compile that method, we will be inside that method, never call it again, and keep executing in the non-compiled code.
The next best thing is compiling Integer#times
, inlining its block and everything from there.
In TruffleRuby, Integer#times
is defined in Ruby:
But we meet the same problem. By the time we figure out this times
method has a loop with many iterations and calls a block (yield
) many times,
we will already be in the while
loop and when we get out the program finishes so we will never use a compiled version of Integer#times
.
That’s where On-Stack-Replacement (OSR) comes in.
On-Stack-Replacement enables to compile a loop and jump to the compiled loop from the interpeter.
TruffleRuby can perform On-Stack-Replacement in while
loops thanks to the support for OSR by Truffle and Graal.
Once Truffle detects an interpreter iterates in a loop many times (the default threshold is 100 000 loop iterations), it triggers an OSR compilation of that loop. Once that compilation finishes, the interpreter jumps in the compiled loop at the next iteration, executing the rest of the loop much faster.
In this case, this works because Integer#times
is written in Ruby and uses a while
loop which has OSR support.
In the previous GraalVM version (0.30), Integer#times
was written in Java and did not have OSR support (it would be possible but more complex).
This caused the block given to times
to be compiled but not the loop itself which makes a big difference as calling a block from the interpreter is much slower than an inlined block call.
When I was playing with my own solution for Day 15, I tried redefining Integer#times
in Ruby and that alone sped up the execution from 7 seconds to 2 seconds, illustrating the gains of On-Stack-Replacement.
Interesting how defining more in Ruby can actually help performance.
Conclusion
Improving the performance of short-running programs with just-in-time compilers is challenging.
If the program executes for less than a second, none of the implementations with a JIT compiler managed to gain anything compared to MRI trunk. But, they also all took less than a second, so it probably doesn’t matter much for scripts run only once or a few times.
For programs running for longer than a second, TruffleRuby Native shows it is possible
to gain a significant amount of time with a Just-In-Time compiler.
This requires fast startup (well below 1 second) and fast warmup (otherwise the program finishes before the compiled code is used).
Of course, the JIT compiler benefits from being more advanced such as supporting inlining and a better understanding of Ruby’s constructs.
In the case of higher-level loops like Integer#times
called only once and with many iterations,
On-Stack-Replacement is important to achieve good performance.
YARV-MJIT and RTL-MJIT are exciting but still very young JIT compilers for MRI. Improving warmup while shelling out to GCC (or Clang) is certainly challenging. Making GCC (or Clang) understand better Ruby constructs is also gonna be interesting. Let’s see what the future brings.
If you want to try TruffleRuby Native, you can download GraalVM from OTN. See Getting Started for details. We are working on making it easier to install TruffleRuby (e.g., with rvm/rbenv-install/ruby-install), but that has not landed yet.
If you liked this post, consider following @eregontp on Twitter for more Ruby, performance and concurrency blog posts.