How TruffleRuby's Startup Time Became Faster Than MRI's

Introduction

I want to talk about VM startup in Ruby. That is, the time it takes for a Ruby implementation to print “Hello World” with:

$ ruby -e 'puts "Hello World"'

This is a lower bound for running any Ruby script or application, and lower startup time typically results in an improved developer experience.

MRI has been the gold standard for startup time, unbeaten so far by other Ruby implementations. Can we set a new record? Without further ado, here are the results for VM startup on the latest Ruby implementations:

Implementation Real Time (s)
TruffleRuby Native 1.0.0-rc16 0.025
MRI 2.6.2 0.048
Rubinius 3.107 0.150
JRuby 9.2.7.0 1.357
TruffleRuby JVM 1.0.0-rc16 1.787

TruffleRuby Native (the default) is leading here, with a startup time of only 25ms, followed by 48ms for MRI, 150ms for Rubinius and 1357ms for JRuby. OTOH, TruffleRuby JVM is much more representative of a typical (and slow) JVM startup, in the same order of magnitude as JRuby. To clarify, TruffleRuby Native is what one gets when installing TruffleRuby via rvm, rbenv or chruby.

Each measurement is the average of 10 consecutive runs with:

$ precise_time 10 `which ruby` -e 'puts "Hello World"'

precise_time is a small C program I wrote to measure startup time precisely using clock_gettime(CLOCK_MONOTONIC).

Note that I do not consider any trick to improve startup time here like having a prepared JVM process around (nailgun, drip, etc). I just use my Ruby manager (chruby), switch to each ruby and run Hello World. FWIW, I think this kind of startup tricks is usually cumbersome to setup and therefore rarely used and not representative of what most users experience.

Ahead-Of-Time Compilation with Native Image

So how did we get from 1787ms to 25ms? The first and biggest step goes from 1787ms to 165ms (10x faster) by using a custom VM called SubstrateVM.

SubstrateVM, unlike, for example, HotSpot, loads all Java classes ahead-of-time, not at runtime. This is done once during a step called Native Image Generation, where SubstrateVM will look at the classpath, include every reachable class and compile those classes and methods to native code, all of that ahead-of-time. This native code is stored in a native executable, i.e., a “Native Image”.

For instance, TruffleRuby is written in Java and Ruby, and all Java classes are compiled ahead-of-time into a native executable:

$ ls -lh `which truffleruby`
-rwxr-xr-x. 1 eregon eregon 134M Apr 20 00:51 .../truffleruby-1.0.0-rc16/bin/truffleruby

At runtime, the application will then start with the main() method, but no classloading or JIT compilation of Java classes is needed; every Java method is already compiled to native code and ready to be run.

That step alone reduces startup time to around 165ms. I think we can attribute these gains to:

This is not new, and Kevin Menard already discussed it in a blog post two years ago, but it is an important context to discuss further startup optimizations.

Storing Extra Data in the Native Image

SubstrateVM not only compiles Java classes ahead-of-time, it also executes the static initializers (corresponding to statements in the class/module body in Ruby).

This gives the opportunity to run additional initializations ahead-of-time and store the resulting state in static (class) variables.

One prime example here for TruffleRuby is storing the Ruby files of the core library. The core library in TruffleRuby is defined mostly in Ruby (similar to Rubinius) and as a result there are about 18 000 significant lines of Ruby code to load, on every startup, before executing any user Ruby code.

Reading these 89 files from disk takes time, but further than that, parsing them to Abstract Syntax Trees (ASTs) is also time-consuming. It turns out, we could both read the files and parse them to ASTs ahead-of-time, in a static initializer, while SubstrateVM is compiling our Java code!

This gives another boost for startup, producing Hello World in around 96ms. That’s still significantly slower than MRI, so we need to go deeper.

Pre-Initialization and Freezing the Entire Heap with Native Image

Since we can read files and parse Ruby code ahead-of-time, could we also evaluate Ruby code ahead-of-time? Could we load the entire core library Ruby code during Native Image Generation?

Yes we can, with SubstrateVM, and that’s how we got to 25ms startup! We call this step pre-initialization and this is actually available for any GraalVM-based language.

However, this requires some care, and some parts of the core library initialization need to be delayed to, or patched at, runtime because they depend on the runtime environment. For instance, setting $$ (the PID global variable) must be done at runtime, otherwise it would reflect the PID of the Native Image Generator process, which is no longer running.

To optimize startup time to 25ms, I introduced pre-initialization in TruffleRuby as well as startup metrics, and as a side effect those metrics now document the steps performed during startup:

$ AOT_BIN=`which truffleruby` jt metrics time --native --experimental-options --metrics-time-parsing-file -e 'puts "Hello World"'
..........
0.026 total
 0.004 vm
 0.022 main
  0.013 patch-context
   0.000 options
   0.001 create-native-platform
   0.003 rehash
   0.009 run-delayed-initialization
    0.000 kernel_operations.rb
    0.000 encoding.rb
    0.001 env.rb
    0.000 posix.rb
    0.000 main.rb
    0.004 post.rb
    0.003 post-boot.rb
  0.006 run
   0.004 script

With that, we can see around 13ms is spent in patch-context, the step to incorporate the runtime environment into the pre-initialized core library.

A simple example of initialization that must be delayed to runtime is setting the process ID:

Truffle::Boot.delay do
  $$ = Process.pid
end

Similarly, setting ARGV must be done from runtime command-line arguments:

Truffle::Boot.delay do
  ARGV = Truffle::Boot.original_argv
end

Here is a list of other initializations that must be done at runtime:

This last item, setting a new random seed for hashing, is actually significantly more complicated than the rest. It’s easy to generate a new random seed and store it, and even to use it for new #hash calls. But what about existing objects, which might cache their #hash like Symbols? What about Hash objects created during pre-initialization, which determine the position in the buckets array by using the #hash of the keys, which itself used the old random seed? All of these need to be “patched”, to recompute their #hash based on the new random seed. To do so, we track every Hash created during pre-initialization and rehash internal hash tables such as the Symbol table.

There are currently some restrictions on what SubstrateVM can pre-initialize:

The ability to run static initializers ahead-of-time actually implies that SubstrateVM is able to store an entire Java heap into a native executable, such that when the executable starts, all the Java objects are already available with no extra work, with no deserialization or copy. SubstrateVM is therefore able to run arbitrary code during Native Image Generation and “freeze” the resulting Java heap into an executable. From that perspective, it is somewhat similar to CRIU, but more specialized and therefore faster for startup.

Autoloading RubyGems

We use another trick to speed up startup in TruffleRuby: we load RubyGems lazily. To do so, we use an autoload for the Gem constant and hook into require so RubyGems is loaded the first time require raises a LoadError (see this file for details).

require "rubygems" in RubyGems 3.0.3 loads 21 RubyGems files (4010 SLOC), as well as the rbconfig (160 SLOC), uri (12 files, ~2000 SLOC), stringio (559 SLOC) and monitor (127 SLOC) standard libraries. That’s a fair amount of code which takes a while to load, and in some cases it’s simply not used at all (e.g., command-line tools using only the standard library).

Future Work and Application Startup

In this blog post, I showed how we optimize VM startup in TruffleRuby. A related topic is application startup, that is, how long it takes before running any useful work in your application. For example, how long it takes for a Rails app until it can accept the first request. This of course depends a lot on the specific application.

Application startup is not really fast yet on TruffleRuby, and it is something we want to improve on. For instance, gem and bundle commands are significantly slower than on MRI currently.

Application startup seems composed of mostly two aspects:

For the first aspect, we could think to parse Ruby files that might be loaded by the application, and save the resulting ASTs in the Native Image. This would skip reading them from disk and parsing time, much like we already do for the core library. However, this would require recreating a Native Image every time we want to include additional files, so it is not very practical for user-installed gems. We could of course also serialize ASTs (similarly to how Bootsnap caches serialized MRI bytecode), but then deserialization is typically slower than having all AST node objects already in the Native Image.

The second aspect is directly related to interpreter speed. TruffleRuby is currently gathering a lot of profiling information while running in interpreter, to later feed it to the Just-In-Time compiler so that it can compile only the relevant part of each operation used by the Ruby code. However, that makes it slower to run code in the interpreter than MRI, which gathers almost no profiling information currently. I do think we can improve interpreter speed at both the TruffleRuby and Truffle levels though.

Acknowledgments

I would like to thank Tomáš Zezula who worked on pre-initialization support in the Truffle Language Implementation Framework, as well as all the contributors to SubstrateVM which helped to make fast startup for Java a reality. I would like to also thank Kevin Menard for proofreading this blog post.