Compiling Ruby Code and Analyzing YJIT

There have been many attempts to compile Ruby code to machine code. More recently, YJIT is a new Ruby JIT inside CRuby. In this post I explain relevant background for compiling Ruby code and analyze YJIT in more in details. I think this is important because many people might not know how to categorize and analyze the various approaches to JIT compilation of dynamic languages like Ruby.

Interpreter

To start with a bit of context, let’s have a high-level look at what the CRuby interpreter looks like. First, it takes the Ruby source code, parses it to an Abstract Syntax Tree (AST) and then “compiles” it to Ruby bytecode. Ruby bytecode is just another representation of the Ruby source code, but it is faster to interpret than the AST.

The Ruby bytecode is comprised of instructions like reading a particular local variable, writing an instance variable, calling a method, etc. Each of these instructions in CRuby has a corresponding C function, which you can see here.

AOT

The first distinction is whether the compiler for Ruby operates Ahead-of-Time (AOT) or Just-in-Time (JIT).

Ruby is a dynamic languages and there are in general no types in Ruby source code. Sorbet is a way to add types but the vast majority of Ruby code out there is untyped.

An AOT compiler could do some type inference over Ruby code but basically this is impossible as soon as you consider monkey-patching or eval’ing code which might define new methods. The Sorbet Compiler does some of that inference, but it only works if everything is typed, and there is a guarantee no monkey-patching and no metaprogramming is used to define methods dynamically, something that is not the case for the majority of Ruby code out there.

Without types, AOT compilation is near useless for improving performance in Ruby, because the only thing it achieves is concatenating the code for instructions together, but without any knowledge of what the operands are and so completely generic, i.e., no better than the interpreter. Basically, the only thing AOT compilation without types would gain for Ruby is to remove the interpreter dispatch cost. That cost is not very big, CPUs have been optimized to run loops over switch‘es since ages. Of course, this does not mean it’s not fun or interesting to try to AOT compile Ruby, but it will not improve performance significantly compared to an interpreter.

JIT

So to compile Ruby code meaningful one needs to know the runtime types (or class) of the operands/inputs. That’s why JIT compilers are much more popular than AOT compilers for dynamically-typed languages like Ruby with in general no types in the source code. A JIT relies on profiling information which tells it the types of various things, and also which branches are taken or not, and more. Based on that it can then generate specialized code which is more efficient than generic code which handles all cases. This profiling information is typically provided by the interpreter or by running code in some special mode to record it.

JIT Tiers

It is frequent for a Just-in-Time compilers and Virtual Machines (VMs) to have multiples tiers. If there is an interpreter it is often referred to as Tier 0. For instance, the HotSpot JVM has 3 tiers, and some JavaScript VMs even have 4 tiers.

The general rule is the lower the tier the less optimizations and the higher the tier the more optimizations.

In a typical 3-tiers VM we have:

As a note TruffleRuby has a design fairly close to this, with the detail that the GraalVM Compiler is used for both Tier 1 and Tier 2 but in different configurations.

The reason to have multiple tiers is that JIT compiling code takes time, and the more optimizations the longer it takes. Yet it is important to reach good performance quickly and not have very long warmup. That is the main reason for Tier 1 JIT compilers: to improve warmup.

YJIT

With this background we are ready to analyze YJIT.

YJIT is quite impressive in that in about a year it is a working JIT compiler with full compatibility with the CRuby interpreter. There are of course some trade-offs like limited speedups and maintainability.

YJIT is a Tier 1 JIT compiler. This is the way that Vladimir Makarov describes YJIT and it fits well with our background above.

YJIT does no do any inlining currently, typical of Tier 1 JITs. It is very fast at compiling code, which is one of the main goals of Tier 1 JITs. It provides some speedups but they are not comparable to speedups of highly optimizing Ruby JITs like TruffleRuby.

MJIT is clearly more of a Tier 2 JIT (takes longer to compile, does some inlining), except that the performance is not that good yet due to various limitations.

You probably already know Node.js and the fact it’s running on Google’s V8 JavaScript engine. There are actually two “generations” of V8:

YJIT reminds me a lot of V8’s FullCodeGen baseline compiler. Both of these are Tier 1 compilers and it wouldn’t be too far from reality to summarize them as “template compilers”, i.e., they emit some given machine code for a given instruction (much like AOT compilers like I mentioned above), but with the added twist of doing a little bit of runtime profiling for a few instructions which helps performance quite a bit. This means there is no optimization over multiple bytecodes or for a whole Ruby method so the speedups are fairly limited.

On the other hand they are very fast to compile, and that is why there was even no interpreter with FullCodeGen, each JavaScript function would get JITed before the first execution (this turned out to be not so great for memory footprint, hence adding the Ignition interpreter).

Logic Duplication and Maintainability

The general design of YJIT seems also fairly close to the FullCodeGen/Crankshaft generation of V8. For instance it’s a really direct approach in that there is logic to emit architecture-specific (x86_64 for YJIT) machine code for a given bytecode instruction.

While this offers a lot of flexibility it is also a large issue for maintainability, because complex logic is duplicated in various forms in different places. For example the logic to read an instance variable is duplicated in yjit_codegen.c and in vm_insnhelper.c.

My understanding is that FullCodeGen/Crankshaft were replaced in large parts because they became unmaintainable. It is really hard to keep multiple duplicates of the same logic in sync and ensure they behave exactly the same behavior in all corner cases (and if it’s not it leads to hard-to-track-and-understand bugs). And this of course just becomes worse with the many architectures and tiers that V8 supports.

The newer V8 generation avoids in large parts this issue by using some kind of macro assembler (think of it as a kind of generic assembly code, which can then be used to emit assembly for a given architecture), and using that macro assembler for both the interpreter and the JIT.

MJIT mostly avoids this issue by reusing C code from the interpreter, and using the C compiler for JIT compiling (of course this offers less flexibility for optimizations). TruffleRuby avoids this issue by partially evaluating the interpreter, which means the same logic/code is used both for the interpreter and the JIT.

This maintainability issue seems difficult to resolve in YJIT because YJIT can’t use the instruction C functions for generating specialized code for instructions (it can call C functions but then it doesn’t optimize anything over the interpreter), and the interpreter cannot use YJIT x86_64-specific assembly (it’s not portable and not particularly readable).

Limitations due to CRuby design

Making a JIT which provides good speedups in CRuby is hard, notably because CRuby has been optimized a lot for interpretation and rather little for just-in-time compilation.

As Maxime mentions in this post under Next Steps there are various things which could be improved in CRuby, but those will likely take a long time. For instance there has been ongoing work for several years to use Variable Width Allocation in CRuby. There is a lot of low-level C code in CRuby, very rarely documented, so it is a huge project to work on any of these without breaking everything else.

Almost all of the core library in CRuby is written in C, which means YJIT or MJIT just see most of the core library as a black box and cannot optimize most of it. There are some exceptions for a few core library methods which can be manually specialized or intrinsified (at the cost of duplication of course), but this is not feasible to extend to a significant part of the core library. Even if YJIT/MJIT could reuse the C code of the core library from CRuby, it is very unlikely they would be able to optimize it meaningfully. Those C functions which implement the core library handle all cases (they must) but to make Ruby code faster one needs to specialize and optimize only the relevant part of the logic, otherwise it will not do a better job than a C compiler. Moving some core library methods to Ruby code would help here, but that would make the CRuby interpreter slower.

This is where alternative Ruby implementations shine, they are not bound by any of these limitations and can use better data representations, allocators, GCs, etc.

Conclusion

I hope this post helped you understand better the landscape of Ruby compilers and various trade-offs. YJIT is a recent addition with fast compilation, full compatibility and some speedups but also various limitations and trade-offs. I believe TruffleRuby will always have more opportunities for better performance as it is not bound by the limitations in CRuby and already has one of the best compilers in the world (the GraalVM Compiler), which understands Ruby code, C code and more.