Matching Regexps 200 Times Faster
You might have seen @byroot’s excellent blog post series on optimizing the json
gem.
From the first blog post it’s clear most of the time for generating JSON is spent in generate_json_string()
and specifically in convert_UTF8_to_JSON()
, i.e., in converting Ruby Strings to JSON Strings.
Because this is such a hot part of JSON generation, it has received a lot of attention.
In part 3 @byroot shows a lookup table approach to implement convert_UTF8_to_JSON()
, which is now used in the json
gem.
There has been a couple attempts since then to optimize it further using SIMD instructions. This is however quite difficult and messy to do in C as it’s similar to writing inline assembly (N times, with N the number of SIMD variants), and it’s not portable so it needs both compile-time and run-time detection to pick the correct SIMD variant.
In this blog post we compare 3 contenders for the job of converting Ruby Strings to JSON Strings.
- The C extension code used in ruby/json
- The C extension code + AVX2 SIMD part of this SIMD PR
- A pure-Ruby version
The pure-Ruby version is so simple it fits in a few lines (and a big Hash
literal):
ESCAPE_MAP = {
"\x0" => '\u0000', "\x1" => '\u0001', "\x2" => '\u0002', "\x3" => '\u0003',
"\x4" => '\u0004', "\x5" => '\u0005', "\x6" => '\u0006', "\x7" => '\u0007',
"\b" => '\b', "\t" => '\t', "\n" => '\n', "\xb" => '\u000b',
"\f" => '\f', "\r" => '\r', "\xe" => '\u000e', "\xf" => '\u000f',
"\x10" => '\u0010', "\x11" => '\u0011', "\x12" => '\u0012', "\x13" => '\u0013',
"\x14" => '\u0014', "\x15" => '\u0015', "\x16" => '\u0016', "\x17" => '\u0017',
"\x18" => '\u0018', "\x19" => '\u0019', "\x1a" => '\u001a', "\x1b" => '\u001b',
"\x1c" => '\u001c', "\x1d" => '\u001d', "\x1e" => '\u001e', "\x1f" => '\u001f',
'"' => '\"', '\\' => '\\\\',
}
def utf8_to_json(string)
'"' + if /["\\\x0-\x1f]/.match?(string)
string.gsub(/["\\\x0-\x1f]/, ESCAPE_MAP)
else
string
end + '"'
end
The ESCAPE_MAP
is a bit verbose but it’s a simple way to exhaustively store how to escape each character which needs to be escaped.
The match?
is not strictly necessary but it avoids the allocation of a new String for the common case of no character to escape.
We use ruby/json
’s string encoding benchmarks with a small diff to run only those:
diff --git a/benchmark/encoder.rb b/benchmark/encoder.rb
index f0a05db..efbac2b 100644
--- a/benchmark/encoder.rb
+++ b/benchmark/encoder.rb
@@ -33,6 +33,8 @@ def implementations(ruby_obj)
end
def benchmark_encoding(benchmark_name, ruby_obj, check_expected: true, except: [])
+ return unless ["mixed utf8", "mostly utf8"].include?(benchmark_name)
json_output = JSON.dump(ruby_obj)
puts "== Encoding #{benchmark_name} (#{json_output.bytesize} bytes)"
@@ -40,6 +42,7 @@ def benchmark_encoding(benchmark_name, ruby_obj, check_expected: true, except: [
except.each { |i| impls.delete(i) }
Benchmark.ips do |x|
+ x.warmup = 5
expected = ::JSON.dump(ruby_obj) if check_expected
impls.values.each do |name, block|
begin
We select the benchmarks we want with an early return and we increase warmup to increase the stability of the benchmark results.
We use ONLY=json
because the SIMD PR doesn’t have the json_coder
variant (anyway both of these give very similar results for the selected benchmarks).
JSON String Escaping Benchmark
The benchmarks are straightforward:
# mixed utf8
JSON.generate([("a" * 5000) + "€" + ("a" * 5000)] * 500)
# mostly utf8
JSON.generate([("€" * 3333)] * 500)
They both dump to JSON an Array of 500 strings and end up generating a JSON result of 5MB.
Here are the results on my machine (AMD Ryzen 7 3700X 8-Core Processor
), with frequency scaling disabled and the performance
CPU governor:
C extension with CRuby 3.4.2 with YJIT (on ruby/json master):
$ ONLY=json ruby --yjit -Ilib:ext benchmark/encoder.rb
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [x86_64-linux]
mixed utf8 397.953 (± 4.5%) i/s (2.51 ms/i)
mostly utf8 402.388 (± 0.5%) i/s (2.49 ms/i)
We will use this as the baseline, specifically the mixed utf8
benchmark to keep things simple.
C extension + SIMD with CRuby 3.4.2 with YJIT (on the SIMD PR):
$ ONLY=json ruby --yjit -Ilib:ext benchmark/encoder.rb
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [x86_64-linux]
mixed utf8 1.498k (± 4.0%) i/s (667.68 μs/i)
mostly utf8 1.474k (± 1.6%) i/s (678.55 μs/i)
3.76 times faster, nice!
Pure-Ruby generator with TruffleRuby 24.1.1 JVM (on ruby/json master):
$ ONLY=json ruby -Ilib:ext benchmark/encoder.rb
truffleruby 24.1.1, like ruby 3.2.4, Oracle GraalVM JVM [x86_64-linux]
mixed utf8 8.104k (± 0.6%) i/s (123.40 μs/i)
mostly utf8 8.070k (± 1.1%) i/s (123.91 μs/i)
20 times faster than the baseline and 5.4 times faster than SIMD!
And, of course much simpler, this is just a few lines of Ruby code and a Regexp.
If you guessed earlier that the C extension code + SIMD would be fastest, you may be surprised to learn the pure-Ruby version is considerably faster (on TruffleRuby)! That’s the benefit of an advanced JIT compiler which understands Ruby semantics like TruffleRuby’s JIT compiler (called Graal): you get to write nice, idiomatic Ruby and the JIT compiler does the hard work of optimizing it for your target platform.
For comparison, here’s the pure-Ruby generator with CRuby 3.4.2 with YJIT (on ruby/json master):
$ ONLY=json ruby --yjit -Ilib:ext benchmark/encoder.rb
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [x86_64-linux]
mixed utf8 39.551 (± 0.0%) i/s (25.28 ms/i)
mostly utf8 28.119 (± 0.0%) i/s (35.56 ms/i)
Unfortunately, it’s 10 times slower than the baseline, demonstrating the Regexp approach is much slower on CRuby. But, we ran with YJIT enabled, so why is it so much slower?
As a disclaimer, TruffleRuby (which uses the pure-Ruby generator) is currently slower than CRuby using the C extension
on JSON.generate
macro benchmarks (except canada.json
where it is faster).
Other aspects of JSON generation need to be better optimized on TruffleRuby.
Regexp#match? Benchmark
Let’s simplify the benchmark to just Regexp#match?
to understand better:
require 'benchmark/ips'
mixed_utf8 = ("a" * 5000) + "€" + ("a" * 5000)
Benchmark.ips do |x|
x.warmup = 5
x.report '/["\\\x0-\x1f]/.match?(mixed_utf8)' do
raise if /["\\\x0-\x1f]/.match?(mixed_utf8)
end
end
$ ruby --yjit bench_regexp_match.rb
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [x86_64-linux]
20.700k (± 2.4%) i/s (48.31 μs/i)
$ ruby bench_regexp_match.rb
truffleruby 24.1.1, like ruby 3.2.4, Oracle GraalVM JVM [x86_64-linux]
4.699M (± 0.3%) i/s (212.83 ns/i)
TruffleRuby matches 10003 bytes in 213 nanoseconds, i.e., 47 bytes per nanosecond. That’s 227 times faster than CRuby at Regexp matching on this benchmark! Moreover, 47 bytes per nanosecond is greater than the number of CPU cycles we have per nanosecond (this processor has a max frequency of 4.4 GHz). That seems impossible, so what’s going on here?
How can Regexp matching be so fast on TruffleRuby?
There are multiple reasons:
- Regexps on TruffleRuby use TRegex, a state-of-the-art Regular Expression Engine:
- TRegex compiles regexps to finite-state automata, meaning it never goes back in the input string and will read each character at most once. Therefore, it always runs in linear time. CRuby uses backtracking which may go over the input string characters many times, which is much slower.
- Backtracking is prone to ReDoS when backtracking too much while finite-state automata is immune to ReDoS. CRuby 3.2+ has a protection for most regexps against ReDoS by using some form of caching. While that caching avoids many ReDoS, it also slows down the regexp execution by needing to check or fill this cache every time it might backtrack.
- TRegex JIT-compiles regexps to native code. CRuby compiles the regexp to regexp bytecode and then uses an interpreter to run that bytecode. That means it has dispatch overhead, does not optimize across bytecodes, etc. YJIT does not know anything about that regexp interpreter, it sees it as a blackbox because it is all written in C.
- TruffleRuby uses splitting so that each call site for
Regexp#match?
and other Regexp-related methods has its own copy of the logic. With that, TruffleRuby is able to inline the JIT-compiled code for that specific Regexp in the Ruby method callingmatch?
. - TRegex automatically uses SIMD, and specifically the best SIMD variant available on the running processor. It is able to do so because of the JIT compiler, which queries the CPU for which SIMD variants are available. TRegex does not need to dispatch between different SIMD variants like C would since the JIT compiler knows the best SIMD variant available. So we get to use SIMD instructions, but we don’t have to complicate a Ruby gem to benefit.
Specifically for this Regexp TRegex internally uses a ByteIndexOfCodePointSetNode, which automatically creates a lookup table for the character range in that regexp (["\\\x0-\x1f]
) and also automatically uses SIMD.
If you would like to know more about TRegex, the creator of TRegex and I did a talk about it at RubyKaigi 2021.
Kevin Menard also gave a talk about ReDoS and solutions at RailsConf 2022.
The JSON String Escaping example above is not an isolated case where TruffleRuby coincidentally is faster. Having a fast regular expression engine allows for more idiomatic solutions and allows Ruby code running on TruffleRuby to be faster than C code in many cases. Here are two more real world examples where TruffleRuby uses Regexps and it’s faster than C code!
Time.new(String)
Time.new
since Ruby 3.2 accepts a String argument.
CRuby implements it by parsing strings with char
pointers in about 80 lines of C.
TruffleRuby implements it in a straightforward way with a Regexp.
require 'benchmark/ips'
NOW_STRING = Time.now.to_s
Benchmark.ips do |x|
x.report("Time.new(String)") do
Time.new(NOW_STRING)
end
end
$ ruby --yjit bench_time_new_string.rb
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [x86_64-linux]
Time.new(String) 1.595M (± 0.5%) i/s (626.84 ns/i)
$ ruby bench_time_new_string.rb
truffleruby 24.1.1, like ruby 3.2.4, Oracle GraalVM JVM [x86_64-linux]
Time.new(String) 3.923M (± 5.4%) i/s (254.90 ns/i)
TruffleRuby is 2.5 times faster here, while having a much simpler and cleaner implementation.
Let’s benchmark just the Regexp, although we use match
here since we need the capture groups:
require 'benchmark/ips'
NOW_STRING = Time.now.to_s
TIME_REGEXP = /\A (?<year>\d{4,5})
(?:
- (?<month>\d{2})
- (?<mday> \d{2})
[ T] (?<hour> \d{2})
: (?<min> \d{2})
: (?<sec> \d{2})
(?:\. (?<usec> \d+) )?
(?:\s* (?<offset>\S+))?
)?\z/x
Benchmark.ips do |x|
x.report("TIME_REGEXP.match(NOW_STRING)") do
raise unless TIME_REGEXP.match(NOW_STRING)
end
end
$ ruby --yjit bench_time_new_string_regexp.rb
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [x86_64-linux]
TIME_REGEXP.match(NOW_STRING) 1.234M (± 0.8%) i/s (810.47 ns/i)
$ ruby bench_time_new_string_regexp.rb
truffleruby 24.1.1, like ruby 3.2.4, Oracle GraalVM JVM [x86_64-linux]
TIME_REGEXP.match(NOW_STRING) 33.448M (± 9.9%) i/s (29.90 ns/i)
TruffleRuby is 27 times faster for this Regexp.
So the Time.new(String)
benchmark is actually spending a lot of time in creating and initializing the Time
object.
StringScanner#scan_integer
StringScanner#scan_integer
is a method recently added to StringScanner.
CRuby implements it in 114 lines of C. TruffleRuby implements it in 18 lines of Ruby using Regexps. We use TruffleRuby master here since the method was introduced after the last TruffleRuby release.
require 'benchmark/ips'
require 'strscan'
SCANNER = StringScanner.new("123456789")
Benchmark.ips do |x|
x.report("StringScanner#scan_integer") do
SCANNER.reset
raise unless 123456789 == SCANNER.scan_integer
end
end
$ ruby --yjit bench_scan_integer.rb
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [x86_64-linux]
StringScanner#scan_integer 10.530M (± 0.6%) i/s (94.97 ns/i)
$ ruby bench_scan_integer.rb
truffleruby 25.0.0-dev-48a0a626, like ruby 3.3.5, Oracle GraalVM JVM [x86_64-linux]
StringScanner#scan_integer 30.230M (± 3.0%) i/s (33.08 ns/i)
TruffleRuby is about 3 times faster.
BTW, TruffleRuby implements StringScanner
entirely in Ruby code.
Ruby code is not only shorter and more expressive, it’s also more correct.
To that point, while implementing new StringScanner
methods we found 6 issues
with the C extension implementation of StringScanner
: 2 segfaults and 4 inconsistent/incorrect behaviors.
Let’s also benchmark just the Regexp for completeness:
require 'benchmark/ips'
INPUT = "123456789"
Benchmark.ips do |x|
x.report("/\A[+-]?\d+/.match(INPUT)") do
raise unless /\A[+-]?\d+/.match(INPUT)
end
end
$ ruby --yjit bench_scan_integer_regexp.rb
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [x86_64-linux]
/A[+-]?d+/.match(INPUT) 1.972M (± 0.8%) i/s (507.00 ns/i)
$ ruby bench_scan_integer_regexp.rb
truffleruby 25.0.0-dev-48a0a626, like ruby 3.3.5, Oracle GraalVM JVM [x86_64-linux]
/A[+-]?d+/.match(INPUT) 61.086M (± 6.9%) i/s (16.37 ns/i)
TruffleRuby is 31 times faster for this Regexp.
Conclusion
Sometimes, such as in the cases shown in this blog post, the idiomatic pure-Ruby solution is also the fastest and even faster than C code or inline assembly. This is not new on TruffleRuby, pure-Ruby code is often faster than everything else, including C extensions on CRuby. Ruby code is more expressive and higher-level which makes some optimizations possible that otherwise wouldn’t be if written in a lower-level language.
If you want to run idiomatic Ruby code fast, give TruffleRuby a try.