[HN Gopher] YJIT is the most memory-efficient Ruby JIT
___________________________________________________________________
YJIT is the most memory-efficient Ruby JIT
Author : panic
Score : 111 points
Date : 2023-11-14 16:46 UTC (6 hours ago)
(HTM) web link (railsatscale.com)
(TXT) w3m dump (railsatscale.com)
| JohnBooty wrote:
| Wow, Shopify continues to make some heroic improvements here.
| Kudos, kudos, kudos. Thanks, Shopify folks.
|
| One thing I didn't see discussed in the article was YJIT's memory
| usage relative to CRuby, the baseline non-JIT version of Ruby. It
| is certainly possible I missed it; that's been known to happen!
|
| Anyway, the news there is very good. We can see detailed
| information here:
|
| https://speed.yjit.org/memory_timeline.html#railsbench
|
| Currently Railsbench consumes a peak of ~95MB with CRuby, and a
| peak of ~110MB with YJIT. So, YJIT delivers 70% more performance
| while consuming 16% more RAM here. That is a tradeoff I think
| most people would gladly accept in most scenarios. =)
|
| Real-world speedups will be less, since a "real" web application
| spends much of its time waiting for the database and other
| external resources. As the article notes, Shopify's real-world
| observed storefront perf gain is 27.2%.
|
| YJIT is a success and its future is even brighter.
| compumike wrote:
| Also for a practical tip on YJIT memory usage, note that there is
| a "--yjit-exec-mem-size" option, see
| https://github.com/ruby/ruby/blob/master/doc/yjit/yjit.md#co...
| for more details. (This command-line argument is mentioned in the
| paper https://dl.acm.org/doi/10.1145/3617651.3622982 but not in
| this blog post about the paper.)
|
| At Heii On-Call https://heiioncall.com/ we use:
| ENV RUBY_YJIT_ENABLE=1
| ENV RUBYOPT=--yjit-exec-mem-size=16
|
| in our Dockerfile for our Rails processes.
| booleanbetrayal wrote:
| Any recollection on how you arrived at the --yjit-exec-mem-size
| value? We've been running YJIT in production for some time, but
| haven't looked into tuning this at all.
| JohnBooty wrote:
| Not parent poster and do not have production YJIT experience.
| =)
|
| My guess is that you would monitor
| `RubyVM::YJIT.runtime_stats[:code_region_size]` and/or
| `RubyVM::YJIT.runtime_stats[:code_gc_count]` so that you can
| get a feel for a reasonable value for your application, as
| well as know whether or not the "code GC" is running
| frequently.
|
| https://github.com/ruby/ruby/blob/master/doc/yjit/yjit.md#pe.
| ..
| compumike wrote:
| That's exactly right. Our code_region_size levels off a bit
| under 8 MB, so we set the limit to 16. In practice we see
| code_gc_count stays at 0.
| JohnBooty wrote:
| Wow, that's interesting and it seems a little crazy? From the
| docs: When JIT code size
| (RubyVM::YJIT.runtime_stats[:code_region_size])
| reaches this value, YJIT triggers "code GC" that frees all JIT
| code and starts recompiling everything. Compiling code takes
| some time, so scheduling code GC too frequently slows down your
| application. Increasing --yjit-exec-mem-size may speed up your
| application if RubyVM::YJIT.runtime_stats[:code_gc_count] is
| not 0 or 1.
|
| https://github.com/ruby/ruby/blob/master/doc/yjit/yjit.md#co...
|
| It just dumps _all_ the JIT-compiled code? I 'd expect to see
| some kind of heuristic or algorithm there... LFU or something.
|
| The internals of a JIT are essentially black magic to me, and I
| know the people working on YJIT are super talented, so I am
| sure there is a good reason why they just dump everything
| instead of the least-frequently used stuff. Maybe the overhead
| of trying frecency outweighs the gains, maybe they just haven't
| implemented it yet, or maybe it's just a rarely-reached
| condition.
|
| (I hope a YJIT team member sees this, I'm super curious now)
| xerxes901 wrote:
| I don't work on YJIT but I _think_ i know the (or maybe an)
| answer to this. The code for a JIT'd ruby method isn't
| contiguous in one location in memory. When a ruby method is
| first compiled, a straightline path through the method is
| emitted , and branches are emitted as stub code. When the
| stub is hit, the incremental compilation of that branch then
| happens. I believe this is called "lazy basic block
| versioning".
|
| When the stub is hit the code that gets generated is
| somewhere _else_ in executable memory, not contiguous with
| the original bit of the method. Because these "lazy basic
| blocks" are actually quite small, the bookkeeping involved in
| "where is the code for this ruby method" would actually be an
| appreciable fraction of the code size itself. Plus you then
| have to do more bookkeeping to make sure the method you want
| to GC isn't referred to by the generated code in another
| method.
|
| Since low memory usage is an important YJIT goal, I guess
| this tradeoff isn't worth it.
|
| Maybe someone who knows this better will come along and
| correct me :)
| byroot wrote:
| As @xerxes901 said, there's some major challenge in freeing
| just one method code as it's not necessarily contiguous, and
| also it's of very variable size so it would generate lots of
| fragmentation. The allocate would need to be much more
| complex too to compensate.
|
| But the team reasoning is that compilation isn't that slow,
| and while the code is freed, the statistics that drives the
| compilation are kept, so most of the work is already done.
|
| Also the assumption behind code GC is that applications may
| experience a "phase change" e.g. the hottests code path at
| time t1, may not be so hot at time t2. If this is true, then
| it can be advantageous to recompile the hottests paths once
| in a while.
|
| But that assumption is a major subject of debate between
| myself and the YJIT team, hence why I requested a `--yjit-
| disable-code-gc` flag for experimentation, and in 3.3 code GC
| will actually be disabled by default.
| Freaky wrote:
| > We were very generous in terms of warm-up time. Each benchmark
| was run for 1000 iterations, and the first half of all the
| iterations were discarded as warm-up time, giving each JIT a more
| than fair chance to reach peak performance.
|
| 1,000 iterations isn't remotely generous for JRuby, unfortunately
| - JVM's Tier-3 compilation only kicks in by default around 2,000,
| and full tier-4 is only considered beyond 15,000. I've observed
| this to have quite a substantial effect, for instance bringing
| manticore (JRuby wrapper for Apache's Java HttpClient) down from
| merely "okay" performance after 10,000 requests to pretty much
| matching the curb C extension under MRI after 20,000.
|
| You can tweak it to be more aggressive, but I guess this puts
| more pressure on the compiler threads and their memory use, while
| reducing the run-time profiling data they use to optimize most
| effectively. It perhaps also risks more churn from
| deoptimization. I kind of felt like I'd be better off trying to
| formalise the warmup process.
|
| It's rather a shame that all this warmup work is one-shot. It
| would be far less obnoxious if it could be preserved across runs
| - I believe some alternative Java runtimes support something like
| that, though given JRuby's got its own JIT targetting Java
| bytecode I dare say it would require work there as well.
| maxime_cb wrote:
| It is enough iterations for these VMs to warm up on the
| benchmarks we've looked at, but the warm-up time is still on
| the order of minutes on some benchmarks, which is impractical
| for many applications.
___________________________________________________________________
(page generated 2023-11-14 23:00 UTC)