[HN Gopher] YJIT is the most memory-efficient Ruby JIT
       ___________________________________________________________________
        
       YJIT is the most memory-efficient Ruby JIT
        
       Author : panic
       Score  : 111 points
       Date   : 2023-11-14 16:46 UTC (6 hours ago)
        
 (HTM) web link (railsatscale.com)
 (TXT) w3m dump (railsatscale.com)
        
       | JohnBooty wrote:
       | Wow, Shopify continues to make some heroic improvements here.
       | Kudos, kudos, kudos. Thanks, Shopify folks.
       | 
       | One thing I didn't see discussed in the article was YJIT's memory
       | usage relative to CRuby, the baseline non-JIT version of Ruby. It
       | is certainly possible I missed it; that's been known to happen!
       | 
       | Anyway, the news there is very good. We can see detailed
       | information here:
       | 
       | https://speed.yjit.org/memory_timeline.html#railsbench
       | 
       | Currently Railsbench consumes a peak of ~95MB with CRuby, and a
       | peak of ~110MB with YJIT. So, YJIT delivers 70% more performance
       | while consuming 16% more RAM here. That is a tradeoff I think
       | most people would gladly accept in most scenarios. =)
       | 
       | Real-world speedups will be less, since a "real" web application
       | spends much of its time waiting for the database and other
       | external resources. As the article notes, Shopify's real-world
       | observed storefront perf gain is 27.2%.
       | 
       | YJIT is a success and its future is even brighter.
        
       | compumike wrote:
       | Also for a practical tip on YJIT memory usage, note that there is
       | a "--yjit-exec-mem-size" option, see
       | https://github.com/ruby/ruby/blob/master/doc/yjit/yjit.md#co...
       | for more details. (This command-line argument is mentioned in the
       | paper https://dl.acm.org/doi/10.1145/3617651.3622982 but not in
       | this blog post about the paper.)
       | 
       | At Heii On-Call https://heiioncall.com/ we use:
       | ENV RUBY_YJIT_ENABLE=1
       | ENV RUBYOPT=--yjit-exec-mem-size=16
       | 
       | in our Dockerfile for our Rails processes.
        
         | booleanbetrayal wrote:
         | Any recollection on how you arrived at the --yjit-exec-mem-size
         | value? We've been running YJIT in production for some time, but
         | haven't looked into tuning this at all.
        
           | JohnBooty wrote:
           | Not parent poster and do not have production YJIT experience.
           | =)
           | 
           | My guess is that you would monitor
           | `RubyVM::YJIT.runtime_stats[:code_region_size]` and/or
           | `RubyVM::YJIT.runtime_stats[:code_gc_count]` so that you can
           | get a feel for a reasonable value for your application, as
           | well as know whether or not the "code GC" is running
           | frequently.
           | 
           | https://github.com/ruby/ruby/blob/master/doc/yjit/yjit.md#pe.
           | ..
        
             | compumike wrote:
             | That's exactly right. Our code_region_size levels off a bit
             | under 8 MB, so we set the limit to 16. In practice we see
             | code_gc_count stays at 0.
        
         | JohnBooty wrote:
         | Wow, that's interesting and it seems a little crazy? From the
         | docs:                   When JIT code size
         | (RubyVM::YJIT.runtime_stats[:code_region_size])
         | reaches this value, YJIT triggers "code GC" that frees all JIT
         | code and starts recompiling everything. Compiling code takes
         | some time, so scheduling code GC too frequently slows down your
         | application. Increasing --yjit-exec-mem-size may speed up your
         | application if RubyVM::YJIT.runtime_stats[:code_gc_count] is
         | not 0 or 1.
         | 
         | https://github.com/ruby/ruby/blob/master/doc/yjit/yjit.md#co...
         | 
         | It just dumps _all_ the JIT-compiled code? I 'd expect to see
         | some kind of heuristic or algorithm there... LFU or something.
         | 
         | The internals of a JIT are essentially black magic to me, and I
         | know the people working on YJIT are super talented, so I am
         | sure there is a good reason why they just dump everything
         | instead of the least-frequently used stuff. Maybe the overhead
         | of trying frecency outweighs the gains, maybe they just haven't
         | implemented it yet, or maybe it's just a rarely-reached
         | condition.
         | 
         | (I hope a YJIT team member sees this, I'm super curious now)
        
           | xerxes901 wrote:
           | I don't work on YJIT but I _think_ i know the (or maybe an)
           | answer to this. The code for a JIT'd ruby method isn't
           | contiguous in one location in memory. When a ruby method is
           | first compiled, a straightline path through the method is
           | emitted , and branches are emitted as stub code. When the
           | stub is hit, the incremental compilation of that branch then
           | happens. I believe this is called "lazy basic block
           | versioning".
           | 
           | When the stub is hit the code that gets generated is
           | somewhere _else_ in executable memory, not contiguous with
           | the original bit of the method. Because these "lazy basic
           | blocks" are actually quite small, the bookkeeping involved in
           | "where is the code for this ruby method" would actually be an
           | appreciable fraction of the code size itself. Plus you then
           | have to do more bookkeeping to make sure the method you want
           | to GC isn't referred to by the generated code in another
           | method.
           | 
           | Since low memory usage is an important YJIT goal, I guess
           | this tradeoff isn't worth it.
           | 
           | Maybe someone who knows this better will come along and
           | correct me :)
        
           | byroot wrote:
           | As @xerxes901 said, there's some major challenge in freeing
           | just one method code as it's not necessarily contiguous, and
           | also it's of very variable size so it would generate lots of
           | fragmentation. The allocate would need to be much more
           | complex too to compensate.
           | 
           | But the team reasoning is that compilation isn't that slow,
           | and while the code is freed, the statistics that drives the
           | compilation are kept, so most of the work is already done.
           | 
           | Also the assumption behind code GC is that applications may
           | experience a "phase change" e.g. the hottests code path at
           | time t1, may not be so hot at time t2. If this is true, then
           | it can be advantageous to recompile the hottests paths once
           | in a while.
           | 
           | But that assumption is a major subject of debate between
           | myself and the YJIT team, hence why I requested a `--yjit-
           | disable-code-gc` flag for experimentation, and in 3.3 code GC
           | will actually be disabled by default.
        
       | Freaky wrote:
       | > We were very generous in terms of warm-up time. Each benchmark
       | was run for 1000 iterations, and the first half of all the
       | iterations were discarded as warm-up time, giving each JIT a more
       | than fair chance to reach peak performance.
       | 
       | 1,000 iterations isn't remotely generous for JRuby, unfortunately
       | - JVM's Tier-3 compilation only kicks in by default around 2,000,
       | and full tier-4 is only considered beyond 15,000. I've observed
       | this to have quite a substantial effect, for instance bringing
       | manticore (JRuby wrapper for Apache's Java HttpClient) down from
       | merely "okay" performance after 10,000 requests to pretty much
       | matching the curb C extension under MRI after 20,000.
       | 
       | You can tweak it to be more aggressive, but I guess this puts
       | more pressure on the compiler threads and their memory use, while
       | reducing the run-time profiling data they use to optimize most
       | effectively. It perhaps also risks more churn from
       | deoptimization. I kind of felt like I'd be better off trying to
       | formalise the warmup process.
       | 
       | It's rather a shame that all this warmup work is one-shot. It
       | would be far less obnoxious if it could be preserved across runs
       | - I believe some alternative Java runtimes support something like
       | that, though given JRuby's got its own JIT targetting Java
       | bytecode I dare say it would require work there as well.
        
         | maxime_cb wrote:
         | It is enough iterations for these VMs to warm up on the
         | benchmarks we've looked at, but the warm-up time is still on
         | the order of minutes on some benchmarks, which is impractical
         | for many applications.
        
       ___________________________________________________________________
       (page generated 2023-11-14 23:00 UTC)