[HN Gopher] Understanding every byte in a WASM module
       ___________________________________________________________________
        
       Understanding every byte in a WASM module
        
       Author : hasheddan
       Score  : 143 points
       Date   : 2023-12-23 13:41 UTC (9 hours ago)
        
 (HTM) web link (danielmangum.com)
 (TXT) w3m dump (danielmangum.com)
        
       | rockwotj wrote:
       | This is pretty great! I highly recommend building a parser for
       | Wasm, it shows off the nice design that it can be compiled in a
       | streaming fashion and other nice properties.
       | 
       | I built a small parser for a wasm runtime I was going to hack at,
       | although that lost steam and now I just contribute to wasmtime,
       | but it was a great way to learn. The source if you're interested
       | in C++20 and coroutines: https://github.com/rockwotj/wasmcc
        
         | amelius wrote:
         | > the nice design
         | 
         | But integers are read 7 bits at a time. Wouldn't it be much
         | easier to use 8 bits?
        
           | teraflop wrote:
           | If you use all 8 bits of a byte for data, you need a separate
           | way to store the length, or you need to make everything
           | fixed-width which wastes space.
           | 
           | "Varints" or variable-length integers, with 7 bits of data
           | and 1 continuation bit per byte, are used in a fairly wide
           | variety of formats (e.g. Protobuf, SQLite and Lucene).
           | They're compact when most of your integers are small, while
           | still being quite simple and fast to encode/decode.
           | 
           | Faster alternatives tend to be more complicated, and more
           | compact alternatives tend to be slower.
           | 
           | See https://en.wikipedia.org/wiki/Variable-length_quantity
        
             | joquarky wrote:
             | I wonder if something resembling this could have been
             | better than how IPv6 attempts to solve the address space
             | problem.
        
           | svat wrote:
           | This is basically the design of UTF-8 or varint encoding. The
           | integer's representation is read one byte (8 bits) at a time,
           | which can then be interpreted by looking at the first bit
           | which says whether the integer is "done" or more bytes need
           | to be read.
           | 
           | (If we were using all 8 bits for the representation and an
           | additional leading bit for whether it continues, we'd end up
           | having to read 9 bits or something, which is awkward and not
           | aligned with what the machine actually supports reading.)
        
             | flohofwoe wrote:
             | UTF-8 is a bit more complex than that.
             | 
             | The first byte in a sequence basically tells you how many
             | bytes are following by looking at the top-most bits.
             | 0xxxxxxx => no followup bytes, just the 7-bit ASCII code
             | 110xxxxx => 1 byte follows         1110xxxx => 2 bytes
             | follow         11110xxx => 3 bytes follow
             | 
             | The followup bytes then always have the pattern 10xxxxxx.
             | 
             | E.g. the only bytes that have bit 7 cleared are the single-
             | byte 7-bit ASCII character codes.
        
               | Mindless2112 wrote:
               | A UTF-8-like encoding would probably be better than VLQ
               | encoding because the length is known up-front, which
               | allows for the decoder to have fewer branches.
               | 
               | Something like [1] or alternatively [2]. See also [3].
               | 
               | [1] https://en.wikipedia.org/wiki/UTF-8#FSS-UTF [2]
               | https://sqlite.org/src4/doc/trunk/www/varint.wiki [3]
               | https://news.ycombinator.com/item?id=11263378
        
           | rockwotj wrote:
           | It is a balance between simplicity, speed and size. They
           | choose size, honesty leb128 isn't *too* complex (you can see
           | an implementation in my repo I pointed too). But yes they
           | don't take all the simplicity tradeoffs.
        
             | willtemperley wrote:
             | Protobuf seems to use a very similar encoding.
        
           | xjay wrote:
           | In this case, a 64-bit little-endian unsigned integer Ought
           | to be enough for anyone. [1]
           | 
           | Using variable-length "micro-compression" schemes like these
           | merges concerns [2], add processing overhead, and lose out on
           | the simplicity and efficiency of working on the data directly
           | in memory, maybe even reusing the same memory/cache line
           | after checking identifiers, etc.
           | 
           | The 7-bit VLQ scheme [3] makes sense for MIDI sequences
           | (1980s), as they were stored on MIDI hardware with limited
           | memory, and had basic processing needs.
           | 
           | [1] https://quoteinvestigator.com/2011/09/08/640k-enough/
           | 
           | [2] https://en.wikipedia.org/wiki/Separation_of_concerns
           | 
           | [3] https://en.wikipedia.org/wiki/Variable-length_quantity
        
             | monocasa wrote:
             | Eh, I don't think the .wasm file is intended to be an in
             | memory format. For an on wire/storage format, variable
             | length integers like protobuf uses is a perfectly valid
             | choice.
        
               | dataangel wrote:
               | It's actually the protobuf design mistake that most
               | adversely affects decoding performance. The original
               | creator who went on to make Cap'N Proto makes the
               | excellent point that it's better to leave the extra bytes
               | and then just slap a compressor that only works on runs
               | of 0 bytes on top. Faster for the same savings. There are
               | also newer schemes from Daniel Lemire and others more
               | amenable to SIMD decoding.
        
               | monocasa wrote:
               | Sure, but that decoding speed penalty doesn't stop
               | protobuf (and similar encodings using leb128 like thrift)
               | from being some of the most used binary encodings on the
               | planet, even in relatively low latency spaces.
               | 
               | Additionally, Cap'N Proto RLE encodes the bytestream,
               | which is pretty similar to the overhead of LEB128
               | integers.
               | 
               | And on top of all of that, .wasm isn't intended to be an
               | IPC format, so the fact that it's choices are typical of
               | that space (despite not being leading edge) reflect very
               | well for the space it's intended to be in.
        
           | kevingadd wrote:
           | It would be. The use of varints in wasm complicates encoding
           | and decoding and introduces ambiguity into the format for
           | cases like v128 constants (which despite technically being
           | 128-bit integers, are not embedded using leb128.) It also
           | makes it harder to write a single-pass compiler.
           | 
           | It also isn't much of a wire size improvement, since the
           | overall format design assumes you will compress it using
           | brotli (or gzip, I guess) - there are lots of opportunities
           | to make the format smaller or arrange it in a way that makes
           | it more compressible, and the decision was made not to do
           | those things since brotli is _really good_ at compressing
           | data.
           | 
           | I think the people who took control of the binary format just
           | really liked varints for some reason, so we ended up stuck
           | with them. It's at least not particularly harmful, it's just
           | a small waste of everyone's time. Thankfully, my
           | understanding is that in practice most modules will only get
           | decoded once or twice before the compiled code is cached by
           | your browser.
        
       | danielvaughn wrote:
       | A while back I ran through some exercises that had me writing WAT
       | directly. It's a great way to understand what WASM is doing, and
       | is surprisingly easy to pick up (though of course you wouldn't
       | want to write an actual program with it). It's just system calls
       | with s-expressions.
        
         | eliben wrote:
         | WAT is a fun format! I'm curating some readable samples here --
         | https://github.com/eliben/wasm-wat-samples/
        
           | danielvaughn wrote:
           | Bookmarking that for later, thanks for sharing!
        
           | xscott wrote:
           | I already had that bookmarked from using it several months
           | ago to try and figure things out. It's been very helpful.
           | Thank you for creating it!
        
           | Jasper_ wrote:
           | I have a hand-written BF interpreter in WAT, along with hand-
           | written asm.js for comparison, over here
           | https://github.com/magcius/bfasm
        
         | lioeters wrote:
         | For some time I've been fascinated by the codebase of a small
         | WAT compiler written in JavaScript.
         | 
         | https://github.com/stagas/wat-compiler/blob/main/story.txt
         | 
         | And I mean "small" as a real complement to how readable and
         | compact the entire compiler is. It's been a great way to
         | appreciate the design of the WASM text format and WASM overall.
         | It's not a Lisp but has a similar feel to it.
         | 
         | I've been meaning to get more fluent at writing WAT directly,
         | not for any practical purpose but just for pleasure of it. I
         | could see myself gradually building up some abstractions to
         | help me deveolp larger programs, perhaps a slightly higher-
         | level language.
        
           | danielvaughn wrote:
           | Very cool. You might be interested in AssemblyScript, if you
           | haven't heard of them before. They did something very
           | similar: https://www.assemblyscript.org/
        
             | lioeters wrote:
             | Ah yes, I'd love to get more familiar with AssemblyScript
             | too. One thing I like about the afore-mentioned WAT
             | compiler though, is that it's simple to run in the browser,
             | which the AssemblyScript compiler doesn't support, if I
             | understand correctly. That characteristic of the compiler
             | would be useful if I want to build a higher-level language
             | on top of it, for example with a web playground (code
             | editor and runtime).
        
       | Hackbraten wrote:
       | I really appreciate this article. I've noticed that modern
       | streaming web players have started using WASM as part of their
       | obfuscation toolbox. Understanding WASM modules is important for
       | reverse engineering.
        
         | flohofwoe wrote:
         | The browser devtools can give you a disassembled view of the
         | WebAssembly byte code (basically the wasm2wat output) and also
         | allow to set breakpoints and debug-step through the
         | disassembly.
         | 
         | Since WASM is a "structured-programming ISA" with code blocks
         | that can be intended, it's actually quite convenient.
        
           | Hackbraten wrote:
           | Thanks for the pointer!
           | 
           | The WASM payload itself appears to be obfuscated. The player
           | seems to deobfuscate it at runtime, dynamically load it as a
           | WASM module, execute it and then destroy it again. I'm having
           | a hard time pinpointing the exact moment where the WASM is
           | being loaded (because the JS that performs all those things
           | is horribly obfuscated, too.)
           | 
           | The disassembled view in the dev tools might come in handy to
           | tackle this. My goal is to obtain their secret HMAC key so I
           | can change "Linux" into whatever the server prefers to hear,
           | and re-sign the HMAC for that manipulated HTTP request before
           | the player sends it.
           | 
           | (All that effort so I finally get to watch my Bundesliga
           | streams again, for which I'm paying them 30 EUR a month.)
        
             | hoten wrote:
             | Ctrl+f for the WASM compile function (or maybe replace it
             | with a wrapper function that has a debugger statement),
             | breakpoint there, step over, then look in the Sources
             | panel. You'll find the WAT for your secret wasm then.
             | 
             | It may even still be in Sources panel post-deletion, in
             | which case no need for any of that.
        
       | evmar wrote:
       | If you're looking into a .wasm file I wrote an interactive viewer
       | called weave[1].
       | 
       | Here is a demo with some Rust output:
       | https://evmar.github.io/weave/?wasm/rust.wasm be sure to click on
       | 'code' to dive into individual functions.
       | 
       | [1] https://github.com/evmar/weave
        
         | captn3m0 wrote:
         | There's also https://wasdk.github.io/wasmcodeexplorer/ which
         | highlights the bytes.
        
       | pton_xd wrote:
       | This seems sort of like understanding machine code vs assembly;
       | it's much easier to learn WAT and translate to/from WASM as
       | necessary using the wabt tools [0].
       | 
       | Either way its super cool how simple WebAssembly is, you can
       | really get your hands dirty and understand exactly every detail
       | of how your program runs!
       | 
       | [0] https://github.com/WebAssembly/wabt
        
       | marianoguerra wrote:
       | If you are interested in this type of approach you may like
       | https://wasmgroundup.com/
       | 
       | > This book takes a hands-on, bottoms-up approach: you'll go from
       | hand crafting bytecodes to writing a real compiler for a simple
       | programming language.
       | 
       | disclaimer: I'm the co-author
        
       ___________________________________________________________________
       (page generated 2023-12-23 23:00 UTC)