[HN Gopher] Understanding every byte in a WASM module
___________________________________________________________________
Understanding every byte in a WASM module
Author : hasheddan
Score : 143 points
Date : 2023-12-23 13:41 UTC (9 hours ago)
(HTM) web link (danielmangum.com)
(TXT) w3m dump (danielmangum.com)
| rockwotj wrote:
| This is pretty great! I highly recommend building a parser for
| Wasm, it shows off the nice design that it can be compiled in a
| streaming fashion and other nice properties.
|
| I built a small parser for a wasm runtime I was going to hack at,
| although that lost steam and now I just contribute to wasmtime,
| but it was a great way to learn. The source if you're interested
| in C++20 and coroutines: https://github.com/rockwotj/wasmcc
| amelius wrote:
| > the nice design
|
| But integers are read 7 bits at a time. Wouldn't it be much
| easier to use 8 bits?
| teraflop wrote:
| If you use all 8 bits of a byte for data, you need a separate
| way to store the length, or you need to make everything
| fixed-width which wastes space.
|
| "Varints" or variable-length integers, with 7 bits of data
| and 1 continuation bit per byte, are used in a fairly wide
| variety of formats (e.g. Protobuf, SQLite and Lucene).
| They're compact when most of your integers are small, while
| still being quite simple and fast to encode/decode.
|
| Faster alternatives tend to be more complicated, and more
| compact alternatives tend to be slower.
|
| See https://en.wikipedia.org/wiki/Variable-length_quantity
| joquarky wrote:
| I wonder if something resembling this could have been
| better than how IPv6 attempts to solve the address space
| problem.
| svat wrote:
| This is basically the design of UTF-8 or varint encoding. The
| integer's representation is read one byte (8 bits) at a time,
| which can then be interpreted by looking at the first bit
| which says whether the integer is "done" or more bytes need
| to be read.
|
| (If we were using all 8 bits for the representation and an
| additional leading bit for whether it continues, we'd end up
| having to read 9 bits or something, which is awkward and not
| aligned with what the machine actually supports reading.)
| flohofwoe wrote:
| UTF-8 is a bit more complex than that.
|
| The first byte in a sequence basically tells you how many
| bytes are following by looking at the top-most bits.
| 0xxxxxxx => no followup bytes, just the 7-bit ASCII code
| 110xxxxx => 1 byte follows 1110xxxx => 2 bytes
| follow 11110xxx => 3 bytes follow
|
| The followup bytes then always have the pattern 10xxxxxx.
|
| E.g. the only bytes that have bit 7 cleared are the single-
| byte 7-bit ASCII character codes.
| Mindless2112 wrote:
| A UTF-8-like encoding would probably be better than VLQ
| encoding because the length is known up-front, which
| allows for the decoder to have fewer branches.
|
| Something like [1] or alternatively [2]. See also [3].
|
| [1] https://en.wikipedia.org/wiki/UTF-8#FSS-UTF [2]
| https://sqlite.org/src4/doc/trunk/www/varint.wiki [3]
| https://news.ycombinator.com/item?id=11263378
| rockwotj wrote:
| It is a balance between simplicity, speed and size. They
| choose size, honesty leb128 isn't *too* complex (you can see
| an implementation in my repo I pointed too). But yes they
| don't take all the simplicity tradeoffs.
| willtemperley wrote:
| Protobuf seems to use a very similar encoding.
| xjay wrote:
| In this case, a 64-bit little-endian unsigned integer Ought
| to be enough for anyone. [1]
|
| Using variable-length "micro-compression" schemes like these
| merges concerns [2], add processing overhead, and lose out on
| the simplicity and efficiency of working on the data directly
| in memory, maybe even reusing the same memory/cache line
| after checking identifiers, etc.
|
| The 7-bit VLQ scheme [3] makes sense for MIDI sequences
| (1980s), as they were stored on MIDI hardware with limited
| memory, and had basic processing needs.
|
| [1] https://quoteinvestigator.com/2011/09/08/640k-enough/
|
| [2] https://en.wikipedia.org/wiki/Separation_of_concerns
|
| [3] https://en.wikipedia.org/wiki/Variable-length_quantity
| monocasa wrote:
| Eh, I don't think the .wasm file is intended to be an in
| memory format. For an on wire/storage format, variable
| length integers like protobuf uses is a perfectly valid
| choice.
| dataangel wrote:
| It's actually the protobuf design mistake that most
| adversely affects decoding performance. The original
| creator who went on to make Cap'N Proto makes the
| excellent point that it's better to leave the extra bytes
| and then just slap a compressor that only works on runs
| of 0 bytes on top. Faster for the same savings. There are
| also newer schemes from Daniel Lemire and others more
| amenable to SIMD decoding.
| monocasa wrote:
| Sure, but that decoding speed penalty doesn't stop
| protobuf (and similar encodings using leb128 like thrift)
| from being some of the most used binary encodings on the
| planet, even in relatively low latency spaces.
|
| Additionally, Cap'N Proto RLE encodes the bytestream,
| which is pretty similar to the overhead of LEB128
| integers.
|
| And on top of all of that, .wasm isn't intended to be an
| IPC format, so the fact that it's choices are typical of
| that space (despite not being leading edge) reflect very
| well for the space it's intended to be in.
| kevingadd wrote:
| It would be. The use of varints in wasm complicates encoding
| and decoding and introduces ambiguity into the format for
| cases like v128 constants (which despite technically being
| 128-bit integers, are not embedded using leb128.) It also
| makes it harder to write a single-pass compiler.
|
| It also isn't much of a wire size improvement, since the
| overall format design assumes you will compress it using
| brotli (or gzip, I guess) - there are lots of opportunities
| to make the format smaller or arrange it in a way that makes
| it more compressible, and the decision was made not to do
| those things since brotli is _really good_ at compressing
| data.
|
| I think the people who took control of the binary format just
| really liked varints for some reason, so we ended up stuck
| with them. It's at least not particularly harmful, it's just
| a small waste of everyone's time. Thankfully, my
| understanding is that in practice most modules will only get
| decoded once or twice before the compiled code is cached by
| your browser.
| danielvaughn wrote:
| A while back I ran through some exercises that had me writing WAT
| directly. It's a great way to understand what WASM is doing, and
| is surprisingly easy to pick up (though of course you wouldn't
| want to write an actual program with it). It's just system calls
| with s-expressions.
| eliben wrote:
| WAT is a fun format! I'm curating some readable samples here --
| https://github.com/eliben/wasm-wat-samples/
| danielvaughn wrote:
| Bookmarking that for later, thanks for sharing!
| xscott wrote:
| I already had that bookmarked from using it several months
| ago to try and figure things out. It's been very helpful.
| Thank you for creating it!
| Jasper_ wrote:
| I have a hand-written BF interpreter in WAT, along with hand-
| written asm.js for comparison, over here
| https://github.com/magcius/bfasm
| lioeters wrote:
| For some time I've been fascinated by the codebase of a small
| WAT compiler written in JavaScript.
|
| https://github.com/stagas/wat-compiler/blob/main/story.txt
|
| And I mean "small" as a real complement to how readable and
| compact the entire compiler is. It's been a great way to
| appreciate the design of the WASM text format and WASM overall.
| It's not a Lisp but has a similar feel to it.
|
| I've been meaning to get more fluent at writing WAT directly,
| not for any practical purpose but just for pleasure of it. I
| could see myself gradually building up some abstractions to
| help me deveolp larger programs, perhaps a slightly higher-
| level language.
| danielvaughn wrote:
| Very cool. You might be interested in AssemblyScript, if you
| haven't heard of them before. They did something very
| similar: https://www.assemblyscript.org/
| lioeters wrote:
| Ah yes, I'd love to get more familiar with AssemblyScript
| too. One thing I like about the afore-mentioned WAT
| compiler though, is that it's simple to run in the browser,
| which the AssemblyScript compiler doesn't support, if I
| understand correctly. That characteristic of the compiler
| would be useful if I want to build a higher-level language
| on top of it, for example with a web playground (code
| editor and runtime).
| Hackbraten wrote:
| I really appreciate this article. I've noticed that modern
| streaming web players have started using WASM as part of their
| obfuscation toolbox. Understanding WASM modules is important for
| reverse engineering.
| flohofwoe wrote:
| The browser devtools can give you a disassembled view of the
| WebAssembly byte code (basically the wasm2wat output) and also
| allow to set breakpoints and debug-step through the
| disassembly.
|
| Since WASM is a "structured-programming ISA" with code blocks
| that can be intended, it's actually quite convenient.
| Hackbraten wrote:
| Thanks for the pointer!
|
| The WASM payload itself appears to be obfuscated. The player
| seems to deobfuscate it at runtime, dynamically load it as a
| WASM module, execute it and then destroy it again. I'm having
| a hard time pinpointing the exact moment where the WASM is
| being loaded (because the JS that performs all those things
| is horribly obfuscated, too.)
|
| The disassembled view in the dev tools might come in handy to
| tackle this. My goal is to obtain their secret HMAC key so I
| can change "Linux" into whatever the server prefers to hear,
| and re-sign the HMAC for that manipulated HTTP request before
| the player sends it.
|
| (All that effort so I finally get to watch my Bundesliga
| streams again, for which I'm paying them 30 EUR a month.)
| hoten wrote:
| Ctrl+f for the WASM compile function (or maybe replace it
| with a wrapper function that has a debugger statement),
| breakpoint there, step over, then look in the Sources
| panel. You'll find the WAT for your secret wasm then.
|
| It may even still be in Sources panel post-deletion, in
| which case no need for any of that.
| evmar wrote:
| If you're looking into a .wasm file I wrote an interactive viewer
| called weave[1].
|
| Here is a demo with some Rust output:
| https://evmar.github.io/weave/?wasm/rust.wasm be sure to click on
| 'code' to dive into individual functions.
|
| [1] https://github.com/evmar/weave
| captn3m0 wrote:
| There's also https://wasdk.github.io/wasmcodeexplorer/ which
| highlights the bytes.
| pton_xd wrote:
| This seems sort of like understanding machine code vs assembly;
| it's much easier to learn WAT and translate to/from WASM as
| necessary using the wabt tools [0].
|
| Either way its super cool how simple WebAssembly is, you can
| really get your hands dirty and understand exactly every detail
| of how your program runs!
|
| [0] https://github.com/WebAssembly/wabt
| marianoguerra wrote:
| If you are interested in this type of approach you may like
| https://wasmgroundup.com/
|
| > This book takes a hands-on, bottoms-up approach: you'll go from
| hand crafting bytecodes to writing a real compiler for a simple
| programming language.
|
| disclaimer: I'm the co-author
___________________________________________________________________
(page generated 2023-12-23 23:00 UTC)