[HN Gopher] Wuffs: Wrangling Untrusted File Formats Safely
       ___________________________________________________________________
        
       Wuffs: Wrangling Untrusted File Formats Safely
        
       Author : nequo
       Score  : 261 points
       Date   : 2024-05-16 13:48 UTC (2 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | dang wrote:
       | Related:
       | 
       |  _Wuffs the Language_ -
       | https://news.ycombinator.com/item?id=26731305 - April 2021 (75
       | comments)
       | 
       |  _Wuffs' PNG image decoder_ -
       | https://news.ycombinator.com/item?id=26714831 - April 2021 (138
       | comments)
        
       | newman314 wrote:
       | Does anyone know of a tool that can do this for PDFs instead?
        
         | warkdarrior wrote:
         | As soon as someone writes a Javascript interpreter in Wuffs..
        
           | jchw wrote:
           | Do you ever need a JS interpreter to _parse_ a PDF? That 's
           | horrifying.
           | 
           | I understand PDF has a bunch of limbs, but I always assumed
           | the JS stuff was at least separate from the parsing. (I am
           | familiar with the PDF format at a lower level but I never
           | touched any of the weird features.)
        
             | timschmidt wrote:
             | I wrote an SVG that's all javascript, no elements. All the
             | graphics are generated dynamically at runtime by the
             | javascript. It's SVG standards compliant, but only opens
             | correctly in browsers, not in inkscape or other desktop
             | publishing apps.
             | 
             | I work a lot in OpenSCAD, and had a need to design some
             | custom graph paper. So I found the subset of SVG which was
             | similar to OpenSCAD. :)
        
               | jszymborski wrote:
               | Frankly, I wouldn't begrudge a website for not correctly
               | parsing an svg I composed entirely with javascript.
               | 
               | It's annoying you can't just "flatten" or "bake" such an
               | svg like yours into one composed entirely of elements
               | (unless one exists?)
        
               | csande17 wrote:
               | Often you can open the SVG in a browser and then use the
               | developer tools to copy out the resulting nodes as "flat"
               | SVG source code.
               | 
               | Chrome even includes a --dump-dom flag you can use to do
               | this on the command line, although I haven't tested it
               | with an SVG.
        
               | jszymborski wrote:
               | Clever!
        
         | Joel_Mckay wrote:
         | There are PDF readers that do not support the scripting format
         | extensions.
         | 
         | Note this does not prevent unscrupulous companies abusing
         | dominant market positions to voluntarily embed machine and
         | serial hash watermarks.
         | 
         | To be clear: formats like pdf, ps, webp, svg, and tiff are so
         | badly implemented in some ecosystems... they can't _ever_ be
         | assumed safe input formats. Thus, at some point people need to
         | spin up an actual VM to transcode a "web" version, and scrub
         | each stage of the rendering pipeline like a virus or header
         | injection is already present.
         | 
         | "I never play where nice things are, and don't break things"
         | (Eliza Mowry Blven, The Humanitarian Review, Volume 3, March,
         | 1905)
         | 
         | Cheers =3
        
           | tialaramex wrote:
           | I worked with TIFF pretty extensively, it's a mess but I
           | don't see why a WUFFS TIFF codec can't be fine. What makes
           | you say you need "an actual VM to transcode" a TIFF ?
        
             | Joel_Mckay wrote:
             | The complex formats of tiff and tga specifications makes it
             | nearly impossible to span all the edge-cases with unit-
             | tests. A VM can be in a known-state snapshot, process
             | pre/post signature logged/compared with a scripted
             | debugger, and binary input/output stripped of non-compliant
             | metadata/blobs at each stage of the pipeline if the process
             | behaves as expected.
             | 
             | I've yet to find a better method than Honeypots to
             | sustainably mitigate the complex leaky dependency mess on
             | traditional architectures. It has been my experience that
             | "all software is terrible, but some of it is useful".
             | 
             | It may just be my bias, but I see code smell getting worse
             | in recent decades...
             | 
             | Have a nice day, =3
             | 
             | https://www.youtube.com/watch?v=aCbfMkh940Q
        
               | tialaramex wrote:
               | So, there's actually no particular reason and if somebody
               | cares to write one then yup, TIFF codec in WUFFS would in
               | fact be safer and faster than your uh, approach.
        
               | Joel_Mckay wrote:
               | One does not rely on the persistent competence of the
               | coders, and will tell you when something has gone wrong.
               | 
               | And walking a binary object store to ban problem users is
               | not always necessary... depending what you are doing.
               | 
               | Most other approaches makes the same predictable
               | assumptions:
               | 
               | https://en.wikipedia.org/wiki/List_of_cognitive_biases
               | 
               | Despite popular belief, shitty design does not usually
               | get better in another language. Rather, people just feel
               | more confident it isn't shit anymore.
               | 
               | I have yet to see evidence to the contrary. =3
        
               | tialaramex wrote:
               | Wait, you believe that somehow one of these approaches
               | doesn't rely on competence from programmers? How do you
               | figure?
               | 
               | Have you been imagining that sandboxes are some sort of
               | fairy dust we just stumbled onto one day, supernatural in
               | nature and not, in fact, just software written by people
               | you're hoping are competent and haven't left any holes?
        
               | Joel_Mckay wrote:
               | The point was... one is testing parser/OS integrity via a
               | debugging interface over an expectation of an unchanging
               | emulated environment state... there is nothing
               | particularly special about the approach. Even Qubes OS
               | and RancherVM is not perfect in this regard friend.
               | 
               | Or put another way, the available attack surface of a
               | bare-minimum fixed environment is much easier to auto-
               | audit, than a pile of daily permuted binaries and self-
               | delusion approach. i.e. if it fails to behave in an
               | expected way, or is modified in any way... the host audit
               | process doesn't have to care why or how it is broken to
               | maintain a service queue as the guest is culled.
               | 
               | Perhaps I am wrong about exchanging 15% of raw
               | performance for reliability, but things can get
               | complicated with licenses and multiple OS specific
               | platforms.
               | 
               | You seem to be getting emotional about this subject,
               | presenting secondary and tertiary straw-man arguments. So
               | I'm going to go eat some Cheese Goldfish crackers... and
               | just agree that your beliefs are interesting.
               | 
               | Have a fantastic weekend... =3
        
               | tialaramex wrote:
               | There's nothing special about it, but it doesn't work
               | especially well. This is the strategy that's blown up on
               | Apple twice in recent years and will keep burning them.
               | 
               | If you're Matt Godbolt the benefits of sandboxing
               | outweigh the cost because Matt is interested in general
               | purpose software. But WUFFS isn't for that, as its name
               | says it's interested in doing one particular task well.
               | 
               | In this deliberately limited domain, WUFFS gets to
               | sidestep Rice's theorem altogether and just prove the
               | software meets the semantic requirements [technically you
               | do the proving, WUFFS just checks your work].
               | 
               | I hope you enjoyed your goldfish crackers but I urge you
               | to use the right tool for the job.
        
               | Joel_Mckay wrote:
               | "the right tool for the job" is sometimes admitting the
               | breadth of underlying dependencies and ambiguous format
               | specifications are unfeasible to fix with your teams time
               | budget.
               | 
               | The design in question currently only processes around
               | 1.8M large image files a day, and does not require
               | additional work/re-implementations to support the dozens
               | of questionable user file-formats. i.e. the plain old
               | ImageMagick lib does most of the heavy lifting at the
               | end.
               | 
               | Would I trust such a solution for something like a native
               | client side web-browser etc... absolutely not... but for
               | the core-bound instance overhead, the resource cost was
               | acceptable for almost a decade of uptime on those system
               | instances.
               | 
               | Use-cases are funny like that, as there is no perfect
               | solution... but rather a tradeoff of what features get
               | the system functional and reliable. Part of that is
               | admitting integration of 3rd party dependencies is a
               | long-term liability, and domain specific languages almost
               | always fade into obscurity.
               | 
               | Cheers, =3
        
           | immibis wrote:
           | WUFFS is provably safe - that's the whole schtick. If a WUFFS
           | kernel exists, you can assume it is safe. If it's not proven
           | safe, it doesn't compile. The reason everyone doesn't program
           | in WUFFS is that you have to write a proof that your kernel
           | is safe, which takes a very very very long time.
        
             | indolering wrote:
             | What's the formal verification story for WUFFS?
        
               | Joel_Mckay wrote:
               | If you point out some of the above has run-state in some
               | situations... it is provably nondeterministic... and thus
               | the assertion of correctness is utter nonsense.
               | 
               | Hardly a panacea for fundamentally bad designs that go
               | back decades.
               | 
               | Ever seen a web-server written in postscript? Its worth a
               | look just for the laughs.
               | 
               | Good luck out there =)
        
               | tialaramex wrote:
               | For WUFFS the language, or for WUFFS the library, or for
               | the WUFFS tooling today?
               | 
               | The clever idea is to have you the programmer in effect
               | write a proof that your code has the desired semantic
               | properties as part of the programming activity and so
               | then the WUFFS transpiler is merely _checking_ that the
               | proof is correct.
               | 
               | This leverages your understanding of what you were trying
               | to do.
        
               | immibis wrote:
               | Apparently Wuffs only proves safety. Verifying the code
               | does what it's supposed to do is done with unit tests.
        
             | lupire wrote:
             | WUFFS is provably safe, or WUFFS programs are provably
             | safe, using WUFFS as an axiom?
        
         | ThePowerOfFuet wrote:
         | https://dangerzone.rocks/
        
         | trustno2 wrote:
         | pdfs are really really hard. the only viewer that parses them
         | semi-correctly is ... Acrobat Reader.
         | 
         | try to ever read any code for PDFs and see all the horrors.
         | 
         | Google gave up and just bought the code from foxit.
        
           | kjksf wrote:
           | Google was never trying to write PDF reader from scratch so
           | they never "gave up".
           | 
           | They just bought foxit code to save years of development when
           | they wanted to ship PDF reader in Chrome.
           | 
           | Your comment about "the only viewer that is semi-correct" is
           | also wildly off the mark.
           | 
           | Parsing correctly written PDF files is hard but multiple
           | engines can do it correctly.
           | 
           | Parsing real life PDFs is much harder then correctly
           | implementing PDF spec because lots of PDFs are just broken.
           | They generators create invalid PDF files and then PDF readers
           | have to spend heroic efforts to somehow make sense of this
           | brokenness. Adobe does it better than most because... well it
           | would be embarrassing if they didn't. They invented the
           | format, they make money from their tools, they were doing it
           | the longest, they have the largest archive of broken PDFs for
           | testing etc. It's hard to expect that e.g. an open-source
           | project with one or two developers can match that.
           | 
           | I work on SumatraPDF so I know.
        
       | yjftsjthsd-h wrote:
       | This is one of my favorite attempts at better programming
       | language safety, because it compiles down to C that can then be
       | shipped like normal C, so you don't get the ecosystem friction
       | like with ex. Rust.
        
         | vlovich123 wrote:
         | It's an interesting idea for sure but it isn't a general
         | purpose language, so the problem domains it can solve is very
         | very different vs what Rust is trying to do.
        
           | tialaramex wrote:
           | Nigel has said that emitting "unsafe" Rust is a reasonable
           | thing for a hypothetical WUFFS 1.0 to be able to do as an
           | alternative to C. As with good "unsafe" Rust written by
           | humans WUFFS would know exactly why what it's doing is fine,
           | it's just that the Rust compiler can't necessarily see that,
           | hence the need to label it "unsafe".
           | 
           | Today C makes most sense given the WUFFS language is still in
           | flux.
           | 
           | [Edited to fix a serious typo]
        
             | nequo wrote:
             | What would be the primary benefit of emitting Rust rather
             | than C? Both would be considered safe (assuming Wuffs
             | generates correct code), and Rust could access the C code
             | via FFI. Is there something I'm missing?
        
               | vlovich123 wrote:
               | Nominally it can safely elide bounds checks via unsafe
               | that it has proved are actually safe within the
               | constraints of Wuffs, which is what it does for C (+ the
               | language is built for more easy translation to
               | vectorizated than something like llvm is able to do for
               | general purpose languages).
               | 
               | So basically higher performance.
               | 
               | FFI nominally has a runtime and compile time cost -
               | whether that matters for you in particular will depend on
               | your needs, but being able to publish a very simple crate
               | without a build.rs to manage can have an attraction.
        
               | tialaramex wrote:
               | I expect that the Rust emitted by a hypothetical future
               | WUFFS transpiler would be _much_ easier to just drop into
               | an existing Rust project than some C via a C FFI.
               | 
               | It's common for C libraries that do get wrapped today
               | (e.g. openssl) to have a two phase wrapping, a -sys crate
               | which turns the C into Rust C FFI and then another crate
               | to turn the Rust C FFI into something actually palatable
               | to ordinary people.
        
               | jcranmer wrote:
               | The C abstract machine is slightly funkier than unsafe
               | Rust (things like C lacking a way to do signed integer
               | overflow without UB or needing to adhere to strict
               | aliasing in C), so I would expect that lowering to unsafe
               | Rust would be slightly more likely to be correct.
        
               | pcwalton wrote:
               | One benefit would be that Rust users could use Wuffs code
               | without having to install a C compiler. Pure-Rust
               | solutions are much more convenient in the Cargo ecosystem
               | than wrangling -sys crates.
        
               | edflsafoiewq wrote:
               | You could also use c2rust.
        
               | IshKebab wrote:
               | Probably not too much from a "final product" point of
               | view, but using a pure Rust library is a whole lot easier
               | than C from a faff point of view. Especially for cross-
               | compilation.
        
             | vlovich123 wrote:
             | I'm responding to this:
             | 
             | > that can then be shipped like normal C, so you don't get
             | the ecosystem friction like with ex. Rust.
             | 
             | Emitting Rust doesn't help with this.
        
               | tomjakubowski wrote:
               | it helps in the other direction: less friction to use
               | from rust
        
               | fiddlerwoaroof wrote:
               | But more friction to use from just about every other
               | language.
        
         | andrepd wrote:
         | What's the difference vs compiling down to machine code and
         | linking it with your program?
        
           | pornel wrote:
           | You reuse optimizer and machine code generator of the C
           | compiler, and you're not tied to a single backend like LLVM.
        
         | pcwalton wrote:
         | C has a lot of problems as a compilation target as well, from
         | surprising UB (e.g. signed integer overflow) to debugging
         | problems (e.g. #line is woefully inadequate compared to the
         | ability to emit DWARF DIEs) to the inconvenience of setting up
         | a toolchain for end users. To its credit, Wuffs is one of the
         | better projects that compiles to C, because it targets a very
         | restricted domain. But, in general, don't write programming
         | languages that compile to C.
        
           | yjftsjthsd-h wrote:
           | Of course C sucks, but since everything under the sun uses
           | it, there's unique value in being able to make it safer
           | without putting a whole new compiler in the process for
           | users. Remember that time the cryptography library in Python
           | decided to add rust? We could have avoided all that pain with
           | wuffs.
        
           | bobajeff wrote:
           | For many of us making a compile to c language is many times
           | more feasible than using something like llvm. I'm not saying
           | it's great mind you but it's probably the best thing
           | available without a runtime.
           | 
           | For debugging i believe you can generate your own source maps
           | and use gdb as a backend to talk with your custom debugger.
        
       | tedunangst wrote:
       | Related, in the sense of solving the same problem in a different
       | manner: https://rlbox.dev/
        
       | Ono-Sendai wrote:
       | Wuffs is great. I use it in Substrata (https://substrata.info/)
       | for loading PNGs. It is both faster and safer than LibPNG. It's
       | something around 2x faster than LibPNG in my tests (depending on
       | the PNG file), see timings here:
       | https://github.com/google/wuffs/issues/13#issuecomment-17325...
       | 
       | So generally Wuffs is great and you should use it to decode your
       | PNGs. There are some downsides: not all of the obscure bit depths
       | and formats that PNG supports are loaded as-is, some are
       | converted to more standard formats.
       | 
       | Also the Wuffs documentation is a bit hard to understand. It's a
       | litle bit of a mission getting PNG decoding working. You can see
       | my code for that here though:
       | https://github.com/glaretechnologies/glare-core/blob/2c7174c...
        
         | repsilat wrote:
         | The "mango" lib [1] claims to be even faster for PNGs. Actively
         | maintained but doesn't have as much buzz, I think the devs
         | haven't advertised it as much on places like this.
         | 
         | Also, it has the funniest testimonials.
         | 
         | 1: https://github.com/t0rakka/mango
        
           | yjftsjthsd-h wrote:
           | Speed isn't the only thing that matters; is mango as safe as
           | wuffs in the face of untrusted input?
        
         | edflsafoiewq wrote:
         | Where does the extra speed come from?
        
         | nicoburns wrote:
         | My understanding is that libpng is unoptimised and 5-10x faster
         | is possible.
        
           | pornel wrote:
           | libpng is reasonably fast, and has SIMD optimizations. Make
           | sure to compile it with a modern CPU target.
           | 
           | The biggest bottleneck in PNG decoding is zlib, which is not
           | part of libpng. There are faster inflate implementations, but
           | nowhere near 5x.
           | 
           | The second slowest thing is unfiltering, but it takes only
           | 10-20% of the decoding time, so even lightspeed
           | implementation would make little difference.
           | 
           | There is possibility of a 10x difference when _encoding_ ,
           | but that's not due to libpng being slow, but because it's
           | possible to apply worse compression and there are dedicated
           | crappy-but-veryfast encoders.
        
       | YoshiRulz wrote:
       | Superior in every sense to that Magicka garbage they released a
       | couple months ago. I'm excited to see its via-Rust codegen.
        
       | jay-barronville wrote:
       | Wuffs is cool, but you can get similar results writing normal C
       | library code, compiling it into a .wasm binary via Clang, and
       | then running the .wasm binary through the `wasm2c` tool of the
       | WebAssembly Binary Toolkit [0]. I personally prefer this method,
       | although Wuffs will usually produce faster code.
       | 
       | [0]:
       | https://github.com/WebAssembly/wabt/tree/44837a7236e85c048de...
        
         | krick wrote:
         | It is not obvious to me why this should guarantee safety.
        
           | jay-barronville wrote:
           | `wasm2c` fully implements the WebAssembly sandbox execution
           | environment [0][1] and has the passing tests to prove it. To
           | be a bit more specific, the .wasm binary you generate
           | initially already has the WebAssembly semantics baked in
           | (obviously) and `wasm2c` creates a portable C translation of
           | the WebAssembly while also ensuring that the execution
           | environment is sandboxed (e.g., the code traps when
           | attempting out-of-bounds memory accesses).
           | 
           | [0]: https://webassembly.org
           | 
           | [1]: https://github.com/WebAssembly/wabt/issues/2289#issuecom
           | ment...
        
         | eviks wrote:
         | How much faster (say, for something like an image codec)
        
           | jay-barronville wrote:
           | This might not be what you want to hear (and I might get
           | downvoted for it), but it's what I consider the best answer:
           | Implement something minimal but useful (and realistic) using
           | both methods and benchmark them yourself.
           | 
           | Even if I told you some of the numbers I've seen in my
           | experiments and usage, it wouldn't be wise to trust them or
           | let them taint your opinion.
        
       | refibrillator wrote:
       | Can Wuffs provide stronger safety guarantees than techniques like
       | WasmBoxC?
       | 
       | My understanding is that compiling unsafe C to WASM and back
       | would also guarantee safety with respect to buffer overflows,
       | integer arithmetic overflows and null pointer dereferences.
       | 
       | It's nice not annotating code to explicitly prove invariants to
       | the compiler like you would in say Wuffs or Rust, but I suppose
       | that's what limits performance.
        
         | klabb3 wrote:
         | Doesn't wasm have a memory model as well? So unless you sandbox
         | certain parts of it you can still in theory have access across
         | different C functions, within the same wasm module?
         | 
         | What seems nice about wuffs is that it has no side effects and
         | a clear project scope. Deserialization is so riddled with
         | severe issues that it does kind of warrant its own DSL. OTOH,
         | some legacy formats will probably never be ported.
        
         | CJefferson wrote:
         | Technically, while WASM promises you put data in and get data
         | out, you can still have memory corruption (as it has a flat
         | memory), so I could make a (for example) gif with some color
         | palette, then later overflow and rewrite the palette.
         | 
         | Not fatal, but perhaps annoying.
        
         | azakai wrote:
         | Yes, Wuffs can do better than WasmBoxC because it does more
         | than sandboxing of the code. It also checks things like integer
         | overflows which can lead to exploits that are technically not
         | memory safety issues, but still potentially dangerous.
         | 
         | But the tradeoff is that you need to rewrite your code for
         | Wuffs, while WasmBoxC can sandbox anything that compiles to
         | wasm and prevent it from corrupting the outside, including
         | existing code in C, C++, Zig, unsafe Rust, etc. etc.
        
       | who-shot-jr wrote:
       | Could you use this to make sure users uploading files to your
       | website are correct (i.e only jpegs and valid image data)? But in
       | a fast and safe way, or is this overkill?
        
         | pornel wrote:
         | Yes, you could. But be careful to make sure that there's no
         | more data left after the decoder finishes, because it's
         | possible to append a ZIP file (or acropcalypse) at the end of
         | any other valid image file data, and decoders usually stop at
         | the end of the image and don't parse past its end, so won't
         | complain about extra data.
        
         | dividuum wrote:
         | Not sure that's possible. I'm pretty sure it is not safe to
         | assume ,,parses in wuffs" -> ,,is safe in any other decoder".
         | I'm using wuffs to check user upload (see my recent response in
         | another thread) but I still generate out linear RGBA and work
         | with that. I still consider the original JPEG data hostile.
        
       | Alifatisk wrote:
       | while true { ... } endwhile
       | 
       | Please, let the end brackets should be enough.
        
       ___________________________________________________________________
       (page generated 2024-05-18 23:01 UTC)