[HN Gopher] Wuffs: Wrangling Untrusted File Formats Safely
___________________________________________________________________
Wuffs: Wrangling Untrusted File Formats Safely
Author : nequo
Score : 261 points
Date : 2024-05-16 13:48 UTC (2 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| dang wrote:
| Related:
|
| _Wuffs the Language_ -
| https://news.ycombinator.com/item?id=26731305 - April 2021 (75
| comments)
|
| _Wuffs' PNG image decoder_ -
| https://news.ycombinator.com/item?id=26714831 - April 2021 (138
| comments)
| newman314 wrote:
| Does anyone know of a tool that can do this for PDFs instead?
| warkdarrior wrote:
| As soon as someone writes a Javascript interpreter in Wuffs..
| jchw wrote:
| Do you ever need a JS interpreter to _parse_ a PDF? That 's
| horrifying.
|
| I understand PDF has a bunch of limbs, but I always assumed
| the JS stuff was at least separate from the parsing. (I am
| familiar with the PDF format at a lower level but I never
| touched any of the weird features.)
| timschmidt wrote:
| I wrote an SVG that's all javascript, no elements. All the
| graphics are generated dynamically at runtime by the
| javascript. It's SVG standards compliant, but only opens
| correctly in browsers, not in inkscape or other desktop
| publishing apps.
|
| I work a lot in OpenSCAD, and had a need to design some
| custom graph paper. So I found the subset of SVG which was
| similar to OpenSCAD. :)
| jszymborski wrote:
| Frankly, I wouldn't begrudge a website for not correctly
| parsing an svg I composed entirely with javascript.
|
| It's annoying you can't just "flatten" or "bake" such an
| svg like yours into one composed entirely of elements
| (unless one exists?)
| csande17 wrote:
| Often you can open the SVG in a browser and then use the
| developer tools to copy out the resulting nodes as "flat"
| SVG source code.
|
| Chrome even includes a --dump-dom flag you can use to do
| this on the command line, although I haven't tested it
| with an SVG.
| jszymborski wrote:
| Clever!
| Joel_Mckay wrote:
| There are PDF readers that do not support the scripting format
| extensions.
|
| Note this does not prevent unscrupulous companies abusing
| dominant market positions to voluntarily embed machine and
| serial hash watermarks.
|
| To be clear: formats like pdf, ps, webp, svg, and tiff are so
| badly implemented in some ecosystems... they can't _ever_ be
| assumed safe input formats. Thus, at some point people need to
| spin up an actual VM to transcode a "web" version, and scrub
| each stage of the rendering pipeline like a virus or header
| injection is already present.
|
| "I never play where nice things are, and don't break things"
| (Eliza Mowry Blven, The Humanitarian Review, Volume 3, March,
| 1905)
|
| Cheers =3
| tialaramex wrote:
| I worked with TIFF pretty extensively, it's a mess but I
| don't see why a WUFFS TIFF codec can't be fine. What makes
| you say you need "an actual VM to transcode" a TIFF ?
| Joel_Mckay wrote:
| The complex formats of tiff and tga specifications makes it
| nearly impossible to span all the edge-cases with unit-
| tests. A VM can be in a known-state snapshot, process
| pre/post signature logged/compared with a scripted
| debugger, and binary input/output stripped of non-compliant
| metadata/blobs at each stage of the pipeline if the process
| behaves as expected.
|
| I've yet to find a better method than Honeypots to
| sustainably mitigate the complex leaky dependency mess on
| traditional architectures. It has been my experience that
| "all software is terrible, but some of it is useful".
|
| It may just be my bias, but I see code smell getting worse
| in recent decades...
|
| Have a nice day, =3
|
| https://www.youtube.com/watch?v=aCbfMkh940Q
| tialaramex wrote:
| So, there's actually no particular reason and if somebody
| cares to write one then yup, TIFF codec in WUFFS would in
| fact be safer and faster than your uh, approach.
| Joel_Mckay wrote:
| One does not rely on the persistent competence of the
| coders, and will tell you when something has gone wrong.
|
| And walking a binary object store to ban problem users is
| not always necessary... depending what you are doing.
|
| Most other approaches makes the same predictable
| assumptions:
|
| https://en.wikipedia.org/wiki/List_of_cognitive_biases
|
| Despite popular belief, shitty design does not usually
| get better in another language. Rather, people just feel
| more confident it isn't shit anymore.
|
| I have yet to see evidence to the contrary. =3
| tialaramex wrote:
| Wait, you believe that somehow one of these approaches
| doesn't rely on competence from programmers? How do you
| figure?
|
| Have you been imagining that sandboxes are some sort of
| fairy dust we just stumbled onto one day, supernatural in
| nature and not, in fact, just software written by people
| you're hoping are competent and haven't left any holes?
| Joel_Mckay wrote:
| The point was... one is testing parser/OS integrity via a
| debugging interface over an expectation of an unchanging
| emulated environment state... there is nothing
| particularly special about the approach. Even Qubes OS
| and RancherVM is not perfect in this regard friend.
|
| Or put another way, the available attack surface of a
| bare-minimum fixed environment is much easier to auto-
| audit, than a pile of daily permuted binaries and self-
| delusion approach. i.e. if it fails to behave in an
| expected way, or is modified in any way... the host audit
| process doesn't have to care why or how it is broken to
| maintain a service queue as the guest is culled.
|
| Perhaps I am wrong about exchanging 15% of raw
| performance for reliability, but things can get
| complicated with licenses and multiple OS specific
| platforms.
|
| You seem to be getting emotional about this subject,
| presenting secondary and tertiary straw-man arguments. So
| I'm going to go eat some Cheese Goldfish crackers... and
| just agree that your beliefs are interesting.
|
| Have a fantastic weekend... =3
| tialaramex wrote:
| There's nothing special about it, but it doesn't work
| especially well. This is the strategy that's blown up on
| Apple twice in recent years and will keep burning them.
|
| If you're Matt Godbolt the benefits of sandboxing
| outweigh the cost because Matt is interested in general
| purpose software. But WUFFS isn't for that, as its name
| says it's interested in doing one particular task well.
|
| In this deliberately limited domain, WUFFS gets to
| sidestep Rice's theorem altogether and just prove the
| software meets the semantic requirements [technically you
| do the proving, WUFFS just checks your work].
|
| I hope you enjoyed your goldfish crackers but I urge you
| to use the right tool for the job.
| Joel_Mckay wrote:
| "the right tool for the job" is sometimes admitting the
| breadth of underlying dependencies and ambiguous format
| specifications are unfeasible to fix with your teams time
| budget.
|
| The design in question currently only processes around
| 1.8M large image files a day, and does not require
| additional work/re-implementations to support the dozens
| of questionable user file-formats. i.e. the plain old
| ImageMagick lib does most of the heavy lifting at the
| end.
|
| Would I trust such a solution for something like a native
| client side web-browser etc... absolutely not... but for
| the core-bound instance overhead, the resource cost was
| acceptable for almost a decade of uptime on those system
| instances.
|
| Use-cases are funny like that, as there is no perfect
| solution... but rather a tradeoff of what features get
| the system functional and reliable. Part of that is
| admitting integration of 3rd party dependencies is a
| long-term liability, and domain specific languages almost
| always fade into obscurity.
|
| Cheers, =3
| immibis wrote:
| WUFFS is provably safe - that's the whole schtick. If a WUFFS
| kernel exists, you can assume it is safe. If it's not proven
| safe, it doesn't compile. The reason everyone doesn't program
| in WUFFS is that you have to write a proof that your kernel
| is safe, which takes a very very very long time.
| indolering wrote:
| What's the formal verification story for WUFFS?
| Joel_Mckay wrote:
| If you point out some of the above has run-state in some
| situations... it is provably nondeterministic... and thus
| the assertion of correctness is utter nonsense.
|
| Hardly a panacea for fundamentally bad designs that go
| back decades.
|
| Ever seen a web-server written in postscript? Its worth a
| look just for the laughs.
|
| Good luck out there =)
| tialaramex wrote:
| For WUFFS the language, or for WUFFS the library, or for
| the WUFFS tooling today?
|
| The clever idea is to have you the programmer in effect
| write a proof that your code has the desired semantic
| properties as part of the programming activity and so
| then the WUFFS transpiler is merely _checking_ that the
| proof is correct.
|
| This leverages your understanding of what you were trying
| to do.
| immibis wrote:
| Apparently Wuffs only proves safety. Verifying the code
| does what it's supposed to do is done with unit tests.
| lupire wrote:
| WUFFS is provably safe, or WUFFS programs are provably
| safe, using WUFFS as an axiom?
| ThePowerOfFuet wrote:
| https://dangerzone.rocks/
| trustno2 wrote:
| pdfs are really really hard. the only viewer that parses them
| semi-correctly is ... Acrobat Reader.
|
| try to ever read any code for PDFs and see all the horrors.
|
| Google gave up and just bought the code from foxit.
| kjksf wrote:
| Google was never trying to write PDF reader from scratch so
| they never "gave up".
|
| They just bought foxit code to save years of development when
| they wanted to ship PDF reader in Chrome.
|
| Your comment about "the only viewer that is semi-correct" is
| also wildly off the mark.
|
| Parsing correctly written PDF files is hard but multiple
| engines can do it correctly.
|
| Parsing real life PDFs is much harder then correctly
| implementing PDF spec because lots of PDFs are just broken.
| They generators create invalid PDF files and then PDF readers
| have to spend heroic efforts to somehow make sense of this
| brokenness. Adobe does it better than most because... well it
| would be embarrassing if they didn't. They invented the
| format, they make money from their tools, they were doing it
| the longest, they have the largest archive of broken PDFs for
| testing etc. It's hard to expect that e.g. an open-source
| project with one or two developers can match that.
|
| I work on SumatraPDF so I know.
| yjftsjthsd-h wrote:
| This is one of my favorite attempts at better programming
| language safety, because it compiles down to C that can then be
| shipped like normal C, so you don't get the ecosystem friction
| like with ex. Rust.
| vlovich123 wrote:
| It's an interesting idea for sure but it isn't a general
| purpose language, so the problem domains it can solve is very
| very different vs what Rust is trying to do.
| tialaramex wrote:
| Nigel has said that emitting "unsafe" Rust is a reasonable
| thing for a hypothetical WUFFS 1.0 to be able to do as an
| alternative to C. As with good "unsafe" Rust written by
| humans WUFFS would know exactly why what it's doing is fine,
| it's just that the Rust compiler can't necessarily see that,
| hence the need to label it "unsafe".
|
| Today C makes most sense given the WUFFS language is still in
| flux.
|
| [Edited to fix a serious typo]
| nequo wrote:
| What would be the primary benefit of emitting Rust rather
| than C? Both would be considered safe (assuming Wuffs
| generates correct code), and Rust could access the C code
| via FFI. Is there something I'm missing?
| vlovich123 wrote:
| Nominally it can safely elide bounds checks via unsafe
| that it has proved are actually safe within the
| constraints of Wuffs, which is what it does for C (+ the
| language is built for more easy translation to
| vectorizated than something like llvm is able to do for
| general purpose languages).
|
| So basically higher performance.
|
| FFI nominally has a runtime and compile time cost -
| whether that matters for you in particular will depend on
| your needs, but being able to publish a very simple crate
| without a build.rs to manage can have an attraction.
| tialaramex wrote:
| I expect that the Rust emitted by a hypothetical future
| WUFFS transpiler would be _much_ easier to just drop into
| an existing Rust project than some C via a C FFI.
|
| It's common for C libraries that do get wrapped today
| (e.g. openssl) to have a two phase wrapping, a -sys crate
| which turns the C into Rust C FFI and then another crate
| to turn the Rust C FFI into something actually palatable
| to ordinary people.
| jcranmer wrote:
| The C abstract machine is slightly funkier than unsafe
| Rust (things like C lacking a way to do signed integer
| overflow without UB or needing to adhere to strict
| aliasing in C), so I would expect that lowering to unsafe
| Rust would be slightly more likely to be correct.
| pcwalton wrote:
| One benefit would be that Rust users could use Wuffs code
| without having to install a C compiler. Pure-Rust
| solutions are much more convenient in the Cargo ecosystem
| than wrangling -sys crates.
| edflsafoiewq wrote:
| You could also use c2rust.
| IshKebab wrote:
| Probably not too much from a "final product" point of
| view, but using a pure Rust library is a whole lot easier
| than C from a faff point of view. Especially for cross-
| compilation.
| vlovich123 wrote:
| I'm responding to this:
|
| > that can then be shipped like normal C, so you don't get
| the ecosystem friction like with ex. Rust.
|
| Emitting Rust doesn't help with this.
| tomjakubowski wrote:
| it helps in the other direction: less friction to use
| from rust
| fiddlerwoaroof wrote:
| But more friction to use from just about every other
| language.
| andrepd wrote:
| What's the difference vs compiling down to machine code and
| linking it with your program?
| pornel wrote:
| You reuse optimizer and machine code generator of the C
| compiler, and you're not tied to a single backend like LLVM.
| pcwalton wrote:
| C has a lot of problems as a compilation target as well, from
| surprising UB (e.g. signed integer overflow) to debugging
| problems (e.g. #line is woefully inadequate compared to the
| ability to emit DWARF DIEs) to the inconvenience of setting up
| a toolchain for end users. To its credit, Wuffs is one of the
| better projects that compiles to C, because it targets a very
| restricted domain. But, in general, don't write programming
| languages that compile to C.
| yjftsjthsd-h wrote:
| Of course C sucks, but since everything under the sun uses
| it, there's unique value in being able to make it safer
| without putting a whole new compiler in the process for
| users. Remember that time the cryptography library in Python
| decided to add rust? We could have avoided all that pain with
| wuffs.
| bobajeff wrote:
| For many of us making a compile to c language is many times
| more feasible than using something like llvm. I'm not saying
| it's great mind you but it's probably the best thing
| available without a runtime.
|
| For debugging i believe you can generate your own source maps
| and use gdb as a backend to talk with your custom debugger.
| tedunangst wrote:
| Related, in the sense of solving the same problem in a different
| manner: https://rlbox.dev/
| Ono-Sendai wrote:
| Wuffs is great. I use it in Substrata (https://substrata.info/)
| for loading PNGs. It is both faster and safer than LibPNG. It's
| something around 2x faster than LibPNG in my tests (depending on
| the PNG file), see timings here:
| https://github.com/google/wuffs/issues/13#issuecomment-17325...
|
| So generally Wuffs is great and you should use it to decode your
| PNGs. There are some downsides: not all of the obscure bit depths
| and formats that PNG supports are loaded as-is, some are
| converted to more standard formats.
|
| Also the Wuffs documentation is a bit hard to understand. It's a
| litle bit of a mission getting PNG decoding working. You can see
| my code for that here though:
| https://github.com/glaretechnologies/glare-core/blob/2c7174c...
| repsilat wrote:
| The "mango" lib [1] claims to be even faster for PNGs. Actively
| maintained but doesn't have as much buzz, I think the devs
| haven't advertised it as much on places like this.
|
| Also, it has the funniest testimonials.
|
| 1: https://github.com/t0rakka/mango
| yjftsjthsd-h wrote:
| Speed isn't the only thing that matters; is mango as safe as
| wuffs in the face of untrusted input?
| edflsafoiewq wrote:
| Where does the extra speed come from?
| nicoburns wrote:
| My understanding is that libpng is unoptimised and 5-10x faster
| is possible.
| pornel wrote:
| libpng is reasonably fast, and has SIMD optimizations. Make
| sure to compile it with a modern CPU target.
|
| The biggest bottleneck in PNG decoding is zlib, which is not
| part of libpng. There are faster inflate implementations, but
| nowhere near 5x.
|
| The second slowest thing is unfiltering, but it takes only
| 10-20% of the decoding time, so even lightspeed
| implementation would make little difference.
|
| There is possibility of a 10x difference when _encoding_ ,
| but that's not due to libpng being slow, but because it's
| possible to apply worse compression and there are dedicated
| crappy-but-veryfast encoders.
| YoshiRulz wrote:
| Superior in every sense to that Magicka garbage they released a
| couple months ago. I'm excited to see its via-Rust codegen.
| jay-barronville wrote:
| Wuffs is cool, but you can get similar results writing normal C
| library code, compiling it into a .wasm binary via Clang, and
| then running the .wasm binary through the `wasm2c` tool of the
| WebAssembly Binary Toolkit [0]. I personally prefer this method,
| although Wuffs will usually produce faster code.
|
| [0]:
| https://github.com/WebAssembly/wabt/tree/44837a7236e85c048de...
| krick wrote:
| It is not obvious to me why this should guarantee safety.
| jay-barronville wrote:
| `wasm2c` fully implements the WebAssembly sandbox execution
| environment [0][1] and has the passing tests to prove it. To
| be a bit more specific, the .wasm binary you generate
| initially already has the WebAssembly semantics baked in
| (obviously) and `wasm2c` creates a portable C translation of
| the WebAssembly while also ensuring that the execution
| environment is sandboxed (e.g., the code traps when
| attempting out-of-bounds memory accesses).
|
| [0]: https://webassembly.org
|
| [1]: https://github.com/WebAssembly/wabt/issues/2289#issuecom
| ment...
| eviks wrote:
| How much faster (say, for something like an image codec)
| jay-barronville wrote:
| This might not be what you want to hear (and I might get
| downvoted for it), but it's what I consider the best answer:
| Implement something minimal but useful (and realistic) using
| both methods and benchmark them yourself.
|
| Even if I told you some of the numbers I've seen in my
| experiments and usage, it wouldn't be wise to trust them or
| let them taint your opinion.
| refibrillator wrote:
| Can Wuffs provide stronger safety guarantees than techniques like
| WasmBoxC?
|
| My understanding is that compiling unsafe C to WASM and back
| would also guarantee safety with respect to buffer overflows,
| integer arithmetic overflows and null pointer dereferences.
|
| It's nice not annotating code to explicitly prove invariants to
| the compiler like you would in say Wuffs or Rust, but I suppose
| that's what limits performance.
| klabb3 wrote:
| Doesn't wasm have a memory model as well? So unless you sandbox
| certain parts of it you can still in theory have access across
| different C functions, within the same wasm module?
|
| What seems nice about wuffs is that it has no side effects and
| a clear project scope. Deserialization is so riddled with
| severe issues that it does kind of warrant its own DSL. OTOH,
| some legacy formats will probably never be ported.
| CJefferson wrote:
| Technically, while WASM promises you put data in and get data
| out, you can still have memory corruption (as it has a flat
| memory), so I could make a (for example) gif with some color
| palette, then later overflow and rewrite the palette.
|
| Not fatal, but perhaps annoying.
| azakai wrote:
| Yes, Wuffs can do better than WasmBoxC because it does more
| than sandboxing of the code. It also checks things like integer
| overflows which can lead to exploits that are technically not
| memory safety issues, but still potentially dangerous.
|
| But the tradeoff is that you need to rewrite your code for
| Wuffs, while WasmBoxC can sandbox anything that compiles to
| wasm and prevent it from corrupting the outside, including
| existing code in C, C++, Zig, unsafe Rust, etc. etc.
| who-shot-jr wrote:
| Could you use this to make sure users uploading files to your
| website are correct (i.e only jpegs and valid image data)? But in
| a fast and safe way, or is this overkill?
| pornel wrote:
| Yes, you could. But be careful to make sure that there's no
| more data left after the decoder finishes, because it's
| possible to append a ZIP file (or acropcalypse) at the end of
| any other valid image file data, and decoders usually stop at
| the end of the image and don't parse past its end, so won't
| complain about extra data.
| dividuum wrote:
| Not sure that's possible. I'm pretty sure it is not safe to
| assume ,,parses in wuffs" -> ,,is safe in any other decoder".
| I'm using wuffs to check user upload (see my recent response in
| another thread) but I still generate out linear RGBA and work
| with that. I still consider the original JPEG data hostile.
| Alifatisk wrote:
| while true { ... } endwhile
|
| Please, let the end brackets should be enough.
___________________________________________________________________
(page generated 2024-05-18 23:01 UTC)