[HN Gopher] New Ghostscript PDF interpreter
___________________________________________________________________
New Ghostscript PDF interpreter
Author : diskmuncher
Score : 148 points
Date : 2022-07-31 15:40 UTC (7 hours ago)
(HTM) web link (www.ghostscript.com)
(TXT) w3m dump (www.ghostscript.com)
| mepian wrote:
| "But Ghostscript's PDF interpreter was, as noted, written in
| PostScript, and PostScript is not a great language for handling
| error conditions and recovering."
|
| Isn't C, their chosen replacement of PostScript, also
| particularly bad at this?
| daptaq wrote:
| I'd say a language is bad at error handling if it doesn't let
| you check if a procedure failed or not. What C does it that it
| compiles even if you ignore this, which is a different issue.
| Java, Rust, etc. wouldn't compile if you totally ignored it,
| but you that doesn't mean you have to do proper error handling,
| beyond satisfying the compiler/type system.
| ptx wrote:
| Are there any languages that are bad at error handling then,
| according to that definition? That don't let you return
| values, set global flags, mutate arguments or in any other
| way communicate back from a procedure?
| colonwqbang wrote:
| I also had a slight chuckle at this. However, I'm sure C is
| still a great step up from Postscript.
|
| It is however quite entertaining to read the predictable
| comments from Rust/Java/C++ fans who are upset that they didn't
| choose their favourite language.
| forgotpwd16 wrote:
| Surprised the decision wasn't made sooner.
| vivegi wrote:
| In the past when we had to use Ghostscript for PDF processing, we
| always separated it out into its own process and added a whole
| lot of error management externally.
|
| Even if the application was fine, you would always encounter
| PS/PDF files in the wild that kept stress-testing the
| application's memory safety.
| diskmuncher wrote:
| How interpreting PDF in Postscript became untenable
| vintagedave wrote:
| Given the mention of security issues in their custom PostScript
| extensions, and that PDF files are often malformed, I wonder why
| they chose C as the language for the new interpreter. I don't
| want to write a typical HN comment ( _cough_ use Rust for
| everything :)) but surely there is _some_ better language for
| entirely new development of a secure and fast parser in 2022.
|
| The post has no explananation of this choice. Does anyone know?
| salmo wrote:
| My guess is that since the rest of the project (not in PS
| itself) is in C, it's in C. And it may be borrowing from the PS
| interpreter codebase. I dunno.
|
| Requiring another skillset, toolchain, etc. is onerous and has
| to be weighed in those decisions. Rust is cool for sure, but
| difficult to adopt in brownfield projects because of humans
| more than tech.
|
| Also, it wasn't written on in 2022, just made the default now.
| GS is a venerable codebase, and jumping on a "new" language
| bandwagon may have seemed dangerous at the time it was started.
|
| All conjecture. I'm not an expert or involved.
| mkl95 wrote:
| One reason may be that they want to build a high level wrapper
| of that C API, something that is well documented in some
| languages (i.e. Python)
| lvh wrote:
| We (Latacora) previously advised clients to encapsulate
| GhostScript processing in something with a hard security
| boundary (like a Lambda) and I am not expecting the new
| implementation to change that.
| h2odragon wrote:
| I suspect they need portability more than most projects.
| winter_blue wrote:
| Are you kidding? Many other languages are as portable, if not
| more portable.[a] Your point would be valid in 1972, not in
| 2022. I can't believe you're regurgitating the same
| "portability" from 50 years ago, today (unless you meant it
| as a joke and forgot to include a /s).
|
| [a] Languages targeting LLVM or supported by GCC are portable
| to every target machine code / ISA / architecture supported
| by those toolchains. JVM, JS, etc are portable to all the
| platforms they support. You don't need to do any extra work
| (of recompiling) if you use a bytecode VM / platform (for
| example, like JVM).
| mistrial9 wrote:
| does an LLVM requirement fit the social and license goals
| of this eco-system fundamental project?
| zbentley wrote:
| Well, there's portability and then there's portability.
| Getting LLVM to emit artifacts on a given target is easy.
| Getting assurance that big, complex interfaces that
| integrate with the underlying OS in extremely specific ways
| (i.e. your programming language's IO or concurrency system)
| behave correctly on that target, and have appropriate
| testing, community support, and documentation is another
| thing entirely.
|
| Like, I get it. The claim that "rust isn't portable" is
| often used as a thought terminating cliche, and is often
| wrong or irrelevant in context. But the claim "X uses LLVM,
| LLVM can target environment Y, therefore X is fully
| compatible with Y" is just as reductive and misleading.
| jeffbee wrote:
| WUFFS seems like a great option for this.
| midislack wrote:
| No, not more Rust activism. Please, anything but more of this.
| Have some shame.
| amluto wrote:
| Beyond a lack of memory safety, C has another issue that makes
| me dislike it for this kind of application: C has a very
| minimal set of built in data structures. Combined with a lack
| of generics, this means that using, say, a dictionary means
| that quite a bit of the implementation gets hard coded into
| every site that uses the dictionary. This is almost invariably
| done with lots of pointers (since C has no better-constrained
| reference type), and the result can be bug-prone and difficult
| to refactor.
|
| For all of C++'s faults, at least it's possible to use a map
| (or unordered_set or whatever) and mostly avoid encoding the
| fact that it's anything other than an associative container of
| some sort at the call sites. This is especially true in C++11
| or newer with auto.
| SAI_Peregrinus wrote:
| [WUFFS](https://github.com/google/wuffs) is made for stuff
| like this, and it has a library available as transpiled C
| code.
| tgflynn wrote:
| > this means that using, say, a dictionary means that quite a
| bit of the implementation gets hard coded into every site
| that uses the dictionary
|
| I don't understand this part of your comment. There's nothing
| preventing you from designing a nice well-encapsulated
| map/dictionary data structure in C and I'm sure there are
| many many libraries that do just that.
|
| I do agree though that having such basic data structures in
| the standard library, as modern C++ does, is usually
| preferable.
| simias wrote:
| Lack of generics will do that, unless you consider that
| blindly casting `void _` all over the place counts as
| "well-encapsulated". Even with macro-soup designing a good
| agnostic dictionary implementation for C is rather
| challenging. Linked lists are _okay* if you use something
| like the kernel's list.h, but even then it's macro-heavy
| and has its pitfalls.
|
| In my work as an embedded developer I still use C a lot and
| it's probably the programming language I know best and have
| the most experience with but it would never cross my mind
| to write a PDF interpreter in it unless I had a tremendous
| reason to do so. There are so many better choices these
| days.
| tgflynn wrote:
| Type safety and encapsulation are distinct issues. The
| Linux kernel uses many well-encapsulated interfaces but
| it's written in C and the typing reflects that
| limitation.
|
| Personally I haven't used straight C in years and would
| never choose it over C++ unless platform constraints
| required it, but a vast amount of very complex software
| has been and continues to be written in C, including all
| the widely used OS kernels, so I don't find it very
| surprising that a new feature in a very old piece of
| software would be written in it.
| chrisseaton wrote:
| > There's nothing preventing you from designing a nice
| well-encapsulated map/dictionary data structure in C
|
| When you write a set function for your map data structure,
| what type do you make the key parameter?
| rixed wrote:
| size_t key_size, void *key
| nextaccountic wrote:
| And then eschew type safety
| chrisseaton wrote:
| > nice well-encapsulated
|
| ...
|
| > void *
| tgflynn wrote:
| Type safety and encapsulation aren't the same thing.
| Encapsulation is about hiding implementation details from
| the user of an API, which is what the comment I
| originally replied to was claiming you couldn't do in C.
| chrisseaton wrote:
| The void * is (should have been!) an implementation
| detail, and you're leaking it in the interface - that's
| not encapsulation.
|
| For example if I want to store a __int128 on a 64-bit
| machine I'll have to deal with stuff like memory
| allocation and lifetime myself, when the data structure
| should do that.
| mistrial9 wrote:
| this is a pointer-based language so there are lots of
| ways to solve that, but you know that already.. this is a
| setup question.. of course its not useful to re-invent
| critical, secure functions over and over yet, what if I
| am not writing critical, secure functions anyway?
|
| I would choose a key type that is natural to the
| environment and problem.. unsigned integers are useful.
| Which unsigned integer size? there are only a couple of
| practical answers to that.. unless there is some massive
| dataset, use a 32bit unsigned integer, like so much of
| the software does right now.
| thesz wrote:
| Code from yalsat (stochastic SAT solver) [1] made me
| learn something two years ago. I can declare an array of
| some elements and make access to elements statically
| typed. Same with maps, sets and others.
|
| [1] https://github.com/msoos/yalsat/blob/main/yals.c#L49
| Piezoid wrote:
| Code reuse is achievable by (mis)using the preprocessor
| system. It is possible to build a somewhat usable API, even
| for intrusive data structures. (eg. the linux kernel and
| klib[1])
|
| I do agree that generics are required for modern programming,
| but for some, the cost of complexity of modern languages
| (compared to C) and the importance of compatibility seem to
| outweigh the benefits.
|
| [1]: http://attractivechaos.github.io/klib
| MobiusHorizons wrote:
| It looks like it needs to interoperable with the rest of their
| codebase which was already written in C
|
| > The new PDF interpreter is written entirely in C, but
| interfaces to the same underlying graphics library as the
| existing PostScript interpreter. So operations in PDF should
| render exactly the same as they always have (this is affected
| slightly by differing numerical accuracy), all the same devices
| that are currently supported by the Ghostscript family, and any
| new ones in the future should work seamlessly.
| [deleted]
| Sytten wrote:
| That is not an argument at least for rust since its super
| easy to consume and offer a C interface. I think it's more of
| a shift in mentality that needs to occur.
| MobiusHorizons wrote:
| while it doesn't prevent rust from being used, it is still
| a hurdle which must be overcome. Building and maintaining a
| multi-language build system has significant costs,
| especially with a project with as much history and wide use
| as ghostscript.
| dfox wrote:
| It is so easy and well documented that first page of google
| results for "rust autotools" does not contain anything
| about how to integrate rust code into existing autotools
| project.
|
| Another issue is general subtle brokenness of rust tooling
| on anything that is not linux on amd64.
| asdff wrote:
| I don't even actively code with rust but just from the fact
| that its been packaged as a dependency has been enough of a
| headache for me. The latest issue is with some homebrew package
| that has rust as a dependency. It turns out on macos mojave
| rust needs to be built from source since there is no bottle. I
| let it build for a full day and it still didn't finish
| building, so I gave up. Then I installed rust independently
| with rustup and successfully linked that install to brew, which
| nearly worked, but failed with the cryptic "rustup could not
| choose a version of cargo to run..." error that I can't make
| any sense of, because the solution it gave for that error to
| download the latest stable release and set it as your toolchain
| with 'rustup default stable' didn't do anything because that
| was already done. The real salt on the wound is that modern
| google search bringing up nothing relevant.
| [deleted]
| neilv wrote:
| Years back, I raised how evolved Ghostscript had been over a very
| long time, together with the huge complexity of the PDF specs, as
| a potential source of vulnerabilities.
|
| (But maybe wasn't as much on people's radars, with all lower-
| hanging fruit of other technology choices and practices going on,
| outside of PDF.)
|
| New code for a large spec is also interesting for potential
| vulns, but maybe easier to get confidence about.
|
| One neat direction they could go is to be considered more
| trustworthy than the Adobe products. For example, if one is
| thinking of a PDF engine as (among other purposes) supporting the
| use case of a PDF viewer that's an agent of the interests of that
| individual human user, then I suspect you're going to end up with
| different attention and decisions affecting security (compared to
| implementations from businesses focused on other goals).
|
| (I say agent of the individual user, but that can also be aligned
| with enterprise security, as an alternative to risk management
| approaches that, e.g., ultimately will decide they're relying on
| gorillas not to make it through the winter.)
| asdff wrote:
| Is there any work in this space on some oddball "contamination
| protocol" type of security? Like you would assume everything is
| contaminated and you do things that eliminate the potential for
| cross contamination entirely, like they do in lab settings with
| aseptic technique. In this case, it could mean printing out the
| contaminated pdf on a system you don't care about being
| contaminated, then scanning it with an airgapped scanner to
| recover a 'sterile' pdf. It seems convoluted but I'm sure for
| some applications that could be a good solution that requires
| no improvement to pdf protocol.
| neilv wrote:
| I've heard of measures like that, including for the _other_
| direction (i.e., redacting documents without leaking
| information in the effectively opaque PDF format).
|
| IMHO, having well-engineered tools handle data, and being
| conservative about the trust/privileges given externally-
| sourced data is at least complementary to the current "zero
| trust" thinking among networks and nodes.
|
| (Example: Does your spreadsheet really arbitrary code
| execution, in an imperfect sandbox, for all your nontechnical
| users? Should what people might think is a self-contained
| standalone text document file really phone home, to disclose
| your activity and location, or have the potential to be
| remotely memory-holed/disabled, along with attendant added
| security risks from that added complexity and the additional
| requirements it puts on host systems/tools to try to enforce
| that questionable design?)
| woodruffw wrote:
| DARPA is funding fundamental research in this space,
| specifically through programs like SafeDocs[1].
|
| [1]: https://www.darpa.mil/program/safe-documents
| aidos wrote:
| Does anyone know much about the Artifex team? How big it is etc?
|
| They seem to be the kings of working with PDFs. I've not really
| looked at the Ghostscript code (and I'm surprised to hear their
| interpreter was still in postscript), but I've looked through the
| mupdf code and what I saw was really nice.
|
| In any case, I appreciate the work they've done in providing
| fantastic tools to the world for decades now.
| petilon wrote:
| I don't know the current team, but I have met its founder: L.
| Peter Deutsch [1].
|
| James Gosling, inventor of Java, once described him as the
| "greatest programmer in the world". They both used to work at
| Sun Microsystems.
|
| [1] https://en.wikipedia.org/wiki/L._Peter_Deutsch
| skemper911 wrote:
| Three of the greatest programmers I've experienced worked
| there, Peter, Tor, Raph. Hats off.
| madmoose wrote:
| Strangely this appears to be a new implementation not based on
| MuPDF, so Artifex now has two implementations of a PDF
| interpreter.
|
| I wonder what made them decide to reimplement it instead of
| reusing their existing code.
| toddm wrote:
| Ghostscript (well, gv) got me through the 1990s and beyond as
| part of my TeX -> dvips -> gv workflow.
|
| Kudos and thank you to those who maintain it and the associated
| packages!
| lordfosco wrote:
| Most important part of the announcement - you can still revert
| back to the former interpreter by setting the `-dNEWPDF=false`
| flag.
|
| While progress is always nice to see - I am also pleased that we
| don't necessarily need to update all the scripts that depend on
| ghostscript at once but can keep them running in their current
| state.
| ris wrote:
| It's particularly fun for them to introduce this in a point
| release. If this didn't warrant a major version bump I'm
| frankly not sure what would.
| [deleted]
| mkl wrote:
| > As time has gone on, and we have encountered more and more PDF
| files with ever more unexpected deviations from the specification
|
| Does anyone know of a collection of malformed PDF files? It would
| be useful for testing PDF processing programs.
| mdaniel wrote:
| I wasn't able to readily find any collections, and searching
| for anything plus the keyword "pdf" returns links to articles
| _written in_ pdf
|
| That said, this GitHub topic may have some pointers:
| https://github.com/topics/malware-samples
| svat wrote:
| There are some here, as test files in the qpdf library:
| https://github.com/qpdf/qpdf/tree/main/qpdf/qtest/qpdf
|
| (But still, note: A couple of months ago I wrote a low-level
| PDF parser--just parse the PDF file's bytes into PDF objects,
| nothing more--and fed it all the PDF files that happened to be
| present on my laptop, and ran into some files that (some) PDF
| viewers open, but even qpdf doesn't. I say "even" because qpdf
| is really good IMO.)
| vfclists wrote:
| Using C sounds like it will bring a whole new list of exploits
| with it.
|
| Not good!!
| vodou wrote:
| C is not inherently unsafe. Sure, it hasn't "memory safety" as
| a feature. But there are loads of applications considered safe
| written in C. An experienced C programmer (with the help of
| tooling) can write safe C code. It is not impossible.
| c7DJTLrn wrote:
| That would explain all the vulnerabilities in systemd and
| Linux. They just aren't experienced enough. Linus needs to
| get in touch with an expert.
| tinus_hn wrote:
| I'm looking forward to your efforts in rewriting it in Rust
| tptacek wrote:
| So is everyone else! Can't happen soon enough.
| vfclists wrote:
| I guess "experienced C programmers" must be short supply
| although they have been writing C for years.
| jcranmer wrote:
| SQLite is the most stringently developed C code I'm aware of
| --the test suite maintains 100% branch coverage, routinely
| run through all of the sanitizers, and it is regularly
| fuzzed.
|
| It _still_ accumulates CVEs:
| https://www.sqlite.org/cves.html.
| vodou wrote:
| Are you aware of a way to develop fault free code? Please
| share this knowledge then, please.
| jcranmer wrote:
| It's easy to develop fault-free code: just redefine all
| those faults as (undocumented) features!
|
| That's not a helpful answer, but it's basically the same
| thing you're doing--redefining memory safety
| vulnerabilities that would be precluded entirely by
| writing in memory-safe languages as programmer faults.
| tptacek wrote:
| He's aware of a way to develop memory-corruption-fault
| free code, obviously.
| WesolyKubeczek wrote:
| Of course, let's better use a PostScript interpreter also
| written in C, so your exploits leveraging both at least look
| like art.
| midislack wrote:
| Stop this.
| kisamoto wrote:
| Not sure why this is being posted now as this is from March...
|
| But anyway - I understand why they have changed their interpreter
| however the lack of major version bump threw me off. I use ps2pdf
| to optimize pdfs (long story short - makes their size smaller)
| and was alarmed when my pdfs suddenly ended up without the jpeg
| backgrounds. Instead, purely black (although this did result in a
| very small file size so who knows... :) )
|
| Thankfully you can add `-d NEWPDF=false` to your command to use
| the old parser. I'm yet to submit a bug report but it would be
| nice if it was backwards compatible...
___________________________________________________________________
(page generated 2022-07-31 23:00 UTC)