[HN Gopher] Show HN: IPA, a GUI for exploring inner details of PDFs
       ___________________________________________________________________
        
       Show HN: IPA, a GUI for exploring inner details of PDFs
        
       Author : nicolodev
       Score  : 202 points
       Date   : 2024-08-28 10:22 UTC (12 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | svat wrote:
       | This is cool!
       | 
       | Here are some other similar(?) tools, for seeing the inner
       | contents of a PDF file (the raw objects etc), but I haven't
       | compared them to this tool here:
       | 
       | - https://pdf.hyzyla.dev/
       | 
       | - https://github.com/itext/i7j-rups (java -jar ~/Downloads/itext-
       | rups-7.2.5.jar)
       | 
       | - https://github.com/desgeeko/pdfsyntax (python3 -m pdfsyntax
       | inspect foo.pdf > output.html)
       | 
       | - https://github.com/trailofbits/polyfile (polyfile --html
       | output.html foo.pdf)
       | 
       | - https://www.reportmill.com/snaptea/PDFViewer/ =
       | https://www.reportmill.com/snaptea/PDFViewer/pviewer.html (drag
       | PDF onto it)
       | 
       | - https://sourceforge.net/projects/pdfinspector/ (an "example" of
       | https://superficial.sourceforge.net/)
       | 
       | - https://www.o2sol.com/pdfxplorer/overview.htm
       | 
       | More?
        
         | mananaysiempre wrote:
         | The venerable PDFedit[1] more or less forces you to confront
         | the internal structure of the PDF file as well.
         | 
         | [1] http://pdfedit.cz/en/index.html
        
         | nicolodev wrote:
         | Thanks for the list, the idea behind my tool was to try to code
         | something that might fit an analyst that would take a fast look
         | at the PDF. I'm also trying to figure out some fast heuristics
         | to mark/highlight some peculiar stuff on the file itself.
         | 
         | Now regarding the tools you mentioned, I haven't checked out
         | all of them, but part of them are interesting (and more mature,
         | speaking of testing and compatibility). However some (at least
         | the ones I was trying) are very basic, and they don't allow the
         | "Save object as.." or uncompress it. I like the feature of
         | displaying the PDF for preview :)
        
         | whizzter wrote:
         | Sweet, currently working on PDF signature stuff so I'm sure
         | I'll find some stuff handy :)
        
         | desgeeko wrote:
         | I am the author of PDFSyntax, thanks for mentioning it!
         | 
         | The HTML output is like a pretty print where you can read view
         | objects and follow links to other objects.
         | 
         | Since I have added a new command (disasm) that is CLI oriented
         | and displays a greppable summary of the structure. Here is an
         | explanation:
         | https://github.com/desgeeko/pdfsyntax/blob/main/docs/disasse...
        
         | aidos wrote:
         | Mutool is the one I suggest to people. The easiest way to
         | understand a PDF is to decompress it and then just read the
         | contents.                   mutool clean -d in.pdf out.pdf
         | 
         | At that point you'll realise that a PDF is mostly just a list
         | of objects and that those objects can reference each other.
         | After that you'll journey through the spec understanding what
         | each type of object does and what the fields in it control. The
         | graphics stream itself is just a stack based co-ordinates
         | drawing system that's easy to follow too.
         | 
         | By way of an example. Here's an object that represents a Page.
         | You can see the dimensions in the MediaBox. The contents
         | themselves are contained at object "9 0 obj" ("9 0 R" is the
         | pointer to it):                   2 0 obj         <<
         | /Type /Page           /MediaBox [ 0 0 612 792 ]
         | /Contents 9 0 R         >>         endobj
         | 
         | Meanwhile "9 0 obj" has the drawing instructions. They seem a
         | little weird at first glance but you see the values ".23999999
         | 0 0 -.23999999 0 792" each get pushed on the stack and then
         | "cm" pops them to interpret them as the transformation matrix.
         | 9 0 obj         <<           /Length 18266         >>
         | stream         .23999999 0 0 -.23999999 0 792 cm         q
         | 0 0 2551 3301 re         ...
         | 
         | The depth and detail of all of the different possible things
         | that can be represented in a PDF is insane. But understanding
         | the structure above is all you need to begin your journey!
         | 
         | EDIT The rest of your journey is contained in this epic
         | document: https://opensource.adobe.com/dc-acrobat-sdk-
         | docs/pdfstandard...
        
           | nicolodev wrote:
           | > mutool clean -d in.pdf out. pdf
           | 
           | My tool can do exactly the same (viewing internal structure,
           | exporting objects, and see the uncompressed raw content for
           | stream) with a graphical interface and without all this kind
           | of flags (which one of the reasons I started to design this
           | project with egui), but thanks for posting yours too.
        
         | richardw wrote:
         | Recommend just letting people have their one day in the sun.
         | We've become less the site of builders as the red team for
         | testing your launch.
        
           | nicolodev wrote:
           | yeah I agree, and while everyone is suggesting tools which
           | are really good but I designed mine to get rid of the flags
           | and CLI interface. Good for tech people that keeps
           | remembering flags, I'm not :(
        
       | AlanYx wrote:
       | Does anyone have any recommendations for a good tool that allows
       | both programmatic inspection and modification of PDF primitives.
       | For example, let's say someone wants to iterate through every
       | embedded image in a PDF and apply some form of signal processing
       | to the images in-place, then re-save the PDF?
        
         | nicolodev wrote:
         | I'd suggest you to code something along popular libraries for
         | PDF manipulation. I've used pdf-rs for the tool.
        
         | mananaysiempre wrote:
         | I've used pikepdf[1] for text processing before. To use it for
         | the task you outline, you'll probably need to thoroughly
         | investigate how bitmaps can be represented in PDFs. (Or maybe
         | not, if you only need to deal with a known finite set of PDFs
         | or PDF producers.)
         | 
         | [1] https://pikepdf.readthedocs.io/en/latest/
        
         | desgeeko wrote:
         | My tool (PDFSyntax[1], mentioned in this thread) is a Python
         | library that is able to both inspect and transform PDF files.
         | 
         | Depending on your transformation use case, you may write an
         | incremental update with only a few bytes at the end of the
         | original file instead of rewriting it entirely. To my knowledge
         | this feature of the PDF specification is often overlooked and
         | not a lot of libraries implements it.
         | 
         | It is a work in progress and I have not developed functions for
         | images yet, though.
         | 
         | [1] https://github.com/desgeeko/pdfsyntax
        
         | verdverm wrote:
         | I've been using several Python libraries for working with PDFs.
         | At least one of them allows you to walk the AST. (will look up
         | in a bit and edit this comment)
        
       | nbenitezl wrote:
       | For exploring the inners of a PDF you also have RUPS[1] which is
       | open source and easily installed in Linux through flathub[2].
       | 
       | [1] https://itextpdf.com/products/rups
       | 
       | [2] https://flathub.org/apps/com.itextpdf.RUPS
        
         | nicolodev wrote:
         | Thanks, it seems a great product too :) Do you have any
         | particular feature that you share that product for?
        
       | AdmVonSchneider wrote:
       | Back at zynamics, we used to sell PDF Dissector:
       | 
       | https://web.archive.org/web/20110902114238/http://www.zynami...
       | 
       | We never got around to open sourcing it, so I'm happy to see that
       | there is work being done in this space.
       | 
       | Congrats to seekbytes for releasing this!
        
         | nicolodev wrote:
         | :D Well, I'm sure that half of reverse engineering community
         | needs to thank you, and Zynamics for the important contribution
         | for tools of static analysis. I just take the occasion to thank
         | you for being an inspiration with such awesome tools like in
         | BinNavi, BinDiff, and ultimately PDF dissector. When I was
         | reading that it got discontinued, I just had that idea and
         | started to reason about something focused on analysis, and
         | applying some approaches we've already seen for the binary
         | analysis tools.
        
       | jeffreportmill1 wrote:
       | Great work! I'm sorry to be another jerk posting a link to
       | something similar, but here is my solution, running in the
       | browser (just drag and drop your PDF in):
       | 
       | https://reportmill.com/snaptea/PDFViewer/
        
         | nicolodev wrote:
         | Nice! My tool should be runnable in the browser thanks to wasm
         | compatibility with Rust + egui :) Btw I've just tried it, and
         | it's a little bit buggy in Safari with a 504kb PDF (lots of
         | objects though). Apart from that, is there a way to export the
         | raw stream? Is there any reason of do you print all the raw
         | streams as a text?
        
       | geekodour wrote:
       | what's a good tool to check if a pdf is not tampered with eg. as
       | a tool to check before loading a pdf from a public bucket to your
       | backend application?
        
         | criddell wrote:
         | If you sign the file, you should be able to verify that the
         | signature still matches the file.
        
         | remram wrote:
         | How could a PDF be tampered with in your own bucket?
        
           | verdverm wrote:
           | Sounds like they amay be accepting user PDFs, saving them to
           | a bucket, and then doing processing after.
        
       | jerknextdoor wrote:
       | I was curious to try this out as it might actually solve a minor
       | problem of mine right now, but it crashed as soon as I tried to
       | open a PDF.
       | 
       | Installed from git using cargo 1.80.1 on Ubuntu 22.04 on an AMD
       | Framework laptop if that's of any help.
        
         | nicolodev wrote:
         | argh, that's too bad, feel free to open an issue, what's
         | happening in the console? It's panicking, isn't it? Feel free
         | to contact me via email if you prefer
        
       | ZoomZoomZoom wrote:
       | I recently wanted to edit out a huge background image repeating
       | on almost every page of a PDF and found out there's no obvious
       | way to do it.
       | 
       | Would appreciate any tool suggestions!
        
         | dr_kiszonka wrote:
         | You could try one of Adobe's PDF APIs or script their software
         | locally.
        
         | darknavi wrote:
         | If you're OK doing it manually (not scripted), Inkscape can do
         | this.
        
         | darby_nine wrote:
         | I've had good experience with pypdf, if you're willing to do a
         | little coding.
        
       | giancarlostoro wrote:
       | This looks nice, and I didn't know about eGUI which looks like it
       | runs on the web. Very interesting.
       | 
       | https://www.egui.rs/
        
         | nicolodev wrote:
         | Thanks! Immediate paradigm might be a little bit scary if you
         | used to play with Qt, but looks easy to manage and it's really
         | interactive
        
       ___________________________________________________________________
       (page generated 2024-08-28 23:00 UTC)