[HN Gopher] Show HN: IPA, a GUI for exploring inner details of PDFs
___________________________________________________________________
Show HN: IPA, a GUI for exploring inner details of PDFs
Author : nicolodev
Score : 202 points
Date : 2024-08-28 10:22 UTC (12 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| svat wrote:
| This is cool!
|
| Here are some other similar(?) tools, for seeing the inner
| contents of a PDF file (the raw objects etc), but I haven't
| compared them to this tool here:
|
| - https://pdf.hyzyla.dev/
|
| - https://github.com/itext/i7j-rups (java -jar ~/Downloads/itext-
| rups-7.2.5.jar)
|
| - https://github.com/desgeeko/pdfsyntax (python3 -m pdfsyntax
| inspect foo.pdf > output.html)
|
| - https://github.com/trailofbits/polyfile (polyfile --html
| output.html foo.pdf)
|
| - https://www.reportmill.com/snaptea/PDFViewer/ =
| https://www.reportmill.com/snaptea/PDFViewer/pviewer.html (drag
| PDF onto it)
|
| - https://sourceforge.net/projects/pdfinspector/ (an "example" of
| https://superficial.sourceforge.net/)
|
| - https://www.o2sol.com/pdfxplorer/overview.htm
|
| More?
| mananaysiempre wrote:
| The venerable PDFedit[1] more or less forces you to confront
| the internal structure of the PDF file as well.
|
| [1] http://pdfedit.cz/en/index.html
| nicolodev wrote:
| Thanks for the list, the idea behind my tool was to try to code
| something that might fit an analyst that would take a fast look
| at the PDF. I'm also trying to figure out some fast heuristics
| to mark/highlight some peculiar stuff on the file itself.
|
| Now regarding the tools you mentioned, I haven't checked out
| all of them, but part of them are interesting (and more mature,
| speaking of testing and compatibility). However some (at least
| the ones I was trying) are very basic, and they don't allow the
| "Save object as.." or uncompress it. I like the feature of
| displaying the PDF for preview :)
| whizzter wrote:
| Sweet, currently working on PDF signature stuff so I'm sure
| I'll find some stuff handy :)
| desgeeko wrote:
| I am the author of PDFSyntax, thanks for mentioning it!
|
| The HTML output is like a pretty print where you can read view
| objects and follow links to other objects.
|
| Since I have added a new command (disasm) that is CLI oriented
| and displays a greppable summary of the structure. Here is an
| explanation:
| https://github.com/desgeeko/pdfsyntax/blob/main/docs/disasse...
| aidos wrote:
| Mutool is the one I suggest to people. The easiest way to
| understand a PDF is to decompress it and then just read the
| contents. mutool clean -d in.pdf out.pdf
|
| At that point you'll realise that a PDF is mostly just a list
| of objects and that those objects can reference each other.
| After that you'll journey through the spec understanding what
| each type of object does and what the fields in it control. The
| graphics stream itself is just a stack based co-ordinates
| drawing system that's easy to follow too.
|
| By way of an example. Here's an object that represents a Page.
| You can see the dimensions in the MediaBox. The contents
| themselves are contained at object "9 0 obj" ("9 0 R" is the
| pointer to it): 2 0 obj <<
| /Type /Page /MediaBox [ 0 0 612 792 ]
| /Contents 9 0 R >> endobj
|
| Meanwhile "9 0 obj" has the drawing instructions. They seem a
| little weird at first glance but you see the values ".23999999
| 0 0 -.23999999 0 792" each get pushed on the stack and then
| "cm" pops them to interpret them as the transformation matrix.
| 9 0 obj << /Length 18266 >>
| stream .23999999 0 0 -.23999999 0 792 cm q
| 0 0 2551 3301 re ...
|
| The depth and detail of all of the different possible things
| that can be represented in a PDF is insane. But understanding
| the structure above is all you need to begin your journey!
|
| EDIT The rest of your journey is contained in this epic
| document: https://opensource.adobe.com/dc-acrobat-sdk-
| docs/pdfstandard...
| nicolodev wrote:
| > mutool clean -d in.pdf out. pdf
|
| My tool can do exactly the same (viewing internal structure,
| exporting objects, and see the uncompressed raw content for
| stream) with a graphical interface and without all this kind
| of flags (which one of the reasons I started to design this
| project with egui), but thanks for posting yours too.
| richardw wrote:
| Recommend just letting people have their one day in the sun.
| We've become less the site of builders as the red team for
| testing your launch.
| nicolodev wrote:
| yeah I agree, and while everyone is suggesting tools which
| are really good but I designed mine to get rid of the flags
| and CLI interface. Good for tech people that keeps
| remembering flags, I'm not :(
| AlanYx wrote:
| Does anyone have any recommendations for a good tool that allows
| both programmatic inspection and modification of PDF primitives.
| For example, let's say someone wants to iterate through every
| embedded image in a PDF and apply some form of signal processing
| to the images in-place, then re-save the PDF?
| nicolodev wrote:
| I'd suggest you to code something along popular libraries for
| PDF manipulation. I've used pdf-rs for the tool.
| mananaysiempre wrote:
| I've used pikepdf[1] for text processing before. To use it for
| the task you outline, you'll probably need to thoroughly
| investigate how bitmaps can be represented in PDFs. (Or maybe
| not, if you only need to deal with a known finite set of PDFs
| or PDF producers.)
|
| [1] https://pikepdf.readthedocs.io/en/latest/
| desgeeko wrote:
| My tool (PDFSyntax[1], mentioned in this thread) is a Python
| library that is able to both inspect and transform PDF files.
|
| Depending on your transformation use case, you may write an
| incremental update with only a few bytes at the end of the
| original file instead of rewriting it entirely. To my knowledge
| this feature of the PDF specification is often overlooked and
| not a lot of libraries implements it.
|
| It is a work in progress and I have not developed functions for
| images yet, though.
|
| [1] https://github.com/desgeeko/pdfsyntax
| verdverm wrote:
| I've been using several Python libraries for working with PDFs.
| At least one of them allows you to walk the AST. (will look up
| in a bit and edit this comment)
| nbenitezl wrote:
| For exploring the inners of a PDF you also have RUPS[1] which is
| open source and easily installed in Linux through flathub[2].
|
| [1] https://itextpdf.com/products/rups
|
| [2] https://flathub.org/apps/com.itextpdf.RUPS
| nicolodev wrote:
| Thanks, it seems a great product too :) Do you have any
| particular feature that you share that product for?
| AdmVonSchneider wrote:
| Back at zynamics, we used to sell PDF Dissector:
|
| https://web.archive.org/web/20110902114238/http://www.zynami...
|
| We never got around to open sourcing it, so I'm happy to see that
| there is work being done in this space.
|
| Congrats to seekbytes for releasing this!
| nicolodev wrote:
| :D Well, I'm sure that half of reverse engineering community
| needs to thank you, and Zynamics for the important contribution
| for tools of static analysis. I just take the occasion to thank
| you for being an inspiration with such awesome tools like in
| BinNavi, BinDiff, and ultimately PDF dissector. When I was
| reading that it got discontinued, I just had that idea and
| started to reason about something focused on analysis, and
| applying some approaches we've already seen for the binary
| analysis tools.
| jeffreportmill1 wrote:
| Great work! I'm sorry to be another jerk posting a link to
| something similar, but here is my solution, running in the
| browser (just drag and drop your PDF in):
|
| https://reportmill.com/snaptea/PDFViewer/
| nicolodev wrote:
| Nice! My tool should be runnable in the browser thanks to wasm
| compatibility with Rust + egui :) Btw I've just tried it, and
| it's a little bit buggy in Safari with a 504kb PDF (lots of
| objects though). Apart from that, is there a way to export the
| raw stream? Is there any reason of do you print all the raw
| streams as a text?
| geekodour wrote:
| what's a good tool to check if a pdf is not tampered with eg. as
| a tool to check before loading a pdf from a public bucket to your
| backend application?
| criddell wrote:
| If you sign the file, you should be able to verify that the
| signature still matches the file.
| remram wrote:
| How could a PDF be tampered with in your own bucket?
| verdverm wrote:
| Sounds like they amay be accepting user PDFs, saving them to
| a bucket, and then doing processing after.
| jerknextdoor wrote:
| I was curious to try this out as it might actually solve a minor
| problem of mine right now, but it crashed as soon as I tried to
| open a PDF.
|
| Installed from git using cargo 1.80.1 on Ubuntu 22.04 on an AMD
| Framework laptop if that's of any help.
| nicolodev wrote:
| argh, that's too bad, feel free to open an issue, what's
| happening in the console? It's panicking, isn't it? Feel free
| to contact me via email if you prefer
| ZoomZoomZoom wrote:
| I recently wanted to edit out a huge background image repeating
| on almost every page of a PDF and found out there's no obvious
| way to do it.
|
| Would appreciate any tool suggestions!
| dr_kiszonka wrote:
| You could try one of Adobe's PDF APIs or script their software
| locally.
| darknavi wrote:
| If you're OK doing it manually (not scripted), Inkscape can do
| this.
| darby_nine wrote:
| I've had good experience with pypdf, if you're willing to do a
| little coding.
| giancarlostoro wrote:
| This looks nice, and I didn't know about eGUI which looks like it
| runs on the web. Very interesting.
|
| https://www.egui.rs/
| nicolodev wrote:
| Thanks! Immediate paradigm might be a little bit scary if you
| used to play with Qt, but looks easy to manage and it's really
| interactive
___________________________________________________________________
(page generated 2024-08-28 23:00 UTC)