[HN Gopher] Show HN: HTML visualization of a PDF file's internal...
___________________________________________________________________
Show HN: HTML visualization of a PDF file's internal structure
Hi, I've just finished a rebuild of this function and added a lot
of new features: info, page index, minimap, inverted index,... I
think it may be useful for inspection, debugging or just as a
learning resource showcasing the PDF file format. This is a pet
project and I would be happy to receive some feedback! Regards
Author : desgeeko
Score : 293 points
Date : 2025-02-10 13:52 UTC (9 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| Muromec wrote:
| That's pretty cool! I would have used it a lot at my previous job
| if it existed back then. In my ideal world it should work
| somewhat like https://lapo.it/asn1js/ -- you drop a file and it
| does all the stuff locally.
| SSLy wrote:
| Damn, this is also convenient for forensics and finding
| watermarks.
| pr353n747-0n83 wrote:
| That does sound interesting. Forgive my ignorance, but how
| could this be used to detect watermarks? Could the same method
| be used to detect signatures?
| edoceo wrote:
| This tool is pulling out all the metadata in the document.
| Lots of goodies in there not typically displayed.
| xeon06 wrote:
| Wow, I've been doing some PDF parsing at work and this is going
| to come in SO handy.
| vendiddy wrote:
| Was mentioned in this thread, but I can also endorse qpdf as
| being a great library.
|
| It gives you a JSON representation of the PDF data structure.
| What's nice is that doesn't hide the underlying format but it
| takes care of a lot of the low level edge cases for you.
| est wrote:
| I remember there was a similar project on github allows visualize
| any type of binary data by a given schema. There was an TCP/IP
| example IIRC.
| ddulaney wrote:
| https://kaitai.io/ maybe?
|
| It looks perfectly nice for its role, but I didn't use it for
| my last project because I need serialization as well.
| MontagFTB wrote:
| HexFiend also has a template syntax for binary data
| visualization. It's based on Tcl.
|
| https://github.com/HexFiend/HexFiend/blob/master/templates/T...
| mdaniel wrote:
| Be careful, "any" is a strong word in this context.
| Interestingly enough, I actually use PDF as the "hello world"
| for kicking the tires on any such file format descriptor I find
| because PDF is such a crazypants specification. Thus, if the
| descriptor language is able to accurately capture the layout of
| a PDF, it's obviously well thought out.
|
| I haven't had a lot of luck thus far, except ones which allow
| escaping out of declarative mode over into "and then run this
| code"
| nonrandomstring wrote:
| Well done. This is a very useful security previewing tool. PDFs
| are a menace.
| swsieber wrote:
| I've used the iText RUPS (free) for a while for debugging PDFs
| (as I have the "privilege" to work on code that extracts data
| from PDFs...). It looks like your introspection stuff might be a
| bit stronger, which would be great. I'll take it for a whirl.
| tyilo wrote:
| Looks nice.
|
| Would be better if all of the PDF's bytes where shown. Seems like
| `endobj` and `xref` are not shown.
| desgeeko wrote:
| Thanks for noticing! You're right, I will fix that very soon.
| tyilo wrote:
| When opening the following hello world PDF, the trailer isn't
| shown correctly and both `startxref` and `%%EOF` are missing:
| https://ghostbin.site/bb7jb
| escapecharacter wrote:
| I've been shopping for something that does a per-byte description
| of the content of visual media formats (jpeg, png, avi, mp4,
| etc). Anyone know of one?
| freeone3000 wrote:
| This sounds like the format specification? What are you looking
| for that is not a document?
| escapecharacter wrote:
| I want to drop a specific image in, and have a reader that
| debugs this. Sometimes images don't follow specs exactly, or
| stretch them in fun ways, and sometimes this leads to
| inconsistent behaviour across platforms. Sometimes passing an
| image through a platform strips or reformats this data.
|
| The current context for me is I'm exploring various non-
| steganography approaches to embed metadata in photos. In the
| past, I've built custom formats to embed streaming data side-
| by-side: https://github.com/dustinfreeman/kriffer
| moritonal wrote:
| Really impressed this site is still running. They'll have what
| you want https://formats.kaitai.io/
| nayuki wrote:
| I did PNG: https://www.nayuki.io/page/png-file-chunk-inspector
| tekkk wrote:
| This would be really nice as browser library. Could just dragn
| drop a file and see its insides. But impressive nonetheless.
| kohbo wrote:
| Do you mean a browser extension? Not trying to be rude; Just
| making sure I understand.
| kevmo314 wrote:
| Is the UI tooling that does the visualization a library? I really
| like the UI format, would love to use this for breaking down and
| debugging video byte streams too.
|
| EDIT: Oh it's actually reasonably simple, great use of CSS!
| https://github.com/desgeeko/pdfsyntax/blob/main/docs/simple_...
| desgeeko wrote:
| Yes, I value simplicity and the interactivity offered by basic
| HTML and CSS is sufficient for my use case :)
| LegionMammal978 wrote:
| If you're interested in manipulating PDFs, I've found QPDF [0] to
| be a useful tool. Its "QDF mode" lays out the objects in a form
| where you can directly edit them, and it can automatically fix up
| the xref table afterwards. It can also convert to and from a JSON
| format that you can manipulate with your own scripts.
|
| [0] https://github.com/qpdf/qpdf,
| https://qpdf.readthedocs.io/en/stable/
| zackmorris wrote:
| Just so we have them, the top links I got for PDF to JSON:
|
| https://qpdf.readthedocs.io/en/stable/json.html
|
| https://www.jsonify.org
|
| https://github.com/maximoguerrero/PDF-GPT4-JSON
|
| PDF is such a curious format. It's not human-readable, it's not
| well-structured, it's not small. If it weren't for momentum and
| the political horse trading that Apple, Adobe and Microsoft
| were doing when the web went mainstream and freaked them out
| around 1995, I'm not sure that we'd be using it today.
| Postscript is better in countless ways, but since it's Turing-
| complete, it's not really ideal for storing static data, and to
| my knowledge was never extended to handle binary data well,
| like for embedded JPEGs. I remember trying to print a 10 MB ps
| file in the 1990s and it took maybe 20 minutes because the
| grayscale image was basically represented as a bunch of run-
| length encoded scan lines.
|
| I would argue that frontend web development has reached a
| similar fate. It seems odd to use programming language
| (imperative, no less) to design media that we used to describe
| declaratively. If I had enjoyed success in my programming
| career, I would work on a declarative representation of
| HTML/CSS/Javascript that can represent the intersection of all
| existing markup across all mainstream browsers. Sort of like a
| mix between Markdown and CSS flexbox like Xcode's auto layout,
| but universal. It frankly would probably look like HTML, but
| with sane defaults/builtins/inheritance, as well as a way
| define and extend components from the beginning, similarly to
| how people try to use data attributes. For contrast, React and
| Vue come at this from the opposite direction. I'm talking about
| something more like htmx.
|
| Then we could work with that format and transpile to HTML or
| even React Native and dump 90-99% of the boilerplate and build
| tooling that we use currently.
| codetrotter wrote:
| Many moons ago I was tasked with extracting data from a bunch of
| PDFs. I made a tool to visualise how characters were laid out on
| the page and bounding boxes of all the elements.
|
| The project was in the end a complete failure and several people
| were upset at me for not delivering what I was supposed to.
|
| In present day, with the capabilities that are now available with
| LLMs to extract data from PDFs I 100% would go the route of
| utilising AI to extract the data they wanted. Back then that did
| not yet exist.
| GaggiX wrote:
| It reminds me of: https://xkcd.com/1425/
|
| In the same way now with today's AI models the task is easily
| achievable.
| jimjimjim wrote:
| The LLMs might help with sequencing the characters you extract
| from the page but actually getting the contents is still
| difficult. A number of times I've come across a page where the
| letters of the text are glyphs in a custom font with no mapping
| to ascii or anything similar or even more common, especially
| with output from CAD, are letters that are made by drawing
| lines in the shape of letters so there is nothing identifiable
| to extract and you are left with OCRing the page to double
| check the results
| macklinkachorn wrote:
| In my previous role, I have experienced similar things where
| the rule-based parsing approach is really tricky to get right
| and often failed via from edge cases.
|
| We (at https://runtrellis.com/) have been building PDF
| processing pipeline from the ground up with LLMs and VLMs and
| have seen close to 100% accuracy even for tricky PDFs. The key
| is to use rule based engine and references to cross check the
| data.
| bob1029 wrote:
| Parsing data out of arbitrary PDFs is a cursed mission. PDF can
| contain images, so you might as well target JPEG directly.
|
| OCR can take you pretty far depending on expectations, but it's
| never quite far enough in my experience.
| acabajoe wrote:
| Kudos to making this self-hosted. So very much appreciated!
| flsw wrote:
| related: https://news.ycombinator.com/item?id=41377960
| adelpozo wrote:
| it does not have any dependency to a pdf parsing library,
| correct? That's a cool way to learn to file format and be able to
| work around weird pdf file. But what was the motivation to not
| use a library to do the pdf parsing work? is it the case that
| there is none available? Nice work!
| desgeeko wrote:
| Correct, PDFSyntax implements everything at the lowest level.
| You can ignore the HTML visualization and use it as an API to
| access PDF objects. Why? Because I started a very small tool as
| a week-end project and I got hooked reading the PDF
| Specification so it is becoming a general purpose PDF library
| for Python. I am not familiar with other libraries but I have
| the impression that mine implements things that are often
| overlooked in others, like incremental updates.
| nabaraz wrote:
| On a similar note, why haven't PDF been replaced? There are XPS,
| DjVu and XHTML (EPUB) but they all seem to be targeting different
| usecase (a packaged HTML file).
|
| What I want is a simple document format that allows embedding
| other files and metadata without the Adobe's bloat. I should be
| able to hyperlink within pages, change font-size etc without text
| overflowing and being able to print in a consistent manner.
| wetpaws wrote:
| Cause it works and works good enough. Also, immutability is a
| feature, not a bug
| jimjimjim wrote:
| Different use cases.
|
| "without text overflowing" brings with it a lot of detail. In
| pdf every letter/character/glyph of text can have an exact x,y
| position on the page (or off the page sometimes). This allows
| for precise positioning of content regardless of what else is
| going on. It is up to the application that writes the pdf to
| position things correctly and implement letter or word
| wrapping.
|
| XPS was the closest to reimplementing PDF but microsoft didn't
| get enough buy in from other parties so it quietly died.
| stronglikedan wrote:
| One reason is that none of those other formats are suitable for
| commercial printing as-is.
___________________________________________________________________
(page generated 2025-02-10 23:00 UTC)