[HN Gopher] Show HN: HTML visualization of a PDF file's internal...
       ___________________________________________________________________
        
       Show HN: HTML visualization of a PDF file's internal structure
        
       Hi, I've just finished a rebuild of this function and added a lot
       of new features: info, page index, minimap, inverted index,... I
       think it may be useful for inspection, debugging or just as a
       learning resource showcasing the PDF file format. This is a pet
       project and I would be happy to receive some feedback! Regards
        
       Author : desgeeko
       Score  : 293 points
       Date   : 2025-02-10 13:52 UTC (9 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | Muromec wrote:
       | That's pretty cool! I would have used it a lot at my previous job
       | if it existed back then. In my ideal world it should work
       | somewhat like https://lapo.it/asn1js/ -- you drop a file and it
       | does all the stuff locally.
        
       | SSLy wrote:
       | Damn, this is also convenient for forensics and finding
       | watermarks.
        
         | pr353n747-0n83 wrote:
         | That does sound interesting. Forgive my ignorance, but how
         | could this be used to detect watermarks? Could the same method
         | be used to detect signatures?
        
           | edoceo wrote:
           | This tool is pulling out all the metadata in the document.
           | Lots of goodies in there not typically displayed.
        
       | xeon06 wrote:
       | Wow, I've been doing some PDF parsing at work and this is going
       | to come in SO handy.
        
         | vendiddy wrote:
         | Was mentioned in this thread, but I can also endorse qpdf as
         | being a great library.
         | 
         | It gives you a JSON representation of the PDF data structure.
         | What's nice is that doesn't hide the underlying format but it
         | takes care of a lot of the low level edge cases for you.
        
       | est wrote:
       | I remember there was a similar project on github allows visualize
       | any type of binary data by a given schema. There was an TCP/IP
       | example IIRC.
        
         | ddulaney wrote:
         | https://kaitai.io/ maybe?
         | 
         | It looks perfectly nice for its role, but I didn't use it for
         | my last project because I need serialization as well.
        
         | MontagFTB wrote:
         | HexFiend also has a template syntax for binary data
         | visualization. It's based on Tcl.
         | 
         | https://github.com/HexFiend/HexFiend/blob/master/templates/T...
        
         | mdaniel wrote:
         | Be careful, "any" is a strong word in this context.
         | Interestingly enough, I actually use PDF as the "hello world"
         | for kicking the tires on any such file format descriptor I find
         | because PDF is such a crazypants specification. Thus, if the
         | descriptor language is able to accurately capture the layout of
         | a PDF, it's obviously well thought out.
         | 
         | I haven't had a lot of luck thus far, except ones which allow
         | escaping out of declarative mode over into "and then run this
         | code"
        
       | nonrandomstring wrote:
       | Well done. This is a very useful security previewing tool. PDFs
       | are a menace.
        
       | swsieber wrote:
       | I've used the iText RUPS (free) for a while for debugging PDFs
       | (as I have the "privilege" to work on code that extracts data
       | from PDFs...). It looks like your introspection stuff might be a
       | bit stronger, which would be great. I'll take it for a whirl.
        
       | tyilo wrote:
       | Looks nice.
       | 
       | Would be better if all of the PDF's bytes where shown. Seems like
       | `endobj` and `xref` are not shown.
        
         | desgeeko wrote:
         | Thanks for noticing! You're right, I will fix that very soon.
        
           | tyilo wrote:
           | When opening the following hello world PDF, the trailer isn't
           | shown correctly and both `startxref` and `%%EOF` are missing:
           | https://ghostbin.site/bb7jb
        
       | escapecharacter wrote:
       | I've been shopping for something that does a per-byte description
       | of the content of visual media formats (jpeg, png, avi, mp4,
       | etc). Anyone know of one?
        
         | freeone3000 wrote:
         | This sounds like the format specification? What are you looking
         | for that is not a document?
        
           | escapecharacter wrote:
           | I want to drop a specific image in, and have a reader that
           | debugs this. Sometimes images don't follow specs exactly, or
           | stretch them in fun ways, and sometimes this leads to
           | inconsistent behaviour across platforms. Sometimes passing an
           | image through a platform strips or reformats this data.
           | 
           | The current context for me is I'm exploring various non-
           | steganography approaches to embed metadata in photos. In the
           | past, I've built custom formats to embed streaming data side-
           | by-side: https://github.com/dustinfreeman/kriffer
        
         | moritonal wrote:
         | Really impressed this site is still running. They'll have what
         | you want https://formats.kaitai.io/
        
         | nayuki wrote:
         | I did PNG: https://www.nayuki.io/page/png-file-chunk-inspector
        
       | tekkk wrote:
       | This would be really nice as browser library. Could just dragn
       | drop a file and see its insides. But impressive nonetheless.
        
         | kohbo wrote:
         | Do you mean a browser extension? Not trying to be rude; Just
         | making sure I understand.
        
       | kevmo314 wrote:
       | Is the UI tooling that does the visualization a library? I really
       | like the UI format, would love to use this for breaking down and
       | debugging video byte streams too.
       | 
       | EDIT: Oh it's actually reasonably simple, great use of CSS!
       | https://github.com/desgeeko/pdfsyntax/blob/main/docs/simple_...
        
         | desgeeko wrote:
         | Yes, I value simplicity and the interactivity offered by basic
         | HTML and CSS is sufficient for my use case :)
        
       | LegionMammal978 wrote:
       | If you're interested in manipulating PDFs, I've found QPDF [0] to
       | be a useful tool. Its "QDF mode" lays out the objects in a form
       | where you can directly edit them, and it can automatically fix up
       | the xref table afterwards. It can also convert to and from a JSON
       | format that you can manipulate with your own scripts.
       | 
       | [0] https://github.com/qpdf/qpdf,
       | https://qpdf.readthedocs.io/en/stable/
        
         | zackmorris wrote:
         | Just so we have them, the top links I got for PDF to JSON:
         | 
         | https://qpdf.readthedocs.io/en/stable/json.html
         | 
         | https://www.jsonify.org
         | 
         | https://github.com/maximoguerrero/PDF-GPT4-JSON
         | 
         | PDF is such a curious format. It's not human-readable, it's not
         | well-structured, it's not small. If it weren't for momentum and
         | the political horse trading that Apple, Adobe and Microsoft
         | were doing when the web went mainstream and freaked them out
         | around 1995, I'm not sure that we'd be using it today.
         | Postscript is better in countless ways, but since it's Turing-
         | complete, it's not really ideal for storing static data, and to
         | my knowledge was never extended to handle binary data well,
         | like for embedded JPEGs. I remember trying to print a 10 MB ps
         | file in the 1990s and it took maybe 20 minutes because the
         | grayscale image was basically represented as a bunch of run-
         | length encoded scan lines.
         | 
         | I would argue that frontend web development has reached a
         | similar fate. It seems odd to use programming language
         | (imperative, no less) to design media that we used to describe
         | declaratively. If I had enjoyed success in my programming
         | career, I would work on a declarative representation of
         | HTML/CSS/Javascript that can represent the intersection of all
         | existing markup across all mainstream browsers. Sort of like a
         | mix between Markdown and CSS flexbox like Xcode's auto layout,
         | but universal. It frankly would probably look like HTML, but
         | with sane defaults/builtins/inheritance, as well as a way
         | define and extend components from the beginning, similarly to
         | how people try to use data attributes. For contrast, React and
         | Vue come at this from the opposite direction. I'm talking about
         | something more like htmx.
         | 
         | Then we could work with that format and transpile to HTML or
         | even React Native and dump 90-99% of the boilerplate and build
         | tooling that we use currently.
        
       | codetrotter wrote:
       | Many moons ago I was tasked with extracting data from a bunch of
       | PDFs. I made a tool to visualise how characters were laid out on
       | the page and bounding boxes of all the elements.
       | 
       | The project was in the end a complete failure and several people
       | were upset at me for not delivering what I was supposed to.
       | 
       | In present day, with the capabilities that are now available with
       | LLMs to extract data from PDFs I 100% would go the route of
       | utilising AI to extract the data they wanted. Back then that did
       | not yet exist.
        
         | GaggiX wrote:
         | It reminds me of: https://xkcd.com/1425/
         | 
         | In the same way now with today's AI models the task is easily
         | achievable.
        
         | jimjimjim wrote:
         | The LLMs might help with sequencing the characters you extract
         | from the page but actually getting the contents is still
         | difficult. A number of times I've come across a page where the
         | letters of the text are glyphs in a custom font with no mapping
         | to ascii or anything similar or even more common, especially
         | with output from CAD, are letters that are made by drawing
         | lines in the shape of letters so there is nothing identifiable
         | to extract and you are left with OCRing the page to double
         | check the results
        
         | macklinkachorn wrote:
         | In my previous role, I have experienced similar things where
         | the rule-based parsing approach is really tricky to get right
         | and often failed via from edge cases.
         | 
         | We (at https://runtrellis.com/) have been building PDF
         | processing pipeline from the ground up with LLMs and VLMs and
         | have seen close to 100% accuracy even for tricky PDFs. The key
         | is to use rule based engine and references to cross check the
         | data.
        
         | bob1029 wrote:
         | Parsing data out of arbitrary PDFs is a cursed mission. PDF can
         | contain images, so you might as well target JPEG directly.
         | 
         | OCR can take you pretty far depending on expectations, but it's
         | never quite far enough in my experience.
        
       | acabajoe wrote:
       | Kudos to making this self-hosted. So very much appreciated!
        
       | flsw wrote:
       | related: https://news.ycombinator.com/item?id=41377960
        
       | adelpozo wrote:
       | it does not have any dependency to a pdf parsing library,
       | correct? That's a cool way to learn to file format and be able to
       | work around weird pdf file. But what was the motivation to not
       | use a library to do the pdf parsing work? is it the case that
       | there is none available? Nice work!
        
         | desgeeko wrote:
         | Correct, PDFSyntax implements everything at the lowest level.
         | You can ignore the HTML visualization and use it as an API to
         | access PDF objects. Why? Because I started a very small tool as
         | a week-end project and I got hooked reading the PDF
         | Specification so it is becoming a general purpose PDF library
         | for Python. I am not familiar with other libraries but I have
         | the impression that mine implements things that are often
         | overlooked in others, like incremental updates.
        
       | nabaraz wrote:
       | On a similar note, why haven't PDF been replaced? There are XPS,
       | DjVu and XHTML (EPUB) but they all seem to be targeting different
       | usecase (a packaged HTML file).
       | 
       | What I want is a simple document format that allows embedding
       | other files and metadata without the Adobe's bloat. I should be
       | able to hyperlink within pages, change font-size etc without text
       | overflowing and being able to print in a consistent manner.
        
         | wetpaws wrote:
         | Cause it works and works good enough. Also, immutability is a
         | feature, not a bug
        
         | jimjimjim wrote:
         | Different use cases.
         | 
         | "without text overflowing" brings with it a lot of detail. In
         | pdf every letter/character/glyph of text can have an exact x,y
         | position on the page (or off the page sometimes). This allows
         | for precise positioning of content regardless of what else is
         | going on. It is up to the application that writes the pdf to
         | position things correctly and implement letter or word
         | wrapping.
         | 
         | XPS was the closest to reimplementing PDF but microsoft didn't
         | get enough buy in from other parties so it quietly died.
        
         | stronglikedan wrote:
         | One reason is that none of those other formats are suitable for
         | commercial printing as-is.
        
       ___________________________________________________________________
       (page generated 2025-02-10 23:00 UTC)