[HN Gopher] Pdf.tocgen
       ___________________________________________________________________
        
       Pdf.tocgen
        
       Author : nbernard
       Score  : 161 points
       Date   : 2024-04-28 08:51 UTC (1 days ago)
        
 (HTM) web link (krasjet.com)
 (TXT) w3m dump (krasjet.com)
        
       | mrtx01 wrote:
       | What a beautiful website!
        
         | GrumpyNl wrote:
         | And build with very little CSS and basic HTML.
        
           | oneeyedpigeon wrote:
           | > basic HTML
           | 
           | Apart from the code blocks. Syntax-highlighting in `<code>`
           | elements, when, browser manufacturers?
        
         | porker wrote:
         | It took a bit of digging from the Pdf.tocgen page, but
         | https://krasjet.com/colophon/ tells us how it's created.
        
           | lelandfe wrote:
           | Uncommon to see someone so caring about the specifics of
           | their chosen font. Love it.
        
       | mbana wrote:
       | I love the typography on the site. What fonts are you using? I'm
       | on a mobile browser so I can't really see.
        
         | porker wrote:
         | According to https://krasjet.com/colophon/:
         | 
         | > The typeface you are reading right now is Garibaldi by
         | Henrique Beier, with some custom tweaks, as you might have
         | noticed. I hope you enjoy it as much as I do. If you want some
         | free alternatives, check out Alegreya ht and Vollkorn, though I
         | still prefer the look and details of Garibaldi (just look at
         | all the punctuation marks!).
        
         | StayTrue wrote:
         | Garibaldi, $300 for up to 10k page views per month.
        
         | karma_pharmer wrote:
         | I was going to post the same thing. This has to be the most
         | beautifully typeset webpage I've seen in quite a while. Not
         | just the font but the layout too.
         | 
         | It's almost like this page is part of the web from some
         | parallel universe, which has been disenshittified to the same
         | extent that our own web has been... well, you know.
        
       | papichulo2023 wrote:
       | Looks like a very good tool to integrate with Knowledge Graphs or
       | just RAG (llm).
        
       | perihelions wrote:
       | - _" That is, you shouldn't expect it to work with scanned PDFs"_
       | 
       | It's surprisingly easy to extend this type of workflow to scanned
       | pdfs (as opposed to software-generated, text-containing ones).
       | tesseract(1) makes short work of ToC pages with --psm set to 6
       | (an OCR setting that tends to collapse convoluted text layouts
       | into a regular, software-parseable output).
       | 
       | It should also be straightforward, but I don't know of an out-of-
       | the-box solution, to automate that example of extracting "text
       | that looks like a header"-based on page layout/relative
       | positioning, or font weight. (I'm working on an adjacent problem,
       | an automatic re-layout of raster documents to squeeze out
       | whitespace and make them slightly nicer on small e-ink devices.
       | Text islands are trivial to identify. I don't know how to
       | quantify font weight, or things like that. I'm "wasting" a lot of
       | time diving into lots of mathematics rabbit holes, but I don't
       | know in advance which ones will be productive or not).
        
         | felipefar wrote:
         | tesseract is fine for basic use cases, but it fails when the
         | image is tilted (and thus the text isn't laid out
         | horizontally), which can happen several times with scanned
         | books. Compared to how well the Google OCR engine works,
         | tesseract should be much better than it is.
         | 
         | I wonder how difficult it is to develop a better OCR engine
         | than tesseract.
        
           | perihelions wrote:
           | Am I overlooking something, or is automating page rotation no
           | more work than just a 2d FFT?
        
             | notyoutube wrote:
             | Mind ELI5ing this? it seems neat
        
               | perihelions wrote:
               | The Fourier transforms map plane waves to points. Blocks
               | of regularly-spaced text have a periodic character, with
               | the period length of their line spacing; their Fourier
               | transform (I think??) would, in 2d frequency space, have
               | amplitude peaks on vectors that have the same angle as
               | the rotation of the lines.
        
               | notyoutube wrote:
               | thanks!
        
             | aidenn0 wrote:
             | I think it's more typical to low-pass (i.e. blur) the image
             | and then use a line-detection algorithm like the Hough
             | transform. Properly deskewed text should have prominent
             | horizontal white lines.
        
               | perihelions wrote:
               | - _" Hough transform"_
               | 
               | Oh, that one has much nicer properties--thank you!
        
           | HeatrayEnjoyer wrote:
           | Tesseract is last gen. Multimodal is SOTA, and can handle
           | even heavily distorted or destroyed text.
        
           | aidenn0 wrote:
           | You are supposed to deskew (and de-warp if the image isn't
           | flat) images before running through tesseract. There are
           | other tools for doing that.
        
         | bsharper wrote:
         | I've found EasyOCR to work much better at pulling text out of
         | irregular or unknown images. Requires more resources than
         | tesseract but gets much better results in my projects.
        
           | aidenn0 wrote:
           | It seems to be not significantly better than tesseract for
           | non-mixed images though, and it takes about 5 orders of
           | magnitude longer to process a page on my machine; I can
           | literally read a book 100 times faster than EasyOCR can
           | process a book on my Ryzen 7 2700.
        
             | aidenn0 wrote:
             | To give an idea, I started EasyOCR on a single page of a
             | book at ~200dpi before I posted the above comment. It is
             | still running over 3 hours later.
        
         | chazeon wrote:
         | I have thought about using tessaract, using it to OCR the TOC
         | and generate something like this. But there are just so many
         | edge cases that make the whole process fail. For example, how
         | do you handle it if the title breaks into two lines? What if
         | the page number is not recognized correctly? For example, 10
         | can be 1o What if there are dots? Maybe you can use GPT to
         | clean the extracted text.
         | 
         | In the end, I found ChatGPT-4's multimodal capability can
         | recognize text + page number pairs well if I feed screenshots
         | of TOC into it, and I have settled on that.
        
       | bionade24 wrote:
       | Does someone know a tool that is sed- or awk-like for PDFs?
        
         | perihelions wrote:
         | pdftk is a CLI tool that can extract and edit PDF metadata such
         | as tables of contents*, if that's what you mean?
         | 
         | *(Table of contents? Tables of content?)
        
         | manaskarekar wrote:
         | Perhaps you can use lesspipe with sed/awk?
         | 
         | https://github.com/wofr06/lesspipe
        
         | maxerickson wrote:
         | Qpdf has tools that go in that direction (but not a flat text
         | format that allows arbitrary edits).
         | 
         | https://qpdf.readthedocs.io/en/stable/qdf.html#qdf
        
       | janpmz wrote:
       | Recently I found the getToc function in PyMuPdf was too slow. I
       | told them about it in their discord, and a day later they had
       | fixed it. Now it only takes a couple of milliseconds. I'm using
       | it for my project pdftomp3. Pdf.tocgen looks useful too, but I'm
       | not sure if I can use it because of the licencse?
        
         | zerop wrote:
         | Interested to know what is pdftomp3?
        
           | janpmz wrote:
           | You can upload a PDF and convert the chapters into MP3s
           | (either original text or simplified text). But for PDFs
           | without a table of contents, you can only convert single
           | pages.
        
         | karma_pharmer wrote:
         | Of course you can use it.
         | 
         | What you can't do is deny others the same freedoms the license
         | grants to you.
        
           | cge wrote:
           | There does appear to be some licensing awkwardness here. The
           | license is nominally GPLv3, but it says it is based on AGPLv3
           | projects. It also appears to misidentify (it may have been
           | correct at the time) PyMuPDF as GPLv3 when that appears to
           | actually be AGPLv3. My assumption is that using this would
           | require complying with AGPLv3?
           | 
           | There's the additional oddity that a portion of the
           | repository (the recipes directory) is licensed under CC-BY-
           | NC-SA, and so the repository is not fully open source. This
           | is particularly confusing, however, as the functional content
           | of the recipes directory appears to be mostly records of
           | direct observations of parameter choices in external
           | documents and tools, and so doesn't seem like it would be
           | copyrightable at all, at least in the US.
        
       | zerop wrote:
       | Can I use this tool to get toc for arxiv papers ?
        
       | jbecke wrote:
       | We (macro.com) have something similar but without the recipe part
       | in our pdf/word processor. It works pretty well on numbered
       | headings but not so well on non-numbered. We're thinking of
       | porting over to LLMs at some point.
        
       | maCDzP wrote:
       | That is a beautiful website. I got lost in it and it created a
       | sense of wonder. Nice.
        
       | pseingatl wrote:
       | Since when do you need the hyperref package to generate a table
       | of contents under LaTeX (as the author claims)?
       | 
       | \tableofcontents does the job.
        
       | chazeon wrote:
       | I have been thinking about this, but for a while now, I have
       | settled on using ChatGPT's GPT-4v's multimodal capability to
       | generate a text file containing the titles and pages based on
       | screenshots of the TOC. After that, I used a pikepdf-based Python
       | script to bake the TOC into the PDF I had.
       | 
       | The upside, compared to Krasjet's approach, is that this works
       | not only for text-based PDFs but also for scanned PDFs, even old
       | scanned journal papers.
       | 
       | The downside is that, before baking the TOCs, you need to make
       | adjustments to the PDF as sometimes the empty pages are not
       | included. You also need to calculate the offset for the prologs,
       | cover, etc. I have a script for this kind of adjustment, but
       | there always is manual intervention involved.
        
       ___________________________________________________________________
       (page generated 2024-04-29 23:01 UTC)