[HN Gopher] Pdf.tocgen
___________________________________________________________________
Pdf.tocgen
Author : nbernard
Score : 161 points
Date : 2024-04-28 08:51 UTC (1 days ago)
(HTM) web link (krasjet.com)
(TXT) w3m dump (krasjet.com)
| mrtx01 wrote:
| What a beautiful website!
| GrumpyNl wrote:
| And build with very little CSS and basic HTML.
| oneeyedpigeon wrote:
| > basic HTML
|
| Apart from the code blocks. Syntax-highlighting in `<code>`
| elements, when, browser manufacturers?
| porker wrote:
| It took a bit of digging from the Pdf.tocgen page, but
| https://krasjet.com/colophon/ tells us how it's created.
| lelandfe wrote:
| Uncommon to see someone so caring about the specifics of
| their chosen font. Love it.
| mbana wrote:
| I love the typography on the site. What fonts are you using? I'm
| on a mobile browser so I can't really see.
| porker wrote:
| According to https://krasjet.com/colophon/:
|
| > The typeface you are reading right now is Garibaldi by
| Henrique Beier, with some custom tweaks, as you might have
| noticed. I hope you enjoy it as much as I do. If you want some
| free alternatives, check out Alegreya ht and Vollkorn, though I
| still prefer the look and details of Garibaldi (just look at
| all the punctuation marks!).
| StayTrue wrote:
| Garibaldi, $300 for up to 10k page views per month.
| karma_pharmer wrote:
| I was going to post the same thing. This has to be the most
| beautifully typeset webpage I've seen in quite a while. Not
| just the font but the layout too.
|
| It's almost like this page is part of the web from some
| parallel universe, which has been disenshittified to the same
| extent that our own web has been... well, you know.
| papichulo2023 wrote:
| Looks like a very good tool to integrate with Knowledge Graphs or
| just RAG (llm).
| perihelions wrote:
| - _" That is, you shouldn't expect it to work with scanned PDFs"_
|
| It's surprisingly easy to extend this type of workflow to scanned
| pdfs (as opposed to software-generated, text-containing ones).
| tesseract(1) makes short work of ToC pages with --psm set to 6
| (an OCR setting that tends to collapse convoluted text layouts
| into a regular, software-parseable output).
|
| It should also be straightforward, but I don't know of an out-of-
| the-box solution, to automate that example of extracting "text
| that looks like a header"-based on page layout/relative
| positioning, or font weight. (I'm working on an adjacent problem,
| an automatic re-layout of raster documents to squeeze out
| whitespace and make them slightly nicer on small e-ink devices.
| Text islands are trivial to identify. I don't know how to
| quantify font weight, or things like that. I'm "wasting" a lot of
| time diving into lots of mathematics rabbit holes, but I don't
| know in advance which ones will be productive or not).
| felipefar wrote:
| tesseract is fine for basic use cases, but it fails when the
| image is tilted (and thus the text isn't laid out
| horizontally), which can happen several times with scanned
| books. Compared to how well the Google OCR engine works,
| tesseract should be much better than it is.
|
| I wonder how difficult it is to develop a better OCR engine
| than tesseract.
| perihelions wrote:
| Am I overlooking something, or is automating page rotation no
| more work than just a 2d FFT?
| notyoutube wrote:
| Mind ELI5ing this? it seems neat
| perihelions wrote:
| The Fourier transforms map plane waves to points. Blocks
| of regularly-spaced text have a periodic character, with
| the period length of their line spacing; their Fourier
| transform (I think??) would, in 2d frequency space, have
| amplitude peaks on vectors that have the same angle as
| the rotation of the lines.
| notyoutube wrote:
| thanks!
| aidenn0 wrote:
| I think it's more typical to low-pass (i.e. blur) the image
| and then use a line-detection algorithm like the Hough
| transform. Properly deskewed text should have prominent
| horizontal white lines.
| perihelions wrote:
| - _" Hough transform"_
|
| Oh, that one has much nicer properties--thank you!
| HeatrayEnjoyer wrote:
| Tesseract is last gen. Multimodal is SOTA, and can handle
| even heavily distorted or destroyed text.
| aidenn0 wrote:
| You are supposed to deskew (and de-warp if the image isn't
| flat) images before running through tesseract. There are
| other tools for doing that.
| bsharper wrote:
| I've found EasyOCR to work much better at pulling text out of
| irregular or unknown images. Requires more resources than
| tesseract but gets much better results in my projects.
| aidenn0 wrote:
| It seems to be not significantly better than tesseract for
| non-mixed images though, and it takes about 5 orders of
| magnitude longer to process a page on my machine; I can
| literally read a book 100 times faster than EasyOCR can
| process a book on my Ryzen 7 2700.
| aidenn0 wrote:
| To give an idea, I started EasyOCR on a single page of a
| book at ~200dpi before I posted the above comment. It is
| still running over 3 hours later.
| chazeon wrote:
| I have thought about using tessaract, using it to OCR the TOC
| and generate something like this. But there are just so many
| edge cases that make the whole process fail. For example, how
| do you handle it if the title breaks into two lines? What if
| the page number is not recognized correctly? For example, 10
| can be 1o What if there are dots? Maybe you can use GPT to
| clean the extracted text.
|
| In the end, I found ChatGPT-4's multimodal capability can
| recognize text + page number pairs well if I feed screenshots
| of TOC into it, and I have settled on that.
| bionade24 wrote:
| Does someone know a tool that is sed- or awk-like for PDFs?
| perihelions wrote:
| pdftk is a CLI tool that can extract and edit PDF metadata such
| as tables of contents*, if that's what you mean?
|
| *(Table of contents? Tables of content?)
| manaskarekar wrote:
| Perhaps you can use lesspipe with sed/awk?
|
| https://github.com/wofr06/lesspipe
| maxerickson wrote:
| Qpdf has tools that go in that direction (but not a flat text
| format that allows arbitrary edits).
|
| https://qpdf.readthedocs.io/en/stable/qdf.html#qdf
| janpmz wrote:
| Recently I found the getToc function in PyMuPdf was too slow. I
| told them about it in their discord, and a day later they had
| fixed it. Now it only takes a couple of milliseconds. I'm using
| it for my project pdftomp3. Pdf.tocgen looks useful too, but I'm
| not sure if I can use it because of the licencse?
| zerop wrote:
| Interested to know what is pdftomp3?
| janpmz wrote:
| You can upload a PDF and convert the chapters into MP3s
| (either original text or simplified text). But for PDFs
| without a table of contents, you can only convert single
| pages.
| karma_pharmer wrote:
| Of course you can use it.
|
| What you can't do is deny others the same freedoms the license
| grants to you.
| cge wrote:
| There does appear to be some licensing awkwardness here. The
| license is nominally GPLv3, but it says it is based on AGPLv3
| projects. It also appears to misidentify (it may have been
| correct at the time) PyMuPDF as GPLv3 when that appears to
| actually be AGPLv3. My assumption is that using this would
| require complying with AGPLv3?
|
| There's the additional oddity that a portion of the
| repository (the recipes directory) is licensed under CC-BY-
| NC-SA, and so the repository is not fully open source. This
| is particularly confusing, however, as the functional content
| of the recipes directory appears to be mostly records of
| direct observations of parameter choices in external
| documents and tools, and so doesn't seem like it would be
| copyrightable at all, at least in the US.
| zerop wrote:
| Can I use this tool to get toc for arxiv papers ?
| jbecke wrote:
| We (macro.com) have something similar but without the recipe part
| in our pdf/word processor. It works pretty well on numbered
| headings but not so well on non-numbered. We're thinking of
| porting over to LLMs at some point.
| maCDzP wrote:
| That is a beautiful website. I got lost in it and it created a
| sense of wonder. Nice.
| pseingatl wrote:
| Since when do you need the hyperref package to generate a table
| of contents under LaTeX (as the author claims)?
|
| \tableofcontents does the job.
| chazeon wrote:
| I have been thinking about this, but for a while now, I have
| settled on using ChatGPT's GPT-4v's multimodal capability to
| generate a text file containing the titles and pages based on
| screenshots of the TOC. After that, I used a pikepdf-based Python
| script to bake the TOC into the PDF I had.
|
| The upside, compared to Krasjet's approach, is that this works
| not only for text-based PDFs but also for scanned PDFs, even old
| scanned journal papers.
|
| The downside is that, before baking the TOCs, you need to make
| adjustments to the PDF as sometimes the empty pages are not
| included. You also need to calculate the offset for the prologs,
| cover, etc. I have a script for this kind of adjustment, but
| there always is manual intervention involved.
___________________________________________________________________
(page generated 2024-04-29 23:01 UTC)