[HN Gopher] Pdfgrep - a commandline utility to search text in PD...
___________________________________________________________________
Pdfgrep - a commandline utility to search text in PDF files
Author : kretaceous
Score : 187 points
Date : 2022-09-25 14:38 UTC (8 hours ago)
(HTM) web link (pdfgrep.org)
(TXT) w3m dump (pdfgrep.org)
| neilv wrote:
| I've long used `pdfgrep` in a very kludgey way, when stockpiling
| rare variants of particular used ThinkPad models (for looking up
| _actual_ specs based on an IBM "type" number shown in a photo on
| an eBay listing, since the seller's listing of specs is often
| incorrect).
|
| Example shell function: t500grep() {
| pdfgrep "$1" /home/user/doc/lenovo-psref-withdrawn-
| thinkpad-2005-to-2013-2013-12-447.pdf }
|
| Example run: $ t500grep 2082-3GU
| 2082-3GU T9400 2.53 2GB 15.4" WSXGA+ Cam GMA, HD
| 3650 160G 7200 DVD+-RW Intel 5100 2G
| Turbo 9 Bus 32 Aug 08
|
| The Lenovo services to look up this info come and go, and are
| also slow, but a saved copy of the data lives forever.
|
| (A non/less-kludge way would be to get the information from all
| my IBM/Lenovo PSREFs into a lovingly-engineering
| database/knowledge schema, simple CSV file, or `grep`-able ad hoc
| text file.)
| mistrial9 wrote:
| +1 -- after trying several tool sets over years, _pdfgrep_ is
| currently used daily around here
| donio wrote:
| For Emacs users there is also https://github.com/jeremy-
| compostella/pdfgrep which lets you browse the results and open
| the original docs highlighting the selected match.
|
| It's on MELPA.
| nanna wrote:
| Tried this the other day but couldn't figure out how to use it.
| Is it invoked from eshell, dired, a pdftool buffer, or what?
| radicalbyte wrote:
| Cool, about 15 years ago I built something similar for PDF,
| Office (OpenXML) as well as plain text as part of a search
| engine. Commercial/closed source of course but it was super
| handy.
| majkinetor wrote:
| For Windows, there is dngrep: http://dngrep.github.io
| gabythenerd wrote:
| Love dngrep, one of my friends used to combine all his pdfs
| into one and then use adobe reader to search before I showed
| this to him. It's very powerful and also simple to use, even
| for non-technical users.
| moonshotideas wrote:
| Out of curiosity, how did you solve the issue of extracting text
| from the pdf, error free? Or did you use another package?
| kfarnung wrote:
| Looking at the list of dependencies, it seems like they use
| poppler-cpp to render the PDFs.
|
| https://gitlab.com/pdfgrep/pdfgrep#dependencies
| dmoo wrote:
| Popper tools pdftotext -layout is great
| Frost1x wrote:
| Curious as well. About a year ago I was implementing what I
| thought naively might not be a very difficult verification that
| a specific string existed (case sensitive or insensitive)
| within a PDF's text and had many cases where text viewed was
| clearly rendered in the document but many libraries couldn't
| identify the text. It's my understanding there's a lot of
| variance in how a rendered PDF may be presenting something one
| may assume is a simple string that really isn't after going
| down the rabbit hole (wasn't too surprising because I dont like
| to make simplicity assumptions). I couldn't find anything at
| the time that seemed to be error free.
|
| Aside from applying document rendering with OCR and text
| recognition approaches, I ended up living with some error rate
| there. I think PDFgrep was one of the libraries I tested. Some
| other people just used libraries/tools as is with no sort of
| QAing but from my sample applying to several hundred verified
| documents, pdfgrep (and others) missed some.
| nip wrote:
| Tangential:
|
| Some time ago I built an automation [1] that identifies whether
| the given PDFs contain the specified keywords, outputting the
| result as a CSV file.
|
| Similar to PDFGrep, probably much slower, but potentially more
| convenient for people preferring GUIs
|
| [1] https://github.com/bendersej/pdf-keywords-extractor
| mistermann wrote:
| A bit of a tangent, but does anyone know of a good utility that
| can index a large number of PDF files so one can do fast keyword
| searches across all of them simultaneously (free or paid)? It
| seems like this sort of utility used to be very common 15 years
| ago, but local search has kind of died on the vine.
| [deleted]
| kranner wrote:
| DEVONsphere Express, Recoll, also the latest major version of
| Calibre.
| crtxcr wrote:
| I am working on looqs, it can do that (and also will render the
| page immediatly): https://github.com/quitesimpleorg/looqs
| donio wrote:
| Recoll is a nice one, uses Xapian for the index.
|
| https://www.lesbonscomptes.com/recoll/
| summm wrote:
| Recoll?
| tombrossman wrote:
| Yes, +1 for Recoll. It can also OCR those PDFs that are just
| an image of a page of text, and not 'live' text. Read the
| install notes and install the helper applications.
|
| When searching I'll first try the application or system's
| native search utility, but most of the time I end up opening
| Recoll to actually find the thing or snippet of text I want,
| and it has never failed me.
|
| https://www.lesbonscomptes.com/recoll/pages/features.html#do.
| ..
| llanowarelves wrote:
| dtSearch
| sumnole wrote:
| While we're asking for tool tips: does anyone know of a tool
| that will cache/index web pages as the user browses, so that it
| can be searched/viewed offline later?
| pletnes wrote:
| Macos' spotlight can do this AFAIK.
| pulvinar wrote:
| Yes, and Spotlight's also useable from the command line as
| mdfind, which has an -onlyin switch to restrict the search to
| a directory.
| marttt wrote:
| I've been using Ali G. Rudi's pdftxt with my own shell wrappers.
| From the homepage: "uses mupdf to extract text from pdf files;
| prefixes each line with its page number for searching."
|
| Usually I 1) pdftxt a file and 2) based on the results, jump to a
| desired page in Rudi's framebuffer PDF reader, fbpdf. For this,
| the page number prefix in pdftxt is a particularly nice default.
| No temptations with too many command line options either.
|
| https://litcave.rudi.ir/pdftxt-0.7.tar.gz
| thriftwy wrote:
| Catdoc utility does the same for .doc MS word files. Maybe for
| PDFs also.
| sauercrowd wrote:
| pdfgrep is great. Worked like a charme to diff updates to a
| contract
| Findecanor wrote:
| Just what I needed to search my collection of comp-sci articles.
| Regular grep fails on most PDFs.
|
| I installed the Ubuntu package. Thanks!
| ankrgyl wrote:
| DocQuery (https://github.com/impira/docquery), a project I work
| on, allows you to do something similar, but search over semantic
| information in the PDF files (using a large language model that
| is pre-trained to query business documents).
|
| For example: $ docquery scan "What is the due
| date?" /my/invoices/ /my/invoices/Order1.pdf What is
| the due date?: 4/27/2022 /my/invoices/Order2.pdf What
| is the due date?: 9/26/2022 ...
|
| It's obviously a lot slower than "grepping", but very powerful.
| pugio wrote:
| Wow this is exactly what I've been looking for, thank you! I
| just wish with these transformer models it was possible to
| extract a structured set of what the model "knows" (for e.g.
| easy search indexing ). These natural language question systems
| are a little too fuzzy sometimes.
| a1369209993 wrote:
| > to extract a structured set of what the model "knows"
|
| To be fair, that's impossible in the general case, since the
| model can know things (ie be able to answer queries) without
| knowing that it knows them (ie being able to produce a list
| of anserable queries by any means significantly more
| efficient than trying every query and seeing which ones
| work).
|
| As a reducto ad absurdum example, consider a 'model'
| consisting of a deniably encrypted key-value store, where
| it's outright cryptographically guaranteed that you can't
| effiently enumerate the queries. Neural networks aren't
| _quite_ that bad, but (in the general-over-NNs case) they at
| least superficially _appear_ to be pretty close. (They 're
| definitely not _reliably_ secure though; don 't depend on
| that.)
| ankrgyl wrote:
| Can you tell me a bit more about your use case? A few things
| that come to mind:
|
| - There are some ML/transformer-based methods for extracting
| a known schema (e.g. NER) or an unknown schema (e.g. relation
| extraction). - We're going to add a feature to DocQuery
| called "templates" soon for some popular document types (e.g.
| invoices) + a document classifier which will automatically
| apply the template based on the doc type. - Our commercial
| product (http://impira.com/) supports all of this + is a
| hosted solution (many of our customers use us to automate
| accounts payable, process insurance documents, etc.)
| mbb70 wrote:
| Since you mention insurance documents, could you speak to
| how well this would extract data from a policy document
| like https://ahca.myflorida.com/medicaid/Prescribed_Drug/dr
| ug_cri... ?
|
| The unstoppable administrative engine that is the American
| Healthcare system produces hundreds of thousands of
| continuously updated documents like this with no
| standardized format/structure.
|
| Manually extracting/normalizing this data into a querable
| format is an industry all its own.
| ankrgyl wrote:
| It's very easy to try! Just plug that URL here:
| https://huggingface.co/spaces/impira/docquery.
|
| I tried a few questions: What is the
| development date? -> June 20, 2017 What is the
| medicine? -> SPINRAZA(r) (nusinersen) How many
| doses -> 5 doses Did the patient meet the review
| criteria? -> Patient met initial review criteria.
| Is the patient treated with Evrysdi? -> not
| pugio wrote:
| Your commercial product looks very cool, but my use case is
| in creating an offline-first local document storage system
| (data never reaches a cloud). I'd like to be enable users
| to search through all documents for relevant pieces of
| information.
|
| The templates sound very cool - are they essentially just
| using a preset list of (natural language) queries tied to a
| particular document class? It seems like you're using a
| version of donut for your document classification?
| ankrgyl wrote:
| > but my use case is in creating an offline-first local
| document storage system (data never reaches a cloud).
|
| Makes sense -- this is why we OSS'd DocQuery :)
|
| > The templates sound very cool - are they essentially
| just using a preset list of (natural language) queries
| tied to a particular document class? It seems like you're
| using a version of donut for your document
| classification?
|
| Yes that's the plan. We've done extensive testing with
| other approaches (e.g. NER) and realized that the
| benefits of using use-case specific queries
| (customizability, accuracy, flexibility for many use
| cases) outweigh the tradeoffs (NER only needs one
| execution for all fields).
|
| Currently, we support pre-trained Donut models for both
| querying and classification. You can play with it by
| adding the --classify flag to `docquery scan`. We're
| releasing some new stuff soon that should be faster and
| more accurate.
| pugio wrote:
| Sweet! I'll keep an eye on the repo. Thank you for open
| sourcing DocQuery. I agree with your reasoning: my
| current attempts to find an NER model that covers all my
| use cases have come up short
| ultrasounder wrote:
| This is so epic. I was just ruminating about this particular
| use-case. who are your typical customers. Supply chain or
| purchasing? Also I notice that you do text extraction from
| Invoices? Are you using something similar to CharGRID or its
| derivate BERTGRID? Wish you and your team more success!
| ankrgyl wrote:
| Thank you ultrasounder! Supply chain, construction,
| purchasing, insurance, financial services, and healthcare are
| our biggest verticals. Although we have customers doing just
| about anything you can imagine with documents!
|
| For invoices, we have a pre-trained model (demo here:
| https://huggingface.co/spaces/impira/invoices) that is pretty
| good at most fields, but within our product, it will
| automatically learn about your formats as you upload
| documents and confirm/correct predictions. The pre-trained
| model is based on LayoutLM and the additional learning we do
| uses a generative model (GMMs) that can learn from as little
| as one example.
|
| LMK if you have any other questions.
| greggsy wrote:
| I've been looking for an alternative to Acrobat's 'Advanced
| Search' capability, which allows you to define a search directory
| (like NP++). This feature is possible with pdfgrep and other
| tools, but the killer for me is that it displays results in
| context, and allows you to quickly sift through dozens or
| hundreds of results with ease.
|
| It's literally the only reason why I have acrobat installed.
| asicsp wrote:
| See also https://github.com/phiresky/ripgrep-all (`ripgrep`, but
| also search in PDFs, E-Books, Office documents, zip, tar.gz,
| etc.)
| xenodium wrote:
| Was already a big fan of ripgrep and pdfgrep. rga (ripgrep-all)
| was such a hidden gem for me.
| MisterSandman wrote:
| Used this for online quizzes for one of my courses. It was pretty
| good, but using the terminal for stuff like this still sucks
| since you can't just click into the PDF.
|
| I wish PDFs had a Ctrl-Shift-F search like VS Code that could
| search for text in multiple pdfs in a directory.
| darkteflon wrote:
| I use Houdahspot on MacOS for this. You can either invoke it
| globally with a custom hotkey (I use hyper 's', for "search"),
| or from within a specific directory via an icon it adds to the
| top right corner of Finder.
| greggsy wrote:
| Advanced Search in Acrobat allows you to define a directory.
| It's the only reason I ever have it installed.
___________________________________________________________________
(page generated 2022-09-25 23:00 UTC)