[HN Gopher] Search PDFs with Transformers and Python Notebook
       ___________________________________________________________________
        
       Search PDFs with Transformers and Python Notebook
        
       Author : alexcg1
       Score  : 109 points
       Date   : 2022-07-25 14:11 UTC (8 hours ago)
        
 (HTM) web link (colab.research.google.com)
 (TXT) w3m dump (colab.research.google.com)
        
       | [deleted]
        
       | CShorten wrote:
       | Congratulations Alex, super cool!
        
         | alexcg1 wrote:
         | Thanks man!
        
           | alexcg1 wrote:
           | Nice to meet another person in the super-obvious-username
           | club
        
       | divan wrote:
       | Can anyone recommend how to build the following solution?
       | 
       | - Full-text search on modern era PDFs (i.e no need for OCR)
       | 
       | - Exact word search would suffice (fuzzy/contextual search
       | actually is less desirable)
       | 
       | - Cross-platform frontend part that highlights and jumps to the
       | found text within the document. Frontend should be embeddable
       | (i.e. not a SaaS or just standalone UI)
       | 
       | - As lightweight as possible (i.e. no Java, Python or Ruby)
       | 
       | - Long-term oriented stack (i.e. minimum dependencies, ideally
       | promise of compatibility)
       | 
       | I'm looking at Mellisearch or Bleve for indexing/backend, and
       | Syncfusion Flutter PDF viewer for frontend, but it still needs a
       | lot of gluing code and I would love to explore more options.
       | 
       | Google Pinpoint is pretty cool, and I use it a lot, but there is
       | only hosted Google version, plus it's too smart (still can't get
       | it to do exact word search).
        
         | [deleted]
        
         | snowstormsun wrote:
         | pdfgrep with some formatting to add links open the correct
         | page?
        
           | alexcg1 wrote:
           | Getting the URI of original PDF would be straightforward
           | enough - I could whack that into the code tomorrow with a few
           | lines.
           | 
           | Opening up the correct page? I don't know of any standardized
           | PDF reader that supports that kind of thing. And the format
           | has such a history that even if it were supported
           | (technically by Adobe - don't even get me started on what PDF
           | readers support what formats), there's no guarantee the file
           | itself would even have that cooked in.
        
         | capableweb wrote:
         | > - As lightweight as possible (i.e. no Java, Python or Ruby)
         | 
         | I don't have suggestions for you, but I do have a question
         | regarding this point. Why wouldn't Java be considered
         | lightweight? Java literally runs on your SIM card, which is a
         | very bare-bones environment to run something on, I'd probably
         | consider something like that pretty lightweight.
        
           | divan wrote:
           | Ha, I'm from that generation of developers who have the
           | mental model of what is actually happening on the hardware
           | level when you run the program. Doesn't necesarilly mean I
           | overoptimize or think about struct fields offsets or cache
           | branching, but I do have this in my mental model and just
           | can't unlearn it.
           | 
           | When I think about how many stuff needs to be moved in
           | cpu/memory/io bus just to launch simple "Hello, World" in
           | Java - I just cannot accept it. I do realize that for large
           | programs that overhead is small, but still the JVM concept is
           | something I want to avoid as much as possible. Plus the sheer
           | scale of Java SDK and amount of legacy and complexity behind
           | it exceeds my treshold of "avoiding complexity" by orders of
           | magnitude. And the nail to the coffin of "no java" stance is,
           | of course, experience with desktop Java applications.
           | Consistenly the worst UX experience and performance I've seen
           | in 25 years among desktop apps.
        
             | alexcg1 wrote:
             | Don't remind me of desktop Java. What was that toolkit,
             | swing(?) that was used in all the apps back in the day.
             | PDFs have a special place in Hell, but Java desktop UXen
             | deserve a whole special circle
        
               | divan wrote:
               | PDF history is pretty amazing, actually. The fact that
               | PDF survived over so many decades is something worth
               | reflecting upon :)
        
         | simonw wrote:
         | If you hadn't ruled out Python I'd be suggesting using
         | Datasette + SQLite FTS - I've been building a whole bunch of
         | different search engines on that (including ones for searching
         | within OCRd PDF files) and the cost to host is trivial, since
         | you just need to run a Python process somewhere with a binary
         | SQLite database file. I usually use Vercel, Cloud Run or Fly
         | for that.
         | 
         | One example of a search engine I've built like this is the one
         | on the Datasette website: https://datasette.io/-/beta?q=fts - I
         | wrote about how that works here:
         | https://simonwillison.net/2020/Dec/19/dogsheep-beta/
        
           | divan wrote:
           | Interesting, thanks! I'll take a look (datasette is amazing).
        
         | alexcg1 wrote:
         | - Modern PDFs - if you wanna extract text and images, then the
         | PDFSegmenter used in my example will work. If tables too, might
         | need some additional jiggery-pokery, but definitely doable. I
         | know other ppl using the same framework (Jina) who've
         | accomplished it.
         | 
         | - Exact word search - pretty simple. I've focused on more
         | advanced stuff because color vs colour is same same but
         | different. Also just because it's pretty easy since I'm just
         | using pre-defined building blocks, not manually integrating
         | stuff
         | 
         | - Cross platform frontend - I've seen a lyrics search frontend
         | [0] and I've built stuff in Streamlit before. Jina offers
         | RESTful/gRPC/WebSockets gateways so it can't be too tough
         | 
         | - Lightweight? I mean how lightweight do you want it? C? Bash?
         | Assembly? I've found Python good for text parsing
         | 
         | - Long-term: The notebook I wrote has a few (each of which have
         | their own), but compared to others they're relatively
         | lightweight.
         | 
         | - Gluing code: I've been using pre-existing building blocks,
         | and writing new Executors (i.e. building blocks) is relatively
         | straightforward, and then scaling them up with shards,
         | replicas, etc is just a parameter away.
         | 
         | I'm more into the search side then the PDF stuff. The PDF side
         | I've had experience with through bitter suffering and torment.
         | Not a fun format to work with (unless you're into sado-
         | masochism)
         | 
         | [0] https://github.com/jina-ai/examples/tree/master/multires-
         | lyr...
        
           | divan wrote:
           | Thanks for elaborated answer.
           | 
           | Most of my use cases have to deal with 10-100 PDF small
           | documents, some - 1000-2000, but I don't want the solution to
           | choke on 10GB of huge PDFs (I was just uploading those to
           | Google Pinpoint). So Go or Rust for backend should be good
           | fit.
           | 
           | By cross-platform frontend I meant web/ios/android/desktop.
           | It's probably only Flutter, but I'm looking for other plugins
           | than Syncfusion's one to try. I know that sounds like
           | overkill for many people (website with search suffice), but I
           | already have cross-platform apps that would benefit from this
           | functionality, and web is a fallback there, not the main
           | option.
        
       | shubham_saboo wrote:
       | Wao, this is a really cool way to build full fledged search that
       | too in a notebook!
       | 
       | Does it work end-to-end with PDF as a data structure or do we
       | have to use OCR and parse the text first to be able to search it,
       | really curious?
        
         | alexcg1 wrote:
         | The version in the notebook is just for simple text-based PDFs.
         | I wrote some posts on our company blog[1] about the sheer
         | agonies of dealing with PDF as a data format, so wanted to
         | stick with as simple as possible for now.
         | 
         | That said, I'm planning future notebooks where you can perform
         | text-to-image or image-to-image search, integrate OCR, scale it
         | up, serve it, deploy it, etc.
         | 
         | [1] https://medium.com/jina-ai
        
           | shubham_saboo wrote:
           | Awesome, will be on the lookout for that!
        
             | alexcg1 wrote:
             | We've got quite a few other notebooks for other kinds of
             | search on the blog. Would love to hear your thoughts!
        
         | spaetzleesser wrote:
         | "PDF as a data structure"
         | 
         | Don't. PDF is a terrible format for storing machine readable
         | data. You lose a ton of Information while you create the PDF
         | which you then painstakingly have to get back later (if that's
         | even possible)
        
           | alexcg1 wrote:
           | I may have misworded it (if I wrote those words - PDF rots
           | the brain and my memory likewise).
           | 
           | Agreed on the rest. PDFs don't store machine-readable data.
           | Often just pixelated scanned hot garbage dumpster fire text.
           | 
           | I hate PDFs but have to work with the satanforesaken things.
           | Hence the notebook. It's my little way of trying to give my
           | little PDF-bespoked-hellscape a tiny little glow-up.
        
         | rahimnathwani wrote:
         | Under the hood, it uses
         | https://github.com/pdfminer/pdfminer.six which expects the text
         | to be stored as text.
        
           | alexcg1 wrote:
           | You mean the PDFSegmenter Executor in the notebook?
        
             | rahimnathwani wrote:
             | Yes
        
               | alexcg1 wrote:
               | PDFSegmenter also extracts images, which can then be
               | OCR'ed in the next step of the pipeline
        
         | alexcg1 wrote:
         | Incidentally Jina Hub [0] has a few OCR Executors [1][2] you
         | could integrate into my notebook (though you'd have to do some
         | rewiring to take images into account since it's a text-based
         | notebook)
         | 
         | [0] https://hub.jina.ai/
         | 
         | [1] https://hub.jina.ai/executor/w4p7905v
         | 
         | [2] https://hub.jina.ai/executor/78yp7etm
        
       | fzliu wrote:
       | I just tried this on all the papers I downloaded over the past
       | couple months - cool stuff.
       | 
       | How well would this work in a production setting, e.g. when
       | searching over millions of PDFs on arxiv (soon to be tens of
       | millions)? Follow-up: have you tried using a vector database such
       | as Milvus as the key piece of underlying infrastructure to avoid
       | having to implement deletes, failover, scaling, etc?
       | https://zilliz.com/learn/what-is-vector-database
        
         | alexcg1 wrote:
         | In terms of matching embeddings and performing similarity
         | search on text/images - folks are already using the framework
         | (Jina) for that and getting decent results.
         | 
         | In terms of processing the PDFs and extracting that data. idk.
         | That depends on a lot of factors - e.g. do you need to OCR the
         | PDFs or can just extract text directly? Either way, should be
         | possible to write a module and then easily scale it up (Jina
         | supports shards/replicas). Anyway, lemme know. I'm in talks
         | with folks about this kind of shitshow...uh...use case now.
         | 
         | Jina supports multiple vector database backends, like Weaviate,
         | Qdrant and others. For others (like Milvus), suggest you ask on
         | the Slack [0] - responses tend to be fast.
         | 
         | [0] https://slack.jina.ai
        
       | gapovaj742 wrote:
       | okay but what if my PDF is non parseable? Not sure if Python's
       | any good for that
        
         | alexcg1 wrote:
         | In that case I'd use:
         | 
         | 1. PDFSegmenter (in the notebook) - extract the images of the
         | text (yup, it does images too) 2. An OCR Executor [0][1] from
         | Jina Hub [2] to extract the text from the images 3. Actually
         | splice the text chunks together to be what you'd expect -
         | that's the tricky part. Even text splitting over pages can be
         | tricky to reassemble properly. PDFs are a pain the butt
         | frankly.
         | 
         | [0] https://hub.jina.ai/executor/78yp7etm
         | 
         | [1] https://hub.jina.ai/executor/w4p7905v
         | 
         | [2] https://hub.jina.ai
        
         | nicodjimenez wrote:
         | Mathpix PDF search is fully visually powered and does not use
         | underlying PDF metadata, even working on handwriting. It's a
         | great choice for researchers (especially in STEM) who want to
         | build a searchable archive of PDFs.
        
         | simonw wrote:
         | Amazon Textract does a phenomenal job of extracting text from
         | dodgy scanned PDFs - I've been running it against scanned
         | typewritten text and even handwritten journal text from the
         | 1880s with great results.
         | 
         | I built a tool for running OCR against every PDF in an S3
         | bucket (which costs about $1.50/thousand pages) here:
         | https://simonwillison.net/2022/Jun/30/s3-ocr/
        
       | alexcg1 wrote:
       | Wow, this post really took off! If anyone wants to read some of
       | my blog posts on building PDF search engines (and the pain,
       | torment and anguish that it causes) read:
       | 
       | - https://medium.com/jina-ai/building-an-ai-powered-pdf-search...
       | 
       | - https://medium.com/jina-ai/search-pdfs-with-ai-and-python-pa...
       | 
       | - https://medium.com/jina-ai/search-pdfs-with-ai-and-python-pa...
        
         | [deleted]
        
         | Malp wrote:
         | Great stuff, I went down the rabbit hole of building something
         | similar for synthesizing flash cards + Q/A pairs from textbook
         | PDFs about a year ago, and I would also emphasize that PDF
         | search is a janky nightmare to get within the ballpark of
         | usability :')
        
           | alexcg1 wrote:
           | I feel your pain my brother(?) [0] in suffering. That's why I
           | started simple in the notebook. Even trying to go a little
           | more complex just leads to exponential rabbit holes and
           | footguns.
           | 
           | [0] based on typical HN demographics, no assumptions here
        
       | PaulHoule wrote:
       | Does it really work better than a simple tfidf?
       | 
       | I worked on a neural search engine just when deep networks were
       | taking off and we knew that it worked because we had test data
       | that said certain documents were relevant for certain queries so
       | we could compute precision and recall curves. My experience was
       | that if the AUC metric is substantially improved customers really
       | notice the difference.
       | 
       | Very few search vendors do this kind of testing because it is
       | expensive and because enterprise customers seem to care more that
       | there are connectors to 800+ external systems than if the search
       | results are any good.
       | 
       | The main trouble I see with pdf search is that test extracted
       | from pdf files is full of junk punctuation including spaces so if
       | you are trying a bag of words based search the words are
       | corrupted. Seems to me you could build a neural model that works
       | around the brokenness of PDF but that isn't 'download a model
       | from spacy and pray' but would be a big job that starts with
       | getting 10 GB+ of PDF text.
        
         | alexcg1 wrote:
         | I'll agree that there's quite a bit of junk punctuation in the
         | extracted sentences (and sentence fragments), quite often from
         | short footnotes in the Wiki articles. Getting "good" PDFs with
         | open usage rights was a bit tricky, especially in a super
         | simple PDF format. I ended up PDF-printing from Chrome.
         | 
         | Needless to say, working with PDFs makes me want to pull my
         | hair out.
         | 
         | I also ended up writing the SpacySentencizer Executor instead
         | of using a "vanilla" sentencizer. That led to consistent
         | sentence splitting (so "J.R.R. Tolkien turned to pg. 3" would
         | be one sentence, not 5)
         | 
         | For testing, Jina allows you to swap out encoders with just a
         | couple of lines of code, so trying different methods out should
         | work just fine.
        
           | PaulHoule wrote:
           | I dunno, you can download a million or so PDFs from arxiv.org
           | and even more from archive.org. They aren't hard to find.
           | 
           | There is something to say for roundtripping PDFs from source
           | you control (you can accurately model the corruption produced
           | by a particular system) but you will certainly see new and
           | different phenomena if you try more.
           | 
           | I'd agree that spacy's sentence segmentation is better than
           | many of the alternatives.
        
             | alexcg1 wrote:
             | If new and different phenomena means new kinds of
             | corruption and downright weird behavior I'll end up having
             | no hair left!
             | 
             | Even printing the same page to PDF with Chrome and Firefox
             | delivers quite different results. Firefox was often
             | combining "f" and "i" into fi ligature [0] which totally
             | changed the meaning of "finished" for example.
             | 
             | Downloading a lot of random PDFs from arxiv would be great
             | for making something battle-hardened and robust (and I'd
             | love to get the chance to do it sometime) but I didn't have
             | the time (or the remaining hair) to do it this time round.
             | 
             | [0] https://www.compart.com/en/unicode/U+FB01
        
             | alexcg1 wrote:
             | And +1 to spaCy. I typically use it over Transformers
             | because it's SO much faster. I just used Transformers in
             | this example for a change. My Stack Overflow search
             | notebook [0] uses spaCy.
             | 
             | [0] https://colab.research.google.com/github/jina-
             | ai/workshops/b...
        
       | nicodjimenez wrote:
       | Mathpix Snip also supports PDF search, including for handwritten
       | content, and including math symbols in equations.
       | 
       | Disclaimer: I'm the founder.
        
         | ok_computer wrote:
         | Mathpix snip for pdf to Latex is excellent. Thank you for the
         | free tier. It is helpful transcribing pdf math homework sets to
         | use in the solution document without bugging the instructor for
         | their source.
        
         | alexcg1 wrote:
         | Oh, nifty! This is more a demo of a PDF search engine that you
         | could (in parts 1 thru x of the series) deploy to an intranet
         | (for internal knowledge search) or internet (for general
         | search), rather than a collaborative tool.
         | 
         | For handwritten/math symbols, I'm sure it wouldn't be too hard
         | to integrate something. The Jina Flow [0] concept makes
         | integrating new Executors [1] pretty easy.
         | 
         | I LOVE the testimonials on the site btw!
         | 
         | [0] https://docs.jina.ai/fundamentals/flow/
         | 
         | [1] https://docs.jina.ai/fundamentals/executor/
        
       | Stampo00 wrote:
       | Pardon me while I go add Optimus Prime to my corporate
       | letterhead.
        
         | [deleted]
        
       ___________________________________________________________________
       (page generated 2022-07-25 23:01 UTC)