[HN Gopher] Show HN: I made a website to semantically search ArX...
       ___________________________________________________________________
        
       Show HN: I made a website to semantically search ArXiv papers
        
       As a grad student (and an ADHDer), I had trouble doing literature
       review systematically. To combat this, I made a website that finds
       similar papers using the meaning of the thing I am looking for.  I
       used MixedBread's [^1] embedding model to generate vectors from the
       abstracts. I store and search similar vectors using Milvus [^2] and
       finally use Gradio [^3] to serve the frontend. I update the vector
       database weekly by pulling the metadata dataset from Kaggle [^4].
       To speed up the search process on my free oracle instance, I
       binarise the embeddings and use Hamming distance as a metric.  I
       would love your feedback on the site :) Happy Holidays!  [1]:
       https://www.mixedbread.ai/docs/embeddings/mxbai-embed-large-...
       [2]: https://milvus.io/ [3]: https://www.gradio.app/ [4]:
       https://www.kaggle.com/datasets/Cornell-University/arxiv
        
       Author : Quizzical4230
       Score  : 202 points
       Date   : 2024-12-25 05:44 UTC (17 hours ago)
        
 (HTM) web link (papermatch.mitanshu.tech)
 (TXT) w3m dump (papermatch.mitanshu.tech)
        
       | shigeru94 wrote:
       | Is this similar to https://www.semanticscholar.org (from Allen
       | Institute for AI) ?
        
         | triilman wrote:
         | I think more like this website https://arxivxplorer.com/
        
         | Quizzical4230 wrote:
         | It is more like what triilman commented, but with all
         | components open-source. I plan to add filters soon enough with
         | keywords support! (actually waiting for milvus)
        
       | lgas wrote:
       | This might've saved you some time:
       | https://huggingface.co/NeuML/txtai-arxiv
        
         | cluckindan wrote:
         | The dataset there is almost a year old.
        
           | dmezzetti wrote:
           | It was just updated last week. The dataset page on HF only
           | has the scripts, the raw data resides over on Kaggle.
        
         | Quizzical4230 wrote:
         | Actually, yeah XD
        
       | shishy wrote:
       | I enjoy seeing projects like this!
       | 
       | If you expand beyond arxiv, keep in mind since coverage matters
       | for lit reviews, unfortunately the big publishers (Elsevier and
       | Springer) are forcing other indices like OpenAlex, etc. to remove
       | abstracts so they're harder to get.
       | 
       | Have you checked out other tools like undermind.ai, scite.ai, and
       | elicit.org?
       | 
       | You might consider what else a dedicated product workflow for lit
       | reviews includes besides search
       | 
       | (used to work at scite.ai)
        
         | Quizzical4230 wrote:
         | Thank you for the appreciation and great feedback!
         | 
         | | If you expand beyond arxiv, keep in mind since coverage
         | matters for lit reviews,
         | 
         | I do have PaperMatchBio [^1] for bioRxiv and PaperMatchMed [^2]
         | for medRxiv, however I do agree having multiple sites for
         | domains isn't ideal. And I am yet to create a synchronization
         | pipeline for these two so the results may be a little stale.
         | 
         | | unfortunately the big publishers (Elsevier and Springer) are
         | forcing other indices like OpenAlex, etc. to remove abstracts
         | so they're harder to get.
         | 
         | This sounds like a real issue in expanding the coverage.
         | 
         | | Have you checked out other tools like undermind.ai, scite.ai,
         | and elicit.org?
         | 
         | I did, but maybe not thoroughly enough. I will check these and
         | add complementing features.
         | 
         | | You might consider what else a dedicated product workflow for
         | lit reviews includes besides search
         | 
         | Do you mean a reference management system like Mendeley/Zotero?
         | 
         | [1]: https://papermatchbio.mitanshu.tech/ [2]:
         | https://papermatchmed.mitanshu.tech/
        
           | eric-burel wrote:
           | Unusual use case but I write literature reviews for French
           | R&D tax cut system, and we specifically need to: focus on
           | most recent papers, stay on topic for a very specific
           | problematic a company has, potentially include grey
           | literature (tech blog articles from renowned corp), be as
           | exhaustive as possible when it comes to freely accessible
           | papers (we are more ok with missing paid papers unless they
           | are really popular). A "dedicated product workflow" could be
           | about taking business use cases like that into account. This
           | is a real business problem, the Google Scholar lock up is
           | annoying and I would pay for something better than what
           | exists.
        
             | Quizzical4230 wrote:
             | This is quite unique. I believe a custom solution might
             | help you better than Google Scholar.
        
               | eric-burel wrote:
               | This can be seen as technology watch, as opposed to a
               | thesis literature review for instance. Google Scholar
               | gives the best results but sadly doesn't really want you
               | to build products on top of it : no api, no scraping.
               | Breaking this monopoly would be a huge step forward,
               | especially when coupled with semantic search.
        
       | mrjay42 wrote:
       | I think you have an encoding problem <3
       | 
       | If you search for "UPC high performance computing evaluation",
       | you'll see paper with buggy characters in the authors name
       | (second results with that search).
        
         | Quizzical4230 wrote:
         | Most definitely. Thank you for pointing this out!
        
       | antman wrote:
       | Nice work. Any other technical comments, why did you use those
       | embeddings, did you binarzue them, did you use any dpecial
       | prompts?
        
         | Quizzical4230 wrote:
         | At the beginning of the project, MixedBread's embedding model
         | was small and leading the MTEB leaderboard [^1], hence I went
         | with it.
         | 
         | Yes, I did binarize them for a faster search experience.
         | However, I think the search quality degrades significantly
         | after the first 10 results, which are same as fp32 search but
         | with a shuffled order. I am planning to add a reranking
         | strategy to boost better results upwards.
         | 
         | At the moment, this is plain search with no special prompts.
         | 
         | [1]: https://huggingface.co/spaces/mteb/leaderboard
        
       | bubaumba wrote:
       | This is cool, but how about local semantic search through tens of
       | thousands articles and books. Sure I'm not the first, there
       | should be some tools already.
        
         | Quizzical4230 wrote:
         | I definitely was thinking about something like this for
         | PaperMatch itself. Where anyone can pull a docker image and
         | search through the articles locally! Do you think this idea is
         | worthwhile pursuing?
        
           | bubaumba wrote:
           | Absolutely worth doing. Here is interesting related video,
           | local RAG:
           | 
           | https://www.youtube.com/watch?v=bq1Plo2RhYI
           | 
           | I'm not an expert, but I'll do it for learning. Then open
           | source if it works. As far as I understand this approach
           | requires a vector database and LLM which doesn't have to be
           | big. Technically it can be implemented as local web server.
           | Should be easy to use, just type and get a sorted by
           | relevance list.
        
             | Quizzical4230 wrote:
             | Perfect!
             | 
             | Although, atm I am only using retrieval without any LLM
             | involved. Might try integrating if it significantly
             | improves UX without compromising speeds.
        
       | dmezzetti wrote:
       | Excellent project.
       | 
       | As mentioned in another comment, I've put together an embeddings
       | database using the arxiv dataset
       | (https://huggingface.co/NeuML/txtai-arxiv) recently.
       | 
       | For those interested in the literature search space, a couple
       | other projects I've worked on that may be of interest.
       | 
       | annotateai (https://github.com/neuml/annotateai) - Annotates
       | papers with LLMs. Supports searching the arxiv database mentioned
       | above.
       | 
       | paperai (https://github.com/neuml/paperai) - Semantic search and
       | workflows for medical/scientific papers. Built on txtai
       | (https://github.com/neuml/txtai)
       | 
       | paperetl (https://github.com/neuml/paperetl) - ETL processes for
       | medical and scientific papers. Supports full PDF docs.
        
         | shishy wrote:
         | paperetl is cool, saving that for later, nice! did something
         | similar in-house with grobid in the past (great project by
         | patrice).
        
           | dmezzetti wrote:
           | Grobid is great. paperetl is the workhorse of the projects
           | mentioned above. Good ole programming and multiprocessing to
           | churn through data.
        
         | Quizzical4230 wrote:
         | Thank you for your kind words.
         | 
         | These look like great projects, I will surely check them out :D
        
       | namanyayg wrote:
       | What are other good areas where semantic search can be useful?
       | I've been toying with the idea for a while to play around and
       | make such a webapp.
       | 
       | Some of the current ideas I had:
       | 
       | 1. Online ads search for marketers: embed and index video + image
       | ads, allow natural language search to find marketing inspiration.
       | 2. Multi e-commerce platform search for shopping: find products
       | across Sephora, zara, h&m, etc.
       | 
       | I don't know if either are good enough business problems worth
       | solving tho.
        
         | bubaumba wrote:
         | 3. Quick lookup into internal documents. Almost any company
         | needs it. Navigating file-system like hierarchy is slow and
         | limited. That was old way.
         | 
         | 4. Quick lookup into the code to find relevant parts even when
         | the wording in comments is different.
        
           | imadethis wrote:
           | For 4, it would be neat to first pass each block of code
           | (function or class or whatever) through an llm to extract
           | meaning, and then embed some combination of llm parsed
           | meaning, docstring and comments, and function name. Then do
           | semantic search against that.
           | 
           | That way you'd cover what the human thinks the block is for
           | vs what an LLM "thinks" it's for. Should cover some amount of
           | drift in names and comments that any codebase sees.
        
         | jondwillis wrote:
         | Please stop making ad tech better. Someone else might, but you
         | don't have to.
        
       | ukuina wrote:
       | Related: emergentmind.com
        
         | Quizzical4230 wrote:
         | Thank you for the link. Would you know any reliable small model
         | to add on top of vanilla search for a similar experience?
        
       | tokai wrote:
       | Nice but I have to point out that a systematic review cannot be
       | done with semantic search and should never be done in a preprint
       | collection.
        
         | Quizzical4230 wrote:
         | Agreed.
        
         | dmezzetti wrote:
         | Why?
        
           | Quizzical4230 wrote:
           | Not sure about the semantic search, but preprints are peer
           | reviewed and hence not vetted. However, at the current pace
           | of papers on arXiv (5k+/week) peer review alone might halt
           | the progress.
        
             | dmezzetti wrote:
             | Why not semantic search was the bigger question.
        
       | omarhaneef wrote:
       | For every application of semantic search, I'd love to see what
       | the benefit is over text search. If there a benchmark to see if
       | it improves the search. Subjectively, did you find it surfaced
       | new papers? Is this more useful in certain domains?
        
         | Quizzical4230 wrote:
         | All benefits depend on the ability of the embedding model.
         | Semantic embeddings understand nuances, so they can match
         | abstracts that align conceptually even if no exact keywords
         | overlap. For example, "neural networks" vs. "deep learning."
         | can and should fetch similar papers.
         | 
         | Subjectively, yes. I sent this around my peers and they said it
         | helped them find new authors/papers in the field while
         | preparing their manuscripts.
         | 
         | | Is this more useful in certain domains?
         | 
         | I don't think I have the capacity to comment on this.
        
         | feznyng wrote:
         | One of the factors is how users phrase their queries. On some
         | level people are used to full text search but semantic shines
         | when they ask literal questions with terminology that may not
         | match the answer.
        
           | woodson wrote:
           | Query keyword expansion works quite well for that without
           | semantic search (although it can reduce precision).
        
       | gaborme wrote:
       | Nice. Why not use a full-text search like self-hosted Typesense?
        
         | Quizzical4230 wrote:
         | Full text search would be redundant as arXiv.org already
         | supports it. For semantic search, Typesense has limited
         | collection of embedding models. [^1]
         | 
         | [1]: https://huggingface.co/typesense/models/tree/main
        
       | mskar wrote:
       | This is awesome! If you're interested, you could add a search
       | tool client for your backend in paper-qa
       | (https://github.com/Future-House/paper-qa). Then paper-qa users
       | would be able to use your semantic search as part of its
       | workflow.
        
         | Quizzical4230 wrote:
         | paper-qa looks pretty cool. I will do so!
        
       | andai wrote:
       | Did you notice a difference in performance after binarization? Do
       | you have a way to measure performance?
        
         | Quizzical4230 wrote:
         | Absolutely!
         | 
         | Here is a graph showing the difference. [^1]
         | 
         | Known ID is arXiv ID that is in the vector database, Unknown
         | IDs need the metadata to be fetched via API. Text is embedded
         | via the model's API.
         | 
         | FLAT and IVF_FLAT are different indexes used for the search.
         | [^2]
         | 
         | [1]:
         | https://raw.githubusercontent.com/mitanshu7/dumpyard/refs/he...
         | 
         | [2]: https://zilliz.com/learn/how-to-pick-a-vector-index-in-
         | milvu...
        
           | binarymax wrote:
           | That looks great for speed, but what about recall?
        
             | Quizzical4230 wrote:
             | That's has a major downgrade. For binary embeddings, the
             | top 10 results are same as fp32, albeit shuffled. However
             | after the 10th result, I think quality degrades quite a
             | bit. I was planning to add a reranking strategy for binary
             | embeddings. What do you think?
        
               | intalentive wrote:
               | Recommend reranking. You basically get full resolution
               | performance for a negligible latency hit. (Unless you
               | need to make two network calls...)
               | 
               | MixedBread supports matryoshka embeddings too so that's
               | another option to explore on the latency-recall curve.
        
       | madbutcode wrote:
       | This looks great! I have used the biorXiv version of papermatch
       | and it gives pretty good results!
        
       | Maro wrote:
       | Very cool!
       | 
       | Add a "similar papers" link to each paper, that will make this
       | the obvious way to discover topics by clicking along the similar
       | papers.
        
       | swyx wrote:
       | 1. why mixbread's model?
       | 
       | 2. how much efficiency gain did you see binarising
       | embeddings/using hamming distance?
       | 
       | 3. why milvus over other vector stores?
       | 
       | 4. did you automate the weekly metadata pull? just a simple cron
       | job? anything else you need orchestrated?
       | 
       | user thoughts on searching for "transformers on byte level not
       | token level" - was good but didnt turn up
       | https://arxiv.org/abs/2412.09871 <- which is more recent, more
       | people might want
       | 
       | also you might want more result density - so perhaps a UI option
       | to collapse the abstracts and display more in the first glance.
        
       | maCDzP wrote:
       | I want to crawl and plug in scihib to this and see what happens.
        
       | fasa99 wrote:
       | For what it's worth, back in the day (a few years ago, before the
       | LLM boom a few years) I found on a similar sized vector database
       | (gensim / doc2vec), it's possible to just brute force a vector
       | search e.g. with SSE or AVX type instructions. You can code it in
       | C and have a python API. Your data appears to be a few gigs so
       | that's feasible for realtime CPU brute force, <200 ms
        
       ___________________________________________________________________
       (page generated 2024-12-25 23:00 UTC)