[HN Gopher] Show HN: I made a website to semantically search ArX...
___________________________________________________________________
Show HN: I made a website to semantically search ArXiv papers
As a grad student (and an ADHDer), I had trouble doing literature
review systematically. To combat this, I made a website that finds
similar papers using the meaning of the thing I am looking for. I
used MixedBread's [^1] embedding model to generate vectors from the
abstracts. I store and search similar vectors using Milvus [^2] and
finally use Gradio [^3] to serve the frontend. I update the vector
database weekly by pulling the metadata dataset from Kaggle [^4].
To speed up the search process on my free oracle instance, I
binarise the embeddings and use Hamming distance as a metric. I
would love your feedback on the site :) Happy Holidays! [1]:
https://www.mixedbread.ai/docs/embeddings/mxbai-embed-large-...
[2]: https://milvus.io/ [3]: https://www.gradio.app/ [4]:
https://www.kaggle.com/datasets/Cornell-University/arxiv
Author : Quizzical4230
Score : 202 points
Date : 2024-12-25 05:44 UTC (17 hours ago)
(HTM) web link (papermatch.mitanshu.tech)
(TXT) w3m dump (papermatch.mitanshu.tech)
| shigeru94 wrote:
| Is this similar to https://www.semanticscholar.org (from Allen
| Institute for AI) ?
| triilman wrote:
| I think more like this website https://arxivxplorer.com/
| Quizzical4230 wrote:
| It is more like what triilman commented, but with all
| components open-source. I plan to add filters soon enough with
| keywords support! (actually waiting for milvus)
| lgas wrote:
| This might've saved you some time:
| https://huggingface.co/NeuML/txtai-arxiv
| cluckindan wrote:
| The dataset there is almost a year old.
| dmezzetti wrote:
| It was just updated last week. The dataset page on HF only
| has the scripts, the raw data resides over on Kaggle.
| Quizzical4230 wrote:
| Actually, yeah XD
| shishy wrote:
| I enjoy seeing projects like this!
|
| If you expand beyond arxiv, keep in mind since coverage matters
| for lit reviews, unfortunately the big publishers (Elsevier and
| Springer) are forcing other indices like OpenAlex, etc. to remove
| abstracts so they're harder to get.
|
| Have you checked out other tools like undermind.ai, scite.ai, and
| elicit.org?
|
| You might consider what else a dedicated product workflow for lit
| reviews includes besides search
|
| (used to work at scite.ai)
| Quizzical4230 wrote:
| Thank you for the appreciation and great feedback!
|
| | If you expand beyond arxiv, keep in mind since coverage
| matters for lit reviews,
|
| I do have PaperMatchBio [^1] for bioRxiv and PaperMatchMed [^2]
| for medRxiv, however I do agree having multiple sites for
| domains isn't ideal. And I am yet to create a synchronization
| pipeline for these two so the results may be a little stale.
|
| | unfortunately the big publishers (Elsevier and Springer) are
| forcing other indices like OpenAlex, etc. to remove abstracts
| so they're harder to get.
|
| This sounds like a real issue in expanding the coverage.
|
| | Have you checked out other tools like undermind.ai, scite.ai,
| and elicit.org?
|
| I did, but maybe not thoroughly enough. I will check these and
| add complementing features.
|
| | You might consider what else a dedicated product workflow for
| lit reviews includes besides search
|
| Do you mean a reference management system like Mendeley/Zotero?
|
| [1]: https://papermatchbio.mitanshu.tech/ [2]:
| https://papermatchmed.mitanshu.tech/
| eric-burel wrote:
| Unusual use case but I write literature reviews for French
| R&D tax cut system, and we specifically need to: focus on
| most recent papers, stay on topic for a very specific
| problematic a company has, potentially include grey
| literature (tech blog articles from renowned corp), be as
| exhaustive as possible when it comes to freely accessible
| papers (we are more ok with missing paid papers unless they
| are really popular). A "dedicated product workflow" could be
| about taking business use cases like that into account. This
| is a real business problem, the Google Scholar lock up is
| annoying and I would pay for something better than what
| exists.
| Quizzical4230 wrote:
| This is quite unique. I believe a custom solution might
| help you better than Google Scholar.
| eric-burel wrote:
| This can be seen as technology watch, as opposed to a
| thesis literature review for instance. Google Scholar
| gives the best results but sadly doesn't really want you
| to build products on top of it : no api, no scraping.
| Breaking this monopoly would be a huge step forward,
| especially when coupled with semantic search.
| mrjay42 wrote:
| I think you have an encoding problem <3
|
| If you search for "UPC high performance computing evaluation",
| you'll see paper with buggy characters in the authors name
| (second results with that search).
| Quizzical4230 wrote:
| Most definitely. Thank you for pointing this out!
| antman wrote:
| Nice work. Any other technical comments, why did you use those
| embeddings, did you binarzue them, did you use any dpecial
| prompts?
| Quizzical4230 wrote:
| At the beginning of the project, MixedBread's embedding model
| was small and leading the MTEB leaderboard [^1], hence I went
| with it.
|
| Yes, I did binarize them for a faster search experience.
| However, I think the search quality degrades significantly
| after the first 10 results, which are same as fp32 search but
| with a shuffled order. I am planning to add a reranking
| strategy to boost better results upwards.
|
| At the moment, this is plain search with no special prompts.
|
| [1]: https://huggingface.co/spaces/mteb/leaderboard
| bubaumba wrote:
| This is cool, but how about local semantic search through tens of
| thousands articles and books. Sure I'm not the first, there
| should be some tools already.
| Quizzical4230 wrote:
| I definitely was thinking about something like this for
| PaperMatch itself. Where anyone can pull a docker image and
| search through the articles locally! Do you think this idea is
| worthwhile pursuing?
| bubaumba wrote:
| Absolutely worth doing. Here is interesting related video,
| local RAG:
|
| https://www.youtube.com/watch?v=bq1Plo2RhYI
|
| I'm not an expert, but I'll do it for learning. Then open
| source if it works. As far as I understand this approach
| requires a vector database and LLM which doesn't have to be
| big. Technically it can be implemented as local web server.
| Should be easy to use, just type and get a sorted by
| relevance list.
| Quizzical4230 wrote:
| Perfect!
|
| Although, atm I am only using retrieval without any LLM
| involved. Might try integrating if it significantly
| improves UX without compromising speeds.
| dmezzetti wrote:
| Excellent project.
|
| As mentioned in another comment, I've put together an embeddings
| database using the arxiv dataset
| (https://huggingface.co/NeuML/txtai-arxiv) recently.
|
| For those interested in the literature search space, a couple
| other projects I've worked on that may be of interest.
|
| annotateai (https://github.com/neuml/annotateai) - Annotates
| papers with LLMs. Supports searching the arxiv database mentioned
| above.
|
| paperai (https://github.com/neuml/paperai) - Semantic search and
| workflows for medical/scientific papers. Built on txtai
| (https://github.com/neuml/txtai)
|
| paperetl (https://github.com/neuml/paperetl) - ETL processes for
| medical and scientific papers. Supports full PDF docs.
| shishy wrote:
| paperetl is cool, saving that for later, nice! did something
| similar in-house with grobid in the past (great project by
| patrice).
| dmezzetti wrote:
| Grobid is great. paperetl is the workhorse of the projects
| mentioned above. Good ole programming and multiprocessing to
| churn through data.
| Quizzical4230 wrote:
| Thank you for your kind words.
|
| These look like great projects, I will surely check them out :D
| namanyayg wrote:
| What are other good areas where semantic search can be useful?
| I've been toying with the idea for a while to play around and
| make such a webapp.
|
| Some of the current ideas I had:
|
| 1. Online ads search for marketers: embed and index video + image
| ads, allow natural language search to find marketing inspiration.
| 2. Multi e-commerce platform search for shopping: find products
| across Sephora, zara, h&m, etc.
|
| I don't know if either are good enough business problems worth
| solving tho.
| bubaumba wrote:
| 3. Quick lookup into internal documents. Almost any company
| needs it. Navigating file-system like hierarchy is slow and
| limited. That was old way.
|
| 4. Quick lookup into the code to find relevant parts even when
| the wording in comments is different.
| imadethis wrote:
| For 4, it would be neat to first pass each block of code
| (function or class or whatever) through an llm to extract
| meaning, and then embed some combination of llm parsed
| meaning, docstring and comments, and function name. Then do
| semantic search against that.
|
| That way you'd cover what the human thinks the block is for
| vs what an LLM "thinks" it's for. Should cover some amount of
| drift in names and comments that any codebase sees.
| jondwillis wrote:
| Please stop making ad tech better. Someone else might, but you
| don't have to.
| ukuina wrote:
| Related: emergentmind.com
| Quizzical4230 wrote:
| Thank you for the link. Would you know any reliable small model
| to add on top of vanilla search for a similar experience?
| tokai wrote:
| Nice but I have to point out that a systematic review cannot be
| done with semantic search and should never be done in a preprint
| collection.
| Quizzical4230 wrote:
| Agreed.
| dmezzetti wrote:
| Why?
| Quizzical4230 wrote:
| Not sure about the semantic search, but preprints are peer
| reviewed and hence not vetted. However, at the current pace
| of papers on arXiv (5k+/week) peer review alone might halt
| the progress.
| dmezzetti wrote:
| Why not semantic search was the bigger question.
| omarhaneef wrote:
| For every application of semantic search, I'd love to see what
| the benefit is over text search. If there a benchmark to see if
| it improves the search. Subjectively, did you find it surfaced
| new papers? Is this more useful in certain domains?
| Quizzical4230 wrote:
| All benefits depend on the ability of the embedding model.
| Semantic embeddings understand nuances, so they can match
| abstracts that align conceptually even if no exact keywords
| overlap. For example, "neural networks" vs. "deep learning."
| can and should fetch similar papers.
|
| Subjectively, yes. I sent this around my peers and they said it
| helped them find new authors/papers in the field while
| preparing their manuscripts.
|
| | Is this more useful in certain domains?
|
| I don't think I have the capacity to comment on this.
| feznyng wrote:
| One of the factors is how users phrase their queries. On some
| level people are used to full text search but semantic shines
| when they ask literal questions with terminology that may not
| match the answer.
| woodson wrote:
| Query keyword expansion works quite well for that without
| semantic search (although it can reduce precision).
| gaborme wrote:
| Nice. Why not use a full-text search like self-hosted Typesense?
| Quizzical4230 wrote:
| Full text search would be redundant as arXiv.org already
| supports it. For semantic search, Typesense has limited
| collection of embedding models. [^1]
|
| [1]: https://huggingface.co/typesense/models/tree/main
| mskar wrote:
| This is awesome! If you're interested, you could add a search
| tool client for your backend in paper-qa
| (https://github.com/Future-House/paper-qa). Then paper-qa users
| would be able to use your semantic search as part of its
| workflow.
| Quizzical4230 wrote:
| paper-qa looks pretty cool. I will do so!
| andai wrote:
| Did you notice a difference in performance after binarization? Do
| you have a way to measure performance?
| Quizzical4230 wrote:
| Absolutely!
|
| Here is a graph showing the difference. [^1]
|
| Known ID is arXiv ID that is in the vector database, Unknown
| IDs need the metadata to be fetched via API. Text is embedded
| via the model's API.
|
| FLAT and IVF_FLAT are different indexes used for the search.
| [^2]
|
| [1]:
| https://raw.githubusercontent.com/mitanshu7/dumpyard/refs/he...
|
| [2]: https://zilliz.com/learn/how-to-pick-a-vector-index-in-
| milvu...
| binarymax wrote:
| That looks great for speed, but what about recall?
| Quizzical4230 wrote:
| That's has a major downgrade. For binary embeddings, the
| top 10 results are same as fp32, albeit shuffled. However
| after the 10th result, I think quality degrades quite a
| bit. I was planning to add a reranking strategy for binary
| embeddings. What do you think?
| intalentive wrote:
| Recommend reranking. You basically get full resolution
| performance for a negligible latency hit. (Unless you
| need to make two network calls...)
|
| MixedBread supports matryoshka embeddings too so that's
| another option to explore on the latency-recall curve.
| madbutcode wrote:
| This looks great! I have used the biorXiv version of papermatch
| and it gives pretty good results!
| Maro wrote:
| Very cool!
|
| Add a "similar papers" link to each paper, that will make this
| the obvious way to discover topics by clicking along the similar
| papers.
| swyx wrote:
| 1. why mixbread's model?
|
| 2. how much efficiency gain did you see binarising
| embeddings/using hamming distance?
|
| 3. why milvus over other vector stores?
|
| 4. did you automate the weekly metadata pull? just a simple cron
| job? anything else you need orchestrated?
|
| user thoughts on searching for "transformers on byte level not
| token level" - was good but didnt turn up
| https://arxiv.org/abs/2412.09871 <- which is more recent, more
| people might want
|
| also you might want more result density - so perhaps a UI option
| to collapse the abstracts and display more in the first glance.
| maCDzP wrote:
| I want to crawl and plug in scihib to this and see what happens.
| fasa99 wrote:
| For what it's worth, back in the day (a few years ago, before the
| LLM boom a few years) I found on a similar sized vector database
| (gensim / doc2vec), it's possible to just brute force a vector
| search e.g. with SSE or AVX type instructions. You can code it in
| C and have a python API. Your data appears to be a few gigs so
| that's feasible for realtime CPU brute force, <200 ms
___________________________________________________________________
(page generated 2024-12-25 23:00 UTC)