[HN Gopher] Oracle of Zotero: LLM QA of Your Research Library
___________________________________________________________________
Oracle of Zotero: LLM QA of Your Research Library
Author : SubiculumCode
Score : 31 points
Date : 2023-11-26 18:13 UTC (4 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| dmezzetti wrote:
| Nice project!
|
| I've spent quite a lot of time in the medical/scientific
| literature space. With regards to LLMs, specifically RAG, how the
| data is chunked is quite important. With that, I have a couple
| projects that might be beneficial additions.
|
| paperetl (https://github.com/neuml/paperetl) - supports parsing
| arXiv, PubMed and integrates with GROBID to handle parsing
| metadata and text from arbitrary papers.
|
| paperai (https://github.com/neuml/paperai) - builds embeddings
| databases of medical/scientific papers. Supports LLM prompting,
| semantic workflows and vector search. Built with txtai
| (https://github.com/neuml/txtai).
|
| While arbitrary chunking/splitting can work, I've found that
| integrating parsing that has knowledge of medical/scientific
| paper structure increases the overall accuracy and experience of
| downstream applications.
| panabee wrote:
| these are awesome projects. thanks for sharing.
|
| it would accelerate research so much if LLM accuracy increased
| on biomedical papers.
|
| very much agreed on the potential to extract signal from paper
| structures.
|
| two questions if you don't mind:
|
| 1. did you post a summary of your chunking analysis somewhere?
| i'm curious which method maximized accuracy, and which
| sentence-overlap methods were most effective.
|
| 2. do you think general tokenization methods limit LLMs on
| scientific/biomedical papers?
| dmezzetti wrote:
| Appreciate it!
|
| > 1. did you post a summary of your chunking analysis
| somewhere? i'm curious which method maximized accuracy, and
| which sentence-overlap methods were most effective.
|
| Good idea on this but nothing posted. In general, grouping by
| sections of a paper has worked best (i.e. methods,
| conclusions, results etc). GROBID is helpful with arbitrary
| papers.
|
| > 2. do you think general tokenization methods limit LLMs on
| scientific/biomedical papers?
|
| Possibly. For vectorization, specifically with medical, I do
| have this model (https://huggingface.co/NeuML/pubmedbert-
| base-embeddings) which is a fine-tuned sentence embeddings
| model using this base model
| (https://huggingface.co/microsoft/BiomedNLP-BiomedBERT-
| base-u...). The base model does have a custom vocabulary.
|
| In terms of LLMs, I've found that this model
| (https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca) works
| well but haven't experimented with domain specific LLMs.
___________________________________________________________________
(page generated 2023-11-26 23:00 UTC)