[HN Gopher] Using Llamafiles for Embeddings in Local RAG Applica...
___________________________________________________________________
Using Llamafiles for Embeddings in Local RAG Applications
Author : tosh
Score : 82 points
Date : 2024-05-16 16:20 UTC (6 hours ago)
(HTM) web link (future.mozilla.org)
(TXT) w3m dump (future.mozilla.org)
| jinzo wrote:
| I'm always curious how the licensing aspects will play out. It's
| quite clear to me that most of the LLMs contain copyrighted
| material without correct rights. And then they turn around and
| put up some restrictive licensing on it (for example the
| Salesforce model here).
|
| I know they add valuable input to it, but CC-BY-NC is really
| rubbing me the wrong way.
| andy99 wrote:
| It's really two different things. If using data to train is
| either fair use or not a use at all (I belive it is, it's for
| all intents and purposes the same as reading it), the copyright
| on the training data is irrelevant.
|
| Whether weights can be copyrighted at all (which is the basis
| of these licenses) is also unclear. Though again I think they
| should be, they are just as creative a work as a computer
| program for any nontrivial model release (though really I think
| it's important to be able to enforce copyleft on them more than
| anything).
|
| Also, all these laws work to the benefit of who ever has the
| deepest pockets, so salesforce will win against most others by
| virtue of this, regardless of how the law shakes out.
| tossandthrow wrote:
| But then again, my program is merely reading these weights.
| terhechte wrote:
| I'd love if Firefox would feed the text content of each website I
| visit locally and would allow me to RAG search this database. So
| often do I want to re-visit a website I visited weeks earlier but
| can't find it again.
| stavros wrote:
| https://historio.us
| 2024throwaway wrote:
| The OP asked for the data to be stored locally. You linked to
| a hosted service with a subscription model. Very much not the
| same.
| bravura wrote:
| I've been curious about historious for a while, but I want to
| ask how it's different from pinboard.
|
| From what I can see, they both have the ability to archive /
| FTS your bookmarks.
|
| But in terms of API access, historious only allows WRITE
| access (ugh), where at least pinboard allows read/write.
|
| What else am I missing?
| stavros wrote:
| historious indexes everything, whereas Pinboard (as far as
| I know) only indexes things you select. I haven't used
| Pinboard much, so I can't really say much.
| rdli wrote:
| I'm working on something like this! It's simple in concept, but
| there are lots of fiddly bits. A big one is performance (at
| least, without spending $$$$$ on GPUs.) I haven't found that
| much in terms of how to tune/deploy LLMs on commodity cloud
| hardware, which is what I'm trying this out on.
| leobg wrote:
| You can use ONXX versions of embedding models. Those run
| faster on CPU.
|
| Also, don't discount plain old BM25 and fastText. For many
| queries, keyword or bag-of-words based search works just as
| well as fancy 1536 dim vectors.
|
| You can also do things like tokenize your text using the
| tokenizer that GPT-4 uses (via tiktoken for instance) and
| then index those tokens instead of words in BM25.
| rdli wrote:
| Thanks! I should have been clearer -- embeddings are pretty
| fast (relatively) -- it's inference that's slow (I'm at 5
| tokens/second on AKS).
| jnnnthnn wrote:
| Could you sidestep inference altogether? Just return the
| top N results by cosine similarity (or full text search)
| and let the user find what they need?
|
| https://ollama.com models also works really well on most
| modern hardware
| rdli wrote:
| I'm running ollama, but it's still slow (it's actually
| quite fast on my M2). My working theory is that with
| standard cloud VMs, memory <-> CPU bandwidth is an issue.
| I'm looking into vLLM.
|
| And as to sidestepping inference, I can totally do that.
| But I think it's so much better to be able to ask the LLM
| a question, run a vector similarity search to pull
| relevant content, and then have the LLM summarize this
| all in a way that answers my question.
| jnnnthnn wrote:
| Oh yeah! What I meant is having Ollama run on the user's
| machine. Might not work for the use case you're trying to
| build for though :)
| pizza wrote:
| This style of embeddings could be quite
| lightweight/cheap/efficient https://github.com/cohere-
| ai/BinaryVectorDB
| Tostino wrote:
| Embedding models are generally lightweight enough to run on
| CPU, can be done in the background while the user isn't using
| their device.
| bravura wrote:
| You could use archivebox with the archivebox web extension. And
| then use a separate offline / batch process to embed and RAG
| your archive.
| jnnnthnn wrote:
| You might not want them to have that information, but I think
| Google's history search now supports that for Chrome users:
| https://myactivity.google.com/myactivity
| superkuh wrote:
| I'd be really impressed with Mozilla if they could do the entire
| thing (llamafile + llamaindex) in one, or even two files. Having
| to set up a separate python install just for this task and pull
| in all the llamaindex python deps defeats the point of using
| llamafile.
___________________________________________________________________
(page generated 2024-05-16 23:00 UTC)