[HN Gopher] Using Llamafiles for Embeddings in Local RAG Applica...
       ___________________________________________________________________
        
       Using Llamafiles for Embeddings in Local RAG Applications
        
       Author : tosh
       Score  : 82 points
       Date   : 2024-05-16 16:20 UTC (6 hours ago)
        
 (HTM) web link (future.mozilla.org)
 (TXT) w3m dump (future.mozilla.org)
        
       | jinzo wrote:
       | I'm always curious how the licensing aspects will play out. It's
       | quite clear to me that most of the LLMs contain copyrighted
       | material without correct rights. And then they turn around and
       | put up some restrictive licensing on it (for example the
       | Salesforce model here).
       | 
       | I know they add valuable input to it, but CC-BY-NC is really
       | rubbing me the wrong way.
        
         | andy99 wrote:
         | It's really two different things. If using data to train is
         | either fair use or not a use at all (I belive it is, it's for
         | all intents and purposes the same as reading it), the copyright
         | on the training data is irrelevant.
         | 
         | Whether weights can be copyrighted at all (which is the basis
         | of these licenses) is also unclear. Though again I think they
         | should be, they are just as creative a work as a computer
         | program for any nontrivial model release (though really I think
         | it's important to be able to enforce copyleft on them more than
         | anything).
         | 
         | Also, all these laws work to the benefit of who ever has the
         | deepest pockets, so salesforce will win against most others by
         | virtue of this, regardless of how the law shakes out.
        
           | tossandthrow wrote:
           | But then again, my program is merely reading these weights.
        
       | terhechte wrote:
       | I'd love if Firefox would feed the text content of each website I
       | visit locally and would allow me to RAG search this database. So
       | often do I want to re-visit a website I visited weeks earlier but
       | can't find it again.
        
         | stavros wrote:
         | https://historio.us
        
           | 2024throwaway wrote:
           | The OP asked for the data to be stored locally. You linked to
           | a hosted service with a subscription model. Very much not the
           | same.
        
           | bravura wrote:
           | I've been curious about historious for a while, but I want to
           | ask how it's different from pinboard.
           | 
           | From what I can see, they both have the ability to archive /
           | FTS your bookmarks.
           | 
           | But in terms of API access, historious only allows WRITE
           | access (ugh), where at least pinboard allows read/write.
           | 
           | What else am I missing?
        
             | stavros wrote:
             | historious indexes everything, whereas Pinboard (as far as
             | I know) only indexes things you select. I haven't used
             | Pinboard much, so I can't really say much.
        
         | rdli wrote:
         | I'm working on something like this! It's simple in concept, but
         | there are lots of fiddly bits. A big one is performance (at
         | least, without spending $$$$$ on GPUs.) I haven't found that
         | much in terms of how to tune/deploy LLMs on commodity cloud
         | hardware, which is what I'm trying this out on.
        
           | leobg wrote:
           | You can use ONXX versions of embedding models. Those run
           | faster on CPU.
           | 
           | Also, don't discount plain old BM25 and fastText. For many
           | queries, keyword or bag-of-words based search works just as
           | well as fancy 1536 dim vectors.
           | 
           | You can also do things like tokenize your text using the
           | tokenizer that GPT-4 uses (via tiktoken for instance) and
           | then index those tokens instead of words in BM25.
        
             | rdli wrote:
             | Thanks! I should have been clearer -- embeddings are pretty
             | fast (relatively) -- it's inference that's slow (I'm at 5
             | tokens/second on AKS).
        
               | jnnnthnn wrote:
               | Could you sidestep inference altogether? Just return the
               | top N results by cosine similarity (or full text search)
               | and let the user find what they need?
               | 
               | https://ollama.com models also works really well on most
               | modern hardware
        
               | rdli wrote:
               | I'm running ollama, but it's still slow (it's actually
               | quite fast on my M2). My working theory is that with
               | standard cloud VMs, memory <-> CPU bandwidth is an issue.
               | I'm looking into vLLM.
               | 
               | And as to sidestepping inference, I can totally do that.
               | But I think it's so much better to be able to ask the LLM
               | a question, run a vector similarity search to pull
               | relevant content, and then have the LLM summarize this
               | all in a way that answers my question.
        
               | jnnnthnn wrote:
               | Oh yeah! What I meant is having Ollama run on the user's
               | machine. Might not work for the use case you're trying to
               | build for though :)
        
           | pizza wrote:
           | This style of embeddings could be quite
           | lightweight/cheap/efficient https://github.com/cohere-
           | ai/BinaryVectorDB
        
           | Tostino wrote:
           | Embedding models are generally lightweight enough to run on
           | CPU, can be done in the background while the user isn't using
           | their device.
        
         | bravura wrote:
         | You could use archivebox with the archivebox web extension. And
         | then use a separate offline / batch process to embed and RAG
         | your archive.
        
         | jnnnthnn wrote:
         | You might not want them to have that information, but I think
         | Google's history search now supports that for Chrome users:
         | https://myactivity.google.com/myactivity
        
       | superkuh wrote:
       | I'd be really impressed with Mozilla if they could do the entire
       | thing (llamafile + llamaindex) in one, or even two files. Having
       | to set up a separate python install just for this task and pull
       | in all the llamaindex python deps defeats the point of using
       | llamafile.
        
       ___________________________________________________________________
       (page generated 2024-05-16 23:00 UTC)