[HN Gopher] VectorDB: Vector Database Built by Kagi Search
       ___________________________________________________________________
        
       VectorDB: Vector Database Built by Kagi Search
        
       Author : promiseofbeans
       Score  : 272 points
       Date   : 2023-11-26 10:21 UTC (12 hours ago)
        
 (HTM) web link (vectordb.com)
 (TXT) w3m dump (vectordb.com)
        
       | altdataseller wrote:
       | Where does it save the data to? How is it persisted?
       | 
       | Is there any limitation to this? Does it work with text 500-1000
       | words? Does it work well with text that aren't full sentences?
       | Ie. Are just a collection of phrases?
        
         | tyingq wrote:
         | The README.md example just has it in memory. Looking at the
         | source, the Storage class uses a plain file and python's pickle
         | module:
         | 
         | https://github.com/kagisearch/vectordb/blob/main/vectordb/st...
        
       | mathverse wrote:
       | Wonder why Crystal was not used.
        
         | freedomben wrote:
         | Crystal is a super cool and underrated language. But why do you
         | mention it here? Does kagi use crystal in other places?
        
           | mathverse wrote:
           | They use it for search afaik.
        
             | SushiHippie wrote:
             | Yep[0] 70k lines of crystal code for kagi search[1]
             | 
             | [0] https://help.kagi.com/orion/company/hiring-
             | kagi.html#full-ti...
             | 
             | [1] CrystalConf talk from a Kagi Tech Leas
             | https://www.youtube.com/watch?v=r7t9xPajjTM
        
               | darnthenuggets wrote:
               | I can't decide if I hate architecture-by-drawio or not.
        
           | YellowSuB wrote:
           | Yes it is made with Crystal
        
             | synergy20 wrote:
             | I read somewhere that typically crystal is adopted by ruby
             | developers
             | 
             | not sure if there is something similar in Python
        
               | tokai wrote:
               | Nim maybe?
        
               | 7thaccount wrote:
               | Nim is probably closest, but it's probably more
               | influenced by pascal, modula-2, Oberon... something like
               | that iirc
        
               | synergy20 wrote:
               | it is, it's still fairly complex.
               | 
               | wish there is something simpler but statically-compiled
               | with python syntax, something like crystal but for python
               | developers.
        
               | fabianhjr wrote:
               | Scala 3 with Braceless Syntax (scala-native for x86)
               | 
               | https://docs.scala-lang.org/scala3/reference/other-new-
               | featu...
        
               | synergy20 wrote:
               | kotlin seems to be the Java trend now,anyway, don't know
               | much about java
        
           | promiseofbeans wrote:
           | Based on their job postings, they use it for most of their
           | back-end: https://help.kagi.com/kagi/company/hiring-
           | kagi.html#full-tim...
        
       | ipsum2 wrote:
       | This is a wrapper around FAISS, a vector search library. FAISS
       | has a simple API, so it might be a better for your use case if
       | you don't need heavyweight libraries that VectorDB requires,
       | which include PyTorch, Tensorflow, and Transformers.
        
         | dmezzetti wrote:
         | This is possible but how would you encode your vectors?
        
           | sharemywin wrote:
           | what's wrong with these?
           | 
           | https://www.sbert.net/docs/pretrained_models.html
        
             | dmezzetti wrote:
             | Agreed and the question. If you take out the vector
             | encoding part and just use Faiss, then what would be the
             | proposal to encode vectors.
        
               | lmeyerov wrote:
               | ... Whatever you want? Even with a vector DB, embeddings
               | are typically BYO to beginwith - dependencies not in your
               | DB and computed in pipelines well before it. It's handy
               | for small apps to do in DB and have support in DB for
               | some queries, but as things get big, doing all in the DB
               | gets weirder...
               | 
               | Edit: I see you run a vectordb company, so your question
               | makes more sense
        
               | dmezzetti wrote:
               | True but the comment discussed that it should remove
               | "heavyweight libraries". Unless you use an API service,
               | those libraries will need to be imported somewhere. It
               | doesn't necessarily have to run on the same server as the
               | database.
        
               | lmeyerov wrote:
               | Vector embedding at the app tier & orchestration /
               | compute tier make more sense for managing the dependency
               | than the vector DB tier for the bulk of ML/AI projects
               | I've worked on. Round tripping through the vectordb would
               | be an architectural, code, and perf headache. Ex: Just
               | one of the many ways we use embeddings is as part of
               | prefiltering what goes into the DB, running in a bulk
               | pipeline in a diff VPC, and in a way we dont want to
               | interfere with DB utilization.
               | 
               | We generally avoid using embedding services to
               | beginwith.. outside calls is the special case. Imagine
               | something heavy like video transcription via Google APIs,
               | not the typical one of 'just' text embedding. The actual
               | embedding is generally one step of broader data
               | wrangling, so there needs to be a good reason for doing
               | something heavy and outside our control... Which has been
               | rare.
               | 
               | Doing in the DB tier is nice for tiny projects,
               | simplifying occasional business logic, etc, but generally
               | it's not a big deal for us to run encode(str) when
               | building a DB query.
               | 
               | Where DB embedding support gets more interesting to me
               | here is layering on additional representation changes on
               | top, like IVF+PQ... But that can be done after afaict?
               | (And supporting raw vectors generically here, vs having
               | to align our python & model deps to our DB's, is a big
               | feature)
        
           | ipsum2 wrote:
           | Bring your own embeddings. PyTorch and TensorFlow packages
           | are 2GB+ each (don't quote me on that), which is unnecessary
           | if you're making a network call to your favorite embedding
           | service.
        
       | PossiblyKyle wrote:
       | I fail to understand why I should use it over a different
       | embedded vector DB like LanceDB or Chroma. Both are written in
       | more performant languages, have a simple API with a lot of
       | integrations and power if one needs it
        
         | dmezzetti wrote:
         | To be fair, Chroma is also written in Python. And while LanceDB
         | and others are written in Rust, that doesn't automatically give
         | it super powers.
        
           | PossiblyKyle wrote:
           | Fair point, then you could claim it's similar to this DB with
           | its reliance on Faiss. Despite that, Chroma at this point is
           | more feature rich. I was mostly referring to this
           | https://thedataquarry.com/posts/vector-db-1/
           | 
           | You are not wrong about the performance from Rust, but
           | LanceDB is inherently written with performance in mind. SIMD
           | support for both x86 and ARM, and an underlying vector
           | storage approach that's built for speed (Lance)
        
             | marcinzm wrote:
             | >more feature rich
             | 
             | Not necessarily a good thing when the product is made by a
             | VC backed startup that may die or pivot in six months
             | leaving you the need to maintain it yourself.
        
             | dmezzetti wrote:
             | I've seen a number of projects come over the last couple
             | years. I'm the author of txtai
             | (https://github.com/neuml/txtai) which I started in 2020.
             | How you approach performance is the key point.
             | 
             | You can write performant code in any language. For example,
             | for standard keyword search, I wrote a component to make
             | sparse/keyword search just as efficient as Apache Lucene in
             | Python. https://neuml.hashnode.dev/building-an-efficient-
             | sparse-keyw....
        
           | 6r17 wrote:
           | Python programmer for 15 years and i picked up rust to write
           | an oAuth gateway not long ago ; i wrote it in python
           | beforehand - rust DOES give you superpowers ; especially if
           | you compare it to something like python that isn't nowhere as
           | fast and has no typing
        
             | DSingularity wrote:
             | Python does have typing. Although it doesn't feel as "first
             | class" like as rust or golang it gets the job done.
        
             | dmezzetti wrote:
             | There are plenty of examples of Python libraries that can
             | be performant such as NumPy and PyTorch (which both rely on
             | C/C++). Some libraries such as Hugging Face's tokenizers
             | even use Rust.
             | 
             | I referenced this article below but will reference it again
             | here too. https://neuml.hashnode.dev/building-an-efficient-
             | sparse-keyw....
             | 
             | You can write performant code in any language if you try.
        
               | iopq wrote:
               | So to make Python fast you just need to write a library
               | in another language, brilliant
        
               | dmezzetti wrote:
               | If you read the article referenced, I discussed a number
               | of ways to write performant Python such as using this
               | package (https://docs.python.org/3/library/array.html).
        
               | aldanor wrote:
               | NumPy is a C library with Python frontend, moreover lots
               | of functionality based on other existing C libraries like
               | blas etc.
               | 
               | PyTorch, quoting themselves, is a Python binding into a
               | monolithic C++ framework; also optionally depending on
               | existing libs like mkl etc.
               | 
               | > You can write performant code in any language if you
               | try.
               | 
               | Unfortunately, only to a certain extent. Sure, if you
               | just need to multiply a handful of matrices and you want
               | your blas ops to be blas'ed where the sheer size of data
               | outweighs any of your actual code, it doesn't really
               | matter. Once you need to implement lower-level logic, ie
               | traversing and processing the data in some custom way,
               | especially without eating extra memory, you're out of
               | luck with Python/numpy and the rest.
        
               | benrutter wrote:
               | > NumPy is a C library with Python frontend
               | 
               | I guess this is a pretty legitimate take, but in that
               | case VectorDB looks like (from the got repo) it makes
               | huge use of libraries like pytorch and numpy.
               | 
               | If numpy is fast but "doesn't count" because the
               | operations aren't happening in python, then I guess
               | VectorDB isn't in python either by that logic?
               | 
               | On the other hand, if it is in Python despite shipping
               | operations out to C/C++ code, then I guess numpy shows
               | that can be an effective approach?
        
               | bee_rider wrote:
               | BLAS can be implemented in any language. In terms of LOC,
               | most BLAS might be C libraries, but the best open source
               | BLAS, BLIS, is totally structured around the idea of
               | writing custom, likely assembly, kernels for a platform.
               | So, FLOPs-wise it is probably more accurate to call it an
               | assembly library.
               | 
               | LAPACK and other ancillary stuff could be Fortran or C.
               | 
               | Anyway, every language calls out to functions and
               | runtimes, and compiles (or jits or whatever) down to
               | lower level languages. I think it is just not that
               | productive to attribute performance to particular
               | languages. Numpy calls BLAS and LAPACK code, sure, but
               | the flexibility of Python also provides a lot of value.
               | 
               | How does Numba fit into this hierarchy?
        
               | _a_a_a_ wrote:
               | I don't accept that. In the referenced article you're
               | pulling in stuff which I believe is written in a
               | different language (probably C). If you use native
               | python, I'm sure you would accept it would be much slower
               | and take up much more memory. So we have to disagree
               | here.
        
               | dmezzetti wrote:
               | Where do you draw the line? Most of CPython is written in
               | C including the arrays package
               | (https://docs.python.org/3/library/array.html) mentioned
               | in that article.
               | 
               | Yes, pure Python is slower and takes up more memory. But
               | that doesn't mean it can't be productive and performant
               | using these types of strategies to speed up where
               | necessary.
        
               | _a_a_a_ wrote:
               | With respect, I think you're clouding things by trying to
               | defend what is really defensible. Okay then.
               | 
               | > Where do you draw the line?
               | 
               | Drawing the line at native python, not pulling in
               | packages that are written in another language. Packages
               | written in python only are acceptable in this argument.
               | 
               | > But that doesn't mean it can't be productive and
               | performant using these types of strategies to speed up
               | where necessary.
               | 
               | No one said it couldn't. What we're saying is that it
               | pure python is 'slow' and you _need_ to escape from pure
               | python to get the speedups.
        
               | dmezzetti wrote:
               | I agree that pure Python isn't as fast as other options.
               | Just comes down to a productivity tradeoff for
               | developers. And it doesn't have to be one or the other.
        
               | _a_a_a_ wrote:
               | Agreed, then!
        
               | mgl wrote:
               | This is unfortunately not correct once you start pushing
               | the boundaries requiring careful allocation of memory,
               | CPU cache and COU itself, see this table:
               | 
               | https://stratoflow.com/efficient-and-environment-
               | friendly-pr...
        
             | stavros wrote:
             | > Python programmer for 15 years [...] [Python] has no
             | typing
             | 
             | Ok, I have to call this statement out. Mypy was released 15
             | years ago, so Python has had optional static typing for as
             | long as you've been programming in it, and you don't know
             | about it?
             | 
             | I guess it's going to take another fifteen years for this
             | 2008 trope to die.
        
               | aerhardt wrote:
               | I'm primarily a Python programmer, I love mypy and the
               | general typing experience in Python (I think it's better
               | than TypeScript - fight me), but are you seriously
               | comparing it to something - anything - with proper types
               | like Rust?
        
               | stavros wrote:
               | > but are you seriously comparing it to something -
               | anything - with proper types like Rust?
               | 
               | Re-reading my comment, no, I did not. I said it has
               | static typing.
        
               | kamov wrote:
               | > I think it's better than TypeScript - fight me
               | 
               | I used Python type hints and MyPy since long before I
               | used TypeScript, and I have to say that TypeScript's take
               | on types is just plain better (that doesn't mean it's
               | good though).
               | 
               | 1. More TypeScript packages are properly typed thanks to
               | DefinitelyTyped. Some Python packages such as Numpy could
               | not be properly typed last I checked, I think it might
               | change with 3.11 though. Packages such as OpenCV didn't
               | have any types last I checked.
               | 
               | 2. TypeScript's type system is more complete, with better
               | support for generics, this might change with 3.11/3.12
               | though.
               | 
               | 3. TypeScript has more powerful type system than most
               | languages, as it is Turing-complete and similar in
               | functionality to a purely functional language (this could
               | also be a con)
        
               | Gracana wrote:
               | > I have to call this statement out.
               | 
               | Why? That was just mean for no reason!
        
               | stavros wrote:
               | Is that mean? Sorry, English is not my native language. I
               | just meant that I have to express my doubt of the
               | veracity of the statement.
        
               | djbusby wrote:
               | English is my native language. "I have to call out" is a
               | perfectly fine (and polite) way to express doubt of
               | veracity.
        
               | Gracana wrote:
               | Your language is fine (I've enjoyed your blog posts too,
               | never gave it a thought that English wasn't your first
               | language), I just thought it was unnecessarily hurtful to
               | say they must be a phony because they didn't know
               | something.
               | 
               | But, everybody else seems to agree so maybe I've been
               | had.
        
               | stavros wrote:
               | I didn't mean to say they are a phony, just that that
               | statement is inaccurate/poorly thought out.
        
             | HumanOstrich wrote:
             | Yea everyone should just rewrite EVERYTHING in Rust! /s
        
         | hantusk wrote:
         | I thought the API here was quite neat. It's fairly simple to
         | implement a lancedb backend for it instead of
         | sklearn/faiss/mrpt as the source code is really simple.
         | 
         | This repo is basically just a nice api and the needed chunking
         | and batching logic. Using lancedb, you'd still have to write
         | that, as exemplified here: https://github.com/prrao87/lancedb-
         | study/blob/main/lancedb/i...
        
         | mark_l_watson wrote:
         | Same for me. I started using Chroma (about) a year ago, I am
         | used to it, and if I am using Python I look no further.
         | 
         | When I use Common Lisp or Racket I roll my own simple vector
         | embeddings data store, but that is just me having fun.
        
         | freediver wrote:
         | We needed a low latency, on premise solution that we can run on
         | edge nodes with sane defaults that anyone in the team can whim
         | in a sec. Also worth noting is that our use case is end to end
         | retrieval of usually few hundred to few thousand chunks of text
         | (for example in Kagi Assistant research mode) that need to be
         | processed once at run time with minimal latency.
         | 
         | Result is this. We periodically benchmark the performance of
         | different embeddings to ensure best defaults:
         | 
         | https://github.com/kagisearch/vectordb#embeddings-performanc...
        
       | andy99 wrote:
       | The two most interesting things to me in a "minimal" framework
       | would be eliminating dependency on HF transformers and helping
       | customize chunking.
       | 
       | Not a knock against this project, I see where it can be helpful.
        
         | dmezzetti wrote:
         | My assumption is that this gives you the ability to locally
         | encode vectors. This is useful for those not using API services
         | to build their vectors.
        
           | andy99 wrote:
           | Transformer inference is ~60 lines of numpy[0] (closer to 500
           | once you add tokenization etc). It would be nice to just have
           | this and not all of pytorch and transformers.
           | 
           | [0] https://jaykmody.com/blog/gpt-from-scratch/
        
             | dmezzetti wrote:
             | What about models besides GPT? Most of the popular vector
             | encoding models aren't using this architecture.
             | 
             | If you really didn't want PyTorch/Transformers, you could
             | consider exporting your models to ONNX
             | (https://github.com/microsoft/onnxruntime).
        
             | cherryteastain wrote:
             | It's 60 lines for CPU only inference, which'll be slow. If
             | you want GPU acceleration it'll be a lot more than 60
             | lines.
        
         | jsimian wrote:
         | Yeah, I just tried this out (props to the devs, super easy to
         | set up) and my main gripe is the chunking algorithms aren't
         | great - could be alot more useful with a context option that
         | gives surrounding results to search results. The sliding window
         | chunking method always cuts off the start of sentences.
        
           | andy99 wrote:
           | I've found it works better to chunk by some logical sections
           | in the document, e.g header h2 h3 h4 etc or 1.1 1.1.1 ...
           | plus to be able to ignore some stuff (header and footer) plus
           | other customizations.
           | 
           | At least for use cases where there are clusters of many
           | similarly formatted documents, it would be cool to have a way
           | of easily customizing chunking.
        
       | osigurdson wrote:
       | What is being used to actually create the embeddings?
        
         | dmezzetti wrote:
         | The library is open source. Here's where you can see how
         | they're creating the embeddings.
         | https://github.com/kagisearch/vectordb/blob/main/vectordb/em...
        
         | simonw wrote:
         | https://github.com/kagisearch/vectordb/blob/453bb658bb710838...
         | 
         | Looks like it uses one of these, depending on your settings:
         | 
         | Fast model: google/universal-sentence-encoder/4
         | 
         | Multilingual model: universal-sentence-encoder-multilingual-
         | large/3
         | 
         | Normal model (Alternative): BAAI/bge-small-en-v1.5
         | 
         | Best model: BAAI/bge-base-en-v1.5
        
       | syntaxing wrote:
       | I always run into the issue of loading existing embedding. For
       | instance, I want to embed a folder which have five files and two
       | are new, is there a way to only add the two new files to the
       | stored embedding using this or chromaDB?
        
         | rasengan wrote:
         | Keep track of the files and when you see new ones only add
         | those.
        
       | conception wrote:
       | https://github.com/wallabag/wallabag
       | 
       | No one has mentioned wallabag yet, so wanted to. Been working
       | well for me - has apps and extensions. If you're not excited to
       | self-host - https://www.wallabag.it/en has been flawless with the
       | exorbitant price of... 11 euro a year.
        
         | 0x6c6f6c wrote:
         | This isn't even slightly related to a vector database. I like
         | Wallabag but this comes off as a shameless plug.
        
       | politelemon wrote:
       | If a team has an operational store already in Postgres, wouldn't
       | it be best to just use the PGVector extension? So the data and
       | the vector search functionality sit together, and there's one
       | less moving part in the tech stack to manage?
        
         | dmezzetti wrote:
         | This recent thread discussed this -
         | https://news.ycombinator.com/item?id=38416994
        
       | m3kw9 wrote:
       | It shouldn't be legal to call your vectordb vectordb
        
         | andy99 wrote:
         | I prefer it to, say, vectr (such contractions used to be
         | popular) or 'barbershop' or a similar single irrelevant word
         | name as is popular now.
        
           | worldsayshi wrote:
           | Att least with barbershop it becomes much more searchable.
           | "barbershop db" will very likely succeed.
        
             | andy99 wrote:
             | Good point. It trades off searchability with understanding
             | (or getting an indication of) what it is when you first
             | hear of it.
        
         | LtWorf wrote:
         | Someone decided to call a software "informatica".
         | 
         | A british recruiter was quite insistent to have a call... turns
         | out because she read "informatica" on my profile (as in, laurea
         | in informatica), asked me how many years of experience I had
         | with the tool, I replied "I just heard of it 2 minutes ago when
         | you first mentioned it". Then got mad at me for having written
         | the word "informatica" on my profile.
        
         | arbuge wrote:
         | Well, they do have the .com domain, so maybe an exception
         | applies here.
        
         | bee_rider wrote:
         | Sometimes I wonder if Microsoft was actually on to something
         | with their naming scheme.
         | 
         | Something like "Kagi vector search for databases" at least
         | doesn't leave anything up for misinterpretation.
        
       | richardanaya wrote:
       | Kagi search is amazing, i've used it for the last few months. If
       | this is what they are using to power it, i'm optimistic.
        
         | iTokio wrote:
         | From the GitHub repo:
         | 
         | > Thanks to its low latency and small memory footprint,
         | VectorDB is used to power AI features inside Kagi Search.
        
       | stainablesteel wrote:
       | super cool! looks like i can make a local search engine out of
       | the massive datahoard of pdf books and articles i have
        
       | hubraumhugo wrote:
       | Is there any kind of comaprison of the different vector DBs? What
       | would you choose for different use cases? How do they differ?
        
         | dmezzetti wrote:
         | This thread from a few months ago is a good read -
         | https://news.ycombinator.com/item?id=36943318
        
       | pknerd wrote:
       | This or anyone suggest some other db/lib for local QnA like
       | testing on Mac?
        
         | dmezzetti wrote:
         | Just posted this thread that gives an in-depth look at LLM
         | frameworks (local ones included) -
         | https://news.ycombinator.com/item?id=38422264
        
       | ianpurton wrote:
       | https://bionic-gpt.com/blog/you-dont-need-a-vector-database/
        
       | freediver wrote:
       | Dev here. Thanks for submitting. To be fair this is not really a
       | database, but a wrapper around few primitives such as locally-ran
       | embeddings and FAISS/mrpt with a ton of benchmarking behind the
       | doors to offer sane defaults that minimize latency.
       | 
       | Here is an example Colab notebook [1] where this is used to
       | filter the content of the massive Kagi Small Web [2] RSS feed
       | based on stated user interests:
       | 
       | [1]
       | https://colab.research.google.com/drive/1pecKGCCru_Jvx7v0WRN...
       | 
       | [2] https://kagi.com/smallweb
        
       ___________________________________________________________________
       (page generated 2023-11-26 23:01 UTC)