[HN Gopher] VectorDB: Vector Database Built by Kagi Search
___________________________________________________________________
VectorDB: Vector Database Built by Kagi Search
Author : promiseofbeans
Score : 272 points
Date : 2023-11-26 10:21 UTC (12 hours ago)
(HTM) web link (vectordb.com)
(TXT) w3m dump (vectordb.com)
| altdataseller wrote:
| Where does it save the data to? How is it persisted?
|
| Is there any limitation to this? Does it work with text 500-1000
| words? Does it work well with text that aren't full sentences?
| Ie. Are just a collection of phrases?
| tyingq wrote:
| The README.md example just has it in memory. Looking at the
| source, the Storage class uses a plain file and python's pickle
| module:
|
| https://github.com/kagisearch/vectordb/blob/main/vectordb/st...
| mathverse wrote:
| Wonder why Crystal was not used.
| freedomben wrote:
| Crystal is a super cool and underrated language. But why do you
| mention it here? Does kagi use crystal in other places?
| mathverse wrote:
| They use it for search afaik.
| SushiHippie wrote:
| Yep[0] 70k lines of crystal code for kagi search[1]
|
| [0] https://help.kagi.com/orion/company/hiring-
| kagi.html#full-ti...
|
| [1] CrystalConf talk from a Kagi Tech Leas
| https://www.youtube.com/watch?v=r7t9xPajjTM
| darnthenuggets wrote:
| I can't decide if I hate architecture-by-drawio or not.
| YellowSuB wrote:
| Yes it is made with Crystal
| synergy20 wrote:
| I read somewhere that typically crystal is adopted by ruby
| developers
|
| not sure if there is something similar in Python
| tokai wrote:
| Nim maybe?
| 7thaccount wrote:
| Nim is probably closest, but it's probably more
| influenced by pascal, modula-2, Oberon... something like
| that iirc
| synergy20 wrote:
| it is, it's still fairly complex.
|
| wish there is something simpler but statically-compiled
| with python syntax, something like crystal but for python
| developers.
| fabianhjr wrote:
| Scala 3 with Braceless Syntax (scala-native for x86)
|
| https://docs.scala-lang.org/scala3/reference/other-new-
| featu...
| synergy20 wrote:
| kotlin seems to be the Java trend now,anyway, don't know
| much about java
| promiseofbeans wrote:
| Based on their job postings, they use it for most of their
| back-end: https://help.kagi.com/kagi/company/hiring-
| kagi.html#full-tim...
| ipsum2 wrote:
| This is a wrapper around FAISS, a vector search library. FAISS
| has a simple API, so it might be a better for your use case if
| you don't need heavyweight libraries that VectorDB requires,
| which include PyTorch, Tensorflow, and Transformers.
| dmezzetti wrote:
| This is possible but how would you encode your vectors?
| sharemywin wrote:
| what's wrong with these?
|
| https://www.sbert.net/docs/pretrained_models.html
| dmezzetti wrote:
| Agreed and the question. If you take out the vector
| encoding part and just use Faiss, then what would be the
| proposal to encode vectors.
| lmeyerov wrote:
| ... Whatever you want? Even with a vector DB, embeddings
| are typically BYO to beginwith - dependencies not in your
| DB and computed in pipelines well before it. It's handy
| for small apps to do in DB and have support in DB for
| some queries, but as things get big, doing all in the DB
| gets weirder...
|
| Edit: I see you run a vectordb company, so your question
| makes more sense
| dmezzetti wrote:
| True but the comment discussed that it should remove
| "heavyweight libraries". Unless you use an API service,
| those libraries will need to be imported somewhere. It
| doesn't necessarily have to run on the same server as the
| database.
| lmeyerov wrote:
| Vector embedding at the app tier & orchestration /
| compute tier make more sense for managing the dependency
| than the vector DB tier for the bulk of ML/AI projects
| I've worked on. Round tripping through the vectordb would
| be an architectural, code, and perf headache. Ex: Just
| one of the many ways we use embeddings is as part of
| prefiltering what goes into the DB, running in a bulk
| pipeline in a diff VPC, and in a way we dont want to
| interfere with DB utilization.
|
| We generally avoid using embedding services to
| beginwith.. outside calls is the special case. Imagine
| something heavy like video transcription via Google APIs,
| not the typical one of 'just' text embedding. The actual
| embedding is generally one step of broader data
| wrangling, so there needs to be a good reason for doing
| something heavy and outside our control... Which has been
| rare.
|
| Doing in the DB tier is nice for tiny projects,
| simplifying occasional business logic, etc, but generally
| it's not a big deal for us to run encode(str) when
| building a DB query.
|
| Where DB embedding support gets more interesting to me
| here is layering on additional representation changes on
| top, like IVF+PQ... But that can be done after afaict?
| (And supporting raw vectors generically here, vs having
| to align our python & model deps to our DB's, is a big
| feature)
| ipsum2 wrote:
| Bring your own embeddings. PyTorch and TensorFlow packages
| are 2GB+ each (don't quote me on that), which is unnecessary
| if you're making a network call to your favorite embedding
| service.
| PossiblyKyle wrote:
| I fail to understand why I should use it over a different
| embedded vector DB like LanceDB or Chroma. Both are written in
| more performant languages, have a simple API with a lot of
| integrations and power if one needs it
| dmezzetti wrote:
| To be fair, Chroma is also written in Python. And while LanceDB
| and others are written in Rust, that doesn't automatically give
| it super powers.
| PossiblyKyle wrote:
| Fair point, then you could claim it's similar to this DB with
| its reliance on Faiss. Despite that, Chroma at this point is
| more feature rich. I was mostly referring to this
| https://thedataquarry.com/posts/vector-db-1/
|
| You are not wrong about the performance from Rust, but
| LanceDB is inherently written with performance in mind. SIMD
| support for both x86 and ARM, and an underlying vector
| storage approach that's built for speed (Lance)
| marcinzm wrote:
| >more feature rich
|
| Not necessarily a good thing when the product is made by a
| VC backed startup that may die or pivot in six months
| leaving you the need to maintain it yourself.
| dmezzetti wrote:
| I've seen a number of projects come over the last couple
| years. I'm the author of txtai
| (https://github.com/neuml/txtai) which I started in 2020.
| How you approach performance is the key point.
|
| You can write performant code in any language. For example,
| for standard keyword search, I wrote a component to make
| sparse/keyword search just as efficient as Apache Lucene in
| Python. https://neuml.hashnode.dev/building-an-efficient-
| sparse-keyw....
| 6r17 wrote:
| Python programmer for 15 years and i picked up rust to write
| an oAuth gateway not long ago ; i wrote it in python
| beforehand - rust DOES give you superpowers ; especially if
| you compare it to something like python that isn't nowhere as
| fast and has no typing
| DSingularity wrote:
| Python does have typing. Although it doesn't feel as "first
| class" like as rust or golang it gets the job done.
| dmezzetti wrote:
| There are plenty of examples of Python libraries that can
| be performant such as NumPy and PyTorch (which both rely on
| C/C++). Some libraries such as Hugging Face's tokenizers
| even use Rust.
|
| I referenced this article below but will reference it again
| here too. https://neuml.hashnode.dev/building-an-efficient-
| sparse-keyw....
|
| You can write performant code in any language if you try.
| iopq wrote:
| So to make Python fast you just need to write a library
| in another language, brilliant
| dmezzetti wrote:
| If you read the article referenced, I discussed a number
| of ways to write performant Python such as using this
| package (https://docs.python.org/3/library/array.html).
| aldanor wrote:
| NumPy is a C library with Python frontend, moreover lots
| of functionality based on other existing C libraries like
| blas etc.
|
| PyTorch, quoting themselves, is a Python binding into a
| monolithic C++ framework; also optionally depending on
| existing libs like mkl etc.
|
| > You can write performant code in any language if you
| try.
|
| Unfortunately, only to a certain extent. Sure, if you
| just need to multiply a handful of matrices and you want
| your blas ops to be blas'ed where the sheer size of data
| outweighs any of your actual code, it doesn't really
| matter. Once you need to implement lower-level logic, ie
| traversing and processing the data in some custom way,
| especially without eating extra memory, you're out of
| luck with Python/numpy and the rest.
| benrutter wrote:
| > NumPy is a C library with Python frontend
|
| I guess this is a pretty legitimate take, but in that
| case VectorDB looks like (from the got repo) it makes
| huge use of libraries like pytorch and numpy.
|
| If numpy is fast but "doesn't count" because the
| operations aren't happening in python, then I guess
| VectorDB isn't in python either by that logic?
|
| On the other hand, if it is in Python despite shipping
| operations out to C/C++ code, then I guess numpy shows
| that can be an effective approach?
| bee_rider wrote:
| BLAS can be implemented in any language. In terms of LOC,
| most BLAS might be C libraries, but the best open source
| BLAS, BLIS, is totally structured around the idea of
| writing custom, likely assembly, kernels for a platform.
| So, FLOPs-wise it is probably more accurate to call it an
| assembly library.
|
| LAPACK and other ancillary stuff could be Fortran or C.
|
| Anyway, every language calls out to functions and
| runtimes, and compiles (or jits or whatever) down to
| lower level languages. I think it is just not that
| productive to attribute performance to particular
| languages. Numpy calls BLAS and LAPACK code, sure, but
| the flexibility of Python also provides a lot of value.
|
| How does Numba fit into this hierarchy?
| _a_a_a_ wrote:
| I don't accept that. In the referenced article you're
| pulling in stuff which I believe is written in a
| different language (probably C). If you use native
| python, I'm sure you would accept it would be much slower
| and take up much more memory. So we have to disagree
| here.
| dmezzetti wrote:
| Where do you draw the line? Most of CPython is written in
| C including the arrays package
| (https://docs.python.org/3/library/array.html) mentioned
| in that article.
|
| Yes, pure Python is slower and takes up more memory. But
| that doesn't mean it can't be productive and performant
| using these types of strategies to speed up where
| necessary.
| _a_a_a_ wrote:
| With respect, I think you're clouding things by trying to
| defend what is really defensible. Okay then.
|
| > Where do you draw the line?
|
| Drawing the line at native python, not pulling in
| packages that are written in another language. Packages
| written in python only are acceptable in this argument.
|
| > But that doesn't mean it can't be productive and
| performant using these types of strategies to speed up
| where necessary.
|
| No one said it couldn't. What we're saying is that it
| pure python is 'slow' and you _need_ to escape from pure
| python to get the speedups.
| dmezzetti wrote:
| I agree that pure Python isn't as fast as other options.
| Just comes down to a productivity tradeoff for
| developers. And it doesn't have to be one or the other.
| _a_a_a_ wrote:
| Agreed, then!
| mgl wrote:
| This is unfortunately not correct once you start pushing
| the boundaries requiring careful allocation of memory,
| CPU cache and COU itself, see this table:
|
| https://stratoflow.com/efficient-and-environment-
| friendly-pr...
| stavros wrote:
| > Python programmer for 15 years [...] [Python] has no
| typing
|
| Ok, I have to call this statement out. Mypy was released 15
| years ago, so Python has had optional static typing for as
| long as you've been programming in it, and you don't know
| about it?
|
| I guess it's going to take another fifteen years for this
| 2008 trope to die.
| aerhardt wrote:
| I'm primarily a Python programmer, I love mypy and the
| general typing experience in Python (I think it's better
| than TypeScript - fight me), but are you seriously
| comparing it to something - anything - with proper types
| like Rust?
| stavros wrote:
| > but are you seriously comparing it to something -
| anything - with proper types like Rust?
|
| Re-reading my comment, no, I did not. I said it has
| static typing.
| kamov wrote:
| > I think it's better than TypeScript - fight me
|
| I used Python type hints and MyPy since long before I
| used TypeScript, and I have to say that TypeScript's take
| on types is just plain better (that doesn't mean it's
| good though).
|
| 1. More TypeScript packages are properly typed thanks to
| DefinitelyTyped. Some Python packages such as Numpy could
| not be properly typed last I checked, I think it might
| change with 3.11 though. Packages such as OpenCV didn't
| have any types last I checked.
|
| 2. TypeScript's type system is more complete, with better
| support for generics, this might change with 3.11/3.12
| though.
|
| 3. TypeScript has more powerful type system than most
| languages, as it is Turing-complete and similar in
| functionality to a purely functional language (this could
| also be a con)
| Gracana wrote:
| > I have to call this statement out.
|
| Why? That was just mean for no reason!
| stavros wrote:
| Is that mean? Sorry, English is not my native language. I
| just meant that I have to express my doubt of the
| veracity of the statement.
| djbusby wrote:
| English is my native language. "I have to call out" is a
| perfectly fine (and polite) way to express doubt of
| veracity.
| Gracana wrote:
| Your language is fine (I've enjoyed your blog posts too,
| never gave it a thought that English wasn't your first
| language), I just thought it was unnecessarily hurtful to
| say they must be a phony because they didn't know
| something.
|
| But, everybody else seems to agree so maybe I've been
| had.
| stavros wrote:
| I didn't mean to say they are a phony, just that that
| statement is inaccurate/poorly thought out.
| HumanOstrich wrote:
| Yea everyone should just rewrite EVERYTHING in Rust! /s
| hantusk wrote:
| I thought the API here was quite neat. It's fairly simple to
| implement a lancedb backend for it instead of
| sklearn/faiss/mrpt as the source code is really simple.
|
| This repo is basically just a nice api and the needed chunking
| and batching logic. Using lancedb, you'd still have to write
| that, as exemplified here: https://github.com/prrao87/lancedb-
| study/blob/main/lancedb/i...
| mark_l_watson wrote:
| Same for me. I started using Chroma (about) a year ago, I am
| used to it, and if I am using Python I look no further.
|
| When I use Common Lisp or Racket I roll my own simple vector
| embeddings data store, but that is just me having fun.
| freediver wrote:
| We needed a low latency, on premise solution that we can run on
| edge nodes with sane defaults that anyone in the team can whim
| in a sec. Also worth noting is that our use case is end to end
| retrieval of usually few hundred to few thousand chunks of text
| (for example in Kagi Assistant research mode) that need to be
| processed once at run time with minimal latency.
|
| Result is this. We periodically benchmark the performance of
| different embeddings to ensure best defaults:
|
| https://github.com/kagisearch/vectordb#embeddings-performanc...
| andy99 wrote:
| The two most interesting things to me in a "minimal" framework
| would be eliminating dependency on HF transformers and helping
| customize chunking.
|
| Not a knock against this project, I see where it can be helpful.
| dmezzetti wrote:
| My assumption is that this gives you the ability to locally
| encode vectors. This is useful for those not using API services
| to build their vectors.
| andy99 wrote:
| Transformer inference is ~60 lines of numpy[0] (closer to 500
| once you add tokenization etc). It would be nice to just have
| this and not all of pytorch and transformers.
|
| [0] https://jaykmody.com/blog/gpt-from-scratch/
| dmezzetti wrote:
| What about models besides GPT? Most of the popular vector
| encoding models aren't using this architecture.
|
| If you really didn't want PyTorch/Transformers, you could
| consider exporting your models to ONNX
| (https://github.com/microsoft/onnxruntime).
| cherryteastain wrote:
| It's 60 lines for CPU only inference, which'll be slow. If
| you want GPU acceleration it'll be a lot more than 60
| lines.
| jsimian wrote:
| Yeah, I just tried this out (props to the devs, super easy to
| set up) and my main gripe is the chunking algorithms aren't
| great - could be alot more useful with a context option that
| gives surrounding results to search results. The sliding window
| chunking method always cuts off the start of sentences.
| andy99 wrote:
| I've found it works better to chunk by some logical sections
| in the document, e.g header h2 h3 h4 etc or 1.1 1.1.1 ...
| plus to be able to ignore some stuff (header and footer) plus
| other customizations.
|
| At least for use cases where there are clusters of many
| similarly formatted documents, it would be cool to have a way
| of easily customizing chunking.
| osigurdson wrote:
| What is being used to actually create the embeddings?
| dmezzetti wrote:
| The library is open source. Here's where you can see how
| they're creating the embeddings.
| https://github.com/kagisearch/vectordb/blob/main/vectordb/em...
| simonw wrote:
| https://github.com/kagisearch/vectordb/blob/453bb658bb710838...
|
| Looks like it uses one of these, depending on your settings:
|
| Fast model: google/universal-sentence-encoder/4
|
| Multilingual model: universal-sentence-encoder-multilingual-
| large/3
|
| Normal model (Alternative): BAAI/bge-small-en-v1.5
|
| Best model: BAAI/bge-base-en-v1.5
| syntaxing wrote:
| I always run into the issue of loading existing embedding. For
| instance, I want to embed a folder which have five files and two
| are new, is there a way to only add the two new files to the
| stored embedding using this or chromaDB?
| rasengan wrote:
| Keep track of the files and when you see new ones only add
| those.
| conception wrote:
| https://github.com/wallabag/wallabag
|
| No one has mentioned wallabag yet, so wanted to. Been working
| well for me - has apps and extensions. If you're not excited to
| self-host - https://www.wallabag.it/en has been flawless with the
| exorbitant price of... 11 euro a year.
| 0x6c6f6c wrote:
| This isn't even slightly related to a vector database. I like
| Wallabag but this comes off as a shameless plug.
| politelemon wrote:
| If a team has an operational store already in Postgres, wouldn't
| it be best to just use the PGVector extension? So the data and
| the vector search functionality sit together, and there's one
| less moving part in the tech stack to manage?
| dmezzetti wrote:
| This recent thread discussed this -
| https://news.ycombinator.com/item?id=38416994
| m3kw9 wrote:
| It shouldn't be legal to call your vectordb vectordb
| andy99 wrote:
| I prefer it to, say, vectr (such contractions used to be
| popular) or 'barbershop' or a similar single irrelevant word
| name as is popular now.
| worldsayshi wrote:
| Att least with barbershop it becomes much more searchable.
| "barbershop db" will very likely succeed.
| andy99 wrote:
| Good point. It trades off searchability with understanding
| (or getting an indication of) what it is when you first
| hear of it.
| LtWorf wrote:
| Someone decided to call a software "informatica".
|
| A british recruiter was quite insistent to have a call... turns
| out because she read "informatica" on my profile (as in, laurea
| in informatica), asked me how many years of experience I had
| with the tool, I replied "I just heard of it 2 minutes ago when
| you first mentioned it". Then got mad at me for having written
| the word "informatica" on my profile.
| arbuge wrote:
| Well, they do have the .com domain, so maybe an exception
| applies here.
| bee_rider wrote:
| Sometimes I wonder if Microsoft was actually on to something
| with their naming scheme.
|
| Something like "Kagi vector search for databases" at least
| doesn't leave anything up for misinterpretation.
| richardanaya wrote:
| Kagi search is amazing, i've used it for the last few months. If
| this is what they are using to power it, i'm optimistic.
| iTokio wrote:
| From the GitHub repo:
|
| > Thanks to its low latency and small memory footprint,
| VectorDB is used to power AI features inside Kagi Search.
| stainablesteel wrote:
| super cool! looks like i can make a local search engine out of
| the massive datahoard of pdf books and articles i have
| hubraumhugo wrote:
| Is there any kind of comaprison of the different vector DBs? What
| would you choose for different use cases? How do they differ?
| dmezzetti wrote:
| This thread from a few months ago is a good read -
| https://news.ycombinator.com/item?id=36943318
| pknerd wrote:
| This or anyone suggest some other db/lib for local QnA like
| testing on Mac?
| dmezzetti wrote:
| Just posted this thread that gives an in-depth look at LLM
| frameworks (local ones included) -
| https://news.ycombinator.com/item?id=38422264
| ianpurton wrote:
| https://bionic-gpt.com/blog/you-dont-need-a-vector-database/
| freediver wrote:
| Dev here. Thanks for submitting. To be fair this is not really a
| database, but a wrapper around few primitives such as locally-ran
| embeddings and FAISS/mrpt with a ton of benchmarking behind the
| doors to offer sane defaults that minimize latency.
|
| Here is an example Colab notebook [1] where this is used to
| filter the content of the massive Kagi Small Web [2] RSS feed
| based on stated user interests:
|
| [1]
| https://colab.research.google.com/drive/1pecKGCCru_Jvx7v0WRN...
|
| [2] https://kagi.com/smallweb
___________________________________________________________________
(page generated 2023-11-26 23:01 UTC)