[HN Gopher] Creating a search engine with PostgreSQL
___________________________________________________________________
Creating a search engine with PostgreSQL
Author : todsacerdoti
Score : 251 points
Date : 2023-07-12 18:05 UTC (4 hours ago)
(HTM) web link (xata.io)
(TXT) w3m dump (xata.io)
| 0xbadcafebee wrote:
| When I was a teenager I made a search engine and database from
| scratch, not knowing anything about either. I just wanted to see
| how far I could get, how fast I could make it, returning somewhat
| relevant search results.
|
| It's actually very easy to make a fast basic database and search
| engine, even as an amateur programmer. As long as you understand
| basic CS algorithms and how to exploit the operating system and
| hardware, you can put one together in a month or two. Speed is
| not bad even with high-level languages; something like 250K QPS,
| back in 2003, on a laptop. Scalability isn't much of an issue
| either if you shard it. Indexing, locking, and consistency are
| more complicated than the storage and retrieval parts.
|
| The big problem to overcome is the subjective nature of search.
| What do I _really_ want to find? How do I find something when I
| don 't know what I'm looking for? How do I get around people
| trying to game the system? How do I handle complex queries and
| datasets? That's when it gets orders of magnitude harder.
| gomezjdaniel wrote:
| It is open sourced anywhere? I am curious
| daitangio wrote:
| Very nice and clear article. Also SQLite offers advanced indexing
| functionality and stemming with a standard plugin. For english
| SQLite works well too
| lettergram wrote:
| I've actually done this before and made it pretty damn fast -
|
| https://austingwalters.com/fast-full-text-search-in-postgres...
|
| The website is https://askhn.ai for the moment
| rane wrote:
| How does SQLite fare against PostgreSQL for this particular use
| case?
| shortrounddev2 wrote:
| Why do I see so much discussion about SQLite on HN? I've
| literally never seen it used in production, my usage of it is
| purely as a database for local testing
| Mustachio wrote:
| We do natural language processing with some custom
| tokenisation, the data it uses at runtime is put into an
| sqlite db at build-time. Works fine for our usecase.
| 0xbadcafebee wrote:
| HN has fads like everything else. If this were 10 years ago
| they'd be asking about MongoDB. 5 years ago they'd be asking
| if the Blockchain could power their search engine. Now they
| want an AI search engine.
| appletrotter wrote:
| I feel like MongoDB was on the wane even 5 years ago.
| callalex wrote:
| Also, the interest may not have all been...organic.
| thewataccount wrote:
| Sqlite has nothing to do with AI searching in the way I
| think you're suggesting.
|
| It was not designed for "AI usage" or as a "AI database"
|
| SQlite is from 2000 and is more comparable to a single file
| postgres/mysql database.
|
| It's also on literally billions of devices including
| whatever you're reading this on.
| thewataccount wrote:
| Put simply - SQLite is incredibly powerful and simple - it
| can handle a lot more then people might think - and AFAIK it
| is the most "used in production" database whether you're
| aware of it or not.
|
| Depending on the use case it can scale incredibly well, and
| is tiny and battle hardened.
|
| It's the most deployed database, it's on every smartphone,
| "smart device", computer, it's inside of many of your apps,
| powers many of the websites you use, etc.
|
| ---
|
| tl;dr - If you're following "use the simplest, but reliable
| tool for the job" then sqlite is a valid option for a
| crapload of use cases, most people aren't aware of what it
| can actually do because it doesn't "web scale".
|
| Sqlite isn't new, it's old, and it's "used in production"
| count is in the literal billions.
| osrec wrote:
| It is used in a huge number of production environments,
| including within your browser (most probably).
|
| Probably not something you want to run a multi tenant SaaS
| database with, but it is useful if you are going the one
| tenant per database route.
| bshipp wrote:
| Sqlite is extremely powerful for any locally-embedded
| databases, such as apps with an offline component. For
| example, searching text in notes, books, cached news, etc.
| It's very simple to implement inside Sqlite with much less
| overhead than including a Postgres/Mysql instance.
| klysm wrote:
| It's used in production quite a bit, probably just not in
| your field as much. Anytime I'm thinking about storing
| something in a file I use SQLite instead
| TX81Z wrote:
| Lol, you're using it "in production" right now! SQLite is
| everywhere, almost definitely in the browser you're using to
| view this post.
| shortrounddev2 wrote:
| I guess it's a difference in industry. Everything I do is
| accessed over a network, and for caching we use simpler
| key-store libraries rather than relational databases. I
| don't develop client software
| samwillis wrote:
| With the move to the "edge" for compute, having your database
| at the edge reduces latency and basically makes it viable. So
| the theory is that SQLite read replicas will be the answer to
| a lot of these plans. See what Fly and Cloudflare are
| investing in with the edge network and SQLite plans.
|
| On top of that SQLite is now viable in the browser with WASM
| and the "origin privet file system" api. And so people are
| increasing looking at moving more state and persistence to
| the browser with an eventually consistent sync. This is what
| the "Local First" movement, myself included, are excited
| about.
| rahkiin wrote:
| There is https://www.sqlite.org/fts5.html but I cannot deduce
| if it has the same features; the way of working is completely
| different
| bshipp wrote:
| I've had good luck using FTS5 in Sqlite for small-scale use
| scenarios, such as examining scraped data, etc. It's remarkably
| powerful and quick.
|
| https://www.sqlitetutorial.net/sqlite-full-text-search/
|
| https://www.sqlite.org/fts5.html
| quartz wrote:
| You can also marry these techniques with pgvector to find related
| content through embeddings. I've found this to be pretty magical.
| jstummbillig wrote:
| Could you get into some more detail or point at something that
| does?
| kaycebasques wrote:
| https://supabase.com/blog/openai-embeddings-postgres-vector
|
| https://supabase.com/blog/chatgpt-supabase-docs
| binarymax wrote:
| It's interesting that folks go down this hacky route when they
| can use something like Vespa, which is orders of magnitude
| better from a performance, relevance, scalability, and
| developer ergonomics perspective.
| moomoo11 wrote:
| Is that a different system?
|
| Sorry I'm on spotty mobile that can't open anything besides
| HN lol (God bless this website).
|
| Sometimes it is just easier to use the existing systems and
| squeeze them as much as possible. Especially when it's a
| small team or solo without much $$
| binarymax wrote:
| When it comes to search I cannot disagree more.
|
| https://vespa.ai is a purpose built search engine.
|
| If you start bolting search onto your database, your
| relevance will be terrible, you'll be rewriting a lot of
| table stakes tools/features from scratch, and your
| technical debt will skyrocket.
| [deleted]
| LunaSea wrote:
| The advantage of pg_vector is that you don't need a second,
| specialised database and you also don't need to synchronise
| data.
|
| It makes much more operational sense to use pg_vector if your
| use case can be implemented tha way.
| binarymax wrote:
| It makes terrible operational sense. What are the HA/DR,
| sharding, replica, and backup strategies and tools for
| pg_vector? What are the embedding integration and relevance
| tools? What are the reindexing strategies? What are the
| scaling, caching, and CPU thrashing resolution paths?
|
| You're going to spend a bunch of time writing integrations
| that already exist for actual search engines, and you're
| going to be stuck and need to back out when search becomes
| a necessity rather than an afterthought.
| [deleted]
| djbusby wrote:
| The HA/DR, Sharing, Replica and Backup would all be the
| same as before. Its all in PG so you use the existing
| method.
|
| If you have two systems, then you have two (unique)
| answers for HA,DR,Shard,Replica,Backup - the PG set and
| the Vespa.
|
| That's more complicated, from an operational perspective.
|
| PG FTS is quite good, and there are in-pg methods that
| can improve it.
| whakim wrote:
| What if you don't need those things yet and you just have
| some embeddings you want to query for cosine similarity?
| A dedicated vector database is way, way overkill for many
| people.
| philipbjorge wrote:
| What makes most operational sense is going to depend on
| your context.
|
| From my vantage point, you're both right in the
| appropriate context.
| esafak wrote:
| p50/p99 retrieval times at realistic loads or it didn't happen. A
| search engine that returns results in, say, a minute is not
| "advanced". Of course a relational database like Postgres can do
| it on paper.
| lelanthran wrote:
| > p50/p99 retrieval times at realistic loads or it didn't
| happen.
|
| Therein lies the problem - how do you generate actual realistic
| loads for a search engine without having a large number of
| people use it for searches? Simply hitting it with random
| search terms isn't realistic.
|
| Some people will be on slow connections, search terms for
| something specific might spike in only a certain region
| (earthquake, etc), etc.
|
| If your terms are too random, it'll perform worse than it
| should (results not in the cache), and if not random enough it
| will perform better than it should.
| gleenn wrote:
| One actual solution is to use historical search logs. Just
| because "random" is a bad answer doesn't mean people don't
| try and make reasonable reproductions of load to replay and
| benchmark. Cacheing is also a big factor.
| ramraj07 wrote:
| We use postgres fts and it works fine, you just gotta how you
| rank order the rows in the query - if you only use ts_rank then
| it's perfect but you likely want to use some other relevance
| metric to adjust the ranking but then you can't rank order
| primarily by that metric. Once you nail this the results are as
| fast as any other typical db table query with an index.
| glintik wrote:
| Actually good ranking is the most valuable thing in search
| engine.
| qingcharles wrote:
| Good ranking is what made Google a trillion dollar company.
|
| Google indexed the same sites as Altavista, but Google Page
| Rank made the right sites bubble to the top and made Sergey
| and Larry billionaires...
| karmakaze wrote:
| Lucene (what Elasticsearch and Solr use) are based on inverted
| indexes which is what the GIN in the article also uses.
|
| So the benefits of ES/etc are being able to scale horizontally
| scale across nodes or any additional features it adds on top of
| the main index.
| tudorg wrote:
| (disclaimer: author of the post)
|
| I plan a follow up to compare it with Elasticsearch, however, I
| don't think I'm going to attempt benchmarking, because whatever
| realistic scenario I come up with, it will not necessarily be
| relevant to your use case.
|
| I mostly agree with you and I probably wouldn't use this at
| large scale (say, more than a few million records). I was
| primarily interested how much of the functionality I can
| replicate. Because for small search use cases this has some
| clear advantages: less infra to maintain, strong consistency,
| joins, etc.
|
| Also, at Xata we're thinking about having a smooth transition
| between using Postgres at small scale, then migrating to use
| Elasticsearch with minimal breaking changes.
| bagels wrote:
| Sure, benchmarks won't prove that it works in ALL cases, but
| it can at least prove it works in SOME cases?
| TX81Z wrote:
| I had done something similar recently. Goal was to take a
| huge Postgres database and make it searchable and usable.
|
| It ended up that doing a offline batch job to boil down the
| much bigger dataset into a single pre-optimized table was the
| best approach for us. Once optimized and tsvectored it was
| fairly performant and not a huge gain with Elastic. Still
| keeping the elastic code around "in case", but yeah, Postgres
| search can be "good enough" when you aren't serving a ton of
| clients.
| bshipp wrote:
| Using measured comparisons and making informed choices is
| certainly the best way.
|
| If you can start with Postgres to have a relational database
| with the benefit of Full Text Search (i.e. avoid
| Elastisearch) as well as JSON fields (i.e. avoid MongoDB)
| then you end up simplifying initial hardware/software
| requirements while retaining the ability to migrate to those
| solutions when user demand requires it.
|
| So many developers seem to build with the idea that they'll
| become the next FAANG when actual (or reasonably forecasted)
| user load doesn't remotely require such a complex software
| stack.
| [deleted]
| [deleted]
| nanidin wrote:
| I interviewed somewhere with an ex-Googler who revealed they
| cache all of the search results for terms they have seen
| before, then when they update the index they also update the
| cached results. From that perspective, fast search results
| aren't actually that exciting since you can constantly run a
| background task to update the cached results and just serve
| those as the requests come in. This caching and response time
| seem orthogonal to the speed of the actual act of calculating
| search results.
| TX81Z wrote:
| Almost everything is faster if you can precompute.
| winrid wrote:
| Yup, all FastComments threads are precomputed for all sort
| directions! Each page comes from the DB in < 1.5ms 75% of
| the time.
|
| It gets tough with pages with 100k+ comments though, so
| there are different tricks and switches for different flows
| and data sizes.
| qingcharles wrote:
| Your site is killing me. Every page I visit I have to
| accept the cookies :(
| bshipp wrote:
| For sure, but the approach is quite viable. If 19 out of 20
| searches by a user are almost instantaneous and single
| novel one requires a few seconds, they'll assume a hiccup
| in their internet connection and still view the site as
| "really fast". It's certainly useful for limiting demands
| on expensive hardware.
| teej wrote:
| 20% of all Google searches are brand new
| andrelaszlo wrote:
| And use 80% of the resources?
| bagels wrote:
| Then 80% aren't, and those ones will perform very well.
| winddude wrote:
| which is why es uses multiple levels of caching
| _s_a_m_ wrote:
| Has the word "advanced", must be good ..
| endisneigh wrote:
| For something like this I'd use FoundationDB to store the entries
| redundantly and consistently with high availability and TypeSense
| for searching through the records and regenerating the index as
| necessary from FDB.
| esafak wrote:
| Does Typesense work with external vector databases?
| jabo wrote:
| Typesense has a vector store / search built-in:
| https://typesense.org/docs/0.24.1/api/vector-search.html
|
| In the upcoming version, we've also added the ability to
| automatically generate embeddings from within Typesense
| either using OpenAI, PaLM API or a built-in model like s-bert
| or E5. So you only have to send json and pick a model,
| Typesense will then do a hybrid vector+keyword search for
| queries.
| esafak wrote:
| I see you run hnswlib but do you (plan to) support external
| vector databases, so users can upgrade?
| jabo wrote:
| We don't plan to support external vector databases, since
| we want to build Typesense as a vector + keyword search
| datastore by itself.
| esafak wrote:
| I see. Do you plan to replace hnswlib with your own
| technology?
| b1zguy wrote:
| I've been meaning to attempt running a custom search engine for
| particular sites I've 'bookmarked'. Some sites contain gold that
| could be useful in the future and is not often discovered in
| Google results.
|
| Should I go the Postgres/Elasticsearch route or are somewhat out-
| of-the-box solutions available?
| binarymax wrote:
| For something small with a minimal footprint, I'd recommend
| Typesense. https://github.com/typesense/typesense
|
| Elasticsearch is heavy, and relational databases with search
| bolted on (like Postgres or SQLite) aren't great.
| bshipp wrote:
| It depends on what the user requirements are. FTS works
| pretty well with both Postgres and SQLite, in my experience.
|
| Here's a git repo someone can modify to do a cross comparison
| on a specific dataset, if they are interested. It doesn't
| seem to indicate the RMDBs are outclassed in a small-scale
| FTS implementation.
|
| https://github.com/VADOSWARE/fts-benchmark
| hardwaresofton wrote:
| Also I've got a small project in which I try to compare
| meilisearch and postgres fts w/ pg_trm, it's called
| podcastsaver:
|
| Podcastsaver.com (click on the nerds tab in the top right)
|
| Never got to it but there are a bunch of other search
| engines worth adding -- Sonic, Typesense, etc. Maybe some
| day
| binarymax wrote:
| For personal use nobody cares about 100ms vs 10ms response.
| What they do care about is relevance. Consider the
| following from those repo outputs:
|
| Typesense [timing] phrase [superman]:
| returned [28] results in 4.222797.ms [timing]
| phrase [suprman]: returned [28] results in 3.663458.ms
|
| SQLite [timing] phrase [superman]:
| returned [47] results in 0.351138.ms [timing]
| phrase [suprman]: returned [0] results in 0.07513.ms
|
| So SQLite is faster, but who cares? I want things like
| relevance and typo resilience without having to configure
| anything.
| evdubs wrote:
| The article covers typo resilience in the section "Typo
| tolerance / fuzzy search".
|
| This adds a step between query entry and text search
| where you find the similarity of query words to unique
| lexemes if the word is not a lexeme. Seems like a
| reasonable compromise to me?
| bshipp wrote:
| I'm not trying to be argumentative. As long as people
| find a solution they're happy with, I think that's great.
| For me, I'm far less interested in handling typos, but I
| can see how it would be valuable in many applications.
| I'm usually less interested in tying in and learning
| another set of services if I can get 90% of the way there
| with one, but leaving the option of adding it later if
| additional requirements make it necessary.
| bshipp wrote:
| For such a light demand and fixed site requirements, a single-
| file sqlite dB is probably best. Modern Sqlite has full-text
| capabilities that are quite powerful and relatively easy to
| implement.
|
| https://www.sqlite.org/fts5.html
| tudorg wrote:
| Do you prefer it locally or in the cloud? If in the cloud,
| check out Xata (the domain of the blog post here).
| b1zguy wrote:
| Edit: I forgot to add how would I add the webpage to the
| databases already suggested here? Do I need to use a separate
| program to spider/index each site, and check for its updates?
| bshipp wrote:
| If you're looking for a turn-key solution, I'd have to dig a
| little. I generally write a scraper in python that dumps into
| a database or flat file (depending on number of records I'm
| hunting).
|
| Scraping is a separate subject, but once you write one you
| can generally reuse relevant portions for many others. If you
| can get adept at a scraping framework like Scrapy you can do
| it fairly quickly, but there aren't many tools that work out
| of the box for every site you'll encounter.
|
| Once you've written the spider, it's generally able to be
| rerun for updates unless the site code is dramatically
| altered. It really comes down to how brittle the spider is
| coded (i.e. hunting for specific heading sizes or fonts or
| something) instead of grabbing the underlying JSON/XHR that
| doesn't usually change frequently.
|
| 1. https://scrapy.org
| Alifatisk wrote:
| Searchkick gem + Elasticsearch is a good combo
| sudobash1 wrote:
| I am wanting to do something similar. Archivebox seems to be
| the best solution for this sort of self-hosted, searchable web
| archive. It has multiple search back-ends and plugins to sync
| browser bookmarks (or even history).
|
| I haven't finished getting it set up though, so take this
| recommendation with a hefty grain of salt.
| SpriglyElixir12 wrote:
| How would something like this work in practice? Would you
| generate any tags or summaries per site when inserting it
| into the db?
| janalsncm wrote:
| You could run a full text search or search against an auto-
| generated summary. Or if you want to be fancy, use semantic
| search like in Retrieval Augmented Generation.
| fizx wrote:
| One of the many problems with search inside Postgres is that
| search is very CPU-heavy (and starting to become GPU-heavy).
| Ideally, you'd like to reserve CPU in your database for
| transactional updates to your core data model.
|
| I've seen a lot of ES and Solr clusters operating at 100% of 10+
| nodes during a re-index, or just 30-50% of 10+ nodes during
| normal operation. The corresponding database would be say an AWS
| L/XL instance at 50-100GB of data and 30% CPU utilization. Moving
| all of the search CPU into your primary DB means now you'd have
| to shard it.
|
| But I love PG extensions for search, recursive joins, vectors,
| etc on side projects. It can keep things fun and simple.
| nbgoodall wrote:
| Could you solve this by searching on a read-only replica?
| asadawadia wrote:
| https://rcs.aawadia.dev/ is also using pg search
| jsdwarf wrote:
| Looking forward for part 2 / postgres vs elasticsearch. One
| application at my company uses PG for CRUD on objects and elastic
| for searching them. We completely underestimated the effort of
| keeping the two datastores in sync and are actually contemplating
| to get rid of elasticsearch.
| cameronpm wrote:
| Normally best practice is to return primary keys in ES and then
| do a lookup query in the db. Did you attempt this?
| dqv wrote:
| >We completely underestimated the effort of keeping the two
| datastores in sync and are actually contemplating to get rid of
| elasticsearch.
|
| Can you expand on this? Is it that it's tedious to write code
| that updates both? I've been meaning to play around with
| meilisearch and was trying to think about the synchronization
| issue.
| valzam wrote:
| I had a similar setup at a previous gig and didn't find it
| particularly challenging. Whenever an update happens to an
| entity in PG we send a message to async replicate that entity
| in ES (via ID lookup in PG). As always you need good monitoring
| and retries for async computation but ES is quite stable and
| fast so we rarely had any issues.
|
| Now, we had pretty lax consistency requirements, as long as the
| "latest state" of PG ended up in ES within a reasonable
| timeframe everything was fine so maybe your requirements are
| different.
___________________________________________________________________
(page generated 2023-07-12 23:00 UTC)