[HN Gopher] Creating a search engine with PostgreSQL
       ___________________________________________________________________
        
       Creating a search engine with PostgreSQL
        
       Author : todsacerdoti
       Score  : 251 points
       Date   : 2023-07-12 18:05 UTC (4 hours ago)
        
 (HTM) web link (xata.io)
 (TXT) w3m dump (xata.io)
        
       | 0xbadcafebee wrote:
       | When I was a teenager I made a search engine and database from
       | scratch, not knowing anything about either. I just wanted to see
       | how far I could get, how fast I could make it, returning somewhat
       | relevant search results.
       | 
       | It's actually very easy to make a fast basic database and search
       | engine, even as an amateur programmer. As long as you understand
       | basic CS algorithms and how to exploit the operating system and
       | hardware, you can put one together in a month or two. Speed is
       | not bad even with high-level languages; something like 250K QPS,
       | back in 2003, on a laptop. Scalability isn't much of an issue
       | either if you shard it. Indexing, locking, and consistency are
       | more complicated than the storage and retrieval parts.
       | 
       | The big problem to overcome is the subjective nature of search.
       | What do I _really_ want to find? How do I find something when I
       | don 't know what I'm looking for? How do I get around people
       | trying to game the system? How do I handle complex queries and
       | datasets? That's when it gets orders of magnitude harder.
        
         | gomezjdaniel wrote:
         | It is open sourced anywhere? I am curious
        
       | daitangio wrote:
       | Very nice and clear article. Also SQLite offers advanced indexing
       | functionality and stemming with a standard plugin. For english
       | SQLite works well too
        
       | lettergram wrote:
       | I've actually done this before and made it pretty damn fast -
       | 
       | https://austingwalters.com/fast-full-text-search-in-postgres...
       | 
       | The website is https://askhn.ai for the moment
        
       | rane wrote:
       | How does SQLite fare against PostgreSQL for this particular use
       | case?
        
         | shortrounddev2 wrote:
         | Why do I see so much discussion about SQLite on HN? I've
         | literally never seen it used in production, my usage of it is
         | purely as a database for local testing
        
           | Mustachio wrote:
           | We do natural language processing with some custom
           | tokenisation, the data it uses at runtime is put into an
           | sqlite db at build-time. Works fine for our usecase.
        
           | 0xbadcafebee wrote:
           | HN has fads like everything else. If this were 10 years ago
           | they'd be asking about MongoDB. 5 years ago they'd be asking
           | if the Blockchain could power their search engine. Now they
           | want an AI search engine.
        
             | appletrotter wrote:
             | I feel like MongoDB was on the wane even 5 years ago.
        
               | callalex wrote:
               | Also, the interest may not have all been...organic.
        
             | thewataccount wrote:
             | Sqlite has nothing to do with AI searching in the way I
             | think you're suggesting.
             | 
             | It was not designed for "AI usage" or as a "AI database"
             | 
             | SQlite is from 2000 and is more comparable to a single file
             | postgres/mysql database.
             | 
             | It's also on literally billions of devices including
             | whatever you're reading this on.
        
           | thewataccount wrote:
           | Put simply - SQLite is incredibly powerful and simple - it
           | can handle a lot more then people might think - and AFAIK it
           | is the most "used in production" database whether you're
           | aware of it or not.
           | 
           | Depending on the use case it can scale incredibly well, and
           | is tiny and battle hardened.
           | 
           | It's the most deployed database, it's on every smartphone,
           | "smart device", computer, it's inside of many of your apps,
           | powers many of the websites you use, etc.
           | 
           | ---
           | 
           | tl;dr - If you're following "use the simplest, but reliable
           | tool for the job" then sqlite is a valid option for a
           | crapload of use cases, most people aren't aware of what it
           | can actually do because it doesn't "web scale".
           | 
           | Sqlite isn't new, it's old, and it's "used in production"
           | count is in the literal billions.
        
           | osrec wrote:
           | It is used in a huge number of production environments,
           | including within your browser (most probably).
           | 
           | Probably not something you want to run a multi tenant SaaS
           | database with, but it is useful if you are going the one
           | tenant per database route.
        
           | bshipp wrote:
           | Sqlite is extremely powerful for any locally-embedded
           | databases, such as apps with an offline component. For
           | example, searching text in notes, books, cached news, etc.
           | It's very simple to implement inside Sqlite with much less
           | overhead than including a Postgres/Mysql instance.
        
           | klysm wrote:
           | It's used in production quite a bit, probably just not in
           | your field as much. Anytime I'm thinking about storing
           | something in a file I use SQLite instead
        
           | TX81Z wrote:
           | Lol, you're using it "in production" right now! SQLite is
           | everywhere, almost definitely in the browser you're using to
           | view this post.
        
             | shortrounddev2 wrote:
             | I guess it's a difference in industry. Everything I do is
             | accessed over a network, and for caching we use simpler
             | key-store libraries rather than relational databases. I
             | don't develop client software
        
           | samwillis wrote:
           | With the move to the "edge" for compute, having your database
           | at the edge reduces latency and basically makes it viable. So
           | the theory is that SQLite read replicas will be the answer to
           | a lot of these plans. See what Fly and Cloudflare are
           | investing in with the edge network and SQLite plans.
           | 
           | On top of that SQLite is now viable in the browser with WASM
           | and the "origin privet file system" api. And so people are
           | increasing looking at moving more state and persistence to
           | the browser with an eventually consistent sync. This is what
           | the "Local First" movement, myself included, are excited
           | about.
        
         | rahkiin wrote:
         | There is https://www.sqlite.org/fts5.html but I cannot deduce
         | if it has the same features; the way of working is completely
         | different
        
         | bshipp wrote:
         | I've had good luck using FTS5 in Sqlite for small-scale use
         | scenarios, such as examining scraped data, etc. It's remarkably
         | powerful and quick.
         | 
         | https://www.sqlitetutorial.net/sqlite-full-text-search/
         | 
         | https://www.sqlite.org/fts5.html
        
       | quartz wrote:
       | You can also marry these techniques with pgvector to find related
       | content through embeddings. I've found this to be pretty magical.
        
         | jstummbillig wrote:
         | Could you get into some more detail or point at something that
         | does?
        
           | kaycebasques wrote:
           | https://supabase.com/blog/openai-embeddings-postgres-vector
           | 
           | https://supabase.com/blog/chatgpt-supabase-docs
        
         | binarymax wrote:
         | It's interesting that folks go down this hacky route when they
         | can use something like Vespa, which is orders of magnitude
         | better from a performance, relevance, scalability, and
         | developer ergonomics perspective.
        
           | moomoo11 wrote:
           | Is that a different system?
           | 
           | Sorry I'm on spotty mobile that can't open anything besides
           | HN lol (God bless this website).
           | 
           | Sometimes it is just easier to use the existing systems and
           | squeeze them as much as possible. Especially when it's a
           | small team or solo without much $$
        
             | binarymax wrote:
             | When it comes to search I cannot disagree more.
             | 
             | https://vespa.ai is a purpose built search engine.
             | 
             | If you start bolting search onto your database, your
             | relevance will be terrible, you'll be rewriting a lot of
             | table stakes tools/features from scratch, and your
             | technical debt will skyrocket.
        
           | [deleted]
        
           | LunaSea wrote:
           | The advantage of pg_vector is that you don't need a second,
           | specialised database and you also don't need to synchronise
           | data.
           | 
           | It makes much more operational sense to use pg_vector if your
           | use case can be implemented tha way.
        
             | binarymax wrote:
             | It makes terrible operational sense. What are the HA/DR,
             | sharding, replica, and backup strategies and tools for
             | pg_vector? What are the embedding integration and relevance
             | tools? What are the reindexing strategies? What are the
             | scaling, caching, and CPU thrashing resolution paths?
             | 
             | You're going to spend a bunch of time writing integrations
             | that already exist for actual search engines, and you're
             | going to be stuck and need to back out when search becomes
             | a necessity rather than an afterthought.
        
               | [deleted]
        
               | djbusby wrote:
               | The HA/DR, Sharing, Replica and Backup would all be the
               | same as before. Its all in PG so you use the existing
               | method.
               | 
               | If you have two systems, then you have two (unique)
               | answers for HA,DR,Shard,Replica,Backup - the PG set and
               | the Vespa.
               | 
               | That's more complicated, from an operational perspective.
               | 
               | PG FTS is quite good, and there are in-pg methods that
               | can improve it.
        
               | whakim wrote:
               | What if you don't need those things yet and you just have
               | some embeddings you want to query for cosine similarity?
               | A dedicated vector database is way, way overkill for many
               | people.
        
               | philipbjorge wrote:
               | What makes most operational sense is going to depend on
               | your context.
               | 
               | From my vantage point, you're both right in the
               | appropriate context.
        
       | esafak wrote:
       | p50/p99 retrieval times at realistic loads or it didn't happen. A
       | search engine that returns results in, say, a minute is not
       | "advanced". Of course a relational database like Postgres can do
       | it on paper.
        
         | lelanthran wrote:
         | > p50/p99 retrieval times at realistic loads or it didn't
         | happen.
         | 
         | Therein lies the problem - how do you generate actual realistic
         | loads for a search engine without having a large number of
         | people use it for searches? Simply hitting it with random
         | search terms isn't realistic.
         | 
         | Some people will be on slow connections, search terms for
         | something specific might spike in only a certain region
         | (earthquake, etc), etc.
         | 
         | If your terms are too random, it'll perform worse than it
         | should (results not in the cache), and if not random enough it
         | will perform better than it should.
        
           | gleenn wrote:
           | One actual solution is to use historical search logs. Just
           | because "random" is a bad answer doesn't mean people don't
           | try and make reasonable reproductions of load to replay and
           | benchmark. Cacheing is also a big factor.
        
         | ramraj07 wrote:
         | We use postgres fts and it works fine, you just gotta how you
         | rank order the rows in the query - if you only use ts_rank then
         | it's perfect but you likely want to use some other relevance
         | metric to adjust the ranking but then you can't rank order
         | primarily by that metric. Once you nail this the results are as
         | fast as any other typical db table query with an index.
        
           | glintik wrote:
           | Actually good ranking is the most valuable thing in search
           | engine.
        
             | qingcharles wrote:
             | Good ranking is what made Google a trillion dollar company.
             | 
             | Google indexed the same sites as Altavista, but Google Page
             | Rank made the right sites bubble to the top and made Sergey
             | and Larry billionaires...
        
         | karmakaze wrote:
         | Lucene (what Elasticsearch and Solr use) are based on inverted
         | indexes which is what the GIN in the article also uses.
         | 
         | So the benefits of ES/etc are being able to scale horizontally
         | scale across nodes or any additional features it adds on top of
         | the main index.
        
         | tudorg wrote:
         | (disclaimer: author of the post)
         | 
         | I plan a follow up to compare it with Elasticsearch, however, I
         | don't think I'm going to attempt benchmarking, because whatever
         | realistic scenario I come up with, it will not necessarily be
         | relevant to your use case.
         | 
         | I mostly agree with you and I probably wouldn't use this at
         | large scale (say, more than a few million records). I was
         | primarily interested how much of the functionality I can
         | replicate. Because for small search use cases this has some
         | clear advantages: less infra to maintain, strong consistency,
         | joins, etc.
         | 
         | Also, at Xata we're thinking about having a smooth transition
         | between using Postgres at small scale, then migrating to use
         | Elasticsearch with minimal breaking changes.
        
           | bagels wrote:
           | Sure, benchmarks won't prove that it works in ALL cases, but
           | it can at least prove it works in SOME cases?
        
           | TX81Z wrote:
           | I had done something similar recently. Goal was to take a
           | huge Postgres database and make it searchable and usable.
           | 
           | It ended up that doing a offline batch job to boil down the
           | much bigger dataset into a single pre-optimized table was the
           | best approach for us. Once optimized and tsvectored it was
           | fairly performant and not a huge gain with Elastic. Still
           | keeping the elastic code around "in case", but yeah, Postgres
           | search can be "good enough" when you aren't serving a ton of
           | clients.
        
           | bshipp wrote:
           | Using measured comparisons and making informed choices is
           | certainly the best way.
           | 
           | If you can start with Postgres to have a relational database
           | with the benefit of Full Text Search (i.e. avoid
           | Elastisearch) as well as JSON fields (i.e. avoid MongoDB)
           | then you end up simplifying initial hardware/software
           | requirements while retaining the ability to migrate to those
           | solutions when user demand requires it.
           | 
           | So many developers seem to build with the idea that they'll
           | become the next FAANG when actual (or reasonably forecasted)
           | user load doesn't remotely require such a complex software
           | stack.
        
         | [deleted]
        
         | [deleted]
        
         | nanidin wrote:
         | I interviewed somewhere with an ex-Googler who revealed they
         | cache all of the search results for terms they have seen
         | before, then when they update the index they also update the
         | cached results. From that perspective, fast search results
         | aren't actually that exciting since you can constantly run a
         | background task to update the cached results and just serve
         | those as the requests come in. This caching and response time
         | seem orthogonal to the speed of the actual act of calculating
         | search results.
        
           | TX81Z wrote:
           | Almost everything is faster if you can precompute.
        
             | winrid wrote:
             | Yup, all FastComments threads are precomputed for all sort
             | directions! Each page comes from the DB in < 1.5ms 75% of
             | the time.
             | 
             | It gets tough with pages with 100k+ comments though, so
             | there are different tricks and switches for different flows
             | and data sizes.
        
               | qingcharles wrote:
               | Your site is killing me. Every page I visit I have to
               | accept the cookies :(
        
             | bshipp wrote:
             | For sure, but the approach is quite viable. If 19 out of 20
             | searches by a user are almost instantaneous and single
             | novel one requires a few seconds, they'll assume a hiccup
             | in their internet connection and still view the site as
             | "really fast". It's certainly useful for limiting demands
             | on expensive hardware.
        
               | teej wrote:
               | 20% of all Google searches are brand new
        
               | andrelaszlo wrote:
               | And use 80% of the resources?
        
               | bagels wrote:
               | Then 80% aren't, and those ones will perform very well.
        
             | winddude wrote:
             | which is why es uses multiple levels of caching
        
       | _s_a_m_ wrote:
       | Has the word "advanced", must be good ..
        
       | endisneigh wrote:
       | For something like this I'd use FoundationDB to store the entries
       | redundantly and consistently with high availability and TypeSense
       | for searching through the records and regenerating the index as
       | necessary from FDB.
        
         | esafak wrote:
         | Does Typesense work with external vector databases?
        
           | jabo wrote:
           | Typesense has a vector store / search built-in:
           | https://typesense.org/docs/0.24.1/api/vector-search.html
           | 
           | In the upcoming version, we've also added the ability to
           | automatically generate embeddings from within Typesense
           | either using OpenAI, PaLM API or a built-in model like s-bert
           | or E5. So you only have to send json and pick a model,
           | Typesense will then do a hybrid vector+keyword search for
           | queries.
        
             | esafak wrote:
             | I see you run hnswlib but do you (plan to) support external
             | vector databases, so users can upgrade?
        
               | jabo wrote:
               | We don't plan to support external vector databases, since
               | we want to build Typesense as a vector + keyword search
               | datastore by itself.
        
               | esafak wrote:
               | I see. Do you plan to replace hnswlib with your own
               | technology?
        
       | b1zguy wrote:
       | I've been meaning to attempt running a custom search engine for
       | particular sites I've 'bookmarked'. Some sites contain gold that
       | could be useful in the future and is not often discovered in
       | Google results.
       | 
       | Should I go the Postgres/Elasticsearch route or are somewhat out-
       | of-the-box solutions available?
        
         | binarymax wrote:
         | For something small with a minimal footprint, I'd recommend
         | Typesense. https://github.com/typesense/typesense
         | 
         | Elasticsearch is heavy, and relational databases with search
         | bolted on (like Postgres or SQLite) aren't great.
        
           | bshipp wrote:
           | It depends on what the user requirements are. FTS works
           | pretty well with both Postgres and SQLite, in my experience.
           | 
           | Here's a git repo someone can modify to do a cross comparison
           | on a specific dataset, if they are interested. It doesn't
           | seem to indicate the RMDBs are outclassed in a small-scale
           | FTS implementation.
           | 
           | https://github.com/VADOSWARE/fts-benchmark
        
             | hardwaresofton wrote:
             | Also I've got a small project in which I try to compare
             | meilisearch and postgres fts w/ pg_trm, it's called
             | podcastsaver:
             | 
             | Podcastsaver.com (click on the nerds tab in the top right)
             | 
             | Never got to it but there are a bunch of other search
             | engines worth adding -- Sonic, Typesense, etc. Maybe some
             | day
        
             | binarymax wrote:
             | For personal use nobody cares about 100ms vs 10ms response.
             | What they do care about is relevance. Consider the
             | following from those repo outputs:
             | 
             | Typesense                   [timing] phrase [superman]:
             | returned [28] results in 4.222797.ms         [timing]
             | phrase [suprman]: returned [28] results in 3.663458.ms
             | 
             | SQLite                   [timing] phrase [superman]:
             | returned [47] results in 0.351138.ms         [timing]
             | phrase [suprman]: returned [0] results in 0.07513.ms
             | 
             | So SQLite is faster, but who cares? I want things like
             | relevance and typo resilience without having to configure
             | anything.
        
               | evdubs wrote:
               | The article covers typo resilience in the section "Typo
               | tolerance / fuzzy search".
               | 
               | This adds a step between query entry and text search
               | where you find the similarity of query words to unique
               | lexemes if the word is not a lexeme. Seems like a
               | reasonable compromise to me?
        
               | bshipp wrote:
               | I'm not trying to be argumentative. As long as people
               | find a solution they're happy with, I think that's great.
               | For me, I'm far less interested in handling typos, but I
               | can see how it would be valuable in many applications.
               | I'm usually less interested in tying in and learning
               | another set of services if I can get 90% of the way there
               | with one, but leaving the option of adding it later if
               | additional requirements make it necessary.
        
         | bshipp wrote:
         | For such a light demand and fixed site requirements, a single-
         | file sqlite dB is probably best. Modern Sqlite has full-text
         | capabilities that are quite powerful and relatively easy to
         | implement.
         | 
         | https://www.sqlite.org/fts5.html
        
         | tudorg wrote:
         | Do you prefer it locally or in the cloud? If in the cloud,
         | check out Xata (the domain of the blog post here).
        
         | b1zguy wrote:
         | Edit: I forgot to add how would I add the webpage to the
         | databases already suggested here? Do I need to use a separate
         | program to spider/index each site, and check for its updates?
        
           | bshipp wrote:
           | If you're looking for a turn-key solution, I'd have to dig a
           | little. I generally write a scraper in python that dumps into
           | a database or flat file (depending on number of records I'm
           | hunting).
           | 
           | Scraping is a separate subject, but once you write one you
           | can generally reuse relevant portions for many others. If you
           | can get adept at a scraping framework like Scrapy you can do
           | it fairly quickly, but there aren't many tools that work out
           | of the box for every site you'll encounter.
           | 
           | Once you've written the spider, it's generally able to be
           | rerun for updates unless the site code is dramatically
           | altered. It really comes down to how brittle the spider is
           | coded (i.e. hunting for specific heading sizes or fonts or
           | something) instead of grabbing the underlying JSON/XHR that
           | doesn't usually change frequently.
           | 
           | 1. https://scrapy.org
        
         | Alifatisk wrote:
         | Searchkick gem + Elasticsearch is a good combo
        
         | sudobash1 wrote:
         | I am wanting to do something similar. Archivebox seems to be
         | the best solution for this sort of self-hosted, searchable web
         | archive. It has multiple search back-ends and plugins to sync
         | browser bookmarks (or even history).
         | 
         | I haven't finished getting it set up though, so take this
         | recommendation with a hefty grain of salt.
        
           | SpriglyElixir12 wrote:
           | How would something like this work in practice? Would you
           | generate any tags or summaries per site when inserting it
           | into the db?
        
             | janalsncm wrote:
             | You could run a full text search or search against an auto-
             | generated summary. Or if you want to be fancy, use semantic
             | search like in Retrieval Augmented Generation.
        
       | fizx wrote:
       | One of the many problems with search inside Postgres is that
       | search is very CPU-heavy (and starting to become GPU-heavy).
       | Ideally, you'd like to reserve CPU in your database for
       | transactional updates to your core data model.
       | 
       | I've seen a lot of ES and Solr clusters operating at 100% of 10+
       | nodes during a re-index, or just 30-50% of 10+ nodes during
       | normal operation. The corresponding database would be say an AWS
       | L/XL instance at 50-100GB of data and 30% CPU utilization. Moving
       | all of the search CPU into your primary DB means now you'd have
       | to shard it.
       | 
       | But I love PG extensions for search, recursive joins, vectors,
       | etc on side projects. It can keep things fun and simple.
        
         | nbgoodall wrote:
         | Could you solve this by searching on a read-only replica?
        
       | asadawadia wrote:
       | https://rcs.aawadia.dev/ is also using pg search
        
       | jsdwarf wrote:
       | Looking forward for part 2 / postgres vs elasticsearch. One
       | application at my company uses PG for CRUD on objects and elastic
       | for searching them. We completely underestimated the effort of
       | keeping the two datastores in sync and are actually contemplating
       | to get rid of elasticsearch.
        
         | cameronpm wrote:
         | Normally best practice is to return primary keys in ES and then
         | do a lookup query in the db. Did you attempt this?
        
         | dqv wrote:
         | >We completely underestimated the effort of keeping the two
         | datastores in sync and are actually contemplating to get rid of
         | elasticsearch.
         | 
         | Can you expand on this? Is it that it's tedious to write code
         | that updates both? I've been meaning to play around with
         | meilisearch and was trying to think about the synchronization
         | issue.
        
         | valzam wrote:
         | I had a similar setup at a previous gig and didn't find it
         | particularly challenging. Whenever an update happens to an
         | entity in PG we send a message to async replicate that entity
         | in ES (via ID lookup in PG). As always you need good monitoring
         | and retries for async computation but ES is quite stable and
         | fast so we rarely had any issues.
         | 
         | Now, we had pretty lax consistency requirements, as long as the
         | "latest state" of PG ended up in ES within a reasonable
         | timeframe everything was fine so maybe your requirements are
         | different.
        
       ___________________________________________________________________
       (page generated 2023-07-12 23:00 UTC)