[HN Gopher] Evolution of Search Engines Architecture - Algolia S...
___________________________________________________________________
Evolution of Search Engines Architecture - Algolia Search
Architecture Part 1
Author : PretzelFisch
Score : 154 points
Date : 2021-08-08 12:36 UTC (10 hours ago)
(HTM) web link (highscalability.com)
(TXT) w3m dump (highscalability.com)
| polote wrote:
| The issue with Algolia is that they have insane technology but it
| is mostly used only to search documentation.
|
| They are struggling to sell their techno to people who need them
| deeply, for a lot of reasons. But one of them is that they are a
| tricky choice. It is not a database technology, so not a
| developer choice but also their technology is only useful to
| developers.
|
| As a result they have to try to sell their product when you need
| a search but no developers are working on it. That's how you end
| up powering external and internal documentation portals. That's
| really a waste of resource
| petulla wrote:
| no BERT?
| ramoz wrote:
| I've scaled large transformer based models that supplement a
| lucene-based search engine. The architecture supports an
| ensemble approach where Lucene results are first-class and then
| we tailor similarity rankings with the models.
|
| It looks a lot like this: https://huggingface.co/blog/bert-cpu-
| scaling-part-1
|
| We have to store large "index" embeddings on SSDs and use
| leveldb for value retrievals of the lucene results.
| lmeyerov wrote:
| Yep I was surprised -- google and others have long moved to
| neural search, afaict, where we are seeing things like faiss
| for indexes based on embeddings, and all sorts of deploy pain
| around training+inference. I knew that was still true for
| elastic, but hadn't realized also for their replacements. So
| this article is clustering for pre-neural search, and guess
| enterprise search is still getting there..
| cabbagehead wrote:
| What's the "snapshot"/"snapshat" use case mentioned?
| ww520 wrote:
| Are suffix tree/array used at all? How about n-gram with Bloom
| filter for filtering documents?
| avereveard wrote:
| idk this seems more an evolution of clustering, when I think
| about search engines I think more at the progression toward
| stemming, lemming, synonym matching and context matching.
| ramraj07 wrote:
| Also doing it in memory (which is what all the regular search
| engines do right?)
| jabo wrote:
| No, ElasticSearch for example uses a disk-first approach.
| Grimm1 wrote:
| ES uses a disk first approach but only on first load, and
| is smart enough to load similar results for frequently
| searched items into cache as well. That's why search return
| times differ significantly between hot and cold queries.
| This is actually such a problem that a lot of the times in
| older ES versions you wind up prewarming the ES cache
| before you can actually let it be used in production. Most
| alternative search engine implementations and especially in
| vector based search engines load into memory first when
| brought up versus at query time. ES still isn't great at
| this and is one reason they're falling behind in modern
| search, that and their vector search support is kind of
| abysmal as of even 6 months ago.
| ramoz wrote:
| They're falling behind compared to what exactly?
|
| Elastic is great on-disk, especially on SSDs and avoiding
| issues like write amplification.
|
| Loading large indexes in memory isn't simple/cheap and
| when it comes to vectors we're talking apples/oranges I
| feel. Modern search architectures need to embrace
| ensemble approaches but boolean-based content searches is
| often the primary util in enterprise (and search is
| supplemented by a customizable td-idf). Using vector-
| based retrieval & similarity is still useful but not
| something you necessarily need elastic to do for you or
| couldn't co-exist together.
| Grimm1 wrote:
| I've scaled a cluster that was in the 100s of millions of
| results range, the experience was not great and tuning
| for our use case which was decidedly not a typical
| enterprise search problem and that made it a complete
| pain. So that's great that it works for that particular
| case, and we ultimately made it work ourselves much like
| you're suggesting, we used vector search with something
| like FAISS as a pre filtering step and then a final
| search through a much reduced set of ids in ES but it's
| pretty clear a new player could come in here and make a
| much better experience. Basically ES is, if not
| unsuitable, a big pain for large non enterprise search
| such as web search, where things like vector search are
| one major signal and provide a better search experience.
| And that's the exact problem there aren't off the shelf
| open source solutions if you're not doing a fairly
| standard ecom or internal business search type problem
| like log aggregation or internal documents.
|
| I'm also suggesting that use cases that aren't enterprise
| type search problems are more common than you'd think
| these days.
|
| Edit: Additionally the thing here is you have classical
| boolean search systems like ES and vector search
| solutions like Milvus, but no one's gotten around to
| making something that does both well, from what I can see
| a lot of the players in the space are trying to go in
| that direction but it's a slow painful crawl that results
| in this type of situation where we had to do a lot of
| custom gluing of these systems together and keeping that
| parity that was super annoying and expensive, and time
| consuming, but not necessarily performance inhibiting.
| stingraycharles wrote:
| It's a highscalability blog post, though, which usually focuses
| on precisely the clustering, sharding, etc aspects.
|
| Not saying you're wrong, but it's just a different audience
| that would be interested in the actual search algorithms.
| manojlds wrote:
| No https in 2021?
| ilrwbwrkhv wrote:
| No. In fact most websites don't need HTTPS and pointless data
| transfer. Wish we could go back a few years on this zeitgeist.
| Xorlev wrote:
| This is false. Just because the page content isn't sensitive,
| that doesn't mean that TLS is worthless.
|
| TLS prevents your run of the mill MITM scenarios. Like ISPs
| inserting ads (something Comcast actually did), or public
| wifi doing the same. Or worse, more malicious scripts.
|
| You could argue that all I'm really looking for in most cases
| is message integrity (signing), but if you're going to do
| that, you might as well just encrypt it too and avoid
| accidents where sensitive information is sent over encrypted
| channels.
| pornel wrote:
| Every visited HTTP website is a network vulnerability.
|
| It doesn't matter what is supposed to be on these sites. From
| security perspective they contain MITM attacker's content.
| They are effectively an API for issuing arbitrary commands to
| the browser. To shut down this attack API, all sites have to
| stop using HTTP, no exceptions.
| merliossu wrote:
| in memory search works well as long as you dont care about
| persisting your data.. for most companies that would like a big
| chunk of their strategic assets
| bwb wrote:
| I am about to roll out search on Shepherd.com and looking at
| using Algolia. I've been impressed with Algolia on Hacker News...
|
| Is anyone else using them? What are your impressions so far?
|
| Much appreciated
| gervwyk wrote:
| We have it configured for https://docs.lowdefy.com
|
| Really happy with the service it provides and the ease of
| implementation. Note that because the docs can take a few
| seconds to load, the their crawler times out and misses some
| content some of the time. With better page performance this
| should not be an issue.
|
| (We are actively working on some cool ideas to make Lowdefy
| apps super fast)
| ushakov wrote:
| i'm using MeiliSearch, which is a open source alternative
|
| worth giving a look
|
| https://github.com/meilisearch/MeiliSearch
| gervwyk wrote:
| Did not know about MeiliSearch. Looks really great! Thanks
| for sharing.
| thefounder wrote:
| It's easy to use and setup. If pricing and closed source is OK
| with you then it's worth it. We've used them few years ago and
| then switched to ES. Think of it like of pre-docker Heroku.
| oakfr wrote:
| Out of curiosity, what made you choose ES over Algolia?
| thefounder wrote:
| As @kirubakaran said it was the price and the closed source
| license. If search becomes a very important part of your
| business you better own it rather than outsource it.
|
| Algolia is great to get started but it doesn't make sense
| at scale. If you have large indexes it's just too
| expensive.
| jabo wrote:
| What did the migration effort look like when moving from
| Algolia to ElasticSearch? Also, were you able to
| replicate the same user experience?
| kirubakaran wrote:
| From the comment, I guess "pricing and closed source"
| became not OK
| jabo wrote:
| I work on an open source alternative to Algolia called
| Typesense.
|
| Algolia is a great product but can get quite expensive at even
| moderate scale. If I had a dollar for every time I've heard
| this from Algolia users switching over...
|
| I recently put together this comparison page, comparing a few
| search engines, including Algolia, you might find interesting:
| https://typesense.org/typesense-vs-algolia-vs-elasticsearch-...
| arbitrandomuser wrote:
| I heard a joke about FTS engines, but Whoosh !
| notdang wrote:
| It's missing the most important thing:speed. We moved to
| Algolia mainly because of this. Elastic Search and Solr could
| not compete.
| jabo wrote:
| Oh yes. Speed is an important point. ElasticSearch & Solr
| use disk-first indexing (with RAM as just a cache), whereas
| Algolia and Typesense use a RAM-first approach where the
| entire index is stored in memory. This is what makes
| Algolia/Typesense return results much much faster than
| ES/Solr, and lets you build search-as-you-type experiences
| for each keystroke.
|
| I was thinking about adding a row about speed to the
| comparison matrix, but couldn't find a way to express the
| comparison clearly... Imagine a row that said:
|
| Search Speed | Super-fast | Super-fast | Slow? ...
|
| That felt a little off. So I resorted to just mentioning
| primary index location as a proxy.
|
| Open to suggestions on how to express this succinctly.
| Nextgrid wrote:
| What index sizes are we talking about? If it's a few
| hundred gigs there's always the possibility of putting
| the entire ElasticSearch index into a ramdisk, or even
| just leaving lots of "free" RAM meaning the underlying OS
| will use it to speed up I/O transparently. Bare-metal
| machines with insane RAM sizes are a thing, and at
| massive scale could make sense.
|
| I've had great success at a client where simply upgrading
| a DB to an instance with enough RAM to fit 80% of the
| entire data set fixed all performance problems and
| significantly reduced I/O "pressure" at least for reads
| (writes were never a problem).
| jabo wrote:
| I haven't tried to do this myself so I can't speak to it.
|
| But one thing I would add is ElasticSearch is quite
| versatile and flexible, so I wouldn't be surprised if you
| can contort it to get it to work for a wide variety of
| use cases. This is a blessing and a curse - blessing
| because it's so flexible, curse because the flexibility
| breeds complexity and brings with it a steep learning
| curve and operational complexity.
|
| Where I think Algolia / Typesense help is that things
| work out of the box without the learning curve or
| operational overhead.
| NicoJuicy wrote:
| Why not place the main algorithm for speed of search, so
| users can lookup the difference on another page.
| tommoor wrote:
| Does Typesense support searching in non-latin languages?
| jabo wrote:
| Yes it does - all languages except logographic ones
| (Chinese, Japanese and Korean) which we are actively
| working on:
| https://github.com/typesense/typesense/issues/228
| kqr wrote:
| Big up-front disclaimer: my job is making software at Loop54
| and my salary comes from happy customers of our service.
|
| One of our goals is similar to yours: browsing an online store
| should be like walking around in a physical store. The
| navigation system on the site should be as adept as a
| knowledgeable store employee in helping you find exactly what
| you're looking for.
|
| At Loop54 many of our customers come from Algolia. It's very
| popular, and nobody ever gets fired for buying Algolia. In that
| sense, it's a safe option.
|
| On the other hand, customers come to us from Algolia because
| Algolia requires a bit of hand-holding and it still doesn't
| quite seem to get what users are really looking for. When our
| prospects run randomised controlled trials, our search
| consistently seems to give users what they want better than
| Algolia does, with less effort. I can ask about specific
| numbers if you want.
|
| However, another strength of Algolia that Loop54 is currently
| behind in is in the surrounding tooling. For better or worse,
| with Algolia, you'll have more knobs and levers to play with
| (and you'll need them much more often!)
|
| We do have one or two customers that have a majority of books
| in their product catalogues, and we know there are some unique
| challenges that come with that domain.
|
| Loop54 is a very competent, but smaller player. If you think
| it's interesting, it's worth talking to us. I can't evaluate
| how good a fit your site would be for us, but that's why we
| have people who do that for a living!
|
| Edit: I should also say that yes, Loop54 is even more
| expensive. You shouldn't blindly trust us (or any other
| provider.) I would strongly suggest running a randomised
| controlled trial to see whether any expense at all is worth it
| in your case.
|
| I say this in part because I'm a man of science and believe in
| experiments to measure things, but also out of self-interest;
| anyone can throw out impressive marketing, but our search truly
| shines when put to the test against the alternatives.
| Redsquare wrote:
| I am sorry but https://www.loop54.com/pricing is just totally
| snied. No monetary information whatsoever. Why even blag me
| to a pricing page with less than zero pricing honesty.
| geraneum wrote:
| The comfort that they provide is trap sometimes! Algolia suggests
| that frontend sends the queries directly to its service instead
| of going through our backend, which is good if you want to have a
| good search engine fast. But don't go for it without considering
| the consequences. It will take over part of the frontend and your
| product will depend on Algolia to the point that implementing a
| single favourite functionality for your users may need to
| integrated with their service if you're not careful!
| cinntaile wrote:
| It's strange, I don't really like using the HN Algolia search. I
| think it's because the responsiveness doesn't fit HN and the
| results are okay but not great? What are some other big sites
| that use Algolia as their search backend? It would be interesting
| to compare.
| adamveld12 wrote:
| We use it for general search and similar items results on
| www.liveauctioneers.com
| polote wrote:
| > the results are okay but not great
|
| What results do you expect more than keywords search ranked by
| upvote on HN? I find it great honestly, it's fast and don't do
| magics
___________________________________________________________________
(page generated 2021-08-08 23:00 UTC)