[HN Gopher] Building a full-text search engine in 150 lines of P...
___________________________________________________________________
Building a full-text search engine in 150 lines of Python code
Author : bartdegoede
Score : 206 points
Date : 2021-03-25 16:17 UTC (6 hours ago)
(HTM) web link (bart.degoe.de)
(TXT) w3m dump (bart.degoe.de)
| KAdot wrote:
| The article looks suspiciously similar to
| https://artem.krylysov.com/blog/2020/07/28/lets-build-a-full....
| Very similar examples, code and structure.
| jeroenhd wrote:
| Does it? Both are implementations and explanations of a well-
| known algorithm. Most articles on quicksort will also look
| alike, but there's no reason to assume the author has
| plagiarized anything.
| diogenesjunior wrote:
| Except that one is written in go and the other in python...
| johnwheeler wrote:
| Yes, but where's the attribution?
| pmiller2 wrote:
| How many different ways do you think there are to write
| this code?
| pmiller2 wrote:
| So? Wikipedia is one of the most convenient, large English
| corpora available, and I doubt there are many significantly
| different ways to write the bit of functionality that's built
| up here. I'm not sure if that's what you're meaning to suggest,
| or that there was some kind of plagiarism / inspiration going
| on here.
| argaba wrote:
| Everything is a paraphrase of the original source - whatever
| and/or whenever that maybe from.
| mlthoughts2018 wrote:
| I worked previously for a very high traffic ecommerce company
| (Alexa top 300 site).
|
| As part of the search team, I worked on a project where we
| deliberately rewrote the whole product search engine in Python
| and Cython, including our own algorithms manipulating documents
| for deletion, low latency reindexing after edits, and more.
|
| We did this because SOLR was too slow and the process of defining
| custom sort orders (for example, new sort orders defined by
| machine learning ranking algorithms, and needing to be A/B tested
| regularly) was awful and performance in SOLR was poor.
|
| It was a really fun project. One of the slogans for our group at
| the time was "rewriting search in Python _for speed_."
|
| The ultimate system we deployed was insanely fast. I doubt you
| could have made it faster even writing the entire thing directly
| in C or C++. It became a reference project in the company to help
| avoid various flavors of arrogant dismissal of Python as a
| backend service language for performance critical systems.
| NicoJuicy wrote:
| Could i send you a mail somewhere? I'd like to have a casual
| conversation about e-commerce and some of your insights.
| inertiatic wrote:
| Defining custom sort orders in Solr is as simple as uploading a
| text file with the values you intend to use for ranking. This
| is a great feature that is in fact missing from Elasticsearch
| and saves you so much reindexing time.
|
| There certainly are usecases where Lucene based solutions
| aren't the best fit. But I think the claim that you couldn't
| make something faster by moving away from Python is outlandish.
| IncRnd wrote:
| > There certainly are usecases where Lucene based solutions
| aren't the best fit. But I think the claim that you couldn't
| make something faster by moving away from Python is
| outlandish.
|
| I read that as a statement that they implemented a proper and
| bespoke algorithm, not that the speed of Python is greater
| than C. I am surprised that you read it that way. Who in
| their right mind would say Python speed is faster than C
| speed?
| inertiatic wrote:
| You read
|
| >I doubt you could have made it faster even writing the
| entire thing directly in C or C++.
|
| as
|
| > a statement that they implemented a proper and bespoke
| algorithm, not that the speed of Python is greater than C.
|
| ?
| happyweasel wrote:
| As Joel spolsky said: "Just do me a favor and search the damned
| hard drive, quickly, for the string I typed, using full-text
| indexes and other technologies that were boring in 1973."
| arafalov wrote:
| Nice introduction. A good ramp-up from basic splitting to
| ranking.
|
| It does need to be said that when Lucene had a set of features
| this small, it was also pretty tiny. And, if those are the needs,
| one could still download it and it will probably run on modern
| JVM:
|
| https://archive.apache.org/dist/lucene/java/
|
| lucene-1.4.3.jar 2004-11-29 14:13 316K
|
| It is a bit bigger of course, but that's because it already had
| stemmers, multilingual support, multiple Query Parsers including
| phrase, RAM and Disk directory implementations and a bunch of
| other advanced features one can easily see by unzipping the jar
| file.
| machiaweliczny wrote:
| Cool article.
|
| I recently build program in Go that takes wikipedia article and
| gets all dependencies then using tfidf*count ranks concepts in
| order of "importance". Seems quite good for math articles to get
| list of more basic concepts to understand first.
| throwaway320912 wrote:
| Fun times. I once applied pagerank onto a set of 8000 math
| articles and ran the result as a web app in 2009/10.
|
| http://web.archive.org/web/20091230103939/http://myyn.org/m
|
| As a gimmick, I created 36 groups, with group 1 containing the
| most important concepts:
|
| http://web.archive.org/web/20100109055506/http://myyn.org/m/...
|
| Spoiler, the top ten concepts were:
|
| Function * Set * Number * Integer * Real Number * Point *
| Property * Finite * Ring * Relation Theory
|
| BTW: Glad the archive has a copy, because I do not.
| habibur wrote:
| Excellent read. Was searching for a full text search engine but
| not finding any suitable one. Plan to implement one just this
| way.
| bartdegoede wrote:
| Author here: you may want to check out something like Whoosh
| https://whoosh.readthedocs.io/en/latest/intro.html (it's like a
| clone of Lucene but in pure Python). I've used this to build
| some basic search for a small Python website and it was more
| than fast enough for my purposes :-)
| rakoo wrote:
| SQLite has a pretty good built-in fts engine:
| https://www.sqlite.org/fts5.html
| habibur wrote:
| Problem is FTS5 isn't included in the most default
| installation through package managers [I use Fedora]. And
| recompiling from source breaks a lot of things, as sqlite
| libraries are generally linked with all apps that use it.
| rakoo wrote:
| I admit I only used sqlite through the go driver
| (https://github.com/mattn/go-sqlite3) where using fts5
| amounts to one flag during the compile phase.
| ignoramous wrote:
| Reminds of this David Crawshaw (CTO at Tailscale) presentation on
| full-text search with SQLite which probably requires 10 lines or
| less: https://www.youtube-nocookie.com/embed/RqubKSF3wig
| edoceo wrote:
| Yea! FTS5 FTW! Amazing for what it costs.
| asdfasgasdgasdg wrote:
| It's not relevant or an indicator of power that this example
| requires ten lines of SQL. This presentation just uses SQLite's
| builtin full-text search system [1]. Of course it's going to
| require less code to call a library than to implement it.
|
| [1]: https://sqlite.org/fts5.html
| ignoramous wrote:
| Relevant because "building a full-text search engine" is
| solved using SQLite just as much as it can be solved by
| writing Python on top of lxml and py-stemmer.
|
| SQLite is no heavyweight dependency ala Apache Lucene. It
| also helps that it is bundled in billions of Android and iOS
| devices.
| denysvitali wrote:
| Does anybody work for Atlassian here? If so, can you please share
| this with the Confluence team? Thanks...
| boyter wrote:
| Do people really find the search in confluence that bad? Would
| they actually be willing to pay to have it fixed?
| denysvitali wrote:
| I wouldn't pay to fix it, Confluence isn't free. There are a
| lot of Open Source alternatives that provide such features
| out of the box. It's the only feature that shouldn't be
| broken.
|
| I really hate Confluence nowadays (and I really liked in the
| past, when I was using it randomly): - Counter productive
| syntax that is not even equal to the one of Jira
|
| - Search functionality that cannot find anything (I spend
| HOURS trying to find my content, let alone somebody else's).
| Some days I just wish I had the DB at hand / a folder with
| the whole pages exported to grep it myself.
|
| - Proprietary file format - good luck migrating away from it
|
| I'm really sad to say these things, I like Atlassian as a
| company but the more I use Atlassian products, the more I
| realize they're bloated and have _broken_ features that were
| implemented quickly just to tick some more boxes on their
| features sheet.
| tyingq wrote:
| If you want to feel extra sad about it, see this:
| https://jira.atlassian.com/browse/CONFSERVER-13499
| nova22033 wrote:
| _If a wiki page has the word adrecordset, a search on
| recordset should include the page. Currently only a search on
| adrecordset or a search with wildcards returns the page._
|
| Wait..this seems unreasonable?
|
| Isn't this how it works with elasticsearch or solr? or even
| google for that matter..a search for green won't return
| evergreen..
| tyingq wrote:
| They don't support leading wildcards. So foo* will find
| foobar but *bar won't find foobar. Elastic does fine with
| that as of 7.9.
| denysvitali wrote:
| Well, in my experience they cannot even find URLs, IP
| addresses, or wildcarded words like foo*
| bigtones wrote:
| And the Jira team
| 12ian34 wrote:
| JQL really isn't that bad is it?
| fteem wrote:
| Not bad - horrible.
| runningmike wrote:
| Using nlp techniques to create tokens is for a fast and simple
| python search often not needed. It makes things slow and python
| standard string function are often good enough. Issues arise when
| searching for combinations of words like 'machine learning' in a
| sentence. Nice read, but the GitHub repro needs a license to be
| more useful.
| bartdegoede wrote:
| Added an MIT license. Have fun :-)
| creamytaco wrote:
| Why would one choose Python for this instead of Go or Rust?
|
| I imagine Python performance would be terrible.
| Thaxll wrote:
| The article is good and so can be applied to any language, also
| Python is not that slow.
| roelschroeven wrote:
| As a lot of the work is done by code library code, there's a
| very good chance that the code is not all that slow.
| theplague42 wrote:
| Because it's easier for the average developer to follow Python
| code? This is a tutorial/demo, not a library.
| bartdegoede wrote:
| The idea was to illustrate the concepts of an inverted index
| and tf-idf :-) I've built some stuff like this for a smaller
| project written in Python because it was Easier(tm) than
| spinning up an Elasticsearch cluster (ie it definitely didn't
| need to scale :-D)
| sodapopcan wrote:
| Yes, I've been really interested in FTS recently and love
| articles like this. I'm currently implementing a paired-down
| version myself in Elixir because I'm searching one "table"
| (not postgres) and I don't want to bring in external
| dependencies for it.
| neolog wrote:
| If the dataset is small, speed doesn't matter. Also, there's
| PyPy if you need speed.
| superyesh wrote:
| Neat! One production ready python library I love to use in this
| space is https://radimrehurek.com/gensim/ It is quite mature and
| can handle a good amount of data well! It offers topic modeling
| too and can b helpful to find similar documents also.
| andrewmatte wrote:
| This doesn't account for synonyms. I'd rather use a document
| embedding search.
___________________________________________________________________
(page generated 2021-03-25 23:00 UTC)