[HN Gopher] Phrase matching in Marginalia Search
___________________________________________________________________
Phrase matching in Marginalia Search
Author : marginalia_nu
Score : 127 points
Date : 2024-09-30 11:42 UTC (11 hours ago)
(HTM) web link (www.marginalia.nu)
(TXT) w3m dump (www.marginalia.nu)
| ColinHayhurst wrote:
| Congrats Viktor.
|
| > The feedback cycle in web search engine development is very
| long....Overall the approach taken to improving search result
| quality is looking at a query that does not give good results,
| asking what needs to change for that to improve, and then making
| that change. Sometimes it's a small improvement, sometimes it's a
| huge game changer.
|
| Yes, this resonates with our experience
| outime wrote:
| This comment felt like indirect spam, as it doesn't really
| contribute anything IMHO. The phrase "in our experience"
| implies they're in the same business, and upon checking the
| profile, I found that aside from the bio linking to their own
| thing, most comments resemble (in)direct spam. Everyone has
| their strategies, but I really disliked seeing this.
| marginalia_nu wrote:
| Eh, I think it's always interesting to compare notes with
| other search projects.
|
| It's a small niche, and I think we're all rooting for
| eachother.
| gary_0 wrote:
| > To make the most of phrase matching, stop words need to go.
|
| Perhaps I am misunderstanding; does this mean occurrences of stop
| words like "the" are stored now instead of ignored? That seems
| like it would add a lot of bloat. Are there any optimizations in
| place?
|
| Just a shot-in-the-dark suggestion, but if you are storing some
| bits with each keyword occurrence, can you add a few more bits to
| store whether the term is adjacent to a common stop word? So
| maybe if you have to=0 or=1, "to be or not to be" would be able
| to match the data `0be 1not 0be`, where only "be" and "not" are
| actual keywords. But the extra metadata bits can be ignored, so
| pages containing "The Clash" will match both the literal query
| (via the "the" bit), and just "clash" (without the "the" bit).
| marginalia_nu wrote:
| It's not as bad as you might think, we're speaking dozens of GB
| across the entire index.
|
| I don't think stopwords as an optimization makes sense when you
| go beyond BM25. The search engine behaves worse and adding a
| bunch of optimizations makes an already incrediby complex piece
| of software more so.
|
| So overall I don't think the juice is worth the squeeze.
| ValleZ wrote:
| Removing stop words is usually a bad advice which is beneficial
| only in a limited set of circumstances. Google keeps all the
| "the": https://www.google.com/search?q=the
| heikkilevanto wrote:
| One of the problems with stop words is that they vary between
| languages. "The" is a good candidate in English, but in Danish
| it just means "tea", which should be a valid search term. And
| even in English, what looks like a serious stop word, can be an
| integral part of the phrase. "How to use The in English".
| pmdulaney wrote:
| Amazing! "bicycle touring in France" as a search target produces
| a huge number of spot-on returns beautifully formatted.
| mdaniel wrote:
| Based solely upon the title and the first commit's date, I'm
| guessing it's this:
| https://github.com/MarginaliaSearch/MarginaliaSearch/pull/99
| marginalia_nu wrote:
| Correct!
| hosteur wrote:
| Always nice to see updates on marginalia.
| arromatic wrote:
| 1. Is the index public ? 2. Any chance for a rss feed search ?
| marginalia_nu wrote:
| 1. I'm not sure what you mean. The code is open source[3], but
| the data is, for logistical reasons, not available. Common
| Crawl is far more comprehensive though.
|
| 2. I've got such plans in the pipe. Not sure when I'll have
| time to implement it, as I'm in the middle of moving in with my
| girlfriend this month. Soon-ish.
|
| [3] at https://git.marginalia.nu/ , though still some rough
| edges to sand down before it's easy to self-host (as easy as
| hosting a full blown internet search engine gets).
| arromatic wrote:
| Thanks . What you answered at 1. is what I meant. I was
| looking for a small web dataset but cc is too big for me
| process .
|
| 1. Do you know any dataset of rss feeds that are not 100s of
| gbs ?
|
| 2. How does your crawler handle malicious site when crawling
| ?
| marginalia_nu wrote:
| 1. Here are all RSS feeds known to the search engine as of
| some point in 2023:
| https://downloads.marginalia.nu/exports/feeds.csv -- it's
| quite noisy though, a fair number of them are anything but
| small web. You should be able to fetch them all in a few
| hours I'd reckon, and have a sample dataset to play with.
| There's also more data at
| https://downloads.marginalia.nu/exports/ , e.g. a domain
| level link graph, if you want to experiment more in this
| space.
|
| 2. It's a constant whac-a-mole to reverse-engineer and
| prevent search engine spam. Luckily I kinda like the game.
| It's also helpful that it's a search engine so it's quite
| possible to use the search engine itself to find the
| malicious results, by searching for the sorts of topics
| where they tend to crop up, e.g. e-pharama, prostitution,
| etc.
| arromatic wrote:
| On 2. I meant malware that could affect your crawling
| server not spams. And thanks for the data .
| marginalia_nu wrote:
| Malware authors typically focus on more common targets,
| like web browsers. I'm quite possibly the only person
| doing crawling with the stack I'm on, which means it's
| not a very appealing target. It also helps that the
| crawler is written in Java, which is a relatively robust
| language.
| arromatic wrote:
| Apologies for too many questions but resources on search
| engines are scarce . How do I visualize the link graphs
| or process them ? is there any tool preferably foss .
| Majestic seem to have one but it's their own .
| senkora wrote:
| > turned up nothing but vietnamese computer scientists, and
| nothing about the _famous blog post_ "ORM is the vietnam of
| computer science". [emphasis added]
|
| This points in the direction of the kinds of queries that I tend
| to use with Marginalia. I've found it to be very helpful in
| finding well-written blog posts about a variety of subjects, not
| just technical. I tend to use Marginalia when I am in the mood to
| find and read such articles.
|
| This is also largely the same reason that I read HN. My current
| approach is to 1) read HN on a regular schedule, 2) search
| Marginalia if there is a specific topic that I want, and then 3)
| add interesting blogs from either to my RSS reader app.
___________________________________________________________________
(page generated 2024-09-30 23:00 UTC)