[HN Gopher] Phrase matching in Marginalia Search
       ___________________________________________________________________
        
       Phrase matching in Marginalia Search
        
       Author : marginalia_nu
       Score  : 127 points
       Date   : 2024-09-30 11:42 UTC (11 hours ago)
        
 (HTM) web link (www.marginalia.nu)
 (TXT) w3m dump (www.marginalia.nu)
        
       | ColinHayhurst wrote:
       | Congrats Viktor.
       | 
       | > The feedback cycle in web search engine development is very
       | long....Overall the approach taken to improving search result
       | quality is looking at a query that does not give good results,
       | asking what needs to change for that to improve, and then making
       | that change. Sometimes it's a small improvement, sometimes it's a
       | huge game changer.
       | 
       | Yes, this resonates with our experience
        
         | outime wrote:
         | This comment felt like indirect spam, as it doesn't really
         | contribute anything IMHO. The phrase "in our experience"
         | implies they're in the same business, and upon checking the
         | profile, I found that aside from the bio linking to their own
         | thing, most comments resemble (in)direct spam. Everyone has
         | their strategies, but I really disliked seeing this.
        
           | marginalia_nu wrote:
           | Eh, I think it's always interesting to compare notes with
           | other search projects.
           | 
           | It's a small niche, and I think we're all rooting for
           | eachother.
        
       | gary_0 wrote:
       | > To make the most of phrase matching, stop words need to go.
       | 
       | Perhaps I am misunderstanding; does this mean occurrences of stop
       | words like "the" are stored now instead of ignored? That seems
       | like it would add a lot of bloat. Are there any optimizations in
       | place?
       | 
       | Just a shot-in-the-dark suggestion, but if you are storing some
       | bits with each keyword occurrence, can you add a few more bits to
       | store whether the term is adjacent to a common stop word? So
       | maybe if you have to=0 or=1, "to be or not to be" would be able
       | to match the data `0be 1not 0be`, where only "be" and "not" are
       | actual keywords. But the extra metadata bits can be ignored, so
       | pages containing "The Clash" will match both the literal query
       | (via the "the" bit), and just "clash" (without the "the" bit).
        
         | marginalia_nu wrote:
         | It's not as bad as you might think, we're speaking dozens of GB
         | across the entire index.
         | 
         | I don't think stopwords as an optimization makes sense when you
         | go beyond BM25. The search engine behaves worse and adding a
         | bunch of optimizations makes an already incrediby complex piece
         | of software more so.
         | 
         | So overall I don't think the juice is worth the squeeze.
        
         | ValleZ wrote:
         | Removing stop words is usually a bad advice which is beneficial
         | only in a limited set of circumstances. Google keeps all the
         | "the": https://www.google.com/search?q=the
        
         | heikkilevanto wrote:
         | One of the problems with stop words is that they vary between
         | languages. "The" is a good candidate in English, but in Danish
         | it just means "tea", which should be a valid search term. And
         | even in English, what looks like a serious stop word, can be an
         | integral part of the phrase. "How to use The in English".
        
       | pmdulaney wrote:
       | Amazing! "bicycle touring in France" as a search target produces
       | a huge number of spot-on returns beautifully formatted.
        
       | mdaniel wrote:
       | Based solely upon the title and the first commit's date, I'm
       | guessing it's this:
       | https://github.com/MarginaliaSearch/MarginaliaSearch/pull/99
        
         | marginalia_nu wrote:
         | Correct!
        
       | hosteur wrote:
       | Always nice to see updates on marginalia.
        
       | arromatic wrote:
       | 1. Is the index public ? 2. Any chance for a rss feed search ?
        
         | marginalia_nu wrote:
         | 1. I'm not sure what you mean. The code is open source[3], but
         | the data is, for logistical reasons, not available. Common
         | Crawl is far more comprehensive though.
         | 
         | 2. I've got such plans in the pipe. Not sure when I'll have
         | time to implement it, as I'm in the middle of moving in with my
         | girlfriend this month. Soon-ish.
         | 
         | [3] at https://git.marginalia.nu/ , though still some rough
         | edges to sand down before it's easy to self-host (as easy as
         | hosting a full blown internet search engine gets).
        
           | arromatic wrote:
           | Thanks . What you answered at 1. is what I meant. I was
           | looking for a small web dataset but cc is too big for me
           | process .
           | 
           | 1. Do you know any dataset of rss feeds that are not 100s of
           | gbs ?
           | 
           | 2. How does your crawler handle malicious site when crawling
           | ?
        
             | marginalia_nu wrote:
             | 1. Here are all RSS feeds known to the search engine as of
             | some point in 2023:
             | https://downloads.marginalia.nu/exports/feeds.csv -- it's
             | quite noisy though, a fair number of them are anything but
             | small web. You should be able to fetch them all in a few
             | hours I'd reckon, and have a sample dataset to play with.
             | There's also more data at
             | https://downloads.marginalia.nu/exports/ , e.g. a domain
             | level link graph, if you want to experiment more in this
             | space.
             | 
             | 2. It's a constant whac-a-mole to reverse-engineer and
             | prevent search engine spam. Luckily I kinda like the game.
             | It's also helpful that it's a search engine so it's quite
             | possible to use the search engine itself to find the
             | malicious results, by searching for the sorts of topics
             | where they tend to crop up, e.g. e-pharama, prostitution,
             | etc.
        
               | arromatic wrote:
               | On 2. I meant malware that could affect your crawling
               | server not spams. And thanks for the data .
        
               | marginalia_nu wrote:
               | Malware authors typically focus on more common targets,
               | like web browsers. I'm quite possibly the only person
               | doing crawling with the stack I'm on, which means it's
               | not a very appealing target. It also helps that the
               | crawler is written in Java, which is a relatively robust
               | language.
        
               | arromatic wrote:
               | Apologies for too many questions but resources on search
               | engines are scarce . How do I visualize the link graphs
               | or process them ? is there any tool preferably foss .
               | Majestic seem to have one but it's their own .
        
       | senkora wrote:
       | > turned up nothing but vietnamese computer scientists, and
       | nothing about the _famous blog post_ "ORM is the vietnam of
       | computer science". [emphasis added]
       | 
       | This points in the direction of the kinds of queries that I tend
       | to use with Marginalia. I've found it to be very helpful in
       | finding well-written blog posts about a variety of subjects, not
       | just technical. I tend to use Marginalia when I am in the mood to
       | find and read such articles.
       | 
       | This is also largely the same reason that I read HN. My current
       | approach is to 1) read HN on a regular schedule, 2) search
       | Marginalia if there is a specific topic that I want, and then 3)
       | add interesting blogs from either to my RSS reader app.
        
       ___________________________________________________________________
       (page generated 2024-09-30 23:00 UTC)