[HN Gopher] Indexing a billion pages
       ___________________________________________________________________
        
       Indexing a billion pages
        
       Author : daoudc
       Score  : 73 points
       Date   : 2023-12-23 13:51 UTC (9 hours ago)
        
 (HTM) web link (blog.mwmbl.org)
 (TXT) w3m dump (blog.mwmbl.org)
        
       | hcfman wrote:
       | Wuite curious. What indexing and retrieval software is this
       | using? I couldn't find reference to it.
       | 
       | Does it index phrases ?
        
         | bdcravens wrote:
         | "... who crawl the web using the Firefox extension and command
         | line script"
         | 
         | https://addons.mozilla.org/en-GB/firefox/addon/mwmbl-web-cra...
         | 
         | https://github.com/mwmbl/crawler-script
        
       | xnx wrote:
       | How does the homepage of https://mwmbl.org/ not have a single
       | sentence explaining what it is or even an "About" link?
       | 
       | From Github: "Mwmbl is a non-profit, ad-free, free-libre and
       | free-lunch search engine with a focus on useability and speed."
        
         | Kiro wrote:
         | Not everything is a product that needs to be sold.
        
           | xnx wrote:
           | Totally agree. I was just trying to figure out what it is.
           | Even something a small as a subheading like Wikipedia ("The
           | Free Encyclopedia") does would be very helpful.
        
           | hawski wrote:
           | Everything is something. It is helpful to know what this
           | particular something is - regardless if it is sold or not.
        
         | CharlesW wrote:
         | Here's its mission: https://blog.mwmbl.org/articles/non-profit-
         | search-engine/
        
         | renegat0x0 wrote:
         | I suggest also providing og title and og image fields for
         | social media.
        
       | mdaniel wrote:
       | > The biggest expense was purchasing a PyCharm professional
       | license at $116.58
       | 
       | I mean, awesome that they value good tooling to spend on it but
       | https://www.jetbrains.com/community/opensource/ almost certainly
       | means they qualify for a complementary license
        
       | mdaniel wrote:
       | I thought I recalled seeing this before due to its Welsh name and
       | (as is often the case) some are from their domain and some are
       | from the GitHub repo; the ones with over 100 comments are
       | 
       | https://news.ycombinator.com/item?id=37561155
       | 
       | https://news.ycombinator.com/item?id=29690877
        
         | dang wrote:
         | Thanks! Macroexpanded:
         | 
         |  _We are entering a new era of web search_ -
         | https://news.ycombinator.com/item?id=38465864 - Nov 2023 (2
         | comments)
         | 
         |  _Mwmbl: Free, open-source and non-profit search engine_ -
         | https://news.ycombinator.com/item?id=37561155 - Sept 2023 (122
         | comments)
         | 
         |  _The Book of Mwmbl: a free, non-profit search engine_ -
         | https://news.ycombinator.com/item?id=33828087 - Dec 2022 (8
         | comments)
         | 
         |  _Show HN: An open source web crawler for the Mwmbl non-profit
         | search engine_ - https://news.ycombinator.com/item?id=31765015
         | - June 2022 (4 comments)
         | 
         |  _Show HN: I 'm building a non-profit search engine_ -
         | https://news.ycombinator.com/item?id=29690877 - Dec 2021 (199
         | comments)
        
       | marginalia_nu wrote:
       | I'll race you there ;-)
        
       | bdcravens wrote:
       | Most impressive part:
       | 
       | > Our estimated annual budget is $752.36 and we have spent
       | $174.49.
        
       | jetrink wrote:
       | > We've indexed over 100 million pages
       | 
       | > [W]e're crawling up to a million pages a day, as you can see on
       | our stats page.
       | 
       | > Given that Mwmbl is still relatively unknown, it seems
       | plausible that we can reach our target of crawling three billion
       | pages a day, to refresh the entire index in one month.
       | 
       | I think this is supposed to read "it seems plausible that we can
       | reach our target of crawling three _million_ pages a day. "
        
       | jmclnx wrote:
       | Very interesting and was quick for me. Nice work!
        
       ___________________________________________________________________
       (page generated 2023-12-23 23:01 UTC)