[HN Gopher] Indexing a billion pages
___________________________________________________________________
Indexing a billion pages
Author : daoudc
Score : 73 points
Date : 2023-12-23 13:51 UTC (9 hours ago)
(HTM) web link (blog.mwmbl.org)
(TXT) w3m dump (blog.mwmbl.org)
| hcfman wrote:
| Wuite curious. What indexing and retrieval software is this
| using? I couldn't find reference to it.
|
| Does it index phrases ?
| bdcravens wrote:
| "... who crawl the web using the Firefox extension and command
| line script"
|
| https://addons.mozilla.org/en-GB/firefox/addon/mwmbl-web-cra...
|
| https://github.com/mwmbl/crawler-script
| xnx wrote:
| How does the homepage of https://mwmbl.org/ not have a single
| sentence explaining what it is or even an "About" link?
|
| From Github: "Mwmbl is a non-profit, ad-free, free-libre and
| free-lunch search engine with a focus on useability and speed."
| Kiro wrote:
| Not everything is a product that needs to be sold.
| xnx wrote:
| Totally agree. I was just trying to figure out what it is.
| Even something a small as a subheading like Wikipedia ("The
| Free Encyclopedia") does would be very helpful.
| hawski wrote:
| Everything is something. It is helpful to know what this
| particular something is - regardless if it is sold or not.
| CharlesW wrote:
| Here's its mission: https://blog.mwmbl.org/articles/non-profit-
| search-engine/
| renegat0x0 wrote:
| I suggest also providing og title and og image fields for
| social media.
| mdaniel wrote:
| > The biggest expense was purchasing a PyCharm professional
| license at $116.58
|
| I mean, awesome that they value good tooling to spend on it but
| https://www.jetbrains.com/community/opensource/ almost certainly
| means they qualify for a complementary license
| mdaniel wrote:
| I thought I recalled seeing this before due to its Welsh name and
| (as is often the case) some are from their domain and some are
| from the GitHub repo; the ones with over 100 comments are
|
| https://news.ycombinator.com/item?id=37561155
|
| https://news.ycombinator.com/item?id=29690877
| dang wrote:
| Thanks! Macroexpanded:
|
| _We are entering a new era of web search_ -
| https://news.ycombinator.com/item?id=38465864 - Nov 2023 (2
| comments)
|
| _Mwmbl: Free, open-source and non-profit search engine_ -
| https://news.ycombinator.com/item?id=37561155 - Sept 2023 (122
| comments)
|
| _The Book of Mwmbl: a free, non-profit search engine_ -
| https://news.ycombinator.com/item?id=33828087 - Dec 2022 (8
| comments)
|
| _Show HN: An open source web crawler for the Mwmbl non-profit
| search engine_ - https://news.ycombinator.com/item?id=31765015
| - June 2022 (4 comments)
|
| _Show HN: I 'm building a non-profit search engine_ -
| https://news.ycombinator.com/item?id=29690877 - Dec 2021 (199
| comments)
| marginalia_nu wrote:
| I'll race you there ;-)
| bdcravens wrote:
| Most impressive part:
|
| > Our estimated annual budget is $752.36 and we have spent
| $174.49.
| jetrink wrote:
| > We've indexed over 100 million pages
|
| > [W]e're crawling up to a million pages a day, as you can see on
| our stats page.
|
| > Given that Mwmbl is still relatively unknown, it seems
| plausible that we can reach our target of crawling three billion
| pages a day, to refresh the entire index in one month.
|
| I think this is supposed to read "it seems plausible that we can
| reach our target of crawling three _million_ pages a day. "
| jmclnx wrote:
| Very interesting and was quick for me. Nice work!
___________________________________________________________________
(page generated 2023-12-23 23:01 UTC)