Post Aa9tkpLTkf80NDxu4G by rohden@fe.disroot.org
(DIR) More posts by rohden@fe.disroot.org
(DIR) Post #AZz5vrocAYoP0s00ci by djoerd@idf.social
2023-09-21T07:10:07Z
0 likes, 0 repeats
#Mwmbl is a non-profit, ad-free, free-libre and free-lunch search engine 🙂 Mwmbl crawls the web from volunteers that install a Firefox extension that crawls the web in the background, retrieving one page a second. https://github.com/mwmbl/mwmbl
(DIR) Post #AZz60t2t1U1EHC4IXw by djoerd@idf.social
2023-09-21T07:11:02Z
0 likes, 0 repeats
Get the extension here:https://addons.mozilla.org/en-GB/firefox/addon/mwmbl-web-crawler/
(DIR) Post #AZz6EbTt7kflr5Erdw by djoerd@idf.social
2023-09-21T07:13:30Z
0 likes, 0 repeats
#Mwmbl got its name from the Welsh word Mwmbwls (Now I love Welsh)https://en.wikipedia.org/wiki/Mumbles
(DIR) Post #AZz9hNqfBl1bpoJ4ro by berkes@mastodon.nl
2023-09-21T07:52:17Z
0 likes, 0 repeats
@djoerd according to that lemma it comes from the French or Latin for breasts.
(DIR) Post #Aa9tkpLTkf80NDxu4G by rohden@fe.disroot.org
2023-09-25T07:18:37.696834Z
0 likes, 0 repeats
@djoerd I really wonder how this index thingy works. In the end, it is stored on a central server, I guess.I mean, if 100 people per day crawl, isn't that going to be quite a large data volume after a few weeks?Moreover, does the crawler skip harmful and embarrassing URLs?Is the developer in the #fediverse?As a side note, it is integrated as a search engine and auto-completion tool in the latest #searxNG releases.#Mwmbl---Update 1: After some looking, I found an answer regarding the crawling of harmful and embarrassing URLs. It seems they use a block list.https://github.com/mwmbl/mwmbl/blob/main/mwmbl/settings.pyHowever, the block list does not seem complete (EXCLUDED_DOMAINS =).---Update 2: I dug a bit further. The main developer is on #matrix.They have a goal of indexing 1 billion pages a month. 🦾https://daoudclarke.net/search%20engines/2022/07/10/non-profit-search-engine
(DIR) Post #Aa9tkquFwqXbDNFBui by djoerd@idf.social
2023-09-26T12:15:25Z
0 likes, 0 repeats
@rohden Thanks for asking and answering these questions. I learnt about Mwmbl just the other day too
(DIR) Post #Aa9toJReIaPIauVsfo by rohden@fe.disroot.org
2023-09-24T21:49:04.579383Z
0 likes, 0 repeats
@djoerd #mwmbl offers autocomletion. I hope they took a look at this: https://www.cs.ru.nl/~hiemstra/deck.js/ossym2020.html#slide-10Reducing Misinformation in Query Auto-completionsHiemstra, DjoerdAppeared in: Open Search Symposium 2020, 12-14 October 2020, CERN, Geneva, Switzerland. https://djoerdhiemstra.com/publications/
(DIR) Post #Aa9toKAfbCQ4qXZpqa by djoerd@idf.social
2023-09-26T12:16:07Z
0 likes, 0 repeats
@rohden 😊
(DIR) Post #AaBVPZxvjGVS0cCo76 by rohden@fe.disroot.org
2023-09-26T18:41:51.745930Z
0 likes, 0 repeats
@djoerdAre you considering investing some energy in this project? I mean, it is something worth looking at with a scientific perspective. Additionally, any university could offer some crawling capacity.The goals he has are quite ambitious.
(DIR) Post #AaBVPaaZPbPlwSHfLE by djoerd@idf.social
2023-09-27T06:52:09Z
0 likes, 0 repeats
@rohden I am! We are in the pricess of making an open web index at Radboud University together with partners in Europe, so that may be mutually beneficial
(DIR) Post #AaDbuMFfkrHIUiU4Bs by rohden@fe.disroot.org
2023-09-28T06:58:19.766946Z
0 likes, 0 repeats
@djoerd May I ask, what is your approach?I have started to crawl for #mwmbl. In theory, one mwmbl crawler manages 1 web page per second. This means, in 30 days, 18,144,000 websites get indexed. There are 201,898,446 active websites (https://siteefy.com/how-many-websites-are-there/). This means one crawler covers 8.9 % of all active websites per month. However, I suppose that mwmbl does not distinguish between active and inactive websites. The strategy employed by mwmbl involves starting with #HackerNews and then progressing randomly. Therefore, it is highly unlikely that a limited number of crawlers may accidentally traverse the same page simultaneously. At the same time, it means that it takes quite some to index everything.
(DIR) Post #AaDbuN4ih46x32Mpl2 by djoerd@idf.social
2023-09-28T07:14:20Z
0 likes, 0 repeats
@rohden More info the project at: https://openwebsearch.eu/and specifically about our crawler here: https://openwebsearch.eu/owler/We wrote a paper discussing some the development of the open web index here: https://djoerdhiemstra.com/2023/impact-and-development-of-an-open-web-index-for-open-web-search/About sharing the index, we recently wrote a small paper about some of the challenges: https://djoerdhiemstra.com/2023/challenges-of-index-exchange-for-search-engine-interoperability/
(DIR) Post #AaKd9CWxdk58kVvp5M by rohden@fe.disroot.org
2023-09-30T11:52:28.402127Z
0 likes, 0 repeats
@djoerd Warning. I just took a look at the crawling process of #mwmbl. The block list of mwmbl is crap. I cannot recommend any longer to participate in crawling.
(DIR) Post #AaKd9DbbdxOM5D72xs by djoerd@idf.social
2023-10-01T16:31:14Z
0 likes, 0 repeats
@rohden Too bad. There's no easy solution, I guess