Posts by rohden@fe.disroot.org
(DIR) Post #ASbFKqE9XN0keqPNdw by rohden@fe.disroot.org
2023-02-11T21:03:06.399846Z
0 likes, 0 repeats
@fitheach They have a list of recommended e-mail providers https://delta.chat/en/2022-01-16-dapsi2blogpostAlternatively, setting up own mail servers is also an option, it seems: https://delta.chat/en/2023-01-27-upcoming-mail-server-workshops
(DIR) Post #ATZJLwEvP6fx7dOtTE by rohden@fe.disroot.org
2023-03-11T22:55:05.118080Z
1 likes, 0 repeats
@muppeth @antilopa It is not magic. It is art!
(DIR) Post #Aa9tkpLTkf80NDxu4G by rohden@fe.disroot.org
2023-09-25T07:18:37.696834Z
0 likes, 0 repeats
@djoerd I really wonder how this index thingy works. In the end, it is stored on a central server, I guess.I mean, if 100 people per day crawl, isn't that going to be quite a large data volume after a few weeks?Moreover, does the crawler skip harmful and embarrassing URLs?Is the developer in the #fediverse?As a side note, it is integrated as a search engine and auto-completion tool in the latest #searxNG releases.#Mwmbl---Update 1: After some looking, I found an answer regarding the crawling of harmful and embarrassing URLs. It seems they use a block list.https://github.com/mwmbl/mwmbl/blob/main/mwmbl/settings.pyHowever, the block list does not seem complete (EXCLUDED_DOMAINS =).---Update 2: I dug a bit further. The main developer is on #matrix.They have a goal of indexing 1 billion pages a month. 🦾https://daoudclarke.net/search%20engines/2022/07/10/non-profit-search-engine
(DIR) Post #Aa9toJReIaPIauVsfo by rohden@fe.disroot.org
2023-09-24T21:49:04.579383Z
0 likes, 0 repeats
@djoerd #mwmbl offers autocomletion. I hope they took a look at this: https://www.cs.ru.nl/~hiemstra/deck.js/ossym2020.html#slide-10Reducing Misinformation in Query Auto-completionsHiemstra, DjoerdAppeared in: Open Search Symposium 2020, 12-14 October 2020, CERN, Geneva, Switzerland. https://djoerdhiemstra.com/publications/
(DIR) Post #AaBVPZxvjGVS0cCo76 by rohden@fe.disroot.org
2023-09-26T18:41:51.745930Z
0 likes, 0 repeats
@djoerdAre you considering investing some energy in this project? I mean, it is something worth looking at with a scientific perspective. Additionally, any university could offer some crawling capacity.The goals he has are quite ambitious.
(DIR) Post #AaDbuMFfkrHIUiU4Bs by rohden@fe.disroot.org
2023-09-28T06:58:19.766946Z
0 likes, 0 repeats
@djoerd May I ask, what is your approach?I have started to crawl for #mwmbl. In theory, one mwmbl crawler manages 1 web page per second. This means, in 30 days, 18,144,000 websites get indexed. There are 201,898,446 active websites (https://siteefy.com/how-many-websites-are-there/). This means one crawler covers 8.9 % of all active websites per month. However, I suppose that mwmbl does not distinguish between active and inactive websites. The strategy employed by mwmbl involves starting with #HackerNews and then progressing randomly. Therefore, it is highly unlikely that a limited number of crawlers may accidentally traverse the same page simultaneously. At the same time, it means that it takes quite some to index everything.
(DIR) Post #AaKd9CWxdk58kVvp5M by rohden@fe.disroot.org
2023-09-30T11:52:28.402127Z
0 likes, 0 repeats
@djoerd Warning. I just took a look at the crawling process of #mwmbl. The block list of mwmbl is crap. I cannot recommend any longer to participate in crawling.
(DIR) Post #AbMl81bgdk06U1tNMu by rohden@fe.disroot.org
2023-11-01T09:18:27.203252Z
0 likes, 1 repeats
Rohan Kumar took in 2021 a look at #searchengines with own indexes and last updated it 2023-09-02.Here is what he has found out:https://seirdy.one/posts/2021/03/10/search-engines-with-own-indexes/Some of them are open-source:#stract https://trystract.com/#SeSe Engine https://sese.yyj.moe/#ChatNoir https://www.chatnoir.eu/#marginalia https://search.marginalia.nu/#teclis http://teclis.com/It is a very comprehensive summary.#mwmbl #yep #Alexandria #Mojeek #seekport #Infotiger #Bloopish #Ichido #Teclis #Gnomit #fastbot #wiby
(DIR) Post #AbMt3fXjWui2WqTdAW by rohden@fe.disroot.org
2023-11-01T15:11:43.079976Z
0 likes, 0 repeats
@amerika Yes, that is correct. It is a complex task to create an index. Besides technical difficulties (storage space, bandwidth), there are many other problems. For example, how to make sure that the search results are safe, unbiased, legal ..., sustainable. Currently there are many interesting projects like @djoerd's or projects based on the raw web archive (#WARC, #commoncrawl) like #ChatNoir. But I'm not sure if they can survive if the current leaders leave the project.To get an idea what ChatNoir is using https://www.chatnoir.eu/doc/architecture/:"Hardware and Index StatisticsIndexing billions of web documents and providing a fast web search service for them isn't possible without some beefy hardware.ChatNoir runs on the 145-node Webis Betaweb Cluster at Bauhaus-Universität Weimar (Germany), which provides a total of over 1700 CPU threads, almost 30 TB RAM and more than 4 PB of storage. Thanks to this large amount of fast main memory, we can serve search requests in only a few milliseconds despite the considerable index size.The Elasticsearch indices are distributed over 120 nodes with 40 shards per index and a replica count of 2 (resulting in full allocation of one shard per data node). The remaining nodes are used as data-less manager nodes or serve other purposes.Shards are between 80 and 250 GB in size. In total, we have indexed a little over 3 billion documents with a total size of 50 TB (including replicas). Another 41 TB is needed for storing the map files."The approach by #mwmbl is different: https://github.com/mwmbl/mwmblThus, more feasible while accepting that searches are not as successful.
(DIR) Post #Aqa6GVvAH4uX5zfTMG by rohden@fe.disroot.org
2025-01-29T17:22:23.696915Z
0 likes, 0 repeats
@vidzy Need for more input. Online or offline?