Post AbMt3fXjWui2WqTdAW by rohden@fe.disroot.org
(DIR) More posts by rohden@fe.disroot.org
(DIR) Post #AbMl81bgdk06U1tNMu by rohden@fe.disroot.org
2023-11-01T09:18:27.203252Z
0 likes, 1 repeats
Rohan Kumar took in 2021 a look at #searchengines with own indexes and last updated it 2023-09-02.Here is what he has found out:https://seirdy.one/posts/2021/03/10/search-engines-with-own-indexes/Some of them are open-source:#stract https://trystract.com/#SeSe Engine https://sese.yyj.moe/#ChatNoir https://www.chatnoir.eu/#marginalia https://search.marginalia.nu/#teclis http://teclis.com/It is a very comprehensive summary.#mwmbl #yep #Alexandria #Mojeek #seekport #Infotiger #Bloopish #Ichido #Teclis #Gnomit #fastbot #wiby
(DIR) Post #AbMl9xWAoUtONgEfR2 by amerika@noagendasocial.com
2023-11-01T15:02:35Z
0 likes, 0 repeats
@rohden We have to escape the Google-Bing duopoly somehow.As far as I know, Brave, DuckDuckGo, and most other "alternatives" use the Bing index.
(DIR) Post #AbMt3fXjWui2WqTdAW by rohden@fe.disroot.org
2023-11-01T15:11:43.079976Z
0 likes, 0 repeats
@amerika Yes, that is correct. It is a complex task to create an index. Besides technical difficulties (storage space, bandwidth), there are many other problems. For example, how to make sure that the search results are safe, unbiased, legal ..., sustainable. Currently there are many interesting projects like @djoerd's or projects based on the raw web archive (#WARC, #commoncrawl) like #ChatNoir. But I'm not sure if they can survive if the current leaders leave the project.To get an idea what ChatNoir is using https://www.chatnoir.eu/doc/architecture/:"Hardware and Index StatisticsIndexing billions of web documents and providing a fast web search service for them isn't possible without some beefy hardware.ChatNoir runs on the 145-node Webis Betaweb Cluster at Bauhaus-Universität Weimar (Germany), which provides a total of over 1700 CPU threads, almost 30 TB RAM and more than 4 PB of storage. Thanks to this large amount of fast main memory, we can serve search requests in only a few milliseconds despite the considerable index size.The Elasticsearch indices are distributed over 120 nodes with 40 shards per index and a replica count of 2 (resulting in full allocation of one shard per data node). The remaining nodes are used as data-less manager nodes or serve other purposes.Shards are between 80 and 250 GB in size. In total, we have indexed a little over 3 billion documents with a total size of 50 TB (including replicas). Another 41 TB is needed for storing the map files."The approach by #mwmbl is different: https://github.com/mwmbl/mwmblThus, more feasible while accepting that searches are not as successful.
(DIR) Post #AbMt3gZBizT1heAJ4i by amerika@noagendasocial.com
2023-11-01T16:31:05Z
0 likes, 0 repeats
@rohden Running a crawler these days must be a major undertaking.The Microsoft guys said back in the day that you needed a server farm, and even with that, their bots strangle hosts regularly.I think many of us are ready for an option to the catalog-style advertising-based search results from Google and Bing.