[HN Gopher] Show HN: I'm building a non-profit search engine
       ___________________________________________________________________
        
       Show HN: I'm building a non-profit search engine
        
       Author : daoudc
       Score  : 376 points
       Date   : 2021-12-26 09:11 UTC (13 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | wolfgarbe wrote:
       | A laudable effort. Two questions:
       | 
       | 1. What is the rationale behind choosing Python as a
       | implementation language? Performance and efficiency are paramount
       | in keeping operational costs low and ensuring a good user
       | experience even if the search engine will be used by many users.
       | I guess Python is not the best choice for this, compared to C,
       | Rust or Java.
       | 
       | 2. What is the rationale behind implementing a search engine from
       | scratch versus using existing Open Source search engine libraries
       | like Apache Lucene, Apache Solr and Apache Nutch (crawler)?
        
         | Faaak wrote:
         | Premature optimization is the root of all evil. Best to
         | concentrate on the algorithm first, and then, maybe, improve it
         | with a faster language.
         | 
         | Apart from that, the misconception that "python is slow" should
         | die :-)
        
           | Debug_Overload wrote:
           | > Premature optimization is the root of all evil.
           | 
           | Is keeping performance in mind and choosing the tech stack
           | accordingly really a premature optimization?
           | 
           | This might be the most abused phrase in CS history. Perhaps
           | we should add "Premature optimization fallacy" to the list of
           | cognitive errors programmers use as an excuse to not
           | seriously think about performance.
        
             | stevenally wrote:
             | It's a trade off between speed of development and
             | performance. Speed of development seems like a good
             | optimization for an experimental project?
        
           | authed wrote:
           | "Apart from that, the misconception that "python is slow"
           | should die :-) "
           | 
           | Yeah it's not python that is slow, it's the interpreter.
        
             | nulbyte wrote:
             | Which interpreter? There are multiple. I found pypy to be
             | quite reasonable; often faster than the standard C python
             | interpreter.
        
           | wolfgarbe wrote:
           | This is no contradiction, you can concentrate on the
           | algorithm also in a faster language :-)
           | https://benchmarksgame-
           | team.pages.debian.net/benchmarksgame/...
           | https://benchmarksgame-
           | team.pages.debian.net/benchmarksgame/...
           | https://benchmarksgame-
           | team.pages.debian.net/benchmarksgame/...
        
           | daoudc wrote:
           | Agreed, this was my thinking - and since I'm better at
           | Python, it's faster for me to get stuff done. I would like to
           | rewrite it in Rust though, all help from Rustaceans gladly
           | accepted!
        
           | marginalia_nu wrote:
           | In general, speed isn't the problem with search (at least the
           | retrieval aspect), but memory efficiency is. Things like
           | small object overhead and the ability to memory map large
           | data ranges are extremely beneficial for a language if you
           | want to implement a search index.
           | 
           | But I agree, get it working first, then re-implement it in
           | another language if it turns out to be necessary.
        
       | freediver wrote:
       | Congrats! Very nice to see results being lightning fast, I am
       | getting 100-120ms response with network overhead included and
       | that is impressive. The payload size of only 10-20kb helps
       | immensely, good job!
       | 
       | I've built something similar called Teclis [1] and in my
       | experience a new search engine should focus on a niche and try to
       | be really, really good at it (I focused on non-commercial content
       | for example).
       | 
       | The reason is to be able to narrow down the scope of content to
       | crawl/index/rank and hopefully with enough specialization to be
       | able to offer better results than Google for that niche. This
       | could open doors to additional monetization path, API access.
       | Newscatcher [2] is an example of where this approach worked (they
       | specialized on "news").
       | 
       | [1] http://teclis.com
       | 
       | [2] https://newscatcherapi.com/
        
       | [deleted]
        
       | [deleted]
        
       | gkasev wrote:
       | Congrats on the mvp path you took to lunch your product.
       | Generally, I think that there is a place for other variations of
       | web search, be it in the way you crawl or perhaps how you
       | monetize. I genuinely believe that it is really hard to build a
       | general purpose search engine like DDG, Google and the like, but
       | you can build a fairly good niche search engine. I'm particularly
       | fond of the idea of community powered curation in search. Just
       | today I lunched my own take on a community driven search engine -
       | https://github.com/gkasev/chainguide. If you like to bounce ideas
       | back and forth with somebody, I'll be very interested to talk to
       | you.
        
       | ChuckMcM wrote:
       | Okay, the cynical quip is "All search engines other than Google's
       | are 'non-profit'." :-) But the reasons for that won't fit in the
       | margin here.
       | 
       | Building search engines are cool and fun! They have what seems
       | like an endless source of hard problems that have to be solved
       | before they are even close to useful!
       | 
       | As a result people who start on this journey often end up crushed
       | by the lack of successes between the start and the point where
       | there is something useful. So if I may, allow me to suggest some
       | alternatives which have all the fun of building a search engine
       | and yet can get you to a useful place sooner.
       | 
       | Consider a 'spam' search engine. Which is to say a crawler that
       | you work to train on finding spammy useless web sites. Trust me
       | when I say the current web is a "target rich environment" here.
       | The purpose would be to not so much provide a search engine in
       | total here, as it would be to provide something like the realtime
       | black hole list did for email spam, come up with a list of URLs
       | that could be easily checked with a modified DNS type server
       | (using DNS protocol but expressly for the purpose of doing the
       | query 'Is this URI hosting spam?' in a rapid fashion.
       | 
       | There are two "go to market" strategies for such a site. One is a
       | web browser plugin that would either pop up an interstitial page
       | that said, "Don't go here, it is just spam" when someone clicked
       | on a link. Or a monkey-script kind of thing which would add an
       | indication to a displayed page that a link was spammy (like set
       | the anchor display tag to blinking red or something). The second
       | is to sell access to this service to web proxies, web filters,
       | and Bing which could in the course of their operation simply
       | ignore sites that appeared on your list as if they didn't exist.
       | 
       | You will know you are successful when you are approached by shady
       | people trying to buy you out.
       | 
       | Another might be a "fact finding" search engine. This would be
       | something like Wolfram Alpha but for "facts." There are lots of
       | good AI problems here, one which develops a knowledge tree based
       | on crawled and parsed data, and one which answers factual queries
       | like 'capital of alaska' or 'recipe for baked alaska'. The nice
       | things about facts is they are well protected against the claim
       | of copyright infringement and so people really can't come after
       | you for reproducing the fact that the speed of light is 300Mkps,
       | even if they can prove you crawled their web site to get that
       | fact.
        
       | amelius wrote:
       | What literature did you use to obtain suitable algorithms for
       | search/NLP?
        
       | prohobo wrote:
       | Thank you for this, but I feel like you _should_ make a profit
       | and this is currently a missed opportunity to use web3 principles
       | to do that.
       | 
       | Free and open software is a great ideal, but the reality is that
       | people need money to live - and ads are the way to make that
       | money on web2 platforms, which is why Google is in such a sad
       | state. Why not do something similar to Brave? You can add
       | tokenomics to the search engine and make money while keeping it
       | 100% useable and open source.
        
         | astoor wrote:
         | How would web3 and "tokenomics" solve search? If the underlying
         | problem is that spamdexers are given a profit incentive to game
         | search engine results with low effort and low quality content,
         | does it make a difference whether the profit incentive is via
         | pages generating advertising revenue or pages generating
         | cryptocurrency revenue?
        
           | prohobo wrote:
           | I don't really want to address your questions; this has been
           | talked to death. How do you think tokens could help? Maybe
           | they'd create incentives for users to use the engine. Maybe
           | they'd open a market for said tokens so the dev can extract
           | currency out of it. Maybe tokens can be a meta-system on top
           | of the search engine, so that the search functionality can be
           | left to solving the search problem without interference.
           | 
           | Do you think there's an alternative to tokens in order to
           | fund the project without degrading the search algorithm? If
           | so, I'm all ears.
           | 
           | In fact, I think we all are. Please enlighten us. But you
           | haven't proposed a solution while crypto devs have been
           | working on one since 2008.
        
         | everydaybro wrote:
         | exactly, free and opensource is amazing, but the developer
         | should also earn money to live, maybe add donations, or maybe
         | add some web3 features
        
           | [deleted]
        
       | ortuman84 wrote:
       | > Marginalia Search is fantastic, but it is more of a personal
       | project than an open source community.
       | 
       | And where's the community behind mwmbl project?
        
         | daoudc wrote:
         | Feel free to email me if you want to be involved!
        
       | Closi wrote:
       | Hey, great project - the more competition in this space the
       | better. To be honest, at the moment the algorithm doesn't return
       | any sensible results for anything (at least that I can find), but
       | I hope that you can find a way past this as it's a great place to
       | have a project.
       | 
       | I've included some search terms below that I've tried - I've not
       | cherrypicked these and believe they are indicative of current
       | performance. Some of these might be the size of the index -
       | however I suspect it's actually how the search is being
       | parsed/ranked (in particular I think the top two examples show
       | that).
       | 
       | > Search "best car brands"
       | 
       | Expected: Car Reviews
       | 
       | Returns a page showing the best mobile phone brands.
       | 
       | then...
       | 
       | > Then searching "Best Mobile Phone"
       | 
       | Expected: The article from the search above.
       | 
       | Returns a gizmodo page showing the best apps to buy... "App
       | Deals: Discounted iOS iPhone, iPad, Android, Windows Phone Apps"
       | 
       | > Searching "What is a test?"
       | 
       | Expected result: Some page describing what a test is, maybe
       | wikipedia?
       | 
       | Returns "Test could confirm if Brad Pitt does suffer from face
       | blindness"
       | 
       | > Searching "Duck Duck Go"
       | 
       | Expected result: DDG.com
       | 
       | Returns "There be dragons? Why net neutrality groups won't go to
       | Congress"
       | 
       | > Searching "Google"
       | 
       | Expected result: Google.com
       | 
       | Returns: An article from the independent, "Google has just
       | created the world's bluest jeans"
        
         | rmbyrro wrote:
         | I guess that's the real problem. People like to wonder what
         | would be the "ideal world" in a search engine. It may be
         | wishful thinking, I don't know.
         | 
         | It seems really hard to produce quality search results. Takes a
         | lot of investment. Makes it an expensive product. But no one
         | wants to pay. So selling ads it's the only way forward.
         | 
         | Maybe there's a way to convince people to pay what it takes? I
         | dunno...
        
           | bigyikes wrote:
           | I would gladly pay $5 a month for a Google-quality search
           | service that doesn't track me. I've been using Duck Duck Go
           | for most of this year, but frequently find myself falling
           | back to !g because Google's results really are much better.
           | 
           | I wonder how much money Google search makes per the average
           | user. Is it more than $5/mo?
        
             | danuker wrote:
             | > because Google's results really are much better.
             | 
             | Or maybe they're average, but you only see the ones where
             | DDG fails.
             | 
             | Next time also try Yandex and Baidu.
        
             | berkut wrote:
             | Likewise, I'd pay that as well (I'd like no ads as well,
             | but tracking is the main issue for me).
             | 
             | Similarly, I do have DDG as my main search on all machines
             | and devices just out of principle, but its region-aware
             | searching (I'm in NZ, and often only want NZ results) are
             | very close to useless in my experience (with NZ as the
             | region ticked, it will still return results from .ca and
             | .co.uk domains, which I would have hoped would be almost
             | trivial to remove), and Google seems much better in this
             | area (but not perfect).
             | 
             | Similarly, there's often technical/programming things I'll
             | search for that DDG doesn't have indexed at all, and Google
             | does.
             | 
             | Google also seems a lot better at ignoring spelling
             | differences (color/colour, favorite/favourite) than DDG,
             | which is often (but not always!) useful.
        
             | _benj wrote:
             | I've been enjoying a lot neeva, not affiliated at all, just
             | a happy user :-)
             | 
             | I think they are $4.95/mo or something? haven't payed a
             | cent yet since there are a few discounts that they do to
             | prompt you to learn how to use it (I really liked that, and
             | def made me more likely to stick with it!)
        
         | daoudc wrote:
         | Thanks for the feedback! I'll take a look at your examples and
         | see if I can improve the rankings.
        
           | devoutsalsa wrote:
           | The first thing I do to test a search engine is to search for
           | my own username on various public sites to see if it can find
           | me. It didn't find me. But keep it up and I'm sure I'll be in
           | there eventually (or maybe a overestimate how interesting I
           | am, hehe).
        
             | rmbyrro wrote:
             | I get this is your usual testikg case for search engines,
             | but if you've read their README you'd have seen its
             | inaproppriate for the project at the current stage.
        
           | GraemeMeyer wrote:
           | Fun idea. It seems to be getting stuck on the first word you
           | enter.
           | 
           | e.g. you get the same results for "London" as you do for
           | "London cats", "London cat rescue" and "London test".
        
         | gjm11 wrote:
         | I was curious and tried a bunch of other searches, with
         | similarly disappointing results. My searches were a bit more
         | esoteric than Closi's.
         | 
         | "langlands program" (pure mathematics thing): yup, top result
         | is indeed related to the Langlands program, though it isn't
         | obviously what anyone would want as their first result for that
         | search. Not bad.
         | 
         | "asmodeus" (evil spirit in one of the deuterocanonical books of
         | the Bible, features extensively in later demonology, name used
         | for an evil god in Dungeons & Dragons, etc.): completely blank
         | page, no results, no "sorry, we have no results" message,
         | nothing. Not good.
         | 
         | "clerihew" (a kind of comic biographical short poem popular in
         | the late 19th / early 20th century): completely blank page. Not
         | good.
         | 
         | "marlon brando" (Hollywood actor): first few results are at
         | least related to the actor -- good! -- but I'd have expected to
         | see something like his Wikipedia or IMDB page near the top,
         | rather than the tangentially related things I actually god.
         | 
         | "b minor mass" (one of J S Bach's major compositions): nothing
         | to do with Bach anywhere in the results; putting quotation
         | marks around the search string doesn't help.
         | 
         | "top quark" (fundamental particle): results -- of which there
         | were only 7 -- do seem to be about particle physics, and in
         | some cases about the top quark, but as with Marlon Brando
         | they're not exactly the results one would expect.
         | 
         | "ferrucio busoni" (composer and pianist): blank page.
         | 
         | "dry brine goose" (a thing one might be interested in doing at
         | this time of year): five results, none relevant; top two were
         | about Untitled Goose Game.
         | 
         | "alphazero" (game-playing AI made by Google): blank page.
         | Putting a space in results in lots of results related to the
         | word "alpha", none of which has anything to do with AlphaZero.
         | 
         | OK, let's try some more mainstream things.
         | 
         | "harry potter": blank page. Wat. Tried again; did give some
         | results this time. They are indeed relevant to Harry Potter,
         | though the unexpected first-place hit is Eric Raymond's rave
         | review of Eliezer Yudkowsky's "Harry Potter and the Methods of
         | Rationality", which I am fairly sure is not what Google gives
         | as its first result for "harry potter" :-).
         | 
         | "iphone 12" (confession: I couldn't remember what the current
         | generation was, and actually this is last year's): top results
         | _are_ all iPhone-related, but first one is about the iPhone 6,
         | second is from 2007, this is about the iPhone 6, fourth is from
         | 2007, fifth is about the iPhone 4S, etc.
         | 
         | "pfizer vaccine": does give fairly relevant-looking results,
         | yay.
        
           | daoudc wrote:
           | Thanks for the detailed feedback! I think most of most of the
           | problems here are because we have a _really_ small index
           | right now. Increasing the number of documents is our top
           | priority. I agree that some kind of feedback when there are
           | no results would be a good idea.
        
             | Closi wrote:
             | I actually think it's probably the algorithm too - if I
             | take one of the search items returned from a search that I
             | know is in the index, but then search for it with slightly
             | different terminology (or a different tense /
             | pluralisation), the same item doesn't come up.
        
         | clay-dreidels wrote:
         | What does a search engine algorithm look like, and where can I
         | find examples to build from?
        
           | marginalia_nu wrote:
           | Depends on which algorithm you are looking for, but these are
           | commonly used:
           | 
           | * Okapi BM25 for determining the relevance of a result to a
           | query.
           | 
           | * TF-IDF for determining the relevance of a term to a
           | document.
           | 
           | * PageRank for ranking domains.
        
       | aantix wrote:
       | How does someone economically store tens if thousands of
       | terabytes of data needed for the indexes of a large scale search
       | engine?
       | 
       | And have the large server instances (lots of ram)?
        
         | marginalia_nu wrote:
         | Why would you need that much data? The average website has
         | maybe 10kB worth of textual information without compression. To
         | get tens of thousands of terabytes of data, you'd need to index
         | of the order 10^12 websites. That seems a bit much.
        
         | supernovae wrote:
         | I tried this back in 2006 - mozdex (only a wikipedia article
         | survives) - it's not cheap. I was a fan of lucene which led to
         | nutch and eventually hadoop. so lots of servers running hdfs
         | doing map reduce jobs to compile and update indexes. No one in
         | the end cared about open search... duck seems to do alright
         | under the guise of security but most non major searches are
         | just meta searches these days because of economies of scale
         | being highly disadvantageous to any upcoming search - and
         | people just don't search like they used to either.
         | 
         | i was spending 2500 a month just on indexers and had query
         | traffic taken off my costs would have shot through the roof
         | since you want query nodes to all be in memory cache and that
         | was expensive back then. today i would have used some modern in
         | memory distributed doc dbs instead of query masters with heavy
         | block buffer caches. i learned a lot but lost my shirt :)
        
       | [deleted]
        
       | tmnstr85 wrote:
       | GuideStar is the veteran in this space. I agree that doing this
       | based on web scrapes and robots.txt is probably going to be
       | pretty tough to get quality results. GuideStar always sells there
       | product on the premise that they're vetting the financial
       | statements of non-profits of best results. The real money might
       | be on figuring out a way to scale reading and classifying non-
       | profit financials - then see if you can quality control using a
       | set of patterns.
        
         | AlphaWeaver wrote:
         | This isn't a search engine for nonprofits, it's a search engine
         | that's designed to be run by a nonprofit one day.
        
           | blueatlas wrote:
           | Or perhaps a for-profit company running this search engine
           | and not garnering profit from it by directly changing search
           | results, or having a different model for profit that does not
           | affect search rankings.
        
       | marcodiego wrote:
       | Non-profit search engines are needed. It will probably still be
       | vulnerable to SEO but will more likely be resistant to become
       | corrupt by the interest of "investors".
        
       | daoudc wrote:
       | Update: there's been interest from a few people so I've started a
       | Matrix chat here for anyone that wants to help out or provide
       | feedback: https://matrix.to/#/#mwmbl:matrix.org
        
       | marban wrote:
       | I've recently built one just for business news w/ obligatory
       | zero-tracking. (https://yup.is)
        
       | tibbar wrote:
       | It's really fast - nice job! Can you elaborate on the ranking
       | algorithm you are using? It seems that this will become more
       | important as you index more pages.
        
         | daoudc wrote:
         | Thanks! A really simple one for now: number of matching terms,
         | and then prioritising matches earlier in the result string. But
         | this is something I'm looking forward to working on properly
         | when I get a bigger index.
        
           | daoudc wrote:
           | I also want to incorporate a community aspect to ranking,
           | allowing upvoting and downvoting of results. I've not yet
           | figured out how to reconcile this idea with not having any
           | tracking though. Perhaps a separate interface for logged-in
           | users.
        
             | foxfluff wrote:
             | One ambitious project I've thought about over and over
             | again over the years is search (and social sites / forums)
             | where the votes, tags, and flags make a public dataset and
             | users can manipulate their own weights (or even the ranking
             | algorithm) to construct a "web of trust" that yields
             | favorable results.
             | 
             | This way you can escape spammers, powertripping moderators,
             | and the tyranny of the hive mind; it doesn't matter if
             | there's a large population of spammers, shills, and idiots
             | upvoting crap because you set their weights to zero (or
             | negative). In fact, that becomes a feature, because by
             | upvoting crap, they generate a crap filter for you. If the
             | weights are also public, then you can automatically &
             | algorithmically seed your web of trust (simplest algo for
             | sake of example: give positive weight to identities who
             | upvoted and downvoted the same way you did) but you could
             | still override the algo with manually set values if it gave
             | too much weight to bad actors.
             | 
             | Obviously this has privacy implications (all your votes and
             | your network becomes public), and can generate a large
             | dataset (performance challenge, how do you distribute it /
             | give access to it?), so it's far from a trivial project.
             | For the privacy angle, I'd start by keeping identities
             | pseudonymous (e.g. a public key or random id -- you don't
             | know who's behind the identity unless they blurt it out).
             | Furthermore, I think it'd be useful to automagically split
             | your actions across multiple identities so it's harder to
             | link all your activity. I think the system should also
             | explicitly allow switching identities, for privacy but also
             | because sometimes you just want a different "filter bubble"
             | which helps tailor the content you get to what you're
             | looking for. Maybe the network that yields best shopping
             | results isn't the same network that yields best cooking
             | recipes or technical docs.
             | 
             | With this model, everyone is a moderator and everyone can
             | defer moderation to identities they trust, but neither the
             | hive mind nor individuals have the ultimate power to
             | dictate what you see. If you want to read spam or
             | conspiracy theories, you just switch to your identity which
             | upvotes such content and has positive weights towards other
             | identities with similar votes.
             | 
             | I doubt you're going to build this; I doubt people want
             | this. I certainly want it. Maybe one day I'll try, but it
             | probably won't work well without network effects
             | (=reasonably large quantity of users). I just wanted to let
             | you know about the idea because your project is inspiring
             | and inspiring things inspire me to share ideas.. :)
        
               | ffhhj wrote:
               | This sounds like the sorting hat algorithm (tiktok)
               | applied to query search engines. If there could be a way
               | to visualize your recommendations network and switch to
               | others without logging out, this could work really well.
               | But a lot of research needs to be done, and the interest
               | of big actors is to keep users blind inside their webs.
               | 
               | This topic is interesting to me because I'm building a
               | faster search engine for programming queries and trying
               | to solve the core issues that got us stuck with crappy
               | engines.
        
               | daoudc wrote:
               | Actually I was thinking of something vaguely along those
               | lines!
        
       | legofr wrote:
       | > All other search engines that I've come across are for-profit.
       | Please let me know if I've missed one!
       | 
       | https://www.ecosia.org/
       | 
       | https://ask.moe/
       | 
       | https://ekoru.org/
       | 
       | I remember seeing one more non-profit search engine on HN but
       | can't seem to find it right now.
        
         | m-i-l wrote:
         | Also https://searchmysite.net/ for personal and independent
         | websites (essentially a loss-leader for its open source self-
         | hostable search as a service).
        
         | mlinksva wrote:
         | https://web.archive.org/web/20171130000415/https://about.com...
         | was one 5+ years ago
         | https://news.ycombinator.com/item?id=11281700
         | https://github.com/commonsearch
        
         | daoudc wrote:
         | Thanks, but these are not technically non-profit:
         | 
         | "Ecosia is a search engine based in Berlin, Germany. It donates
         | 80% of its profits to nonprofit organizations that focus on
         | reforestation" [1]
         | 
         | "80% of profits will be distributed among charities and non-
         | profit organizations. The remaining 20% will be put aside for a
         | rainy day." [2]
         | 
         | "Ekoru.org is a search engine dedicated to saving the planet.
         | The company donates 60% of revenue generated from clicks on
         | sponsored search results to partner organizations who work on
         | climate change issues" [3]
         | 
         | [1] https://en.wikipedia.org/wiki/Ecosia [2] https://ask.moe/
         | [3] https://www.forbes.com/sites/meimeifox/2020/01/19/how-the-
         | se...
        
           | toper-centage wrote:
           | While Ecosia is not technically a non-profit, no one can sell
           | Ecosia shares at a profit.
           | 
           | > Ecosia says that it was built on the premise that profits
           | wouldn't be taken out of the company. In 2018 this commitment
           | was made legally binding when the company sold a 1% share to
           | The Purpose Foundation, entering into a 'steward-ownership'
           | relationship.
           | 
           | > The Purpose Foundation's steward-ownership of Ecosia
           | legally binds Ecosia in the following ways: - Shares can't be
           | sold at a profit or owned by people outside of the company
           | and - No profits can be taken out of the company.
           | 
           | https://www.ethicalconsumer.org/technology/how-ethical-
           | searc...
        
             | gardenfelder wrote:
             | Seems like a reasonable approach is to use a b-corp where
             | shareholders cannot sue for financial gains.
        
         | asicsp wrote:
         | >I remember seeing one more non-profit search engine on HN but
         | can't seem to find it right now.
         | 
         | Probably this one? "A search engine that favors text-heavy
         | sites and punishes modern web design"
         | https://news.ycombinator.com/item?id=28550764 _(3 months ago,
         | 717 comments)_
        
         | Minor49er wrote:
         | I just tried ask.moe, but it clearly noted that the search
         | results were provided by Google
        
       | kova12 wrote:
       | What do you do in order for your crawler to not accidentally weer
       | into some naughty-naughty site and yield you a visit from your
       | friendly FBI squad? That concern why I decided to stay away from
       | yacy
        
       | montebicyclelo wrote:
       | How much compute, storage, and network speed would a minimal up-
       | to-date web search engine need?
        
         | ValleZ wrote:
         | I'd estimate that Google uses ~1000 TB of fast storage, Bing
         | 500 TB and Yandex 100 TB, so the most basic useful search
         | engine would use at least... 10 TB?
        
           | dqv wrote:
           | What is fast storage? Is that, for right now, the fastest
           | SSDs available?
        
             | ValleZ wrote:
             | HDD is definitely not enough because of low iops, likely
             | Google keeps index in RAM. I think NVME should be good
             | enough, idk for sure.
        
           | fauigerzigerk wrote:
           | _" The Google Search index contains hundreds of billions of
           | web pages and is well over 100,000,000 gigabytes in size."_
           | 
           | https://www.google.com/intl/en_uk/search/howsearchworks/craw.
           | ..
        
             | ValleZ wrote:
             | Actually I doubt that this is a true statement and not
             | something to discourage others. Check out these queries:
             | https://www.google.com/search?q=1 12B results
             | https://www.google.com/search?q=an 9B results
             | https://www.google.com/search?q=the 6B results If we
             | estimate that about half of all English pages contain 'the'
             | or 'an' article we'll have about 15B English pages. If half
             | of all pages contain the '1' then the total number of pages
             | is about 24B. If half of all the pages are in English then
             | the total number of all the pages is 30B. So even the
             | maximum is less than the "hundreds". Similar numbers are at
             | https://www.worldwidewebsize.com/
        
             | [deleted]
        
         | iopq wrote:
         | If you have to ask, you can't afford it
        
           | marginalia_nu wrote:
           | You might think that. My search engine runs off <$5k worth of
           | consumer hardware off domestic broadband. It survived the
           | hacker news front page for a week, saw a sustained load of
           | 8000 searches per hour for nearly a day.
           | 
           | It's got a fairly small index, but yeah, it's not
           | particularly hardware-hungry.
        
             | daibo wrote:
             | What's the size of your index in records?
        
               | marginalia_nu wrote:
               | I run three separate indices at about 10-20mn documents
               | each. But I'm fairly far off any sort of limit (ram and
               | disk-wise I'm at maybe 40%).
               | 
               | I'm confident 100mn is doable with the current code,
               | maybe .5bn if I did some additional space optimization.
               | There are some low hanging fruit that seem very
               | promising. Sorted integers are highly compressable, and
               | right now I'm not doing that at all.
        
               | daibo wrote:
               | Yes doclist compression is a must. Higher intersection
               | throughput and less bandwidth stress. Are you loading
               | your doclists from persistent storage? What is your
               | current max rps?
        
               | marginalia_nu wrote:
               | I'm loading the data off a memory mapped SSD, trivial
               | questions will probably be answered entirely from memory,
               | although the disk-read performance doesn't seem terrible
               | either.
               | 
               | > What is your current max rps?
               | 
               | It depends on the complexity of the request, and repeated
               | retrievals are cached, so I'm not even sure there is a
               | good answer to this.
        
           | dotancohen wrote:
           | Fair point.
           | 
           | How is it to be funded?
        
             | daoudc wrote:
             | The plan is to fund it through donations
        
               | smt88 wrote:
               | Can you make it a contributory database? I wouldn't mind
               | "donating" my browsing history and page downloads to
               | build the index and train the algorithm.
               | 
               | You'd have to find a way to verify reputation to make
               | sure no bad actors could contribute.
        
               | aspenmayer wrote:
               | How does Internet Archive verify dumps submitted by
               | Archive Team and other groups? This may already be a
               | solved problem.
               | 
               | Not knowing their implementation details, I'm guessing it
               | could be doable without reinventing much. An oracle could
               | dispatch a P2P archive job to a pool of clients randomly
               | assigned tasks, with both the first to archive and the
               | first to validate being recognized by the swarm somehow,
               | with periodic re-archiving and re-verification, rate
               | adjusted by popularity of site and of search keywords.
        
               | daoudc wrote:
               | Yes, I'm planning to do something like this.
        
             | hirako2000 wrote:
             | is the size of the common Web already way too large to play
             | catch up against google/bing at this point?
             | 
             | my dream, is a distributed/p2p index. each browser
             | contribute to storing part of the overall index, and handle
             | queries coming from other users so that how to fund huge
             | data centers never become a question.
        
               | dotancohen wrote:
               | > is the size of the common Web already way too large to
               | play catch up against google/bing at this point?
               | 
               | Probably. But I would prefer a search engine that didn't
               | search the whole web. I would prefer a search engine that
               | searched the sites related to fields that I'm interested
               | in.
               | 
               | So I would pay for or donate to a search engine that
               | provided me good results in e.g. software development.
               | They could add additional fields as demand warrants, so
               | long as quality as maintained. I would even like to see a
               | faceting feature, so I could search for e.g. Matrix and
               | get results on the mathematical concept when need be
               | without having to wade through movie review or fiddle
               | with magic search keywords.
        
               | hirako2000 wrote:
               | not searching the whole Web makes sense, except that it
               | isn't clear what belongs to your field of interest and
               | what doesn't. should indexing skip a blog page because a
               | the author usually write about algorithm but here goes on
               | and on about business while sporadically mentioning
               | algorithmic technical aspects? and what if you want to
               | search about pottery that Sunday morning, turn to Google?
               | I think segmenting search results by topics is a useful
               | consumer query feature, not sure segmenting what gets
               | indexed would provide a useful service other than
               | covering niches hence not really fulfilling a web search
               | engine. I find the idea less ambitious so maybe that's
               | how an open search engine should approach the problem,
               | federation of hosts could, cover the whole Web
               | eventually.
        
               | dotancohen wrote:
               | > should indexing skip a blog page because a the author
               | usually write about algorithm but here goes on and on
               | about business while sporadically mentioning algorithmic
               | technical aspects?
               | 
               | Yes, because that author's pages wouldn't even be fetched
               | at the point of development that we are discussing.
               | > and what if you want to search about pottery that
               | Sunday morning, turn to Google?
               | 
               | Yes, Google still exists. Why not?
        
               | daoudc wrote:
               | Check out YaCy. It's great if you're happy with slow
               | search!
        
               | charcircuit wrote:
               | YaCy's results weren't great and there was the raking was
               | very bad letting sides game it by just spamming a ton of
               | keywords.
        
               | jesprenj wrote:
               | I had problems with YaCy. Slow search and slow crawling.
               | Regarding search I think it could be improved if instead
               | of HTTP requests to other peers a more efficient protocol
               | (UDP) could be used. Regarding crawling it's quite
               | possible that doing this in Java and running it on a
               | Rock64 may not be the best combination (: It started
               | OOMing after some days.
        
           | arpa wrote:
           | see, that's the problem with engineers today. AltaVista ran
           | on 3x300MHz 64bit processors with 512M of RAM back in 1995.
           | Resources cost peanuts these days. It's just that we're so
           | used to bloat and digital inflation, we can't even start
           | considering unbloated implementations as we perceive them as
           | "not modern". Apparently we are also stuck in the
           | centralized/ownership mentality. If you crowdsource search
           | indexing and processing, it scales alongwith amount of users.
           | Oh and also another bane of the internet is the need to
           | monetize. FFS, if email would be invented today, we probably
           | would have to buy NFT poststamps.
        
             | adtac wrote:
             | I'm sorry what? Have you seen the rate at which data is
             | being created today? I mean, if you want to index the size
             | of the 1995 web with your raspberry pi, go ahead, but it
             | costs insane amounts of money to index the December 2021
             | web and keep the index up-to-date.
             | 
             | edit / full disclosure: I work at google but nothing
             | related to search
        
               | noogle wrote:
               | Is it really necessary to index EVERYTHING? It's true
               | that we have much more data today than 26 years ago, but
               | not all of these websites qualify or provide value
               | (duplicate results, promotional content, outdated
               | content).
               | 
               | The challenge then moves to the curation, but it's no
               | longer infeasible.
        
               | Closi wrote:
               | I think the words "minimally viable" are being ignored
               | here.
               | 
               | I think OP's point is, assume you only have 10/100
               | terabytes of space and limited compute ability - how
               | would you approach the problem? I assume 90% of google's
               | searches probably come from less than 1% of their total
               | index, not to mention that Google is also keeping full
               | cached versions of the whole website including images.
        
               | adtac wrote:
               | I just eyeballed my browser history from the last 2-3
               | days and I'd estimate 15% is current/latest news related,
               | some 25% is programming related, 15% is e-commerce stuff,
               | the rest random crap. I'd imagine 10-100 TB can easily
               | serve all of _my_ search space (even the links I didn't
               | look at on page 10) from the past few years, but that's
               | the thing -- it's just my search space. How do you serve
               | the rest of the world? I wish I knew the answer :)
        
               | foxfluff wrote:
               | Well Google doesn't know the answer either. Results are
               | complete trash when you try find something niche or in a
               | local language.
               | 
               | The challenge isn't to index the entire web, it's to
               | index the useful parts of it, and I think an index
               | covering most of the useful web can be seeded quite
               | easily with some community effort.
        
               | [deleted]
        
               | [deleted]
        
       | quantum2021 wrote:
       | Two big things that annoy me about google:
       | 
       | 1. They somewhat get around this with their maps feature, but
       | their regular search doesn't actually search by area; you always
       | get national websites that optimize the best. That would be a
       | nice feature to have starting out without having to type in the
       | specific area you're looking for.
       | 
       | 2. Search results for hotels that actually work! Not only if
       | they're set up on OTA's! This could actually get your search
       | engine some traction as the search engine to go to when making
       | travel plans which would give you a nice niche to start out in.
        
       | slmjkdbtl wrote:
       | As a normal human I naturally typed in "fuck" in a new search
       | engine and it led me to this article
       | https://mathbabe.org/2015/06/22/fuck-trigonometry/ which I quite
       | enjoyed!
        
       | born-jre wrote:
       | there seems to be a lot of comment about some form of distributed
       | trust/reputation bashed system.
        
       | tomxor wrote:
       | > We plan to start work on a distributed crawler, probably
       | implemented as a browser extension that can be installed by
       | volunteers.
       | 
       | Is there a concern that volunteers could manipulate results
       | through their crawler?
       | 
       | You already mentioned distributed search engines have their own
       | set of issues. I'm wondering if a simple centralised non-profit
       | fund a la wikipedia could work better to fund crawling without
       | these concerns. One anecdote: Personally I would not install a
       | crawler extensions, not because I don't want to help, but because
       | my internet connection is pitifully slow. I'd rather donate a
       | small sum that would go way further in a datacenter... although I
       | realise the broader community might be the other way around.
       | 
       | [edit]
       | 
       | Unless, the crawler was clever enough to merely feed off the
       | sites i'm already visiting and use minimal upload bandwidth. The
       | only concern then would be privacy. oh the irony, but trust goes
       | a long way.
        
         | qwertox wrote:
         | There could be legal issues if the crawler starts to crawl into
         | regions which should be left alone. I don't mean things like
         | the dark web, but for example if someone is a subscriber to an
         | online magazine it could start crawling paywalled content if
         | 3rd party cookies enable this.
        
           | tomxor wrote:
           | Google already seems to crawl paywalled content somehow, this
           | doesn't seem to be much of a legal issue since you cannot
           | click through - it's just annoying as a user.
           | 
           | This might even be intentional through robots.txt ... A
           | browser extension that passively crawls visited sites could
           | easily download robots.txt as the single extra but minimal
           | download requirement.
        
             | dreamcompiler wrote:
             | Pretty sure most paywalled sites explicitly allow the
             | googlebot to enter. If you spoof your UserAgent to be that
             | of the googlebot they check your IP address to make sure
             | you really are Google.
             | 
             | The new fly in the crawler ointment is Cloudflare: If
             | you're not the googlebot and you hit a Cloudflare customer
             | you need to be running javascript so they can verify you're
             | _not_ a bot. It 's a continual arms race.
        
             | imglorp wrote:
             | Google has instructions for paywall sites to allow
             | crawling. I suppose it brings them traffic when users click
             | on a search result and arrive at the sign up page.
             | 
             | https://developers.google.com/search/docs/advanced/structur
             | e...
        
         | daoudc wrote:
         | Yes, that is a concern. I'd probably worry about it if and when
         | it started happening, however.
        
           | Schiendelman wrote:
           | If you wait until then, it may become too late to mitigate.
           | Unless you have a plan to remain in complete control of who
           | contributes.
        
         | altdataseller wrote:
         | There are already loads of browser extensions that do a lot of
         | screen scraping of all the sites you visit, without you even
         | realizing it,
        
           | tomxor wrote:
           | I can imagine, but i only use one, ublock
        
       | gravypod wrote:
       | If you filed to become a non-profit could people "donate" their
       | engineering time as a tax write off? If you find out the legality
       | of something like this and make it easy to do that could inspire
       | a lot of collaboration on the project and I can see a bunch of
       | other areas (outside of search) where services could be provided
       | like this. I'm also sure having a non-profit would also make it
       | easier to find cheap hosting which is a large part of the cost
       | there.
        
       | champagnois wrote:
       | If I were to work on building a search engine from scratch, I
       | would probably approach this from these directions:
       | 
       | (1) Investigate if running a DNS server will help me get a more
       | robust picture of what websites exist.
       | 
       | (2) Investigate if supplying a custom browser would help me to
       | leverage client PCs to do the crawling / processing for me.
       | 
       | (3) Investigate if there is any point in building a search engine
       | with the data gathered in a non-proffit way... Non-proffits are
       | not as sustainable as for proffit corporations.
        
       | [deleted]
        
       | yuhong wrote:
        
       | btdmaster wrote:
       | Any reason to prefer GPLv3 over AGPLv3? It might be useful to use
       | the latter so that distributing it over a network requires
       | distributing modifications as well.
        
         | daoudc wrote:
         | Good suggestion, thanks.
        
       | amenod wrote:
       | Off-topic [0]: I would be very interested in an economic model
       | that would work for such a search engine. Donations are fine, but
       | (imho) it will take much more than that to keep the lights on,
       | let alone expand...
       | 
       | The "fairest" solution for both sides I can think of is ads which
       | no not send tracking information, and are shown primarily based
       | on search terms and country, or even other parameters that the
       | visitor has set explicitly. Any other ideas on how to finance
       | such an engine so that incentives are aligned?
       | 
       | [0]: EDIT: off-topic because the page clearly states that this
       | project will be financed with donations only.
        
         | hosteur wrote:
         | Ads as a business model ends with surveillance as a business
         | model. We know this now.
        
         | m-i-l wrote:
         | The model my search uses is for the public search to
         | essentially be a loss leader for the search as a service - site
         | owners can pay a small fee to access extra features such as
         | being able to configure what is indexed, trigger reindexing on
         | demand, etc. It also heavily downranks pages with adverts, to
         | try to eliminate the incentive for spamdexing.
        
         | daoudc wrote:
         | Wikimedia has an estimated $157m in donations this year. If we
         | could get a small fraction of this amount we should be able to
         | build something pretty good.
        
           | klohto wrote:
           | Get real lol. Why would a general public care about you?
           | Happy to donate but it won't keep the light on. You're
           | serving a niche community
        
             | daoudc wrote:
             | Niche for now, but I think a lot of people can see the
             | value of search without ads.
        
               | abraae wrote:
               | A journey of a thousand miles begins with a single step.
        
               | oefrha wrote:
               | I wish you luck, but I mean, I use Google and I haven't
               | seen a search ad for what, a decade (okay, less than a
               | decade considering iOS)? Most people who don't want to
               | see search ads can pretty easily find an ad blocker.
        
             | wodenokoto wrote:
             | Online encyclopedia was very niche when Wikipedia started.
        
               | interator7 wrote:
               | I mean, it's not even remotely comparable. It's not like
               | we have to look at paper search engines as an alternative
               | to online search engines. The whole point is that the
               | general public has no real reason to switch, never mind
               | donate.
        
         | luckylion wrote:
         | Aren't ads super ineffective, especially when you don't make
         | them very invasive?
         | 
         | I think donations are probably workable. It works in the
         | private tracker scene; the larger ones have "donation meters"
         | and never seem to fall behind.
         | 
         | It could also work on a subscription model which is essentially
         | just formalizing the donations and making it easier to plan
         | cash flow.
        
           | daoudc wrote:
           | Yes, I think a subscription model is the way to go.
        
         | marginalia_nu wrote:
         | > but (imho) it will take much more than that to keep the
         | lights on, let alone expand...
         | 
         | You'd be surprised how cheap a search engine can be to operate.
         | My search.marginalia.nu has a burn rate of less than
         | $100/month.
        
           | daoudc wrote:
           | Impressive! Does that include the startup costs of the server
           | you bought, and maintaining it?
        
             | marginalia_nu wrote:
             | The hardware cost about $5k all-in-all including a UPS, and
             | I'd estimate it will chew through an 1Tb SSD once every
             | 12-18 months.
        
       | [deleted]
        
       | alexdowad wrote:
       | Some idle words from a passer-by: It would have been good if this
       | project had a pronounceable name.
       | 
       | "To Google" has entered the English lexicon as a verb, but I
       | don't think anybody will ever say they "mwmbled" something.
        
         | wodenokoto wrote:
         | In the early web 2, it was very in for things to be spelled
         | unpronounceably. For the life of me I can only remember Twittr,
         | but I wanna say Spotify also had an unreadable name in the
         | early days.
        
           | thenthenthen wrote:
           | Flickr
        
             | marginalia_nu wrote:
             | del.ico.us
        
             | KnobbleMcKnees wrote:
             | Tumblr
        
         | daoudc wrote:
         | It's pronounced "mumble". I live in Mumbles, which is spelt
         | Mwmbwls in Welsh.
        
           | yetanother-1 wrote:
           | Nice, but still not very intuitive nor common for the grand
           | public.
        
             | danpalmer wrote:
             | Spelling it "mumble" wouldn't be accurately pronounceable
             | for most of the world, and billions couldn't even read the
             | letters.
             | 
             | I get your point but I think we should normalise things
             | that don't come completely naturally for English speakers.
        
               | jesprenj wrote:
               | Mumble is already a group talk protocol.
               | http://mumble.info
        
             | scottmcdot wrote:
             | If it took off we all might start swapping e with w as a
             | nod to our preferred search engine.
        
       | discordance wrote:
       | Is there a test suite of expected search results that could be
       | used with these sort of projects?
        
       | KarlKemp wrote:
       | The central problem with this and similar endeavors: nobody is
       | willing to pay what they are worth in ads. Let's say the average
       | Google user in the US earns them $30/year. Are you willing to pay
       | $30/year for an ad-free Google experience? Great! We now know
       | that you are worth at least $60/year.
       | 
       | That little thought experiment is true for many online services,
       | from social networking to (marginally) publishing. But nowhere is
       | it more true than for search results, which differ in two
       | fundamental ways: being text-only, they don't bother me to
       | anywhere near the degree of other ads. And, second, they are an
       | order of magnitude more valuable than drive-by display ads,
       | because people have indicated a need and a willingness to visit a
       | website that isn't among their bookmarks. These two, combined,
       | make this the worst possible case for replacing an ad-based
       | business with a donation model.
       | 
       | The idea mentioned in this readme that "Google intentionally
       | degrades search results to make you also view the second page" is
       | also wrong, bordering on self-delusion. The typical answer to
       | conspiracy theories works here: there are tens of thousands of
       | people at Google. Such self-sabotage would be obvious to many
       | people on the inside, far too many to keep something like this
       | secret.
        
         | daoudc wrote:
         | TBF I don't think Google intentionally degrades results, but
         | they have less incentive to improve the results.
        
           | dazc wrote:
           | The same way people don't intentionally break the law, they
           | just overlook certain aspects of it when it suits them.
        
         | timeon wrote:
         | > The central problem with this and similar endeavors: nobody
         | is willing to pay what they are worth in ads. Let's say the
         | average Google user in the US earns them $30/year. Are you
         | willing to pay $30/year for an ad-free Google experience?
         | Great! We now know that you are worth at least $60/year.
         | 
         | Is this relevant for non-profit project? Do you pay $30/year
         | for Wikipedia?
        
           | KarlKemp wrote:
           | Do you think ,,non profit" means they don't have to pay for
           | servers and employees?
           | 
           | And yes.
        
         | dazc wrote:
         | Duck Duck Go is profitable despite not blanketing the first
         | page with ads (just like google were once upon a time); you can
         | have no ads at all if you like also. Do they make money in
         | other ways, sure but not in a way that degrades the user
         | experience.
         | 
         | Are DDG results inferior, for 95% of users no.
        
           | nyuszika7h wrote:
           | > Are DDG results inferior, for 95% of users no.
           | 
           | Do you have a source for this figure? Maybe it's mostly true
           | for the "average non-tech-savvy user in an English-speaking
           | country", but I've found DuckDuckGo and everything other than
           | Google inferior in many cases, especially when looking for
           | Hungarian content.
        
             | dazc wrote:
             | Sorry, for the "average non-tech-savvy user in an English-
             | speaking country" is more or less what I meant. Although
             | this would have been true for Google also in the early
             | days.
        
         | nyuszika7h wrote:
         | I would consider Google randomly excluding the most relevant
         | words from my search query intentionally degrading results.
         | It's incredibly frustrating. This shouldn't be the default
         | behavior, maybe an optional link the user can click to try
         | again with some of the terms excluded.
         | 
         | Yes, I know verbatim mode exists, but I always forget to enable
         | it, and the setting eventually gets lost when my cookies are
         | cleared or something.
         | 
         | Unfortunately I can't switch to another search engine because
         | in my experience every other search engine has far inferior
         | results, despite not having the annoying behaviors Google does.
         | DuckDuckGo is only useful for !bangs for me.
        
       | ZeroGravitas wrote:
       | I have this feeling that most of time I "search" for something I
       | already know what I'm looking for, but google via firefox's
       | omnibox, is just the fastest way to get there, even though it's a
       | bit indirect. Are they getting paid for that, or am I costing
       | them money in the short term, but they get to build up a profile
       | on me to provide more effective ads later?
       | 
       | I wonder if it's possible to take advanage of that type of search
       | by putting a facade in front of the "search engine" and based on
       | the search term and the private local user history, then go
       | direct to a known site, or if it seems a search is needed, go to
       | a specific search engine. This may open up opportunities for say
       | program language specific search engines, or error messages from
       | a program specific search, or shopping for X sites.
        
         | medstrom wrote:
         | I bookmark every site I might possibly want to revisit - make a
         | habit of Ctrl+D. They're totally unsorted, but the key is to
         | wipe the regular history on exit, leaving only the bookmarks as
         | source material for completion. That way I can type something
         | in the url bar and get completion to interesting sites. The url
         | bar (or omnibox) matches on page title as well as the actual
         | address, so it's easy, and always faster than a search engine.
        
         | bluecatswim wrote:
         | Most wikis or resource/documentation sites have a local search
         | bar on their homepage, Firefox has a feature where it lets you
         | add a search keyword for that specific site. So if you add,
         | say, pydocs as a keyword for docs.python.org you can do
         | "@pydocs <query>" it looks up the query on that page.
        
         | yellowsir wrote:
         | if u set duckduckgo as your default search provider, you can
         | use bang in the omibox. also you can toogle between local-area
         | or global search. https://duckduckgo.com/bang e.g. !yt !osm !gi
        
       | WheelsAtLarge wrote:
       | Make it open source and syndicate it. The goal is to get people
       | to contribute both resources and code. Think about the Shopify as
       | the model. Where many people contribute to create a huge shopping
       | place. People care about their shop only but ultimately they
       | create a useful shopping area.
       | 
       | Also setup a foundation to guide its development and be able to
       | hire a management team.
       | 
       | The real challenge is not the code development but setting up an
       | organization that will outlast all the challenges that will
       | appear. Wikipedia is the model to follow.
        
       | juliushuijnk wrote:
       | here's my open data attempt from a couple of years ago:
       | 
       | http://www.charitius.org
       | 
       | goal was/is to include all charities in the world, based on open
       | data, and open software.
       | 
       | It's been on pause for a while, but still works, and open for new
       | sources of data to incorporate.
        
       | ChemSpider wrote:
       | Really, I don't care if it is for-profit or not. Just a search
       | engine with transparent ranking would be great.
       | 
       | Ideally with explainable AI (XAI) that can tell me WHY is result
       | A ranked higher than result B. I would even pay a monthly
       | subscription to use it.
        
       | igammarays wrote:
       | This is a business model I've been thinking about: what if users
       | earned credits for running a crawler on their machine? In other
       | words, as much as I hate crypto scams, a "tokenized" search
       | engine where the "mining" power was put to good use, i.e crawling
       | and indexing.
        
         | thebeastie wrote:
         | How would you judge that they had actually done the work? The
         | output needs to be verifiable.
        
           | igammarays wrote:
           | There would have to be some aspect of centralized moderation,
           | I suppose. This is beyond my knowledge: Is there a way to
           | accept output only from signed binaries, so that we assume if
           | X cycles of work were performed by a signed binary, then it
           | is legitimate output?
        
             | born-jre wrote:
             | no not really may be some theoretical way using ZK-snarks
             | or encrypted enclaves (intel sgx) but not practical. also
             | probably does not work cz oracle/enclave also needs
             | input(raw crawl data/ network http bytes) which has to be
             | trusted. one way could be project could make unbreakable
             | mining/crawling chip/box/os/anticheat_layer_with_vm and
             | supply to everyone which has different levels of breakable-
             | lity and complexity to build.
             | 
             | One way is we send crawl_task to different to N random
             | nodes and accept one that most similar?
             | 
             | another way could be build messy network to solve messy
             | problem. What we do is build reputation bashed graph
             | network and you accept index from nodes you trust. so
             | people will start un following misbehaving nodes. there is
             | not universalroot view of network instead its dynamic and
             | different from prospective of each node. or it could have
             | one root view if we store reputation data in bchain and
             | with some type quadratic voting to modify the chain. ?
             | 
             | yeah Bitcoin showed us way to build mathematically secure
             | system without any trusted party but it could do that cz
             | problem it was solving is mathematically provable. Problem
             | like collecting indexing crawl data you have to trust
             | somebody.
        
           | [deleted]
        
           | GistNoesis wrote:
           | You build a system based on trust but verify. If the output
           | is the result of a known deterministic program on a known
           | input, anyone can verify it. So they just have to sign their
           | work. If later someone find that they have lied and provided
           | a false result, they lose reputation/stacked coins.
           | 
           | The second associated problem is how would one prevent them
           | from appropriating the work of others that would just re-sign
           | it.
           | 
           | One way would be to allow the worker to introduce a few
           | voluntary errors but have a secret joker that allow him to
           | pass the challenge of a failed verification.
           | 
           | One alternative way is based on data malleability. The worker
           | pick a secret one way function and compute is F( data +
           | secretFunction(data,epsilon) ) ~ F(data) and publish the
           | values of the secretFunction(data,epsilon) but not the
           | secretFunction. Only someone with knowledge of the
           | secretFunction can make a claim on the work done. If there is
           | a challenge only the real worker will be able to publish the
           | secret of the secretFunction (Or use some zero knowledge
           | proof to convince you they know it).
        
             | charcircuit wrote:
             | >If later someone find that they have lied and provided a
             | false result, they lose reputation/stacked coins.
             | 
             | A web page isn't an immutable piece of text. It can change
             | on every visit and it can sometimes returns errors.
        
               | GistNoesis wrote:
               | That's why you don't index the webpage but a snapshot of
               | it. For example you index the commoncrawl archives, or
               | some content addressable storage like ipfs or torrent
               | file.
        
               | charcircuit wrote:
               | What's the point of crawling the common crawl archive?
               | It's pointless. You can simply download it.
        
               | GistNoesis wrote:
               | I think crawling and indexing should be treated
               | differently. Indexing is about extracting value from the
               | data, while crawling is about gathering data.
               | 
               | Once a reference snapshot has been crawled, the indexing
               | task is more easily verifiable.
               | 
               | The crawling task is harder to verify, because external
               | website could lie to the crawler. So the sensible thing
               | to do is have multiple people crawl the same site and
               | compare their results. Every crawler will publish its
               | snapshots (which may contain some errors or not), and
               | then that's the job of the indexer to combine multiple
               | snapshots of various crawler and filter the errors out
               | and do the de-duplication.
               | 
               | The crawling task is less necessary now than it was a few
               | years ago, because there is already plenty of available
               | data. Also most of the valuable data is locked in walled
               | garden, and companies like Cloudflare make the crawling
               | difficult for the rest of the fat tail. So it's better to
               | only have data submitted to you, and outsource the
               | crawling.
        
           | b3kart wrote:
           | Well they are in a sense: you can just do the task yourself.
           | It's expensive of course, so you can use methods applied to
           | human labelling for ML: _periodically_ injecting tasks with
           | known results and checking how trustworthy the party is,
           | vending the task to multiple parties and aggregating results,
           | blocking parties that make many "mistakes", etc.
        
           | hericium wrote:
           | Depending on who would be providing URIs.
           | 
           | If the miner, they would be able to deliver any crap so
           | content deliveries would have to be judged in some way and
           | awarded differently.
           | 
           | If the pool provides addresses to crawl, the miner could be
           | given crafted/dedicated URIs from time to time and lack of
           | delivery of Proof Of Crawl could result in a penalty chosen
           | in a way rendering _" cheating"_ unprofitable. But then fresh
           | URIs have to come from somewhere.
        
           | zomglings wrote:
           | One idea I have been kicking around is the idea of federating
           | the indices, not the crawling.
           | 
           | If every contributor maintained their own index, then you
           | could reward contributors based on how many hits their index
           | generated.
           | 
           | This would open up the possibility of people maintaining
           | indices for specialized topics that they were experts in, and
           | give the federated search engine a shot at taking on Google.
        
             | bspammer wrote:
             | You don't want to reward people for quantity, but quality.
             | The cost of creating a new webpage is effectively zero, so
             | if you attach an incentive for creating them you are
             | doomed.
        
               | zomglings wrote:
               | You and I are talking about completely different types of
               | people.
               | 
               | For most people, the cost of creating and maintaining a
               | website is high. This is why products like Wix and
               | Squarespace exist (and are not cheap).
               | 
               | I am thinking a simple dashboard where _anyone_ could go
               | and curate a list of content they find useful. They could
               | share this with the world.
               | 
               | The interface should be so simple that my parents could
               | use it - and they aren't going to be putting up websites
               | anytime soon.
        
             | DarylZero wrote:
             | This is what Yahoo was doing before Google took over.
             | 
             | (But of course it wasn't a volunteer public benefit effort
             | like you describe.)
        
         | thebeastie wrote:
         | Actually I have an idea for you: i think you can use
         | cryptography to prove that an SSL session really happened. So
         | you could prove indexing of HTTPS sites.
        
           | thebeastie wrote:
           | I think the way this works is having the code to execute an
           | ssl session encoded in a zkSnark. One of the zkSnark based
           | blockchains is doing it.
        
           | detaro wrote:
           | You can prove that a TLS session happened, but nothing about
           | its contents, so you can't really prove indexing.
        
         | g105b wrote:
         | I'm very intrigued by this concept.
        
         | Piezoid wrote:
         | YaCy is decentralized, but without the credit system. Some
         | tokens, like QBUX, have tried to develop decentralized hosting
         | infrastructure.
         | 
         | I also have been wondering how this would play out with some
         | kind of decentralized indexes. The nodes could automatically
         | cluster with other nodes of users sharing the same interests,
         | using some notion of distances between query distributions. The
         | caching and crawling tasks could then be distributed between
         | neighbors.
        
           | igammarays wrote:
           | YaCy is too slow for mainstream use. I believe the indices
           | still need to be centralized, only index-building and
           | crawling can be distributed.
        
             | marginalia_nu wrote:
             | A big part of the problem I see with decentralized search
             | is that you basically need to traverse the index in
             | orthogonal axes to assemble search results. First you need
             | to search word-wise in order to get result candidates, then
             | sort them rank-wise to get relevant results (this also
             | hinges upon an agreed-upon ranking of domains). That's a
             | damn hard nut to crack for a distributed system.
             | 
             | Crawling is also not as resource consuming as you might
             | think. Sure you _can_ distribute it, but there isn 't a
             | huge benefit to this.
        
       ___________________________________________________________________
       (page generated 2021-12-26 23:00 UTC)