[HN Gopher] Show HN: I'm building a non-profit search engine
___________________________________________________________________
Show HN: I'm building a non-profit search engine
Author : daoudc
Score : 376 points
Date : 2021-12-26 09:11 UTC (13 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| wolfgarbe wrote:
| A laudable effort. Two questions:
|
| 1. What is the rationale behind choosing Python as a
| implementation language? Performance and efficiency are paramount
| in keeping operational costs low and ensuring a good user
| experience even if the search engine will be used by many users.
| I guess Python is not the best choice for this, compared to C,
| Rust or Java.
|
| 2. What is the rationale behind implementing a search engine from
| scratch versus using existing Open Source search engine libraries
| like Apache Lucene, Apache Solr and Apache Nutch (crawler)?
| Faaak wrote:
| Premature optimization is the root of all evil. Best to
| concentrate on the algorithm first, and then, maybe, improve it
| with a faster language.
|
| Apart from that, the misconception that "python is slow" should
| die :-)
| Debug_Overload wrote:
| > Premature optimization is the root of all evil.
|
| Is keeping performance in mind and choosing the tech stack
| accordingly really a premature optimization?
|
| This might be the most abused phrase in CS history. Perhaps
| we should add "Premature optimization fallacy" to the list of
| cognitive errors programmers use as an excuse to not
| seriously think about performance.
| stevenally wrote:
| It's a trade off between speed of development and
| performance. Speed of development seems like a good
| optimization for an experimental project?
| authed wrote:
| "Apart from that, the misconception that "python is slow"
| should die :-) "
|
| Yeah it's not python that is slow, it's the interpreter.
| nulbyte wrote:
| Which interpreter? There are multiple. I found pypy to be
| quite reasonable; often faster than the standard C python
| interpreter.
| wolfgarbe wrote:
| This is no contradiction, you can concentrate on the
| algorithm also in a faster language :-)
| https://benchmarksgame-
| team.pages.debian.net/benchmarksgame/...
| https://benchmarksgame-
| team.pages.debian.net/benchmarksgame/...
| https://benchmarksgame-
| team.pages.debian.net/benchmarksgame/...
| daoudc wrote:
| Agreed, this was my thinking - and since I'm better at
| Python, it's faster for me to get stuff done. I would like to
| rewrite it in Rust though, all help from Rustaceans gladly
| accepted!
| marginalia_nu wrote:
| In general, speed isn't the problem with search (at least the
| retrieval aspect), but memory efficiency is. Things like
| small object overhead and the ability to memory map large
| data ranges are extremely beneficial for a language if you
| want to implement a search index.
|
| But I agree, get it working first, then re-implement it in
| another language if it turns out to be necessary.
| freediver wrote:
| Congrats! Very nice to see results being lightning fast, I am
| getting 100-120ms response with network overhead included and
| that is impressive. The payload size of only 10-20kb helps
| immensely, good job!
|
| I've built something similar called Teclis [1] and in my
| experience a new search engine should focus on a niche and try to
| be really, really good at it (I focused on non-commercial content
| for example).
|
| The reason is to be able to narrow down the scope of content to
| crawl/index/rank and hopefully with enough specialization to be
| able to offer better results than Google for that niche. This
| could open doors to additional monetization path, API access.
| Newscatcher [2] is an example of where this approach worked (they
| specialized on "news").
|
| [1] http://teclis.com
|
| [2] https://newscatcherapi.com/
| [deleted]
| [deleted]
| gkasev wrote:
| Congrats on the mvp path you took to lunch your product.
| Generally, I think that there is a place for other variations of
| web search, be it in the way you crawl or perhaps how you
| monetize. I genuinely believe that it is really hard to build a
| general purpose search engine like DDG, Google and the like, but
| you can build a fairly good niche search engine. I'm particularly
| fond of the idea of community powered curation in search. Just
| today I lunched my own take on a community driven search engine -
| https://github.com/gkasev/chainguide. If you like to bounce ideas
| back and forth with somebody, I'll be very interested to talk to
| you.
| ChuckMcM wrote:
| Okay, the cynical quip is "All search engines other than Google's
| are 'non-profit'." :-) But the reasons for that won't fit in the
| margin here.
|
| Building search engines are cool and fun! They have what seems
| like an endless source of hard problems that have to be solved
| before they are even close to useful!
|
| As a result people who start on this journey often end up crushed
| by the lack of successes between the start and the point where
| there is something useful. So if I may, allow me to suggest some
| alternatives which have all the fun of building a search engine
| and yet can get you to a useful place sooner.
|
| Consider a 'spam' search engine. Which is to say a crawler that
| you work to train on finding spammy useless web sites. Trust me
| when I say the current web is a "target rich environment" here.
| The purpose would be to not so much provide a search engine in
| total here, as it would be to provide something like the realtime
| black hole list did for email spam, come up with a list of URLs
| that could be easily checked with a modified DNS type server
| (using DNS protocol but expressly for the purpose of doing the
| query 'Is this URI hosting spam?' in a rapid fashion.
|
| There are two "go to market" strategies for such a site. One is a
| web browser plugin that would either pop up an interstitial page
| that said, "Don't go here, it is just spam" when someone clicked
| on a link. Or a monkey-script kind of thing which would add an
| indication to a displayed page that a link was spammy (like set
| the anchor display tag to blinking red or something). The second
| is to sell access to this service to web proxies, web filters,
| and Bing which could in the course of their operation simply
| ignore sites that appeared on your list as if they didn't exist.
|
| You will know you are successful when you are approached by shady
| people trying to buy you out.
|
| Another might be a "fact finding" search engine. This would be
| something like Wolfram Alpha but for "facts." There are lots of
| good AI problems here, one which develops a knowledge tree based
| on crawled and parsed data, and one which answers factual queries
| like 'capital of alaska' or 'recipe for baked alaska'. The nice
| things about facts is they are well protected against the claim
| of copyright infringement and so people really can't come after
| you for reproducing the fact that the speed of light is 300Mkps,
| even if they can prove you crawled their web site to get that
| fact.
| amelius wrote:
| What literature did you use to obtain suitable algorithms for
| search/NLP?
| prohobo wrote:
| Thank you for this, but I feel like you _should_ make a profit
| and this is currently a missed opportunity to use web3 principles
| to do that.
|
| Free and open software is a great ideal, but the reality is that
| people need money to live - and ads are the way to make that
| money on web2 platforms, which is why Google is in such a sad
| state. Why not do something similar to Brave? You can add
| tokenomics to the search engine and make money while keeping it
| 100% useable and open source.
| astoor wrote:
| How would web3 and "tokenomics" solve search? If the underlying
| problem is that spamdexers are given a profit incentive to game
| search engine results with low effort and low quality content,
| does it make a difference whether the profit incentive is via
| pages generating advertising revenue or pages generating
| cryptocurrency revenue?
| prohobo wrote:
| I don't really want to address your questions; this has been
| talked to death. How do you think tokens could help? Maybe
| they'd create incentives for users to use the engine. Maybe
| they'd open a market for said tokens so the dev can extract
| currency out of it. Maybe tokens can be a meta-system on top
| of the search engine, so that the search functionality can be
| left to solving the search problem without interference.
|
| Do you think there's an alternative to tokens in order to
| fund the project without degrading the search algorithm? If
| so, I'm all ears.
|
| In fact, I think we all are. Please enlighten us. But you
| haven't proposed a solution while crypto devs have been
| working on one since 2008.
| everydaybro wrote:
| exactly, free and opensource is amazing, but the developer
| should also earn money to live, maybe add donations, or maybe
| add some web3 features
| [deleted]
| ortuman84 wrote:
| > Marginalia Search is fantastic, but it is more of a personal
| project than an open source community.
|
| And where's the community behind mwmbl project?
| daoudc wrote:
| Feel free to email me if you want to be involved!
| Closi wrote:
| Hey, great project - the more competition in this space the
| better. To be honest, at the moment the algorithm doesn't return
| any sensible results for anything (at least that I can find), but
| I hope that you can find a way past this as it's a great place to
| have a project.
|
| I've included some search terms below that I've tried - I've not
| cherrypicked these and believe they are indicative of current
| performance. Some of these might be the size of the index -
| however I suspect it's actually how the search is being
| parsed/ranked (in particular I think the top two examples show
| that).
|
| > Search "best car brands"
|
| Expected: Car Reviews
|
| Returns a page showing the best mobile phone brands.
|
| then...
|
| > Then searching "Best Mobile Phone"
|
| Expected: The article from the search above.
|
| Returns a gizmodo page showing the best apps to buy... "App
| Deals: Discounted iOS iPhone, iPad, Android, Windows Phone Apps"
|
| > Searching "What is a test?"
|
| Expected result: Some page describing what a test is, maybe
| wikipedia?
|
| Returns "Test could confirm if Brad Pitt does suffer from face
| blindness"
|
| > Searching "Duck Duck Go"
|
| Expected result: DDG.com
|
| Returns "There be dragons? Why net neutrality groups won't go to
| Congress"
|
| > Searching "Google"
|
| Expected result: Google.com
|
| Returns: An article from the independent, "Google has just
| created the world's bluest jeans"
| rmbyrro wrote:
| I guess that's the real problem. People like to wonder what
| would be the "ideal world" in a search engine. It may be
| wishful thinking, I don't know.
|
| It seems really hard to produce quality search results. Takes a
| lot of investment. Makes it an expensive product. But no one
| wants to pay. So selling ads it's the only way forward.
|
| Maybe there's a way to convince people to pay what it takes? I
| dunno...
| bigyikes wrote:
| I would gladly pay $5 a month for a Google-quality search
| service that doesn't track me. I've been using Duck Duck Go
| for most of this year, but frequently find myself falling
| back to !g because Google's results really are much better.
|
| I wonder how much money Google search makes per the average
| user. Is it more than $5/mo?
| danuker wrote:
| > because Google's results really are much better.
|
| Or maybe they're average, but you only see the ones where
| DDG fails.
|
| Next time also try Yandex and Baidu.
| berkut wrote:
| Likewise, I'd pay that as well (I'd like no ads as well,
| but tracking is the main issue for me).
|
| Similarly, I do have DDG as my main search on all machines
| and devices just out of principle, but its region-aware
| searching (I'm in NZ, and often only want NZ results) are
| very close to useless in my experience (with NZ as the
| region ticked, it will still return results from .ca and
| .co.uk domains, which I would have hoped would be almost
| trivial to remove), and Google seems much better in this
| area (but not perfect).
|
| Similarly, there's often technical/programming things I'll
| search for that DDG doesn't have indexed at all, and Google
| does.
|
| Google also seems a lot better at ignoring spelling
| differences (color/colour, favorite/favourite) than DDG,
| which is often (but not always!) useful.
| _benj wrote:
| I've been enjoying a lot neeva, not affiliated at all, just
| a happy user :-)
|
| I think they are $4.95/mo or something? haven't payed a
| cent yet since there are a few discounts that they do to
| prompt you to learn how to use it (I really liked that, and
| def made me more likely to stick with it!)
| daoudc wrote:
| Thanks for the feedback! I'll take a look at your examples and
| see if I can improve the rankings.
| devoutsalsa wrote:
| The first thing I do to test a search engine is to search for
| my own username on various public sites to see if it can find
| me. It didn't find me. But keep it up and I'm sure I'll be in
| there eventually (or maybe a overestimate how interesting I
| am, hehe).
| rmbyrro wrote:
| I get this is your usual testikg case for search engines,
| but if you've read their README you'd have seen its
| inaproppriate for the project at the current stage.
| GraemeMeyer wrote:
| Fun idea. It seems to be getting stuck on the first word you
| enter.
|
| e.g. you get the same results for "London" as you do for
| "London cats", "London cat rescue" and "London test".
| gjm11 wrote:
| I was curious and tried a bunch of other searches, with
| similarly disappointing results. My searches were a bit more
| esoteric than Closi's.
|
| "langlands program" (pure mathematics thing): yup, top result
| is indeed related to the Langlands program, though it isn't
| obviously what anyone would want as their first result for that
| search. Not bad.
|
| "asmodeus" (evil spirit in one of the deuterocanonical books of
| the Bible, features extensively in later demonology, name used
| for an evil god in Dungeons & Dragons, etc.): completely blank
| page, no results, no "sorry, we have no results" message,
| nothing. Not good.
|
| "clerihew" (a kind of comic biographical short poem popular in
| the late 19th / early 20th century): completely blank page. Not
| good.
|
| "marlon brando" (Hollywood actor): first few results are at
| least related to the actor -- good! -- but I'd have expected to
| see something like his Wikipedia or IMDB page near the top,
| rather than the tangentially related things I actually god.
|
| "b minor mass" (one of J S Bach's major compositions): nothing
| to do with Bach anywhere in the results; putting quotation
| marks around the search string doesn't help.
|
| "top quark" (fundamental particle): results -- of which there
| were only 7 -- do seem to be about particle physics, and in
| some cases about the top quark, but as with Marlon Brando
| they're not exactly the results one would expect.
|
| "ferrucio busoni" (composer and pianist): blank page.
|
| "dry brine goose" (a thing one might be interested in doing at
| this time of year): five results, none relevant; top two were
| about Untitled Goose Game.
|
| "alphazero" (game-playing AI made by Google): blank page.
| Putting a space in results in lots of results related to the
| word "alpha", none of which has anything to do with AlphaZero.
|
| OK, let's try some more mainstream things.
|
| "harry potter": blank page. Wat. Tried again; did give some
| results this time. They are indeed relevant to Harry Potter,
| though the unexpected first-place hit is Eric Raymond's rave
| review of Eliezer Yudkowsky's "Harry Potter and the Methods of
| Rationality", which I am fairly sure is not what Google gives
| as its first result for "harry potter" :-).
|
| "iphone 12" (confession: I couldn't remember what the current
| generation was, and actually this is last year's): top results
| _are_ all iPhone-related, but first one is about the iPhone 6,
| second is from 2007, this is about the iPhone 6, fourth is from
| 2007, fifth is about the iPhone 4S, etc.
|
| "pfizer vaccine": does give fairly relevant-looking results,
| yay.
| daoudc wrote:
| Thanks for the detailed feedback! I think most of most of the
| problems here are because we have a _really_ small index
| right now. Increasing the number of documents is our top
| priority. I agree that some kind of feedback when there are
| no results would be a good idea.
| Closi wrote:
| I actually think it's probably the algorithm too - if I
| take one of the search items returned from a search that I
| know is in the index, but then search for it with slightly
| different terminology (or a different tense /
| pluralisation), the same item doesn't come up.
| clay-dreidels wrote:
| What does a search engine algorithm look like, and where can I
| find examples to build from?
| marginalia_nu wrote:
| Depends on which algorithm you are looking for, but these are
| commonly used:
|
| * Okapi BM25 for determining the relevance of a result to a
| query.
|
| * TF-IDF for determining the relevance of a term to a
| document.
|
| * PageRank for ranking domains.
| aantix wrote:
| How does someone economically store tens if thousands of
| terabytes of data needed for the indexes of a large scale search
| engine?
|
| And have the large server instances (lots of ram)?
| marginalia_nu wrote:
| Why would you need that much data? The average website has
| maybe 10kB worth of textual information without compression. To
| get tens of thousands of terabytes of data, you'd need to index
| of the order 10^12 websites. That seems a bit much.
| supernovae wrote:
| I tried this back in 2006 - mozdex (only a wikipedia article
| survives) - it's not cheap. I was a fan of lucene which led to
| nutch and eventually hadoop. so lots of servers running hdfs
| doing map reduce jobs to compile and update indexes. No one in
| the end cared about open search... duck seems to do alright
| under the guise of security but most non major searches are
| just meta searches these days because of economies of scale
| being highly disadvantageous to any upcoming search - and
| people just don't search like they used to either.
|
| i was spending 2500 a month just on indexers and had query
| traffic taken off my costs would have shot through the roof
| since you want query nodes to all be in memory cache and that
| was expensive back then. today i would have used some modern in
| memory distributed doc dbs instead of query masters with heavy
| block buffer caches. i learned a lot but lost my shirt :)
| [deleted]
| tmnstr85 wrote:
| GuideStar is the veteran in this space. I agree that doing this
| based on web scrapes and robots.txt is probably going to be
| pretty tough to get quality results. GuideStar always sells there
| product on the premise that they're vetting the financial
| statements of non-profits of best results. The real money might
| be on figuring out a way to scale reading and classifying non-
| profit financials - then see if you can quality control using a
| set of patterns.
| AlphaWeaver wrote:
| This isn't a search engine for nonprofits, it's a search engine
| that's designed to be run by a nonprofit one day.
| blueatlas wrote:
| Or perhaps a for-profit company running this search engine
| and not garnering profit from it by directly changing search
| results, or having a different model for profit that does not
| affect search rankings.
| marcodiego wrote:
| Non-profit search engines are needed. It will probably still be
| vulnerable to SEO but will more likely be resistant to become
| corrupt by the interest of "investors".
| daoudc wrote:
| Update: there's been interest from a few people so I've started a
| Matrix chat here for anyone that wants to help out or provide
| feedback: https://matrix.to/#/#mwmbl:matrix.org
| marban wrote:
| I've recently built one just for business news w/ obligatory
| zero-tracking. (https://yup.is)
| tibbar wrote:
| It's really fast - nice job! Can you elaborate on the ranking
| algorithm you are using? It seems that this will become more
| important as you index more pages.
| daoudc wrote:
| Thanks! A really simple one for now: number of matching terms,
| and then prioritising matches earlier in the result string. But
| this is something I'm looking forward to working on properly
| when I get a bigger index.
| daoudc wrote:
| I also want to incorporate a community aspect to ranking,
| allowing upvoting and downvoting of results. I've not yet
| figured out how to reconcile this idea with not having any
| tracking though. Perhaps a separate interface for logged-in
| users.
| foxfluff wrote:
| One ambitious project I've thought about over and over
| again over the years is search (and social sites / forums)
| where the votes, tags, and flags make a public dataset and
| users can manipulate their own weights (or even the ranking
| algorithm) to construct a "web of trust" that yields
| favorable results.
|
| This way you can escape spammers, powertripping moderators,
| and the tyranny of the hive mind; it doesn't matter if
| there's a large population of spammers, shills, and idiots
| upvoting crap because you set their weights to zero (or
| negative). In fact, that becomes a feature, because by
| upvoting crap, they generate a crap filter for you. If the
| weights are also public, then you can automatically &
| algorithmically seed your web of trust (simplest algo for
| sake of example: give positive weight to identities who
| upvoted and downvoted the same way you did) but you could
| still override the algo with manually set values if it gave
| too much weight to bad actors.
|
| Obviously this has privacy implications (all your votes and
| your network becomes public), and can generate a large
| dataset (performance challenge, how do you distribute it /
| give access to it?), so it's far from a trivial project.
| For the privacy angle, I'd start by keeping identities
| pseudonymous (e.g. a public key or random id -- you don't
| know who's behind the identity unless they blurt it out).
| Furthermore, I think it'd be useful to automagically split
| your actions across multiple identities so it's harder to
| link all your activity. I think the system should also
| explicitly allow switching identities, for privacy but also
| because sometimes you just want a different "filter bubble"
| which helps tailor the content you get to what you're
| looking for. Maybe the network that yields best shopping
| results isn't the same network that yields best cooking
| recipes or technical docs.
|
| With this model, everyone is a moderator and everyone can
| defer moderation to identities they trust, but neither the
| hive mind nor individuals have the ultimate power to
| dictate what you see. If you want to read spam or
| conspiracy theories, you just switch to your identity which
| upvotes such content and has positive weights towards other
| identities with similar votes.
|
| I doubt you're going to build this; I doubt people want
| this. I certainly want it. Maybe one day I'll try, but it
| probably won't work well without network effects
| (=reasonably large quantity of users). I just wanted to let
| you know about the idea because your project is inspiring
| and inspiring things inspire me to share ideas.. :)
| ffhhj wrote:
| This sounds like the sorting hat algorithm (tiktok)
| applied to query search engines. If there could be a way
| to visualize your recommendations network and switch to
| others without logging out, this could work really well.
| But a lot of research needs to be done, and the interest
| of big actors is to keep users blind inside their webs.
|
| This topic is interesting to me because I'm building a
| faster search engine for programming queries and trying
| to solve the core issues that got us stuck with crappy
| engines.
| daoudc wrote:
| Actually I was thinking of something vaguely along those
| lines!
| legofr wrote:
| > All other search engines that I've come across are for-profit.
| Please let me know if I've missed one!
|
| https://www.ecosia.org/
|
| https://ask.moe/
|
| https://ekoru.org/
|
| I remember seeing one more non-profit search engine on HN but
| can't seem to find it right now.
| m-i-l wrote:
| Also https://searchmysite.net/ for personal and independent
| websites (essentially a loss-leader for its open source self-
| hostable search as a service).
| mlinksva wrote:
| https://web.archive.org/web/20171130000415/https://about.com...
| was one 5+ years ago
| https://news.ycombinator.com/item?id=11281700
| https://github.com/commonsearch
| daoudc wrote:
| Thanks, but these are not technically non-profit:
|
| "Ecosia is a search engine based in Berlin, Germany. It donates
| 80% of its profits to nonprofit organizations that focus on
| reforestation" [1]
|
| "80% of profits will be distributed among charities and non-
| profit organizations. The remaining 20% will be put aside for a
| rainy day." [2]
|
| "Ekoru.org is a search engine dedicated to saving the planet.
| The company donates 60% of revenue generated from clicks on
| sponsored search results to partner organizations who work on
| climate change issues" [3]
|
| [1] https://en.wikipedia.org/wiki/Ecosia [2] https://ask.moe/
| [3] https://www.forbes.com/sites/meimeifox/2020/01/19/how-the-
| se...
| toper-centage wrote:
| While Ecosia is not technically a non-profit, no one can sell
| Ecosia shares at a profit.
|
| > Ecosia says that it was built on the premise that profits
| wouldn't be taken out of the company. In 2018 this commitment
| was made legally binding when the company sold a 1% share to
| The Purpose Foundation, entering into a 'steward-ownership'
| relationship.
|
| > The Purpose Foundation's steward-ownership of Ecosia
| legally binds Ecosia in the following ways: - Shares can't be
| sold at a profit or owned by people outside of the company
| and - No profits can be taken out of the company.
|
| https://www.ethicalconsumer.org/technology/how-ethical-
| searc...
| gardenfelder wrote:
| Seems like a reasonable approach is to use a b-corp where
| shareholders cannot sue for financial gains.
| asicsp wrote:
| >I remember seeing one more non-profit search engine on HN but
| can't seem to find it right now.
|
| Probably this one? "A search engine that favors text-heavy
| sites and punishes modern web design"
| https://news.ycombinator.com/item?id=28550764 _(3 months ago,
| 717 comments)_
| Minor49er wrote:
| I just tried ask.moe, but it clearly noted that the search
| results were provided by Google
| kova12 wrote:
| What do you do in order for your crawler to not accidentally weer
| into some naughty-naughty site and yield you a visit from your
| friendly FBI squad? That concern why I decided to stay away from
| yacy
| montebicyclelo wrote:
| How much compute, storage, and network speed would a minimal up-
| to-date web search engine need?
| ValleZ wrote:
| I'd estimate that Google uses ~1000 TB of fast storage, Bing
| 500 TB and Yandex 100 TB, so the most basic useful search
| engine would use at least... 10 TB?
| dqv wrote:
| What is fast storage? Is that, for right now, the fastest
| SSDs available?
| ValleZ wrote:
| HDD is definitely not enough because of low iops, likely
| Google keeps index in RAM. I think NVME should be good
| enough, idk for sure.
| fauigerzigerk wrote:
| _" The Google Search index contains hundreds of billions of
| web pages and is well over 100,000,000 gigabytes in size."_
|
| https://www.google.com/intl/en_uk/search/howsearchworks/craw.
| ..
| ValleZ wrote:
| Actually I doubt that this is a true statement and not
| something to discourage others. Check out these queries:
| https://www.google.com/search?q=1 12B results
| https://www.google.com/search?q=an 9B results
| https://www.google.com/search?q=the 6B results If we
| estimate that about half of all English pages contain 'the'
| or 'an' article we'll have about 15B English pages. If half
| of all pages contain the '1' then the total number of pages
| is about 24B. If half of all the pages are in English then
| the total number of all the pages is 30B. So even the
| maximum is less than the "hundreds". Similar numbers are at
| https://www.worldwidewebsize.com/
| [deleted]
| iopq wrote:
| If you have to ask, you can't afford it
| marginalia_nu wrote:
| You might think that. My search engine runs off <$5k worth of
| consumer hardware off domestic broadband. It survived the
| hacker news front page for a week, saw a sustained load of
| 8000 searches per hour for nearly a day.
|
| It's got a fairly small index, but yeah, it's not
| particularly hardware-hungry.
| daibo wrote:
| What's the size of your index in records?
| marginalia_nu wrote:
| I run three separate indices at about 10-20mn documents
| each. But I'm fairly far off any sort of limit (ram and
| disk-wise I'm at maybe 40%).
|
| I'm confident 100mn is doable with the current code,
| maybe .5bn if I did some additional space optimization.
| There are some low hanging fruit that seem very
| promising. Sorted integers are highly compressable, and
| right now I'm not doing that at all.
| daibo wrote:
| Yes doclist compression is a must. Higher intersection
| throughput and less bandwidth stress. Are you loading
| your doclists from persistent storage? What is your
| current max rps?
| marginalia_nu wrote:
| I'm loading the data off a memory mapped SSD, trivial
| questions will probably be answered entirely from memory,
| although the disk-read performance doesn't seem terrible
| either.
|
| > What is your current max rps?
|
| It depends on the complexity of the request, and repeated
| retrievals are cached, so I'm not even sure there is a
| good answer to this.
| dotancohen wrote:
| Fair point.
|
| How is it to be funded?
| daoudc wrote:
| The plan is to fund it through donations
| smt88 wrote:
| Can you make it a contributory database? I wouldn't mind
| "donating" my browsing history and page downloads to
| build the index and train the algorithm.
|
| You'd have to find a way to verify reputation to make
| sure no bad actors could contribute.
| aspenmayer wrote:
| How does Internet Archive verify dumps submitted by
| Archive Team and other groups? This may already be a
| solved problem.
|
| Not knowing their implementation details, I'm guessing it
| could be doable without reinventing much. An oracle could
| dispatch a P2P archive job to a pool of clients randomly
| assigned tasks, with both the first to archive and the
| first to validate being recognized by the swarm somehow,
| with periodic re-archiving and re-verification, rate
| adjusted by popularity of site and of search keywords.
| daoudc wrote:
| Yes, I'm planning to do something like this.
| hirako2000 wrote:
| is the size of the common Web already way too large to play
| catch up against google/bing at this point?
|
| my dream, is a distributed/p2p index. each browser
| contribute to storing part of the overall index, and handle
| queries coming from other users so that how to fund huge
| data centers never become a question.
| dotancohen wrote:
| > is the size of the common Web already way too large to
| play catch up against google/bing at this point?
|
| Probably. But I would prefer a search engine that didn't
| search the whole web. I would prefer a search engine that
| searched the sites related to fields that I'm interested
| in.
|
| So I would pay for or donate to a search engine that
| provided me good results in e.g. software development.
| They could add additional fields as demand warrants, so
| long as quality as maintained. I would even like to see a
| faceting feature, so I could search for e.g. Matrix and
| get results on the mathematical concept when need be
| without having to wade through movie review or fiddle
| with magic search keywords.
| hirako2000 wrote:
| not searching the whole Web makes sense, except that it
| isn't clear what belongs to your field of interest and
| what doesn't. should indexing skip a blog page because a
| the author usually write about algorithm but here goes on
| and on about business while sporadically mentioning
| algorithmic technical aspects? and what if you want to
| search about pottery that Sunday morning, turn to Google?
| I think segmenting search results by topics is a useful
| consumer query feature, not sure segmenting what gets
| indexed would provide a useful service other than
| covering niches hence not really fulfilling a web search
| engine. I find the idea less ambitious so maybe that's
| how an open search engine should approach the problem,
| federation of hosts could, cover the whole Web
| eventually.
| dotancohen wrote:
| > should indexing skip a blog page because a the author
| usually write about algorithm but here goes on and on
| about business while sporadically mentioning algorithmic
| technical aspects?
|
| Yes, because that author's pages wouldn't even be fetched
| at the point of development that we are discussing.
| > and what if you want to search about pottery that
| Sunday morning, turn to Google?
|
| Yes, Google still exists. Why not?
| daoudc wrote:
| Check out YaCy. It's great if you're happy with slow
| search!
| charcircuit wrote:
| YaCy's results weren't great and there was the raking was
| very bad letting sides game it by just spamming a ton of
| keywords.
| jesprenj wrote:
| I had problems with YaCy. Slow search and slow crawling.
| Regarding search I think it could be improved if instead
| of HTTP requests to other peers a more efficient protocol
| (UDP) could be used. Regarding crawling it's quite
| possible that doing this in Java and running it on a
| Rock64 may not be the best combination (: It started
| OOMing after some days.
| arpa wrote:
| see, that's the problem with engineers today. AltaVista ran
| on 3x300MHz 64bit processors with 512M of RAM back in 1995.
| Resources cost peanuts these days. It's just that we're so
| used to bloat and digital inflation, we can't even start
| considering unbloated implementations as we perceive them as
| "not modern". Apparently we are also stuck in the
| centralized/ownership mentality. If you crowdsource search
| indexing and processing, it scales alongwith amount of users.
| Oh and also another bane of the internet is the need to
| monetize. FFS, if email would be invented today, we probably
| would have to buy NFT poststamps.
| adtac wrote:
| I'm sorry what? Have you seen the rate at which data is
| being created today? I mean, if you want to index the size
| of the 1995 web with your raspberry pi, go ahead, but it
| costs insane amounts of money to index the December 2021
| web and keep the index up-to-date.
|
| edit / full disclosure: I work at google but nothing
| related to search
| noogle wrote:
| Is it really necessary to index EVERYTHING? It's true
| that we have much more data today than 26 years ago, but
| not all of these websites qualify or provide value
| (duplicate results, promotional content, outdated
| content).
|
| The challenge then moves to the curation, but it's no
| longer infeasible.
| Closi wrote:
| I think the words "minimally viable" are being ignored
| here.
|
| I think OP's point is, assume you only have 10/100
| terabytes of space and limited compute ability - how
| would you approach the problem? I assume 90% of google's
| searches probably come from less than 1% of their total
| index, not to mention that Google is also keeping full
| cached versions of the whole website including images.
| adtac wrote:
| I just eyeballed my browser history from the last 2-3
| days and I'd estimate 15% is current/latest news related,
| some 25% is programming related, 15% is e-commerce stuff,
| the rest random crap. I'd imagine 10-100 TB can easily
| serve all of _my_ search space (even the links I didn't
| look at on page 10) from the past few years, but that's
| the thing -- it's just my search space. How do you serve
| the rest of the world? I wish I knew the answer :)
| foxfluff wrote:
| Well Google doesn't know the answer either. Results are
| complete trash when you try find something niche or in a
| local language.
|
| The challenge isn't to index the entire web, it's to
| index the useful parts of it, and I think an index
| covering most of the useful web can be seeded quite
| easily with some community effort.
| [deleted]
| [deleted]
| quantum2021 wrote:
| Two big things that annoy me about google:
|
| 1. They somewhat get around this with their maps feature, but
| their regular search doesn't actually search by area; you always
| get national websites that optimize the best. That would be a
| nice feature to have starting out without having to type in the
| specific area you're looking for.
|
| 2. Search results for hotels that actually work! Not only if
| they're set up on OTA's! This could actually get your search
| engine some traction as the search engine to go to when making
| travel plans which would give you a nice niche to start out in.
| slmjkdbtl wrote:
| As a normal human I naturally typed in "fuck" in a new search
| engine and it led me to this article
| https://mathbabe.org/2015/06/22/fuck-trigonometry/ which I quite
| enjoyed!
| born-jre wrote:
| there seems to be a lot of comment about some form of distributed
| trust/reputation bashed system.
| tomxor wrote:
| > We plan to start work on a distributed crawler, probably
| implemented as a browser extension that can be installed by
| volunteers.
|
| Is there a concern that volunteers could manipulate results
| through their crawler?
|
| You already mentioned distributed search engines have their own
| set of issues. I'm wondering if a simple centralised non-profit
| fund a la wikipedia could work better to fund crawling without
| these concerns. One anecdote: Personally I would not install a
| crawler extensions, not because I don't want to help, but because
| my internet connection is pitifully slow. I'd rather donate a
| small sum that would go way further in a datacenter... although I
| realise the broader community might be the other way around.
|
| [edit]
|
| Unless, the crawler was clever enough to merely feed off the
| sites i'm already visiting and use minimal upload bandwidth. The
| only concern then would be privacy. oh the irony, but trust goes
| a long way.
| qwertox wrote:
| There could be legal issues if the crawler starts to crawl into
| regions which should be left alone. I don't mean things like
| the dark web, but for example if someone is a subscriber to an
| online magazine it could start crawling paywalled content if
| 3rd party cookies enable this.
| tomxor wrote:
| Google already seems to crawl paywalled content somehow, this
| doesn't seem to be much of a legal issue since you cannot
| click through - it's just annoying as a user.
|
| This might even be intentional through robots.txt ... A
| browser extension that passively crawls visited sites could
| easily download robots.txt as the single extra but minimal
| download requirement.
| dreamcompiler wrote:
| Pretty sure most paywalled sites explicitly allow the
| googlebot to enter. If you spoof your UserAgent to be that
| of the googlebot they check your IP address to make sure
| you really are Google.
|
| The new fly in the crawler ointment is Cloudflare: If
| you're not the googlebot and you hit a Cloudflare customer
| you need to be running javascript so they can verify you're
| _not_ a bot. It 's a continual arms race.
| imglorp wrote:
| Google has instructions for paywall sites to allow
| crawling. I suppose it brings them traffic when users click
| on a search result and arrive at the sign up page.
|
| https://developers.google.com/search/docs/advanced/structur
| e...
| daoudc wrote:
| Yes, that is a concern. I'd probably worry about it if and when
| it started happening, however.
| Schiendelman wrote:
| If you wait until then, it may become too late to mitigate.
| Unless you have a plan to remain in complete control of who
| contributes.
| altdataseller wrote:
| There are already loads of browser extensions that do a lot of
| screen scraping of all the sites you visit, without you even
| realizing it,
| tomxor wrote:
| I can imagine, but i only use one, ublock
| gravypod wrote:
| If you filed to become a non-profit could people "donate" their
| engineering time as a tax write off? If you find out the legality
| of something like this and make it easy to do that could inspire
| a lot of collaboration on the project and I can see a bunch of
| other areas (outside of search) where services could be provided
| like this. I'm also sure having a non-profit would also make it
| easier to find cheap hosting which is a large part of the cost
| there.
| champagnois wrote:
| If I were to work on building a search engine from scratch, I
| would probably approach this from these directions:
|
| (1) Investigate if running a DNS server will help me get a more
| robust picture of what websites exist.
|
| (2) Investigate if supplying a custom browser would help me to
| leverage client PCs to do the crawling / processing for me.
|
| (3) Investigate if there is any point in building a search engine
| with the data gathered in a non-proffit way... Non-proffits are
| not as sustainable as for proffit corporations.
| [deleted]
| yuhong wrote:
| btdmaster wrote:
| Any reason to prefer GPLv3 over AGPLv3? It might be useful to use
| the latter so that distributing it over a network requires
| distributing modifications as well.
| daoudc wrote:
| Good suggestion, thanks.
| amenod wrote:
| Off-topic [0]: I would be very interested in an economic model
| that would work for such a search engine. Donations are fine, but
| (imho) it will take much more than that to keep the lights on,
| let alone expand...
|
| The "fairest" solution for both sides I can think of is ads which
| no not send tracking information, and are shown primarily based
| on search terms and country, or even other parameters that the
| visitor has set explicitly. Any other ideas on how to finance
| such an engine so that incentives are aligned?
|
| [0]: EDIT: off-topic because the page clearly states that this
| project will be financed with donations only.
| hosteur wrote:
| Ads as a business model ends with surveillance as a business
| model. We know this now.
| m-i-l wrote:
| The model my search uses is for the public search to
| essentially be a loss leader for the search as a service - site
| owners can pay a small fee to access extra features such as
| being able to configure what is indexed, trigger reindexing on
| demand, etc. It also heavily downranks pages with adverts, to
| try to eliminate the incentive for spamdexing.
| daoudc wrote:
| Wikimedia has an estimated $157m in donations this year. If we
| could get a small fraction of this amount we should be able to
| build something pretty good.
| klohto wrote:
| Get real lol. Why would a general public care about you?
| Happy to donate but it won't keep the light on. You're
| serving a niche community
| daoudc wrote:
| Niche for now, but I think a lot of people can see the
| value of search without ads.
| abraae wrote:
| A journey of a thousand miles begins with a single step.
| oefrha wrote:
| I wish you luck, but I mean, I use Google and I haven't
| seen a search ad for what, a decade (okay, less than a
| decade considering iOS)? Most people who don't want to
| see search ads can pretty easily find an ad blocker.
| wodenokoto wrote:
| Online encyclopedia was very niche when Wikipedia started.
| interator7 wrote:
| I mean, it's not even remotely comparable. It's not like
| we have to look at paper search engines as an alternative
| to online search engines. The whole point is that the
| general public has no real reason to switch, never mind
| donate.
| luckylion wrote:
| Aren't ads super ineffective, especially when you don't make
| them very invasive?
|
| I think donations are probably workable. It works in the
| private tracker scene; the larger ones have "donation meters"
| and never seem to fall behind.
|
| It could also work on a subscription model which is essentially
| just formalizing the donations and making it easier to plan
| cash flow.
| daoudc wrote:
| Yes, I think a subscription model is the way to go.
| marginalia_nu wrote:
| > but (imho) it will take much more than that to keep the
| lights on, let alone expand...
|
| You'd be surprised how cheap a search engine can be to operate.
| My search.marginalia.nu has a burn rate of less than
| $100/month.
| daoudc wrote:
| Impressive! Does that include the startup costs of the server
| you bought, and maintaining it?
| marginalia_nu wrote:
| The hardware cost about $5k all-in-all including a UPS, and
| I'd estimate it will chew through an 1Tb SSD once every
| 12-18 months.
| [deleted]
| alexdowad wrote:
| Some idle words from a passer-by: It would have been good if this
| project had a pronounceable name.
|
| "To Google" has entered the English lexicon as a verb, but I
| don't think anybody will ever say they "mwmbled" something.
| wodenokoto wrote:
| In the early web 2, it was very in for things to be spelled
| unpronounceably. For the life of me I can only remember Twittr,
| but I wanna say Spotify also had an unreadable name in the
| early days.
| thenthenthen wrote:
| Flickr
| marginalia_nu wrote:
| del.ico.us
| KnobbleMcKnees wrote:
| Tumblr
| daoudc wrote:
| It's pronounced "mumble". I live in Mumbles, which is spelt
| Mwmbwls in Welsh.
| yetanother-1 wrote:
| Nice, but still not very intuitive nor common for the grand
| public.
| danpalmer wrote:
| Spelling it "mumble" wouldn't be accurately pronounceable
| for most of the world, and billions couldn't even read the
| letters.
|
| I get your point but I think we should normalise things
| that don't come completely naturally for English speakers.
| jesprenj wrote:
| Mumble is already a group talk protocol.
| http://mumble.info
| scottmcdot wrote:
| If it took off we all might start swapping e with w as a
| nod to our preferred search engine.
| discordance wrote:
| Is there a test suite of expected search results that could be
| used with these sort of projects?
| KarlKemp wrote:
| The central problem with this and similar endeavors: nobody is
| willing to pay what they are worth in ads. Let's say the average
| Google user in the US earns them $30/year. Are you willing to pay
| $30/year for an ad-free Google experience? Great! We now know
| that you are worth at least $60/year.
|
| That little thought experiment is true for many online services,
| from social networking to (marginally) publishing. But nowhere is
| it more true than for search results, which differ in two
| fundamental ways: being text-only, they don't bother me to
| anywhere near the degree of other ads. And, second, they are an
| order of magnitude more valuable than drive-by display ads,
| because people have indicated a need and a willingness to visit a
| website that isn't among their bookmarks. These two, combined,
| make this the worst possible case for replacing an ad-based
| business with a donation model.
|
| The idea mentioned in this readme that "Google intentionally
| degrades search results to make you also view the second page" is
| also wrong, bordering on self-delusion. The typical answer to
| conspiracy theories works here: there are tens of thousands of
| people at Google. Such self-sabotage would be obvious to many
| people on the inside, far too many to keep something like this
| secret.
| daoudc wrote:
| TBF I don't think Google intentionally degrades results, but
| they have less incentive to improve the results.
| dazc wrote:
| The same way people don't intentionally break the law, they
| just overlook certain aspects of it when it suits them.
| timeon wrote:
| > The central problem with this and similar endeavors: nobody
| is willing to pay what they are worth in ads. Let's say the
| average Google user in the US earns them $30/year. Are you
| willing to pay $30/year for an ad-free Google experience?
| Great! We now know that you are worth at least $60/year.
|
| Is this relevant for non-profit project? Do you pay $30/year
| for Wikipedia?
| KarlKemp wrote:
| Do you think ,,non profit" means they don't have to pay for
| servers and employees?
|
| And yes.
| dazc wrote:
| Duck Duck Go is profitable despite not blanketing the first
| page with ads (just like google were once upon a time); you can
| have no ads at all if you like also. Do they make money in
| other ways, sure but not in a way that degrades the user
| experience.
|
| Are DDG results inferior, for 95% of users no.
| nyuszika7h wrote:
| > Are DDG results inferior, for 95% of users no.
|
| Do you have a source for this figure? Maybe it's mostly true
| for the "average non-tech-savvy user in an English-speaking
| country", but I've found DuckDuckGo and everything other than
| Google inferior in many cases, especially when looking for
| Hungarian content.
| dazc wrote:
| Sorry, for the "average non-tech-savvy user in an English-
| speaking country" is more or less what I meant. Although
| this would have been true for Google also in the early
| days.
| nyuszika7h wrote:
| I would consider Google randomly excluding the most relevant
| words from my search query intentionally degrading results.
| It's incredibly frustrating. This shouldn't be the default
| behavior, maybe an optional link the user can click to try
| again with some of the terms excluded.
|
| Yes, I know verbatim mode exists, but I always forget to enable
| it, and the setting eventually gets lost when my cookies are
| cleared or something.
|
| Unfortunately I can't switch to another search engine because
| in my experience every other search engine has far inferior
| results, despite not having the annoying behaviors Google does.
| DuckDuckGo is only useful for !bangs for me.
| ZeroGravitas wrote:
| I have this feeling that most of time I "search" for something I
| already know what I'm looking for, but google via firefox's
| omnibox, is just the fastest way to get there, even though it's a
| bit indirect. Are they getting paid for that, or am I costing
| them money in the short term, but they get to build up a profile
| on me to provide more effective ads later?
|
| I wonder if it's possible to take advanage of that type of search
| by putting a facade in front of the "search engine" and based on
| the search term and the private local user history, then go
| direct to a known site, or if it seems a search is needed, go to
| a specific search engine. This may open up opportunities for say
| program language specific search engines, or error messages from
| a program specific search, or shopping for X sites.
| medstrom wrote:
| I bookmark every site I might possibly want to revisit - make a
| habit of Ctrl+D. They're totally unsorted, but the key is to
| wipe the regular history on exit, leaving only the bookmarks as
| source material for completion. That way I can type something
| in the url bar and get completion to interesting sites. The url
| bar (or omnibox) matches on page title as well as the actual
| address, so it's easy, and always faster than a search engine.
| bluecatswim wrote:
| Most wikis or resource/documentation sites have a local search
| bar on their homepage, Firefox has a feature where it lets you
| add a search keyword for that specific site. So if you add,
| say, pydocs as a keyword for docs.python.org you can do
| "@pydocs <query>" it looks up the query on that page.
| yellowsir wrote:
| if u set duckduckgo as your default search provider, you can
| use bang in the omibox. also you can toogle between local-area
| or global search. https://duckduckgo.com/bang e.g. !yt !osm !gi
| WheelsAtLarge wrote:
| Make it open source and syndicate it. The goal is to get people
| to contribute both resources and code. Think about the Shopify as
| the model. Where many people contribute to create a huge shopping
| place. People care about their shop only but ultimately they
| create a useful shopping area.
|
| Also setup a foundation to guide its development and be able to
| hire a management team.
|
| The real challenge is not the code development but setting up an
| organization that will outlast all the challenges that will
| appear. Wikipedia is the model to follow.
| juliushuijnk wrote:
| here's my open data attempt from a couple of years ago:
|
| http://www.charitius.org
|
| goal was/is to include all charities in the world, based on open
| data, and open software.
|
| It's been on pause for a while, but still works, and open for new
| sources of data to incorporate.
| ChemSpider wrote:
| Really, I don't care if it is for-profit or not. Just a search
| engine with transparent ranking would be great.
|
| Ideally with explainable AI (XAI) that can tell me WHY is result
| A ranked higher than result B. I would even pay a monthly
| subscription to use it.
| igammarays wrote:
| This is a business model I've been thinking about: what if users
| earned credits for running a crawler on their machine? In other
| words, as much as I hate crypto scams, a "tokenized" search
| engine where the "mining" power was put to good use, i.e crawling
| and indexing.
| thebeastie wrote:
| How would you judge that they had actually done the work? The
| output needs to be verifiable.
| igammarays wrote:
| There would have to be some aspect of centralized moderation,
| I suppose. This is beyond my knowledge: Is there a way to
| accept output only from signed binaries, so that we assume if
| X cycles of work were performed by a signed binary, then it
| is legitimate output?
| born-jre wrote:
| no not really may be some theoretical way using ZK-snarks
| or encrypted enclaves (intel sgx) but not practical. also
| probably does not work cz oracle/enclave also needs
| input(raw crawl data/ network http bytes) which has to be
| trusted. one way could be project could make unbreakable
| mining/crawling chip/box/os/anticheat_layer_with_vm and
| supply to everyone which has different levels of breakable-
| lity and complexity to build.
|
| One way is we send crawl_task to different to N random
| nodes and accept one that most similar?
|
| another way could be build messy network to solve messy
| problem. What we do is build reputation bashed graph
| network and you accept index from nodes you trust. so
| people will start un following misbehaving nodes. there is
| not universalroot view of network instead its dynamic and
| different from prospective of each node. or it could have
| one root view if we store reputation data in bchain and
| with some type quadratic voting to modify the chain. ?
|
| yeah Bitcoin showed us way to build mathematically secure
| system without any trusted party but it could do that cz
| problem it was solving is mathematically provable. Problem
| like collecting indexing crawl data you have to trust
| somebody.
| [deleted]
| GistNoesis wrote:
| You build a system based on trust but verify. If the output
| is the result of a known deterministic program on a known
| input, anyone can verify it. So they just have to sign their
| work. If later someone find that they have lied and provided
| a false result, they lose reputation/stacked coins.
|
| The second associated problem is how would one prevent them
| from appropriating the work of others that would just re-sign
| it.
|
| One way would be to allow the worker to introduce a few
| voluntary errors but have a secret joker that allow him to
| pass the challenge of a failed verification.
|
| One alternative way is based on data malleability. The worker
| pick a secret one way function and compute is F( data +
| secretFunction(data,epsilon) ) ~ F(data) and publish the
| values of the secretFunction(data,epsilon) but not the
| secretFunction. Only someone with knowledge of the
| secretFunction can make a claim on the work done. If there is
| a challenge only the real worker will be able to publish the
| secret of the secretFunction (Or use some zero knowledge
| proof to convince you they know it).
| charcircuit wrote:
| >If later someone find that they have lied and provided a
| false result, they lose reputation/stacked coins.
|
| A web page isn't an immutable piece of text. It can change
| on every visit and it can sometimes returns errors.
| GistNoesis wrote:
| That's why you don't index the webpage but a snapshot of
| it. For example you index the commoncrawl archives, or
| some content addressable storage like ipfs or torrent
| file.
| charcircuit wrote:
| What's the point of crawling the common crawl archive?
| It's pointless. You can simply download it.
| GistNoesis wrote:
| I think crawling and indexing should be treated
| differently. Indexing is about extracting value from the
| data, while crawling is about gathering data.
|
| Once a reference snapshot has been crawled, the indexing
| task is more easily verifiable.
|
| The crawling task is harder to verify, because external
| website could lie to the crawler. So the sensible thing
| to do is have multiple people crawl the same site and
| compare their results. Every crawler will publish its
| snapshots (which may contain some errors or not), and
| then that's the job of the indexer to combine multiple
| snapshots of various crawler and filter the errors out
| and do the de-duplication.
|
| The crawling task is less necessary now than it was a few
| years ago, because there is already plenty of available
| data. Also most of the valuable data is locked in walled
| garden, and companies like Cloudflare make the crawling
| difficult for the rest of the fat tail. So it's better to
| only have data submitted to you, and outsource the
| crawling.
| b3kart wrote:
| Well they are in a sense: you can just do the task yourself.
| It's expensive of course, so you can use methods applied to
| human labelling for ML: _periodically_ injecting tasks with
| known results and checking how trustworthy the party is,
| vending the task to multiple parties and aggregating results,
| blocking parties that make many "mistakes", etc.
| hericium wrote:
| Depending on who would be providing URIs.
|
| If the miner, they would be able to deliver any crap so
| content deliveries would have to be judged in some way and
| awarded differently.
|
| If the pool provides addresses to crawl, the miner could be
| given crafted/dedicated URIs from time to time and lack of
| delivery of Proof Of Crawl could result in a penalty chosen
| in a way rendering _" cheating"_ unprofitable. But then fresh
| URIs have to come from somewhere.
| zomglings wrote:
| One idea I have been kicking around is the idea of federating
| the indices, not the crawling.
|
| If every contributor maintained their own index, then you
| could reward contributors based on how many hits their index
| generated.
|
| This would open up the possibility of people maintaining
| indices for specialized topics that they were experts in, and
| give the federated search engine a shot at taking on Google.
| bspammer wrote:
| You don't want to reward people for quantity, but quality.
| The cost of creating a new webpage is effectively zero, so
| if you attach an incentive for creating them you are
| doomed.
| zomglings wrote:
| You and I are talking about completely different types of
| people.
|
| For most people, the cost of creating and maintaining a
| website is high. This is why products like Wix and
| Squarespace exist (and are not cheap).
|
| I am thinking a simple dashboard where _anyone_ could go
| and curate a list of content they find useful. They could
| share this with the world.
|
| The interface should be so simple that my parents could
| use it - and they aren't going to be putting up websites
| anytime soon.
| DarylZero wrote:
| This is what Yahoo was doing before Google took over.
|
| (But of course it wasn't a volunteer public benefit effort
| like you describe.)
| thebeastie wrote:
| Actually I have an idea for you: i think you can use
| cryptography to prove that an SSL session really happened. So
| you could prove indexing of HTTPS sites.
| thebeastie wrote:
| I think the way this works is having the code to execute an
| ssl session encoded in a zkSnark. One of the zkSnark based
| blockchains is doing it.
| detaro wrote:
| You can prove that a TLS session happened, but nothing about
| its contents, so you can't really prove indexing.
| g105b wrote:
| I'm very intrigued by this concept.
| Piezoid wrote:
| YaCy is decentralized, but without the credit system. Some
| tokens, like QBUX, have tried to develop decentralized hosting
| infrastructure.
|
| I also have been wondering how this would play out with some
| kind of decentralized indexes. The nodes could automatically
| cluster with other nodes of users sharing the same interests,
| using some notion of distances between query distributions. The
| caching and crawling tasks could then be distributed between
| neighbors.
| igammarays wrote:
| YaCy is too slow for mainstream use. I believe the indices
| still need to be centralized, only index-building and
| crawling can be distributed.
| marginalia_nu wrote:
| A big part of the problem I see with decentralized search
| is that you basically need to traverse the index in
| orthogonal axes to assemble search results. First you need
| to search word-wise in order to get result candidates, then
| sort them rank-wise to get relevant results (this also
| hinges upon an agreed-upon ranking of domains). That's a
| damn hard nut to crack for a distributed system.
|
| Crawling is also not as resource consuming as you might
| think. Sure you _can_ distribute it, but there isn 't a
| huge benefit to this.
___________________________________________________________________
(page generated 2021-12-26 23:00 UTC)