hngopher.com

       [HN Gopher] A search engine that favors text-heavy sites and pun...
       ___________________________________________________________________
        
       A search engine that favors text-heavy sites and punishes modern
       web design
        
       Author : Funes-
       Score  : 2915 points
       Date   : 2021-09-16 12:16 UTC (1 days ago)
        
 (HTM) web link (search.marginalia.nu)
 (TXT) w3m dump (search.marginalia.nu)
        
       | ryankrage77 wrote:
       | The results are fantastic, but I can't see how the excerpts
       | relate to the search term.
       | 
       | For example, a search term for 'Scotichronicon' returns some
       | fascinating results, but the search term itself doesn't appear in
       | the title or excerpts of most of the results.
       | 
       | This makes it harder to judge how relevant they are.
        
         | marginalia_nu wrote:
         | The excerpts are static and very best effort. You just have to
         | visit the website and find out I'm afraid.
         | 
         | I can do a lot with what I have, but I can't do full text
         | search on millions of documents with dynamic excerpts off a
         | single computer in my living room.
        
       | bencollier49 wrote:
       | This is brilliant - I can actually surf the web for fun again.
       | This engine's actually a nice complement to another mainstream
       | engine, as the regular one is good for searches during the
       | working day, whilst "marginalia" (?) is great for recreational
       | reading, and actual learning.
        
       | snakeboy wrote:
       | Wow, that's awesome. Great work!
       | 
       | For a simple test, I searched "fall of the roman empire". In your
       | search engine, I got wikipedia, followed by academic talks,
       | chapters of books, and long-form blogs. All extremely useful
       | resources.
       | 
       | When I search on google, I get wikipedia, followed by a listicle
       | "8 Reasons Why Rome Fell", then the imdb page for a movie by the
       | same name, and then two Amazon book links, which are totally
       | useless.
        
         | adventured wrote:
         | I did a search for "George Washington"
         | 
         | First result after Wikipedia:
         | 
         | "Radiophone Transmitter on the U.S.S. George Washington (1920)
         | 
         | In 1906, Reginald Fessenden contracted with General Electric to
         | build the first alternator transmitter. G.E. continued to
         | perfect alternator transmitter design, and at the time of this
         | report, the Navy was operating one of G.E.'s 200 kilowatt
         | alternators http://earlyradiohistory.us/1919wsh.htm "
         | 
         | Another result in the first few:
         | 
         | " - VANDERBILT, GEORGE WASHINGTON
         | 
         | PH: (800) ###-#233 FX: (#03) 641-5###.
         | https://www.ScottWinslow.com/manufacturer/VANDERBILT_GEORGE_...
         | "
         | 
         | And just below that terrible result:
         | 
         | "I Looked and I Listened -- George Washington Hill extract
         | (1954)
         | 
         | Although the events described in this account are undated, they
         | appear to have occurred in late 1928. I Looked and I Listened,
         | Ben Gross, 1954, pages 104-105: Programs such as these called
         | for the expenditure of larger sums than NBC had anticipated. It
         | be http://earlyradiohistory.us/1954ayl2.htm "
         | 
         | Dramatically worse than Google.
         | 
         | ---
         | 
         | Ok, how about a search for "Rome" then? Surely it'll pull some
         | great text results for the city or the ancient empire.
         | 
         | First result after Wikipedia:
         | 
         | "Home | Rome Daily Sentinel
         | 
         | Reliable Community News for Oneida, Madison and Lewis County
         | http://romesentinel.com/"
         | 
         | The fourth result for searching "Rome":
         | 
         | "Glenn's Pens - Stores of Note
         | 
         | Glenn's Pens, web site about pens, inks, stores, companies -
         | the pleasure of owning and using a pen of choice. Direcdtory of
         | pen stores in Europe.
         | http://www.marcuslink.com/pens/storesofnote/roma.html"
         | 
         | Again, dramatically worse than Google.
         | 
         | ---
         | 
         | Ok, how about if I search for "British"?
         | 
         | First result after Wikipedia:
         | 
         | "BRITISH MINING DATABASE
         | 
         | British_Mining_Database
         | http://www.users.globalnet.co.uk/~lizcolin/bmd.htm "
         | 
         | And after that:
         | 
         | "British Virgin Islands
         | 
         | Many of these photos were taken on board the Spirit of
         | Massachusetts. The sailing trip was organized by Toto Tours.
         | Images Copyright (c) Lowell Greenberg Home Up Spring Quail
         | Gardens Forest Home Lake Hodges Cape Falcon Cape Lookout,
         | Oregon Wahkeena
         | http://www.earthrenewal.org/british_virgin_islands2.htm"
         | 
         | Again, far off the mark and dramatically worse than Google.
         | 
         | I like the idea of Google having lots of search competition,
         | this isn't there yet (and I wouldn't expect it to be). I don't
         | think overhyping its results does it any favors.
        
           | burkaman wrote:
           | This is not a Google competitor, it's a different type of
           | search engine with different goals.
           | 
           | > If you are looking for fact, this is almost certainly the
           | wrong tool. If you are looking for serendipity, you're on the
           | right track. When was the last time you just stumbled onto
           | something interesting, by the way?
        
           | JasonFruit wrote:
           | Hobby project leads angry person to interesting and
           | unexpected material; angry person remains angry. Details at
           | six.
        
             | fouric wrote:
             | The project explicitly bills itself as a "search engine",
             | not an "interesting and unexpected material surfacer".
             | Moreover, projecting emotions like "angry" onto a comment
             | in order to discredit the content of the comment (hey! is
             | that an ad-hominem?) is just about exactly the opposite of
             | the discussions that the HN mods are trying to curate, and
             | the discussions that I like to see here.
        
               | withinboredom wrote:
               | In the early days of google, I found what I was looking
               | for on page 5+. On the way, I'd discover many interesting
               | things I didn't even know I was looking for, often
               | completely unrelated to what I was searching for.
        
               | kwertyoowiyop wrote:
               | And now Google hides that more than one page even exists,
               | as they populate their first page with buttons to ask
               | similar questions and go to the first page of THOSE
               | results.
        
               | kews wrote:
               | I miss those old days of even being permitted to go many
               | pages in.
        
               | allknowingfrog wrote:
               | If you click through to the About page, I think you'll
               | see that "interesting and unexpected material surfacer"
               | is a fairly apt description of the project.
        
             | adventured wrote:
             | > Hobby project leads angry person to interesting and
             | unexpected material; angry person remains angry.
             | 
             | Not angry in the least. I'm thrilled someone is working on
             | a search competitor to Google.
             | 
             | I understand you're attempting to dismiss my pointing out
             | the bad results by calling me angry though. You're focusing
             | your content on me personally, instead of what I pointed
             | out.
             | 
             | The parent was far overhyping the results in a way that was
             | very misleading (look, it's better than Google!). I tried
             | various searches, they were not great results. The parent
             | was very clearly implying something a lot better than that
             | by what they said. The product isn't close to being at that
             | level at this point, overhyping it to such an absurd degree
             | isn't reasonable or fair to the person that is working on
             | it.
             | 
             | I would specifically suggest people not compare it to
             | Google. Let it be its own thing, at least for a good while.
             | Google (Alphabet) is a trillion dollar company. Don't press
             | the expectations so far and stage it to compete with Google
             | at this point. I wouldn't even reference Google in relation
             | to this search engine, let it be its own thing and find its
             | own mindshare.
        
               | bityard wrote:
               | > I'm thrilled someone is working on a search competitor
               | to Google.
               | 
               | Except the author goes to quite some lengths to explain
               | that his search engine is not a competitor to Google, and
               | is in fact exactly the opposite of Google in many ways:
               | https://memex.marginalia.nu/projects/edge/about.gmi
        
           | kwhitefoot wrote:
           | What were you expecting to see for British? There must be
           | millions of pages containing that term. Anyway the first
           | screenful from Google is unadulterated crap, advertising
           | mixed with the usual trivia questions.
           | 
           | If you are going top claim something is wide of the mark then
           | you really ought to tell us at least roughly where the mark
           | is.
        
           | duckmysick wrote:
           | I checked the results of the same query and they seem fine.
           | Lots of speeches and articles about George Washington the US
           | president. There's even his beer recipe.
           | 
           | As for the results you linked, it's part of the zeitgeist to
           | list other entities sharing the same name. Sure, they could
           | use some subtle changes in ranking, but overall the returned
           | links satisfy my curiosity.
        
         | Nition wrote:
         | The Wikipedia link at the top is always given. It would maybe
         | be good to make it a little clearer that it's not one of the
         | true results.
        
           | Ajef wrote:
           | I think this is just because of terms you have searched. In
           | my test-searches Wikipedia has not come up once in first
           | position (i think the highest was 3rd in the list).
           | 
           | Here's what I've tried with a few variations: golang generics
           | proposal, machine learning transformer, covid hospitalization
           | germany
           | 
           | [edit] formatting
        
             | Nition wrote:
             | I think maybe it's a special insert at the top, but only if
             | a Wikipedia page is found that matches you search term? I'm
             | not sure now though.
        
         | titzer wrote:
         | Search engines whose revenue is based on advertising will
         | ultimately be tuned to steer you to the ad foodchain. All the
         | incentives are aligned towards and all the metrics ultimately
         | in service of, profit for advertisers. Not in the 99% of people
         | who can convinced to consume something by ads? Welp, screw you.
        
           | phendrenad2 wrote:
           | Search engines should be something you pay for. Surely search
           | engine powerusers can afford to pay for such a service. If
           | Google makes $1 per user per month or something, that's not
           | too high a bar to get over.
        
             | titzer wrote:
             | Search engines should be like libraries. At least some tiny
             | sliver of the billions we spend on education and research
             | should go to, you know, actually organizing the world's
             | information and making it universally available.
        
             | wizzwizz4 wrote:
             | In which case, consider paying for something like Infinity:
             | https://infinitysearch.co/
        
         | Siira wrote:
         | I tried some queries for Harry Potter fanfictions, and the
         | results were pretty much completely unrelated. There weren't
         | that many results, either.
        
         | acchow wrote:
         | If this search engine ever takes off, the listicle writers will
         | just start optimizing for it too, right?
        
           | dotancohen wrote:
           | Mission accomplished, then.
        
             | acchow wrote:
             | If the goal was to remove modern web design, ok sure
             | mission accomplished.
             | 
             | If your goal was to create a search engine that ignored
             | listicles and other fluff and instead got you meatier
             | results like "academic talks" and such, then no.
        
         | klntsky wrote:
         | However, when searching for "haskell type inference algorithm"
         | I get completely useless results.
        
           | [deleted]
        
           | klntsky wrote:
           | Since it does not use synonyms, it looks like it is unable to
           | answer "how's that thing called"-queries.
        
           | burkaman wrote:
           | That query is too long apparently. But if you shorten to
           | "haskell type inference", I think it delivers on its promise:
           | 
           | > If you are looking for fact, this is almost certainly the
           | wrong tool. If you are looking for serendipity, you're on the
           | right track. When was the last time you just stumbled onto
           | something interesting, by the way?
        
             | marginalia_nu wrote:
             | The search engine doesn't do any type of re-ordering or
             | synonym stuff, it only tires to construct different N-grams
             | from the search query.
             | 
             | So if you for example compare "SDL tutorial" with "SDL
             | tutorials". On google you'd get the same stuff, this search
             | engine, for better or worse doesn't.
             | 
             | This is a design decision, for now anyway, mostly because
             | I'm incredibly annoyed when algorithms are second-guessing
             | me. On the other hand, it does mean you sometimes have to
             | try different searches to get relevant results.
        
               | ford_o wrote:
               | Maybe list the synonyms under the query, so its easier to
               | try different formulations.
        
               | akavel wrote:
               | Oh this sounds like it could be a really cool idea! This
               | way it could also be subtly teaching users that the
               | engine doesn't do automatic synonyms translation so it's
               | worth experimenting; also kinda like giving the synonyms
               | feature while still keeping user in full control.
        
               | Razengan wrote:
               | It could simply become an option.
        
               | OneLeggedCat wrote:
               | Don't change it. It's good this way.
        
               | leephillips wrote:
               | I like this design decision. It pays you back for
               | choosing your search terms carefully.
        
               | mananaysiempre wrote:
               | I'm not against a stemmer, actually, just against the
               | aggressive concordances (?) that Google now employs, like
               | when it shows me X in Banach spaces (the classical,
               | textbook case) when I'm specifically searching for X in
               | Frechet spaces (the generalization I want to find but am
               | not sure exists); of _course_ Banach spaces and Frechet
               | spaces are almost exclusively encountered in the same
               | context, but it doesn't mean that one is a popular typo
               | for the other! (The relative rarity of both of these in
               | the corpus probably doesn't help. The farcical case is
               | BRST, or Becchi-Rouet-Stora-Tyutin, in physics, as it is
               | literally a single key away from "best" and thus almost
               | impossible to search for.)
               | 
               | On the other hand, Google's unawareness of (extensive and
               | ubiquitous) Russian noun morphology is essentially what
               | allowed Yandex to exist: both 2011 Yandex and 2021 Google
               | are _much_ more helpful for Russian than 2011 Google. I
               | suspect (but have not checked) that the engine under
               | discussion is utterly unusable for it. English (along
               | with other Germanic and Romance languages to a lesser
               | extent) is quite unusual in being meaningfully searchable
               | without any understanding of morphology, globally
               | speaking.
        
               | dahauns wrote:
               | English is more the outlier in regard to Germanic
               | languages, try German or Finnish, with their wonderful
               | compounds :)
               | 
               | https://e.humanities.uva.nl/publications/2004/kamp_lang04
               | .pd...
        
               | medstrom wrote:
               | I thought you could fix that by enclosing "BRST" in
               | quotes, but apparently not. DuckDuckGo (which uses
               | Google) returns a couple of results that do contain
               | "BRST" in a medical context, but most results don't
               | contain this string at all. What's going on?
        
               | mananaysiempre wrote:
               | I'm not certain what DDG actually uses (wasn't it Bing?),
               | but in my experience from the last couple of months it
               | ignores quotes substantially _more_ eagerly than Google
               | does. For this particular term, a little bit of domain
               | knowledge helps: even without quotes, _brst becchi_ ,
               | _brst formalism_ , _brst quantization_ or perhaps _bv
               | brst_ will get you reasonable results. (I could swear
               | Google corrected _brst quantization_ to _best
               | quantization_ a year ago, but apparently not anymore.)
               | Searching for stuff in the context of BRST is still
               | somewhat unpleasant, though.
               | 
               | I... don't think anything particularly surprising is
               | happening here, except for quotes being apparently
               | ignored? I've had it explained to me that a rare word is
               | essentially indistinguishable from a popular misspelling
               | by NLP techniques as they currently exist, except by
               | feeding the machine a massive dictionary (and perhaps not
               | even then). BRST is a thing that you essentially can't
               | even define satisfactorily without at the very least four
               | years of university-level physics (going by the
               | conventional broad approach--the most direct possible
               | road can of course be shorter if not necessarily more
               | illuminating). "Best" is a very popular word both
               | generally and in searches, and the R key is next to E on
               | a Latin keyboard. If you are a perfect probabilistic
               | reasoner with only these facts for context (and
               | especially if you ignore case), I can very well believe
               | that your best possible course of action is to assume a
               | typo.
               | 
               | How to permit overriding that decision (and indeed how to
               | recognize you've actually made one worth worrying about
               | without massive human input-- _e.g._ Russian adjectives
               | can have more than 20 distinct forms, can be made up on
               | the spot by following productive word-formation
               | processes, and you don't want to learn all of the world's
               | languages!) is simply a very difficult problem for what
               | is probably a marginal benefit in the grand scheme of
               | things.
               | 
               | I just dislike hitting these margins so much.
        
               | leephillips wrote:
               | It would not be a difficult problem if they allowed the "
               | " operator to work as they claim it does, or revive the +
               | operator.
        
               | mananaysiempre wrote:
               | In English, maybe; in Russian, I frequently find myself
               | reaching for the nonexistent "morphology but not
               | synonyms" operator (as the same noun phrase can take a
               | different form depending on whether it is the subject or
               | the object of a verb, or even on which verb it is the
               | object of); even German should have the same problem
               | AFAIU, if a bit milder. I don't dare think about how
               | speakers of agglunative languages (Finnish, Turkish,
               | Malayalam) suffer.
               | 
               | (DDG docs do say it supports +... and even +"...", but I
               | can't seem to get them to do what I want.)
        
               | leephillips wrote:
               | Ah, OK. I don't know anything about Russian. This is a
               | hard problem. I think the solution is something like what
               | you suggest: more operators allowing different
               | transformations. Even in English, I would like a "you may
               | pluralize but nothing else" operator.
        
           | LanceH wrote:
           | It would be nice if we could pipe search engines.
        
             | BenoitP wrote:
             | Definitely; We could create a meta search engine that
             | queries them all, in desktop application format.
             | 
             | Let's name it after a famous old scientist, and maybe add
             | the year to prove it's modern: Galileo 2021.
        
               | overkalix wrote:
               | ... is this Galileo 2021 a reference that I am not
               | understanding?
        
               | BenoitP wrote:
               | Yup, but so far no one got it.
               | 
               | There was such an app in the early 2000's, before Google
               | went mainstream, and Altavista-like engines were not
               | good: Copernic 2000.
               | 
               | I guess I'm officially old now.
        
               | squeaky-clean wrote:
               | I was always a dogpile user :p
        
               | LightG wrote:
               | Hotbot!
        
               | tomerv wrote:
               | FWIW, I got the reference. Maybe I'm old too?
        
               | genewitch wrote:
               | For years I wanted to try Copernic Summarizer. It seemed
               | like it actually worked. Then software that did summaries
               | disappeared, maybe? And about 5 years ago bots on Reddit
               | were doing summaries of news stories (and then links in
               | comments).
               | 
               | This is a pattern I see over and over again, some
               | research group or academics show that something can be
               | done (summaries that make sense and are true summaries,
               | evolutionary algorithm FPGA programming, real time gaze
               | prediction, etc) and there's a few published code repos
               | and a bit of news, then 'poof' - no where to be seen for
               | 15 years or more.
        
               | PaulHoule wrote:
               | Meta search engines leave a bad taste in everyone's mouth
               | because they've always failed. Here is why
               | 
               | https://en.wikipedia.org/wiki/Arrow%27s_impossibility_the
               | ore...
               | 
               | You can't combine a few different ranked lists and expect
               | to get results better than any of the original ranked
               | lists.
        
               | robrenaud wrote:
               | > You can't combine a few different ranked lists and
               | expect to get results better than any of the original
               | ranked lists.
               | 
               | I am skeptical of this application of the theorem. Here
               | is my proposal:
               | 
               | Take the top 10 Google and Bing results. If the top
               | result from Bing is in the top 10 from Google, display
               | Google results. If the top result from Bing is not in the
               | top 10 from Google, place it at the 10th position. You'd
               | have an algorithm that ties with Google, say 98% of the
               | time, beats it say, 1.2% of the time, and loses .8% of
               | the time.
        
               | vikingerik wrote:
               | Right. Arrow's theorem just says it's impossible to do it
               | in _all_ cases. It 's still quite possible to get an
               | improvement in a large proportion of cases, as you're
               | proposing.
        
               | random314 wrote:
               | Arrows theorem simply doesn't apply here. We don't need
               | our personalized search results to satisfy the majority.
        
               | PaulHoule wrote:
               | But in both cases you face the problem of aggregating
               | preferences of many into one. In one case you are
               | combining personal preferences in the other case
               | aggregating 'preferences' expressed by search engines.
        
               | random314 wrote:
               | But search engines aren't voting to maximize the chances
               | that their preferred candidate shows up on top. The mixed
               | ranker has no requirement to satisfy Arrows integrity
               | constraints. It has to satisfy the end user, which is
               | quite possible in theory.
               | 
               | Conditions the mixed ranker doesn't have to satisfy
               | "ranking while also meeting a specified set of criteria:
               | unrestricted domain, non-dictatorship, Pareto efficiency,
               | and independence of irrelevant alternatives"
        
               | PaulHoule wrote:
               | Sure, but the problem that conventional IR ranking
               | functions are not meaningful other than by ordering leads
               | you to the dismal world of political economy where you
               | can't aggregate people's utility functions. (Thus you
               | can't say anything about inequality, only about Pareto
               | efficiency)
               | 
               | Hypothetically you could treat these functions as
               | meaningful but when you try you find that they aren't
               | very meaningful.
               | 
               | For instance IBM Watson aggregated multiple search
               | sources by converting all the relevance scores to "the
               | probability that this result is relevant".
               | 
               | A conventional search engine will do horribly in that
               | respect, you can fit a logit curve to make a probability
               | estimator and you might get p=0.7 at the most and very
               | rarely get that, in fact, you rarely get p>0.5.
               | 
               | If you are combining search results from search engines
               | that use similar approaches you know those p's are not
               | independent so you can't take a large numbers of p=0.7's
               | and turn that into a higher p.
               | 
               | If you are using search engines that use radically
               | different matching strategies (say they return only
               | p=0.99 results with low recall) the Watson approach
               | works, but you need a big team to develop a long tail of
               | matching strategies.
               | 
               | If you had a good p-estimator for search you could do all
               | sorts of things that normal search engines do poorly,
               | such as "get an email when a p>0.5 document is added to
               | the collection."
               | 
               | For now alerting features are either absent or useless
               | and most people have no idea why.
        
               | PaulHoule wrote:
               | I've had jobs tuning up the relevance of search engines
               | with methods like
               | 
               | https://ccc.inaoep.mx/~villasen/bib/AN%20OVERVIEW%20OF%20
               | EVA...
               | 
               | and the first conclusion is "something that you think
               | will improve relevance probably won't"; the TREC
               | conference went for about five years before making the
               | first real discovery
               | 
               | https://en.wikipedia.org/wiki/Okapi_BM25
               | 
               | It's true that Arrow's Theorem doesn't strictly apply,
               | but thinking about it makes it clear that the aggregation
               | problem is ill-defined and tricky. (e.g. note also that a
               | ranking function for full text search might have a range
               | of 0-1 but is not a meaningful number, like a probability
               | estimate that a document is relevant, but it just means
               | that a result with a higher score is likely to be more
               | relevant than one with a lower score.)
               | 
               | Another way to think about it is that for any given
               | feature architecture (say "bag of words") there is an
               | (unknown) ideal ranking function.
               | 
               | You might think that a real ranking function is the ideal
               | ranking function plus an error and that averaging several
               | ranking functions would keep the contribution of the
               | ideal ranking function and the errors would average out,
               | but actually the errors are correlated.
               | 
               | In the case of BM25 for instance, it turns out you have
               | to carefully tune between the biases of "long documents
               | get more hits because they have more words in them" and
               | "short documents rank higher because the document vectors
               | are spiky like the query the vectors". Until BM25 there
               | wasn't a function that could be tuned up properly and
               | just averaging several bad functions doesn't solve the
               | real problem.
        
               | gnramires wrote:
               | That's an invalid application of this theorem. (It
               | doesn't necessarily hold)
               | 
               | Suppose there's an unambiguous ranked preference by all
               | people among a set (webpages, ranking). Suppose one
               | search engine ranks correctly the top 5 results and
               | incorrectly the next 5 results, while another ranks
               | incorrectly the top 5 and correctly the next 5.
               | 
               | What can happen is that some there may be no universally
               | preferred search engine (likely). In practice, as another
               | commenter noted, you can also have most users prefer more
               | a certain combination of results (that's not difficult to
               | imagine, for example by combining top independent results
               | from different engines for example).
        
               | collinmanderson wrote:
               | Brave browser currently has "Google fallback" which
               | sometimes mixes in Google search results with Brave's own
               | search engine.
               | 
               | https://search.brave.com/help/google-fallback
        
               | Torwald wrote:
               | I need that with a simpler interface, so I call it after
               | a famous dedective: Sherlock.
        
               | artificial wrote:
               | _a magic pop sound is faintly audible as a new side
               | project is appended to several lists_ Excellent, thank
               | you!
        
               | mkr-hn wrote:
               | Likely trademark collision with this:
               | https://www.galileo.usg.edu/
        
               | gmueckl wrote:
               | Not an app, but probably comes quite close in all other
               | respects: https://metager.org
        
         | foofoo4u wrote:
         | Good comparison. Reminds me of an analogy I like to make of
         | today's web, which is it feels like browsing through a magazine
         | store -- full of top 10s, shallow wow-factoids, and baity
         | material. I genuinely believe terrible results like this are
         | making society dumber.
        
           | bluGill wrote:
           | what I really want is a true AI to search through all that
           | and figure out the useful truth. I don't know how to do this
           | (and of course whoever writes the AI needs to be unbiased...)
        
             | idiotsecant wrote:
             | >whoever writes the AI needs to be unbiased...)
             | 
             | I'm not sure the idea of a sentient being not having a bias
             | is meaningful. Reality, once you get past the trivial bits,
             | is subjective.
        
               | jimbokun wrote:
               | Isn't there a fundamental ML postulate that learning
               | without bias is impossible?
               | 
               | Maybe not the same kind of bias we think of in terms of
               | politics and such, but I wonder if there's a connection.
        
               | bluGill wrote:
               | I didn't say the AI should be unbiased, just whoever
               | writes it.
               | 
               | I want an AI that is biased to the truth when there is an
               | objective one, and my tastes otherwise. (that is when
               | asked to find a good book it should give me fantasy even
               | though romance is the most popular genre and so will have
               | better reviews)
        
             | hnick wrote:
             | I think that is the goal, it's just what we currently have
             | is an AI that's like a naive child who is easily tricked
             | and distracted by clickbait.
        
               | Dah00n wrote:
               | >an AI that's <snip> easily tricked and distracted by
               | clickbait.
               | 
               | So, AIs are actually on par with most adults now? (Sorry)
        
           | anonuser123456 wrote:
           | > I genuinely believe terrible results like this are making
           | society dumber.
           | 
           | You have to e causality reversed. Google results reflect the
           | fact that society is dumb.
        
             | jimbokun wrote:
             | Google results reflect the fact that educating and
             | informing people has low profit margins.
        
           | rchaud wrote:
           | The context matters. I'd happily read "Top 10" lists on a
           | website if the site itself was dedicated to that one thing.
           | "Top 10 Prog Rock albums", while a lazy, SEO-bait title,
           | would at least be credible if it were on a music-oriented
           | website.
           | 
           | But no, these stories all come from cookie-cutter "new media"
           | blog sites, written by an anonymous content writer who's
           | repackaged Wikipedia/Discogs info into Buzzfeed-style copy
           | writing designed to get people to "share to Twitter/FB". No
           | passion, no expertise. Just eyeballs at any cost.
        
             | 1vuio0pswjnm7 wrote:
             | Its the "healthy web" Mozilla^1 and Google keep telling
             | their blog audiences about. :)
             | 
             | 1 Accept quid pro quo to send all queries to Google by
             | default
             | 
             | If what these companies were telling their readers was
             | true, i.e., that advertising is "essential" for the web to
             | survive, then how are the sites returned by this search
             | engine for text-heavy websites (that are not discoverable
             | through Google, the default search engine for Chrome,
             | Firefox, etc.) able to remain online. Advertising is
             | essential for the "tech" company middleman business to
             | survive.
        
             | foofoo4u wrote:
             | This got me thinking that maybe one of the other big
             | reasons for this is that the algorithms prioritize newer
             | pages over older pages. This produces the problem where
             | instead of covering a topic and refining it over time, the
             | incentive is to repackage it over and over again.
             | 
             | It reminds me of an annoyance I have with the Kindle store.
             | If I wanted to find a book on, let's say, Psychology, there
             | is no option to find all-time respected books of the past
             | centenary. Amazon's algorithms constantly push to recommend
             | the latest hot book of the year. But I don't want that. A
             | year is not enough time to have society determine if the
             | material withstands time. I want something that has stood
             | the test of time and is recommended by reputable
             | institutions.
        
               | echelon wrote:
               | The new evergreen is refreshed sludge for bottom dollar.
               | College kids stealing Reddit comments or moving around
               | paragraphs from old articles. Or linking to linked blogs
               | that link elsewhere.
               | 
               | It's all stamped with Google Ads, of course, and then
               | Google ranks these pages high enough to rake in eyeballs
               | and ad dollars.
               | 
               | Also there's the fact that each year, the average webpage
               | picks up two more video elements / ad players, one or two
               | more ad overlays, a cookie banner, and half a dozen
               | banner/interstitials. It's 3-5% content spread thinly
               | over an ad engine.
               | 
               | The Google web is about squeezing ads down your throat.
        
               | andrepd wrote:
               | Really makes you wonder: you play whack a mole and tackle
               | the symptoms with initiatives like this search engine.
               | But the root of that problem and many many others is the
               | same: advertising. Why don't we try to tackle that?
        
               | echelon wrote:
               | Exactly.
               | 
               | The only reason people make content they aren't
               | passionate about is advertising.
        
               | jamra wrote:
               | This is just a guess, but I believe that they use machine
               | learning and rank it by the clicks. I took some coursera
               | courses and Andrew Ng sort of suggested that as their
               | strategy.
               | 
               | The problem is that clickbait and low effort articles
               | could be good enough to get the click, but low effort
               | enough to drag society into the gutter. As time passes,
               | the system is gamified more and more where the least
               | effort for the most clicks is optimized.
        
               | amenod wrote:
               | > is that the algorithms prioritize newer pages over
               | older pages.
               | 
               | They do? That would explain a lot - but ironically, I
               | can't find a good source on this. Do you have one at
               | hand?
        
               | dvogel wrote:
               | It is pretty obvious if you search for any old topic that
               | is also covered incessantly by the news. "royal family"
               | is a good example. There's no way those news stories
               | published an hour ago are listed first due to a high
               | PageRank score (which necessarily depends on time to
               | accumulate inbound links).
        
               | RattleyCooper wrote:
               | It depends on the content. The flip side is looking up a
               | programming-related question and getting results from
               | 2012.
               | 
               | I think they take different things into account based on
               | the thing being searched.
        
               | II2II wrote:
               | Even your example would depend upon the context. There
               | are many cases where a programming question in 2021 is
               | identical to one from 2012, along with the answer. In
               | those instances, would you rather a shallow answer from
               | 2021 or an indepth answer from 2012? This is not meant to
               | imply that older answers offer greater depth, yet a heavy
               | bias towards recent material can produce that outcome in
               | some circumstances.
        
               | valvar wrote:
               | If you're using tools/languages that change rapidly (like
               | Kotlin, in my case), syntax from a few years ago will
               | often be outdated.
        
               | II2II wrote:
               | Yes, yet there are programming questions that go beyond
               | "how do I do X in language Y" or "how do I do X with
               | library Y". The language and library specific questions
               | are the ones where I would be less inclined to want
               | additional depth anyhow, well, provided they aren't
               | dependent upon some language or library specific
               | implementation detail.
        
               | rchaud wrote:
               | Your Google search results show the date on articles do
               | they not? If people are more likely to click on
               | "Celebrity Net Worth (2021)" than "Celebrity Net Worth
               | (2012)", then the algo will update to favour those
               | results, because people are clicking on them.
               | 
               | The only definitive source on this would be the
               | gatekeeper itself. But Google never says anything
               | explicitly, because they don't want people gaming search
               | rankings. Even though it happens anyway.
        
               | WalterBright wrote:
               | Amazon search clearly does _not_ prioritize exact title
               | matches.
        
               | hodgesrm wrote:
               | > This got me thinking that maybe one of the other big
               | reasons for this is that the algorithms prioritize newer
               | pages over older pages.
               | 
               | Actually that's not always the case. We publish a lot of
               | blog content and it's really hard to publish new content
               | that replaces old articles. We still see articles from
               | 2017 coming up as more popular than newer, better
               | treatments of the same subject. If somebody knows the SEO
               | magic to get around this I'm all ears.
        
             | Dah00n wrote:
             | I'm not sure I agree with your example. It seems to me it
             | is the exact same as a "Top ten drinks to drink on a rainy
             | day" list. There's simply too many good albums and opinions
             | differ, so a top ten would -just like the drinks- end up
             | being a list of the most popular ones with maybe one the
             | author picks to stir some controversy or discussion. In my
             | opinion the world would be a smarter place if Google ranked
             | all such sites low. Then we might at least get fluff like
             | "Top ten prog rock albums if you love X, hate Y and listen
             | to Z when no one is around" instead.
        
               | dageshi wrote:
               | Google won't rank them low because they actually do serve
               | an important purpose. They're there for people who don't
               | really know what they want specifically, they're looking
               | for an overview. A top 10 gives a digestible overview on
               | some topic, which helps the searcher narrow down what
               | they really want.
               | 
               | A "Top 10 albums of all time" post is actually better off
               | going through 10 genres of popular music from the past 50
               | years and picking the top album (plus mentioning some
               | other top albums in the genre) for each one.
               | 
               | That gives the user the overview they're probably looking
               | for, whether those are the top 10 albums of all time or
               | not. It's a case of what the user searched for vs what
               | they actually really want.
        
             | bythckr wrote:
             | "The best minds of my generation are thinking about how to
             | make people click ads"
        
           | wwweston wrote:
           | It's also possible that it's the other way around: a certain
           | "common denominator" + algorithms that chase broad engagement
           | = mediocre results.
           | 
           | The real trick would be some kind of engine that can aim just
           | above where the user's at.
        
         | [deleted]
        
         | rasz wrote:
         | >followed by a listicle "8 Reasons Why Rome Fell"
         | 
         | but arent you curious about the 7th reason? it will surprise
         | you!
        
           | coldtea wrote:
           | You wont believe how Claudius looks today!
        
             | kahrl wrote:
             | Doctors HATE him!!!
        
         | [deleted]
        
         | resynth1943 wrote:
         | Yeah, Google tends to send a lot of junk back.
        
         | purplefruit wrote:
         | Wow I used "personality test" and actually got useful articles
         | about personality theory. I'll actually use this!
        
         | hdjjhhvvhga wrote:
         | As long as few people use it, it will be great. Rest assured
         | that the moment it becomes popular, the people who want to game
         | it will appear.
        
           | marginalia_nu wrote:
           | Specialization mostly a problem in monocultures.
           | 
           | If you almost only plant wheat, you are going to end up with
           | one hell of a pest problem.
           | 
           | If you almost only have Windows XP, you are going to have one
           | hell of a virus problem.
           | 
           | If you almost only have SearchRank-style search engines (or
           | just the one), you are going to have one hell of a content
           | spam problem.
           | 
           | Even though they have some pretty dodgy incentives, I don't
           | think google suffers quality problems because they are evil,
           | I think ultimately they suffer because they're so dominant.
           | Whatever they do, the spammers adapt almost instantly.
           | 
           | A diverse ecosystem on the other hand limits the viability of
           | specialization by its very nature. If one actor is attacked,
           | it shrinks and that reduces the opportunity for attacking it.
        
           | Nextgrid wrote:
           | I don't think the existing media-heavy websites are gaming
           | Google to rank higher. It's that Google itself prefers media
           | heavy content; they don't have to "game" anything.
           | 
           | I also think a search engine like this would be quite hard to
           | game. An ML-based classifier trained on thousands of text-
           | heavy and media-heavy screenshots should be quite robust and
           | I think would be very hard to evade, so the "game" will
           | become more about how _identify_ the crawler so you can serve
           | it a high-ranking page while serving crap to the real users,
           | and it seems fairly easily to defeat if the search engine
           | does a second pass using residential proxies and standard
           | browser user agents to detect this behavior (it could also
           | threaten huge penalties like the entire domain being banned
           | for a month to even deter attempts at this).
        
             | fragmede wrote:
             | With the advances in text generation by machines that
             | looks, but isn't _quite_ accurate (aka GPT-3), seems like
             | it would be _easily_ gamed (given access to GPT-3). Even
             | without GPT-3, if the content being prioritized is mere
             | text, I 'm sure that for a pile of money, I could generate
             | something that looks like Wikipedia, in the sense that it's
             | a giant pile of mostly text, but it would make zero sense
             | to a human reader. (Building an SEO farm to boost ranking
             | of not-wikpedia is left as an exercise for the reader.)
        
           | jandrese wrote:
           | This sort of optimization is why simple recipes are typically
           | found at the end of a rambling pointless blog post now.
           | 
           | Still, the best way to break SEO is to have actual
           | competition in the search space. As long as SEO remains
           | focused on Google there is an opportunity for these companies
           | to thrive by evading SEO braindamage.
        
             | SerLava wrote:
             | That's not really for SEO, which favors readily accessible
             | information.
             | 
             | That's ads. When mobile users have to scroll past 10 add,
             | theyll click on some of them and make the blog money.
        
             | ggggtez wrote:
             | I've noticed this pattern start to pop up elsewhere. I've
             | started to train my skimming skills, skipping a paragraph
             | or two at a time to get past the fluff.
             | 
             | Like an article about some current event will undoubtedly
             | begin with "when I was traveling ten years ago...".
        
             | Aeolun wrote:
             | Searching for 'chocolate' on this search engine turned up a
             | surprisingly large amount of chocolate based recipes.
        
             | zerd wrote:
             | It's also because that's a way of trying to copyright
             | protect recipes, which are normally not copyright
             | protected.
             | 
             | > "Mere listings of ingredients as in recipes, formulas,
             | compounds, or prescriptions are not subject to copyright
             | protection. However, when a recipe or formula is
             | accompanied by substantial literary expression in the form
             | of an explanation or directions, or when there is a
             | combination of recipes, as in a cookbook, there may be a
             | basis for copyright protection."
        
               | JohnFen wrote:
               | But that copyright protection only extends to the
               | literary expression. The recipe itself is still not
               | covered by copyright, even if accompanied by an essay.
        
             | YeGoblynQueenne wrote:
             | >> This sort of optimization is why simple recipes are
             | typically found at the end of a rambling pointless blog
             | post now.
             | 
             | I continue to be curious about this kind of complaint. If
             | all you want is a recipe list, without any of the fluff,
             | why would you click on a link to a blog, rather than on a
             | link to a recipe aggregator?
             | 
             | Foodie blogs exist specifically for the people who want a
             | foodie discussion and not just an ingredients' list.
             | 
             | Is it because blogs tend to have better recipes overall? In
             | that case, isn't there a bit of entitlement involved in
             | asking that the author self-sacrificingly provides only the
             | information that you want, without taking care of their own
             | needs and wants, also?
        
               | Loughla wrote:
               | It's the same thing that people always complain about.
               | This thing is not in a format that I like, so it must be
               | not what anyone likes.
               | 
               | If you want JUST recipes, pay money instead of just
               | randomly googling around. America's test kitchen has a
               | billion, vetted, and really good recipes. That solves
               | that problem.
        
               | joegahona wrote:
               | I think the complaint is that those blogs rank higher
               | than nuts-and-bolts recipes now. It wasn't that way a few
               | years ago. Yes, scrolling down the results to Food
               | Network or Martha Stewart or whatever is possible, as is
               | going directly to those sites and using their site
               | search, but it's noticeable and annoying.
        
               | YeGoblynQueenne wrote:
               | Not my experience. For a very quick test, I searched DDG
               | for "omelette recipe, "carbonara recipe" and "peking duck
               | recipe" (just to spice it up a bit) and all my top
               | results are aggregators. Even "avgolemeono recipe" (which
               | I'd think is very specialised) is aggregators on top.
               | 
               | To be honest, I don't follow recipes when I cook unless
               | it's a dish I've never had before. At that point what I
               | want is to understand the point of the dish. A list of
               | ingredients and preparation instructions don't tell me
               | what it's supposed to taste and smell like. The foodie
               | blogs at least try to create a certain... feeling of
               | place, I suppose, some kind of impression that guides you
               | when you cook. I wouldn't say it always works but I
               | appreciate the effort.
               | 
               | My real complaint with recipe writers is that they know
               | how to cook one or two dishes well and they crib the rest
               | off each other so even with all the information they
               | provide, you still can't reliably cook a good meal from a
               | recipe unless you've had the dish before. But that's my
               | personal opinion.
        
               | jandrese wrote:
               | Because when you search for a recipe you get the link to
               | the blog, not the aggregator.
        
             | WorldMaker wrote:
             | That sort of recipe blog hasn't happened just for SEO. It's
             | also a bit of a "two audiences" problem: if you are coming
             | to that food blogger from a search you certainly would
             | prefer the recipe first and then maybe any commentary on it
             | below if the recipe looks good. If you are a regular reader
             | of that food blogger you are probably invested in the
             | stories up top and that parasocial connection and the
             | recipes themselves are sometimes incidental to why you are
             | a regular reader.
             | 
             | You see some of that "two readers" divide sometimes even in
             | classic cookbooks, where "celebrity" chefs of the day might
             | spend much of a cookbook on a long rambling memoir.
             | Admittedly such books were generally well indexed and had
             | table of contents to jump right to the recipes or
             | particular recipes, but the concept of "long personal
             | ramble of what these recipes mean to me" is an old one in
             | cookbooks too.
        
               | giantrobot wrote:
               | > If you are a regular reader of that food blogger
               | 
               | I think this assumes facts not in evidence. It certainly
               | seems like an overwhelming number of "blogs" are not
               | actual blogs but SEO content farms. There's no regular
               | readers of such things because there's no actual authors,
               | just someone that took a job on Fivver to spew out some
               | SEO garbage. Old content gets reposted almost verbatim
               | because new results better according to Google.
               | 
               | The only reason these "blogs" exist is to show ads and
               | hopefully get someone's e-mail (and implied consent) for
               | a marke....newsletter.
        
               | WorldMaker wrote:
               | I know at least a few that I commonly see in top search
               | results that I have friends that read them like
               | personalized soap operas where most of the drama revolves
               | around food and family and serving food to family.
               | 
               | It's at least half the business models of Food Network
               | shows: aspirational kitchens and the people that live in
               | them and also sometimes here's their recipes. (The other
               | half being competitions, obviously.) I've got friends
               | that could deliver entire doctoral theses on the Bon
               | Appetit Test Kitchen (and its many YouTube shows and
               | blogs) and the huge soap operatic drama of 2020's events
               | where the entire brand milkshake ducked itself; falling
               | into people's hearts as "feel good" entertainment early
               | in 2020/the pandemic and then exploding very dramatically
               | with revelations and betrayals that Fall.
               | 
               | Which isn't to say that there _aren 't_ garbage SEO farms
               | out there in the food blogging space _as well_ , but a
               | lot of the big ones people commonly complain about seeing
               | in google's results do have regular fans/audiences. (ETA:
               | And many of the smaller blogs _want_ to have regular fans
               | /audiences. It's an active influencer/"content creator"
               | space with relatively low barrier to entry that people
               | love. Everyone's family loves food, it's a part of the
               | human condition.)
        
               | run-types wrote:
               | I've basically never been taken to a recipe without a
               | rambling preamble from Google. While food blogs may serve
               | two audiences, a long introduction seems to be a
               | requirement to appear in the top Google search results.
        
               | WorldMaker wrote:
               | Personally, I think that has a lot more to do with the
               | fact that Google killed the Recipe Databases. There did
               | used to be a few startups that tried to be Recipe
               | Aggregators with advertising based business models, that
               | would show recipes and then link to source blogs and/or
               | cookbooks, and in the brief period where they existed
               | Google scraped them _entirely_ and showed entire recipes
               | on search results and ate their ad revenue out from under
               | them.
        
               | tomrod wrote:
               | That is a really bad thing by Google. Their core business
               | is not recipes.
        
               | kwertyoowiyop wrote:
               | Their core business is making money from other people's
               | content, no matter what it is.
        
               | WorldMaker wrote:
               | Their core business is advertising and they have always
               | been in a direct conflict-of-interest by competing with
               | content sites for ad revenue buys.
        
               | dspillett wrote:
               | Such databases would get battered by demands to remove
               | content these days, if not already back then. No one want
               | a database listing their stuff for ad revenue like that
               | because many wouldn't follow the links so see _their_
               | adverts or be subject to _their_ tracking.
               | 
               | A couple of browser add-ons specifically geared around
               | trimming recipe pages down have been taken down due to
               | similar complaints.
        
               | inanutshellus wrote:
               | I see your point, but argue you've misidentified the two
               | audiences.
               | 
               | One audience matches your description and is the invested
               | reader. They want _that_ blogger 's story telling. they
               | might make the recipe, but they're a dedicated reader.
               | 
               | The other audience is not the recipe-searcher, but
               | instead Google. Food bloggers know that recipe-searchers
               | are there to drop in, get an ingredient list, and move
               | on. They won't even remember the blog's name. So the site
               | isn't optimized for them. It's optimized for Google.
               | 
               | "Slow the parasitic recipe-searcher down. They're
               | leeches, here for a freebie. Well they'll pay me in
               | Google Rank time blocks."
        
             | xtracto wrote:
             | That's why I use Saffron [1], it magically converts those
             | sites into a page in my recipe book. I found it when the
             | developer commented here in HN. Also, a lot of cooking
             | website have started to add a link with "jump to recipe"
             | functionality allowing you to skip all the crap.
             | 
             | [1] https://www.mysaffronapp.com/
        
               | Funes- wrote:
               | There's also https://based.cooking.
        
               | eigengrau5150 wrote:
               | Run by Luke Smith, an admitted neo-reactionary and
               | possible white supremacist who writes like a 4chan
               | reject.
        
           | the_other wrote:
           | If there were a wider variety of popular search engines, with
           | different ranking criteria, would sites begin to move away
           | from gaming the system? Surely it would be too hard to game
           | more than one search engine at a time?
        
             | Nasrudith wrote:
             | It would be a matter of numbers anyway about which they
             | optimize for. A/B testing is already in place and doesn't
             | care about where it comes from, just which one does better.
        
           | new_guy wrote:
           | > the people who want to game it will appear.
           | 
           | So just add human review to the mix, if a site is obviously
           | trying to game the system (listicles, seo spam etc) just drop
           | and ban them from the search index.
        
             | hdjjhhvvhga wrote:
             | Congratulations, you've just invented negative SEO.
        
           | phendrenad2 wrote:
           | There should be some perfect balance where this search engine
           | is N% as popular as Google, where Google soaks up all of the
           | gamifiers, but this search engine is still popular enough to
           | derive revenue and do ML and other search-engine-useful
           | stuff.
        
         | eterevsky wrote:
         | Imagine if you were looking for the movie.
        
           | neltnerb wrote:
           | Imagine including the search term "movie".
        
             | yreg wrote:
             | That doesn't do anything useful.
        
           | lucideer wrote:
           | I tend to prefer Wikipedia for movies. The exception is actor
           | headshots if I'm trying to identify someone, which Wikipedia
           | lacks for licensing reasons, but otherwise Wikipedia tends to
           | be better than IMDB for most needs. Wikipedia has an IMDB
           | link on every article anyway.
           | 
           | Another need I guess might be reviews, for which RT or MC are
           | better than IMDB: not sure if either of those two will fare
           | better than IMDB in this search engine but again Wiki has
           | links out (in addition to good reception summaries)
        
             | mountainboy wrote:
             | For me, imdb was much better when they had user
             | comments/discussion.
             | 
             | I never even posted on it myself, but browsing the
             | discussions one could learn all sorts of trivia, inside
             | info, speculation, etc about each movie.
             | 
             | Since they (inexplicably) killed that feature, I rarely
             | even visit anymore. Your right, for many purposes wikipedia
             | is better, especially for TV series episode lists with
             | summaries.
        
               | ncphil wrote:
               | IMDB management thought it was their brilliant editorial
               | work that drew people to their site. Morons. It was the
               | comments all along. Of course they also believed they
               | could create gravity-free zones by sheer force of
               | executive will (and maybe still do).
        
               | _blu wrote:
               | Especially for old and lesser known movies, the
               | discussion board for the movie was a brilliant addition
               | that could give the movie an extra dimension. Context is
               | very important in order to understand, and ulitmately
               | enjoy something.
               | 
               | I think they removed it in part because new movies, like
               | star wars and superhero movies, had alot of negative
               | activity.
        
             | sellyme wrote:
             | I find IMDb to be more convenient than RT/MC/Wikipedia for
             | finding release dates of movies - nearly every other
             | website lists only the American release date, maybe one or
             | two others if the movie was disproportionately popular in
             | certain regions.
        
           | shuntress wrote:
           | _?q=imdb.com:fall of the roman empire_
        
           | jazzyjackson wrote:
           | !imdb
        
           | MisterTea wrote:
           | The you'd use a different search engine. Why does everything
           | have to be a Swiss Army knife?
        
             | zozbot234 wrote:
             | Or you could just search for 'rome movie'. Though for more
             | complex disambiguation you would need to resort to, e.g.
             | schema.org descriptions (which are supported by most search
             | engines, and the foundation for most "smart" search result
             | snippets).
        
             | eterevsky wrote:
             | That's a fair point. This engine would be useful if you
             | need grep over internet (by without regexes), i.e. when you
             | want to find the exact phrases. But that's a relatively
             | narrow use case.
        
         | psadri wrote:
         | Interesting choice of search topic. Are you trying to make an
         | additional point?
        
         | hn_throwaway_99 wrote:
         | I had the exact opposite experience. I searched the site for
         | "java", got a Wikipedia link first (for the island, not the
         | programming language), and the 2nd result was to a random JEP
         | page, and all the rest of the results were random tidbits about
         | Java (e.g. "XZ compression algorithm in Java). Didn't get any
         | high level results pointing to an overview of the language,
         | getting started guides, etc.
        
           | withinboredom wrote:
           | You need to use some old school search techniques and search
           | for "Java overview"
        
           | _wldu wrote:
           | I'm not sure that's a bad thing.
        
           | rovr138 wrote:
           | well, they're results to java related items...
           | 
           | What kind of links where you expecting to find?
        
         | SPBS wrote:
         | Cool, it appears that the trend towards JS may be causing self-
         | selection -- if a page has a high amount of JS, it is highly
         | unlikely to contain anything of value.
        
           | dv_dt wrote:
           | If one could create an metric of ad to content ratio from the
           | js used, I would guess that would be a nice differentiator
           | too.
        
           | Dah00n wrote:
           | Huh. A weighted algorithm, somewhere between Google and the
           | one linked, where you could subtract from sites by amount of
           | JavaScript might be interesting.
        
           | sjtindell wrote:
           | True. Unfortunately many large corporate websites through
           | which you pay bills, order tickets, etc. are becoming
           | infested with JS widgets and bulky, slow interfaces. These
           | are hard to avoid.
        
             | artificial wrote:
             | Conversely no software to install. Browser as a platform.
             | Don't have to boot to Windows to pay your bills with
             | activex for example
        
               | foxfluff wrote:
               | The mostly JS-less web was fine, fast, and reliable 20
               | years ago and I never had ActiveX.
               | 
               | I hear stories about Flash and ActiveX but I literally
               | never needed these to shop or pay bills online. Payments
               | also didn't require scripts from a dozen domains and four
               | redirects..
        
               | TeMPOraL wrote:
               | The platform isn't the problem. The problem is with the
               | amount of code that does something other than letting you
               | "pay bills, order tickets, etc.".
        
           | hinkley wrote:
           | Browsers should be cherry picking the most compelling things
           | that people accomplish with complex code and supporting them
           | as a native feature. Maybe the Browser Wars aren't keeping up
           | anymore.
        
           | lugged wrote:
           | Was that ever in doubt?
        
       | zimpenfish wrote:
       | Searched for my initials - got back a bunch of raw binary results
       | (mp4, pdf, img, txz, etc.) which was disconcerting. Although it
       | did find one reference to actual-me which is better than Google
       | manages on the first 4 pages...
       | 
       | https://imgur.com/a/n2xro2Y
        
         | marginalia_nu wrote:
         | Yeah there was unfortunately a problem with the content-type
         | code recently, it unfortunately categorized some binary data as
         | HTML and tried to process it best-effort. So there's some
         | binary soup in the index.
         | 
         | The bug has since been fixed, but it won't come into effect in
         | a few weeks.
        
       | ape4 wrote:
       | Fast and doesn't crash when on the front page of Hacker News!
        
         | ricardo81 wrote:
         | That crossed my mind too, considering vanilla webpages
         | sometimes struggle with a top page HN thread, never mind a
         | search engine backend.
        
         | marginalia_nu wrote:
         | Well,... yet. Load average is at 1.2, not that bad. But the
         | services are getting a solid workout.
        
           | marginalia_nu wrote:
           | The real test is in now, index server is reconstructing its
           | index. It does this every 6 hours if there is new pages.
           | Takes half an hour or so usually.
           | 
           | It's supposed to be able to handle searches at the same time,
           | but jeepers, it's gonna have to chew through nearly 400 Gb of
           | data while dealing with over 1 request per second.
        
             | 0xbadcafebee wrote:
             | Is your site/code on GitHub? I would be happy to give
             | performance tips/tweaks. Also Fyi, https://marginalia.nu/
             | gives a certificate error (I know that's not the search
             | site)
        
       | criddell wrote:
       | Have you given any thought on what you will do if you get a DMCA
       | take down request or a request from a person asking you to remove
       | them from search results?
        
       | AlexCoventry wrote:
       | I don't really care about website's design, as long as it gets
       | out of the way of me reading it.
        
       | mrkramer wrote:
       | Awesome work! I had similar idea in mind but I'm glad to see
       | someone else was able to pull it off.
        
       | sailorganymede wrote:
       | Love this! Is there any way someone could help contribute to
       | this?
        
       | palijer wrote:
       | This has been needed in my life for a while. I am growing really
       | apathetic about the internet lately, but I realize that is
       | because my entry point is always a google search.
       | 
       | I miss finding blog posts and scholarly articles in long form. I
       | hate the SEO sites with unreadable UI because the information in
       | them is often a lot lower quality as well.
        
       | raflemakt wrote:
       | I tried two searches in Norwegian ("norsk ordbok" [norwegian
       | dictionary] and "stortinget" [the parliament]), and they both
       | returned many extreme or "alternative" websites. It was
       | especially striking that the neo-nazi group Vigrid's website was
       | the top hit for both searches. Maybe these sites just have less
       | modern web design?
        
         | marginalia_nu wrote:
         | Yeah this is actually a bit of a concern of mine.
         | 
         | As very much a friend of Voltaire's, I don't think it's my
         | place to police people's opinions no matter how disagreeable,
         | but I also don't want my search engine to become branded as the
         | search engine of choice for nazis because it's decent at
         | cataloguing extremist sites.
        
       | claytn wrote:
       | Searching for your own name will turn up some interesting
       | results! I got some early 90s webpages that just contain
       | obituaries or marriage records. I never knew cities maintained
       | these records online!
        
       | xtiansimon wrote:
       | Does this also penalize pages with tons of ads and three
       | paragraphs of text? Or anything from Medium?
        
       | IlliOnato wrote:
       | Pretty cool. I am not sure yet how useful, but cool it is.
       | 
       | However, it seems that it currently does not support non-Latin
       | alphabets. Which I understand in an early version. Still, it's
       | handling of such "exception cases" could be improved:
       | 
       | when I search for a Russian word, say "Akvarium", I get <<Search
       | "Akvarium" needs to be a word>>, which is rather rude...
        
         | foxfluff wrote:
         | "It also focuses on websites in English, Swedish and Latin and
         | tries to identify and ignore the rest (best-effort)."
         | 
         | https://news.ycombinator.com/item?id=28551183
        
           | IlliOnato wrote:
           | Fine; still "Not a supported language" would be much better
           | response than "not a word".
        
       | beepbooptheory wrote:
       | need a duck duck go `bang!` for this
        
       | jordache wrote:
       | how about a search engine that bans all pinterest content.
       | 
       | I hate pinterest with a passion. I may need to get a "his" laptop
       | separate from my wife, since she needs that darn pinterest
       | extension for pinning photos.
        
       | AQXt wrote:
       | I searched for "giraffe evolution" (without quotes) and received
       | the following links on the first page:
       | 
       | - _Evolutionist scientists say the theory is unscientific and
       | worthless_
       | 
       | - _Seven Mysteries of Evolution_
       | 
       | - _OTHER EVIDENCE AGAINST EVOLUTION_
       | 
       | - _Evolution Falsified_
       | 
       | Not a single result about the evolution of giraffes...
        
         | phendrenad2 wrote:
         | Unfortunately, as you've discovered, giraffes are often used by
         | crackpots to try to disprove evolution. Google seems to get
         | around this by heavily boosting known authoritative sources
         | like National Geographic and NIH. But, sadly, those are
         | JS/image heavy sites.
        
       | mattowen_uk wrote:
       | Nice! Typing my name in, gets my own site back as 3 of the top 5
       | results. I suddenly feel important ;)
        
       | schmorptron wrote:
       | I've also found that brave search gets much better results than
       | google for some programming related topic, simply by not being
       | targeted by blogspam SEO as much. It's refreshing to not have to
       | click through 3 auto generated "articles" but to either a) get
       | the documentation straight away or b) find actually human written
       | blog entries.
        
       | arethuza wrote:
       | I did a quick check using the name of the Scottish village I am
       | originally from (or as I should say "far am fae") and this
       | produced a _much_ more interesting set of links for me than that
       | produced by Google
        
       | abhiminator wrote:
       | Absolute textgasm.
       | 
       | Wonder how 'text-only first' prioritization is being implemented,
       | algorithmically speaking?
        
         | ryankrage77 wrote:
         | This post - https://news.ycombinator.com/item?id=28551183 -
         | suggests it's a simple set of hueristics, looking for things
         | like javascript, link/SEO spam, language, amount of text
         | content, etc, filtering out unwanted results and only indexing
         | wanted ones.
        
           | abhiminator wrote:
           | Thank you!
        
       | michaelcampbell wrote:
       | (Old man yells at cloud.)
        
         | [deleted]
        
       | yrds96 wrote:
       | Damn that's is interesting search engine, this is great for
       | search simple terms and find a bunch of blog articles about the
       | term.
        
       | turtlebits wrote:
       | Great for a text-focused site- however, the results are a bit
       | confusing. Would help if there were more details on the criteria
       | used for a site to be included in the index.
       | 
       | Suggestion - Use system fonts (the site downloads almost 300k of
       | fonts)
        
       | lishzen wrote:
       | Love it! Searched for Shigeru Miyamoto and this was the third
       | result: https://www.glitterberri.com/developer-
       | interviews/miyamoto-h...
        
       | wizzwizz4 wrote:
       | Add an OpenSearch file: https://developer.mozilla.org/en-
       | US/docs/Web/OpenSearch
       | 
       | Maybe something like:                 <OpenSearchDescription
       | xmlns="http://a9.com/-/spec/opensearch/1.1/"
       | xmlns:moz="http://www.mozilla.org/2006/browser/search/">
       | <ShortName>Marginalia</ShortName>         <Description>Marginalia
       | Search - a search engine that favors text-heavy sites and
       | punishes modern web design</Description>
       | <InputEncoding>UTF-8</InputEncoding>         <Image width="16"
       | height="16" type="image/x-icon">https://search.marginalia.nu/favi
       | con.ico</Image>         <Url type="text/html" template="https://s
       | earch.marginalia.nu/search?query={searchTerms}" />
       | <moz:SearchForm>https://search.marginalia.nu/</moz:SearchForm>
       | </OpenSearchDescription>
       | 
       | and then add:                 <link rel="search"
       | type="application/opensearchdescription+xml"
       | title="Marginalia"           href="/opensearch.xml" />
       | 
       | to your page's <head>.
        
         | marginalia_nu wrote:
         | It's been added a few moments ago :-)
         | 
         | Thanks everyone who suggested this.
        
       | tentacleuno wrote:
       | I'm all for nostalgia but IMO the information should be the most
       | important thing. Presentation is a close second though, and I can
       | kind of get behind this project when I see what the web is like
       | without an ad-blocker.
        
       | marcosdumay wrote:
       | I've got some results where the same site is has than 70% of the
       | links. It was a very on topic and high quality site, but still,
       | all the results shouldn't point to the same place.
       | 
       | I think some grouping by site (and capping to only the few most
       | relevant links there) would improve the engine.
        
       | aabajian wrote:
       | Gotta say, sometimes the results really are nice. I searched for
       | "Land Cruiser 70." The first result is a simple, short blog post
       | about a couple who traveled across Europe and Asia in their Troop
       | Carrier (http://www.destoop.com/trip/1%20PREPARATION/2%20Vehicle%
       | 20sp...).
       | 
       | The first results on Google are Australian site for buying a LC70
       | (news-flash, I can't buy one in the USA). There is also a
       | MotorTrend article about the LC70...also irrelevant since it's
       | only sold in Australia.
        
       | dzhiurgis wrote:
       | Search for playwright waitForSelector and you land in pretty
       | useless page. I'm all in for text websites, but something like
       | playwright.dev documentation is top notch - fuzzy search being
       | key thing.
        
         | marginalia_nu wrote:
         | Yeah I wasn't really planning for this to blow up like it did
         | today. It's currently sitting at about 35% of the index size I
         | usually aim for, so besides the stuff I can't index because
         | it's behind CDNs, there's a lot of pages it just hasn't gotten
         | to yet. playwright.dev is pretty low on the priority list
         | because it has a metric crap-ton of javascript on its front
         | page. The crawler has visited it, looked at it, and put it very
         | far down the priority queue.
        
           | soheil wrote:
           | Even though some sites have a metric crap-top of js they
           | sometimes render very minimally for certain screen sizes or
           | mobile devices without any of the js crap. Does your crawler
           | pay attention to any of that?
        
             | marginalia_nu wrote:
             | It doesn't look at what the javascript does, just how much
             | there is.
        
       | rukuu001 wrote:
       | This is great. The results for my search were like a suggested
       | reading list.
        
       | arnaudsm wrote:
       | I wish we could configure Google's algorithm to our needs, and
       | blacklist websites.
        
         | MarioMan wrote:
         | It could get tedious depending on how many sites you want to
         | block, but you can add "-site:google.com" to exclude
         | google.com, for instance.
        
           | arnaudsm wrote:
           | I mean a blacklist system like Twitter's, where you block a
           | website forever. Pinterest would be the first to go.
        
       | marto1 wrote:
       | super! How far would you say are you in indexing the blogosphere
       | ? I tried the engine a few times, but I mostly get academic
       | papers and I know most (good) blogs are in fact text-heavy.
        
       | Method5440 wrote:
       | A big fan of your work! Just wanted to let you know what iOS
       | devices provide quotes as " rather than " - you may need to
       | support the character " or at least let people know that iOS is
       | not supported etc... right now I get a generic character error.
        
       | phreack wrote:
       | One nitpick that kind of bothered me - on a large desktop
       | monitor, the results page was like 70% whitespace margins with
       | the results squished in the middle like a portrait cellphone.
       | Hopefully it's easy to fix, I like to research at home and this
       | website could help a lot!
        
       | seoulmetro wrote:
       | This would be awesome if the search actually worked. Typed in
       | 'runescape' and expected a few websites left over from the early
       | 2000s. But I got nothing, just a lot of hits to other keywords.
        
         | marginalia_nu wrote:
         | Huh. That's an interesting case you've found.
         | 
         | I think part of the problem is gold sellers were displacing all
         | the good results. I blocked a few of them and got a few more
         | relevant results, but it's not great.
         | 
         | But still, the search engine only finds two great hits. I
         | wonder why not. Maybe there just aren't that many runescape
         | pages around still? Or it may just be that it hasn't found
         | anything better yet. The index is pretty shallow right now,
         | only 20M URLs, I aim for more than double.
         | 
         | I honestly wasn't planning on this blowing up on HN at this
         | stage.
        
       | yakubin wrote:
       | Wikpedia links point to <https://encyclopedia.marginalia.nu/>
       | instead, which to my eyes is less readable. The justified text,
       | done with CSS, instead of the LaTeX algorithm, looks wild. The
       | font used for quotations is even worse (very thin).
       | 
       | Wikipedia is perfectly usable without JavaScript and it's one of
       | the nicest sites out there typography-wise, so I'd reconsider
       | this redirection.
        
         | marginalia_nu wrote:
         | I guess it's a matter of taste. I can barely read anything on
         | regular wikipedia because the inline links disrupt my flow.
        
       | wolpoli wrote:
       | I wish niche search engines has an option to group results by
       | domain names. There are a few major sites that dominate Google
       | search results with low effort content. As long as Google stands
       | as the largest search engine, it's unlikely that these major
       | sites will want to rearchitect itself into different domain
       | names.
        
       | gjm11 wrote:
       | I tried a few searches.
       | 
       | <<javascript pipe syntax>>: none of the search results appeared
       | to have anything to do with Javascript pipe syntax. (Which
       | doesn't exist yet, but it's under discussion.) Google gives a
       | bunch of highly-relevant results.
       | 
       | <<hans reichenbach relativity>>: first result is a list of books
       | about relativity, one of which is Reichenbach's "Philosophy of
       | space and time"; good, but there's no real _information_ there.
       | Second is about Reichenbach but nothing to do with relativity or
       | even, really, philosophy of science. Third is about philosophy of
       | science and mentions some of Reichenbach 's work but not related
       | to relativity. Fourth mentions Reichenbach's "Philosophy of space
       | and time" as part of a list of books relevant to a seminar on
       | "time and eternity". None of this is _bad_ , but it's not great
       | either. Google gives a couple of online philosophy encyclopaedia
       | entries, then a journal article on "Hans Reichenbach's relativity
       | of geometry", then the Wikipedia article on Reichenbach ... much
       | more informative.
       | 
       | <<luna lovegood actress>>: I thought this would be an easy one.
       | It was easy for Google, which gave me her name in large friendly
       | letters at the top, then her IMDB entry, and a bunch of other
       | relevant things. Literally nothing in the Marginalia results was
       | relevant to the query.
       | 
       | I guess maybe popular culture is just too monetizable, so no one
       | is going to write about it on the sites that Marginalia crawls?
       | Let's try some slightly less popular culture.
       | 
       | <<wilde "a handbag">>: First result is kinda-relevant but weird:
       | it's about a musical adaptation of _The Importance of Being
       | Earnest_. It doesn 't mention that famous line from the play, but
       | one of the numbers in the musical has the words "a handbag" in
       | the title. Second result is a review of a CD of musicals,
       | including the same work. Third is a bunch of short reviews of
       | theatrical items from the Buxton Festival Fringe, one of which is
       | a three-man adaptation of TIOBE. Next four are 100% irrelevant.
       | Next is a list of names of plays. Last one is actually relevant;
       | it's an article about "Lady Bracknell through the decades".
       | Google puts that one first (after, sigh, a bunch of YouTube
       | videos which look as if they might actually be relevant).
       | 
       | I really like the _idea_ of this, and many of the things it turns
       | up look like they might be interesting, but it isn 't doing very
       | well at producing results that are actually relevant to the thing
       | being searched for.
        
         | cowvin wrote:
         | i don't know, people here like to complain about google, but
         | google still works pretty well for me.
        
         | exikyut wrote:
         | TIL about https://github.com/tc39/proposal-pipeline-operator,
         | which I am immediately looking forward to playing with once it
         | gains traction Some Time From Now(tm)
         | 
         | (I have no earnest reason to transpile)
        
           | seph-reed wrote:
           | Yeah, this seems pretty nice. I don't think the "deep
           | nesting" issue is quite so realistic... I very rarely have a
           | logic tree that's easier to identify by its leaves than its
           | root. And I'd really hate to have code where you have to
           | scroll to the end of a bunch of pipes to figure out what
           | they're adding up to
           | 
           | But I have plenty of single use "temp variables" and cutting
           | those out could be cool.
        
         | silent_cal wrote:
         | To be fair, those searches are pretty weird.
        
           | JxLS-cpgbe0 wrote:
           | We don't _search_ for things because they 're easy to find
        
             | colinmhayes wrote:
             | I mean most of my searches are probably pretty easy to
             | find, I just don't want to go to the website I'm thinking
             | of and click through 5 pages to get there.
        
           | HaloZero wrote:
           | The pop culture one is fairly common. Me and my wife both
           | search "who the fuck is that" in that TV show movie all the
           | time. Or who is the author of X book?
        
             | cormacrelf wrote:
             | It's trying to surface long articles and you're asking it
             | for a one word answer. What did you expect? A long article
             | consisting of "Emma Stone played Cruella" repeated 800
             | times?
        
             | gibspaulding wrote:
             | I think perhaps the usefulness here is less finding what
             | you're looking for, but rather finding something
             | interesting.
        
           | dwaltrip wrote:
           | They seem reasonable to me.
        
       | exporectomy wrote:
       | Wow. I tested it on recipes which Google has destroyed and this
       | was the first result, a simple clear recipe:
       | 
       | http://demont.myds.me/leerecipes/mainmeals/mainmeals1/chicke...
       | 
       | compared to Google's endless drivel of "This chicken stir fry
       | recipe will become a staple in your home. It's so quick to make
       | and you can use whatever vegetables you have on hand. It tastes
       | wonderful regardless of how you alter the ingredients. ... " JUST
       | SHUT UP AND GIVE ME THE RECIPE!
       | 
       | https://natashaskitchen.com/chicken-stir-fry-recipe/
        
         | srcreigh wrote:
         | Fwiw this website, Natasha's kitchen, seems to be one of the
         | more performant ad filled recipe websites.
         | 
         | Make use of the "Jump to recipe" button to get to the recipe
         | faster.
        
         | 1f60c wrote:
         | This isn't entirely Google's fault. Recipes on their own aren't
         | copyrighted in the US, and adding this fluff text is a way
         | around that.
        
           | AdamN wrote:
           | Yes it is. They're giving a lower quality result to the user
           | (their customer ... but not for long if competitors can get
           | just a little bit better)
        
             | isubkhankulov wrote:
             | A small nit: I think Google's customers are actually the
             | companies paying for ads. Its an important distinction and
             | probably explains why their search results' quality has
             | gone down
        
           | mdoms wrote:
           | It is entirely Google's fault. Google knows people don't want
           | this dreck (everyone knows it) but still serves it up.
        
       | b-x wrote:
       | Too bad it rejects non-Latin words, as if the definition of
       | "text" is a sequence of alphabetical letters originated from
       | Latin.
       | 
       | I thought that we've reached the time to embrace all cultures in
       | the world, but this retrogressive engine proves that most _modern
       | tech_ designers are myopic about other civilizations in the
       | globe.
        
         | fghfghfghfghfgh wrote:
         | It's one guy. Making a useful tool. It even has an altruistic
         | purpose.
         | 
         | Shame on you for twisting a well intended effort into a
         | negative statement that suits your narrow identity political
         | world view.
        
           | b-x wrote:
           | Less insulting error messages would be more welcome than
           | casting out others without any consideration.
           | 
           | You may distribute "shame" however you want, but this only
           | helps enforcing the damaging insults and amplifying them.
        
         | hombre_fatal wrote:
         | No, it just proves that a one-man hobby project with finite
         | resources found it reasonable to restrict the scope.
         | 
         | Maybe when they find out they're an immortal billionaire they
         | can build all the additional things you've entitled yourself to
         | expect from the freely shared work of others.
        
         | marginalia_nu wrote:
         | Understand that this is something I built for myself, by
         | myself, so it focuses on languages I understand. It hosted on a
         | single consumer grade computer in my living room. I built it
         | out of pocket and anyone is free to use it. Does this make me a
         | villain?
         | 
         | If I can do this, what's preventing some guy in Japan or India
         | or Peru from doing the same, of course focusing on their
         | languages?
        
           | b-x wrote:
           | Maybe a better suited choice for errors than an insulting
           | message: when I provided a query in my native language it
           | regurgitated the error "needs to be a word" instead of more
           | acceptable "not a supported language".
           | 
           | When you claim that a word in some other culture is not "a
           | word", just because it's not recognized by your machine,
           | that's demeaning to say the least.
        
             | marginalia_nu wrote:
             | Again it's a one man hobby project, I don't have a team of
             | people to go through every formulation and every error
             | message to ensure nobody can read them in a way that
             | offends them. It's just me, writing code on an unfinished
             | project that HN discovered.
             | 
             | In this case, the code doesn't match the word regexp, like
             | it may be a @TwitterHandle or a "comp.lang.c" with periods
             | in it, or an unsupported Unicode range. It doesn't know why
             | it is not matching, just that it doesn't.
        
               | b-x wrote:
               | I must congratulate you on this achievement. That's
               | certainly a useful take on search.
               | 
               | Nonetheless, even when coding, one should also consider
               | thoroughly the UX and how it would be addressing the
               | others.
               | 
               | Saying "unsupported word" is much more sympathetic than
               | "needs to be a word" (where you define what a "word" is,
               | and the general user is unaware of such definition).
        
               | marginalia_nu wrote:
               | Fair point, I refined the phrasing a bit.
               | 
               | > The term "" contains characters that are not currently
               | supported
        
               | soheil wrote:
               | Stop feeding the trolls, great job on this project and
               | keep it up, hope at least most of HN is more empathetic.
        
       | leavenotracks wrote:
       | Really impressed with the results I'm seeing so far. In all
       | searches I have done so far, the results are truly lightweight,
       | and haven't had to click through any modals, subscription pop-ups
       | or any other junk thus far! Will be using more in the days to
       | come.
        
       | ncfausti wrote:
       | This is incredible. I just got goosebumps as I stumbled upon
       | https://solitaryroad.com after searching for "linear algebra
       | homomorphism". It reminds me of the magical feelings of the early
       | Internet. Keep up the great work!
        
       | pajko wrote:
       | Does it filter out ad-heavy copy-paste/autogenerated fake sites?
       | Tired of seeing those on the first few pages of Google. Bing gets
       | more and more usable, but far from perfect.
        
         | marginalia_nu wrote:
         | It tries.
        
       | AltruisticGapHN wrote:
       | I like the idea. However results take too much space vertically
       | it's slow and cumbersome to scan through them.
       | 
       | I think it would benefit from using a responsive layout, allow
       | the text expand to a wide 1000+ px, make the font smaller, so the
       | excerpt can fit one or two lines below the links.
       | 
       | Google has problems but their search results layout is easy to
       | scan.
       | 
       | Otherwise I genuinely wish I would use it, because the Google
       | search's "self referential reality bubble" is really annoying.
        
       | titzer wrote:
       | I think I want a BBS. Text mode, fixed width font, keyboard-
       | driven menus, no (or very little) bitmapped graphics. I've been
       | thinking about the UIs for a lot of sites that I use to "do
       | things" on the web. E.g. search for flights. Do I need _any_ of
       | that  "beautiful" web design with pretty forms and fonts,
       | bevelled edges, drop shadows, drop-down menus, hovers? Hell, do I
       | even need a map? Heck no, I need three text entry fields and
       | output a bulleted list, maybe table of results. Just give me the
       | raw data and do as little presentation as possible, thanks.
       | 
       | I really think I want an internet console, not an animated
       | magazine.
        
       | javajosh wrote:
       | This is really good; I'll actually use it!
        
       | macksd wrote:
       | This is a really cool idea. I tried a few technical queries I did
       | on DDG today and didn't get amazing results - hence the warning
       | in the About page about this engine giving you things you didn't
       | know you were looking for, rather than specific facts. But the
       | examples others have posted sound promising and refreshing. I
       | would love to read about the algorithms behind this and how
       | modern web design gets detected in order to punish it...
        
       | Lapsa wrote:
       | I like the design.
        
       | BlackLotus89 wrote:
       | Are there any technical infos about the search engine? Found some
       | information here
       | https://memex.marginalia.nu/projects/edge/about.gmi
       | 
       | Author said they threw it together on consumer hardware. How big
       | is the index? (TB used or entries) how is it realised?
       | 
       | I'm pretty much interested in this since I myself am crawling
       | some pages for my own "search index".
       | 
       | Oh and thx for making and posting. Added it as a keyword to
       | firefox
       | 
       | Edit: Just realized that my question is a bit shallow. What I'm
       | particular interested in is the storage before the indexing. I'm
       | trying to store the raw html so that I can reindex everything
       | with better algorithms, but I'm hitting many limits. It takes a
       | few minutes getting the size of a site-directory (every site has
       | it's own dir) and I'm at a point where I can't reasonably manage
       | the scrape-versioning over git and I cycled through a few
       | filesystems only to find that the metadata management kind of
       | sucks for most of them. It's rather interesting how we store such
       | files and I'm thinking about storing a few sites in a simple
       | sqlite format for easy access and search. I'm thinking about a a
       | few low overhead solutions like facebooks project haystack
       | (implemented open source in seaweedfs) or something similar...
       | Hopefully this gives some context to the question of storage and
       | sites that are indexed
        
         | [deleted]
        
         | marginalia_nu wrote:
         | The index is tiny, not even a terabyte. Right now it's a few
         | hundred gigabytes for ~20 million URLs. But it's stored in an
         | extremely dense binary format.
         | 
         | Honestly you may just want to roll your own solution for
         | storing a ton of files. If you don't need a general-purpose
         | filesystem, but an append-only archive with extra metadata,
         | then you can cut a lot of corners. Like if you have a file
         | system that is fixed-size and append-only, you can build it in
         | a way no off-the-shelf stuff can.
         | 
         | This line of thinking is a large part of why my index is so
         | small and fast. I have a lot of special built data-structures
         | that are built for their exact use case. Like a fixed size
         | append-only hash map that uses mapped memory and can in theory
         | be larger than the system memory. Very good for a search
         | engine, absolutely useless almost everywhere else.
        
       | eitland wrote:
       | Tested with the first person to settle on Island:
       | https://search.marginalia.nu/search?query=Ingolf+Arnarson
       | 
       | and it worked surprisingly well.
       | 
       | Anyone else has good examples?
        
       | abdullahkhalids wrote:
       | Can we submit text-heavy sites for possible inclusion? Assuming
       | they pass your filters.
        
       | sealthedeal wrote:
       | lol this is great, reminds me of the old school search engines we
       | would use in school back in the day before Google haha.
        
       | pomian wrote:
       | Congratulations. Truly impressive search results. I tried two,
       | one word searches. The results were interesting, useful, and
       | would have been impossible (well, really really hard) to find, on
       | standard search engines. Plus, no garbage, ads, recommendations,
       | etc etc. As another commenter suggested, it is what World Wide
       | Web searches results were like, twenty years ago!
        
         | pomian wrote:
         | PS. I added Marginalia as a search option (even the default for
         | now) in Firefox Nightly (on Android). In case others want to,
         | under settings for search, you can add other, then name, and
         | then: https://search.marginalia.nu/search?query=%s
        
           | freddref wrote:
           | After a good amount of searching it doesn't seem possible to
           | add Marginalia as default search in firefox (84.0b8) on
           | Debian.
           | 
           | I did not expect this to not be available.
        
             | mmphosis wrote:
             | I am running Linux Mint, and I did not expect adding a
             | custom search engine to be missing from Firefox. There are
             | plug-ins, but I don't like adding plugins.
             | 
             | https://mmphosis.netlify.app/search.marginalia.nu/
        
       | lumost wrote:
       | This is a fascinating tool, I estimated that the corpus of the
       | factual web was between 1 and 10 TB when I last played around
       | with BigQuery using domain names which had low amounts of click
       | bait. Seeing these search results I suspect my estimate was off
       | by a couple orders of magnitude.
       | 
       | Although a search for "Fractional Reserve Banking" shows that
       | some further ranking improvements can be made to exclude
       | unrelated results, and potentially penalize old conspiracy sites.
       | 
       | https://search.marginalia.nu/search?query=fractional+reserve...
        
       | rchaud wrote:
       | Is it fair to assume that text-heavy sites that are inactive (but
       | still online) don't have SSL?
       | 
       | If so, would you ever tweak the parameters to surface sites that
       | that aren't served with "HTTPS"?
        
       | asjdflakjsdf wrote:
       | You should monetise this with amazon affiliate links that are
       | relevant to each search. And then use that money to keep this
       | project going. Google is fantastic, but it has become something
       | different from what it was, the company and the product. It is so
       | refreshing to see a modern tool that encourages exploration of
       | the actual world wide web.
        
         | Funes- wrote:
         | That would be an absolutely awful decision.
        
         | marginalia_nu wrote:
         | I might add a donate button or something if people want to help
         | support the project, hardware isn't cheap and all. But I have a
         | job and decent income. I think if this search engine became the
         | way I earned money, it would influence the project in a bad
         | way, and corrupt its purpose, which is to help people explore
         | the less-commercial internet.
        
           | eigenhombre wrote:
           | +1 for a donate button; much preferred over affiliate links
           | or ads of any kind. Thank you for making this beautiful
           | little(?) product!
        
           | kews wrote:
           | Appreciated. The more things fill up with monetizing shit,
           | the more I stay away. There's something beautiful in having
           | higher purposes than grubbing for cash.
        
           | RistrettoMike wrote:
           | I'd donate to continued expansion/development of something
           | like this. Where is somewhere good to follow you for any
           | thoughts/updates?
        
             | marginalia_nu wrote:
             | I have something of a blog here, with an Atom feed.
             | 
             | https://memex.marginalia.nu/log/
             | 
             | It's not very well optimized for mobile, really it's more
             | of a bridge for my geminispace content.
        
           | marginalia_nu wrote:
           | I added a patron, and huh, a few people are actually
           | donating. I don't really know what to say. Thanks!
        
       | streamofdigits wrote:
       | Let a thousand search engines bloom.
       | 
       | btw, interesting how many http (as opposed to https) sites show
       | up...
        
       | pjs_ wrote:
       | This kicks ass!!
        
       | 0xbadcafebee wrote:
       | I love it. Even though it didn't give me the results I was
       | looking for. I searched "new york fishing license", and it didn't
       | give me any links to the actual new york fishing license
       | websites. But it did give me a ton of really cute little websites
       | related to lakes and fishing in New York. This one has _amazing_
       | information about fishing all over Western New York:
       | http://www.huntfishnyoutdoors.com/fishing.php
        
       | spookthesunset wrote:
       | This is really cool! So retro!
       | 
       | Here is the second result when you search for "cat food". It
       | takes you to some old dudes entire family tree with full history
       | and biographies... it even uses sub domains and everything!
       | Crazy!
       | 
       | http://www.torrens.org/
        
       | kews wrote:
       | There's probably a more suitable term than "modern" that we
       | should generally be using, since "modern" consistently has a
       | positive connotation.
        
         | marginalia_nu wrote:
         | Dunno, I prefer to use as neutral or positive terminology even
         | when I talk about things I don't like. I think it very easily
         | comes off as juvenile ranting when you start throwing around
         | terms with strong negative connotations.
        
       | mumblemumble wrote:
       | I like it.
       | 
       | Coincidentally, the other day I was daydreaming about a search
       | engine that favors sites that are updated less frequently. The
       | thought being, the kinds of labors of love that characterized the
       | 1990s Web that I still sometimes miss are still out there, it's
       | just harder to find them amidst the flood of SEO dreck. So
       | perhaps they could be made discoverable again with the help of a
       | contrarian search engine that specifically looks for the kinds of
       | things that Google and Bing _don 't_ like to see.
        
         | gibspaulding wrote:
         | Million Short [1] offers an option to omit results from popular
         | domains. It's a different approach from what you describe, but
         | I think the goal is similar.
         | 
         | [1] https://millionshort.com/
        
         | itzworm wrote:
         | I had this problem recently trying to fix an Atari. There's a
         | guy out there who has ton's of guides on doing video out mods
         | but newer guide references the older. However googling the OG
         | guide didn't find it so I manually scoured his old web page.
        
         | zachguo wrote:
         | Try this https://wiby.me/
         | 
         | Clicking 'Surprise me' gave me an interesting article from 1994
         | http://milk.com/wall-o-shame/bucket.html
        
           | appel wrote:
           | That was a great read, thanks for sharing.
        
         | capableweb wrote:
         | Similarly, I wish there was a recommendation engine (for web,
         | music, movies, whatever) that can show you what is the furthest
         | away from your existing tastes. I've learned to re-create my
         | Spotify account once every 6 months or so, as their
         | recommendation engine becomes a boring machine after using it
         | daily for some months.
         | 
         | I'd love to discover new content that is different from what I
         | read/watch/listen to now, but it's really hard to know about
         | genres you don't know about.
        
           | ebiester wrote:
           | It's hard, though. I simultaneously want something far from
           | my tastes, but I don't want to see Plandemic-style Ivermectin
           | material, or Focus On The Family-style material. I want
           | things that will push me out of my comfort zone sometimes,
           | but it turns out I really don't want the thing furthest from
           | my tastes; I want things marginally adjacent. I want them
           | close enough to feel familiarity, but far enough that it
           | challenges my worldview.
           | 
           | I don't think a recommendation engine can do that.
        
           | gverrilla wrote:
           | Doing that takes real work and curiosity. I'm afraid an
           | algorithm will never be able to do it, particularly if you're
           | into niche stuff. For instance I enjoy a lot a Japanese band
           | called The Boredoms - but few people like it, and there's
           | only 2 of their albums available in spotify.
        
         | potatoman22 wrote:
         | I like the idea. Tangentially, I wonder how one would find the
         | right 'penalty' for more updated sites?
        
       | timvisee wrote:
       | Cool!
       | 
       | There do seem to be some text encoding issues though. For
       | example: https://search.marginalia.nu/search?query=tim+visee
        
         | marginalia_nu wrote:
         | Yeah I think the charset detection needs work.
         | 
         | It understands the "Content-type: text/html;charset=utf-8"
         | -header, and <meta charset="UTF-8">
         | 
         | but not
         | 
         | <meta http-equiv="content-type" content="text/html;
         | charset=utf-8">
         | 
         | It turns out HTML has a lot of corner cases. I'm constantly
         | marveling at how web browsers hold together as well as they do.
        
           | timvisee wrote:
           | Thanks for your response! Hope you can implement this as well
           | without too much trouble.
           | 
           | I wonder if you could just assume UTF-8 to be the default
           | these days. I imagine that to fix many other cases as well.
        
             | marginalia_nu wrote:
             | Haha! I did actually assume UTF-8 at first, but being a
             | search engine has a lot of older websites, I sadly got a
             | lot of encoding errors doing that, too.
        
               | soheil wrote:
               | Maybe just like js-heavy sites also punish non-
               | conforming-encoding sites.
        
       | enduku wrote:
       | Fantastic project! Found very interesting links to a lot of
       | compiler related keywords. A similar service, yet different in
       | their approach to cut through the e-commerce and seo optimized
       | websites I found useful is MillionsShort[0]
       | 
       | millionshort.com
        
       | oytis wrote:
       | Designed for serendipity indeed. Tried a few searches, results
       | are quite fun, but none of them relevant.
        
       | pjspycha wrote:
       | This is really refreshing work, and we can all benefit from other
       | search engines focused on improving the field. I tried a bunch of
       | searches and some of them were quite wonderful, others were a
       | little dry on results. But overall I enjoyed going through it.
       | Here is some critiques if you don't mind:
       | 
       | I did search for "Daria Bilodid" and the results were a bit
       | troublesome. First the Wikipedia result did not work:
       | https://en.wikipedia.org/wiki/Daria_Bilodid vs
       | https://encyclopedia.marginalia.nu/wiki/Daria_Bilodid
       | 
       | Secondly the results matched a few judoinside.com results which
       | is ok, including sites to her competitors, but seemed to miss the
       | judoinside website for her:
       | https://www.judoinside.com/judoka/92660/Daria_Bilodid.
       | 
       | The design is hard on my eyes, I have a average size screen and
       | its using less than half of the width. The line-height is
       | enormous and seems to breakup flow making it uncomfortable for me
       | to read. The spacing around each result is the same as between
       | titles and paragraph items, which again was unpleasant to read.
        
         | ASalazarMX wrote:
         | > Secondly the results matched a few judoinside.com results
         | which is ok, including sites to her competitors, but seemed to
         | miss the judoinside website for her:
         | https://www.judoinside.com/judoka/92660/Daria_Bilodid.
         | 
         | The title says this search engine punishes modern websites
         | (images, videos, MB of JS, I suppose), and this site looks
         | scarce on text and heavy on images, maybe that's confusing the
         | ranking.
         | 
         | I certainly find the results very refreshing, but you'll have
         | to complement with other search engines if they're not enough.
         | In fact, I think the days when we could use a single search
         | engine have already passed.
        
       | dukeofdoom wrote:
       | "corporate speak" bs detector and filter on google search engine
       | would be nice.
        
       | kag0 wrote:
       | An interesting concept and awesome work!
       | 
       | I searched for high pressure air (HPA) regulator trying to find a
       | description of how one works. I didn't find that, but did find
       | some interesting links on how they're used in scuba, and one
       | guy's homemade gas laser.
        
       | throwawaysea wrote:
       | Is it possible to also make a site that favors a diverse set of
       | information sources? For instance a lot of searches turn up
       | results from Pinterest or Wikipedia or Amazon or whatever else. I
       | wonder if there's room for a search engine that is all about
       | favoring a greater diversity of smaller sources, for those who
       | are less interested in staying within walled gardens.
        
       | michaelgrafl wrote:
       | I just looked up my last name and found a World class heavyweight
       | weightlifter named Josef Grafl born in 1872 who has an awesome
       | portrait of him on Wikipedia. Never before have I read about that
       | man.
       | 
       | I love this.
        
       | dumbfounder wrote:
       | Based on a few searches it seems to favor sites with very long
       | passages of text. Search for a name and you get pages with
       | massive lists of names. It quite simply isn't very good at
       | everyday searches. But it does bring up the point, shouldn't I be
       | able to tell my search engine I want results like this? It should
       | be a feature of google I can turn on and off. It should be one of
       | many ways to impact relevance.
        
       | runnerup wrote:
       | Seems like this is still very very hard! I searched for "hart
       | protocol" hoping to find this: http://www.romilly.co.uk/
        
       | mrpf1ster wrote:
       | I searched "c strtok" and got one result saying '"strtok" could
       | be spelled "stroke", "stork", "sarto", "strop"'.
       | 
       | Cool concept though!
        
         | marginalia_nu wrote:
         | The spelling suggestions are presented whenever there isn't any
         | results, but sometimes they can be pretty misleading.
         | 
         | What happens is that C, as a word, isn't indexed because it's
         | deemed too short, and the bigram "c strtok" can't be found
         | anywhere.
         | 
         | Try 'strtok' instead.
        
       | llbeansandrice wrote:
       | Is there anyway to add this as a favored search engine in the
       | browser?
       | 
       | I currently use google as it's set as the default search when I
       | type in the address bar but would love to switch and move
       | google/ddg to a added character like "<search terms> @g"
        
         | tomerv wrote:
         | All the major browsers support adding custom search engines.
         | You just need to specify the URL template to do the search. The
         | common format is to put "%s" as the search term. You can use it
         | for any site, not just things that are considered search
         | engines.
         | 
         | Firefox is a bit different, since you do it by adding a
         | bookmark, and giving that bookmark a keyword. The other
         | browsers I checked have an option under the search engine
         | settings.
         | 
         | After defining the custom search engine, you just type
         | "<keyword> <search term>" in the URL bar.
        
           | freddref wrote:
           | Is there a way to set marginalia as default search in
           | firefox?
        
           | llbeansandrice wrote:
           | Ah I'm on FF so I'll have to do the weird bookmark method. A
           | little annoying since it supports other search engines.
        
       | hilbert42 wrote:
       | It's excellent, I looked up some physics topics and got some
       | excellent results - real meaty stuff full of text, eqations and
       | applicable diagrams, etc.
       | 
       | I've not only bookmarked it but also I've an icon linked to it on
       | the taskbar. Will watch its progress with interest.
        
       | aetherspawn wrote:
       | Beware, I got the impression straight away that some sites were
       | censored from the results for no good reason.
       | 
       | For example, if you search "jehovahs witnesses", all pages from
       | jw.org are missing.
       | 
       | Exactly the same thing happened when I searched "mormons" - the
       | official website is missing and it only brings up
       | sects/hate/conspiracies against mormons.
        
         | edrxty wrote:
         | jw.org appears to be the kind of modern web design this is
         | trying to avoid. I seriously doubt it has anything to do with
         | the cult.
        
         | marginalia_nu wrote:
         | If you want mormons in positive light, you should search for
         | "latter-day saints", as that's how they typically brand
         | themselves.
         | 
         | jw.org is on the indexing list, just pretty far down, based on
         | the fact that previous times it's been visited it's had ton of
         | javascript.
         | 
         | I don't have an axe to grind with fringe religious movements, I
         | actually love them to bits. Try searching for Nag Hammadi or
         | Hermes Trismegistos.
        
       | ilrwbwrkhv wrote:
       | This is soooo good. I'm finally finding sites I haven't heard of
       | with good content.
       | 
       | I didn't realize how much I missed this stuff.
       | 
       | The popular web has become so bad nowadays.
        
       | protontorpedo wrote:
       | I searched for "Starlink satellites" and found this Y2K-style
       | Canadian UFO blog [1] explaining it isn't aliens. I might just
       | waste my weekend with this search engine.
       | 
       | [1] https://www.ufobc.ca/Reports/stringoflights.html
        
       | snuser wrote:
       | Just searching for 'dogs' gave me more interesting results than
       | I've seen from google in years
        
       | prionassembly wrote:
       | The website itself seems generated with some kind of kick-ass
       | generator from template files (.gmi?)
       | 
       | I feel like I'm stuck with Wordpress.com because it brings me
       | _some_ traffic (whereas something hand-rolled on nsfspeech or
       | digital ocean or whatever would literally be off the edge of the
       | web), but the structure of that is so cool.
        
         | NhanH wrote:
         | That would be gemini protocol!
        
         | huijzer wrote:
         | You can easily do proper SEO with static site generators too.
         | Even more, static sites can be hosted via GitHub or GitLab
         | Pages, Netlify or CloudFlare and in all cases the speed will
         | outperform Wordpress in almost all cases. Also, you have way
         | more control over the output than with Wordpress.
        
       | dzink wrote:
       | One use case to always test: "online wishlist" or "make a
       | wishlist". If you start seeing tools like
       | https://www.DreamList.com or others, you are on the right path.
       | If you start seeing random web pages linking to individual wish
       | lists, then people are likely not able to find tools on your
       | search engine.
        
       | platz wrote:
       | > Don't be afraid to scroll down in the search results
       | 
       | I never knew it was fear that was preventing me from scrolling
        
       | prewett wrote:
       | Kudos for taking on this project, and I like the idea! I think
       | it'll be a big project to take it to the next level, but would
       | love to have a search engine that's more useful.
       | 
       | Some reactions:
       | 
       | - The font is really big and the columns really narrow, so I get
       | 3 - 4 entries per page, something like 8 words per line, and huge
       | spacings between lines, which makes it a frustrating experience.
       | I've been using the recommendations in
       | https://practicaltypography.com/, which recommends 60 - 90
       | characters in a line I think, and line spacing of 120% - 140% (I
       | like 125%). The line lengths here might technically fall within
       | the lower bound, but it's really short, and for search results
       | I'm going to try scanning the text to see if there's something
       | relevant, so I think going on the long side is better here. At
       | least make the width somewhat variable so that I can shrink the
       | rather large font and fit more on the line.
       | 
       | - The results are eclectic, but I'm not sure it's usable at the
       | moment. "scala append list" did not get me much that's helpful,
       | while Google will usually at least put up some click-farming
       | tutorial that although minimal effort does tend to answer the
       | question. Both "mapo doufu recipe" and "ma po do fu recipe" had
       | very few recipes, although the latter did have one.
       | Unfortunately, recipe websites are some of the worst, with about
       | 10 pages of description, ads, pictures, what-have-you until the
       | recipe at the very bottom. "collection unmitigated pedantry" did
       | return the acoup.blog entry at the top, though.
       | 
       | Good luck on the project!
       | 
       | -
        
       | lbriner wrote:
       | My pet peeve with search results is simply that there are ancient
       | technical results that in many cases are irrelevant. If I am
       | searching for a Window error message, I don't want some old forum
       | post from 2001, especially if it didn't have any answers!
       | 
       | What would be cool would be for people who host old stuff to
       | "archive" it at some point so it doesn't appear in normal
       | results, only if you tick "include archives".
        
         | athenot wrote:
         | As much as the release names for macOS over the years were
         | marketing gimmicks, it does make it a lot easier to zero in on
         | the correct version when doing these types of searches.
        
       | platz wrote:
       | modern design = low information density?
        
         | monkeybutton wrote:
         | Definitely low signal to noise. Looking at you recipe websites
         | and cooking blogs.
        
       | dfdz wrote:
       | I like the concept, but I did not work on any of the search
       | phrases I entered consisting of the full title of a computer
       | science article or book.
       | 
       | It also does not work for subjects. For example, if you search
       | "discrete math" it links to academic webpages, but most of them
       | do not have any notes posted. It is just a plain text website
       | with the syllabus of a class.
        
       | grae_QED wrote:
       | Obligatory https://wiby.me/ plug. If you're looking for a decent
       | minimalist website search engine, this fits the bill pretty well.
        
       | davuinci wrote:
       | Congrats for the effort, I really like the idea and it works
       | wonderfully for some searches.
       | 
       | However, I searched "infiniband" and the results are far away
       | from what I would expect or like to see. Most of the results that
       | appear first are completely unrelated to the topic.
        
       | cyral wrote:
       | I tried a few queries and got extremely irrelevant results
        
         | marginalia_nu wrote:
         | It really depends on what you search for. A major drawback is
         | that there needs to be text-heavy sites to find, in order for
         | the search engine to find them.
         | 
         | Compare for example the results for "Duke Nukem 3D" with those
         | for "Cyberpunk 2077".
        
       | adriangrigore wrote:
       | A little bit harsh "punishes". It's a cool search engine.
        
       | OneEyedRobot wrote:
       | Very cool. A person can really appreciate simple web design
       | looking at something like Luke Smith's recipe page.
       | 
       | So how on earth do you take an idea like this and scale it for
       | both broad web coverage and high traffic? For that matter, just
       | how much 'useful' text is there on the net?
        
       | vanattab wrote:
       | Cool idea but it needs to be able to handle special characters.
       | Right now searching "Hello World c#" returns no results because
       | the search term can't handle #. I also can't just delete the #
       | because then I would be stuck writing C...
        
       | jteppinette wrote:
       | This is awesome! We should definitely move in this direction.
        
       | xipho wrote:
       | Fascinating. I studied an "obscure" group of insects. My go-to
       | search term to test an engine is their family name as it is a
       | rarely used word and I know most (all?) of the major data sources
       | that have accumulated data on it. When Wolfram Alpha added
       | species names, I checked with the name, boring, Duck Duck,
       | boring, Google (well we know Google isn't for search anymore,
       | it's absolutely horrible) boring, Bing, boring... you get the
       | idea.
       | 
       | This was a little different, extremely few results, but a couple
       | of them really made me grin, and all(?) made me curious or raise
       | an eyebrow or reflect on who/what might have been the source of
       | the link, or remember some obscure connection from grad-school.
       | So, if anything a crawled list of results worthy of ponder,
       | thanks for this!
        
         | tentacleuno wrote:
         | > well we know Google isn't for search anymore,
         | 
         | If you're talking about the ads, I would bear in mind Google's
         | whole business model is basically online advertising. Search is
         | just the vehicle to deliver those ads; I'd say Google is pretty
         | good at throwing things back.
        
           | jevgeni wrote:
           | But what's their UVP? I'd say quick and relevant search
           | results. And that seems to be constantly degrading.
        
             | tentacleuno wrote:
             | Well, the unique value proposition is their gigantic index,
             | really fast search and a bunch of other things.
             | 
             | I'm not sure about the quality of the results, I just use
             | DuckDuckGo these days, but IMO the unique technical
             | advancements are pretty unique to Google.
        
         | trutannus wrote:
         | > well we know Google isn't for search anymore
         | 
         | Do you suggest anything better? As far as I can tell, all the
         | other search engines are either repackaged Bing (ie: DuckDuck),
         | or are just as bad.
        
           | ColinHayhurst wrote:
           | This needs an update but is an easy look see.
           | https://www.searchenginemap.com/
           | 
           | Broad and longer Twitter lists maintained here:
           | https://twitter.com/SearchEngineMap/lists
        
           | ricardo81 wrote:
           | Mojeek was built in the same spirit (one server living in a
           | house) and has 4.5 bn pages indexed now, and a bunch more
           | servers. A lot of people comment in similar style of it
           | reminding them of an older Internet, or generally less
           | branded results. It's definitely an alternative point of
           | view. Disclaimer: I work for them.
        
           | giancarlostoro wrote:
           | Not sure, but I remember when Google could find literally
           | anything. Then they started adding a bunch of exceptions and
           | crapped out their quality. I wonder how insanely different
           | results would be to get the older Google Engine from the
           | 2000s search result wise.
           | 
           | I now have to play games with Google to find things. I feel
           | like I do less than I used to for some reason.
        
             | bbarnett wrote:
             | The other day, I was searching for something, and google's
             | suggested, on-site answers took up 1/2 the first page. All
             | wrong.
             | 
             | The actual search results were another 1/4 page of
             | completely identical results, followed by google ad placed
             | search results.
             | 
             | I thought to myself, they've finally done it. Real
             | responses are no longer first page.
             | 
             | A lot of the cause for google getting crappy, is "ok
             | google", another "all platforms are the same" form of
             | sickness.
             | 
             | No, a desktop is not a phone. No, voice searching is not
             | the same as phone, or desktop.
        
               | foobarian wrote:
               | I was just thinking that they finally became Lycos. It's
               | what all the search engines except Google looked like
               | back in the early 2000s - ad laden cesspools of
               | irrelevant search results and other content. And it's why
               | we all switched to Google at the time.
        
               | habibur wrote:
               | It's time to disrupt the market. As Google can't compete
               | with a newcomer that penalize ads on page.
        
               | eitland wrote:
               | Seriously, yes.
               | 
               | Moores law means a modern day 2007-style Google should be
               | significantly less expensive to run now than back then.
               | 
               | Also the most relevant patents are now free to use.
               | 
               | 2021 Google is a sad story compared to 2007 Google and
               | I'd actually pay to get back 2007 Google - ads included -
               | meaning a double revenue source :-)
        
             | trutannus wrote:
             | You're absolutely correct, and a lot came from their
             | nerfing of search modifiers like + - "search term" and
             | whatnot. There's also a lot of ads and "PSA" type nonsense.
             | If I'm looking for anything COVID related for example, I
             | have to sift through a heap of PSA nonsense that's not even
             | related to my search query.
        
             | blowski wrote:
             | Wacky idea: instead of Google changing it's algorithm every
             | couple of years, it could run 50 algorithms in parallel
             | leaving no way for sites to "optimise" for the current one.
        
               | vikingerik wrote:
               | The output of the parallelism is itself an algorithm,
               | that can and will be optimized for.
        
           | mda wrote:
           | IMHO, that is a trendy claim in HN with little evidence.
        
             | mhh__ wrote:
             | You're downvoted but in my experience I have never really
             | been burned by this Google-decline
        
               | Ygg2 wrote:
               | "I haven't seen a black swan, ergo it's not real."
               | 
               | I've been burned by this decline in the past.
               | 
               | From creepy results i.e. first suggestion before typing
               | was something I spoke near the Android and I never
               | searched for before; to not finding what I was searching
               | for before successfully, Google has started declining.
        
               | mda wrote:
               | It would be nice if you provided a few real examples so
               | that we would see how Google was so fantastic and found
               | everything magically but then went to shit.
        
               | bbarnett wrote:
               | You are a lobster. (or frog, depending upon parable)
        
             | xipho wrote:
             | You want evidence? Search for a plumber/tradesperson in
             | your area THEN try to find rational discourse about your
             | options. There are literally 100s of results of websites
             | remixing a small set of data, presenting it to you, and
             | asking you to buy something to see more, when you _know_
             | there is nothing behind the scenes.
             | 
             | This type of engine would punish these sites, in theory,
             | and may turn up a discussion in some forum, newsgroup, etc.
             | that is actually relevant, or insightful.
        
               | krapp wrote:
               | > Search for a plumber/tradesperson in your area THEN try
               | to find rational discourse about your options.
               | 
               | I searched "plumber Austin TX" in Google and got a map
               | and list of company websites near me. There are a lot of
               | "top x y in z" list sites, but the top results were still
               | the most relevant. I don't know what "rational discourse"
               | I'm expected to find, though, or why I should assume the
               | discourse I would find through Google is less rational
               | than discourse I would find elsewhere.
               | 
               | I searched the same thing in OP and found nothing even
               | remotely significant. Not even anything related to
               | plumbing.
               | 
               | OP's project isn't optimized for relevance, it's
               | optimized for nostalgia - providing a filter that keeps
               | the modern web away and dropping quirky, interesting
               | breadcrumbs to distract you and remind you of what it was
               | like to wander around the web of the 90's.
               | 
               | Which is all well and good if that's what you want, and
               | judging from the comments it is what a lot of people here
               | want, but Google giving me a list of company names,
               | numbers, websites and a map showing their location by
               | distance is more useful, even if it uses "modern web
               | design" and javascript.
        
               | xipho wrote:
               | > I searched "plumber Austin TX" in Google and got a map
               | and list of company websites near me.
               | 
               | I think you could have done this historically in a Yellow
               | Pages phone book. My OP used "boring". A list of plumbers
               | is boring, been done on dead wood. I'm not saying boring
               | != !useful.
               | 
               | > There are a lot of "top x y in z" list sites
               | 
               | This is an understatement. I actually want to know the
               | top x in y, to do that I need "rational discourse".
               | Rational discourse is recognizable as well written,
               | insightful, humble, reflective, self-countering,
               | anecdotal etc. By "search is terrible" I mean with
               | respect to finding this.
               | 
               | > OP's project isn't optimized for relevance, it's
               | optimized for nostalgia
               | 
               | Nostalgia is highly relevant if it's on topic, but
               | agreeing with you as to what this engine is about.
        
               | krapp wrote:
               | >Rational discourse is recognizable as well written,
               | insightful, humble, reflective, self-countering,
               | anecdotal etc. By "search is terrible" I mean with
               | respect to finding this.
               | 
               | I believe a search engine that ignores results based on
               | superficial and aesthetic qualities like "modern web
               | design" would be even worse in that regard, unless you're
               | assuming no relevant discourse about any subject has
               | taken place on the web since the early 2000's.
               | 
               | I admit, I have no idea what heuristic you would actually
               | use to find "well written, insightful, humble,
               | reflective, self-countering, anecdotal etc" content, but
               | I've seen it on modern sites (even on Twitter,) and I've
               | seen a lot of garbage on old sites, so a simple text
               | search of only old websites doesn't seem like it.
               | 
               | It is fun, though.
        
               | mda wrote:
               | well it displays a map of plumbers in my area, is it not
               | useful? Besides do you remember what it was displaying
               | before "it became useless"? This whole thread is full of
               | hand wavy claims with pretty much no good examples about
               | how Google actually became worse in time. Hence my point.
        
             | Spivak wrote:
             | Google Search is a fantastic product because it's
             | essentially Spotlight for the web. It's by far the fastest
             | way to get to things you already vaguely know are there and
             | acts as a metasearch for large sites.
             | 
             | But as a result it's now less useful as a tool for scouring
             | the web.
        
         | mountain_peak wrote:
         | Likewise, I co-maintain the only "fan" site on one of my all-
         | time favourite composers/performers, and gave the engine a shot
         | with a unique string query. While my text-heavy WP-driven site
         | didn't seem to make the cut, the results were highly relevant
         | in that they were links to former band members and
         | collaborators - a couple of which I didn't realize existed.
         | That being said, there were a few sites (including my own) I
         | expected to be returned, but no dice. Still, a fascinating
         | experiment that many at HN have been clamouring for.
        
           | xipho wrote:
           | Exactly this. A couple results returned reference to obscure
           | now-defunct newsletters and clubs, people that I know were
           | historically important for past researchers, but only because
           | this was my research forcus for so long would I have known
           | this.
        
           | marginalia_nu wrote:
           | The search engine doesn't actually do full text search, so
           | maybe your query was too... unique.
           | 
           | But do first of all verify that you haven't been hacked.
           | There's about quarter of a million domains I've flagged that,
           | besides their wordpress content, also host a ton of link spam
           | crap off in some hidden folder. This reflects on the quality
           | rating extremely negatively to the point where you may have
           | not been indexed at all.
           | 
           | Secondly, are you behind cloudflare or some other big-name
           | CDN? Because, as I mentioned in another comment, I can't
           | crawl their pages without getting captchad until they approve
           | of my humble request to be classified as a good bot.
           | 
           | There are some other hosting providers I flat out block on a
           | subnet level because they host a large amount of link farms.
           | This is currently Alibaba, Psychz, eSited, Cloud Yuqu and
           | 1Blu.
        
             | mountain_peak wrote:
             | Thanks for the advice; not hacked, but I have "resurrected"
             | many WP sites that have been (including my wife's non-
             | profit). Just running on an EC2 micro instance, but I tried
             | adding "site:" and received "No such domain". Actually, I
             | think it's because I haven't enabled "HTTPS" yet! That's on
             | my to-do along with migrating off EC2-Classic to VPC...
        
               | marginalia_nu wrote:
               | Vanilla HTTP should be fine. I think 80% of the urls are
               | HTTP.
               | 
               | If you're getting no such domain, it's either blocked
               | because it looks too much like a spam domain, or it
               | simply hasn't been discovered yet.
               | 
               | What's the TLD? I severely restrict some cheaper TLDs
               | because they gave so much spam.
               | 
               | For example, cr.yp.to is an example of a baby I know I've
               | definitely thrown out with the bathwater.
        
               | wolverine876 wrote:
               | www.ft.com gets 'no such domain'
        
               | marginalia_nu wrote:
               | I added it now, but it turns out it's behind a CDN so I
               | still can't crawl it.
        
               | mountain_peak wrote:
               | Is a good ol' .com with no ads and minimal JS -
               | originally launched in 2011. Thanks again for your
               | insights; I've bookmarked your site and will check back
               | every so often to see if my site's been indexed.
        
             | withinboredom wrote:
             | It'd be nice if you had a page to get the current index
             | status for a domain.
        
               | marginalia_nu wrote:
               | Try a query on the form site:www.example.com ;-)
        
               | wolverine876 wrote:
               | > site:www.washingtonpost.com
               | 
               | > Blacklisted false
               | 
               | > site:www.wsj.com
               | 
               | > Blacklisted false
               | 
               | > site:www.rt.com
               | 
               | > Blacklisted false
               | 
               | > site:www.nytimes.com
               | 
               | > Blacklisted true
               | 
               | ?
        
               | marginalia_nu wrote:
               | Hmm, not sure what caused it to end up there, but I
               | removed it from the blacklist. It still doesn't seem to
               | want to index the domain however, probably CDN-related.
        
               | rovr138 wrote:
               | Would it be possible to have a link to a page with
               | operators?
        
         | duckmysick wrote:
         | I'm intrigued by this experiment but I can't visualize it. What
         | do you mean by boring results? Would combing through a library
         | (the one with paper books) also produce boring results? What's
         | your ideal results?
        
           | xipho wrote:
           | Perhaps a counter example, something that is interesting.
           | Anecdotally. This, of all things, is the _top_ result in my
           | search: https://tft.brainiac.com/archive/0303/msg00037.html.
           | Which is strange to me because I don't recognize
           | tft.brainiac. I click, it's a list of biological
           | relationships among Hymenoptera, including a reference to
           | genus of the wasps I studied, presumably in a biological
           | relationship (host/parasite) context. I cataloged every
           | relationship known at one point, so my brain wants to know
           | where this come from, is it something I caught. Then I go
           | look for more context, and find it's part of a thread about
           | D&D(?) and hymenoptera, and it's epic, and a chunk of my
           | morning is lost figuring out why and how this came to be.
        
             | duckmysick wrote:
             | Yes, thanks. That helps.
             | 
             | If I understand it correctly, you're interested in bits and
             | pieces of new information that's indirectly related to your
             | object of interest. Degree 2 and 3 in Six Degrees of Kevin
             | Bacon, so to speak. You know degree 0 like the back of your
             | hand and you've seen almost everything closely connected.
             | Finding novel, interesting things is getting more
             | difficult.
             | 
             | Have you thought about cataloging all the related stuff you
             | stumble upon? Something in between loose notes and what
             | Moby Dick is to cetology.
        
               | xipho wrote:
               | Exactly.
               | 
               | > Have you thought about cataloging all the related stuff
               | you stumble upon? Something in between loose notes and
               | what Moby Dick is to cetology.
               | 
               | Tongue in cheek- new app time, to facilitate this. It
               | should have the name "Degree4". Entries can only be made
               | if degrees 2 and 3 are "defined". Scoffs at degrees 5 and
               | 6, just because. Startup developing can probably
               | unethically seed content by mining
               | https://www.everything2.com/. Should use concepts of "AI"
               | and "persistent homology"... profit!
               | 
               | But no, I don't outside a mental note. Closest I would
               | come would be adding '!! <some note>' to my potwiki text
               | notes (see my past comments) if its something I want to
               | have come back with a grep, or think might be interesting
               | to explore "when I retire". If it's a scientific fact in
               | my field after researching it further it would go into
               | this https://taxonworks.org (or its precursor).
        
           | xipho wrote:
           | In part, by boring results I mean I instantly recognize the
           | top results, and I know exactly what will be in them, and I
           | know which ones will actually contain potentially interesting
           | new stuff, i.e. _I didn't have to search for these, I'd go
           | their directly_. Then next results are all obscure, and I've
           | already visited them, and/or I know they are historical and
           | not something I have to revisit.
           | 
           | With this engine with at least 1/2 the links (to be fair
           | there were < 20) I didn't recognize the URL at all, and it
           | was clear in the text or the URL that there was an
           | interesting bit to check out (i.e. what Google should have
           | also returned after they barfed out the things I don't need
           | to know about), but had never succinctly done in my
           | experience.
           | 
           | I suppose the magic in this engine would have to be alerting
           | the searcher that they found more of this type of link, as
           | once I visited the 10 or so sites they would fall back into
           | the "been there, done that" link category that Google appends
           | somewhere after the ads and "big" sites, mixed in with a
           | million search term spam sites, etc.
        
           | xattt wrote:
           | There's certain grey literature that's not captured in
           | university library federated searches nor easily found with
           | mainstream search engines.
        
             | xipho wrote:
             | There are decades of academic research not digitized. The
             | digitization window used to only hit around 1990, I haven't
             | looked at it hard recently, but I suspect this still
             | remains true for many important journals. This is grey only
             | to those who do not know how to use a library.
        
       | nagyf wrote:
       | This is great, I like the results. Couple of things I noticed:
       | 
       | - Search results often very old, from the early 2000s (I guess
       | because back then more websites were text oriented). Are you
       | taking into account the age of the page when showing results? It
       | would be great to see more up-to-date results at the top
       | 
       | - I noticed a few results which directed me to websites with
       | security risks, Firefox didn't even let me open them. Is it
       | possible to filter these out from the results?
        
       | horsh1 wrote:
       | No cyrillic or hiragana suport :-(
        
       | 300bps wrote:
       | What we need now is a search engine that weeds out sites that
       | have been SEO optimized for keyword density.
       | 
       | I'm tired of searching for "generic keyword" and getting a page
       | with an extremely low signal to noise ratio written like this:
       | 
       | "Many people search for generic keyword. That is why you can find
       | all about generic keyword here. In fact we specialize in generic
       | keyword and slight alterations of generic keyword."
       | 
       | It's like Google stopped caring that people were gaming it.
        
       | Nicksil wrote:
       | - Semantic HTML; not everything is a div; correct use of markup.
       | 
       | - Search results are not overrun with commercial, SEO stuffing,
       | "content" farms.
       | 
       | I don't know what to say. This is such a refreshing sight. Well
       | done.
        
       | thetanil wrote:
       | Yes please! More of this!
        
       | hulitu wrote:
       | "Search results Search "alt.sysadmin.recovery" needs to be a word
       | Those were all the results,"
       | 
       | No comment.
        
       | hdjjhhvvhga wrote:
       | Congratulations, great work!
        
       | tomaszs wrote:
       | I like the concept of a search engine that does not try to figure
       | out what I should learn based on what I search..I know what I
       | search for
        
       | camillomiller wrote:
       | Great idea, awful UI
        
         | muxator wrote:
         | How so? It's intuitive and super fast. I whish there were more
         | websites with such a simple UI.
        
           | feikname wrote:
           | it's too "uncompact". Font size too big and could use a bit
           | more horizontal space.
           | 
           | I find it comfortable to use at 60% zoom level
        
           | marginalia_nu wrote:
           | Some people like a flashy UI, the modern look is important
           | for them. It's ok to have aesthetic preferences, let's not
           | pretend we don't all have them.
           | 
           | In the end, it's a niche search engine I've made, the
           | intended audience is the long tail. It just isn't for
           | everyone, and if it was for everyone, it probably would be
           | lesser for it.
        
             | silent_cal wrote:
             | I like the UI.
        
             | ravenstine wrote:
             | I'm glad you aren't trying to please anyone. I'd like a
             | return to an internet with fewer colors, gadgets and
             | gizmos, custom fonts, TypeKit, JavaScript requirements, and
             | so on. Most of the time I'm reading articles, so just give
             | me more text and less fluff!
        
         | jbj wrote:
         | What makes you find the user interface awful?
         | 
         | it is litterally a search website with a text box for a search
         | term and a button to do the search.
        
         | r00t4ccess wrote:
         | The page isn't prompting for cookie preferences, asking to
         | allow notifications, popping up a mailing list or coupon half
         | way do the page, playing a full page video with sound, or load
         | 97million lines of javascript. I'd say its pretty much perfect.
        
         | IggleSniggle wrote:
         | Huh. I think it's a great UI. What did you not like about it?
        
         | stronglikedan wrote:
         | Ironic comment, considering that this is a search engine to
         | weed out sites with awful UIs. This gives us exactly what we
         | need in a search UI - no more, no less - in a clean and
         | intuitive way.
        
         | fouc wrote:
         | Yeah the design could use some work. The search results are not
         | compact - I only see 1 result without scrolling, not counting
         | the related wikipedia link that apparently has no description.
         | 
         | I don't particularly like that it seems to be a column
         | constrained to 550px width, instead of being responsive and
         | taking advantage of greater widths.
         | 
         | to the author of the site, if you're not really into
         | design/css, take a look at tailwindcss, it makes it fairly easy
         | to produce a minimal amount of css that is responsive.
        
       | agumonkey wrote:
       | Very nice. Start a trend :)
        
       | jccalhoun wrote:
       | It says it punishes modern web design but it has my most
       | irritating feature of modern web design: a narrow strip of text
       | on an otherwise blank page.
        
       | marginalia_nu wrote:
       | Yeah so this is my project. It's very much a work in progress,
       | but occasionally I think it works remarkably well for something I
       | cobbled together alone out of consumer hardware and home-made
       | code :-)
        
         | eigengrau5150 wrote:
         | I like this. Thanks for doing it.
        
         | scrollaway wrote:
         | I searched Warcraft and got a gold selling/ level boosting
         | site. Some things never change :)
        
         | bityard wrote:
         | This is awesome. I've been looking for a long time for a search
         | engine that basically takes everything Google does and does the
         | opposite. Thank you for doing this, I will definitely be
         | bookmarking it.
         | 
         | Is there a way to suggest or add sites? I went looking for
         | woodgears.ca and only got one result. I also think my personal
         | blog would be a good candidate for being indexed here but I
         | couldn't find any results for it.
        
         | peterburkimsher wrote:
         | Thank you so much for creating such a useful search engine!
         | 
         | Is there any way that you can get an HTTP certificate?
         | 
         | I use an old iPhone 4S, and most of the modern web is
         | inaccessible due to TLS. Hacker News and mbasic.facebook are
         | two of the last sites I can use.
         | 
         | Usually text-based sites are more accessible, so this could be
         | really useful to help me continue using my antique devices!
        
         | ColinHayhurst wrote:
         | Great work. Working on an alternative search engine too. Take a
         | look at my profile.
        
         | soheil wrote:
         | Awesome project! How are you able to keep the site running
         | after HN kiss of death? What is your stack, elastic search or
         | something simper? How did you crawl so many websites for a
         | project this size? Did you use any APIs like duck duck go or
         | data from other search engines? Are you still incorporating
         | something like PageRank to ensure good results are prioritized
         | or is it just the text-based-ness factor?
        
           | marginalia_nu wrote:
           | > How are you able to keep the site running after HN kiss of
           | death?
           | 
           | I originally targeted a Raspberry Pi4-cluster. It was only
           | able to deal with about 200k pages at that stage, but it did
           | shape the design in a way that makes very thrifty use of the
           | available hardware.
           | 
           | My day job is also developing this sort of highly performance
           | java applications, I guess it helps.
           | 
           | > What is your stack, elastic search or something simper?
           | 
           | It's a custom index engine I built for this. I do use mariadb
           | for some ancillary data and to support the crawler, but it's
           | only doing trivial queries.
           | 
           | > How did you crawl so many websites for a project this size?
           | 
           | It's not that hard. Like it seems like it would be, and there
           | certainly is an insane number of edge cases, but if you just
           | keep tinkering you can easily crawl dozens of pages per
           | second even on modest hardware (of course distributed across
           | different domains).
           | 
           | > Did you use any APIs like duck duck go or data from other
           | search engines?
           | 
           | Nope, it's all me.
           | 
           | > Are you still incorporating something like PageRank to
           | ensure good results are prioritized or is it just the text-
           | based-ness factor?
           | 
           | I'm using a somewhat convoluted algorithm that takes into
           | consideration the text-based-ness of the page, but also how
           | many incoming links the domain has, but it's a weighted value
           | that factors in the text-based-ness of the origin domains.
           | 
           | It would be interesting to try a page rank-style approach,
           | but my thinking is that because it's _the_ algorithm, it 's
           | also the algorithm everyone is trying to game.
        
         | noduerme wrote:
         | I love this idea, and admire the work you put into it. I'm a
         | fan of long reads and historical non-fiction, and Google's
         | results are truly garbage.
         | 
         | I have a criticism that I think may pertain to the ranking
         | methodology. I searched for "discovery of Australia". Among the
         | top results were:
         | 
         | * A site claiming that the biblical flood was caused by Earth
         | colliding with a comet (with several other pages from that site
         | also making the top search results with other wild claims, e.g.
         | that the Egyptians discovered Arizona);
         | 
         | * Another site claiming the first inhabitants of Australia were
         | a lost tribe of Israel;
         | 
         | * A third site claiming that Australia was discovered and
         | founded by members of a secret society of Rosicrucians who had
         | infiltrated the Dutch East India Company and planned to build
         | an Australian utopia...
         | 
         | These were all pages heavy with HTML4 tags and virtually devoid
         | of Javascript, the kinds of pages you'd frequently see in the
         | late 1990s from people who had built their own static websites
         | in a text editor, or exported HTML from MS Word. At that time,
         | there were millions of those sites with people paying for their
         | own unique domain names, and so the proportion of them that
         | were home to wild-eyed conspiracy theories was relatively
         | small. What I think has happened is that kooks continued to
         | keep these sites up - to the point where it's almost a visual
         | trope now to see a red <h1> tag in Times New Roman and think,
         | uh oh, I've stumbled on an "ancient aliens" site. Whereas
         | scholars and journals offering higher quality information have
         | moved to more modern platforms that rely more heavily on modern
         | browsers - with or without their own domain names. So as a
         | result what seemed to surface here were the fragments of the
         | old web that remain live - possibly because people living in
         | cabins in Montana forget to cancel their web hosting, or
         | because the nature of old-school conspiracy theorists is to
         | just keep packing their old sites with walls of text surrounded
         | by <p> tags.
         | 
         | Arguably, this seems to rank the way Google's engine used to,
         | since it couldn't run JS and they wanted to punish sites that
         | used code to change markup at render time. At least, when I
         | used to have to do onsite SEO work, it was always about simple
         | tag hierarchies.
         | 
         | I wonder whether there isn't some better metric of validity and
         | information quality than what markup is used. Some of the sites
         | that surfaced further down could be considered interesting and
         | valuable resources. I think _not punishing_ simple wall-of-text
         | content is a good thing. But to punish more complicated layouts
         | may have the perverse effect of downranking higher-quality
         | sources of information - i.e. people and organizations who can
         | afford to build a decent website, or who care to migrate to a
         | modern blogging platform.
        
           | _dain_ wrote:
           | those three pages sound pretty interesting, I don't see this
           | as a problem
        
         | crocodiletears wrote:
         | It's very rare that I see a project on HN I can see myself
         | using. This is one. Like others have said, the results can be a
         | little rough. But they're rough in a way I think is much more
         | manageable than the idiosynchrosies of more 'clever' search
         | engines.
        
           | marginalia_nu wrote:
           | I think you need to approach it more like grep than google.
           | It's a forgotten art, dealing with this type of dumb search
           | engine.
           | 
           | Like if you search for "How do I make a steak", you aren't
           | going to get very good results. But a better query is "Steak
           | Recipe", as that is at least a conceivable H1-tag.
        
             | AQXt wrote:
             | So, you are re-implementing Altavista, Lycos and other old
             | search engines.
             | 
             | They used the naive approach: you searched for "steak", and
             | they would bring the pages which included the word "steak".
             | 
             | The problem is that people could fool these engines by
             | adding a long sequence like "steak, steak, steak, steak,
             | steak, steak" to their site -- to pretend that they were
             | the most authoritative page about steaks.
             | 
             | Google's big innovation was to count the referrers -- how
             | many pages used the word "steak" to link to that particular
             | page.
             | 
             | The rest is history.
        
               | wolverine876 wrote:
               | > The problem is that people could fool these engines by
               | adding a long sequence like "steak, steak, steak, steak,
               | steak, steak" to their site -- to pretend that they were
               | the most authoritative page about steaks.
               | 
               | I don't see a lot of people investing in SEO to boost
               | their Marginalia results.
        
               | frogpelt wrote:
               | Effective Google search is also history.
               | 
               | I understand they are trying to maximize ad revenue and
               | search does work very well for people who are looking for
               | products or services.
               | 
               | But it no longer works well for finding information that
               | is even slightly obscure.
        
             | crocodiletears wrote:
             | This is exactly how I prefer to use my search engines.
        
               | quaintdev wrote:
               | I searched like this all my life and always got expected
               | results.
               | 
               | But just a week ago I found out that these "how", "what"
               | questions give better and faster results on Google.
        
               | LeftHandPath wrote:
               | That switch happened some years ago. I've been unlearning
               | and relearning how to use google for what feels like at
               | least three or four years now.
               | 
               | The main pain-point, though, is that a lot of long-tail
               | searches you could've used to find different results in
               | years past, now seem to funnel you to the same set of
               | results based on your apparent intent. At least, it has
               | felt that way -- I'm not entirely sure how the modern
               | google algorithm works.
        
         | bluefox wrote:
         | This is a very cool project! Thank you.
        
         | BugsJustFindMe wrote:
         | I love this, and I love (many of) the results so far! What I
         | can't find on the site is detail about what "too many modern
         | web design features" means. Is it just penalizing sites with
         | tons of JavaScript?
        
           | marginalia_nu wrote:
           | Javascript tags are penalized the hardest, but it also takes
           | into consideration density of text per HTML. There's also
           | some adjustments based on text length, which words occur in
           | the page, etc.
        
         | ad404b8a372f2b9 wrote:
         | Very cool project! How many websites do you have in your index?
         | And how did you go about building it?
         | 
         | I've been working on an engine for personal websites, currently
         | trying to build a classifier to extract them from commoncrawl,
         | if you have any general tips on that kind of project they'd be
         | very welcome.
        
         | davegauer wrote:
         | This is absolutely wonderful. I am LOVING the results I'm
         | getting back from it: the sort of content-rich sites that have
         | become nigh unreachable using traditional search engines. Thank
         | you for building this!
        
         | asah wrote:
         | Love it, kudos! This is great for developers and others who
         | Just Need Answers and not shopping or entertainment.
         | 
         | If you're looking for feedback, both from a UI design and
         | utility standpoint, you might consider "inlining" results from
         | selected sites, e.g. Wikipedia, stacked change, etc. Having
         | worked on search for a long time, inlining (onebox etc) is a
         | big reason users choose Google, and that channelers fail to get
         | traction. If you're Serious(tm), dog into the publisher
         | structure formats and format those, create a test suite, etc.
         | 
         | A word of caution: if this takes off, as a business it's
         | vulnerable to Google shifting its algorithms slightly to
         | identify the segment of users+queries who prefer these results
         | and give the same results to those queries.
         | 
         | Hope this helps!
        
           | marginalia_nu wrote:
           | If Google starts showing interesting text-heavy links instead
           | of vapid listicles and storefronts, I have accomplished
           | everything I ever could dream of.
        
             | MaysonL wrote:
             | Google Info - for when you're looking for information, not
             | shopping advice or lists!
        
               | addandsubtract wrote:
               | Google info? Can you give me a sample query of what you
               | mean?
        
               | wolverine876 wrote:
               | Maybe you're joking, but this is a good idea for search
               | engine. Better: Credible info.
        
             | 0xbadcafebee wrote:
             | Thank you for doing this important work.
        
             | palijer wrote:
             | Haha, reminds me exactly of this.
             | 
             | https://xkcd.com/810/
        
             | aaron5 wrote:
             | haha, great answer! thanks for your work on this :)
        
         | santamex wrote:
         | Which software do you use to index the sites?
        
           | marginalia_nu wrote:
           | I wrote it myself from scratch. I have some metadata in
           | mariadb, but the index is bespoke.
           | 
           | A design sketch of the index is that it uses one file with
           | sorted URL IDs, one with IDs of N-grams (i.e. words and word-
           | pairs) referring to ranges in the URL file; as well as a
           | dictionary for relating words to word-IDs; that's a GNU Trove
           | hash map I modified to use memory map data instead of direct
           | allocated arrays.
           | 
           | So when you search for two words, it translates them into IDs
           | using the special hash map, goes to the words file and finds
           | the least common of the words; starts with that.
           | 
           | Then it goes to the words file and looks up the URL range of
           | the first word.
           | 
           | Then it goes to the words file and looks up the URL range of
           | the second word.
           | 
           | Then it goes through the less common word's range and does a
           | binary search for each of those in the range of the more
           | common word.
           | 
           | Then it grabs the first N results, and translates them into
           | URLs (through mariadb); and that's your search result.
           | 
           | I'm skipping over a few steps, but that's the very crudest of
           | outlines.
        
             | q3k wrote:
             | Good stuff. I've also been toying with doing some homegrown
             | search engine indexing (as an exercise in scalable
             | systems), and this is a fantastic result and great
             | inspiration.
             | 
             | Definitely want to see more people doing that kind of low-
             | level work instead of falling back to either 'use
             | elasticsearch' or 'you can't, you're not google'.
        
               | marginalia_nu wrote:
               | Well just crunching the numbers should indicate what is
               | possible and what isn't.
               | 
               | For the moment I have just south of 20 million URLs
               | indexed.
               | 
               | 1 x 20 million bytes = 20 Mb.
               | 
               | 10 x 20 million bytes = 200 Mb.
               | 
               | 100 x 20 million bytes = 2 Gb.
               | 
               | 1,000 x 20 million bytes = 20 Gb.
               | 
               | 10,000 x 20 million bytes = 200 Gb.
               | 
               | 100,000 x 20 million bytes = 2 Tb.
               | 
               | 1,000,000 x 20 million bytes = 20 Tb.
               | 
               | This is still within what consumer hardware can deal
               | with. It's getting expensive, but you don't need a
               | datacenter to store 20 Tb worth of data.
               | 
               | How many bytes do you need, per document, for an index?
               | Do you need 1 Mb of data to store index information about
               | a page that, in terms of text alone, is perhaps 10 Kb?
        
             | rvnx wrote:
             | It's a great project!
        
             | Aeolun wrote:
             | I'm not sure how you go from word to url range? Range
             | implies contiguous, but how can you make that happen for a
             | bunch of words without keeping track of a list of urls for
             | each word (or URL ids, the idea is the same)?
        
               | marginalia_nu wrote:
               | The trick is that the list of URLs for each word already
               | _is_ in the URLs file.
               | 
               | The URLs in a range are sorted. A sorted list (or list-
               | range) forms an implicit set-like data structure, where
               | you can do binary searches to test for existence.
               | 
               | Consider a words file with two words, "hello" and
               | "world", corresponding to the ranges (0,3), (3,6). The
               | URLs file contains URLs 1, 5, 7, 2, 5, 8.
               | 
               | The first range corresponds to the URLs 1, 5, 7; and the
               | second 2, 5, 8.
               | 
               | If you search for hello world, it will first pick a
               | range, the range for "hello", let's say (1,5,7); and then
               | do binary searches in the second range -- the range
               | corresponding to "world" -- (2,5,8) to find the overlap.
               | 
               | This seems like it would be very slow, but since you can
               | trivially find the size of the ranges, it's possible to
               | always do them in an order of increasing range-sizes. 10
               | x log(100000) is a lot smaller than 100000 x log(10)
        
         | axelroze wrote:
         | Hi,
         | 
         | Interesting idea. Definitely see an overlap with eReader
         | markets and looking at text only contents.
         | 
         | How does it work?
         | 
         | It ignores pages on which it detects frameworks for ui and ads
         | or any javascript code at all?
        
         | agumonkey wrote:
         | is there a json endpoint ? I'd love to make an emacs bridge :)
        
           | maddyboo wrote:
           | Seconded, I'd like to incorporate it into a project of mine.
        
         | artembugara wrote:
         | Nice, what are you using to crawl the web?
        
           | marginalia_nu wrote:
           | It's pretty much all bespoke.
           | 
           | I use external libraries for parsing HTML (JSoup) and
           | robots.txt; but that's about it.
        
             | soheil wrote:
             | What was the starting site you fed to the crawler to follow
             | the links from to build the index?
        
               | marginalia_nu wrote:
               | Just my (swedish) personal website. The first iteration
               | of the search engine was probably mainly seeded by these
               | links:
               | 
               | https://www.marginalia.nu/00-l%C3%A4nkar/
               | 
               | But I've since expanded my websites, so now I think these
               | play a decent role in later iterations, although they are
               | virtually all of them pages I've found eating my own
               | dogfood:
               | 
               | https://memex.marginalia.nu/links/fragments-old-web.gmi
               | 
               | https://memex.marginalia.nu/links/bookmarks.gmi
        
         | zizee wrote:
         | Love the idea. A little feedback: layout needs tweaking for
         | mobile. FWIW: I'm on mobile Firefox for Android.
        
         | blondin wrote:
         | fantastic project, thank you!
        
         | habibur wrote:
         | How are you doing the crawling without getting blocking? -- the
         | hardest part.
        
           | judge2020 wrote:
           | Not OP but crawling is easy if you don't try scanning 5+
           | pages a second - almost all rate limiting/heuristic based
           | 'keep server costs low' engines, including Cloudflare, don't
           | care if you request every page, but will take action if you
           | do something like burst every page and take up just as many
           | server resources as a hundred concurrent users.
           | 
           | Now, that is assuming you aren't on some VPS provider. If
           | you're going to crawl, you'll have the best chance when you
           | use your own IPs on your own ASN, with DNS and reverse DNS
           | set up correctly. This makes it so the IP reputation systems
           | can detect you as a crawler but not one that hammers every
           | site it visits.
           | 
           | Also, I imagine that, for a search engine like this, it
           | doesn't expect content to change much anyways - so it can
           | take its time crawling every site only once every month or
           | two, instead of the multiple times a week (or day) search
           | engines like Google have to for the constantly-updated
           | content being churned out.
        
         | androceium wrote:
         | Pretty neat!!!
         | 
         | You may already be aware of this, but the page doesn't seem to
         | be formatted correctly on mobile. The content shows in a single
         | thin column in the middle.
        
           | marginalia_nu wrote:
           | Hmm, which OS? I only have a single Android phone so I've
           | only fixed the CSS for that.
        
             | androceium wrote:
             | I was seeing it on Android w/ Firefox. Seems like it's
             | fixed now though. :)
        
               | marginalia_nu wrote:
               | Curious, I haven't touched the stylesheets.
        
             | ant6n wrote:
             | For example Firefox on Android.
        
           | khimaros wrote:
           | Fennec F-Droid on Android 11 has some rendering issues.
        
         | edbaskerville wrote:
         | This has amazing potential. I'd encourage you to form a non-
         | profit, turn this into something that can last as an
         | organization without becoming what you're trying to avoid
         | becoming. This is a good enough start that I bet you could
         | raise a sizeable startup fund very soon from a combination of
         | crowdfunding and foundation grants--I bet the Sloan Foundation
         | would love this!
        
       | egberts1 wrote:
       | I tried "Error 49" as a search phrase.
       | 
       | It's rudimentary but no IT-related result.
        
       | thrtythreeforty wrote:
       | > New: You can now look up dictionary definitions for words. If
       | you for example don't know what the definition of is is, you can
       | inquire thus: define:is.
       | 
       | Oh man, I love subtle jabs and tongue in cheek writing like this.
       | Very Robin Williams-esque.
        
         | marginalia_nu wrote:
         | I am the first to admit it's a pretty dated reference.
        
       | earthbee wrote:
       | I love this! I've been searching random words with no aim in
       | particular and keep finding lots of interesting tiny personal
       | webpages. It feels like the old web
        
       | [deleted]
        
       | arduinomancer wrote:
       | Wow this is immediately useful
       | 
       | If you figure out some sort of funding model (maybe even just
       | Patreon) I could totally see this as a viable side project
       | 
       | Already discovered this recipe site: https://based.cooking/
       | 
       | I love how adding recipes is through pull requests:
       | https://github.com/LukeSmithxyz/based.cooking/pulls
        
         | MrBoomixer wrote:
         | Thank you for this, it really makes me love the web and the
         | people making things like this, Forked!
        
       | dmje wrote:
       | Love it. You should provide a link to Patreon / whatever so
       | people can support you financially. Hosting is probably not cheap
       | for you. Given the love here on HN I suspect you'd do well.
        
         | marginalia_nu wrote:
         | Hosting is actually surprisingly cheap, but that's because I'm
         | hosting it on consumer hardware in my living room, off my
         | domestic broadband connection.
         | 
         | That's both a blessing and a curse. It works okay as long as I
         | don't touch it, but I can't do maintenance without shutting it
         | down. I can't implement crawler changes without a week of
         | shitty results as it needs to visit half the Internet before it
         | gets decently good. I can only afford a production machine, so
         | all testing that can't be done with unit tests gets done there.
         | 
         | Anyway I added a patreon in case anyone wants to toss a coin.
        
       | spandrew wrote:
       | All of my searches are turning up unrelated results ("college
       | life after the pandemic", "post-pandemic teaching in higher
       | education", "football news NFL" etc.)
       | 
       | NFL one had 'some' decently related results, but the websites
       | were all strangely disreputable.
        
         | mmmpop wrote:
         | > the websites were all strangely disreputable
         | 
         | Interesting you'd feel that way when sites without "modern
         | design" are encountered. Is this your own bias perhaps creating
         | a judgment or are they sites that you already know have a bad
         | reputation?
        
           | typon wrote:
           | Or perhaps the websites being returned are garbage? I have
           | the same experience trying a few searches and following the
           | top 5 links. Besides wikipedia, I haven't found a single
           | useful website.
        
             | spandrew wrote:
             | This!
             | 
             | Maybe modern or 'non-modern' web design just isn't a great
             | litmus test for quality content? Could just need some work.
             | At any rate I wasn't clicking on the results.
        
       | abhinav22 wrote:
       | Great work and congrats!
        
       | skyfaller wrote:
       | This is a fantastic search engine. It delivers on its promise of
       | "serendipity". I found pages featuring my name that I'm not sure
       | I've ever seen before, after many years of searching myself to
       | test out search engines.
       | 
       | Perhaps more importantly, it delivers the most correct result
       | when searching for my username: the first result is not any of my
       | social media accounts, or even my own blog, but the text of the
       | obscure science fiction story that I took my username from! Well
       | done.
       | 
       | I've immediately added this as a search keyword in Firefox, and
       | I'll be using it more in the future.
       | 
       | Could meta search engines like DuckDuckGo include this as a
       | source? Should they?
        
       | voidnullnil wrote:
       | How does this have 2.5K upvotes when every single HN related
       | project needs JS and a quad core CPU (for the browser to open a
       | blank page) to view a paragraph of text?
        
       | pietroppeter wrote:
       | from About page:
       | 
       | > If you search for "Plato", you might for example end up at the
       | Canterbury Tales. Go looking for the Canterbury Tales, and you
       | may stumble upon Neil Gaiman's blog.
       | 
       | I know it is just a suggestion, but had to try searching both,
       | with no luck in getting the expected unexpected.
        
         | marginalia_nu wrote:
         | Yeah I did some work very recently aimed at improving the
         | relevance a bit. It was a bit too random in the state it was
         | before. Now it, perhaps, isn't random enough anymore.
        
           | pietroppeter wrote:
           | It looks very nice anyway, great job! I did try with other
           | queries and results were in general interesting.
        
       | fsflover wrote:
       | See also: https://wiby.me/
        
         | [deleted]
        
         | Tade0 wrote:
         | The "surprise me..." button is adequately labelled.
        
           | twobitshifter wrote:
           | Shades of stumbleupon.
        
           | tpmx wrote:
           | Great link to drag to the bookmark bar.
        
       | scopio918 wrote:
       | All search engines favors more text and less graphics.
        
       | JohnFen wrote:
       | Oh, this is brilliant! I think I'll make this my "first stop"
       | search engine.
        
       | bovermyer wrote:
       | I adore this. Unfortunately, searching for my own name - with or
       | without quotes - doesn't actually find my site.
       | 
       | It does find a handful of references to me from over twenty years
       | ago, though, which I thought was fascinating.
        
         | tgv wrote:
         | My name retrieved the "dead pornstar list". Unexpected.
        
       | hop34s3w wrote:
       | If the website is targeted towards international audience then
       | its nice to have the first page links to content in english. All
       | the four links in the main page https://www.marginalia.nu/ have
       | links to non-english content which is not useful.
       | 
       | Disclaimer: I am not a native english speaker. English is my
       | second language.
        
         | marginalia_nu wrote:
         | Yeah my main site is a bit of a disorderly mess. It started as
         | a Swedish blog, but I've since added a few services aimed at a
         | global audience. Haven't quite figured out how to unify it all
         | just yet.
        
       | JohnJamesRambo wrote:
       | Saving this forever. Thank you for making it.
        
       | Valkhyr wrote:
       | As a quick test, I searched for the name of one of my favorite
       | game series: "Baldur's Gate" (on its own, no qualifiers, properly
       | spelled - I would usually spell it "baldurs gate" on Google, but
       | I decided to give this one the best chance). I search for info
       | around video games a lot, so that's quite representative of a
       | good chunk of my web searches, and I pretty much know the top
       | sites Google would give me for that query (on its own, without
       | any further qualifiers).
       | 
       | The results were all either barely relevant, outdated (sites that
       | covered the game back in the 90s/2000s before it was re-
       | released), at best tangentially relevant or complete garbage
       | noise. Some of the most highly relevant pages (such as the Steam
       | store listing, the fandom wiki, the publisher/developer's forums
       | for the re-releases, the Baldur's Gate 3 website and the
       | subreddit) were not included at all. Those are all fairly text
       | heavy by any reasonable standard, so I assume they were
       | "punished" because they use JS? Would make sense that nearly all
       | of them are way out of date.
       | 
       | Then I searched specifically for "Baldur's Gate Wiki" but still
       | out of luck - some results, but nothing vaguely Wiki-like.
       | 
       | Finally I searched for "Baldur's Gate Fandom Wiki". This is
       | basically "search engine easy mode", by giving essentially the
       | name of of the site I am looking for. I got ZERO results. At this
       | point I gave up and decided that this thing is useless.
       | 
       | Look, I'm all for unearthing good long-form content (in fact I
       | would say that much of the content around this specific game
       | would qualify), and I do get as annoyed at modern SPAs as the
       | next grumpy neckbeard.
       | 
       | I think considering both of those in a search engine is not a bad
       | idea in and of itself. But I have to wonder what's the point of a
       | search engine that weights some arbitrary aspect of web design
       | higher than the relevancy of the subject matter (to the point of
       | not returning any results at all)? In fact, considering that
       | generally speaking more recent websites tend to include more
       | scripting, you are intentionally skewing the results towards
       | (very) old content, which is probably doing the user a
       | disservice.
        
         | matesz wrote:
         | > But I have to wonder what's the point of a search engine that
         | weights some arbitrary aspect of web design higher than the
         | relevancy of the subject matter (to the point of not returning
         | any results at all)?
         | 
         | Because in some cases it returns arguably better/more to the
         | point results, than other search engines - for example search
         | for "Douglas Engelbart" or "Ted Nelson". I thought that I've
         | searched everything for those two yet marginalia gave results
         | otherwise I would have never seen,.
        
         | marginalia_nu wrote:
         | This just isn't the place to go for promotional materials about
         | upcoming video games. It's a niche search engine for
         | discovering stuff off the beaten path, the stuff you _can 't_
         | find on mainstream search engines. Some of it is junk,
         | admittedly, and not everyone will see the point, that's fine
         | too.
         | 
         | Despite what some people seem to think, it's never been meant
         | as a google-replacement. I have never claimed otherwise.
        
       | zvxczvvzxzcxzm wrote:
       | This is great!
       | 
       | I tried with "covid tyranny", and got some very interesting
       | results I'd never get with any of the other search engines!
        
       | m1117 wrote:
       | Love it! I can punish my employees by setting this as a default
       | search engine on their work laptops.
        
       | kodeninja wrote:
       | Ivermectin (marginalia):
       | https://search.marginalia.nu/search?query=ivermectin+
       | 
       | Ivermectin (Google): https://www.google.com/search?q=ivermectin
       | 
       | The difference in the overall _thrust_ of the results is
       | remarkable.
       | 
       | Very interesting! Thanks for building it.
        
         | typon wrote:
         | The Google results tell you why Ivermectin is not a good
         | replacement for vaccination against Covid, the Marginalia
         | results tell you that Ivermectin is a miracle drug for treating
         | Covid 19. Really shows how much technology has the power to
         | change reality in today's world.
        
           | lame-robot-hoax wrote:
           | Google links to the FDA, CDC, WHO, NIH, WebMD, drugs.com, and
           | a pro ivermectin journal article from the American Journal of
           | Therapeutics.
           | 
           | The Marginalia results point you mostly to random blogs.
        
           | marginalia_nu wrote:
           | Part of what I wanted to show with this project is that there
           | is no such thing as an objective search engine. Even
           | seemingly irrelevant technological decisions drastically
           | impact the narrative.
        
             | Shingbogle wrote:
             | It's definitely not irrelevant. My first search time covid
             | related because I knew the non-official, random person on
             | the internet blog wouldn't have the money to create flashy
             | sites.
        
             | Drew_ wrote:
             | Well the focus on text content isn't the only technical
             | difference here. Google is obviously weighing hundreds of
             | signals in its search results that your engine is not
             | accounting for. These omitted signals are also relevant.
        
               | sundarurfriend wrote:
               | > These omitted signals are also relevant.
               | 
               | Certainly. And sometimes they're relevant in a good way,
               | sometimes in a bad user-hostile way. Every search engine
               | rquires discrimination and intelligent usage by the
               | person doing the search, just in different areas.
        
               | marginalia_nu wrote:
               | Right, but that is still a technical decision on their
               | side. They presumably don't sit down and have a meeting
               | about what world view they should present. Well I hope
               | they don't.
        
               | kevin_thibedeau wrote:
               | Google is actively fighting the spread of disinformation.
               | You can see this clearly in the forced row of COVID PSA
               | links on the YouTube front page that's been up for the
               | last year regardless of whether you have any history
               | viewing such content. There is manual intervention going
               | on to prevent the garbage their normal algorithms will
               | automatically surface. This is the greatest tragedy of
               | the internet in that it allows people with crazy notions
               | to find each other and build echo chambers with the aid
               | of unbiased ML.
        
         | silent_cal wrote:
         | Lol!
        
         | daxfohl wrote:
         | Though before basing life-and-death decisions on this, consider
         | reading the "about" page first:
         | https://memex.marginalia.nu/projects/edge/about.gmi
         | 
         | > The purpose of the tool is primarily to help you find and
         | navigate the strange parts of the internet. Where, for sure,
         | you'll find crack-pots, communists, libertarians, anarchists,
         | strange religious cults, snake oil peddlers, really strong
         | opinions.
         | 
         | and
         | 
         | > If you are looking for fact, this is almost certainly the
         | wrong tool.
        
         | lame-robot-hoax wrote:
         | Yes, google returns results from the FDA, American Journal of
         | Therapeutics pro Ivermectin study, WebMD, the CDC, the NIH,
         | Wikipedia, the WHO, and New York Times.
         | 
         | Marginalia returns results from Wikipedia, a faculty member's
         | university blog regarding river blindness, a website called
         | truthsummit promoting it as a miracle cure, a website called
         | vaxxchoice promoting it as a cure, vitamindsstopcovid, etc.
         | 
         | I'd say the quality of the results are quite different.
        
         | motoxpro wrote:
         | Second result: "Ivermectin, a miracle drug against Covid
         | Ivermectin, a miracle drug against Covid. 100% effective as
         | preventative and for early stage Covid. Over 90% cut in
         | fatality rate for late-stage cases.
         | 
         | https://truthsummit.info/blog/ivermectin-against-covid.html "
         | 
         | Eh I think I'll take the google search results on this one.
        
       | pkamb wrote:
       | Great results for "sauna". Lots of Web 1.0 pages discussing
       | building plans and displaying pictures of individually built,
       | traditional, unique, old saunas on some property.
       | 
       | The Google result are all blogspam or sales pages for cheap
       | shipped saunas. Lots of "IR" results. Phony health benefit pages.
       | Stock photos solely of beautiful new hotel gyms.
       | 
       | I've noticed this problem with Google results for quite some
       | time. Sadly, the _new_ content being created of the top variety
       | is mostly being done within private Facebook groups that can 't
       | be easily searched, linked, or archived.
        
       | sireat wrote:
       | Fantastic idea and it works quite well for short phrases that I
       | tried.
       | 
       | As expected I am getting a lot of early 2000s sites which is
       | something that I miss on regular Google.
       | 
       | Hilariously searching for "array data structure" got me one of
       | the top results this little tiny page:
       | http://infolab.stanford.edu/~backrub/google.html
        
         | marginalia_nu wrote:
         | > We have designed Google to be scalable in the near term to a
         | goal of 100 million web pages
         | 
         | Funny, that's about where I see my search engine capping out as
         | well.
        
       | rfrey wrote:
       | This is stunning. I searched "winemaking" because it's my latest
       | obsession, and turned up dozens of links to high-quality pages
       | I'd never seen despite spending an hour a day for three months
       | cruising Google on the topic.
       | 
       | Please do announce it here if you ever decide to solicit help or
       | contributors. My stab at this problem was to have a search index
       | of only ad-free pages, on the hypothesis it would turn up self-
       | hosted blogs, university personal pages, that sort of thing. But
       | the results were too thin, your approach is much better.
        
       | winddude wrote:
       | hmm, I dream of recipes search engine that punishes recipes pages
       | with too much text. lol
        
         | mint2 wrote:
         | Yeah, recipes sites have both too much text and too many
         | pictures.
         | 
         | But they do illustrate what this search engine needs to watch
         | out of. If they rank more text higher and their search site
         | becomes popular, won't everyone just spam recipe site word
         | salad, maybe even ai generated word salad.
         | 
         | But in the interval, until that day comes, they are going to
         | have a very useful service.
        
       | Paul_S wrote:
       | Looking for an arm assembly instruction, instead I get this
       | strange website as the result
       | http://mailstar.net/coronavirus.html
       | 
       | Is that accidental or is this website promoted because it's text
       | heavy and will surface for any search without many results?
        
         | marginalia_nu wrote:
         | Looks like that page just has an absurd amount of keywords.
         | Those sometimes surface when there isn't any good results.
         | Haven't found a foolproof detection method that doesn't
         | unjustly punish innocent pages with large amounts of content.
        
       | tbojanin wrote:
       | this is sweet
        
       | gen_greyface wrote:
       | Hi, It'd be nice if you could add a OpenSearch description
       | document for your site.
       | 
       | https://developer.mozilla.org/en-US/docs/Web/OpenSearch
        
         | gen_greyface wrote:
         | until then i'll keep the site bookmarked. :-)
        
       | SlapperKoala wrote:
       | I like the idea but could use some tweaking. I keep getting
       | conservative christian websites for some reason. And foreign
       | language sites
        
       | josefresco wrote:
       | It you like wacky search engines, there's also Million Short:
       | https://millionshort.com where you can search and remove the top
       | 100/1K/10k/100K/1M results.
        
       | ephbit wrote:
       | Quoted from the linked site:
       | 
       | > Convenience functions have been added, and the search engine
       | can now perform simple calculations and unit conversions. Try 1
       | pint in cubic centimeters, or 50+sqrt(pi). This functionality is
       | still under development, be patient if it doesn't work.
       | 
       | Why would you make any ever so small effort to implement
       | calculations? I don't get it.
       | 
       | If your search engine enabled me to find more useful search
       | results to my queries than google or yacy or whatever, I wouldn't
       | care one tiny bit about being able to do calculations with it.
       | 
       | Why not focus on the search functionality?
        
         | marginalia_nu wrote:
         | I implemented calculations because easily 80% of my google
         | queries are calculations, unit conversions, etc.
         | 
         | Search functionality is larger priority. Calculations and unit
         | conversions were an afternoon's break from the search
         | functionality :-)
        
           | ephbit wrote:
           | Ok, I guess people use google (or search engines in general)
           | differently ... I rarely if ever use a search enginge to
           | calculate stuff or do unit conversions.
           | 
           | I use google only to search and when ecosia/bing doesn't
           | return anything useful.
        
         | exporectomy wrote:
         | How else do you do unit conversions? I use Google because it's
         | far easier than any other software I've tried. Mainly because
         | it's more forgiving of errors. It knows that "34 fset in
         | msters" is 10.3632 meter. This search engine isn't, though, so
         | I wouldn't waste time trying to discover its unit conversion
         | syntax rules.
        
           | samhh wrote:
           | On macOS for example I'd use Spotlight.
        
           | soheil wrote:
           | You can also use the Chrome address bar without hitting the
           | enter, just start typing
        
       | drusepth wrote:
       | Interesting approach.
       | 
       | I always search myself on new search engines to compare the
       | results. Most engines return my personal blog/website,
       | books/stories I've written, news stories, my github
       | projects/contributions, social links, etc.
       | 
       | This search engine surfaces just three obscure IRC logs that
       | contain my nick in join/part messages (nothing said from me!)
       | from 2009. And nothing else.
       | 
       | There's probably some things this approach is really good at but
       | I'm not sure what they'd be for me off hand. Always cool to see
       | new approaches to search, though.
        
       | fhackernewz wrote:
       | fuck you hacker news
        
       | fsckboy wrote:
       | I've read most of the comments here and people are evaluating the
       | search results: all good information.
       | 
       | I'm looking at "punishes modern web design"... This thing IS
       | modern web design. I think it's called "marginalia" in reference
       | to the huge margins they chose!
       | 
       | I'm using a browser on a linux desktop and side-by-side, HN's
       | page design is old-fashioned tasteful making pretty good use of
       | space, and maginalia has a font that's more than twice the 2D
       | pointsize and is so spread out with whitespace that the "Tips" on
       | the home page are off the bottom of my window.
        
       | gtmb wrote:
       | As everything in life flows in cycle, I predict the search engine
       | that will de-throne Google will be like Google when it started -
       | a simple variation of page rank.
       | 
       | No smarts, no bubble, no signals decided by over fitting to a
       | biased engineer preference.
        
         | __MatrixMan__ wrote:
         | I agree, except it'll optionally accept the ID of your node in
         | a web of trust, and it'll use a page rank customized for you.
         | 
         | Or you can put in two ID's and have it find sources that both
         | parties trust.
        
         | jerrre wrote:
         | I wouldn't say the existence of this page proves your
         | prediction right (as it's not dethroning Google anytime soon).
         | 
         | It's easy to forget that the goal of Google isn't to provide a
         | useful search engine (at least not anymore), but the search
         | engine is a by product of them wanting to show ads.
        
           | marcos100 wrote:
           | If Google isn't useful then nobody will use it.
           | 
           | The search engine and the ads are tightly coupled. A better
           | search engine means it can predict with more accuracy what
           | you are looking for and can serve you an even more targeted
           | ad that increases the chance you'll click.
        
             | rchaud wrote:
             | > If Google isn't useful then nobody will use it.
             | 
             | Or they'll continue to use it out of sheer inertia. Google
             | is paying Apple $15 billion to keep its place as iOS
             | default search engine.
             | 
             | IE6 didn't die overnight when Firefox arrived.
        
           | popcube wrote:
           | now Google try immitate a document system on your computer,
           | usually I rely on Google know what I need:(
        
         | antupis wrote:
         | As dev I would love search engine which would only do search to
         | stackoverflow github issues, documentation etc.
        
           | axelroze wrote:
           | You can limit the search query per website in DDG (and
           | probably in others)
           | 
           | Example: `rust slow compilation site:stackoverflow.com`
        
             | antupis wrote:
             | Yeah but usually I want some set of sites not just
             | stackoverflow.
        
           | goodpoint wrote:
           | ...especially if you could group online resources by category
           | (e.g. software eng, cooking, ...)
        
         | axelroze wrote:
         | Wouldn't the dethroner of Google be some new technology which
         | is not a search engine like Google but better at solving the
         | original task of finding information on how to solve problems?
         | 
         | Just like how iPad dethroned Windows PCs for average home user
         | but not Mac because Windows had the monopoly and then an
         | innovation destroyed MS in this space and not a competitor.
         | 
         | I don't think Google dethrones Yahoo and AltaVista scenario
         | will occur again.
        
           | gverrilla wrote:
           | > iPad dethroned Windows PCs for average home user
           | 
           | is this true? in the US, perhaps? because in south america it
           | couldn't me more far away from truth - didn't happen at all
        
       | amznbyebyebye wrote:
       | Wow. Love this.
       | 
       | Searched for "Ramanujan", one of my heros.
       | 
       | Found this gem- https://math.ucr.edu/home/baez/ramanujan/
       | 
       | Ramanujan's "easiest" formula.
       | 
       | Awesome!!
        
       | amelius wrote:
       | Question: how do we _benchmark_ search engines? Are there any
       | groups attempting to provide (open) solutions in this space?
       | 
       | (It seems to me that if you want to build a good search engine,
       | this is the question you need to address first.)
        
         | X6S1x6Okd1st wrote:
         | The search term you might be looking for is "information
         | retrieval" there are pretty standard measurements for whether
         | you are getting good results, but they are generally
         | conditioned on stuff like click through rate, comparing to
         | expert ranking and other signals that the user gives you that
         | it was a good or bad return of search results.
        
       | lightsurfer wrote:
       | thank you!
        
       | NuNotNon wrote:
       | I have an interest in logic and cs curriculum and i like Geneses
       | in general(last days i've read intro in math phylosophy from
       | Russell and some acm report of cs curriculum. I search for cs
       | curriculum and this is the first link
       | https://www.cs.rice.edu/~vardi/sigcse/ Feels so good to recive
       | good answers so easy. Thanks.
        
       | megraf wrote:
       | Thank you so much. This is wonderful
        
       | timdaub wrote:
       | I'm developing a text-heavy site and philosophically I'm trying
       | to view documents as just that... documents [1].
       | 
       | But I don't get good results for "rug pull".
       | 
       | - 1 https://rugpullindex.com
        
         | marginalia_nu wrote:
         | Yeah it's hosted by cloudflare. I'm currently IP-blocking them,
         | as because they keep prompting my crawler with a captcha,
         | presumably because it's made millions of requests from their
         | CDN.
         | 
         | Some rigmarole getting recognized as a good bot by the CDNs.
         | I've submitted a request fairly recently, but haven't heard
         | back from them yet.
         | 
         | Like I would like to be on good terms with them, and other
         | websites that block small independent crawlers.
         | 
         | I can't blame them though, there's a lot of bad bots out there.
         | But I'm doing my best not be part of the problem.
        
           | [deleted]
        
           | petercooper wrote:
           | Aha, I was going to ask how you were coping with CDNs like
           | Cloudflare blocking bots. It's sad we've got to this point
           | where basically only the established search engines are
           | grandfathered in to be able to crawl sites.
        
         | BugsJustFindMe wrote:
         | > _I 'm developing a text-heavy site_
         | 
         | I looked at the source for your site's front page. That's not
         | text-heavy; that's markup-heavy. I didn't bother looking at the
         | rest of the pages because it appears to be yet another crypto
         | market site.
        
       | greggturkington wrote:
       | Wouldn't this just skew towards really old sites?
       | 
       | The _third_ search result for  "dog" is this page on how to
       | remove AOL Instant Messenger, published in 2002.
       | 
       | https://sillydog.org/netscape/kb/removeaim.html
       | 
       | No one wants to see newsletter signup popovers, but "modern web
       | design" includes good performance and relevant content. (The
       | search engine itself takes about 2 seconds to first contentful
       | paint, not great.)
        
         | bityard wrote:
         | This search engine pretty much takes everything that Google is
         | doing and does the opposite. For instance, Google has decided
         | that "relevant" usually also means "recent". Thus, when
         | searching for something on Google, you mainly get results from
         | blogspam farms and almost never do you see anything more than a
         | few years old.
         | 
         | An implication of this is that old sites tend to disappear
         | (either into obscurity or by being taken down) because Google
         | penalizes them in search rankings. The author of this search
         | engine says, however:
         | 
         | > If a webpage has been around for a long time, then odds are
         | it has fundamental redeeming quality that has motivated keeping
         | it around all for that time.
         | 
         | I don't know that I agree 100% with this (there was lots of
         | crap on the "old" web too), but it makes a certain amount of
         | sense.
        
           | greggturkington wrote:
           | What "fundamental redeeming quality" about uninstalling AIM
           | from Windows 3.x motivated making that the 3rd result for
           | "dog"?
           | 
           | The 5th result is a tutorial on CSS. This search engine
           | decided it's relevant because it has "dog" in the URL. Is
           | that a better reasoning than Google's?
           | https://htmldog.com/guides/css/beginner/
           | 
           | Core Web Vitals ranks sites higher that perform well. Text-
           | heavy sites that are also optimized and relevant would
           | already perform well.
        
             | marginalia_nu wrote:
             | What are you searching for when you enter the query "dog",
             | keeping in mind the search engine deliberately does not
             | examine synonyms or and deliberately seeks out the path
             | less taken?
             | 
             | Dog facts? Then search "dog facts"
             | 
             | Famous dogs? Then search "famous dogs"
             | 
             | Rappers? Try "snoop dogg"
        
               | [deleted]
        
               | greggturkington wrote:
               | I'm searching for information on "dog".
               | 
               | Your suggestion of "dog facts" returns 6 pages from the
               | same domain, dogquotes.com. It's unreadable on mobile
               | because it's so old, all the facts are unsourced, and
               | often wrong:
               | 
               | > Never assume that a barking dog won't bute _[sic]_ ,
               | unless you're absolutely certain the dog believes it too.
               | 
               | Also on the 1st SERP, this odd blog post ranting about
               | 4th amendment rights [1], "Media Glamorization of the
               | Psychopath" [2], and this (image-heavy) page about
               | dolphin encounters in the Bahamas ("Sea Dog Facts" is a
               | link on the page).                   1.
               | http://www.rexcurry.net/drugdogsdan.html         2. https
               | ://www.metaphoricalplatypus.com/articles/psychology/psych
               | opathysociopathy/media-glamorization-of-the-psychopath/
               | 3. https://www.dolphinencounters.com/education/
        
       | samsaga2 wrote:
       | Where does the data come from? Do you index the whole web
       | yourself? I see it totally impossible for a personal project. I'm
       | very curious about that.
        
         | marginalia_nu wrote:
         | I do indeed index the web myself. Not the _entire_ web, just a
         | subset of it. The crawler quickly loses interest in
         | javascript:y websites and only indexes at depth those websites
         | that are simple. It also focuses on websites in English,
         | Swedish and Latin and tries to identify and ignore the rest
         | (best-effort).
         | 
         | You'd be surprised how much you can do with modern hardware if
         | you are scrappy. The current index is about 17.7 million URLs.
         | I've gone as far as 50 million and could _probably_ double that
         | if I really wanted to. The difficulty isn 't having a small
         | enough index, but rather having a relevant enough index,
         | weeding out the link farms and stuff that just take space.
         | 
         | I only index N-grams of up to 4 words, carefully chosen to be
         | useful. The search engine, right now, is backed by a 317 Gb
         | reverse index and a 5.2 Gb dictionary.
        
           | omoikane wrote:
           | > It also focuses on websites in English, Swedish and Latin
           | and tries to identify and ignore the rest
           | 
           | When I search for Japanese terms, it "says <query> needs to
           | be a word", which wasn't the best error message. Maybe the
           | error message should say something like "sorry, your language
           | isn't support yet"?
        
             | marginalia_nu wrote:
             | I've rephrased the wording for that one a bit.
        
           | throwaway47292 wrote:
           | Amazing!
           | 
           | I have only one recommendation that might make the search a
           | bit more relevant, e.g when searching for 'linux locking' or
           | 'kernel locking' kind of things.
           | 
           | Try to upsort things that match near the top of the content,
           | like the top of the man page vs middle vs bottom.
           | 
           | One easy way to do it without having to store the positions,
           | is to index the ngrams with max(sqrt,8) of their line number,
           | this will cover first 64 lines, you can also use log() or
           | just decide ad hock, top, middle, bottom of the document, so
           | you can use only 3 values.
           | 
           | e.g. https://www.kernel.org/doc/html/v5.0/kernel-
           | hacking/locking.... would do unreliable_1 guide_1 locking_1
           | ... then at line 4 kernel_2 locking_2 ... after line 50 ...
           | then_7 ... and after that everything will be _8.
           | 
           | then just make the query "kernel locking" to "dismax(kernel_1
           | OR kernel_2 OR kernel_3...) AND dismax(locking_1 OR locking_2
           | ...) with some tiebreaker of 0.1 or so, you can also say "i
           | want to upsort things on the same line, or few lines apart"
           | by modifying the query a bit.
           | 
           | It works really well and costs very little in terms of space,
           | i tried it at https://github.com/jackdoe/zr while searching
           | all of stackoverfow/man pages and etc and was pretty
           | surprised by the result.
           | 
           | This approach is a bit cheaper than storing the positions
           | because positions are (lets say) 4 bytes per term per doc,
           | while this approach has fixed uppre bound cost of 8*4 per
           | document (assuming 4 byte document ids) plus some amortized
           | cost for the terms
        
           | kews wrote:
           | Do you know what proportion of the texty web instructs
           | unknown crawlers to go away (or blocks them)?
        
             | marginalia_nu wrote:
             | It's hard to give numbers, it doesn't seem to be very many,
             | but losing out on a few key sites does make a pretty big
             | impact.
             | 
             | You see stuff like this sometimes, makes me a bit sad.
             | 
             | https://linux.die.net/robots.txt
        
           | c0wb0yc0d3r wrote:
           | How did you go about seeding your web crawler with URLs to
           | crawl?
        
             | marginalia_nu wrote:
             | I just started with my website and did a crawl.
             | Subsequently I've been seeding it with the best results
             | form my previous crawls.
             | 
             | It's a directed search so it doesn't seem to need a
             | particularly solid seed to get decent results.
        
               | c0wb0yc0d3r wrote:
               | So how long did it take to get to 17 million URLs?
        
             | dannyw wrote:
             | Not OP, but if I was to do this, I'd start by downloading
             | Wikipedia and all its external links and references, and
             | crawling from there. You should eventually reach most of
             | the publicly visible internet.
        
               | c0wb0yc0d3r wrote:
               | I feel a little embarrassed that I didn't think of
               | something like that.
               | 
               | When I did some crawler experimenting in my younger
               | years, I thought I was pretty clever using sites that
               | would let you perform a random Google searches. I would
               | just crawl all the pages from the results returned.
               | 
               | Your method would undoubtedly be more interesting I
               | think. It would certainly lead to interesting performance
               | problems quicker, I bet.
        
           | dannyw wrote:
           | This is unbelievably impressive on a technical and ambition
           | level for a solo, self-hosted hardware project. Kudos.
        
           | jillesvangurp wrote:
           | Cool, I've been thinking on this topic a bit lately. Crawling
           | is indeed not that hard of a problem. Google could do it 23
           | years ago. The web is a bit bigger now of course but it's not
           | that bad. Those numbers are well within the range of a very
           | modest search cluster (pick your favorite technology; it
           | shouldn't be challenging for any of them). 10x or 1000x would
           | not matter a lot for this. Although it would raise your cost
           | a little.
           | 
           | The hard problem is indeed separating the good stuff from the
           | bad stuff; or rather labeling the stuff such that you can
           | tell the difference at query time. Page rank was nice back in
           | the day; until people figured out how to game things. And now
           | we have bot farms filling the web with nonsense to drive
           | political agendas, create memes, or to drown out criticism.
           | Page rank is still a useful ranking signal; just not by it
           | self.
           | 
           | The one thing no search engine has yet figured out is
           | reputability of sources. Content isn't anonymous mostly. It's
           | produced and consumed by people. And those people have
           | reputations. Bot content is bad because it comes from sources
           | without a credible reputation. Reputations are built over
           | time and people value having them. What if we could value
           | people's appreciation relative to their reputability? That
           | could filter out a lot of nonsense. A simple like button + a
           | flag button combined with verified domain ownership (ssl
           | certificates) could do the trick. You like a lot of content
           | that other people disliked, your reputation goes down the
           | drain. If you produce a lot of content that people like, your
           | reputation goes up. If a lot of reputable people flag your
           | content, your reputation tanks.
           | 
           | The hard part is keeping the system fair and balanced. And
           | reputability is of course a subjective notion and there is a
           | danger of creating recommendation bubbles, politicizing
           | certain topics, or even creating alternative reality type
           | bubbles. It's basically what's happening. But it's mostly
           | powered by search engines and social media that actually
           | completely ignore reputability.
        
       | silent_cal wrote:
       | Wonderful work (':
        
       | afrcnc wrote:
       | except it doesn't actually return that many results
        
       | lazybreather wrote:
       | Amazing! How do I make this my search engine on browser? Not home
       | page.
        
       | winddude wrote:
       | curious how do you afford the infrastructure? I found that the
       | hardest part of running a search engine.
        
         | marginalia_nu wrote:
         | I'm self-hosting, and the server is a Ryzen 7 3900x with 128 Gb
         | of non-ECC RAM. It sits in my living room next to a cheap UPS.
         | I did snag one of the last remaining Optane 900Ps off Amazon,
         | and it powers the index and the database--and I really do think
         | this is among the best hardware choices for this use case. But
         | beyond that it's really nothing special, hardware-wise. Like
         | it's less than a month's salary.
         | 
         | It runs Debian, and all the services run bare metal with zero
         | containerization.
         | 
         | Modern consumer hardware can be absurdly powerful if you let
         | it.
         | 
         | Like I have no doubt a thousand engineers could spend a hundred
         | times as much time building a search engine that did pretty
         | much the same thing mine does, it would require a full data
         | center to stay running and be much slower. But that's just a
         | cost of large scale software development I don't have to pay as
         | a solo developer with no deadline, no planning and a shoestring
         | budget.
        
       | kevinob11 wrote:
       | I tested this with "Caribbean Vacation" and wow what a
       | difference. Everything on Google is "TOP X LIST" and "BEST XYZ"
       | which are just the worst when trying to find real interesting
       | information about experiences you can have on vacation somewhere.
       | I had used those as starting points then searched for long-form
       | blogs of real experiences people have had. This surfaced those
       | kinds of things immediately. I love it.
        
       | jokoon wrote:
       | I don't even want to imagine how google and other search engines
       | crawl websites that make heavy usage of react or other ajax
       | stuff. I don't want to be that guy.
       | 
       | I wonder if some browser engineers are trying to have some ideas
       | on how to find a solution on this. Personally, I would just make
       | a browser that breaks backward compatibility, remove old
       | features, etc. I guess browsers would be much lighter, fast and
       | simple if some hard choices were made.
       | 
       | Mozilla already decided to break some websites with the strict
       | cookie policy. I wish they would do the same for everything else
       | that sucks on the modern web.
       | 
       | I honestly don't think I have much respect for "web developers".
       | In a way I want mobile apps to kill the modern web, just to prove
       | a point.
        
       | yewenjie wrote:
       | Related question - suppose I want to create a meta search engine
       | for myself, and I want it to be as fast as possible. What are the
       | things I should be optimizing for?
        
       | FractalHQ wrote:
       | Ok this is great if all I want to do is read text, but often
       | times that is very much not all I want to do. The web is much
       | more than text and images these days. I can appreciate this as
       | long as it's branded as a search engine for blogs and articles
       | specifically, as opposed to being touted as a drop-in replacement
       | for the modern search engine.
        
         | lukas099 wrote:
         | Is this a criticism? It doesn't at all seem touted as a drop-in
         | replacement for the modern search engine.
        
           | FractalHQ wrote:
           | I do think this is a great tool, I didn't intend to come off
           | as being critical. I think I had some misdirected frustration
           | from people on HN talking about how the modern web is bad and
           | should be replaced with a text protocol. I aspire to create
           | web apps that push the boundaries of what is possible today,
           | and feel disappointed whenever I encounter people advocating
           | for regression.
           | 
           | But I digress. In hindsight, my comment was obnoxious and
           | under-appreciative of the tool being shared, and my rant was
           | only tangentially related.
        
       | ahthat wrote:
       | Very cool idea! Room for lots of improvement, keep working on it,
       | I like the direction this is going.
        
         | marginalia_nu wrote:
         | I just got it working reasonably well just this week. I've had
         | it "working" for a few months, but the results were always
         | extremely chaotic, bordering on random.
        
       | Phileosopher wrote:
       | Wow, if this catches on, my original content will actually
       | matter![1] I've always had a love-hate relationship with modern
       | web design principles because my design choices have all the
       | excitement and polish of what we get on HN.
       | 
       | I'm sure I'm not the only one, either. Content-rich sites need
       | more love.
       | 
       | [1] https://adequate.life
        
       | justinzollars wrote:
       | It works. Nice job.
        
       | chx wrote:
       | Not sure what to do with this.
       | 
       | https://search.marginalia.nu/search?query=gan+charger
       | 
       | aside from nexperia none of this looks even remotely relevant.
        
       | tejtm wrote:
       | There is an open standard way for an engine like this to provide
       | a mechanism for your standards aware browser to add the site as a
       | alternative search with a click.
       | 
       | That way I would not have to remember or bookmark just use my
       | search bar as normal and choose which engine for this query or
       | set it as default.
       | 
       | []https://developer.mozilla.org/en-US/docs/Web/OpenSearch
        
       | NotAnOtter wrote:
       | "Don't be afraid to scroll down in the search results, unlike in
       | many other search engines, depending on what you are looking for,
       | you may find the best results in the middle of the listing."
       | 
       | This is a very polite way of saying "this engine isn't very good"
       | 
       | Overall impressed with the project but I thought the word play
       | there was funny
        
         | marginalia_nu wrote:
         | I felt I needed to add it to help people taught by other search
         | engines that they only get 1-2 good results, and the rest is
         | useless. The reason I'm providing a hundred results is that
         | there are often a lot of results to choose from. If the point
         | is to find something unexpected, and that indeed is the entire
         | point, then that is the only sane design choice.
         | 
         | Like you search for something on Google and similar, and you
         | know what you are going to find. They are so good at searching
         | the Internet and predicting what you are going to click on that
         | you never see something new.
         | 
         | It's a great feat of engineering, but a huge tragedy, because
         | discovering new things, outside of what you our your
         | demographic has previously demonstrated an interest in, it can
         | be absolutely life changing.
        
       | jarbus wrote:
       | This engine is fantastic for recipes
        
       | NmAmDa wrote:
       | Lets get the the internet great place foe knowledge again. I
       | really loved the engine ans tried for different terms and very
       | happy. Goos job
        
       | godshatter wrote:
       | Instead of looking for something specific, I decided to try a
       | category of some sort to see what came up. Thinking about
       | Jeopardy categories, I tried "potent potables" and found a lot of
       | random pages that may or may not have made sense given that
       | category but that I had a lot of fun reading. Definitely a win
       | for me.
        
       | exabrial wrote:
       | I would like a search that punishes 'modern' SPOs that load 87mb
       | of the author's pet JS projects to display simple text. Basically
       | every modern SPO.
        
       | rc_mob wrote:
       | blessings upon you sir for making this
        
       | sabujp wrote:
       | effort is good, but needs some work, no results here :
       | 
       | https://search.marginalia.nu/search?query=rxjava+2+api+docs
       | 
       | https://www.google.com/search?q=rxjava+2+api+docs&oq=rxjava+...
        
         | marginalia_nu wrote:
         | It is very much a work in progress, still struggling with some
         | areas. I only really got into the territory of "sometimes
         | actually useful" like this weekend. Wasn't planning on blowing
         | up on HN just yet.
        
       | rafael_c wrote:
       | I liked this one... I searched for 'George Harrison' and among
       | the first results there was a page with interesting comments
       | about Harrison's solo career; someone reminiscing about the time
       | they got to talk to him about guitars for half an hour at a bar
       | at the airport; a transcript for an interview he gave on TV...
       | Whereas on GOOGLE: an instrusive 'People also ask' which I was
       | not interested; thumbnails for videos on youtube that I was not
       | looking for; previews to garbage clickbaity news articles; and
       | then finally for the search items: a bunch of websites for
       | lyrics; his Instagram (!) and fb pages; his imdb page; some more
       | news articles I was not looking for...
       | 
       | Granted, google's web results above are perhaps what people are
       | looking for 75% of the time, but how limiting and boring.
       | 
       | I'm also a sucker for the simplistic text-centric, information-
       | laden pages from the pre-facebook era.
       | 
       | For 'global warming', however - since Marginalia excludes modern
       | web-design pages - the results are of dubious relevance and
       | interest, since they are, well, 'old'.
       | 
       | I see myself using this engine a lot.
        
       | leephillips wrote:
       | This is wonderful and stupendous.
       | 
       | I've often thought that Google could be turned back into a good
       | search engine by simply eliminating the crap and letting the
       | useful sites float to the top of the results.
       | 
       | marginalia.nu seems to like my sites, so it must be good!
       | 
       | Some results are prefixed with ! or an arrow dingbat. What does
       | that mean?
        
       | PieUser wrote:
       | searching for covid gives a bunch of bogus crap of fake news
        
       | isaacgreyed wrote:
       | A common use case, how to do random thing in programming:
       | 
       | I searched python make a bar chart and it returned a live coding
       | video with an AI generated text transcript and two articles which
       | mentioned a different kind of bar.
       | 
       | I then narrowed it down to just python bar chart, and got a blog
       | post about scripting with a bar chart in it, this
       | http://www.nitcentral.com/voyager4/hellyear.htm with monty
       | python, bars, and charts from 1996 and among some other things I
       | found this https://python-
       | course.eu/naive_bayes_classifier_introduction..., which had an
       | example of a python bar chart even though the title of the page
       | made me think it wasn't what I wanted.
       | 
       | So for what I imagine to be a difficult search because of all the
       | different meanings of the words, I found my result on the second
       | query pretty quickly, and found some cool unrelated stuff too.
       | 
       | I like mostly that I get what I type in, and not exactly what I
       | want, but what I want is there too.
        
         | CapmCrackaWaka wrote:
         | I would probably use this if I wanted to find interesting blog
         | posts/websites about a topic I want to learn more about in
         | general. It seems less useful for returning exact answers to
         | specific questions.
        
       | jerhewet wrote:
       | I use webcrawler.com, and IMO it's better than any other search
       | engine for finding _exactly_ what I 'm looking for. Not what's
       | "trending", or "popular", or what the sheeple are searching for.
       | It finds the _exact matching keywords_ that I 'm looking for. No
       | inference or other bullshit -- just the matches.
       | 
       | Such a relief to not wade through oceans of worthless crap any
       | more.
        
       | api wrote:
       | This is the most amazing thing I have seen on here in at least a
       | year!
       | 
       | It's... no... it can't be... a search engine that finds _actual
       | information_ instead of 5 megabyte blobs of tracking code and SEO
       | crap!
        
       | optimalsolver wrote:
       | I predict it will return a disproportionate amount of sites by
       | schizophrenic conspiracists.
        
       | smoyer wrote:
       | You can add this to Firefox as search engine option by right
       | clicking on the URL and selecting "Add Marginalia". From there,
       | setting it as your default search engine is done from the
       | "Settings" panel as with other predefined search engines.
       | 
       | I'm experimenting with using it as my default ...
        
       | slim wrote:
       | This is a search engine indexing the internet on a mariadb
       | database hosted on consumer hardware maintained by a single
       | person as a hobby and it does not suffer from HN hug of death
        
         | swyx wrote:
         | how on earth do you index so much on consumer hardware? my
         | frontend developer mind is blown.
        
           | foxfluff wrote:
           | Wait till you learn that modern CPUs run _billions_ of cycles
           | per second. With multiple cores in parallel! And they can
           | reach transfer rates of tens of gigabytes per second to RAM,
           | or around a terabyte per second into L3.
        
             | ThalesX wrote:
             | And then you add a single HTTP request and everything tones
             | down to the speed of the web. Or I/O. Or DB call.
        
               | goodpoint wrote:
               | More like: you add some javascript library and now the
               | browser needs 5 seconds to run 10 MB of javascript.
        
               | knuthsat wrote:
               | You can still support millions of these requests per
               | second if you just bake all of the dependencies directly
               | in a small OS running on your fastest raspberry pi.
        
           | paxys wrote:
           | Consumer hardware today is simply what was cutting-edge and
           | crazy expensive 5 years ago.
        
       | FriendWithMoon wrote:
       | As a sufferer of Tinnitus, and having spent near 100 hours
       | researching it, I found a few sites I had never seen offering
       | great data and tools. Thank you
        
         | marginalia_nu wrote:
         | Honestly it's probably not a great source for medical advice.
         | At least take what you read with a healthy grain of salt.
        
       | necovek wrote:
       | Too bad the search index is currently restricted to ASCII-only
       | (or at least Cyrillic and Latin-2 characters were rejected as
       | "contains characters that are not currently supported").
       | 
       | I love the idea definitely, and I've long toyed around with
       | building a similar thing that starts crawling off my own
       | bookmarks (a personal small-deep-web if you wish).
       | 
       | I also love the "Small Web" name: this is the first I hear of it,
       | and it's what I've long complained about -- the web today hides
       | all of the cool gems search engines of old would have given you!
       | 
       | I am also a bit split on the "www" prefix restriction (iiuc,
       | domains which do not have "www" subdomain too are dropped from
       | the index because many of them are spammy): it might for sure be
       | a useful heuristic, but I've advocated for dropping "www" back in
       | late 90s and early 2000s already (one reason being that for eg.
       | Serbian, "w" is not in the alphabet, so you can't reasonably
       | quote it as Serbian is otherwise a phonetic-language).
        
       | talrand wrote:
       | Gave it a go with two different queries. The first I chose was
       | "amazon vendor services" didn't get a single result about the
       | topic.
       | 
       | The second query was a nation+city(in the nation). Got a lot of
       | result that were in no way related to either.
       | 
       | It seems to be biased towards IT topics (based uniquely on the
       | two queries).
        
       | deadalus wrote:
       | Very interesting because of the interesting results from random
       | websites. It's a great discovery tool.
       | 
       | Now hoping for search engine that favors text-heavy sites and
       | punishes paywalls
        
         | the__alchemist wrote:
         | I built one!
        
       | billyharris wrote:
       | Search engines always like websites with more text and less
       | graphics.
        
       | mattchew wrote:
       | Oh, I dream of a day where there are multiple useful search
       | engines, specialized for different purposes.
       | 
       | You're doing God's work here. Thanks and good luck.
        
         | scns wrote:
         | The Flying Spaghetti Monster wants to have a word with you.
         | 
         | (edit) That is a nice dream though.
        
         | brian_herman wrote:
         | Kind of reminds me of the past like alta vista and dogpile.
        
         | BoxOfRain wrote:
         | I wonder if there's any mileage in an extension of something
         | like uBlock Origin's lists of ad networks to block but instead
         | it's a list of known content mills and SEO spam factories to
         | remove from search results?
        
       | high_byte wrote:
       | I'd like a chrome extension that marks links that target text-
       | heavy vs "modern" so I know beforehand what to expect - paywall,
       | ads, popups, clickbaits, etc.
        
       | kebos wrote:
       | This is really cool, it filters out all fluff.
       | 
       | It's not always taking me to totally relevant sites but the
       | results contain my favourite type of content.
       | 
       | Full of _writing_ and pure html - usually the hallmark of someone
       | who knows what they are doing, wants to communicate but doesn 't
       | want to waste their time.
        
       | SimplGy wrote:
       | Searched for "chocolate chip cookie recipe"
       | 
       | First result had a recipe I could see both recipe and directions
       | in a single page, no ads, no scrolling, no fake seo anecdotes
       | about kids and grandmas.
       | 
       | (Pls make the search query box fit small mobile devices)
       | 
       | Great project idea!
        
       ___________________________________________________________________
       (page generated 2021-09-17 23:01 UTC)