hngopher.com

       [HN Gopher] A search engine that favors text-heavy sites and pun...
       ___________________________________________________________________
        
       A search engine that favors text-heavy sites and punishes modern
       web design
        
       Author : Funes-
       Score  : 1765 points
       Date   : 2021-09-16 12:16 UTC (10 hours ago)
        
 (HTM) web link (search.marginalia.nu)
 (TXT) w3m dump (search.marginalia.nu)
        
       | ryankrage77 wrote:
       | The results are fantastic, but I can't see how the excerpts
       | relate to the search term.
       | 
       | For example, a search term for 'Scotichronicon' returns some
       | fascinating results, but the search term itself doesn't appear in
       | the title or excerpts of most of the results.
       | 
       | This makes it harder to judge how relevant they are.
        
         | marginalia_nu wrote:
         | The excerpts are static and very best effort. You just have to
         | visit the website and find out I'm afraid.
         | 
         | I can do a lot with what I have, but I can't do full text
         | search on millions of documents with dynamic excerpts off a
         | single computer in my living room.
        
       | snakeboy wrote:
       | Wow, that's awesome. Great work!
       | 
       | For a simple test, I searched "fall of the roman empire". In your
       | search engine, I got wikipedia, followed by academic talks,
       | chapters of books, and long-form blogs. All extremely useful
       | resources.
       | 
       | When I search on google, I get wikipedia, followed by a listicle
       | "8 Reasons Why Rome Fell", then the imdb page for a movie by the
       | same name, and then two Amazon book links, which are totally
       | useless.
        
         | adventured wrote:
         | I did a search for "George Washington"
         | 
         | First result after Wikipedia:
         | 
         | "Radiophone Transmitter on the U.S.S. George Washington (1920)
         | 
         | In 1906, Reginald Fessenden contracted with General Electric to
         | build the first alternator transmitter. G.E. continued to
         | perfect alternator transmitter design, and at the time of this
         | report, the Navy was operating one of G.E.'s 200 kilowatt
         | alternators http://earlyradiohistory.us/1919wsh.htm "
         | 
         | Another result in the first few:
         | 
         | " - VANDERBILT, GEORGE WASHINGTON
         | 
         | PH: (800) ###-#233 FX: (#03) 641-5###.
         | https://www.ScottWinslow.com/manufacturer/VANDERBILT_GEORGE_...
         | "
         | 
         | And just below that terrible result:
         | 
         | "I Looked and I Listened -- George Washington Hill extract
         | (1954)
         | 
         | Although the events described in this account are undated, they
         | appear to have occurred in late 1928. I Looked and I Listened,
         | Ben Gross, 1954, pages 104-105: Programs such as these called
         | for the expenditure of larger sums than NBC had anticipated. It
         | be http://earlyradiohistory.us/1954ayl2.htm "
         | 
         | Dramatically worse than Google.
         | 
         | ---
         | 
         | Ok, how about a search for "Rome" then? Surely it'll pull some
         | great text results for the city or the ancient empire.
         | 
         | First result after Wikipedia:
         | 
         | "Home | Rome Daily Sentinel
         | 
         | Reliable Community News for Oneida, Madison and Lewis County
         | http://romesentinel.com/"
         | 
         | The fourth result for searching "Rome":
         | 
         | "Glenn's Pens - Stores of Note
         | 
         | Glenn's Pens, web site about pens, inks, stores, companies -
         | the pleasure of owning and using a pen of choice. Direcdtory of
         | pen stores in Europe.
         | http://www.marcuslink.com/pens/storesofnote/roma.html"
         | 
         | Again, dramatically worse than Google.
         | 
         | ---
         | 
         | Ok, how about if I search for "British"?
         | 
         | First result after Wikipedia:
         | 
         | "BRITISH MINING DATABASE
         | 
         | British_Mining_Database
         | http://www.users.globalnet.co.uk/~lizcolin/bmd.htm "
         | 
         | And after that:
         | 
         | "British Virgin Islands
         | 
         | Many of these photos were taken on board the Spirit of
         | Massachusetts. The sailing trip was organized by Toto Tours.
         | Images Copyright (c) Lowell Greenberg Home Up Spring Quail
         | Gardens Forest Home Lake Hodges Cape Falcon Cape Lookout,
         | Oregon Wahkeena
         | http://www.earthrenewal.org/british_virgin_islands2.htm"
         | 
         | Again, far off the mark and dramatically worse than Google.
         | 
         | I like the idea of Google having lots of search competition,
         | this isn't there yet (and I wouldn't expect it to be). I don't
         | think overhyping its results does it any favors.
        
           | burkaman wrote:
           | This is not a Google competitor, it's a different type of
           | search engine with different goals.
           | 
           | > If you are looking for fact, this is almost certainly the
           | wrong tool. If you are looking for serendipity, you're on the
           | right track. When was the last time you just stumbled onto
           | something interesting, by the way?
        
           | JasonFruit wrote:
           | Hobby project leads angry person to interesting and
           | unexpected material; angry person remains angry. Details at
           | six.
        
             | fouric wrote:
             | The project explicitly bills itself as a "search engine",
             | not an "interesting and unexpected material surfacer".
             | Moreover, projecting emotions like "angry" onto a comment
             | in order to discredit the content of the comment (hey! is
             | that an ad-hominem?) is just about exactly the opposite of
             | the discussions that the HN mods are trying to curate, and
             | the discussions that I like to see here.
        
               | withinboredom wrote:
               | In the early days of google, I found what I was looking
               | for on page 5+. On the way, I'd discover many interesting
               | things I didn't even know I was looking for, often
               | completely unrelated to what I was searching for.
        
               | kwertyoowiyop wrote:
               | And now Google hides that more than one page even exists,
               | as they populate their first page with buttons to ask
               | similar questions and go to the first page of THOSE
               | results.
        
               | kews wrote:
               | I miss those old days of even being permitted to go many
               | pages in.
        
               | allknowingfrog wrote:
               | If you click through to the About page, I think you'll
               | see that "interesting and unexpected material surfacer"
               | is a fairly apt description of the project.
        
             | adventured wrote:
             | > Hobby project leads angry person to interesting and
             | unexpected material; angry person remains angry.
             | 
             | Not angry in the least. I'm thrilled someone is working on
             | a search competitor to Google.
             | 
             | I understand you're attempting to dismiss my pointing out
             | the bad results by calling me angry though. You're focusing
             | your content on me personally, instead of what I pointed
             | out.
             | 
             | The parent was far overhyping the results in a way that was
             | very misleading (look, it's better than Google!). I tried
             | various searches, they were not great results. The parent
             | was very clearly implying something a lot better than that
             | by what they said. The product isn't close to being at that
             | level at this point, overhyping it to such an absurd degree
             | isn't reasonable or fair to the person that is working on
             | it.
             | 
             | I would specifically suggest people not compare it to
             | Google. Let it be its own thing, at least for a good while.
             | Google (Alphabet) is a trillion dollar company. Don't press
             | the expectations so far and stage it to compete with Google
             | at this point. I wouldn't even reference Google in relation
             | to this search engine, let it be its own thing and find its
             | own mindshare.
        
               | bityard wrote:
               | > I'm thrilled someone is working on a search competitor
               | to Google.
               | 
               | Except the author goes to quite some lengths to explain
               | that his search engine is not a competitor to Google, and
               | is in fact exactly the opposite of Google in many ways:
               | https://memex.marginalia.nu/projects/edge/about.gmi
        
           | kwhitefoot wrote:
           | What were you expecting to see for British? There must be
           | millions of pages containing that term. Anyway the first
           | screenful from Google is unadulterated crap, advertising
           | mixed with the usual trivia questions.
           | 
           | If you are going top claim something is wide of the mark then
           | you really ought to tell us at least roughly where the mark
           | is.
        
           | duckmysick wrote:
           | I checked the results of the same query and they seem fine.
           | Lots of speeches and articles about George Washington the US
           | president. There's even his beer recipe.
           | 
           | As for the results you linked, it's part of the zeitgeist to
           | list other entities sharing the same name. Sure, they could
           | use some subtle changes in ranking, but overall the returned
           | links satisfy my curiosity.
        
         | Nition wrote:
         | The Wikipedia link at the top is always given. It would maybe
         | be good to make it a little clearer that it's not one of the
         | true results.
        
         | klntsky wrote:
         | However, when searching for "haskell type inference algorithm"
         | I get completely useless results.
        
           | [deleted]
        
           | klntsky wrote:
           | Since it does not use synonyms, it looks like it is unable to
           | answer "how's that thing called"-queries.
        
           | burkaman wrote:
           | That query is too long apparently. But if you shorten to
           | "haskell type inference", I think it delivers on its promise:
           | 
           | > If you are looking for fact, this is almost certainly the
           | wrong tool. If you are looking for serendipity, you're on the
           | right track. When was the last time you just stumbled onto
           | something interesting, by the way?
        
             | marginalia_nu wrote:
             | The search engine doesn't do any type of re-ordering or
             | synonym stuff, it only tires to construct different N-grams
             | from the search query.
             | 
             | So if you for example compare "SDL tutorial" with "SDL
             | tutorials". On google you'd get the same stuff, this search
             | engine, for better or worse doesn't.
             | 
             | This is a design decision, for now anyway, mostly because
             | I'm incredibly annoyed when algorithms are second-guessing
             | me. On the other hand, it does mean you sometimes have to
             | try different searches to get relevant results.
        
               | ford_o wrote:
               | Maybe list the synonyms under the query, so its easier to
               | try different formulations.
        
               | Razengan wrote:
               | It could simply become an option.
        
               | OneLeggedCat wrote:
               | Don't change it. It's good this way.
        
               | leephillips wrote:
               | I like this design decision. It pays you back for
               | choosing your search terms carefully.
        
           | LanceH wrote:
           | It would be nice if we could pipe search engines.
        
             | BenoitP wrote:
             | Definitely; We could create a meta search engine that
             | queries them all, in desktop application format.
             | 
             | Let's name it after a famous old scientist, and maybe add
             | the year to prove it's modern: Galileo 2021.
        
               | overkalix wrote:
               | ... is this Galileo 2021 a reference that I am not
               | understanding?
        
               | BenoitP wrote:
               | Yup, but so far no one got it.
               | 
               | There was such an app in the early 2000's, before Google
               | went mainstream, and Altavista-like engines were not
               | good: Copernic 2000.
               | 
               | I guess I'm officially old now.
        
               | tomerv wrote:
               | FWIW, I got the reference. Maybe I'm old too?
        
               | PaulHoule wrote:
               | Meta search engines leave a bad taste in everyone's mouth
               | because they've always failed. Here is why
               | 
               | https://en.wikipedia.org/wiki/Arrow%27s_impossibility_the
               | ore...
               | 
               | You can't combine a few different ranked lists and expect
               | to get results better than any of the original ranked
               | lists.
        
               | robrenaud wrote:
               | > You can't combine a few different ranked lists and
               | expect to get results better than any of the original
               | ranked lists.
               | 
               | I am skeptical of this application of the theorem. Here
               | is my proposal:
               | 
               | Take the top 10 Google and Bing results. If the top
               | result from Bing is in the top 10 from Google, display
               | Google results. If the top result from Bing is not in the
               | top 10 from Google, place it at the 10th position. You'd
               | have an algorithm that ties with Google, say 98% of the
               | time, beats it say, 1.2% of the time, and loses .8% of
               | the time.
        
               | vikingerik wrote:
               | Right. Arrow's theorem just says it's impossible to do it
               | in _all_ cases. It 's still quite possible to get an
               | improvement in a large proportion of cases, as you're
               | proposing.
        
               | random314 wrote:
               | Arrows theorem simply doesn't apply here. We don't need
               | our personalized search results to satisfy the majority.
        
               | PaulHoule wrote:
               | But in both cases you face the problem of aggregating
               | preferences of many into one. In one case you are
               | combining personal preferences in the other case
               | aggregating 'preferences' expressed by search engines.
        
               | PaulHoule wrote:
               | I've had jobs tuning up the relevance of search engines
               | with methods like
               | 
               | https://ccc.inaoep.mx/~villasen/bib/AN%20OVERVIEW%20OF%20
               | EVA...
               | 
               | and the first conclusion is "something that you think
               | will improve relevance probably won't"; the TREC
               | conference went for about five years before making the
               | first real discovery
               | 
               | https://en.wikipedia.org/wiki/Okapi_BM25
               | 
               | It's true that Arrow's Theorem doesn't strictly apply,
               | but thinking about it makes it clear that the aggregation
               | problem is ill-defined and tricky. (e.g. note also that a
               | ranking function for full text search might have a range
               | of 0-1 but is not a meaningful number, like a probability
               | estimate that a document is relevant, but it just means
               | that a result with a higher score is likely to be more
               | relevant than one with a lower score.)
               | 
               | Another way to think about it is that for any given
               | feature architecture (say "bag of words") there is an
               | (unknown) ideal ranking function.
               | 
               | You might think that a real ranking function is the ideal
               | ranking function plus an error and that averaging several
               | ranking functions would keep the contribution of the
               | ideal ranking function and the errors would average out,
               | but actually the errors are correlated.
               | 
               | In the case of BM25 for instance, it turns out you have
               | to carefully tune between the biases of "long documents
               | get more hits because they have more words in them" and
               | "short documents rank higher because the document vectors
               | are spiky like the query the vectors". Until BM25 there
               | wasn't a function that could be tuned up properly and
               | just averaging several bad functions doesn't solve the
               | real problem.
        
               | gnramires wrote:
               | That's an invalid application of this theorem. (It
               | doesn't necessarily hold)
               | 
               | Suppose there's an unambiguous ranked preference by all
               | people among a set (webpages, ranking). Suppose one
               | search engine ranks correctly the top 5 results and
               | incorrectly the next 5 results, while another ranks
               | incorrectly the top 5 and correctly the next 5.
               | 
               | What can happen is that some there may be no universally
               | preferred search engine (likely). In practice, as another
               | commenter noted, you can also have most users prefer more
               | a certain combination of results (that's not difficult to
               | imagine, for example by combining top independent results
               | from different engines for example).
        
               | Torwald wrote:
               | I need that with a simpler interface, so I call it after
               | a famous dedective: Sherlock.
        
               | artificial wrote:
               | _a magic pop sound is faintly audible as a new side
               | project is appended to several lists_ Excellent, thank
               | you!
        
               | mkr-hn wrote:
               | Likely trademark collision with this:
               | https://www.galileo.usg.edu/
        
               | gmueckl wrote:
               | Not an app, but probably comes quite close in all other
               | respects: https://metager.org
        
         | foofoo4u wrote:
         | Good comparison. Reminds me of an analogy I like to make of
         | today's web, which is it feels like browsing through a magazine
         | store -- full of top 10s, shallow wow-factoids, and baity
         | material. I genuinely believe terrible results like this are
         | making society dumber.
        
           | bluGill wrote:
           | what I really want is a true AI to search through all that
           | and figure out the useful truth. I don't know how to do this
           | (and of course whoever writes the AI needs to be unbiased...)
        
           | rchaud wrote:
           | The context matters. I'd happily read "Top 10" lists on a
           | website if the site itself was dedicated to that one thing.
           | "Top 10 Prog Rock albums", while a lazy, SEO-bait title,
           | would at least be credible if it were on a music-oriented
           | website.
           | 
           | But no, these stories all come from cookie-cutter "new media"
           | blog sites, written by an anonymous content writer who's
           | repackaged Wikipedia/Discogs info into Buzzfeed-style copy
           | writing designed to get people to "share to Twitter/FB". No
           | passion, no expertise. Just eyeballs at any cost.
        
             | foofoo4u wrote:
             | This got me thinking that maybe one of the other big
             | reasons for this is that the algorithms prioritize newer
             | pages over older pages. This produces the problem where
             | instead of covering a topic and refining it over time, the
             | incentive is to repackage it over and over again.
             | 
             | It reminds me of an annoyance I have with the Kindle store.
             | If I wanted to find a book on, let's say, Psychology, there
             | is no option to find all-time respected books of the past
             | centenary. Amazon's algorithms constantly push to recommend
             | the latest hot book of the year. But I don't want that. A
             | year is not enough time to have society determine if the
             | material withstands time. I want something that has stood
             | the test of time and is recommended by reputable
             | institutions.
        
               | amenod wrote:
               | > is that the algorithms prioritize newer pages over
               | older pages.
               | 
               | They do? That would explain a lot - but ironically, I
               | can't find a good source on this. Do you have one at
               | hand?
        
               | dvogel wrote:
               | It is pretty obvious if you search for any old topic that
               | is also covered incessantly by the news. "royal family"
               | is a good example. There's no way those news stories
               | published an hour ago are listed first due to a high
               | PageRank score (which necessarily depends on time to
               | accumulate inbound links).
        
               | RattleyCooper wrote:
               | It depends on the content. The flip side is looking up a
               | programming-related question and getting results from
               | 2012.
               | 
               | I think they take different things into account based on
               | the thing being searched.
        
               | rchaud wrote:
               | Your Google search results show the date on articles do
               | they not? If people are more likely to click on
               | "Celebrity Net Worth (2021)" than "Celebrity Net Worth
               | (2012)", then the algo will update to favour those
               | results, because people are clicking on them.
               | 
               | The only definitive source on this would be the
               | gatekeeper itself. But Google never says anything
               | explicitly, because they don't want people gaming search
               | rankings. Even though it happens anyway.
        
           | wwweston wrote:
           | It's also possible that it's the other way around: a certain
           | "common denominator" + algorithms that chase broad engagement
           | = mediocre results.
           | 
           | The real trick would be some kind of engine that can aim just
           | above where the user's at.
        
         | [deleted]
        
         | hdjjhhvvhga wrote:
         | As long as few people use it, it will be great. Rest assured
         | that the moment it becomes popular, the people who want to game
         | it will appear.
        
           | Nextgrid wrote:
           | I don't think the existing media-heavy websites are gaming
           | Google to rank higher. It's that Google itself prefers media
           | heavy content; they don't have to "game" anything.
           | 
           | I also think a search engine like this would be quite hard to
           | game. An ML-based classifier trained on thousands of text-
           | heavy and media-heavy screenshots should be quite robust and
           | I think would be very hard to evade, so the "game" will
           | become more about how _identify_ the crawler so you can serve
           | it a high-ranking page while serving crap to the real users,
           | and it seems fairly easily to defeat if the search engine
           | does a second pass using residential proxies and standard
           | browser user agents to detect this behavior (it could also
           | threaten huge penalties like the entire domain being banned
           | for a month to even deter attempts at this).
        
             | fragmede wrote:
             | With the advances in text generation by machines that
             | looks, but isn't _quite_ accurate (aka GPT-3), seems like
             | it would be _easily_ gamed (given access to GPT-3). Even
             | without GPT-3, if the content being prioritized is mere
             | text, I 'm sure that for a pile of money, I could generate
             | something that looks like Wikipedia, in the sense that it's
             | a giant pile of mostly text, but it would make zero sense
             | to a human reader. (Building an SEO farm to boost ranking
             | of not-wikpedia is left as an exercise for the reader.)
        
           | jandrese wrote:
           | This sort of optimization is why simple recipes are typically
           | found at the end of a rambling pointless blog post now.
           | 
           | Still, the best way to break SEO is to have actual
           | competition in the search space. As long as SEO remains
           | focused on Google there is an opportunity for these companies
           | to thrive by evading SEO braindamage.
        
             | SerLava wrote:
             | That's not really for SEO, which favors readily accessible
             | information.
             | 
             | That's ads. When mobile users have to scroll past 10 add,
             | theyll click on some of them and make the blog money.
        
             | ggggtez wrote:
             | I've noticed this pattern start to pop up elsewhere. I've
             | started to train my skimming skills, skipping a paragraph
             | or two at a time to get past the fluff.
             | 
             | Like an article about some current event will undoubtedly
             | begin with "when I was traveling ten years ago...".
        
             | zerd wrote:
             | It's also because that's a way of trying to copyright
             | protect recipes, which are normally not copyright
             | protected.
             | 
             | > "Mere listings of ingredients as in recipes, formulas,
             | compounds, or prescriptions are not subject to copyright
             | protection. However, when a recipe or formula is
             | accompanied by substantial literary expression in the form
             | of an explanation or directions, or when there is a
             | combination of recipes, as in a cookbook, there may be a
             | basis for copyright protection."
        
             | YeGoblynQueenne wrote:
             | >> This sort of optimization is why simple recipes are
             | typically found at the end of a rambling pointless blog
             | post now.
             | 
             | I continue to be curious about this kind of complaint. If
             | all you want is a recipe list, without any of the fluff,
             | why would you click on a link to a blog, rather than on a
             | link to a recipe aggregator?
             | 
             | Foodie blogs exist specifically for the people who want a
             | foodie discussion and not just an ingredients' list.
             | 
             | Is it because blogs tend to have better recipes overall? In
             | that case, isn't there a bit of entitlement involved in
             | asking that the author self-sacrificingly provides only the
             | information that you want, without taking care of their own
             | needs and wants, also?
        
               | Loughla wrote:
               | It's the same thing that people always complain about.
               | This thing is not in a format that I like, so it must be
               | not what anyone likes.
               | 
               | If you want JUST recipes, pay money instead of just
               | randomly googling around. America's test kitchen has a
               | billion, vetted, and really good recipes. That solves
               | that problem.
        
               | joegahona wrote:
               | I think the complaint is that those blogs rank higher
               | than nuts-and-bolts recipes now. It wasn't that way a few
               | years ago. Yes, scrolling down the results to Food
               | Network or Martha Stewart or whatever is possible, as is
               | going directly to those sites and using their site
               | search, but it's noticeable and annoying.
        
               | jandrese wrote:
               | Because when you search for a recipe you get the link to
               | the blog, not the aggregator.
        
             | WorldMaker wrote:
             | That sort of recipe blog hasn't happened just for SEO. It's
             | also a bit of a "two audiences" problem: if you are coming
             | to that food blogger from a search you certainly would
             | prefer the recipe first and then maybe any commentary on it
             | below if the recipe looks good. If you are a regular reader
             | of that food blogger you are probably invested in the
             | stories up top and that parasocial connection and the
             | recipes themselves are sometimes incidental to why you are
             | a regular reader.
             | 
             | You see some of that "two readers" divide sometimes even in
             | classic cookbooks, where "celebrity" chefs of the day might
             | spend much of a cookbook on a long rambling memoir.
             | Admittedly such books were generally well indexed and had
             | table of contents to jump right to the recipes or
             | particular recipes, but the concept of "long personal
             | ramble of what these recipes mean to me" is an old one in
             | cookbooks too.
        
               | giantrobot wrote:
               | > If you are a regular reader of that food blogger
               | 
               | I think this assumes facts not in evidence. It certainly
               | seems like an overwhelming number of "blogs" are not
               | actual blogs but SEO content farms. There's no regular
               | readers of such things because there's no actual authors,
               | just someone that took a job on Fivver to spew out some
               | SEO garbage. Old content gets reposted almost verbatim
               | because new results better according to Google.
               | 
               | The only reason these "blogs" exist is to show ads and
               | hopefully get someone's e-mail (and implied consent) for
               | a marke....newsletter.
        
               | WorldMaker wrote:
               | I know at least a few that I commonly see in top search
               | results that I have friends that read them like
               | personalized soap operas where most of the drama revolves
               | around food and family and serving food to family.
               | 
               | It's at least half the business models of Food Network
               | shows: aspirational kitchens and the people that live in
               | them and also sometimes here's their recipes. (The other
               | half being competitions, obviously.) I've got friends
               | that could deliver entire doctoral theses on the Bon
               | Appetit Test Kitchen (and its many YouTube shows and
               | blogs) and the huge soap operatic drama of 2020's events
               | where the entire brand milkshake ducked itself; falling
               | into people's hearts as "feel good" entertainment early
               | in 2020/the pandemic and then exploding very dramatically
               | with revelations and betrayals that Fall.
               | 
               | Which isn't to say that there _aren 't_ garbage SEO farms
               | out there in the food blogging space _as well_ , but a
               | lot of the big ones people commonly complain about seeing
               | in google's results do have regular fans/audiences. (ETA:
               | And many of the smaller blogs _want_ to have regular fans
               | /audiences. It's an active influencer/"content creator"
               | space with relatively low barrier to entry that people
               | love. Everyone's family loves food, it's a part of the
               | human condition.)
        
               | run-types wrote:
               | I've basically never been taken to a recipe without a
               | rambling preamble from Google. While food blogs may serve
               | two audiences, a long introduction seems to be a
               | requirement to appear in the top Google search results.
        
               | WorldMaker wrote:
               | Personally, I think that has a lot more to do with the
               | fact that Google killed the Recipe Databases. There did
               | used to be a few startups that tried to be Recipe
               | Aggregators with advertising based business models, that
               | would show recipes and then link to source blogs and/or
               | cookbooks, and in the brief period where they existed
               | Google scraped them _entirely_ and showed entire recipes
               | on search results and ate their ad revenue out from under
               | them.
        
               | tomrod wrote:
               | That is a really bad thing by Google. Their core business
               | is not recipes.
        
               | kwertyoowiyop wrote:
               | Their core business is making money from other people's
               | content, no matter what it is.
        
               | WorldMaker wrote:
               | Their core business is advertising and they have always
               | been in a direct conflict-of-interest by competing with
               | content sites for ad revenue buys.
        
               | inanutshellus wrote:
               | I see your point, but argue you've misidentified the two
               | audiences.
               | 
               | One audience matches your description and is the invested
               | reader. They want _that_ blogger 's story telling. they
               | might make the recipe, but they're a dedicated reader.
               | 
               | The other audience is not the recipe-searcher, but
               | instead Google. Food bloggers know that recipe-searchers
               | are there to drop in, get an ingredient list, and move
               | on. They won't even remember the blog's name. So the site
               | isn't optimized for them. It's optimized for Google.
               | 
               | "Slow the parasitic recipe-searcher down. They're
               | leeches, here for a freebie. Well they'll pay me in
               | Google Rank time blocks."
        
             | xtracto wrote:
             | That's why I use Saffron [1], it magically converts those
             | sites into a page in my recipe book. I found it when the
             | developer commented here in HN. Also, a lot of cooking
             | website have started to add a link with "jump to recipe"
             | functionality allowing you to skip all the crap.
             | 
             | [1] https://www.mysaffronapp.com/
        
               | Funes- wrote:
               | There's also https://based.cooking.
        
               | eigengrau5150 wrote:
               | Run by Luke Smith, an admitted neo-reactionary and
               | possible white supremacist who writes like a 4chan
               | reject.
        
           | the_other wrote:
           | If there were a wider variety of popular search engines, with
           | different ranking criteria, would sites begin to move away
           | from gaming the system? Surely it would be too hard to game
           | more than one search engine at a time?
        
             | Nasrudith wrote:
             | It would be a matter of numbers anyway about which they
             | optimize for. A/B testing is already in place and doesn't
             | care about where it comes from, just which one does better.
        
           | new_guy wrote:
           | > the people who want to game it will appear.
           | 
           | So just add human review to the mix, if a site is obviously
           | trying to game the system (listicles, seo spam etc) just drop
           | and ban them from the search index.
        
             | hdjjhhvvhga wrote:
             | Congratulations, you've just invented negative SEO.
        
         | eterevsky wrote:
         | Imagine if you were looking for the movie.
        
           | neltnerb wrote:
           | Imagine including the search term "movie".
        
             | yreg wrote:
             | That doesn't do anything useful.
        
           | lucideer wrote:
           | I tend to prefer Wikipedia for movies. The exception is actor
           | headshots if I'm trying to identify someone, which Wikipedia
           | lacks for licensing reasons, but otherwise Wikipedia tends to
           | be better than IMDB for most needs. Wikipedia has an IMDB
           | link on every article anyway.
           | 
           | Another need I guess might be reviews, for which RT or MC are
           | better than IMDB: not sure if either of those two will fare
           | better than IMDB in this search engine but again Wiki has
           | links out (in addition to good reception summaries)
        
             | mountainboy wrote:
             | For me, imdb was much better when they had user
             | comments/discussion.
             | 
             | I never even posted on it myself, but browsing the
             | discussions one could learn all sorts of trivia, inside
             | info, speculation, etc about each movie.
             | 
             | Since they (inexplicably) killed that feature, I rarely
             | even visit anymore. Your right, for many purposes wikipedia
             | is better, especially for TV series episode lists with
             | summaries.
        
               | ncphil wrote:
               | IMDB management thought it was their brilliant editorial
               | work that drew people to their site. Morons. It was the
               | comments all along. Of course they also believed they
               | could create gravity-free zones by sheer force of
               | executive will (and maybe still do).
        
           | shuntress wrote:
           | _?q=imdb.com:fall of the roman empire_
        
           | jazzyjackson wrote:
           | !imdb
        
           | MisterTea wrote:
           | The you'd use a different search engine. Why does everything
           | have to be a Swiss Army knife?
        
             | zozbot234 wrote:
             | Or you could just search for 'rome movie'. Though for more
             | complex disambiguation you would need to resort to, e.g.
             | schema.org descriptions (which are supported by most search
             | engines, and the foundation for most "smart" search result
             | snippets).
        
         | hn_throwaway_99 wrote:
         | I had the exact opposite experience. I searched the site for
         | "java", got a Wikipedia link first (for the island, not the
         | programming language), and the 2nd result was to a random JEP
         | page, and all the rest of the results were random tidbits about
         | Java (e.g. "XZ compression algorithm in Java). Didn't get any
         | high level results pointing to an overview of the language,
         | getting started guides, etc.
        
           | withinboredom wrote:
           | You need to use some old school search techniques and search
           | for "Java overview"
        
           | _wldu wrote:
           | I'm not sure that's a bad thing.
        
           | rovr138 wrote:
           | well, they're results to java related items...
           | 
           | What kind of links where you expecting to find?
        
         | SPBS wrote:
         | Cool, it appears that the trend towards JS may be causing self-
         | selection -- if a page has a high amount of JS, it is highly
         | unlikely to contain anything of value.
        
           | dv_dt wrote:
           | If one could create an metric of ad to content ratio from the
           | js used, I would guess that would be a nice differentiator
           | too.
        
           | sjtindell wrote:
           | True. Unfortunately many large corporate websites through
           | which you pay bills, order tickets, etc. are becoming
           | infested with JS widgets and bulky, slow interfaces. These
           | are hard to avoid.
        
       | zimpenfish wrote:
       | Searched for my initials - got back a bunch of raw binary results
       | (mp4, pdf, img, txz, etc.) which was disconcerting. Although it
       | did find one reference to actual-me which is better than Google
       | manages on the first 4 pages...
       | 
       | https://imgur.com/a/n2xro2Y
        
         | marginalia_nu wrote:
         | Yeah there was unfortunately a problem with the content-type
         | code recently, it unfortunately categorized some binary data as
         | HTML and tried to process it best-effort. So there's some
         | binary soup in the index.
         | 
         | The bug has since been fixed, but it won't come into effect in
         | a few weeks.
        
       | ape4 wrote:
       | Fast and doesn't crash when on the front page of Hacker News!
        
         | ricardo81 wrote:
         | That crossed my mind too, considering vanilla webpages
         | sometimes struggle with a top page HN thread, never mind a
         | search engine backend.
        
         | marginalia_nu wrote:
         | Well,... yet. Load average is at 1.2, not that bad. But the
         | services are getting a solid workout.
        
           | marginalia_nu wrote:
           | The real test is in now, index server is reconstructing its
           | index. It does this every 6 hours if there is new pages.
           | Takes half an hour or so usually.
           | 
           | It's supposed to be able to handle searches at the same time,
           | but jeepers, it's gonna have to chew through nearly 400 Gb of
           | data while dealing with over 1 request per second.
        
             | 0xbadcafebee wrote:
             | Is your site/code on GitHub? I would be happy to give
             | performance tips/tweaks. Also Fyi, https://marginalia.nu/
             | gives a certificate error (I know that's not the search
             | site)
        
       | criddell wrote:
       | Have you given any thought on what you will do if you get a DMCA
       | take down request or a request from a person asking you to remove
       | them from search results?
        
       | mrkramer wrote:
       | Awesome work! I had similar idea in mind but I'm glad to see
       | someone else was able to pull it off.
        
       | sailorganymede wrote:
       | Love this! Is there any way someone could help contribute to
       | this?
        
       | palijer wrote:
       | This has been needed in my life for a while. I am growing really
       | apathetic about the internet lately, but I realize that is
       | because my entry point is always a google search.
       | 
       | I miss finding blog posts and scholarly articles in long form. I
       | hate the SEO sites with unreadable UI because the information in
       | them is often a lot lower quality as well.
        
       | claytn wrote:
       | Searching for your own name will turn up some interesting
       | results! I got some early 90s webpages that just contain
       | obituaries or marriage records. I never knew cities maintained
       | these records online!
        
       | IlliOnato wrote:
       | Pretty cool. I am not sure yet how useful, but cool it is.
       | 
       | However, it seems that it currently does not support non-Latin
       | alphabets. Which I understand in an early version. Still, it's
       | handling of such "exception cases" could be improved:
       | 
       | when I search for a Russian word, say "Akvarium", I get <<Search
       | "Akvarium" needs to be a word>>, which is rather rude...
        
         | foxfluff wrote:
         | "It also focuses on websites in English, Swedish and Latin and
         | tries to identify and ignore the rest (best-effort)."
         | 
         | https://news.ycombinator.com/item?id=28551183
        
       | beepbooptheory wrote:
       | need a duck duck go `bang!` for this
        
       | jordache wrote:
       | how about a search engine that bans all pinterest content.
       | 
       | I hate pinterest with a passion. I may need to get a "his" laptop
       | separate from my wife, since she needs that darn pinterest
       | extension for pinning photos.
        
       | mattowen_uk wrote:
       | Nice! Typing my name in, gets my own site back as 3 of the top 5
       | results. I suddenly feel important ;)
        
       | schmorptron wrote:
       | I've also found that brave search gets much better results than
       | google for some programming related topic, simply by not being
       | targeted by blogspam SEO as much. It's refreshing to not have to
       | click through 3 auto generated "articles" but to either a) get
       | the documentation straight away or b) find actually human written
       | blog entries.
        
       | arethuza wrote:
       | I did a quick check using the name of the Scottish village I am
       | originally from (or as I should say "far am fae") and this
       | produced a _much_ more interesting set of links for me than that
       | produced by Google
        
       | abhiminator wrote:
       | Absolute textgasm.
       | 
       | Wonder how 'text-only first' prioritization is being implemented,
       | algorithmically speaking?
        
         | ryankrage77 wrote:
         | This post - https://news.ycombinator.com/item?id=28551183 -
         | suggests it's a simple set of hueristics, looking for things
         | like javascript, link/SEO spam, language, amount of text
         | content, etc, filtering out unwanted results and only indexing
         | wanted ones.
        
           | abhiminator wrote:
           | Thank you!
        
       | michaelcampbell wrote:
       | (Old man yells at cloud.)
        
         | [deleted]
        
       | yrds96 wrote:
       | Damn that's is interesting search engine, this is great for
       | search simple terms and find a bunch of blog articles about the
       | term.
        
       | turtlebits wrote:
       | Great for a text-focused site- however, the results are a bit
       | confusing. Would help if there were more details on the criteria
       | used for a site to be included in the index.
       | 
       | Suggestion - Use system fonts (the site downloads almost 300k of
       | fonts)
        
       | marcosdumay wrote:
       | I've got some results where the same site is has than 70% of the
       | links. It was a very on topic and high quality site, but still,
       | all the results shouldn't point to the same place.
       | 
       | I think some grouping by site (and capping to only the few most
       | relevant links there) would improve the engine.
        
       | aabajian wrote:
       | Gotta say, sometimes the results really are nice. I searched for
       | "Land Cruiser 70." The first result is a simple, short blog post
       | about a couple who traveled across Europe and Asia in their Troop
       | Carrier (http://www.destoop.com/trip/1%20PREPARATION/2%20Vehicle%
       | 20sp...).
       | 
       | The first results on Google are Australian site for buying a LC70
       | (news-flash, I can't buy one in the USA). There is also a
       | MotorTrend article about the LC70...also irrelevant since it's
       | only sold in Australia.
        
       | dzhiurgis wrote:
       | Search for playwright waitForSelector and you land in pretty
       | useless page. I'm all in for text websites, but something like
       | playwright.dev documentation is top notch - fuzzy search being
       | key thing.
        
         | marginalia_nu wrote:
         | Yeah I wasn't really planning for this to blow up like it did
         | today. It's currently sitting at about 35% of the index size I
         | usually aim for, so besides the stuff I can't index because
         | it's behind CDNs, there's a lot of pages it just hasn't gotten
         | to yet. playwright.dev is pretty low on the priority list
         | because it has a metric crap-ton of javascript on its front
         | page. The crawler has visited it, looked at it, and put it very
         | far down the priority queue.
        
       | arnaudsm wrote:
       | I wish we could configure Google's algorithm to our needs, and
       | blacklist websites.
        
         | MarioMan wrote:
         | It could get tedious depending on how many sites you want to
         | block, but you can add "-site:google.com" to exclude
         | google.com, for instance.
        
           | arnaudsm wrote:
           | I mean a blacklist system like Twitter's, where you block a
           | website forever. Pinterest would be the first to go.
        
       | marto1 wrote:
       | super! How far would you say are you in indexing the blogosphere
       | ? I tried the engine a few times, but I mostly get academic
       | papers and I know most (good) blogs are in fact text-heavy.
        
       | phreack wrote:
       | One nitpick that kind of bothered me - on a large desktop
       | monitor, the results page was like 70% whitespace margins with
       | the results squished in the middle like a portrait cellphone.
       | Hopefully it's easy to fix, I like to research at home and this
       | website could help a lot!
        
       | yakubin wrote:
       | Wikpedia links point to <https://encyclopedia.marginalia.nu/>
       | instead, which to my eyes is less readable. The justified text,
       | done with CSS, instead of the LaTeX algorithm, looks wild. The
       | font used for quotations is even worse (very thin).
       | 
       | Wikipedia is perfectly usable without JavaScript and it's one of
       | the nicest sites out there typography-wise, so I'd reconsider
       | this redirection.
        
         | marginalia_nu wrote:
         | I guess it's a matter of taste. I can barely read anything on
         | regular wikipedia because the inline links disrupt my flow.
        
       | wolpoli wrote:
       | I wish niche search engines has an option to group results by
       | domain names. There are a few major sites that dominate Google
       | search results with low effort content. As long as Google stands
       | as the largest search engine, it's unlikely that these major
       | sites will want to rearchitect itself into different domain
       | names.
        
       | gjm11 wrote:
       | I tried a few searches.
       | 
       | <<javascript pipe syntax>>: none of the search results appeared
       | to have anything to do with Javascript pipe syntax. (Which
       | doesn't exist yet, but it's under discussion.) Google gives a
       | bunch of highly-relevant results.
       | 
       | <<hans reichenbach relativity>>: first result is a list of books
       | about relativity, one of which is Reichenbach's "Philosophy of
       | space and time"; good, but there's no real _information_ there.
       | Second is about Reichenbach but nothing to do with relativity or
       | even, really, philosophy of science. Third is about philosophy of
       | science and mentions some of Reichenbach 's work but not related
       | to relativity. Fourth mentions Reichenbach's "Philosophy of space
       | and time" as part of a list of books relevant to a seminar on
       | "time and eternity". None of this is _bad_ , but it's not great
       | either. Google gives a couple of online philosophy encyclopaedia
       | entries, then a journal article on "Hans Reichenbach's relativity
       | of geometry", then the Wikipedia article on Reichenbach ... much
       | more informative.
       | 
       | <<luna lovegood actress>>: I thought this would be an easy one.
       | It was easy for Google, which gave me her name in large friendly
       | letters at the top, then her IMDB entry, and a bunch of other
       | relevant things. Literally nothing in the Marginalia results was
       | relevant to the query.
       | 
       | I guess maybe popular culture is just too monetizable, so no one
       | is going to write about it on the sites that Marginalia crawls?
       | Let's try some slightly less popular culture.
       | 
       | <<wilde "a handbag">>: First result is kinda-relevant but weird:
       | it's about a musical adaptation of _The Importance of Being
       | Earnest_. It doesn 't mention that famous line from the play, but
       | one of the numbers in the musical has the words "a handbag" in
       | the title. Second result is a review of a CD of musicals,
       | including the same work. Third is a bunch of short reviews of
       | theatrical items from the Buxton Festival Fringe, one of which is
       | a three-man adaptation of TIOBE. Next four are 100% irrelevant.
       | Next is a list of names of plays. Last one is actually relevant;
       | it's an article about "Lady Bracknell through the decades".
       | Google puts that one first (after, sigh, a bunch of YouTube
       | videos which look as if they might actually be relevant).
       | 
       | I really like the _idea_ of this, and many of the things it turns
       | up look like they might be interesting, but it isn 't doing very
       | well at producing results that are actually relevant to the thing
       | being searched for.
        
         | exikyut wrote:
         | TIL about https://github.com/tc39/proposal-pipeline-operator,
         | which I am immediately looking forward to playing with once it
         | gains traction Some Time From Now(tm)
         | 
         | (I have no earnest reason to transpile)
        
           | seph-reed wrote:
           | Yeah, this seems pretty nice. I don't think the "deep
           | nesting" issue is quite so realistic... I very rarely have a
           | logic tree that's easier to identify by its leaves than its
           | root. And I'd really hate to have code where you have to
           | scroll to the end of a bunch of pipes to figure out what
           | they're adding up to
           | 
           | But I have plenty of single use "temp variables" and cutting
           | those out could be cool.
        
         | silent_cal wrote:
         | To be fair, those searches are pretty weird.
        
           | JxLS-cpgbe0 wrote:
           | We don't _search_ for things because they 're easy to find
        
           | HaloZero wrote:
           | The pop culture one is fairly common. Me and my wife both
           | search "who the fuck is that" in that TV show movie all the
           | time. Or who is the author of X book?
        
             | cormacrelf wrote:
             | It's trying to surface long articles and you're asking it
             | for a one word answer. What did you expect? A long article
             | consisting of "Emma Stone played Cruella" repeated 800
             | times?
        
             | gibspaulding wrote:
             | I think perhaps the usefulness here is less finding what
             | you're looking for, but rather finding something
             | interesting.
        
       | exporectomy wrote:
       | Wow. I tested it on recipes which Google has destroyed and this
       | was the first result, a simple clear recipe:
       | 
       | http://demont.myds.me/leerecipes/mainmeals/mainmeals1/chicke...
       | 
       | compared to Google's endless drivel of "This chicken stir fry
       | recipe will become a staple in your home. It's so quick to make
       | and you can use whatever vegetables you have on hand. It tastes
       | wonderful regardless of how you alter the ingredients. ... " JUST
       | SHUT UP AND GIVE ME THE RECIPE!
       | 
       | https://natashaskitchen.com/chicken-stir-fry-recipe/
        
         | srcreigh wrote:
         | Fwiw this website, Natasha's kitchen, seems to be one of the
         | more performant ad filled recipe websites.
         | 
         | Make use of the "Jump to recipe" button to get to the recipe
         | faster.
        
         | 1f60c wrote:
         | This isn't entirely Google's fault. Recipes on their own aren't
         | copyrighted in the US, and adding this fluff text is a way
         | around that.
        
           | AdamN wrote:
           | Yes it is. They're giving a lower quality result to the user
           | (their customer ... but not for long if competitors can get
           | just a little bit better)
        
           | mdoms wrote:
           | It is entirely Google's fault. Google knows people don't want
           | this dreck (everyone knows it) but still serves it up.
        
       | b-x wrote:
       | Too bad it rejects non-Latin words, as if the definition of
       | "text" is a sequence of alphabetical letters originated from
       | Latin.
       | 
       | I thought that we've reached the time to embrace all cultures in
       | the world, but this retrogressive engine proves that most _modern
       | tech_ designers are myopic about other civilizations in the
       | globe.
        
         | fghfghfghfghfgh wrote:
         | It's one guy. Making a useful tool. It even has an altruistic
         | purpose.
         | 
         | Shame on you for twisting a well intended effort into a
         | negative statement that suits your narrow identity political
         | world view.
        
           | b-x wrote:
           | Less insulting error messages would be more welcome than
           | casting out others without any consideration.
           | 
           | You may distribute "shame" however you want, but this only
           | helps enforcing the damaging insults and amplifying them.
        
         | hombre_fatal wrote:
         | No, it just proves that a one-man hobby project with finite
         | resources found it reasonable to restrict the scope.
         | 
         | Maybe when they find out they're an immortal billionaire they
         | can build all the additional things you've entitled yourself to
         | expect from the freely shared work of others.
        
         | marginalia_nu wrote:
         | Understand that this is something I built for myself, by
         | myself, so it focuses on languages I understand. It hosted on a
         | single consumer grade computer in my living room. I built it
         | out of pocket and anyone is free to use it. Does this make me a
         | villain?
         | 
         | If I can do this, what's preventing some guy in Japan or India
         | or Peru from doing the same, of course focusing on their
         | languages?
        
           | b-x wrote:
           | Maybe a better suited choice for errors than an insulting
           | message: when I provided a query in my native language it
           | regurgitated the error "needs to be a word" instead of more
           | acceptable "not a supported language".
           | 
           | When you claim that a word in some other culture is not "a
           | word", just because it's not recognized by your machine,
           | that's demeaning to say the least.
        
             | marginalia_nu wrote:
             | Again it's a one man hobby project, I don't have a team of
             | people to go through every formulation and every error
             | message to ensure nobody can read them in a way that
             | offends them. It's just me, writing code on an unfinished
             | project that HN discovered.
             | 
             | In this case, the code doesn't match the word regexp, like
             | it may be a @TwitterHandle or a "comp.lang.c" with periods
             | in it, or an unsupported Unicode range. It doesn't know why
             | it is not matching, just that it doesn't.
        
               | b-x wrote:
               | I must congratulate you on this achievement. That's
               | certainly a useful take on search.
               | 
               | Nonetheless, even when coding, one should also consider
               | thoroughly the UX and how it would be addressing the
               | others.
               | 
               | Saying "unsupported word" is much more sympathetic than
               | "needs to be a word" (where you define what a "word" is,
               | and the general user is unaware of such definition).
        
               | marginalia_nu wrote:
               | Fair point, I refined the phrasing a bit.
               | 
               | > The term "" contains characters that are not currently
               | supported
        
       | leavenotracks wrote:
       | Really impressed with the results I'm seeing so far. In all
       | searches I have done so far, the results are truly lightweight,
       | and haven't had to click through any modals, subscription pop-ups
       | or any other junk thus far! Will be using more in the days to
       | come.
        
       | ncfausti wrote:
       | This is incredible. I just got goosebumps as I stumbled upon
       | https://solitaryroad.com after searching for "linear algebra
       | homomorphism". It reminds me of the magical feelings of the early
       | Internet. Keep up the great work!
        
       | pajko wrote:
       | Does it filter out ad-heavy copy-paste/autogenerated fake sites?
       | Tired of seeing those on the first few pages of Google. Bing gets
       | more and more usable, but far from perfect.
        
         | marginalia_nu wrote:
         | It tries.
        
       | titzer wrote:
       | I think I want a BBS. Text mode, fixed width font, keyboard-
       | driven menus, no (or very little) bitmapped graphics. I've been
       | thinking about the UIs for a lot of sites that I use to "do
       | things" on the web. E.g. search for flights. Do I need _any_ of
       | that  "beautiful" web design with pretty forms and fonts,
       | bevelled edges, drop shadows, drop-down menus, hovers? Hell, do I
       | even need a map? Heck no, I need three text entry fields and
       | output a bulleted list, maybe table of results. Just give me the
       | raw data and do as little presentation as possible, thanks.
       | 
       | I really think I want an internet console, not an animated
       | magazine.
        
       | javajosh wrote:
       | This is really good; I'll actually use it!
        
       | eitland wrote:
       | Tested with the first person to settle on Island:
       | https://search.marginalia.nu/search?query=Ingolf+Arnarson
       | 
       | and it worked surprisingly well.
       | 
       | Anyone else has good examples?
        
       | abdullahkhalids wrote:
       | Can we submit text-heavy sites for possible inclusion? Assuming
       | they pass your filters.
        
       | sealthedeal wrote:
       | lol this is great, reminds me of the old school search engines we
       | would use in school back in the day before Google haha.
        
       | pomian wrote:
       | Congratulations. Truly impressive search results. I tried two,
       | one word searches. The results were interesting, useful, and
       | would have been impossible (well, really really hard) to find, on
       | standard search engines. Plus, no garbage, ads, recommendations,
       | etc etc. As another commenter suggested, it is what World Wide
       | Web searches results were like, twenty years ago!
        
         | pomian wrote:
         | PS. I added Marginalia as a search option (even the default for
         | now) in Firefox Nightly (on Android). In case others want to,
         | under settings for search, you can add other, then name, and
         | then: https://search.marginalia.nu/search?query=%s
        
           | freddref wrote:
           | After a good amount of searching it doesn't seem possible to
           | add Marginalia as default search in firefox (84.0b8) on
           | Debian.
           | 
           | I did not expect this to not be available.
        
       | lumost wrote:
       | This is a fascinating tool, I estimated that the corpus of the
       | factual web was between 1 and 10 TB when I last played around
       | with BigQuery using domain names which had low amounts of click
       | bait. Seeing these search results I suspect my estimate was off
       | by a couple orders of magnitude.
       | 
       | Although a search for "Fractional Reserve Banking" shows that
       | some further ranking improvements can be made to exclude
       | unrelated results, and potentially penalize old conspiracy sites.
       | 
       | https://search.marginalia.nu/search?query=fractional+reserve...
        
       | rchaud wrote:
       | Is it fair to assume that text-heavy sites that are inactive (but
       | still online) don't have SSL?
       | 
       | If so, would you ever tweak the parameters to surface sites that
       | that aren't served with "HTTPS"?
        
       | asjdflakjsdf wrote:
       | You should monetise this with amazon affiliate links that are
       | relevant to each search. And then use that money to keep this
       | project going. Google is fantastic, but it has become something
       | different from what it was, the company and the product. It is so
       | refreshing to see a modern tool that encourages exploration of
       | the actual world wide web.
        
         | Funes- wrote:
         | That would be an absolutely awful decision.
        
         | marginalia_nu wrote:
         | I might add a donate button or something if people want to help
         | support the project, hardware isn't cheap and all. But I have a
         | job and decent income. I think if this search engine became the
         | way I earned money, it would influence the project in a bad
         | way, and corrupt its purpose, which is to help people explore
         | the less-commercial internet.
        
           | eigenhombre wrote:
           | +1 for a donate button; much preferred over affiliate links
           | or ads of any kind. Thank you for making this beautiful
           | little(?) product!
        
           | kews wrote:
           | Appreciated. The more things fill up with monetizing shit,
           | the more I stay away. There's something beautiful in having
           | higher purposes than grubbing for cash.
        
           | RistrettoMike wrote:
           | I'd donate to continued expansion/development of something
           | like this. Where is somewhere good to follow you for any
           | thoughts/updates?
        
             | marginalia_nu wrote:
             | I have something of a blog here, with an Atom feed.
             | 
             | https://memex.marginalia.nu/log/
             | 
             | It's not very well optimized for mobile, really it's more
             | of a bridge for my geminispace content.
        
       | streamofdigits wrote:
       | Let a thousand search engines bloom.
       | 
       | btw, interesting how many http (as opposed to https) sites show
       | up...
        
       | pjs_ wrote:
       | This kicks ass!!
        
       | 0xbadcafebee wrote:
       | I love it. Even though it didn't give me the results I was
       | looking for. I searched "new york fishing license", and it didn't
       | give me any links to the actual new york fishing license
       | websites. But it did give me a ton of really cute little websites
       | related to lakes and fishing in New York. This one has _amazing_
       | information about fishing all over Western New York:
       | http://www.huntfishnyoutdoors.com/fishing.php
        
       | kews wrote:
       | There's probably a more suitable term than "modern" that we
       | should generally be using, since "modern" consistently has a
       | positive connotation.
        
         | marginalia_nu wrote:
         | Dunno, I prefer to use as neutral or positive terminology even
         | when I talk about things I don't like. I think it very easily
         | comes off as juvenile ranting when you start throwing around
         | terms with strong negative connotations.
        
       | mumblemumble wrote:
       | I like it.
       | 
       | Coincidentally, the other day I was daydreaming about a search
       | engine that favors sites that are updated less frequently. The
       | thought being, the kinds of labors of love that characterized the
       | 1990s Web that I still sometimes miss are still out there, it's
       | just harder to find them amidst the flood of SEO dreck. So
       | perhaps they could be made discoverable again with the help of a
       | contrarian search engine that specifically looks for the kinds of
       | things that Google and Bing _don 't_ like to see.
        
         | gibspaulding wrote:
         | Million Short [1] offers an option to omit results from popular
         | domains. It's a different approach from what you describe, but
         | I think the goal is similar.
         | 
         | [1] https://millionshort.com/
        
         | itzworm wrote:
         | I had this problem recently trying to fix an Atari. There's a
         | guy out there who has ton's of guides on doing video out mods
         | but newer guide references the older. However googling the OG
         | guide didn't find it so I manually scoured his old web page.
        
         | capableweb wrote:
         | Similarly, I wish there was a recommendation engine (for web,
         | music, movies, whatever) that can show you what is the furthest
         | away from your existing tastes. I've learned to re-create my
         | Spotify account once every 6 months or so, as their
         | recommendation engine becomes a boring machine after using it
         | daily for some months.
         | 
         | I'd love to discover new content that is different from what I
         | read/watch/listen to now, but it's really hard to know about
         | genres you don't know about.
        
           | ebiester wrote:
           | It's hard, though. I simultaneously want something far from
           | my tastes, but I don't want to see Plandemic-style Ivermectin
           | material, or Focus On The Family-style material. I want
           | things that will push me out of my comfort zone sometimes,
           | but it turns out I really don't want the thing furthest from
           | my tastes; I want things marginally adjacent. I want them
           | close enough to feel familiarity, but far enough that it
           | challenges my worldview.
           | 
           | I don't think a recommendation engine can do that.
        
           | gverrilla wrote:
           | Doing that takes real work and curiosity. I'm afraid an
           | algorithm will never be able to do it, particularly if you're
           | into niche stuff. For instance I enjoy a lot a Japanese band
           | called The Boredoms - but few people like it, and there's
           | only 2 of their albums available in spotify.
        
         | potatoman22 wrote:
         | I like the idea. Tangentially, I wonder how one would find the
         | right 'penalty' for more updated sites?
        
       | timvisee wrote:
       | Cool!
       | 
       | There do seem to be some text encoding issues though. For
       | example: https://search.marginalia.nu/search?query=tim+visee
        
         | marginalia_nu wrote:
         | Yeah I think the charset detection needs work.
         | 
         | It understands the "Content-type: text/html;charset=utf-8"
         | -header, and <meta charset="UTF-8">
         | 
         | but not
         | 
         | <meta http-equiv="content-type" content="text/html;
         | charset=utf-8">
         | 
         | It turns out HTML has a lot of corner cases. I'm constantly
         | marveling at how web browsers hold together as well as they do.
        
           | timvisee wrote:
           | Thanks for your response! Hope you can implement this as well
           | without too much trouble.
           | 
           | I wonder if you could just assume UTF-8 to be the default
           | these days. I imagine that to fix many other cases as well.
        
             | marginalia_nu wrote:
             | Haha! I did actually assume UTF-8 at first, but being a
             | search engine has a lot of older websites, I sadly got a
             | lot of encoding errors doing that, too.
        
       | enduku wrote:
       | Fantastic project! Found very interesting links to a lot of
       | compiler related keywords. A similar service, yet different in
       | their approach to cut through the e-commerce and seo optimized
       | websites I found useful is MillionsShort[0]
       | 
       | millionshort.com
        
       | oytis wrote:
       | Designed for serendipity indeed. Tried a few searches, results
       | are quite fun, but none of them relevant.
        
       | pjspycha wrote:
       | This is really refreshing work, and we can all benefit from other
       | search engines focused on improving the field. I tried a bunch of
       | searches and some of them were quite wonderful, others were a
       | little dry on results. But overall I enjoyed going through it.
       | Here is some critiques if you don't mind:
       | 
       | I did search for "Daria Bilodid" and the results were a bit
       | troublesome. First the Wikipedia result did not work:
       | https://en.wikipedia.org/wiki/Daria_Bilodid vs
       | https://encyclopedia.marginalia.nu/wiki/Daria_Bilodid
       | 
       | Secondly the results matched a few judoinside.com results which
       | is ok, including sites to her competitors, but seemed to miss the
       | judoinside website for her:
       | https://www.judoinside.com/judoka/92660/Daria_Bilodid.
       | 
       | The design is hard on my eyes, I have a average size screen and
       | its using less than half of the width. The line-height is
       | enormous and seems to breakup flow making it uncomfortable for me
       | to read. The spacing around each result is the same as between
       | titles and paragraph items, which again was unpleasant to read.
        
         | ASalazarMX wrote:
         | > Secondly the results matched a few judoinside.com results
         | which is ok, including sites to her competitors, but seemed to
         | miss the judoinside website for her:
         | https://www.judoinside.com/judoka/92660/Daria_Bilodid.
         | 
         | The title says this search engine punishes modern websites
         | (images, videos, MB of JS, I suppose), and this site looks
         | scarce on text and heavy on images, maybe that's confusing the
         | ranking.
         | 
         | I certainly find the results very refreshing, but you'll have
         | to complement with other search engines if they're not enough.
         | In fact, I think the days when we could use a single search
         | engine have already passed.
        
       | dukeofdoom wrote:
       | "corporate speak" bs detector and filter on google search engine
       | would be nice.
        
       | throwawaysea wrote:
       | Is it possible to also make a site that favors a diverse set of
       | information sources? For instance a lot of searches turn up
       | results from Pinterest or Wikipedia or Amazon or whatever else. I
       | wonder if there's room for a search engine that is all about
       | favoring a greater diversity of smaller sources, for those who
       | are less interested in staying within walled gardens.
        
       | michaelgrafl wrote:
       | I just looked up my last name and found a World class heavyweight
       | weightlifter named Josef Grafl born in 1872 who has an awesome
       | portrait of him on Wikipedia. Never before have I read about that
       | man.
       | 
       | I love this.
        
       | dumbfounder wrote:
       | Based on a few searches it seems to favor sites with very long
       | passages of text. Search for a name and you get pages with
       | massive lists of names. It quite simply isn't very good at
       | everyday searches. But it does bring up the point, shouldn't I be
       | able to tell my search engine I want results like this? It should
       | be a feature of google I can turn on and off. It should be one of
       | many ways to impact relevance.
        
       | runnerup wrote:
       | Seems like this is still very very hard! I searched for "hart
       | protocol" hoping to find this: http://www.romilly.co.uk/
        
       | mrpf1ster wrote:
       | I searched "c strtok" and got one result saying '"strtok" could
       | be spelled "stroke", "stork", "sarto", "strop"'.
       | 
       | Cool concept though!
        
         | marginalia_nu wrote:
         | The spelling suggestions are presented whenever there isn't any
         | results, but sometimes they can be pretty misleading.
         | 
         | What happens is that C, as a word, isn't indexed because it's
         | deemed too short, and the bigram "c strtok" can't be found
         | anywhere.
         | 
         | Try 'strtok' instead.
        
       | llbeansandrice wrote:
       | Is there anyway to add this as a favored search engine in the
       | browser?
       | 
       | I currently use google as it's set as the default search when I
       | type in the address bar but would love to switch and move
       | google/ddg to a added character like "<search terms> @g"
        
         | tomerv wrote:
         | All the major browsers support adding custom search engines.
         | You just need to specify the URL template to do the search. The
         | common format is to put "%s" as the search term. You can use it
         | for any site, not just things that are considered search
         | engines.
         | 
         | Firefox is a bit different, since you do it by adding a
         | bookmark, and giving that bookmark a keyword. The other
         | browsers I checked have an option under the search engine
         | settings.
         | 
         | After defining the custom search engine, you just type
         | "<keyword> <search term>" in the URL bar.
        
           | freddref wrote:
           | Is there a way to set marginalia as default search in
           | firefox?
        
           | llbeansandrice wrote:
           | Ah I'm on FF so I'll have to do the weird bookmark method. A
           | little annoying since it supports other search engines.
        
       | ilrwbwrkhv wrote:
       | This is soooo good. I'm finally finding sites I haven't heard of
       | with good content.
       | 
       | I didn't realize how much I missed this stuff.
       | 
       | The popular web has become so bad nowadays.
        
       | snuser wrote:
       | Just searching for 'dogs' gave me more interesting results than
       | I've seen from google in years
        
       | prionassembly wrote:
       | The website itself seems generated with some kind of kick-ass
       | generator from template files (.gmi?)
       | 
       | I feel like I'm stuck with Wordpress.com because it brings me
       | _some_ traffic (whereas something hand-rolled on nsfspeech or
       | digital ocean or whatever would literally be off the edge of the
       | web), but the structure of that is so cool.
        
         | NhanH wrote:
         | That would be gemini protocol!
        
         | huijzer wrote:
         | You can easily do proper SEO with static site generators too.
         | Even more, static sites can be hosted via GitHub or GitLab
         | Pages, Netlify or CloudFlare and in all cases the speed will
         | outperform Wordpress in almost all cases. Also, you have way
         | more control over the output than with Wordpress.
        
       | dzink wrote:
       | One use case to always test: "online wishlist" or "make a
       | wishlist". If you start seeing tools like
       | https://www.DreamList.com or others, you are on the right path.
       | If you start seeing random web pages linking to individual wish
       | lists, then people are likely not able to find tools on your
       | search engine.
        
       | platz wrote:
       | > Don't be afraid to scroll down in the search results
       | 
       | I never knew it was fear that was preventing me from scrolling
        
       | prewett wrote:
       | Kudos for taking on this project, and I like the idea! I think
       | it'll be a big project to take it to the next level, but would
       | love to have a search engine that's more useful.
       | 
       | Some reactions:
       | 
       | - The font is really big and the columns really narrow, so I get
       | 3 - 4 entries per page, something like 8 words per line, and huge
       | spacings between lines, which makes it a frustrating experience.
       | I've been using the recommendations in
       | https://practicaltypography.com/, which recommends 60 - 90
       | characters in a line I think, and line spacing of 120% - 140% (I
       | like 125%). The line lengths here might technically fall within
       | the lower bound, but it's really short, and for search results
       | I'm going to try scanning the text to see if there's something
       | relevant, so I think going on the long side is better here. At
       | least make the width somewhat variable so that I can shrink the
       | rather large font and fit more on the line.
       | 
       | - The results are eclectic, but I'm not sure it's usable at the
       | moment. "scala append list" did not get me much that's helpful,
       | while Google will usually at least put up some click-farming
       | tutorial that although minimal effort does tend to answer the
       | question. Both "mapo doufu recipe" and "ma po do fu recipe" had
       | very few recipes, although the latter did have one.
       | Unfortunately, recipe websites are some of the worst, with about
       | 10 pages of description, ads, pictures, what-have-you until the
       | recipe at the very bottom. "collection unmitigated pedantry" did
       | return the acoup.blog entry at the top, though.
       | 
       | Good luck on the project!
       | 
       | -
        
       | lbriner wrote:
       | My pet peeve with search results is simply that there are ancient
       | technical results that in many cases are irrelevant. If I am
       | searching for a Window error message, I don't want some old forum
       | post from 2001, especially if it didn't have any answers!
       | 
       | What would be cool would be for people who host old stuff to
       | "archive" it at some point so it doesn't appear in normal
       | results, only if you tick "include archives".
        
         | athenot wrote:
         | As much as the release names for macOS over the years were
         | marketing gimmicks, it does make it a lot easier to zero in on
         | the correct version when doing these types of searches.
        
       | platz wrote:
       | modern design = low information density?
        
         | monkeybutton wrote:
         | Definitely low signal to noise. Looking at you recipe websites
         | and cooking blogs.
        
       | dfdz wrote:
       | I like the concept, but I did not work on any of the search
       | phrases I entered consisting of the full title of a computer
       | science article or book.
       | 
       | It also does not work for subjects. For example, if you search
       | "discrete math" it links to academic webpages, but most of them
       | do not have any notes posted. It is just a plain text website
       | with the syllabus of a class.
        
       | cyral wrote:
       | I tried a few queries and got extremely irrelevant results
        
         | marginalia_nu wrote:
         | It really depends on what you search for. A major drawback is
         | that there needs to be text-heavy sites to find, in order for
         | the search engine to find them.
         | 
         | Compare for example the results for "Duke Nukem 3D" with those
         | for "Cyberpunk 2077".
        
       | adriangrigore wrote:
       | A little bit harsh "punishes". It's a cool search engine.
        
       | OneEyedRobot wrote:
       | Very cool. A person can really appreciate simple web design
       | looking at something like Luke Smith's recipe page.
       | 
       | So how on earth do you take an idea like this and scale it for
       | both broad web coverage and high traffic? For that matter, just
       | how much 'useful' text is there on the net?
        
       | jteppinette wrote:
       | This is awesome! We should definitely move in this direction.
        
       | xipho wrote:
       | Fascinating. I studied an "obscure" group of insects. My go-to
       | search term to test an engine is their family name as it is a
       | rarely used word and I know most (all?) of the major data sources
       | that have accumulated data on it. When Wolfram Alpha added
       | species names, I checked with the name, boring, Duck Duck,
       | boring, Google (well we know Google isn't for search anymore,
       | it's absolutely horrible) boring, Bing, boring... you get the
       | idea.
       | 
       | This was a little different, extremely few results, but a couple
       | of them really made me grin, and all(?) made me curious or raise
       | an eyebrow or reflect on who/what might have been the source of
       | the link, or remember some obscure connection from grad-school.
       | So, if anything a crawled list of results worthy of ponder,
       | thanks for this!
        
         | trutannus wrote:
         | > well we know Google isn't for search anymore
         | 
         | Do you suggest anything better? As far as I can tell, all the
         | other search engines are either repackaged Bing (ie: DuckDuck),
         | or are just as bad.
        
           | ColinHayhurst wrote:
           | This needs an update but is an easy look see.
           | https://www.searchenginemap.com/
           | 
           | Broad and longer Twitter lists maintained here:
           | https://twitter.com/SearchEngineMap/lists
        
           | ricardo81 wrote:
           | Mojeek was built in the same spirit (one server living in a
           | house) and has 4.5 bn pages indexed now, and a bunch more
           | servers. A lot of people comment in similar style of it
           | reminding them of an older Internet, or generally less
           | branded results. It's definitely an alternative point of
           | view. Disclaimer: I work for them.
        
           | giancarlostoro wrote:
           | Not sure, but I remember when Google could find literally
           | anything. Then they started adding a bunch of exceptions and
           | crapped out their quality. I wonder how insanely different
           | results would be to get the older Google Engine from the
           | 2000s search result wise.
           | 
           | I now have to play games with Google to find things. I feel
           | like I do less than I used to for some reason.
        
             | bbarnett wrote:
             | The other day, I was searching for something, and google's
             | suggested, on-site answers took up 1/2 the first page. All
             | wrong.
             | 
             | The actual search results were another 1/4 page of
             | completely identical results, followed by google ad placed
             | search results.
             | 
             | I thought to myself, they've finally done it. Real
             | responses are no longer first page.
             | 
             | A lot of the cause for google getting crappy, is "ok
             | google", another "all platforms are the same" form of
             | sickness.
             | 
             | No, a desktop is not a phone. No, voice searching is not
             | the same as phone, or desktop.
        
               | foobarian wrote:
               | I was just thinking that they finally became Lycos. It's
               | what all the search engines except Google looked like
               | back in the early 2000s - ad laden cesspools of
               | irrelevant search results and other content. And it's why
               | we all switched to Google at the time.
        
               | habibur wrote:
               | It's time to disrupt the market. As Google can't compete
               | with a newcomer that penalize ads on page.
        
               | eitland wrote:
               | Seriously, yes.
               | 
               | Moores law means a modern day 2007-style Google should be
               | significantly less expensive to run now than back then.
               | 
               | Also the most relevant patents are now free to use.
               | 
               | 2021 Google is a sad story compared to 2007 Google and
               | I'd actually pay to get back 2007 Google - ads included -
               | meaning a double revenue source :-)
        
             | trutannus wrote:
             | You're absolutely correct, and a lot came from their
             | nerfing of search modifiers like + - "search term" and
             | whatnot. There's also a lot of ads and "PSA" type nonsense.
             | If I'm looking for anything COVID related for example, I
             | have to sift through a heap of PSA nonsense that's not even
             | related to my search query.
        
             | blowski wrote:
             | Wacky idea: instead of Google changing it's algorithm every
             | couple of years, it could run 50 algorithms in parallel
             | leaving no way for sites to "optimise" for the current one.
        
               | vikingerik wrote:
               | The output of the parallelism is itself an algorithm,
               | that can and will be optimized for.
        
           | mda wrote:
           | IMHO, that is a trendy claim in HN with little evidence.
        
             | mhh__ wrote:
             | You're downvoted but in my experience I have never really
             | been burned by this Google-decline
        
               | Ygg2 wrote:
               | "I haven't seen a black swan, ergo it's not real."
               | 
               | I've been burned by this decline in the past.
               | 
               | From creepy results i.e. first suggestion before typing
               | was something I spoke near the Android and I never
               | searched for before; to not finding what I was searching
               | for before successfully, Google has started declining.
        
               | bbarnett wrote:
               | You are a lobster. (or frog, depending upon parable)
        
             | xipho wrote:
             | You want evidence? Search for a plumber/tradesperson in
             | your area THEN try to find rational discourse about your
             | options. There are literally 100s of results of websites
             | remixing a small set of data, presenting it to you, and
             | asking you to buy something to see more, when you _know_
             | there is nothing behind the scenes.
             | 
             | This type of engine would punish these sites, in theory,
             | and may turn up a discussion in some forum, newsgroup, etc.
             | that is actually relevant, or insightful.
        
               | krapp wrote:
               | > Search for a plumber/tradesperson in your area THEN try
               | to find rational discourse about your options.
               | 
               | I searched "plumber Austin TX" in Google and got a map
               | and list of company websites near me. There are a lot of
               | "top x y in z" list sites, but the top results were still
               | the most relevant. I don't know what "rational discourse"
               | I'm expected to find, though, or why I should assume the
               | discourse I would find through Google is less rational
               | than discourse I would find elsewhere.
               | 
               | I searched the same thing in OP and found nothing even
               | remotely significant. Not even anything related to
               | plumbing.
               | 
               | OP's project isn't optimized for relevance, it's
               | optimized for nostalgia - providing a filter that keeps
               | the modern web away and dropping quirky, interesting
               | breadcrumbs to distract you and remind you of what it was
               | like to wander around the web of the 90's.
               | 
               | Which is all well and good if that's what you want, and
               | judging from the comments it is what a lot of people here
               | want, but Google giving me a list of company names,
               | numbers, websites and a map showing their location by
               | distance is more useful, even if it uses "modern web
               | design" and javascript.
        
               | xipho wrote:
               | > I searched "plumber Austin TX" in Google and got a map
               | and list of company websites near me.
               | 
               | I think you could have done this historically in a Yellow
               | Pages phone book. My OP used "boring". A list of plumbers
               | is boring, been done on dead wood. I'm not saying boring
               | != !useful.
               | 
               | > There are a lot of "top x y in z" list sites
               | 
               | This is an understatement. I actually want to know the
               | top x in y, to do that I need "rational discourse".
               | Rational discourse is recognizable as well written,
               | insightful, humble, reflective, self-countering,
               | anecdotal etc. By "search is terrible" I mean with
               | respect to finding this.
               | 
               | > OP's project isn't optimized for relevance, it's
               | optimized for nostalgia
               | 
               | Nostalgia is highly relevant if it's on topic, but
               | agreeing with you as to what this engine is about.
        
               | krapp wrote:
               | >Rational discourse is recognizable as well written,
               | insightful, humble, reflective, self-countering,
               | anecdotal etc. By "search is terrible" I mean with
               | respect to finding this.
               | 
               | I believe a search engine that ignores results based on
               | superficial and aesthetic qualities like "modern web
               | design" would be even worse in that regard, unless you're
               | assuming no relevant discourse about any subject has
               | taken place on the web since the early 2000's.
               | 
               | I admit, I have no idea what heuristic you would actually
               | use to find "well written, insightful, humble,
               | reflective, self-countering, anecdotal etc" content, but
               | I've seen it on modern sites (even on Twitter,) and I've
               | seen a lot of garbage on old sites, so a simple text
               | search of only old websites doesn't seem like it.
               | 
               | It is fun, though.
        
             | Spivak wrote:
             | Google Search is a fantastic product because it's
             | essentially Spotlight for the web. It's by far the fastest
             | way to get to things you already vaguely know are there and
             | acts as a metasearch for large sites.
             | 
             | But as a result it's now less useful as a tool for scouring
             | the web.
        
         | mountain_peak wrote:
         | Likewise, I co-maintain the only "fan" site on one of my all-
         | time favourite composers/performers, and gave the engine a shot
         | with a unique string query. While my text-heavy WP-driven site
         | didn't seem to make the cut, the results were highly relevant
         | in that they were links to former band members and
         | collaborators - a couple of which I didn't realize existed.
         | That being said, there were a few sites (including my own) I
         | expected to be returned, but no dice. Still, a fascinating
         | experiment that many at HN have been clamouring for.
        
           | xipho wrote:
           | Exactly this. A couple results returned reference to obscure
           | now-defunct newsletters and clubs, people that I know were
           | historically important for past researchers, but only because
           | this was my research forcus for so long would I have known
           | this.
        
           | marginalia_nu wrote:
           | The search engine doesn't actually do full text search, so
           | maybe your query was too... unique.
           | 
           | But do first of all verify that you haven't been hacked.
           | There's about quarter of a million domains I've flagged that,
           | besides their wordpress content, also host a ton of link spam
           | crap off in some hidden folder. This reflects on the quality
           | rating extremely negatively to the point where you may have
           | not been indexed at all.
           | 
           | Secondly, are you behind cloudflare or some other big-name
           | CDN? Because, as I mentioned in another comment, I can't
           | crawl their pages without getting captchad until they approve
           | of my humble request to be classified as a good bot.
           | 
           | There are some other hosting providers I flat out block on a
           | subnet level because they host a large amount of link farms.
           | This is currently Alibaba, Psychz, eSited, Cloud Yuqu and
           | 1Blu.
        
             | mountain_peak wrote:
             | Thanks for the advice; not hacked, but I have "resurrected"
             | many WP sites that have been (including my wife's non-
             | profit). Just running on an EC2 micro instance, but I tried
             | adding "site:" and received "No such domain". Actually, I
             | think it's because I haven't enabled "HTTPS" yet! That's on
             | my to-do along with migrating off EC2-Classic to VPC...
        
               | marginalia_nu wrote:
               | Vanilla HTTP should be fine. I think 80% of the urls are
               | HTTP.
               | 
               | If you're getting no such domain, it's either blocked
               | because it looks too much like a spam domain, or it
               | simply hasn't been discovered yet.
               | 
               | What's the TLD? I severely restrict some cheaper TLDs
               | because they gave so much spam.
               | 
               | For example, cr.yp.to is an example of a baby I know I've
               | definitely thrown out with the bathwater.
        
             | withinboredom wrote:
             | It'd be nice if you had a page to get the current index
             | status for a domain.
        
               | marginalia_nu wrote:
               | Try a query on the form site:www.example.com ;-)
        
               | rovr138 wrote:
               | Would it be possible to have a link to a page with
               | operators?
        
         | duckmysick wrote:
         | I'm intrigued by this experiment but I can't visualize it. What
         | do you mean by boring results? Would combing through a library
         | (the one with paper books) also produce boring results? What's
         | your ideal results?
        
           | xipho wrote:
           | Perhaps a counter example, something that is interesting.
           | Anecdotally. This, of all things, is the _top_ result in my
           | search: https://tft.brainiac.com/archive/0303/msg00037.html.
           | Which is strange to me because I don't recognize
           | tft.brainiac. I click, it's a list of biological
           | relationships among Hymenoptera, including a reference to
           | genus of the wasps I studied, presumably in a biological
           | relationship (host/parasite) context. I cataloged every
           | relationship known at one point, so my brain wants to know
           | where this come from, is it something I caught. Then I go
           | look for more context, and find it's part of a thread about
           | D&D(?) and hymenoptera, and it's epic, and a chunk of my
           | morning is lost figuring out why and how this came to be.
        
             | duckmysick wrote:
             | Yes, thanks. That helps.
             | 
             | If I understand it correctly, you're interested in bits and
             | pieces of new information that's indirectly related to your
             | object of interest. Degree 2 and 3 in Six Degrees of Kevin
             | Bacon, so to speak. You know degree 0 like the back of your
             | hand and you've seen almost everything closely connected.
             | Finding novel, interesting things is getting more
             | difficult.
             | 
             | Have you thought about cataloging all the related stuff you
             | stumble upon? Something in between loose notes and what
             | Moby Dick is to cetology.
        
               | xipho wrote:
               | Exactly.
               | 
               | > Have you thought about cataloging all the related stuff
               | you stumble upon? Something in between loose notes and
               | what Moby Dick is to cetology.
               | 
               | Tongue in cheek- new app time, to facilitate this. It
               | should have the name "Degree4". Entries can only be made
               | if degrees 2 and 3 are "defined". Scoffs at degrees 5 and
               | 6, just because. Startup developing can probably
               | unethically seed content by mining
               | https://www.everything2.com/. Should use concepts of "AI"
               | and "persistent homology"... profit!
               | 
               | But no, I don't outside a mental note. Closest I would
               | come would be adding '!! <some note>' to my potwiki text
               | notes (see my past comments) if its something I want to
               | have come back with a grep, or think might be interesting
               | to explore "when I retire". If it's a scientific fact in
               | my field after researching it further it would go into
               | this https://taxonworks.org (or its precursor).
        
           | xipho wrote:
           | In part, by boring results I mean I instantly recognize the
           | top results, and I know exactly what will be in them, and I
           | know which ones will actually contain potentially interesting
           | new stuff, i.e. _I didn't have to search for these, I'd go
           | their directly_. Then next results are all obscure, and I've
           | already visited them, and/or I know they are historical and
           | not something I have to revisit.
           | 
           | With this engine with at least 1/2 the links (to be fair
           | there were < 20) I didn't recognize the URL at all, and it
           | was clear in the text or the URL that there was an
           | interesting bit to check out (i.e. what Google should have
           | also returned after they barfed out the things I don't need
           | to know about), but had never succinctly done in my
           | experience.
           | 
           | I suppose the magic in this engine would have to be alerting
           | the searcher that they found more of this type of link, as
           | once I visited the 10 or so sites they would fall back into
           | the "been there, done that" link category that Google appends
           | somewhere after the ads and "big" sites, mixed in with a
           | million search term spam sites, etc.
        
           | xattt wrote:
           | There's certain grey literature that's not captured in
           | university library federated searches nor easily found with
           | mainstream search engines.
        
             | xipho wrote:
             | There are decades of academic research not digitized. The
             | digitization window used to only hit around 1990, I haven't
             | looked at it hard recently, but I suspect this still
             | remains true for many important journals. This is grey only
             | to those who do not know how to use a library.
        
       | nagyf wrote:
       | This is great, I like the results. Couple of things I noticed:
       | 
       | - Search results often very old, from the early 2000s (I guess
       | because back then more websites were text oriented). Are you
       | taking into account the age of the page when showing results? It
       | would be great to see more up-to-date results at the top
       | 
       | - I noticed a few results which directed me to websites with
       | security risks, Firefox didn't even let me open them. Is it
       | possible to filter these out from the results?
        
       | horsh1 wrote:
       | No cyrillic or hiragana suport :-(
        
       | 300bps wrote:
       | What we need now is a search engine that weeds out sites that
       | have been SEO optimized for keyword density.
       | 
       | I'm tired of searching for "generic keyword" and getting a page
       | with an extremely low signal to noise ratio written like this:
       | 
       | "Many people search for generic keyword. That is why you can find
       | all about generic keyword here. In fact we specialize in generic
       | keyword and slight alterations of generic keyword."
       | 
       | It's like Google stopped caring that people were gaming it.
        
       | Nicksil wrote:
       | - Semantic HTML; not everything is a div; correct use of markup.
       | 
       | - Search results are not overrun with commercial, SEO stuffing,
       | "content" farms.
       | 
       | I don't know what to say. This is such a refreshing sight. Well
       | done.
        
       | thetanil wrote:
       | Yes please! More of this!
        
       | hulitu wrote:
       | "Search results Search "alt.sysadmin.recovery" needs to be a word
       | Those were all the results,"
       | 
       | No comment.
        
       | hdjjhhvvhga wrote:
       | Congratulations, great work!
        
       | tomaszs wrote:
       | I like the concept of a search engine that does not try to figure
       | out what I should learn based on what I search..I know what I
       | search for
        
       | camillomiller wrote:
       | Great idea, awful UI
        
         | muxator wrote:
         | How so? It's intuitive and super fast. I whish there were more
         | websites with such a simple UI.
        
           | feikname wrote:
           | it's too "uncompact". Font size too big and could use a bit
           | more horizontal space.
           | 
           | I find it comfortable to use at 60% zoom level
        
           | marginalia_nu wrote:
           | Some people like a flashy UI, the modern look is important
           | for them. It's ok to have aesthetic preferences, let's not
           | pretend we don't all have them.
           | 
           | In the end, it's a niche search engine I've made, the
           | intended audience is the long tail. It just isn't for
           | everyone, and if it was for everyone, it probably would be
           | lesser for it.
        
             | silent_cal wrote:
             | I like the UI.
        
             | ravenstine wrote:
             | I'm glad you aren't trying to please anyone. I'd like a
             | return to an internet with fewer colors, gadgets and
             | gizmos, custom fonts, TypeKit, JavaScript requirements, and
             | so on. Most of the time I'm reading articles, so just give
             | me more text and less fluff!
        
         | jbj wrote:
         | What makes you find the user interface awful?
         | 
         | it is litterally a search website with a text box for a search
         | term and a button to do the search.
        
         | r00t4ccess wrote:
         | The page isn't prompting for cookie preferences, asking to
         | allow notifications, popping up a mailing list or coupon half
         | way do the page, playing a full page video with sound, or load
         | 97million lines of javascript. I'd say its pretty much perfect.
        
         | IggleSniggle wrote:
         | Huh. I think it's a great UI. What did you not like about it?
        
         | stronglikedan wrote:
         | Ironic comment, considering that this is a search engine to
         | weed out sites with awful UIs. This gives us exactly what we
         | need in a search UI - no more, no less - in a clean and
         | intuitive way.
        
         | fouc wrote:
         | Yeah the design could use some work. The search results are not
         | compact - I only see 1 result without scrolling, not counting
         | the related wikipedia link that apparently has no description.
         | 
         | I don't particularly like that it seems to be a column
         | constrained to 550px width, instead of being responsive and
         | taking advantage of greater widths.
         | 
         | to the author of the site, if you're not really into
         | design/css, take a look at tailwindcss, it makes it fairly easy
         | to produce a minimal amount of css that is responsive.
        
       | agumonkey wrote:
       | Very nice. Start a trend :)
        
       | marginalia_nu wrote:
       | Yeah so this is my project. It's very much a work in progress,
       | but occasionally I think it works remarkably well for something I
       | cobbled together alone out of consumer hardware and home-made
       | code :-)
        
         | eigengrau5150 wrote:
         | I like this. Thanks for doing it.
        
         | scrollaway wrote:
         | I searched Warcraft and got a gold selling/ level boosting
         | site. Some things never change :)
        
         | bityard wrote:
         | This is awesome. I've been looking for a long time for a search
         | engine that basically takes everything Google does and does the
         | opposite. Thank you for doing this, I will definitely be
         | bookmarking it.
         | 
         | Is there a way to suggest or add sites? I went looking for
         | woodgears.ca and only got one result. I also think my personal
         | blog would be a good candidate for being indexed here but I
         | couldn't find any results for it.
        
         | ColinHayhurst wrote:
         | Great work. Working on an alternative search engine too. Take a
         | look at my profile.
        
         | soheil wrote:
         | Awesome project! How are you able to keep the site running
         | after HN kiss of death? What is your stack, elastic search or
         | something simper? How did you crawl so many websites for a
         | project this size? Did you use any APIs like duck duck go or
         | data from other search engines? Are you still incorporating
         | something like PageRank to ensure good results are prioritized
         | or is it just the text-based-ness factor?
        
         | noduerme wrote:
         | I love this idea, and admire the work you put into it. I'm a
         | fan of long reads and historical non-fiction, and Google's
         | results are truly garbage.
         | 
         | I have a criticism that I think may pertain to the ranking
         | methodology. I searched for "discovery of Australia". Among the
         | top results were:
         | 
         | * A site claiming that the biblical flood was caused by Earth
         | colliding with a comet (with several other pages from that site
         | also making the top search results with other wild claims, e.g.
         | that the Egyptians discovered Arizona);
         | 
         | * Another site claiming the first inhabitants of Australia were
         | a lost tribe of Israel;
         | 
         | * A third site claiming that Australia was discovered and
         | founded by members of a secret society of Rosicrucians who had
         | infiltrated the Dutch East India Company and planned to build
         | an Australian utopia...
         | 
         | These were all pages heavy with HTML4 tags and virtually devoid
         | of Javascript, the kinds of pages you'd frequently see in the
         | late 1990s from people who had built their own static websites
         | in a text editor, or exported HTML from MS Word. At that time,
         | there were millions of those sites with people paying for their
         | own unique domain names, and so the proportion of them that
         | were home to wild-eyed conspiracy theories was relatively
         | small. What I think has happened is that kooks continued to
         | keep these sites up - to the point where it's almost a visual
         | trope now to see a red <h1> tag in Times New Roman and think,
         | uh oh, I've stumbled on an "ancient aliens" site. Whereas
         | scholars and journals offering higher quality information have
         | moved to more modern platforms that rely more heavily on modern
         | browsers - with or without their own domain names. So as a
         | result what seemed to surface here were the fragments of the
         | old web that remain live - possibly because people living in
         | cabins in Montana forget to cancel their web hosting, or
         | because the nature of old-school conspiracy theorists is to
         | just keep packing their old sites with walls of text surrounded
         | by <p> tags.
         | 
         | Arguably, this seems to rank the way Google's engine used to,
         | since it couldn't run JS and they wanted to punish sites that
         | used code to change markup at render time. At least, when I
         | used to have to do onsite SEO work, it was always about simple
         | tag hierarchies.
         | 
         | I wonder whether there isn't some better metric of validity and
         | information quality than what markup is used. Some of the sites
         | that surfaced further down could be considered interesting and
         | valuable resources. I think _not punishing_ simple wall-of-text
         | content is a good thing. But to punish more complicated layouts
         | may have the perverse effect of downranking higher-quality
         | sources of information - i.e. people and organizations who can
         | afford to build a decent website, or who care to migrate to a
         | modern blogging platform.
        
         | crocodiletears wrote:
         | It's very rare that I see a project on HN I can see myself
         | using. This is one. Like others have said, the results can be a
         | little rough. But they're rough in a way I think is much more
         | manageable than the idiosynchrosies of more 'clever' search
         | engines.
        
           | marginalia_nu wrote:
           | I think you need to approach it more like grep than google.
           | It's a forgotten art, dealing with this type of dumb search
           | engine.
           | 
           | Like if you search for "How do I make a steak", you aren't
           | going to get very good results. But a better query is "Steak
           | Recipe", as that is at least a conceivable H1-tag.
        
         | bluefox wrote:
         | This is a very cool project! Thank you.
        
         | BugsJustFindMe wrote:
         | I love this, and I love (many of) the results so far! What I
         | can't find on the site is detail about what "too many modern
         | web design features" means. Is it just penalizing sites with
         | tons of JavaScript?
        
           | marginalia_nu wrote:
           | Javascript tags are penalized the hardest, but it also takes
           | into consideration density of text per HTML. There's also
           | some adjustments based on text length, which words occur in
           | the page, etc.
        
         | ad404b8a372f2b9 wrote:
         | Very cool project! How many websites do you have in your index?
         | And how did you go about building it?
         | 
         | I've been working on an engine for personal websites, currently
         | trying to build a classifier to extract them from commoncrawl,
         | if you have any general tips on that kind of project they'd be
         | very welcome.
        
         | davegauer wrote:
         | This is absolutely wonderful. I am LOVING the results I'm
         | getting back from it: the sort of content-rich sites that have
         | become nigh unreachable using traditional search engines. Thank
         | you for building this!
        
         | asah wrote:
         | Love it, kudos! This is great for developers and others who
         | Just Need Answers and not shopping or entertainment.
         | 
         | If you're looking for feedback, both from a UI design and
         | utility standpoint, you might consider "inlining" results from
         | selected sites, e.g. Wikipedia, stacked change, etc. Having
         | worked on search for a long time, inlining (onebox etc) is a
         | big reason users choose Google, and that channelers fail to get
         | traction. If you're Serious(tm), dog into the publisher
         | structure formats and format those, create a test suite, etc.
         | 
         | A word of caution: if this takes off, as a business it's
         | vulnerable to Google shifting its algorithms slightly to
         | identify the segment of users+queries who prefer these results
         | and give the same results to those queries.
         | 
         | Hope this helps!
        
           | marginalia_nu wrote:
           | If Google starts showing interesting text-heavy links instead
           | of vapid listicles and storefronts, I have accomplished
           | everything I ever could dream of.
        
             | 0xbadcafebee wrote:
             | Thank you for doing this important work.
        
             | palijer wrote:
             | Haha, reminds me exactly of this.
             | 
             | https://xkcd.com/810/
        
         | santamex wrote:
         | Which software do you use to index the sites?
        
           | marginalia_nu wrote:
           | I wrote it myself from scratch. I have some metadata in
           | mariadb, but the index is bespoke.
           | 
           | A design sketch of the index is that it uses one file with
           | sorted URL IDs, one with IDs of N-grams (i.e. words and word-
           | pairs) referring to ranges in the URL file; as well as a
           | dictionary for relating words to word-IDs; that's a GNU Trove
           | hash map I modified to use memory map data instead of direct
           | allocated arrays.
           | 
           | So when you search for two words, it translates them into IDs
           | using the special hash map, goes to the words file and finds
           | the least common of the words; starts with that.
           | 
           | Then it goes to the words file and looks up the URL range of
           | the first word.
           | 
           | Then it goes to the words file and looks up the URL range of
           | the second word.
           | 
           | Then it goes through the less common word's range and does a
           | binary search for each of those in the range of the more
           | common word.
           | 
           | Then it grabs the first N results, and translates them into
           | URLs (through mariadb); and that's your search result.
           | 
           | I'm skipping over a few steps, but that's the very crudest of
           | outlines.
        
             | q3k wrote:
             | Good stuff. I've also been toying with doing some homegrown
             | search engine indexing (as an exercise in scalable
             | systems), and this is a fantastic result and great
             | inspiration.
             | 
             | Definitely want to see more people doing that kind of low-
             | level work instead of falling back to either 'use
             | elasticsearch' or 'you can't, you're not google'.
        
               | marginalia_nu wrote:
               | Well just crunching the numbers should indicate what is
               | possible and what isn't.
               | 
               | For the moment I have just south of 20 million URLs
               | indexed.
               | 
               | 1 x 20 million bytes = 20 Mb.
               | 
               | 10 x 20 million bytes = 200 Mb.
               | 
               | 100 x 20 million bytes = 2 Gb.
               | 
               | 1,000 x 20 million bytes = 20 Gb.
               | 
               | 10,000 x 20 million bytes = 200 Gb.
               | 
               | 100,000 x 20 million bytes = 2 Tb.
               | 
               | 1,000,000 x 20 million bytes = 20 Tb.
               | 
               | This is still within what consumer hardware can deal
               | with. It's getting expensive, but you don't need a
               | datacenter to store 20 Tb worth of data.
               | 
               | How many bytes do you need, per document, for an index?
               | Do you need 1 Mb of data to store index information about
               | a page that, in terms of text alone, is perhaps 10 Kb?
        
             | rvnx wrote:
             | It's a great project!
        
         | axelroze wrote:
         | Hi,
         | 
         | Interesting idea. Definitely see an overlap with eReader
         | markets and looking at text only contents.
         | 
         | How does it work?
         | 
         | It ignores pages on which it detects frameworks for ui and ads
         | or any javascript code at all?
        
         | agumonkey wrote:
         | is there a json endpoint ? I'd love to make an emacs bridge :)
        
         | artembugara wrote:
         | Nice, what are you using to crawl the web?
        
           | marginalia_nu wrote:
           | It's pretty much all bespoke.
           | 
           | I use external libraries for parsing HTML (JSoup) and
           | robots.txt; but that's about it.
        
         | blondin wrote:
         | fantastic project, thank you!
        
         | habibur wrote:
         | How are you doing the crawling without getting blocking? -- the
         | hardest part.
        
           | judge2020 wrote:
           | Not OP but crawling is easy if you don't try scanning 5+
           | pages a second - almost all rate limiting/heuristic based
           | 'keep server costs low' engines, including Cloudflare, don't
           | care if you request every page, but will take action if you
           | do something like burst every page and take up just as many
           | server resources as a hundred concurrent users.
           | 
           | Now, that is assuming you aren't on some VPS provider. If
           | you're going to crawl, you'll have the best chance when you
           | use your own IPs on your own ASN, with DNS and reverse DNS
           | set up correctly. This makes it so the IP reputation systems
           | can detect you as a crawler but not one that hammers every
           | site it visits.
           | 
           | Also, I imagine that, for a search engine like this, it
           | doesn't expect content to change much anyways - so it can
           | take its time crawling every site only once every month or
           | two, instead of the multiple times a week (or day) search
           | engines like Google have to for the constantly-updated
           | content being churned out.
        
         | androceium wrote:
         | Pretty neat!!!
         | 
         | You may already be aware of this, but the page doesn't seem to
         | be formatted correctly on mobile. The content shows in a single
         | thin column in the middle.
        
           | marginalia_nu wrote:
           | Hmm, which OS? I only have a single Android phone so I've
           | only fixed the CSS for that.
        
             | androceium wrote:
             | I was seeing it on Android w/ Firefox. Seems like it's
             | fixed now though. :)
        
             | ant6n wrote:
             | For example Firefox on Android.
        
       | egberts1 wrote:
       | I tried "Error 49" as a search phrase.
       | 
       | It's rudimentary but no IT-related result.
        
       | thrtythreeforty wrote:
       | > New: You can now look up dictionary definitions for words. If
       | you for example don't know what the definition of is is, you can
       | inquire thus: define:is.
       | 
       | Oh man, I love subtle jabs and tongue in cheek writing like this.
       | Very Robin Williams-esque.
        
         | marginalia_nu wrote:
         | I am the first to admit it's a pretty dated reference.
        
       | earthbee wrote:
       | I love this! I've been searching random words with no aim in
       | particular and keep finding lots of interesting tiny personal
       | webpages. It feels like the old web
        
       | [deleted]
        
       | arduinomancer wrote:
       | Wow this is immediately useful
       | 
       | If you figure out some sort of funding model (maybe even just
       | Patreon) I could totally see this as a viable side project
       | 
       | Already discovered this recipe site: https://based.cooking/
       | 
       | I love how adding recipes is through pull requests:
       | https://github.com/LukeSmithxyz/based.cooking/pulls
        
       | dmje wrote:
       | Love it. You should provide a link to Patreon / whatever so
       | people can support you financially. Hosting is probably not cheap
       | for you. Given the love here on HN I suspect you'd do well.
        
       | spandrew wrote:
       | All of my searches are turning up unrelated results ("college
       | life after the pandemic", "post-pandemic teaching in higher
       | education", "football news NFL" etc.)
       | 
       | NFL one had 'some' decently related results, but the websites
       | were all strangely disreputable.
        
         | mmmpop wrote:
         | > the websites were all strangely disreputable
         | 
         | Interesting you'd feel that way when sites without "modern
         | design" are encountered. Is this your own bias perhaps creating
         | a judgment or are they sites that you already know have a bad
         | reputation?
        
           | typon wrote:
           | Or perhaps the websites being returned are garbage? I have
           | the same experience trying a few searches and following the
           | top 5 links. Besides wikipedia, I haven't found a single
           | useful website.
        
       | abhinav22 wrote:
       | Great work and congrats!
        
       | skyfaller wrote:
       | This is a fantastic search engine. It delivers on its promise of
       | "serendipity". I found pages featuring my name that I'm not sure
       | I've ever seen before, after many years of searching myself to
       | test out search engines.
       | 
       | Perhaps more importantly, it delivers the most correct result
       | when searching for my username: the first result is not any of my
       | social media accounts, or even my own blog, but the text of the
       | obscure science fiction story that I took my username from! Well
       | done.
       | 
       | I've immediately added this as a search keyword in Firefox, and
       | I'll be using it more in the future.
       | 
       | Could meta search engines like DuckDuckGo include this as a
       | source? Should they?
        
       | pietroppeter wrote:
       | from About page:
       | 
       | > If you search for "Plato", you might for example end up at the
       | Canterbury Tales. Go looking for the Canterbury Tales, and you
       | may stumble upon Neil Gaiman's blog.
       | 
       | I know it is just a suggestion, but had to try searching both,
       | with no luck in getting the expected unexpected.
        
         | marginalia_nu wrote:
         | Yeah I did some work very recently aimed at improving the
         | relevance a bit. It was a bit too random in the state it was
         | before. Now it, perhaps, isn't random enough anymore.
        
           | pietroppeter wrote:
           | It looks very nice anyway, great job! I did try with other
           | queries and results were in general interesting.
        
       | fsflover wrote:
       | See also: https://wiby.me/
        
         | [deleted]
        
         | Tade0 wrote:
         | The "surprise me..." button is adequately labelled.
        
           | twobitshifter wrote:
           | Shades of stumbleupon.
        
           | tpmx wrote:
           | Great link to drag to the bookmark bar.
        
       | bovermyer wrote:
       | I adore this. Unfortunately, searching for my own name - with or
       | without quotes - doesn't actually find my site.
       | 
       | It does find a handful of references to me from over twenty years
       | ago, though, which I thought was fascinating.
        
         | tgv wrote:
         | My name retrieved the "dead pornstar list". Unexpected.
        
       | JohnJamesRambo wrote:
       | Saving this forever. Thank you for making it.
        
       | kodeninja wrote:
       | Ivermectin (marginalia):
       | https://search.marginalia.nu/search?query=ivermectin+
       | 
       | Ivermectin (Google): https://www.google.com/search?q=ivermectin
       | 
       | The difference in the overall _thrust_ of the results is
       | remarkable.
       | 
       | Very interesting! Thanks for building it.
        
         | typon wrote:
         | The Google results tell you why Ivermectin is not a good
         | replacement for vaccination against Covid, the Marginalia
         | results tell you that Ivermectin is a miracle drug for treating
         | Covid 19. Really shows how much technology has the power to
         | change reality in today's world.
        
           | lame-robot-hoax wrote:
           | Google links to the FDA, CDC, WHO, NIH, WebMD, drugs.com, and
           | a pro ivermectin journal article from the American Journal of
           | Therapeutics.
           | 
           | The Marginalia results point you mostly to random blogs.
        
           | marginalia_nu wrote:
           | Part of what I wanted to show with this project is that there
           | is no such thing as an objective search engine. Even
           | seemingly irrelevant technological decisions drastically
           | impact the narrative.
        
             | Drew_ wrote:
             | Well the focus on text content isn't the only technical
             | difference here. Google is obviously weighing hundreds of
             | signals in its search results that your engine is not
             | accounting for. These omitted signals are also relevant.
        
               | sundarurfriend wrote:
               | > These omitted signals are also relevant.
               | 
               | Certainly. And sometimes they're relevant in a good way,
               | sometimes in a bad user-hostile way. Every search engine
               | rquires discrimination and intelligent usage by the
               | person doing the search, just in different areas.
        
               | marginalia_nu wrote:
               | Right, but that is still a technical decision on their
               | side. They presumably don't sit down and have a meeting
               | about what world view they should present. Well I hope
               | they don't.
        
         | silent_cal wrote:
         | Lol!
        
         | daxfohl wrote:
         | Though before basing life-and-death decisions on this, consider
         | reading the "about" page first:
         | https://memex.marginalia.nu/projects/edge/about.gmi
         | 
         | > The purpose of the tool is primarily to help you find and
         | navigate the strange parts of the internet. Where, for sure,
         | you'll find crack-pots, communists, libertarians, anarchists,
         | strange religious cults, snake oil peddlers, really strong
         | opinions.
         | 
         | and
         | 
         | > If you are looking for fact, this is almost certainly the
         | wrong tool.
        
         | lame-robot-hoax wrote:
         | Yes, google returns results from the FDA, American Journal of
         | Therapeutics pro Ivermectin study, WebMD, the CDC, the NIH,
         | Wikipedia, the WHO, and New York Times.
         | 
         | Marginalia returns results from Wikipedia, a faculty member's
         | university blog regarding river blindness, a website called
         | truthsummit promoting it as a miracle cure, a website called
         | vaxxchoice promoting it as a cure, vitamindsstopcovid, etc.
         | 
         | I'd say the quality of the results are quite different.
        
         | motoxpro wrote:
         | Second result: "Ivermectin, a miracle drug against Covid
         | Ivermectin, a miracle drug against Covid. 100% effective as
         | preventative and for early stage Covid. Over 90% cut in
         | fatality rate for late-stage cases.
         | 
         | https://truthsummit.info/blog/ivermectin-against-covid.html "
         | 
         | Eh I think I'll take the google search results on this one.
        
       | pkamb wrote:
       | Great results for "sauna". Lots of Web 1.0 pages discussing
       | building plans and displaying pictures of individually built,
       | traditional, unique, old saunas on some property.
       | 
       | The Google result are all blogspam or sales pages for cheap
       | shipped saunas. Lots of "IR" results. Phony health benefit pages.
       | Stock photos solely of beautiful new hotel gyms.
       | 
       | I've noticed this problem with Google results for quite some
       | time. Sadly, the _new_ content being created of the top variety
       | is mostly being done within private Facebook groups that can 't
       | be easily searched, linked, or archived.
        
       | rfrey wrote:
       | This is stunning. I searched "winemaking" because it's my latest
       | obsession, and turned up dozens of links to high-quality pages
       | I'd never seen despite spending an hour a day for three months
       | cruising Google on the topic.
       | 
       | Please do announce it here if you ever decide to solicit help or
       | contributors. My stab at this problem was to have a search index
       | of only ad-free pages, on the hypothesis it would turn up self-
       | hosted blogs, university personal pages, that sort of thing. But
       | the results were too thin, your approach is much better.
        
       | winddude wrote:
       | hmm, I dream of recipes search engine that punishes recipes pages
       | with too much text. lol
        
         | mint2 wrote:
         | Yeah, recipes sites have both too much text and too many
         | pictures.
         | 
         | But they do illustrate what this search engine needs to watch
         | out of. If they rank more text higher and their search site
         | becomes popular, won't everyone just spam recipe site word
         | salad, maybe even ai generated word salad.
         | 
         | But in the interval, until that day comes, they are going to
         | have a very useful service.
        
       | Paul_S wrote:
       | Looking for an arm assembly instruction, instead I get this
       | strange website as the result
       | http://mailstar.net/coronavirus.html
       | 
       | Is that accidental or is this website promoted because it's text
       | heavy and will surface for any search without many results?
        
         | marginalia_nu wrote:
         | Looks like that page just has an absurd amount of keywords.
         | Those sometimes surface when there isn't any good results.
         | Haven't found a foolproof detection method that doesn't
         | unjustly punish innocent pages with large amounts of content.
        
       | tbojanin wrote:
       | this is sweet
        
       | gen_greyface wrote:
       | Hi, It'd be nice if you could add a OpenSearch description
       | document for your site.
       | 
       | https://developer.mozilla.org/en-US/docs/Web/OpenSearch
        
         | gen_greyface wrote:
         | until then i'll keep the site bookmarked. :-)
        
       | josefresco wrote:
       | It you like wacky search engines, there's also Million Short:
       | https://millionshort.com where you can search and remove the top
       | 100/1K/10k/100K/1M results.
        
       | ephbit wrote:
       | Quoted from the linked site:
       | 
       | > Convenience functions have been added, and the search engine
       | can now perform simple calculations and unit conversions. Try 1
       | pint in cubic centimeters, or 50+sqrt(pi). This functionality is
       | still under development, be patient if it doesn't work.
       | 
       | Why would you make any ever so small effort to implement
       | calculations? I don't get it.
       | 
       | If your search engine enabled me to find more useful search
       | results to my queries than google or yacy or whatever, I wouldn't
       | care one tiny bit about being able to do calculations with it.
       | 
       | Why not focus on the search functionality?
        
         | marginalia_nu wrote:
         | I implemented calculations because easily 80% of my google
         | queries are calculations, unit conversions, etc.
         | 
         | Search functionality is larger priority. Calculations and unit
         | conversions were an afternoon's break from the search
         | functionality :-)
        
         | exporectomy wrote:
         | How else do you do unit conversions? I use Google because it's
         | far easier than any other software I've tried. Mainly because
         | it's more forgiving of errors. It knows that "34 fset in
         | msters" is 10.3632 meter. This search engine isn't, though, so
         | I wouldn't waste time trying to discover its unit conversion
         | syntax rules.
        
           | samhh wrote:
           | On macOS for example I'd use Spotlight.
        
       | drusepth wrote:
       | Interesting approach.
       | 
       | I always search myself on new search engines to compare the
       | results. Most engines return my personal blog/website,
       | books/stories I've written, news stories, my github
       | projects/contributions, social links, etc.
       | 
       | This search engine surfaces just three obscure IRC logs that
       | contain my nick in join/part messages (nothing said from me!)
       | from 2009. And nothing else.
       | 
       | There's probably some things this approach is really good at but
       | I'm not sure what they'd be for me off hand. Always cool to see
       | new approaches to search, though.
        
       | fhackernewz wrote:
       | fuck you hacker news
        
       | fsckboy wrote:
       | I've read most of the comments here and people are evaluating the
       | search results: all good information.
       | 
       | I'm looking at "punishes modern web design"... This thing IS
       | modern web design. I think it's called "marginalia" in reference
       | to the huge margins they chose!
       | 
       | I'm using a browser on a linux desktop and side-by-side, HN's
       | page design is old-fashioned tasteful making pretty good use of
       | space, and maginalia has a font that's more than twice the 2D
       | pointsize and is so spread out with whitespace that the "Tips" on
       | the home page are off the bottom of my window.
        
       | gtmb wrote:
       | As everything in life flows in cycle, I predict the search engine
       | that will de-throne Google will be like Google when it started -
       | a simple variation of page rank.
       | 
       | No smarts, no bubble, no signals decided by over fitting to a
       | biased engineer preference.
        
         | __MatrixMan__ wrote:
         | I agree, except it'll optionally accept the ID of your node in
         | a web of trust, and it'll use a page rank customized for you.
         | 
         | Or you can put in two ID's and have it find sources that both
         | parties trust.
        
         | jerrre wrote:
         | I wouldn't say the existence of this page proves your
         | prediction right (as it's not dethroning Google anytime soon).
         | 
         | It's easy to forget that the goal of Google isn't to provide a
         | useful search engine (at least not anymore), but the search
         | engine is a by product of them wanting to show ads.
        
           | marcos100 wrote:
           | If Google isn't useful then nobody will use it.
           | 
           | The search engine and the ads are tightly coupled. A better
           | search engine means it can predict with more accuracy what
           | you are looking for and can serve you an even more targeted
           | ad that increases the chance you'll click.
        
             | rchaud wrote:
             | > If Google isn't useful then nobody will use it.
             | 
             | Or they'll continue to use it out of sheer inertia. Google
             | is paying Apple $15 billion to keep its place as iOS
             | default search engine.
             | 
             | IE6 didn't die overnight when Firefox arrived.
        
           | popcube wrote:
           | now Google try immitate a document system on your computer,
           | usually I rely on Google know what I need:(
        
         | antupis wrote:
         | As dev I would love search engine which would only do search to
         | stackoverflow github issues, documentation etc.
        
           | axelroze wrote:
           | You can limit the search query per website in DDG (and
           | probably in others)
           | 
           | Example: `rust slow compilation site:stackoverflow.com`
        
         | axelroze wrote:
         | Wouldn't the dethroner of Google be some new technology which
         | is not a search engine like Google but better at solving the
         | original task of finding information on how to solve problems?
         | 
         | Just like how iPad dethroned Windows PCs for average home user
         | but not Mac because Windows had the monopoly and then an
         | innovation destroyed MS in this space and not a competitor.
         | 
         | I don't think Google dethrones Yahoo and AltaVista scenario
         | will occur again.
        
           | gverrilla wrote:
           | > iPad dethroned Windows PCs for average home user
           | 
           | is this true? in the US, perhaps? because in south america it
           | couldn't me more far away from truth - didn't happen at all
        
       | lightsurfer wrote:
       | thank you!
        
       | timdaub wrote:
       | I'm developing a text-heavy site and philosophically I'm trying
       | to view documents as just that... documents [1].
       | 
       | But I don't get good results for "rug pull".
       | 
       | - 1 https://rugpullindex.com
        
         | marginalia_nu wrote:
         | Yeah it's hosted by cloudflare. I'm currently IP-blocking them,
         | as because they keep prompting my crawler with a captcha,
         | presumably because it's made millions of requests from their
         | CDN.
         | 
         | Some rigmarole getting recognized as a good bot by the CDNs.
         | I've submitted a request fairly recently, but haven't heard
         | back from them yet.
         | 
         | Like I would like to be on good terms with them, and other
         | websites that block small independent crawlers.
         | 
         | I can't blame them though, there's a lot of bad bots out there.
         | But I'm doing my best not be part of the problem.
        
           | [deleted]
        
           | petercooper wrote:
           | Aha, I was going to ask how you were coping with CDNs like
           | Cloudflare blocking bots. It's sad we've got to this point
           | where basically only the established search engines are
           | grandfathered in to be able to crawl sites.
        
         | BugsJustFindMe wrote:
         | > _I 'm developing a text-heavy site_
         | 
         | I looked at the source for your site's front page. That's not
         | text-heavy; that's markup-heavy. I didn't bother looking at the
         | rest of the pages because it appears to be yet another crypto
         | market site.
        
       | greggturkington wrote:
       | Wouldn't this just skew towards really old sites?
       | 
       | The _third_ search result for  "dog" is this page on how to
       | remove AOL Instant Messenger, published in 2002.
       | 
       | https://sillydog.org/netscape/kb/removeaim.html
       | 
       | No one wants to see newsletter signup popovers, but "modern web
       | design" includes good performance and relevant content. (The
       | search engine itself takes about 2 seconds to first contentful
       | paint, not great.)
        
         | bityard wrote:
         | This search engine pretty much takes everything that Google is
         | doing and does the opposite. For instance, Google has decided
         | that "relevant" usually also means "recent". Thus, when
         | searching for something on Google, you mainly get results from
         | blogspam farms and almost never do you see anything more than a
         | few years old.
         | 
         | An implication of this is that old sites tend to disappear
         | (either into obscurity or by being taken down) because Google
         | penalizes them in search rankings. The author of this search
         | engine says, however:
         | 
         | > If a webpage has been around for a long time, then odds are
         | it has fundamental redeeming quality that has motivated keeping
         | it around all for that time.
         | 
         | I don't know that I agree 100% with this (there was lots of
         | crap on the "old" web too), but it makes a certain amount of
         | sense.
        
           | greggturkington wrote:
           | What "fundamental redeeming quality" about uninstalling AIM
           | from Windows 3.x motivated making that the 3rd result for
           | "dog"?
           | 
           | The 5th result is a tutorial on CSS. This search engine
           | decided it's relevant because it has "dog" in the URL. Is
           | that a better reasoning than Google's?
           | https://htmldog.com/guides/css/beginner/
           | 
           | Core Web Vitals ranks sites higher that perform well. Text-
           | heavy sites that are also optimized and relevant would
           | already perform well.
        
             | marginalia_nu wrote:
             | What are you searching for when you enter the query "dog",
             | keeping in mind the search engine deliberately does not
             | examine synonyms or and deliberately seeks out the path
             | less taken?
             | 
             | Dog facts? Then search "dog facts"
             | 
             | Famous dogs? Then search "famous dogs"
             | 
             | Rappers? Try "snoop dogg"
        
               | [deleted]
        
               | greggturkington wrote:
               | I'm searching for information on "dog".
               | 
               | Your suggestion of "dog facts" returns 6 pages from the
               | same domain, dogquotes.com. It's unreadable on mobile
               | because it's so old, all the facts are unsourced, and
               | often wrong:
               | 
               | > Never assume that a barking dog won't bute _[sic]_ ,
               | unless you're absolutely certain the dog believes it too.
               | 
               | Also on the 1st SERP, this odd blog post ranting about
               | 4th amendment rights [1], "Media Glamorization of the
               | Psychopath" [2], and this (image-heavy) page about
               | dolphin encounters in the Bahamas ("Sea Dog Facts" is a
               | link on the page).                   1.
               | http://www.rexcurry.net/drugdogsdan.html         2. https
               | ://www.metaphoricalplatypus.com/articles/psychology/psych
               | opathysociopathy/media-glamorization-of-the-psychopath/
               | 3. https://www.dolphinencounters.com/education/
        
       | samsaga2 wrote:
       | Where does the data come from? Do you index the whole web
       | yourself? I see it totally impossible for a personal project. I'm
       | very curious about that.
        
         | marginalia_nu wrote:
         | I do indeed index the web myself. Not the _entire_ web, just a
         | subset of it. The crawler quickly loses interest in
         | javascript:y websites and only indexes at depth those websites
         | that are simple. It also focuses on websites in English,
         | Swedish and Latin and tries to identify and ignore the rest
         | (best-effort).
         | 
         | You'd be surprised how much you can do with modern hardware if
         | you are scrappy. The current index is about 17.7 million URLs.
         | I've gone as far as 50 million and could _probably_ double that
         | if I really wanted to. The difficulty isn 't having a small
         | enough index, but rather having a relevant enough index,
         | weeding out the link farms and stuff that just take space.
         | 
         | I only index N-grams of up to 4 words, carefully chosen to be
         | useful. The search engine, right now, is backed by a 317 Gb
         | reverse index and a 5.2 Gb dictionary.
        
           | omoikane wrote:
           | > It also focuses on websites in English, Swedish and Latin
           | and tries to identify and ignore the rest
           | 
           | When I search for Japanese terms, it "says <query> needs to
           | be a word", which wasn't the best error message. Maybe the
           | error message should say something like "sorry, your language
           | isn't support yet"?
        
             | marginalia_nu wrote:
             | I've rephrased the wording for that one a bit.
        
           | throwaway47292 wrote:
           | Amazing!
           | 
           | I have only one recommendation that might make the search a
           | bit more relevant, e.g when searching for 'linux locking' or
           | 'kernel locking' kind of things.
           | 
           | Try to upsort things that match near the top of the content,
           | like the top of the man page vs middle vs bottom.
           | 
           | One easy way to do it without having to store the positions,
           | is to index the ngrams with max(sqrt,8) of their line number,
           | this will cover first 64 lines, you can also use log() or
           | just decide ad hock, top, middle, bottom of the document, so
           | you can use only 3 values.
           | 
           | e.g. https://www.kernel.org/doc/html/v5.0/kernel-
           | hacking/locking.... would do unreliable_1 guide_1 locking_1
           | ... then at line 4 kernel_2 locking_2 ... after line 50 ...
           | then_7 ... and after that everything will be _8.
           | 
           | then just make the query "kernel locking" to "dismax(kernel_1
           | OR kernel_2 OR kernel_3...) AND dismax(locking_1 OR locking_2
           | ...) with some tiebreaker of 0.1 or so, you can also say "i
           | want to upsort things on the same line, or few lines apart"
           | by modifying the query a bit.
           | 
           | It works really well and costs very little in terms of space,
           | i tried it at https://github.com/jackdoe/zr while searching
           | all of stackoverfow/man pages and etc and was pretty
           | surprised by the result.
           | 
           | This approach is a bit cheaper than storing the positions
           | because positions are (lets say) 4 bytes per term per doc,
           | while this approach has fixed uppre bound cost of 8*4 per
           | document (assuming 4 byte document ids) plus some amortized
           | cost for the terms
        
           | kews wrote:
           | Do you know what proportion of the texty web instructs
           | unknown crawlers to go away (or blocks them)?
        
           | c0wb0yc0d3r wrote:
           | How did you go about seeding your web crawler with URLs to
           | crawl?
        
             | marginalia_nu wrote:
             | I just started with my website and did a crawl.
             | Subsequently I've been seeding it with the best results
             | form my previous crawls.
             | 
             | It's a directed search so it doesn't seem to need a
             | particularly solid seed to get decent results.
        
               | c0wb0yc0d3r wrote:
               | So how long did it take to get to 17 million URLs?
        
             | dannyw wrote:
             | Not OP, but if I was to do this, I'd start by downloading
             | Wikipedia and all its external links and references, and
             | crawling from there. You should eventually reach most of
             | the publicly visible internet.
        
               | c0wb0yc0d3r wrote:
               | I feel a little embarrassed that I didn't think of
               | something like that.
               | 
               | When I did some crawler experimenting in my younger
               | years, I thought I was pretty clever using sites that
               | would let you perform a random Google searches. I would
               | just crawl all the pages from the results returned.
               | 
               | Your method would undoubtedly be more interesting I
               | think. It would certainly lead to interesting performance
               | problems quicker, I bet.
        
           | dannyw wrote:
           | This is unbelievably impressive on a technical and ambition
           | level for a solo, self-hosted hardware project. Kudos.
        
           | jillesvangurp wrote:
           | Cool, I've been thinking on this topic a bit lately. Crawling
           | is indeed not that hard of a problem. Google could do it 23
           | years ago. The web is a bit bigger now of course but it's not
           | that bad. Those numbers are well within the range of a very
           | modest search cluster (pick your favorite technology; it
           | shouldn't be challenging for any of them). 10x or 1000x would
           | not matter a lot for this. Although it would raise your cost
           | a little.
           | 
           | The hard problem is indeed separating the good stuff from the
           | bad stuff; or rather labeling the stuff such that you can
           | tell the difference at query time. Page rank was nice back in
           | the day; until people figured out how to game things. And now
           | we have bot farms filling the web with nonsense to drive
           | political agendas, create memes, or to drown out criticism.
           | Page rank is still a useful ranking signal; just not by it
           | self.
           | 
           | The one thing no search engine has yet figured out is
           | reputability of sources. Content isn't anonymous mostly. It's
           | produced and consumed by people. And those people have
           | reputations. Bot content is bad because it comes from sources
           | without a credible reputation. Reputations are built over
           | time and people value having them. What if we could value
           | people's appreciation relative to their reputability? That
           | could filter out a lot of nonsense. A simple like button + a
           | flag button combined with verified domain ownership (ssl
           | certificates) could do the trick. You like a lot of content
           | that other people disliked, your reputation goes down the
           | drain. If you produce a lot of content that people like, your
           | reputation goes up. If a lot of reputable people flag your
           | content, your reputation tanks.
           | 
           | The hard part is keeping the system fair and balanced. And
           | reputability is of course a subjective notion and there is a
           | danger of creating recommendation bubbles, politicizing
           | certain topics, or even creating alternative reality type
           | bubbles. It's basically what's happening. But it's mostly
           | powered by search engines and social media that actually
           | completely ignore reputability.
        
       | silent_cal wrote:
       | Wonderful work (':
        
       | afrcnc wrote:
       | except it doesn't actually return that many results
        
       | winddude wrote:
       | curious how do you afford the infrastructure? I found that the
       | hardest part of running a search engine.
        
         | marginalia_nu wrote:
         | I'm self-hosting, and the server is a Ryzen 7 3900x with 128 Gb
         | of non-ECC RAM. It sits in my living room next to a cheap UPS.
         | I did snag one of the last remaining Optane 900Ps off Amazon,
         | and it powers the index and the database--and I really do think
         | this is among the best hardware choices for this use case. But
         | beyond that it's really nothing special, hardware-wise. Like
         | it's less than a month's salary.
         | 
         | It runs Debian, and all the services run bare metal with zero
         | containerization.
         | 
         | Modern consumer hardware can be absurdly powerful if you let
         | it.
         | 
         | Like I have no doubt a thousand engineers could spend a hundred
         | times as much time building a search engine that did pretty
         | much the same thing mine does, it would require a full data
         | center to stay running and be much slower. But that's just a
         | cost of large scale software development I don't have to pay as
         | a solo developer with no deadline, no planning and a shoestring
         | budget.
        
       | yewenjie wrote:
       | Related question - suppose I want to create a meta search engine
       | for myself, and I want it to be as fast as possible. What are the
       | things I should be optimizing for?
        
       | FractalHQ wrote:
       | Ok this is great if all I want to do is read text, but often
       | times that is very much not all I want to do. The web is much
       | more than text and images these days. I can appreciate this as
       | long as it's branded as a search engine for blogs and articles
       | specifically, as opposed to being touted as a drop-in replacement
       | for the modern search engine.
        
         | lukas099 wrote:
         | Is this a criticism? It doesn't at all seem touted as a drop-in
         | replacement for the modern search engine.
        
       | Phileosopher wrote:
       | Wow, if this catches on, my original content will actually
       | matter![1] I've always had a love-hate relationship with modern
       | web design principles because my design choices have all the
       | excitement and polish of what we get on HN.
       | 
       | I'm sure I'm not the only one, either. Content-rich sites need
       | more love.
       | 
       | [1] https://adequate.life
        
       | justinzollars wrote:
       | It works. Nice job.
        
       | NotAnOtter wrote:
       | "Don't be afraid to scroll down in the search results, unlike in
       | many other search engines, depending on what you are looking for,
       | you may find the best results in the middle of the listing."
       | 
       | This is a very polite way of saying "this engine isn't very good"
       | 
       | Overall impressed with the project but I thought the word play
       | there was funny
        
         | marginalia_nu wrote:
         | I felt I needed to add it to help people taught by other search
         | engines that they only get 1-2 good results, and the rest is
         | useless. The reason I'm providing a hundred results is that
         | there are often a lot of results to choose from. If the point
         | is to find something unexpected, and that indeed is the entire
         | point, then that is the only sane design choice.
         | 
         | Like you search for something on Google and similar, and you
         | know what you are going to find. They are so good at searching
         | the Internet and predicting what you are going to click on that
         | you never see something new.
         | 
         | It's a great feat of engineering, but a huge tragedy, because
         | discovering new things, outside of what you our your
         | demographic has previously demonstrated an interest in, it can
         | be absolutely life changing.
        
       | jarbus wrote:
       | This engine is fantastic for recipes
        
       | exabrial wrote:
       | I would like a search that punishes 'modern' SPOs that load 87mb
       | of the author's pet JS projects to display simple text. Basically
       | every modern SPO.
        
       | rc_mob wrote:
       | blessings upon you sir for making this
        
       | sabujp wrote:
       | effort is good, but needs some work, no results here :
       | 
       | https://search.marginalia.nu/search?query=rxjava+2+api+docs
       | 
       | https://www.google.com/search?q=rxjava+2+api+docs&oq=rxjava+...
        
         | marginalia_nu wrote:
         | It is very much a work in progress, still struggling with some
         | areas. I only really got into the territory of "sometimes
         | actually useful" like this weekend. Wasn't planning on blowing
         | up on HN just yet.
        
       | rafael_c wrote:
       | I liked this one... I searched for 'George Harrison' and among
       | the first results there was a page with interesting comments
       | about Harrison's solo career; someone reminiscing about the time
       | they got to talk to him about guitars for half an hour at a bar
       | at the airport; a transcript for an interview he gave on TV...
       | Whereas on GOOGLE: an instrusive 'People also ask' which I was
       | not interested; thumbnails for videos on youtube that I was not
       | looking for; previews to garbage clickbaity news articles; and
       | then finally for the search items: a bunch of websites for
       | lyrics; his Instagram (!) and fb pages; his imdb page; some more
       | news articles I was not looking for...
       | 
       | Granted, google's web results above are perhaps what people are
       | looking for 75% of the time, but how limiting and boring.
       | 
       | I'm also a sucker for the simplistic text-centric, information-
       | laden pages from the pre-facebook era.
       | 
       | For 'global warming', however - since Marginalia excludes modern
       | web-design pages - the results are of dubious relevance and
       | interest, since they are, well, 'old'.
       | 
       | I see myself using this engine a lot.
        
       | leephillips wrote:
       | This is wonderful and stupendous.
       | 
       | I've often thought that Google could be turned back into a good
       | search engine by simply eliminating the crap and letting the
       | useful sites float to the top of the results.
       | 
       | marginalia.nu seems to like my sites, so it must be good!
       | 
       | Some results are prefixed with ! or an arrow dingbat. What does
       | that mean?
        
       | PieUser wrote:
       | searching for covid gives a bunch of bogus crap of fake news
        
       | isaacgreyed wrote:
       | A common use case, how to do random thing in programming:
       | 
       | I searched python make a bar chart and it returned a live coding
       | video with an AI generated text transcript and two articles which
       | mentioned a different kind of bar.
       | 
       | I then narrowed it down to just python bar chart, and got a blog
       | post about scripting with a bar chart in it, this
       | http://www.nitcentral.com/voyager4/hellyear.htm with monty
       | python, bars, and charts from 1996 and among some other things I
       | found this https://python-
       | course.eu/naive_bayes_classifier_introduction..., which had an
       | example of a python bar chart even though the title of the page
       | made me think it wasn't what I wanted.
       | 
       | So for what I imagine to be a difficult search because of all the
       | different meanings of the words, I found my result on the second
       | query pretty quickly, and found some cool unrelated stuff too.
       | 
       | I like mostly that I get what I type in, and not exactly what I
       | want, but what I want is there too.
        
         | CapmCrackaWaka wrote:
         | I would probably use this if I wanted to find interesting blog
         | posts/websites about a topic I want to learn more about in
         | general. It seems less useful for returning exact answers to
         | specific questions.
        
       | jerhewet wrote:
       | I use webcrawler.com, and IMO it's better than any other search
       | engine for finding _exactly_ what I 'm looking for. Not what's
       | "trending", or "popular", or what the sheeple are searching for.
       | It finds the _exact matching keywords_ that I 'm looking for. No
       | inference or other bullshit -- just the matches.
       | 
       | Such a relief to not wade through oceans of worthless crap any
       | more.
        
       | api wrote:
       | This is the most amazing thing I have seen on here in at least a
       | year!
       | 
       | It's... no... it can't be... a search engine that finds _actual
       | information_ instead of 5 megabyte blobs of tracking code and SEO
       | crap!
        
       | optimalsolver wrote:
       | I predict it will return a disproportionate amount of sites by
       | schizophrenic conspiracists.
        
       | slim wrote:
       | This is a search engine indexing the internet on a mariadb
       | database hosted on consumer hardware maintained by a single
       | person as a hobby and it does not suffer from HN hug of death
        
         | swyx wrote:
         | how on earth do you index so much on consumer hardware? my
         | frontend developer mind is blown.
        
           | foxfluff wrote:
           | Wait till you learn that modern CPUs run _billions_ of cycles
           | per second. With multiple cores in parallel! And they can
           | reach transfer rates of tens of gigabytes per second to RAM,
           | or around a terabyte per second into L3.
        
             | ThalesX wrote:
             | And then you add a single HTTP request and everything tones
             | down to the speed of the web. Or I/O. Or DB call.
        
           | paxys wrote:
           | Consumer hardware today is simply what was cutting-edge and
           | crazy expensive 5 years ago.
        
       | deadalus wrote:
       | Very interesting because of the interesting results from random
       | websites. It's a great discovery tool.
       | 
       | Now hoping for search engine that favors text-heavy sites and
       | punishes paywalls
        
         | the__alchemist wrote:
         | I built one!
        
       | mattchew wrote:
       | Oh, I dream of a day where there are multiple useful search
       | engines, specialized for different purposes.
       | 
       | You're doing God's work here. Thanks and good luck.
        
         | scns wrote:
         | The Flying Spaghetti Monster wants to have a word with you.
         | 
         | (edit) That is a nice dream though.
        
         | brian_herman wrote:
         | Kind of reminds me of the past like alta vista and dogpile.
        
         | BoxOfRain wrote:
         | I wonder if there's any mileage in an extension of something
         | like uBlock Origin's lists of ad networks to block but instead
         | it's a list of known content mills and SEO spam factories to
         | remove from search results?
        
       | high_byte wrote:
       | I'd like a chrome extension that marks links that target text-
       | heavy vs "modern" so I know beforehand what to expect - paywall,
       | ads, popups, clickbaits, etc.
        
       | kebos wrote:
       | This is really cool, it filters out all fluff.
       | 
       | It's not always taking me to totally relevant sites but the
       | results contain my favourite type of content.
       | 
       | Full of _writing_ and pure html - usually the hallmark of someone
       | who knows what they are doing, wants to communicate but doesn 't
       | want to waste their time.
        
       ___________________________________________________________________
       (page generated 2021-09-16 23:00 UTC)