[HN Gopher] A search engine that favors text-heavy sites and pun...
___________________________________________________________________
A search engine that favors text-heavy sites and punishes modern
web design
Author : Funes-
Score : 2915 points
Date : 2021-09-16 12:16 UTC (1 days ago)
(HTM) web link (search.marginalia.nu)
(TXT) w3m dump (search.marginalia.nu)
| ryankrage77 wrote:
| The results are fantastic, but I can't see how the excerpts
| relate to the search term.
|
| For example, a search term for 'Scotichronicon' returns some
| fascinating results, but the search term itself doesn't appear in
| the title or excerpts of most of the results.
|
| This makes it harder to judge how relevant they are.
| marginalia_nu wrote:
| The excerpts are static and very best effort. You just have to
| visit the website and find out I'm afraid.
|
| I can do a lot with what I have, but I can't do full text
| search on millions of documents with dynamic excerpts off a
| single computer in my living room.
| bencollier49 wrote:
| This is brilliant - I can actually surf the web for fun again.
| This engine's actually a nice complement to another mainstream
| engine, as the regular one is good for searches during the
| working day, whilst "marginalia" (?) is great for recreational
| reading, and actual learning.
| snakeboy wrote:
| Wow, that's awesome. Great work!
|
| For a simple test, I searched "fall of the roman empire". In your
| search engine, I got wikipedia, followed by academic talks,
| chapters of books, and long-form blogs. All extremely useful
| resources.
|
| When I search on google, I get wikipedia, followed by a listicle
| "8 Reasons Why Rome Fell", then the imdb page for a movie by the
| same name, and then two Amazon book links, which are totally
| useless.
| adventured wrote:
| I did a search for "George Washington"
|
| First result after Wikipedia:
|
| "Radiophone Transmitter on the U.S.S. George Washington (1920)
|
| In 1906, Reginald Fessenden contracted with General Electric to
| build the first alternator transmitter. G.E. continued to
| perfect alternator transmitter design, and at the time of this
| report, the Navy was operating one of G.E.'s 200 kilowatt
| alternators http://earlyradiohistory.us/1919wsh.htm "
|
| Another result in the first few:
|
| " - VANDERBILT, GEORGE WASHINGTON
|
| PH: (800) ###-#233 FX: (#03) 641-5###.
| https://www.ScottWinslow.com/manufacturer/VANDERBILT_GEORGE_...
| "
|
| And just below that terrible result:
|
| "I Looked and I Listened -- George Washington Hill extract
| (1954)
|
| Although the events described in this account are undated, they
| appear to have occurred in late 1928. I Looked and I Listened,
| Ben Gross, 1954, pages 104-105: Programs such as these called
| for the expenditure of larger sums than NBC had anticipated. It
| be http://earlyradiohistory.us/1954ayl2.htm "
|
| Dramatically worse than Google.
|
| ---
|
| Ok, how about a search for "Rome" then? Surely it'll pull some
| great text results for the city or the ancient empire.
|
| First result after Wikipedia:
|
| "Home | Rome Daily Sentinel
|
| Reliable Community News for Oneida, Madison and Lewis County
| http://romesentinel.com/"
|
| The fourth result for searching "Rome":
|
| "Glenn's Pens - Stores of Note
|
| Glenn's Pens, web site about pens, inks, stores, companies -
| the pleasure of owning and using a pen of choice. Direcdtory of
| pen stores in Europe.
| http://www.marcuslink.com/pens/storesofnote/roma.html"
|
| Again, dramatically worse than Google.
|
| ---
|
| Ok, how about if I search for "British"?
|
| First result after Wikipedia:
|
| "BRITISH MINING DATABASE
|
| British_Mining_Database
| http://www.users.globalnet.co.uk/~lizcolin/bmd.htm "
|
| And after that:
|
| "British Virgin Islands
|
| Many of these photos were taken on board the Spirit of
| Massachusetts. The sailing trip was organized by Toto Tours.
| Images Copyright (c) Lowell Greenberg Home Up Spring Quail
| Gardens Forest Home Lake Hodges Cape Falcon Cape Lookout,
| Oregon Wahkeena
| http://www.earthrenewal.org/british_virgin_islands2.htm"
|
| Again, far off the mark and dramatically worse than Google.
|
| I like the idea of Google having lots of search competition,
| this isn't there yet (and I wouldn't expect it to be). I don't
| think overhyping its results does it any favors.
| burkaman wrote:
| This is not a Google competitor, it's a different type of
| search engine with different goals.
|
| > If you are looking for fact, this is almost certainly the
| wrong tool. If you are looking for serendipity, you're on the
| right track. When was the last time you just stumbled onto
| something interesting, by the way?
| JasonFruit wrote:
| Hobby project leads angry person to interesting and
| unexpected material; angry person remains angry. Details at
| six.
| fouric wrote:
| The project explicitly bills itself as a "search engine",
| not an "interesting and unexpected material surfacer".
| Moreover, projecting emotions like "angry" onto a comment
| in order to discredit the content of the comment (hey! is
| that an ad-hominem?) is just about exactly the opposite of
| the discussions that the HN mods are trying to curate, and
| the discussions that I like to see here.
| withinboredom wrote:
| In the early days of google, I found what I was looking
| for on page 5+. On the way, I'd discover many interesting
| things I didn't even know I was looking for, often
| completely unrelated to what I was searching for.
| kwertyoowiyop wrote:
| And now Google hides that more than one page even exists,
| as they populate their first page with buttons to ask
| similar questions and go to the first page of THOSE
| results.
| kews wrote:
| I miss those old days of even being permitted to go many
| pages in.
| allknowingfrog wrote:
| If you click through to the About page, I think you'll
| see that "interesting and unexpected material surfacer"
| is a fairly apt description of the project.
| adventured wrote:
| > Hobby project leads angry person to interesting and
| unexpected material; angry person remains angry.
|
| Not angry in the least. I'm thrilled someone is working on
| a search competitor to Google.
|
| I understand you're attempting to dismiss my pointing out
| the bad results by calling me angry though. You're focusing
| your content on me personally, instead of what I pointed
| out.
|
| The parent was far overhyping the results in a way that was
| very misleading (look, it's better than Google!). I tried
| various searches, they were not great results. The parent
| was very clearly implying something a lot better than that
| by what they said. The product isn't close to being at that
| level at this point, overhyping it to such an absurd degree
| isn't reasonable or fair to the person that is working on
| it.
|
| I would specifically suggest people not compare it to
| Google. Let it be its own thing, at least for a good while.
| Google (Alphabet) is a trillion dollar company. Don't press
| the expectations so far and stage it to compete with Google
| at this point. I wouldn't even reference Google in relation
| to this search engine, let it be its own thing and find its
| own mindshare.
| bityard wrote:
| > I'm thrilled someone is working on a search competitor
| to Google.
|
| Except the author goes to quite some lengths to explain
| that his search engine is not a competitor to Google, and
| is in fact exactly the opposite of Google in many ways:
| https://memex.marginalia.nu/projects/edge/about.gmi
| kwhitefoot wrote:
| What were you expecting to see for British? There must be
| millions of pages containing that term. Anyway the first
| screenful from Google is unadulterated crap, advertising
| mixed with the usual trivia questions.
|
| If you are going top claim something is wide of the mark then
| you really ought to tell us at least roughly where the mark
| is.
| duckmysick wrote:
| I checked the results of the same query and they seem fine.
| Lots of speeches and articles about George Washington the US
| president. There's even his beer recipe.
|
| As for the results you linked, it's part of the zeitgeist to
| list other entities sharing the same name. Sure, they could
| use some subtle changes in ranking, but overall the returned
| links satisfy my curiosity.
| Nition wrote:
| The Wikipedia link at the top is always given. It would maybe
| be good to make it a little clearer that it's not one of the
| true results.
| Ajef wrote:
| I think this is just because of terms you have searched. In
| my test-searches Wikipedia has not come up once in first
| position (i think the highest was 3rd in the list).
|
| Here's what I've tried with a few variations: golang generics
| proposal, machine learning transformer, covid hospitalization
| germany
|
| [edit] formatting
| Nition wrote:
| I think maybe it's a special insert at the top, but only if
| a Wikipedia page is found that matches you search term? I'm
| not sure now though.
| titzer wrote:
| Search engines whose revenue is based on advertising will
| ultimately be tuned to steer you to the ad foodchain. All the
| incentives are aligned towards and all the metrics ultimately
| in service of, profit for advertisers. Not in the 99% of people
| who can convinced to consume something by ads? Welp, screw you.
| phendrenad2 wrote:
| Search engines should be something you pay for. Surely search
| engine powerusers can afford to pay for such a service. If
| Google makes $1 per user per month or something, that's not
| too high a bar to get over.
| titzer wrote:
| Search engines should be like libraries. At least some tiny
| sliver of the billions we spend on education and research
| should go to, you know, actually organizing the world's
| information and making it universally available.
| wizzwizz4 wrote:
| In which case, consider paying for something like Infinity:
| https://infinitysearch.co/
| Siira wrote:
| I tried some queries for Harry Potter fanfictions, and the
| results were pretty much completely unrelated. There weren't
| that many results, either.
| acchow wrote:
| If this search engine ever takes off, the listicle writers will
| just start optimizing for it too, right?
| dotancohen wrote:
| Mission accomplished, then.
| acchow wrote:
| If the goal was to remove modern web design, ok sure
| mission accomplished.
|
| If your goal was to create a search engine that ignored
| listicles and other fluff and instead got you meatier
| results like "academic talks" and such, then no.
| klntsky wrote:
| However, when searching for "haskell type inference algorithm"
| I get completely useless results.
| [deleted]
| klntsky wrote:
| Since it does not use synonyms, it looks like it is unable to
| answer "how's that thing called"-queries.
| burkaman wrote:
| That query is too long apparently. But if you shorten to
| "haskell type inference", I think it delivers on its promise:
|
| > If you are looking for fact, this is almost certainly the
| wrong tool. If you are looking for serendipity, you're on the
| right track. When was the last time you just stumbled onto
| something interesting, by the way?
| marginalia_nu wrote:
| The search engine doesn't do any type of re-ordering or
| synonym stuff, it only tires to construct different N-grams
| from the search query.
|
| So if you for example compare "SDL tutorial" with "SDL
| tutorials". On google you'd get the same stuff, this search
| engine, for better or worse doesn't.
|
| This is a design decision, for now anyway, mostly because
| I'm incredibly annoyed when algorithms are second-guessing
| me. On the other hand, it does mean you sometimes have to
| try different searches to get relevant results.
| ford_o wrote:
| Maybe list the synonyms under the query, so its easier to
| try different formulations.
| akavel wrote:
| Oh this sounds like it could be a really cool idea! This
| way it could also be subtly teaching users that the
| engine doesn't do automatic synonyms translation so it's
| worth experimenting; also kinda like giving the synonyms
| feature while still keeping user in full control.
| Razengan wrote:
| It could simply become an option.
| OneLeggedCat wrote:
| Don't change it. It's good this way.
| leephillips wrote:
| I like this design decision. It pays you back for
| choosing your search terms carefully.
| mananaysiempre wrote:
| I'm not against a stemmer, actually, just against the
| aggressive concordances (?) that Google now employs, like
| when it shows me X in Banach spaces (the classical,
| textbook case) when I'm specifically searching for X in
| Frechet spaces (the generalization I want to find but am
| not sure exists); of _course_ Banach spaces and Frechet
| spaces are almost exclusively encountered in the same
| context, but it doesn't mean that one is a popular typo
| for the other! (The relative rarity of both of these in
| the corpus probably doesn't help. The farcical case is
| BRST, or Becchi-Rouet-Stora-Tyutin, in physics, as it is
| literally a single key away from "best" and thus almost
| impossible to search for.)
|
| On the other hand, Google's unawareness of (extensive and
| ubiquitous) Russian noun morphology is essentially what
| allowed Yandex to exist: both 2011 Yandex and 2021 Google
| are _much_ more helpful for Russian than 2011 Google. I
| suspect (but have not checked) that the engine under
| discussion is utterly unusable for it. English (along
| with other Germanic and Romance languages to a lesser
| extent) is quite unusual in being meaningfully searchable
| without any understanding of morphology, globally
| speaking.
| dahauns wrote:
| English is more the outlier in regard to Germanic
| languages, try German or Finnish, with their wonderful
| compounds :)
|
| https://e.humanities.uva.nl/publications/2004/kamp_lang04
| .pd...
| medstrom wrote:
| I thought you could fix that by enclosing "BRST" in
| quotes, but apparently not. DuckDuckGo (which uses
| Google) returns a couple of results that do contain
| "BRST" in a medical context, but most results don't
| contain this string at all. What's going on?
| mananaysiempre wrote:
| I'm not certain what DDG actually uses (wasn't it Bing?),
| but in my experience from the last couple of months it
| ignores quotes substantially _more_ eagerly than Google
| does. For this particular term, a little bit of domain
| knowledge helps: even without quotes, _brst becchi_ ,
| _brst formalism_ , _brst quantization_ or perhaps _bv
| brst_ will get you reasonable results. (I could swear
| Google corrected _brst quantization_ to _best
| quantization_ a year ago, but apparently not anymore.)
| Searching for stuff in the context of BRST is still
| somewhat unpleasant, though.
|
| I... don't think anything particularly surprising is
| happening here, except for quotes being apparently
| ignored? I've had it explained to me that a rare word is
| essentially indistinguishable from a popular misspelling
| by NLP techniques as they currently exist, except by
| feeding the machine a massive dictionary (and perhaps not
| even then). BRST is a thing that you essentially can't
| even define satisfactorily without at the very least four
| years of university-level physics (going by the
| conventional broad approach--the most direct possible
| road can of course be shorter if not necessarily more
| illuminating). "Best" is a very popular word both
| generally and in searches, and the R key is next to E on
| a Latin keyboard. If you are a perfect probabilistic
| reasoner with only these facts for context (and
| especially if you ignore case), I can very well believe
| that your best possible course of action is to assume a
| typo.
|
| How to permit overriding that decision (and indeed how to
| recognize you've actually made one worth worrying about
| without massive human input-- _e.g._ Russian adjectives
| can have more than 20 distinct forms, can be made up on
| the spot by following productive word-formation
| processes, and you don't want to learn all of the world's
| languages!) is simply a very difficult problem for what
| is probably a marginal benefit in the grand scheme of
| things.
|
| I just dislike hitting these margins so much.
| leephillips wrote:
| It would not be a difficult problem if they allowed the "
| " operator to work as they claim it does, or revive the +
| operator.
| mananaysiempre wrote:
| In English, maybe; in Russian, I frequently find myself
| reaching for the nonexistent "morphology but not
| synonyms" operator (as the same noun phrase can take a
| different form depending on whether it is the subject or
| the object of a verb, or even on which verb it is the
| object of); even German should have the same problem
| AFAIU, if a bit milder. I don't dare think about how
| speakers of agglunative languages (Finnish, Turkish,
| Malayalam) suffer.
|
| (DDG docs do say it supports +... and even +"...", but I
| can't seem to get them to do what I want.)
| leephillips wrote:
| Ah, OK. I don't know anything about Russian. This is a
| hard problem. I think the solution is something like what
| you suggest: more operators allowing different
| transformations. Even in English, I would like a "you may
| pluralize but nothing else" operator.
| LanceH wrote:
| It would be nice if we could pipe search engines.
| BenoitP wrote:
| Definitely; We could create a meta search engine that
| queries them all, in desktop application format.
|
| Let's name it after a famous old scientist, and maybe add
| the year to prove it's modern: Galileo 2021.
| overkalix wrote:
| ... is this Galileo 2021 a reference that I am not
| understanding?
| BenoitP wrote:
| Yup, but so far no one got it.
|
| There was such an app in the early 2000's, before Google
| went mainstream, and Altavista-like engines were not
| good: Copernic 2000.
|
| I guess I'm officially old now.
| squeaky-clean wrote:
| I was always a dogpile user :p
| LightG wrote:
| Hotbot!
| tomerv wrote:
| FWIW, I got the reference. Maybe I'm old too?
| genewitch wrote:
| For years I wanted to try Copernic Summarizer. It seemed
| like it actually worked. Then software that did summaries
| disappeared, maybe? And about 5 years ago bots on Reddit
| were doing summaries of news stories (and then links in
| comments).
|
| This is a pattern I see over and over again, some
| research group or academics show that something can be
| done (summaries that make sense and are true summaries,
| evolutionary algorithm FPGA programming, real time gaze
| prediction, etc) and there's a few published code repos
| and a bit of news, then 'poof' - no where to be seen for
| 15 years or more.
| PaulHoule wrote:
| Meta search engines leave a bad taste in everyone's mouth
| because they've always failed. Here is why
|
| https://en.wikipedia.org/wiki/Arrow%27s_impossibility_the
| ore...
|
| You can't combine a few different ranked lists and expect
| to get results better than any of the original ranked
| lists.
| robrenaud wrote:
| > You can't combine a few different ranked lists and
| expect to get results better than any of the original
| ranked lists.
|
| I am skeptical of this application of the theorem. Here
| is my proposal:
|
| Take the top 10 Google and Bing results. If the top
| result from Bing is in the top 10 from Google, display
| Google results. If the top result from Bing is not in the
| top 10 from Google, place it at the 10th position. You'd
| have an algorithm that ties with Google, say 98% of the
| time, beats it say, 1.2% of the time, and loses .8% of
| the time.
| vikingerik wrote:
| Right. Arrow's theorem just says it's impossible to do it
| in _all_ cases. It 's still quite possible to get an
| improvement in a large proportion of cases, as you're
| proposing.
| random314 wrote:
| Arrows theorem simply doesn't apply here. We don't need
| our personalized search results to satisfy the majority.
| PaulHoule wrote:
| But in both cases you face the problem of aggregating
| preferences of many into one. In one case you are
| combining personal preferences in the other case
| aggregating 'preferences' expressed by search engines.
| random314 wrote:
| But search engines aren't voting to maximize the chances
| that their preferred candidate shows up on top. The mixed
| ranker has no requirement to satisfy Arrows integrity
| constraints. It has to satisfy the end user, which is
| quite possible in theory.
|
| Conditions the mixed ranker doesn't have to satisfy
| "ranking while also meeting a specified set of criteria:
| unrestricted domain, non-dictatorship, Pareto efficiency,
| and independence of irrelevant alternatives"
| PaulHoule wrote:
| Sure, but the problem that conventional IR ranking
| functions are not meaningful other than by ordering leads
| you to the dismal world of political economy where you
| can't aggregate people's utility functions. (Thus you
| can't say anything about inequality, only about Pareto
| efficiency)
|
| Hypothetically you could treat these functions as
| meaningful but when you try you find that they aren't
| very meaningful.
|
| For instance IBM Watson aggregated multiple search
| sources by converting all the relevance scores to "the
| probability that this result is relevant".
|
| A conventional search engine will do horribly in that
| respect, you can fit a logit curve to make a probability
| estimator and you might get p=0.7 at the most and very
| rarely get that, in fact, you rarely get p>0.5.
|
| If you are combining search results from search engines
| that use similar approaches you know those p's are not
| independent so you can't take a large numbers of p=0.7's
| and turn that into a higher p.
|
| If you are using search engines that use radically
| different matching strategies (say they return only
| p=0.99 results with low recall) the Watson approach
| works, but you need a big team to develop a long tail of
| matching strategies.
|
| If you had a good p-estimator for search you could do all
| sorts of things that normal search engines do poorly,
| such as "get an email when a p>0.5 document is added to
| the collection."
|
| For now alerting features are either absent or useless
| and most people have no idea why.
| PaulHoule wrote:
| I've had jobs tuning up the relevance of search engines
| with methods like
|
| https://ccc.inaoep.mx/~villasen/bib/AN%20OVERVIEW%20OF%20
| EVA...
|
| and the first conclusion is "something that you think
| will improve relevance probably won't"; the TREC
| conference went for about five years before making the
| first real discovery
|
| https://en.wikipedia.org/wiki/Okapi_BM25
|
| It's true that Arrow's Theorem doesn't strictly apply,
| but thinking about it makes it clear that the aggregation
| problem is ill-defined and tricky. (e.g. note also that a
| ranking function for full text search might have a range
| of 0-1 but is not a meaningful number, like a probability
| estimate that a document is relevant, but it just means
| that a result with a higher score is likely to be more
| relevant than one with a lower score.)
|
| Another way to think about it is that for any given
| feature architecture (say "bag of words") there is an
| (unknown) ideal ranking function.
|
| You might think that a real ranking function is the ideal
| ranking function plus an error and that averaging several
| ranking functions would keep the contribution of the
| ideal ranking function and the errors would average out,
| but actually the errors are correlated.
|
| In the case of BM25 for instance, it turns out you have
| to carefully tune between the biases of "long documents
| get more hits because they have more words in them" and
| "short documents rank higher because the document vectors
| are spiky like the query the vectors". Until BM25 there
| wasn't a function that could be tuned up properly and
| just averaging several bad functions doesn't solve the
| real problem.
| gnramires wrote:
| That's an invalid application of this theorem. (It
| doesn't necessarily hold)
|
| Suppose there's an unambiguous ranked preference by all
| people among a set (webpages, ranking). Suppose one
| search engine ranks correctly the top 5 results and
| incorrectly the next 5 results, while another ranks
| incorrectly the top 5 and correctly the next 5.
|
| What can happen is that some there may be no universally
| preferred search engine (likely). In practice, as another
| commenter noted, you can also have most users prefer more
| a certain combination of results (that's not difficult to
| imagine, for example by combining top independent results
| from different engines for example).
| collinmanderson wrote:
| Brave browser currently has "Google fallback" which
| sometimes mixes in Google search results with Brave's own
| search engine.
|
| https://search.brave.com/help/google-fallback
| Torwald wrote:
| I need that with a simpler interface, so I call it after
| a famous dedective: Sherlock.
| artificial wrote:
| _a magic pop sound is faintly audible as a new side
| project is appended to several lists_ Excellent, thank
| you!
| mkr-hn wrote:
| Likely trademark collision with this:
| https://www.galileo.usg.edu/
| gmueckl wrote:
| Not an app, but probably comes quite close in all other
| respects: https://metager.org
| foofoo4u wrote:
| Good comparison. Reminds me of an analogy I like to make of
| today's web, which is it feels like browsing through a magazine
| store -- full of top 10s, shallow wow-factoids, and baity
| material. I genuinely believe terrible results like this are
| making society dumber.
| bluGill wrote:
| what I really want is a true AI to search through all that
| and figure out the useful truth. I don't know how to do this
| (and of course whoever writes the AI needs to be unbiased...)
| idiotsecant wrote:
| >whoever writes the AI needs to be unbiased...)
|
| I'm not sure the idea of a sentient being not having a bias
| is meaningful. Reality, once you get past the trivial bits,
| is subjective.
| jimbokun wrote:
| Isn't there a fundamental ML postulate that learning
| without bias is impossible?
|
| Maybe not the same kind of bias we think of in terms of
| politics and such, but I wonder if there's a connection.
| bluGill wrote:
| I didn't say the AI should be unbiased, just whoever
| writes it.
|
| I want an AI that is biased to the truth when there is an
| objective one, and my tastes otherwise. (that is when
| asked to find a good book it should give me fantasy even
| though romance is the most popular genre and so will have
| better reviews)
| hnick wrote:
| I think that is the goal, it's just what we currently have
| is an AI that's like a naive child who is easily tricked
| and distracted by clickbait.
| Dah00n wrote:
| >an AI that's <snip> easily tricked and distracted by
| clickbait.
|
| So, AIs are actually on par with most adults now? (Sorry)
| anonuser123456 wrote:
| > I genuinely believe terrible results like this are making
| society dumber.
|
| You have to e causality reversed. Google results reflect the
| fact that society is dumb.
| jimbokun wrote:
| Google results reflect the fact that educating and
| informing people has low profit margins.
| rchaud wrote:
| The context matters. I'd happily read "Top 10" lists on a
| website if the site itself was dedicated to that one thing.
| "Top 10 Prog Rock albums", while a lazy, SEO-bait title,
| would at least be credible if it were on a music-oriented
| website.
|
| But no, these stories all come from cookie-cutter "new media"
| blog sites, written by an anonymous content writer who's
| repackaged Wikipedia/Discogs info into Buzzfeed-style copy
| writing designed to get people to "share to Twitter/FB". No
| passion, no expertise. Just eyeballs at any cost.
| 1vuio0pswjnm7 wrote:
| Its the "healthy web" Mozilla^1 and Google keep telling
| their blog audiences about. :)
|
| 1 Accept quid pro quo to send all queries to Google by
| default
|
| If what these companies were telling their readers was
| true, i.e., that advertising is "essential" for the web to
| survive, then how are the sites returned by this search
| engine for text-heavy websites (that are not discoverable
| through Google, the default search engine for Chrome,
| Firefox, etc.) able to remain online. Advertising is
| essential for the "tech" company middleman business to
| survive.
| foofoo4u wrote:
| This got me thinking that maybe one of the other big
| reasons for this is that the algorithms prioritize newer
| pages over older pages. This produces the problem where
| instead of covering a topic and refining it over time, the
| incentive is to repackage it over and over again.
|
| It reminds me of an annoyance I have with the Kindle store.
| If I wanted to find a book on, let's say, Psychology, there
| is no option to find all-time respected books of the past
| centenary. Amazon's algorithms constantly push to recommend
| the latest hot book of the year. But I don't want that. A
| year is not enough time to have society determine if the
| material withstands time. I want something that has stood
| the test of time and is recommended by reputable
| institutions.
| echelon wrote:
| The new evergreen is refreshed sludge for bottom dollar.
| College kids stealing Reddit comments or moving around
| paragraphs from old articles. Or linking to linked blogs
| that link elsewhere.
|
| It's all stamped with Google Ads, of course, and then
| Google ranks these pages high enough to rake in eyeballs
| and ad dollars.
|
| Also there's the fact that each year, the average webpage
| picks up two more video elements / ad players, one or two
| more ad overlays, a cookie banner, and half a dozen
| banner/interstitials. It's 3-5% content spread thinly
| over an ad engine.
|
| The Google web is about squeezing ads down your throat.
| andrepd wrote:
| Really makes you wonder: you play whack a mole and tackle
| the symptoms with initiatives like this search engine.
| But the root of that problem and many many others is the
| same: advertising. Why don't we try to tackle that?
| echelon wrote:
| Exactly.
|
| The only reason people make content they aren't
| passionate about is advertising.
| jamra wrote:
| This is just a guess, but I believe that they use machine
| learning and rank it by the clicks. I took some coursera
| courses and Andrew Ng sort of suggested that as their
| strategy.
|
| The problem is that clickbait and low effort articles
| could be good enough to get the click, but low effort
| enough to drag society into the gutter. As time passes,
| the system is gamified more and more where the least
| effort for the most clicks is optimized.
| amenod wrote:
| > is that the algorithms prioritize newer pages over
| older pages.
|
| They do? That would explain a lot - but ironically, I
| can't find a good source on this. Do you have one at
| hand?
| dvogel wrote:
| It is pretty obvious if you search for any old topic that
| is also covered incessantly by the news. "royal family"
| is a good example. There's no way those news stories
| published an hour ago are listed first due to a high
| PageRank score (which necessarily depends on time to
| accumulate inbound links).
| RattleyCooper wrote:
| It depends on the content. The flip side is looking up a
| programming-related question and getting results from
| 2012.
|
| I think they take different things into account based on
| the thing being searched.
| II2II wrote:
| Even your example would depend upon the context. There
| are many cases where a programming question in 2021 is
| identical to one from 2012, along with the answer. In
| those instances, would you rather a shallow answer from
| 2021 or an indepth answer from 2012? This is not meant to
| imply that older answers offer greater depth, yet a heavy
| bias towards recent material can produce that outcome in
| some circumstances.
| valvar wrote:
| If you're using tools/languages that change rapidly (like
| Kotlin, in my case), syntax from a few years ago will
| often be outdated.
| II2II wrote:
| Yes, yet there are programming questions that go beyond
| "how do I do X in language Y" or "how do I do X with
| library Y". The language and library specific questions
| are the ones where I would be less inclined to want
| additional depth anyhow, well, provided they aren't
| dependent upon some language or library specific
| implementation detail.
| rchaud wrote:
| Your Google search results show the date on articles do
| they not? If people are more likely to click on
| "Celebrity Net Worth (2021)" than "Celebrity Net Worth
| (2012)", then the algo will update to favour those
| results, because people are clicking on them.
|
| The only definitive source on this would be the
| gatekeeper itself. But Google never says anything
| explicitly, because they don't want people gaming search
| rankings. Even though it happens anyway.
| WalterBright wrote:
| Amazon search clearly does _not_ prioritize exact title
| matches.
| hodgesrm wrote:
| > This got me thinking that maybe one of the other big
| reasons for this is that the algorithms prioritize newer
| pages over older pages.
|
| Actually that's not always the case. We publish a lot of
| blog content and it's really hard to publish new content
| that replaces old articles. We still see articles from
| 2017 coming up as more popular than newer, better
| treatments of the same subject. If somebody knows the SEO
| magic to get around this I'm all ears.
| Dah00n wrote:
| I'm not sure I agree with your example. It seems to me it
| is the exact same as a "Top ten drinks to drink on a rainy
| day" list. There's simply too many good albums and opinions
| differ, so a top ten would -just like the drinks- end up
| being a list of the most popular ones with maybe one the
| author picks to stir some controversy or discussion. In my
| opinion the world would be a smarter place if Google ranked
| all such sites low. Then we might at least get fluff like
| "Top ten prog rock albums if you love X, hate Y and listen
| to Z when no one is around" instead.
| dageshi wrote:
| Google won't rank them low because they actually do serve
| an important purpose. They're there for people who don't
| really know what they want specifically, they're looking
| for an overview. A top 10 gives a digestible overview on
| some topic, which helps the searcher narrow down what
| they really want.
|
| A "Top 10 albums of all time" post is actually better off
| going through 10 genres of popular music from the past 50
| years and picking the top album (plus mentioning some
| other top albums in the genre) for each one.
|
| That gives the user the overview they're probably looking
| for, whether those are the top 10 albums of all time or
| not. It's a case of what the user searched for vs what
| they actually really want.
| bythckr wrote:
| "The best minds of my generation are thinking about how to
| make people click ads"
| wwweston wrote:
| It's also possible that it's the other way around: a certain
| "common denominator" + algorithms that chase broad engagement
| = mediocre results.
|
| The real trick would be some kind of engine that can aim just
| above where the user's at.
| [deleted]
| rasz wrote:
| >followed by a listicle "8 Reasons Why Rome Fell"
|
| but arent you curious about the 7th reason? it will surprise
| you!
| coldtea wrote:
| You wont believe how Claudius looks today!
| kahrl wrote:
| Doctors HATE him!!!
| [deleted]
| resynth1943 wrote:
| Yeah, Google tends to send a lot of junk back.
| purplefruit wrote:
| Wow I used "personality test" and actually got useful articles
| about personality theory. I'll actually use this!
| hdjjhhvvhga wrote:
| As long as few people use it, it will be great. Rest assured
| that the moment it becomes popular, the people who want to game
| it will appear.
| marginalia_nu wrote:
| Specialization mostly a problem in monocultures.
|
| If you almost only plant wheat, you are going to end up with
| one hell of a pest problem.
|
| If you almost only have Windows XP, you are going to have one
| hell of a virus problem.
|
| If you almost only have SearchRank-style search engines (or
| just the one), you are going to have one hell of a content
| spam problem.
|
| Even though they have some pretty dodgy incentives, I don't
| think google suffers quality problems because they are evil,
| I think ultimately they suffer because they're so dominant.
| Whatever they do, the spammers adapt almost instantly.
|
| A diverse ecosystem on the other hand limits the viability of
| specialization by its very nature. If one actor is attacked,
| it shrinks and that reduces the opportunity for attacking it.
| Nextgrid wrote:
| I don't think the existing media-heavy websites are gaming
| Google to rank higher. It's that Google itself prefers media
| heavy content; they don't have to "game" anything.
|
| I also think a search engine like this would be quite hard to
| game. An ML-based classifier trained on thousands of text-
| heavy and media-heavy screenshots should be quite robust and
| I think would be very hard to evade, so the "game" will
| become more about how _identify_ the crawler so you can serve
| it a high-ranking page while serving crap to the real users,
| and it seems fairly easily to defeat if the search engine
| does a second pass using residential proxies and standard
| browser user agents to detect this behavior (it could also
| threaten huge penalties like the entire domain being banned
| for a month to even deter attempts at this).
| fragmede wrote:
| With the advances in text generation by machines that
| looks, but isn't _quite_ accurate (aka GPT-3), seems like
| it would be _easily_ gamed (given access to GPT-3). Even
| without GPT-3, if the content being prioritized is mere
| text, I 'm sure that for a pile of money, I could generate
| something that looks like Wikipedia, in the sense that it's
| a giant pile of mostly text, but it would make zero sense
| to a human reader. (Building an SEO farm to boost ranking
| of not-wikpedia is left as an exercise for the reader.)
| jandrese wrote:
| This sort of optimization is why simple recipes are typically
| found at the end of a rambling pointless blog post now.
|
| Still, the best way to break SEO is to have actual
| competition in the search space. As long as SEO remains
| focused on Google there is an opportunity for these companies
| to thrive by evading SEO braindamage.
| SerLava wrote:
| That's not really for SEO, which favors readily accessible
| information.
|
| That's ads. When mobile users have to scroll past 10 add,
| theyll click on some of them and make the blog money.
| ggggtez wrote:
| I've noticed this pattern start to pop up elsewhere. I've
| started to train my skimming skills, skipping a paragraph
| or two at a time to get past the fluff.
|
| Like an article about some current event will undoubtedly
| begin with "when I was traveling ten years ago...".
| Aeolun wrote:
| Searching for 'chocolate' on this search engine turned up a
| surprisingly large amount of chocolate based recipes.
| zerd wrote:
| It's also because that's a way of trying to copyright
| protect recipes, which are normally not copyright
| protected.
|
| > "Mere listings of ingredients as in recipes, formulas,
| compounds, or prescriptions are not subject to copyright
| protection. However, when a recipe or formula is
| accompanied by substantial literary expression in the form
| of an explanation or directions, or when there is a
| combination of recipes, as in a cookbook, there may be a
| basis for copyright protection."
| JohnFen wrote:
| But that copyright protection only extends to the
| literary expression. The recipe itself is still not
| covered by copyright, even if accompanied by an essay.
| YeGoblynQueenne wrote:
| >> This sort of optimization is why simple recipes are
| typically found at the end of a rambling pointless blog
| post now.
|
| I continue to be curious about this kind of complaint. If
| all you want is a recipe list, without any of the fluff,
| why would you click on a link to a blog, rather than on a
| link to a recipe aggregator?
|
| Foodie blogs exist specifically for the people who want a
| foodie discussion and not just an ingredients' list.
|
| Is it because blogs tend to have better recipes overall? In
| that case, isn't there a bit of entitlement involved in
| asking that the author self-sacrificingly provides only the
| information that you want, without taking care of their own
| needs and wants, also?
| Loughla wrote:
| It's the same thing that people always complain about.
| This thing is not in a format that I like, so it must be
| not what anyone likes.
|
| If you want JUST recipes, pay money instead of just
| randomly googling around. America's test kitchen has a
| billion, vetted, and really good recipes. That solves
| that problem.
| joegahona wrote:
| I think the complaint is that those blogs rank higher
| than nuts-and-bolts recipes now. It wasn't that way a few
| years ago. Yes, scrolling down the results to Food
| Network or Martha Stewart or whatever is possible, as is
| going directly to those sites and using their site
| search, but it's noticeable and annoying.
| YeGoblynQueenne wrote:
| Not my experience. For a very quick test, I searched DDG
| for "omelette recipe, "carbonara recipe" and "peking duck
| recipe" (just to spice it up a bit) and all my top
| results are aggregators. Even "avgolemeono recipe" (which
| I'd think is very specialised) is aggregators on top.
|
| To be honest, I don't follow recipes when I cook unless
| it's a dish I've never had before. At that point what I
| want is to understand the point of the dish. A list of
| ingredients and preparation instructions don't tell me
| what it's supposed to taste and smell like. The foodie
| blogs at least try to create a certain... feeling of
| place, I suppose, some kind of impression that guides you
| when you cook. I wouldn't say it always works but I
| appreciate the effort.
|
| My real complaint with recipe writers is that they know
| how to cook one or two dishes well and they crib the rest
| off each other so even with all the information they
| provide, you still can't reliably cook a good meal from a
| recipe unless you've had the dish before. But that's my
| personal opinion.
| jandrese wrote:
| Because when you search for a recipe you get the link to
| the blog, not the aggregator.
| WorldMaker wrote:
| That sort of recipe blog hasn't happened just for SEO. It's
| also a bit of a "two audiences" problem: if you are coming
| to that food blogger from a search you certainly would
| prefer the recipe first and then maybe any commentary on it
| below if the recipe looks good. If you are a regular reader
| of that food blogger you are probably invested in the
| stories up top and that parasocial connection and the
| recipes themselves are sometimes incidental to why you are
| a regular reader.
|
| You see some of that "two readers" divide sometimes even in
| classic cookbooks, where "celebrity" chefs of the day might
| spend much of a cookbook on a long rambling memoir.
| Admittedly such books were generally well indexed and had
| table of contents to jump right to the recipes or
| particular recipes, but the concept of "long personal
| ramble of what these recipes mean to me" is an old one in
| cookbooks too.
| giantrobot wrote:
| > If you are a regular reader of that food blogger
|
| I think this assumes facts not in evidence. It certainly
| seems like an overwhelming number of "blogs" are not
| actual blogs but SEO content farms. There's no regular
| readers of such things because there's no actual authors,
| just someone that took a job on Fivver to spew out some
| SEO garbage. Old content gets reposted almost verbatim
| because new results better according to Google.
|
| The only reason these "blogs" exist is to show ads and
| hopefully get someone's e-mail (and implied consent) for
| a marke....newsletter.
| WorldMaker wrote:
| I know at least a few that I commonly see in top search
| results that I have friends that read them like
| personalized soap operas where most of the drama revolves
| around food and family and serving food to family.
|
| It's at least half the business models of Food Network
| shows: aspirational kitchens and the people that live in
| them and also sometimes here's their recipes. (The other
| half being competitions, obviously.) I've got friends
| that could deliver entire doctoral theses on the Bon
| Appetit Test Kitchen (and its many YouTube shows and
| blogs) and the huge soap operatic drama of 2020's events
| where the entire brand milkshake ducked itself; falling
| into people's hearts as "feel good" entertainment early
| in 2020/the pandemic and then exploding very dramatically
| with revelations and betrayals that Fall.
|
| Which isn't to say that there _aren 't_ garbage SEO farms
| out there in the food blogging space _as well_ , but a
| lot of the big ones people commonly complain about seeing
| in google's results do have regular fans/audiences. (ETA:
| And many of the smaller blogs _want_ to have regular fans
| /audiences. It's an active influencer/"content creator"
| space with relatively low barrier to entry that people
| love. Everyone's family loves food, it's a part of the
| human condition.)
| run-types wrote:
| I've basically never been taken to a recipe without a
| rambling preamble from Google. While food blogs may serve
| two audiences, a long introduction seems to be a
| requirement to appear in the top Google search results.
| WorldMaker wrote:
| Personally, I think that has a lot more to do with the
| fact that Google killed the Recipe Databases. There did
| used to be a few startups that tried to be Recipe
| Aggregators with advertising based business models, that
| would show recipes and then link to source blogs and/or
| cookbooks, and in the brief period where they existed
| Google scraped them _entirely_ and showed entire recipes
| on search results and ate their ad revenue out from under
| them.
| tomrod wrote:
| That is a really bad thing by Google. Their core business
| is not recipes.
| kwertyoowiyop wrote:
| Their core business is making money from other people's
| content, no matter what it is.
| WorldMaker wrote:
| Their core business is advertising and they have always
| been in a direct conflict-of-interest by competing with
| content sites for ad revenue buys.
| dspillett wrote:
| Such databases would get battered by demands to remove
| content these days, if not already back then. No one want
| a database listing their stuff for ad revenue like that
| because many wouldn't follow the links so see _their_
| adverts or be subject to _their_ tracking.
|
| A couple of browser add-ons specifically geared around
| trimming recipe pages down have been taken down due to
| similar complaints.
| inanutshellus wrote:
| I see your point, but argue you've misidentified the two
| audiences.
|
| One audience matches your description and is the invested
| reader. They want _that_ blogger 's story telling. they
| might make the recipe, but they're a dedicated reader.
|
| The other audience is not the recipe-searcher, but
| instead Google. Food bloggers know that recipe-searchers
| are there to drop in, get an ingredient list, and move
| on. They won't even remember the blog's name. So the site
| isn't optimized for them. It's optimized for Google.
|
| "Slow the parasitic recipe-searcher down. They're
| leeches, here for a freebie. Well they'll pay me in
| Google Rank time blocks."
| xtracto wrote:
| That's why I use Saffron [1], it magically converts those
| sites into a page in my recipe book. I found it when the
| developer commented here in HN. Also, a lot of cooking
| website have started to add a link with "jump to recipe"
| functionality allowing you to skip all the crap.
|
| [1] https://www.mysaffronapp.com/
| Funes- wrote:
| There's also https://based.cooking.
| eigengrau5150 wrote:
| Run by Luke Smith, an admitted neo-reactionary and
| possible white supremacist who writes like a 4chan
| reject.
| the_other wrote:
| If there were a wider variety of popular search engines, with
| different ranking criteria, would sites begin to move away
| from gaming the system? Surely it would be too hard to game
| more than one search engine at a time?
| Nasrudith wrote:
| It would be a matter of numbers anyway about which they
| optimize for. A/B testing is already in place and doesn't
| care about where it comes from, just which one does better.
| new_guy wrote:
| > the people who want to game it will appear.
|
| So just add human review to the mix, if a site is obviously
| trying to game the system (listicles, seo spam etc) just drop
| and ban them from the search index.
| hdjjhhvvhga wrote:
| Congratulations, you've just invented negative SEO.
| phendrenad2 wrote:
| There should be some perfect balance where this search engine
| is N% as popular as Google, where Google soaks up all of the
| gamifiers, but this search engine is still popular enough to
| derive revenue and do ML and other search-engine-useful
| stuff.
| eterevsky wrote:
| Imagine if you were looking for the movie.
| neltnerb wrote:
| Imagine including the search term "movie".
| yreg wrote:
| That doesn't do anything useful.
| lucideer wrote:
| I tend to prefer Wikipedia for movies. The exception is actor
| headshots if I'm trying to identify someone, which Wikipedia
| lacks for licensing reasons, but otherwise Wikipedia tends to
| be better than IMDB for most needs. Wikipedia has an IMDB
| link on every article anyway.
|
| Another need I guess might be reviews, for which RT or MC are
| better than IMDB: not sure if either of those two will fare
| better than IMDB in this search engine but again Wiki has
| links out (in addition to good reception summaries)
| mountainboy wrote:
| For me, imdb was much better when they had user
| comments/discussion.
|
| I never even posted on it myself, but browsing the
| discussions one could learn all sorts of trivia, inside
| info, speculation, etc about each movie.
|
| Since they (inexplicably) killed that feature, I rarely
| even visit anymore. Your right, for many purposes wikipedia
| is better, especially for TV series episode lists with
| summaries.
| ncphil wrote:
| IMDB management thought it was their brilliant editorial
| work that drew people to their site. Morons. It was the
| comments all along. Of course they also believed they
| could create gravity-free zones by sheer force of
| executive will (and maybe still do).
| _blu wrote:
| Especially for old and lesser known movies, the
| discussion board for the movie was a brilliant addition
| that could give the movie an extra dimension. Context is
| very important in order to understand, and ulitmately
| enjoy something.
|
| I think they removed it in part because new movies, like
| star wars and superhero movies, had alot of negative
| activity.
| sellyme wrote:
| I find IMDb to be more convenient than RT/MC/Wikipedia for
| finding release dates of movies - nearly every other
| website lists only the American release date, maybe one or
| two others if the movie was disproportionately popular in
| certain regions.
| shuntress wrote:
| _?q=imdb.com:fall of the roman empire_
| jazzyjackson wrote:
| !imdb
| MisterTea wrote:
| The you'd use a different search engine. Why does everything
| have to be a Swiss Army knife?
| zozbot234 wrote:
| Or you could just search for 'rome movie'. Though for more
| complex disambiguation you would need to resort to, e.g.
| schema.org descriptions (which are supported by most search
| engines, and the foundation for most "smart" search result
| snippets).
| eterevsky wrote:
| That's a fair point. This engine would be useful if you
| need grep over internet (by without regexes), i.e. when you
| want to find the exact phrases. But that's a relatively
| narrow use case.
| psadri wrote:
| Interesting choice of search topic. Are you trying to make an
| additional point?
| hn_throwaway_99 wrote:
| I had the exact opposite experience. I searched the site for
| "java", got a Wikipedia link first (for the island, not the
| programming language), and the 2nd result was to a random JEP
| page, and all the rest of the results were random tidbits about
| Java (e.g. "XZ compression algorithm in Java). Didn't get any
| high level results pointing to an overview of the language,
| getting started guides, etc.
| withinboredom wrote:
| You need to use some old school search techniques and search
| for "Java overview"
| _wldu wrote:
| I'm not sure that's a bad thing.
| rovr138 wrote:
| well, they're results to java related items...
|
| What kind of links where you expecting to find?
| SPBS wrote:
| Cool, it appears that the trend towards JS may be causing self-
| selection -- if a page has a high amount of JS, it is highly
| unlikely to contain anything of value.
| dv_dt wrote:
| If one could create an metric of ad to content ratio from the
| js used, I would guess that would be a nice differentiator
| too.
| Dah00n wrote:
| Huh. A weighted algorithm, somewhere between Google and the
| one linked, where you could subtract from sites by amount of
| JavaScript might be interesting.
| sjtindell wrote:
| True. Unfortunately many large corporate websites through
| which you pay bills, order tickets, etc. are becoming
| infested with JS widgets and bulky, slow interfaces. These
| are hard to avoid.
| artificial wrote:
| Conversely no software to install. Browser as a platform.
| Don't have to boot to Windows to pay your bills with
| activex for example
| foxfluff wrote:
| The mostly JS-less web was fine, fast, and reliable 20
| years ago and I never had ActiveX.
|
| I hear stories about Flash and ActiveX but I literally
| never needed these to shop or pay bills online. Payments
| also didn't require scripts from a dozen domains and four
| redirects..
| TeMPOraL wrote:
| The platform isn't the problem. The problem is with the
| amount of code that does something other than letting you
| "pay bills, order tickets, etc.".
| hinkley wrote:
| Browsers should be cherry picking the most compelling things
| that people accomplish with complex code and supporting them
| as a native feature. Maybe the Browser Wars aren't keeping up
| anymore.
| lugged wrote:
| Was that ever in doubt?
| zimpenfish wrote:
| Searched for my initials - got back a bunch of raw binary results
| (mp4, pdf, img, txz, etc.) which was disconcerting. Although it
| did find one reference to actual-me which is better than Google
| manages on the first 4 pages...
|
| https://imgur.com/a/n2xro2Y
| marginalia_nu wrote:
| Yeah there was unfortunately a problem with the content-type
| code recently, it unfortunately categorized some binary data as
| HTML and tried to process it best-effort. So there's some
| binary soup in the index.
|
| The bug has since been fixed, but it won't come into effect in
| a few weeks.
| ape4 wrote:
| Fast and doesn't crash when on the front page of Hacker News!
| ricardo81 wrote:
| That crossed my mind too, considering vanilla webpages
| sometimes struggle with a top page HN thread, never mind a
| search engine backend.
| marginalia_nu wrote:
| Well,... yet. Load average is at 1.2, not that bad. But the
| services are getting a solid workout.
| marginalia_nu wrote:
| The real test is in now, index server is reconstructing its
| index. It does this every 6 hours if there is new pages.
| Takes half an hour or so usually.
|
| It's supposed to be able to handle searches at the same time,
| but jeepers, it's gonna have to chew through nearly 400 Gb of
| data while dealing with over 1 request per second.
| 0xbadcafebee wrote:
| Is your site/code on GitHub? I would be happy to give
| performance tips/tweaks. Also Fyi, https://marginalia.nu/
| gives a certificate error (I know that's not the search
| site)
| criddell wrote:
| Have you given any thought on what you will do if you get a DMCA
| take down request or a request from a person asking you to remove
| them from search results?
| AlexCoventry wrote:
| I don't really care about website's design, as long as it gets
| out of the way of me reading it.
| mrkramer wrote:
| Awesome work! I had similar idea in mind but I'm glad to see
| someone else was able to pull it off.
| sailorganymede wrote:
| Love this! Is there any way someone could help contribute to
| this?
| palijer wrote:
| This has been needed in my life for a while. I am growing really
| apathetic about the internet lately, but I realize that is
| because my entry point is always a google search.
|
| I miss finding blog posts and scholarly articles in long form. I
| hate the SEO sites with unreadable UI because the information in
| them is often a lot lower quality as well.
| raflemakt wrote:
| I tried two searches in Norwegian ("norsk ordbok" [norwegian
| dictionary] and "stortinget" [the parliament]), and they both
| returned many extreme or "alternative" websites. It was
| especially striking that the neo-nazi group Vigrid's website was
| the top hit for both searches. Maybe these sites just have less
| modern web design?
| marginalia_nu wrote:
| Yeah this is actually a bit of a concern of mine.
|
| As very much a friend of Voltaire's, I don't think it's my
| place to police people's opinions no matter how disagreeable,
| but I also don't want my search engine to become branded as the
| search engine of choice for nazis because it's decent at
| cataloguing extremist sites.
| claytn wrote:
| Searching for your own name will turn up some interesting
| results! I got some early 90s webpages that just contain
| obituaries or marriage records. I never knew cities maintained
| these records online!
| xtiansimon wrote:
| Does this also penalize pages with tons of ads and three
| paragraphs of text? Or anything from Medium?
| IlliOnato wrote:
| Pretty cool. I am not sure yet how useful, but cool it is.
|
| However, it seems that it currently does not support non-Latin
| alphabets. Which I understand in an early version. Still, it's
| handling of such "exception cases" could be improved:
|
| when I search for a Russian word, say "Akvarium", I get <<Search
| "Akvarium" needs to be a word>>, which is rather rude...
| foxfluff wrote:
| "It also focuses on websites in English, Swedish and Latin and
| tries to identify and ignore the rest (best-effort)."
|
| https://news.ycombinator.com/item?id=28551183
| IlliOnato wrote:
| Fine; still "Not a supported language" would be much better
| response than "not a word".
| beepbooptheory wrote:
| need a duck duck go `bang!` for this
| jordache wrote:
| how about a search engine that bans all pinterest content.
|
| I hate pinterest with a passion. I may need to get a "his" laptop
| separate from my wife, since she needs that darn pinterest
| extension for pinning photos.
| AQXt wrote:
| I searched for "giraffe evolution" (without quotes) and received
| the following links on the first page:
|
| - _Evolutionist scientists say the theory is unscientific and
| worthless_
|
| - _Seven Mysteries of Evolution_
|
| - _OTHER EVIDENCE AGAINST EVOLUTION_
|
| - _Evolution Falsified_
|
| Not a single result about the evolution of giraffes...
| phendrenad2 wrote:
| Unfortunately, as you've discovered, giraffes are often used by
| crackpots to try to disprove evolution. Google seems to get
| around this by heavily boosting known authoritative sources
| like National Geographic and NIH. But, sadly, those are
| JS/image heavy sites.
| mattowen_uk wrote:
| Nice! Typing my name in, gets my own site back as 3 of the top 5
| results. I suddenly feel important ;)
| schmorptron wrote:
| I've also found that brave search gets much better results than
| google for some programming related topic, simply by not being
| targeted by blogspam SEO as much. It's refreshing to not have to
| click through 3 auto generated "articles" but to either a) get
| the documentation straight away or b) find actually human written
| blog entries.
| arethuza wrote:
| I did a quick check using the name of the Scottish village I am
| originally from (or as I should say "far am fae") and this
| produced a _much_ more interesting set of links for me than that
| produced by Google
| abhiminator wrote:
| Absolute textgasm.
|
| Wonder how 'text-only first' prioritization is being implemented,
| algorithmically speaking?
| ryankrage77 wrote:
| This post - https://news.ycombinator.com/item?id=28551183 -
| suggests it's a simple set of hueristics, looking for things
| like javascript, link/SEO spam, language, amount of text
| content, etc, filtering out unwanted results and only indexing
| wanted ones.
| abhiminator wrote:
| Thank you!
| michaelcampbell wrote:
| (Old man yells at cloud.)
| [deleted]
| yrds96 wrote:
| Damn that's is interesting search engine, this is great for
| search simple terms and find a bunch of blog articles about the
| term.
| turtlebits wrote:
| Great for a text-focused site- however, the results are a bit
| confusing. Would help if there were more details on the criteria
| used for a site to be included in the index.
|
| Suggestion - Use system fonts (the site downloads almost 300k of
| fonts)
| lishzen wrote:
| Love it! Searched for Shigeru Miyamoto and this was the third
| result: https://www.glitterberri.com/developer-
| interviews/miyamoto-h...
| wizzwizz4 wrote:
| Add an OpenSearch file: https://developer.mozilla.org/en-
| US/docs/Web/OpenSearch
|
| Maybe something like: <OpenSearchDescription
| xmlns="http://a9.com/-/spec/opensearch/1.1/"
| xmlns:moz="http://www.mozilla.org/2006/browser/search/">
| <ShortName>Marginalia</ShortName> <Description>Marginalia
| Search - a search engine that favors text-heavy sites and
| punishes modern web design</Description>
| <InputEncoding>UTF-8</InputEncoding> <Image width="16"
| height="16" type="image/x-icon">https://search.marginalia.nu/favi
| con.ico</Image> <Url type="text/html" template="https://s
| earch.marginalia.nu/search?query={searchTerms}" />
| <moz:SearchForm>https://search.marginalia.nu/</moz:SearchForm>
| </OpenSearchDescription>
|
| and then add: <link rel="search"
| type="application/opensearchdescription+xml"
| title="Marginalia" href="/opensearch.xml" />
|
| to your page's <head>.
| marginalia_nu wrote:
| It's been added a few moments ago :-)
|
| Thanks everyone who suggested this.
| tentacleuno wrote:
| I'm all for nostalgia but IMO the information should be the most
| important thing. Presentation is a close second though, and I can
| kind of get behind this project when I see what the web is like
| without an ad-blocker.
| marcosdumay wrote:
| I've got some results where the same site is has than 70% of the
| links. It was a very on topic and high quality site, but still,
| all the results shouldn't point to the same place.
|
| I think some grouping by site (and capping to only the few most
| relevant links there) would improve the engine.
| aabajian wrote:
| Gotta say, sometimes the results really are nice. I searched for
| "Land Cruiser 70." The first result is a simple, short blog post
| about a couple who traveled across Europe and Asia in their Troop
| Carrier (http://www.destoop.com/trip/1%20PREPARATION/2%20Vehicle%
| 20sp...).
|
| The first results on Google are Australian site for buying a LC70
| (news-flash, I can't buy one in the USA). There is also a
| MotorTrend article about the LC70...also irrelevant since it's
| only sold in Australia.
| dzhiurgis wrote:
| Search for playwright waitForSelector and you land in pretty
| useless page. I'm all in for text websites, but something like
| playwright.dev documentation is top notch - fuzzy search being
| key thing.
| marginalia_nu wrote:
| Yeah I wasn't really planning for this to blow up like it did
| today. It's currently sitting at about 35% of the index size I
| usually aim for, so besides the stuff I can't index because
| it's behind CDNs, there's a lot of pages it just hasn't gotten
| to yet. playwright.dev is pretty low on the priority list
| because it has a metric crap-ton of javascript on its front
| page. The crawler has visited it, looked at it, and put it very
| far down the priority queue.
| soheil wrote:
| Even though some sites have a metric crap-top of js they
| sometimes render very minimally for certain screen sizes or
| mobile devices without any of the js crap. Does your crawler
| pay attention to any of that?
| marginalia_nu wrote:
| It doesn't look at what the javascript does, just how much
| there is.
| rukuu001 wrote:
| This is great. The results for my search were like a suggested
| reading list.
| arnaudsm wrote:
| I wish we could configure Google's algorithm to our needs, and
| blacklist websites.
| MarioMan wrote:
| It could get tedious depending on how many sites you want to
| block, but you can add "-site:google.com" to exclude
| google.com, for instance.
| arnaudsm wrote:
| I mean a blacklist system like Twitter's, where you block a
| website forever. Pinterest would be the first to go.
| marto1 wrote:
| super! How far would you say are you in indexing the blogosphere
| ? I tried the engine a few times, but I mostly get academic
| papers and I know most (good) blogs are in fact text-heavy.
| Method5440 wrote:
| A big fan of your work! Just wanted to let you know what iOS
| devices provide quotes as " rather than " - you may need to
| support the character " or at least let people know that iOS is
| not supported etc... right now I get a generic character error.
| phreack wrote:
| One nitpick that kind of bothered me - on a large desktop
| monitor, the results page was like 70% whitespace margins with
| the results squished in the middle like a portrait cellphone.
| Hopefully it's easy to fix, I like to research at home and this
| website could help a lot!
| seoulmetro wrote:
| This would be awesome if the search actually worked. Typed in
| 'runescape' and expected a few websites left over from the early
| 2000s. But I got nothing, just a lot of hits to other keywords.
| marginalia_nu wrote:
| Huh. That's an interesting case you've found.
|
| I think part of the problem is gold sellers were displacing all
| the good results. I blocked a few of them and got a few more
| relevant results, but it's not great.
|
| But still, the search engine only finds two great hits. I
| wonder why not. Maybe there just aren't that many runescape
| pages around still? Or it may just be that it hasn't found
| anything better yet. The index is pretty shallow right now,
| only 20M URLs, I aim for more than double.
|
| I honestly wasn't planning on this blowing up on HN at this
| stage.
| yakubin wrote:
| Wikpedia links point to <https://encyclopedia.marginalia.nu/>
| instead, which to my eyes is less readable. The justified text,
| done with CSS, instead of the LaTeX algorithm, looks wild. The
| font used for quotations is even worse (very thin).
|
| Wikipedia is perfectly usable without JavaScript and it's one of
| the nicest sites out there typography-wise, so I'd reconsider
| this redirection.
| marginalia_nu wrote:
| I guess it's a matter of taste. I can barely read anything on
| regular wikipedia because the inline links disrupt my flow.
| wolpoli wrote:
| I wish niche search engines has an option to group results by
| domain names. There are a few major sites that dominate Google
| search results with low effort content. As long as Google stands
| as the largest search engine, it's unlikely that these major
| sites will want to rearchitect itself into different domain
| names.
| gjm11 wrote:
| I tried a few searches.
|
| <<javascript pipe syntax>>: none of the search results appeared
| to have anything to do with Javascript pipe syntax. (Which
| doesn't exist yet, but it's under discussion.) Google gives a
| bunch of highly-relevant results.
|
| <<hans reichenbach relativity>>: first result is a list of books
| about relativity, one of which is Reichenbach's "Philosophy of
| space and time"; good, but there's no real _information_ there.
| Second is about Reichenbach but nothing to do with relativity or
| even, really, philosophy of science. Third is about philosophy of
| science and mentions some of Reichenbach 's work but not related
| to relativity. Fourth mentions Reichenbach's "Philosophy of space
| and time" as part of a list of books relevant to a seminar on
| "time and eternity". None of this is _bad_ , but it's not great
| either. Google gives a couple of online philosophy encyclopaedia
| entries, then a journal article on "Hans Reichenbach's relativity
| of geometry", then the Wikipedia article on Reichenbach ... much
| more informative.
|
| <<luna lovegood actress>>: I thought this would be an easy one.
| It was easy for Google, which gave me her name in large friendly
| letters at the top, then her IMDB entry, and a bunch of other
| relevant things. Literally nothing in the Marginalia results was
| relevant to the query.
|
| I guess maybe popular culture is just too monetizable, so no one
| is going to write about it on the sites that Marginalia crawls?
| Let's try some slightly less popular culture.
|
| <<wilde "a handbag">>: First result is kinda-relevant but weird:
| it's about a musical adaptation of _The Importance of Being
| Earnest_. It doesn 't mention that famous line from the play, but
| one of the numbers in the musical has the words "a handbag" in
| the title. Second result is a review of a CD of musicals,
| including the same work. Third is a bunch of short reviews of
| theatrical items from the Buxton Festival Fringe, one of which is
| a three-man adaptation of TIOBE. Next four are 100% irrelevant.
| Next is a list of names of plays. Last one is actually relevant;
| it's an article about "Lady Bracknell through the decades".
| Google puts that one first (after, sigh, a bunch of YouTube
| videos which look as if they might actually be relevant).
|
| I really like the _idea_ of this, and many of the things it turns
| up look like they might be interesting, but it isn 't doing very
| well at producing results that are actually relevant to the thing
| being searched for.
| cowvin wrote:
| i don't know, people here like to complain about google, but
| google still works pretty well for me.
| exikyut wrote:
| TIL about https://github.com/tc39/proposal-pipeline-operator,
| which I am immediately looking forward to playing with once it
| gains traction Some Time From Now(tm)
|
| (I have no earnest reason to transpile)
| seph-reed wrote:
| Yeah, this seems pretty nice. I don't think the "deep
| nesting" issue is quite so realistic... I very rarely have a
| logic tree that's easier to identify by its leaves than its
| root. And I'd really hate to have code where you have to
| scroll to the end of a bunch of pipes to figure out what
| they're adding up to
|
| But I have plenty of single use "temp variables" and cutting
| those out could be cool.
| silent_cal wrote:
| To be fair, those searches are pretty weird.
| JxLS-cpgbe0 wrote:
| We don't _search_ for things because they 're easy to find
| colinmhayes wrote:
| I mean most of my searches are probably pretty easy to
| find, I just don't want to go to the website I'm thinking
| of and click through 5 pages to get there.
| HaloZero wrote:
| The pop culture one is fairly common. Me and my wife both
| search "who the fuck is that" in that TV show movie all the
| time. Or who is the author of X book?
| cormacrelf wrote:
| It's trying to surface long articles and you're asking it
| for a one word answer. What did you expect? A long article
| consisting of "Emma Stone played Cruella" repeated 800
| times?
| gibspaulding wrote:
| I think perhaps the usefulness here is less finding what
| you're looking for, but rather finding something
| interesting.
| dwaltrip wrote:
| They seem reasonable to me.
| exporectomy wrote:
| Wow. I tested it on recipes which Google has destroyed and this
| was the first result, a simple clear recipe:
|
| http://demont.myds.me/leerecipes/mainmeals/mainmeals1/chicke...
|
| compared to Google's endless drivel of "This chicken stir fry
| recipe will become a staple in your home. It's so quick to make
| and you can use whatever vegetables you have on hand. It tastes
| wonderful regardless of how you alter the ingredients. ... " JUST
| SHUT UP AND GIVE ME THE RECIPE!
|
| https://natashaskitchen.com/chicken-stir-fry-recipe/
| srcreigh wrote:
| Fwiw this website, Natasha's kitchen, seems to be one of the
| more performant ad filled recipe websites.
|
| Make use of the "Jump to recipe" button to get to the recipe
| faster.
| 1f60c wrote:
| This isn't entirely Google's fault. Recipes on their own aren't
| copyrighted in the US, and adding this fluff text is a way
| around that.
| AdamN wrote:
| Yes it is. They're giving a lower quality result to the user
| (their customer ... but not for long if competitors can get
| just a little bit better)
| isubkhankulov wrote:
| A small nit: I think Google's customers are actually the
| companies paying for ads. Its an important distinction and
| probably explains why their search results' quality has
| gone down
| mdoms wrote:
| It is entirely Google's fault. Google knows people don't want
| this dreck (everyone knows it) but still serves it up.
| b-x wrote:
| Too bad it rejects non-Latin words, as if the definition of
| "text" is a sequence of alphabetical letters originated from
| Latin.
|
| I thought that we've reached the time to embrace all cultures in
| the world, but this retrogressive engine proves that most _modern
| tech_ designers are myopic about other civilizations in the
| globe.
| fghfghfghfghfgh wrote:
| It's one guy. Making a useful tool. It even has an altruistic
| purpose.
|
| Shame on you for twisting a well intended effort into a
| negative statement that suits your narrow identity political
| world view.
| b-x wrote:
| Less insulting error messages would be more welcome than
| casting out others without any consideration.
|
| You may distribute "shame" however you want, but this only
| helps enforcing the damaging insults and amplifying them.
| hombre_fatal wrote:
| No, it just proves that a one-man hobby project with finite
| resources found it reasonable to restrict the scope.
|
| Maybe when they find out they're an immortal billionaire they
| can build all the additional things you've entitled yourself to
| expect from the freely shared work of others.
| marginalia_nu wrote:
| Understand that this is something I built for myself, by
| myself, so it focuses on languages I understand. It hosted on a
| single consumer grade computer in my living room. I built it
| out of pocket and anyone is free to use it. Does this make me a
| villain?
|
| If I can do this, what's preventing some guy in Japan or India
| or Peru from doing the same, of course focusing on their
| languages?
| b-x wrote:
| Maybe a better suited choice for errors than an insulting
| message: when I provided a query in my native language it
| regurgitated the error "needs to be a word" instead of more
| acceptable "not a supported language".
|
| When you claim that a word in some other culture is not "a
| word", just because it's not recognized by your machine,
| that's demeaning to say the least.
| marginalia_nu wrote:
| Again it's a one man hobby project, I don't have a team of
| people to go through every formulation and every error
| message to ensure nobody can read them in a way that
| offends them. It's just me, writing code on an unfinished
| project that HN discovered.
|
| In this case, the code doesn't match the word regexp, like
| it may be a @TwitterHandle or a "comp.lang.c" with periods
| in it, or an unsupported Unicode range. It doesn't know why
| it is not matching, just that it doesn't.
| b-x wrote:
| I must congratulate you on this achievement. That's
| certainly a useful take on search.
|
| Nonetheless, even when coding, one should also consider
| thoroughly the UX and how it would be addressing the
| others.
|
| Saying "unsupported word" is much more sympathetic than
| "needs to be a word" (where you define what a "word" is,
| and the general user is unaware of such definition).
| marginalia_nu wrote:
| Fair point, I refined the phrasing a bit.
|
| > The term "" contains characters that are not currently
| supported
| soheil wrote:
| Stop feeding the trolls, great job on this project and
| keep it up, hope at least most of HN is more empathetic.
| leavenotracks wrote:
| Really impressed with the results I'm seeing so far. In all
| searches I have done so far, the results are truly lightweight,
| and haven't had to click through any modals, subscription pop-ups
| or any other junk thus far! Will be using more in the days to
| come.
| ncfausti wrote:
| This is incredible. I just got goosebumps as I stumbled upon
| https://solitaryroad.com after searching for "linear algebra
| homomorphism". It reminds me of the magical feelings of the early
| Internet. Keep up the great work!
| pajko wrote:
| Does it filter out ad-heavy copy-paste/autogenerated fake sites?
| Tired of seeing those on the first few pages of Google. Bing gets
| more and more usable, but far from perfect.
| marginalia_nu wrote:
| It tries.
| AltruisticGapHN wrote:
| I like the idea. However results take too much space vertically
| it's slow and cumbersome to scan through them.
|
| I think it would benefit from using a responsive layout, allow
| the text expand to a wide 1000+ px, make the font smaller, so the
| excerpt can fit one or two lines below the links.
|
| Google has problems but their search results layout is easy to
| scan.
|
| Otherwise I genuinely wish I would use it, because the Google
| search's "self referential reality bubble" is really annoying.
| titzer wrote:
| I think I want a BBS. Text mode, fixed width font, keyboard-
| driven menus, no (or very little) bitmapped graphics. I've been
| thinking about the UIs for a lot of sites that I use to "do
| things" on the web. E.g. search for flights. Do I need _any_ of
| that "beautiful" web design with pretty forms and fonts,
| bevelled edges, drop shadows, drop-down menus, hovers? Hell, do I
| even need a map? Heck no, I need three text entry fields and
| output a bulleted list, maybe table of results. Just give me the
| raw data and do as little presentation as possible, thanks.
|
| I really think I want an internet console, not an animated
| magazine.
| javajosh wrote:
| This is really good; I'll actually use it!
| macksd wrote:
| This is a really cool idea. I tried a few technical queries I did
| on DDG today and didn't get amazing results - hence the warning
| in the About page about this engine giving you things you didn't
| know you were looking for, rather than specific facts. But the
| examples others have posted sound promising and refreshing. I
| would love to read about the algorithms behind this and how
| modern web design gets detected in order to punish it...
| Lapsa wrote:
| I like the design.
| BlackLotus89 wrote:
| Are there any technical infos about the search engine? Found some
| information here
| https://memex.marginalia.nu/projects/edge/about.gmi
|
| Author said they threw it together on consumer hardware. How big
| is the index? (TB used or entries) how is it realised?
|
| I'm pretty much interested in this since I myself am crawling
| some pages for my own "search index".
|
| Oh and thx for making and posting. Added it as a keyword to
| firefox
|
| Edit: Just realized that my question is a bit shallow. What I'm
| particular interested in is the storage before the indexing. I'm
| trying to store the raw html so that I can reindex everything
| with better algorithms, but I'm hitting many limits. It takes a
| few minutes getting the size of a site-directory (every site has
| it's own dir) and I'm at a point where I can't reasonably manage
| the scrape-versioning over git and I cycled through a few
| filesystems only to find that the metadata management kind of
| sucks for most of them. It's rather interesting how we store such
| files and I'm thinking about storing a few sites in a simple
| sqlite format for easy access and search. I'm thinking about a a
| few low overhead solutions like facebooks project haystack
| (implemented open source in seaweedfs) or something similar...
| Hopefully this gives some context to the question of storage and
| sites that are indexed
| [deleted]
| marginalia_nu wrote:
| The index is tiny, not even a terabyte. Right now it's a few
| hundred gigabytes for ~20 million URLs. But it's stored in an
| extremely dense binary format.
|
| Honestly you may just want to roll your own solution for
| storing a ton of files. If you don't need a general-purpose
| filesystem, but an append-only archive with extra metadata,
| then you can cut a lot of corners. Like if you have a file
| system that is fixed-size and append-only, you can build it in
| a way no off-the-shelf stuff can.
|
| This line of thinking is a large part of why my index is so
| small and fast. I have a lot of special built data-structures
| that are built for their exact use case. Like a fixed size
| append-only hash map that uses mapped memory and can in theory
| be larger than the system memory. Very good for a search
| engine, absolutely useless almost everywhere else.
| eitland wrote:
| Tested with the first person to settle on Island:
| https://search.marginalia.nu/search?query=Ingolf+Arnarson
|
| and it worked surprisingly well.
|
| Anyone else has good examples?
| abdullahkhalids wrote:
| Can we submit text-heavy sites for possible inclusion? Assuming
| they pass your filters.
| sealthedeal wrote:
| lol this is great, reminds me of the old school search engines we
| would use in school back in the day before Google haha.
| pomian wrote:
| Congratulations. Truly impressive search results. I tried two,
| one word searches. The results were interesting, useful, and
| would have been impossible (well, really really hard) to find, on
| standard search engines. Plus, no garbage, ads, recommendations,
| etc etc. As another commenter suggested, it is what World Wide
| Web searches results were like, twenty years ago!
| pomian wrote:
| PS. I added Marginalia as a search option (even the default for
| now) in Firefox Nightly (on Android). In case others want to,
| under settings for search, you can add other, then name, and
| then: https://search.marginalia.nu/search?query=%s
| freddref wrote:
| After a good amount of searching it doesn't seem possible to
| add Marginalia as default search in firefox (84.0b8) on
| Debian.
|
| I did not expect this to not be available.
| mmphosis wrote:
| I am running Linux Mint, and I did not expect adding a
| custom search engine to be missing from Firefox. There are
| plug-ins, but I don't like adding plugins.
|
| https://mmphosis.netlify.app/search.marginalia.nu/
| lumost wrote:
| This is a fascinating tool, I estimated that the corpus of the
| factual web was between 1 and 10 TB when I last played around
| with BigQuery using domain names which had low amounts of click
| bait. Seeing these search results I suspect my estimate was off
| by a couple orders of magnitude.
|
| Although a search for "Fractional Reserve Banking" shows that
| some further ranking improvements can be made to exclude
| unrelated results, and potentially penalize old conspiracy sites.
|
| https://search.marginalia.nu/search?query=fractional+reserve...
| rchaud wrote:
| Is it fair to assume that text-heavy sites that are inactive (but
| still online) don't have SSL?
|
| If so, would you ever tweak the parameters to surface sites that
| that aren't served with "HTTPS"?
| asjdflakjsdf wrote:
| You should monetise this with amazon affiliate links that are
| relevant to each search. And then use that money to keep this
| project going. Google is fantastic, but it has become something
| different from what it was, the company and the product. It is so
| refreshing to see a modern tool that encourages exploration of
| the actual world wide web.
| Funes- wrote:
| That would be an absolutely awful decision.
| marginalia_nu wrote:
| I might add a donate button or something if people want to help
| support the project, hardware isn't cheap and all. But I have a
| job and decent income. I think if this search engine became the
| way I earned money, it would influence the project in a bad
| way, and corrupt its purpose, which is to help people explore
| the less-commercial internet.
| eigenhombre wrote:
| +1 for a donate button; much preferred over affiliate links
| or ads of any kind. Thank you for making this beautiful
| little(?) product!
| kews wrote:
| Appreciated. The more things fill up with monetizing shit,
| the more I stay away. There's something beautiful in having
| higher purposes than grubbing for cash.
| RistrettoMike wrote:
| I'd donate to continued expansion/development of something
| like this. Where is somewhere good to follow you for any
| thoughts/updates?
| marginalia_nu wrote:
| I have something of a blog here, with an Atom feed.
|
| https://memex.marginalia.nu/log/
|
| It's not very well optimized for mobile, really it's more
| of a bridge for my geminispace content.
| marginalia_nu wrote:
| I added a patron, and huh, a few people are actually
| donating. I don't really know what to say. Thanks!
| streamofdigits wrote:
| Let a thousand search engines bloom.
|
| btw, interesting how many http (as opposed to https) sites show
| up...
| pjs_ wrote:
| This kicks ass!!
| 0xbadcafebee wrote:
| I love it. Even though it didn't give me the results I was
| looking for. I searched "new york fishing license", and it didn't
| give me any links to the actual new york fishing license
| websites. But it did give me a ton of really cute little websites
| related to lakes and fishing in New York. This one has _amazing_
| information about fishing all over Western New York:
| http://www.huntfishnyoutdoors.com/fishing.php
| spookthesunset wrote:
| This is really cool! So retro!
|
| Here is the second result when you search for "cat food". It
| takes you to some old dudes entire family tree with full history
| and biographies... it even uses sub domains and everything!
| Crazy!
|
| http://www.torrens.org/
| kews wrote:
| There's probably a more suitable term than "modern" that we
| should generally be using, since "modern" consistently has a
| positive connotation.
| marginalia_nu wrote:
| Dunno, I prefer to use as neutral or positive terminology even
| when I talk about things I don't like. I think it very easily
| comes off as juvenile ranting when you start throwing around
| terms with strong negative connotations.
| mumblemumble wrote:
| I like it.
|
| Coincidentally, the other day I was daydreaming about a search
| engine that favors sites that are updated less frequently. The
| thought being, the kinds of labors of love that characterized the
| 1990s Web that I still sometimes miss are still out there, it's
| just harder to find them amidst the flood of SEO dreck. So
| perhaps they could be made discoverable again with the help of a
| contrarian search engine that specifically looks for the kinds of
| things that Google and Bing _don 't_ like to see.
| gibspaulding wrote:
| Million Short [1] offers an option to omit results from popular
| domains. It's a different approach from what you describe, but
| I think the goal is similar.
|
| [1] https://millionshort.com/
| itzworm wrote:
| I had this problem recently trying to fix an Atari. There's a
| guy out there who has ton's of guides on doing video out mods
| but newer guide references the older. However googling the OG
| guide didn't find it so I manually scoured his old web page.
| zachguo wrote:
| Try this https://wiby.me/
|
| Clicking 'Surprise me' gave me an interesting article from 1994
| http://milk.com/wall-o-shame/bucket.html
| appel wrote:
| That was a great read, thanks for sharing.
| capableweb wrote:
| Similarly, I wish there was a recommendation engine (for web,
| music, movies, whatever) that can show you what is the furthest
| away from your existing tastes. I've learned to re-create my
| Spotify account once every 6 months or so, as their
| recommendation engine becomes a boring machine after using it
| daily for some months.
|
| I'd love to discover new content that is different from what I
| read/watch/listen to now, but it's really hard to know about
| genres you don't know about.
| ebiester wrote:
| It's hard, though. I simultaneously want something far from
| my tastes, but I don't want to see Plandemic-style Ivermectin
| material, or Focus On The Family-style material. I want
| things that will push me out of my comfort zone sometimes,
| but it turns out I really don't want the thing furthest from
| my tastes; I want things marginally adjacent. I want them
| close enough to feel familiarity, but far enough that it
| challenges my worldview.
|
| I don't think a recommendation engine can do that.
| gverrilla wrote:
| Doing that takes real work and curiosity. I'm afraid an
| algorithm will never be able to do it, particularly if you're
| into niche stuff. For instance I enjoy a lot a Japanese band
| called The Boredoms - but few people like it, and there's
| only 2 of their albums available in spotify.
| potatoman22 wrote:
| I like the idea. Tangentially, I wonder how one would find the
| right 'penalty' for more updated sites?
| timvisee wrote:
| Cool!
|
| There do seem to be some text encoding issues though. For
| example: https://search.marginalia.nu/search?query=tim+visee
| marginalia_nu wrote:
| Yeah I think the charset detection needs work.
|
| It understands the "Content-type: text/html;charset=utf-8"
| -header, and <meta charset="UTF-8">
|
| but not
|
| <meta http-equiv="content-type" content="text/html;
| charset=utf-8">
|
| It turns out HTML has a lot of corner cases. I'm constantly
| marveling at how web browsers hold together as well as they do.
| timvisee wrote:
| Thanks for your response! Hope you can implement this as well
| without too much trouble.
|
| I wonder if you could just assume UTF-8 to be the default
| these days. I imagine that to fix many other cases as well.
| marginalia_nu wrote:
| Haha! I did actually assume UTF-8 at first, but being a
| search engine has a lot of older websites, I sadly got a
| lot of encoding errors doing that, too.
| soheil wrote:
| Maybe just like js-heavy sites also punish non-
| conforming-encoding sites.
| enduku wrote:
| Fantastic project! Found very interesting links to a lot of
| compiler related keywords. A similar service, yet different in
| their approach to cut through the e-commerce and seo optimized
| websites I found useful is MillionsShort[0]
|
| millionshort.com
| oytis wrote:
| Designed for serendipity indeed. Tried a few searches, results
| are quite fun, but none of them relevant.
| pjspycha wrote:
| This is really refreshing work, and we can all benefit from other
| search engines focused on improving the field. I tried a bunch of
| searches and some of them were quite wonderful, others were a
| little dry on results. But overall I enjoyed going through it.
| Here is some critiques if you don't mind:
|
| I did search for "Daria Bilodid" and the results were a bit
| troublesome. First the Wikipedia result did not work:
| https://en.wikipedia.org/wiki/Daria_Bilodid vs
| https://encyclopedia.marginalia.nu/wiki/Daria_Bilodid
|
| Secondly the results matched a few judoinside.com results which
| is ok, including sites to her competitors, but seemed to miss the
| judoinside website for her:
| https://www.judoinside.com/judoka/92660/Daria_Bilodid.
|
| The design is hard on my eyes, I have a average size screen and
| its using less than half of the width. The line-height is
| enormous and seems to breakup flow making it uncomfortable for me
| to read. The spacing around each result is the same as between
| titles and paragraph items, which again was unpleasant to read.
| ASalazarMX wrote:
| > Secondly the results matched a few judoinside.com results
| which is ok, including sites to her competitors, but seemed to
| miss the judoinside website for her:
| https://www.judoinside.com/judoka/92660/Daria_Bilodid.
|
| The title says this search engine punishes modern websites
| (images, videos, MB of JS, I suppose), and this site looks
| scarce on text and heavy on images, maybe that's confusing the
| ranking.
|
| I certainly find the results very refreshing, but you'll have
| to complement with other search engines if they're not enough.
| In fact, I think the days when we could use a single search
| engine have already passed.
| dukeofdoom wrote:
| "corporate speak" bs detector and filter on google search engine
| would be nice.
| kag0 wrote:
| An interesting concept and awesome work!
|
| I searched for high pressure air (HPA) regulator trying to find a
| description of how one works. I didn't find that, but did find
| some interesting links on how they're used in scuba, and one
| guy's homemade gas laser.
| throwawaysea wrote:
| Is it possible to also make a site that favors a diverse set of
| information sources? For instance a lot of searches turn up
| results from Pinterest or Wikipedia or Amazon or whatever else. I
| wonder if there's room for a search engine that is all about
| favoring a greater diversity of smaller sources, for those who
| are less interested in staying within walled gardens.
| michaelgrafl wrote:
| I just looked up my last name and found a World class heavyweight
| weightlifter named Josef Grafl born in 1872 who has an awesome
| portrait of him on Wikipedia. Never before have I read about that
| man.
|
| I love this.
| dumbfounder wrote:
| Based on a few searches it seems to favor sites with very long
| passages of text. Search for a name and you get pages with
| massive lists of names. It quite simply isn't very good at
| everyday searches. But it does bring up the point, shouldn't I be
| able to tell my search engine I want results like this? It should
| be a feature of google I can turn on and off. It should be one of
| many ways to impact relevance.
| runnerup wrote:
| Seems like this is still very very hard! I searched for "hart
| protocol" hoping to find this: http://www.romilly.co.uk/
| mrpf1ster wrote:
| I searched "c strtok" and got one result saying '"strtok" could
| be spelled "stroke", "stork", "sarto", "strop"'.
|
| Cool concept though!
| marginalia_nu wrote:
| The spelling suggestions are presented whenever there isn't any
| results, but sometimes they can be pretty misleading.
|
| What happens is that C, as a word, isn't indexed because it's
| deemed too short, and the bigram "c strtok" can't be found
| anywhere.
|
| Try 'strtok' instead.
| llbeansandrice wrote:
| Is there anyway to add this as a favored search engine in the
| browser?
|
| I currently use google as it's set as the default search when I
| type in the address bar but would love to switch and move
| google/ddg to a added character like "<search terms> @g"
| tomerv wrote:
| All the major browsers support adding custom search engines.
| You just need to specify the URL template to do the search. The
| common format is to put "%s" as the search term. You can use it
| for any site, not just things that are considered search
| engines.
|
| Firefox is a bit different, since you do it by adding a
| bookmark, and giving that bookmark a keyword. The other
| browsers I checked have an option under the search engine
| settings.
|
| After defining the custom search engine, you just type
| "<keyword> <search term>" in the URL bar.
| freddref wrote:
| Is there a way to set marginalia as default search in
| firefox?
| llbeansandrice wrote:
| Ah I'm on FF so I'll have to do the weird bookmark method. A
| little annoying since it supports other search engines.
| hilbert42 wrote:
| It's excellent, I looked up some physics topics and got some
| excellent results - real meaty stuff full of text, eqations and
| applicable diagrams, etc.
|
| I've not only bookmarked it but also I've an icon linked to it on
| the taskbar. Will watch its progress with interest.
| aetherspawn wrote:
| Beware, I got the impression straight away that some sites were
| censored from the results for no good reason.
|
| For example, if you search "jehovahs witnesses", all pages from
| jw.org are missing.
|
| Exactly the same thing happened when I searched "mormons" - the
| official website is missing and it only brings up
| sects/hate/conspiracies against mormons.
| edrxty wrote:
| jw.org appears to be the kind of modern web design this is
| trying to avoid. I seriously doubt it has anything to do with
| the cult.
| marginalia_nu wrote:
| If you want mormons in positive light, you should search for
| "latter-day saints", as that's how they typically brand
| themselves.
|
| jw.org is on the indexing list, just pretty far down, based on
| the fact that previous times it's been visited it's had ton of
| javascript.
|
| I don't have an axe to grind with fringe religious movements, I
| actually love them to bits. Try searching for Nag Hammadi or
| Hermes Trismegistos.
| ilrwbwrkhv wrote:
| This is soooo good. I'm finally finding sites I haven't heard of
| with good content.
|
| I didn't realize how much I missed this stuff.
|
| The popular web has become so bad nowadays.
| protontorpedo wrote:
| I searched for "Starlink satellites" and found this Y2K-style
| Canadian UFO blog [1] explaining it isn't aliens. I might just
| waste my weekend with this search engine.
|
| [1] https://www.ufobc.ca/Reports/stringoflights.html
| snuser wrote:
| Just searching for 'dogs' gave me more interesting results than
| I've seen from google in years
| prionassembly wrote:
| The website itself seems generated with some kind of kick-ass
| generator from template files (.gmi?)
|
| I feel like I'm stuck with Wordpress.com because it brings me
| _some_ traffic (whereas something hand-rolled on nsfspeech or
| digital ocean or whatever would literally be off the edge of the
| web), but the structure of that is so cool.
| NhanH wrote:
| That would be gemini protocol!
| huijzer wrote:
| You can easily do proper SEO with static site generators too.
| Even more, static sites can be hosted via GitHub or GitLab
| Pages, Netlify or CloudFlare and in all cases the speed will
| outperform Wordpress in almost all cases. Also, you have way
| more control over the output than with Wordpress.
| dzink wrote:
| One use case to always test: "online wishlist" or "make a
| wishlist". If you start seeing tools like
| https://www.DreamList.com or others, you are on the right path.
| If you start seeing random web pages linking to individual wish
| lists, then people are likely not able to find tools on your
| search engine.
| platz wrote:
| > Don't be afraid to scroll down in the search results
|
| I never knew it was fear that was preventing me from scrolling
| prewett wrote:
| Kudos for taking on this project, and I like the idea! I think
| it'll be a big project to take it to the next level, but would
| love to have a search engine that's more useful.
|
| Some reactions:
|
| - The font is really big and the columns really narrow, so I get
| 3 - 4 entries per page, something like 8 words per line, and huge
| spacings between lines, which makes it a frustrating experience.
| I've been using the recommendations in
| https://practicaltypography.com/, which recommends 60 - 90
| characters in a line I think, and line spacing of 120% - 140% (I
| like 125%). The line lengths here might technically fall within
| the lower bound, but it's really short, and for search results
| I'm going to try scanning the text to see if there's something
| relevant, so I think going on the long side is better here. At
| least make the width somewhat variable so that I can shrink the
| rather large font and fit more on the line.
|
| - The results are eclectic, but I'm not sure it's usable at the
| moment. "scala append list" did not get me much that's helpful,
| while Google will usually at least put up some click-farming
| tutorial that although minimal effort does tend to answer the
| question. Both "mapo doufu recipe" and "ma po do fu recipe" had
| very few recipes, although the latter did have one.
| Unfortunately, recipe websites are some of the worst, with about
| 10 pages of description, ads, pictures, what-have-you until the
| recipe at the very bottom. "collection unmitigated pedantry" did
| return the acoup.blog entry at the top, though.
|
| Good luck on the project!
|
| -
| lbriner wrote:
| My pet peeve with search results is simply that there are ancient
| technical results that in many cases are irrelevant. If I am
| searching for a Window error message, I don't want some old forum
| post from 2001, especially if it didn't have any answers!
|
| What would be cool would be for people who host old stuff to
| "archive" it at some point so it doesn't appear in normal
| results, only if you tick "include archives".
| athenot wrote:
| As much as the release names for macOS over the years were
| marketing gimmicks, it does make it a lot easier to zero in on
| the correct version when doing these types of searches.
| platz wrote:
| modern design = low information density?
| monkeybutton wrote:
| Definitely low signal to noise. Looking at you recipe websites
| and cooking blogs.
| dfdz wrote:
| I like the concept, but I did not work on any of the search
| phrases I entered consisting of the full title of a computer
| science article or book.
|
| It also does not work for subjects. For example, if you search
| "discrete math" it links to academic webpages, but most of them
| do not have any notes posted. It is just a plain text website
| with the syllabus of a class.
| grae_QED wrote:
| Obligatory https://wiby.me/ plug. If you're looking for a decent
| minimalist website search engine, this fits the bill pretty well.
| davuinci wrote:
| Congrats for the effort, I really like the idea and it works
| wonderfully for some searches.
|
| However, I searched "infiniband" and the results are far away
| from what I would expect or like to see. Most of the results that
| appear first are completely unrelated to the topic.
| cyral wrote:
| I tried a few queries and got extremely irrelevant results
| marginalia_nu wrote:
| It really depends on what you search for. A major drawback is
| that there needs to be text-heavy sites to find, in order for
| the search engine to find them.
|
| Compare for example the results for "Duke Nukem 3D" with those
| for "Cyberpunk 2077".
| adriangrigore wrote:
| A little bit harsh "punishes". It's a cool search engine.
| OneEyedRobot wrote:
| Very cool. A person can really appreciate simple web design
| looking at something like Luke Smith's recipe page.
|
| So how on earth do you take an idea like this and scale it for
| both broad web coverage and high traffic? For that matter, just
| how much 'useful' text is there on the net?
| vanattab wrote:
| Cool idea but it needs to be able to handle special characters.
| Right now searching "Hello World c#" returns no results because
| the search term can't handle #. I also can't just delete the #
| because then I would be stuck writing C...
| jteppinette wrote:
| This is awesome! We should definitely move in this direction.
| xipho wrote:
| Fascinating. I studied an "obscure" group of insects. My go-to
| search term to test an engine is their family name as it is a
| rarely used word and I know most (all?) of the major data sources
| that have accumulated data on it. When Wolfram Alpha added
| species names, I checked with the name, boring, Duck Duck,
| boring, Google (well we know Google isn't for search anymore,
| it's absolutely horrible) boring, Bing, boring... you get the
| idea.
|
| This was a little different, extremely few results, but a couple
| of them really made me grin, and all(?) made me curious or raise
| an eyebrow or reflect on who/what might have been the source of
| the link, or remember some obscure connection from grad-school.
| So, if anything a crawled list of results worthy of ponder,
| thanks for this!
| tentacleuno wrote:
| > well we know Google isn't for search anymore,
|
| If you're talking about the ads, I would bear in mind Google's
| whole business model is basically online advertising. Search is
| just the vehicle to deliver those ads; I'd say Google is pretty
| good at throwing things back.
| jevgeni wrote:
| But what's their UVP? I'd say quick and relevant search
| results. And that seems to be constantly degrading.
| tentacleuno wrote:
| Well, the unique value proposition is their gigantic index,
| really fast search and a bunch of other things.
|
| I'm not sure about the quality of the results, I just use
| DuckDuckGo these days, but IMO the unique technical
| advancements are pretty unique to Google.
| trutannus wrote:
| > well we know Google isn't for search anymore
|
| Do you suggest anything better? As far as I can tell, all the
| other search engines are either repackaged Bing (ie: DuckDuck),
| or are just as bad.
| ColinHayhurst wrote:
| This needs an update but is an easy look see.
| https://www.searchenginemap.com/
|
| Broad and longer Twitter lists maintained here:
| https://twitter.com/SearchEngineMap/lists
| ricardo81 wrote:
| Mojeek was built in the same spirit (one server living in a
| house) and has 4.5 bn pages indexed now, and a bunch more
| servers. A lot of people comment in similar style of it
| reminding them of an older Internet, or generally less
| branded results. It's definitely an alternative point of
| view. Disclaimer: I work for them.
| giancarlostoro wrote:
| Not sure, but I remember when Google could find literally
| anything. Then they started adding a bunch of exceptions and
| crapped out their quality. I wonder how insanely different
| results would be to get the older Google Engine from the
| 2000s search result wise.
|
| I now have to play games with Google to find things. I feel
| like I do less than I used to for some reason.
| bbarnett wrote:
| The other day, I was searching for something, and google's
| suggested, on-site answers took up 1/2 the first page. All
| wrong.
|
| The actual search results were another 1/4 page of
| completely identical results, followed by google ad placed
| search results.
|
| I thought to myself, they've finally done it. Real
| responses are no longer first page.
|
| A lot of the cause for google getting crappy, is "ok
| google", another "all platforms are the same" form of
| sickness.
|
| No, a desktop is not a phone. No, voice searching is not
| the same as phone, or desktop.
| foobarian wrote:
| I was just thinking that they finally became Lycos. It's
| what all the search engines except Google looked like
| back in the early 2000s - ad laden cesspools of
| irrelevant search results and other content. And it's why
| we all switched to Google at the time.
| habibur wrote:
| It's time to disrupt the market. As Google can't compete
| with a newcomer that penalize ads on page.
| eitland wrote:
| Seriously, yes.
|
| Moores law means a modern day 2007-style Google should be
| significantly less expensive to run now than back then.
|
| Also the most relevant patents are now free to use.
|
| 2021 Google is a sad story compared to 2007 Google and
| I'd actually pay to get back 2007 Google - ads included -
| meaning a double revenue source :-)
| trutannus wrote:
| You're absolutely correct, and a lot came from their
| nerfing of search modifiers like + - "search term" and
| whatnot. There's also a lot of ads and "PSA" type nonsense.
| If I'm looking for anything COVID related for example, I
| have to sift through a heap of PSA nonsense that's not even
| related to my search query.
| blowski wrote:
| Wacky idea: instead of Google changing it's algorithm every
| couple of years, it could run 50 algorithms in parallel
| leaving no way for sites to "optimise" for the current one.
| vikingerik wrote:
| The output of the parallelism is itself an algorithm,
| that can and will be optimized for.
| mda wrote:
| IMHO, that is a trendy claim in HN with little evidence.
| mhh__ wrote:
| You're downvoted but in my experience I have never really
| been burned by this Google-decline
| Ygg2 wrote:
| "I haven't seen a black swan, ergo it's not real."
|
| I've been burned by this decline in the past.
|
| From creepy results i.e. first suggestion before typing
| was something I spoke near the Android and I never
| searched for before; to not finding what I was searching
| for before successfully, Google has started declining.
| mda wrote:
| It would be nice if you provided a few real examples so
| that we would see how Google was so fantastic and found
| everything magically but then went to shit.
| bbarnett wrote:
| You are a lobster. (or frog, depending upon parable)
| xipho wrote:
| You want evidence? Search for a plumber/tradesperson in
| your area THEN try to find rational discourse about your
| options. There are literally 100s of results of websites
| remixing a small set of data, presenting it to you, and
| asking you to buy something to see more, when you _know_
| there is nothing behind the scenes.
|
| This type of engine would punish these sites, in theory,
| and may turn up a discussion in some forum, newsgroup, etc.
| that is actually relevant, or insightful.
| krapp wrote:
| > Search for a plumber/tradesperson in your area THEN try
| to find rational discourse about your options.
|
| I searched "plumber Austin TX" in Google and got a map
| and list of company websites near me. There are a lot of
| "top x y in z" list sites, but the top results were still
| the most relevant. I don't know what "rational discourse"
| I'm expected to find, though, or why I should assume the
| discourse I would find through Google is less rational
| than discourse I would find elsewhere.
|
| I searched the same thing in OP and found nothing even
| remotely significant. Not even anything related to
| plumbing.
|
| OP's project isn't optimized for relevance, it's
| optimized for nostalgia - providing a filter that keeps
| the modern web away and dropping quirky, interesting
| breadcrumbs to distract you and remind you of what it was
| like to wander around the web of the 90's.
|
| Which is all well and good if that's what you want, and
| judging from the comments it is what a lot of people here
| want, but Google giving me a list of company names,
| numbers, websites and a map showing their location by
| distance is more useful, even if it uses "modern web
| design" and javascript.
| xipho wrote:
| > I searched "plumber Austin TX" in Google and got a map
| and list of company websites near me.
|
| I think you could have done this historically in a Yellow
| Pages phone book. My OP used "boring". A list of plumbers
| is boring, been done on dead wood. I'm not saying boring
| != !useful.
|
| > There are a lot of "top x y in z" list sites
|
| This is an understatement. I actually want to know the
| top x in y, to do that I need "rational discourse".
| Rational discourse is recognizable as well written,
| insightful, humble, reflective, self-countering,
| anecdotal etc. By "search is terrible" I mean with
| respect to finding this.
|
| > OP's project isn't optimized for relevance, it's
| optimized for nostalgia
|
| Nostalgia is highly relevant if it's on topic, but
| agreeing with you as to what this engine is about.
| krapp wrote:
| >Rational discourse is recognizable as well written,
| insightful, humble, reflective, self-countering,
| anecdotal etc. By "search is terrible" I mean with
| respect to finding this.
|
| I believe a search engine that ignores results based on
| superficial and aesthetic qualities like "modern web
| design" would be even worse in that regard, unless you're
| assuming no relevant discourse about any subject has
| taken place on the web since the early 2000's.
|
| I admit, I have no idea what heuristic you would actually
| use to find "well written, insightful, humble,
| reflective, self-countering, anecdotal etc" content, but
| I've seen it on modern sites (even on Twitter,) and I've
| seen a lot of garbage on old sites, so a simple text
| search of only old websites doesn't seem like it.
|
| It is fun, though.
| mda wrote:
| well it displays a map of plumbers in my area, is it not
| useful? Besides do you remember what it was displaying
| before "it became useless"? This whole thread is full of
| hand wavy claims with pretty much no good examples about
| how Google actually became worse in time. Hence my point.
| Spivak wrote:
| Google Search is a fantastic product because it's
| essentially Spotlight for the web. It's by far the fastest
| way to get to things you already vaguely know are there and
| acts as a metasearch for large sites.
|
| But as a result it's now less useful as a tool for scouring
| the web.
| mountain_peak wrote:
| Likewise, I co-maintain the only "fan" site on one of my all-
| time favourite composers/performers, and gave the engine a shot
| with a unique string query. While my text-heavy WP-driven site
| didn't seem to make the cut, the results were highly relevant
| in that they were links to former band members and
| collaborators - a couple of which I didn't realize existed.
| That being said, there were a few sites (including my own) I
| expected to be returned, but no dice. Still, a fascinating
| experiment that many at HN have been clamouring for.
| xipho wrote:
| Exactly this. A couple results returned reference to obscure
| now-defunct newsletters and clubs, people that I know were
| historically important for past researchers, but only because
| this was my research forcus for so long would I have known
| this.
| marginalia_nu wrote:
| The search engine doesn't actually do full text search, so
| maybe your query was too... unique.
|
| But do first of all verify that you haven't been hacked.
| There's about quarter of a million domains I've flagged that,
| besides their wordpress content, also host a ton of link spam
| crap off in some hidden folder. This reflects on the quality
| rating extremely negatively to the point where you may have
| not been indexed at all.
|
| Secondly, are you behind cloudflare or some other big-name
| CDN? Because, as I mentioned in another comment, I can't
| crawl their pages without getting captchad until they approve
| of my humble request to be classified as a good bot.
|
| There are some other hosting providers I flat out block on a
| subnet level because they host a large amount of link farms.
| This is currently Alibaba, Psychz, eSited, Cloud Yuqu and
| 1Blu.
| mountain_peak wrote:
| Thanks for the advice; not hacked, but I have "resurrected"
| many WP sites that have been (including my wife's non-
| profit). Just running on an EC2 micro instance, but I tried
| adding "site:" and received "No such domain". Actually, I
| think it's because I haven't enabled "HTTPS" yet! That's on
| my to-do along with migrating off EC2-Classic to VPC...
| marginalia_nu wrote:
| Vanilla HTTP should be fine. I think 80% of the urls are
| HTTP.
|
| If you're getting no such domain, it's either blocked
| because it looks too much like a spam domain, or it
| simply hasn't been discovered yet.
|
| What's the TLD? I severely restrict some cheaper TLDs
| because they gave so much spam.
|
| For example, cr.yp.to is an example of a baby I know I've
| definitely thrown out with the bathwater.
| wolverine876 wrote:
| www.ft.com gets 'no such domain'
| marginalia_nu wrote:
| I added it now, but it turns out it's behind a CDN so I
| still can't crawl it.
| mountain_peak wrote:
| Is a good ol' .com with no ads and minimal JS -
| originally launched in 2011. Thanks again for your
| insights; I've bookmarked your site and will check back
| every so often to see if my site's been indexed.
| withinboredom wrote:
| It'd be nice if you had a page to get the current index
| status for a domain.
| marginalia_nu wrote:
| Try a query on the form site:www.example.com ;-)
| wolverine876 wrote:
| > site:www.washingtonpost.com
|
| > Blacklisted false
|
| > site:www.wsj.com
|
| > Blacklisted false
|
| > site:www.rt.com
|
| > Blacklisted false
|
| > site:www.nytimes.com
|
| > Blacklisted true
|
| ?
| marginalia_nu wrote:
| Hmm, not sure what caused it to end up there, but I
| removed it from the blacklist. It still doesn't seem to
| want to index the domain however, probably CDN-related.
| rovr138 wrote:
| Would it be possible to have a link to a page with
| operators?
| duckmysick wrote:
| I'm intrigued by this experiment but I can't visualize it. What
| do you mean by boring results? Would combing through a library
| (the one with paper books) also produce boring results? What's
| your ideal results?
| xipho wrote:
| Perhaps a counter example, something that is interesting.
| Anecdotally. This, of all things, is the _top_ result in my
| search: https://tft.brainiac.com/archive/0303/msg00037.html.
| Which is strange to me because I don't recognize
| tft.brainiac. I click, it's a list of biological
| relationships among Hymenoptera, including a reference to
| genus of the wasps I studied, presumably in a biological
| relationship (host/parasite) context. I cataloged every
| relationship known at one point, so my brain wants to know
| where this come from, is it something I caught. Then I go
| look for more context, and find it's part of a thread about
| D&D(?) and hymenoptera, and it's epic, and a chunk of my
| morning is lost figuring out why and how this came to be.
| duckmysick wrote:
| Yes, thanks. That helps.
|
| If I understand it correctly, you're interested in bits and
| pieces of new information that's indirectly related to your
| object of interest. Degree 2 and 3 in Six Degrees of Kevin
| Bacon, so to speak. You know degree 0 like the back of your
| hand and you've seen almost everything closely connected.
| Finding novel, interesting things is getting more
| difficult.
|
| Have you thought about cataloging all the related stuff you
| stumble upon? Something in between loose notes and what
| Moby Dick is to cetology.
| xipho wrote:
| Exactly.
|
| > Have you thought about cataloging all the related stuff
| you stumble upon? Something in between loose notes and
| what Moby Dick is to cetology.
|
| Tongue in cheek- new app time, to facilitate this. It
| should have the name "Degree4". Entries can only be made
| if degrees 2 and 3 are "defined". Scoffs at degrees 5 and
| 6, just because. Startup developing can probably
| unethically seed content by mining
| https://www.everything2.com/. Should use concepts of "AI"
| and "persistent homology"... profit!
|
| But no, I don't outside a mental note. Closest I would
| come would be adding '!! <some note>' to my potwiki text
| notes (see my past comments) if its something I want to
| have come back with a grep, or think might be interesting
| to explore "when I retire". If it's a scientific fact in
| my field after researching it further it would go into
| this https://taxonworks.org (or its precursor).
| xipho wrote:
| In part, by boring results I mean I instantly recognize the
| top results, and I know exactly what will be in them, and I
| know which ones will actually contain potentially interesting
| new stuff, i.e. _I didn't have to search for these, I'd go
| their directly_. Then next results are all obscure, and I've
| already visited them, and/or I know they are historical and
| not something I have to revisit.
|
| With this engine with at least 1/2 the links (to be fair
| there were < 20) I didn't recognize the URL at all, and it
| was clear in the text or the URL that there was an
| interesting bit to check out (i.e. what Google should have
| also returned after they barfed out the things I don't need
| to know about), but had never succinctly done in my
| experience.
|
| I suppose the magic in this engine would have to be alerting
| the searcher that they found more of this type of link, as
| once I visited the 10 or so sites they would fall back into
| the "been there, done that" link category that Google appends
| somewhere after the ads and "big" sites, mixed in with a
| million search term spam sites, etc.
| xattt wrote:
| There's certain grey literature that's not captured in
| university library federated searches nor easily found with
| mainstream search engines.
| xipho wrote:
| There are decades of academic research not digitized. The
| digitization window used to only hit around 1990, I haven't
| looked at it hard recently, but I suspect this still
| remains true for many important journals. This is grey only
| to those who do not know how to use a library.
| nagyf wrote:
| This is great, I like the results. Couple of things I noticed:
|
| - Search results often very old, from the early 2000s (I guess
| because back then more websites were text oriented). Are you
| taking into account the age of the page when showing results? It
| would be great to see more up-to-date results at the top
|
| - I noticed a few results which directed me to websites with
| security risks, Firefox didn't even let me open them. Is it
| possible to filter these out from the results?
| horsh1 wrote:
| No cyrillic or hiragana suport :-(
| 300bps wrote:
| What we need now is a search engine that weeds out sites that
| have been SEO optimized for keyword density.
|
| I'm tired of searching for "generic keyword" and getting a page
| with an extremely low signal to noise ratio written like this:
|
| "Many people search for generic keyword. That is why you can find
| all about generic keyword here. In fact we specialize in generic
| keyword and slight alterations of generic keyword."
|
| It's like Google stopped caring that people were gaming it.
| Nicksil wrote:
| - Semantic HTML; not everything is a div; correct use of markup.
|
| - Search results are not overrun with commercial, SEO stuffing,
| "content" farms.
|
| I don't know what to say. This is such a refreshing sight. Well
| done.
| thetanil wrote:
| Yes please! More of this!
| hulitu wrote:
| "Search results Search "alt.sysadmin.recovery" needs to be a word
| Those were all the results,"
|
| No comment.
| hdjjhhvvhga wrote:
| Congratulations, great work!
| tomaszs wrote:
| I like the concept of a search engine that does not try to figure
| out what I should learn based on what I search..I know what I
| search for
| camillomiller wrote:
| Great idea, awful UI
| muxator wrote:
| How so? It's intuitive and super fast. I whish there were more
| websites with such a simple UI.
| feikname wrote:
| it's too "uncompact". Font size too big and could use a bit
| more horizontal space.
|
| I find it comfortable to use at 60% zoom level
| marginalia_nu wrote:
| Some people like a flashy UI, the modern look is important
| for them. It's ok to have aesthetic preferences, let's not
| pretend we don't all have them.
|
| In the end, it's a niche search engine I've made, the
| intended audience is the long tail. It just isn't for
| everyone, and if it was for everyone, it probably would be
| lesser for it.
| silent_cal wrote:
| I like the UI.
| ravenstine wrote:
| I'm glad you aren't trying to please anyone. I'd like a
| return to an internet with fewer colors, gadgets and
| gizmos, custom fonts, TypeKit, JavaScript requirements, and
| so on. Most of the time I'm reading articles, so just give
| me more text and less fluff!
| jbj wrote:
| What makes you find the user interface awful?
|
| it is litterally a search website with a text box for a search
| term and a button to do the search.
| r00t4ccess wrote:
| The page isn't prompting for cookie preferences, asking to
| allow notifications, popping up a mailing list or coupon half
| way do the page, playing a full page video with sound, or load
| 97million lines of javascript. I'd say its pretty much perfect.
| IggleSniggle wrote:
| Huh. I think it's a great UI. What did you not like about it?
| stronglikedan wrote:
| Ironic comment, considering that this is a search engine to
| weed out sites with awful UIs. This gives us exactly what we
| need in a search UI - no more, no less - in a clean and
| intuitive way.
| fouc wrote:
| Yeah the design could use some work. The search results are not
| compact - I only see 1 result without scrolling, not counting
| the related wikipedia link that apparently has no description.
|
| I don't particularly like that it seems to be a column
| constrained to 550px width, instead of being responsive and
| taking advantage of greater widths.
|
| to the author of the site, if you're not really into
| design/css, take a look at tailwindcss, it makes it fairly easy
| to produce a minimal amount of css that is responsive.
| agumonkey wrote:
| Very nice. Start a trend :)
| jccalhoun wrote:
| It says it punishes modern web design but it has my most
| irritating feature of modern web design: a narrow strip of text
| on an otherwise blank page.
| marginalia_nu wrote:
| Yeah so this is my project. It's very much a work in progress,
| but occasionally I think it works remarkably well for something I
| cobbled together alone out of consumer hardware and home-made
| code :-)
| eigengrau5150 wrote:
| I like this. Thanks for doing it.
| scrollaway wrote:
| I searched Warcraft and got a gold selling/ level boosting
| site. Some things never change :)
| bityard wrote:
| This is awesome. I've been looking for a long time for a search
| engine that basically takes everything Google does and does the
| opposite. Thank you for doing this, I will definitely be
| bookmarking it.
|
| Is there a way to suggest or add sites? I went looking for
| woodgears.ca and only got one result. I also think my personal
| blog would be a good candidate for being indexed here but I
| couldn't find any results for it.
| peterburkimsher wrote:
| Thank you so much for creating such a useful search engine!
|
| Is there any way that you can get an HTTP certificate?
|
| I use an old iPhone 4S, and most of the modern web is
| inaccessible due to TLS. Hacker News and mbasic.facebook are
| two of the last sites I can use.
|
| Usually text-based sites are more accessible, so this could be
| really useful to help me continue using my antique devices!
| ColinHayhurst wrote:
| Great work. Working on an alternative search engine too. Take a
| look at my profile.
| soheil wrote:
| Awesome project! How are you able to keep the site running
| after HN kiss of death? What is your stack, elastic search or
| something simper? How did you crawl so many websites for a
| project this size? Did you use any APIs like duck duck go or
| data from other search engines? Are you still incorporating
| something like PageRank to ensure good results are prioritized
| or is it just the text-based-ness factor?
| marginalia_nu wrote:
| > How are you able to keep the site running after HN kiss of
| death?
|
| I originally targeted a Raspberry Pi4-cluster. It was only
| able to deal with about 200k pages at that stage, but it did
| shape the design in a way that makes very thrifty use of the
| available hardware.
|
| My day job is also developing this sort of highly performance
| java applications, I guess it helps.
|
| > What is your stack, elastic search or something simper?
|
| It's a custom index engine I built for this. I do use mariadb
| for some ancillary data and to support the crawler, but it's
| only doing trivial queries.
|
| > How did you crawl so many websites for a project this size?
|
| It's not that hard. Like it seems like it would be, and there
| certainly is an insane number of edge cases, but if you just
| keep tinkering you can easily crawl dozens of pages per
| second even on modest hardware (of course distributed across
| different domains).
|
| > Did you use any APIs like duck duck go or data from other
| search engines?
|
| Nope, it's all me.
|
| > Are you still incorporating something like PageRank to
| ensure good results are prioritized or is it just the text-
| based-ness factor?
|
| I'm using a somewhat convoluted algorithm that takes into
| consideration the text-based-ness of the page, but also how
| many incoming links the domain has, but it's a weighted value
| that factors in the text-based-ness of the origin domains.
|
| It would be interesting to try a page rank-style approach,
| but my thinking is that because it's _the_ algorithm, it 's
| also the algorithm everyone is trying to game.
| noduerme wrote:
| I love this idea, and admire the work you put into it. I'm a
| fan of long reads and historical non-fiction, and Google's
| results are truly garbage.
|
| I have a criticism that I think may pertain to the ranking
| methodology. I searched for "discovery of Australia". Among the
| top results were:
|
| * A site claiming that the biblical flood was caused by Earth
| colliding with a comet (with several other pages from that site
| also making the top search results with other wild claims, e.g.
| that the Egyptians discovered Arizona);
|
| * Another site claiming the first inhabitants of Australia were
| a lost tribe of Israel;
|
| * A third site claiming that Australia was discovered and
| founded by members of a secret society of Rosicrucians who had
| infiltrated the Dutch East India Company and planned to build
| an Australian utopia...
|
| These were all pages heavy with HTML4 tags and virtually devoid
| of Javascript, the kinds of pages you'd frequently see in the
| late 1990s from people who had built their own static websites
| in a text editor, or exported HTML from MS Word. At that time,
| there were millions of those sites with people paying for their
| own unique domain names, and so the proportion of them that
| were home to wild-eyed conspiracy theories was relatively
| small. What I think has happened is that kooks continued to
| keep these sites up - to the point where it's almost a visual
| trope now to see a red <h1> tag in Times New Roman and think,
| uh oh, I've stumbled on an "ancient aliens" site. Whereas
| scholars and journals offering higher quality information have
| moved to more modern platforms that rely more heavily on modern
| browsers - with or without their own domain names. So as a
| result what seemed to surface here were the fragments of the
| old web that remain live - possibly because people living in
| cabins in Montana forget to cancel their web hosting, or
| because the nature of old-school conspiracy theorists is to
| just keep packing their old sites with walls of text surrounded
| by <p> tags.
|
| Arguably, this seems to rank the way Google's engine used to,
| since it couldn't run JS and they wanted to punish sites that
| used code to change markup at render time. At least, when I
| used to have to do onsite SEO work, it was always about simple
| tag hierarchies.
|
| I wonder whether there isn't some better metric of validity and
| information quality than what markup is used. Some of the sites
| that surfaced further down could be considered interesting and
| valuable resources. I think _not punishing_ simple wall-of-text
| content is a good thing. But to punish more complicated layouts
| may have the perverse effect of downranking higher-quality
| sources of information - i.e. people and organizations who can
| afford to build a decent website, or who care to migrate to a
| modern blogging platform.
| _dain_ wrote:
| those three pages sound pretty interesting, I don't see this
| as a problem
| crocodiletears wrote:
| It's very rare that I see a project on HN I can see myself
| using. This is one. Like others have said, the results can be a
| little rough. But they're rough in a way I think is much more
| manageable than the idiosynchrosies of more 'clever' search
| engines.
| marginalia_nu wrote:
| I think you need to approach it more like grep than google.
| It's a forgotten art, dealing with this type of dumb search
| engine.
|
| Like if you search for "How do I make a steak", you aren't
| going to get very good results. But a better query is "Steak
| Recipe", as that is at least a conceivable H1-tag.
| AQXt wrote:
| So, you are re-implementing Altavista, Lycos and other old
| search engines.
|
| They used the naive approach: you searched for "steak", and
| they would bring the pages which included the word "steak".
|
| The problem is that people could fool these engines by
| adding a long sequence like "steak, steak, steak, steak,
| steak, steak" to their site -- to pretend that they were
| the most authoritative page about steaks.
|
| Google's big innovation was to count the referrers -- how
| many pages used the word "steak" to link to that particular
| page.
|
| The rest is history.
| wolverine876 wrote:
| > The problem is that people could fool these engines by
| adding a long sequence like "steak, steak, steak, steak,
| steak, steak" to their site -- to pretend that they were
| the most authoritative page about steaks.
|
| I don't see a lot of people investing in SEO to boost
| their Marginalia results.
| frogpelt wrote:
| Effective Google search is also history.
|
| I understand they are trying to maximize ad revenue and
| search does work very well for people who are looking for
| products or services.
|
| But it no longer works well for finding information that
| is even slightly obscure.
| crocodiletears wrote:
| This is exactly how I prefer to use my search engines.
| quaintdev wrote:
| I searched like this all my life and always got expected
| results.
|
| But just a week ago I found out that these "how", "what"
| questions give better and faster results on Google.
| LeftHandPath wrote:
| That switch happened some years ago. I've been unlearning
| and relearning how to use google for what feels like at
| least three or four years now.
|
| The main pain-point, though, is that a lot of long-tail
| searches you could've used to find different results in
| years past, now seem to funnel you to the same set of
| results based on your apparent intent. At least, it has
| felt that way -- I'm not entirely sure how the modern
| google algorithm works.
| bluefox wrote:
| This is a very cool project! Thank you.
| BugsJustFindMe wrote:
| I love this, and I love (many of) the results so far! What I
| can't find on the site is detail about what "too many modern
| web design features" means. Is it just penalizing sites with
| tons of JavaScript?
| marginalia_nu wrote:
| Javascript tags are penalized the hardest, but it also takes
| into consideration density of text per HTML. There's also
| some adjustments based on text length, which words occur in
| the page, etc.
| ad404b8a372f2b9 wrote:
| Very cool project! How many websites do you have in your index?
| And how did you go about building it?
|
| I've been working on an engine for personal websites, currently
| trying to build a classifier to extract them from commoncrawl,
| if you have any general tips on that kind of project they'd be
| very welcome.
| davegauer wrote:
| This is absolutely wonderful. I am LOVING the results I'm
| getting back from it: the sort of content-rich sites that have
| become nigh unreachable using traditional search engines. Thank
| you for building this!
| asah wrote:
| Love it, kudos! This is great for developers and others who
| Just Need Answers and not shopping or entertainment.
|
| If you're looking for feedback, both from a UI design and
| utility standpoint, you might consider "inlining" results from
| selected sites, e.g. Wikipedia, stacked change, etc. Having
| worked on search for a long time, inlining (onebox etc) is a
| big reason users choose Google, and that channelers fail to get
| traction. If you're Serious(tm), dog into the publisher
| structure formats and format those, create a test suite, etc.
|
| A word of caution: if this takes off, as a business it's
| vulnerable to Google shifting its algorithms slightly to
| identify the segment of users+queries who prefer these results
| and give the same results to those queries.
|
| Hope this helps!
| marginalia_nu wrote:
| If Google starts showing interesting text-heavy links instead
| of vapid listicles and storefronts, I have accomplished
| everything I ever could dream of.
| MaysonL wrote:
| Google Info - for when you're looking for information, not
| shopping advice or lists!
| addandsubtract wrote:
| Google info? Can you give me a sample query of what you
| mean?
| wolverine876 wrote:
| Maybe you're joking, but this is a good idea for search
| engine. Better: Credible info.
| 0xbadcafebee wrote:
| Thank you for doing this important work.
| palijer wrote:
| Haha, reminds me exactly of this.
|
| https://xkcd.com/810/
| aaron5 wrote:
| haha, great answer! thanks for your work on this :)
| santamex wrote:
| Which software do you use to index the sites?
| marginalia_nu wrote:
| I wrote it myself from scratch. I have some metadata in
| mariadb, but the index is bespoke.
|
| A design sketch of the index is that it uses one file with
| sorted URL IDs, one with IDs of N-grams (i.e. words and word-
| pairs) referring to ranges in the URL file; as well as a
| dictionary for relating words to word-IDs; that's a GNU Trove
| hash map I modified to use memory map data instead of direct
| allocated arrays.
|
| So when you search for two words, it translates them into IDs
| using the special hash map, goes to the words file and finds
| the least common of the words; starts with that.
|
| Then it goes to the words file and looks up the URL range of
| the first word.
|
| Then it goes to the words file and looks up the URL range of
| the second word.
|
| Then it goes through the less common word's range and does a
| binary search for each of those in the range of the more
| common word.
|
| Then it grabs the first N results, and translates them into
| URLs (through mariadb); and that's your search result.
|
| I'm skipping over a few steps, but that's the very crudest of
| outlines.
| q3k wrote:
| Good stuff. I've also been toying with doing some homegrown
| search engine indexing (as an exercise in scalable
| systems), and this is a fantastic result and great
| inspiration.
|
| Definitely want to see more people doing that kind of low-
| level work instead of falling back to either 'use
| elasticsearch' or 'you can't, you're not google'.
| marginalia_nu wrote:
| Well just crunching the numbers should indicate what is
| possible and what isn't.
|
| For the moment I have just south of 20 million URLs
| indexed.
|
| 1 x 20 million bytes = 20 Mb.
|
| 10 x 20 million bytes = 200 Mb.
|
| 100 x 20 million bytes = 2 Gb.
|
| 1,000 x 20 million bytes = 20 Gb.
|
| 10,000 x 20 million bytes = 200 Gb.
|
| 100,000 x 20 million bytes = 2 Tb.
|
| 1,000,000 x 20 million bytes = 20 Tb.
|
| This is still within what consumer hardware can deal
| with. It's getting expensive, but you don't need a
| datacenter to store 20 Tb worth of data.
|
| How many bytes do you need, per document, for an index?
| Do you need 1 Mb of data to store index information about
| a page that, in terms of text alone, is perhaps 10 Kb?
| rvnx wrote:
| It's a great project!
| Aeolun wrote:
| I'm not sure how you go from word to url range? Range
| implies contiguous, but how can you make that happen for a
| bunch of words without keeping track of a list of urls for
| each word (or URL ids, the idea is the same)?
| marginalia_nu wrote:
| The trick is that the list of URLs for each word already
| _is_ in the URLs file.
|
| The URLs in a range are sorted. A sorted list (or list-
| range) forms an implicit set-like data structure, where
| you can do binary searches to test for existence.
|
| Consider a words file with two words, "hello" and
| "world", corresponding to the ranges (0,3), (3,6). The
| URLs file contains URLs 1, 5, 7, 2, 5, 8.
|
| The first range corresponds to the URLs 1, 5, 7; and the
| second 2, 5, 8.
|
| If you search for hello world, it will first pick a
| range, the range for "hello", let's say (1,5,7); and then
| do binary searches in the second range -- the range
| corresponding to "world" -- (2,5,8) to find the overlap.
|
| This seems like it would be very slow, but since you can
| trivially find the size of the ranges, it's possible to
| always do them in an order of increasing range-sizes. 10
| x log(100000) is a lot smaller than 100000 x log(10)
| axelroze wrote:
| Hi,
|
| Interesting idea. Definitely see an overlap with eReader
| markets and looking at text only contents.
|
| How does it work?
|
| It ignores pages on which it detects frameworks for ui and ads
| or any javascript code at all?
| agumonkey wrote:
| is there a json endpoint ? I'd love to make an emacs bridge :)
| maddyboo wrote:
| Seconded, I'd like to incorporate it into a project of mine.
| artembugara wrote:
| Nice, what are you using to crawl the web?
| marginalia_nu wrote:
| It's pretty much all bespoke.
|
| I use external libraries for parsing HTML (JSoup) and
| robots.txt; but that's about it.
| soheil wrote:
| What was the starting site you fed to the crawler to follow
| the links from to build the index?
| marginalia_nu wrote:
| Just my (swedish) personal website. The first iteration
| of the search engine was probably mainly seeded by these
| links:
|
| https://www.marginalia.nu/00-l%C3%A4nkar/
|
| But I've since expanded my websites, so now I think these
| play a decent role in later iterations, although they are
| virtually all of them pages I've found eating my own
| dogfood:
|
| https://memex.marginalia.nu/links/fragments-old-web.gmi
|
| https://memex.marginalia.nu/links/bookmarks.gmi
| zizee wrote:
| Love the idea. A little feedback: layout needs tweaking for
| mobile. FWIW: I'm on mobile Firefox for Android.
| blondin wrote:
| fantastic project, thank you!
| habibur wrote:
| How are you doing the crawling without getting blocking? -- the
| hardest part.
| judge2020 wrote:
| Not OP but crawling is easy if you don't try scanning 5+
| pages a second - almost all rate limiting/heuristic based
| 'keep server costs low' engines, including Cloudflare, don't
| care if you request every page, but will take action if you
| do something like burst every page and take up just as many
| server resources as a hundred concurrent users.
|
| Now, that is assuming you aren't on some VPS provider. If
| you're going to crawl, you'll have the best chance when you
| use your own IPs on your own ASN, with DNS and reverse DNS
| set up correctly. This makes it so the IP reputation systems
| can detect you as a crawler but not one that hammers every
| site it visits.
|
| Also, I imagine that, for a search engine like this, it
| doesn't expect content to change much anyways - so it can
| take its time crawling every site only once every month or
| two, instead of the multiple times a week (or day) search
| engines like Google have to for the constantly-updated
| content being churned out.
| androceium wrote:
| Pretty neat!!!
|
| You may already be aware of this, but the page doesn't seem to
| be formatted correctly on mobile. The content shows in a single
| thin column in the middle.
| marginalia_nu wrote:
| Hmm, which OS? I only have a single Android phone so I've
| only fixed the CSS for that.
| androceium wrote:
| I was seeing it on Android w/ Firefox. Seems like it's
| fixed now though. :)
| marginalia_nu wrote:
| Curious, I haven't touched the stylesheets.
| ant6n wrote:
| For example Firefox on Android.
| khimaros wrote:
| Fennec F-Droid on Android 11 has some rendering issues.
| edbaskerville wrote:
| This has amazing potential. I'd encourage you to form a non-
| profit, turn this into something that can last as an
| organization without becoming what you're trying to avoid
| becoming. This is a good enough start that I bet you could
| raise a sizeable startup fund very soon from a combination of
| crowdfunding and foundation grants--I bet the Sloan Foundation
| would love this!
| egberts1 wrote:
| I tried "Error 49" as a search phrase.
|
| It's rudimentary but no IT-related result.
| thrtythreeforty wrote:
| > New: You can now look up dictionary definitions for words. If
| you for example don't know what the definition of is is, you can
| inquire thus: define:is.
|
| Oh man, I love subtle jabs and tongue in cheek writing like this.
| Very Robin Williams-esque.
| marginalia_nu wrote:
| I am the first to admit it's a pretty dated reference.
| earthbee wrote:
| I love this! I've been searching random words with no aim in
| particular and keep finding lots of interesting tiny personal
| webpages. It feels like the old web
| [deleted]
| arduinomancer wrote:
| Wow this is immediately useful
|
| If you figure out some sort of funding model (maybe even just
| Patreon) I could totally see this as a viable side project
|
| Already discovered this recipe site: https://based.cooking/
|
| I love how adding recipes is through pull requests:
| https://github.com/LukeSmithxyz/based.cooking/pulls
| MrBoomixer wrote:
| Thank you for this, it really makes me love the web and the
| people making things like this, Forked!
| dmje wrote:
| Love it. You should provide a link to Patreon / whatever so
| people can support you financially. Hosting is probably not cheap
| for you. Given the love here on HN I suspect you'd do well.
| marginalia_nu wrote:
| Hosting is actually surprisingly cheap, but that's because I'm
| hosting it on consumer hardware in my living room, off my
| domestic broadband connection.
|
| That's both a blessing and a curse. It works okay as long as I
| don't touch it, but I can't do maintenance without shutting it
| down. I can't implement crawler changes without a week of
| shitty results as it needs to visit half the Internet before it
| gets decently good. I can only afford a production machine, so
| all testing that can't be done with unit tests gets done there.
|
| Anyway I added a patreon in case anyone wants to toss a coin.
| spandrew wrote:
| All of my searches are turning up unrelated results ("college
| life after the pandemic", "post-pandemic teaching in higher
| education", "football news NFL" etc.)
|
| NFL one had 'some' decently related results, but the websites
| were all strangely disreputable.
| mmmpop wrote:
| > the websites were all strangely disreputable
|
| Interesting you'd feel that way when sites without "modern
| design" are encountered. Is this your own bias perhaps creating
| a judgment or are they sites that you already know have a bad
| reputation?
| typon wrote:
| Or perhaps the websites being returned are garbage? I have
| the same experience trying a few searches and following the
| top 5 links. Besides wikipedia, I haven't found a single
| useful website.
| spandrew wrote:
| This!
|
| Maybe modern or 'non-modern' web design just isn't a great
| litmus test for quality content? Could just need some work.
| At any rate I wasn't clicking on the results.
| abhinav22 wrote:
| Great work and congrats!
| skyfaller wrote:
| This is a fantastic search engine. It delivers on its promise of
| "serendipity". I found pages featuring my name that I'm not sure
| I've ever seen before, after many years of searching myself to
| test out search engines.
|
| Perhaps more importantly, it delivers the most correct result
| when searching for my username: the first result is not any of my
| social media accounts, or even my own blog, but the text of the
| obscure science fiction story that I took my username from! Well
| done.
|
| I've immediately added this as a search keyword in Firefox, and
| I'll be using it more in the future.
|
| Could meta search engines like DuckDuckGo include this as a
| source? Should they?
| voidnullnil wrote:
| How does this have 2.5K upvotes when every single HN related
| project needs JS and a quad core CPU (for the browser to open a
| blank page) to view a paragraph of text?
| pietroppeter wrote:
| from About page:
|
| > If you search for "Plato", you might for example end up at the
| Canterbury Tales. Go looking for the Canterbury Tales, and you
| may stumble upon Neil Gaiman's blog.
|
| I know it is just a suggestion, but had to try searching both,
| with no luck in getting the expected unexpected.
| marginalia_nu wrote:
| Yeah I did some work very recently aimed at improving the
| relevance a bit. It was a bit too random in the state it was
| before. Now it, perhaps, isn't random enough anymore.
| pietroppeter wrote:
| It looks very nice anyway, great job! I did try with other
| queries and results were in general interesting.
| fsflover wrote:
| See also: https://wiby.me/
| [deleted]
| Tade0 wrote:
| The "surprise me..." button is adequately labelled.
| twobitshifter wrote:
| Shades of stumbleupon.
| tpmx wrote:
| Great link to drag to the bookmark bar.
| scopio918 wrote:
| All search engines favors more text and less graphics.
| JohnFen wrote:
| Oh, this is brilliant! I think I'll make this my "first stop"
| search engine.
| bovermyer wrote:
| I adore this. Unfortunately, searching for my own name - with or
| without quotes - doesn't actually find my site.
|
| It does find a handful of references to me from over twenty years
| ago, though, which I thought was fascinating.
| tgv wrote:
| My name retrieved the "dead pornstar list". Unexpected.
| hop34s3w wrote:
| If the website is targeted towards international audience then
| its nice to have the first page links to content in english. All
| the four links in the main page https://www.marginalia.nu/ have
| links to non-english content which is not useful.
|
| Disclaimer: I am not a native english speaker. English is my
| second language.
| marginalia_nu wrote:
| Yeah my main site is a bit of a disorderly mess. It started as
| a Swedish blog, but I've since added a few services aimed at a
| global audience. Haven't quite figured out how to unify it all
| just yet.
| JohnJamesRambo wrote:
| Saving this forever. Thank you for making it.
| Valkhyr wrote:
| As a quick test, I searched for the name of one of my favorite
| game series: "Baldur's Gate" (on its own, no qualifiers, properly
| spelled - I would usually spell it "baldurs gate" on Google, but
| I decided to give this one the best chance). I search for info
| around video games a lot, so that's quite representative of a
| good chunk of my web searches, and I pretty much know the top
| sites Google would give me for that query (on its own, without
| any further qualifiers).
|
| The results were all either barely relevant, outdated (sites that
| covered the game back in the 90s/2000s before it was re-
| released), at best tangentially relevant or complete garbage
| noise. Some of the most highly relevant pages (such as the Steam
| store listing, the fandom wiki, the publisher/developer's forums
| for the re-releases, the Baldur's Gate 3 website and the
| subreddit) were not included at all. Those are all fairly text
| heavy by any reasonable standard, so I assume they were
| "punished" because they use JS? Would make sense that nearly all
| of them are way out of date.
|
| Then I searched specifically for "Baldur's Gate Wiki" but still
| out of luck - some results, but nothing vaguely Wiki-like.
|
| Finally I searched for "Baldur's Gate Fandom Wiki". This is
| basically "search engine easy mode", by giving essentially the
| name of of the site I am looking for. I got ZERO results. At this
| point I gave up and decided that this thing is useless.
|
| Look, I'm all for unearthing good long-form content (in fact I
| would say that much of the content around this specific game
| would qualify), and I do get as annoyed at modern SPAs as the
| next grumpy neckbeard.
|
| I think considering both of those in a search engine is not a bad
| idea in and of itself. But I have to wonder what's the point of a
| search engine that weights some arbitrary aspect of web design
| higher than the relevancy of the subject matter (to the point of
| not returning any results at all)? In fact, considering that
| generally speaking more recent websites tend to include more
| scripting, you are intentionally skewing the results towards
| (very) old content, which is probably doing the user a
| disservice.
| matesz wrote:
| > But I have to wonder what's the point of a search engine that
| weights some arbitrary aspect of web design higher than the
| relevancy of the subject matter (to the point of not returning
| any results at all)?
|
| Because in some cases it returns arguably better/more to the
| point results, than other search engines - for example search
| for "Douglas Engelbart" or "Ted Nelson". I thought that I've
| searched everything for those two yet marginalia gave results
| otherwise I would have never seen,.
| marginalia_nu wrote:
| This just isn't the place to go for promotional materials about
| upcoming video games. It's a niche search engine for
| discovering stuff off the beaten path, the stuff you _can 't_
| find on mainstream search engines. Some of it is junk,
| admittedly, and not everyone will see the point, that's fine
| too.
|
| Despite what some people seem to think, it's never been meant
| as a google-replacement. I have never claimed otherwise.
| zvxczvvzxzcxzm wrote:
| This is great!
|
| I tried with "covid tyranny", and got some very interesting
| results I'd never get with any of the other search engines!
| m1117 wrote:
| Love it! I can punish my employees by setting this as a default
| search engine on their work laptops.
| kodeninja wrote:
| Ivermectin (marginalia):
| https://search.marginalia.nu/search?query=ivermectin+
|
| Ivermectin (Google): https://www.google.com/search?q=ivermectin
|
| The difference in the overall _thrust_ of the results is
| remarkable.
|
| Very interesting! Thanks for building it.
| typon wrote:
| The Google results tell you why Ivermectin is not a good
| replacement for vaccination against Covid, the Marginalia
| results tell you that Ivermectin is a miracle drug for treating
| Covid 19. Really shows how much technology has the power to
| change reality in today's world.
| lame-robot-hoax wrote:
| Google links to the FDA, CDC, WHO, NIH, WebMD, drugs.com, and
| a pro ivermectin journal article from the American Journal of
| Therapeutics.
|
| The Marginalia results point you mostly to random blogs.
| marginalia_nu wrote:
| Part of what I wanted to show with this project is that there
| is no such thing as an objective search engine. Even
| seemingly irrelevant technological decisions drastically
| impact the narrative.
| Shingbogle wrote:
| It's definitely not irrelevant. My first search time covid
| related because I knew the non-official, random person on
| the internet blog wouldn't have the money to create flashy
| sites.
| Drew_ wrote:
| Well the focus on text content isn't the only technical
| difference here. Google is obviously weighing hundreds of
| signals in its search results that your engine is not
| accounting for. These omitted signals are also relevant.
| sundarurfriend wrote:
| > These omitted signals are also relevant.
|
| Certainly. And sometimes they're relevant in a good way,
| sometimes in a bad user-hostile way. Every search engine
| rquires discrimination and intelligent usage by the
| person doing the search, just in different areas.
| marginalia_nu wrote:
| Right, but that is still a technical decision on their
| side. They presumably don't sit down and have a meeting
| about what world view they should present. Well I hope
| they don't.
| kevin_thibedeau wrote:
| Google is actively fighting the spread of disinformation.
| You can see this clearly in the forced row of COVID PSA
| links on the YouTube front page that's been up for the
| last year regardless of whether you have any history
| viewing such content. There is manual intervention going
| on to prevent the garbage their normal algorithms will
| automatically surface. This is the greatest tragedy of
| the internet in that it allows people with crazy notions
| to find each other and build echo chambers with the aid
| of unbiased ML.
| silent_cal wrote:
| Lol!
| daxfohl wrote:
| Though before basing life-and-death decisions on this, consider
| reading the "about" page first:
| https://memex.marginalia.nu/projects/edge/about.gmi
|
| > The purpose of the tool is primarily to help you find and
| navigate the strange parts of the internet. Where, for sure,
| you'll find crack-pots, communists, libertarians, anarchists,
| strange religious cults, snake oil peddlers, really strong
| opinions.
|
| and
|
| > If you are looking for fact, this is almost certainly the
| wrong tool.
| lame-robot-hoax wrote:
| Yes, google returns results from the FDA, American Journal of
| Therapeutics pro Ivermectin study, WebMD, the CDC, the NIH,
| Wikipedia, the WHO, and New York Times.
|
| Marginalia returns results from Wikipedia, a faculty member's
| university blog regarding river blindness, a website called
| truthsummit promoting it as a miracle cure, a website called
| vaxxchoice promoting it as a cure, vitamindsstopcovid, etc.
|
| I'd say the quality of the results are quite different.
| motoxpro wrote:
| Second result: "Ivermectin, a miracle drug against Covid
| Ivermectin, a miracle drug against Covid. 100% effective as
| preventative and for early stage Covid. Over 90% cut in
| fatality rate for late-stage cases.
|
| https://truthsummit.info/blog/ivermectin-against-covid.html "
|
| Eh I think I'll take the google search results on this one.
| pkamb wrote:
| Great results for "sauna". Lots of Web 1.0 pages discussing
| building plans and displaying pictures of individually built,
| traditional, unique, old saunas on some property.
|
| The Google result are all blogspam or sales pages for cheap
| shipped saunas. Lots of "IR" results. Phony health benefit pages.
| Stock photos solely of beautiful new hotel gyms.
|
| I've noticed this problem with Google results for quite some
| time. Sadly, the _new_ content being created of the top variety
| is mostly being done within private Facebook groups that can 't
| be easily searched, linked, or archived.
| sireat wrote:
| Fantastic idea and it works quite well for short phrases that I
| tried.
|
| As expected I am getting a lot of early 2000s sites which is
| something that I miss on regular Google.
|
| Hilariously searching for "array data structure" got me one of
| the top results this little tiny page:
| http://infolab.stanford.edu/~backrub/google.html
| marginalia_nu wrote:
| > We have designed Google to be scalable in the near term to a
| goal of 100 million web pages
|
| Funny, that's about where I see my search engine capping out as
| well.
| rfrey wrote:
| This is stunning. I searched "winemaking" because it's my latest
| obsession, and turned up dozens of links to high-quality pages
| I'd never seen despite spending an hour a day for three months
| cruising Google on the topic.
|
| Please do announce it here if you ever decide to solicit help or
| contributors. My stab at this problem was to have a search index
| of only ad-free pages, on the hypothesis it would turn up self-
| hosted blogs, university personal pages, that sort of thing. But
| the results were too thin, your approach is much better.
| winddude wrote:
| hmm, I dream of recipes search engine that punishes recipes pages
| with too much text. lol
| mint2 wrote:
| Yeah, recipes sites have both too much text and too many
| pictures.
|
| But they do illustrate what this search engine needs to watch
| out of. If they rank more text higher and their search site
| becomes popular, won't everyone just spam recipe site word
| salad, maybe even ai generated word salad.
|
| But in the interval, until that day comes, they are going to
| have a very useful service.
| Paul_S wrote:
| Looking for an arm assembly instruction, instead I get this
| strange website as the result
| http://mailstar.net/coronavirus.html
|
| Is that accidental or is this website promoted because it's text
| heavy and will surface for any search without many results?
| marginalia_nu wrote:
| Looks like that page just has an absurd amount of keywords.
| Those sometimes surface when there isn't any good results.
| Haven't found a foolproof detection method that doesn't
| unjustly punish innocent pages with large amounts of content.
| tbojanin wrote:
| this is sweet
| gen_greyface wrote:
| Hi, It'd be nice if you could add a OpenSearch description
| document for your site.
|
| https://developer.mozilla.org/en-US/docs/Web/OpenSearch
| gen_greyface wrote:
| until then i'll keep the site bookmarked. :-)
| SlapperKoala wrote:
| I like the idea but could use some tweaking. I keep getting
| conservative christian websites for some reason. And foreign
| language sites
| josefresco wrote:
| It you like wacky search engines, there's also Million Short:
| https://millionshort.com where you can search and remove the top
| 100/1K/10k/100K/1M results.
| ephbit wrote:
| Quoted from the linked site:
|
| > Convenience functions have been added, and the search engine
| can now perform simple calculations and unit conversions. Try 1
| pint in cubic centimeters, or 50+sqrt(pi). This functionality is
| still under development, be patient if it doesn't work.
|
| Why would you make any ever so small effort to implement
| calculations? I don't get it.
|
| If your search engine enabled me to find more useful search
| results to my queries than google or yacy or whatever, I wouldn't
| care one tiny bit about being able to do calculations with it.
|
| Why not focus on the search functionality?
| marginalia_nu wrote:
| I implemented calculations because easily 80% of my google
| queries are calculations, unit conversions, etc.
|
| Search functionality is larger priority. Calculations and unit
| conversions were an afternoon's break from the search
| functionality :-)
| ephbit wrote:
| Ok, I guess people use google (or search engines in general)
| differently ... I rarely if ever use a search enginge to
| calculate stuff or do unit conversions.
|
| I use google only to search and when ecosia/bing doesn't
| return anything useful.
| exporectomy wrote:
| How else do you do unit conversions? I use Google because it's
| far easier than any other software I've tried. Mainly because
| it's more forgiving of errors. It knows that "34 fset in
| msters" is 10.3632 meter. This search engine isn't, though, so
| I wouldn't waste time trying to discover its unit conversion
| syntax rules.
| samhh wrote:
| On macOS for example I'd use Spotlight.
| soheil wrote:
| You can also use the Chrome address bar without hitting the
| enter, just start typing
| drusepth wrote:
| Interesting approach.
|
| I always search myself on new search engines to compare the
| results. Most engines return my personal blog/website,
| books/stories I've written, news stories, my github
| projects/contributions, social links, etc.
|
| This search engine surfaces just three obscure IRC logs that
| contain my nick in join/part messages (nothing said from me!)
| from 2009. And nothing else.
|
| There's probably some things this approach is really good at but
| I'm not sure what they'd be for me off hand. Always cool to see
| new approaches to search, though.
| fhackernewz wrote:
| fuck you hacker news
| fsckboy wrote:
| I've read most of the comments here and people are evaluating the
| search results: all good information.
|
| I'm looking at "punishes modern web design"... This thing IS
| modern web design. I think it's called "marginalia" in reference
| to the huge margins they chose!
|
| I'm using a browser on a linux desktop and side-by-side, HN's
| page design is old-fashioned tasteful making pretty good use of
| space, and maginalia has a font that's more than twice the 2D
| pointsize and is so spread out with whitespace that the "Tips" on
| the home page are off the bottom of my window.
| gtmb wrote:
| As everything in life flows in cycle, I predict the search engine
| that will de-throne Google will be like Google when it started -
| a simple variation of page rank.
|
| No smarts, no bubble, no signals decided by over fitting to a
| biased engineer preference.
| __MatrixMan__ wrote:
| I agree, except it'll optionally accept the ID of your node in
| a web of trust, and it'll use a page rank customized for you.
|
| Or you can put in two ID's and have it find sources that both
| parties trust.
| jerrre wrote:
| I wouldn't say the existence of this page proves your
| prediction right (as it's not dethroning Google anytime soon).
|
| It's easy to forget that the goal of Google isn't to provide a
| useful search engine (at least not anymore), but the search
| engine is a by product of them wanting to show ads.
| marcos100 wrote:
| If Google isn't useful then nobody will use it.
|
| The search engine and the ads are tightly coupled. A better
| search engine means it can predict with more accuracy what
| you are looking for and can serve you an even more targeted
| ad that increases the chance you'll click.
| rchaud wrote:
| > If Google isn't useful then nobody will use it.
|
| Or they'll continue to use it out of sheer inertia. Google
| is paying Apple $15 billion to keep its place as iOS
| default search engine.
|
| IE6 didn't die overnight when Firefox arrived.
| popcube wrote:
| now Google try immitate a document system on your computer,
| usually I rely on Google know what I need:(
| antupis wrote:
| As dev I would love search engine which would only do search to
| stackoverflow github issues, documentation etc.
| axelroze wrote:
| You can limit the search query per website in DDG (and
| probably in others)
|
| Example: `rust slow compilation site:stackoverflow.com`
| antupis wrote:
| Yeah but usually I want some set of sites not just
| stackoverflow.
| goodpoint wrote:
| ...especially if you could group online resources by category
| (e.g. software eng, cooking, ...)
| axelroze wrote:
| Wouldn't the dethroner of Google be some new technology which
| is not a search engine like Google but better at solving the
| original task of finding information on how to solve problems?
|
| Just like how iPad dethroned Windows PCs for average home user
| but not Mac because Windows had the monopoly and then an
| innovation destroyed MS in this space and not a competitor.
|
| I don't think Google dethrones Yahoo and AltaVista scenario
| will occur again.
| gverrilla wrote:
| > iPad dethroned Windows PCs for average home user
|
| is this true? in the US, perhaps? because in south america it
| couldn't me more far away from truth - didn't happen at all
| amznbyebyebye wrote:
| Wow. Love this.
|
| Searched for "Ramanujan", one of my heros.
|
| Found this gem- https://math.ucr.edu/home/baez/ramanujan/
|
| Ramanujan's "easiest" formula.
|
| Awesome!!
| amelius wrote:
| Question: how do we _benchmark_ search engines? Are there any
| groups attempting to provide (open) solutions in this space?
|
| (It seems to me that if you want to build a good search engine,
| this is the question you need to address first.)
| X6S1x6Okd1st wrote:
| The search term you might be looking for is "information
| retrieval" there are pretty standard measurements for whether
| you are getting good results, but they are generally
| conditioned on stuff like click through rate, comparing to
| expert ranking and other signals that the user gives you that
| it was a good or bad return of search results.
| lightsurfer wrote:
| thank you!
| NuNotNon wrote:
| I have an interest in logic and cs curriculum and i like Geneses
| in general(last days i've read intro in math phylosophy from
| Russell and some acm report of cs curriculum. I search for cs
| curriculum and this is the first link
| https://www.cs.rice.edu/~vardi/sigcse/ Feels so good to recive
| good answers so easy. Thanks.
| megraf wrote:
| Thank you so much. This is wonderful
| timdaub wrote:
| I'm developing a text-heavy site and philosophically I'm trying
| to view documents as just that... documents [1].
|
| But I don't get good results for "rug pull".
|
| - 1 https://rugpullindex.com
| marginalia_nu wrote:
| Yeah it's hosted by cloudflare. I'm currently IP-blocking them,
| as because they keep prompting my crawler with a captcha,
| presumably because it's made millions of requests from their
| CDN.
|
| Some rigmarole getting recognized as a good bot by the CDNs.
| I've submitted a request fairly recently, but haven't heard
| back from them yet.
|
| Like I would like to be on good terms with them, and other
| websites that block small independent crawlers.
|
| I can't blame them though, there's a lot of bad bots out there.
| But I'm doing my best not be part of the problem.
| [deleted]
| petercooper wrote:
| Aha, I was going to ask how you were coping with CDNs like
| Cloudflare blocking bots. It's sad we've got to this point
| where basically only the established search engines are
| grandfathered in to be able to crawl sites.
| BugsJustFindMe wrote:
| > _I 'm developing a text-heavy site_
|
| I looked at the source for your site's front page. That's not
| text-heavy; that's markup-heavy. I didn't bother looking at the
| rest of the pages because it appears to be yet another crypto
| market site.
| greggturkington wrote:
| Wouldn't this just skew towards really old sites?
|
| The _third_ search result for "dog" is this page on how to
| remove AOL Instant Messenger, published in 2002.
|
| https://sillydog.org/netscape/kb/removeaim.html
|
| No one wants to see newsletter signup popovers, but "modern web
| design" includes good performance and relevant content. (The
| search engine itself takes about 2 seconds to first contentful
| paint, not great.)
| bityard wrote:
| This search engine pretty much takes everything that Google is
| doing and does the opposite. For instance, Google has decided
| that "relevant" usually also means "recent". Thus, when
| searching for something on Google, you mainly get results from
| blogspam farms and almost never do you see anything more than a
| few years old.
|
| An implication of this is that old sites tend to disappear
| (either into obscurity or by being taken down) because Google
| penalizes them in search rankings. The author of this search
| engine says, however:
|
| > If a webpage has been around for a long time, then odds are
| it has fundamental redeeming quality that has motivated keeping
| it around all for that time.
|
| I don't know that I agree 100% with this (there was lots of
| crap on the "old" web too), but it makes a certain amount of
| sense.
| greggturkington wrote:
| What "fundamental redeeming quality" about uninstalling AIM
| from Windows 3.x motivated making that the 3rd result for
| "dog"?
|
| The 5th result is a tutorial on CSS. This search engine
| decided it's relevant because it has "dog" in the URL. Is
| that a better reasoning than Google's?
| https://htmldog.com/guides/css/beginner/
|
| Core Web Vitals ranks sites higher that perform well. Text-
| heavy sites that are also optimized and relevant would
| already perform well.
| marginalia_nu wrote:
| What are you searching for when you enter the query "dog",
| keeping in mind the search engine deliberately does not
| examine synonyms or and deliberately seeks out the path
| less taken?
|
| Dog facts? Then search "dog facts"
|
| Famous dogs? Then search "famous dogs"
|
| Rappers? Try "snoop dogg"
| [deleted]
| greggturkington wrote:
| I'm searching for information on "dog".
|
| Your suggestion of "dog facts" returns 6 pages from the
| same domain, dogquotes.com. It's unreadable on mobile
| because it's so old, all the facts are unsourced, and
| often wrong:
|
| > Never assume that a barking dog won't bute _[sic]_ ,
| unless you're absolutely certain the dog believes it too.
|
| Also on the 1st SERP, this odd blog post ranting about
| 4th amendment rights [1], "Media Glamorization of the
| Psychopath" [2], and this (image-heavy) page about
| dolphin encounters in the Bahamas ("Sea Dog Facts" is a
| link on the page). 1.
| http://www.rexcurry.net/drugdogsdan.html 2. https
| ://www.metaphoricalplatypus.com/articles/psychology/psych
| opathysociopathy/media-glamorization-of-the-psychopath/
| 3. https://www.dolphinencounters.com/education/
| samsaga2 wrote:
| Where does the data come from? Do you index the whole web
| yourself? I see it totally impossible for a personal project. I'm
| very curious about that.
| marginalia_nu wrote:
| I do indeed index the web myself. Not the _entire_ web, just a
| subset of it. The crawler quickly loses interest in
| javascript:y websites and only indexes at depth those websites
| that are simple. It also focuses on websites in English,
| Swedish and Latin and tries to identify and ignore the rest
| (best-effort).
|
| You'd be surprised how much you can do with modern hardware if
| you are scrappy. The current index is about 17.7 million URLs.
| I've gone as far as 50 million and could _probably_ double that
| if I really wanted to. The difficulty isn 't having a small
| enough index, but rather having a relevant enough index,
| weeding out the link farms and stuff that just take space.
|
| I only index N-grams of up to 4 words, carefully chosen to be
| useful. The search engine, right now, is backed by a 317 Gb
| reverse index and a 5.2 Gb dictionary.
| omoikane wrote:
| > It also focuses on websites in English, Swedish and Latin
| and tries to identify and ignore the rest
|
| When I search for Japanese terms, it "says <query> needs to
| be a word", which wasn't the best error message. Maybe the
| error message should say something like "sorry, your language
| isn't support yet"?
| marginalia_nu wrote:
| I've rephrased the wording for that one a bit.
| throwaway47292 wrote:
| Amazing!
|
| I have only one recommendation that might make the search a
| bit more relevant, e.g when searching for 'linux locking' or
| 'kernel locking' kind of things.
|
| Try to upsort things that match near the top of the content,
| like the top of the man page vs middle vs bottom.
|
| One easy way to do it without having to store the positions,
| is to index the ngrams with max(sqrt,8) of their line number,
| this will cover first 64 lines, you can also use log() or
| just decide ad hock, top, middle, bottom of the document, so
| you can use only 3 values.
|
| e.g. https://www.kernel.org/doc/html/v5.0/kernel-
| hacking/locking.... would do unreliable_1 guide_1 locking_1
| ... then at line 4 kernel_2 locking_2 ... after line 50 ...
| then_7 ... and after that everything will be _8.
|
| then just make the query "kernel locking" to "dismax(kernel_1
| OR kernel_2 OR kernel_3...) AND dismax(locking_1 OR locking_2
| ...) with some tiebreaker of 0.1 or so, you can also say "i
| want to upsort things on the same line, or few lines apart"
| by modifying the query a bit.
|
| It works really well and costs very little in terms of space,
| i tried it at https://github.com/jackdoe/zr while searching
| all of stackoverfow/man pages and etc and was pretty
| surprised by the result.
|
| This approach is a bit cheaper than storing the positions
| because positions are (lets say) 4 bytes per term per doc,
| while this approach has fixed uppre bound cost of 8*4 per
| document (assuming 4 byte document ids) plus some amortized
| cost for the terms
| kews wrote:
| Do you know what proportion of the texty web instructs
| unknown crawlers to go away (or blocks them)?
| marginalia_nu wrote:
| It's hard to give numbers, it doesn't seem to be very many,
| but losing out on a few key sites does make a pretty big
| impact.
|
| You see stuff like this sometimes, makes me a bit sad.
|
| https://linux.die.net/robots.txt
| c0wb0yc0d3r wrote:
| How did you go about seeding your web crawler with URLs to
| crawl?
| marginalia_nu wrote:
| I just started with my website and did a crawl.
| Subsequently I've been seeding it with the best results
| form my previous crawls.
|
| It's a directed search so it doesn't seem to need a
| particularly solid seed to get decent results.
| c0wb0yc0d3r wrote:
| So how long did it take to get to 17 million URLs?
| dannyw wrote:
| Not OP, but if I was to do this, I'd start by downloading
| Wikipedia and all its external links and references, and
| crawling from there. You should eventually reach most of
| the publicly visible internet.
| c0wb0yc0d3r wrote:
| I feel a little embarrassed that I didn't think of
| something like that.
|
| When I did some crawler experimenting in my younger
| years, I thought I was pretty clever using sites that
| would let you perform a random Google searches. I would
| just crawl all the pages from the results returned.
|
| Your method would undoubtedly be more interesting I
| think. It would certainly lead to interesting performance
| problems quicker, I bet.
| dannyw wrote:
| This is unbelievably impressive on a technical and ambition
| level for a solo, self-hosted hardware project. Kudos.
| jillesvangurp wrote:
| Cool, I've been thinking on this topic a bit lately. Crawling
| is indeed not that hard of a problem. Google could do it 23
| years ago. The web is a bit bigger now of course but it's not
| that bad. Those numbers are well within the range of a very
| modest search cluster (pick your favorite technology; it
| shouldn't be challenging for any of them). 10x or 1000x would
| not matter a lot for this. Although it would raise your cost
| a little.
|
| The hard problem is indeed separating the good stuff from the
| bad stuff; or rather labeling the stuff such that you can
| tell the difference at query time. Page rank was nice back in
| the day; until people figured out how to game things. And now
| we have bot farms filling the web with nonsense to drive
| political agendas, create memes, or to drown out criticism.
| Page rank is still a useful ranking signal; just not by it
| self.
|
| The one thing no search engine has yet figured out is
| reputability of sources. Content isn't anonymous mostly. It's
| produced and consumed by people. And those people have
| reputations. Bot content is bad because it comes from sources
| without a credible reputation. Reputations are built over
| time and people value having them. What if we could value
| people's appreciation relative to their reputability? That
| could filter out a lot of nonsense. A simple like button + a
| flag button combined with verified domain ownership (ssl
| certificates) could do the trick. You like a lot of content
| that other people disliked, your reputation goes down the
| drain. If you produce a lot of content that people like, your
| reputation goes up. If a lot of reputable people flag your
| content, your reputation tanks.
|
| The hard part is keeping the system fair and balanced. And
| reputability is of course a subjective notion and there is a
| danger of creating recommendation bubbles, politicizing
| certain topics, or even creating alternative reality type
| bubbles. It's basically what's happening. But it's mostly
| powered by search engines and social media that actually
| completely ignore reputability.
| silent_cal wrote:
| Wonderful work (':
| afrcnc wrote:
| except it doesn't actually return that many results
| lazybreather wrote:
| Amazing! How do I make this my search engine on browser? Not home
| page.
| winddude wrote:
| curious how do you afford the infrastructure? I found that the
| hardest part of running a search engine.
| marginalia_nu wrote:
| I'm self-hosting, and the server is a Ryzen 7 3900x with 128 Gb
| of non-ECC RAM. It sits in my living room next to a cheap UPS.
| I did snag one of the last remaining Optane 900Ps off Amazon,
| and it powers the index and the database--and I really do think
| this is among the best hardware choices for this use case. But
| beyond that it's really nothing special, hardware-wise. Like
| it's less than a month's salary.
|
| It runs Debian, and all the services run bare metal with zero
| containerization.
|
| Modern consumer hardware can be absurdly powerful if you let
| it.
|
| Like I have no doubt a thousand engineers could spend a hundred
| times as much time building a search engine that did pretty
| much the same thing mine does, it would require a full data
| center to stay running and be much slower. But that's just a
| cost of large scale software development I don't have to pay as
| a solo developer with no deadline, no planning and a shoestring
| budget.
| kevinob11 wrote:
| I tested this with "Caribbean Vacation" and wow what a
| difference. Everything on Google is "TOP X LIST" and "BEST XYZ"
| which are just the worst when trying to find real interesting
| information about experiences you can have on vacation somewhere.
| I had used those as starting points then searched for long-form
| blogs of real experiences people have had. This surfaced those
| kinds of things immediately. I love it.
| jokoon wrote:
| I don't even want to imagine how google and other search engines
| crawl websites that make heavy usage of react or other ajax
| stuff. I don't want to be that guy.
|
| I wonder if some browser engineers are trying to have some ideas
| on how to find a solution on this. Personally, I would just make
| a browser that breaks backward compatibility, remove old
| features, etc. I guess browsers would be much lighter, fast and
| simple if some hard choices were made.
|
| Mozilla already decided to break some websites with the strict
| cookie policy. I wish they would do the same for everything else
| that sucks on the modern web.
|
| I honestly don't think I have much respect for "web developers".
| In a way I want mobile apps to kill the modern web, just to prove
| a point.
| yewenjie wrote:
| Related question - suppose I want to create a meta search engine
| for myself, and I want it to be as fast as possible. What are the
| things I should be optimizing for?
| FractalHQ wrote:
| Ok this is great if all I want to do is read text, but often
| times that is very much not all I want to do. The web is much
| more than text and images these days. I can appreciate this as
| long as it's branded as a search engine for blogs and articles
| specifically, as opposed to being touted as a drop-in replacement
| for the modern search engine.
| lukas099 wrote:
| Is this a criticism? It doesn't at all seem touted as a drop-in
| replacement for the modern search engine.
| FractalHQ wrote:
| I do think this is a great tool, I didn't intend to come off
| as being critical. I think I had some misdirected frustration
| from people on HN talking about how the modern web is bad and
| should be replaced with a text protocol. I aspire to create
| web apps that push the boundaries of what is possible today,
| and feel disappointed whenever I encounter people advocating
| for regression.
|
| But I digress. In hindsight, my comment was obnoxious and
| under-appreciative of the tool being shared, and my rant was
| only tangentially related.
| ahthat wrote:
| Very cool idea! Room for lots of improvement, keep working on it,
| I like the direction this is going.
| marginalia_nu wrote:
| I just got it working reasonably well just this week. I've had
| it "working" for a few months, but the results were always
| extremely chaotic, bordering on random.
| Phileosopher wrote:
| Wow, if this catches on, my original content will actually
| matter![1] I've always had a love-hate relationship with modern
| web design principles because my design choices have all the
| excitement and polish of what we get on HN.
|
| I'm sure I'm not the only one, either. Content-rich sites need
| more love.
|
| [1] https://adequate.life
| justinzollars wrote:
| It works. Nice job.
| chx wrote:
| Not sure what to do with this.
|
| https://search.marginalia.nu/search?query=gan+charger
|
| aside from nexperia none of this looks even remotely relevant.
| tejtm wrote:
| There is an open standard way for an engine like this to provide
| a mechanism for your standards aware browser to add the site as a
| alternative search with a click.
|
| That way I would not have to remember or bookmark just use my
| search bar as normal and choose which engine for this query or
| set it as default.
|
| []https://developer.mozilla.org/en-US/docs/Web/OpenSearch
| NotAnOtter wrote:
| "Don't be afraid to scroll down in the search results, unlike in
| many other search engines, depending on what you are looking for,
| you may find the best results in the middle of the listing."
|
| This is a very polite way of saying "this engine isn't very good"
|
| Overall impressed with the project but I thought the word play
| there was funny
| marginalia_nu wrote:
| I felt I needed to add it to help people taught by other search
| engines that they only get 1-2 good results, and the rest is
| useless. The reason I'm providing a hundred results is that
| there are often a lot of results to choose from. If the point
| is to find something unexpected, and that indeed is the entire
| point, then that is the only sane design choice.
|
| Like you search for something on Google and similar, and you
| know what you are going to find. They are so good at searching
| the Internet and predicting what you are going to click on that
| you never see something new.
|
| It's a great feat of engineering, but a huge tragedy, because
| discovering new things, outside of what you our your
| demographic has previously demonstrated an interest in, it can
| be absolutely life changing.
| jarbus wrote:
| This engine is fantastic for recipes
| NmAmDa wrote:
| Lets get the the internet great place foe knowledge again. I
| really loved the engine ans tried for different terms and very
| happy. Goos job
| godshatter wrote:
| Instead of looking for something specific, I decided to try a
| category of some sort to see what came up. Thinking about
| Jeopardy categories, I tried "potent potables" and found a lot of
| random pages that may or may not have made sense given that
| category but that I had a lot of fun reading. Definitely a win
| for me.
| exabrial wrote:
| I would like a search that punishes 'modern' SPOs that load 87mb
| of the author's pet JS projects to display simple text. Basically
| every modern SPO.
| rc_mob wrote:
| blessings upon you sir for making this
| sabujp wrote:
| effort is good, but needs some work, no results here :
|
| https://search.marginalia.nu/search?query=rxjava+2+api+docs
|
| https://www.google.com/search?q=rxjava+2+api+docs&oq=rxjava+...
| marginalia_nu wrote:
| It is very much a work in progress, still struggling with some
| areas. I only really got into the territory of "sometimes
| actually useful" like this weekend. Wasn't planning on blowing
| up on HN just yet.
| rafael_c wrote:
| I liked this one... I searched for 'George Harrison' and among
| the first results there was a page with interesting comments
| about Harrison's solo career; someone reminiscing about the time
| they got to talk to him about guitars for half an hour at a bar
| at the airport; a transcript for an interview he gave on TV...
| Whereas on GOOGLE: an instrusive 'People also ask' which I was
| not interested; thumbnails for videos on youtube that I was not
| looking for; previews to garbage clickbaity news articles; and
| then finally for the search items: a bunch of websites for
| lyrics; his Instagram (!) and fb pages; his imdb page; some more
| news articles I was not looking for...
|
| Granted, google's web results above are perhaps what people are
| looking for 75% of the time, but how limiting and boring.
|
| I'm also a sucker for the simplistic text-centric, information-
| laden pages from the pre-facebook era.
|
| For 'global warming', however - since Marginalia excludes modern
| web-design pages - the results are of dubious relevance and
| interest, since they are, well, 'old'.
|
| I see myself using this engine a lot.
| leephillips wrote:
| This is wonderful and stupendous.
|
| I've often thought that Google could be turned back into a good
| search engine by simply eliminating the crap and letting the
| useful sites float to the top of the results.
|
| marginalia.nu seems to like my sites, so it must be good!
|
| Some results are prefixed with ! or an arrow dingbat. What does
| that mean?
| PieUser wrote:
| searching for covid gives a bunch of bogus crap of fake news
| isaacgreyed wrote:
| A common use case, how to do random thing in programming:
|
| I searched python make a bar chart and it returned a live coding
| video with an AI generated text transcript and two articles which
| mentioned a different kind of bar.
|
| I then narrowed it down to just python bar chart, and got a blog
| post about scripting with a bar chart in it, this
| http://www.nitcentral.com/voyager4/hellyear.htm with monty
| python, bars, and charts from 1996 and among some other things I
| found this https://python-
| course.eu/naive_bayes_classifier_introduction..., which had an
| example of a python bar chart even though the title of the page
| made me think it wasn't what I wanted.
|
| So for what I imagine to be a difficult search because of all the
| different meanings of the words, I found my result on the second
| query pretty quickly, and found some cool unrelated stuff too.
|
| I like mostly that I get what I type in, and not exactly what I
| want, but what I want is there too.
| CapmCrackaWaka wrote:
| I would probably use this if I wanted to find interesting blog
| posts/websites about a topic I want to learn more about in
| general. It seems less useful for returning exact answers to
| specific questions.
| jerhewet wrote:
| I use webcrawler.com, and IMO it's better than any other search
| engine for finding _exactly_ what I 'm looking for. Not what's
| "trending", or "popular", or what the sheeple are searching for.
| It finds the _exact matching keywords_ that I 'm looking for. No
| inference or other bullshit -- just the matches.
|
| Such a relief to not wade through oceans of worthless crap any
| more.
| api wrote:
| This is the most amazing thing I have seen on here in at least a
| year!
|
| It's... no... it can't be... a search engine that finds _actual
| information_ instead of 5 megabyte blobs of tracking code and SEO
| crap!
| optimalsolver wrote:
| I predict it will return a disproportionate amount of sites by
| schizophrenic conspiracists.
| smoyer wrote:
| You can add this to Firefox as search engine option by right
| clicking on the URL and selecting "Add Marginalia". From there,
| setting it as your default search engine is done from the
| "Settings" panel as with other predefined search engines.
|
| I'm experimenting with using it as my default ...
| slim wrote:
| This is a search engine indexing the internet on a mariadb
| database hosted on consumer hardware maintained by a single
| person as a hobby and it does not suffer from HN hug of death
| swyx wrote:
| how on earth do you index so much on consumer hardware? my
| frontend developer mind is blown.
| foxfluff wrote:
| Wait till you learn that modern CPUs run _billions_ of cycles
| per second. With multiple cores in parallel! And they can
| reach transfer rates of tens of gigabytes per second to RAM,
| or around a terabyte per second into L3.
| ThalesX wrote:
| And then you add a single HTTP request and everything tones
| down to the speed of the web. Or I/O. Or DB call.
| goodpoint wrote:
| More like: you add some javascript library and now the
| browser needs 5 seconds to run 10 MB of javascript.
| knuthsat wrote:
| You can still support millions of these requests per
| second if you just bake all of the dependencies directly
| in a small OS running on your fastest raspberry pi.
| paxys wrote:
| Consumer hardware today is simply what was cutting-edge and
| crazy expensive 5 years ago.
| FriendWithMoon wrote:
| As a sufferer of Tinnitus, and having spent near 100 hours
| researching it, I found a few sites I had never seen offering
| great data and tools. Thank you
| marginalia_nu wrote:
| Honestly it's probably not a great source for medical advice.
| At least take what you read with a healthy grain of salt.
| necovek wrote:
| Too bad the search index is currently restricted to ASCII-only
| (or at least Cyrillic and Latin-2 characters were rejected as
| "contains characters that are not currently supported").
|
| I love the idea definitely, and I've long toyed around with
| building a similar thing that starts crawling off my own
| bookmarks (a personal small-deep-web if you wish).
|
| I also love the "Small Web" name: this is the first I hear of it,
| and it's what I've long complained about -- the web today hides
| all of the cool gems search engines of old would have given you!
|
| I am also a bit split on the "www" prefix restriction (iiuc,
| domains which do not have "www" subdomain too are dropped from
| the index because many of them are spammy): it might for sure be
| a useful heuristic, but I've advocated for dropping "www" back in
| late 90s and early 2000s already (one reason being that for eg.
| Serbian, "w" is not in the alphabet, so you can't reasonably
| quote it as Serbian is otherwise a phonetic-language).
| talrand wrote:
| Gave it a go with two different queries. The first I chose was
| "amazon vendor services" didn't get a single result about the
| topic.
|
| The second query was a nation+city(in the nation). Got a lot of
| result that were in no way related to either.
|
| It seems to be biased towards IT topics (based uniquely on the
| two queries).
| deadalus wrote:
| Very interesting because of the interesting results from random
| websites. It's a great discovery tool.
|
| Now hoping for search engine that favors text-heavy sites and
| punishes paywalls
| the__alchemist wrote:
| I built one!
| billyharris wrote:
| Search engines always like websites with more text and less
| graphics.
| mattchew wrote:
| Oh, I dream of a day where there are multiple useful search
| engines, specialized for different purposes.
|
| You're doing God's work here. Thanks and good luck.
| scns wrote:
| The Flying Spaghetti Monster wants to have a word with you.
|
| (edit) That is a nice dream though.
| brian_herman wrote:
| Kind of reminds me of the past like alta vista and dogpile.
| BoxOfRain wrote:
| I wonder if there's any mileage in an extension of something
| like uBlock Origin's lists of ad networks to block but instead
| it's a list of known content mills and SEO spam factories to
| remove from search results?
| high_byte wrote:
| I'd like a chrome extension that marks links that target text-
| heavy vs "modern" so I know beforehand what to expect - paywall,
| ads, popups, clickbaits, etc.
| kebos wrote:
| This is really cool, it filters out all fluff.
|
| It's not always taking me to totally relevant sites but the
| results contain my favourite type of content.
|
| Full of _writing_ and pure html - usually the hallmark of someone
| who knows what they are doing, wants to communicate but doesn 't
| want to waste their time.
| SimplGy wrote:
| Searched for "chocolate chip cookie recipe"
|
| First result had a recipe I could see both recipe and directions
| in a single page, no ads, no scrolling, no fake seo anecdotes
| about kids and grandmas.
|
| (Pls make the search query box fit small mobile devices)
|
| Great project idea!
___________________________________________________________________
(page generated 2021-09-17 23:01 UTC)