[HN Gopher] A search engine that favors text-heavy sites and pun...
___________________________________________________________________
A search engine that favors text-heavy sites and punishes modern
web design
Author : Funes-
Score : 1765 points
Date : 2021-09-16 12:16 UTC (10 hours ago)
(HTM) web link (search.marginalia.nu)
(TXT) w3m dump (search.marginalia.nu)
| ryankrage77 wrote:
| The results are fantastic, but I can't see how the excerpts
| relate to the search term.
|
| For example, a search term for 'Scotichronicon' returns some
| fascinating results, but the search term itself doesn't appear in
| the title or excerpts of most of the results.
|
| This makes it harder to judge how relevant they are.
| marginalia_nu wrote:
| The excerpts are static and very best effort. You just have to
| visit the website and find out I'm afraid.
|
| I can do a lot with what I have, but I can't do full text
| search on millions of documents with dynamic excerpts off a
| single computer in my living room.
| snakeboy wrote:
| Wow, that's awesome. Great work!
|
| For a simple test, I searched "fall of the roman empire". In your
| search engine, I got wikipedia, followed by academic talks,
| chapters of books, and long-form blogs. All extremely useful
| resources.
|
| When I search on google, I get wikipedia, followed by a listicle
| "8 Reasons Why Rome Fell", then the imdb page for a movie by the
| same name, and then two Amazon book links, which are totally
| useless.
| adventured wrote:
| I did a search for "George Washington"
|
| First result after Wikipedia:
|
| "Radiophone Transmitter on the U.S.S. George Washington (1920)
|
| In 1906, Reginald Fessenden contracted with General Electric to
| build the first alternator transmitter. G.E. continued to
| perfect alternator transmitter design, and at the time of this
| report, the Navy was operating one of G.E.'s 200 kilowatt
| alternators http://earlyradiohistory.us/1919wsh.htm "
|
| Another result in the first few:
|
| " - VANDERBILT, GEORGE WASHINGTON
|
| PH: (800) ###-#233 FX: (#03) 641-5###.
| https://www.ScottWinslow.com/manufacturer/VANDERBILT_GEORGE_...
| "
|
| And just below that terrible result:
|
| "I Looked and I Listened -- George Washington Hill extract
| (1954)
|
| Although the events described in this account are undated, they
| appear to have occurred in late 1928. I Looked and I Listened,
| Ben Gross, 1954, pages 104-105: Programs such as these called
| for the expenditure of larger sums than NBC had anticipated. It
| be http://earlyradiohistory.us/1954ayl2.htm "
|
| Dramatically worse than Google.
|
| ---
|
| Ok, how about a search for "Rome" then? Surely it'll pull some
| great text results for the city or the ancient empire.
|
| First result after Wikipedia:
|
| "Home | Rome Daily Sentinel
|
| Reliable Community News for Oneida, Madison and Lewis County
| http://romesentinel.com/"
|
| The fourth result for searching "Rome":
|
| "Glenn's Pens - Stores of Note
|
| Glenn's Pens, web site about pens, inks, stores, companies -
| the pleasure of owning and using a pen of choice. Direcdtory of
| pen stores in Europe.
| http://www.marcuslink.com/pens/storesofnote/roma.html"
|
| Again, dramatically worse than Google.
|
| ---
|
| Ok, how about if I search for "British"?
|
| First result after Wikipedia:
|
| "BRITISH MINING DATABASE
|
| British_Mining_Database
| http://www.users.globalnet.co.uk/~lizcolin/bmd.htm "
|
| And after that:
|
| "British Virgin Islands
|
| Many of these photos were taken on board the Spirit of
| Massachusetts. The sailing trip was organized by Toto Tours.
| Images Copyright (c) Lowell Greenberg Home Up Spring Quail
| Gardens Forest Home Lake Hodges Cape Falcon Cape Lookout,
| Oregon Wahkeena
| http://www.earthrenewal.org/british_virgin_islands2.htm"
|
| Again, far off the mark and dramatically worse than Google.
|
| I like the idea of Google having lots of search competition,
| this isn't there yet (and I wouldn't expect it to be). I don't
| think overhyping its results does it any favors.
| burkaman wrote:
| This is not a Google competitor, it's a different type of
| search engine with different goals.
|
| > If you are looking for fact, this is almost certainly the
| wrong tool. If you are looking for serendipity, you're on the
| right track. When was the last time you just stumbled onto
| something interesting, by the way?
| JasonFruit wrote:
| Hobby project leads angry person to interesting and
| unexpected material; angry person remains angry. Details at
| six.
| fouric wrote:
| The project explicitly bills itself as a "search engine",
| not an "interesting and unexpected material surfacer".
| Moreover, projecting emotions like "angry" onto a comment
| in order to discredit the content of the comment (hey! is
| that an ad-hominem?) is just about exactly the opposite of
| the discussions that the HN mods are trying to curate, and
| the discussions that I like to see here.
| withinboredom wrote:
| In the early days of google, I found what I was looking
| for on page 5+. On the way, I'd discover many interesting
| things I didn't even know I was looking for, often
| completely unrelated to what I was searching for.
| kwertyoowiyop wrote:
| And now Google hides that more than one page even exists,
| as they populate their first page with buttons to ask
| similar questions and go to the first page of THOSE
| results.
| kews wrote:
| I miss those old days of even being permitted to go many
| pages in.
| allknowingfrog wrote:
| If you click through to the About page, I think you'll
| see that "interesting and unexpected material surfacer"
| is a fairly apt description of the project.
| adventured wrote:
| > Hobby project leads angry person to interesting and
| unexpected material; angry person remains angry.
|
| Not angry in the least. I'm thrilled someone is working on
| a search competitor to Google.
|
| I understand you're attempting to dismiss my pointing out
| the bad results by calling me angry though. You're focusing
| your content on me personally, instead of what I pointed
| out.
|
| The parent was far overhyping the results in a way that was
| very misleading (look, it's better than Google!). I tried
| various searches, they were not great results. The parent
| was very clearly implying something a lot better than that
| by what they said. The product isn't close to being at that
| level at this point, overhyping it to such an absurd degree
| isn't reasonable or fair to the person that is working on
| it.
|
| I would specifically suggest people not compare it to
| Google. Let it be its own thing, at least for a good while.
| Google (Alphabet) is a trillion dollar company. Don't press
| the expectations so far and stage it to compete with Google
| at this point. I wouldn't even reference Google in relation
| to this search engine, let it be its own thing and find its
| own mindshare.
| bityard wrote:
| > I'm thrilled someone is working on a search competitor
| to Google.
|
| Except the author goes to quite some lengths to explain
| that his search engine is not a competitor to Google, and
| is in fact exactly the opposite of Google in many ways:
| https://memex.marginalia.nu/projects/edge/about.gmi
| kwhitefoot wrote:
| What were you expecting to see for British? There must be
| millions of pages containing that term. Anyway the first
| screenful from Google is unadulterated crap, advertising
| mixed with the usual trivia questions.
|
| If you are going top claim something is wide of the mark then
| you really ought to tell us at least roughly where the mark
| is.
| duckmysick wrote:
| I checked the results of the same query and they seem fine.
| Lots of speeches and articles about George Washington the US
| president. There's even his beer recipe.
|
| As for the results you linked, it's part of the zeitgeist to
| list other entities sharing the same name. Sure, they could
| use some subtle changes in ranking, but overall the returned
| links satisfy my curiosity.
| Nition wrote:
| The Wikipedia link at the top is always given. It would maybe
| be good to make it a little clearer that it's not one of the
| true results.
| klntsky wrote:
| However, when searching for "haskell type inference algorithm"
| I get completely useless results.
| [deleted]
| klntsky wrote:
| Since it does not use synonyms, it looks like it is unable to
| answer "how's that thing called"-queries.
| burkaman wrote:
| That query is too long apparently. But if you shorten to
| "haskell type inference", I think it delivers on its promise:
|
| > If you are looking for fact, this is almost certainly the
| wrong tool. If you are looking for serendipity, you're on the
| right track. When was the last time you just stumbled onto
| something interesting, by the way?
| marginalia_nu wrote:
| The search engine doesn't do any type of re-ordering or
| synonym stuff, it only tires to construct different N-grams
| from the search query.
|
| So if you for example compare "SDL tutorial" with "SDL
| tutorials". On google you'd get the same stuff, this search
| engine, for better or worse doesn't.
|
| This is a design decision, for now anyway, mostly because
| I'm incredibly annoyed when algorithms are second-guessing
| me. On the other hand, it does mean you sometimes have to
| try different searches to get relevant results.
| ford_o wrote:
| Maybe list the synonyms under the query, so its easier to
| try different formulations.
| Razengan wrote:
| It could simply become an option.
| OneLeggedCat wrote:
| Don't change it. It's good this way.
| leephillips wrote:
| I like this design decision. It pays you back for
| choosing your search terms carefully.
| LanceH wrote:
| It would be nice if we could pipe search engines.
| BenoitP wrote:
| Definitely; We could create a meta search engine that
| queries them all, in desktop application format.
|
| Let's name it after a famous old scientist, and maybe add
| the year to prove it's modern: Galileo 2021.
| overkalix wrote:
| ... is this Galileo 2021 a reference that I am not
| understanding?
| BenoitP wrote:
| Yup, but so far no one got it.
|
| There was such an app in the early 2000's, before Google
| went mainstream, and Altavista-like engines were not
| good: Copernic 2000.
|
| I guess I'm officially old now.
| tomerv wrote:
| FWIW, I got the reference. Maybe I'm old too?
| PaulHoule wrote:
| Meta search engines leave a bad taste in everyone's mouth
| because they've always failed. Here is why
|
| https://en.wikipedia.org/wiki/Arrow%27s_impossibility_the
| ore...
|
| You can't combine a few different ranked lists and expect
| to get results better than any of the original ranked
| lists.
| robrenaud wrote:
| > You can't combine a few different ranked lists and
| expect to get results better than any of the original
| ranked lists.
|
| I am skeptical of this application of the theorem. Here
| is my proposal:
|
| Take the top 10 Google and Bing results. If the top
| result from Bing is in the top 10 from Google, display
| Google results. If the top result from Bing is not in the
| top 10 from Google, place it at the 10th position. You'd
| have an algorithm that ties with Google, say 98% of the
| time, beats it say, 1.2% of the time, and loses .8% of
| the time.
| vikingerik wrote:
| Right. Arrow's theorem just says it's impossible to do it
| in _all_ cases. It 's still quite possible to get an
| improvement in a large proportion of cases, as you're
| proposing.
| random314 wrote:
| Arrows theorem simply doesn't apply here. We don't need
| our personalized search results to satisfy the majority.
| PaulHoule wrote:
| But in both cases you face the problem of aggregating
| preferences of many into one. In one case you are
| combining personal preferences in the other case
| aggregating 'preferences' expressed by search engines.
| PaulHoule wrote:
| I've had jobs tuning up the relevance of search engines
| with methods like
|
| https://ccc.inaoep.mx/~villasen/bib/AN%20OVERVIEW%20OF%20
| EVA...
|
| and the first conclusion is "something that you think
| will improve relevance probably won't"; the TREC
| conference went for about five years before making the
| first real discovery
|
| https://en.wikipedia.org/wiki/Okapi_BM25
|
| It's true that Arrow's Theorem doesn't strictly apply,
| but thinking about it makes it clear that the aggregation
| problem is ill-defined and tricky. (e.g. note also that a
| ranking function for full text search might have a range
| of 0-1 but is not a meaningful number, like a probability
| estimate that a document is relevant, but it just means
| that a result with a higher score is likely to be more
| relevant than one with a lower score.)
|
| Another way to think about it is that for any given
| feature architecture (say "bag of words") there is an
| (unknown) ideal ranking function.
|
| You might think that a real ranking function is the ideal
| ranking function plus an error and that averaging several
| ranking functions would keep the contribution of the
| ideal ranking function and the errors would average out,
| but actually the errors are correlated.
|
| In the case of BM25 for instance, it turns out you have
| to carefully tune between the biases of "long documents
| get more hits because they have more words in them" and
| "short documents rank higher because the document vectors
| are spiky like the query the vectors". Until BM25 there
| wasn't a function that could be tuned up properly and
| just averaging several bad functions doesn't solve the
| real problem.
| gnramires wrote:
| That's an invalid application of this theorem. (It
| doesn't necessarily hold)
|
| Suppose there's an unambiguous ranked preference by all
| people among a set (webpages, ranking). Suppose one
| search engine ranks correctly the top 5 results and
| incorrectly the next 5 results, while another ranks
| incorrectly the top 5 and correctly the next 5.
|
| What can happen is that some there may be no universally
| preferred search engine (likely). In practice, as another
| commenter noted, you can also have most users prefer more
| a certain combination of results (that's not difficult to
| imagine, for example by combining top independent results
| from different engines for example).
| Torwald wrote:
| I need that with a simpler interface, so I call it after
| a famous dedective: Sherlock.
| artificial wrote:
| _a magic pop sound is faintly audible as a new side
| project is appended to several lists_ Excellent, thank
| you!
| mkr-hn wrote:
| Likely trademark collision with this:
| https://www.galileo.usg.edu/
| gmueckl wrote:
| Not an app, but probably comes quite close in all other
| respects: https://metager.org
| foofoo4u wrote:
| Good comparison. Reminds me of an analogy I like to make of
| today's web, which is it feels like browsing through a magazine
| store -- full of top 10s, shallow wow-factoids, and baity
| material. I genuinely believe terrible results like this are
| making society dumber.
| bluGill wrote:
| what I really want is a true AI to search through all that
| and figure out the useful truth. I don't know how to do this
| (and of course whoever writes the AI needs to be unbiased...)
| rchaud wrote:
| The context matters. I'd happily read "Top 10" lists on a
| website if the site itself was dedicated to that one thing.
| "Top 10 Prog Rock albums", while a lazy, SEO-bait title,
| would at least be credible if it were on a music-oriented
| website.
|
| But no, these stories all come from cookie-cutter "new media"
| blog sites, written by an anonymous content writer who's
| repackaged Wikipedia/Discogs info into Buzzfeed-style copy
| writing designed to get people to "share to Twitter/FB". No
| passion, no expertise. Just eyeballs at any cost.
| foofoo4u wrote:
| This got me thinking that maybe one of the other big
| reasons for this is that the algorithms prioritize newer
| pages over older pages. This produces the problem where
| instead of covering a topic and refining it over time, the
| incentive is to repackage it over and over again.
|
| It reminds me of an annoyance I have with the Kindle store.
| If I wanted to find a book on, let's say, Psychology, there
| is no option to find all-time respected books of the past
| centenary. Amazon's algorithms constantly push to recommend
| the latest hot book of the year. But I don't want that. A
| year is not enough time to have society determine if the
| material withstands time. I want something that has stood
| the test of time and is recommended by reputable
| institutions.
| amenod wrote:
| > is that the algorithms prioritize newer pages over
| older pages.
|
| They do? That would explain a lot - but ironically, I
| can't find a good source on this. Do you have one at
| hand?
| dvogel wrote:
| It is pretty obvious if you search for any old topic that
| is also covered incessantly by the news. "royal family"
| is a good example. There's no way those news stories
| published an hour ago are listed first due to a high
| PageRank score (which necessarily depends on time to
| accumulate inbound links).
| RattleyCooper wrote:
| It depends on the content. The flip side is looking up a
| programming-related question and getting results from
| 2012.
|
| I think they take different things into account based on
| the thing being searched.
| rchaud wrote:
| Your Google search results show the date on articles do
| they not? If people are more likely to click on
| "Celebrity Net Worth (2021)" than "Celebrity Net Worth
| (2012)", then the algo will update to favour those
| results, because people are clicking on them.
|
| The only definitive source on this would be the
| gatekeeper itself. But Google never says anything
| explicitly, because they don't want people gaming search
| rankings. Even though it happens anyway.
| wwweston wrote:
| It's also possible that it's the other way around: a certain
| "common denominator" + algorithms that chase broad engagement
| = mediocre results.
|
| The real trick would be some kind of engine that can aim just
| above where the user's at.
| [deleted]
| hdjjhhvvhga wrote:
| As long as few people use it, it will be great. Rest assured
| that the moment it becomes popular, the people who want to game
| it will appear.
| Nextgrid wrote:
| I don't think the existing media-heavy websites are gaming
| Google to rank higher. It's that Google itself prefers media
| heavy content; they don't have to "game" anything.
|
| I also think a search engine like this would be quite hard to
| game. An ML-based classifier trained on thousands of text-
| heavy and media-heavy screenshots should be quite robust and
| I think would be very hard to evade, so the "game" will
| become more about how _identify_ the crawler so you can serve
| it a high-ranking page while serving crap to the real users,
| and it seems fairly easily to defeat if the search engine
| does a second pass using residential proxies and standard
| browser user agents to detect this behavior (it could also
| threaten huge penalties like the entire domain being banned
| for a month to even deter attempts at this).
| fragmede wrote:
| With the advances in text generation by machines that
| looks, but isn't _quite_ accurate (aka GPT-3), seems like
| it would be _easily_ gamed (given access to GPT-3). Even
| without GPT-3, if the content being prioritized is mere
| text, I 'm sure that for a pile of money, I could generate
| something that looks like Wikipedia, in the sense that it's
| a giant pile of mostly text, but it would make zero sense
| to a human reader. (Building an SEO farm to boost ranking
| of not-wikpedia is left as an exercise for the reader.)
| jandrese wrote:
| This sort of optimization is why simple recipes are typically
| found at the end of a rambling pointless blog post now.
|
| Still, the best way to break SEO is to have actual
| competition in the search space. As long as SEO remains
| focused on Google there is an opportunity for these companies
| to thrive by evading SEO braindamage.
| SerLava wrote:
| That's not really for SEO, which favors readily accessible
| information.
|
| That's ads. When mobile users have to scroll past 10 add,
| theyll click on some of them and make the blog money.
| ggggtez wrote:
| I've noticed this pattern start to pop up elsewhere. I've
| started to train my skimming skills, skipping a paragraph
| or two at a time to get past the fluff.
|
| Like an article about some current event will undoubtedly
| begin with "when I was traveling ten years ago...".
| zerd wrote:
| It's also because that's a way of trying to copyright
| protect recipes, which are normally not copyright
| protected.
|
| > "Mere listings of ingredients as in recipes, formulas,
| compounds, or prescriptions are not subject to copyright
| protection. However, when a recipe or formula is
| accompanied by substantial literary expression in the form
| of an explanation or directions, or when there is a
| combination of recipes, as in a cookbook, there may be a
| basis for copyright protection."
| YeGoblynQueenne wrote:
| >> This sort of optimization is why simple recipes are
| typically found at the end of a rambling pointless blog
| post now.
|
| I continue to be curious about this kind of complaint. If
| all you want is a recipe list, without any of the fluff,
| why would you click on a link to a blog, rather than on a
| link to a recipe aggregator?
|
| Foodie blogs exist specifically for the people who want a
| foodie discussion and not just an ingredients' list.
|
| Is it because blogs tend to have better recipes overall? In
| that case, isn't there a bit of entitlement involved in
| asking that the author self-sacrificingly provides only the
| information that you want, without taking care of their own
| needs and wants, also?
| Loughla wrote:
| It's the same thing that people always complain about.
| This thing is not in a format that I like, so it must be
| not what anyone likes.
|
| If you want JUST recipes, pay money instead of just
| randomly googling around. America's test kitchen has a
| billion, vetted, and really good recipes. That solves
| that problem.
| joegahona wrote:
| I think the complaint is that those blogs rank higher
| than nuts-and-bolts recipes now. It wasn't that way a few
| years ago. Yes, scrolling down the results to Food
| Network or Martha Stewart or whatever is possible, as is
| going directly to those sites and using their site
| search, but it's noticeable and annoying.
| jandrese wrote:
| Because when you search for a recipe you get the link to
| the blog, not the aggregator.
| WorldMaker wrote:
| That sort of recipe blog hasn't happened just for SEO. It's
| also a bit of a "two audiences" problem: if you are coming
| to that food blogger from a search you certainly would
| prefer the recipe first and then maybe any commentary on it
| below if the recipe looks good. If you are a regular reader
| of that food blogger you are probably invested in the
| stories up top and that parasocial connection and the
| recipes themselves are sometimes incidental to why you are
| a regular reader.
|
| You see some of that "two readers" divide sometimes even in
| classic cookbooks, where "celebrity" chefs of the day might
| spend much of a cookbook on a long rambling memoir.
| Admittedly such books were generally well indexed and had
| table of contents to jump right to the recipes or
| particular recipes, but the concept of "long personal
| ramble of what these recipes mean to me" is an old one in
| cookbooks too.
| giantrobot wrote:
| > If you are a regular reader of that food blogger
|
| I think this assumes facts not in evidence. It certainly
| seems like an overwhelming number of "blogs" are not
| actual blogs but SEO content farms. There's no regular
| readers of such things because there's no actual authors,
| just someone that took a job on Fivver to spew out some
| SEO garbage. Old content gets reposted almost verbatim
| because new results better according to Google.
|
| The only reason these "blogs" exist is to show ads and
| hopefully get someone's e-mail (and implied consent) for
| a marke....newsletter.
| WorldMaker wrote:
| I know at least a few that I commonly see in top search
| results that I have friends that read them like
| personalized soap operas where most of the drama revolves
| around food and family and serving food to family.
|
| It's at least half the business models of Food Network
| shows: aspirational kitchens and the people that live in
| them and also sometimes here's their recipes. (The other
| half being competitions, obviously.) I've got friends
| that could deliver entire doctoral theses on the Bon
| Appetit Test Kitchen (and its many YouTube shows and
| blogs) and the huge soap operatic drama of 2020's events
| where the entire brand milkshake ducked itself; falling
| into people's hearts as "feel good" entertainment early
| in 2020/the pandemic and then exploding very dramatically
| with revelations and betrayals that Fall.
|
| Which isn't to say that there _aren 't_ garbage SEO farms
| out there in the food blogging space _as well_ , but a
| lot of the big ones people commonly complain about seeing
| in google's results do have regular fans/audiences. (ETA:
| And many of the smaller blogs _want_ to have regular fans
| /audiences. It's an active influencer/"content creator"
| space with relatively low barrier to entry that people
| love. Everyone's family loves food, it's a part of the
| human condition.)
| run-types wrote:
| I've basically never been taken to a recipe without a
| rambling preamble from Google. While food blogs may serve
| two audiences, a long introduction seems to be a
| requirement to appear in the top Google search results.
| WorldMaker wrote:
| Personally, I think that has a lot more to do with the
| fact that Google killed the Recipe Databases. There did
| used to be a few startups that tried to be Recipe
| Aggregators with advertising based business models, that
| would show recipes and then link to source blogs and/or
| cookbooks, and in the brief period where they existed
| Google scraped them _entirely_ and showed entire recipes
| on search results and ate their ad revenue out from under
| them.
| tomrod wrote:
| That is a really bad thing by Google. Their core business
| is not recipes.
| kwertyoowiyop wrote:
| Their core business is making money from other people's
| content, no matter what it is.
| WorldMaker wrote:
| Their core business is advertising and they have always
| been in a direct conflict-of-interest by competing with
| content sites for ad revenue buys.
| inanutshellus wrote:
| I see your point, but argue you've misidentified the two
| audiences.
|
| One audience matches your description and is the invested
| reader. They want _that_ blogger 's story telling. they
| might make the recipe, but they're a dedicated reader.
|
| The other audience is not the recipe-searcher, but
| instead Google. Food bloggers know that recipe-searchers
| are there to drop in, get an ingredient list, and move
| on. They won't even remember the blog's name. So the site
| isn't optimized for them. It's optimized for Google.
|
| "Slow the parasitic recipe-searcher down. They're
| leeches, here for a freebie. Well they'll pay me in
| Google Rank time blocks."
| xtracto wrote:
| That's why I use Saffron [1], it magically converts those
| sites into a page in my recipe book. I found it when the
| developer commented here in HN. Also, a lot of cooking
| website have started to add a link with "jump to recipe"
| functionality allowing you to skip all the crap.
|
| [1] https://www.mysaffronapp.com/
| Funes- wrote:
| There's also https://based.cooking.
| eigengrau5150 wrote:
| Run by Luke Smith, an admitted neo-reactionary and
| possible white supremacist who writes like a 4chan
| reject.
| the_other wrote:
| If there were a wider variety of popular search engines, with
| different ranking criteria, would sites begin to move away
| from gaming the system? Surely it would be too hard to game
| more than one search engine at a time?
| Nasrudith wrote:
| It would be a matter of numbers anyway about which they
| optimize for. A/B testing is already in place and doesn't
| care about where it comes from, just which one does better.
| new_guy wrote:
| > the people who want to game it will appear.
|
| So just add human review to the mix, if a site is obviously
| trying to game the system (listicles, seo spam etc) just drop
| and ban them from the search index.
| hdjjhhvvhga wrote:
| Congratulations, you've just invented negative SEO.
| eterevsky wrote:
| Imagine if you were looking for the movie.
| neltnerb wrote:
| Imagine including the search term "movie".
| yreg wrote:
| That doesn't do anything useful.
| lucideer wrote:
| I tend to prefer Wikipedia for movies. The exception is actor
| headshots if I'm trying to identify someone, which Wikipedia
| lacks for licensing reasons, but otherwise Wikipedia tends to
| be better than IMDB for most needs. Wikipedia has an IMDB
| link on every article anyway.
|
| Another need I guess might be reviews, for which RT or MC are
| better than IMDB: not sure if either of those two will fare
| better than IMDB in this search engine but again Wiki has
| links out (in addition to good reception summaries)
| mountainboy wrote:
| For me, imdb was much better when they had user
| comments/discussion.
|
| I never even posted on it myself, but browsing the
| discussions one could learn all sorts of trivia, inside
| info, speculation, etc about each movie.
|
| Since they (inexplicably) killed that feature, I rarely
| even visit anymore. Your right, for many purposes wikipedia
| is better, especially for TV series episode lists with
| summaries.
| ncphil wrote:
| IMDB management thought it was their brilliant editorial
| work that drew people to their site. Morons. It was the
| comments all along. Of course they also believed they
| could create gravity-free zones by sheer force of
| executive will (and maybe still do).
| shuntress wrote:
| _?q=imdb.com:fall of the roman empire_
| jazzyjackson wrote:
| !imdb
| MisterTea wrote:
| The you'd use a different search engine. Why does everything
| have to be a Swiss Army knife?
| zozbot234 wrote:
| Or you could just search for 'rome movie'. Though for more
| complex disambiguation you would need to resort to, e.g.
| schema.org descriptions (which are supported by most search
| engines, and the foundation for most "smart" search result
| snippets).
| hn_throwaway_99 wrote:
| I had the exact opposite experience. I searched the site for
| "java", got a Wikipedia link first (for the island, not the
| programming language), and the 2nd result was to a random JEP
| page, and all the rest of the results were random tidbits about
| Java (e.g. "XZ compression algorithm in Java). Didn't get any
| high level results pointing to an overview of the language,
| getting started guides, etc.
| withinboredom wrote:
| You need to use some old school search techniques and search
| for "Java overview"
| _wldu wrote:
| I'm not sure that's a bad thing.
| rovr138 wrote:
| well, they're results to java related items...
|
| What kind of links where you expecting to find?
| SPBS wrote:
| Cool, it appears that the trend towards JS may be causing self-
| selection -- if a page has a high amount of JS, it is highly
| unlikely to contain anything of value.
| dv_dt wrote:
| If one could create an metric of ad to content ratio from the
| js used, I would guess that would be a nice differentiator
| too.
| sjtindell wrote:
| True. Unfortunately many large corporate websites through
| which you pay bills, order tickets, etc. are becoming
| infested with JS widgets and bulky, slow interfaces. These
| are hard to avoid.
| zimpenfish wrote:
| Searched for my initials - got back a bunch of raw binary results
| (mp4, pdf, img, txz, etc.) which was disconcerting. Although it
| did find one reference to actual-me which is better than Google
| manages on the first 4 pages...
|
| https://imgur.com/a/n2xro2Y
| marginalia_nu wrote:
| Yeah there was unfortunately a problem with the content-type
| code recently, it unfortunately categorized some binary data as
| HTML and tried to process it best-effort. So there's some
| binary soup in the index.
|
| The bug has since been fixed, but it won't come into effect in
| a few weeks.
| ape4 wrote:
| Fast and doesn't crash when on the front page of Hacker News!
| ricardo81 wrote:
| That crossed my mind too, considering vanilla webpages
| sometimes struggle with a top page HN thread, never mind a
| search engine backend.
| marginalia_nu wrote:
| Well,... yet. Load average is at 1.2, not that bad. But the
| services are getting a solid workout.
| marginalia_nu wrote:
| The real test is in now, index server is reconstructing its
| index. It does this every 6 hours if there is new pages.
| Takes half an hour or so usually.
|
| It's supposed to be able to handle searches at the same time,
| but jeepers, it's gonna have to chew through nearly 400 Gb of
| data while dealing with over 1 request per second.
| 0xbadcafebee wrote:
| Is your site/code on GitHub? I would be happy to give
| performance tips/tweaks. Also Fyi, https://marginalia.nu/
| gives a certificate error (I know that's not the search
| site)
| criddell wrote:
| Have you given any thought on what you will do if you get a DMCA
| take down request or a request from a person asking you to remove
| them from search results?
| mrkramer wrote:
| Awesome work! I had similar idea in mind but I'm glad to see
| someone else was able to pull it off.
| sailorganymede wrote:
| Love this! Is there any way someone could help contribute to
| this?
| palijer wrote:
| This has been needed in my life for a while. I am growing really
| apathetic about the internet lately, but I realize that is
| because my entry point is always a google search.
|
| I miss finding blog posts and scholarly articles in long form. I
| hate the SEO sites with unreadable UI because the information in
| them is often a lot lower quality as well.
| claytn wrote:
| Searching for your own name will turn up some interesting
| results! I got some early 90s webpages that just contain
| obituaries or marriage records. I never knew cities maintained
| these records online!
| IlliOnato wrote:
| Pretty cool. I am not sure yet how useful, but cool it is.
|
| However, it seems that it currently does not support non-Latin
| alphabets. Which I understand in an early version. Still, it's
| handling of such "exception cases" could be improved:
|
| when I search for a Russian word, say "Akvarium", I get <<Search
| "Akvarium" needs to be a word>>, which is rather rude...
| foxfluff wrote:
| "It also focuses on websites in English, Swedish and Latin and
| tries to identify and ignore the rest (best-effort)."
|
| https://news.ycombinator.com/item?id=28551183
| beepbooptheory wrote:
| need a duck duck go `bang!` for this
| jordache wrote:
| how about a search engine that bans all pinterest content.
|
| I hate pinterest with a passion. I may need to get a "his" laptop
| separate from my wife, since she needs that darn pinterest
| extension for pinning photos.
| mattowen_uk wrote:
| Nice! Typing my name in, gets my own site back as 3 of the top 5
| results. I suddenly feel important ;)
| schmorptron wrote:
| I've also found that brave search gets much better results than
| google for some programming related topic, simply by not being
| targeted by blogspam SEO as much. It's refreshing to not have to
| click through 3 auto generated "articles" but to either a) get
| the documentation straight away or b) find actually human written
| blog entries.
| arethuza wrote:
| I did a quick check using the name of the Scottish village I am
| originally from (or as I should say "far am fae") and this
| produced a _much_ more interesting set of links for me than that
| produced by Google
| abhiminator wrote:
| Absolute textgasm.
|
| Wonder how 'text-only first' prioritization is being implemented,
| algorithmically speaking?
| ryankrage77 wrote:
| This post - https://news.ycombinator.com/item?id=28551183 -
| suggests it's a simple set of hueristics, looking for things
| like javascript, link/SEO spam, language, amount of text
| content, etc, filtering out unwanted results and only indexing
| wanted ones.
| abhiminator wrote:
| Thank you!
| michaelcampbell wrote:
| (Old man yells at cloud.)
| [deleted]
| yrds96 wrote:
| Damn that's is interesting search engine, this is great for
| search simple terms and find a bunch of blog articles about the
| term.
| turtlebits wrote:
| Great for a text-focused site- however, the results are a bit
| confusing. Would help if there were more details on the criteria
| used for a site to be included in the index.
|
| Suggestion - Use system fonts (the site downloads almost 300k of
| fonts)
| marcosdumay wrote:
| I've got some results where the same site is has than 70% of the
| links. It was a very on topic and high quality site, but still,
| all the results shouldn't point to the same place.
|
| I think some grouping by site (and capping to only the few most
| relevant links there) would improve the engine.
| aabajian wrote:
| Gotta say, sometimes the results really are nice. I searched for
| "Land Cruiser 70." The first result is a simple, short blog post
| about a couple who traveled across Europe and Asia in their Troop
| Carrier (http://www.destoop.com/trip/1%20PREPARATION/2%20Vehicle%
| 20sp...).
|
| The first results on Google are Australian site for buying a LC70
| (news-flash, I can't buy one in the USA). There is also a
| MotorTrend article about the LC70...also irrelevant since it's
| only sold in Australia.
| dzhiurgis wrote:
| Search for playwright waitForSelector and you land in pretty
| useless page. I'm all in for text websites, but something like
| playwright.dev documentation is top notch - fuzzy search being
| key thing.
| marginalia_nu wrote:
| Yeah I wasn't really planning for this to blow up like it did
| today. It's currently sitting at about 35% of the index size I
| usually aim for, so besides the stuff I can't index because
| it's behind CDNs, there's a lot of pages it just hasn't gotten
| to yet. playwright.dev is pretty low on the priority list
| because it has a metric crap-ton of javascript on its front
| page. The crawler has visited it, looked at it, and put it very
| far down the priority queue.
| arnaudsm wrote:
| I wish we could configure Google's algorithm to our needs, and
| blacklist websites.
| MarioMan wrote:
| It could get tedious depending on how many sites you want to
| block, but you can add "-site:google.com" to exclude
| google.com, for instance.
| arnaudsm wrote:
| I mean a blacklist system like Twitter's, where you block a
| website forever. Pinterest would be the first to go.
| marto1 wrote:
| super! How far would you say are you in indexing the blogosphere
| ? I tried the engine a few times, but I mostly get academic
| papers and I know most (good) blogs are in fact text-heavy.
| phreack wrote:
| One nitpick that kind of bothered me - on a large desktop
| monitor, the results page was like 70% whitespace margins with
| the results squished in the middle like a portrait cellphone.
| Hopefully it's easy to fix, I like to research at home and this
| website could help a lot!
| yakubin wrote:
| Wikpedia links point to <https://encyclopedia.marginalia.nu/>
| instead, which to my eyes is less readable. The justified text,
| done with CSS, instead of the LaTeX algorithm, looks wild. The
| font used for quotations is even worse (very thin).
|
| Wikipedia is perfectly usable without JavaScript and it's one of
| the nicest sites out there typography-wise, so I'd reconsider
| this redirection.
| marginalia_nu wrote:
| I guess it's a matter of taste. I can barely read anything on
| regular wikipedia because the inline links disrupt my flow.
| wolpoli wrote:
| I wish niche search engines has an option to group results by
| domain names. There are a few major sites that dominate Google
| search results with low effort content. As long as Google stands
| as the largest search engine, it's unlikely that these major
| sites will want to rearchitect itself into different domain
| names.
| gjm11 wrote:
| I tried a few searches.
|
| <<javascript pipe syntax>>: none of the search results appeared
| to have anything to do with Javascript pipe syntax. (Which
| doesn't exist yet, but it's under discussion.) Google gives a
| bunch of highly-relevant results.
|
| <<hans reichenbach relativity>>: first result is a list of books
| about relativity, one of which is Reichenbach's "Philosophy of
| space and time"; good, but there's no real _information_ there.
| Second is about Reichenbach but nothing to do with relativity or
| even, really, philosophy of science. Third is about philosophy of
| science and mentions some of Reichenbach 's work but not related
| to relativity. Fourth mentions Reichenbach's "Philosophy of space
| and time" as part of a list of books relevant to a seminar on
| "time and eternity". None of this is _bad_ , but it's not great
| either. Google gives a couple of online philosophy encyclopaedia
| entries, then a journal article on "Hans Reichenbach's relativity
| of geometry", then the Wikipedia article on Reichenbach ... much
| more informative.
|
| <<luna lovegood actress>>: I thought this would be an easy one.
| It was easy for Google, which gave me her name in large friendly
| letters at the top, then her IMDB entry, and a bunch of other
| relevant things. Literally nothing in the Marginalia results was
| relevant to the query.
|
| I guess maybe popular culture is just too monetizable, so no one
| is going to write about it on the sites that Marginalia crawls?
| Let's try some slightly less popular culture.
|
| <<wilde "a handbag">>: First result is kinda-relevant but weird:
| it's about a musical adaptation of _The Importance of Being
| Earnest_. It doesn 't mention that famous line from the play, but
| one of the numbers in the musical has the words "a handbag" in
| the title. Second result is a review of a CD of musicals,
| including the same work. Third is a bunch of short reviews of
| theatrical items from the Buxton Festival Fringe, one of which is
| a three-man adaptation of TIOBE. Next four are 100% irrelevant.
| Next is a list of names of plays. Last one is actually relevant;
| it's an article about "Lady Bracknell through the decades".
| Google puts that one first (after, sigh, a bunch of YouTube
| videos which look as if they might actually be relevant).
|
| I really like the _idea_ of this, and many of the things it turns
| up look like they might be interesting, but it isn 't doing very
| well at producing results that are actually relevant to the thing
| being searched for.
| exikyut wrote:
| TIL about https://github.com/tc39/proposal-pipeline-operator,
| which I am immediately looking forward to playing with once it
| gains traction Some Time From Now(tm)
|
| (I have no earnest reason to transpile)
| seph-reed wrote:
| Yeah, this seems pretty nice. I don't think the "deep
| nesting" issue is quite so realistic... I very rarely have a
| logic tree that's easier to identify by its leaves than its
| root. And I'd really hate to have code where you have to
| scroll to the end of a bunch of pipes to figure out what
| they're adding up to
|
| But I have plenty of single use "temp variables" and cutting
| those out could be cool.
| silent_cal wrote:
| To be fair, those searches are pretty weird.
| JxLS-cpgbe0 wrote:
| We don't _search_ for things because they 're easy to find
| HaloZero wrote:
| The pop culture one is fairly common. Me and my wife both
| search "who the fuck is that" in that TV show movie all the
| time. Or who is the author of X book?
| cormacrelf wrote:
| It's trying to surface long articles and you're asking it
| for a one word answer. What did you expect? A long article
| consisting of "Emma Stone played Cruella" repeated 800
| times?
| gibspaulding wrote:
| I think perhaps the usefulness here is less finding what
| you're looking for, but rather finding something
| interesting.
| exporectomy wrote:
| Wow. I tested it on recipes which Google has destroyed and this
| was the first result, a simple clear recipe:
|
| http://demont.myds.me/leerecipes/mainmeals/mainmeals1/chicke...
|
| compared to Google's endless drivel of "This chicken stir fry
| recipe will become a staple in your home. It's so quick to make
| and you can use whatever vegetables you have on hand. It tastes
| wonderful regardless of how you alter the ingredients. ... " JUST
| SHUT UP AND GIVE ME THE RECIPE!
|
| https://natashaskitchen.com/chicken-stir-fry-recipe/
| srcreigh wrote:
| Fwiw this website, Natasha's kitchen, seems to be one of the
| more performant ad filled recipe websites.
|
| Make use of the "Jump to recipe" button to get to the recipe
| faster.
| 1f60c wrote:
| This isn't entirely Google's fault. Recipes on their own aren't
| copyrighted in the US, and adding this fluff text is a way
| around that.
| AdamN wrote:
| Yes it is. They're giving a lower quality result to the user
| (their customer ... but not for long if competitors can get
| just a little bit better)
| mdoms wrote:
| It is entirely Google's fault. Google knows people don't want
| this dreck (everyone knows it) but still serves it up.
| b-x wrote:
| Too bad it rejects non-Latin words, as if the definition of
| "text" is a sequence of alphabetical letters originated from
| Latin.
|
| I thought that we've reached the time to embrace all cultures in
| the world, but this retrogressive engine proves that most _modern
| tech_ designers are myopic about other civilizations in the
| globe.
| fghfghfghfghfgh wrote:
| It's one guy. Making a useful tool. It even has an altruistic
| purpose.
|
| Shame on you for twisting a well intended effort into a
| negative statement that suits your narrow identity political
| world view.
| b-x wrote:
| Less insulting error messages would be more welcome than
| casting out others without any consideration.
|
| You may distribute "shame" however you want, but this only
| helps enforcing the damaging insults and amplifying them.
| hombre_fatal wrote:
| No, it just proves that a one-man hobby project with finite
| resources found it reasonable to restrict the scope.
|
| Maybe when they find out they're an immortal billionaire they
| can build all the additional things you've entitled yourself to
| expect from the freely shared work of others.
| marginalia_nu wrote:
| Understand that this is something I built for myself, by
| myself, so it focuses on languages I understand. It hosted on a
| single consumer grade computer in my living room. I built it
| out of pocket and anyone is free to use it. Does this make me a
| villain?
|
| If I can do this, what's preventing some guy in Japan or India
| or Peru from doing the same, of course focusing on their
| languages?
| b-x wrote:
| Maybe a better suited choice for errors than an insulting
| message: when I provided a query in my native language it
| regurgitated the error "needs to be a word" instead of more
| acceptable "not a supported language".
|
| When you claim that a word in some other culture is not "a
| word", just because it's not recognized by your machine,
| that's demeaning to say the least.
| marginalia_nu wrote:
| Again it's a one man hobby project, I don't have a team of
| people to go through every formulation and every error
| message to ensure nobody can read them in a way that
| offends them. It's just me, writing code on an unfinished
| project that HN discovered.
|
| In this case, the code doesn't match the word regexp, like
| it may be a @TwitterHandle or a "comp.lang.c" with periods
| in it, or an unsupported Unicode range. It doesn't know why
| it is not matching, just that it doesn't.
| b-x wrote:
| I must congratulate you on this achievement. That's
| certainly a useful take on search.
|
| Nonetheless, even when coding, one should also consider
| thoroughly the UX and how it would be addressing the
| others.
|
| Saying "unsupported word" is much more sympathetic than
| "needs to be a word" (where you define what a "word" is,
| and the general user is unaware of such definition).
| marginalia_nu wrote:
| Fair point, I refined the phrasing a bit.
|
| > The term "" contains characters that are not currently
| supported
| leavenotracks wrote:
| Really impressed with the results I'm seeing so far. In all
| searches I have done so far, the results are truly lightweight,
| and haven't had to click through any modals, subscription pop-ups
| or any other junk thus far! Will be using more in the days to
| come.
| ncfausti wrote:
| This is incredible. I just got goosebumps as I stumbled upon
| https://solitaryroad.com after searching for "linear algebra
| homomorphism". It reminds me of the magical feelings of the early
| Internet. Keep up the great work!
| pajko wrote:
| Does it filter out ad-heavy copy-paste/autogenerated fake sites?
| Tired of seeing those on the first few pages of Google. Bing gets
| more and more usable, but far from perfect.
| marginalia_nu wrote:
| It tries.
| titzer wrote:
| I think I want a BBS. Text mode, fixed width font, keyboard-
| driven menus, no (or very little) bitmapped graphics. I've been
| thinking about the UIs for a lot of sites that I use to "do
| things" on the web. E.g. search for flights. Do I need _any_ of
| that "beautiful" web design with pretty forms and fonts,
| bevelled edges, drop shadows, drop-down menus, hovers? Hell, do I
| even need a map? Heck no, I need three text entry fields and
| output a bulleted list, maybe table of results. Just give me the
| raw data and do as little presentation as possible, thanks.
|
| I really think I want an internet console, not an animated
| magazine.
| javajosh wrote:
| This is really good; I'll actually use it!
| eitland wrote:
| Tested with the first person to settle on Island:
| https://search.marginalia.nu/search?query=Ingolf+Arnarson
|
| and it worked surprisingly well.
|
| Anyone else has good examples?
| abdullahkhalids wrote:
| Can we submit text-heavy sites for possible inclusion? Assuming
| they pass your filters.
| sealthedeal wrote:
| lol this is great, reminds me of the old school search engines we
| would use in school back in the day before Google haha.
| pomian wrote:
| Congratulations. Truly impressive search results. I tried two,
| one word searches. The results were interesting, useful, and
| would have been impossible (well, really really hard) to find, on
| standard search engines. Plus, no garbage, ads, recommendations,
| etc etc. As another commenter suggested, it is what World Wide
| Web searches results were like, twenty years ago!
| pomian wrote:
| PS. I added Marginalia as a search option (even the default for
| now) in Firefox Nightly (on Android). In case others want to,
| under settings for search, you can add other, then name, and
| then: https://search.marginalia.nu/search?query=%s
| freddref wrote:
| After a good amount of searching it doesn't seem possible to
| add Marginalia as default search in firefox (84.0b8) on
| Debian.
|
| I did not expect this to not be available.
| lumost wrote:
| This is a fascinating tool, I estimated that the corpus of the
| factual web was between 1 and 10 TB when I last played around
| with BigQuery using domain names which had low amounts of click
| bait. Seeing these search results I suspect my estimate was off
| by a couple orders of magnitude.
|
| Although a search for "Fractional Reserve Banking" shows that
| some further ranking improvements can be made to exclude
| unrelated results, and potentially penalize old conspiracy sites.
|
| https://search.marginalia.nu/search?query=fractional+reserve...
| rchaud wrote:
| Is it fair to assume that text-heavy sites that are inactive (but
| still online) don't have SSL?
|
| If so, would you ever tweak the parameters to surface sites that
| that aren't served with "HTTPS"?
| asjdflakjsdf wrote:
| You should monetise this with amazon affiliate links that are
| relevant to each search. And then use that money to keep this
| project going. Google is fantastic, but it has become something
| different from what it was, the company and the product. It is so
| refreshing to see a modern tool that encourages exploration of
| the actual world wide web.
| Funes- wrote:
| That would be an absolutely awful decision.
| marginalia_nu wrote:
| I might add a donate button or something if people want to help
| support the project, hardware isn't cheap and all. But I have a
| job and decent income. I think if this search engine became the
| way I earned money, it would influence the project in a bad
| way, and corrupt its purpose, which is to help people explore
| the less-commercial internet.
| eigenhombre wrote:
| +1 for a donate button; much preferred over affiliate links
| or ads of any kind. Thank you for making this beautiful
| little(?) product!
| kews wrote:
| Appreciated. The more things fill up with monetizing shit,
| the more I stay away. There's something beautiful in having
| higher purposes than grubbing for cash.
| RistrettoMike wrote:
| I'd donate to continued expansion/development of something
| like this. Where is somewhere good to follow you for any
| thoughts/updates?
| marginalia_nu wrote:
| I have something of a blog here, with an Atom feed.
|
| https://memex.marginalia.nu/log/
|
| It's not very well optimized for mobile, really it's more
| of a bridge for my geminispace content.
| streamofdigits wrote:
| Let a thousand search engines bloom.
|
| btw, interesting how many http (as opposed to https) sites show
| up...
| pjs_ wrote:
| This kicks ass!!
| 0xbadcafebee wrote:
| I love it. Even though it didn't give me the results I was
| looking for. I searched "new york fishing license", and it didn't
| give me any links to the actual new york fishing license
| websites. But it did give me a ton of really cute little websites
| related to lakes and fishing in New York. This one has _amazing_
| information about fishing all over Western New York:
| http://www.huntfishnyoutdoors.com/fishing.php
| kews wrote:
| There's probably a more suitable term than "modern" that we
| should generally be using, since "modern" consistently has a
| positive connotation.
| marginalia_nu wrote:
| Dunno, I prefer to use as neutral or positive terminology even
| when I talk about things I don't like. I think it very easily
| comes off as juvenile ranting when you start throwing around
| terms with strong negative connotations.
| mumblemumble wrote:
| I like it.
|
| Coincidentally, the other day I was daydreaming about a search
| engine that favors sites that are updated less frequently. The
| thought being, the kinds of labors of love that characterized the
| 1990s Web that I still sometimes miss are still out there, it's
| just harder to find them amidst the flood of SEO dreck. So
| perhaps they could be made discoverable again with the help of a
| contrarian search engine that specifically looks for the kinds of
| things that Google and Bing _don 't_ like to see.
| gibspaulding wrote:
| Million Short [1] offers an option to omit results from popular
| domains. It's a different approach from what you describe, but
| I think the goal is similar.
|
| [1] https://millionshort.com/
| itzworm wrote:
| I had this problem recently trying to fix an Atari. There's a
| guy out there who has ton's of guides on doing video out mods
| but newer guide references the older. However googling the OG
| guide didn't find it so I manually scoured his old web page.
| capableweb wrote:
| Similarly, I wish there was a recommendation engine (for web,
| music, movies, whatever) that can show you what is the furthest
| away from your existing tastes. I've learned to re-create my
| Spotify account once every 6 months or so, as their
| recommendation engine becomes a boring machine after using it
| daily for some months.
|
| I'd love to discover new content that is different from what I
| read/watch/listen to now, but it's really hard to know about
| genres you don't know about.
| ebiester wrote:
| It's hard, though. I simultaneously want something far from
| my tastes, but I don't want to see Plandemic-style Ivermectin
| material, or Focus On The Family-style material. I want
| things that will push me out of my comfort zone sometimes,
| but it turns out I really don't want the thing furthest from
| my tastes; I want things marginally adjacent. I want them
| close enough to feel familiarity, but far enough that it
| challenges my worldview.
|
| I don't think a recommendation engine can do that.
| gverrilla wrote:
| Doing that takes real work and curiosity. I'm afraid an
| algorithm will never be able to do it, particularly if you're
| into niche stuff. For instance I enjoy a lot a Japanese band
| called The Boredoms - but few people like it, and there's
| only 2 of their albums available in spotify.
| potatoman22 wrote:
| I like the idea. Tangentially, I wonder how one would find the
| right 'penalty' for more updated sites?
| timvisee wrote:
| Cool!
|
| There do seem to be some text encoding issues though. For
| example: https://search.marginalia.nu/search?query=tim+visee
| marginalia_nu wrote:
| Yeah I think the charset detection needs work.
|
| It understands the "Content-type: text/html;charset=utf-8"
| -header, and <meta charset="UTF-8">
|
| but not
|
| <meta http-equiv="content-type" content="text/html;
| charset=utf-8">
|
| It turns out HTML has a lot of corner cases. I'm constantly
| marveling at how web browsers hold together as well as they do.
| timvisee wrote:
| Thanks for your response! Hope you can implement this as well
| without too much trouble.
|
| I wonder if you could just assume UTF-8 to be the default
| these days. I imagine that to fix many other cases as well.
| marginalia_nu wrote:
| Haha! I did actually assume UTF-8 at first, but being a
| search engine has a lot of older websites, I sadly got a
| lot of encoding errors doing that, too.
| enduku wrote:
| Fantastic project! Found very interesting links to a lot of
| compiler related keywords. A similar service, yet different in
| their approach to cut through the e-commerce and seo optimized
| websites I found useful is MillionsShort[0]
|
| millionshort.com
| oytis wrote:
| Designed for serendipity indeed. Tried a few searches, results
| are quite fun, but none of them relevant.
| pjspycha wrote:
| This is really refreshing work, and we can all benefit from other
| search engines focused on improving the field. I tried a bunch of
| searches and some of them were quite wonderful, others were a
| little dry on results. But overall I enjoyed going through it.
| Here is some critiques if you don't mind:
|
| I did search for "Daria Bilodid" and the results were a bit
| troublesome. First the Wikipedia result did not work:
| https://en.wikipedia.org/wiki/Daria_Bilodid vs
| https://encyclopedia.marginalia.nu/wiki/Daria_Bilodid
|
| Secondly the results matched a few judoinside.com results which
| is ok, including sites to her competitors, but seemed to miss the
| judoinside website for her:
| https://www.judoinside.com/judoka/92660/Daria_Bilodid.
|
| The design is hard on my eyes, I have a average size screen and
| its using less than half of the width. The line-height is
| enormous and seems to breakup flow making it uncomfortable for me
| to read. The spacing around each result is the same as between
| titles and paragraph items, which again was unpleasant to read.
| ASalazarMX wrote:
| > Secondly the results matched a few judoinside.com results
| which is ok, including sites to her competitors, but seemed to
| miss the judoinside website for her:
| https://www.judoinside.com/judoka/92660/Daria_Bilodid.
|
| The title says this search engine punishes modern websites
| (images, videos, MB of JS, I suppose), and this site looks
| scarce on text and heavy on images, maybe that's confusing the
| ranking.
|
| I certainly find the results very refreshing, but you'll have
| to complement with other search engines if they're not enough.
| In fact, I think the days when we could use a single search
| engine have already passed.
| dukeofdoom wrote:
| "corporate speak" bs detector and filter on google search engine
| would be nice.
| throwawaysea wrote:
| Is it possible to also make a site that favors a diverse set of
| information sources? For instance a lot of searches turn up
| results from Pinterest or Wikipedia or Amazon or whatever else. I
| wonder if there's room for a search engine that is all about
| favoring a greater diversity of smaller sources, for those who
| are less interested in staying within walled gardens.
| michaelgrafl wrote:
| I just looked up my last name and found a World class heavyweight
| weightlifter named Josef Grafl born in 1872 who has an awesome
| portrait of him on Wikipedia. Never before have I read about that
| man.
|
| I love this.
| dumbfounder wrote:
| Based on a few searches it seems to favor sites with very long
| passages of text. Search for a name and you get pages with
| massive lists of names. It quite simply isn't very good at
| everyday searches. But it does bring up the point, shouldn't I be
| able to tell my search engine I want results like this? It should
| be a feature of google I can turn on and off. It should be one of
| many ways to impact relevance.
| runnerup wrote:
| Seems like this is still very very hard! I searched for "hart
| protocol" hoping to find this: http://www.romilly.co.uk/
| mrpf1ster wrote:
| I searched "c strtok" and got one result saying '"strtok" could
| be spelled "stroke", "stork", "sarto", "strop"'.
|
| Cool concept though!
| marginalia_nu wrote:
| The spelling suggestions are presented whenever there isn't any
| results, but sometimes they can be pretty misleading.
|
| What happens is that C, as a word, isn't indexed because it's
| deemed too short, and the bigram "c strtok" can't be found
| anywhere.
|
| Try 'strtok' instead.
| llbeansandrice wrote:
| Is there anyway to add this as a favored search engine in the
| browser?
|
| I currently use google as it's set as the default search when I
| type in the address bar but would love to switch and move
| google/ddg to a added character like "<search terms> @g"
| tomerv wrote:
| All the major browsers support adding custom search engines.
| You just need to specify the URL template to do the search. The
| common format is to put "%s" as the search term. You can use it
| for any site, not just things that are considered search
| engines.
|
| Firefox is a bit different, since you do it by adding a
| bookmark, and giving that bookmark a keyword. The other
| browsers I checked have an option under the search engine
| settings.
|
| After defining the custom search engine, you just type
| "<keyword> <search term>" in the URL bar.
| freddref wrote:
| Is there a way to set marginalia as default search in
| firefox?
| llbeansandrice wrote:
| Ah I'm on FF so I'll have to do the weird bookmark method. A
| little annoying since it supports other search engines.
| ilrwbwrkhv wrote:
| This is soooo good. I'm finally finding sites I haven't heard of
| with good content.
|
| I didn't realize how much I missed this stuff.
|
| The popular web has become so bad nowadays.
| snuser wrote:
| Just searching for 'dogs' gave me more interesting results than
| I've seen from google in years
| prionassembly wrote:
| The website itself seems generated with some kind of kick-ass
| generator from template files (.gmi?)
|
| I feel like I'm stuck with Wordpress.com because it brings me
| _some_ traffic (whereas something hand-rolled on nsfspeech or
| digital ocean or whatever would literally be off the edge of the
| web), but the structure of that is so cool.
| NhanH wrote:
| That would be gemini protocol!
| huijzer wrote:
| You can easily do proper SEO with static site generators too.
| Even more, static sites can be hosted via GitHub or GitLab
| Pages, Netlify or CloudFlare and in all cases the speed will
| outperform Wordpress in almost all cases. Also, you have way
| more control over the output than with Wordpress.
| dzink wrote:
| One use case to always test: "online wishlist" or "make a
| wishlist". If you start seeing tools like
| https://www.DreamList.com or others, you are on the right path.
| If you start seeing random web pages linking to individual wish
| lists, then people are likely not able to find tools on your
| search engine.
| platz wrote:
| > Don't be afraid to scroll down in the search results
|
| I never knew it was fear that was preventing me from scrolling
| prewett wrote:
| Kudos for taking on this project, and I like the idea! I think
| it'll be a big project to take it to the next level, but would
| love to have a search engine that's more useful.
|
| Some reactions:
|
| - The font is really big and the columns really narrow, so I get
| 3 - 4 entries per page, something like 8 words per line, and huge
| spacings between lines, which makes it a frustrating experience.
| I've been using the recommendations in
| https://practicaltypography.com/, which recommends 60 - 90
| characters in a line I think, and line spacing of 120% - 140% (I
| like 125%). The line lengths here might technically fall within
| the lower bound, but it's really short, and for search results
| I'm going to try scanning the text to see if there's something
| relevant, so I think going on the long side is better here. At
| least make the width somewhat variable so that I can shrink the
| rather large font and fit more on the line.
|
| - The results are eclectic, but I'm not sure it's usable at the
| moment. "scala append list" did not get me much that's helpful,
| while Google will usually at least put up some click-farming
| tutorial that although minimal effort does tend to answer the
| question. Both "mapo doufu recipe" and "ma po do fu recipe" had
| very few recipes, although the latter did have one.
| Unfortunately, recipe websites are some of the worst, with about
| 10 pages of description, ads, pictures, what-have-you until the
| recipe at the very bottom. "collection unmitigated pedantry" did
| return the acoup.blog entry at the top, though.
|
| Good luck on the project!
|
| -
| lbriner wrote:
| My pet peeve with search results is simply that there are ancient
| technical results that in many cases are irrelevant. If I am
| searching for a Window error message, I don't want some old forum
| post from 2001, especially if it didn't have any answers!
|
| What would be cool would be for people who host old stuff to
| "archive" it at some point so it doesn't appear in normal
| results, only if you tick "include archives".
| athenot wrote:
| As much as the release names for macOS over the years were
| marketing gimmicks, it does make it a lot easier to zero in on
| the correct version when doing these types of searches.
| platz wrote:
| modern design = low information density?
| monkeybutton wrote:
| Definitely low signal to noise. Looking at you recipe websites
| and cooking blogs.
| dfdz wrote:
| I like the concept, but I did not work on any of the search
| phrases I entered consisting of the full title of a computer
| science article or book.
|
| It also does not work for subjects. For example, if you search
| "discrete math" it links to academic webpages, but most of them
| do not have any notes posted. It is just a plain text website
| with the syllabus of a class.
| cyral wrote:
| I tried a few queries and got extremely irrelevant results
| marginalia_nu wrote:
| It really depends on what you search for. A major drawback is
| that there needs to be text-heavy sites to find, in order for
| the search engine to find them.
|
| Compare for example the results for "Duke Nukem 3D" with those
| for "Cyberpunk 2077".
| adriangrigore wrote:
| A little bit harsh "punishes". It's a cool search engine.
| OneEyedRobot wrote:
| Very cool. A person can really appreciate simple web design
| looking at something like Luke Smith's recipe page.
|
| So how on earth do you take an idea like this and scale it for
| both broad web coverage and high traffic? For that matter, just
| how much 'useful' text is there on the net?
| jteppinette wrote:
| This is awesome! We should definitely move in this direction.
| xipho wrote:
| Fascinating. I studied an "obscure" group of insects. My go-to
| search term to test an engine is their family name as it is a
| rarely used word and I know most (all?) of the major data sources
| that have accumulated data on it. When Wolfram Alpha added
| species names, I checked with the name, boring, Duck Duck,
| boring, Google (well we know Google isn't for search anymore,
| it's absolutely horrible) boring, Bing, boring... you get the
| idea.
|
| This was a little different, extremely few results, but a couple
| of them really made me grin, and all(?) made me curious or raise
| an eyebrow or reflect on who/what might have been the source of
| the link, or remember some obscure connection from grad-school.
| So, if anything a crawled list of results worthy of ponder,
| thanks for this!
| trutannus wrote:
| > well we know Google isn't for search anymore
|
| Do you suggest anything better? As far as I can tell, all the
| other search engines are either repackaged Bing (ie: DuckDuck),
| or are just as bad.
| ColinHayhurst wrote:
| This needs an update but is an easy look see.
| https://www.searchenginemap.com/
|
| Broad and longer Twitter lists maintained here:
| https://twitter.com/SearchEngineMap/lists
| ricardo81 wrote:
| Mojeek was built in the same spirit (one server living in a
| house) and has 4.5 bn pages indexed now, and a bunch more
| servers. A lot of people comment in similar style of it
| reminding them of an older Internet, or generally less
| branded results. It's definitely an alternative point of
| view. Disclaimer: I work for them.
| giancarlostoro wrote:
| Not sure, but I remember when Google could find literally
| anything. Then they started adding a bunch of exceptions and
| crapped out their quality. I wonder how insanely different
| results would be to get the older Google Engine from the
| 2000s search result wise.
|
| I now have to play games with Google to find things. I feel
| like I do less than I used to for some reason.
| bbarnett wrote:
| The other day, I was searching for something, and google's
| suggested, on-site answers took up 1/2 the first page. All
| wrong.
|
| The actual search results were another 1/4 page of
| completely identical results, followed by google ad placed
| search results.
|
| I thought to myself, they've finally done it. Real
| responses are no longer first page.
|
| A lot of the cause for google getting crappy, is "ok
| google", another "all platforms are the same" form of
| sickness.
|
| No, a desktop is not a phone. No, voice searching is not
| the same as phone, or desktop.
| foobarian wrote:
| I was just thinking that they finally became Lycos. It's
| what all the search engines except Google looked like
| back in the early 2000s - ad laden cesspools of
| irrelevant search results and other content. And it's why
| we all switched to Google at the time.
| habibur wrote:
| It's time to disrupt the market. As Google can't compete
| with a newcomer that penalize ads on page.
| eitland wrote:
| Seriously, yes.
|
| Moores law means a modern day 2007-style Google should be
| significantly less expensive to run now than back then.
|
| Also the most relevant patents are now free to use.
|
| 2021 Google is a sad story compared to 2007 Google and
| I'd actually pay to get back 2007 Google - ads included -
| meaning a double revenue source :-)
| trutannus wrote:
| You're absolutely correct, and a lot came from their
| nerfing of search modifiers like + - "search term" and
| whatnot. There's also a lot of ads and "PSA" type nonsense.
| If I'm looking for anything COVID related for example, I
| have to sift through a heap of PSA nonsense that's not even
| related to my search query.
| blowski wrote:
| Wacky idea: instead of Google changing it's algorithm every
| couple of years, it could run 50 algorithms in parallel
| leaving no way for sites to "optimise" for the current one.
| vikingerik wrote:
| The output of the parallelism is itself an algorithm,
| that can and will be optimized for.
| mda wrote:
| IMHO, that is a trendy claim in HN with little evidence.
| mhh__ wrote:
| You're downvoted but in my experience I have never really
| been burned by this Google-decline
| Ygg2 wrote:
| "I haven't seen a black swan, ergo it's not real."
|
| I've been burned by this decline in the past.
|
| From creepy results i.e. first suggestion before typing
| was something I spoke near the Android and I never
| searched for before; to not finding what I was searching
| for before successfully, Google has started declining.
| bbarnett wrote:
| You are a lobster. (or frog, depending upon parable)
| xipho wrote:
| You want evidence? Search for a plumber/tradesperson in
| your area THEN try to find rational discourse about your
| options. There are literally 100s of results of websites
| remixing a small set of data, presenting it to you, and
| asking you to buy something to see more, when you _know_
| there is nothing behind the scenes.
|
| This type of engine would punish these sites, in theory,
| and may turn up a discussion in some forum, newsgroup, etc.
| that is actually relevant, or insightful.
| krapp wrote:
| > Search for a plumber/tradesperson in your area THEN try
| to find rational discourse about your options.
|
| I searched "plumber Austin TX" in Google and got a map
| and list of company websites near me. There are a lot of
| "top x y in z" list sites, but the top results were still
| the most relevant. I don't know what "rational discourse"
| I'm expected to find, though, or why I should assume the
| discourse I would find through Google is less rational
| than discourse I would find elsewhere.
|
| I searched the same thing in OP and found nothing even
| remotely significant. Not even anything related to
| plumbing.
|
| OP's project isn't optimized for relevance, it's
| optimized for nostalgia - providing a filter that keeps
| the modern web away and dropping quirky, interesting
| breadcrumbs to distract you and remind you of what it was
| like to wander around the web of the 90's.
|
| Which is all well and good if that's what you want, and
| judging from the comments it is what a lot of people here
| want, but Google giving me a list of company names,
| numbers, websites and a map showing their location by
| distance is more useful, even if it uses "modern web
| design" and javascript.
| xipho wrote:
| > I searched "plumber Austin TX" in Google and got a map
| and list of company websites near me.
|
| I think you could have done this historically in a Yellow
| Pages phone book. My OP used "boring". A list of plumbers
| is boring, been done on dead wood. I'm not saying boring
| != !useful.
|
| > There are a lot of "top x y in z" list sites
|
| This is an understatement. I actually want to know the
| top x in y, to do that I need "rational discourse".
| Rational discourse is recognizable as well written,
| insightful, humble, reflective, self-countering,
| anecdotal etc. By "search is terrible" I mean with
| respect to finding this.
|
| > OP's project isn't optimized for relevance, it's
| optimized for nostalgia
|
| Nostalgia is highly relevant if it's on topic, but
| agreeing with you as to what this engine is about.
| krapp wrote:
| >Rational discourse is recognizable as well written,
| insightful, humble, reflective, self-countering,
| anecdotal etc. By "search is terrible" I mean with
| respect to finding this.
|
| I believe a search engine that ignores results based on
| superficial and aesthetic qualities like "modern web
| design" would be even worse in that regard, unless you're
| assuming no relevant discourse about any subject has
| taken place on the web since the early 2000's.
|
| I admit, I have no idea what heuristic you would actually
| use to find "well written, insightful, humble,
| reflective, self-countering, anecdotal etc" content, but
| I've seen it on modern sites (even on Twitter,) and I've
| seen a lot of garbage on old sites, so a simple text
| search of only old websites doesn't seem like it.
|
| It is fun, though.
| Spivak wrote:
| Google Search is a fantastic product because it's
| essentially Spotlight for the web. It's by far the fastest
| way to get to things you already vaguely know are there and
| acts as a metasearch for large sites.
|
| But as a result it's now less useful as a tool for scouring
| the web.
| mountain_peak wrote:
| Likewise, I co-maintain the only "fan" site on one of my all-
| time favourite composers/performers, and gave the engine a shot
| with a unique string query. While my text-heavy WP-driven site
| didn't seem to make the cut, the results were highly relevant
| in that they were links to former band members and
| collaborators - a couple of which I didn't realize existed.
| That being said, there were a few sites (including my own) I
| expected to be returned, but no dice. Still, a fascinating
| experiment that many at HN have been clamouring for.
| xipho wrote:
| Exactly this. A couple results returned reference to obscure
| now-defunct newsletters and clubs, people that I know were
| historically important for past researchers, but only because
| this was my research forcus for so long would I have known
| this.
| marginalia_nu wrote:
| The search engine doesn't actually do full text search, so
| maybe your query was too... unique.
|
| But do first of all verify that you haven't been hacked.
| There's about quarter of a million domains I've flagged that,
| besides their wordpress content, also host a ton of link spam
| crap off in some hidden folder. This reflects on the quality
| rating extremely negatively to the point where you may have
| not been indexed at all.
|
| Secondly, are you behind cloudflare or some other big-name
| CDN? Because, as I mentioned in another comment, I can't
| crawl their pages without getting captchad until they approve
| of my humble request to be classified as a good bot.
|
| There are some other hosting providers I flat out block on a
| subnet level because they host a large amount of link farms.
| This is currently Alibaba, Psychz, eSited, Cloud Yuqu and
| 1Blu.
| mountain_peak wrote:
| Thanks for the advice; not hacked, but I have "resurrected"
| many WP sites that have been (including my wife's non-
| profit). Just running on an EC2 micro instance, but I tried
| adding "site:" and received "No such domain". Actually, I
| think it's because I haven't enabled "HTTPS" yet! That's on
| my to-do along with migrating off EC2-Classic to VPC...
| marginalia_nu wrote:
| Vanilla HTTP should be fine. I think 80% of the urls are
| HTTP.
|
| If you're getting no such domain, it's either blocked
| because it looks too much like a spam domain, or it
| simply hasn't been discovered yet.
|
| What's the TLD? I severely restrict some cheaper TLDs
| because they gave so much spam.
|
| For example, cr.yp.to is an example of a baby I know I've
| definitely thrown out with the bathwater.
| withinboredom wrote:
| It'd be nice if you had a page to get the current index
| status for a domain.
| marginalia_nu wrote:
| Try a query on the form site:www.example.com ;-)
| rovr138 wrote:
| Would it be possible to have a link to a page with
| operators?
| duckmysick wrote:
| I'm intrigued by this experiment but I can't visualize it. What
| do you mean by boring results? Would combing through a library
| (the one with paper books) also produce boring results? What's
| your ideal results?
| xipho wrote:
| Perhaps a counter example, something that is interesting.
| Anecdotally. This, of all things, is the _top_ result in my
| search: https://tft.brainiac.com/archive/0303/msg00037.html.
| Which is strange to me because I don't recognize
| tft.brainiac. I click, it's a list of biological
| relationships among Hymenoptera, including a reference to
| genus of the wasps I studied, presumably in a biological
| relationship (host/parasite) context. I cataloged every
| relationship known at one point, so my brain wants to know
| where this come from, is it something I caught. Then I go
| look for more context, and find it's part of a thread about
| D&D(?) and hymenoptera, and it's epic, and a chunk of my
| morning is lost figuring out why and how this came to be.
| duckmysick wrote:
| Yes, thanks. That helps.
|
| If I understand it correctly, you're interested in bits and
| pieces of new information that's indirectly related to your
| object of interest. Degree 2 and 3 in Six Degrees of Kevin
| Bacon, so to speak. You know degree 0 like the back of your
| hand and you've seen almost everything closely connected.
| Finding novel, interesting things is getting more
| difficult.
|
| Have you thought about cataloging all the related stuff you
| stumble upon? Something in between loose notes and what
| Moby Dick is to cetology.
| xipho wrote:
| Exactly.
|
| > Have you thought about cataloging all the related stuff
| you stumble upon? Something in between loose notes and
| what Moby Dick is to cetology.
|
| Tongue in cheek- new app time, to facilitate this. It
| should have the name "Degree4". Entries can only be made
| if degrees 2 and 3 are "defined". Scoffs at degrees 5 and
| 6, just because. Startup developing can probably
| unethically seed content by mining
| https://www.everything2.com/. Should use concepts of "AI"
| and "persistent homology"... profit!
|
| But no, I don't outside a mental note. Closest I would
| come would be adding '!! <some note>' to my potwiki text
| notes (see my past comments) if its something I want to
| have come back with a grep, or think might be interesting
| to explore "when I retire". If it's a scientific fact in
| my field after researching it further it would go into
| this https://taxonworks.org (or its precursor).
| xipho wrote:
| In part, by boring results I mean I instantly recognize the
| top results, and I know exactly what will be in them, and I
| know which ones will actually contain potentially interesting
| new stuff, i.e. _I didn't have to search for these, I'd go
| their directly_. Then next results are all obscure, and I've
| already visited them, and/or I know they are historical and
| not something I have to revisit.
|
| With this engine with at least 1/2 the links (to be fair
| there were < 20) I didn't recognize the URL at all, and it
| was clear in the text or the URL that there was an
| interesting bit to check out (i.e. what Google should have
| also returned after they barfed out the things I don't need
| to know about), but had never succinctly done in my
| experience.
|
| I suppose the magic in this engine would have to be alerting
| the searcher that they found more of this type of link, as
| once I visited the 10 or so sites they would fall back into
| the "been there, done that" link category that Google appends
| somewhere after the ads and "big" sites, mixed in with a
| million search term spam sites, etc.
| xattt wrote:
| There's certain grey literature that's not captured in
| university library federated searches nor easily found with
| mainstream search engines.
| xipho wrote:
| There are decades of academic research not digitized. The
| digitization window used to only hit around 1990, I haven't
| looked at it hard recently, but I suspect this still
| remains true for many important journals. This is grey only
| to those who do not know how to use a library.
| nagyf wrote:
| This is great, I like the results. Couple of things I noticed:
|
| - Search results often very old, from the early 2000s (I guess
| because back then more websites were text oriented). Are you
| taking into account the age of the page when showing results? It
| would be great to see more up-to-date results at the top
|
| - I noticed a few results which directed me to websites with
| security risks, Firefox didn't even let me open them. Is it
| possible to filter these out from the results?
| horsh1 wrote:
| No cyrillic or hiragana suport :-(
| 300bps wrote:
| What we need now is a search engine that weeds out sites that
| have been SEO optimized for keyword density.
|
| I'm tired of searching for "generic keyword" and getting a page
| with an extremely low signal to noise ratio written like this:
|
| "Many people search for generic keyword. That is why you can find
| all about generic keyword here. In fact we specialize in generic
| keyword and slight alterations of generic keyword."
|
| It's like Google stopped caring that people were gaming it.
| Nicksil wrote:
| - Semantic HTML; not everything is a div; correct use of markup.
|
| - Search results are not overrun with commercial, SEO stuffing,
| "content" farms.
|
| I don't know what to say. This is such a refreshing sight. Well
| done.
| thetanil wrote:
| Yes please! More of this!
| hulitu wrote:
| "Search results Search "alt.sysadmin.recovery" needs to be a word
| Those were all the results,"
|
| No comment.
| hdjjhhvvhga wrote:
| Congratulations, great work!
| tomaszs wrote:
| I like the concept of a search engine that does not try to figure
| out what I should learn based on what I search..I know what I
| search for
| camillomiller wrote:
| Great idea, awful UI
| muxator wrote:
| How so? It's intuitive and super fast. I whish there were more
| websites with such a simple UI.
| feikname wrote:
| it's too "uncompact". Font size too big and could use a bit
| more horizontal space.
|
| I find it comfortable to use at 60% zoom level
| marginalia_nu wrote:
| Some people like a flashy UI, the modern look is important
| for them. It's ok to have aesthetic preferences, let's not
| pretend we don't all have them.
|
| In the end, it's a niche search engine I've made, the
| intended audience is the long tail. It just isn't for
| everyone, and if it was for everyone, it probably would be
| lesser for it.
| silent_cal wrote:
| I like the UI.
| ravenstine wrote:
| I'm glad you aren't trying to please anyone. I'd like a
| return to an internet with fewer colors, gadgets and
| gizmos, custom fonts, TypeKit, JavaScript requirements, and
| so on. Most of the time I'm reading articles, so just give
| me more text and less fluff!
| jbj wrote:
| What makes you find the user interface awful?
|
| it is litterally a search website with a text box for a search
| term and a button to do the search.
| r00t4ccess wrote:
| The page isn't prompting for cookie preferences, asking to
| allow notifications, popping up a mailing list or coupon half
| way do the page, playing a full page video with sound, or load
| 97million lines of javascript. I'd say its pretty much perfect.
| IggleSniggle wrote:
| Huh. I think it's a great UI. What did you not like about it?
| stronglikedan wrote:
| Ironic comment, considering that this is a search engine to
| weed out sites with awful UIs. This gives us exactly what we
| need in a search UI - no more, no less - in a clean and
| intuitive way.
| fouc wrote:
| Yeah the design could use some work. The search results are not
| compact - I only see 1 result without scrolling, not counting
| the related wikipedia link that apparently has no description.
|
| I don't particularly like that it seems to be a column
| constrained to 550px width, instead of being responsive and
| taking advantage of greater widths.
|
| to the author of the site, if you're not really into
| design/css, take a look at tailwindcss, it makes it fairly easy
| to produce a minimal amount of css that is responsive.
| agumonkey wrote:
| Very nice. Start a trend :)
| marginalia_nu wrote:
| Yeah so this is my project. It's very much a work in progress,
| but occasionally I think it works remarkably well for something I
| cobbled together alone out of consumer hardware and home-made
| code :-)
| eigengrau5150 wrote:
| I like this. Thanks for doing it.
| scrollaway wrote:
| I searched Warcraft and got a gold selling/ level boosting
| site. Some things never change :)
| bityard wrote:
| This is awesome. I've been looking for a long time for a search
| engine that basically takes everything Google does and does the
| opposite. Thank you for doing this, I will definitely be
| bookmarking it.
|
| Is there a way to suggest or add sites? I went looking for
| woodgears.ca and only got one result. I also think my personal
| blog would be a good candidate for being indexed here but I
| couldn't find any results for it.
| ColinHayhurst wrote:
| Great work. Working on an alternative search engine too. Take a
| look at my profile.
| soheil wrote:
| Awesome project! How are you able to keep the site running
| after HN kiss of death? What is your stack, elastic search or
| something simper? How did you crawl so many websites for a
| project this size? Did you use any APIs like duck duck go or
| data from other search engines? Are you still incorporating
| something like PageRank to ensure good results are prioritized
| or is it just the text-based-ness factor?
| noduerme wrote:
| I love this idea, and admire the work you put into it. I'm a
| fan of long reads and historical non-fiction, and Google's
| results are truly garbage.
|
| I have a criticism that I think may pertain to the ranking
| methodology. I searched for "discovery of Australia". Among the
| top results were:
|
| * A site claiming that the biblical flood was caused by Earth
| colliding with a comet (with several other pages from that site
| also making the top search results with other wild claims, e.g.
| that the Egyptians discovered Arizona);
|
| * Another site claiming the first inhabitants of Australia were
| a lost tribe of Israel;
|
| * A third site claiming that Australia was discovered and
| founded by members of a secret society of Rosicrucians who had
| infiltrated the Dutch East India Company and planned to build
| an Australian utopia...
|
| These were all pages heavy with HTML4 tags and virtually devoid
| of Javascript, the kinds of pages you'd frequently see in the
| late 1990s from people who had built their own static websites
| in a text editor, or exported HTML from MS Word. At that time,
| there were millions of those sites with people paying for their
| own unique domain names, and so the proportion of them that
| were home to wild-eyed conspiracy theories was relatively
| small. What I think has happened is that kooks continued to
| keep these sites up - to the point where it's almost a visual
| trope now to see a red <h1> tag in Times New Roman and think,
| uh oh, I've stumbled on an "ancient aliens" site. Whereas
| scholars and journals offering higher quality information have
| moved to more modern platforms that rely more heavily on modern
| browsers - with or without their own domain names. So as a
| result what seemed to surface here were the fragments of the
| old web that remain live - possibly because people living in
| cabins in Montana forget to cancel their web hosting, or
| because the nature of old-school conspiracy theorists is to
| just keep packing their old sites with walls of text surrounded
| by <p> tags.
|
| Arguably, this seems to rank the way Google's engine used to,
| since it couldn't run JS and they wanted to punish sites that
| used code to change markup at render time. At least, when I
| used to have to do onsite SEO work, it was always about simple
| tag hierarchies.
|
| I wonder whether there isn't some better metric of validity and
| information quality than what markup is used. Some of the sites
| that surfaced further down could be considered interesting and
| valuable resources. I think _not punishing_ simple wall-of-text
| content is a good thing. But to punish more complicated layouts
| may have the perverse effect of downranking higher-quality
| sources of information - i.e. people and organizations who can
| afford to build a decent website, or who care to migrate to a
| modern blogging platform.
| crocodiletears wrote:
| It's very rare that I see a project on HN I can see myself
| using. This is one. Like others have said, the results can be a
| little rough. But they're rough in a way I think is much more
| manageable than the idiosynchrosies of more 'clever' search
| engines.
| marginalia_nu wrote:
| I think you need to approach it more like grep than google.
| It's a forgotten art, dealing with this type of dumb search
| engine.
|
| Like if you search for "How do I make a steak", you aren't
| going to get very good results. But a better query is "Steak
| Recipe", as that is at least a conceivable H1-tag.
| bluefox wrote:
| This is a very cool project! Thank you.
| BugsJustFindMe wrote:
| I love this, and I love (many of) the results so far! What I
| can't find on the site is detail about what "too many modern
| web design features" means. Is it just penalizing sites with
| tons of JavaScript?
| marginalia_nu wrote:
| Javascript tags are penalized the hardest, but it also takes
| into consideration density of text per HTML. There's also
| some adjustments based on text length, which words occur in
| the page, etc.
| ad404b8a372f2b9 wrote:
| Very cool project! How many websites do you have in your index?
| And how did you go about building it?
|
| I've been working on an engine for personal websites, currently
| trying to build a classifier to extract them from commoncrawl,
| if you have any general tips on that kind of project they'd be
| very welcome.
| davegauer wrote:
| This is absolutely wonderful. I am LOVING the results I'm
| getting back from it: the sort of content-rich sites that have
| become nigh unreachable using traditional search engines. Thank
| you for building this!
| asah wrote:
| Love it, kudos! This is great for developers and others who
| Just Need Answers and not shopping or entertainment.
|
| If you're looking for feedback, both from a UI design and
| utility standpoint, you might consider "inlining" results from
| selected sites, e.g. Wikipedia, stacked change, etc. Having
| worked on search for a long time, inlining (onebox etc) is a
| big reason users choose Google, and that channelers fail to get
| traction. If you're Serious(tm), dog into the publisher
| structure formats and format those, create a test suite, etc.
|
| A word of caution: if this takes off, as a business it's
| vulnerable to Google shifting its algorithms slightly to
| identify the segment of users+queries who prefer these results
| and give the same results to those queries.
|
| Hope this helps!
| marginalia_nu wrote:
| If Google starts showing interesting text-heavy links instead
| of vapid listicles and storefronts, I have accomplished
| everything I ever could dream of.
| 0xbadcafebee wrote:
| Thank you for doing this important work.
| palijer wrote:
| Haha, reminds me exactly of this.
|
| https://xkcd.com/810/
| santamex wrote:
| Which software do you use to index the sites?
| marginalia_nu wrote:
| I wrote it myself from scratch. I have some metadata in
| mariadb, but the index is bespoke.
|
| A design sketch of the index is that it uses one file with
| sorted URL IDs, one with IDs of N-grams (i.e. words and word-
| pairs) referring to ranges in the URL file; as well as a
| dictionary for relating words to word-IDs; that's a GNU Trove
| hash map I modified to use memory map data instead of direct
| allocated arrays.
|
| So when you search for two words, it translates them into IDs
| using the special hash map, goes to the words file and finds
| the least common of the words; starts with that.
|
| Then it goes to the words file and looks up the URL range of
| the first word.
|
| Then it goes to the words file and looks up the URL range of
| the second word.
|
| Then it goes through the less common word's range and does a
| binary search for each of those in the range of the more
| common word.
|
| Then it grabs the first N results, and translates them into
| URLs (through mariadb); and that's your search result.
|
| I'm skipping over a few steps, but that's the very crudest of
| outlines.
| q3k wrote:
| Good stuff. I've also been toying with doing some homegrown
| search engine indexing (as an exercise in scalable
| systems), and this is a fantastic result and great
| inspiration.
|
| Definitely want to see more people doing that kind of low-
| level work instead of falling back to either 'use
| elasticsearch' or 'you can't, you're not google'.
| marginalia_nu wrote:
| Well just crunching the numbers should indicate what is
| possible and what isn't.
|
| For the moment I have just south of 20 million URLs
| indexed.
|
| 1 x 20 million bytes = 20 Mb.
|
| 10 x 20 million bytes = 200 Mb.
|
| 100 x 20 million bytes = 2 Gb.
|
| 1,000 x 20 million bytes = 20 Gb.
|
| 10,000 x 20 million bytes = 200 Gb.
|
| 100,000 x 20 million bytes = 2 Tb.
|
| 1,000,000 x 20 million bytes = 20 Tb.
|
| This is still within what consumer hardware can deal
| with. It's getting expensive, but you don't need a
| datacenter to store 20 Tb worth of data.
|
| How many bytes do you need, per document, for an index?
| Do you need 1 Mb of data to store index information about
| a page that, in terms of text alone, is perhaps 10 Kb?
| rvnx wrote:
| It's a great project!
| axelroze wrote:
| Hi,
|
| Interesting idea. Definitely see an overlap with eReader
| markets and looking at text only contents.
|
| How does it work?
|
| It ignores pages on which it detects frameworks for ui and ads
| or any javascript code at all?
| agumonkey wrote:
| is there a json endpoint ? I'd love to make an emacs bridge :)
| artembugara wrote:
| Nice, what are you using to crawl the web?
| marginalia_nu wrote:
| It's pretty much all bespoke.
|
| I use external libraries for parsing HTML (JSoup) and
| robots.txt; but that's about it.
| blondin wrote:
| fantastic project, thank you!
| habibur wrote:
| How are you doing the crawling without getting blocking? -- the
| hardest part.
| judge2020 wrote:
| Not OP but crawling is easy if you don't try scanning 5+
| pages a second - almost all rate limiting/heuristic based
| 'keep server costs low' engines, including Cloudflare, don't
| care if you request every page, but will take action if you
| do something like burst every page and take up just as many
| server resources as a hundred concurrent users.
|
| Now, that is assuming you aren't on some VPS provider. If
| you're going to crawl, you'll have the best chance when you
| use your own IPs on your own ASN, with DNS and reverse DNS
| set up correctly. This makes it so the IP reputation systems
| can detect you as a crawler but not one that hammers every
| site it visits.
|
| Also, I imagine that, for a search engine like this, it
| doesn't expect content to change much anyways - so it can
| take its time crawling every site only once every month or
| two, instead of the multiple times a week (or day) search
| engines like Google have to for the constantly-updated
| content being churned out.
| androceium wrote:
| Pretty neat!!!
|
| You may already be aware of this, but the page doesn't seem to
| be formatted correctly on mobile. The content shows in a single
| thin column in the middle.
| marginalia_nu wrote:
| Hmm, which OS? I only have a single Android phone so I've
| only fixed the CSS for that.
| androceium wrote:
| I was seeing it on Android w/ Firefox. Seems like it's
| fixed now though. :)
| ant6n wrote:
| For example Firefox on Android.
| egberts1 wrote:
| I tried "Error 49" as a search phrase.
|
| It's rudimentary but no IT-related result.
| thrtythreeforty wrote:
| > New: You can now look up dictionary definitions for words. If
| you for example don't know what the definition of is is, you can
| inquire thus: define:is.
|
| Oh man, I love subtle jabs and tongue in cheek writing like this.
| Very Robin Williams-esque.
| marginalia_nu wrote:
| I am the first to admit it's a pretty dated reference.
| earthbee wrote:
| I love this! I've been searching random words with no aim in
| particular and keep finding lots of interesting tiny personal
| webpages. It feels like the old web
| [deleted]
| arduinomancer wrote:
| Wow this is immediately useful
|
| If you figure out some sort of funding model (maybe even just
| Patreon) I could totally see this as a viable side project
|
| Already discovered this recipe site: https://based.cooking/
|
| I love how adding recipes is through pull requests:
| https://github.com/LukeSmithxyz/based.cooking/pulls
| dmje wrote:
| Love it. You should provide a link to Patreon / whatever so
| people can support you financially. Hosting is probably not cheap
| for you. Given the love here on HN I suspect you'd do well.
| spandrew wrote:
| All of my searches are turning up unrelated results ("college
| life after the pandemic", "post-pandemic teaching in higher
| education", "football news NFL" etc.)
|
| NFL one had 'some' decently related results, but the websites
| were all strangely disreputable.
| mmmpop wrote:
| > the websites were all strangely disreputable
|
| Interesting you'd feel that way when sites without "modern
| design" are encountered. Is this your own bias perhaps creating
| a judgment or are they sites that you already know have a bad
| reputation?
| typon wrote:
| Or perhaps the websites being returned are garbage? I have
| the same experience trying a few searches and following the
| top 5 links. Besides wikipedia, I haven't found a single
| useful website.
| abhinav22 wrote:
| Great work and congrats!
| skyfaller wrote:
| This is a fantastic search engine. It delivers on its promise of
| "serendipity". I found pages featuring my name that I'm not sure
| I've ever seen before, after many years of searching myself to
| test out search engines.
|
| Perhaps more importantly, it delivers the most correct result
| when searching for my username: the first result is not any of my
| social media accounts, or even my own blog, but the text of the
| obscure science fiction story that I took my username from! Well
| done.
|
| I've immediately added this as a search keyword in Firefox, and
| I'll be using it more in the future.
|
| Could meta search engines like DuckDuckGo include this as a
| source? Should they?
| pietroppeter wrote:
| from About page:
|
| > If you search for "Plato", you might for example end up at the
| Canterbury Tales. Go looking for the Canterbury Tales, and you
| may stumble upon Neil Gaiman's blog.
|
| I know it is just a suggestion, but had to try searching both,
| with no luck in getting the expected unexpected.
| marginalia_nu wrote:
| Yeah I did some work very recently aimed at improving the
| relevance a bit. It was a bit too random in the state it was
| before. Now it, perhaps, isn't random enough anymore.
| pietroppeter wrote:
| It looks very nice anyway, great job! I did try with other
| queries and results were in general interesting.
| fsflover wrote:
| See also: https://wiby.me/
| [deleted]
| Tade0 wrote:
| The "surprise me..." button is adequately labelled.
| twobitshifter wrote:
| Shades of stumbleupon.
| tpmx wrote:
| Great link to drag to the bookmark bar.
| bovermyer wrote:
| I adore this. Unfortunately, searching for my own name - with or
| without quotes - doesn't actually find my site.
|
| It does find a handful of references to me from over twenty years
| ago, though, which I thought was fascinating.
| tgv wrote:
| My name retrieved the "dead pornstar list". Unexpected.
| JohnJamesRambo wrote:
| Saving this forever. Thank you for making it.
| kodeninja wrote:
| Ivermectin (marginalia):
| https://search.marginalia.nu/search?query=ivermectin+
|
| Ivermectin (Google): https://www.google.com/search?q=ivermectin
|
| The difference in the overall _thrust_ of the results is
| remarkable.
|
| Very interesting! Thanks for building it.
| typon wrote:
| The Google results tell you why Ivermectin is not a good
| replacement for vaccination against Covid, the Marginalia
| results tell you that Ivermectin is a miracle drug for treating
| Covid 19. Really shows how much technology has the power to
| change reality in today's world.
| lame-robot-hoax wrote:
| Google links to the FDA, CDC, WHO, NIH, WebMD, drugs.com, and
| a pro ivermectin journal article from the American Journal of
| Therapeutics.
|
| The Marginalia results point you mostly to random blogs.
| marginalia_nu wrote:
| Part of what I wanted to show with this project is that there
| is no such thing as an objective search engine. Even
| seemingly irrelevant technological decisions drastically
| impact the narrative.
| Drew_ wrote:
| Well the focus on text content isn't the only technical
| difference here. Google is obviously weighing hundreds of
| signals in its search results that your engine is not
| accounting for. These omitted signals are also relevant.
| sundarurfriend wrote:
| > These omitted signals are also relevant.
|
| Certainly. And sometimes they're relevant in a good way,
| sometimes in a bad user-hostile way. Every search engine
| rquires discrimination and intelligent usage by the
| person doing the search, just in different areas.
| marginalia_nu wrote:
| Right, but that is still a technical decision on their
| side. They presumably don't sit down and have a meeting
| about what world view they should present. Well I hope
| they don't.
| silent_cal wrote:
| Lol!
| daxfohl wrote:
| Though before basing life-and-death decisions on this, consider
| reading the "about" page first:
| https://memex.marginalia.nu/projects/edge/about.gmi
|
| > The purpose of the tool is primarily to help you find and
| navigate the strange parts of the internet. Where, for sure,
| you'll find crack-pots, communists, libertarians, anarchists,
| strange religious cults, snake oil peddlers, really strong
| opinions.
|
| and
|
| > If you are looking for fact, this is almost certainly the
| wrong tool.
| lame-robot-hoax wrote:
| Yes, google returns results from the FDA, American Journal of
| Therapeutics pro Ivermectin study, WebMD, the CDC, the NIH,
| Wikipedia, the WHO, and New York Times.
|
| Marginalia returns results from Wikipedia, a faculty member's
| university blog regarding river blindness, a website called
| truthsummit promoting it as a miracle cure, a website called
| vaxxchoice promoting it as a cure, vitamindsstopcovid, etc.
|
| I'd say the quality of the results are quite different.
| motoxpro wrote:
| Second result: "Ivermectin, a miracle drug against Covid
| Ivermectin, a miracle drug against Covid. 100% effective as
| preventative and for early stage Covid. Over 90% cut in
| fatality rate for late-stage cases.
|
| https://truthsummit.info/blog/ivermectin-against-covid.html "
|
| Eh I think I'll take the google search results on this one.
| pkamb wrote:
| Great results for "sauna". Lots of Web 1.0 pages discussing
| building plans and displaying pictures of individually built,
| traditional, unique, old saunas on some property.
|
| The Google result are all blogspam or sales pages for cheap
| shipped saunas. Lots of "IR" results. Phony health benefit pages.
| Stock photos solely of beautiful new hotel gyms.
|
| I've noticed this problem with Google results for quite some
| time. Sadly, the _new_ content being created of the top variety
| is mostly being done within private Facebook groups that can 't
| be easily searched, linked, or archived.
| rfrey wrote:
| This is stunning. I searched "winemaking" because it's my latest
| obsession, and turned up dozens of links to high-quality pages
| I'd never seen despite spending an hour a day for three months
| cruising Google on the topic.
|
| Please do announce it here if you ever decide to solicit help or
| contributors. My stab at this problem was to have a search index
| of only ad-free pages, on the hypothesis it would turn up self-
| hosted blogs, university personal pages, that sort of thing. But
| the results were too thin, your approach is much better.
| winddude wrote:
| hmm, I dream of recipes search engine that punishes recipes pages
| with too much text. lol
| mint2 wrote:
| Yeah, recipes sites have both too much text and too many
| pictures.
|
| But they do illustrate what this search engine needs to watch
| out of. If they rank more text higher and their search site
| becomes popular, won't everyone just spam recipe site word
| salad, maybe even ai generated word salad.
|
| But in the interval, until that day comes, they are going to
| have a very useful service.
| Paul_S wrote:
| Looking for an arm assembly instruction, instead I get this
| strange website as the result
| http://mailstar.net/coronavirus.html
|
| Is that accidental or is this website promoted because it's text
| heavy and will surface for any search without many results?
| marginalia_nu wrote:
| Looks like that page just has an absurd amount of keywords.
| Those sometimes surface when there isn't any good results.
| Haven't found a foolproof detection method that doesn't
| unjustly punish innocent pages with large amounts of content.
| tbojanin wrote:
| this is sweet
| gen_greyface wrote:
| Hi, It'd be nice if you could add a OpenSearch description
| document for your site.
|
| https://developer.mozilla.org/en-US/docs/Web/OpenSearch
| gen_greyface wrote:
| until then i'll keep the site bookmarked. :-)
| josefresco wrote:
| It you like wacky search engines, there's also Million Short:
| https://millionshort.com where you can search and remove the top
| 100/1K/10k/100K/1M results.
| ephbit wrote:
| Quoted from the linked site:
|
| > Convenience functions have been added, and the search engine
| can now perform simple calculations and unit conversions. Try 1
| pint in cubic centimeters, or 50+sqrt(pi). This functionality is
| still under development, be patient if it doesn't work.
|
| Why would you make any ever so small effort to implement
| calculations? I don't get it.
|
| If your search engine enabled me to find more useful search
| results to my queries than google or yacy or whatever, I wouldn't
| care one tiny bit about being able to do calculations with it.
|
| Why not focus on the search functionality?
| marginalia_nu wrote:
| I implemented calculations because easily 80% of my google
| queries are calculations, unit conversions, etc.
|
| Search functionality is larger priority. Calculations and unit
| conversions were an afternoon's break from the search
| functionality :-)
| exporectomy wrote:
| How else do you do unit conversions? I use Google because it's
| far easier than any other software I've tried. Mainly because
| it's more forgiving of errors. It knows that "34 fset in
| msters" is 10.3632 meter. This search engine isn't, though, so
| I wouldn't waste time trying to discover its unit conversion
| syntax rules.
| samhh wrote:
| On macOS for example I'd use Spotlight.
| drusepth wrote:
| Interesting approach.
|
| I always search myself on new search engines to compare the
| results. Most engines return my personal blog/website,
| books/stories I've written, news stories, my github
| projects/contributions, social links, etc.
|
| This search engine surfaces just three obscure IRC logs that
| contain my nick in join/part messages (nothing said from me!)
| from 2009. And nothing else.
|
| There's probably some things this approach is really good at but
| I'm not sure what they'd be for me off hand. Always cool to see
| new approaches to search, though.
| fhackernewz wrote:
| fuck you hacker news
| fsckboy wrote:
| I've read most of the comments here and people are evaluating the
| search results: all good information.
|
| I'm looking at "punishes modern web design"... This thing IS
| modern web design. I think it's called "marginalia" in reference
| to the huge margins they chose!
|
| I'm using a browser on a linux desktop and side-by-side, HN's
| page design is old-fashioned tasteful making pretty good use of
| space, and maginalia has a font that's more than twice the 2D
| pointsize and is so spread out with whitespace that the "Tips" on
| the home page are off the bottom of my window.
| gtmb wrote:
| As everything in life flows in cycle, I predict the search engine
| that will de-throne Google will be like Google when it started -
| a simple variation of page rank.
|
| No smarts, no bubble, no signals decided by over fitting to a
| biased engineer preference.
| __MatrixMan__ wrote:
| I agree, except it'll optionally accept the ID of your node in
| a web of trust, and it'll use a page rank customized for you.
|
| Or you can put in two ID's and have it find sources that both
| parties trust.
| jerrre wrote:
| I wouldn't say the existence of this page proves your
| prediction right (as it's not dethroning Google anytime soon).
|
| It's easy to forget that the goal of Google isn't to provide a
| useful search engine (at least not anymore), but the search
| engine is a by product of them wanting to show ads.
| marcos100 wrote:
| If Google isn't useful then nobody will use it.
|
| The search engine and the ads are tightly coupled. A better
| search engine means it can predict with more accuracy what
| you are looking for and can serve you an even more targeted
| ad that increases the chance you'll click.
| rchaud wrote:
| > If Google isn't useful then nobody will use it.
|
| Or they'll continue to use it out of sheer inertia. Google
| is paying Apple $15 billion to keep its place as iOS
| default search engine.
|
| IE6 didn't die overnight when Firefox arrived.
| popcube wrote:
| now Google try immitate a document system on your computer,
| usually I rely on Google know what I need:(
| antupis wrote:
| As dev I would love search engine which would only do search to
| stackoverflow github issues, documentation etc.
| axelroze wrote:
| You can limit the search query per website in DDG (and
| probably in others)
|
| Example: `rust slow compilation site:stackoverflow.com`
| axelroze wrote:
| Wouldn't the dethroner of Google be some new technology which
| is not a search engine like Google but better at solving the
| original task of finding information on how to solve problems?
|
| Just like how iPad dethroned Windows PCs for average home user
| but not Mac because Windows had the monopoly and then an
| innovation destroyed MS in this space and not a competitor.
|
| I don't think Google dethrones Yahoo and AltaVista scenario
| will occur again.
| gverrilla wrote:
| > iPad dethroned Windows PCs for average home user
|
| is this true? in the US, perhaps? because in south america it
| couldn't me more far away from truth - didn't happen at all
| lightsurfer wrote:
| thank you!
| timdaub wrote:
| I'm developing a text-heavy site and philosophically I'm trying
| to view documents as just that... documents [1].
|
| But I don't get good results for "rug pull".
|
| - 1 https://rugpullindex.com
| marginalia_nu wrote:
| Yeah it's hosted by cloudflare. I'm currently IP-blocking them,
| as because they keep prompting my crawler with a captcha,
| presumably because it's made millions of requests from their
| CDN.
|
| Some rigmarole getting recognized as a good bot by the CDNs.
| I've submitted a request fairly recently, but haven't heard
| back from them yet.
|
| Like I would like to be on good terms with them, and other
| websites that block small independent crawlers.
|
| I can't blame them though, there's a lot of bad bots out there.
| But I'm doing my best not be part of the problem.
| [deleted]
| petercooper wrote:
| Aha, I was going to ask how you were coping with CDNs like
| Cloudflare blocking bots. It's sad we've got to this point
| where basically only the established search engines are
| grandfathered in to be able to crawl sites.
| BugsJustFindMe wrote:
| > _I 'm developing a text-heavy site_
|
| I looked at the source for your site's front page. That's not
| text-heavy; that's markup-heavy. I didn't bother looking at the
| rest of the pages because it appears to be yet another crypto
| market site.
| greggturkington wrote:
| Wouldn't this just skew towards really old sites?
|
| The _third_ search result for "dog" is this page on how to
| remove AOL Instant Messenger, published in 2002.
|
| https://sillydog.org/netscape/kb/removeaim.html
|
| No one wants to see newsletter signup popovers, but "modern web
| design" includes good performance and relevant content. (The
| search engine itself takes about 2 seconds to first contentful
| paint, not great.)
| bityard wrote:
| This search engine pretty much takes everything that Google is
| doing and does the opposite. For instance, Google has decided
| that "relevant" usually also means "recent". Thus, when
| searching for something on Google, you mainly get results from
| blogspam farms and almost never do you see anything more than a
| few years old.
|
| An implication of this is that old sites tend to disappear
| (either into obscurity or by being taken down) because Google
| penalizes them in search rankings. The author of this search
| engine says, however:
|
| > If a webpage has been around for a long time, then odds are
| it has fundamental redeeming quality that has motivated keeping
| it around all for that time.
|
| I don't know that I agree 100% with this (there was lots of
| crap on the "old" web too), but it makes a certain amount of
| sense.
| greggturkington wrote:
| What "fundamental redeeming quality" about uninstalling AIM
| from Windows 3.x motivated making that the 3rd result for
| "dog"?
|
| The 5th result is a tutorial on CSS. This search engine
| decided it's relevant because it has "dog" in the URL. Is
| that a better reasoning than Google's?
| https://htmldog.com/guides/css/beginner/
|
| Core Web Vitals ranks sites higher that perform well. Text-
| heavy sites that are also optimized and relevant would
| already perform well.
| marginalia_nu wrote:
| What are you searching for when you enter the query "dog",
| keeping in mind the search engine deliberately does not
| examine synonyms or and deliberately seeks out the path
| less taken?
|
| Dog facts? Then search "dog facts"
|
| Famous dogs? Then search "famous dogs"
|
| Rappers? Try "snoop dogg"
| [deleted]
| greggturkington wrote:
| I'm searching for information on "dog".
|
| Your suggestion of "dog facts" returns 6 pages from the
| same domain, dogquotes.com. It's unreadable on mobile
| because it's so old, all the facts are unsourced, and
| often wrong:
|
| > Never assume that a barking dog won't bute _[sic]_ ,
| unless you're absolutely certain the dog believes it too.
|
| Also on the 1st SERP, this odd blog post ranting about
| 4th amendment rights [1], "Media Glamorization of the
| Psychopath" [2], and this (image-heavy) page about
| dolphin encounters in the Bahamas ("Sea Dog Facts" is a
| link on the page). 1.
| http://www.rexcurry.net/drugdogsdan.html 2. https
| ://www.metaphoricalplatypus.com/articles/psychology/psych
| opathysociopathy/media-glamorization-of-the-psychopath/
| 3. https://www.dolphinencounters.com/education/
| samsaga2 wrote:
| Where does the data come from? Do you index the whole web
| yourself? I see it totally impossible for a personal project. I'm
| very curious about that.
| marginalia_nu wrote:
| I do indeed index the web myself. Not the _entire_ web, just a
| subset of it. The crawler quickly loses interest in
| javascript:y websites and only indexes at depth those websites
| that are simple. It also focuses on websites in English,
| Swedish and Latin and tries to identify and ignore the rest
| (best-effort).
|
| You'd be surprised how much you can do with modern hardware if
| you are scrappy. The current index is about 17.7 million URLs.
| I've gone as far as 50 million and could _probably_ double that
| if I really wanted to. The difficulty isn 't having a small
| enough index, but rather having a relevant enough index,
| weeding out the link farms and stuff that just take space.
|
| I only index N-grams of up to 4 words, carefully chosen to be
| useful. The search engine, right now, is backed by a 317 Gb
| reverse index and a 5.2 Gb dictionary.
| omoikane wrote:
| > It also focuses on websites in English, Swedish and Latin
| and tries to identify and ignore the rest
|
| When I search for Japanese terms, it "says <query> needs to
| be a word", which wasn't the best error message. Maybe the
| error message should say something like "sorry, your language
| isn't support yet"?
| marginalia_nu wrote:
| I've rephrased the wording for that one a bit.
| throwaway47292 wrote:
| Amazing!
|
| I have only one recommendation that might make the search a
| bit more relevant, e.g when searching for 'linux locking' or
| 'kernel locking' kind of things.
|
| Try to upsort things that match near the top of the content,
| like the top of the man page vs middle vs bottom.
|
| One easy way to do it without having to store the positions,
| is to index the ngrams with max(sqrt,8) of their line number,
| this will cover first 64 lines, you can also use log() or
| just decide ad hock, top, middle, bottom of the document, so
| you can use only 3 values.
|
| e.g. https://www.kernel.org/doc/html/v5.0/kernel-
| hacking/locking.... would do unreliable_1 guide_1 locking_1
| ... then at line 4 kernel_2 locking_2 ... after line 50 ...
| then_7 ... and after that everything will be _8.
|
| then just make the query "kernel locking" to "dismax(kernel_1
| OR kernel_2 OR kernel_3...) AND dismax(locking_1 OR locking_2
| ...) with some tiebreaker of 0.1 or so, you can also say "i
| want to upsort things on the same line, or few lines apart"
| by modifying the query a bit.
|
| It works really well and costs very little in terms of space,
| i tried it at https://github.com/jackdoe/zr while searching
| all of stackoverfow/man pages and etc and was pretty
| surprised by the result.
|
| This approach is a bit cheaper than storing the positions
| because positions are (lets say) 4 bytes per term per doc,
| while this approach has fixed uppre bound cost of 8*4 per
| document (assuming 4 byte document ids) plus some amortized
| cost for the terms
| kews wrote:
| Do you know what proportion of the texty web instructs
| unknown crawlers to go away (or blocks them)?
| c0wb0yc0d3r wrote:
| How did you go about seeding your web crawler with URLs to
| crawl?
| marginalia_nu wrote:
| I just started with my website and did a crawl.
| Subsequently I've been seeding it with the best results
| form my previous crawls.
|
| It's a directed search so it doesn't seem to need a
| particularly solid seed to get decent results.
| c0wb0yc0d3r wrote:
| So how long did it take to get to 17 million URLs?
| dannyw wrote:
| Not OP, but if I was to do this, I'd start by downloading
| Wikipedia and all its external links and references, and
| crawling from there. You should eventually reach most of
| the publicly visible internet.
| c0wb0yc0d3r wrote:
| I feel a little embarrassed that I didn't think of
| something like that.
|
| When I did some crawler experimenting in my younger
| years, I thought I was pretty clever using sites that
| would let you perform a random Google searches. I would
| just crawl all the pages from the results returned.
|
| Your method would undoubtedly be more interesting I
| think. It would certainly lead to interesting performance
| problems quicker, I bet.
| dannyw wrote:
| This is unbelievably impressive on a technical and ambition
| level for a solo, self-hosted hardware project. Kudos.
| jillesvangurp wrote:
| Cool, I've been thinking on this topic a bit lately. Crawling
| is indeed not that hard of a problem. Google could do it 23
| years ago. The web is a bit bigger now of course but it's not
| that bad. Those numbers are well within the range of a very
| modest search cluster (pick your favorite technology; it
| shouldn't be challenging for any of them). 10x or 1000x would
| not matter a lot for this. Although it would raise your cost
| a little.
|
| The hard problem is indeed separating the good stuff from the
| bad stuff; or rather labeling the stuff such that you can
| tell the difference at query time. Page rank was nice back in
| the day; until people figured out how to game things. And now
| we have bot farms filling the web with nonsense to drive
| political agendas, create memes, or to drown out criticism.
| Page rank is still a useful ranking signal; just not by it
| self.
|
| The one thing no search engine has yet figured out is
| reputability of sources. Content isn't anonymous mostly. It's
| produced and consumed by people. And those people have
| reputations. Bot content is bad because it comes from sources
| without a credible reputation. Reputations are built over
| time and people value having them. What if we could value
| people's appreciation relative to their reputability? That
| could filter out a lot of nonsense. A simple like button + a
| flag button combined with verified domain ownership (ssl
| certificates) could do the trick. You like a lot of content
| that other people disliked, your reputation goes down the
| drain. If you produce a lot of content that people like, your
| reputation goes up. If a lot of reputable people flag your
| content, your reputation tanks.
|
| The hard part is keeping the system fair and balanced. And
| reputability is of course a subjective notion and there is a
| danger of creating recommendation bubbles, politicizing
| certain topics, or even creating alternative reality type
| bubbles. It's basically what's happening. But it's mostly
| powered by search engines and social media that actually
| completely ignore reputability.
| silent_cal wrote:
| Wonderful work (':
| afrcnc wrote:
| except it doesn't actually return that many results
| winddude wrote:
| curious how do you afford the infrastructure? I found that the
| hardest part of running a search engine.
| marginalia_nu wrote:
| I'm self-hosting, and the server is a Ryzen 7 3900x with 128 Gb
| of non-ECC RAM. It sits in my living room next to a cheap UPS.
| I did snag one of the last remaining Optane 900Ps off Amazon,
| and it powers the index and the database--and I really do think
| this is among the best hardware choices for this use case. But
| beyond that it's really nothing special, hardware-wise. Like
| it's less than a month's salary.
|
| It runs Debian, and all the services run bare metal with zero
| containerization.
|
| Modern consumer hardware can be absurdly powerful if you let
| it.
|
| Like I have no doubt a thousand engineers could spend a hundred
| times as much time building a search engine that did pretty
| much the same thing mine does, it would require a full data
| center to stay running and be much slower. But that's just a
| cost of large scale software development I don't have to pay as
| a solo developer with no deadline, no planning and a shoestring
| budget.
| yewenjie wrote:
| Related question - suppose I want to create a meta search engine
| for myself, and I want it to be as fast as possible. What are the
| things I should be optimizing for?
| FractalHQ wrote:
| Ok this is great if all I want to do is read text, but often
| times that is very much not all I want to do. The web is much
| more than text and images these days. I can appreciate this as
| long as it's branded as a search engine for blogs and articles
| specifically, as opposed to being touted as a drop-in replacement
| for the modern search engine.
| lukas099 wrote:
| Is this a criticism? It doesn't at all seem touted as a drop-in
| replacement for the modern search engine.
| Phileosopher wrote:
| Wow, if this catches on, my original content will actually
| matter![1] I've always had a love-hate relationship with modern
| web design principles because my design choices have all the
| excitement and polish of what we get on HN.
|
| I'm sure I'm not the only one, either. Content-rich sites need
| more love.
|
| [1] https://adequate.life
| justinzollars wrote:
| It works. Nice job.
| NotAnOtter wrote:
| "Don't be afraid to scroll down in the search results, unlike in
| many other search engines, depending on what you are looking for,
| you may find the best results in the middle of the listing."
|
| This is a very polite way of saying "this engine isn't very good"
|
| Overall impressed with the project but I thought the word play
| there was funny
| marginalia_nu wrote:
| I felt I needed to add it to help people taught by other search
| engines that they only get 1-2 good results, and the rest is
| useless. The reason I'm providing a hundred results is that
| there are often a lot of results to choose from. If the point
| is to find something unexpected, and that indeed is the entire
| point, then that is the only sane design choice.
|
| Like you search for something on Google and similar, and you
| know what you are going to find. They are so good at searching
| the Internet and predicting what you are going to click on that
| you never see something new.
|
| It's a great feat of engineering, but a huge tragedy, because
| discovering new things, outside of what you our your
| demographic has previously demonstrated an interest in, it can
| be absolutely life changing.
| jarbus wrote:
| This engine is fantastic for recipes
| exabrial wrote:
| I would like a search that punishes 'modern' SPOs that load 87mb
| of the author's pet JS projects to display simple text. Basically
| every modern SPO.
| rc_mob wrote:
| blessings upon you sir for making this
| sabujp wrote:
| effort is good, but needs some work, no results here :
|
| https://search.marginalia.nu/search?query=rxjava+2+api+docs
|
| https://www.google.com/search?q=rxjava+2+api+docs&oq=rxjava+...
| marginalia_nu wrote:
| It is very much a work in progress, still struggling with some
| areas. I only really got into the territory of "sometimes
| actually useful" like this weekend. Wasn't planning on blowing
| up on HN just yet.
| rafael_c wrote:
| I liked this one... I searched for 'George Harrison' and among
| the first results there was a page with interesting comments
| about Harrison's solo career; someone reminiscing about the time
| they got to talk to him about guitars for half an hour at a bar
| at the airport; a transcript for an interview he gave on TV...
| Whereas on GOOGLE: an instrusive 'People also ask' which I was
| not interested; thumbnails for videos on youtube that I was not
| looking for; previews to garbage clickbaity news articles; and
| then finally for the search items: a bunch of websites for
| lyrics; his Instagram (!) and fb pages; his imdb page; some more
| news articles I was not looking for...
|
| Granted, google's web results above are perhaps what people are
| looking for 75% of the time, but how limiting and boring.
|
| I'm also a sucker for the simplistic text-centric, information-
| laden pages from the pre-facebook era.
|
| For 'global warming', however - since Marginalia excludes modern
| web-design pages - the results are of dubious relevance and
| interest, since they are, well, 'old'.
|
| I see myself using this engine a lot.
| leephillips wrote:
| This is wonderful and stupendous.
|
| I've often thought that Google could be turned back into a good
| search engine by simply eliminating the crap and letting the
| useful sites float to the top of the results.
|
| marginalia.nu seems to like my sites, so it must be good!
|
| Some results are prefixed with ! or an arrow dingbat. What does
| that mean?
| PieUser wrote:
| searching for covid gives a bunch of bogus crap of fake news
| isaacgreyed wrote:
| A common use case, how to do random thing in programming:
|
| I searched python make a bar chart and it returned a live coding
| video with an AI generated text transcript and two articles which
| mentioned a different kind of bar.
|
| I then narrowed it down to just python bar chart, and got a blog
| post about scripting with a bar chart in it, this
| http://www.nitcentral.com/voyager4/hellyear.htm with monty
| python, bars, and charts from 1996 and among some other things I
| found this https://python-
| course.eu/naive_bayes_classifier_introduction..., which had an
| example of a python bar chart even though the title of the page
| made me think it wasn't what I wanted.
|
| So for what I imagine to be a difficult search because of all the
| different meanings of the words, I found my result on the second
| query pretty quickly, and found some cool unrelated stuff too.
|
| I like mostly that I get what I type in, and not exactly what I
| want, but what I want is there too.
| CapmCrackaWaka wrote:
| I would probably use this if I wanted to find interesting blog
| posts/websites about a topic I want to learn more about in
| general. It seems less useful for returning exact answers to
| specific questions.
| jerhewet wrote:
| I use webcrawler.com, and IMO it's better than any other search
| engine for finding _exactly_ what I 'm looking for. Not what's
| "trending", or "popular", or what the sheeple are searching for.
| It finds the _exact matching keywords_ that I 'm looking for. No
| inference or other bullshit -- just the matches.
|
| Such a relief to not wade through oceans of worthless crap any
| more.
| api wrote:
| This is the most amazing thing I have seen on here in at least a
| year!
|
| It's... no... it can't be... a search engine that finds _actual
| information_ instead of 5 megabyte blobs of tracking code and SEO
| crap!
| optimalsolver wrote:
| I predict it will return a disproportionate amount of sites by
| schizophrenic conspiracists.
| slim wrote:
| This is a search engine indexing the internet on a mariadb
| database hosted on consumer hardware maintained by a single
| person as a hobby and it does not suffer from HN hug of death
| swyx wrote:
| how on earth do you index so much on consumer hardware? my
| frontend developer mind is blown.
| foxfluff wrote:
| Wait till you learn that modern CPUs run _billions_ of cycles
| per second. With multiple cores in parallel! And they can
| reach transfer rates of tens of gigabytes per second to RAM,
| or around a terabyte per second into L3.
| ThalesX wrote:
| And then you add a single HTTP request and everything tones
| down to the speed of the web. Or I/O. Or DB call.
| paxys wrote:
| Consumer hardware today is simply what was cutting-edge and
| crazy expensive 5 years ago.
| deadalus wrote:
| Very interesting because of the interesting results from random
| websites. It's a great discovery tool.
|
| Now hoping for search engine that favors text-heavy sites and
| punishes paywalls
| the__alchemist wrote:
| I built one!
| mattchew wrote:
| Oh, I dream of a day where there are multiple useful search
| engines, specialized for different purposes.
|
| You're doing God's work here. Thanks and good luck.
| scns wrote:
| The Flying Spaghetti Monster wants to have a word with you.
|
| (edit) That is a nice dream though.
| brian_herman wrote:
| Kind of reminds me of the past like alta vista and dogpile.
| BoxOfRain wrote:
| I wonder if there's any mileage in an extension of something
| like uBlock Origin's lists of ad networks to block but instead
| it's a list of known content mills and SEO spam factories to
| remove from search results?
| high_byte wrote:
| I'd like a chrome extension that marks links that target text-
| heavy vs "modern" so I know beforehand what to expect - paywall,
| ads, popups, clickbaits, etc.
| kebos wrote:
| This is really cool, it filters out all fluff.
|
| It's not always taking me to totally relevant sites but the
| results contain my favourite type of content.
|
| Full of _writing_ and pure html - usually the hallmark of someone
| who knows what they are doing, wants to communicate but doesn 't
| want to waste their time.
___________________________________________________________________
(page generated 2021-09-16 23:00 UTC)