[HN Gopher] Almost all searches on my independent search engine ...
___________________________________________________________________
Almost all searches on my independent search engine are now from
SEO spam bots
Author : m-i-l
Score : 604 points
Date : 2022-05-16 10:08 UTC (12 hours ago)
(HTM) web link (blog.searchmysite.net)
(TXT) w3m dump (blog.searchmysite.net)
| mywaifuismeta wrote:
| That's really interesting... and sad. For what it's worth, I've
| noticed comment bots dramatically increase over the last year
| too. They have always been there, but looking at Reddit, YouTube,
| etc, now there seem to be 10x more than there were a few years
| earlier. Even on HN it has gotten worse.
| cbozeman wrote:
| Is there a browser plug-in or some other piece of software that
| can filter, or highlight, which posts / comments are likely
| made by bots?
| BuyMyBitcoins wrote:
| On two occasions I've read one of my comments here on HN copied
| and posted on Reddit. The user profiles that copied my comments
| in _seemed_ like they were run by a real person but the rest of
| their posts might have all been scraped as well.
|
| I only found out because I just so happened to be looking at
| the comments on a related news story and quickly realized the
| post sounded strangely familiar. I'm sure most of us here have
| had our comments copied without our knowledge.
| chairmanwow1 wrote:
| I created a temporary email service that was being used by about
| 10k users / week. Then several weeks ago, the number of users
| started growing like crazy up to about 60k users a day. Then we
| checked the recent email activity and 60k / 65k emails were from
| a social networking site.
|
| Seems our service was being used to create fake bot accounts. The
| newly created accounts were obvious fakes. Rather than deal with
| the issue, we just shut the service off.
| wibyweb wrote:
| In late April up to now, Wiby (a small mostly unheard of search
| engine) began having the exact same issue. Tens of thousands of
| the exact same type of "powered by..." requests coming from
| thousands of IPs. They are using a tool called QHub.
| m-i-l wrote:
| Thanks for wiby.me. I have seen QHub coming up in the scraping
| footprints, but my assumption has been that the footprint query
| is looking for Question and Answers sites powered by QHub
| containing their targeted terms, e.g. because there's a known
| vulnerability with QHub that their scripts can exploit to auto-
| post backlinks or whatever it is they do. There are lots of
| other hosting tools, other than QHub, that come up in the
| footprints as well. I found some lists of footprints by doing
| an internet search for one of them: "Designed by Mitre Design
| and SWOOP".
| wibyweb wrote:
| Interesting, thanks for that extra info.
| mfrye0 wrote:
| I run a data aggregation company that has a fairly advanced
| scraping infrastructure for collecting data across the web.
| Having built the scraping side, I'm pretty familiar with most of
| the strategies for avoiding bot detection.
|
| Coming from that perspective, detecting and stopping at least the
| majority of bots out there is fairly doable, and I put together a
| rudimentary thing for a side project.
|
| The core of it uses an IP API for looking up the requesting IP to
| identify the country and if it's coming from a data center, VPN,
| Tor, etc. If it passes that, I trigger Google Captcha to show up.
| Lastly, I track IPs that make it through and have some basic
| rules in place to try to detect patterns and block offenders that
| way.
|
| There's a bunch more stuff you can check for, but the core of it
| is basically filtering out data center traffic to minimize the
| requests going to Google Captcha.
| buzzwords wrote:
| I have had very interesting conversations with people who are
| "casual" users of internet. They are still finding the results of
| the likes of Google, bing and duckduckgo perfectly suitable.
| Maybe it's most of us here who have different needs to what's
| available.
| bachmeier wrote:
| I suppose it depends what they're looking for. If you're a
| homeowner looking for a service of some kind...good luck. There
| are domains that aren't too bad, like programming, but you
| should go into a search with low expectations. Anyone that
| remembers the early days of Google will find today's search
| engines to be useless in comparison.
| not2b wrote:
| The conclusion isn't that there's nobody out there, but that the
| billion-odd people who use search engines every day have no idea
| what searchmysite.net is. They use Google, often without even
| knowing it because they just type some words into their browser
| and take what they get.
| Auguste wrote:
| I'm disappointed that Search My Site isn't seeing many legitimate
| viewers.
|
| Just wanted you to know that I'm a fan. I love reading peoples
| personal websites, and Search My Site has been great for
| discoverability. I visit the Newest Pages and Browse Sites pages
| once or twice a week to check out the new sites being indexed.
|
| I don't know what the answer is to the spam bots, but you do have
| some real visitors out there. :)
| closedloop129 wrote:
| >I noted that there had been multiple weeks where not one single
| real person had visited a single blog entry for the whole week
|
| The site is not on https://searchengine.party/ nor on
| seirdy.one's overview. Apart from the blog, how could users find
| that engine?
|
| Is there some place where new search engines are announced and
| where new search engines band together to make themselves heard?
| m-i-l wrote:
| Actually seirdy.one added searchmysite.net to his excellent
| list[0] way back in March 2021[1].
|
| [0] https://seirdy.one/2021/03/10/search-engines-with-own-
| indexe...
|
| [1]
| https://git.sr.ht/~seirdy/seirdy.one/commit/ab92d8ded69fd869...
| arunsivadasan wrote:
| Thank you building something like this!
| xwdv wrote:
| Will we ever see the return of hand curated directories of
| websites like the old days, categorized by topics and approved by
| human review?
| closedloop129 wrote:
| Coincidentally, such a site was submitted yesterday:
| https://news.ycombinator.com/item?id=31387592
| saalweachter wrote:
| Wikipedia, maybe?
|
| The greater problem of curation is that it doesn't scale, and
| you need immense human effort to survey and curate both the
| breadth of questions -- what's a good table saw? what aspects
| of Egyptian culture were exported back to Greece? is HDPE
| plastic safe? give me some punk music. -- and also the breadth
| of answers, both every website and every type of table saw.
|
| The lesser is that you cannot curate without introducing a
| _voice_ , a set of preferences that may not be universal.
| Tastes are not universal, you can't recommend the same band for
| everyone. Resources are not universal, regardless of whether
| the $10000 table saw is more than 100x better than the $100
| table saw, it's just out of reach of most people. And needs
| aren't universal -- a professional cabinet maker and a DIYer
| making a chicken coop don't need the same saw.
|
| There's a set of priors behind every query, and you either need
| to get users to frame their queries in a way that captures all
| of the relevant priors, or you need to create a variety of
| voices that capture different sets of priors and curate answers
| appropriate to that voice. Are you asking Norm Abrams, Monica
| Mangin, or Shane Wighton for a recommendation on a table saw?
| xwdv wrote:
| Perhaps there can be a difference between search engines for
| answering specific questions, and directories where one may
| browse a broad range of topics without any goal in
| particular.
| westcort wrote:
| My key takeaways:
|
| 1. Almost all searches on my independent search engine are now
| from SEO spam bots
|
| 2. In summary, if they break through the current reverse proxy
| level protection, options include an invisible ReCAPTCHA (but
| given I've sometimes 160,000 requests a day I'd be well over the
| 1,000,000 a month free tier limit), requiring JavaScript as per
| the web analytics or some Cross Site Request Forgery style
| protection (but those would place much more load on the servers),
| or CloudFlare (but the searchmysite.net spider is still currently
| blocked by CloudFlare as per Some of the challenges of building
| an internet search)
|
| 3. If you were into conspiracy theories you could claim that the
| major search engines were trying to stifle the competition, but a
| more realistic explanation is simply that searchmysite.net is
| being drowned out by SEO spam
|
| 4. If I'd had a decent amount of real users visiting and never
| returning I could reasonably conclude that updating the blog
| wasn't the most productive use of my time and effort, but without
| any real users in the first place it is hard to gauge whether
| people like it or not
|
| My own independent search engine, https://www.locserendipity.com,
| is seeing similar trends.
| superasn wrote:
| To me it looks like some popular spamming software (like
| thebestspinner, etc) just integrated you and now everyone who is
| the software is now hitting your site.
|
| The good news in this case is that's it'll be easy to spot the
| pattern and block it, the bad news is you're entering a never-
| ending cat and mouse game.
| larsrc wrote:
| Google puts a _lot_ of effort into avoiding SEO spam, but it's a
| red queen problem.
| melenaboija wrote:
| I am a total ignorant about search engines and I have a question
| after seeing all types of comments and projects popping up lately
| and criticizing Google results which is if it is realistic to
| think that something similar to Google could exist.
|
| It seems to me that there are all sort of tools out there to do
| so such as all the public NLP implementations, vector search
| engines, ... and I wonder if it is that not everything that is
| needed is truly available, it is a matter of the needed resources
| to have something working or is just a matter of the products
| already existing and not getting traction (and I am not talking
| about the other big search engines).
| phkahler wrote:
| >> This time I'm really not sure what the solution is.
|
| As with everything internet, the solution is to have solid,
| verifiable user identification. I realize the downside is that
| sites would love to have all your activity logged under a
| verifiable identity, so the other problem is we need to ban
| collection of such personal data.
| jrochkind1 wrote:
| I'm not sure I understand the theory of what motivates the
| automated "powered by" searches; can anyone explain it (or an
| alternate theory) further?
| Teandw wrote:
| This guy throws multiple reasons/conspiracies out there on why
| the website is really struggling to gain literally any sort of
| traction. Web is all bots, search engines not promoting
| competitors and being drowned out by SEO spam, yet he's failing
| to see the most obvious reason... the reason nearly all websites
| don't gain traction...
|
| Because it's a bad website. It provides no value to the user. I
| put in a few search terms and had no relevant search results
| back. What use is a search engine that can't find what I'm
| searching for?
|
| Maybe if that was improved he may see traction.
| CWuestefeld wrote:
| Whether or not his site is meeting his goals is his business.
|
| I find this a really interesting post, because I'm also dealing
| with excessive bot traffic (it's generally about half of my
| overall), and specifically how to salvage analytics data when
| there's so much noise. Seeing what other people are doing to
| combat it helps me, regardless of whether you might think of
| them as successful or not.
| lukev wrote:
| I second this. Don't get me wrong, I applaud the concept and
| the effort, but this implementation isn't quite there.
|
| I searched for "document management system comparison" since I
| am currently in the process of selecting one for our legal team
| at work. Some on-the-ground reports from real users would be
| hugely valuable. But this is the classic example of where
| Google utterly fails; document management is a 100 billion
| industry and there are absolutely no search results which are
| not SEO, marketing copy, or astroturfed listicles with nearly
| zero value.
|
| Unfortunately, this website returned even less relevant
| results. Not a single result pertained to document management
| at all; instead it returned random matches on words like
| "system" and "management."
|
| Whoever solves this problem could definitely unseat Google as
| the go-to search engine for most people. So it's a big prize.
| But it's also a super hard socio-technical problem, requiring
| incredibly sophisticated and powerful tech in a highly
| adversarial environment. However, regrettably, it looks like
| this attempt hasn't even got the basic search tech down.
| marginalia_nu wrote:
| Is a comparison of document management systems something you
| expect actually find, as something written by humans? I
| wouldn't write such an article, I don't know who would.
|
| The only people who seem to be writing these types of
| comparison articles are spammers.
|
| I typed this reply without checking, but I checked now, and
| yeah -- if you google "document management system
| comparison", you get ads for document management systems, and
| search engine spam. That's hardly helpful.
| oneeyedpigeon wrote:
| 2nd result I got from that exact search is an article from
| techradar:
|
| https://www.techradar.com/uk/best/best-document-
| management-s...
|
| Do you consider that search engine spam?
| marginalia_nu wrote:
| Yeah, that's affiliate marketing dressed up as a review.
| They're getting a kickback for several of the links in
| the review.
|
| The deal on DocuWare is perhaps the most obvious, but the
| Abbyy-link also run through an affiliate marketing
| redirect service.
| freediver wrote:
| Typed this search into Kagi and got:
|
| - This results from an old site https://www.scanstore.com/Sca
| nning_Software/Document_Managem... not sure if still relevant
|
| - A bunch of discussions from reddit and other forums
| (probably best lead)
|
| - One research paper https://arxiv.org/pdf/1403.3131.pdf
|
| - Listicles grouped togeter so you can skip them
|
| - The noncommercial filter gave a few more good results, but
| it seems like there is not much 'good' content written on
| this topic
|
| I would definetely not call all Kagi results fantastic, but
| it does seem to be better than Google. We are trying hard to
| solve the problem of the nonsense on the web (Kagi founder
| here).
| alx__ wrote:
| Thanks for building Kagi! Have been enjoying the experience
| of it this past month
| kldx wrote:
| Got any beta slots to share?
| status200 wrote:
| I searched "best dress shoes reddit" as a test, and just got a
| random list of websites that had the word "shoes" on the page
| somewhere, including a Dinosaur Comic from 2008.
|
| So... yeah. Won't exactly be my first choice of search engine
| in the future.
| matt_heimer wrote:
| Looking at the blog
| (https://blog.searchmysite.net/posts/milestone-1000th-site-
| in...) I think very little of the internet is in this search
| engine.
|
| Its difficult to gauge the quality of the engine itself at
| this point with so little content in it.
|
| What I can say is that even remotely presenting the system as
| a general purpose internet search engine like the UI from
| https://searchmysite.net/ does is going to give people the
| wrong idea and make them think the system is bad. To start
| with I'd suggest adding the number of sites indexed to the
| main search page.
|
| I also think that the https://searchmysite.net/ portal will
| likely never be a destination. I'd suggest trying to promote
| it differently, offer a service service for OG internet
| sites, they opt-in to the service because they want a search
| widget they can embed on their site that has filter to search
| just that site or all OG sites. Having website categories
| would also help so people could search across tech blogs, or
| aquarium, or bowling sites, etc. Basically the old web ring
| idea but powered by search instead of just browsing a list.
|
| Since there is a chicken and egg scenario - What you really
| need are people that think Google sucks that are invested in
| a niche and want to build a search ring out. The "only sites
| submitted by verified site owners" restriction needs to go,
| you want good curation but this is just too restrictive. I
| also think "downranks results containing adverts" is too
| restrictive, switch that to "downranks results containing
| excessive adverts and SEO spam".
| _tom_ wrote:
| It doesn't index sites like Reddit, so, not too surprising
| Reddit wasn't in the result.
| honkdaddy wrote:
| Searching for Astral Codex Ten, a popular, well-written, non-
| spammy blog which I would expect is indexed...
|
| Returns only results in which _other_ bloggers are referencing
| ACX. Consider me as one of the datapoints that arrived from HN
| and likely won't be back, I'm afraid.
| m-i-l wrote:
| Thanks for your feedback. The idea was for people to submit
| sites they like, and search sites other people have liked.
| I've submitted Astral Codex Ten, and that site is now indexed
| for the benefit of others.
| wccrawford wrote:
| I just search Kagi, Google, and DDG for "Astral Codex Ten"
| and it was the first result on each.
| weird-eye-issue wrote:
| Ironically the Kagi search engine is not in the first few
| results in Google when you search Kagi (at least in
| Thailand)
|
| And when I did make it to the site, it looks like I have to
| sign up to use it? I'm not sure putting a locked gate in
| front of a search engine in 2022 makes sense but okay
| norman784 wrote:
| The whole concept of kagi is to be a paid service (is
| still in beta and for now it's free AFAIK), so you pay
| money instead of having ads or the search engine selling
| your data, use the service that suits best to your
| purposes and philosophy.
| ipaddr wrote:
| The concept in 2022 sounds doomed to fail on many fronts.
| A service that claims to offer privacy but requires
| identifying payment information. A required email signup
| so followup sales emails can happen when the service is
| ready.
|
| Ddg was popular on here until they censored certain
| websites. Does this search service censor?
|
| Sounds like they are trying to tackle privacy but in
| reality users of this service will have less privacy.
| m-i-l wrote:
| Hi, "this guy" here:-) If people come to a site but don't come
| back then it is reasonable to conclude that "it's a bad
| website", but as the blog entry put it "without any real users
| in the first place it is hard to gauge whether people like it
| or not".
|
| Note also that it isn't intended to be a general purpose search
| engine, but a niche search engine to try and find some of the
| fun and interesting content, e.g. relating to hobbies and
| interests, which used to be at the core of the web but which
| can be difficult to find anywhere nowadays.
| soheil wrote:
| How exactly is a "general purpose search engine" different
| than a "search engine to try and find some of the fun and
| interesting content"?
| m-i-l wrote:
| The general purpose search engines search the whole
| internet, and as a result claim that you can search for
| anything on the whole internet, even going beyond that to
| answer questions which aren't on the internet as such, e.g.
| "What is my IP?" and "What time is it?". However, niche
| search engines only search specific parts of the internet,
| and only claim to be able to deliver results relating to
| their specific topic, e.g. you wouldn't ask the search on a
| car forum what the weather is today.
| soheil wrote:
| Ok, but answering questions like "what time is it?"
| doesn't subtract from the usefulness of a search engine.
| Seems like you're saying it makes your search engine
| better somehow because it can't do the above.
| dumbfounder wrote:
| I am a search guy and I would like you to succeed. But I
| don't get it. The name of the site is bland and makes me
| think you are a white label search service for websites.
| On the homepage it says "Open source search engine and
| search as a service for personal and independent
| websites." but it offers me to reason about why I (or
| anyone) would want to use it. The content it actually
| searches is random and of no real particular value as far
| as I can tell. Also, you are trying to avoid spam sites,
| but once you reach a certain size that's all you would
| see is people submitting spam sites. If you blocked
| people from submitting you would never get all the
| diamonds in the rough you are trying to expose.
|
| You need to find an actual niche that solves a real
| problem people have and can understand and orient
| everything you do to tackling that. Then expand from
| there.
| haswell wrote:
| > _general purpose search engines search the whole
| internet, and as a result claim that you can search for
| anything on the whole internet, even going beyond that to
| answer questions which aren 't on the internet as such,
| e.g. "What is my IP?"_
|
| I think there are two distinct things here:
|
| 1) Searching the whole internet
|
| 2) Returning results that aren't necessarily from the
| Internet, but instead are convenience features of the
| engine
|
| I understand that you're not trying to replicate things
| like "What's the weather today", but when I want results
| about <very specific classic car X>, how can you return
| meaningful results without searching the whole Internet?
|
| Put another way, if you don't search the whole Internet,
| the results are going to be limited to only the curated
| list of sources you do search. This can be useful in its
| own way - i.e. if you are positioning this as "search
| this list of curated sources", but also means the site
| will only be as useful as the curation you provide.
|
| For example, I dabble with Software Defined Radio. If I
| search your site for "rtlsdr", a very popular package, I
| get three results. Those results are somewhat
| interesting, but I know there's a whole world of content
| out there related to rtlsdr that I'm not seeing here.
|
| So adding a bit to what the parent commenter was saying -
| if I'm using your site to look for my particular niche,
| and I only see three results when I know there are many
| more, I'm not likely to continue using your site to
| search for rtlsdr.
|
| It then leads me to wonder what I _can_ search for, or if
| there 's much utility to searching at all.
|
| Please take these comments in the spirit they are
| intended - I think a search engine that helps find things
| on the "old" web, or just helps me cut through all of the
| SEO optimized crap is a great idea. It's something I want
| to use. But I can also understand why someone might try a
| search and move on.
|
| Just an idea, but maybe providing a way for independent
| creators to submit their site for indexing (or for an
| interested user like me to submit a site) would help
| increase your reach.
| _tom_ wrote:
| Google is demonstrating this nicely now. It's become almost
| useless, replacing the query I actually typed with
| something more popular. And when that doesn't happen, the
| results are likely seo'd junk. (The latter is not purely
| googles fault, it's just that smaller search engines aren't
| targeted as much).
|
| Try looking up a phone number (by number) in google for a
| great example of nothing but spam results.
| native_samples wrote:
| Well, it's worse than that. The whole schtick is that it's only
| pure, real content by folksy people like us. The top reason to
| use it on the about page is:
|
| _Indexes only user-submitted sites with a moderation layer on
| top, for a community-based approach to content curation, rather
| than indexing the entire internet with all of its spam, "search
| engine optimisation" and "click-bait" content._
|
| So I tried searching [kotlin] and got 123 results ...
|
| https://searchmysite.net/search/?q=kotlin
|
| ... of which the 9th result is SEO spam! It reads:
|
| PersonalSit.es | Yes we got hot and fresh sites
| https://personalsit.es/ ...
| Shandilyahttps://msfjarvis.devTagsandroid, kotlin, rust Go to
| feed Go to siteradoslawkoziel.plradoslawkoziel.pl ...
|
| That looks like junk to me. How is that possible if what the
| developer says is true, that it's all verified and pre-
| moderated?
| m-i-l wrote:
| Thanks for your feedback. It is just the home page which is
| moderated before indexing (and reviewed annually). When
| https://personalsit.es/ was listed it looked legitimate, but
| agreed the results for that site look infected with spam now.
| I've found at least one other site today where the home page
| and blog look genuinely legitimate, but which has a complete
| spam subdomain, quite possibly the victim of a subdomain
| takeover attack by spammers. I've delisted both.
| Unfortunately it isn't an easy task trying to defeat a vast
| army of well funded spammers in your spare time!
| stevenicr wrote:
| As someone that has a few sites that can get user generated
| content - I must say that it saddens me that spam stuffing
| would get the main domain and site delisted - and likely
| never re-listed.
|
| A couple times a year I get hit with a bunch of spam blogs
| / user profiles and when I discover and clean them up, I
| assume that at least google/bing see that the spam-to-real
| ratio has been fixed and rank it higher again.. but I'm not
| sure really, especially since google took keywords out of
| click traffic.
|
| What would be nice is something like the 'site has been
| hacked page' that I've unfortunately seen a few times for
| sites - that lets you clean it up and submit a re-check
| it's clean now button thing.
|
| I've also suggested that google make it so you have to
| vouch for links which would expose people using the spam
| stuffing techniques.. kind of the opposite of the disavow
| tool - but they never read any of my disavow submissions.
|
| Sucks to get spammed, fight spam, and then be penalized for
| it more ways than one.
|
| One of my older buddypress/wpmu sites I recently turned off
| blog creation for users because it's just so tiring
| fighting the spammers - which are only doing what they do
| because google - meh.
| salawat wrote:
| Your problem is that SEO are under no obligation to be
| truthful with you, and will likely pull bait and switches
| as far as making accounts if it ever seems like your site
| will catch on.
|
| Note, I nearly spit my food the first time I was at lunch
| and someone was talking about SEO a few tables away...oh a
| decade or so ago now. It's sad it's gotten this bad.
| pwiercinski wrote:
| I guess the use-case just isn't that popular. It's a good
| website if you want to learn what some devs are up to, but
| barely anyone cares about that. Most people use search engines
| to find answers to their questions and Search My Site just
| doesn't work like that.
| fortran77 wrote:
| I found a few pro-terrorism sites here. I don't think it's the
| OPs purpose, but he's being duped by the few users that do look
| for sites like this where they can add a "curated link" to
| their ISIS or Hezbollah or Hamas site with a slick facade.
| m-i-l wrote:
| Thanks for your feedback. If you can drop me a note I'll
| remove those sites - it is against the Terms of Use at
| https://searchmysite.net/pages/terms/ (not that spammers,
| terrorists, etc. care about complying with a Terms of Use). I
| think legitimate looking home pages as a front to other non-
| legitimate content is a genuine problem this model doesn't
| solve (also noting that some of those home pages may even be
| genuinely legitimate but have been hacked e.g. via a
| subdomain takeover).
| [deleted]
| Jleagle wrote:
| I'm getting lots of `No results found for query = xxx.`
| rightbyte wrote:
| That sounds like a feature actually, being honest about no
| hits.
| XCSme wrote:
| If the internet is dead, is there anything left that's "alive"?
| The mobile app stores are also filled with crap[0] and it seems
| that the ratio of spam content vs real content is getting close
| to infinity.
|
| [0]: https://youtu.be/E8Lhqri8tZk - 1,500 Slot Machines Walk into
| a Bar: Adventures in Quantity Over Quality
| john-radio wrote:
| Since everyone in this thread wants to jump down OP's throat
| about the quality of his web site, another interesting search
| engine is millionshort.com, which allows you to filter out the
| top N web sites from the results of your search. It's a great
| tool for looking past sites with good SEO; all you have to do is
| fiddle with the value of N.
|
| For example, searching for "electronic music box" as /u/ajnin
| suggested, with the top 100K web sites removed from the results,
| filters out the following:
|
| > These 23 sites were removed from your results:
|
| > alibaba.com (1 result removed)
|
| > aliexpress.com (1 result removed)
|
| > allaboutcircuits.com (1 result removed)
|
| > amazon.com (2 result removed)
|
| > apple.com (1 result removed)
|
| > bestreviews.com (1 result removed)
|
| > ebay.com (1 result removed)
|
| > etsy.com (2 result removed)
|
| > facebook.com (1 result removed)
|
| > instructables.com (2 result removed)
|
| > lightinthebox.com (2 result removed)
|
| > lumberjocks.com (1 result removed)
|
| > mapquest.com (1 result removed)
|
| > reverb.com (1 result removed)
|
| > twitter.com (1 result removed)
|
| > wikipedia.org (1 result removed)
|
| > yelp.com (1 result removed)
|
| > youtube.com (2 result removed)
|
| And the top result ends up being https://midiguy.com/.
| mdoms wrote:
| Million Short also has an option to remove only e-commerce
| results which is invaluable if you still want results from
| sites like Twitter, Wikipedia and YouTube but don't want online
| shopping spam.
| consp wrote:
| Would this also work for the fake-sites-stealing-text-to-
| look-legit sites since they quickly end up in the top
| results?
| blisterpeanuts wrote:
| That's an outstanding concept. One problem though: wouldn't it
| also filter out high quality curated results?
| trinovantes wrote:
| If this was the spam for a search engine (almost) nobody uses, it
| makes you wonder how much abuse the major search engines face
| Nextgrid wrote:
| My understanding is that this wasn't about gaming this
| particular search engine itself, and more about the spammers
| using the search engine for its intended purpose of finding
| spam-free content so they can then use this content as copy for
| their spam posts.
| sonicggg wrote:
| I'd assume they have more control though. I noticed whenever I
| use Google after connecting to NordVPN, it requires a captcha
| the first time.
| mensetmanusman wrote:
| They face a lot. I always browse with incognito on safari, and
| I quite often have to do captchas on google and bing etc. to
| prove I'm not a computer...
|
| If there is money involved and value in being able to trick
| search engines, I'm not surprised it's a thriving business of
| grift.
| hihihihi1234 wrote:
| Why do you use Bing?
| the_third_wave wrote:
| Why don't you like diversity, in this case diversity of
| search engines? Bing may have its problems but so does
| Google, the way to handle this is to either use many
| different engines or to use a meta-search engine like
| Searx. The latter is far easier so it is what I do. Just
| relying on a single source makes you an easy target for
| those who control that source.
| maven29 wrote:
| You should try Bing again. Bing doesn't mess with your
| query terms as much as google does. If you aren't a zoomer
| typing out whole sentences into the search bar, the fact
| that Bing doesn't substitute your jargon for more general
| terms will help with spending less time in the search
| results.
|
| I just got tired of iterative refining not working as it
| used to in the past. I once got results for databases when
| searching for decibels (despite spelling it out in full),
| so it isn't just a matter of semantically related terms.
|
| The rewriting is just braindead and the ranking algorithm
| falls for generated content way too easily. Google
| shouldn't be trying to teach me DHCP when I am clearly
| trying to recall a config item, but then it gets worse when
| you read the infobox and realize that it's written at a
| toddler level of comprehension.
|
| This is with the caveat that all search engines rely on
| some level of personalization, so you might be able to get
| good results on google if they deem you worthy.
| ricardo81 wrote:
| Indeed. There are various SEO "rank tracking" services that
| scrape millions of SERPs a month.
| thelittleone wrote:
| Complete SEO noob here. Can someone help explain what these bots
| are trying to achieve? There is mention in the blog that they're
| trying to uncover ad free content.
| DethNinja wrote:
| Only solution is a webring based federated search engine.
|
| 1. You just put /webring.txt to your website. It shows links to
| other websites with a hard limit of 100 websites.
|
| 2. To combat spam and bots, search engine does accept blocklist
| as an input. So other people can curate the content.
|
| 3. People can personally rank the websites they like, so webring
| of the said website gets ranked higher for that specific user.
| This can be a community effort too.
|
| 4. Search engine itself should be under a commercial license so
| that other people can keep building it and add ads if they want
| to commercialise it.
|
| I'm too busy to spend time with this but perhaps one day I can
| start coding it.
|
| I'm convinced that search engine model of early internet is just
| dead, webrings are the way forward.
| mcv wrote:
| Nice idea, but of course if it gets even slightly popular,
| every SEO content farm will immediately generate 10,000 sites
| that all list each other in their webring.txt.
| TheRealDunkirk wrote:
| If there's a game to play, people will write software to play it
| for their profit.
|
| I guess it's back to web rings.
| robmay wrote:
| While I'm generally a blockchain skeptic, this is actually a good
| use for a blockchain - to "register" bots so they have an id, and
| an owner, and you can measure their behavior. There are going to
| be more bots interacting with more sites, so, this could work.
| PaulHoule wrote:
| Spammers badly need spam-free content so they can mix some
| legitimate links with the junk they spew.
|
| One great Black Hat SEO trick is to find where your competitors
| are getting clean links and insert your own links there so they
| do your spamming for you.
| closedloop129 wrote:
| Why does the mixing work? Shouldn't Google and Bing know what
| the original content is and automatically identify the sites
| that are copies?
| PaulHoule wrote:
| Here's an example.
|
| If I have (say) 15 affiliate marketing sites, I might make a
| link aggregator site that looks a bit like Hacker News.
| Except I won't make just one, I might make 30 of them.
|
| These might subscribe to a bunch of RSS feeds and randomly
| select articles, maybe 10% of the links on those sites go to
| my affiliate sites.
|
| If you can inject spam into those RSS feeds that system I
| describe would amplify it and this could have effects ranging
| from: you are using my marketing machine to promote your
| content to my sites getting really obnoxious and getting
| blocked.
|
| ----
|
| "Duplicate Detection" is a necessary technology for web
| search because sheepeople copy themselves and other people
| without bound. It cuts both ways because Google and Bing have
| no sure way to know which one is the copy and which is the
| original. So (1) they aren't completely efficient at removing
| duplicates and (2) duplicate detection can be turned into a
| weapon against you, just like that link aggregator.
| randomstring wrote:
| Search traffic has always been mostly automated spam bots.
|
| Even back in the Open Directory Days when we powered part of
| search.netscape.com I estimated 80+% of all search traffic was
| automated. At least most of it self-identified with the same Java
| useragent.
|
| Later when working Topix, despite being a news search engine,
| most traffic was bot traffic. Most included the word "mortgage"
| in the query. Topix specialized in localized content, and that
| was very popular for SEO scrapers.
|
| Lastly at Blekko, I estimate 90+% of traffic was automated. By
| then maybe half or more learned to change the user agent. Most
| used HTTP/1.0, a dead giveaway as no browser still uses 1.0. This
| was a major aspect in Blekko's load shedding strategy. If the
| servers started to get overloaded, we'd start bouncing suspected
| bot traffic to a redirect that would show in the logs. If there
| was a human with a modern browser running javascript on the other
| end, would get redirect to a link that wouldn't get bounced. I
| would check the logs weekly to see if any humans got caught. None
| ever did. This was a huge monetary savings, you only need 1/10th
| the servers if you can safely ignore the bots.
|
| Often it's endless repetition of the same keywords in a random
| order with a place name appended, or prepended, or inserted. over
| and over. Often variations on known monetizatable SEO keywords.
| However, much of it doesn't make any sense.
|
| I don't have any insight into Google's numbers but I would
| conservatively estimate 95% or more of all their queries are
| automated bots and not humans. And the level of spy-vs-spy going
| on for Google CPU resources vs SEO bots is probably pretty
| evolved by now. I stopped tracking many years ago when Google
| switched to densely packed obfuscated javascript for page
| renders. Maybe this is part of why automated queries are so high
| across the web, maybe google is too hard to crack for most.
| superjan wrote:
| Almost sounds like it is justified to add a javascript crypto
| miner to your pages to make the bots pay for the use of your
| service.
| randomstring wrote:
| The point is that the vast majority of scrapers do not bother
| to run javascript.
| [deleted]
| stevenicr wrote:
| appreciate the sharing of info here.
|
| I have recently been discovering and combating some similar,
| albeit much smaller issues.
|
| I've been finding that a bunch of my recent 'resource sucks'
| have been constant spidering from petal-bot, semrush bot,
| alibiba-bot and a few others.
|
| Using the wordpress plugin stop-bad-bots and it's logs has been
| eye-opening for me recently.
|
| I understand many of these are not directly dark-seo related,
| but their aggressive nature is hurting the cpu and memory
| limits of some of my servers and sites so it's a big issue
| regardless of the intents behind them.
|
| (kind of) glad someone else has dealt with these issues, and
| glad to see some of the 'how' for handling, identifying, and
| some actual real numbers for the impacts, as I've been guessing
| some of these things in my small projects, indeed it's a real
| thing. As well as a practical issue to pay attention to and
| work on.
| munk-a wrote:
| Could you possibly use your robots.txt to redirect them all to
| ad-laiden pages to try and subsidize your legitimate users?
| buro9 wrote:
| This is for comment spam.
|
| It's trying to find a long tail of popular but not top listed
| blogs for the purpose of posting comments with the much desired
| links to the SEO target.
| Veen wrote:
| Does that work any more? I thought everyone put nofollow
| attributes on comment links.
| 0des wrote:
| If it didnt work, would you still see it?
| Veen wrote:
| Yes, because to sell it you need someone to believe it
| works. That's independent of whether it actually works
| (although this does answer my initial question).
| [deleted]
| hinkley wrote:
| I am slowly convincing my coworkers that deploying the exact same
| binary as two different 'services' is a significant tool to have
| in your toolbox. Some disaster recovery work we're doing is
| making it a much easier sell.
|
| I'm really just combining two very old tricks here. Traffic
| shaping based on class of service for two different requests, and
| for two different classes of users.
|
| Segregating bot traffic improves consumer experience. Segregating
| admin traffic from both allows you to set an upper and lower
| bound on availability.
| FargaColora wrote:
| You mention the "Dead Internet Theory" (not heard that phrase
| before!).
|
| I agree: the WWW Internet is dead, that is your problem. No-one
| visits websites anymore, everyone has moved to the 10 biggest
| websites and all data is now siloed there.
|
| If I want to search for something topical and relevant, I go to
| Facebook, Twitter, Reddit, HackerNews, Instagram, Google Maps,
| Discord etc.
|
| The general Internet is dead: it's just legacy content and spam.
|
| If you think it's bad for you, imagine what it is like for Google
| Search! Their entire business is indexing a medium which no
| longer has any relevancy. People complain that Google no longer
| delivers good results. But what can Google do? The "good content"
| is no longer available for them to index.
|
| Want to become rich? Make a search engine which indexes the fresh
| relevant data from the big siloed websites, and ignores the
| general dead Internet.
| marginalia_nu wrote:
| I built my search engine in part to explore whether this was
| actually true, and I don't think it actually is.
|
| There's still a lot of organic human-made content still out
| there, possibly more than ever, it's just not able to compete
| with the SEO industry that completely displaces it from Google
| and social media.
| kodah wrote:
| Agreed, the general internet is not dead, but the majority of
| internet users are on Facebook, Twitter, Reddit, HackerNews,
| Instagram, Google Maps, Discord etc.
|
| From my perspective, we onboarded a lot (if not most) people
| to the internet after 2007 (the explosion of social media).
| People sticking to big sites really speaks to an inability to
| explore the larger internet and a lack of knowing _why_ you
| would even want to.
| alxlaz wrote:
| This matches my findings 100%. The WWW is active and
| bubbling, but virtually all the cool websites I've found in
| the last 10 years or so came through friends, small IRC
| channels, or more recently through marginalia.nu :-). Google
| and friends are facilitators for the SEO and tracking
| industries, so of course they have zero interest to
| prioritize these things over content spam -- their whole
| business runs on content spam. But the WWW is as alive as it
| gets.
| dylan604 wrote:
| And who uses your search? I had never heard of "you" until
| just now. And there is the problem with "new" search engines.
| Unless you can come up with what would have to be one of the
| greatest ad campaigns the world has ever seen, no significant
| number of users will know you exist. Where does the money to
| pay for that ad campaign come from? How will a search engine
| generate money to stay relevant? Once people see you becoming
| relevant, they will figure out how to game your system. It's
| just the nature of the beast. I don't think I'm being overly
| cynical about this either.
| marginalia_nu wrote:
| Why would I need to generate money to stay relevant?
| dylan604 wrote:
| <edit>The first </edit>relevant was the wrong word.
| sustainable would be more appropriate. on the assumption
| that hosting the search engine isn't free, and unless it
| is supported by a generous benefactor it will need to
| have a way of generating money to keep the servers
| running.
| marginalia_nu wrote:
| I'm self hosting so my operational cost is like $50/mo.
| throwaway14356 wrote:
| then he must be relevant
| fifticon wrote:
| I second that independent sites exist - I maintain my own
| website on a personally run server. There are dozens of us!
| to quote a quaint phrase.
| api wrote:
| All open systems are destroyed by spam once they become
| popular enough to be profitable targets. This will eventually
| happen to the Fediverse too. If there is money to be made
| pissing all over the commons, the commons will be pissed all
| over.
|
| It even happens to proprietary silos if they are too open.
| Look at how many bots and spammers infest social media.
| Propaganda and disinformation can also be considered a form
| of spam.
|
| I realize this sounds cynical but don't shoot the messenger.
| It's just something I've learned watching the Internet evolve
| since the middle 1990s. Spam eats everything it can.
|
| IMHO the future is enclaves and invite only communities. The
| Internet is a dark forest.
| marginalia_nu wrote:
| As old open systems are destroyed, new ones are created to
| replace them. The Internet exists in a constant state of
| rebirth and transformation. You really can't step into the
| same river twice.
| nonrandomstring wrote:
| > You really can't step into the same river twice.
|
| I love the maxim and philosophy of eternal refreshment.
|
| Seems like the problem is more akin to having nuclear
| waste dumped into our rivers though.
| pixl97 wrote:
| It's not cynical, is how every system in nature works.
| Everything alive must develop an immune system or it is
| attacked and eaten.
| NoGravitas wrote:
| You are probably right about the future; not necessarily
| because of spam, though that's a part of it, but just
| because of the toxicity of global, open to the world,
| mostly public social media. The Fediverse has mostly
| coasted by so far on obscurity, but it's not great, and
| it's bound to get worse. All of my online socializing these
| days is either through short-lived pseuds on topic-oriented
| fora, or invite-only Matrix rooms.
| pwdisswordfish9 wrote:
| > This will eventually happen to the Fediverse too.
|
| Oh, don't worry, the Fediverse will never catch on.
| ffhhj wrote:
| Why? Serious question.
| indigochill wrote:
| How do you surface organic human content? I happen to linger
| around the fediverse/tildeverse sphere where I see organic
| content from people I personally have a direct (digital)
| connection to (and I started self-hosting my music after Epic
| bought Bandcamp), but I'm not clear on how I'd go about
| digging that kind of stuff up in the more general case.
| marginalia_nu wrote:
| I do a traditional web crawl and exclude anything that
| looks too much like it wants a high google ranking. Nothing
| to it.
| ratww wrote:
| This might be controversial, but I wish Google would
| exclude those websites too.
|
| Google started punishing keyword spam, then it started
| punishing black-hat comment spam. Even Youtube
| backtracked on the "videos have to be 10 minutes to
| rank".
|
| I wish they would do the same for carefully manicured SEO
| content farms too, as those sites are causing a harm
| worse than keyword-spammer sites did.
| marginalia_nu wrote:
| They're probably doing all they can. The problem is their
| dominance, both means they have effectively an entire
| industry looking for loopholes in everything they do, as
| well as legal considerations (arbitrarily punishing
| individual smaller actors might skirt on the territory of
| anti-competitive behavior)
| ajmurmann wrote:
| I love your search engine. Should I stop recommending it
| to friends to keep it safe?
|
| I jest a little bit, but your comment genuinely makes me
| wonder if Marginalia++ is search results - Google -
| Marginalia
| sdoering wrote:
| I fear that Google also has a conflict of interest here.
| A lot of these non optimized sites are not interested in
| making money via ads. So Google wouldn't profit
| additionally from leading people there.
|
| And a lot of people (myself often times included) are
| looking for a quick answer. A good enough answer. So good
| enough, SEO optimized is being surfaced. The result of an
| optimization war on both sides combined with the
| inevitable monetary interests.
|
| I don't habe a solution. Sadly.
| galangalalgol wrote:
| Does anyone have an ad free search engine? You'd start
| with blacklists from ublock origin, pi-hole, and similar,
| don't bother even crawling those, then have easy
| reporting for new or self hosted ads. Not much money in
| it if any, but it would be refreshing. Might even have a
| mode to nix anything with a payment method on the site,
| or that links to a site with a payment method.
| ajmurmann wrote:
| > Does anyone have an ad free search engine
|
| kagi.com search.marginalia.nu
| EVa5I7bHFq9mnYK wrote:
| Maybe back to Yahoo model of the 90s? Manually created
| collection of curated links?
| datavirtue wrote:
| Yes. We have enough users now.
| ratww wrote:
| I think there's two kinds of SEO spam going on.
|
| The black-hat kind is definitely made to extract money
| from ads. But those are easy to avoid for web veterans
| IMO. And I also feel that Google is doing its part, even
| though it's costing them money from those sweet ads!
|
| But the white-hat kind, also known as content marketing,
| is made to let legit companies _save_ money. Instead of
| paying for Google Advertisement, they get traffic by
| means of organic content. Think "Michelin Guide" or "Red
| Bull". Which is a jolly fine idea and responsible for a
| lot of good stuff, but the problem is that this has been
| taken to extremes, and now the web is littered with low-
| effort content made by freelancer writers getting
| peanuts.
|
| I would personally prefer if those freelancer writers
| were doing 10 interesting Red Bull articles per month
| rather than 500 rehashes of contents from other websites.
| But who am I to judge.
|
| In the news industry things are also very similar.
| Nextgrid wrote:
| The "white-hat kind" can trivially be filtered out (or
| deterred) by downranking any of the crap these marketers
| use to measure their conversion rate - analytics, etc.
| ratww wrote:
| I love this idea. Would be nice to see it in a search
| engine, or at least a browser extension showing how much
| analytics junk a site has before you click it.
| Nextgrid wrote:
| Kagi has a non-commercial filter that I suspect uses the
| presence of ads/analytics as a signal.
| ysavir wrote:
| It's not about surfacing organic human content, it's about
| only indexing organic human content. The problem is
| automated indexing. So long as indexing works according to
| defined rules, the advantage will be to those able to shape
| their content to those rules, and the spammers and scammers
| will win.
|
| An idea I've had for a few years is making a social-network
| based index engine. The only pages that get indexed are
| pages that users themselves mark as worth indexing, and the
| only pages returned in your results are pages that were
| marked for indexing by people you added to your circles, or
| the people in their circles, or the people in _those_
| circles, etc (probably up to 5 or 6 degrees of separation).
| nyokodo wrote:
| > up to 5 or 6 degrees of separation
|
| So basically everyone on earth?
| ysavir wrote:
| Alright, 2 or 3!
| kmeisthax wrote:
| ...so, blogrolls?
| ysavir wrote:
| Not familiar with blogrolls, but not quite. The idea is
| more to have standard search engine user experience, but
| with the requirement that each result is vetted by
| someone the user trusts, or trusts by proxy.
| pixl97 wrote:
| Welcome to the billion dollar question. Any place that is
| authentic will face the zombie horde attempting to fake
| authenticity in order to capture attention.
| tomxor wrote:
| I think your _almost_ right, but it 's not necessarily
| authenticity... I think it's just money.
|
| Large "authentic" search engines can exist to serve the
| rest of the web, those personal blogs and other small
| communities. Those sites have a natural tendency to not
| be trying to turn everything into a revenue stream, so if
| that was the prerequisite for an engine, it would be a
| perfect match and naturally dissuade marketing types.
| pixl97 wrote:
| Authenticity is worth money.
|
| When you have a 'real' community you're talking about
| real people with real salaries and desires, add in that
| you tend to develop a real trust between members. Think
| of this as fertilized soil. You can grow crops in it, but
| weed seeds will eventually land and try to take over it.
|
| HackerNews is a good example of this, it takes a healthy
| amount of moderation to keep things on topic where things
| like politics get peared pretty ruthlessly. If for a
| minute Dang gave in found ways to additionally monetize
| the forums, something that would be profitable for a
| while at least, things would start down a bad path.
| sdoering wrote:
| I can only agree with my sister comment. I find this
| industrialized web more and more shallow and taxing to use.
|
| While professionally I need to help (smaller, local) clients
| to reach their audiences I become more and more weary.
|
| It is like walking through a supermarket with industrialized
| fast convenience food shouting in bright colors and
| advertising while ultimately not nourishing me like slow,
| real food could.
|
| I am still looking for this digital slow food movement.
| nonrandomstring wrote:
| > I am still looking for this digital slow food movement.
|
| https://digitalvegan.net
|
| Please read it, and if you enjoy it please suggest it to
| friends.
| Vladimof wrote:
| I added it to my list of search engines on Firefox... your
| favicon is really small, that's on purpose?
| ColinHayhurst wrote:
| Agreed.
|
| > If I want to search for something topical and relevant, I
| go to Facebook, Twitter, Reddit, HackerNews, Instagram,
| Google Maps, Discord etc. The general Internet is dead: it's
| just legacy content and spam.
|
| The "general" Internet is not dead. Though if you just want
| to participate in just Facebook, Twitter, Reddit, HackerNews,
| Instagram, Google Maps, Discord you might well think that.
|
| Users of marginalia (author above), Mojeek (disclosure: CEO)
| and others [0] are well aware that there are riches of
| organic human-made content; from years back and new. Yes, a
| lot of noise too, which Google has a bigger (SEO) struggle to
| compete against. But still there is good and different
| content available.
|
| To find good content, using search, you need to use "search"
| engines which enable discovery, as Google used to do so. I
| stress the "search" as the emphasis of Google, Bing and thus
| their syndicates is increasingly on being "answer" engines.
|
| [0] https://seirdy.one/2021/03/10/search-engines-with-own-
| indexe...
| mc32 wrote:
| Sounds like we're back to AskJeeves and a number of failed
| answer engines from a couple of decades ago!
| ColinHayhurst wrote:
| AskBERT but now MUM knows best.
| tmaly wrote:
| Everyone is trying to game the Google algorithm. The net
| result is all this long form content and cooking recipes
| that are 10 pages long.
|
| There seems to be a big disconnect with a typical users
| attention span and the length of a post.
| ajmurmann wrote:
| I thought the recipe thing was to be able to copyright
| them
| Domenic_S wrote:
| > _The "general" Internet is not dead._
|
| For some things it is. Good luck getting a non-
| sponsored/SEO-gamed review of a kitchen appliance or
| particular vacation mode such as a cruise. It's
| flabbergasting.
|
| Most times I just stick "inurl:reddit.com" in my search and
| _try_ to get discussion threads about the thing I 'm
| researching, but even that's getting filled up with shills.
| ColinHayhurst wrote:
| Result #1 & #2 for kitchen appliance review (your
| personalised/local results might vary):
|
| Google:
|
| https://www.expertreviews.co.uk/home-garden/home-
| appliances
|
| https://www.goodhousekeeping.com/appliances/
|
| Bing:
|
| https://www.which.co.uk/reviews/fitted-
| kitchens/article/plan...
|
| https://www.goodhousekeeping.com/appliances/
|
| DDG:
|
| https://www.goodhousekeeping.com/appliances/
|
| https://www.which.co.uk/reviews/fitted-
| kitchens/article/plan...
|
| Marginalia:
|
| https://www.infiniteeureka.com/shop-markdowns-on-small-
| kitch...
|
| http://www.fullyramblomatic.com/essays/sarah.htm
|
| Mojeek:
|
| https://www.appliancesreviewed.net/
|
| https://busybakers.co.uk/category/kitchen-appliance-
| reviews/
| [deleted]
| FargaColora wrote:
| Most of these are spam. They contain affiliate links to
| Amazon to buy the product which is being reviewed,
| therefore the the review cannot be trusted.
|
| "Which" looks to be the exception, but that is a paid-for
| service.
|
| It's a sad state of affairs.
| kelnage wrote:
| I understand your opinion about affiliate links - but I
| use several review websites that use such links for all
| products they review, and have both positive and negative
| reviews for products. So I wouldn't say it necessarily
| follows that affiliate links = biased reviews.
| throwaway894345 wrote:
| I think search engines are broken, but the Internet
| itself is probably not "dead". It's just our
| accessibility to that information. That's not super
| helpful until we have better search engines (which steer
| us away from this SEO stuff), but the good news is that
| building a better search engine is easier than
| resurrecting the Internet. In particular, there's a good
| chance that a niche, naive search engine might be able to
| significantly improve accessibility (e.g., high rankings
| for pages that answer user queries in the fewest bytes).
| marginalia_nu wrote:
| -\\_(tsu)_/-
|
| http://www.jitterbuzz.com/indmix.html
|
| http://www.alaska.net/~akpassag/
| FargaColora wrote:
| These websites seem to be last updated decades ago, which
| is prehistoric to most casual browsers. There's no doubt
| there is great content on the general internet, but these
| examples I would classify as "legacy".
| marginalia_nu wrote:
| I can see why the website owners would be interested in
| getting traffic to recent websites, but why would you be
| interested in recently updated websites?
| pmontra wrote:
| I take myself as an example.
|
| People that know me and don't meet me regularly might know
| the URL of my web site and might care to look at it once per
| year and check if there is something new. Usually pictures
| and tales from holidays. Covid made those holidays less
| memorable so I didn't make any update since fall 2019. People
| that meet me regularly don't need that website, I'm telling
| them the tales first hand and showing them the pictures
| without being obnoxious. I guess that this website is a
| target for your search engine except it's not in English and
| your search engine seems to want English search phrases.
|
| I don't have anything of value to share on a public chat like
| Twitter and I don't have an ego to pretend I do. I also don't
| use Facebook anymore. I go there once per year to like the
| messages that wish me happy birthday. I think it's polite to
| do so. All my media production is on WhatsApp or Telegram in
| group chats with people I know in real life.
|
| If I really cared about producing content for the world I'd
| probably be using Twitter, Medium or the fad of the year and
| they'd take care of my SEO (do they?) or I'd be trying to
| score points on StackOverflow.
|
| To recap: I never intended to compete on SEO. I'm really OK
| that my website is only for friends and spreads by word of
| mouth. It probably never did, I bet it's been on a flatline
| since I created it 20+ years ago.
| captainmuon wrote:
| But Twitter, Reddit, HN, and most other such places are just
| websites and can be indexed fine. Same with Wikipedia, which is
| very much a silo (they don't have regular links in text in the
| hypertext spirit, but only footnotes).
|
| Facebook and Instagram are more of a walled garden, like Quora,
| but there is a lot of junk there anyway.
|
| It's sad for the WWW, but I don't really think it is a
| fundamental problem for search engines. In fact Twitter for
| example gives a direct pipe to Google. If you tweet something,
| it is immediately findable. Similar for StackExchange, but
| there I think the site is so "small" that Google can afford to
| just continuously index it.
| ratww wrote:
| Twitter and Reddit still can be indexed, but they've also
| become increasingly hard to use without an account. Reddit
| doesn't let you fully expand threads when you're unlogged.
| Twitter limits the amount of things you can read and shows a
| modal. Both of them heavily limit usage on mobile devices
| without installing an app.
|
| Sure, an account is free but might require giving information
| you don't want to give. Twitter asks me for a phone number a
| few minutes after creating an account, even if I don't post
| anything). Reddit at least lets you skip giving an email.
|
| Sure, there are workarounds such as using lite versions (old
| Reddit, mobile Twitter), but that's not known to all people
| coming from a search engine.
|
| It feels as if HN are the only one that's not a partially
| walled garden yet (and Wikipedia of course).
| airstrike wrote:
| > Reddit doesn't let you fully expand threads when you're
| unlogged.
|
| that's what old.reddit.com is for!
| FargaColora wrote:
| old.reddit will be gone soon, it is inevitable.
| Especially once they go public.
| ntauthority wrote:
| Isn't it a bit ironic that a site - or its operator -
| 'going public' means all the content on said site
| actually 'goes private'?
| aceazzameen wrote:
| Yup. It's bound to happen. And when it does, Reddit will
| no longer exist in my eyes.
| azemetre wrote:
| Agreed. IDK how I feel about Reddit. I've been on it
| since 2010 when Fark lost its spark. I remember some
| great times but a lot of it was "junk" content that in
| the end was very meaningless. I wish I could say I used
| it to develop my career in tech but that isn't true
| either; I use specific blogs, books, and tutorial sites
| to learn instead.
|
| I suppose I mostly view it as a continuous party, yeah
| it's fun if you attend but after a few hours I wish I was
| doing something more productive.
| ratww wrote:
| Exactly, I mentioned it. But not only it's bound to go
| away sometime, it's also not trivial to find to anyone
| who's not an expert Reddit user, unfortunately.
| TheRealDunkirk wrote:
| And isn't great to get a link to Reddit or Twitter, and you
| click the link, and try to navigate to the comments for
| context or the answer, and you go to click the link to expand
| it, and then you get a demand to log in and install their
| app? Don't talk about walled gardens and not include Reddit
| or Twitter just because they let you look at one brick before
| demanding their tax.
| [deleted]
| hn_throwaway_99 wrote:
| Doesn't _this_ site, and all of the content it links to, pretty
| much disprove your theory?
|
| Yes, sure, I often do go to the "top sites" when searching for
| content, but I still usually start at Google. And, despite all
| the SEO spam, Google still does a fairly decent of landing me
| on, for example, the appropriate Wikipedia page, Stackoverflow
| post, travel site, etc.
| mrtksn wrote:
| It has been dead for a while now and the whole society feels it
| globally. Things were getting so good then things become
| horrible and whoever cracks the path to the goods stuff again
| will find great riches at the end of the path.
| dageshi wrote:
| I agree with you to an extent. The web is less useful than it
| used to be. BUT I would say a lot of that usefulness has
| diverted into youtube. There are people who would previously
| have made sites who are making youtube videos instead which of
| course is owned by google.
| Jenk wrote:
| > If I want to search for something topical and relevant, I go
| to Facebook, Twitter, Reddit, HackerNews, Instagram, Google
| Maps, Discord etc.
|
| High chances you will find a link to an external site over
| content actually on those big named sites though, right? That
| tells us the organic web isn't dead, it's just hard to
| discover/navigate - because of SEO wars, most probably... The
| problem isn't the lack of content, it's the number of shitty
| spammy sites standing in your way of the sites you actually
| want to see. Like a sleazy salesman trying to direct you to the
| crap laden three wheeled rust bucket when you were heading
| toward the family sedans.
| altairprime wrote:
| If you want to be rich, solve search without full-text indexing
| of sites. Pagerank only ever worked because of human curation
| of webrings. Full-text search made is easier to find content,
| and opened the door for spammers. The only viable route forward
| for search will be to replace full-text indexing with human
| curation, somehow. Solve how to scale that up instead, so that
| when everyone else realizes we need it for the health of the
| Web, you're ready.
| [deleted]
| shortformblog wrote:
| I think this is a tad reductive, but I will say that we sure
| let a lot of big companies convince a huge portion of the
| population to create all of their content on platforms that
| they have no real control over.
|
| The problem is, many of them didn't realize this was a problem
| until recently.
|
| That said, plenty of exciting stuff is happening outside of the
| walled garden, as long as you know how to find it.
| Gravityloss wrote:
| And not only did this happen already over a decade ago, a lot
| of the current internet users have never known anything else.
|
| We had a discussion with coworkers and somebody mentioned
| irc. Explaining to younger colleagues what it was and that it
| was not a product of a company, but operators had servers
| that formed a network, and it was more like infrastructure.
| Felt weird.
| Elvie wrote:
| isn't Discord a bit like IRC used to be?
| ori_b wrote:
| How do I connect to a self hosted discord, and then
| connect it to my friends self hosted one?
|
| And where do I get the RFC for the protocol so that I can
| write my own compatible implementation?
|
| IRC isn't a product. It's a standardized protocol
| sufficiently simple to implement in a day or two.
| kasey_junk wrote:
| Most of the kids in my 3rd graders peer group understand
| federated infrastructures quite well because of Minecraft.
|
| Perhaps it wasn't the federated nature of irc that was
| surprising but the fact that it was irc?
| mst wrote:
| Isn't minecraft more decentralised than federated?
|
| IRC networks usually have multiple servers connected
| together (historically, often run by a bunch of different
| people) and I didn't think people self-hosting minecraft
| servers usually did that?
| shortformblog wrote:
| I think honestly it highlights the power of marketing as
| much as anything else. In some ways, building an open
| network is always going to put you at a disadvantage to a
| company that can throw money at user acquisition and PR
| teams. That federated networks like Mastodon have seen
| growth reflects the fact that word of mouth still means
| something in 2022.
| NicoJuicy wrote:
| The big siloed websites are just indexes of fresh content
| though.
|
| With a generic way to place comments on it.
| psyc wrote:
| Based on my observations over the past year, I'm certain that
| Google and Bing choose not to show us most of the web anymore.
|
| I usually find what I'm looking for. It just takes literally
| three orders of magnitude longer than it used to for the same
| kind of stuff. I used to use Google a lot to jog my memory
| about various things I vaguely remembered. Type a few
| associative words and snippets, press Enter, done. Google's
| useless for that now.
|
| If you're looking for hot pop shit in trendy publications,
| things to buy, commercial services to subscribe to - G has you
| covered. That's what they do now.
| ouid wrote:
| Google is still pretty good at searching reddit. Maybe reddit
| can acquire them.
| big_blind wrote:
| site:reddit just is the best search engine at this point. I
| still don't like Google though.
| dotnet00 wrote:
| I agree that this seems way too reductive. I was recently
| reflecting on this and noticed that I constantly run across new
| blogs and sites whenever trying to learn something. I just
| don't usually pay much attention to the site name in the way
| that I remember HN, Reddit, Twitter etc.
|
| So, while I would agree that some aspects of the old internet
| are dead (like 'small' ~1000 user forums focused on specific
| topics having largely been replaced by generally inferior
| subreddits and discord servers), I think it hasn't gotten as
| bad as you're making it out to be.
| baxtr wrote:
| I am not so sure...
|
| I think what happened is this: the WWW was everything back in
| the days. But in the "old days," only 10% of all people were
| online, the web elite. Then, AOL came, and the rest came online
| slowly but surely. The so-called "mainstream" people were no
| geeks, and these people were "just" ordinary people. Almost all
| were captured by what you call "big websites".
|
| Now, we see the 100% being dominated by the 90%. That's why
| "Google results are bad". Bad for us! Not maybe (most probably)
| not for them.
| nl wrote:
| Eternal September was Sep 1993. AOL hit the internet in March
| 1994.
|
| Netscape didn't launch until December 1994 (and the WWW was
| nothing before that. I subscribed to a mailing list with new
| sites that were released and I'd visit most new websites on
| the internet on most days with the Cello browser in my uni
| labs most days).
|
| AOL users have been there since the beginning of the WWW.
|
| https://en.m.wikipedia.org/wiki/Eternal_September
| CWuestefeld wrote:
| My recollection is that the AOL event you reference was
| only making usenet accessible - a point that makes good
| sense in the context of the eternal September.
|
| But when talking about the WWW, that's a very different
| story. I think that AOL didn't incorporate a web browser
| until quite some time after that.
| mywaifuismeta wrote:
| I no longer see Google as a neutral "search engine" the way it
| used to be. Now it's just another company that owns and
| promotes certain types of content, no different from reddit.
| For some things Google has the best content, for some things
| Twitter or Reddit have the best content.
| dixego wrote:
| Google is an advertising company. It has been for a good
| while.
| big_blind wrote:
| Yeah I use you.com and kagi.com. No advertising on either.
| Less SEO spam too it seems.
| [deleted]
| photochemsyn wrote:
| I find one of the best ways to find interesting content on
| specific subjects using Google is now to start blocking all
| their top returns (a lot of SEO spam). This is somewhat
| tedious (lots of -site:seospam.com) and Google doesn't like
| automated queries. However, a few rounds of this often turns
| up interesting content down low in the search results. Just
| don't take what's on offer on page one of search results,
| basically.
|
| Where it's gotten really bad is on news searches as Google
| either now has some kind of shitlist of independent news
| sites that it won't allow to show op on, for example,
| site:youtube.com searches - or, it's filtered through a guest
| list. It's hard to tell which strategy they're using, but
| news is definitely being heavily filtered based on very
| dubious propaganda-smelling agendas.
| xvello wrote:
| You might be interested in using uBlockOrigin and
| https://letsblock.it/filters/search-results to easily block
| these domains. In addition to your own domain list, you can
| use the community-maintained SO / github / npm copycat
| lists.
| maxwelldone wrote:
| Back in 2000s Google used to be the place for any type of
| search (IIRC).
|
| Now, I've been conditioned to use it only for specific use
| cases, mostly for convenience. Some examples include:
|
| 1. Anything programming related (searching for man pages,
| error codes etc) is straightforward. (I do have some UBO
| filters to exclude SO copycats)
|
| 2. Utility stuff like currency conversion, finding time in
| another city, weather etc.
|
| Where Google has really fallen behind is in multimedia
| search. Not sure if it's due to copyright issues or not but
| Bing and Yandex provide way better service in this regard.
|
| Not to mentions the "reddit" suffix I need to add to any
| search that even remotely calls for public opinion. In many
| cases, Google is just a shortcut to take me to the relevant
| subreddit.
| ufmace wrote:
| Programming-related stuff seems to have gotten a lot worse
| in the last couple of years. Now most terms, at least for
| common things, return a ton of blogspam, when the official
| docs or SO are usually the best source.
| LegitShady wrote:
| another thing seems to be prioritizing current news over
| past news which makes searching for old.articles youve read
| quite difficult.
| samstave wrote:
| This MUST be the reason that they threw their purchase of
| Postini in the garbage and my GMAIL INBOX is filled with spam,
| and my "social" and "promotions" tabs dont filter....
|
| GMAIL is garbage now, I literally use it as my spam email any
| more. Which sucks because I have had it for a _really_ long
| time.
|
| Annecdote on Yahoo! Mail ; years ago I wrote to yahoo support
| asking when I created my Yahoo Mail account (i'd had it from
| the 90s when it was very early available...)
|
| And support told me that they couldnt tell me when my account
| was created as that was *proprietary company information*
|
| So I deleted my Yahoo account. Im about to DL all my gmail and
| do the same.
| throw10920 wrote:
| > I agree: the WWW Internet is dead
|
| I've heard this claim a lot, with 0 supporting evidence. Do you
| have any?
|
| My own experience is that there are _thousands_ of content-
| rich, high-quality blogs still being written by real humans,
| because I regularly find and bookmark new ones weekly, without
| even looking for them, so: please provide evidence for this
| claim that runs counter to my lived experience.
| PragmaticPulp wrote:
| > If I want to search for something topical and relevant, I go
| to Facebook, Twitter, Reddit, HackerNews, Instagram, Google
| Maps, Discord etc.
|
| Maybe we're searching for different content, but I disagree.
| While Google results are not without noise, I think it's a huge
| exaggeration to suggest it's useless. I still regularly find
| quality results from a quick skim of the first or second page
| of Google results.
|
| Meanwhile places like Reddit, Twitter, and Hacker News are full
| of very strong opinions that _feel_ truthy, but are mostly
| noise. Unless you go in with enough baseline knowledge to
| filter out 9 /10 underinformed comments to dig out the 10% who
| actually have direct knowledge of the subject and aren't just
| parroting some version of something they read from other
| comments, skipping straight to social sites becomes a source of
| misinformation.
| derefr wrote:
| > Make a search engine which indexes the fresh relevant data
| from the big siloed websites, and ignores the general dead
| Internet
|
| I don't understand why Google themselves don't do this.
| LinkedIn v. hiQ demonstrated that they won't get in trouble for
| scraping users' subjective views of data within these silos and
| then stitching them together to form a cohesive whole. So
| where's the effort to do so? It seems like the obvious step.
| Gigachad wrote:
| Interesting thought. I just went though my browser history and
| realised that almost every time I use google search, I already
| know what website I want, I just don't know the exact
| link/page. I'll use google because the search on stack overflow
| or reddit sucks but I know I'm looking for a page on one
| particular site.
| Pelam wrote:
| I realized this too. I disabled search from address bar and
| started bookmarking everything even remotely sane I see. I
| often add a few personal keywords to the bookmark bar.
|
| It is starting to pay dividends. Instead of weird stuff
| thrown up by google when I type in something, I get the "oh
| yeah, that was the page" from a short list of bookmarks shown
| to match the words.
| npilk wrote:
| I had the same realization and ended up setting up a simple
| Cloudflare script to automatically do an "I'm Feeling Lucky"
| style search to return the first result:
| https://notes.npilk.com/custom-search
| lysecret wrote:
| I think this is a very "consumer focused" take. Yes. A lot of
| interesting people data is now "locked" behind these
| aggregators and platforms (and also hard to handle because of
| GDPR). But most interesting company data is still out there.
| matheusmoreira wrote:
| The internet itself is probably gonna die soon anyway. Every
| country wants to impose its own laws on it. I think it'll
| eventually fragment into multiple segregated continental
| networks, if not national ones, all with heavy filtering at the
| borders.
|
| I'm happy to have experienced the free internet. Truly a jewel
| of humanity.
| cesarb wrote:
| > I think it'll eventually fragment into multiple segregated
| continental networks, if not national ones
|
| That's exactly the world in which the Internet grew. There
| were multiple segregated national and sub-national networks,
| and the Internet was built as a means to interconnect them.
| After some time, the Internet protocols ended up being used
| even within these networks, but that was not originally the
| case. And even today, there are still things like the AS
| (Autonomous System) concept which permeates the core of the
| top-level Internet routing protocols, which still reflect the
| Internet being a "network of networks" instead of a single
| unified network.
|
| That's why I'm not too worried about the Internet
| fragmenting; we've seen this before. What happens next is
| gateways between the networks, and there are already shades
| of these in the VPN providers which allow one to connect as
| if one were located in a different network, often from a
| different country.
| kmlx wrote:
| > I think it'll eventually fragment into multiple segregated
| continental networks
|
| i think it already has.
|
| the Great Firewall of China is the classic example, but I
| think the trend started in the west with the Right to be
| forgotten/right to erasure in Europe, and subsequent HTTP
| Status 451 Unavailable For Legal Reasons. GDPR just further
| cemented the split between Europe and the rest, and the new
| DMA & DSA regulation in the European Union finally makes it
| clear. The writing is of course on the wall, so countries
| like India or Australia aren't too far behind. Places like
| California also have their own "right to be forgotten", and
| I'm sure the US will not be left behind for too long before
| we see regulation further splitting their internet from the
| RoW. And I don't think the RoW will hold off much longer till
| it also splits into multiple big blocks. It's the start of
| the new "nationalist" internet, and I'm sure we'll all be
| poorer because of it.
| matheusmoreira wrote:
| Exactly what I mean. There is no way to have an
| international network with national borders.
| Telecommunications providers have always been centralized
| and have always been in bed with the government. Only way
| we'll ever be free is if someone invents some kind of
| decentralized long range wireless mesh network.
| politician wrote:
| Like Starlink?
| ricardobeat wrote:
| Starlink connects to standard internet gateways on the
| ground. It cannot function without the 'regular
| internet', unless a replacement appears.
| dotnet00 wrote:
| IIRC there was mention of it providing some p2p network
| style communication capabilities for Ukraine's military,
| and one of the reasons it's appealing to the US's
| military is the ability to route communications entirely
| within the network (well, with the gen 2 satellites which
| have laser interconnects).
|
| So it can (at least eventually) function without 'regular
| internet', although I would still be hesitant to call it
| a viable infrastructure choice if the goal is to get
| around government control, simply from how much SpaceX
| have to appease the government to do anything space
| related.
| matheusmoreira wrote:
| Starlink is maintained by a company, it's an internet
| service provider. One visit from the police and they'll
| censor anything.
|
| The mesh network should be made out of common hardware in
| order to be viable. I'd suggest phones but those devices
| are owned before they've even left the factory.
| Nextgrid wrote:
| One visit from the _US_ police. US-unfriendly countries
| have no leverage over it, and similarly, the US has no
| leverage over satellite ISPs based in countries they aren
| 't on good terms with.
| jrockway wrote:
| > US-unfriendly countries have no leverage over it
|
| "Star Wars Episode 10: The one that's not fiction."
| Nextgrid wrote:
| Internet censorship isn't worth going to war over and
| disclosing secret anti-satellite weapons that are better
| saved for a rainy day.
| jrockway wrote:
| It's probably easier to just cut off outgoing payments to
| Starlink anyway. They're not a charity, so if they don't
| get paid, they probably don't want to provide service
| just to send a message to some random government.
|
| On the other hand, if you want to demonstrate that you
| have anti-satellite capability it's probably a better
| idea to shoot down a corporate satellite than a military
| one. The Soviet Union shot down Korean Air Lines Flight
| 007 and it didn't start a war, after all.
| eloisius wrote:
| Good luck, spectrum is highly regulated in every country
| I can think of. If national governments don't want you
| networking across borders, you're definitely not going to
| be broadcasting long range radio transmissions that way.
| In fact, it's currently illegal to transmit encrypted
| data or to relay packets via ham radio in the US.
| matheusmoreira wrote:
| Who knows? The whole point of decentralization is for
| there to be so many nodes in the network they can't
| possibly take them all down so that it's pointless to
| even try. What if all smartphones formed a mesh network?
| There aren't enough prisons in my country for all those
| criminals.
| eloisius wrote:
| I agree with your ethos, but I don't share your optimism.
| If the state wants to enforce networking firewalls along
| national boundaries, no technological solution will save
| us in general. As a resourceful techie with the right
| know-how you may be able to sneak your packets through,
| just like people in Cuba receive a literal packet of data
| via sneakernet, but if the state doesn't want widespread
| meshnets circumventing their firewall, they will imprison
| you for emitting pirate radio signals, they will penalize
| any electronics manufacturer that makes non-compliant
| hardware, and rest assured that companies will go right
| along. Liberty requires more than technical solutions.
|
| I'm saying this as someone who once wrote a decentralized
| P2P mesh for instant messaging[1]. I was inspired by the
| HK protests going on ~2014 after hearing that they were
| using Bluetooth chat apps. Luckily Matrix, Telegram,
| Signal, etc. mostly solved the problem. Still, I don't
| think any amount of mesh networking would turn back the
| tide of Hong Kong now.
|
| [1]: https://github.com/zacstewart/comm/
| groby_b wrote:
| >What if all smartphones formed a mesh network? There
| aren't enough prisons in my country for all those
| criminals.
|
| There don't need to be. You publicly gruesomely execute
| the first 100 or so you catch, and the practice of
| running a mesh node on your cell phone will fall so far
| out of fashion that the network breaks.
|
| Societal shortcomings cannot be fixed via tech alone. If
| you can't build a society resilient to authoritarianism
| in the first place, tech will not help you. It can be
| used to _increase_ resilience, but that 's far from
| fixing the problem by itself.
| 7sidedmarble wrote:
| The networking may have been open like that, but I'm not sure
| the content ever was. It seems to me like a lot of internet
| users consume mainly the content of sites from their country.
| Kind of hard to blame them when that content is probably
| going to download fastest. But the language barrier has also
| kept the internet from becoming truly global.
| dreen wrote:
| I think this was inevitable all along, something similar
| happened to radio if I'm not mistaken.
|
| However, the good news is that we will never stop reinventing
| everything. The real value of the old internet was showing us
| what is possible.
| nonrandomstring wrote:
| > The real value of the old internet was showing us what is
| possible.
|
| Of equal value is that it showed us what not to do.
|
| We have 30 years of documentation for research on exactly
| what a successful intra-planetary network needs to be
| immune to. A successful future network must build-in
| resistance all forms of human pyschopathology from the
| ground up.
| pde3 wrote:
| This is a nice fantasy, but it's a fantasy. The tech
| stack and network we have is too dense a forest to be
| replaced by clean slate designs. But maybe some of the
| problems could be improved with some new platforms and
| APIs. Mind you, ML is making so much progress so quickly
| that what happened over the last thirty years is at best
| a partial model of the problem we have to solve now, and
| the tools we have to do it with...
| nonrandomstring wrote:
| > ML is making so much progress so quickly that what
| happened over the last thirty years is at best a partial
| model of the problem we have to solve now, and the tools
| we have to do it with...
|
| Sorry I don't see how ML can help here. It seems like
| another thing to pin hopes of repairing an already too
| broken system on.
|
| "We cannot solve our problems with the same thinking we
| used when we created them." -- Albert Einstein
|
| "A new scientific truth does not triumph by convincing
| its opponents and making them see the light, but rather
| because its opponents eventually die, and a new
| generation grows up that is familiar with it." -- Max
| Planck
|
| We are the dying generation my friend. We built it. They
| came. It didn't work. Surely if ML can do anything it's
| telling us that we need to tear down the old system
| completely and start again, don't you think? Adding
| sticking tape won't help.
|
| edit: turning a grunt into an honest question
| Whiteshadow12 wrote:
| This made me sad, the optimist in me believes that some
| alternative will be built, that could take us back to those
| days. Honestly I do feel for most of my life I experienced an
| American Internet mostly (From South Africa), as long as one
| can still hop from one internet to another, in as simple a
| manner as possible it might not as bad as it could be.
| matheusmoreira wrote:
| I'm sad as well. To me it feels like we're already living
| in a cyberpunk nightmare, things just keep getting worse
| and there's nothing anyone can do to stop it.
| [deleted]
| lkxijlewlf wrote:
| > If I want to search for something topical and relevant, I go
| to Facebook, Twitter, Reddit, HackerNews, Instagram, Google
| Maps, Discord etc.
|
| Interesting. When I search for something topical I search those
| sites using Google because al(most) (I don't use some like FB
| and insta) all those sites have really shitty search.
| jerf wrote:
| "I agree: the WWW Internet is dead, that is your problem. No-
| one visits websites anymore, everyone has moved to the 10
| biggest websites and all data is now siloed there."
|
| That is not the Dead Internet Theory. That's just something
| anyone can see by looking at the world.
|
| The Dead Internet Theory is that the Internet is _already_ an
| echo chamber custom fed to you by a collection of bots and
| other such things, and that a lot of the "people" you think
| you're interacting with are already, today, faked. You're
| basically in a constructed echo chamber designed only with the
| interests of the creators of that chamber in mind, using the
| powerful social cues of _homo sapiens_ effectively against you.
|
| In particular, those silos aren't where people are
| communicating. Those silos are where you _think_ you 're
| communicating.
|
| It is obviously not entirely true. When we physically meet
| friends, sometimes topics wander to "Did you see what I posted
| on Facebook?" So far, we've not caught Facebook actively
| forging posts from our real-life friends that we physically
| know. (Though we _have_ caught them failing to disseminate
| posts in what seems to be a distinctly slanted manner.)
|
| I am also not terribly convinced that the bots have mastered
| long-form content like you see on HN. I think we've had some
| try, and while they can sort of pass, they seem to expend so
| much effort on merely "passing" that they don't have much left
| over to actually drive the conversation. HN probably still
| requires real humans to manipulate things.
|
| Where I do seriously wonder about this theory is Twitter. AI
| _has_ progressed to the point that short-form content like that
| can be effectively generated and driven in a certain direction.
| There 's been some chatter on the far-out rumor mills about
| just how bot-infested Twitter may be, how many people think
| they have thousands of followers, even having interacted with
| some of them as "people", and in fact may only have dozens of
| flesh-and-blood humans following them, if that. Stay tuned,
| this one is developing.
|
| (Note that while this could be "a big plan", it is also a
| possible outcome of many groups independently coming to the
| conclusion that a Twitter bot horde could be useful. A few
| hundred from X trying to nudge you one way, a few hundred from
| Y trying to nudge you another, another few thousand from Z
| trying to nudge you yet another, before you know it, the vast
| vast majority of everyone's "followers" is bots bots bots, and
| there was no grand plan to produce that result. It just so
| happens that Twitter's ancient decision to be dedicated to
| short-form content, with no particular real-world connection to
| the conversation participants, where everyone is isolated on
| their own feed (even if that is shared in some ways) made it
| the first place where this could happen. Things with real-world
| connections, things where everyone is in the same "area" like
| an HN conversation, and long-form content will all be three
| things that will be harder for AIs to manipulate. Twitter is
| like the agar dish for this sort of thing, by its structure.)
| thesuitonym wrote:
| > (Though we have caught them failing to disseminate posts in
| what seems to be a distinctly slanted manner.)
|
| I haven't seen this, but I'd be interested in reading about
| it, if you have a link!
| ftkftk wrote:
| I agree - I don't believe that there is a grand master plan
| of a conspiratorial or other nature. I think it is simply, as
| you stated, a co-evolution of independent actors.
| rchaud wrote:
| > Want to become rich? Make a search engine which indexes the
| fresh relevant data from the big siloed websites, and ignores
| the general dead Internet.
|
| That would be a great service, but it certainly wouldn't make
| you rich. Where's the money going to come from? Google got rich
| because they acquired an ads platform (DoubleClick) and an
| analytics platform (Urchin) and started monetizing the vast
| amounts of data they had. That was years after Google had
| established goodwill as the best search engine.
| big_blind wrote:
| I use beta search engines. On kagi.com and you.com you can
| preference and filter top sites. There's also no advertising
| on either. I've just stopped using Google altogether and its
| improved search so much.
| simion314 wrote:
| This is not true, maybe for a subset of Internet users.
|
| For example you have Wikis and forums. Wikis are good for
| communities that are passionate about a topic and they
| collaborate on buidling content for their passion. Reddit is a
| valid alternative to forums but if the community s older and
| has members that are technical competent then they usually have
| the forum customized for their purpose and the forum will
| continue to exist , especially if you want to avoid some third
| party censorship.
|
| I never ever search for something and found answers on
| Facebook, sometimes very rare I find something that points to
| Instagram blogs/posts but never Facebook.
|
| Probably depends on your location and what you search for, so
| it might be possible that 99% of your Internet consumption is
| satisfied by 5-10 websites.
| Hnrobert42 wrote:
| As you describe this, it makes me think about how populations
| tend to migrate to cities and away from rural areas. There's
| even a parallel to white flight in the emerging popularity of
| the chan/gab fora.
| hombre_fatal wrote:
| I don't get how TFA shows evidence of the Dead Internet Theory
| just because their site manages to attract ~zero users.
|
| Just host a <form><textarea><button></form> at an IP address
| and notice it's just spambots submitting it with backlinks, not
| actual users. Doesn't mean the internet is dead nor that the
| indieweb is dead.
|
| It doesn't really show anything other than the only people able
| to extract value from your creation are the spammers.
| jspaetzel wrote:
| This is so incredibly false, I've been working on a project for
| the last six months and MoM I've seen steady increase in usage.
| Tbh much much higher usage then I expected. Most users find my
| site via Google or Facebook however they are looking for
| content that's not in those silos and have no problems leaving
| them.
|
| If you have high quality content and you get it indexed
| properly by Google, users will come.
|
| There are reasons users are not using your website.
|
| 1. It's not solving a problem people have.
|
| 2. Users can't find it.
|
| Who, in their right mind searches for search engines? Nobody I
| know.
|
| If you want users you have to go out and get them (literally
| pound the pavement and talk to people) or create a LOT more
| content ironically, so they can find your site on the search
| engines they are using today.
| black_puppydog wrote:
| These discussions always make me recal Jacob Applebaum. Think
| of him what you want, but this statement of his really stuck
| with me at the time. Paraphrasing:
|
| The real dark-net is facebook. Everything that goes in there
| never comes out again and is basically invisible to the world,
| except if you join facebook yourself.
|
| My own prime example of that used to be pinterest: it seems to
| be a 100% sink in the directed graph of internet links. But
| since Applebaum stated this, instagram (also facebook of
| course) is trying hard to push pinterest off that particular
| throne.
| LegitShady wrote:
| to me this is also discord - which seems to have become the
| chose alternative tk online forums for many communities and
| basically hides what used to be the public face of those
| communities.
| samatman wrote:
| boplicity wrote:
| > No-one visits websites anymore, everyone has moved to the 10
| biggest websites and all data is now siloed there.
|
| Really? We make our living running a small web based
| publication; around 40k readers a month. I know of many other
| sites like this. Google, and other search engines, depends on
| niche websites to provide quality search results. Without sites
| like ours, the internet would truly be dead, and search would
| be mostly useless. Our "traffic sources" come from a mix of
| Facebook, Search, Reddit, etc, in addition to our many loyal
| readers.
|
| Others in our niche are producing blog spam, which looks nearly
| identical to people who aren't experts in the field, but we
| have real experts, fact checkers, etc, as part of our
| production process. This is a big problem: These low quality
| websites get similar rankings to our own, which does make it
| much harder for people to get quality information via search.
| (Hence the general shift towards trusting social
| recommendations, such as from Reddit.)
|
| In short, the WWW is alive and well, it's just buried under a
| bunch of #$#$%.
| rchaud wrote:
| > Our "traffic sources" come from a mix of Facebook, Search,
| Reddit, etc, in addition to our many loyal readers.
|
| 40k/mo is a pretty good number for an independent website. As
| a word of warning though, relying on social media reach is a
| dangerous game, as there is anecdotal evidence that tweets
| with outbound links don't get as many impressions as those
| that link to in-site content, like another Twitter post.
|
| As for Facebook, well, there's a good comic from The Oatmeal
| (enormously popular on FB back in 2010) that talks about what
| happened in the long run:
|
| https://twitter.com/Oatmeal/status/923250055540219904
| Cthulhu_ wrote:
| I don't believe the WWW internet is dead; there's still
| millions of webpages being made and published every day.
| However, the traffic numbers are skewed in favor of the big
| socials and aggregators; I wouldn't be surprised if the 80/20
| rule applies there.
| pnutjam wrote:
| There seems to be a tendancy towards video that undercuts the
| "old internet". I prefer instructions in a text or list
| format, but that's almost impossible to find for things like,
| changing the headlight bulb on my traverse.
|
| 1. turn the wheel so it is pointed hard in the direction of
| the bulb you are changing.
|
| 2. remove the hex screws from the shroud in the wheel well
|
| 3. pull the shroud down, it's pretty flexible plastic.
|
| 4. reach up and change the bulb. The wires are a bit short so
| you might need to get both hands in there. I have big hands
| and I'm able to do it.
|
| ---- There are innumerable videos explaining this process,
| but very few text directions.
| ElevenLathe wrote:
| I think this is actually because real, fluent literacy is
| still rare even in highly developed places. It may be
| easier for a very literate someone to dash off those
| instructions but most people are 1000x more comfortable
| making a little video. Same goes for reading vs watching
| the video.
|
| This is my same theory about meetings being universally
| preferred to asynchronous email, even when literally all
| the questions someone asks at a meeting have already been
| answered in my long form email.
|
| Most people, even if they can read, are not really
| comfortable with it. Doubly so for writing. There used to
| be no choice to function in society, but increasingly we
| can use technology to substitute for reading and writing
| effectively, so people do.
| pnutjam wrote:
| You're probably right, it's just so frustrating.
|
| I think I'm going to start compiling stuff like this in
| my git repo.
| Jiro wrote:
| Even something like that flounders on the question "these
| instructions say to pull down the shroud, what is a
| shroud?" or "I can't find those hex screws, where are they
| located?" Repairs are inherently visual, although text with
| illustrations might work.
| soheil wrote:
| To a fish the world is made of water and there can't possibly
| be anything else worthwhile. This is more indicative of how you
| spend your time online vs reality.
| heavyset_go wrote:
| I was once on this bandwagon, but I think it was just
| confirmation bias reflecting the way _I_ used the internet at
| the time. The non-siloed internet is bigger than the pre-siloed
| internet ever was.
| omoikane wrote:
| I think the Dead Internet Theory bit is just a bait to get more
| comments. It's a bit of a stretch to conclude that the internet
| is mostly robots just because one website sees mostly robots.
| This extrapolation would be convincing if that one website is a
| high ranking website that sees a lot of traffic, but
| searchmysite.net does not appear to be one of the top websites.
| DebtDeflation wrote:
| Unfortunately, correct. The average Internet user accesses it
| via a phone, not a desktop, laptop, or even tablet these days.
| Most of that access is through apps, not a browser. To the
| extent that a user is looking for a factoid answer and does a
| search, a Google Knowledge Graph result with a Wikipedia link
| is probably enough in most cases. If they want a technical
| question answered, Stack Exchange; a product review, Reddit;
| nearby restaurants with reviews, Google Maps; etc.
| stackbutterflow wrote:
| I think you're generalizing your own behavior. I regularly use
| google to search for topics that cross my mind and I end up on
| many websites that are not one the giants in your list. It's a
| fun activity. If people stick to the same 10 websites that's on
| them. Nothing prevents you from exploring the web.
| MockObject wrote:
| > Nothing prevents you from exploring the web.
|
| What prevents you from exploring the web is you can't find
| but the same 10 sites through search engines.
| jrussbowman wrote:
| "Want to become rich? Make a search engine which indexes the
| fresh relevant data from the big siloed websites, and ignores
| the general dead Internet."
|
| Did that to some degree. Unscatter.com pulls from reddit and
| twitter to source links.
|
| I found reddit only created an echo chamber bubble of obvious
| bias and twitter only diluted it a little.
| CTDOCodebases wrote:
| People are doing this already. You just have to include the
| site name in the search on google e.g reddit. Search on these
| platforms is often broken.
| freeone3000 wrote:
| Well, the first two links loaded for a search for "magic the
| gathering" are 404s. The "Random" link at the bottom 403s. The
| search engine feels broken.
| assemblylang wrote:
| There are still ways to prod out good content from the SEO spam
| on search engines. I wrote a google search front end that does
| this [0], using search operators to remove some common SEO spam.
|
| [0] https://sayno2seo.com
| hammock wrote:
| Makes me wonder whether Google tolerates bots on its search
| engine, to boost its ads revenue.
|
| See also Twitter's extraordinary claim that 5% or less of its
| users are bots (or a claim from Twitter's detractors that up to
| 90% of its DAU are bots)
| exyi wrote:
| I don't think it does, I get a ~~middle finger~~ recaptcha
| every time I try google something
| iamjbn wrote:
| Adding to the list I have been building for very long --
| "Becoming irrelevant, Google Search" -- here:
| https://docs.google.com/document/d/1cSMY5wXSKhJdMxeJEvTUJ21e...
| iamwil wrote:
| To the OP of the article, this is great. I had just never known
| about it, to use it for searching.
|
| Usually quality blog posts on specific technical topics are just
| things I run across through HN, lobsters, or twitter. Now it's
| one more channel to look for things that I'm specifically
| researching, like CRDTs. Kudos!
| ColinHayhurst wrote:
| Mojeek member here. We have always had a high level of spam bots;
| as any search engine/service will have. It's a constant battle to
| fend off new bots; folks can always use try out our API rather
| than freeloading, and some do. Many obviously do not. We are
| taking a look at whether things have also changed for us since
| mid-April 2022.
| ColinHayhurst wrote:
| Some evidence of an uptick here too. Historically it has been
| ~80%. 6 days ago we had to block 92%. Yesterday we blocked
| around three times that number of bot searches.
|
| edit: the three times spike yesterday was one particular new
| attacker; general recent rise holds.
| alphabet9000 wrote:
| i recently built a habitat for spam bots, they eventually found
| it and now post peacefully
|
| https://upstairs.treehouse.telnet.asia/pharm/cylohexapine
| TremendousJudge wrote:
| It's beautiful to see nature healing
| tbm57 wrote:
| Maybe someone should start an internet rewilding project
| getcrunk wrote:
| this is the best thing I have ever seen. Its art, engineering,
| biology and sociology. Do you write blog posts about it?
| mcv wrote:
| If you're trying to boost your user numbers, I'm in. Results on
| topics I search for are very sparse, but it's all content I
| hadn't seen before, which is great.
|
| Sounds like your search engine is not suitable as a replacement
| for more traditional search engines, but it might complement them
| very well. I'll give it a try.
|
| As for the SEO bots: can't you simply block those?
| egberts1 wrote:
| Error codes. Open source that reports in mysterious error codes.
|
| Used to be able to Google for those; now, not so much.
| 0xbadcafebee wrote:
| People will only use your product if they know about it and
| perceive value in it. How do people know about it, and why would
| they want to use it?
|
| On _" Most of the tiny number of real users have come from links
| posted to places like Hacker News, and there is almost no organic
| traffic from other search engines"_ - Organic traffic comes from
| word of mouth. Are people talking about your site? If they're
| not, you're not gonna see organic traffic. You could do what
| others do and pay some influencers to advertise your site, but
| that's expensive and not as scalable as "real" buzz. Is your
| product exciting or controversial? If not, why would people talk
| about it?
|
| Your homepage's tag line is _" Open source search engine and
| search as a service for personal and independent websites."_ A
| regular person's eyes would glaze as they try to figure out what
| this means. Given some time they might put together the words
| "search engine" and "personal" and "websites" and figure this is
| a blog search engine. So just say that.
|
| The "Newest Pages" section is a fun novelty, but after a few
| minutes the novelty wears off.
|
| The "Browse Sites" section is _almost_ useful. Next to the list
| of sites I see some tags. Why isn 't a heatmap of the tags the
| first thing I see? That would be way more useful than a paginated
| list of random sites.
|
| Your "About" page lists _" community-based approach to content
| curation"_. This is the most exciting aspect of the whole
| endeavor, so add that to your front page blurb ("Community search
| engine"). You would probably do well to build a real community
| around it, for example with a forum or chat system (GitHub
| Discussions does not count). A SubReddit would be an easy way to
| bootstrap this and later move it to your own hosted forum.
|
| You'll probably need a very complicated moderation system if this
| thing takes off.
| unnouinceput wrote:
| Plot twist: His website/search engine/blog is written by a bot
| and not a real person behind.
| albatrosstrophy wrote:
| Ona tangential note, I remember a time when Google had the option
| to search only for 'discussions'. The results were amazing and
| accurate as it scoured online forums. Almost all issue I had (was
| following the rooting scene closely back then) were quickly
| resolved. Then suddenly it got removed for reasons unknown to me.
| Anyone knows if it's replicatable today?
| sodality2 wrote:
| Brave Search does have a discussion search section.
| blackhaz wrote:
| Sometimes adding "reddit" to a search query produces fantastic
| results.
| jrussbowman wrote:
| I do this all the time
| tunap wrote:
| I have had some success adding "forum", when looking for
| trade discussions; eg: controls & automotive. With all the
| walled silos on the net, this is much less useful with every
| passing day. On the bright side, I don't have to use -twitter
| & -facebook, so there's that.
| throwaway27727 wrote:
| This is great but it seems reddit has done something to mess
| with their date reporting. When looking for recent posts, I
| might see a result on Google that says it was posted in the
| last few days, but on clicking the result will actually be
| from years ago.
| asddubs wrote:
| might also be google. I've noticed inaccurate dates that
| don't appear anywhere for some of my pages. my only theory
| as to why these were displayed is that google interpreted a
| (server side) randomly generated number in an inline script
| as a timestamp (but i can't know for sure that's what
| happened)
| oefrha wrote:
| Messed up dates, plus irrelevant topics showing up because
| there are matched snippets in "more posts from...".
| SirAiedail wrote:
| I use "site:reddit.com" to fully restrict to that. You can
| even filter by subreddit that way.
|
| Works well with HN and other sites, too.
| matheusmoreira wrote:
| Not sure for how much longer this is going to work. Plenty of
| marketers make fake posts there in grassroots campaigns.
| Reddit itself is an advertising company.
|
| God I hope they never find out about this site.
| f0xJtpvHYTVQ88B wrote:
| Brave Search recently implemented "discussions". From what I've
| seen it is mostly Reddit results but StackExchange also can
| appear there.
|
| https://searchengineland.com/brave-search-discussions-383706
| Cthulhu_ wrote:
| I have a suspicion they removed it because of the amount of
| spam on those forums. There's tons of abandoned forums that are
| only occupied by spambots.
|
| There's even pretty convincing looking accounts and messages
| that turn out to be spam in the end, once they start trying to
| post links.
|
| I have Akismet on the comment section of the Wordpress front-
| end of the site I run, it basically said something like 99.99%
| of attempted comments were spam. I'm sure the same applies to
| e-mail and the like.
| matsemann wrote:
| Reminds of those "fake forums" I sometimes see when
| exhausting google's results. Found a screenshot of the
| concept here: https://www.reddit.com/r/Scams/comments/jxtr1k/
| but_it_requir...
| 6510 wrote:
| Everyone is a spammer according to Akismet. I wouldn't be
| surprised if 99% of that 99.9999% is false positives.
|
| You could start a website for people you don't like, flag all
| the comments as spam and they wont be allowed to post
| anything elsewhere - forever!
| efreak wrote:
| That percentage sounds about right to me. I've seen
| comments on blogs from ~10-15 years ago, that continue to
| have spam posted to them. The first 2-3 comments will be
| relevant, but comments 50-100 may have a single relevant
| comment along them, with a total of anywhere from 300-3000
| comments. Older comments link mainly to blogs
| (*.WordPress.com) and such, while newer comments link to
| Facebook and Instagram.
| arbuge wrote:
| It is my experience that SEO bots are increasingly ignoring
| robots.txt entries disallowing them from crawling our sites. Last
| week we noticed several doing this. I don't mind naming names -
| semrush, something called grapeshot crawler, something else
| called blex bot, and moz dotbot. Anyone else having the same
| experience?
| edenfed wrote:
| I'm currently building a search engine made specifically for
| developers. We are searching directly in
| GitHub/StackOverfow/Reddit so SEO is not a problem. You are
| welcome to try it at https://keyval.dev
| mcovalt wrote:
| I noticed this on https://hndex.org. So many searches for hair
| loss products. Like thousands... daily.
| ajnin wrote:
| This made me curious to try that search engine so I typed
| "electronic music box" (first thing that came to mind). As far as
| I can tell none or the 10+ pages of results include all those 3
| words. I mean, you might not have any relevant sites in your
| database (likely if there are only 1000 sites or so as another of
| your blog posts imply), and I understand you want to show _some_
| result to the user, but if I want irrelevant links I might as
| well go to google.com...
| thehodge wrote:
| Yeah same, I searched for Leeds grand theatre and the top
| result is something titled "June 2012 - Sam's Blog' which just
| mentions the word grand.
| lubesGordi wrote:
| What the heck is an 'electronic music box'? I personally
| wouldn't expect those three words to show up on any sites
| served by a small search engine.
| nspattak wrote:
| This is an awsome website that I was not aware of!
| mlatu wrote:
| and there you have it: nobody uses it because nobody knows of
| it.
|
| of course for a bot it is easy to remember your site, its just
| another url in a long list of others... but what does a human
| do? they go to their fav search site, be it duck duck go,
| google or bing... perhaps even yahoo.
|
| i remember when google just started out, back then you would
| have used askjeeves, altavista or yahoo... google was really
| good compared to those... and the name was new, kinda
| orthogonal to existing search engines (except yahoo perhaps)
| and perhaps the most important bit: the site was "clean" except
| for the searchbar, there was nothing distracting there. you
| opened it and knew it is for looking up stuff
|
| now, to join in, this late in the game? difficult. difficult.
|
| maybe it would be easier if it specialized for some niche? idk.
|
| dear OP: i'll try to remember your searchengine, but i cant
| promise to become a regular
| jacquesm wrote:
| One day we'll have an internet for humans exclusively. On another
| note, with 160K requests / day from bots you could of course
| simply block the bots structurally assuming they are nice enough
| to identify themselves. Block all of AWS and Google, Russia,
| China, NK and a couple of other bot hot spots and the service may
| well become more successful for regular users because they get
| faster results. Bots can afford to wait, humans are often
| impatient. And with 2 hits / second by bots that may well become
| a factor.
| netsharc wrote:
| I wonder how that could be accomplished. Maybe they'll build a
| brain interface to replace the "I'm not a robot" captchas/add a
| TPM chip to the brain.
|
| And then the spammers will start selling tools to fake the
| responses. Or pay Filipinos a few cents a month to have the
| chip implanted to their brains...
| jacquesm wrote:
| Well, we can do it with the roads, I'm pretty sure if the
| incentives are right we can come up with a way to do it
| online. As long as we have not passed the Turing test ;)
|
| The current web seems to favor machines talking to machines
| and that is definitely not how it was intended.
| Nextgrid wrote:
| > Or pay Filipinos a few cents a month to have the chip
| implanted to their brains...
|
| That's the problem with blocking _bots_ as opposed to
| malicious behavior. Bot blocking is actually trivial and very
| cheap to bypass as long as you can buy slave-like labor for
| peanuts.
|
| Ideally you'd want to block malicious _behavior_ (when it
| comes to SEO spam, downrank anything for-profit such as ads,
| analytics, affiliate links, etc) instead to remove the
| incentives for spamming, regardless of whether it 's a bot or
| human.
|
| In this case the only problem is that this search engine
| gives away resources (search queries) for free and then
| complains that people (in this case spammers) are taking it.
| It's not really a _spam_ problem - they 'd complain equally
| well if they had some _legitimate_ user that happened to need
| tons of search queries to achieve their task.
|
| The only solution here is to start charging for stuff that
| costs money, and then it doesn't really matter who is on the
| other side, as long as they pay the bill.
| samatman wrote:
| It's a principal-agent problem. Websites want to be paid
| for their content, rent ad space, advertisers want users to
| see ads, users want to find content.
|
| The agent in the middle fucking over all three principals
| is hmm. Metaalphabetic, let's say.
| m-i-l wrote:
| This isn't indexing by search engine spiders, which are usually
| fairly benign and easy to identify with user agent etc. This is
| searches for "scraping footprints" executed en mass by "SEO
| proxy farms", which are designed to be very difficult to detect
| (e.g. originating from globally distributed residential IPs,
| quite possibly ordinary home user's machines which have been
| compromised). The main giveaway that something is a "scraping
| footprint" is the long search query which includes text that
| would appear on a template, e.g. ""This website is proudly
| using the open source classifieds software OSClass" rega
| turntables", for someone looking for OSClass-powered pages they
| could "search engine optimise" for the query "rega turntables".
| thesuitonym wrote:
| That's funny to me, because if I'm searching for something
| that would have been around between 2004-2012, I'll often
| append "Powered by phpBB" (or other software) to find posts
| about it on forums.
| pjmlp wrote:
| And the cycle will reboot itself again.
|
| The silos we have nowadays were there before the Internet took
| off, on BBS, Compuserve, Geocities, ....
|
| Apparently the majority of regular humans likes to have
| centralized providers they can reach out to, instead of the
| freadom of decentralized content.
| jacquesm wrote:
| Yes, that's true. Bots tend to follow the money.
| xmodem wrote:
| > Block all of AWS and Google,
|
| Google for "residential proxy". This is already a huge
| industry, and it's difficult not to see how we haven't lost
| this war a long time ago.
| kmeisthax wrote:
| ...so you're going to write your own HTTP requests? Encrypt
| your traffic and validate certificates by hand? Toggle in each
| TCP header from a memory debugger?
|
| Most of the Internet is bots because humans don't actually
| generate HTTP traffic - they fire up a bot called a "browser"
| to do it for them. The challenge for anti-spam is to
| distinguish which bots are currently being directly controlled
| by humans and which ones are not-so-directly controlled by
| such. This isn't even a hard line; I've frequently hit Hacker
| News' bot detection just by upvoting a comment and then
| clicking reply too quickly.
| jacquesm wrote:
| I really don't understand your comment.
|
| Just so we don't have to argue about what constitutes a bot
| and what does not I propose we use this definition:
|
| https://en.wikipedia.org/wiki/Internet_bot
| calltrak wrote:
| [deleted]
| oefrha wrote:
| > I didn't notice at first because the web analytics only shows
| real users, and the unusual activity could only be seen by
| looking at the server logs.
|
| Sounds like everyone blocking analytics (Plausible in this case),
| e.g. myself just now, is lumped in with spam bots.
|
| Of course, analytics blocking can't meaningfully swing the
| ~99.99% statistic.
| rhn_mk1 wrote:
| I would argue that yes, it can. If the only people who are
| interested in using the website are those who block analytics -
| and, given the demographic of a niche search engine, it doesn't
| sound entirely implausible - then there's no telling how the
| 99.99% splits into bots and nerds.
| oefrha wrote:
| Not every "nerd" use a blocker. I know many who don't. Some
| want to support the sites they visit; some want to see the
| web as it is for most people; some say their mental filters
| are so well developed that ads don't bother them; etc.
| Xylakant wrote:
| You could guesstimate by checking the IP address - blocks
| assigned to residential users are likely humans, blocks
| assigned to cloud providers etc. likely bots.
| gnabgib wrote:
| This is far from true. Either via trojans, botnets, "crowd
| sourced vpns", or of course tor relays, residential IPs are
| a source of many bots. The overwhelming majority of spam
| sources (after you block a few data centers in NL).
| asddubs wrote:
| even if there's 99 people blocking analytics for every person
| who doesn't, the figure is still 99%
| scambier wrote:
| If you self-host Plausible, it's also possible to bundle the
| analytics package with the website, so that there's isn't an
| "ad-blockable" lone request for the .js file.
|
| https://github.com/plausible/plausible-tracker
| pluc wrote:
| Yeah there is. I surf with JS off because of people like you.
| varun_ch wrote:
| Most of the data you can collect with Plausible could just
| be collected server side instead, it's nothing like Google
| Analytics.
| netr0ute wrote:
| > Most of the data you can collect with Plausible could
| just be collected server side instead
|
| Then why not just use that instead?
| tylergetsay wrote:
| SPAs & marketing teams are used to snippets
| scambier wrote:
| Also notice how I said "analytics package" and not
| "tracking" in my comment, because there is no tracking. I
| mean, unless you're the only visitor from a specific
| country, there is literally 0 identifying data in
| Plausible.
| netr0ute wrote:
| Analytics is still unnecessary JS and a bandwidth hog, so
| it has to go.
| folkrav wrote:
| https://plausible.io/privacy-focused-web-analytics
|
| You surf with JS off because of sites abusing their users'
| data. This is not it.
| [deleted]
| 34679 wrote:
| Collecting data that a user doesn't want collected is
| abuse. It doesn't matter what you do with it.
| folkrav wrote:
| Oof. Hard disagree on that one, way too black & white of
| a position for me in the face of such a broad concept as
| "data".
| inetknght wrote:
| > _You surf with JS off because of sites abusing their
| users ' data. This is not it._
|
| Wrong. I surf with JS off because of sites that use JS to
| collect information about me.
|
| If it's available on the server, then sure that might be
| considered fair game. But using javascript (or any other
| client-side tool) to do what you _should_ instead do
| server-side _is_ abusing users (or their data).
|
| Putting analytics inline so it's "not ad-blocked by a url
| request" is absolutely disrespecting users and a perfect
| reason to turn off javascript.
| folkrav wrote:
| > Wrong. I surf with JS off because of sites that use JS
| to collect information about me.
|
| Plausible doesn't collect information about you, but the
| site's usage. Do you also object to physical stores
| putting up cameras?
|
| Here's their own instance, open to public.
|
| https://plausible.io/plausible.io
|
| > If it's available on the server, then sure that might
| be considered fair game. But using javascript (or any
| other client-side tool) to do what you should instead do
| server-side is abusing users (or their data).
|
| That's quite the affirmation. Is this fact or opinion?
| inetknght wrote:
| > _Plausible doesn 't collect information about you, but
| the site's usage. Do you also object to physical stores
| putting up cameras?_
|
| The difference is that the cameras don't get attached to
| my physical body, doesn't have any ability to monitor my
| actions after I have left the presence of the physical
| store, and can't force me to take any physical item or
| action.
|
| Javascript, on the other hand, has the capability to
| become persistent, can monitor my computer's activity
| outside of your website, and can leave a lot (!) of
| additional data on my computer without my permission.
| MicahKV wrote:
| So spammers have latched onto your search engine because they are
| getting useful results. They are able to systematically discover
| websites built on certain platforms that allow users to post
| content containing links, which they can target for link spam. It
| is very difficult to fight this on a technical level because
| there is an entire industry built around blackhat SEO, with all
| kinds of softwares and services dedicated to thwarting your
| defensive efforts. Even Google struggles to keep up with this.
|
| However, they are also systematically feeding you their footprint
| lists. I imagine you could put together a footprint blacklist
| pretty quickly, and just stop returning results for any obvious
| spam queries like those containing "powered by wordpress".
|
| It's not a very elegant solution I'll admit. It won't stop the
| bots from trying, and you may have to circle back periodically to
| add new footprints as they surface. But it's a potentially quick
| and easy way to stop rewarding their efforts, and the blackhat
| world is pretty used to burning out their resources so hopefully
| they will figure out it's a dead end and move on.
| wolpoli wrote:
| Considering that as of Mar 12, this search engine only has 1001
| sites indexed, I am not sure how useful this site is for
| getting SEO backlinks. Speaking of which, are backlinks still a
| thing these days?
| pascalxus wrote:
| just to throw out ideas: What if he decided to charge for each
| search?, say 1 cent or so. Users could purchase them in bulk,
| say 100 searches for a 1$.
|
| The world is getting more and more desperate for a better
| search engine. the day may come, when people are willing to pay
| for better results.
| marginalia_nu wrote:
| > So spammers have latched onto your search engine because they
| are getting useful results.
|
| I'm not sure about this. At least with my search engine, it
| doesn't really seem to matter what response they get, I don't
| even think they look at the responses. They keep hammering away
| with tens of thousands of queries per day with the requests
| even though they've seen nothing but HTTP Status 403 since last
| October or so.
|
| My best guess is they're going after search engines in general
| in case they forward queries to google, in order to manipulate
| their typeahead suggestions.
| miohtama wrote:
| Put a CloudFlare web application firewall at the front of the
| site and then use its rate limited / CAPTCHA features to
| throttle traffic. It is the easiest way to get rid of
| parasitic scraping and API abuse. Cost is $0.
| MicahKV wrote:
| Huh, well I guess there goes my theory about the incentive.
| What a bummer. I would have thought that at least with search
| engine scraping, they would stop expending the effort once
| the results dried up.
| z3t4 wrote:
| Or put those query results behind an anti-bot/"capcha" test.
| Ikatza wrote:
| How about serving bots with one link per page, and taking a
| minute to serve each page? Would this impact their
| efficiency?
| tofuahdude wrote:
| Captcha breaking is SO easy these days; even the modern
| captchas are easy to defeat.
| MicahKV wrote:
| That would probably help, but it's also a continuation of the
| cat and mouse game. There are plenty of captcha breaking
| services out there, it only cost about $1 to programmatically
| solve 1000 captchas.
| sylware wrote:
| ... and there are the "click farms" with human beings.
| z3t4 wrote:
| If someone pay people to collect data you could outright
| sell the data to them.
| anselmschueler wrote:
| As I understand it, the main point of CAPTCHAs isn't to
| keep out bots completely, but to give enough friction to
| make automated attacks or uses infeasible, while keeping
| the friction low enough that normal users can still use it
| normally.
| noAnswer wrote:
| > There are plenty of captcha breaking services out there
|
| Give it a try and see what happens.
|
| People said greylisting against email spam wouldn't work,
| since spammers would just resend. It works since 20 years.
| To get your IP off the DNSBL NiX Spam you just have to
| follow a link. People said spammers would automate that
| process. Never happened in 19 years. Sometimes spammers are
| just lazy.
| minsc_and_boo wrote:
| Sure, but it increases friction that forces a re-eval of
| cost/benefit of the bot(s).
|
| Newest captcha services are a prediction score, not even a
| verification screen, and you can feed polluting data to
| bots you are certain to exist.
| Calavar wrote:
| Agreed. I suspect that this is an arbitrage game on the
| part of the SEO spammers. Each search is cheaper for them
| than it is for a competitor who's using a major search
| engine with more extensive anti-spammer protections, and
| that difference equals $$$. A captcha doesn't have to be
| an unbeatable solution. It just has to provide enough of
| a barrier to equalize the cost.
| MicahKV wrote:
| I'm not so sure about this. The spammers goal is to build
| up as big a list of link spam targets as possible. If one
| spammer chooses to only scrape minor engines and another
| only major engines, the one scraping the major engines
| will probably come out on top despite the higher cost.
| Whoever is abusing OP's search engine is likely doing it
| to supplement the data they are already scraping from the
| major engines.
|
| For OP, I think simply not returning results at all is a
| more practical measure because it removes the reward
| completely. Captchas and bot detection keep the reward in
| play, while taking away the results entirely makes the
| entire pursuit futile.
| go_prodev wrote:
| Deliberately feeding the spam bots into an endless loop
| of captchas might slowly drain their accounts if they are
| paying 3rd party captcha farms.
| jfim wrote:
| It might be a better idea to return low quality results
| than nothing at all. The idea is that it's pretty obvious
| when the bot is banned when it receives no results at
| all. Having to look at the results manually to determine
| whether one is banned is a much more time consuming
| endeavor.
| MicahKV wrote:
| Well what I'm suggesting isn't about blocking the bots,
| it's about removing the incentive. So in this case, I
| think the more obvious it is the better. I would want
| them to realize as soon as possible that they are 100%
| wasting their time.
|
| If anything, it might be best to return a page that
| explicitly states "Sorry, this search engine no longer
| supports SEO footprint search queries."
|
| *edit for typo & wording
| bornfreddy wrote:
| On the other hand, making content difficult to parse is
| easy to do and a very strong weapon. Make them waste dev
| time... It is much easier to make variants of HTML than
| it is to parse it. You can even automate it to some
| degree.
| gopher_space wrote:
| > It is very difficult to fight this on a technical level
|
| It is when your base assumption is that you won't hire outside
| of engineering. There are more bored teenagers with phones than
| people creating quality content, so I'm not sure why you
| wouldn't just brute force checks against bad actors.
| pstuart wrote:
| If the confidence was high enough, perhaps return garbage data?
| _tom_ wrote:
| I think many people in the comments here, and most users, are
| missing that you index a SMALL subset of the web. This leads to
| people running a default test search, finding no results, and
| concluding your search engine is bad, and leaving.
|
| While you imply that in the search page, obviously it's not clear
| enough.
|
| Maybe add "this search engine only searches a small set of user
| submitted sites. Click <here> for the list. Or <here> to add your
| site."
| AdamN wrote:
| IMHO what you should try is excluding all sites with excessive
| third-party cookies, sluggish performance, and too many ads. That
| will slice the index down by 80% probably but it would be a
| really nice thing to see. It might push out low quality SEO
| results for a couple of years.
| guerrilla wrote:
| This is the solution. Google and DuckDuckGo should be doing
| this too (and make exceptions if they need to so that they
| don't collapse). We have to incentivize the good behavior and
| create an environment where people actually compete on the
| properties we want and not horseshit.
| stuff4ben wrote:
| Just posting here that I'm real and I'm glad I found
| searchmysite! After HN, Verge, Ars, Gizmodo, and some car forum,
| I struggle to find content I want to read. Hopefully this will
| allow me to continue to find something I can read as I work on
| solving problems at work. I find distractions help me to refocus
| in an odd way.
| marginalia_nu wrote:
| I had to put my search engine behind Cloudflare to deal with
| this. Like the volume grew to about 10x the traffic I saw sitting
| at the front page of Hacker News for a full week.
| marginalia_nu wrote:
| This is the rate of rejected HTTP requests I'm seeing at this
| point: https://www.marginalia.nu/junk/spam.png
|
| Real search traffic is about half that.
| m-i-l wrote:
| Thanks V. I'm seeing a similar number of problem search
| requests (although nowhere near as many real search
| requests:-), so it is probably the same "SEO practitioners"
| running the same "scraping footprints" against different
| search engines around the same time.
|
| I was kind-of hoping that somewhere in this discussion there
| would be an "And the answer to your problem is...", but I
| suppose it is a very specific problem which only a search
| engine would encounter. I think the Cloudflare solution you
| have is probably the best to block the requests as early as
| possible. The reverse proxy config[0] I've got seems to be
| mostly holding out for now though.
|
| [0]
| https://github.com/searchmysite/searchmysite.net/issues/55
| marginalia_nu wrote:
| If they're from the same outfit I've had problems with I
| really am at a loss as to what, other than Cloudflare, is a
| good solution. I got like 4-5 requests per second at worst.
| Seems to be a botnet, I entered a few of the source IPs
| into my browser and got like login screens to enterprise
| routers and so on.
| searchableguy wrote:
| Not surprised. I see many startups with Head of SEO (Search
| engine optimization) with huge salaries now a days.
| evanmoran wrote:
| Has anyone seen this bot growth with online newsletter signups?
| I've noticed a steady increase in signups but without any
| equivalent marking or product buzz that might account for it
| jrussbowman wrote:
| It's been the same for unscatter.com for years but I've always
| attributed to that to me not having a real marketing strategy or
| even sticking with the ones I've tried to start.
___________________________________________________________________
(page generated 2022-05-16 23:00 UTC)