[HN Gopher] Crawler Hints supports Microsoft's IndexNow in helpi...
       ___________________________________________________________________
        
       Crawler Hints supports Microsoft's IndexNow in helping users find
       new content
        
       Author : jgrahamc
       Score  : 37 points
       Date   : 2022-08-26 13:02 UTC (1 days ago)
        
 (HTM) web link (blog.cloudflare.com)
 (TXT) w3m dump (blog.cloudflare.com)
        
       | franze wrote:
       | for Google there is the workaround to update the sitemap.xml and
       | then ping that sitemap to google.
       | 
       | i usually have a whole-sites-inventory-sitemap.xml which gets
       | updated once per day/week/x and a limited update.rss (rss is a
       | valid sitemap.xml format) that gets pinged either real time with
       | every update or if-changed-5-minute interval.
       | 
       | wrote a book about it and other distribution concepts
        
       | richdougherty wrote:
       | Here's the IndexNow standard that CloudFlare Crawler Hints is
       | using:
       | 
       | https://www.indexnow.org/
       | 
       | The idea is that you can push a notification to a search engine
       | when content changes instead of waiting for the crawler to
       | notice.
       | 
       | https://<searchengine>/indexnow?url=url-changed&key=your-key
       | 
       | You can also submit more than one URL with a POST.
       | 
       | You can notify Bing at https://www.bing.com/indexnow?url=url-
       | changed&key=your-key
       | 
       | If you notify the IndexNow API endpoint it notifies Bing plus
       | other search engines on your behalf:
       | 
       | https://api.indexnow.org/indexnow?url=url-changed&key=your-k...
       | 
       | This announcement is about how CloudFlare can now do this
       | automatically for sites it hosts.
       | 
       | Some other hosts and CDNs support IndexNow, eg Akamai. See:
       | https://blogs.bing.com/webmaster/october-2021/IndexNow-Insta...
        
         | speedgoose wrote:
         | So it's an alternative to sitemaps, documents that can list
         | incrementally all the web pages of a website with their last
         | modification date-time, but with a push model instead of a pull
         | model?
         | 
         | https://www.sitemaps.org/
        
           | rob-olmos wrote:
           | Yes, but of course only for the search engines that support
           | it. Eg, Google is absent from the list, although they've
           | already had a push/ping sitemap feature.
        
         | orf wrote:
         | Why would they not use the /.well-known/ prefix for the default
         | index now key?
         | 
         | The default being at the root seems... stupid.
        
           | richdougherty wrote:
           | Not defending the standard, but I guess since this is a
           | shared secret you don't want to put it at a well known
           | location. There's a (slight) attack vector from having an
           | attacker know the secret, since they can "launch" a crawl
           | against a site. Maybe could get a crawler to access private
           | URLs or something?
           | 
           | Another interesting feature I saw in the standard is that you
           | can host keys in subdirectories too.
           | 
           | "the location of a key file determines the set of URLs that
           | can be included with this key. A key file located at
           | http://example.com/catalog/key12457EDd.txt can include any
           | URLs starting with http://example.com/catalog/ but cannot
           | include URLs starting with http://example.com/help/."
        
             | orf wrote:
             | This wouldn't be in a place like ".well-known/secret-key",
             | the key would still be part of the path. It's just a well
             | known prefix to put exactly this kind of thing.
        
       | dgivney wrote:
       | Sharing your key seems like the most 90s approach to system
       | design.
       | 
       | "Only you and the search engines should know the key.. so
       | obviously, we want you to host it in plain text, in the root
       | directory."
        
         | rstupek wrote:
         | The key appears to not be a fixed value so unless your server
         | allows directory scans it seems reasonably secure?
        
           | dgivney wrote:
           | I agree, in a 90s system design meeting - security through
           | obscurity is reasonably secure.
        
             | zhfliz wrote:
             | without referring to this particular case, how is `/.well-
             | known/LsyrYyZGDMMPwS1lAUS7qXo7c81XLaxPeRrSZdSReFk5zPaJaD`
             | less secure than `/.well-known/key` requiring an
             | `Authorization:
             | LsyrYyZGDMMPwS1lAUS7qXo7c81XLaxPeRrSZdSReFk5zPaJaD` header?
        
       | kwerk wrote:
       | Is CommonCrawl one of the IndexNow recipients for this? Seems
       | like a big win for the open web if so to make an open index more
       | efficient to hydrate.
        
       | taylorfinley wrote:
       | "We're also hopeful that Google, the world's largest search
       | engine and most wasteful user of Internet resources, will adopt
       | IndexNow or a similar standard and lower the burden of search
       | crawling on the planet." Pretty blunt language from CloudFlare!
        
       | cmroanirgo wrote:
       | Although I agree heartily with the idea of a push model for
       | search engines, I can't help but notice that it seems to provide
       | more centralisation to the search engines out there.
       | 
       | Here on HN we've been seeing posts of alternate search engines.
       | How will those small bespoke engines make use of IndexNow unless
       | the website participates?
       | 
       | The way I see IndexNow, I'll still get crawled relentlessly by
       | the bots I don't want crawling my site (because robots.txt never
       | seems to apply to _them_ unless there 's a special listing
       | explicitly for them)
       | 
       | So, unless you're a participating search engine, a website will
       | still be getting crawled by low hanging fruit, not alleviating
       | the problem.
       | 
       | A good compromise would be something like an RSS feed, which a
       | site can publish, and crawlers can hit for updated changes. It
       | would also allow easier management for those domains that have
       | many moving parts: individual search engines can be pinged, but
       | the search engine just grabs the changes.xml file... Or
       | something.
        
         | [deleted]
        
         | rstupek wrote:
         | It looks like a search engine could get listed here:
         | https://www.indexnow.org/searchengines.json and any website
         | which implements IndexNow could utilize that list to know where
         | to publish?
         | 
         | There already is such an "RSS" feed, its called a sitemap
         | available at /sitemap.xml or you can alternatively list your
         | url in the robots.txt file
        
       | pacifika wrote:
       | I'm concerned about the centralisation aspect of this (quick
       | reading, could be wrong) which makes it harder for innovation to
       | happen in the search industry.
        
         | rob-olmos wrote:
         | IndexNow claims to distribute the pings to other participating
         | search engines[1], and to participate they need to have
         | "noticeable presence in at least one market", so yea still
         | seems to be somewhat exclusive?
         | 
         | Eg, would Ahrefs or Semrush qualify to join the party?
         | 
         | 1: https://www.indexnow.org/searchengines
        
           | richbell wrote:
           | There is a noticeable lack of information about how to
           | enroll. I suspect it's more of a "if you need to ask you
           | aren't big enough" type deal.
        
             | marginalia_nu wrote:
             | Or maybe just ask?
             | 
             | I've had staggering success with just ending emails to
             | people, including businesses.
        
           | wumpus wrote:
           | Most search engine people don't call ahrefs or semrush search
           | engines -- they're SEO tools.
           | 
           | Having to have existing market share as a search engine is
           | going to really limit the number of participants.
        
       | metadat wrote:
       | This is a novel concept. I wonder if / when Marginalia will get
       | onboard and implement support for it, too.
       | 
       | It would be cool to be able to push the update signal to a bunch
       | of search engines when I publish a new page (even if all of my
       | websites get virtually no traffic and don't even come up with the
       | appropriate, highly unique keyword combos in Google, Bing, or
       | Marginalia - they don't even have any ads or anything terrible,
       | perhaps terribly boring or SEO unoptimized, haha).
       | 
       | I wonder if there could be a market for something which collects
       | website change information and offers an API to query for new
       | pages / updated pages over the past X time interval across Y set
       | of [interesting] websites. I could see this being useful, sort of
       | like RSS but 100% general purpose.
       | 
       | I'd call it something like Invertdex.
        
         | marginalia_nu wrote:
         | > I wonder if / when Marginalia will get onboard and implement
         | support for it, too.
         | 
         | I created an issue for it as a reminder. Probably not gonna
         | implement it in the short term, because a lot of my model is
         | based on sort of full-site crawls at 8 week intervals. While
         | something like this would help with identifying new content, it
         | doesn't do much to identify when links go dead, so you still
         | need to crawl passively.
         | 
         | Although, on some level I'm a bit uneasy, I have a hunch this
         | may be a bit of a vulnerability. In general giving websites the
         | tools to control the crawling process beyond like robots.txt
         | and so on seems a bit sketch. Maybe it's possible to build
         | checks and balances to prevent that though.
         | 
         | Would take some work to support real-time updates. Not
         | impossible, but I've got a lot of work to do with regard to
         | search result accuracy that I feel is more important.
         | 
         | I've been looking at using RSS feeds as a signal for when to
         | re-index sites before, not doing that now but the general idea
         | works I think
        
       | pingiun wrote:
        
         | richbell wrote:
         | > Eschew flamebait. Avoid unrelated controversies, generic
         | tangents, and internet tropes.
         | 
         | https://news.ycombinator.com/newsguidelines.html
        
         | [deleted]
        
         | BonoboIO wrote:
         | Never heard of kiwifarms before... what a cesspool.
         | 
         | Cloudflare also hosts/protects a number of Austrian and German
         | right wing and covid hoax ,,news sites".
         | 
         | Contacted them, but they don't care.
         | 
         | It's a fine line between freedom of speech and censorship, but
         | these websites are just absolute garbage! Spreading hate,
         | racism, hoaxes and help fascist to take over power.
        
       ___________________________________________________________________
       (page generated 2022-08-27 23:01 UTC)