[HN Gopher] Crawler Hints supports Microsoft's IndexNow in helpi...
___________________________________________________________________
Crawler Hints supports Microsoft's IndexNow in helping users find
new content
Author : jgrahamc
Score : 37 points
Date : 2022-08-26 13:02 UTC (1 days ago)
(HTM) web link (blog.cloudflare.com)
(TXT) w3m dump (blog.cloudflare.com)
| franze wrote:
| for Google there is the workaround to update the sitemap.xml and
| then ping that sitemap to google.
|
| i usually have a whole-sites-inventory-sitemap.xml which gets
| updated once per day/week/x and a limited update.rss (rss is a
| valid sitemap.xml format) that gets pinged either real time with
| every update or if-changed-5-minute interval.
|
| wrote a book about it and other distribution concepts
| richdougherty wrote:
| Here's the IndexNow standard that CloudFlare Crawler Hints is
| using:
|
| https://www.indexnow.org/
|
| The idea is that you can push a notification to a search engine
| when content changes instead of waiting for the crawler to
| notice.
|
| https://<searchengine>/indexnow?url=url-changed&key=your-key
|
| You can also submit more than one URL with a POST.
|
| You can notify Bing at https://www.bing.com/indexnow?url=url-
| changed&key=your-key
|
| If you notify the IndexNow API endpoint it notifies Bing plus
| other search engines on your behalf:
|
| https://api.indexnow.org/indexnow?url=url-changed&key=your-k...
|
| This announcement is about how CloudFlare can now do this
| automatically for sites it hosts.
|
| Some other hosts and CDNs support IndexNow, eg Akamai. See:
| https://blogs.bing.com/webmaster/october-2021/IndexNow-Insta...
| speedgoose wrote:
| So it's an alternative to sitemaps, documents that can list
| incrementally all the web pages of a website with their last
| modification date-time, but with a push model instead of a pull
| model?
|
| https://www.sitemaps.org/
| rob-olmos wrote:
| Yes, but of course only for the search engines that support
| it. Eg, Google is absent from the list, although they've
| already had a push/ping sitemap feature.
| orf wrote:
| Why would they not use the /.well-known/ prefix for the default
| index now key?
|
| The default being at the root seems... stupid.
| richdougherty wrote:
| Not defending the standard, but I guess since this is a
| shared secret you don't want to put it at a well known
| location. There's a (slight) attack vector from having an
| attacker know the secret, since they can "launch" a crawl
| against a site. Maybe could get a crawler to access private
| URLs or something?
|
| Another interesting feature I saw in the standard is that you
| can host keys in subdirectories too.
|
| "the location of a key file determines the set of URLs that
| can be included with this key. A key file located at
| http://example.com/catalog/key12457EDd.txt can include any
| URLs starting with http://example.com/catalog/ but cannot
| include URLs starting with http://example.com/help/."
| orf wrote:
| This wouldn't be in a place like ".well-known/secret-key",
| the key would still be part of the path. It's just a well
| known prefix to put exactly this kind of thing.
| dgivney wrote:
| Sharing your key seems like the most 90s approach to system
| design.
|
| "Only you and the search engines should know the key.. so
| obviously, we want you to host it in plain text, in the root
| directory."
| rstupek wrote:
| The key appears to not be a fixed value so unless your server
| allows directory scans it seems reasonably secure?
| dgivney wrote:
| I agree, in a 90s system design meeting - security through
| obscurity is reasonably secure.
| zhfliz wrote:
| without referring to this particular case, how is `/.well-
| known/LsyrYyZGDMMPwS1lAUS7qXo7c81XLaxPeRrSZdSReFk5zPaJaD`
| less secure than `/.well-known/key` requiring an
| `Authorization:
| LsyrYyZGDMMPwS1lAUS7qXo7c81XLaxPeRrSZdSReFk5zPaJaD` header?
| kwerk wrote:
| Is CommonCrawl one of the IndexNow recipients for this? Seems
| like a big win for the open web if so to make an open index more
| efficient to hydrate.
| taylorfinley wrote:
| "We're also hopeful that Google, the world's largest search
| engine and most wasteful user of Internet resources, will adopt
| IndexNow or a similar standard and lower the burden of search
| crawling on the planet." Pretty blunt language from CloudFlare!
| cmroanirgo wrote:
| Although I agree heartily with the idea of a push model for
| search engines, I can't help but notice that it seems to provide
| more centralisation to the search engines out there.
|
| Here on HN we've been seeing posts of alternate search engines.
| How will those small bespoke engines make use of IndexNow unless
| the website participates?
|
| The way I see IndexNow, I'll still get crawled relentlessly by
| the bots I don't want crawling my site (because robots.txt never
| seems to apply to _them_ unless there 's a special listing
| explicitly for them)
|
| So, unless you're a participating search engine, a website will
| still be getting crawled by low hanging fruit, not alleviating
| the problem.
|
| A good compromise would be something like an RSS feed, which a
| site can publish, and crawlers can hit for updated changes. It
| would also allow easier management for those domains that have
| many moving parts: individual search engines can be pinged, but
| the search engine just grabs the changes.xml file... Or
| something.
| [deleted]
| rstupek wrote:
| It looks like a search engine could get listed here:
| https://www.indexnow.org/searchengines.json and any website
| which implements IndexNow could utilize that list to know where
| to publish?
|
| There already is such an "RSS" feed, its called a sitemap
| available at /sitemap.xml or you can alternatively list your
| url in the robots.txt file
| pacifika wrote:
| I'm concerned about the centralisation aspect of this (quick
| reading, could be wrong) which makes it harder for innovation to
| happen in the search industry.
| rob-olmos wrote:
| IndexNow claims to distribute the pings to other participating
| search engines[1], and to participate they need to have
| "noticeable presence in at least one market", so yea still
| seems to be somewhat exclusive?
|
| Eg, would Ahrefs or Semrush qualify to join the party?
|
| 1: https://www.indexnow.org/searchengines
| richbell wrote:
| There is a noticeable lack of information about how to
| enroll. I suspect it's more of a "if you need to ask you
| aren't big enough" type deal.
| marginalia_nu wrote:
| Or maybe just ask?
|
| I've had staggering success with just ending emails to
| people, including businesses.
| wumpus wrote:
| Most search engine people don't call ahrefs or semrush search
| engines -- they're SEO tools.
|
| Having to have existing market share as a search engine is
| going to really limit the number of participants.
| metadat wrote:
| This is a novel concept. I wonder if / when Marginalia will get
| onboard and implement support for it, too.
|
| It would be cool to be able to push the update signal to a bunch
| of search engines when I publish a new page (even if all of my
| websites get virtually no traffic and don't even come up with the
| appropriate, highly unique keyword combos in Google, Bing, or
| Marginalia - they don't even have any ads or anything terrible,
| perhaps terribly boring or SEO unoptimized, haha).
|
| I wonder if there could be a market for something which collects
| website change information and offers an API to query for new
| pages / updated pages over the past X time interval across Y set
| of [interesting] websites. I could see this being useful, sort of
| like RSS but 100% general purpose.
|
| I'd call it something like Invertdex.
| marginalia_nu wrote:
| > I wonder if / when Marginalia will get onboard and implement
| support for it, too.
|
| I created an issue for it as a reminder. Probably not gonna
| implement it in the short term, because a lot of my model is
| based on sort of full-site crawls at 8 week intervals. While
| something like this would help with identifying new content, it
| doesn't do much to identify when links go dead, so you still
| need to crawl passively.
|
| Although, on some level I'm a bit uneasy, I have a hunch this
| may be a bit of a vulnerability. In general giving websites the
| tools to control the crawling process beyond like robots.txt
| and so on seems a bit sketch. Maybe it's possible to build
| checks and balances to prevent that though.
|
| Would take some work to support real-time updates. Not
| impossible, but I've got a lot of work to do with regard to
| search result accuracy that I feel is more important.
|
| I've been looking at using RSS feeds as a signal for when to
| re-index sites before, not doing that now but the general idea
| works I think
| pingiun wrote:
| richbell wrote:
| > Eschew flamebait. Avoid unrelated controversies, generic
| tangents, and internet tropes.
|
| https://news.ycombinator.com/newsguidelines.html
| [deleted]
| BonoboIO wrote:
| Never heard of kiwifarms before... what a cesspool.
|
| Cloudflare also hosts/protects a number of Austrian and German
| right wing and covid hoax ,,news sites".
|
| Contacted them, but they don't care.
|
| It's a fine line between freedom of speech and censorship, but
| these websites are just absolute garbage! Spreading hate,
| racism, hoaxes and help fascist to take over power.
___________________________________________________________________
(page generated 2022-08-27 23:01 UTC)