Post AwQ121mkBT7Cw8FMci by bortzmeyer@mastodon.gougere.fr
(DIR) More posts by bortzmeyer@mastodon.gougere.fr
(DIR) Post #AwPxdDiEOzXYpHcD1E by bortzmeyer@mastodon.gougere.fr
2025-07-23T07:20:30Z
0 likes, 0 repeats
Good morning, Madrid! Third day of #IETF123 https://www.ietf.org/meeting/123/For me, today, MAPRG (crawlers) and the plenary.
(DIR) Post #AwPyYr9kQJZ5fIy8Su by bortzmeyer@mastodon.gougere.fr
2025-07-23T07:30:57Z
0 likes, 0 repeats
MAPRG deals with measurement and analysis of Internet protocols. Today session is about the impact of #AI crawlers. [Personal opinion: the fact that it crawls for AI or for any other reason do not change the impact.] #IETF123
(DIR) Post #AwQ07cILljwlfC8DFA by bortzmeyer@mastodon.gougere.fr
2025-07-23T07:48:26Z
1 likes, 0 repeats
Testimony from #CommonCrawl https://commoncrawl.org/ people. Blocking bots is more and more common, for instance by Cloudflare, so research projects like CommonCrawl (bot "CCBot") suffer.147 regular expressions to identify and classify refusals.HTTP status code can be wrong, such as the 430 returned by Shopify. Or 429 returned for non-transient refusals.Many sites can be unreachable because they are centralized under one company, like Newfold Digital.#IETF123
(DIR) Post #AwQ121mkBT7Cw8FMci by bortzmeyer@mastodon.gougere.fr
2025-07-23T07:58:38Z
0 likes, 0 repeats
On the other side (the content servers), how to defend against crawlers? [With the usual confusion between the usages we don't like like gen AI and the stress on the server from the crawling. That's two different issues.]robots.txt is not always respected (for instant by TikTok's ByteSpider). The vast majority of artist-related HTTP servers don't use robots.txt (awareness? Also, many platforms do not allow the user to edit robots.txt)#IETF123
(DIR) Post #AwQ1icEL4XpY8DSg9w by bortzmeyer@mastodon.gougere.fr
2025-07-23T08:06:21Z
0 likes, 0 repeats
I notice that Cloudflare blocks serious AI bots (those that respect robots.txt) but not the many unknown bots that make most of the traffic and trouble.#IETF123
(DIR) Post #AwQ2m2ygBMu1uUKxVI by bortzmeyer@mastodon.gougere.fr
2025-07-23T08:18:09Z
0 likes, 0 repeats
Now, the #IETF itself: impact of crawlers on the many IETF Web services. The IETF data must be public, so blocking is not an option in most cases.20 Mbytes (on 277 by month) are sent to bots. (AI crawlers are a minority of these bots.)#IETF123
(DIR) Post #AwQ3mEKc4M1N4rFPOq by bortzmeyer@mastodon.gougere.fr
2025-07-23T08:29:24Z
0 likes, 0 repeats
Now the testimony for #Wikimedia : the network use increases (+50 % in the last year), partly from bots (not always AI bots). Bots have broader interests than humans (they don't go to the most popular page of the day) so are less often served from cache. Bots make 35 % iof the traffic but 65 % of the expensive [not from caches] traffic.But like IETF, WIkimedia does not want to block: the goal is to make knowledge available. Heavy users should download the dumps? #IETF123
(DIR) Post #AwQ4qJmbqHmV66r7LM by bortzmeyer@mastodon.gougere.fr
2025-07-23T08:41:18Z
0 likes, 0 repeats
A report from Cloudflare : https://radar.cloudflare.com/bots#IETF123
(DIR) Post #AwQ5xrUd0XVrd1bdyK by bortzmeyer@mastodon.gougere.fr
2025-07-23T08:53:54Z
0 likes, 0 repeats
A proposal from Microsoft / Bing to drive crawling from a push by the Web site (and not a pull from the crawler, which is typically quite inefficient, as is often polling): https://www.indexnow.org/[With a lot of green washing.]#IETF123 IndexNow is not currently enabled on my site https://www.bortzmeyer.org/ It apparently requires my publication process ('make install') to call an API at Microsoft
(DIR) Post #AwQCB01tBE0ZaqoMue by ondrej@mastodon.rfc1925.org
2025-07-23T09:29:19Z
0 likes, 0 repeats
@bortzmeyer @Milena_Hime Something has been hitting our GitLab instance very hard doing remote git blame on large files in loops and when we blocked that it switched to raw data. We have no idea if it’s malicious or just very stupid.
(DIR) Post #AwQCB0lyPss5tmNAkC by bortzmeyer@mastodon.gougere.fr
2025-07-23T10:03:27Z
0 likes, 0 repeats
@ondrej @Milena_Hime Software forges are specially sensitive since they have an unlimited number of URLs, and each is often costly.
(DIR) Post #AwQGhJeQAA7zX1V2HY by nw8man@fosstodon.org
2025-07-23T10:53:39Z
0 likes, 0 repeats
@bortzmeyer i found the easiest way to block the bulk of bots and crawlers was to just block the entire USA from entering my site. Does the trick. 🤣
(DIR) Post #AwQHw9YS5C2SW5LXO4 by bortzmeyer@mastodon.gougere.fr
2025-07-23T11:08:03Z
0 likes, 0 repeats
@nw8man The most annoying bots on my HTTP server come from AWS in Singapore.
(DIR) Post #AwQLYSQqe4FhvjkKoK by nw8man@fosstodon.org
2025-07-23T11:48:35Z
0 likes, 0 repeats
@bortzmeyer i use MaxMind GeoIP with my apache setup. It was NOT straight forward to get working, I had to compile the plugin myself, but that part was fairly easy, once running you can select countries you want to allow or disallow based on IP. I found it easiest to block all countries except my own as its only stuff i want to view, not really for general world viewing such as nextcloud.
(DIR) Post #AwQfBdTHQxBEagHMGm by bortzmeyer@mastodon.gougere.fr
2025-07-23T15:28:33Z
0 likes, 0 repeats
And now the #IETF123 plenary, in the coldest room of the meeting venue. https://datatracker.ietf.org/meeting/123/materials/agenda-123-ietf-sessb-01#WinterIsComing
(DIR) Post #AwQgDuccjrTIJeIaae by bortzmeyer@mastodon.gougere.fr
2025-07-23T15:40:11Z
0 likes, 0 repeats
The plenary did not start yet. Network issue (too many network engineers in the room).#IETF123
(DIR) Post #AwQhC7OsMHYdK1Zovo by bortzmeyer@mastodon.gougere.fr
2025-07-23T15:51:04Z
0 likes, 0 repeats
1098 persons at the #IETF123 meeting, plus 597 remote (largest post-Covid meeting). 20 % newcomers. 325 got a fee waiver. 12 children at the IETF child care.Countries by decreasing size: USA, Germany, China, Great Britain, India, Spain.678 persons for the hackathon. 63 projecs (this is why the reporting was so long.)
(DIR) Post #AwQhOxE5UpthX4eRd2 by Beldeche@toot.aquilenet.fr
2025-07-23T15:53:14Z
0 likes, 0 repeats
@bortzmeyer EID-6370 plenary encountered a too many network engineers exception during start phase.