http://avodonosov.blogspot.com/2022/08/a-robotstxt-problem.html Caution: my English is far from perfect. (Russkii tozhe ne vsegda khorosh). skip to main | skip to sidebar blog Monday, 15 August 2022 A robots.txt Problem To prevent a part of our web application from being scanned by search engines and other web crawlers, we add a robots.txt like User-agent: * Disallow: /path It's so simple, what can go wrong? A real story happened to me. Turns out, my cloud platform - Google App Engine - has a caching and compression layer between the applciation and the Internet. It can gzip content for one client, cache it, and then return the same gzipped responses to other clients, even if they haven't specified the Accept-Encoding: gzip header; or even explicitly requested uncompressed content. This unwise, in my opinion, behaviour is documented here: https:// cloud.google.com/appengine/docs/legacy/standard/java/ how-requests-are-handled#response_caching Example: # Force a gzipped response $ curl -v -H 'Accept-Encoding: gzip' -H 'User-Agent: gzip' https:// yourapp.appspot.com/robots.txt ... content-encoding: gzip ... Warning: Binary output can mess up your terminal. Use "--output -" to tell Warning: curl to output it to your terminal anyway, or consider "--output Warning: " to save to a file. # Now explicitly request uncompressed robots.txt $ curl -v -H 'Accept-Encoding: identity' https://yourapp.appspot.com/ robots.txt ... content-encoding: gzip ... Warning: Binary output can mess up your terminal. Use "--output -" to tell Warning: curl to output it to your terminal anyway, or consider "--output Warning: " to save to a file. (BTW, despite the doc says the default caching duration is 10 minutes, I observed Google App Engine returning gzipped responses for at least 30 minutes). A web crawler (Dotbot from moz.com) has encountered such a gzipped robots.txt response and was unable to parse it, so considers all the URLs in the app domain as allowed for crawling. Moreover, the crawler caches this gzipped response. All its subsequent requests to robots.txt are conditional (ETag based, I think), and result in 304 Not Modified, thus the crawler continues relying on the gzipped version it cannot parse, and regularly visits the unwanted URLs. Luckily, the Dotbot clearly identifies itself in the User-Agent header, and they have a working support email, so after a five month communication in a ticket I discovered the reason. Fixed the Gooble App Engine behaviour by adding an explicit configuration to the appengine-web.xml: Also made a little modification to the robots.txt, to be sure the ETag changes. Posted by A. V. at 15:06 # Email ThisBlogThis!Share to TwitterShare to FacebookShare to Pinterest Labels: english, programming No comments: Post a Comment Older Post Home Subscribe to: Post Comments (Atom) Blog Archive * V 2022 (1) + V August (1) o V Aug 15 (1) # A robots.txt Problem * > 2021 (1) + > May (1) o > May 01 (1) * > 2019 (1) + > March (1) o > Mar 25 (1) * > 2017 (1) + > July (1) o > Jul 21 (1) * > 2016 (2) + > September (1) o > Sep 26 (1) + > March (1) o > Mar 10 (1) * > 2015 (4) + > October (1) o > Oct 02 (1) + > September (1) o > Sep 05 (1) + > August (2) o > Aug 12 (1) o > Aug 07 (1) * > 2014 (4) + > December (1) o > Dec 20 (1) + > August (1) o > Aug 22 (1) + > January (2) o > Jan 25 (1) o > Jan 20 (1) * > 2013 (5) + > September (1) o > Sep 23 (1) + > May (1) o > May 03 (1) + > April (1) o > Apr 14 (1) + > February (2) o > Feb 09 (1) o > Feb 05 (1) * > 2012 (2) + > September (1) o > Sep 03 (1) + > January (1) o > Jan 17 (1) * > 2011 (3) + > September (1) o > Sep 27 (1) + > June (1) o > Jun 02 (1) + > January (1) o > Jan 24 (1) * > 2010 (3) + > June (1) o > Jun 02 (1) + > May (1) o > May 03 (1) + > January (1) o > Jan 25 (1) * > 2009 (2) + > October (1) o > Oct 16 (1) + > August (1) o > Aug 22 (1) * > 2008 (18) + > December (4) o > Dec 13 (1) o > Dec 07 (1) o > Dec 05 (1) o > Dec 03 (1) + > November (1) o > Nov 28 (1) + > October (1) o > Oct 23 (1) + > August (3) o > Aug 29 (1) o > Aug 14 (1) o > Aug 09 (1) + > July (3) o > Jul 15 (2) o > Jul 07 (1) + > June (2) o > Jun 30 (1) o > Jun 26 (1) + > April (1) o > Apr 28 (1) + > March (1) o > Mar 06 (1) + > February (1) o > Feb 07 (1) + > January (1) o > Jan 01 (1) * > 2007 (1) + > December (1) o > Dec 12 (1) About Me A. V. View my complete profile About Me * TestSheet * CL Test Grid * GitHub Blog Labels english (30) programming (16) russian (16) lisp (10) ecmascript (4) slime (4) cl-test-grid (3) hunchentoot (3) advertising (2) clojure (2) ecl (2) funny (2) intellij (2) law (2) politics (2) ukaz60 (2) ABCL (1) appengine (1) c++ (1) cloud (1) concurrency (1) emacs (1) flex (1) football (1) forth (1) github (1) go (1) google-appengine (1) heroku (1) hosting (1) java (1) junit (1) linux (1) linux-memory-over-commit (1) livejournal (1) markdown (1) open-source (1) openid (1) oracle (1) org-mode (1) play (1) poetry (1) scheme (1) tetris (1) tv (1) ukrain (1) vodka (1) web (1) wishforge (1) work (1) Some Links * Lisp Libraries