[HN Gopher] Faking a JPEG
       ___________________________________________________________________
        
       Faking a JPEG
        
       Author : todsacerdoti
       Score  : 371 points
       Date   : 2025-07-11 22:57 UTC (1 days ago)
        
 (HTM) web link (www.ty-penguin.org.uk)
 (TXT) w3m dump (www.ty-penguin.org.uk)
        
       | lblume wrote:
       | Given that current LLMs do not consistently output total garbage,
       | and can be used as judges in a fairly efficient way, I highly
       | doubt this could even in theory have any impact on the
       | capabilities of future models. Once (a) models are capable enough
       | to distinguish between semi-plausible garbage and possibly
       | relevant text and (b) companies are aware of the problem, I do
       | not think data poisoning will be an issue at all.
        
         | jesprenj wrote:
         | Yes, but you still waste their processing power.
        
         | immibis wrote:
         | There's no evidence that the current global DDoS is related to
         | AI.
        
           | lblume wrote:
           | The linked page claims that most identified crawlers are
           | related to scraping for training data of LLMs, which seems
           | likely.
        
           | ykonstant wrote:
           | We have investigated nobody and found no evidence of
           | malpractice!
        
         | Zecc wrote:
         | > Once (a) models are capable enough to distinguish between
         | semi-plausible garbage and possibly relevant text
         | 
         | https://xkcd.com/810/
        
       | bschwindHN wrote:
       | You should generate fake but believable EXIF data to go along
       | with your JPEGs too.
        
         | derektank wrote:
         | From the headline that's actually what I was expecting the link
         | to discuss
        
         | russelg wrote:
         | They're taking the valid JPEG headers from images already on
         | their site, so it's possible those are already in place.
        
           | electroglyph wrote:
           | there's no metadata in the example image
        
         | bigiain wrote:
         | Fake exif data with lat/longs showing the image was taken
         | inside Area 51 or The Cheyenne Mountain Complex or Guantanamo
         | Bay...
        
       | mrbluecoat wrote:
       | > I felt sorry for its thankless quest and started thinking about
       | how I could please it.
       | 
       | A refreshing (and amusing) attitude versus getting angry and
       | venting on forums about aggressive crawlers.
        
         | ASalazarMX wrote:
         | Helped without doubt by the capacity to inflict pain and
         | garbage unto those nasty crawlers.
        
       | dheera wrote:
       | > So the compressed data in a JPEG will look random, right?
       | 
       | I don't think JPEG data is compressed enough to be
       | indistinguishable from random.
       | 
       | SD VAE with some bits lopped off gets you better compression than
       | JPEG and yet the latents don't "look" random at all.
       | 
       | So you might think Huffman encoded JPEG coefficients "look"
       | random when visualized as an image but that's only because
       | they're not intended to be visualized that way.
        
         | maxbond wrote:
         | Encoded JPEG data is random in the same way cows are spherical.
        
           | BlaDeKke wrote:
           | Cows can be spherical.
        
             | bigiain wrote:
             | And have uniform density.
        
               | anyfoo wrote:
               | Yeah, but in practice you only get that in a perfect
               | vacuum.
        
               | nasretdinov wrote:
               | I imagine gravity matters more here than atmosphere
        
       | EspadaV9 wrote:
       | I like this one
       | 
       | https://www.ty-penguin.org.uk/~auj/spigot/pics/2025/03/25/fa...
       | 
       | Some kind of statement piece
        
         | myelinsheep wrote:
         | Anything with Shakespeare in it?
        
           | EspadaV9 wrote:
           | Looks like he didn't get time to finish
           | 
           | https://www.ty-
           | penguin.org.uk/~auj/spigot/pics/2025/03/25/fa...
           | 
           | Terry Pratchett has one I'd like to think he'd approve of.
           | Just a shame I'm unable to see the 8th colour, I'm sure it's
           | in there somewhere.
           | 
           | https://www.ty-
           | penguin.org.uk/~auj/spigot/pics/2025/03/25/fa...
        
         | creatonez wrote:
         | For the full experience:
         | 
         | Firefox: Press F12, go to Network, click No Throttling > change
         | it to GPRS
         | 
         | Chromium: Press F12, go to Network, click No Throttling >
         | Custom > Add Profile > Set it to 20kbps and set the profile
        
           | extraduder_ire wrote:
           | Good mention. There's probably some good art to be made by
           | serving similar jpeg images with the speed limited server-
           | side.
        
       | hashishen wrote:
       | the hero we needed and deserved
        
       | derefr wrote:
       | > It seems quite likely that this is being done via a botnet -
       | illegally abusing thousands of people's devices. Sigh.
       | 
       | Just because traffic is coming from thousands of devices on
       | residential IPs, doesn't mean it's a botnet in the classical
       | sense. It could just as well be people signing up for a "free VPN
       | service" -- or a tool that "generates passive income" for them --
       | where the actual cost of running the software, is that you become
       | an exit node for both other "free VPN service" users' traffic,
       | and the traffic of users of the VPN's sibling commercial brand.
       | (E.g. scrapers like this one.)
       | 
       | This scheme is known as "proxyware" -- see
       | https://www.trendmicro.com/en_ca/research/23/b/hijacking-you...
        
         | cAtte_ wrote:
         | sounds like a botnet to me
        
           | ronsor wrote:
           | because it is, but it's a legal botnet
        
           | derefr wrote:
           | Eh. To me, a bot is something users don't know they're
           | running, and would shut off if they knew it was there.
           | 
           | Proxyware is more like a crypto miner -- the original kind,
           | from back when crypto-mining was something a regular computer
           | could feasibly do with pure CPU power. It's something users
           | intentionally install and run and even _maintain_ , because
           | they see it as providing them some potential amount of value.
           | Not a bot; just a P2P network client.
           | 
           | Compare/contrast: https://en.wikipedia.org/wiki/Winny /
           | https://en.wikipedia.org/wiki/Share_(P2P) /
           | https://en.wikipedia.org/wiki/Perfect_Dark_(P2P) -- pieces of
           | software which offer users a similar devil's bargain, but
           | instead of "you get a VPN; we get to use your computer as a
           | VPN", it's "you get to pirate things; we get to use your hard
           | drive as a cache node in our distributed, encrypted-and-
           | striped pirated media cache."
           | 
           | (And both of these are different still to something like
           | BitTorrent, where the user only ever seeds what they
           | themselves have previously leeched -- which is much less
           | questionable in terms of what sort of activity you're
           | agreeing to play host to.)
        
             | tgsovlerkhgsel wrote:
             | AFAIK much of the proxyware runs without the _informed_
             | consent of the user. Sure, there may be some note on page
             | 252 of the EULA of whatever adware the user downloaded, but
             | most users wouldn 't be aware of it.
        
           | whatsupdog wrote:
           | Botnet with extra steps.
        
         | jeroenhd wrote:
         | That's just a variant of a botnet that the users are willingly
         | joining. Someone well-intentioned should probably redirect
         | those IP addresses to a "you are part of a botnet" page just in
         | case they find the website on a site like HN and don't know
         | what their family members are up to.
         | 
         | Easiest way to deal with them is just to block them regardless,
         | because the probability that someone who knows what to do about
         | this software and why it's bad will read any particularly
         | botnetted website are close to zero.
        
       | marcod wrote:
       | Reading about Spigot made me remember
       | https://www.projecthoneypot.org/
       | 
       | I was very excited 20 years ago, every time I got emails from
       | them that the scripts and donated MX records on my website had
       | helped catching a harvester
       | 
       | > Regardless of how the rest of your day goes, here's something
       | to be happy about -- today one of your donated MXs helped to
       | identify a previously unknown email harvester (IP:
       | 172.180.164.102). The harvester was caught a spam trap email
       | address created with your donated MX:
        
         | notpushkin wrote:
         | This is very neat. Honeypot scripts are fairly outdated though
         | (and you can't modify them according to ToS). The Python one
         | only supports CGI and Zope out of the box, though I think you
         | can make a wrapper to make it work with WSGI apps as well.
        
       | puttycat wrote:
       | > compression tends to increase the entropy of a bit stream.
       | 
       | Does it? Encryption increases entropy, but not sure about
       | compression.
        
         | JCBird1012 wrote:
         | I can see what was meant with that statement. I do think
         | compression increases Shannon entropy by virtue of it removing
         | repeating patterns of data - Shannon entropy per byte of
         | compressed data increases since it's now more "random" - all
         | the non-random patterns have been compressed out.
         | 
         | Total information entropy - no. The amount of information
         | conveyed remains the same.
        
           | gary_0 wrote:
           | Technically with lossy compression, the amount of information
           | conveyed will likely change. It could even _increase_ the
           | amount of information of the decompressed image, for instance
           | if you compress a cartoon with simple lines and colors, a
           | lossy algorithm might introduce artifacts that appear as
           | noise.
        
         | gregdeon wrote:
         | Yes: the reason why some data can be compressed is because many
         | of its bits are predictable, meaning that it has low entropy
         | per bit.
        
       | Modified3019 wrote:
       | Love the effort.
       | 
       | That said, these seem to be heavily biased towards displaying
       | green, so one "sanity" check would be if your bot is suddenly
       | scraping thousands of green images, something might be up.
        
         | recursive wrote:
         | Mission accomplished I guess
        
         | lvncelot wrote:
         | Nature photographers around the world rejoice as their content
         | becomes safe from scraping.
        
         | ykonstant wrote:
         | Next we do it with red and blue :D
        
       | jandrese wrote:
       | I wonder if you could mess with AI input scrapers by adding fake
       | captions to each image? I imagine something like:
       | (big green blob)              "My cat playing with his new catnip
       | ball".                   (blue mess of an image)
       | "Robins nesting"
        
         | Dwedit wrote:
         | A well-written scraper would check the image against a CLIP
         | model or other captioning model to see if the text there
         | actually agrees with the image contents.
        
           | Simran-B wrote:
           | Then captions that are somewhat believable? "Abstract digital
           | art piece by F. U. Botts resembling wide landscapes in
           | vibrant colors"
        
           | Someone wrote:
           | Do scrapers actually do such things on every page they
           | download? Sampling a small fraction of a site to check how
           | trustworthy it is, I can see happen, but I would think they'd
           | rather scrape many more pages than spend resources doing such
           | checks on every page.
           | 
           | Or is the internet so full of garbage nowadays that it is
           | necessary to do that on every page?
        
       | 112233 wrote:
       | So how do I set up an instance of this beautiful flytrap? Do I
       | need a valid personal blog, or can I plop something on cloudflare
       | to spin on their edge?
        
         | ffsm8 wrote:
         | It's a flask app, he linked to it
         | 
         | https://github.com/gw1urf/spigot/
        
       | superjan wrote:
       | There is a particular pattern (block/tag marker) that is illegal
       | the compressed JPEG stream. If I recall correctly you should
       | insert a 0x00 after a 0xFF byte in the output to avoid it. If
       | there is interest I can followup later (not today).
        
         | superjan wrote:
         | Ok this is correct for traditional JPEG. Other flavors like
         | Jpeg2000 use a similar (but lower overhead) version of this
         | byte-stuffing to avoid JPEG markers from appearing in the
         | compressed stream.
         | 
         | Related:
         | https://en.wikipedia.org/wiki/JPEG#Syntax_and_structure
        
         | ethan_smith wrote:
         | You're referring to JPEG's byte stuffing rule: any 0xFF byte in
         | the entropy-coded data must be followed by a 0x00 byte to
         | prevent it from being interpreted as a marker segment.
        
       | tomsmeding wrote:
       | They do have a robots.txt [1] that disallows robot access to the
       | spigot tree (as expected), but removing the /spigot/ part from
       | the URL seems to still lead to Spigot. [2] The /~auj namespace is
       | not disallowed in robots.txt, so even _well-intentioned_
       | crawlers, if they somehow end up there, can get stuck in the
       | infinite page zoo. That 's not very nice.
       | 
       | [1]: https://www.ty-penguin.org.uk/robots.txt
       | 
       | [2]: https://www.ty-penguin.org.uk concatenated with /~auj/cheese
       | (don't want to create links there)
        
         | josephg wrote:
         | > even well-intentioned crawlers, if they somehow end up there,
         | can get stuck in the infinite page zoo. That's not very nice.
         | 
         | So? What duty do web site operators have to be "nice" to people
         | scraping your website?
        
           | gary_0 wrote:
           | The Marginalia search engine or archive.org probably don't
           | deserve such treatment--they're performing a public service
           | that benefits everyone, for free. And it's generally not in
           | one's best interests to serve a bunch of garbage to Google or
           | Bing's crawlers, either.
        
             | darkwater wrote:
             | If you have such a website, then you will just serve normal
             | data. But it seems perfectly legit to serve fake random
             | gibberish from your website if you want to. A human would
             | just stop reading it.
        
           | suspended_state wrote:
           | The point is that not every web crawler is out there to
           | scrape websites.
        
             | andybak wrote:
             | Unless you define "scrape" to be inherently nefarious -
             | then surely they are? Isn't the definition of a web crawler
             | based on scraping websites?
        
               | suspended_state wrote:
               | I think that web scraping is usually understood as the
               | act of extracting information of a website for ulterior
               | self-centered motives. However, it is clear that this
               | ulterior motive cannot be assessed by a website owner.
               | Only the observable behaviour of a data collecting
               | process can be categorized as morally good or bad. While
               | the bad behaving people are usually also the ones with
               | morally wrong motives, one doesn't entail the other. I
               | chose to qualify the bad behaving ones as scrapers, and
               | the good behaving ones as crawlers.
               | 
               | That being said, the author is perhaps concerned by the
               | growing amount of collecting process, which carries a
               | toll on his server, and thus chose to simply penalize
               | them all.
        
         | bstsb wrote:
         | previously the author wrote in a comment reply about not
         | configuring robots.txt at all:
         | 
         | > I've not configured anything in my robots.txt and yes, this
         | is an extreme position to take. But I don't much like the
         | concept that it's my responsibility to configure my web site so
         | that crawlers don't DOS it. In my opinion, a legitimate crawler
         | ought not to be hitting a single web site at a sustained rate
         | of > 15 requests per second.
        
           | yorwba wrote:
           | The spigot doesn't seem to distinguish between crawlers that
           | make more than 15 requests per second and those that make
           | less. I think it would be nicer to throw up a "429 Too Many
           | Requests" page when you think the load is too much and only
           | poison crawlers that don't back off afterwards.
        
             | evgpbfhnr wrote:
             | when crawlers use a botnet to only make one request per ip
             | per long duration that's not realistic to implement
             | though..
        
             | DamonHD wrote:
             | Almost no bot responds usefully to 429 that I have seen,
             | and a few respond to it like 500 and 503 to _speed up_ /
             | retry / poll more.
        
       | kazinator wrote:
       | Faking a JPEG is not only less CPU intensive than making one
       | properly, but by doing os you are fuzzing whatever malware is on
       | the other end; if it is decoding the JPEG and isn't robust, it
       | may well crash.
        
       | Szpadel wrote:
       | the worst offender I saw is meta.
       | 
       | they have facebookexternalhit bot (they sometimes use default
       | python request user agent) that (as they documented) explicitly
       | ignores robots.txt
       | 
       | it's (as they say) used to validate links if they contain
       | malware. But if someone would like to serve malware the first
       | thing they would do would be to serve innocent page to facebook
       | AS and their user agent.
       | 
       | they also re-check every URL every month to validate if this
       | still does not contain malware.
       | 
       | the issue is as follows some bad actors spam Facebook with URLs
       | to expensive endpoints (like some search with random filters) and
       | Facebook provides then with free ddos service for your
       | competition. they flood you with > 10 r/s for days every month.
        
         | kuschku wrote:
         | Since when is 10r/s flooding?
         | 
         | That barely registers as a blip even if you're hosting your
         | site on a single server.
        
           | Szpadel wrote:
           | In our case this was very heavy specialized endpoint and
           | because each request used different set of parameters could
           | not benefit from caching (actually in this case it thrashed
           | caches with useless entries).
           | 
           | This resulted in upscale. When handling such bot cost more
           | than rest of the users and bots, that's an issue. Especially
           | for our customers with smaller traffic.
           | 
           | This request rate varied from site to site, but it ranged
           | from half to 75% of whole traffic and was basically
           | saturating many servers for days if not blocked.
        
           | anarazel wrote:
           | That depends on what you're hosting. Good luck if it's e.g. a
           | web interface for a bunch of git repositories with a long
           | history. You can't cache effectively because there's too many
           | pages and generating each page isn't cheap.
        
       | BubbleRings wrote:
       | Is there reason you couldn't generate your images by grabbing
       | random rectangles of pixels from one source image and pasting it
       | into a random location in another source image? Then you would
       | have a fully valid jpg that no AI could easily successfully
       | identify as generated junk. I guess that would require much more
       | CPU than your current method huh?
        
         | adgjlsfhk1 wrote:
         | Given the amount of money AI companies have, you need at least
         | ~100x work amplification for this to begin to be a punishment.
        
       | jekwoooooe wrote:
       | It's our moral imperative to make crawling cost prohibitive and
       | also poison LLM training.
        
       | a-biad wrote:
       | I am bit confused about the context. What is exactly the point of
       | exposing fake data to webcrawlers?
        
         | whitten wrote:
         | penalizing the web spider for scraping their site
        
         | gosub100 wrote:
         | They crawl for data, usually to train a model. Poisoning the
         | models training data makes it less useful and therefore less
         | valuable
        
       | jeroenhd wrote:
       | This makes me wonder if there are more efficient image formats
       | that one might want to feed botnets. JPEG is highly complex, but
       | PNG uses a relatively simple DEFLATE stream as well as some basic
       | filters. Perhaps one could make a zip-bomb like PNG that only
       | consists of a few bytes?
        
         | john01dav wrote:
         | That might be challenging because you can trivially determine
         | the output file sized based on the dimensions in pixels and
         | pixel format, so if the DEFLATE stream goes beyond that you can
         | stop decoding and discard the image as malformed. Of course,
         | some decoders may not do so and thus would be vulnerable.
        
           | _ache_ wrote:
           | Is it a problem through ? I'm pretty sure that any check is
           | on the weight of the PNG, not the actual dimension of the
           | image.
           | 
           | PNG doesn't have size limitation on the image dimensions
           | (4bytes each). So I bet you can break at least one scrap bot
           | with that.
        
         | sltkr wrote:
         | DEFLATE has a rather low maximum compression ratio of 1:1032,
         | so a file that would take 1 GB of memory uncompressed still
         | needs to be about 1 MB.
         | 
         | ZIP bombs rely on recursion or overlapping entries to achieve
         | higher ratios, but the PNG format is too simple to allow such
         | tricks (at least in the usual critical chunks that all decoders
         | are required to support).
        
       | Domainzsite wrote:
       | This is pure internet mischief at its finest. Weaponizing fake
       | JPEGs with valid structure and random payloads to burn botnet
       | cycles? Brilliant. Love the tradeoff thinking: maximize crawler
       | cost, minimize CPU. The Huffman bitmask tweak is chef's kiss.
       | Spigot feels like a spiritual successor to robots.txt flipping
       | you off in binary.
        
       | time0ut wrote:
       | JPEG is fascinating and quite complex. Here is a really excellent
       | high level explanation of how it works:
       | 
       | https://www.youtube.com/watch?v=0me3guauqOU
        
       | larcanio wrote:
       | Happy to realize real heros does exists.
        
       | ardme wrote:
       | Old man yells at cloud, then creates a labyrinth of mirrors for
       | the images of the clouds to reflect back on each other.
        
       | bvan wrote:
       | Love this.
        
       | sim7c00 wrote:
       | love how u speak about pleasing bots and them getting excited :D
       | fun read, fun project. thanks!
        
       | mhuffman wrote:
       | I don't understand the reasoning behind the "feed them a bunch of
       | trash" option when it seems that if you identify them (for
       | example by ignoring a robots.txt file) you can just keep them
       | hung up on network connections or similar without paying for
       | infinite garbage for crawlers to injest.
        
         | schroeding wrote:
         | The "poisoning the data supply" angle seems to be a common
         | motive, similar to tools like nightshade[1] (for actual images
         | and not just garbage data).
         | 
         | [1] https://nightshade.cs.uchicago.edu/whatis.html
        
       | thayne wrote:
       | I'm curious how the author identifies the crawlers that use
       | random User Agents and and distinct ip addresses per request. Is
       | there some other indicator that can be used to identify them?
       | 
       | On a different note, if the goal is to waste resources for the
       | bot, on potential improvement could be to uses very large images
       | with repeating structure that compress extremely well as jpegs
       | for the templates, so that it takes more ram and cpu to decode
       | them with relatively little cpu and ram required to generate them
       | and bandwidth to transfer them.
        
       ___________________________________________________________________
       (page generated 2025-07-12 23:00 UTC)