[HN Gopher] Faking a JPEG
___________________________________________________________________
Faking a JPEG
Author : todsacerdoti
Score : 371 points
Date : 2025-07-11 22:57 UTC (1 days ago)
(HTM) web link (www.ty-penguin.org.uk)
(TXT) w3m dump (www.ty-penguin.org.uk)
| lblume wrote:
| Given that current LLMs do not consistently output total garbage,
| and can be used as judges in a fairly efficient way, I highly
| doubt this could even in theory have any impact on the
| capabilities of future models. Once (a) models are capable enough
| to distinguish between semi-plausible garbage and possibly
| relevant text and (b) companies are aware of the problem, I do
| not think data poisoning will be an issue at all.
| jesprenj wrote:
| Yes, but you still waste their processing power.
| immibis wrote:
| There's no evidence that the current global DDoS is related to
| AI.
| lblume wrote:
| The linked page claims that most identified crawlers are
| related to scraping for training data of LLMs, which seems
| likely.
| ykonstant wrote:
| We have investigated nobody and found no evidence of
| malpractice!
| Zecc wrote:
| > Once (a) models are capable enough to distinguish between
| semi-plausible garbage and possibly relevant text
|
| https://xkcd.com/810/
| bschwindHN wrote:
| You should generate fake but believable EXIF data to go along
| with your JPEGs too.
| derektank wrote:
| From the headline that's actually what I was expecting the link
| to discuss
| russelg wrote:
| They're taking the valid JPEG headers from images already on
| their site, so it's possible those are already in place.
| electroglyph wrote:
| there's no metadata in the example image
| bigiain wrote:
| Fake exif data with lat/longs showing the image was taken
| inside Area 51 or The Cheyenne Mountain Complex or Guantanamo
| Bay...
| mrbluecoat wrote:
| > I felt sorry for its thankless quest and started thinking about
| how I could please it.
|
| A refreshing (and amusing) attitude versus getting angry and
| venting on forums about aggressive crawlers.
| ASalazarMX wrote:
| Helped without doubt by the capacity to inflict pain and
| garbage unto those nasty crawlers.
| dheera wrote:
| > So the compressed data in a JPEG will look random, right?
|
| I don't think JPEG data is compressed enough to be
| indistinguishable from random.
|
| SD VAE with some bits lopped off gets you better compression than
| JPEG and yet the latents don't "look" random at all.
|
| So you might think Huffman encoded JPEG coefficients "look"
| random when visualized as an image but that's only because
| they're not intended to be visualized that way.
| maxbond wrote:
| Encoded JPEG data is random in the same way cows are spherical.
| BlaDeKke wrote:
| Cows can be spherical.
| bigiain wrote:
| And have uniform density.
| anyfoo wrote:
| Yeah, but in practice you only get that in a perfect
| vacuum.
| nasretdinov wrote:
| I imagine gravity matters more here than atmosphere
| EspadaV9 wrote:
| I like this one
|
| https://www.ty-penguin.org.uk/~auj/spigot/pics/2025/03/25/fa...
|
| Some kind of statement piece
| myelinsheep wrote:
| Anything with Shakespeare in it?
| EspadaV9 wrote:
| Looks like he didn't get time to finish
|
| https://www.ty-
| penguin.org.uk/~auj/spigot/pics/2025/03/25/fa...
|
| Terry Pratchett has one I'd like to think he'd approve of.
| Just a shame I'm unable to see the 8th colour, I'm sure it's
| in there somewhere.
|
| https://www.ty-
| penguin.org.uk/~auj/spigot/pics/2025/03/25/fa...
| creatonez wrote:
| For the full experience:
|
| Firefox: Press F12, go to Network, click No Throttling > change
| it to GPRS
|
| Chromium: Press F12, go to Network, click No Throttling >
| Custom > Add Profile > Set it to 20kbps and set the profile
| extraduder_ire wrote:
| Good mention. There's probably some good art to be made by
| serving similar jpeg images with the speed limited server-
| side.
| hashishen wrote:
| the hero we needed and deserved
| derefr wrote:
| > It seems quite likely that this is being done via a botnet -
| illegally abusing thousands of people's devices. Sigh.
|
| Just because traffic is coming from thousands of devices on
| residential IPs, doesn't mean it's a botnet in the classical
| sense. It could just as well be people signing up for a "free VPN
| service" -- or a tool that "generates passive income" for them --
| where the actual cost of running the software, is that you become
| an exit node for both other "free VPN service" users' traffic,
| and the traffic of users of the VPN's sibling commercial brand.
| (E.g. scrapers like this one.)
|
| This scheme is known as "proxyware" -- see
| https://www.trendmicro.com/en_ca/research/23/b/hijacking-you...
| cAtte_ wrote:
| sounds like a botnet to me
| ronsor wrote:
| because it is, but it's a legal botnet
| derefr wrote:
| Eh. To me, a bot is something users don't know they're
| running, and would shut off if they knew it was there.
|
| Proxyware is more like a crypto miner -- the original kind,
| from back when crypto-mining was something a regular computer
| could feasibly do with pure CPU power. It's something users
| intentionally install and run and even _maintain_ , because
| they see it as providing them some potential amount of value.
| Not a bot; just a P2P network client.
|
| Compare/contrast: https://en.wikipedia.org/wiki/Winny /
| https://en.wikipedia.org/wiki/Share_(P2P) /
| https://en.wikipedia.org/wiki/Perfect_Dark_(P2P) -- pieces of
| software which offer users a similar devil's bargain, but
| instead of "you get a VPN; we get to use your computer as a
| VPN", it's "you get to pirate things; we get to use your hard
| drive as a cache node in our distributed, encrypted-and-
| striped pirated media cache."
|
| (And both of these are different still to something like
| BitTorrent, where the user only ever seeds what they
| themselves have previously leeched -- which is much less
| questionable in terms of what sort of activity you're
| agreeing to play host to.)
| tgsovlerkhgsel wrote:
| AFAIK much of the proxyware runs without the _informed_
| consent of the user. Sure, there may be some note on page
| 252 of the EULA of whatever adware the user downloaded, but
| most users wouldn 't be aware of it.
| whatsupdog wrote:
| Botnet with extra steps.
| jeroenhd wrote:
| That's just a variant of a botnet that the users are willingly
| joining. Someone well-intentioned should probably redirect
| those IP addresses to a "you are part of a botnet" page just in
| case they find the website on a site like HN and don't know
| what their family members are up to.
|
| Easiest way to deal with them is just to block them regardless,
| because the probability that someone who knows what to do about
| this software and why it's bad will read any particularly
| botnetted website are close to zero.
| marcod wrote:
| Reading about Spigot made me remember
| https://www.projecthoneypot.org/
|
| I was very excited 20 years ago, every time I got emails from
| them that the scripts and donated MX records on my website had
| helped catching a harvester
|
| > Regardless of how the rest of your day goes, here's something
| to be happy about -- today one of your donated MXs helped to
| identify a previously unknown email harvester (IP:
| 172.180.164.102). The harvester was caught a spam trap email
| address created with your donated MX:
| notpushkin wrote:
| This is very neat. Honeypot scripts are fairly outdated though
| (and you can't modify them according to ToS). The Python one
| only supports CGI and Zope out of the box, though I think you
| can make a wrapper to make it work with WSGI apps as well.
| puttycat wrote:
| > compression tends to increase the entropy of a bit stream.
|
| Does it? Encryption increases entropy, but not sure about
| compression.
| JCBird1012 wrote:
| I can see what was meant with that statement. I do think
| compression increases Shannon entropy by virtue of it removing
| repeating patterns of data - Shannon entropy per byte of
| compressed data increases since it's now more "random" - all
| the non-random patterns have been compressed out.
|
| Total information entropy - no. The amount of information
| conveyed remains the same.
| gary_0 wrote:
| Technically with lossy compression, the amount of information
| conveyed will likely change. It could even _increase_ the
| amount of information of the decompressed image, for instance
| if you compress a cartoon with simple lines and colors, a
| lossy algorithm might introduce artifacts that appear as
| noise.
| gregdeon wrote:
| Yes: the reason why some data can be compressed is because many
| of its bits are predictable, meaning that it has low entropy
| per bit.
| Modified3019 wrote:
| Love the effort.
|
| That said, these seem to be heavily biased towards displaying
| green, so one "sanity" check would be if your bot is suddenly
| scraping thousands of green images, something might be up.
| recursive wrote:
| Mission accomplished I guess
| lvncelot wrote:
| Nature photographers around the world rejoice as their content
| becomes safe from scraping.
| ykonstant wrote:
| Next we do it with red and blue :D
| jandrese wrote:
| I wonder if you could mess with AI input scrapers by adding fake
| captions to each image? I imagine something like:
| (big green blob) "My cat playing with his new catnip
| ball". (blue mess of an image)
| "Robins nesting"
| Dwedit wrote:
| A well-written scraper would check the image against a CLIP
| model or other captioning model to see if the text there
| actually agrees with the image contents.
| Simran-B wrote:
| Then captions that are somewhat believable? "Abstract digital
| art piece by F. U. Botts resembling wide landscapes in
| vibrant colors"
| Someone wrote:
| Do scrapers actually do such things on every page they
| download? Sampling a small fraction of a site to check how
| trustworthy it is, I can see happen, but I would think they'd
| rather scrape many more pages than spend resources doing such
| checks on every page.
|
| Or is the internet so full of garbage nowadays that it is
| necessary to do that on every page?
| 112233 wrote:
| So how do I set up an instance of this beautiful flytrap? Do I
| need a valid personal blog, or can I plop something on cloudflare
| to spin on their edge?
| ffsm8 wrote:
| It's a flask app, he linked to it
|
| https://github.com/gw1urf/spigot/
| superjan wrote:
| There is a particular pattern (block/tag marker) that is illegal
| the compressed JPEG stream. If I recall correctly you should
| insert a 0x00 after a 0xFF byte in the output to avoid it. If
| there is interest I can followup later (not today).
| superjan wrote:
| Ok this is correct for traditional JPEG. Other flavors like
| Jpeg2000 use a similar (but lower overhead) version of this
| byte-stuffing to avoid JPEG markers from appearing in the
| compressed stream.
|
| Related:
| https://en.wikipedia.org/wiki/JPEG#Syntax_and_structure
| ethan_smith wrote:
| You're referring to JPEG's byte stuffing rule: any 0xFF byte in
| the entropy-coded data must be followed by a 0x00 byte to
| prevent it from being interpreted as a marker segment.
| tomsmeding wrote:
| They do have a robots.txt [1] that disallows robot access to the
| spigot tree (as expected), but removing the /spigot/ part from
| the URL seems to still lead to Spigot. [2] The /~auj namespace is
| not disallowed in robots.txt, so even _well-intentioned_
| crawlers, if they somehow end up there, can get stuck in the
| infinite page zoo. That 's not very nice.
|
| [1]: https://www.ty-penguin.org.uk/robots.txt
|
| [2]: https://www.ty-penguin.org.uk concatenated with /~auj/cheese
| (don't want to create links there)
| josephg wrote:
| > even well-intentioned crawlers, if they somehow end up there,
| can get stuck in the infinite page zoo. That's not very nice.
|
| So? What duty do web site operators have to be "nice" to people
| scraping your website?
| gary_0 wrote:
| The Marginalia search engine or archive.org probably don't
| deserve such treatment--they're performing a public service
| that benefits everyone, for free. And it's generally not in
| one's best interests to serve a bunch of garbage to Google or
| Bing's crawlers, either.
| darkwater wrote:
| If you have such a website, then you will just serve normal
| data. But it seems perfectly legit to serve fake random
| gibberish from your website if you want to. A human would
| just stop reading it.
| suspended_state wrote:
| The point is that not every web crawler is out there to
| scrape websites.
| andybak wrote:
| Unless you define "scrape" to be inherently nefarious -
| then surely they are? Isn't the definition of a web crawler
| based on scraping websites?
| suspended_state wrote:
| I think that web scraping is usually understood as the
| act of extracting information of a website for ulterior
| self-centered motives. However, it is clear that this
| ulterior motive cannot be assessed by a website owner.
| Only the observable behaviour of a data collecting
| process can be categorized as morally good or bad. While
| the bad behaving people are usually also the ones with
| morally wrong motives, one doesn't entail the other. I
| chose to qualify the bad behaving ones as scrapers, and
| the good behaving ones as crawlers.
|
| That being said, the author is perhaps concerned by the
| growing amount of collecting process, which carries a
| toll on his server, and thus chose to simply penalize
| them all.
| bstsb wrote:
| previously the author wrote in a comment reply about not
| configuring robots.txt at all:
|
| > I've not configured anything in my robots.txt and yes, this
| is an extreme position to take. But I don't much like the
| concept that it's my responsibility to configure my web site so
| that crawlers don't DOS it. In my opinion, a legitimate crawler
| ought not to be hitting a single web site at a sustained rate
| of > 15 requests per second.
| yorwba wrote:
| The spigot doesn't seem to distinguish between crawlers that
| make more than 15 requests per second and those that make
| less. I think it would be nicer to throw up a "429 Too Many
| Requests" page when you think the load is too much and only
| poison crawlers that don't back off afterwards.
| evgpbfhnr wrote:
| when crawlers use a botnet to only make one request per ip
| per long duration that's not realistic to implement
| though..
| DamonHD wrote:
| Almost no bot responds usefully to 429 that I have seen,
| and a few respond to it like 500 and 503 to _speed up_ /
| retry / poll more.
| kazinator wrote:
| Faking a JPEG is not only less CPU intensive than making one
| properly, but by doing os you are fuzzing whatever malware is on
| the other end; if it is decoding the JPEG and isn't robust, it
| may well crash.
| Szpadel wrote:
| the worst offender I saw is meta.
|
| they have facebookexternalhit bot (they sometimes use default
| python request user agent) that (as they documented) explicitly
| ignores robots.txt
|
| it's (as they say) used to validate links if they contain
| malware. But if someone would like to serve malware the first
| thing they would do would be to serve innocent page to facebook
| AS and their user agent.
|
| they also re-check every URL every month to validate if this
| still does not contain malware.
|
| the issue is as follows some bad actors spam Facebook with URLs
| to expensive endpoints (like some search with random filters) and
| Facebook provides then with free ddos service for your
| competition. they flood you with > 10 r/s for days every month.
| kuschku wrote:
| Since when is 10r/s flooding?
|
| That barely registers as a blip even if you're hosting your
| site on a single server.
| Szpadel wrote:
| In our case this was very heavy specialized endpoint and
| because each request used different set of parameters could
| not benefit from caching (actually in this case it thrashed
| caches with useless entries).
|
| This resulted in upscale. When handling such bot cost more
| than rest of the users and bots, that's an issue. Especially
| for our customers with smaller traffic.
|
| This request rate varied from site to site, but it ranged
| from half to 75% of whole traffic and was basically
| saturating many servers for days if not blocked.
| anarazel wrote:
| That depends on what you're hosting. Good luck if it's e.g. a
| web interface for a bunch of git repositories with a long
| history. You can't cache effectively because there's too many
| pages and generating each page isn't cheap.
| BubbleRings wrote:
| Is there reason you couldn't generate your images by grabbing
| random rectangles of pixels from one source image and pasting it
| into a random location in another source image? Then you would
| have a fully valid jpg that no AI could easily successfully
| identify as generated junk. I guess that would require much more
| CPU than your current method huh?
| adgjlsfhk1 wrote:
| Given the amount of money AI companies have, you need at least
| ~100x work amplification for this to begin to be a punishment.
| jekwoooooe wrote:
| It's our moral imperative to make crawling cost prohibitive and
| also poison LLM training.
| a-biad wrote:
| I am bit confused about the context. What is exactly the point of
| exposing fake data to webcrawlers?
| whitten wrote:
| penalizing the web spider for scraping their site
| gosub100 wrote:
| They crawl for data, usually to train a model. Poisoning the
| models training data makes it less useful and therefore less
| valuable
| jeroenhd wrote:
| This makes me wonder if there are more efficient image formats
| that one might want to feed botnets. JPEG is highly complex, but
| PNG uses a relatively simple DEFLATE stream as well as some basic
| filters. Perhaps one could make a zip-bomb like PNG that only
| consists of a few bytes?
| john01dav wrote:
| That might be challenging because you can trivially determine
| the output file sized based on the dimensions in pixels and
| pixel format, so if the DEFLATE stream goes beyond that you can
| stop decoding and discard the image as malformed. Of course,
| some decoders may not do so and thus would be vulnerable.
| _ache_ wrote:
| Is it a problem through ? I'm pretty sure that any check is
| on the weight of the PNG, not the actual dimension of the
| image.
|
| PNG doesn't have size limitation on the image dimensions
| (4bytes each). So I bet you can break at least one scrap bot
| with that.
| sltkr wrote:
| DEFLATE has a rather low maximum compression ratio of 1:1032,
| so a file that would take 1 GB of memory uncompressed still
| needs to be about 1 MB.
|
| ZIP bombs rely on recursion or overlapping entries to achieve
| higher ratios, but the PNG format is too simple to allow such
| tricks (at least in the usual critical chunks that all decoders
| are required to support).
| Domainzsite wrote:
| This is pure internet mischief at its finest. Weaponizing fake
| JPEGs with valid structure and random payloads to burn botnet
| cycles? Brilliant. Love the tradeoff thinking: maximize crawler
| cost, minimize CPU. The Huffman bitmask tweak is chef's kiss.
| Spigot feels like a spiritual successor to robots.txt flipping
| you off in binary.
| time0ut wrote:
| JPEG is fascinating and quite complex. Here is a really excellent
| high level explanation of how it works:
|
| https://www.youtube.com/watch?v=0me3guauqOU
| larcanio wrote:
| Happy to realize real heros does exists.
| ardme wrote:
| Old man yells at cloud, then creates a labyrinth of mirrors for
| the images of the clouds to reflect back on each other.
| bvan wrote:
| Love this.
| sim7c00 wrote:
| love how u speak about pleasing bots and them getting excited :D
| fun read, fun project. thanks!
| mhuffman wrote:
| I don't understand the reasoning behind the "feed them a bunch of
| trash" option when it seems that if you identify them (for
| example by ignoring a robots.txt file) you can just keep them
| hung up on network connections or similar without paying for
| infinite garbage for crawlers to injest.
| schroeding wrote:
| The "poisoning the data supply" angle seems to be a common
| motive, similar to tools like nightshade[1] (for actual images
| and not just garbage data).
|
| [1] https://nightshade.cs.uchicago.edu/whatis.html
| thayne wrote:
| I'm curious how the author identifies the crawlers that use
| random User Agents and and distinct ip addresses per request. Is
| there some other indicator that can be used to identify them?
|
| On a different note, if the goal is to waste resources for the
| bot, on potential improvement could be to uses very large images
| with repeating structure that compress extremely well as jpegs
| for the templates, so that it takes more ram and cpu to decode
| them with relatively little cpu and ram required to generate them
| and bandwidth to transfer them.
___________________________________________________________________
(page generated 2025-07-12 23:00 UTC)