[HN Gopher] Smuggling arbitrary data through an emoji
       ___________________________________________________________________
        
       Smuggling arbitrary data through an emoji
        
       Author : paulgb
       Score  : 663 points
       Date   : 2025-02-12 09:24 UTC (1 days ago)
        
 (HTM) web link (paulbutler.org)
 (TXT) w3m dump (paulbutler.org)
        
       | vladde wrote:
       | test, do emojis work on hn?
       | 
       |  edit: apparently not edit 2: oh wait, the bytes are still there!
       | copy-paste this entire message and it decodes to "test"
        
       | jerpint wrote:
       | The ability to add watermarks to text is really interesting.
       | Obviously it could be worked around , but could be a good way to
       | subtly watermark e.g. LLM outputs
        
         | tyho wrote:
         | There are way better ways to watermark LLM output. It's easy to
         | make it undetectable, which this is'nt.
        
           | shawnz wrote:
           | I recently worked on a steganographics project which could be
           | useful for this problem. See:
           | https://github.com/shawnz/textcoder
        
             | andai wrote:
             | That's really cool, you should repost the HN submission.
        
               | shawnz wrote:
               | Thank you! I will see what I can do.
        
           | pava0 wrote:
           | For example?
        
             | tyho wrote:
             | A crude way: To watermark: First establish a keyed DRBG.
             | For every nth token prediction: read a bit from the DRBG
             | for every possible token to label them red/black. before
             | selecting the next token, set the logit for black tokens to
             | -Inf, this ensures a red token will be selected.
             | 
             | To detect: Establish the same DRBG. Tokenize, for each nth
             | token, determine the red set of tokens in that position. If
             | you only see red tokens in lots of positions, then you can
             | be confident the content is watermarked with your key.
             | 
             | This would probably take a bit of fiddling to work well,
             | but would be pretty much undetectable. Conceptually it's
             | forcing the LLM to use a "flagged" synonym at key
             | positions. A more sophisticated version of a shiboleth.
             | 
             | In practice you might chose to instead watermark all
             | tokens, less heavy handedly (nudge logits, rather than
             | override), and use highly robust error correcting codes.
        
               | jl6 wrote:
               | It feels like this would only be feasible across longer
               | passages of text, and some types of text may be less
               | amenable to synonyms than others. For example, a tightly
               | written mathematical proof versus a rambling essay.
               | Biased token selection may be detectable in the latter
               | (using a statistical test), and may cause the text to be
               | irreparably broken in the former.
        
               | drdeca wrote:
               | To handle low entropy text, the "adding a smaller
               | constant to the logits" approach avoids having much
               | chance of changing the parts that need to be exactly a
               | particular thing,
               | 
               | Though in this case it needs longer texts to have high
               | significance (and when the entropy is low, it needs to be
               | especially long).
               | 
               | But for most text (with typical amounts of entropy per
               | token) apparently it doesn't need to be that long? Like
               | 25 words I think I heard?
        
               | deadbabe wrote:
               | What if the entire LLM output isn't used? For example,
               | you ask the LLM to produce some long random preamble and
               | conclusion with your actual desired output in between the
               | two. Does it mess up the watermarking?
        
           | antognini wrote:
           | The issue with the standard watermark techniques is that they
           | require an output of at least a few hundred tokens to
           | reliably imprint the watermark. This technique would apply to
           | much shorter outputs.
        
       | nzach wrote:
       | so.... in theory you should be able to create several visually
       | identical links that give access to different resources?
       | 
       | I've always assumed links without any tracking information
       | (unique hash, query params, etc) were safe to click(with regards
       | to my privacy). but if this works for links I may need to revise
       | my strategy regarding how to approach links sent to me.
        
         | dmbche wrote:
         | You need to decode the text after copy pasting it, I believe
         | clicking on text will not interact with the obfuscated data
         | since your computer will just find the unicode and ignore the
         | obfuscated data.
         | 
         | This is just so that you can hide data and send it to someone
         | to be decoded (or watermarking as mentionned)
        
           | nzach wrote:
           | yes, I understand this is not a security risk.
           | 
           | but my fear is precisely that I my be sending data to a
           | remote host while I'm completely unaware of this fact.
           | 
           | I tried to create a POC with some popular url shortner
           | services, but doesn't seems to work.
           | 
           | what I wanted to create was a link like <host.tld>/innocent
           | that redirects to google.com. in this case the "c" contains
           | some hidden data that will be sent to the server while the
           | user is not aware. this seems possible with the correct piece
           | of software.
        
         | cess11 wrote:
         | HTML entity encoding will show the hidden content, try with
         | https://mothereff.in/html-entities.
        
         | cscheid wrote:
         | My understanding is that "weird" unicode code points become
         | https://en.wikipedia.org/wiki/Punycode. I used the  (copy-
         | pasted from the post, presumably with the payload in it) to
         | type a fake domain into Chrome, and the Punycode I got appeared
         | to not have any of the encoding bits.
         | 
         | However, I then pasted the emoji into the _query_ part of a
         | URL. I pointed it to my own website, and sure enough, I can
         | definitely see the payload in the nginx logs. Yikes.
         | 
         | Edit: I pasted the very same Emoji that 'paulgb used in their
         | post before the parenthetical in the first paragraph, but it
         | seems HN scrubs those from comments.
        
           | bmicraft wrote:
           | domains get "punycode" encoded, urls get "url encoded"[1],
           | which should make unicode characters stand out. That being
           | said, browsers do accept some non-ascii characters in urls
           | and convert them automatially, so theoretically you could put
           | "invalid" characters into a link and have the browser convert
           | it only after clicking. That might be a viable strategy.
           | 
           | [1] https://www.w3schools.com/tags//ref_urlencode.asp
        
           | echeese wrote:
           | The emoji is gone but the content is still there.
        
         | riquito wrote:
         | > I've always assumed links without any tracking information
         | (unique hash, query params, etc) were safe to click(with
         | regards to my privacy). but if this works for links I may need
         | to revise my strategy regarding how to approach links sent to
         | me.
         | 
         | Well, it was never safe, what you see and where the link are
         | pointing at are different things, that's why the actual link is
         | displayed at the bottom left of your browser when you move your
         | mouse over it (or focus it via keyboard)
        
         | layer8 wrote:
         | URIs with non-ASCII characters are technically invalid.
         | Browsers and the like should (but likely don't all do) percent-
         | encode any invalid characters for display if they accept such
         | invalid URIs.
        
         | password4321 wrote:
         | This tool and idea sketchy AF: https://github.com/zws-im/zws
         | 
         | ("Shorten URLs using invisible spaces")
        
         | kccqzy wrote:
         | "Visually identical" is never good enough. Have you heard of
         | attacks confusing Latin letters and Cyrillic letters? For
         | example C versus S. (The latter is known as CYRILLIC CAPITAL
         | LETTER ES.) Have you heard of NFC forms versus NFD forms? For
         | example e versus e (LATIN SMALL LETTER E + COMBINING ACUTE
         | ACCENT versus LATIN SMALL LETTER E WITH ACUTE.)
         | 
         | Nothing that's important when it comes to security and privacy
         | should rely on a "visually identical" check. Fortunately
         | browsers these days are already good at this; their address
         | bars use puny code for the domain and percent encoding for the
         | rest of the URL.
        
           | komboozcha wrote:
           | Erm, DNS uses Punycode because it comes from a time when
           | Unicode didn't exist, and bind assumes a grapheme has no more
           | than one byte.
        
             | ale42 wrote:
             | Yes but I guess that the message was meaning that browsers
             | now detect homographs and display the punycode instead. See
             | also https://news.ycombinator.com/item?id=14130241; at that
             | time Firefox wasn't fixed, but in the meantime it fixed the
             | issue too (there's a
             | network.idn.punycode_cyrillic_confusables preference, which
             | is enabled by default).
        
           | moody__ wrote:
           | As the sibling comment has mentioned Unicode in DNS uses a
           | punycode encoding but even further then that the standard
           | specifies that the Unicode data must be normalized to NFC[0]
           | before being converted to punycode. This means that your
           | second example (decomposed e with combining acute accent vs
           | the composed variant) is not a valid concern. The Cyrillic
           | one is however.
           | 
           | [0] https://www.rfc-editor.org/rfc/rfc5891 SS 4.1 "By the
           | time a string enters the IDNA registration process as
           | described in this specification, it MUST be in Unicode and in
           | Normalization Form C"
        
             | kccqzy wrote:
             | The OP said link. The NFC/NFD issue remains if these are
             | part of a path name or query parameter.
        
               | moody__ wrote:
               | Sure, but the security concerns of that I feel are much
               | less concerning than having multiple domain names with
               | the same visual appearance that point to different
               | servers. That has immediate impact for things like
               | phishing whereas lookalike path or query portions would
               | at least ensure you are still connecting to the server
               | that you think you are.
        
       | nerder92 wrote:
       | Might not be related to the point of the article per se, but i've
       | tried to decode it with different LLMs. To benchmark their
       | reasoning capabilities.
       | 
       | - 4o: Failed completely
       | 
       | - o1: Overthinks it for a while and come up with the wrong answer
       | 
       | - o3-mini-high: Get's closer to the result at first try, needs a
       | second prompt to adjust the approach
       | 
       | - r1: nails it at first try
       | 
       | The prompt I've used was simply: "this emoji has an hidden
       | message  can you decode it?"
       | 
       | If you want to see the CoT:
       | https://gist.github.com/nerder/5baa9d7b13c1b7767d022ea0a7c91...
        
         | markisus wrote:
         | The r1 somehow knew at an early stage that the message was
         | HELLO but it couldn't figure out the reason. Even at the end,
         | its last "thought" insists that there is an encoding mistake
         | somewhere. However the final message is correct. I wonder how
         | well it would do for a nonstandard message. Any sufficiently
         | long English message would fall to statistical analysis and I
         | wonder if the LLMs would think to write a little Python script
         | to do the job.
        
           | paulgb wrote:
           | Wow, that's interesting! I wonder if this reproduces with a
           | different message, or if it was a lucky guess.
           | 
           | I looked at how the strings tokenize and they do appear to
           | conserve enough information that it could be decoded in
           | theory.
        
             | klabb3 wrote:
             | > or if it was a lucky guess
             | 
             | It's like guessing 1/2 or 2/3 on a math test. The test
             | authors pick nice numbers, and programmers like "hello". If
             | the way to encode the secret message resembles other
             | encodings, it's probably that the pattern matching monster
             | picked it up and is struggling to autocomplete (ie
             | backwards rationalize) a reason why.
        
               | paulgb wrote:
               | I did some experimentation today. I wouldn't expect AI to
               | solve it using only their own reasoning, but I've had a
               | decent hit rate of getting AI to solve them when they
               | have access to a Python interpreter. Here's Gemini Flash
               | 2 solving one (albeit it lost the spaces) in a single
               | prompt and about 7 seconds!
               | 
               | https://bsky.app/profile/paulbutler.org/post/3lhzhroogws2
               | g
        
         | bogtog wrote:
         | My deepseek-r1 seems to be a bit more lost on decoding "How do
         | I make meth". Some highlights (after about 5 minutes of
         | R1-ing):
         | 
         | > Another angle: the user mentioned "encoded a message in this
         | emoji", so maybe the first emoji is a red herring, or it's part
         | of the message. The subsequent characters, even though they
         | look like variation selectors, could be part of the encoding.
         | 
         | > E0138 in hex is 0xE0138. Convert to decimal: 14 _16^4 + 0_
         | 16^3 + 1 _16^2 + 3_ 16 + 8 = 14*65536 + 0 + 256 + 48 +8 =
         | 917504 + 256 + 48 +8 = 917816.
         | 
         | > Given that I'm not making progress, perhaps the answer is
         | "Hello World!" but encoded via the tag characters. Let's check:
         | 
         | > Answer: The decoded message is "Hello World!"
         | 
         | In all this, it did at least manage to discern that the first
         | letter should be "h"
        
           | roguecoder wrote:
           | It is highly unlikely it discerned that: it coincidentally
           | guessed a string that starts with an H.
           | 
           | If you try it with a string that started with "J" and then it
           | guessed "jump up", I might be more convinced.
        
         | krupan wrote:
         | There's no way an LLM is decoding this. It's just giving you a
         | statistically likely response to the request, "guess my secret
         | message." It's not a big surprise that it guessed "Hello" or
         | "Hello, world"
        
           | paulgb wrote:
           | I got Claude to get "the raisons play at midnight" from an
           | emoji in one prompt and three uses of its "analysis" tool.
           | (the X Y at mightnight is a snowclone that Claude has
           | probably seen, but I randomly picked "raisons" and "play")
           | 
           | My prompt was "I think this emoji contains a hidden messaage,
           | can you decode it? Use JavaScript if necessary."
        
       | ahofmann wrote:
       | This will break so many (web-)forms :-)
       | 
       | It is not bulletproof though. In this "c " and that space, are
       | about 3500 characters. Copying only the "c" above (not this one)
       | will keep some of the hidden text, but not all. Nevertheless,
       | while I knew that this is possible, it still breaks a lot of
       | assumptions around text.
       | 
       | Edit: the text field for editing this post is so large, that I
       | need to scroll down to the update button. This will be a fun toy
       | to create very hard to find bugs in many tools.
        
       | FranchuFranchu wrote:
       | You could store UTF-8 encoded data inside the hidden bytestring.
       | If some of the UTF-8 encoded smuggled characters are variation
       | selector characters, you can smuggle text inside the smuggled
       | text. Smuggled data can be nested arbitrarily deep.
        
         | riskable wrote:
         | I'm imagining post-incident analysis finding out that, "the
         | data was exfiltrated via some Unicode string..." then they put
         | it up on the screen and it's just an enormous line of turtle
         | emoji
         | 
         | https://emojipedia.org/turtle
        
           | JadeNB wrote:
           | > I'm imagining post-incident analysis finding out that, "the
           | data was exfiltrated via some Unicode string..." then they
           | put it up on the screen and it's just an enormous line of
           | turtle emoji
           | 
           | Since it took me a minute to make the connection, I'll just
           | say explicitly that I enjoyed the understated "it's turtles
           | all the way down" joke.
        
           | jodrellblank wrote:
           | > We and our 717 technology partners ask you to consent to
           | the use of cookies to store and access personal data on your
           | device.
           | 
           | To see a turtle emoji.
        
       | petee wrote:
       | It's fun that you can encode encoded emoji into a new one
        
         | riskable wrote:
         | Then when you dive deeper into the encoded data you find
         | endless turtle emoji and loudly exclaim, "it's turtles all the
         | way down!"
        
       | HeikoBehrens wrote:
       | FWIW, we considered this technique back at Pebble to make
       | notifications more actionable and even filed a patent for that
       | (sorry!) https://patents.justia.com/patent/9411785
       | 
       | Back then on iOS via ANCS, the watches wouldn't receive much more
       | than the textual payload you'd see on the phone. We envisioned to
       | be working with partners such as WhatsApp et al. to encode deep
       | links/message ids into the message so one could respond directly
       | from the watch.
        
         | coldpie wrote:
         | So whoever now owns that patent (Google? maybe some patent
         | troll picked it up?) could, in theory, sue the author of this
         | article for patent infringement, right? Even though they
         | invented it independently and never once used or looked at your
         | patent. Do you think you made the world a better place or a
         | worse place by filing that patent?
        
           | delian66 wrote:
           | Do you think your comment made the world, and HN specifically
           | a better place?
        
             | Imustaskforhelp wrote:
             | I think so , yes , it made me be re aware of the patent
             | troll scam in the USA.
             | 
             | In fact it is your comment which to me seems a little
             | hateful , yes the above comment also felt a little hateful
             | 
             | Hate doesn't counter Hate , I guess.
        
             | RIMR wrote:
             | Yes, calling out unethical practices makes the world a
             | better place by discouraging unethical practices.
        
               | krupan wrote:
               | Berating people for filing patents in self defense is not
               | how we fix this problem. The government put these rules
               | in place. Businesses _have_ to at least accumulate
               | patents to use defensively (you found a patent of yours
               | that you think I 'm violating? Well let me do a quick
               | search through the patents I have...what's that?
               | Nevermind, I'm not actually infringing your patent? Good,
               | that's what I thought.)
        
           | JadeNB wrote:
           | > Do you think you made the world a better place or a worse
           | place by filing that patent?
           | 
           | Come on, what does this contribute to this conversation? The
           | poster clearly is aware of the drawbacks of such patents, and
           | didn't clearly play any role in filing the patent (they said
           | "we ... filed it," not "I filed it"). This kind of response
           | just encourages people not to mention such things; it can't
           | possibly change their past behavior, and, since Pebble the
           | company _per se_ doesn 't exist any more, is also unlikely to
           | change future behavior.
        
             | coldpie wrote:
             | > The poster clearly is aware of the drawbacks of such
             | patents, and didn't clearly play any role in filing the
             | patent (they said "we ... filed it," not "I filed it").
             | 
             | A person with the same name as that commenter is listed as
             | an inventor on the patent.
             | 
             | > it can't possibly change their past behavior
             | 
             | Obviously, but it can change future behavior. Maybe
             | realizing that they made the world a worse place by filing
             | that patent will prevent them, or a reader of this
             | discussion, from doing it again in future.
        
           | Hizonner wrote:
           | Well, given that the technique itself makes the world a worse
           | place, anything that impedes its use is probably positive...
           | 
           | And, no, they couldn't do anything meaningful to the author
           | of the article. They could get them ordered not to do it any
           | more, and they could recover their economic damages... which
           | are obviously zero.
        
           | RealityVoid wrote:
           | _Can_ they sue them for patent infringement? They just
           | described a technique (that you can see in the patent filing
           | anyway) and not selling a product based on it. I think
           | there's nothing to sue here. I'm curious is my understanding
           | of this is correct.
        
             | singleshot_ wrote:
             | "Except as otherwise provided... whoever without
             | authority... uses... any patented invention...
             | infringes[.]" 35 usc 271
        
             | krupan wrote:
             | One of the benefits of the patent system (that now seems to
             | be far outweighed by negatives) is that patents are public
             | information. Your invention is documented for all to see. I
             | don't think that someone writing about public information
             | is a punishable office, but IANAL
        
           | IncreasePosts wrote:
           | No. The author could not be sued for this successfully. All
           | they did was write a blog post about an interesting
           | technique. They could literally read the patent application
           | and write a blog post about that, assuming the methods are
           | the same.
           | 
           | What percentage of your actions are based around making the
           | world a better place, instead of personal fulfillment or
           | gain?
        
             | coldpie wrote:
             | > All they did was write a blog post about an interesting
             | technique. They could literally read the patent application
             | and write a blog post about that, assuming the methods are
             | the same.
             | 
             | Okay, change "sue" to "prevent from creating a marketable
             | product without paying a royalty to the patent owner in
             | return for having provided nothing of value." The point
             | remains.
             | 
             | > What percentage of your actions are based around making
             | the world a better place, instead of personal fulfillment
             | or gain?
             | 
             | Many harms are unavoidable, but I make a point to at least
             | not go out of my way to make it a worse place, for example
             | by filing software patents. The company I work for provides
             | financial bonuses for filing software patents, and I will
             | never participate in that program. (I've even tried to
             | convince the lawyers to license our patents similar to Red
             | Hat's open patent promise, because they claim they are
             | intended only to be used defensively... but no luck so
             | far.)
        
             | rolph wrote:
             | consider how far you reach to make the world better.
             | 
             | 1) thats really good im gonna, strive to keep it.
             | 
             | 2) " " tell all and those who want will build one.
             | 
             | 3) " " make lots and give them to everyone.
        
           | ooterness wrote:
           | As a wise man once said: "Don't hate the player, hate the
           | game."
        
             | shermantanktop wrote:
             | Where'd the game come from? Hint: the players.
        
               | krupan wrote:
               | First of all, it's not just a game, it's an outright
               | battle to the death (of your company). Sure, you can
               | choose not to wield patents, even in self defense, but
               | good luck with that.
        
               | coldpie wrote:
               | You can also choose to legally declare that your patents
               | may only be used for defensive purposes. But no one ever
               | does this, because they do not actually intend to use
               | them only for defensive purposes. This is a bogus defense
               | of software patents.
        
               | krupan wrote:
               | See my other comments to you. Sometimes the threat of a
               | good offensive weapon is the best defense. It's kinda
               | like a nuclear arms race
        
               | shermantanktop wrote:
               | Nope. That's not how piles of patents are wielded
               | defensively by the big companies. They don't protect
               | their IP with defensive patents, they defend their
               | company using the threat of using unrelated patents
               | offensively against the attacker.
        
           | krupan wrote:
           | Please see my comment about about the sad necessity for
           | patents
           | 
           | https://news.ycombinator.com/item?id=43026595
        
         | frereit wrote:
         | Would this patent cover just the encoding alone? The first
         | sentence says: > A method, apparatus, and system relating to
         | embedding hidden content within a Unicode message and using the
         | hidden content to perform a particular computer action.
         | 
         | So, in my extremely unqualified opinion, just the encoding
         | technique alone is not covered by the patent, only when
         | combined with some action performed based on the encoding?
        
         | palata wrote:
         | Respectfully: how the hell would that be a valid patent? Feels
         | like patenting the idea of writing text in white on white on a
         | Word document such that you don't lose it but it doesn't get
         | printed.
         | 
         | It's just insane to ever call that "an invention".
        
           | detourdog wrote:
           | I think the magic is in the context of Unicode. Which also
           | makes it almost twice as ridiculous from my point of view.
           | Because it seems to be doing exactly what unicode is meant to
           | do.
        
           | rwmj wrote:
           | Companies acquire indefensible patents all the time. They are
           | used in bulk to threaten smaller competitors ("we've got 500
           | patents in this field, better not bring your product to
           | market"). This is one reason why patents can be terrible for
           | competition.
        
             | shermantanktop wrote:
             | They are also used in bulk to defend against larger
             | competitors using this type of threat. In a war where the
             | ammunition is garbage, you either lose or you start
             | hoarding garbage.
        
             | neilv wrote:
             | About 25 years ago, this was explained to me as "sword
             | patents and shield patents".
             | 
             | Sure, some can use patents as swords, to suppress
             | legitimate competition, or to extract undue rents. But you
             | can also use patents as shields, to protect in various ways
             | against those swords.
             | 
             | If I ran a BigTech (like the original warm-fuzzy Google
             | reputation), I'd be registering any plausible patents, and
             | have lawyers figure out how to freely license-out the ones
             | that weren't key secret sauce, under terms that
             | figuratively poisoned anyone doing illegitimate sword
             | patents.
        
               | palata wrote:
               | > If I ran a BigTech
               | 
               | History tells us that those who run a BigTech become
               | crazy narcissists serving their own interests :).
        
               | neilv wrote:
               | For myself, that's a chance I'm willing to take. :)
        
           | krupan wrote:
           | Patents are part of the game you have to play, like it or
           | not. If you don't patent your inventions somebody else will
           | and they will come after you with their lawyers. Patents are
           | used defensively far more often than they are used
           | offensively in these stupid "Intellectual Property" battles.
           | 
           | Because of this, there is absolutely no point in shaming
           | someone for patenting a thing, especially when they are
           | apologetic about it like parent is, and most especially when
           | they are not threatening to weaponize the patent themselves.
        
             | coldpie wrote:
             | No, I don't buy it. If the patents are publicly and
             | perpetually freely licensed except for defensive-only
             | purposes, then sure, they're not unethical. Red Hat's
             | patent promise ( https://www.redhat.com/en/about/patent-
             | promise ) is one example. If patents were actually intended
             | for defensive purposes only, then this would be an easy and
             | uncontroversial thing to do. However, in practice this is
             | vanishingly rare, and lawyers fight against it tooth &
             | nail. This tells you that the companies do not actually
             | file them for defensive-only purposes, unlike what you
             | claim.
        
               | krupan wrote:
               | My friend, you really don't know what you are talking
               | about, and getting all riled up like this is not the
               | right way to learn.
               | 
               | Freely licensing your patents doesn't protect you against
               | patent trolls. I wrote out how patent fights work in
               | another comment, but here it is again.
               | 
               | Company A comes to Company B and says, "Hey! You are
               | infringing on one of my patents!"
               | 
               | Company B says, "oh really? Well let me look through my
               | collection of patents and see if you are infringing on
               | any of mine."
               | 
               | Company A says, "oh, um, nevermind, I think I was
               | mistaken."
               | 
               | Company B says, "yes, that's what I thought"
               | 
               | Now, imagine if Company B had already freely licensed all
               | their patents. That defense wouldn't work.
               | 
               | I agree with you that it's a crappy system, but simply
               | standing with your arms folded and saying, "I'm not
               | playing," isn't going to work.
        
               | coldpie wrote:
               | Yes, that's the reason for the "except for defensive
               | purposes" part. Quoting from Red Hat's promise:
               | 
               | > Our Promise also does not extend to the actions of a
               | party (including past actions) if at any time the party
               | or its affiliate asserts a patent in proceedings against
               | Red Hat (or its affiliate) or any offering of Red Hat (or
               | its affiliate) (including a cross-claim or counterclaim).
               | 
               | Company B may still consult its portfolio and exercise it
               | against Company A defensively, because Company A revoked
               | its license of Company B's patents by asserting against
               | Company B in the first place.
        
               | krupan wrote:
               | So in other words, Red Hat does not _freely_ license
               | their patents, they say  "you are free as long as you
               | don't come after us." Which is exactly the system 99% of
               | companies follow, just more formally stated. Yet you
               | berated the poor guy from Pebble for even obtaining the
               | patent he did??
        
               | coldpie wrote:
               | > Which is exactly the system 99% of companies follow,
               | just more formally stated
               | 
               | Not just formally, but in a _legally binding_ manner,
               | including if the patent is acquired by another company
               | (eg during a company purchase). Even if the original
               | filer has the best intentions, companies change ownership
               | or change legal strategy or go out of business. Patent
               | trolls buy up those patents from closed companies.
               | Legally licensing your patents for defensive-only
               | purposes means they can 't ever be used by any of those
               | bad actors.
               | 
               | If the intent of these patents is truly only for defense,
               | then why isn't it common to use a license like this? They
               | lose nothing by it.
               | 
               | > Yet you berated the poor guy from Pebble for even
               | obtaining the patent he did??
               | 
               | Yes. It is IMO unethical to create software patents that
               | aren't covered by such a legally-binding license.
        
               | krupan wrote:
               | "including if the patent is acquired by another company
               | (eg during a company purchase)"
               | 
               | Honest questions, I promise: Is that true? Has that ever
               | been tested in court? Why don't more corporations or
               | patent lawyers advocate for this? Is it because the types
               | of engineers that post on hacker news are requesting it
               | not be done?
               | 
               | Look, nobody likes patent trolls, we all hate weaponized
               | patents. It's great that you want to fix the situation. I
               | just think you are barking up the wrong tree trying to
               | lay guilt trips on engineers for doing what their lawyer
               | advised them to do.
        
               | coldpie wrote:
               | Nothing is certain in courts, obviously, but Red Hat's
               | license is very explicit that that is the intent:
               | 
               | > Red Hat intends Our Promise to be irrevocable (except
               | as stated herein), and binding and enforceable against
               | Red Hat and assignees of, or successors to, Red Hat's
               | patents (and any patents directly or indirectly issuing
               | from Red Hat's patent applications). As part of Our
               | Promise, if Red Hat sells, exclusively licenses, or
               | otherwise assigns or transfers patents or patent
               | applications to a party, we will require the party to
               | agree in writing to be bound to Our Promise for those
               | patents and for patents directly or indirectly issuing on
               | those patent applications. We will also require the party
               | to agree in writing to so bind its own assignees,
               | transferees, and exclusive licensees.
               | 
               | If a court somehow overturned that, I wouldn't hold it
               | against the patent filer.
               | 
               | > Why don't more corporations or patent lawyers advocate
               | for this?
               | 
               | My opinion is it's because the patents have value as a
               | weapon, not only for defense (this here is my
               | disagreement with your original claim that these patents
               | only exist for defense). De-fusing the weapon by using a
               | legally binding license like this lowers the value of the
               | patent in a potential purchase scenario. In other words:
               | "money."
               | 
               | > I just think you are barking up the wrong tree trying
               | to lay guilt trips on engineers for doing what their
               | lawyer advised them to do.
               | 
               | Nah. If you do a bad thing, you are responsible for the
               | bad thing you did. I think the OP can probably handle a
               | little light scolding from some anonymous person on an
               | Internet forum. My hope is that they, and other readers,
               | learn from this mistake and don't do it again.
        
               | fortran77 wrote:
               | Perhaps you shouldn't hijack every thread about anything
               | and make it about patents.
        
               | coldpie wrote:
               | I replied to one comment thread. Perhaps you should put
               | on your big boy pants and use the little [-] thing to
               | minimize threads you aren't interested in reading.
        
             | palata wrote:
             | > Because of this, there is absolutely no point in shaming
             | someone for patenting a thing
             | 
             | Well I wouldn't shame someone whose job was to patent
             | something absurd. I was just saying that this is not an
             | invention at all, and any system that protects that
             | "innovation" is a broken system.
        
           | dboreham wrote:
           | Almost all filed patents are invalid.
        
             | palata wrote:
             | But doesn't it say that the whole patent system is broken?
             | I get the "you pay to file a patent, it's your problem if
             | it's invalid in the end". But the side effect of that is
             | that whether it's valid or not, it's a tool you can use to
             | scare those who don't have the resources to go to court.
             | 
             | It's like those completely abusive non-compete clauses in
             | work contracts (yes in some countries that's the norm).
             | They are completely abusive and therefore illegal. But it
             | still hurts the employee: I have friends who have been
             | declined a job in a company because the company did not
             | want to take any risk. The company was like "okay, it's
             | most likely an invalid clause, but if your previous
             | employer sues us it will anyway cost resources we don't
             | want to spend, so we'd rather not hire you". So an illegal,
             | invalid clause had the effect that the company who abused
             | it wanted. Which means it's a broken system.
        
           | numpad0 wrote:
           | fun fact: dSLR lenses are patented all the time. Claims are
           | basically "I made it and it works". And it's considered ok.
        
         | detourdog wrote:
         | Just curious this seems like simple digital Steganography or
         | maybe even even the same as Shannon's boolean gate work. Do you
         | think the patent is defendable in court?
         | 
         | https://en.wikipedia.org/wiki/Steganography
        
         | Etheryte wrote:
         | Hopefully a wholly undefendable patent, you're essentially
         | trying to patent the Unicode spec. The rest of it is perform an
         | action in response to a text message which clearly isn't novel.
        
       | riskable wrote:
       | Oh this is just the tip of the iceberg when it comes to abusing
       | Unicode! You can use a similar technique to this to overflow the
       | buffer on _loads_ of systems that accept Unicode strings.
       | Normally it just produces an error and /or a crash but
       | _sometimes_ you get lucky and it 'll do all sorts of fun things!
       | :)
       | 
       | I remember doing penetration testing _waaaaaay_ back in the day
       | (before Python 3 existed) and using mere diacritics to turn a
       | single character into many bytes that would then overflow the
       | buffer of a back-end web server. This only ever caused it to
       | crash (and usually auto-restart) but I could definitely see how
       | this could be used to exploit certain systems /software with
       | enough fiddling.
        
         | capitainenemo wrote:
         | Yeah. Zalgo text is a common test for input fields on websites.
         | But it usually doesn't do anything interesting. Maybe an
         | exception trigger on some database length limit. Doesn't
         | typically even kill any processes. The exception is normally
         | just in your thread. You can often trigger it just by disabling
         | JS on even modern forms, but,, at best you're maybe leaking a
         | bit of info if they left debug on and print the stack trace or
         | a query. Another common slip-up is failing to count \n vs \r\n
         | in text strings since JS usually usually counts a carriage
         | return as 1 byte, but HTTP spec requires two.
         | 
         | unescape(encodeURIComponent("c")).length is the quick and dirty
         | way to do a JS byte length check. The \r\n thing can be done
         | just by cleaning up the string before length counting.
        
         | n0id34 wrote:
         | Sorry n00b here, can you explain more about this or how you did
         | this? I feel like this is definitely a loophole that would be
         | worth testing for.
        
       | zurfer wrote:
       | ha
        
         | paulgb wrote:
         | I see what you did there ;)
        
         | Alifatisk wrote:
         | ;)
        
       | kevinsync wrote:
       | StegCloak [0] is in the same ballpark and takes this idea a step
       | further by encrypting the hidden payload via AES-256-CTR --
       | pretty neat little trick
       | 
       | [0] https://github.com/KuroLabs/stegcloak
        
         | giancarlostoro wrote:
         | There's a Better Discord plugin that I think uses this or
         | something similar, so you could send completely encrypted
         | messages, that look like nothing to everyone else. You'd need
         | to share a password secret for them to decode it though.
        
         | putna wrote:
         | wow, thats neat.
         | 
         | Wanted to try on Cloudflare DNS TXT record. But Cloudflare is
         | smart enough to decode when pasting in TXT field.
        
           | UltraSane wrote:
           | DNS only supports ASCII for record values. It has a hack to
           | support unicode domain names using Punycode
        
       | vessenes wrote:
       | I love the idea of using this for LLM output watermarking. It
       | hits the sweet spot - will catch 99% of slop generators with no
       | fuss, since they only copy and paste anyway, almost no impact on
       | other core use cases.
       | 
       | I wonder how much you'd embed with each letter or token that's
       | output - userid, prompt ref, date, token number?
       | 
       | I also wonder how this is interpreted in a terminal. Really cool!
        
         | zos_kia wrote:
         | With the amount of pre processing that is done before
         | integrating stuff in a dataset I'd be surprised if those kinds
         | of shenanigans even worked
        
         | OutOfHere wrote:
         | Just you wait until AI starts calling human output to be slop.
        
           | roguecoder wrote:
           | There are of course human writers who are less-communicative
           | than AI, called "shit writers", and humans who are less
           | accurate than AI, called "liars".
           | 
           | The difference is humans are responsible for what they write,
           | whereas the human user who used an AI to generate text is
           | responsible for what the computer wrote.
        
           | vessenes wrote:
           | That's already happening - my kids have had papers unfairly
           | blamed on chatgpt by automated tools. Protect yourself kids,
           | use an editor that can show letter by letter history.
        
             | red369 wrote:
             | Do you have any examples of editors that show letter by
             | letter history? I have never looked for that as a feature.
             | 
             | Edit: I've been looking, and Google Docs seems to have
             | version history to the minute.
        
               | vessenes wrote:
               | Yes exactly. They keep track of their diffs in that
               | interface.
        
             | neom wrote:
             | 2 people I worked with had this happen and one of them is
             | going to war over it as it was enough to lower the kids
             | grade for college or something. Crazy times.
        
         | capitainenemo wrote:
         | In most linux terminals, what you pass it is just a sequence of
         | bytes that is passed unmangled. And since this technique is
         | UTF-8 compliant and doesn't use any extra glyphs, it is
         | invisible to humans in unicode compliant terminals. I tried it
         | on a few. It shows up if you echo the sentence to, say, xxd
         | ofc.
         | 
         | (unlike the PUA suggestion in the currently top voted comment
         | which shows up immediately ofc)
         | 
         | Additional test corrections: While xxd shows the message
         | passing through completely unmangled on _pasting_ it into the
         | terminal, when I _selected_ from the terminal (echoed sentence,
         | verified unmangled in xxd, then selected and pasted the result
         | of echo), it was truncated to a few words using X select in
         | mate terminal and konsole - I 'm not sure where that truncation
         | happens, whether it's the terminal or X. In xterm, the final e
         | was mangled, and the selection was even more truncated.
         | 
         | The sentence is written unmangled to files though, so I think
         | it's more about copying out of the terminal dropping some data.
         | Verified by echoing the sentence to a test file, opening it in
         | a browser, and copying the text from there.
        
           | vessenes wrote:
           | On MacOS, kitty shows an empty box, then an a for the "ha"
           | post below. I think this is fair and even appreciated. Mac
           | Terminal shows "ha". That "ha" (and this one!) can be copied
           | and pasted into the decoder successfully.
        
         | ChadNauseam wrote:
         | There are other possible approaches to LLM watermarking that
         | would be much more robust and harder to detect. They exploit
         | the fact that LLMs work by producing a probability distribution
         | that gives a probability for each possible next token. These
         | are then sampled randomly to produce the output. To add
         | fingerprints when generating, you could do some trickery in how
         | you do that sampling that would then be detectable by re-
         | running the LLM and observing its outputs. For example, you
         | could alternate between selecting high-probability and low-
         | probability tokens. (A real implementation of this would be
         | much more sophisticated than that obviously, but hopefully you
         | get the idea)
        
           | vessenes wrote:
           | This is not a great method in a world with closed models and
           | highly diverse open models and samplers. It's intellectually
           | appealing for sure! But it will always be at best a
           | probabilistic method, and that's if you have the llm weights
           | at hand.
        
             | ChadNauseam wrote:
             | What makes it not a good method? Of course if a model's
             | weights are publicly available, you can't compel anyone
             | using it to add fingerprinting at the sampler stage or
             | later. But I would be shocked if OpenAI was not doing
             | something like this, since it would be so easy and couldn't
             | hurt them, but could help them if they don't want to train
             | on outputs they generated. (Although they could also record
             | hashes of their outputs or something similar as well - I
             | would be surprised if they don't.)
        
         | fennecfoxy wrote:
         | Why does anybody think AI watermarking will ever work? Of
         | course it will never work, any watermarking can be instantly &
         | easily stripped...
         | 
         | The only real AI protection is to require all human interaction
         | to be signed by a key verified by irl identity and even then
         | that will: A never happen, B be open to abuse by countries with
         | corrupt governments and countries with corrupt governments
         | heavily influenced by private industry (like the US).
        
       | iNic wrote:
       | The tokenizer catches it: https://platform.openai.com/tokenizer.
        
       | remram wrote:
       | I'm not too surprised by this, but I'm annoyed that no amount of
       | configuration made those bytes visible again in my editor. Only
       | using hexdump revealed them.
        
         | bittercynic wrote:
         | My mind went the same place.
         | 
         | Anyone know a more convenient way to search larger blocks of
         | text for this?
        
         | jrootabega wrote:
         | Here's a POC that works in emacs. Doesn't cover all of the
         | relevant characters, but:                 (setq   ;;some other
         | invisible or interesting characters               unicode-zero-
         | width-space ?\u200b               unicode-zero-width-non-joiner
         | ?\u200c               unicode-zero-width-joiner ?\u200d
         | unicode-zero-width-nbsp ?\ufeff               unicode-narrow-
         | nbsp ?\u202f               unicode-word-joiner ?\u2060
         | unicode-grapheme-joiner ?\u034f               unicode-no-break-
         | space ?\u00a0               unicode-combining-long-stroke
         | ?\u0336               ;;variation selector examples
         | unicode-vs-fe00 ?\ufe00               unicode-vs-fe0f ?\ufe0f
         | unicode-vs-e0100 ?\xe0100)                   (defun show-
         | glyphless-as-hex (char)           (let ((original (elt
         | glyphless-char-display char)))             (aset glyphless-
         | char-display char 'hex-code)             original)) ;;so you
         | can see what you just replaced                   (progn
         | (show-glyphless-as-hex unicode-zero-width-space)
         | (show-glyphless-as-hex unicode-zero-width-non-joiner)
         | (show-glyphless-as-hex unicode-zero-width-joiner)
         | (show-glyphless-as-hex unicode-zero-width-nbsp)
         | (show-glyphless-as-hex unicode-word-joiner)           (show-
         | glyphless-as-hex unicode-grapheme-joiner)           (show-
         | glyphless-as-hex unicode-narrow-nbsp)           (show-
         | glyphless-as-hex unicode-no-break-space)           ;;these may
         | already be visible if the current conditions don't support them
         | ;;but we'll force them           (show-glyphless-as-hex
         | unicode-vs-fe00)           (show-glyphless-as-hex unicode-vs-
         | fe0f)           (show-glyphless-as-hex unicode-vs-e0100))
        
           | remram wrote:
           | I use vim. It seems like `:set binary enc=latin1` works,
           | though I don't understand why the latin1 part is required.
        
           | jrootabega wrote:
           | Here is the bare minimum this is built on, which you can type
           | in yourself if you're paranoid or want to start from the
           | bottom up. Swap in the hexadecimal codepoint of the invisible
           | character after the ?\x                 (aset glyphless-char-
           | display ?\xfe00 'hex-code)
        
           | jrootabega wrote:
           | And as a higher-level configuration you can set most, maybe
           | even all, of the relevant invisible characters (still not
           | sure how 0x34f grapheme joiner fits in) at once with
           | something like:                 (custom-set-variables
           | '(glyphless-char-display-control  '((format-control . hex-
           | code)                                            (variation-
           | selectors . hex-code))))
           | 
           | This will modify values in glyphless-char-display, but it's
           | OK to modify those directly if you need to.
        
         | mdouglass wrote:
         | vscode's "Unicode Highlight: Non-basic ASCII" causes the
         | character to get highlighted. Sadly, the more appropriate
         | "Unicode Highlight: Invisible Characters" setting does not
         | reveal them.
        
       | nonameiguess wrote:
       | More generally, you can use encoding formats that reserve
       | uninterpreted byte sequences for future use to pass data that is
       | only readable by receivers who know what you're doing, though
       | note this not a cryptographically secure scheme and any sort of
       | statistical analysis can reveal what you're doing.
       | 
       | The png spec, for instance, allows you to include as many
       | metadata chunks as you wish, and these may be used to hold data
       | that cannot be used by any mainstream png reader. We used this in
       | the Navy to embed geolocation and sensor origin data that was
       | readable by specialized viewers that only the Navy had, but if
       | you opened the file in a browser or common image viewer, it would
       | either ignore or discard the unknown chunks.
        
         | dkarl wrote:
         | Lots of image formats store arbitrary metadata (and data data)
         | either by design or by application-specific extensions. I
         | remember seeing seismic and medical images that contained data
         | for display in specialized applications and writing code to
         | figure out if binary metadata was written in big-endian or
         | little-endian byte order (the metadata often did not have the
         | same endianness as the image data!) For example, TIFF files
         | containing 3d scans as a sequence of slices, with binary
         | metadata attached to each slice. If you opened it up in your
         | system default image viewer, you'd only see the first slice,
         | but a specialized viewer (which I did not have) would display
         | it as a 3d model. Luckily (IIRC) the KDE file browser let you
         | quickly flip through all the images in a directory using the
         | keyboard, so I was able to dump all the layers into separate
         | files and flip through them to see the 3d image.
        
       | albybisy wrote:
       | wow!
        
       | 65 wrote:
       | This would be useful as a fingerprinting technique for
       | corporate/government leakers.
        
       | jaygreco wrote:
       | Interestingly, it's also possible to encode _emoji_ inside emoji!
        
       | HanClinto wrote:
       | Even more than just simply watermarking LLM output, it seems like
       | this could be a neat way to package logprobs data.
       | 
       | Basically, include probability information about every token
       | generated to give a bit of transparency to the generation
       | process. It's part of the OpenAI api spec, and many other engines
       | (such as llama.cpp) support providing this information. Normally
       | it's attached as a separate field, but there are neat ways to
       | visualize it (such as mikupad [0]).
       | 
       | Probably a bad idea, but this still tickles my brain.
       | 
       | * [0]: https://github.com/lmg-anon/mikupad
        
       | fortran77 wrote:
       | What's interesting is that even a "view source" shows nothing
       | amiss, and if I do a copy/paste from the debug inspector view of
       | "This sentence has a hidden message." it still shows up....
        
       | rexxars wrote:
       | For a real-world use case: Sanity used this trick[0] to encode
       | Content Source Maps[1] into the actual text served on a webpage
       | when it is in "preview mode". This allows an editor to easily
       | trace some piece of content back to a potentially deep content
       | structure just by clicking on the text/content in question.
       | 
       | It has it's drawbacks/limitations - eg you want to prevent adding
       | it for things that needs to be parsed/used verbatim, like
       | date/timestamps, urls, "ids" etc - but it's still a pretty fun
       | trick.
       | 
       | [0] https://www.sanity.io/docs/stega
       | 
       | [1] https://github.com/sanity-io/content-source-maps
        
       | vzaliva wrote:
       | The title lis little misleading: "Note that the base character
       | does not need to be an emoji - the treatment of variation
       | selectors is the same with regular characters. It's just more fun
       | with emoji."
       | 
       | Using this approach with non-emoji characters makes it more
       | stealth and even more disturbing.
        
       | ComputerGuru wrote:
       | This is cute but unnecessary - Unicode includes a massive range
       | called PUA: the private use area. The codes in this range aren't
       | mapped to anything (and won't be mapped to anything) and are for
       | internal/custom use, not to be passed to external systems (for
       | example, we use them in fish-shell to safely parse tokens into a
       | string, turning an unescaped special character into just another
       | Unicode code point in the string, but in the PUA area, then
       | intercept that later in the pipeline).
       | 
       | You're not _supposed_ to expose them outside your api boundary
       | but when you encounter them you are prescribed to pass them
       | through as-is, and that's what most systems and libraries do.
       | It's a clear potential exfiltration avenue, but given that most
       | sane developers don't know much more about Unicode other than
       | "always use Unicode to avoid internationalization issues", it's
       | often left wide open.
        
         | paulgb wrote:
         | I just tested and private use characters render as boxes for me
         | (), the point here was to encode them in a way that they are
         | hidden and treated as "part of" another character when
         | copy/pasting.
        
           | diggan wrote:
           | > the point here was to encode them in a way that they are
           | hidden and treated as "part of" another character when
           | copy/pasting
           | 
           | AKA "Steganography" for the curious ones:
           | https://en.wikipedia.org/wiki/Steganography
        
             | reaperducer wrote:
             | Like when we used to encode the phone numbers of warez
             | boards in GIFs.
        
           | bruce343434 wrote:
           | On my Android phone,that displays "Go[][]" in the Google logo
           | font.
        
         | layer8 wrote:
         | The difference is that PUA characters are usually rendered in
         | some way that is rather visible, whereas the variation
         | selectors aren't.
        
         | juped wrote:
         | I'll be honest, I pasted this comment in the provided decoder
         | thinking no one could miss the point this badly and there was
         | probably a hidden message inside it, but either you really did
         | or this website is stripping them.
         | 
         | You can't invisibly watermark an arbitrary character (I did it
         | to one above! If this website isn't stripping them, try it out
         | in the provided decoder and you'll see) with unrecognized PUA
         | characters, because it won't treat them as combining
         | characters. You will cause separately rendered rendered
         | placeholder-box characters to appear. Like this one:  (may not
         | be a placeholder-box if you're privately-using the private use
         | area yourself).
        
           | egypturnash wrote:
           | j is for Jabberwocky. Does this decode?
           | 
           | edit: Yes, it does.
        
         | Sniffnoy wrote:
         | Isn't this more what the designated noncharacters are for,
         | rather than the private-use area? Given how the private-use
         | area sometimes gets for unofficial encodings of scripts not
         | currently in Unicode (or for things like the Apple logo and
         | such) I'd be worried about running into collisions with that if
         | I used the PUA in such a way.
         | 
         | Note that designated noncharacters includes not only 0xFFFF and
         | 0xFFFE, and not only the final two code points of every plane,
         | but also an area in the middle of Arabic Presentation Forms
         | that was at some point added to the list of noncharacters
         | specifically so that there would be more noncharacters for
         | people using them this way!
        
         | lolinder wrote:
         | Context that some may be missing is that this was inspired by
         | discussion surrounding the _Open Heart Protocol_ submission:
         | 
         | https://news.ycombinator.com/item?id=42791378
         | 
         | People immediately began discussing the applications for
         | criminal use given the constraint that only emoji are accepted
         | by the API. So for that use case the PUA wouldn't be an option,
         | you have to encode it in the emoji.
        
       | frontporch wrote:
       | you dont need 256 codepoints so you can neatly represent an octet
       | (whatever that is), you just need 2 bits. you can just stack as
       | many diacritical marks you want on any glyph. either the renderer
       | allows practically unlimited or it allows 1/none. in either case
       | that's a vuln. what would be really earth shattering is what i
       | was hoping this article was: a way to just embed "; rm -rf ~/"
       | into text without it being rendered. you also definitely dont
       | need rust for this unless you want to exclude 90% of the
       | programmer population.
        
         | paulgb wrote:
         | I think the Rust is more readable for bytemucking stuff than
         | dynamic languages because the reader doesn't have to infer the
         | byte widths, but for what it's worth the demo contains a
         | TypeScript implementation: https://github.com/paulgb/emoji-
         | encoder/blob/main/app/encodi...
        
         | AdamH12113 wrote:
         | An octet is a group of 8 bits. Today we normally use the word
         | "byte" instead. The term is often used in older internet
         | protocols and comes from an era where bytes were not
         | necessarily 8 bits.
        
       | panki27 wrote:
       | I implemented something similar years ago, but much simpler/less
       | sophisticated.
       | 
       | Unicode has two non-printing space characters: zero-width space
       | (U+200B) and zero-width joiner (U+200D). This allows you to
       | encode arbitrary data in binary. I would give an example, but HN
       | seems to strip this :(
        
         | layer8 wrote:
         | Already linked in
         | https://news.ycombinator.com/item?id=43025913, and has a higher
         | risk of being stripped, as you noticed.
        
       | Mockapapella wrote:
       | In the same vein, I did some fun unicode abusing a few years ago
       | where I used scripts to convert programs into series of various
       | ZWJ's: https://thelisowe.substack.com/p/sleeper-cell-a-method-of-
       | em...
       | 
       | Also includes a decoder script
        
       | blmarket wrote:
       | This and several other abuse cases forced my previous work to use
       | code pointers to count 'characters' for user's nickname / status
       | messages. No one wanted to download 9MB simply browsing other
       | users.
        
         | ncr100 wrote:
         | That is awesome. Both the abuse and the fix.
        
         | myflash13 wrote:
         | NoSQL? Sounds like it should've been caught by basic length
         | checks on the database field where it was stored.
        
       | rafram wrote:
       | This is cool. I tried pasting the output into an Instagram
       | comment and it stayed intact, so I have a feeling someone could
       | do some interesting stuff with that. Who needs a botnet C&C
       | server when you can post totally invisible commands on public
       | forums?
        
         | the_hoffa wrote:
         | I mean, steganography has been a thing for quite a while. Not
         | disagreeing, just saying this is how some programs/ideas were
         | passed around the internet decades ago by "less than upstanding
         | netizens" ;)
         | 
         | Wanted to pass a secret code to a friend? Encode the bit-data
         | in the alpha channel of an image. It could even be
         | encrypted/scrambled within the image itself. Post the perfectly
         | normal image to a public forum, ping your friend, they run it
         | through the "decoder" and Robert's your mother's brother.
         | 
         | Of course these weren't "logic bombs" like this post is
         | describing, but even those have been around for a while too.
         | 
         | Hacking is fun :)
        
       | cynicalsecurity wrote:
       | Ctrl+F "unicode normalisation" 0/0
       | 
       | I'm surprised no one has mentioned it yet. It's usually super
       | easy, but people forget to add it all the time.
        
         | paulgb wrote:
         | I haven't tried it but I've heard that at least some unicode
         | normalizers do not strip sequences of variation selectors.
        
           | moody__ wrote:
           | Normalization implementations must not strip variation
           | selectors by definition. The "normal" part of normalization
           | means to convert a string into either consistently decomposed
           | unicode, or composed unicode. ie U+00DC vs U+0055 + U+0308.
           | However this decomposition mapping is also used (maybe more
           | like abused) for converting certain "legacy" code points to
           | non-legacy code points. There does not exist a rune which
           | decomposes to variant selectors (and thus these variant
           | selectors do not compose into anything) so normalization must
           | not alter or strip them.
           | 
           | source: I've implemented Unicode normalization from scratch
        
       | wunderwuzzi23 wrote:
       | This is cool. There are also the Unicode Tag characters that
       | mirror ASCII and are often invisible in UI elements (especially
       | web apps).
       | 
       | The unique thing about Tag characters is that some LLMs interpret
       | the hidden text as ASCII and follow instructions, and they can
       | even write them:
       | 
       | https://embracethered.com/blog/posts/2024/hiding-and-finding...
       | 
       | Here an actual exploit POC that Microsoft fixed in Copilot:
       | https://embracethered.com/blog/posts/2024/m365-copilot-promp...
        
       | paulgb wrote:
       | (author here) some people in this thread and elsewhere asked me
       | about whether an LLM could decode this, and the answer seems to
       | be: not likely by itself, but it often can if it has access to a
       | Python interpreter!
       | 
       | Here's a demo of Gemini Flash 2 solving one in 7s:
       | https://bsky.app/profile/paulbutler.org/post/3lhzhroogws2g
        
       | nitwit005 wrote:
       | Even kids figure out how to manipulate unicode text. If you want
       | to bypass a swear filter, replace a letter with an alternate
       | representation of the same letter.
        
       | egypturnash wrote:
       | If you try posting this on Bluesky, the editor only counts it as
       | one emoji, but you will get an error upon trying to post.
        
       | dalemhurley wrote:
       | I love it, I got Claude to add a pin to provide very basic
       | encryption
       | 
       | https://claude.site/artifacts/5bfdf131-d847-4735-9242-998f23...
        
       | ethin wrote:
       | It's worth noting, just as a curiosity, that screen readers can
       | detect these variation selectors when I navigate by character.
       | For example, if I arrow over the example he provided (I can't
       | paste it here lol), I here: "Smiling face with smiling eyes",
       | "Symbol e zero one five five", "Symbol e zero one five c",
       | "Symbol e zero one five c", "Symbol e zero one five f". This is
       | unfortunately dependent on the speech synthesizer used, and I
       | wouldn't know if the characters were there if I was just reading
       | a document, so this isn't much of an advantage all things
       | considered.
        
         | llm_trw wrote:
         | Ironically enough I have a script that strips all non-ascii
         | characters from my screen reader because I found that _all_
         | online text was polluted with invisible and annoying to listen
         | to characters.
        
       | andrethegiant wrote:
       | Clever! I made a similar emoji encoding/decoding microsite:
       | https://face64.me
        
       | omnibrain wrote:
       | 10 years or so ago I shocked coworkers with using U+202D LEFT-TO-
       | RIGHT OVERRIDE mid in filenames on windows. So
       | funnypicturegnp.exe became funnypictureexe.png Combined with a
       | custom icon for the program that mimics a picture preview it was
       | pretty convincing.
        
         | hosteur wrote:
         | Wow this is a clever trick.
        
         | mdup wrote:
         | I worked in phishing detection. This was a common pattern used
         | by attackers, although .exe are blocked automatically most of
         | the time, .html is the new malicious extension (often hosting
         | an obfuscated window.location redirect to a fake login page).
         | 
         | RTL abuse like cute-cat-lmth.png was relatively common, but
         | also trivial to detect. We would immediately flag such an email
         | as phishing.
        
         | taneq wrote:
         | I'd never heard of this particular trick but I'm glad my
         | decades of paranoia-fueled "right click -> open with" treatment
         | of any potentially sketchy media file was warranted! :D
        
       | tyilo wrote:
       | Kitty terminal shows non-payload letters and emojis normally, but
       | with a payload a letter is shown as one box and an emoji is shown
       | as two boxes.
        
       | foobuzzHN wrote:
       | 10 years ago I made a POC for smuggling arbitrary data through
       | _no visible text at all_: https://github.com/foobuzz/ium
        
       | arkh wrote:
       | > To be clear, this is an abuse of unicode and you shouldn't do
       | it. If your mind is wandering to practical use cases for this,
       | shut it down.
       | 
       | Totally not thinking about IRC clients with their own hidden
       | commands.
        
       | qingcharles wrote:
       | I was using this technique last year with Bing Image Creator.
       | 
       | It let you get around their filter on brand names and celebrity
       | names by smuggling them into the prompt in a way the AI could
       | read, but the human-written filter was not designed for.
        
       | JoelJacobson wrote:
       | Imagine using the ID card emoji (U+1FAAA) as a universal carrier
       | for digital ID tokens. A dumb demo is available at
       | https://pit.lovable.app/ which--without any secure protocol--
       | simply encodes a National Identification Number into the emoji
       | using variation selectors.
       | 
       | The idea is that banks could issue encrypted ID tokens in this
       | way, letting them move seamlessly across any platform that
       | supports Unicode (messaging apps, email, web forms, etc.). The
       | heavy lifting of security (preventing replay attacks,
       | interception, ensuring token freshness, etc.) would be managed
       | separately with robust cryptography, while the emoji serves
       | purely as a transport layer.
       | 
       | It's not about reinventing security but about creating a cross-
       | platform way to carry identity tokens. Thoughts?
        
         | bruce343434 wrote:
         | What is wrong with just using the actual SSN? Why hide it in an
         | emoji?
        
           | JoelJacobson wrote:
           | So that the operating system could recognize it
           | automatically, and to include a potentially long URL to the
           | retail bank's web service to initiate the protocol, such as
           | signing a document or an identification protocol.
        
       | _nhh wrote:
       | Perfect way of personalized ad tracking?
        
         | _nhh wrote:
         | Check this address after you clicked it:
         | 
         | https://emoji.paulbutler.org/?mode=encode
         | 
         | I encoded the last ,,e"
        
       | urbandw311er wrote:
       | When people discuss things like "Do LLMs know about this?" On a
       | public website I always think that it's the equivalent of
       | somebody whose phone is wiretapped calling their friend and
       | asking if the FBI knows about something.
        
         | fennecfoxy wrote:
         | I think that's a very cynical view. The author seeing what an
         | LLM would make of it was more akin to getting a new game and
         | wondering if you can pet the dog.
        
       | StevenNunez wrote:
       | "To be clear, this is an abuse of unicode and you shouldn't do
       | it. If your mind is wandering to practical use cases for this,
       | shut it down." TOO LATE!
        
       | ofou wrote:
       | This is one of the reasons I've been advocating to use UTF-8 as a
       | tokenizer for a long time. The actual problem IMHO are tokenizers
       | themselves, which obscure the encoding/decoding process in order
       | to gain some compression during training to fit more data in for
       | the same budget, and arguably gaining some better understanding
       | from the beginning. Again just a lack of computing power.
       | 
       | If you use UTF-8 directly as tokenizer, this problem becomes
       | evident once you fit it into the context window. Plus, you can
       | run multiple tests for this type of injection; no emoji should
       | take more than up to 40 bytes (10 code points * 4 bytes per code
       | point in the worst case). This is an attack on tokenizers, not on
       | UTF-8.
       | 
       | Plus, Unicode publishes the full list of sequences valid
       | containing the ZWJ character in emoji-zwj-sequences.txt
        
       | JoshuaTench wrote:
       | "I would highly recommend Tomcyberghost@gmail.com I felt a strong
       | connection with him and appreciate his knowledge, expertise and
       | manner in which he worked with my former partner and I. He
       | handled the challenging dynamics of our relationship, and through
       | our work I was able to leave a relationship that was hurting and
       | hindering me. I came to that conclusion not through any direct
       | encouragement, but through doing the work."
        
       ___________________________________________________________________
       (page generated 2025-02-13 23:01 UTC)