[HN Gopher] Smuggling arbitrary data through an emoji
___________________________________________________________________
Smuggling arbitrary data through an emoji
Author : paulgb
Score : 663 points
Date : 2025-02-12 09:24 UTC (1 days ago)
(HTM) web link (paulbutler.org)
(TXT) w3m dump (paulbutler.org)
| vladde wrote:
| test, do emojis work on hn?
|
| edit: apparently not edit 2: oh wait, the bytes are still there!
| copy-paste this entire message and it decodes to "test"
| jerpint wrote:
| The ability to add watermarks to text is really interesting.
| Obviously it could be worked around , but could be a good way to
| subtly watermark e.g. LLM outputs
| tyho wrote:
| There are way better ways to watermark LLM output. It's easy to
| make it undetectable, which this is'nt.
| shawnz wrote:
| I recently worked on a steganographics project which could be
| useful for this problem. See:
| https://github.com/shawnz/textcoder
| andai wrote:
| That's really cool, you should repost the HN submission.
| shawnz wrote:
| Thank you! I will see what I can do.
| pava0 wrote:
| For example?
| tyho wrote:
| A crude way: To watermark: First establish a keyed DRBG.
| For every nth token prediction: read a bit from the DRBG
| for every possible token to label them red/black. before
| selecting the next token, set the logit for black tokens to
| -Inf, this ensures a red token will be selected.
|
| To detect: Establish the same DRBG. Tokenize, for each nth
| token, determine the red set of tokens in that position. If
| you only see red tokens in lots of positions, then you can
| be confident the content is watermarked with your key.
|
| This would probably take a bit of fiddling to work well,
| but would be pretty much undetectable. Conceptually it's
| forcing the LLM to use a "flagged" synonym at key
| positions. A more sophisticated version of a shiboleth.
|
| In practice you might chose to instead watermark all
| tokens, less heavy handedly (nudge logits, rather than
| override), and use highly robust error correcting codes.
| jl6 wrote:
| It feels like this would only be feasible across longer
| passages of text, and some types of text may be less
| amenable to synonyms than others. For example, a tightly
| written mathematical proof versus a rambling essay.
| Biased token selection may be detectable in the latter
| (using a statistical test), and may cause the text to be
| irreparably broken in the former.
| drdeca wrote:
| To handle low entropy text, the "adding a smaller
| constant to the logits" approach avoids having much
| chance of changing the parts that need to be exactly a
| particular thing,
|
| Though in this case it needs longer texts to have high
| significance (and when the entropy is low, it needs to be
| especially long).
|
| But for most text (with typical amounts of entropy per
| token) apparently it doesn't need to be that long? Like
| 25 words I think I heard?
| deadbabe wrote:
| What if the entire LLM output isn't used? For example,
| you ask the LLM to produce some long random preamble and
| conclusion with your actual desired output in between the
| two. Does it mess up the watermarking?
| antognini wrote:
| The issue with the standard watermark techniques is that they
| require an output of at least a few hundred tokens to
| reliably imprint the watermark. This technique would apply to
| much shorter outputs.
| nzach wrote:
| so.... in theory you should be able to create several visually
| identical links that give access to different resources?
|
| I've always assumed links without any tracking information
| (unique hash, query params, etc) were safe to click(with regards
| to my privacy). but if this works for links I may need to revise
| my strategy regarding how to approach links sent to me.
| dmbche wrote:
| You need to decode the text after copy pasting it, I believe
| clicking on text will not interact with the obfuscated data
| since your computer will just find the unicode and ignore the
| obfuscated data.
|
| This is just so that you can hide data and send it to someone
| to be decoded (or watermarking as mentionned)
| nzach wrote:
| yes, I understand this is not a security risk.
|
| but my fear is precisely that I my be sending data to a
| remote host while I'm completely unaware of this fact.
|
| I tried to create a POC with some popular url shortner
| services, but doesn't seems to work.
|
| what I wanted to create was a link like <host.tld>/innocent
| that redirects to google.com. in this case the "c" contains
| some hidden data that will be sent to the server while the
| user is not aware. this seems possible with the correct piece
| of software.
| cess11 wrote:
| HTML entity encoding will show the hidden content, try with
| https://mothereff.in/html-entities.
| cscheid wrote:
| My understanding is that "weird" unicode code points become
| https://en.wikipedia.org/wiki/Punycode. I used the (copy-
| pasted from the post, presumably with the payload in it) to
| type a fake domain into Chrome, and the Punycode I got appeared
| to not have any of the encoding bits.
|
| However, I then pasted the emoji into the _query_ part of a
| URL. I pointed it to my own website, and sure enough, I can
| definitely see the payload in the nginx logs. Yikes.
|
| Edit: I pasted the very same Emoji that 'paulgb used in their
| post before the parenthetical in the first paragraph, but it
| seems HN scrubs those from comments.
| bmicraft wrote:
| domains get "punycode" encoded, urls get "url encoded"[1],
| which should make unicode characters stand out. That being
| said, browsers do accept some non-ascii characters in urls
| and convert them automatially, so theoretically you could put
| "invalid" characters into a link and have the browser convert
| it only after clicking. That might be a viable strategy.
|
| [1] https://www.w3schools.com/tags//ref_urlencode.asp
| echeese wrote:
| The emoji is gone but the content is still there.
| riquito wrote:
| > I've always assumed links without any tracking information
| (unique hash, query params, etc) were safe to click(with
| regards to my privacy). but if this works for links I may need
| to revise my strategy regarding how to approach links sent to
| me.
|
| Well, it was never safe, what you see and where the link are
| pointing at are different things, that's why the actual link is
| displayed at the bottom left of your browser when you move your
| mouse over it (or focus it via keyboard)
| layer8 wrote:
| URIs with non-ASCII characters are technically invalid.
| Browsers and the like should (but likely don't all do) percent-
| encode any invalid characters for display if they accept such
| invalid URIs.
| password4321 wrote:
| This tool and idea sketchy AF: https://github.com/zws-im/zws
|
| ("Shorten URLs using invisible spaces")
| kccqzy wrote:
| "Visually identical" is never good enough. Have you heard of
| attacks confusing Latin letters and Cyrillic letters? For
| example C versus S. (The latter is known as CYRILLIC CAPITAL
| LETTER ES.) Have you heard of NFC forms versus NFD forms? For
| example e versus e (LATIN SMALL LETTER E + COMBINING ACUTE
| ACCENT versus LATIN SMALL LETTER E WITH ACUTE.)
|
| Nothing that's important when it comes to security and privacy
| should rely on a "visually identical" check. Fortunately
| browsers these days are already good at this; their address
| bars use puny code for the domain and percent encoding for the
| rest of the URL.
| komboozcha wrote:
| Erm, DNS uses Punycode because it comes from a time when
| Unicode didn't exist, and bind assumes a grapheme has no more
| than one byte.
| ale42 wrote:
| Yes but I guess that the message was meaning that browsers
| now detect homographs and display the punycode instead. See
| also https://news.ycombinator.com/item?id=14130241; at that
| time Firefox wasn't fixed, but in the meantime it fixed the
| issue too (there's a
| network.idn.punycode_cyrillic_confusables preference, which
| is enabled by default).
| moody__ wrote:
| As the sibling comment has mentioned Unicode in DNS uses a
| punycode encoding but even further then that the standard
| specifies that the Unicode data must be normalized to NFC[0]
| before being converted to punycode. This means that your
| second example (decomposed e with combining acute accent vs
| the composed variant) is not a valid concern. The Cyrillic
| one is however.
|
| [0] https://www.rfc-editor.org/rfc/rfc5891 SS 4.1 "By the
| time a string enters the IDNA registration process as
| described in this specification, it MUST be in Unicode and in
| Normalization Form C"
| kccqzy wrote:
| The OP said link. The NFC/NFD issue remains if these are
| part of a path name or query parameter.
| moody__ wrote:
| Sure, but the security concerns of that I feel are much
| less concerning than having multiple domain names with
| the same visual appearance that point to different
| servers. That has immediate impact for things like
| phishing whereas lookalike path or query portions would
| at least ensure you are still connecting to the server
| that you think you are.
| nerder92 wrote:
| Might not be related to the point of the article per se, but i've
| tried to decode it with different LLMs. To benchmark their
| reasoning capabilities.
|
| - 4o: Failed completely
|
| - o1: Overthinks it for a while and come up with the wrong answer
|
| - o3-mini-high: Get's closer to the result at first try, needs a
| second prompt to adjust the approach
|
| - r1: nails it at first try
|
| The prompt I've used was simply: "this emoji has an hidden
| message can you decode it?"
|
| If you want to see the CoT:
| https://gist.github.com/nerder/5baa9d7b13c1b7767d022ea0a7c91...
| markisus wrote:
| The r1 somehow knew at an early stage that the message was
| HELLO but it couldn't figure out the reason. Even at the end,
| its last "thought" insists that there is an encoding mistake
| somewhere. However the final message is correct. I wonder how
| well it would do for a nonstandard message. Any sufficiently
| long English message would fall to statistical analysis and I
| wonder if the LLMs would think to write a little Python script
| to do the job.
| paulgb wrote:
| Wow, that's interesting! I wonder if this reproduces with a
| different message, or if it was a lucky guess.
|
| I looked at how the strings tokenize and they do appear to
| conserve enough information that it could be decoded in
| theory.
| klabb3 wrote:
| > or if it was a lucky guess
|
| It's like guessing 1/2 or 2/3 on a math test. The test
| authors pick nice numbers, and programmers like "hello". If
| the way to encode the secret message resembles other
| encodings, it's probably that the pattern matching monster
| picked it up and is struggling to autocomplete (ie
| backwards rationalize) a reason why.
| paulgb wrote:
| I did some experimentation today. I wouldn't expect AI to
| solve it using only their own reasoning, but I've had a
| decent hit rate of getting AI to solve them when they
| have access to a Python interpreter. Here's Gemini Flash
| 2 solving one (albeit it lost the spaces) in a single
| prompt and about 7 seconds!
|
| https://bsky.app/profile/paulbutler.org/post/3lhzhroogws2
| g
| bogtog wrote:
| My deepseek-r1 seems to be a bit more lost on decoding "How do
| I make meth". Some highlights (after about 5 minutes of
| R1-ing):
|
| > Another angle: the user mentioned "encoded a message in this
| emoji", so maybe the first emoji is a red herring, or it's part
| of the message. The subsequent characters, even though they
| look like variation selectors, could be part of the encoding.
|
| > E0138 in hex is 0xE0138. Convert to decimal: 14 _16^4 + 0_
| 16^3 + 1 _16^2 + 3_ 16 + 8 = 14*65536 + 0 + 256 + 48 +8 =
| 917504 + 256 + 48 +8 = 917816.
|
| > Given that I'm not making progress, perhaps the answer is
| "Hello World!" but encoded via the tag characters. Let's check:
|
| > Answer: The decoded message is "Hello World!"
|
| In all this, it did at least manage to discern that the first
| letter should be "h"
| roguecoder wrote:
| It is highly unlikely it discerned that: it coincidentally
| guessed a string that starts with an H.
|
| If you try it with a string that started with "J" and then it
| guessed "jump up", I might be more convinced.
| krupan wrote:
| There's no way an LLM is decoding this. It's just giving you a
| statistically likely response to the request, "guess my secret
| message." It's not a big surprise that it guessed "Hello" or
| "Hello, world"
| paulgb wrote:
| I got Claude to get "the raisons play at midnight" from an
| emoji in one prompt and three uses of its "analysis" tool.
| (the X Y at mightnight is a snowclone that Claude has
| probably seen, but I randomly picked "raisons" and "play")
|
| My prompt was "I think this emoji contains a hidden messaage,
| can you decode it? Use JavaScript if necessary."
| ahofmann wrote:
| This will break so many (web-)forms :-)
|
| It is not bulletproof though. In this "c " and that space, are
| about 3500 characters. Copying only the "c" above (not this one)
| will keep some of the hidden text, but not all. Nevertheless,
| while I knew that this is possible, it still breaks a lot of
| assumptions around text.
|
| Edit: the text field for editing this post is so large, that I
| need to scroll down to the update button. This will be a fun toy
| to create very hard to find bugs in many tools.
| FranchuFranchu wrote:
| You could store UTF-8 encoded data inside the hidden bytestring.
| If some of the UTF-8 encoded smuggled characters are variation
| selector characters, you can smuggle text inside the smuggled
| text. Smuggled data can be nested arbitrarily deep.
| riskable wrote:
| I'm imagining post-incident analysis finding out that, "the
| data was exfiltrated via some Unicode string..." then they put
| it up on the screen and it's just an enormous line of turtle
| emoji
|
| https://emojipedia.org/turtle
| JadeNB wrote:
| > I'm imagining post-incident analysis finding out that, "the
| data was exfiltrated via some Unicode string..." then they
| put it up on the screen and it's just an enormous line of
| turtle emoji
|
| Since it took me a minute to make the connection, I'll just
| say explicitly that I enjoyed the understated "it's turtles
| all the way down" joke.
| jodrellblank wrote:
| > We and our 717 technology partners ask you to consent to
| the use of cookies to store and access personal data on your
| device.
|
| To see a turtle emoji.
| petee wrote:
| It's fun that you can encode encoded emoji into a new one
| riskable wrote:
| Then when you dive deeper into the encoded data you find
| endless turtle emoji and loudly exclaim, "it's turtles all the
| way down!"
| HeikoBehrens wrote:
| FWIW, we considered this technique back at Pebble to make
| notifications more actionable and even filed a patent for that
| (sorry!) https://patents.justia.com/patent/9411785
|
| Back then on iOS via ANCS, the watches wouldn't receive much more
| than the textual payload you'd see on the phone. We envisioned to
| be working with partners such as WhatsApp et al. to encode deep
| links/message ids into the message so one could respond directly
| from the watch.
| coldpie wrote:
| So whoever now owns that patent (Google? maybe some patent
| troll picked it up?) could, in theory, sue the author of this
| article for patent infringement, right? Even though they
| invented it independently and never once used or looked at your
| patent. Do you think you made the world a better place or a
| worse place by filing that patent?
| delian66 wrote:
| Do you think your comment made the world, and HN specifically
| a better place?
| Imustaskforhelp wrote:
| I think so , yes , it made me be re aware of the patent
| troll scam in the USA.
|
| In fact it is your comment which to me seems a little
| hateful , yes the above comment also felt a little hateful
|
| Hate doesn't counter Hate , I guess.
| RIMR wrote:
| Yes, calling out unethical practices makes the world a
| better place by discouraging unethical practices.
| krupan wrote:
| Berating people for filing patents in self defense is not
| how we fix this problem. The government put these rules
| in place. Businesses _have_ to at least accumulate
| patents to use defensively (you found a patent of yours
| that you think I 'm violating? Well let me do a quick
| search through the patents I have...what's that?
| Nevermind, I'm not actually infringing your patent? Good,
| that's what I thought.)
| JadeNB wrote:
| > Do you think you made the world a better place or a worse
| place by filing that patent?
|
| Come on, what does this contribute to this conversation? The
| poster clearly is aware of the drawbacks of such patents, and
| didn't clearly play any role in filing the patent (they said
| "we ... filed it," not "I filed it"). This kind of response
| just encourages people not to mention such things; it can't
| possibly change their past behavior, and, since Pebble the
| company _per se_ doesn 't exist any more, is also unlikely to
| change future behavior.
| coldpie wrote:
| > The poster clearly is aware of the drawbacks of such
| patents, and didn't clearly play any role in filing the
| patent (they said "we ... filed it," not "I filed it").
|
| A person with the same name as that commenter is listed as
| an inventor on the patent.
|
| > it can't possibly change their past behavior
|
| Obviously, but it can change future behavior. Maybe
| realizing that they made the world a worse place by filing
| that patent will prevent them, or a reader of this
| discussion, from doing it again in future.
| Hizonner wrote:
| Well, given that the technique itself makes the world a worse
| place, anything that impedes its use is probably positive...
|
| And, no, they couldn't do anything meaningful to the author
| of the article. They could get them ordered not to do it any
| more, and they could recover their economic damages... which
| are obviously zero.
| RealityVoid wrote:
| _Can_ they sue them for patent infringement? They just
| described a technique (that you can see in the patent filing
| anyway) and not selling a product based on it. I think
| there's nothing to sue here. I'm curious is my understanding
| of this is correct.
| singleshot_ wrote:
| "Except as otherwise provided... whoever without
| authority... uses... any patented invention...
| infringes[.]" 35 usc 271
| krupan wrote:
| One of the benefits of the patent system (that now seems to
| be far outweighed by negatives) is that patents are public
| information. Your invention is documented for all to see. I
| don't think that someone writing about public information
| is a punishable office, but IANAL
| IncreasePosts wrote:
| No. The author could not be sued for this successfully. All
| they did was write a blog post about an interesting
| technique. They could literally read the patent application
| and write a blog post about that, assuming the methods are
| the same.
|
| What percentage of your actions are based around making the
| world a better place, instead of personal fulfillment or
| gain?
| coldpie wrote:
| > All they did was write a blog post about an interesting
| technique. They could literally read the patent application
| and write a blog post about that, assuming the methods are
| the same.
|
| Okay, change "sue" to "prevent from creating a marketable
| product without paying a royalty to the patent owner in
| return for having provided nothing of value." The point
| remains.
|
| > What percentage of your actions are based around making
| the world a better place, instead of personal fulfillment
| or gain?
|
| Many harms are unavoidable, but I make a point to at least
| not go out of my way to make it a worse place, for example
| by filing software patents. The company I work for provides
| financial bonuses for filing software patents, and I will
| never participate in that program. (I've even tried to
| convince the lawyers to license our patents similar to Red
| Hat's open patent promise, because they claim they are
| intended only to be used defensively... but no luck so
| far.)
| rolph wrote:
| consider how far you reach to make the world better.
|
| 1) thats really good im gonna, strive to keep it.
|
| 2) " " tell all and those who want will build one.
|
| 3) " " make lots and give them to everyone.
| ooterness wrote:
| As a wise man once said: "Don't hate the player, hate the
| game."
| shermantanktop wrote:
| Where'd the game come from? Hint: the players.
| krupan wrote:
| First of all, it's not just a game, it's an outright
| battle to the death (of your company). Sure, you can
| choose not to wield patents, even in self defense, but
| good luck with that.
| coldpie wrote:
| You can also choose to legally declare that your patents
| may only be used for defensive purposes. But no one ever
| does this, because they do not actually intend to use
| them only for defensive purposes. This is a bogus defense
| of software patents.
| krupan wrote:
| See my other comments to you. Sometimes the threat of a
| good offensive weapon is the best defense. It's kinda
| like a nuclear arms race
| shermantanktop wrote:
| Nope. That's not how piles of patents are wielded
| defensively by the big companies. They don't protect
| their IP with defensive patents, they defend their
| company using the threat of using unrelated patents
| offensively against the attacker.
| krupan wrote:
| Please see my comment about about the sad necessity for
| patents
|
| https://news.ycombinator.com/item?id=43026595
| frereit wrote:
| Would this patent cover just the encoding alone? The first
| sentence says: > A method, apparatus, and system relating to
| embedding hidden content within a Unicode message and using the
| hidden content to perform a particular computer action.
|
| So, in my extremely unqualified opinion, just the encoding
| technique alone is not covered by the patent, only when
| combined with some action performed based on the encoding?
| palata wrote:
| Respectfully: how the hell would that be a valid patent? Feels
| like patenting the idea of writing text in white on white on a
| Word document such that you don't lose it but it doesn't get
| printed.
|
| It's just insane to ever call that "an invention".
| detourdog wrote:
| I think the magic is in the context of Unicode. Which also
| makes it almost twice as ridiculous from my point of view.
| Because it seems to be doing exactly what unicode is meant to
| do.
| rwmj wrote:
| Companies acquire indefensible patents all the time. They are
| used in bulk to threaten smaller competitors ("we've got 500
| patents in this field, better not bring your product to
| market"). This is one reason why patents can be terrible for
| competition.
| shermantanktop wrote:
| They are also used in bulk to defend against larger
| competitors using this type of threat. In a war where the
| ammunition is garbage, you either lose or you start
| hoarding garbage.
| neilv wrote:
| About 25 years ago, this was explained to me as "sword
| patents and shield patents".
|
| Sure, some can use patents as swords, to suppress
| legitimate competition, or to extract undue rents. But you
| can also use patents as shields, to protect in various ways
| against those swords.
|
| If I ran a BigTech (like the original warm-fuzzy Google
| reputation), I'd be registering any plausible patents, and
| have lawyers figure out how to freely license-out the ones
| that weren't key secret sauce, under terms that
| figuratively poisoned anyone doing illegitimate sword
| patents.
| palata wrote:
| > If I ran a BigTech
|
| History tells us that those who run a BigTech become
| crazy narcissists serving their own interests :).
| neilv wrote:
| For myself, that's a chance I'm willing to take. :)
| krupan wrote:
| Patents are part of the game you have to play, like it or
| not. If you don't patent your inventions somebody else will
| and they will come after you with their lawyers. Patents are
| used defensively far more often than they are used
| offensively in these stupid "Intellectual Property" battles.
|
| Because of this, there is absolutely no point in shaming
| someone for patenting a thing, especially when they are
| apologetic about it like parent is, and most especially when
| they are not threatening to weaponize the patent themselves.
| coldpie wrote:
| No, I don't buy it. If the patents are publicly and
| perpetually freely licensed except for defensive-only
| purposes, then sure, they're not unethical. Red Hat's
| patent promise ( https://www.redhat.com/en/about/patent-
| promise ) is one example. If patents were actually intended
| for defensive purposes only, then this would be an easy and
| uncontroversial thing to do. However, in practice this is
| vanishingly rare, and lawyers fight against it tooth &
| nail. This tells you that the companies do not actually
| file them for defensive-only purposes, unlike what you
| claim.
| krupan wrote:
| My friend, you really don't know what you are talking
| about, and getting all riled up like this is not the
| right way to learn.
|
| Freely licensing your patents doesn't protect you against
| patent trolls. I wrote out how patent fights work in
| another comment, but here it is again.
|
| Company A comes to Company B and says, "Hey! You are
| infringing on one of my patents!"
|
| Company B says, "oh really? Well let me look through my
| collection of patents and see if you are infringing on
| any of mine."
|
| Company A says, "oh, um, nevermind, I think I was
| mistaken."
|
| Company B says, "yes, that's what I thought"
|
| Now, imagine if Company B had already freely licensed all
| their patents. That defense wouldn't work.
|
| I agree with you that it's a crappy system, but simply
| standing with your arms folded and saying, "I'm not
| playing," isn't going to work.
| coldpie wrote:
| Yes, that's the reason for the "except for defensive
| purposes" part. Quoting from Red Hat's promise:
|
| > Our Promise also does not extend to the actions of a
| party (including past actions) if at any time the party
| or its affiliate asserts a patent in proceedings against
| Red Hat (or its affiliate) or any offering of Red Hat (or
| its affiliate) (including a cross-claim or counterclaim).
|
| Company B may still consult its portfolio and exercise it
| against Company A defensively, because Company A revoked
| its license of Company B's patents by asserting against
| Company B in the first place.
| krupan wrote:
| So in other words, Red Hat does not _freely_ license
| their patents, they say "you are free as long as you
| don't come after us." Which is exactly the system 99% of
| companies follow, just more formally stated. Yet you
| berated the poor guy from Pebble for even obtaining the
| patent he did??
| coldpie wrote:
| > Which is exactly the system 99% of companies follow,
| just more formally stated
|
| Not just formally, but in a _legally binding_ manner,
| including if the patent is acquired by another company
| (eg during a company purchase). Even if the original
| filer has the best intentions, companies change ownership
| or change legal strategy or go out of business. Patent
| trolls buy up those patents from closed companies.
| Legally licensing your patents for defensive-only
| purposes means they can 't ever be used by any of those
| bad actors.
|
| If the intent of these patents is truly only for defense,
| then why isn't it common to use a license like this? They
| lose nothing by it.
|
| > Yet you berated the poor guy from Pebble for even
| obtaining the patent he did??
|
| Yes. It is IMO unethical to create software patents that
| aren't covered by such a legally-binding license.
| krupan wrote:
| "including if the patent is acquired by another company
| (eg during a company purchase)"
|
| Honest questions, I promise: Is that true? Has that ever
| been tested in court? Why don't more corporations or
| patent lawyers advocate for this? Is it because the types
| of engineers that post on hacker news are requesting it
| not be done?
|
| Look, nobody likes patent trolls, we all hate weaponized
| patents. It's great that you want to fix the situation. I
| just think you are barking up the wrong tree trying to
| lay guilt trips on engineers for doing what their lawyer
| advised them to do.
| coldpie wrote:
| Nothing is certain in courts, obviously, but Red Hat's
| license is very explicit that that is the intent:
|
| > Red Hat intends Our Promise to be irrevocable (except
| as stated herein), and binding and enforceable against
| Red Hat and assignees of, or successors to, Red Hat's
| patents (and any patents directly or indirectly issuing
| from Red Hat's patent applications). As part of Our
| Promise, if Red Hat sells, exclusively licenses, or
| otherwise assigns or transfers patents or patent
| applications to a party, we will require the party to
| agree in writing to be bound to Our Promise for those
| patents and for patents directly or indirectly issuing on
| those patent applications. We will also require the party
| to agree in writing to so bind its own assignees,
| transferees, and exclusive licensees.
|
| If a court somehow overturned that, I wouldn't hold it
| against the patent filer.
|
| > Why don't more corporations or patent lawyers advocate
| for this?
|
| My opinion is it's because the patents have value as a
| weapon, not only for defense (this here is my
| disagreement with your original claim that these patents
| only exist for defense). De-fusing the weapon by using a
| legally binding license like this lowers the value of the
| patent in a potential purchase scenario. In other words:
| "money."
|
| > I just think you are barking up the wrong tree trying
| to lay guilt trips on engineers for doing what their
| lawyer advised them to do.
|
| Nah. If you do a bad thing, you are responsible for the
| bad thing you did. I think the OP can probably handle a
| little light scolding from some anonymous person on an
| Internet forum. My hope is that they, and other readers,
| learn from this mistake and don't do it again.
| fortran77 wrote:
| Perhaps you shouldn't hijack every thread about anything
| and make it about patents.
| coldpie wrote:
| I replied to one comment thread. Perhaps you should put
| on your big boy pants and use the little [-] thing to
| minimize threads you aren't interested in reading.
| palata wrote:
| > Because of this, there is absolutely no point in shaming
| someone for patenting a thing
|
| Well I wouldn't shame someone whose job was to patent
| something absurd. I was just saying that this is not an
| invention at all, and any system that protects that
| "innovation" is a broken system.
| dboreham wrote:
| Almost all filed patents are invalid.
| palata wrote:
| But doesn't it say that the whole patent system is broken?
| I get the "you pay to file a patent, it's your problem if
| it's invalid in the end". But the side effect of that is
| that whether it's valid or not, it's a tool you can use to
| scare those who don't have the resources to go to court.
|
| It's like those completely abusive non-compete clauses in
| work contracts (yes in some countries that's the norm).
| They are completely abusive and therefore illegal. But it
| still hurts the employee: I have friends who have been
| declined a job in a company because the company did not
| want to take any risk. The company was like "okay, it's
| most likely an invalid clause, but if your previous
| employer sues us it will anyway cost resources we don't
| want to spend, so we'd rather not hire you". So an illegal,
| invalid clause had the effect that the company who abused
| it wanted. Which means it's a broken system.
| numpad0 wrote:
| fun fact: dSLR lenses are patented all the time. Claims are
| basically "I made it and it works". And it's considered ok.
| detourdog wrote:
| Just curious this seems like simple digital Steganography or
| maybe even even the same as Shannon's boolean gate work. Do you
| think the patent is defendable in court?
|
| https://en.wikipedia.org/wiki/Steganography
| Etheryte wrote:
| Hopefully a wholly undefendable patent, you're essentially
| trying to patent the Unicode spec. The rest of it is perform an
| action in response to a text message which clearly isn't novel.
| riskable wrote:
| Oh this is just the tip of the iceberg when it comes to abusing
| Unicode! You can use a similar technique to this to overflow the
| buffer on _loads_ of systems that accept Unicode strings.
| Normally it just produces an error and /or a crash but
| _sometimes_ you get lucky and it 'll do all sorts of fun things!
| :)
|
| I remember doing penetration testing _waaaaaay_ back in the day
| (before Python 3 existed) and using mere diacritics to turn a
| single character into many bytes that would then overflow the
| buffer of a back-end web server. This only ever caused it to
| crash (and usually auto-restart) but I could definitely see how
| this could be used to exploit certain systems /software with
| enough fiddling.
| capitainenemo wrote:
| Yeah. Zalgo text is a common test for input fields on websites.
| But it usually doesn't do anything interesting. Maybe an
| exception trigger on some database length limit. Doesn't
| typically even kill any processes. The exception is normally
| just in your thread. You can often trigger it just by disabling
| JS on even modern forms, but,, at best you're maybe leaking a
| bit of info if they left debug on and print the stack trace or
| a query. Another common slip-up is failing to count \n vs \r\n
| in text strings since JS usually usually counts a carriage
| return as 1 byte, but HTTP spec requires two.
|
| unescape(encodeURIComponent("c")).length is the quick and dirty
| way to do a JS byte length check. The \r\n thing can be done
| just by cleaning up the string before length counting.
| n0id34 wrote:
| Sorry n00b here, can you explain more about this or how you did
| this? I feel like this is definitely a loophole that would be
| worth testing for.
| zurfer wrote:
| ha
| paulgb wrote:
| I see what you did there ;)
| Alifatisk wrote:
| ;)
| kevinsync wrote:
| StegCloak [0] is in the same ballpark and takes this idea a step
| further by encrypting the hidden payload via AES-256-CTR --
| pretty neat little trick
|
| [0] https://github.com/KuroLabs/stegcloak
| giancarlostoro wrote:
| There's a Better Discord plugin that I think uses this or
| something similar, so you could send completely encrypted
| messages, that look like nothing to everyone else. You'd need
| to share a password secret for them to decode it though.
| putna wrote:
| wow, thats neat.
|
| Wanted to try on Cloudflare DNS TXT record. But Cloudflare is
| smart enough to decode when pasting in TXT field.
| UltraSane wrote:
| DNS only supports ASCII for record values. It has a hack to
| support unicode domain names using Punycode
| vessenes wrote:
| I love the idea of using this for LLM output watermarking. It
| hits the sweet spot - will catch 99% of slop generators with no
| fuss, since they only copy and paste anyway, almost no impact on
| other core use cases.
|
| I wonder how much you'd embed with each letter or token that's
| output - userid, prompt ref, date, token number?
|
| I also wonder how this is interpreted in a terminal. Really cool!
| zos_kia wrote:
| With the amount of pre processing that is done before
| integrating stuff in a dataset I'd be surprised if those kinds
| of shenanigans even worked
| OutOfHere wrote:
| Just you wait until AI starts calling human output to be slop.
| roguecoder wrote:
| There are of course human writers who are less-communicative
| than AI, called "shit writers", and humans who are less
| accurate than AI, called "liars".
|
| The difference is humans are responsible for what they write,
| whereas the human user who used an AI to generate text is
| responsible for what the computer wrote.
| vessenes wrote:
| That's already happening - my kids have had papers unfairly
| blamed on chatgpt by automated tools. Protect yourself kids,
| use an editor that can show letter by letter history.
| red369 wrote:
| Do you have any examples of editors that show letter by
| letter history? I have never looked for that as a feature.
|
| Edit: I've been looking, and Google Docs seems to have
| version history to the minute.
| vessenes wrote:
| Yes exactly. They keep track of their diffs in that
| interface.
| neom wrote:
| 2 people I worked with had this happen and one of them is
| going to war over it as it was enough to lower the kids
| grade for college or something. Crazy times.
| capitainenemo wrote:
| In most linux terminals, what you pass it is just a sequence of
| bytes that is passed unmangled. And since this technique is
| UTF-8 compliant and doesn't use any extra glyphs, it is
| invisible to humans in unicode compliant terminals. I tried it
| on a few. It shows up if you echo the sentence to, say, xxd
| ofc.
|
| (unlike the PUA suggestion in the currently top voted comment
| which shows up immediately ofc)
|
| Additional test corrections: While xxd shows the message
| passing through completely unmangled on _pasting_ it into the
| terminal, when I _selected_ from the terminal (echoed sentence,
| verified unmangled in xxd, then selected and pasted the result
| of echo), it was truncated to a few words using X select in
| mate terminal and konsole - I 'm not sure where that truncation
| happens, whether it's the terminal or X. In xterm, the final e
| was mangled, and the selection was even more truncated.
|
| The sentence is written unmangled to files though, so I think
| it's more about copying out of the terminal dropping some data.
| Verified by echoing the sentence to a test file, opening it in
| a browser, and copying the text from there.
| vessenes wrote:
| On MacOS, kitty shows an empty box, then an a for the "ha"
| post below. I think this is fair and even appreciated. Mac
| Terminal shows "ha". That "ha" (and this one!) can be copied
| and pasted into the decoder successfully.
| ChadNauseam wrote:
| There are other possible approaches to LLM watermarking that
| would be much more robust and harder to detect. They exploit
| the fact that LLMs work by producing a probability distribution
| that gives a probability for each possible next token. These
| are then sampled randomly to produce the output. To add
| fingerprints when generating, you could do some trickery in how
| you do that sampling that would then be detectable by re-
| running the LLM and observing its outputs. For example, you
| could alternate between selecting high-probability and low-
| probability tokens. (A real implementation of this would be
| much more sophisticated than that obviously, but hopefully you
| get the idea)
| vessenes wrote:
| This is not a great method in a world with closed models and
| highly diverse open models and samplers. It's intellectually
| appealing for sure! But it will always be at best a
| probabilistic method, and that's if you have the llm weights
| at hand.
| ChadNauseam wrote:
| What makes it not a good method? Of course if a model's
| weights are publicly available, you can't compel anyone
| using it to add fingerprinting at the sampler stage or
| later. But I would be shocked if OpenAI was not doing
| something like this, since it would be so easy and couldn't
| hurt them, but could help them if they don't want to train
| on outputs they generated. (Although they could also record
| hashes of their outputs or something similar as well - I
| would be surprised if they don't.)
| fennecfoxy wrote:
| Why does anybody think AI watermarking will ever work? Of
| course it will never work, any watermarking can be instantly &
| easily stripped...
|
| The only real AI protection is to require all human interaction
| to be signed by a key verified by irl identity and even then
| that will: A never happen, B be open to abuse by countries with
| corrupt governments and countries with corrupt governments
| heavily influenced by private industry (like the US).
| iNic wrote:
| The tokenizer catches it: https://platform.openai.com/tokenizer.
| remram wrote:
| I'm not too surprised by this, but I'm annoyed that no amount of
| configuration made those bytes visible again in my editor. Only
| using hexdump revealed them.
| bittercynic wrote:
| My mind went the same place.
|
| Anyone know a more convenient way to search larger blocks of
| text for this?
| jrootabega wrote:
| Here's a POC that works in emacs. Doesn't cover all of the
| relevant characters, but: (setq ;;some other
| invisible or interesting characters unicode-zero-
| width-space ?\u200b unicode-zero-width-non-joiner
| ?\u200c unicode-zero-width-joiner ?\u200d
| unicode-zero-width-nbsp ?\ufeff unicode-narrow-
| nbsp ?\u202f unicode-word-joiner ?\u2060
| unicode-grapheme-joiner ?\u034f unicode-no-break-
| space ?\u00a0 unicode-combining-long-stroke
| ?\u0336 ;;variation selector examples
| unicode-vs-fe00 ?\ufe00 unicode-vs-fe0f ?\ufe0f
| unicode-vs-e0100 ?\xe0100) (defun show-
| glyphless-as-hex (char) (let ((original (elt
| glyphless-char-display char))) (aset glyphless-
| char-display char 'hex-code) original)) ;;so you
| can see what you just replaced (progn
| (show-glyphless-as-hex unicode-zero-width-space)
| (show-glyphless-as-hex unicode-zero-width-non-joiner)
| (show-glyphless-as-hex unicode-zero-width-joiner)
| (show-glyphless-as-hex unicode-zero-width-nbsp)
| (show-glyphless-as-hex unicode-word-joiner) (show-
| glyphless-as-hex unicode-grapheme-joiner) (show-
| glyphless-as-hex unicode-narrow-nbsp) (show-
| glyphless-as-hex unicode-no-break-space) ;;these may
| already be visible if the current conditions don't support them
| ;;but we'll force them (show-glyphless-as-hex
| unicode-vs-fe00) (show-glyphless-as-hex unicode-vs-
| fe0f) (show-glyphless-as-hex unicode-vs-e0100))
| remram wrote:
| I use vim. It seems like `:set binary enc=latin1` works,
| though I don't understand why the latin1 part is required.
| jrootabega wrote:
| Here is the bare minimum this is built on, which you can type
| in yourself if you're paranoid or want to start from the
| bottom up. Swap in the hexadecimal codepoint of the invisible
| character after the ?\x (aset glyphless-char-
| display ?\xfe00 'hex-code)
| jrootabega wrote:
| And as a higher-level configuration you can set most, maybe
| even all, of the relevant invisible characters (still not
| sure how 0x34f grapheme joiner fits in) at once with
| something like: (custom-set-variables
| '(glyphless-char-display-control '((format-control . hex-
| code) (variation-
| selectors . hex-code))))
|
| This will modify values in glyphless-char-display, but it's
| OK to modify those directly if you need to.
| mdouglass wrote:
| vscode's "Unicode Highlight: Non-basic ASCII" causes the
| character to get highlighted. Sadly, the more appropriate
| "Unicode Highlight: Invisible Characters" setting does not
| reveal them.
| nonameiguess wrote:
| More generally, you can use encoding formats that reserve
| uninterpreted byte sequences for future use to pass data that is
| only readable by receivers who know what you're doing, though
| note this not a cryptographically secure scheme and any sort of
| statistical analysis can reveal what you're doing.
|
| The png spec, for instance, allows you to include as many
| metadata chunks as you wish, and these may be used to hold data
| that cannot be used by any mainstream png reader. We used this in
| the Navy to embed geolocation and sensor origin data that was
| readable by specialized viewers that only the Navy had, but if
| you opened the file in a browser or common image viewer, it would
| either ignore or discard the unknown chunks.
| dkarl wrote:
| Lots of image formats store arbitrary metadata (and data data)
| either by design or by application-specific extensions. I
| remember seeing seismic and medical images that contained data
| for display in specialized applications and writing code to
| figure out if binary metadata was written in big-endian or
| little-endian byte order (the metadata often did not have the
| same endianness as the image data!) For example, TIFF files
| containing 3d scans as a sequence of slices, with binary
| metadata attached to each slice. If you opened it up in your
| system default image viewer, you'd only see the first slice,
| but a specialized viewer (which I did not have) would display
| it as a 3d model. Luckily (IIRC) the KDE file browser let you
| quickly flip through all the images in a directory using the
| keyboard, so I was able to dump all the layers into separate
| files and flip through them to see the 3d image.
| albybisy wrote:
| wow!
| 65 wrote:
| This would be useful as a fingerprinting technique for
| corporate/government leakers.
| jaygreco wrote:
| Interestingly, it's also possible to encode _emoji_ inside emoji!
| HanClinto wrote:
| Even more than just simply watermarking LLM output, it seems like
| this could be a neat way to package logprobs data.
|
| Basically, include probability information about every token
| generated to give a bit of transparency to the generation
| process. It's part of the OpenAI api spec, and many other engines
| (such as llama.cpp) support providing this information. Normally
| it's attached as a separate field, but there are neat ways to
| visualize it (such as mikupad [0]).
|
| Probably a bad idea, but this still tickles my brain.
|
| * [0]: https://github.com/lmg-anon/mikupad
| fortran77 wrote:
| What's interesting is that even a "view source" shows nothing
| amiss, and if I do a copy/paste from the debug inspector view of
| "This sentence has a hidden message." it still shows up....
| rexxars wrote:
| For a real-world use case: Sanity used this trick[0] to encode
| Content Source Maps[1] into the actual text served on a webpage
| when it is in "preview mode". This allows an editor to easily
| trace some piece of content back to a potentially deep content
| structure just by clicking on the text/content in question.
|
| It has it's drawbacks/limitations - eg you want to prevent adding
| it for things that needs to be parsed/used verbatim, like
| date/timestamps, urls, "ids" etc - but it's still a pretty fun
| trick.
|
| [0] https://www.sanity.io/docs/stega
|
| [1] https://github.com/sanity-io/content-source-maps
| vzaliva wrote:
| The title lis little misleading: "Note that the base character
| does not need to be an emoji - the treatment of variation
| selectors is the same with regular characters. It's just more fun
| with emoji."
|
| Using this approach with non-emoji characters makes it more
| stealth and even more disturbing.
| ComputerGuru wrote:
| This is cute but unnecessary - Unicode includes a massive range
| called PUA: the private use area. The codes in this range aren't
| mapped to anything (and won't be mapped to anything) and are for
| internal/custom use, not to be passed to external systems (for
| example, we use them in fish-shell to safely parse tokens into a
| string, turning an unescaped special character into just another
| Unicode code point in the string, but in the PUA area, then
| intercept that later in the pipeline).
|
| You're not _supposed_ to expose them outside your api boundary
| but when you encounter them you are prescribed to pass them
| through as-is, and that's what most systems and libraries do.
| It's a clear potential exfiltration avenue, but given that most
| sane developers don't know much more about Unicode other than
| "always use Unicode to avoid internationalization issues", it's
| often left wide open.
| paulgb wrote:
| I just tested and private use characters render as boxes for me
| (), the point here was to encode them in a way that they are
| hidden and treated as "part of" another character when
| copy/pasting.
| diggan wrote:
| > the point here was to encode them in a way that they are
| hidden and treated as "part of" another character when
| copy/pasting
|
| AKA "Steganography" for the curious ones:
| https://en.wikipedia.org/wiki/Steganography
| reaperducer wrote:
| Like when we used to encode the phone numbers of warez
| boards in GIFs.
| bruce343434 wrote:
| On my Android phone,that displays "Go[][]" in the Google logo
| font.
| layer8 wrote:
| The difference is that PUA characters are usually rendered in
| some way that is rather visible, whereas the variation
| selectors aren't.
| juped wrote:
| I'll be honest, I pasted this comment in the provided decoder
| thinking no one could miss the point this badly and there was
| probably a hidden message inside it, but either you really did
| or this website is stripping them.
|
| You can't invisibly watermark an arbitrary character (I did it
| to one above! If this website isn't stripping them, try it out
| in the provided decoder and you'll see) with unrecognized PUA
| characters, because it won't treat them as combining
| characters. You will cause separately rendered rendered
| placeholder-box characters to appear. Like this one: (may not
| be a placeholder-box if you're privately-using the private use
| area yourself).
| egypturnash wrote:
| j is for Jabberwocky. Does this decode?
|
| edit: Yes, it does.
| Sniffnoy wrote:
| Isn't this more what the designated noncharacters are for,
| rather than the private-use area? Given how the private-use
| area sometimes gets for unofficial encodings of scripts not
| currently in Unicode (or for things like the Apple logo and
| such) I'd be worried about running into collisions with that if
| I used the PUA in such a way.
|
| Note that designated noncharacters includes not only 0xFFFF and
| 0xFFFE, and not only the final two code points of every plane,
| but also an area in the middle of Arabic Presentation Forms
| that was at some point added to the list of noncharacters
| specifically so that there would be more noncharacters for
| people using them this way!
| lolinder wrote:
| Context that some may be missing is that this was inspired by
| discussion surrounding the _Open Heart Protocol_ submission:
|
| https://news.ycombinator.com/item?id=42791378
|
| People immediately began discussing the applications for
| criminal use given the constraint that only emoji are accepted
| by the API. So for that use case the PUA wouldn't be an option,
| you have to encode it in the emoji.
| frontporch wrote:
| you dont need 256 codepoints so you can neatly represent an octet
| (whatever that is), you just need 2 bits. you can just stack as
| many diacritical marks you want on any glyph. either the renderer
| allows practically unlimited or it allows 1/none. in either case
| that's a vuln. what would be really earth shattering is what i
| was hoping this article was: a way to just embed "; rm -rf ~/"
| into text without it being rendered. you also definitely dont
| need rust for this unless you want to exclude 90% of the
| programmer population.
| paulgb wrote:
| I think the Rust is more readable for bytemucking stuff than
| dynamic languages because the reader doesn't have to infer the
| byte widths, but for what it's worth the demo contains a
| TypeScript implementation: https://github.com/paulgb/emoji-
| encoder/blob/main/app/encodi...
| AdamH12113 wrote:
| An octet is a group of 8 bits. Today we normally use the word
| "byte" instead. The term is often used in older internet
| protocols and comes from an era where bytes were not
| necessarily 8 bits.
| panki27 wrote:
| I implemented something similar years ago, but much simpler/less
| sophisticated.
|
| Unicode has two non-printing space characters: zero-width space
| (U+200B) and zero-width joiner (U+200D). This allows you to
| encode arbitrary data in binary. I would give an example, but HN
| seems to strip this :(
| layer8 wrote:
| Already linked in
| https://news.ycombinator.com/item?id=43025913, and has a higher
| risk of being stripped, as you noticed.
| Mockapapella wrote:
| In the same vein, I did some fun unicode abusing a few years ago
| where I used scripts to convert programs into series of various
| ZWJ's: https://thelisowe.substack.com/p/sleeper-cell-a-method-of-
| em...
|
| Also includes a decoder script
| blmarket wrote:
| This and several other abuse cases forced my previous work to use
| code pointers to count 'characters' for user's nickname / status
| messages. No one wanted to download 9MB simply browsing other
| users.
| ncr100 wrote:
| That is awesome. Both the abuse and the fix.
| myflash13 wrote:
| NoSQL? Sounds like it should've been caught by basic length
| checks on the database field where it was stored.
| rafram wrote:
| This is cool. I tried pasting the output into an Instagram
| comment and it stayed intact, so I have a feeling someone could
| do some interesting stuff with that. Who needs a botnet C&C
| server when you can post totally invisible commands on public
| forums?
| the_hoffa wrote:
| I mean, steganography has been a thing for quite a while. Not
| disagreeing, just saying this is how some programs/ideas were
| passed around the internet decades ago by "less than upstanding
| netizens" ;)
|
| Wanted to pass a secret code to a friend? Encode the bit-data
| in the alpha channel of an image. It could even be
| encrypted/scrambled within the image itself. Post the perfectly
| normal image to a public forum, ping your friend, they run it
| through the "decoder" and Robert's your mother's brother.
|
| Of course these weren't "logic bombs" like this post is
| describing, but even those have been around for a while too.
|
| Hacking is fun :)
| cynicalsecurity wrote:
| Ctrl+F "unicode normalisation" 0/0
|
| I'm surprised no one has mentioned it yet. It's usually super
| easy, but people forget to add it all the time.
| paulgb wrote:
| I haven't tried it but I've heard that at least some unicode
| normalizers do not strip sequences of variation selectors.
| moody__ wrote:
| Normalization implementations must not strip variation
| selectors by definition. The "normal" part of normalization
| means to convert a string into either consistently decomposed
| unicode, or composed unicode. ie U+00DC vs U+0055 + U+0308.
| However this decomposition mapping is also used (maybe more
| like abused) for converting certain "legacy" code points to
| non-legacy code points. There does not exist a rune which
| decomposes to variant selectors (and thus these variant
| selectors do not compose into anything) so normalization must
| not alter or strip them.
|
| source: I've implemented Unicode normalization from scratch
| wunderwuzzi23 wrote:
| This is cool. There are also the Unicode Tag characters that
| mirror ASCII and are often invisible in UI elements (especially
| web apps).
|
| The unique thing about Tag characters is that some LLMs interpret
| the hidden text as ASCII and follow instructions, and they can
| even write them:
|
| https://embracethered.com/blog/posts/2024/hiding-and-finding...
|
| Here an actual exploit POC that Microsoft fixed in Copilot:
| https://embracethered.com/blog/posts/2024/m365-copilot-promp...
| paulgb wrote:
| (author here) some people in this thread and elsewhere asked me
| about whether an LLM could decode this, and the answer seems to
| be: not likely by itself, but it often can if it has access to a
| Python interpreter!
|
| Here's a demo of Gemini Flash 2 solving one in 7s:
| https://bsky.app/profile/paulbutler.org/post/3lhzhroogws2g
| nitwit005 wrote:
| Even kids figure out how to manipulate unicode text. If you want
| to bypass a swear filter, replace a letter with an alternate
| representation of the same letter.
| egypturnash wrote:
| If you try posting this on Bluesky, the editor only counts it as
| one emoji, but you will get an error upon trying to post.
| dalemhurley wrote:
| I love it, I got Claude to add a pin to provide very basic
| encryption
|
| https://claude.site/artifacts/5bfdf131-d847-4735-9242-998f23...
| ethin wrote:
| It's worth noting, just as a curiosity, that screen readers can
| detect these variation selectors when I navigate by character.
| For example, if I arrow over the example he provided (I can't
| paste it here lol), I here: "Smiling face with smiling eyes",
| "Symbol e zero one five five", "Symbol e zero one five c",
| "Symbol e zero one five c", "Symbol e zero one five f". This is
| unfortunately dependent on the speech synthesizer used, and I
| wouldn't know if the characters were there if I was just reading
| a document, so this isn't much of an advantage all things
| considered.
| llm_trw wrote:
| Ironically enough I have a script that strips all non-ascii
| characters from my screen reader because I found that _all_
| online text was polluted with invisible and annoying to listen
| to characters.
| andrethegiant wrote:
| Clever! I made a similar emoji encoding/decoding microsite:
| https://face64.me
| omnibrain wrote:
| 10 years or so ago I shocked coworkers with using U+202D LEFT-TO-
| RIGHT OVERRIDE mid in filenames on windows. So
| funnypicturegnp.exe became funnypictureexe.png Combined with a
| custom icon for the program that mimics a picture preview it was
| pretty convincing.
| hosteur wrote:
| Wow this is a clever trick.
| mdup wrote:
| I worked in phishing detection. This was a common pattern used
| by attackers, although .exe are blocked automatically most of
| the time, .html is the new malicious extension (often hosting
| an obfuscated window.location redirect to a fake login page).
|
| RTL abuse like cute-cat-lmth.png was relatively common, but
| also trivial to detect. We would immediately flag such an email
| as phishing.
| taneq wrote:
| I'd never heard of this particular trick but I'm glad my
| decades of paranoia-fueled "right click -> open with" treatment
| of any potentially sketchy media file was warranted! :D
| tyilo wrote:
| Kitty terminal shows non-payload letters and emojis normally, but
| with a payload a letter is shown as one box and an emoji is shown
| as two boxes.
| foobuzzHN wrote:
| 10 years ago I made a POC for smuggling arbitrary data through
| _no visible text at all_: https://github.com/foobuzz/ium
| arkh wrote:
| > To be clear, this is an abuse of unicode and you shouldn't do
| it. If your mind is wandering to practical use cases for this,
| shut it down.
|
| Totally not thinking about IRC clients with their own hidden
| commands.
| qingcharles wrote:
| I was using this technique last year with Bing Image Creator.
|
| It let you get around their filter on brand names and celebrity
| names by smuggling them into the prompt in a way the AI could
| read, but the human-written filter was not designed for.
| JoelJacobson wrote:
| Imagine using the ID card emoji (U+1FAAA) as a universal carrier
| for digital ID tokens. A dumb demo is available at
| https://pit.lovable.app/ which--without any secure protocol--
| simply encodes a National Identification Number into the emoji
| using variation selectors.
|
| The idea is that banks could issue encrypted ID tokens in this
| way, letting them move seamlessly across any platform that
| supports Unicode (messaging apps, email, web forms, etc.). The
| heavy lifting of security (preventing replay attacks,
| interception, ensuring token freshness, etc.) would be managed
| separately with robust cryptography, while the emoji serves
| purely as a transport layer.
|
| It's not about reinventing security but about creating a cross-
| platform way to carry identity tokens. Thoughts?
| bruce343434 wrote:
| What is wrong with just using the actual SSN? Why hide it in an
| emoji?
| JoelJacobson wrote:
| So that the operating system could recognize it
| automatically, and to include a potentially long URL to the
| retail bank's web service to initiate the protocol, such as
| signing a document or an identification protocol.
| _nhh wrote:
| Perfect way of personalized ad tracking?
| _nhh wrote:
| Check this address after you clicked it:
|
| https://emoji.paulbutler.org/?mode=encode
|
| I encoded the last ,,e"
| urbandw311er wrote:
| When people discuss things like "Do LLMs know about this?" On a
| public website I always think that it's the equivalent of
| somebody whose phone is wiretapped calling their friend and
| asking if the FBI knows about something.
| fennecfoxy wrote:
| I think that's a very cynical view. The author seeing what an
| LLM would make of it was more akin to getting a new game and
| wondering if you can pet the dog.
| StevenNunez wrote:
| "To be clear, this is an abuse of unicode and you shouldn't do
| it. If your mind is wandering to practical use cases for this,
| shut it down." TOO LATE!
| ofou wrote:
| This is one of the reasons I've been advocating to use UTF-8 as a
| tokenizer for a long time. The actual problem IMHO are tokenizers
| themselves, which obscure the encoding/decoding process in order
| to gain some compression during training to fit more data in for
| the same budget, and arguably gaining some better understanding
| from the beginning. Again just a lack of computing power.
|
| If you use UTF-8 directly as tokenizer, this problem becomes
| evident once you fit it into the context window. Plus, you can
| run multiple tests for this type of injection; no emoji should
| take more than up to 40 bytes (10 code points * 4 bytes per code
| point in the worst case). This is an attack on tokenizers, not on
| UTF-8.
|
| Plus, Unicode publishes the full list of sequences valid
| containing the ZWJ character in emoji-zwj-sequences.txt
| JoshuaTench wrote:
| "I would highly recommend Tomcyberghost@gmail.com I felt a strong
| connection with him and appreciate his knowledge, expertise and
| manner in which he worked with my former partner and I. He
| handled the challenging dynamics of our relationship, and through
| our work I was able to leave a relationship that was hurting and
| hindering me. I came to that conclusion not through any direct
| encouragement, but through doing the work."
___________________________________________________________________
(page generated 2025-02-13 23:01 UTC)