[HN Gopher] The Invisible JavaScript Backdoor
       ___________________________________________________________________
        
       The Invisible JavaScript Backdoor
        
       Author : davidbarker
       Score  : 517 points
       Date   : 2021-11-10 04:04 UTC (18 hours ago)
        
 (HTM) web link (certitude.consulting)
 (TXT) w3m dump (certitude.consulting)
        
       | rafaelturk wrote:
       | Nice article, it was a fund reading it. Albeit I don't think this
       | kind of attack is restricted to just JavaScript.
        
       | dlsa wrote:
       | Compilers and interpreters need a new pass to detect these
       | characters in code and treat them as hard errors. This doesn't
       | stop their use in comments where presumably they are still ok.
       | 
       | Alternatively, there needs to be an uptake of the use of code
       | linters and pretty printers.
       | 
       | A bit of both perhaps.
        
       | speleding wrote:
       | I think the recommendation to disallow any non-ASCII character is
       | throwing out the baby with the bathwater.
       | 
       | How about code that wants to display some emojis? It would be
       | cumbersome to use hex unicode everywhere. And while localisations
       | should typically happen in a separate language file, it's very
       | common to want some text in code intended for a single audience.
       | 
       | Blocking all the confusables might be tricky, and an allow list
       | would be endless. Perhaps some magic pre-processor comment that
       | says "allow unicode in this file".
        
         | YetAnotherNick wrote:
         | Do you really have to write emoji in the code string? Similarly
         | with international language characters. The sane thing is to
         | use either json config files or i18n libraries.
        
           | speleding wrote:
           | If you are writing something intended for a single audience
           | using i18n libraries can be unnecessary overhead. And emoji
           | can also be icons like [?] that can be useful to display in
           | the UI.
        
         | josteink wrote:
         | > I think the recommendation to disallow any non-ASCII
         | character is throwing out the baby with the bathwater.
         | 
         | Not throwing out all non-ASCII characters from code-files. Just
         | throwing them out as being invalid _identifiers_ in your code
         | (think variables, function-names, etc).
         | 
         | > How about code that wants to display some emojis?
         | 
         | Fine. You quote that emoji in a string, and it's golden.
         | 
         | You try to make a variable with the name of an emoji however,
         | you code crashes.
         | 
         | That sounds fine to me.
        
           | speleding wrote:
           | That would close this particular attack (but not the BIDI one
           | the article mentions). But there is probably already too much
           | code out there with p=3.14 in it to be feasible to do this.
        
             | smcl wrote:
             | I really thought that using the greek letter for pi (or
             | theta, etc) was something you do to show your programming
             | language supports unicode identifiers but that nobody
             | actually does in real life. I wonder how people input this,
             | do they know the Alt+xyz combo, do they select-copy-paste
             | or is there another way that to write these characters that
             | I'm not aware of?
             | 
             | Just to be clear, I don't mean people who are actually
             | using Greek language for input - it's pretty obvious how
             | they would type that character :)
        
             | josteink wrote:
             | > But there is probably already too much code out there
             | with p=3.14 in it to be feasible to do this.
             | 
             | So for JS let it break in new, module based strict-mode
             | code.
             | 
             | That's going to be processed by tooling prior to shipping
             | anyway, so that'll get caught.
             | 
             | For other platforms do the same. In some forward-looking
             | revision of the language/compiler.
             | 
             | People has to fix obsolete/deprecated stuff in newer
             | compilers/class libraries all the time. This is no
             | different.
        
       | est wrote:
       | IO operations, especially involving a subprocess is prone to have
       | backdoors. They are practically unverifiable.
       | 
       | I've yet to see a unicode backdoor in pure algorithmic flows.
        
         | willvarfar wrote:
         | Some examples of historic attacks you could embed in
         | algorithms:
         | 
         | "Salami slicing" is a kind of embezzlement where eg an insider
         | programs the computer to credit small amounts to the last
         | account (and then opens an account with a name beginning with
         | Z).
         | 
         | In the 90s there was a massive hushed up scandal where the
         | programmers developing the early Barclaycard made the pseudo
         | random number generator for pin codes just issue three distinct
         | pins. This meant that a stolen card could be easily used
         | because they could guess any pin in three goes before the ATM
         | swallowed the card.
         | 
         | This is hardly an exhaustive list. It's just to get peoples
         | cogs turning... :)
        
           | null_object wrote:
           | > In the 90s there was a massive hushed up scandal where the
           | programmers developing the early Barclaycard made the pseudo
           | random number generator for pin codes just issue three
           | distinct pins. This meant that a stolen card could be easily
           | used because they could guess any pin in three goes before
           | the ATM swallowed the card.
           | 
           | Citation for this?
        
             | willvarfar wrote:
             | Took some digging to find any working links these days. The
             | three pin thing is on page two but it doesn't name which
             | bank; I may have misremembered and it might not have been
             | Barclays. The whole article is a good starting point for
             | digging into other vulnerabilities and exploits too
             | https://www.theregister.com/2005/10/21/phantoms_and_rogues/
        
       | More-nitors wrote:
       | maybe someone should make a linter for this...
        
         | pabs3 wrote:
         | There is one for Go called glyphcheck:
         | 
         | https://github.com/NebulousLabs/glyphcheck
        
         | trevinhofmann wrote:
         | Added to my own ESLint config:
         | https://github.com/trevinhofmann/eslint-config-principled/pu...
        
       | gostsamo wrote:
       | The benefit of being blind: the screen reader announces invisible
       | characters and I could detect the invisible variable.
        
         | infomax wrote:
         | T[?]h[?][?]i[?][?][?]s c[?]o[?][?]m[?][?][?]m[?][?][?][?]e[?][?
         | ][?][?][?]n[?][?][?][?][?][?]t s[?]h[?][?]o[?][?][?]u[?][?][?][
         | ?]l[?][?][?][?][?]d[?][?][?][?][?][?]n[?][?][?][?][?][?][?]'[?]
         | [?][?][?][?][?][?][?]t b[?]e e[?]a[?][?]s[?][?][?]y t[?]o
         | r[?]e[?][?]a[?][?][?]d b[?]y
         | s[?]c[?][?]r[?][?][?]e[?][?][?][?]e[?][?][?][?][?]n r[?]e[?][?]
         | a[?][?][?]d[?][?][?][?]e[?][?][?][?][?]r[?][?][?][?][?][?]s
        
           | gostsamo wrote:
           | yep, it is not.
        
           | geocar wrote:
           | It is difficult to _see_ on an iPhone, but it sounds fine in
           | Voiceover.
        
           | mwcampbell wrote:
           | With NVDA on Windows, when I read the comment normally, it's
           | spelled out. When I read it character by character, I get
           | "symbol FFF8" for each of the hidden Unicode characters. And
           | when I move line by line through NVDA's linear representation
           | of the web page, the hidden characters count against the
           | length of the line for the purpose of word wrapping.
           | 
           | Narrator's behavior is weirder. If I turn on scan mode and
           | move onto the line with the up or down arrow key, Narrator
           | says nothing. If I read the current line with Insert+Up
           | Arrow, Narrator spells it out like NVDA does. When moving
           | character by character, Narrator says nothing for the hidden
           | Unicode characters. And because Narrator doesn't do its own
           | line wrapping but defers to the application to determine what
           | counts as a line, the text only counts as one line.
           | 
           | Disclosure: I used to work on the Windows accessibility team
           | at Microsoft, on Narrator among other things.
        
         | WesolyKubeczek wrote:
         | The benefit of being sighted is being able to use accessibility
         | features while also being sighted.
         | 
         | Take a peek at those technologies sometimes, those things
         | improve work comfort for everyone.
        
           | [deleted]
        
           | mwcampbell wrote:
           | Still, it would not occur to most sighted programmers to
           | review code using a screen reader. To me, this is another
           | argument for having a truly diverse team (or community, in
           | the case of an open-source project); a blind programmer who's
           | already involved with the project would catch something like
           | this. So in this particular case, blindness is truly not a
           | disability.
        
             | marginalia_nu wrote:
             | Being able to perceive BOM markers is tantamount to a
             | superpower in programming.
        
         | IceWreck wrote:
         | How hard is it to program while being blind ? What sort of
         | development do you do ? i understand that frontend is
         | impossible but what other difficulties do you face ?
         | 
         | Are indent based langauges like python harder than bracket
         | based languages ?
        
           | gostsamo wrote:
           | Hi,
           | 
           | Front end is not entirely impossible, but impossible on doing
           | pixel perfect designs. Otherwise, I know blind people who do
           | FE, not sure if most of it is professional though.
           | 
           | Indent based languages are actually easier. Every screen
           | reader has a way to announce indentation in code, while
           | brackets could be confusing if not formatted or verbose if
           | properly announced.
           | 
           | My main issues are dev tools with bad accessibility. Also, it
           | takes me more time to get acquainted with new code and
           | sometimes omophones in the source code which require extra
           | attention. Filtering through logs is also a bitch in most
           | cases. Besides the dev tools, you can summarize the rest as
           | bad IO speed.
        
             | MathCodeLove wrote:
             | I've been struggling with eye strain and have considered
             | trying to approach development in a fashion similar to that
             | takeb by blind devs. Any suggestions for guides or
             | overviews for how I can get setup?
        
               | gostsamo wrote:
               | Hi,
               | 
               | it depends on what you are working on and what you want
               | to do. Generally, screen readers are not as good for
               | programing as they are for plain text stuff, so they will
               | be a limited substitute for whatever you are using now.
               | If you are okay with working slower, they can help you
               | listen through code and tool's messages providing relief
               | for your eyes.
               | 
               | If you are using Windows, NVDA is the screen reader. Jaws
               | is a bit too expensive for my taste without any
               | significant edge over NVDA. The builtin narrator is still
               | immature in my opinion. VSCode has excellent
               | accessibility with a dedicated and involved team. Visual
               | Studio also has extremely good accessibility support
               | though I'm not using it. InteliJ sucks. Not completely,
               | but enough that people do not see the benefit of using
               | it. Eclipse is not popular these days, but it has good
               | accessibility as well as far as I know. Sublime is not
               | accessible.
               | 
               | If you are on Linux, the screen reader is Orca. It does
               | not have the same level of support as the Windows stuff,
               | but I know people who are developing on linux boxes so it
               | is doable. Emacs must be good enough because it has self-
               | voicing plugin and people who like and use it. As far as
               | I know, VSCode for Linux has some accessibility features
               | but I don't know how they compare to Windows.
               | 
               | If you are on Mac, your only choice is Voice Over by
               | Apple as screen reader. It is good but not always perfect
               | to my knowledge. I know people who use TextMaid, XCode,
               | VSCode, and Emacs, but I don't have much feedback from
               | there. It is totally doable though.
               | 
               | On Windows, I'm also using notepad++ as secondary editor
               | because it is faster and works better for large files.
               | Also, it is a good notetaking tool.
               | 
               | We can connect offline if you need some more info.
        
             | mrlemke wrote:
             | I am very interested in how blind developers work. I have
             | been pondering how to make computers and development more
             | accessible. If you don't mind:
             | 
             | Do you have preference between CLI, TUI, or GUI dev tools?
             | 
             | Is highly symbolic code harder to understand using a screen
             | reader than plain language code? By symbolic, I
             | specifically mean any characters that are not alphanumeric.
        
               | gostsamo wrote:
               | Hi,
               | 
               | I don't have preferences on the interface. As far as it
               | is accessible, I can learn to work with it. E.g. VSCode
               | make everything possible to make their interface
               | accessible and they are continuously fixing any reported
               | issues.
               | 
               | When it comes to code, verbose is better. Abbreviations
               | take effort to decode. I can remap some symbols to have
               | different pronunciations, but it does not work always.
               | E.g. I've maid the sr to speak the ":=" operator in
               | python as "assigned from", but brackets have nesting and
               | orientation, and too many of them get nasty to listen to
               | or follow.
        
               | mrlemke wrote:
               | Thanks for answering. What is your favorite programming
               | language to work in? If you could use any language you
               | wanted, what would be your top pick?
        
               | gostsamo wrote:
               | Well, this is highly subjective. I'm paid to do python
               | and node js from time to time and python really rocks for
               | me. Not a small reason why I like python more is for the
               | much better tracebacks. When looked in a console, it is
               | much more pleasant to have the erroring line at the
               | bottom which spares me copying the entire console in npp
               | in trying to find the top of it.
               | 
               | That said, I know many blind devs who do java, c#, swift,
               | c++ and so on. I had bad experiences with ide-s when I
               | was starting to study software development on those
               | languages and it've stayed with me, but it is not
               | universal.
               | 
               | If I had the choice, I would not drop python, but I might
               | add some of the functional languages or rust for the new
               | ways of thinking they might teach me. So far, I've looked
               | at them, but I haven't done nothing serious there.
        
               | mrlemke wrote:
               | Interesting, thanks for sharing!
        
             | akavel wrote:
             | Do you have some tricks for how you handle filtering
             | through logs? Or some ideas if there could be a tool that
             | could help you or mitigate your most critical issue[s]?
             | 
             | I found filtering through longs a major pain even for a
             | fully sighted person like me, so I wrote a tool to help me
             | with that, but it's fully in a "TUI" paradigm (i.e. curses-
             | like), so I presume it wouldn't help you much
             | (https://github.com/akavel/up). No promises, given that the
             | tool as is scratched my itch, but I am honestly curious if
             | something similar could reduce your PITA, including whether
             | this specific tool could be made useful for you through
             | some minimal effort on my side.
        
               | gostsamo wrote:
               | Hi,
               | 
               | usually grep saves the day. I will check your tool, but
               | what I need is for a terminal command that can recognize
               | the meta fields from a log record and put them on a line
               | separated from the main message. Also, it must be
               | installed everywhere I work, which is not so easy.
               | Putting logs in a table with filtering capabilities might
               | be best, but this means web access to the location of the
               | logs which is again tricky.
        
               | ryanianian wrote:
               | > what I need is for a terminal command that can
               | recognize the meta fields from a log record and put them
               | on a line separated from the main message
               | 
               | Isn't this the exact use-case of structured logging?
               | 
               | Log events have                   {timestamp, log level,
               | log category, string message, ...arbitrary key/value
               | pairs}
               | 
               | Usually serializing each message as a single json line in
               | a file.
               | 
               | Since it's all on one line you can still use grep, but
               | then since it's machine-readable you can pipe the grep to
               | anything that can parse json. Vanilla python3 works and
               | tends to be a part of most ops toolkits. Such tooling can
               | split out the fields onto other lines etc or in a more
               | reader-friendly format.
        
               | gostsamo wrote:
               | Yes, this has been my idea in many cases, but it is not
               | always that I have a say over the logging format.
        
             | IceWreck wrote:
             | Hey, thats cool. Thank you.
        
         | threatripper wrote:
         | So, it's a backdoor that only the blind can see?
        
         | mwcampbell wrote:
         | The next time someone tries to tell me that a true screen
         | reader should use computer vision and machine learning
         | (including OCR) rather than requiring applications to implement
         | accessibility APIs, I will bring up this case.
        
           | SilasX wrote:
           | HN exchange:
           | 
           | "Why can't we just, you know, direct blind users to a special
           | protocol that structures the data appropriately and then lets
           | them parse it however they want?"
           | 
           | Me: 'We did! It's called HTML! Designers just broke it!'
           | 
           | https://news.ycombinator.com/item?id=20224961
        
             | mwcampbell wrote:
             | IMO, HTML is still closer to that ideal than anything else
             | we have. My guess is that given a random web application
             | and a random non-web GUI (especially if the latter is
             | multi-platform), the web application will be more usable
             | with a screen reader.
        
               | joquarky wrote:
               | And now many people are excited about throwing all of
               | that away with canvas and web assembly.
        
               | skyde wrote:
               | right! But we would not need to use canvas if updating
               | the DOM was not super slow.
               | 
               | I suspect the #1 reason is the layout/reflow engine but i
               | might be wrong. Game engine do run physics at 60fps which
               | is harder than CSS reflow.
        
               | slaymaker1907 wrote:
               | I'd say markdown is even better than HTML for writing
               | generic documents since it enforces simplicity. In
               | particular, it forces a linear flow of the document and
               | does not have any support for stuff like JS.
        
             | HeavyStorm wrote:
             | Html could have been that - or better, it was at first -
             | but instead of creating a more specialized solution for
             | running rich apps we decided to exploit html.
             | 
             | Right now we are in what I'd call the worse of both worlds,
             | because we rely on html to do things it wasn't designed to,
             | and there's no longer purity in any html out in the wild.
        
           | gostsamo wrote:
           | Yep, together with the ml screen reading, they do not offer
           | subsidized infinite battery life and machine learning
           | hardware for the inferring model.
        
       | robertrbairdii wrote:
       | There's definitely a benefit to using a linter and a tool such as
       | prettier. Using prettier pushes the hidden character onto an
       | additional line in the checkCommands array which makes it much
       | easier to spot that something is wrong even if you're not using
       | the trailingCommas setting.
       | 
       | https://imgur.com/a/gYKylyH
       | 
       | I think this eslint rule would also be able to defend against the
       | initial destructuring of the query object by defining a regex
       | that identifiers have to match which would exclude those
       | invisible characters https://eslint.org/docs/rules/id-match
        
       | sihox wrote:
       | Just being curious I've pasted the example to Geany and VSCode
       | and in both this invisible character was visible :) I can't
       | remember setting some special character / whitespace visibility
       | options but I think it is good to have this kind of options
       | always on.
        
       | Eriks wrote:
       | good reason to not use comma when destructuring an object
        
         | smhg wrote:
         | You mean the last trailing comma? Or not destructions into
         | multiple variables?
        
       | jabbany wrote:
       | IIRC Rust has some compiler-level defenses against these glyph
       | based attacks (ref:
       | https://twitter.com/skyslasher11/status/1152824207555698688)
       | 
       | Perhaps one could do something similar in JS as well. Like have a
       | config that will make an interpreter fail if it encounters
       | unescaped unicode in variable names. It does not prevent any
       | unicode variable names, but you just have to escape them if the
       | are from some list of "abusable characters".
       | 
       | (At least Chrome seems to be happy with `var \u6D4B\u8BD5 = 1;`)
        
         | goldsteinq wrote:
         | You can just do `#![forbid(non_ascii_idents)]` in Rust. It'll
         | prevent this kind of attacks completely and you shouldn't need
         | non-ASCII idents anyway.
        
           | Tepix wrote:
           | That seems like throwing out the baby with the bathwater. We
           | don't all want to go back to the IT stone age of 1963.
        
             | goldsteinq wrote:
             | You still can use Unicode in comments and string literals.
             | You just can't use non-ASCII characters in identifiers.
             | 
             | Unicode in identifiers is just a bad idea.
             | 
             | 1. It creates a security consideration with confusable
             | identifiers (and lints don't always catch these)
             | 
             | 2. It breaks tooling with RTL identifiers
             | 
             | 3. It may not render correctly depending on fonts
             | 
             | 4. It may be hard to type depending on keyboard layout
             | 
             | 5. There really isn't a good reason to use non-ASCII idents
             | anyway
        
               | lifthrasiir wrote:
               | > 1. It creates a security consideration with confusable
               | identifiers (and lints don't always catch these)
               | 
               | O/0 and I/1/l are confusable characters within ASCII. I'm
               | not kidding here, they are actual entries in the Unicode
               | confusables database [1]. But no one wants to remove
               | those characters from identifiers.
               | 
               | [1] For example, https://util.unicode.org/UnicodeJsps/con
               | fusables.jsp?a=0&r=N...
               | 
               | > 2. It breaks tooling with RTL identifiers
               | 
               | It rather unbreaks tooling with no RTL support.
               | 
               | > 3. It may not render correctly depending on fonts
               | 
               | So does Unicode in comments and string literals. In fact
               | the purported Trojan "attack" was mostly about string
               | literals. So why should they be allowed in strings but
               | disallowed in identifiers?
               | 
               | > 4. It may be hard to type depending on keyboard layout
               | 
               | Did you know that not every Latin keyboard layout
               | supports a backquote (`)? This was the actual reason that
               | the repr(expr) shortcut got removed from Python 3 [2].
               | 
               | [2] https://mail.python.org/pipermail/python-
               | ideas/2007-January/...
               | 
               | > 5. There really isn't a good reason to use non-ASCII
               | idents anyway
               | 
               | My canonical answer from the experience is that not every
               | programmer who can understand English documentations can
               | easily write and comprehend English in general. For those
               | people having a non-ASCII identifier support is a great
               | relief, as it frees them from choosing "correct" English
               | identifiers. You can disallow them for your project if
               | you want (or conversely, make it an optional feature
               | disabled by default), but they are relevant for someone
               | else.
        
               | cedilla wrote:
               | > it frees them from choosing "correct" English
               | identifiers
               | 
               | Even if you have fluent English skills, sometimes
               | translations just confuse the issue. It's sometimes
               | better to use an untranslated word instead of introducing
               | ambiguity, especially when a term originates from a local
               | law.
        
               | lifthrasiir wrote:
               | Like CNLabelContactRelationYoungerCousinMothersSiblingsDa
               | ughterOrFathersSistersDaughter [1]? :-) You are very much
               | correct.
               | 
               | [1] https://news.ycombinator.com/item?id=28712667
        
               | josephcsible wrote:
               | > O/0 and I/1/l are confusable characters within ASCII.
               | 
               | You're mixing up two different ways that people use the
               | word "confusable": things that look similar in some
               | fonts, versus things that look exactly the same
               | regardless of font. I want the latter to be banned from
               | source files but not the former.
        
               | mkl wrote:
               | Non-ASCII identifiers can be useful for maths too. E.g. I
               | use l sometimes, especially in Python where "lambda" is a
               | keyword. (I have AutoHotKey and Espanso hotstrings to
               | make typing such symbols easy.)
        
               | __s wrote:
               | > O/0 and I/1/l are confusable characters within ASCII
               | 
               | Which is why the first thing I make sure of when looking
               | at programming fonts is how well they differentiate these
               | characters
        
               | koheripbal wrote:
               | Agreed - this is literally the first thing I check when
               | selecting the editor's font
        
             | lifthrasiir wrote:
             | `#![forbid(...)]` is a crate-wide attribute, so it is more
             | like a policy (that is good to have if your code would be
             | entirely ASCII).
        
             | Ygg2 wrote:
             | Agreed. I use Unicode identifiers to spot shitcode. This
             | would really hamper my detection abilities.
             | 
             | Jokes aside, if you're writing Unicode identifiers it means
             | you're not writing your code to be read by a broad
             | audience.
        
             | CodesInChaos wrote:
             | The compiler disallowing them globally might count as that.
             | But individual crates enforcing an "ascii only" policy
             | makes sense, if they never plan to use non-ascii.
             | 
             | Personally I'd prefer even one step further: the compiler
             | would disallow them by default, and you can opt into
             | specific character sets/languages at a crate level. e.g.
             | `AllowSpecialCharacters("de")` to enable on special
             | characters common in German.
        
           | UncleMeat wrote:
           | > It'll prevent this kind of attacks completely
           | 
           | It won't. The same approach works just fine in your build
           | specification or other config files. And it doesn't solve the
           | root of this problem, which is that you are compiling source
           | code you don't control and don't audit closely into your
           | binary. Sneaky text is not the only way of getting malicious
           | code through code review.
        
           | matheusmoreira wrote:
           | > you shouldn't need non-ASCII idents anyway
           | 
           | Yes, we do. People from all over the world write software
           | too. They should be able to use the words they know in code.
           | 
           | Also, it's totally cool to have mathematical symbols in code.
           | l, for example. Much more readable than the word lambda. The
           | only reason these symbols are hard to type is our keyboards
           | suck. They can be made easy to type with editor support
           | though.
        
             | tytso wrote:
             | Good code is maintainable code. And while you, as the
             | original programmer, might be perfectly comfortable writing
             | your code using Arabic variables and comments, what if the
             | next person who has to maintain the code is from Korea? Or
             | Russia? Or France? Or China?
             | 
             | OK, maybe you're a small startup in Taiwan and so you don't
             | care about the next maintainer in your company not being
             | able to read or write Chinese. What if you decide to open
             | source your code? Or Meta decides to offer you a zillion
             | dollars to buy you out, but after they do their due
             | diligence, realize that the code is utterly unmaintainable
             | should they decide to outsource internationalizing the code
             | so it will work in Brazil, so that requires native
             | Portguese speakers (who can preferably be paid low, low
             | wages) --- but they can't understand the code because it's
             | using Chinese variables and comments. And then Meta decides
             | to back out from the deal?
        
               | matheusmoreira wrote:
               | If you're likely to work with an international team, it
               | makes sense to use english. That's not always the case
               | though. Plenty of those low-paid brazilian programmers
               | you cited will never do that. Many of them don't speak
               | english to begin with.
               | 
               | For example, the school I went to had a simple web
               | application for student feedback. Attachments were
               | allowed. People started running into issues due to non-
               | ASCII characters in file names. I reported the issue to
               | the IT department and even helped them fix it. The Python
               | code was written in portuguese, accents and everything.
               | Why shouldn't accents be used in this case? It's unlikely
               | this code will ever be used in an international context.
        
             | jimmaswell wrote:
             | ASCII is the standard for code for good reason. Everyone
             | can type it. Put whatever you want in comments, but you
             | shouldn't make people have to copy/paste your variable
             | names.
        
               | matheusmoreira wrote:
               | > you shouldn't make people have to copy/paste your
               | variable names
               | 
               | The people working on a non-english codebase don't have
               | to. Their keyboards have the symbols they're typing.
        
               | jimmaswell wrote:
               | You're assuming no international collaboration.
        
             | antris wrote:
             | >Yes, we do. People from all over the world write software
             | too. They should be able to use the words they know in code
             | 
             | My native language has non-ASCII characters and I do not
             | expect nor do I want to be able to type them outside string
             | literals. Specifically for the reasons stated in the blog
             | post, among others. Writing in my native language is far,
             | far down in the list of priorities as a professional coder,
             | when security / compatibility are there too. Suggesting
             | that non-native English speakers have to be able to code in
             | their native language also would suggest that non-native
             | coders do not take security / compatibility seriously,
             | which would mean that they are unprofessional. I'm pretty
             | sure that it's not your intention to suggest that, but
             | that's kind of how it comes across. With all the problems
             | eliminated by the use of English and ASCII, it would strike
             | me as amateurish to not use English and ASCII wherever
             | possible.
        
               | matheusmoreira wrote:
               | > non-native coders do not take security / compatibility
               | seriously
               | 
               | That's not what I said at all. I don't see how you came
               | to this conclusion.
               | 
               | > With all the problems eliminated by the use of English
               | and ASCII, it would strike me as amateurish to not use
               | English and ASCII wherever possible.
               | 
               | Not everybody speaks english. I've taught programming to
               | quite a few people and they all attempted to use normal
               | characters while writing code. There's absolutely no
               | reason why that shouldn't work. I don't see how
               | characters like c or a or u could possibly cause security
               | issues. Go ahead and ban the invisible unicode stuff but
               | there's absolutely no reason why these common letters
               | shouldn't work.
        
               | antris wrote:
               | It is funny that you are using the existence of a segment
               | of the population that I am a part of, to make your claim
               | but aren't willing to listen when a member of the segment
               | is trying to explain how non-ASCII characters and coding
               | do not mix well.
               | 
               | Sure, you could make a fix for this specific case, but
               | the problem mentioned in the blog post is not even close
               | to the only problem of non-ASCII characters. In _theory_
               | , yes, we could make a language and a full suite of
               | tooling that would play nice with non-ASCII characters.
               | But it's not like the whole non English speaking world is
               | waiting for this to happen. People code in English even
               | in teams where everyone speaks Finnish. Nobody even
               | questions it, because it's so obvious that all code
               | should be in English and ASCII. Everyone has shot their
               | foot, putting in non-ASCII characters in the source code
               | at some point of their career, if they have ever dared to
               | try. That's how the reality is, and at the same time I
               | hear people saying that the existence of those Finnish
               | programmers means we _have to_ have Unicode in source
               | code.
               | 
               | >That's not what I said at all. I don't see how you came
               | to this conclusion.
               | 
               | I didn't say you said it. I said that's how it (probably
               | accidentally) comes across when you talk about something
               | so carelessly. Non English speakers care about
               | compatibility and security and take those seriously,
               | therefore we pretty much always write code in English and
               | ASCII.
        
               | matheusmoreira wrote:
               | > It is funny that you are using the existence of a
               | segment of the population that I am a part of, to make
               | your claim but aren't willing to listen when a member of
               | the segment is trying to explain how non-ASCII characters
               | and coding do not mix well.
               | 
               | Why is it funny? I'm also a member of that group. English
               | is not my native language.
               | 
               | > But it's not like the whole non English speaking world
               | is waiting for this to happen.
               | 
               | I don't think we should have to wait for this to happen.
               | In many ways, it's already happened: most modern
               | languages already support unicode symbols.
               | 
               | > People code in English even in teams where everyone
               | speaks Finnish. Nobody even questions it, because it's so
               | obvious that all code should be in English and ASCII.
               | 
               | Relatively few people speak english in my country. I have
               | only a few friends who do. A whole team of people writing
               | code in english just doesn't seem likely where I live. I
               | actually tried writing english code in such a context
               | once, the result was a mixed language mess that I quickly
               | reverted back to my native language. Unicode support is
               | great because it makes the non-english code much more
               | readable.
               | 
               | Europeans in general seem to know english very well. This
               | is _not_ the case everywhere. Somehow making english a
               | requirement for programming just doesn 't sound fair to
               | me.
        
               | capitainenemo wrote:
               | I brought this up last time
               | (https://news.ycombinator.com/item?id=29066760) but:
               | 
               | https://github.com/reinderien/mimic
               | 
               | It applies to other contexts besides code. For our user
               | table we have a mariadb collation on the unicodes
               | confusables list which avoids confusable usernames
               | (treated as already existing).
        
       | wnevets wrote:
       | If you use Sublime Text the Gremlins[0] package will detect and
       | light up these kind of characters
       | 
       | [0] https://packagecontrol.io/packages/Gremlins
        
       | Mockapapella wrote:
       | If anyone is interested, I wrote an article a while back
       | exploring which unicode characters Python allows you to set
       | variables equal to: https://www.thelisowe.com/why-can-be-a-
       | variable-in-python-bu...
       | 
       | This was originally done with the goal of trying to hide/encode
       | one program within another using non-displayable characters (such
       | as zero width spaces), I just never got around to it. But reading
       | this article has kind of reignited that interest for me and I
       | think I might take another crack at that soon.
        
       | geoduck14 wrote:
       | This is MOST interesting.
       | 
       | I wonder if Git or Stack Overflow should highlight non Ascii
       | characters to reduce malicious actors using this in code.
        
       | lovasoa wrote:
       | The `cmd &&` looks fishy in their example, and would probably
       | have been removed in a review. Instead, one could write :
       | const { ping, curl,  } = req.query;         const checkCommands =
       | [             ping && 'ping -c 1 google.com',             curl &&
       | 'curl -s http://example.com/',          ];              await
       | Promise.all(checkCommands.map(cmd => cmd && exec(cmd, { timeout:
       | 5_000 })));
       | 
       | This way the `cmd &&` is justified
        
         | testASW2 wrote:
         | d
        
       | kuon wrote:
       | My editor (vim) will warn me with a loud visual red block for any
       | non ascii char outside a string literal. But I do not think that
       | is enough. Compiler and interpreter must be more strict.
        
         | jrochkind1 wrote:
         | I'm surprised VS Code doesn't at least have that option. (Or
         | does it?)
        
           | myfonj wrote:
           | It has `editor.renderControlCharacters` but only recently
           | started displaying few dangerous previously invisible ones
           | (directional overrides) natively [1], but besides that you
           | had to use extension that adds highlights for non-ascii non-
           | whitelisted [2] or predefined [3] characters.
           | 
           | [1] https://github.com/microsoft/vscode/issues/116939 [2] htt
           | ps://marketplace.visualstudio.com/items?itemName=nachocab...
           | [3] https://marketplace.visualstudio.com/items?itemName=nhoiz
           | ey....
        
         | tomxor wrote:
         | Mind sharing the relevant config line?
         | 
         | I thought this was default but just realised it only does the
         | <FFFF> thing when there is no printable glyph available.
         | 
         | Allowing printable unicode in strings seems like a nice balance
         | if it can be done reliably.
        
           | fatheart wrote:
           | After seeing this thread I added the following to my vimrc:
           | highlight link NonASCII Error       autocmd Syntax * :syntax
           | match NonASCII "[^\d0-\d127]"
           | 
           | Obviously haven't been using it long, and I'm not confident
           | enough in my vim knowledge to vouch for its correctness, but
           | it works in the limited amount of scenarios I tested so far.
        
         | gpvos wrote:
         | That's not the default vim configuration though.
        
       | aww_dang wrote:
       | Yes, the Unicode characters are a problem. But do the norms and
       | tooling play a role here as well?
       | 
       | Explicitly casting types, like String parameters to integers
       | would make this much more explicit. The convenience of accessing
       | parameters via destructuring, vs explicitly
       | request.getParameter("\u3164"). Having a static array of
       | permissible commands declared elsewhere.
       | 
       | There's something to be said for verbosity and explicitness.
       | Where the tooling and norms shun it, these 'invisible' backdoors
       | can gain advantage.
        
       | laktak wrote:
       | This shows up in standard Vim as a [HF] symbol.
        
         | kreetx wrote:
         | Same in vanilla emacs.
        
       | josteink wrote:
       | Listen guys, don't get me wrong. As someone with O in my name,
       | and both A and O in my address, don't get me started on poorly
       | written systems which cannot handle unicode properly. I've seen
       | my name and address mangled in shipping forms, in airline tickets
       | (every time) and even in my marriage-papers since I married
       | abroad.
       | 
       | I literally have _personal_ reasons for getting everyone, and I
       | mean everyone, on the unicode bandwagon.
       | 
       | That said... Maybe it's because I'm a child of the late 70s and
       | early 80s and learned to program on computers which simply didn't
       | have non-ASCII characters at all...
       | 
       | But can't we all just sit down and admit that allowing non-ASCII
       | characters in programming-language identifiers was a bad idea?
       | Can't we in the next revision of EcmaScript (or Rust, or
       | whatever) mandate ASCII-only identifiers when in strict mode or
       | using modules or whatever? Having _invisible characters_
       | represent executable code is not just a dumb a idea, it 's so
       | hazardous that you might call it borderline malicious.
       | 
       | There _has_ to be some way to undo this damage, without breaking
       | compatibility with the code which is already out there, right?
        
         | rtoway wrote:
         | Rust has a lint against this kind of attack + you can
         | explicitly disable non-ASCII identifiers if you really want to
        
           | est31 wrote:
           | Ideally that lint would be on by default though. Most code
           | doesn't use non-ASCII identifiers. It's not happened though
           | because of uhm. political reasons.
        
             | rtoway wrote:
             | The lint is on by default in the latest version of the
             | compiler
        
             | drran wrote:
             | Most code made by English speakers contains English word
             | and Latin characters, so other languages and alphabets must
             | be abandoned, and their native speakers must imprisoned
             | until they understand their mistakes.
        
               | toastal wrote:
               | Abugidas and logographies banished
        
               | drran wrote:
               | OK, OK, we can start with a warning in the compiler that
               | use of any language except English is unsafe.
        
         | auggierose wrote:
         | Yeah, let's just switch to Cosmopolitan Identifiers:
         | https://obua.com/publications/cosmo-id/3/ :-)
         | 
         | But yeah, it would break existing code, sorry.
        
         | Dagonfly wrote:
         | Adding a variable decorator/annotation like
         | @Unicode(german,french) would be a good stop-gap. You could
         | only use ASCII characters unless you specified the script that
         | you want to use. One could even set a max limit on how many
         | scripts per variable. Because while I have used German
         | characters in variables before (only if I'm referring to some
         | law or spec), I never had a use case for more than 2 scripts
         | within one variable.
        
           | silvestrov wrote:
           | I think this is a good idea because once in a while you need
           | to write non-ascii characters in names.
           | 
           | This mostly comes up when implementing tax rules or
           | government administrative divisions as some countries have
           | names/concepts which have no good translation into English,
           | so you are left with using the non-English name, which often
           | contains non-ASCII characters.
        
           | est31 wrote:
           | The multiple scripts per variable thing is implemented in
           | Rust via a lint. For the explicit enabling of single scripts,
           | I have suggested that for Rust, but sadly people preferred
           | allowing all identifiers (while giving an option to only have
           | ascii but I'd argue this is unfair for anyone who only wants
           | to use a specific non-ascii language, why do they have to
           | suddenly allow _all_ languages in their code base?). There
           | are also practical concerns, like who says what a language
           | is, which characters it contains, how that language is
           | called, etc? Someone has to maintain all these lists.
        
             | chrismorgan wrote:
             | > _who says what a language is, which characters it
             | contains, how that language is called, etc?_
             | 
             | The Unicode Consortium already maintains all of that data
             | in the CLDR (Common Locale Data Registry).
        
           | lifthrasiir wrote:
           | For your information the relevant Unicode specification is
           | the Script_Extensions property [1]. (You can't easily filter
           | by languages, so you should filter by scripts.)
           | 
           | [1] https://www.unicode.org/reports/tr24/tr24-32.html#Script_
           | Ext...
        
         | jillesvangurp wrote:
         | The issue with this is less that this is possible and more that
         | a lot of javascript ends up in production without ever getting
         | compiled, linted, type-checked, etc. Stuff like this is
         | designed to bypass what little human oversight there is to
         | prevent bad things from happening. What is actually visible
         | also depends on what fonts you have installed on your system.
         | So, it's less clear cut than you think.
         | 
         | The problem is not so much that humans can't see this but that
         | they are not looking very hard to begin with (otherwise, they'd
         | be using the appropriate tools) and that we should rely less on
         | them actively looking. Blind trust that things will be fine is
         | the root problem here.
        
           | josteink wrote:
           | > The problem is not so much that humans can't see this but
           | that they are not looking very hard to begin with (otherwise,
           | they'd be using the appropriate tools) and that we should
           | rely less on them actively looking.
           | 
           | And simply not allowing non-ASCII identifiers in the first
           | place would be a move in that direction. Now you have one
           | thing less to look for.
        
         | badsectoracula wrote:
         | You can only type ~27% of my name with just ASCII (and even
         | then one letter will not be exactly)... and i agree with you.
         | If anything i'd go a bit further and say that, sure, use
         | Unicode in places where you can find arbitrary text like
         | documents, messages, etc but anything that has to do with the
         | 'guts' of the computer should stay away from Unicode (or at
         | least treat it as data, like how filenames are treated on
         | Linux).
         | 
         | I disagree with the getting everyone on the Unicode bandwagon
         | though, IMO Unicode has introduced a ton of problems exactly
         | because it tries to be a ton of stuff at the same time. I don't
         | know how exactly a better solution would be but i have a very
         | hard time accepting that such a convoluted and error prone
         | system is the best solution. IMO if decades later there are
         | still issues with getting it right then there is something
         | fundamentally wrong with the system itself and not with the
         | applications and developers trying to work with it.
        
           | gpderetta wrote:
           | An existing working solution, even if not perfect, patched,
           | with lot of baggage and technical debt is infinitely better
           | than a non-yet invented ideal, perfect solution.
           | 
           | And even if the perfect solution existed right now, in a few
           | decades it will be as filed with baggage as the current one.
           | 
           | Sometime one has to realize that hard problems are hard.
        
         | AzzieElbab wrote:
         | a lot of software used in shipping/logistics predates unicode
        
         | lifthrasiir wrote:
         | > But can't we all just sit down and admit that allowing non-
         | ASCII characters in programming-language identifiers was a bad
         | idea?
         | 
         | It's a bad idea only if all members in your team can easily
         | produce and comprehend an ASCII-only code.
         | 
         | > Having invisible characters represent executable code is not
         | just a dumb a idea, it's so hazardous that you might call it
         | borderline malicious.
         | 
         | Not if those invisible characters do affect the rendering.
         | Invisible formatting characters like ZWJ and ZWNJ are allowed
         | because they are used in some scripts. The relevant Unicode
         | specification [1] even provides a guideline to limit ZWJ and
         | ZWNJ strictly to the context where they do affect the
         | rendering.
         | 
         | That said, the Hangul filler and half-width Hangul filler were
         | mistakes. They are purely legacy characters and never have been
         | used in practice, so I encourage new languages to exclude them
         | from the default (X)ID_Start/Continue set (Unicode can't do
         | that because of the compatibility, maybe they can introduce
         | another pair of properties without those characters).
         | 
         | [1]
         | https://unicode.org/reports/tr31/#Layout_and_Format_Control_...
        
           | josteink wrote:
           | > The relevant Unicode specification [1] even provides a
           | guideline to limit ZWJ and ZWNJ strictly to the context where
           | they do affect the rendering.
           | 
           | Which is exactly what I am suggesting by saying non-ASCII
           | characters should be banned from being used as _identifiers_
           | , not from being present in the code-file all together or in
           | the form of strings, etc.
           | 
           | If the formatting of your output in your applications (as
           | seen by the user) depends on the _names you 've declared your
           | variables with_, then you are doing something horribly wrong.
        
             | lifthrasiir wrote:
             | You seem to think those formatting characters as something
             | that should be in the higher-level protocol like HTML. They
             | are not. They are used when two consecutive abstract
             | characters can be combined in two or more different ways.
             | _And those different renderings frequently have different
             | meanings._ That 's why they can't be simply removed when
             | normalized; doing so will destroy the text.
        
               | josteink wrote:
               | We seem to be talking past one another. What Id like to
               | see banned is non-ascii in identifiers, variables-
               | _names_ and nothing else.
               | 
               | While you respond as if I want to banish anything non-
               | ASCII from all parts of all code-files except from HTML-
               | templates. That's certainly not what I'm advocating.
               | 
               | The following is IMO perfectly _fine_ :
               | var greeting = "Hello  (cowboy emoji)";
               | 
               | The following is IMO not:                   var (emoji) =
               | "Let's party!"; // note identifier contains non-ascii
               | 
               | Do you still disagree? If so, can you outline why?
        
               | lifthrasiir wrote:
               | Okay, I think I see where you got confused. There are
               | multiple levels of Unicode identifier support and you are
               | probably not aware of all possible levels. Those levels
               | are:
               | 
               | 1. Identifiers can contain any octet with the highest bit
               | set. Different octet sequences denote different names.
               | 
               | 2. Identifiers can contain any Unicode code point (or
               | scalar value, the fine distinction is not required here)
               | above U+007F. Different (but possibly same-looking) code
               | point sequences denote different names.
               | 
               | 3. Identifiers can contain any Unicode code point in a
               | predefined set, or two if the first character and
               | subsequent characters are distinguished. Different code
               | point sequences denote different names.
               | 
               | 4. Same to 3, but these predefined sets derive from the
               | Unicode Identifier and Pattern Syntax specification
               | [1]---namely (X)ID_Start/Continue.
               | 
               | 5. Same to 4, but now identifiers are normalized
               | according to one of the Unicode normalization algorithms.
               | So some different code point sequences now map to the
               | same name, but only if they are semantically same
               | according to Unicode.
               | 
               | 6. Same to 5, but also has a rule to reduce unwanted
               | identifiers. This may include confusable characters,
               | virtually indistinguishable names and names with multiple
               | unrelated scripts. Unicode itself provides many
               | guidelines in the Unicode Security Mechanisms standard
               | [2].
               | 
               | Levels 3, 4 and 5 are most common choices in programming
               | languages. In particular emojis are not allowed for 4, so
               | your example wouldn't work in such languages. For example
               | JavaScript is one of them so `eval('var \u{1f600} = 42')`
               | doesn't work (where U+1F600 is a smiling face). Both
               | Python and Rust are at the level 5. Possibly
               | unexpectedly, both C and C++ are at the level 3. Levels 1
               | and 2 are rare especially in modern languages; PHP is a
               | famous example of the level 1.
               | 
               | Level 6 is a complex topic and there are varying degrees
               | of implementations (for example Rust partially supports
               | the level 6 via lints), but there is a notable example
               | outside of programming languages: the Internationalized
               | Domain Names. They have very strong constraints because
               | any pair of confusable labels is a security problem. It
               | seems that they have been successful in keeping the
               | security of non-ASCII domains on par with ASCII-only
               | domains, that is, not fully satisfactory but reasonable
               | enough. (If you don't see the security issues of ASCII-
               | only domains, PaypaI and rnastercard are examples of
               | problematic ASCII labels that were never forbidden.)
               | 
               | I argue that the level 3+ is necessary and the level 5+
               | is desirable for international audiences. The level 5
               | would for example mean that `var annyeonghaseyo =
               | "annyonghaseyo";` (Korean) is allowed but `var (emoji) =
               | "oh no";` is forbidden. I have outlined why the former is
               | required in the last paragraph of [3]. Does my clarified
               | stance make sense to you?
               | 
               | [1] https://unicode.org/reports/tr31/
               | 
               | [2] https://unicode.org/reports/tr39/
               | 
               | [3] https://news.ycombinator.com/item?id=29170954
        
               | josteink wrote:
               | To be clear I'm completely oblivious to what Unicode
               | identifiers are. As such I'm not talking about them, and
               | they are out of scope wrt to my point.
               | 
               | What I am advocating is that identifiers used for symbols
               | in the programming language (variables-names, function-
               | names, class-names, etc), should be strictly ASCII-based.
               | 
               | That's simple, understandable and should be a sane
               | default anywhere.
               | 
               | My opinion is that since nobody without a doctorate in
               | Unicode _actually fully understands Unicode_ , having a
               | rule-set for identifiers built on top of the already
               | bewildering Unicode rule-set is a sure-fire way to
               | engineer for unexpected consequences and/or security
               | issues.
               | 
               | Sure. Allow it if you _must_. But you must opt in to use
               | it. It should be a non-default feature everywhere where
               | it's available.
        
       | ludovicianul wrote:
       | On a similar note, if you want to test your REST APIs for weird
       | characters, I built a tool for this:
       | (https://github.com/Endava/cats#leadingcontrolcharsinfieldstr...)
        
       | lifthrasiir wrote:
       | Hey, you have missed U+FFA0 HALFWIDTH HANGUL FILLER which has
       | about the same property as U+3164 HANGUL FILLER!
       | 
       | Surely I expected this coming ever since I've seen the purported
       | Trojan "attack", as the Hangul fillers are pretty much the only
       | characters that are (X)ID_Start and have no visible glyphs [1].
       | If (X)ID_Continue is also considered ZWJ and ZWNJ would be
       | another contenders. Attacks using those characters have much
       | better chance than the Trojan "attack", but you need a very
       | specific code to execute the attack. It should be obvious that a
       | typical coding convention easily prevents them.
       | 
       | As much like the purported Trojan "attack", this kind of attacks
       | need a better code review and tooling. You don't need to remove
       | non-ASCII identifiers from existing languages: they have their
       | uses when an entirety of your team speak languages not using
       | Latin script. But you should be able to catch a _new_ use of non-
       | ASCII characters throughout your code base and compare that with
       | your expectation.
       | 
       | [1] The Hangul filler comes from a legacy mechanism of KS X 1001
       | for unencoded Hangul syllables (it had only 2,350 out of 11,172
       | modern syllables). The half-width Hangul filler probably comes
       | from a duplicate encoding of the filler in the IBM code page 933
       | to ensure round-trip conversion. Both are never used in practice,
       | except for probably the Hangul filler that was briefly
       | implemented by Mozilla and removed due to the compatibility
       | issue.
        
       | fergie wrote:
       | Running prettier on the code makes the "hidden" variables fairly
       | obvious -> https://imgur.com/a/MhhRpwq
       | 
       | That said, nothing on my buildchain actually throws an error or
       | warning.
        
         | DyslexicAtheist wrote:
         | >> That said, nothing on my buildchain actually throws an error
         | or warning.
         | 
         | use hooks for CI on pre-commit / merge and pull requests e.g.
         | like this pre-commit which would catch bi-directional trojan
         | sources:                 #!/usr/bin/env python3       import
         | sys       import subprocess            bidi_chars =
         | '\u202A\u202B\u202D\u202E\u2066\u2067\u2068\u202C\u2069'
         | for line in sys.stdin:           old, new, ref = line.split()
         | diff = subprocess.run(['git', 'diff', old, new],
         | stdout=subprocess.PIPE,
         | stderr=subprocess.STDOUT,                   text=True)
         | if diff.returncode != 0:               print(diff.stdout)
         | sys.exit(f'git diff ended with rc={diff.returncode}, receive
         | TERMINATED')           if any(c in diff.stdout for c in
         | bidi_chars):               print(diff.stdout)
         | sys.exit('Possible Trojan Source Attack, receive REFUSED')
         | 
         | I wish github/gitlab would provide such features available out
         | of the box which also follow best practice, so people can stop
         | pasting them from the web or reinvent our own version in every
         | team ...
        
         | brabel wrote:
         | Obvious as in an empty line? Not very obvious to me.
        
       | onion2k wrote:
       | If you combined this attack with Whitespace
       | (https://en.wikipedia.org/wiki/Whitespace_(programming_langua...)
       | you could embed entire programs in your JS code.
        
         | tyingq wrote:
         | Perl has a module called Acme::Bleach that does that for you.
        
       | FrankyHollywood wrote:
       | > In our experience non-ASCII characters are pretty rare in code.
       | Many development teams chose to use English as the primary
       | development language
       | 
       | Is this true for the whole world, or just Europe/US?
        
         | afavour wrote:
         | It is. The way it was explained to me, all the APIs you use are
         | in English so naming variables in your local language is futile
         | at best and would just require constant context switching.
         | 
         | I remember many moons ago MooTools announced international API
         | translations as an April Fools joke. It did make me wonder if
         | there's an interesting programming experiment to be done
         | there... but I'm a native English speaker so I'm not best
         | positioned to know!
        
           | mijamo wrote:
           | Not my experience. Plenty of codebases have variables in
           | local language in France Germany and Sweden (where I have
           | experience).
           | 
           | I actually have encountered a lot of problem with English
           | codebases in those countries as they often try to translate
           | regional concepts that are not directly translatable. This is
           | particularly annoying when it comes to administrative stuff
           | where one English word can refer to different local concepts
           | (ex: geographical divisions of the territory) and
           | translations always clumsy. I have even seen nasty bugs come
           | from there, where a "county" had a different meaning in
           | different places of the code as different teams had different
           | idea of what a county was but didn't discuss it.
        
         | utrack wrote:
         | Yep, pretty much (Russian here)
        
       | reilly3000 wrote:
       | While this is an interesting hack, the larger issue in the
       | example is allowing any query parameter to write into a
       | subprocess. exec() immediately throws flags for me, especially
       | when it isn't necessary like in the case of making an http call.
       | Even when it isn't passing arbitrary inputs from the web to the
       | command line, it's susceptible to DoS that could crash the whole
       | kernel instead of just the web server. I get that this is just a
       | contrived example to show the risk of hidden characters, but
       | please don't use process.exec() unless you have no other options.
        
       | Klaster_1 wrote:
       | A similar thing to the Reddit post mentioned in the article
       | happened to me too: I used a not-a-space character that looks
       | like a space once, the text editor autocompletion remembered it
       | and would occasionally substitute it for space. The code looked
       | OK, but compilation failed or threw syntax errors in run time.
       | This continued for several years until I completely reinstalled
       | the editor, with full cleanup.
        
         | yepthatsreality wrote:
         | _Glances at Microsoft GitHub Copilot._
        
         | bavell wrote:
         | Dev nightmare fuel
        
         | Enginerrrd wrote:
         | Oh my God. That's nightmare level error-inducing.
        
         | Arrath wrote:
         | Auto-complete gremlins can be the absolute worst, I would have
         | been tearing my hair out.
        
       | nosianu wrote:
       | First thing I did when I first read the story was check my
       | editor. I already had the "Zero Width Characters locator" plugin
       | installed, but that covered less than a handful of specific space
       | character type codes.
       | 
       | Still, the result was good: Looks like IDEA editors like Webstorm
       | show invisible characters with colored background and a warning.
       | 
       | My test was from that first article and also now from this one
       | copy the example code they contained or linked to from the
       | browser into an open file.
       | 
       | Screenshot: https://i.imgur.com/ColuRNB.png
        
         | JeremyNT wrote:
         | While not as fancy, font choice may save you here too. I use
         | vim and while the editor doesn't treat this character as
         | special, my font (Iosevka term) doesn't include this character,
         | and so it's rendered as the generic "missing unicode" glyph
         | with the code inside it.
        
         | dotancohen wrote:
         | Interesting. PhpStorm highlights the variable after `timeout`
         | but does not highlight the variable after
         | `http://example.com/`. Even pressing F2 to go to the next error
         | goes to the first variable (the highlighted one) but not the
         | second.
         | 
         | However, placing the cursor on either does highlight the
         | second.
         | 
         | I'm using the Darcula scheme. Your screenshot obscures the
         | second occurrence, so we cannot see if your light theme has the
         | same issue with the second occurrence not being highlighted as
         | Darcula has.
         | 
         | Screenshot: https://i.imgur.com/FxwUkVz.png
        
           | nosianu wrote:
           | You are right, I missed the other one, it is not reported.
           | You can see there is something because it takes space, but
           | you have to deliberately go there to see it. There also is no
           | warning from having the "No trailing spaces" setting active,
           | so it is not seen as a space character even if it shows as
           | such.
           | 
           | I'll write an Issue on youtrack, I'm sure they'll fix it.
           | From the well over hundred issues I ever reported about 2/3rd
           | were fixed (rest is obsolete, only a few that are really
           | still open).
           | 
           | EDIT: Bug report submitted.
        
             | dotancohen wrote:
             | Please link it, I'll comment as well.
             | 
             | Yes, I file bug reports with lots of places, and Jetbrains
             | is one of the best for actually doing something with them.
             | It is one of the few non-FOSS applications that I am
             | willing to integrate into my workflow (hmm, the only one I
             | think).
        
               | nosianu wrote:
               | Ticket: IDEA-282266 Not all invisible characters are
               | reported
               | 
               | I didn't want to link because of loss of anonymity... :(
               | 
               | https://youtrack.jetbrains.com/issue/IDEA-282266
        
               | nosianu wrote:
               | EDIT (new comment because edit-period is long gone):
               | 
               | It's not too severe an issue, maybe not one at all(?), at
               | least in this concrete example, because after removing
               | the first occurrence of the hidden variable it now
               | becomes a "not defined" real error and not just a warning
               | in the second location.
        
           | tdrdt wrote:
           | Some time ago I managed to add a non width character in my
           | PHP code. Because it had no width PhpStorm did not highlight
           | it and I had absolutely no clue why there was an error in my
           | code. So it only highlights when it has width.
           | 
           | Edit: just added some non-space characters and at least in
           | Rider they are now displayed as a warning. So I think this is
           | fixed now.
        
         | kingcharles wrote:
         | Is there a plugin for detecting homoglyphs like these
         | genderless vs. male zombies I made:
         | https://kingcharles.one/unistrange.html
        
       | Cameri wrote:
       | Is there an eslint plugin to prevent invisible characters on
       | .js/.ts files?
        
       | cphoover wrote:
       | I imagine this can be defended again pretty easily with a lint
       | rule that prevents these unicode characters in variables. pretty
       | ingenious little hack though.
       | 
       | The eslint rule _id-match_ , which require identifiers to match a
       | specified regular expression, would be useful here. For example:
       | "id-match": ["error", "^[a-z]+([A-Z][a-z]+)*$"]
        
         | bluepnume wrote:
         | A malicious PR could also add the character to your eslintrc
         | too though. You'd be forgiven for seeing the line change in the
         | diff and thinking it was just some reformatting.
        
           | cphoover wrote:
           | that would show up on a diff and would elicit a question in
           | code review hopefully of why the .eslintrc file was being
           | changed in this way. This another good argument for a
           | comprehensive code review process.
           | 
           | also you could lock this file down with a CODEOWNERS file so
           | only certain trusted contributors could modify the lint
           | configuration. You could also do exclusionary pattern
           | matching to make sure none of the bad characters do exist in
           | identifier names... Or you could write your eslint
           | configuration as a separate module to be npm installed... or
           | you could write a eslint rule plugin that disallows non-ascii
           | identifiers and then npm install that... lots of different
           | ways to skin this cat to add security.
        
           | jraph wrote:
           | It seems like things displaying diffs could use a specific
           | color for lines only changed by formatting or indentation
           | (indentation can have significant meaning like in Python but
           | this would probably be good enough)
        
             | dotancohen wrote:
             | I believe that Git diff - which has features not supported
             | by regular diff such as --word-diff - can differentiate
             | between whitespace-only line changes. The Jetbrains IDE,
             | which I believe uses Git diff behind the scenes, will show
             | who originally wrote a line even if it has been whitespace-
             | reformatted later.
        
       | vjeux wrote:
       | Another reason to use prettier, this will be formated in a
       | confusing way and has a higher chance of being spotted by a
       | human!
        
       | nathell wrote:
       | What was wrong with only allowing ASCII in identifiers?
        
         | jgalt212 wrote:
         | seriously, I'd fire anyone who put any emoji in an identifier.
        
           | chrismorgan wrote:
           | Sensible languages follow UAX #31
           | <https://www.unicode.org/reports/tr31/> for Unicode
           | identifiers, which doesn't allow emoji.
        
             | skrebbel wrote:
             | But it does allow hangul fillers, apparently.
        
       | SenpaiHurricane wrote:
       | TS + Intellij
       | 
       | https://ibb.co/MfLLNQL
        
       | Suvitruf wrote:
       | This issue and example used are more about data validation and
       | escaping characters.
        
       | mirekrusin wrote:
       | I don't think this is JavaScript specific. It's like saying "BMW
       | cars in Paris stop working if you pour sugar in tank".
        
         | bestham wrote:
         | The point is not that the vulnerability is a trait of
         | javascript but to make a demonstration on how different unicode
         | characters can be used to create a vulnerability, exemplified
         | by a piece of javascript.
        
           | mirekrusin wrote:
           | Yes, I understand it, but title and content doesn't
           | explicitly mention this "detail" that same attact vector
           | exists for other languages and data formats.
        
         | dspillett wrote:
         | _> I don't think this is JavaScript specific._
         | 
         | It isn't, any relatively dynamic language is going to have
         | these or similar issues. Many moons ago I saw similar examples
         | in bash, I'm sure they are possible in PHP, ..., ..., ...
         | 
         | In fact, even the more strict languages probably do to: the
         | "accidentally run something malicious via care-free use of
         | exec" is an issue in just every language that has
         | "exec"/similar - it is a data trusting error in the
         | programmer's logic not an issue with the language itself. The
         | dynamic nature of some of JS's syntax is just one way to
         | pollute the data being fed to exec amongst the other sources
         | (user input, being too trusting of config in the DB or
         | filesystem, and so forth).
         | 
         | Javascript is a very good option to use for examples though:
         | most devs know it well enough and it is _everywhere_ so the
         | potential scale of the danger is obvious, even more so in light
         | of people being far too trusting of dependencies pulled via NPM
         | and the recent examples of malicious updates getting into
         | common packages.
         | 
         | Maybe the title could be a bit less click-baity, though I'm not
         | sure what would be used instead that wouldn't be overly wordy
         | for a punchy article title.
        
           | fergie wrote:
           | Strong typing doesn't fix the issue that invisible characters
           | can be used as variable names.
        
         | UncleMeat wrote:
         | This whole story has been stupid. These ideas have been around
         | for ages and are not novel to the security community. Yet we've
         | seen headlines like "all programs ever are vulnerable to this
         | new hack." The root cause is not unicode characters but instead
         | _untrusted text_. It isn 't like a malicious library would be
         | unable to sneak backdoors in through ascii source anyway. Heck,
         | we _just_ had a big kerfuffle over this happening in the linux
         | kernel this year.
         | 
         | Or worse! Go look at the dependencies for some large enterprise
         | system built in java. How many _raw jars_ do you think are
         | being included in there? Has _anybody_ looked at the bytecode
         | of these jars?
        
         | pumpum wrote:
         | Good point. FYI, sugar in the gas tank is a myth. The sugar
         | will cause basically no harm at all, I've witnessed it
         | attempted.
        
       | cyberpsybin wrote:
       | This needs to be patched. Although at this point there might be
       | code depending on this.
        
       ___________________________________________________________________
       (page generated 2021-11-10 23:02 UTC)