[HN Gopher] The Invisible JavaScript Backdoor
___________________________________________________________________
The Invisible JavaScript Backdoor
Author : davidbarker
Score : 517 points
Date : 2021-11-10 04:04 UTC (18 hours ago)
(HTM) web link (certitude.consulting)
(TXT) w3m dump (certitude.consulting)
| rafaelturk wrote:
| Nice article, it was a fund reading it. Albeit I don't think this
| kind of attack is restricted to just JavaScript.
| dlsa wrote:
| Compilers and interpreters need a new pass to detect these
| characters in code and treat them as hard errors. This doesn't
| stop their use in comments where presumably they are still ok.
|
| Alternatively, there needs to be an uptake of the use of code
| linters and pretty printers.
|
| A bit of both perhaps.
| speleding wrote:
| I think the recommendation to disallow any non-ASCII character is
| throwing out the baby with the bathwater.
|
| How about code that wants to display some emojis? It would be
| cumbersome to use hex unicode everywhere. And while localisations
| should typically happen in a separate language file, it's very
| common to want some text in code intended for a single audience.
|
| Blocking all the confusables might be tricky, and an allow list
| would be endless. Perhaps some magic pre-processor comment that
| says "allow unicode in this file".
| YetAnotherNick wrote:
| Do you really have to write emoji in the code string? Similarly
| with international language characters. The sane thing is to
| use either json config files or i18n libraries.
| speleding wrote:
| If you are writing something intended for a single audience
| using i18n libraries can be unnecessary overhead. And emoji
| can also be icons like [?] that can be useful to display in
| the UI.
| josteink wrote:
| > I think the recommendation to disallow any non-ASCII
| character is throwing out the baby with the bathwater.
|
| Not throwing out all non-ASCII characters from code-files. Just
| throwing them out as being invalid _identifiers_ in your code
| (think variables, function-names, etc).
|
| > How about code that wants to display some emojis?
|
| Fine. You quote that emoji in a string, and it's golden.
|
| You try to make a variable with the name of an emoji however,
| you code crashes.
|
| That sounds fine to me.
| speleding wrote:
| That would close this particular attack (but not the BIDI one
| the article mentions). But there is probably already too much
| code out there with p=3.14 in it to be feasible to do this.
| smcl wrote:
| I really thought that using the greek letter for pi (or
| theta, etc) was something you do to show your programming
| language supports unicode identifiers but that nobody
| actually does in real life. I wonder how people input this,
| do they know the Alt+xyz combo, do they select-copy-paste
| or is there another way that to write these characters that
| I'm not aware of?
|
| Just to be clear, I don't mean people who are actually
| using Greek language for input - it's pretty obvious how
| they would type that character :)
| josteink wrote:
| > But there is probably already too much code out there
| with p=3.14 in it to be feasible to do this.
|
| So for JS let it break in new, module based strict-mode
| code.
|
| That's going to be processed by tooling prior to shipping
| anyway, so that'll get caught.
|
| For other platforms do the same. In some forward-looking
| revision of the language/compiler.
|
| People has to fix obsolete/deprecated stuff in newer
| compilers/class libraries all the time. This is no
| different.
| est wrote:
| IO operations, especially involving a subprocess is prone to have
| backdoors. They are practically unverifiable.
|
| I've yet to see a unicode backdoor in pure algorithmic flows.
| willvarfar wrote:
| Some examples of historic attacks you could embed in
| algorithms:
|
| "Salami slicing" is a kind of embezzlement where eg an insider
| programs the computer to credit small amounts to the last
| account (and then opens an account with a name beginning with
| Z).
|
| In the 90s there was a massive hushed up scandal where the
| programmers developing the early Barclaycard made the pseudo
| random number generator for pin codes just issue three distinct
| pins. This meant that a stolen card could be easily used
| because they could guess any pin in three goes before the ATM
| swallowed the card.
|
| This is hardly an exhaustive list. It's just to get peoples
| cogs turning... :)
| null_object wrote:
| > In the 90s there was a massive hushed up scandal where the
| programmers developing the early Barclaycard made the pseudo
| random number generator for pin codes just issue three
| distinct pins. This meant that a stolen card could be easily
| used because they could guess any pin in three goes before
| the ATM swallowed the card.
|
| Citation for this?
| willvarfar wrote:
| Took some digging to find any working links these days. The
| three pin thing is on page two but it doesn't name which
| bank; I may have misremembered and it might not have been
| Barclays. The whole article is a good starting point for
| digging into other vulnerabilities and exploits too
| https://www.theregister.com/2005/10/21/phantoms_and_rogues/
| More-nitors wrote:
| maybe someone should make a linter for this...
| pabs3 wrote:
| There is one for Go called glyphcheck:
|
| https://github.com/NebulousLabs/glyphcheck
| trevinhofmann wrote:
| Added to my own ESLint config:
| https://github.com/trevinhofmann/eslint-config-principled/pu...
| gostsamo wrote:
| The benefit of being blind: the screen reader announces invisible
| characters and I could detect the invisible variable.
| infomax wrote:
| T[?]h[?][?]i[?][?][?]s c[?]o[?][?]m[?][?][?]m[?][?][?][?]e[?][?
| ][?][?][?]n[?][?][?][?][?][?]t s[?]h[?][?]o[?][?][?]u[?][?][?][
| ?]l[?][?][?][?][?]d[?][?][?][?][?][?]n[?][?][?][?][?][?][?]'[?]
| [?][?][?][?][?][?][?]t b[?]e e[?]a[?][?]s[?][?][?]y t[?]o
| r[?]e[?][?]a[?][?][?]d b[?]y
| s[?]c[?][?]r[?][?][?]e[?][?][?][?]e[?][?][?][?][?]n r[?]e[?][?]
| a[?][?][?]d[?][?][?][?]e[?][?][?][?][?]r[?][?][?][?][?][?]s
| gostsamo wrote:
| yep, it is not.
| geocar wrote:
| It is difficult to _see_ on an iPhone, but it sounds fine in
| Voiceover.
| mwcampbell wrote:
| With NVDA on Windows, when I read the comment normally, it's
| spelled out. When I read it character by character, I get
| "symbol FFF8" for each of the hidden Unicode characters. And
| when I move line by line through NVDA's linear representation
| of the web page, the hidden characters count against the
| length of the line for the purpose of word wrapping.
|
| Narrator's behavior is weirder. If I turn on scan mode and
| move onto the line with the up or down arrow key, Narrator
| says nothing. If I read the current line with Insert+Up
| Arrow, Narrator spells it out like NVDA does. When moving
| character by character, Narrator says nothing for the hidden
| Unicode characters. And because Narrator doesn't do its own
| line wrapping but defers to the application to determine what
| counts as a line, the text only counts as one line.
|
| Disclosure: I used to work on the Windows accessibility team
| at Microsoft, on Narrator among other things.
| WesolyKubeczek wrote:
| The benefit of being sighted is being able to use accessibility
| features while also being sighted.
|
| Take a peek at those technologies sometimes, those things
| improve work comfort for everyone.
| [deleted]
| mwcampbell wrote:
| Still, it would not occur to most sighted programmers to
| review code using a screen reader. To me, this is another
| argument for having a truly diverse team (or community, in
| the case of an open-source project); a blind programmer who's
| already involved with the project would catch something like
| this. So in this particular case, blindness is truly not a
| disability.
| marginalia_nu wrote:
| Being able to perceive BOM markers is tantamount to a
| superpower in programming.
| IceWreck wrote:
| How hard is it to program while being blind ? What sort of
| development do you do ? i understand that frontend is
| impossible but what other difficulties do you face ?
|
| Are indent based langauges like python harder than bracket
| based languages ?
| gostsamo wrote:
| Hi,
|
| Front end is not entirely impossible, but impossible on doing
| pixel perfect designs. Otherwise, I know blind people who do
| FE, not sure if most of it is professional though.
|
| Indent based languages are actually easier. Every screen
| reader has a way to announce indentation in code, while
| brackets could be confusing if not formatted or verbose if
| properly announced.
|
| My main issues are dev tools with bad accessibility. Also, it
| takes me more time to get acquainted with new code and
| sometimes omophones in the source code which require extra
| attention. Filtering through logs is also a bitch in most
| cases. Besides the dev tools, you can summarize the rest as
| bad IO speed.
| MathCodeLove wrote:
| I've been struggling with eye strain and have considered
| trying to approach development in a fashion similar to that
| takeb by blind devs. Any suggestions for guides or
| overviews for how I can get setup?
| gostsamo wrote:
| Hi,
|
| it depends on what you are working on and what you want
| to do. Generally, screen readers are not as good for
| programing as they are for plain text stuff, so they will
| be a limited substitute for whatever you are using now.
| If you are okay with working slower, they can help you
| listen through code and tool's messages providing relief
| for your eyes.
|
| If you are using Windows, NVDA is the screen reader. Jaws
| is a bit too expensive for my taste without any
| significant edge over NVDA. The builtin narrator is still
| immature in my opinion. VSCode has excellent
| accessibility with a dedicated and involved team. Visual
| Studio also has extremely good accessibility support
| though I'm not using it. InteliJ sucks. Not completely,
| but enough that people do not see the benefit of using
| it. Eclipse is not popular these days, but it has good
| accessibility as well as far as I know. Sublime is not
| accessible.
|
| If you are on Linux, the screen reader is Orca. It does
| not have the same level of support as the Windows stuff,
| but I know people who are developing on linux boxes so it
| is doable. Emacs must be good enough because it has self-
| voicing plugin and people who like and use it. As far as
| I know, VSCode for Linux has some accessibility features
| but I don't know how they compare to Windows.
|
| If you are on Mac, your only choice is Voice Over by
| Apple as screen reader. It is good but not always perfect
| to my knowledge. I know people who use TextMaid, XCode,
| VSCode, and Emacs, but I don't have much feedback from
| there. It is totally doable though.
|
| On Windows, I'm also using notepad++ as secondary editor
| because it is faster and works better for large files.
| Also, it is a good notetaking tool.
|
| We can connect offline if you need some more info.
| mrlemke wrote:
| I am very interested in how blind developers work. I have
| been pondering how to make computers and development more
| accessible. If you don't mind:
|
| Do you have preference between CLI, TUI, or GUI dev tools?
|
| Is highly symbolic code harder to understand using a screen
| reader than plain language code? By symbolic, I
| specifically mean any characters that are not alphanumeric.
| gostsamo wrote:
| Hi,
|
| I don't have preferences on the interface. As far as it
| is accessible, I can learn to work with it. E.g. VSCode
| make everything possible to make their interface
| accessible and they are continuously fixing any reported
| issues.
|
| When it comes to code, verbose is better. Abbreviations
| take effort to decode. I can remap some symbols to have
| different pronunciations, but it does not work always.
| E.g. I've maid the sr to speak the ":=" operator in
| python as "assigned from", but brackets have nesting and
| orientation, and too many of them get nasty to listen to
| or follow.
| mrlemke wrote:
| Thanks for answering. What is your favorite programming
| language to work in? If you could use any language you
| wanted, what would be your top pick?
| gostsamo wrote:
| Well, this is highly subjective. I'm paid to do python
| and node js from time to time and python really rocks for
| me. Not a small reason why I like python more is for the
| much better tracebacks. When looked in a console, it is
| much more pleasant to have the erroring line at the
| bottom which spares me copying the entire console in npp
| in trying to find the top of it.
|
| That said, I know many blind devs who do java, c#, swift,
| c++ and so on. I had bad experiences with ide-s when I
| was starting to study software development on those
| languages and it've stayed with me, but it is not
| universal.
|
| If I had the choice, I would not drop python, but I might
| add some of the functional languages or rust for the new
| ways of thinking they might teach me. So far, I've looked
| at them, but I haven't done nothing serious there.
| mrlemke wrote:
| Interesting, thanks for sharing!
| akavel wrote:
| Do you have some tricks for how you handle filtering
| through logs? Or some ideas if there could be a tool that
| could help you or mitigate your most critical issue[s]?
|
| I found filtering through longs a major pain even for a
| fully sighted person like me, so I wrote a tool to help me
| with that, but it's fully in a "TUI" paradigm (i.e. curses-
| like), so I presume it wouldn't help you much
| (https://github.com/akavel/up). No promises, given that the
| tool as is scratched my itch, but I am honestly curious if
| something similar could reduce your PITA, including whether
| this specific tool could be made useful for you through
| some minimal effort on my side.
| gostsamo wrote:
| Hi,
|
| usually grep saves the day. I will check your tool, but
| what I need is for a terminal command that can recognize
| the meta fields from a log record and put them on a line
| separated from the main message. Also, it must be
| installed everywhere I work, which is not so easy.
| Putting logs in a table with filtering capabilities might
| be best, but this means web access to the location of the
| logs which is again tricky.
| ryanianian wrote:
| > what I need is for a terminal command that can
| recognize the meta fields from a log record and put them
| on a line separated from the main message
|
| Isn't this the exact use-case of structured logging?
|
| Log events have {timestamp, log level,
| log category, string message, ...arbitrary key/value
| pairs}
|
| Usually serializing each message as a single json line in
| a file.
|
| Since it's all on one line you can still use grep, but
| then since it's machine-readable you can pipe the grep to
| anything that can parse json. Vanilla python3 works and
| tends to be a part of most ops toolkits. Such tooling can
| split out the fields onto other lines etc or in a more
| reader-friendly format.
| gostsamo wrote:
| Yes, this has been my idea in many cases, but it is not
| always that I have a say over the logging format.
| IceWreck wrote:
| Hey, thats cool. Thank you.
| threatripper wrote:
| So, it's a backdoor that only the blind can see?
| mwcampbell wrote:
| The next time someone tries to tell me that a true screen
| reader should use computer vision and machine learning
| (including OCR) rather than requiring applications to implement
| accessibility APIs, I will bring up this case.
| SilasX wrote:
| HN exchange:
|
| "Why can't we just, you know, direct blind users to a special
| protocol that structures the data appropriately and then lets
| them parse it however they want?"
|
| Me: 'We did! It's called HTML! Designers just broke it!'
|
| https://news.ycombinator.com/item?id=20224961
| mwcampbell wrote:
| IMO, HTML is still closer to that ideal than anything else
| we have. My guess is that given a random web application
| and a random non-web GUI (especially if the latter is
| multi-platform), the web application will be more usable
| with a screen reader.
| joquarky wrote:
| And now many people are excited about throwing all of
| that away with canvas and web assembly.
| skyde wrote:
| right! But we would not need to use canvas if updating
| the DOM was not super slow.
|
| I suspect the #1 reason is the layout/reflow engine but i
| might be wrong. Game engine do run physics at 60fps which
| is harder than CSS reflow.
| slaymaker1907 wrote:
| I'd say markdown is even better than HTML for writing
| generic documents since it enforces simplicity. In
| particular, it forces a linear flow of the document and
| does not have any support for stuff like JS.
| HeavyStorm wrote:
| Html could have been that - or better, it was at first -
| but instead of creating a more specialized solution for
| running rich apps we decided to exploit html.
|
| Right now we are in what I'd call the worse of both worlds,
| because we rely on html to do things it wasn't designed to,
| and there's no longer purity in any html out in the wild.
| gostsamo wrote:
| Yep, together with the ml screen reading, they do not offer
| subsidized infinite battery life and machine learning
| hardware for the inferring model.
| robertrbairdii wrote:
| There's definitely a benefit to using a linter and a tool such as
| prettier. Using prettier pushes the hidden character onto an
| additional line in the checkCommands array which makes it much
| easier to spot that something is wrong even if you're not using
| the trailingCommas setting.
|
| https://imgur.com/a/gYKylyH
|
| I think this eslint rule would also be able to defend against the
| initial destructuring of the query object by defining a regex
| that identifiers have to match which would exclude those
| invisible characters https://eslint.org/docs/rules/id-match
| sihox wrote:
| Just being curious I've pasted the example to Geany and VSCode
| and in both this invisible character was visible :) I can't
| remember setting some special character / whitespace visibility
| options but I think it is good to have this kind of options
| always on.
| Eriks wrote:
| good reason to not use comma when destructuring an object
| smhg wrote:
| You mean the last trailing comma? Or not destructions into
| multiple variables?
| jabbany wrote:
| IIRC Rust has some compiler-level defenses against these glyph
| based attacks (ref:
| https://twitter.com/skyslasher11/status/1152824207555698688)
|
| Perhaps one could do something similar in JS as well. Like have a
| config that will make an interpreter fail if it encounters
| unescaped unicode in variable names. It does not prevent any
| unicode variable names, but you just have to escape them if the
| are from some list of "abusable characters".
|
| (At least Chrome seems to be happy with `var \u6D4B\u8BD5 = 1;`)
| goldsteinq wrote:
| You can just do `#![forbid(non_ascii_idents)]` in Rust. It'll
| prevent this kind of attacks completely and you shouldn't need
| non-ASCII idents anyway.
| Tepix wrote:
| That seems like throwing out the baby with the bathwater. We
| don't all want to go back to the IT stone age of 1963.
| goldsteinq wrote:
| You still can use Unicode in comments and string literals.
| You just can't use non-ASCII characters in identifiers.
|
| Unicode in identifiers is just a bad idea.
|
| 1. It creates a security consideration with confusable
| identifiers (and lints don't always catch these)
|
| 2. It breaks tooling with RTL identifiers
|
| 3. It may not render correctly depending on fonts
|
| 4. It may be hard to type depending on keyboard layout
|
| 5. There really isn't a good reason to use non-ASCII idents
| anyway
| lifthrasiir wrote:
| > 1. It creates a security consideration with confusable
| identifiers (and lints don't always catch these)
|
| O/0 and I/1/l are confusable characters within ASCII. I'm
| not kidding here, they are actual entries in the Unicode
| confusables database [1]. But no one wants to remove
| those characters from identifiers.
|
| [1] For example, https://util.unicode.org/UnicodeJsps/con
| fusables.jsp?a=0&r=N...
|
| > 2. It breaks tooling with RTL identifiers
|
| It rather unbreaks tooling with no RTL support.
|
| > 3. It may not render correctly depending on fonts
|
| So does Unicode in comments and string literals. In fact
| the purported Trojan "attack" was mostly about string
| literals. So why should they be allowed in strings but
| disallowed in identifiers?
|
| > 4. It may be hard to type depending on keyboard layout
|
| Did you know that not every Latin keyboard layout
| supports a backquote (`)? This was the actual reason that
| the repr(expr) shortcut got removed from Python 3 [2].
|
| [2] https://mail.python.org/pipermail/python-
| ideas/2007-January/...
|
| > 5. There really isn't a good reason to use non-ASCII
| idents anyway
|
| My canonical answer from the experience is that not every
| programmer who can understand English documentations can
| easily write and comprehend English in general. For those
| people having a non-ASCII identifier support is a great
| relief, as it frees them from choosing "correct" English
| identifiers. You can disallow them for your project if
| you want (or conversely, make it an optional feature
| disabled by default), but they are relevant for someone
| else.
| cedilla wrote:
| > it frees them from choosing "correct" English
| identifiers
|
| Even if you have fluent English skills, sometimes
| translations just confuse the issue. It's sometimes
| better to use an untranslated word instead of introducing
| ambiguity, especially when a term originates from a local
| law.
| lifthrasiir wrote:
| Like CNLabelContactRelationYoungerCousinMothersSiblingsDa
| ughterOrFathersSistersDaughter [1]? :-) You are very much
| correct.
|
| [1] https://news.ycombinator.com/item?id=28712667
| josephcsible wrote:
| > O/0 and I/1/l are confusable characters within ASCII.
|
| You're mixing up two different ways that people use the
| word "confusable": things that look similar in some
| fonts, versus things that look exactly the same
| regardless of font. I want the latter to be banned from
| source files but not the former.
| mkl wrote:
| Non-ASCII identifiers can be useful for maths too. E.g. I
| use l sometimes, especially in Python where "lambda" is a
| keyword. (I have AutoHotKey and Espanso hotstrings to
| make typing such symbols easy.)
| __s wrote:
| > O/0 and I/1/l are confusable characters within ASCII
|
| Which is why the first thing I make sure of when looking
| at programming fonts is how well they differentiate these
| characters
| koheripbal wrote:
| Agreed - this is literally the first thing I check when
| selecting the editor's font
| lifthrasiir wrote:
| `#![forbid(...)]` is a crate-wide attribute, so it is more
| like a policy (that is good to have if your code would be
| entirely ASCII).
| Ygg2 wrote:
| Agreed. I use Unicode identifiers to spot shitcode. This
| would really hamper my detection abilities.
|
| Jokes aside, if you're writing Unicode identifiers it means
| you're not writing your code to be read by a broad
| audience.
| CodesInChaos wrote:
| The compiler disallowing them globally might count as that.
| But individual crates enforcing an "ascii only" policy
| makes sense, if they never plan to use non-ascii.
|
| Personally I'd prefer even one step further: the compiler
| would disallow them by default, and you can opt into
| specific character sets/languages at a crate level. e.g.
| `AllowSpecialCharacters("de")` to enable on special
| characters common in German.
| UncleMeat wrote:
| > It'll prevent this kind of attacks completely
|
| It won't. The same approach works just fine in your build
| specification or other config files. And it doesn't solve the
| root of this problem, which is that you are compiling source
| code you don't control and don't audit closely into your
| binary. Sneaky text is not the only way of getting malicious
| code through code review.
| matheusmoreira wrote:
| > you shouldn't need non-ASCII idents anyway
|
| Yes, we do. People from all over the world write software
| too. They should be able to use the words they know in code.
|
| Also, it's totally cool to have mathematical symbols in code.
| l, for example. Much more readable than the word lambda. The
| only reason these symbols are hard to type is our keyboards
| suck. They can be made easy to type with editor support
| though.
| tytso wrote:
| Good code is maintainable code. And while you, as the
| original programmer, might be perfectly comfortable writing
| your code using Arabic variables and comments, what if the
| next person who has to maintain the code is from Korea? Or
| Russia? Or France? Or China?
|
| OK, maybe you're a small startup in Taiwan and so you don't
| care about the next maintainer in your company not being
| able to read or write Chinese. What if you decide to open
| source your code? Or Meta decides to offer you a zillion
| dollars to buy you out, but after they do their due
| diligence, realize that the code is utterly unmaintainable
| should they decide to outsource internationalizing the code
| so it will work in Brazil, so that requires native
| Portguese speakers (who can preferably be paid low, low
| wages) --- but they can't understand the code because it's
| using Chinese variables and comments. And then Meta decides
| to back out from the deal?
| matheusmoreira wrote:
| If you're likely to work with an international team, it
| makes sense to use english. That's not always the case
| though. Plenty of those low-paid brazilian programmers
| you cited will never do that. Many of them don't speak
| english to begin with.
|
| For example, the school I went to had a simple web
| application for student feedback. Attachments were
| allowed. People started running into issues due to non-
| ASCII characters in file names. I reported the issue to
| the IT department and even helped them fix it. The Python
| code was written in portuguese, accents and everything.
| Why shouldn't accents be used in this case? It's unlikely
| this code will ever be used in an international context.
| jimmaswell wrote:
| ASCII is the standard for code for good reason. Everyone
| can type it. Put whatever you want in comments, but you
| shouldn't make people have to copy/paste your variable
| names.
| matheusmoreira wrote:
| > you shouldn't make people have to copy/paste your
| variable names
|
| The people working on a non-english codebase don't have
| to. Their keyboards have the symbols they're typing.
| jimmaswell wrote:
| You're assuming no international collaboration.
| antris wrote:
| >Yes, we do. People from all over the world write software
| too. They should be able to use the words they know in code
|
| My native language has non-ASCII characters and I do not
| expect nor do I want to be able to type them outside string
| literals. Specifically for the reasons stated in the blog
| post, among others. Writing in my native language is far,
| far down in the list of priorities as a professional coder,
| when security / compatibility are there too. Suggesting
| that non-native English speakers have to be able to code in
| their native language also would suggest that non-native
| coders do not take security / compatibility seriously,
| which would mean that they are unprofessional. I'm pretty
| sure that it's not your intention to suggest that, but
| that's kind of how it comes across. With all the problems
| eliminated by the use of English and ASCII, it would strike
| me as amateurish to not use English and ASCII wherever
| possible.
| matheusmoreira wrote:
| > non-native coders do not take security / compatibility
| seriously
|
| That's not what I said at all. I don't see how you came
| to this conclusion.
|
| > With all the problems eliminated by the use of English
| and ASCII, it would strike me as amateurish to not use
| English and ASCII wherever possible.
|
| Not everybody speaks english. I've taught programming to
| quite a few people and they all attempted to use normal
| characters while writing code. There's absolutely no
| reason why that shouldn't work. I don't see how
| characters like c or a or u could possibly cause security
| issues. Go ahead and ban the invisible unicode stuff but
| there's absolutely no reason why these common letters
| shouldn't work.
| antris wrote:
| It is funny that you are using the existence of a segment
| of the population that I am a part of, to make your claim
| but aren't willing to listen when a member of the segment
| is trying to explain how non-ASCII characters and coding
| do not mix well.
|
| Sure, you could make a fix for this specific case, but
| the problem mentioned in the blog post is not even close
| to the only problem of non-ASCII characters. In _theory_
| , yes, we could make a language and a full suite of
| tooling that would play nice with non-ASCII characters.
| But it's not like the whole non English speaking world is
| waiting for this to happen. People code in English even
| in teams where everyone speaks Finnish. Nobody even
| questions it, because it's so obvious that all code
| should be in English and ASCII. Everyone has shot their
| foot, putting in non-ASCII characters in the source code
| at some point of their career, if they have ever dared to
| try. That's how the reality is, and at the same time I
| hear people saying that the existence of those Finnish
| programmers means we _have to_ have Unicode in source
| code.
|
| >That's not what I said at all. I don't see how you came
| to this conclusion.
|
| I didn't say you said it. I said that's how it (probably
| accidentally) comes across when you talk about something
| so carelessly. Non English speakers care about
| compatibility and security and take those seriously,
| therefore we pretty much always write code in English and
| ASCII.
| matheusmoreira wrote:
| > It is funny that you are using the existence of a
| segment of the population that I am a part of, to make
| your claim but aren't willing to listen when a member of
| the segment is trying to explain how non-ASCII characters
| and coding do not mix well.
|
| Why is it funny? I'm also a member of that group. English
| is not my native language.
|
| > But it's not like the whole non English speaking world
| is waiting for this to happen.
|
| I don't think we should have to wait for this to happen.
| In many ways, it's already happened: most modern
| languages already support unicode symbols.
|
| > People code in English even in teams where everyone
| speaks Finnish. Nobody even questions it, because it's so
| obvious that all code should be in English and ASCII.
|
| Relatively few people speak english in my country. I have
| only a few friends who do. A whole team of people writing
| code in english just doesn't seem likely where I live. I
| actually tried writing english code in such a context
| once, the result was a mixed language mess that I quickly
| reverted back to my native language. Unicode support is
| great because it makes the non-english code much more
| readable.
|
| Europeans in general seem to know english very well. This
| is _not_ the case everywhere. Somehow making english a
| requirement for programming just doesn 't sound fair to
| me.
| capitainenemo wrote:
| I brought this up last time
| (https://news.ycombinator.com/item?id=29066760) but:
|
| https://github.com/reinderien/mimic
|
| It applies to other contexts besides code. For our user
| table we have a mariadb collation on the unicodes
| confusables list which avoids confusable usernames
| (treated as already existing).
| wnevets wrote:
| If you use Sublime Text the Gremlins[0] package will detect and
| light up these kind of characters
|
| [0] https://packagecontrol.io/packages/Gremlins
| Mockapapella wrote:
| If anyone is interested, I wrote an article a while back
| exploring which unicode characters Python allows you to set
| variables equal to: https://www.thelisowe.com/why-can-be-a-
| variable-in-python-bu...
|
| This was originally done with the goal of trying to hide/encode
| one program within another using non-displayable characters (such
| as zero width spaces), I just never got around to it. But reading
| this article has kind of reignited that interest for me and I
| think I might take another crack at that soon.
| geoduck14 wrote:
| This is MOST interesting.
|
| I wonder if Git or Stack Overflow should highlight non Ascii
| characters to reduce malicious actors using this in code.
| lovasoa wrote:
| The `cmd &&` looks fishy in their example, and would probably
| have been removed in a review. Instead, one could write :
| const { ping, curl, } = req.query; const checkCommands =
| [ ping && 'ping -c 1 google.com', curl &&
| 'curl -s http://example.com/', ]; await
| Promise.all(checkCommands.map(cmd => cmd && exec(cmd, { timeout:
| 5_000 })));
|
| This way the `cmd &&` is justified
| testASW2 wrote:
| d
| kuon wrote:
| My editor (vim) will warn me with a loud visual red block for any
| non ascii char outside a string literal. But I do not think that
| is enough. Compiler and interpreter must be more strict.
| jrochkind1 wrote:
| I'm surprised VS Code doesn't at least have that option. (Or
| does it?)
| myfonj wrote:
| It has `editor.renderControlCharacters` but only recently
| started displaying few dangerous previously invisible ones
| (directional overrides) natively [1], but besides that you
| had to use extension that adds highlights for non-ascii non-
| whitelisted [2] or predefined [3] characters.
|
| [1] https://github.com/microsoft/vscode/issues/116939 [2] htt
| ps://marketplace.visualstudio.com/items?itemName=nachocab...
| [3] https://marketplace.visualstudio.com/items?itemName=nhoiz
| ey....
| tomxor wrote:
| Mind sharing the relevant config line?
|
| I thought this was default but just realised it only does the
| <FFFF> thing when there is no printable glyph available.
|
| Allowing printable unicode in strings seems like a nice balance
| if it can be done reliably.
| fatheart wrote:
| After seeing this thread I added the following to my vimrc:
| highlight link NonASCII Error autocmd Syntax * :syntax
| match NonASCII "[^\d0-\d127]"
|
| Obviously haven't been using it long, and I'm not confident
| enough in my vim knowledge to vouch for its correctness, but
| it works in the limited amount of scenarios I tested so far.
| gpvos wrote:
| That's not the default vim configuration though.
| aww_dang wrote:
| Yes, the Unicode characters are a problem. But do the norms and
| tooling play a role here as well?
|
| Explicitly casting types, like String parameters to integers
| would make this much more explicit. The convenience of accessing
| parameters via destructuring, vs explicitly
| request.getParameter("\u3164"). Having a static array of
| permissible commands declared elsewhere.
|
| There's something to be said for verbosity and explicitness.
| Where the tooling and norms shun it, these 'invisible' backdoors
| can gain advantage.
| laktak wrote:
| This shows up in standard Vim as a [HF] symbol.
| kreetx wrote:
| Same in vanilla emacs.
| josteink wrote:
| Listen guys, don't get me wrong. As someone with O in my name,
| and both A and O in my address, don't get me started on poorly
| written systems which cannot handle unicode properly. I've seen
| my name and address mangled in shipping forms, in airline tickets
| (every time) and even in my marriage-papers since I married
| abroad.
|
| I literally have _personal_ reasons for getting everyone, and I
| mean everyone, on the unicode bandwagon.
|
| That said... Maybe it's because I'm a child of the late 70s and
| early 80s and learned to program on computers which simply didn't
| have non-ASCII characters at all...
|
| But can't we all just sit down and admit that allowing non-ASCII
| characters in programming-language identifiers was a bad idea?
| Can't we in the next revision of EcmaScript (or Rust, or
| whatever) mandate ASCII-only identifiers when in strict mode or
| using modules or whatever? Having _invisible characters_
| represent executable code is not just a dumb a idea, it 's so
| hazardous that you might call it borderline malicious.
|
| There _has_ to be some way to undo this damage, without breaking
| compatibility with the code which is already out there, right?
| rtoway wrote:
| Rust has a lint against this kind of attack + you can
| explicitly disable non-ASCII identifiers if you really want to
| est31 wrote:
| Ideally that lint would be on by default though. Most code
| doesn't use non-ASCII identifiers. It's not happened though
| because of uhm. political reasons.
| rtoway wrote:
| The lint is on by default in the latest version of the
| compiler
| drran wrote:
| Most code made by English speakers contains English word
| and Latin characters, so other languages and alphabets must
| be abandoned, and their native speakers must imprisoned
| until they understand their mistakes.
| toastal wrote:
| Abugidas and logographies banished
| drran wrote:
| OK, OK, we can start with a warning in the compiler that
| use of any language except English is unsafe.
| auggierose wrote:
| Yeah, let's just switch to Cosmopolitan Identifiers:
| https://obua.com/publications/cosmo-id/3/ :-)
|
| But yeah, it would break existing code, sorry.
| Dagonfly wrote:
| Adding a variable decorator/annotation like
| @Unicode(german,french) would be a good stop-gap. You could
| only use ASCII characters unless you specified the script that
| you want to use. One could even set a max limit on how many
| scripts per variable. Because while I have used German
| characters in variables before (only if I'm referring to some
| law or spec), I never had a use case for more than 2 scripts
| within one variable.
| silvestrov wrote:
| I think this is a good idea because once in a while you need
| to write non-ascii characters in names.
|
| This mostly comes up when implementing tax rules or
| government administrative divisions as some countries have
| names/concepts which have no good translation into English,
| so you are left with using the non-English name, which often
| contains non-ASCII characters.
| est31 wrote:
| The multiple scripts per variable thing is implemented in
| Rust via a lint. For the explicit enabling of single scripts,
| I have suggested that for Rust, but sadly people preferred
| allowing all identifiers (while giving an option to only have
| ascii but I'd argue this is unfair for anyone who only wants
| to use a specific non-ascii language, why do they have to
| suddenly allow _all_ languages in their code base?). There
| are also practical concerns, like who says what a language
| is, which characters it contains, how that language is
| called, etc? Someone has to maintain all these lists.
| chrismorgan wrote:
| > _who says what a language is, which characters it
| contains, how that language is called, etc?_
|
| The Unicode Consortium already maintains all of that data
| in the CLDR (Common Locale Data Registry).
| lifthrasiir wrote:
| For your information the relevant Unicode specification is
| the Script_Extensions property [1]. (You can't easily filter
| by languages, so you should filter by scripts.)
|
| [1] https://www.unicode.org/reports/tr24/tr24-32.html#Script_
| Ext...
| jillesvangurp wrote:
| The issue with this is less that this is possible and more that
| a lot of javascript ends up in production without ever getting
| compiled, linted, type-checked, etc. Stuff like this is
| designed to bypass what little human oversight there is to
| prevent bad things from happening. What is actually visible
| also depends on what fonts you have installed on your system.
| So, it's less clear cut than you think.
|
| The problem is not so much that humans can't see this but that
| they are not looking very hard to begin with (otherwise, they'd
| be using the appropriate tools) and that we should rely less on
| them actively looking. Blind trust that things will be fine is
| the root problem here.
| josteink wrote:
| > The problem is not so much that humans can't see this but
| that they are not looking very hard to begin with (otherwise,
| they'd be using the appropriate tools) and that we should
| rely less on them actively looking.
|
| And simply not allowing non-ASCII identifiers in the first
| place would be a move in that direction. Now you have one
| thing less to look for.
| badsectoracula wrote:
| You can only type ~27% of my name with just ASCII (and even
| then one letter will not be exactly)... and i agree with you.
| If anything i'd go a bit further and say that, sure, use
| Unicode in places where you can find arbitrary text like
| documents, messages, etc but anything that has to do with the
| 'guts' of the computer should stay away from Unicode (or at
| least treat it as data, like how filenames are treated on
| Linux).
|
| I disagree with the getting everyone on the Unicode bandwagon
| though, IMO Unicode has introduced a ton of problems exactly
| because it tries to be a ton of stuff at the same time. I don't
| know how exactly a better solution would be but i have a very
| hard time accepting that such a convoluted and error prone
| system is the best solution. IMO if decades later there are
| still issues with getting it right then there is something
| fundamentally wrong with the system itself and not with the
| applications and developers trying to work with it.
| gpderetta wrote:
| An existing working solution, even if not perfect, patched,
| with lot of baggage and technical debt is infinitely better
| than a non-yet invented ideal, perfect solution.
|
| And even if the perfect solution existed right now, in a few
| decades it will be as filed with baggage as the current one.
|
| Sometime one has to realize that hard problems are hard.
| AzzieElbab wrote:
| a lot of software used in shipping/logistics predates unicode
| lifthrasiir wrote:
| > But can't we all just sit down and admit that allowing non-
| ASCII characters in programming-language identifiers was a bad
| idea?
|
| It's a bad idea only if all members in your team can easily
| produce and comprehend an ASCII-only code.
|
| > Having invisible characters represent executable code is not
| just a dumb a idea, it's so hazardous that you might call it
| borderline malicious.
|
| Not if those invisible characters do affect the rendering.
| Invisible formatting characters like ZWJ and ZWNJ are allowed
| because they are used in some scripts. The relevant Unicode
| specification [1] even provides a guideline to limit ZWJ and
| ZWNJ strictly to the context where they do affect the
| rendering.
|
| That said, the Hangul filler and half-width Hangul filler were
| mistakes. They are purely legacy characters and never have been
| used in practice, so I encourage new languages to exclude them
| from the default (X)ID_Start/Continue set (Unicode can't do
| that because of the compatibility, maybe they can introduce
| another pair of properties without those characters).
|
| [1]
| https://unicode.org/reports/tr31/#Layout_and_Format_Control_...
| josteink wrote:
| > The relevant Unicode specification [1] even provides a
| guideline to limit ZWJ and ZWNJ strictly to the context where
| they do affect the rendering.
|
| Which is exactly what I am suggesting by saying non-ASCII
| characters should be banned from being used as _identifiers_
| , not from being present in the code-file all together or in
| the form of strings, etc.
|
| If the formatting of your output in your applications (as
| seen by the user) depends on the _names you 've declared your
| variables with_, then you are doing something horribly wrong.
| lifthrasiir wrote:
| You seem to think those formatting characters as something
| that should be in the higher-level protocol like HTML. They
| are not. They are used when two consecutive abstract
| characters can be combined in two or more different ways.
| _And those different renderings frequently have different
| meanings._ That 's why they can't be simply removed when
| normalized; doing so will destroy the text.
| josteink wrote:
| We seem to be talking past one another. What Id like to
| see banned is non-ascii in identifiers, variables-
| _names_ and nothing else.
|
| While you respond as if I want to banish anything non-
| ASCII from all parts of all code-files except from HTML-
| templates. That's certainly not what I'm advocating.
|
| The following is IMO perfectly _fine_ :
| var greeting = "Hello (cowboy emoji)";
|
| The following is IMO not: var (emoji) =
| "Let's party!"; // note identifier contains non-ascii
|
| Do you still disagree? If so, can you outline why?
| lifthrasiir wrote:
| Okay, I think I see where you got confused. There are
| multiple levels of Unicode identifier support and you are
| probably not aware of all possible levels. Those levels
| are:
|
| 1. Identifiers can contain any octet with the highest bit
| set. Different octet sequences denote different names.
|
| 2. Identifiers can contain any Unicode code point (or
| scalar value, the fine distinction is not required here)
| above U+007F. Different (but possibly same-looking) code
| point sequences denote different names.
|
| 3. Identifiers can contain any Unicode code point in a
| predefined set, or two if the first character and
| subsequent characters are distinguished. Different code
| point sequences denote different names.
|
| 4. Same to 3, but these predefined sets derive from the
| Unicode Identifier and Pattern Syntax specification
| [1]---namely (X)ID_Start/Continue.
|
| 5. Same to 4, but now identifiers are normalized
| according to one of the Unicode normalization algorithms.
| So some different code point sequences now map to the
| same name, but only if they are semantically same
| according to Unicode.
|
| 6. Same to 5, but also has a rule to reduce unwanted
| identifiers. This may include confusable characters,
| virtually indistinguishable names and names with multiple
| unrelated scripts. Unicode itself provides many
| guidelines in the Unicode Security Mechanisms standard
| [2].
|
| Levels 3, 4 and 5 are most common choices in programming
| languages. In particular emojis are not allowed for 4, so
| your example wouldn't work in such languages. For example
| JavaScript is one of them so `eval('var \u{1f600} = 42')`
| doesn't work (where U+1F600 is a smiling face). Both
| Python and Rust are at the level 5. Possibly
| unexpectedly, both C and C++ are at the level 3. Levels 1
| and 2 are rare especially in modern languages; PHP is a
| famous example of the level 1.
|
| Level 6 is a complex topic and there are varying degrees
| of implementations (for example Rust partially supports
| the level 6 via lints), but there is a notable example
| outside of programming languages: the Internationalized
| Domain Names. They have very strong constraints because
| any pair of confusable labels is a security problem. It
| seems that they have been successful in keeping the
| security of non-ASCII domains on par with ASCII-only
| domains, that is, not fully satisfactory but reasonable
| enough. (If you don't see the security issues of ASCII-
| only domains, PaypaI and rnastercard are examples of
| problematic ASCII labels that were never forbidden.)
|
| I argue that the level 3+ is necessary and the level 5+
| is desirable for international audiences. The level 5
| would for example mean that `var annyeonghaseyo =
| "annyonghaseyo";` (Korean) is allowed but `var (emoji) =
| "oh no";` is forbidden. I have outlined why the former is
| required in the last paragraph of [3]. Does my clarified
| stance make sense to you?
|
| [1] https://unicode.org/reports/tr31/
|
| [2] https://unicode.org/reports/tr39/
|
| [3] https://news.ycombinator.com/item?id=29170954
| josteink wrote:
| To be clear I'm completely oblivious to what Unicode
| identifiers are. As such I'm not talking about them, and
| they are out of scope wrt to my point.
|
| What I am advocating is that identifiers used for symbols
| in the programming language (variables-names, function-
| names, class-names, etc), should be strictly ASCII-based.
|
| That's simple, understandable and should be a sane
| default anywhere.
|
| My opinion is that since nobody without a doctorate in
| Unicode _actually fully understands Unicode_ , having a
| rule-set for identifiers built on top of the already
| bewildering Unicode rule-set is a sure-fire way to
| engineer for unexpected consequences and/or security
| issues.
|
| Sure. Allow it if you _must_. But you must opt in to use
| it. It should be a non-default feature everywhere where
| it's available.
| ludovicianul wrote:
| On a similar note, if you want to test your REST APIs for weird
| characters, I built a tool for this:
| (https://github.com/Endava/cats#leadingcontrolcharsinfieldstr...)
| lifthrasiir wrote:
| Hey, you have missed U+FFA0 HALFWIDTH HANGUL FILLER which has
| about the same property as U+3164 HANGUL FILLER!
|
| Surely I expected this coming ever since I've seen the purported
| Trojan "attack", as the Hangul fillers are pretty much the only
| characters that are (X)ID_Start and have no visible glyphs [1].
| If (X)ID_Continue is also considered ZWJ and ZWNJ would be
| another contenders. Attacks using those characters have much
| better chance than the Trojan "attack", but you need a very
| specific code to execute the attack. It should be obvious that a
| typical coding convention easily prevents them.
|
| As much like the purported Trojan "attack", this kind of attacks
| need a better code review and tooling. You don't need to remove
| non-ASCII identifiers from existing languages: they have their
| uses when an entirety of your team speak languages not using
| Latin script. But you should be able to catch a _new_ use of non-
| ASCII characters throughout your code base and compare that with
| your expectation.
|
| [1] The Hangul filler comes from a legacy mechanism of KS X 1001
| for unencoded Hangul syllables (it had only 2,350 out of 11,172
| modern syllables). The half-width Hangul filler probably comes
| from a duplicate encoding of the filler in the IBM code page 933
| to ensure round-trip conversion. Both are never used in practice,
| except for probably the Hangul filler that was briefly
| implemented by Mozilla and removed due to the compatibility
| issue.
| fergie wrote:
| Running prettier on the code makes the "hidden" variables fairly
| obvious -> https://imgur.com/a/MhhRpwq
|
| That said, nothing on my buildchain actually throws an error or
| warning.
| DyslexicAtheist wrote:
| >> That said, nothing on my buildchain actually throws an error
| or warning.
|
| use hooks for CI on pre-commit / merge and pull requests e.g.
| like this pre-commit which would catch bi-directional trojan
| sources: #!/usr/bin/env python3 import
| sys import subprocess bidi_chars =
| '\u202A\u202B\u202D\u202E\u2066\u2067\u2068\u202C\u2069'
| for line in sys.stdin: old, new, ref = line.split()
| diff = subprocess.run(['git', 'diff', old, new],
| stdout=subprocess.PIPE,
| stderr=subprocess.STDOUT, text=True)
| if diff.returncode != 0: print(diff.stdout)
| sys.exit(f'git diff ended with rc={diff.returncode}, receive
| TERMINATED') if any(c in diff.stdout for c in
| bidi_chars): print(diff.stdout)
| sys.exit('Possible Trojan Source Attack, receive REFUSED')
|
| I wish github/gitlab would provide such features available out
| of the box which also follow best practice, so people can stop
| pasting them from the web or reinvent our own version in every
| team ...
| brabel wrote:
| Obvious as in an empty line? Not very obvious to me.
| onion2k wrote:
| If you combined this attack with Whitespace
| (https://en.wikipedia.org/wiki/Whitespace_(programming_langua...)
| you could embed entire programs in your JS code.
| tyingq wrote:
| Perl has a module called Acme::Bleach that does that for you.
| FrankyHollywood wrote:
| > In our experience non-ASCII characters are pretty rare in code.
| Many development teams chose to use English as the primary
| development language
|
| Is this true for the whole world, or just Europe/US?
| afavour wrote:
| It is. The way it was explained to me, all the APIs you use are
| in English so naming variables in your local language is futile
| at best and would just require constant context switching.
|
| I remember many moons ago MooTools announced international API
| translations as an April Fools joke. It did make me wonder if
| there's an interesting programming experiment to be done
| there... but I'm a native English speaker so I'm not best
| positioned to know!
| mijamo wrote:
| Not my experience. Plenty of codebases have variables in
| local language in France Germany and Sweden (where I have
| experience).
|
| I actually have encountered a lot of problem with English
| codebases in those countries as they often try to translate
| regional concepts that are not directly translatable. This is
| particularly annoying when it comes to administrative stuff
| where one English word can refer to different local concepts
| (ex: geographical divisions of the territory) and
| translations always clumsy. I have even seen nasty bugs come
| from there, where a "county" had a different meaning in
| different places of the code as different teams had different
| idea of what a county was but didn't discuss it.
| utrack wrote:
| Yep, pretty much (Russian here)
| reilly3000 wrote:
| While this is an interesting hack, the larger issue in the
| example is allowing any query parameter to write into a
| subprocess. exec() immediately throws flags for me, especially
| when it isn't necessary like in the case of making an http call.
| Even when it isn't passing arbitrary inputs from the web to the
| command line, it's susceptible to DoS that could crash the whole
| kernel instead of just the web server. I get that this is just a
| contrived example to show the risk of hidden characters, but
| please don't use process.exec() unless you have no other options.
| Klaster_1 wrote:
| A similar thing to the Reddit post mentioned in the article
| happened to me too: I used a not-a-space character that looks
| like a space once, the text editor autocompletion remembered it
| and would occasionally substitute it for space. The code looked
| OK, but compilation failed or threw syntax errors in run time.
| This continued for several years until I completely reinstalled
| the editor, with full cleanup.
| yepthatsreality wrote:
| _Glances at Microsoft GitHub Copilot._
| bavell wrote:
| Dev nightmare fuel
| Enginerrrd wrote:
| Oh my God. That's nightmare level error-inducing.
| Arrath wrote:
| Auto-complete gremlins can be the absolute worst, I would have
| been tearing my hair out.
| nosianu wrote:
| First thing I did when I first read the story was check my
| editor. I already had the "Zero Width Characters locator" plugin
| installed, but that covered less than a handful of specific space
| character type codes.
|
| Still, the result was good: Looks like IDEA editors like Webstorm
| show invisible characters with colored background and a warning.
|
| My test was from that first article and also now from this one
| copy the example code they contained or linked to from the
| browser into an open file.
|
| Screenshot: https://i.imgur.com/ColuRNB.png
| JeremyNT wrote:
| While not as fancy, font choice may save you here too. I use
| vim and while the editor doesn't treat this character as
| special, my font (Iosevka term) doesn't include this character,
| and so it's rendered as the generic "missing unicode" glyph
| with the code inside it.
| dotancohen wrote:
| Interesting. PhpStorm highlights the variable after `timeout`
| but does not highlight the variable after
| `http://example.com/`. Even pressing F2 to go to the next error
| goes to the first variable (the highlighted one) but not the
| second.
|
| However, placing the cursor on either does highlight the
| second.
|
| I'm using the Darcula scheme. Your screenshot obscures the
| second occurrence, so we cannot see if your light theme has the
| same issue with the second occurrence not being highlighted as
| Darcula has.
|
| Screenshot: https://i.imgur.com/FxwUkVz.png
| nosianu wrote:
| You are right, I missed the other one, it is not reported.
| You can see there is something because it takes space, but
| you have to deliberately go there to see it. There also is no
| warning from having the "No trailing spaces" setting active,
| so it is not seen as a space character even if it shows as
| such.
|
| I'll write an Issue on youtrack, I'm sure they'll fix it.
| From the well over hundred issues I ever reported about 2/3rd
| were fixed (rest is obsolete, only a few that are really
| still open).
|
| EDIT: Bug report submitted.
| dotancohen wrote:
| Please link it, I'll comment as well.
|
| Yes, I file bug reports with lots of places, and Jetbrains
| is one of the best for actually doing something with them.
| It is one of the few non-FOSS applications that I am
| willing to integrate into my workflow (hmm, the only one I
| think).
| nosianu wrote:
| Ticket: IDEA-282266 Not all invisible characters are
| reported
|
| I didn't want to link because of loss of anonymity... :(
|
| https://youtrack.jetbrains.com/issue/IDEA-282266
| nosianu wrote:
| EDIT (new comment because edit-period is long gone):
|
| It's not too severe an issue, maybe not one at all(?), at
| least in this concrete example, because after removing
| the first occurrence of the hidden variable it now
| becomes a "not defined" real error and not just a warning
| in the second location.
| tdrdt wrote:
| Some time ago I managed to add a non width character in my
| PHP code. Because it had no width PhpStorm did not highlight
| it and I had absolutely no clue why there was an error in my
| code. So it only highlights when it has width.
|
| Edit: just added some non-space characters and at least in
| Rider they are now displayed as a warning. So I think this is
| fixed now.
| kingcharles wrote:
| Is there a plugin for detecting homoglyphs like these
| genderless vs. male zombies I made:
| https://kingcharles.one/unistrange.html
| Cameri wrote:
| Is there an eslint plugin to prevent invisible characters on
| .js/.ts files?
| cphoover wrote:
| I imagine this can be defended again pretty easily with a lint
| rule that prevents these unicode characters in variables. pretty
| ingenious little hack though.
|
| The eslint rule _id-match_ , which require identifiers to match a
| specified regular expression, would be useful here. For example:
| "id-match": ["error", "^[a-z]+([A-Z][a-z]+)*$"]
| bluepnume wrote:
| A malicious PR could also add the character to your eslintrc
| too though. You'd be forgiven for seeing the line change in the
| diff and thinking it was just some reformatting.
| cphoover wrote:
| that would show up on a diff and would elicit a question in
| code review hopefully of why the .eslintrc file was being
| changed in this way. This another good argument for a
| comprehensive code review process.
|
| also you could lock this file down with a CODEOWNERS file so
| only certain trusted contributors could modify the lint
| configuration. You could also do exclusionary pattern
| matching to make sure none of the bad characters do exist in
| identifier names... Or you could write your eslint
| configuration as a separate module to be npm installed... or
| you could write a eslint rule plugin that disallows non-ascii
| identifiers and then npm install that... lots of different
| ways to skin this cat to add security.
| jraph wrote:
| It seems like things displaying diffs could use a specific
| color for lines only changed by formatting or indentation
| (indentation can have significant meaning like in Python but
| this would probably be good enough)
| dotancohen wrote:
| I believe that Git diff - which has features not supported
| by regular diff such as --word-diff - can differentiate
| between whitespace-only line changes. The Jetbrains IDE,
| which I believe uses Git diff behind the scenes, will show
| who originally wrote a line even if it has been whitespace-
| reformatted later.
| vjeux wrote:
| Another reason to use prettier, this will be formated in a
| confusing way and has a higher chance of being spotted by a
| human!
| nathell wrote:
| What was wrong with only allowing ASCII in identifiers?
| jgalt212 wrote:
| seriously, I'd fire anyone who put any emoji in an identifier.
| chrismorgan wrote:
| Sensible languages follow UAX #31
| <https://www.unicode.org/reports/tr31/> for Unicode
| identifiers, which doesn't allow emoji.
| skrebbel wrote:
| But it does allow hangul fillers, apparently.
| SenpaiHurricane wrote:
| TS + Intellij
|
| https://ibb.co/MfLLNQL
| Suvitruf wrote:
| This issue and example used are more about data validation and
| escaping characters.
| mirekrusin wrote:
| I don't think this is JavaScript specific. It's like saying "BMW
| cars in Paris stop working if you pour sugar in tank".
| bestham wrote:
| The point is not that the vulnerability is a trait of
| javascript but to make a demonstration on how different unicode
| characters can be used to create a vulnerability, exemplified
| by a piece of javascript.
| mirekrusin wrote:
| Yes, I understand it, but title and content doesn't
| explicitly mention this "detail" that same attact vector
| exists for other languages and data formats.
| dspillett wrote:
| _> I don't think this is JavaScript specific._
|
| It isn't, any relatively dynamic language is going to have
| these or similar issues. Many moons ago I saw similar examples
| in bash, I'm sure they are possible in PHP, ..., ..., ...
|
| In fact, even the more strict languages probably do to: the
| "accidentally run something malicious via care-free use of
| exec" is an issue in just every language that has
| "exec"/similar - it is a data trusting error in the
| programmer's logic not an issue with the language itself. The
| dynamic nature of some of JS's syntax is just one way to
| pollute the data being fed to exec amongst the other sources
| (user input, being too trusting of config in the DB or
| filesystem, and so forth).
|
| Javascript is a very good option to use for examples though:
| most devs know it well enough and it is _everywhere_ so the
| potential scale of the danger is obvious, even more so in light
| of people being far too trusting of dependencies pulled via NPM
| and the recent examples of malicious updates getting into
| common packages.
|
| Maybe the title could be a bit less click-baity, though I'm not
| sure what would be used instead that wouldn't be overly wordy
| for a punchy article title.
| fergie wrote:
| Strong typing doesn't fix the issue that invisible characters
| can be used as variable names.
| UncleMeat wrote:
| This whole story has been stupid. These ideas have been around
| for ages and are not novel to the security community. Yet we've
| seen headlines like "all programs ever are vulnerable to this
| new hack." The root cause is not unicode characters but instead
| _untrusted text_. It isn 't like a malicious library would be
| unable to sneak backdoors in through ascii source anyway. Heck,
| we _just_ had a big kerfuffle over this happening in the linux
| kernel this year.
|
| Or worse! Go look at the dependencies for some large enterprise
| system built in java. How many _raw jars_ do you think are
| being included in there? Has _anybody_ looked at the bytecode
| of these jars?
| pumpum wrote:
| Good point. FYI, sugar in the gas tank is a myth. The sugar
| will cause basically no harm at all, I've witnessed it
| attempted.
| cyberpsybin wrote:
| This needs to be patched. Although at this point there might be
| code depending on this.
___________________________________________________________________
(page generated 2021-11-10 23:02 UTC)