[HN Gopher] The complete guide to working with strings in modern...
___________________________________________________________________
The complete guide to working with strings in modern JavaScript
Author : davethedevguy
Score : 54 points
Date : 2021-04-13 08:16 UTC (14 hours ago)
(HTM) web link (www.baseclass.io)
(TXT) w3m dump (www.baseclass.io)
| mgerullis wrote:
| > let website = new String("BaseClass")
|
| > website.rating = "great!"
|
| Ok that's new to me. Can anyone point out where this could be
| useful, if anywhere?
| greggturkington wrote:
| Primarily for adding properties, and making objects that can be
| inherited from (unlike primitives). This is a great answer with
| even more cases:
|
| https://stackoverflow.com/a/39113022/110241
| mgerullis wrote:
| > Primarily for adding properties
|
| I personally would never expect a property which is treated
| as a string. That's what POJOs are for, but that is just
| opinion I guess. still weird they let you do this.
|
| > and making objects that can be inherited from (unlike
| primitives).
|
| I mean the SO gives an example of how to do it but I would
| still love to see some place where this is _needed_ or even
| _pragmatic_ to any degree.
| Ciantic wrote:
| Only thing useful I can come up is monkey patching without
| changing the function signature. E.g. smuggling additional data
| to a function which expects a string like thing.
|
| But I think the whole `new String` is problematic, because it
| breaks the strict comparison: let a = "Foo";
| let b = new String("Foo"); a === "Foo" true b
| === "Foo" false
| mgerullis wrote:
| > Only thing useful I can come up is monkey patching without
| changing the function signature. E.g. smuggling additional
| data to a function which expects a string like thing.
|
| Yeah this sounds terrifying to me, if I were to debug
| something like this.
|
| Also the fact they identify as objects:
| typeof new String("Foo") === 'object';
|
| Case closed, I will never use it :D
| dwheeler wrote:
| This guide omits an IMPORTANT issue: handling characters outside
| the Basic Multilingual Plane (BMP). JavaScript, like some other
| languages, suffers from the UCS-2 curse - that is, it assumes
| that all characters fit inside 16 bits, even though that is no
| longer true.
|
| For example, the cited text says: "You can return the UTF-16 code
| for a character in a string using charCodeAt()...".
|
| Not true. This only works if the UTF-16 code fits in 16 bits; if
| it's more than 16 bits, charCodeAt will only return a _part_ of
| the character.
|
| There are lots of discussions about this, here's one:
| https://stackoverflow.com/questions/3744721/javascript-strin...
|
| JavaScript _can_ handle characters outside the BMP, but you
| sometimes have to aware of the problem & carefully code around
| it when such characters are possible.
| crazygringo wrote:
| Exactly, and emoji are outside the BMP, so it's not exactly an
| edge case, but the norm where two code _units_ (UTF-16 double-
| bytes) are used to make one code _point_ (Unicode character)
| [1].
|
| And it gets even worse, when you consider that for many
| purposes you're not even interested in code _points_ but in
| _graphemes_ which can be sequences of code points -- e.g. a
| single visible emoji might actually be a sequence of 5 code
| _points_ , represented by 8 UTF-16 code _units_ , taking up 16
| _bytes_ [2]. Similarly a single accented character will often
| be two code points (letter plus combining diacritic).
|
| If you want to split a string by graphemes -- e.g. to count its
| visible length, or delete its last visible character -- you can
| either use a library for it [3], or the relatively new
| Intl.Segmenter [4] which is in Chrome and Safari, but hasn't
| made it to Firefox [5].
|
| Kind of amazing it's 2021 and you _still_ can 't calculate the
| number of visible characters (graphemes) in a string using
| native functions across all modern browsers.
|
| [1] https://blog.jonnew.com/posts/poo-dot-length-equals-two
|
| [2] https://www.contentful.com/blog/2016/12/06/unicode-
| javascrip...
|
| [3] https://github.com/orling/grapheme-splitter
|
| [4] https://github.com/tc39/proposal-intl-segmenter
|
| [5] https://bugzilla.mozilla.org/show_bug.cgi?id=1423593
| davethedevguy wrote:
| Thank you for these comments. I didn't know about this at
| all!
|
| I'll read up on it until I understand it, and then add
| something to the article that covers it.
| crazygringo wrote:
| This might also help you:
|
| https://stackoverflow.com/questions/4547609/how-to-get-
| chara...
|
| e.g. use Array.from() to at least process code points
| rather than code units, though that's still not graphemes.
| dwheeler wrote:
| You're right about the graphemes. If you need it, you need
| it, but I recommend writing code that does _not_ need to
| count graphemes _if_ you can avoid them.
|
| In many cases strings are best considered units that can be
| concatenated at will, but it's best if you avoid splitting
| them, and if you _must_ split them, generally only split them
| on ASCII character boundaries. Don 't consider "lengths" as
| something that has a meaning to humans (it doesn't), and
| don't assume that a "character" is a single JavaScript
| character (it isn't). If you normally just work strings as
| opaque sequences of "characters" that can be later displayed,
| you can avoid many complications (though obviously NOT all of
| them).
| rav wrote:
| // Correct: returns '1' 'Resume'.localeCompare('RESUME',
| undefined, {sensitivity: 'accent'})
|
| localeCompare() returns 0 if the strings are equal and -1/+1 if
| they're different. Since this section is about comparing two
| strings that only differ in case and accents, I would expect to
| see a method I could use that would consider the strings to be
| equal. Instead, this example just shows two ways to compare
| strings (=== and localeCompare) that both consider the strings to
| be different.
| davethedevguy wrote:
| Thank you, you're right that's a mistake.
|
| This example was supposed to use: {sensitivity: 'base'}
|
| I've corrected it.
| mormegil wrote:
| Not really. Case (in) sensitivity and accent (in)sensitivity
| are two orthogonal things. If you want to compare two
| strings, ignoring case differences, converting both to
| lowercase (or uppercase) is completely fine in Javascript (it
| might be problematic in other languages because of Turkish
| dotted and dotless i, but in JS, the obvious first-choice
| toLowerCase() is locale-ignorant, you would need to use
| toLocaleLowerCase() to be bitten by the problem and why would
| do that?). Obviously, the method considers "A" and "a" to be
| different, but why wouldn't it? Those characters differ in
| accents, not only in case.
| davethedevguy wrote:
| > Case (in) sensitivity and accent (in)sensitivity are two
| orthogonal things
|
| Is that definitely true?
|
| As I understand it (and I admit I'm no expert), it's common
| to omit accents in some languages when changing case.
| mixmastamyk wrote:
| I came to write this. Why are accents being disregarded
| under the case-insensitive comparison section?
|
| The proper solution is typically "case fold", however I
| only know it from Python, not sure if Javascript supports
| it natively.
| davethedevguy wrote:
| Thanks for this.
|
| I've split 'Handling diacritics' in to a separate section
| to 'Case sensitivity'.
| cproctor wrote:
| > If you're not sure [which equality operator] to use, prefer
| strict equality using ====.
|
| Super-strict!
| davethedevguy wrote:
| Ha! Thank you for noticing that. I've fixed it :)
| [deleted]
| mmabbq wrote:
| I think there's a minor typo in the trimming examples:
| " Trim Me ".trim() // "Trim"
|
| should be " Trim Me ".trim() // "Trim Me"
| davethedevguy wrote:
| Thank you! I've fixed it.
| fabiospampinato wrote:
| I'll add something: if a string [^1] contains essentially only
| ASCII [^2] characters v8 will use 1 byte per character, if that
| string contains _any_ character other than ASCII characters in it
| then it will use 2 bytes for each character in the string. Said
| it differently storing strings as lines may save you up to 50% of
| memory usage depending on your use case.
|
| [^1]: It actually depends on how that string was made, if
| internally it still references the parent string then slicing it
| up into lines won't save you any memory. I'm referring to
| "flattened" strings.
|
| [^2]: I don't remember what the exact character set is, I think
| it's not exactly ASCII but close enough.
| esprehn wrote:
| It's latin1. The same is true of DOM strings in Chromium, like
| attributes, blocks of text, and inline scripts.
|
| Webkit and the JDK implement the same string optimization,
| while .NET unfortunately doesn't:
| https://github.com/dotnet/runtime/issues/6612
| davethedevguy wrote:
| I didn't know that! I'll add something to the post.
|
| Thank you!
| faitswulff wrote:
| Regarding [^2], is it code page 437 by any chance?
| https://en.m.wikipedia.org/wiki/Code_page_437
| esprehn wrote:
| It's Latin1 which I think is code page 850:
| https://en.m.wikipedia.org/wiki/Code_page_850
| timyim wrote:
| Great site. Great idea. Keep it up and add more!
| davethedevguy wrote:
| Thank you!
___________________________________________________________________
(page generated 2021-04-13 23:02 UTC)