hngopher.com

       [HN Gopher] The complete guide to working with strings in modern...
       ___________________________________________________________________
        
       The complete guide to working with strings in modern JavaScript
        
       Author : davethedevguy
       Score  : 54 points
       Date   : 2021-04-13 08:16 UTC (14 hours ago)
        
 (HTM) web link (www.baseclass.io)
 (TXT) w3m dump (www.baseclass.io)
        
       | mgerullis wrote:
       | > let website = new String("BaseClass")
       | 
       | > website.rating = "great!"
       | 
       | Ok that's new to me. Can anyone point out where this could be
       | useful, if anywhere?
        
         | greggturkington wrote:
         | Primarily for adding properties, and making objects that can be
         | inherited from (unlike primitives). This is a great answer with
         | even more cases:
         | 
         | https://stackoverflow.com/a/39113022/110241
        
           | mgerullis wrote:
           | > Primarily for adding properties
           | 
           | I personally would never expect a property which is treated
           | as a string. That's what POJOs are for, but that is just
           | opinion I guess. still weird they let you do this.
           | 
           | > and making objects that can be inherited from (unlike
           | primitives).
           | 
           | I mean the SO gives an example of how to do it but I would
           | still love to see some place where this is _needed_ or even
           | _pragmatic_ to any degree.
        
         | Ciantic wrote:
         | Only thing useful I can come up is monkey patching without
         | changing the function signature. E.g. smuggling additional data
         | to a function which expects a string like thing.
         | 
         | But I think the whole `new String` is problematic, because it
         | breaks the strict comparison:                 let a = "Foo";
         | let b = new String("Foo");       a === "Foo"       true       b
         | === "Foo"       false
        
           | mgerullis wrote:
           | > Only thing useful I can come up is monkey patching without
           | changing the function signature. E.g. smuggling additional
           | data to a function which expects a string like thing.
           | 
           | Yeah this sounds terrifying to me, if I were to debug
           | something like this.
           | 
           | Also the fact they identify as objects:
           | typeof new String("Foo") === 'object';
           | 
           | Case closed, I will never use it :D
        
       | dwheeler wrote:
       | This guide omits an IMPORTANT issue: handling characters outside
       | the Basic Multilingual Plane (BMP). JavaScript, like some other
       | languages, suffers from the UCS-2 curse - that is, it assumes
       | that all characters fit inside 16 bits, even though that is no
       | longer true.
       | 
       | For example, the cited text says: "You can return the UTF-16 code
       | for a character in a string using charCodeAt()...".
       | 
       | Not true. This only works if the UTF-16 code fits in 16 bits; if
       | it's more than 16 bits, charCodeAt will only return a _part_ of
       | the character.
       | 
       | There are lots of discussions about this, here's one:
       | https://stackoverflow.com/questions/3744721/javascript-strin...
       | 
       | JavaScript _can_ handle characters outside the BMP, but you
       | sometimes have to aware of the problem  & carefully code around
       | it when such characters are possible.
        
         | crazygringo wrote:
         | Exactly, and emoji are outside the BMP, so it's not exactly an
         | edge case, but the norm where two code _units_ (UTF-16 double-
         | bytes) are used to make one code _point_ (Unicode character)
         | [1].
         | 
         | And it gets even worse, when you consider that for many
         | purposes you're not even interested in code _points_ but in
         | _graphemes_ which can be sequences of code points -- e.g. a
         | single visible emoji might actually be a sequence of 5 code
         | _points_ , represented by 8 UTF-16 code _units_ , taking up 16
         | _bytes_ [2]. Similarly a single accented character will often
         | be two code points (letter plus combining diacritic).
         | 
         | If you want to split a string by graphemes -- e.g. to count its
         | visible length, or delete its last visible character -- you can
         | either use a library for it [3], or the relatively new
         | Intl.Segmenter [4] which is in Chrome and Safari, but hasn't
         | made it to Firefox [5].
         | 
         | Kind of amazing it's 2021 and you _still_ can 't calculate the
         | number of visible characters (graphemes) in a string using
         | native functions across all modern browsers.
         | 
         | [1] https://blog.jonnew.com/posts/poo-dot-length-equals-two
         | 
         | [2] https://www.contentful.com/blog/2016/12/06/unicode-
         | javascrip...
         | 
         | [3] https://github.com/orling/grapheme-splitter
         | 
         | [4] https://github.com/tc39/proposal-intl-segmenter
         | 
         | [5] https://bugzilla.mozilla.org/show_bug.cgi?id=1423593
        
           | davethedevguy wrote:
           | Thank you for these comments. I didn't know about this at
           | all!
           | 
           | I'll read up on it until I understand it, and then add
           | something to the article that covers it.
        
             | crazygringo wrote:
             | This might also help you:
             | 
             | https://stackoverflow.com/questions/4547609/how-to-get-
             | chara...
             | 
             | e.g. use Array.from() to at least process code points
             | rather than code units, though that's still not graphemes.
        
           | dwheeler wrote:
           | You're right about the graphemes. If you need it, you need
           | it, but I recommend writing code that does _not_ need to
           | count graphemes _if_ you can avoid them.
           | 
           | In many cases strings are best considered units that can be
           | concatenated at will, but it's best if you avoid splitting
           | them, and if you _must_ split them, generally only split them
           | on ASCII character boundaries. Don 't consider "lengths" as
           | something that has a meaning to humans (it doesn't), and
           | don't assume that a "character" is a single JavaScript
           | character (it isn't). If you normally just work strings as
           | opaque sequences of "characters" that can be later displayed,
           | you can avoid many complications (though obviously NOT all of
           | them).
        
       | rav wrote:
       | // Correct: returns '1'         'Resume'.localeCompare('RESUME',
       | undefined, {sensitivity: 'accent'})
       | 
       | localeCompare() returns 0 if the strings are equal and -1/+1 if
       | they're different. Since this section is about comparing two
       | strings that only differ in case and accents, I would expect to
       | see a method I could use that would consider the strings to be
       | equal. Instead, this example just shows two ways to compare
       | strings (=== and localeCompare) that both consider the strings to
       | be different.
        
         | davethedevguy wrote:
         | Thank you, you're right that's a mistake.
         | 
         | This example was supposed to use: {sensitivity: 'base'}
         | 
         | I've corrected it.
        
           | mormegil wrote:
           | Not really. Case (in) sensitivity and accent (in)sensitivity
           | are two orthogonal things. If you want to compare two
           | strings, ignoring case differences, converting both to
           | lowercase (or uppercase) is completely fine in Javascript (it
           | might be problematic in other languages because of Turkish
           | dotted and dotless i, but in JS, the obvious first-choice
           | toLowerCase() is locale-ignorant, you would need to use
           | toLocaleLowerCase() to be bitten by the problem and why would
           | do that?). Obviously, the method considers "A" and "a" to be
           | different, but why wouldn't it? Those characters differ in
           | accents, not only in case.
        
             | davethedevguy wrote:
             | > Case (in) sensitivity and accent (in)sensitivity are two
             | orthogonal things
             | 
             | Is that definitely true?
             | 
             | As I understand it (and I admit I'm no expert), it's common
             | to omit accents in some languages when changing case.
        
             | mixmastamyk wrote:
             | I came to write this. Why are accents being disregarded
             | under the case-insensitive comparison section?
             | 
             | The proper solution is typically "case fold", however I
             | only know it from Python, not sure if Javascript supports
             | it natively.
        
               | davethedevguy wrote:
               | Thanks for this.
               | 
               | I've split 'Handling diacritics' in to a separate section
               | to 'Case sensitivity'.
        
       | cproctor wrote:
       | > If you're not sure [which equality operator] to use, prefer
       | strict equality using ====.
       | 
       | Super-strict!
        
         | davethedevguy wrote:
         | Ha! Thank you for noticing that. I've fixed it :)
        
       | [deleted]
        
       | mmabbq wrote:
       | I think there's a minor typo in the trimming examples:
       | "  Trim Me  ".trim() // "Trim"
       | 
       | should be                 "  Trim Me  ".trim() // "Trim Me"
        
         | davethedevguy wrote:
         | Thank you! I've fixed it.
        
       | fabiospampinato wrote:
       | I'll add something: if a string [^1] contains essentially only
       | ASCII [^2] characters v8 will use 1 byte per character, if that
       | string contains _any_ character other than ASCII characters in it
       | then it will use 2 bytes for each character in the string. Said
       | it differently storing strings as lines may save you up to 50% of
       | memory usage depending on your use case.
       | 
       | [^1]: It actually depends on how that string was made, if
       | internally it still references the parent string then slicing it
       | up into lines won't save you any memory. I'm referring to
       | "flattened" strings.
       | 
       | [^2]: I don't remember what the exact character set is, I think
       | it's not exactly ASCII but close enough.
        
         | esprehn wrote:
         | It's latin1. The same is true of DOM strings in Chromium, like
         | attributes, blocks of text, and inline scripts.
         | 
         | Webkit and the JDK implement the same string optimization,
         | while .NET unfortunately doesn't:
         | https://github.com/dotnet/runtime/issues/6612
        
         | davethedevguy wrote:
         | I didn't know that! I'll add something to the post.
         | 
         | Thank you!
        
         | faitswulff wrote:
         | Regarding [^2], is it code page 437 by any chance?
         | https://en.m.wikipedia.org/wiki/Code_page_437
        
           | esprehn wrote:
           | It's Latin1 which I think is code page 850:
           | https://en.m.wikipedia.org/wiki/Code_page_850
        
       | timyim wrote:
       | Great site. Great idea. Keep it up and add more!
        
         | davethedevguy wrote:
         | Thank you!
        
       ___________________________________________________________________
       (page generated 2021-04-13 23:02 UTC)