[HN Gopher] We don't need a string type (2013)
       ___________________________________________________________________
        
       We don't need a string type (2013)
        
       Author : grep_it
       Score  : 25 points
       Date   : 2021-02-09 05:59 UTC (2 days ago)
        
 (HTM) web link (mortoray.com)
 (TXT) w3m dump (mortoray.com)
        
       | [deleted]
        
       | 37ef_ced3 wrote:
       | Go's immutable UTF-8 string type is one of the nice things about
       | the language
       | 
       | A Go string is almost exactly like this C struct:
       | struct String {           uint8_t* addr;           ptrdiff_t len;
       | };
       | 
       | The language guarantees you can't modify the bytes in memory
       | range [addr, addr+len)
       | 
       | Go's garbage collection makes it simple and natural to have one
       | string alias ("point into", "overlap") part of another string.
       | This works because strings are immutable. Compare this to the
       | nightmare in C++, where substrings require copying or explicit
       | handling
       | 
       | The rune (UTF-8) iterator and other facilities make Unicode
       | handling natural in Go
       | 
       | In summary, Go's string type is a huge win
        
         | jrimbault wrote:
         | I'd arguee Go's string type is "somewhat unusable"* since it
         | doesn't enforce the guarantees it says/implies it does. The
         | byte slice it points to is not guaranteed to be valid utf8.
         | 
         | * of course to a degree, let's be reasonable, it's usable in a
         | _lot_ of contexts, but I like my types to actually mean
         | something.
        
           | DougBTX wrote:
           | Go doesn't guarantee any encoding for strings, very
           | deliberately (so that, eg, they can be used to represent file
           | names).
        
             | KMag wrote:
             | Filesystem paths are not strings. Linux doesn't enforce an
             | encoding. Windows at least didn't used to enforce proper
             | use of conjugate UTF-16 pairs (see WTF-8 encoding).
             | 
             | I think OS X does perform UTF-8 normalization, which might
             | include sanity checking and rejecting malformed UTF-8, but
             | I'm not sure.
             | 
             | A byte array (or a ref-counted singly-linked list of
             | immutable byte arrays to save space/copying) is a much
             | better representation for a file system path. That doesn't
             | have great interaction with GUIs, but there are other
             | corner cases that are often problematic for GUIs. In high
             | school, one of my friends had a habit of putting games on
             | the school library computers, and renaming them to names
             | with non-printable characters using alt+number pad. (He
             | used 129, IIRC, which isn't assigned a character in
             | CP-1252.) The Windows 95 graphical shell would convert the
             | non-printable characters to spaces for display, but when
             | the librarian tried to delete the games, it would pass the
             | display name to the kernel, which would complain that the
             | presented path didn't exist.
        
               | throwaway894345 wrote:
               | A string is a byte array for all intents and purposes. In
               | Go specifically, it's an immutable byte slice with some
               | built-in operator overloading, some of which is sugar for
               | dealing with utf-8, but there's nothing that suggests a
               | string must be encoded any particular way.
        
               | Koshkin wrote:
               | > _A string is a byte array for all intents and
               | purposes._
               | 
               | This smacks of reductionism. String as an abstract type
               | only needs to conform to a number of certain axioms and
               | support certain operations. (Thus, for example, a text
               | editor, where a string can be mutable, could choose a
               | _representation_ of this type that is different from a
               | simple byte array.)
        
               | throwaway894345 wrote:
               | Based on the context of the thread, the definition of
               | "string" used in this thread must also include the
               | properties possessed by Go strings in order for the
               | original criticism to be coherent. It seems more likely
               | (and charitable) that the criticism is incorrect rather
               | than incoherent.
               | 
               | In whatever case, Go strings have all of the relevant
               | properties for modeling file paths.
        
               | KMag wrote:
               | I'm saying that it's useful to not conflate the types for
               | sequences of Unicode codepoints and and filesystem paths.
               | Using the same type for both is likely to result in code
               | with baked-in assumptions that for any path, there is a
               | standard encoding that will yield a sequence of Unicode
               | codepoints.
               | 
               | Pervasive code with this sort of type confusion in the
               | wild in Python2 is why Python3 separated bytes and
               | strings.
        
               | throwaway894345 wrote:
               | Maybe, but a decade of experience with Go suggests that
               | this isn't a significant problem (i.e., more than a
               | handful of instances).
        
               | GoblinSlayer wrote:
               | Posix thinks paths are strings. See https://pubs.opengrou
               | p.org/onlinepubs/009695399/functions/op...
        
               | msla wrote:
               | "String" has multiple meanings in this context. In the
               | context of that manpage, it means "nul-terminated array
               | of char" which is the C language meaning. In the context
               | of what you're replying to, a "string" is a sequence of
               | bytes (octets) in a specific Unicode Transformation
               | Format. Those are very different things when it comes to
               | programmatic manipulation of those things.
        
               | jerf wrote:
               | It is not clear to me if you're elaborating or think
               | you're disagreeing, but that is what Go does. It is
               | generally assumed in Go that strings are UTF-8, but in
               | practice what they actually are are just bags of bytes.
               | Nothing really "UTF-y" will happen to them until you
               | directly call UTF functions on them, which may produce
               | new strings.
               | 
               | It's something that I don't think could work unless your
               | language is as recent as Go, and perhaps even Go 1.0 was
               | pushing it, but it is an increasingly viable answer. For
               | as thin as Go's encoding support really is in some sense,
               | it has almost never caused me any trouble. The contexts
               | where you are actively unsafe in assuming UTF-8 are
               | decreasing, and the ones that are going to survive are
               | the ones where there's some sort of explicit label, like
               | in email. (Not that those are always trustworthy either.)
        
               | KMag wrote:
               | I'm saying it's useful to have valid strings and paths as
               | separate types, but Go conflates the two types.
               | Conflating the two is likely to lead to confused usage
               | (such as programmers assuming there's a bijective mapping
               | between valid paths and valid sequences of Unicode
               | codepoints.)
               | 
               | Pervasive confused usage of this sort in the wild in
               | Python 2 was the motivation behind splitting bytes and
               | strings in Python 3.
        
           | skybrian wrote:
           | Go's standard library works with both possibly-malformed and
           | verified UTF-8 strings, which is a nice property.
           | 
           | The type system needed to explain what they actually do (take
           | one of two possible input types and return the corresponding
           | output type) would require generics, which we don't have yet.
           | 
           | An alternative would be to duplicate the code to account for
           | the different types, but we already have that for []byte
           | versus string and that's bad enough already.
        
           | 37ef_ced3 wrote:
           | In Go, malformed UTF-8 encodings are expected
           | 
           | They are handled in a well-defined and graceful manner by all
           | aspects of the language, runtime, and library
        
           | throwaway894345 wrote:
           | I think of a string as an immutable byte slice. This is a
           | little confusing since the language supports utf-8 literals
           | only and it also lets you iterate over individual runes with
           | for loops, but those are just conveniences over the fact that
           | these are really just immutable byte slices. You could
           | probably make your own "UTF8" type with the invariants you
           | want (or at least someone would have to drop down into unsafe
           | to violate the invariants) but in general Go programs don't
           | typically go that far, presumably because it doesn't add much
           | value in practice, which would suggest that your "somewhat
           | unusable" claim (even with its caveat) is too strong. That
           | said, I think it would be nice if Go made it a little
           | easier/clearer to model a type that can only be created by a
           | particular constructor or some such.
        
         | Koshkin wrote:
         | Oh well. The could've (should've?) used the layout of _bstr_t
         | instead.
        
         | foo_barrio wrote:
         | Java's sub-strings used to work sort of like this but was
         | changed to use copy semantics. The structure was to have a
         | "char[]" and an "offset" into that array. This allowed sub-
         | strings to share the underlying array. However if you had a 1
         | char sub-string to a 1 GB array, the underlying array was never
         | trimmed for garbage collection.
         | 
         | In the case of a 1 char substring to a 1 GB string, is Go smart
         | enough to free the rest of the array and keep only the 1 char?
        
           | KMag wrote:
           | I wonder how hard the JVM folks looked into specialized weak
           | references for solving this issue. The mark phase would treat
           | all Strings with zero offset and full length as strongly
           | referencing the byte[], and weakly otherwise. At the end of
           | each full GC, you could iterate over all of your Strings
           | (custom allocate/compact them to their own ranges of the heap
           | for faster scanning), use some heuristics and probabilistic
           | sampling to select some of the weakly reachable byte[]s for
           | size reduction. A specialized copy/compact pass over the
           | Strings could in-place replace byte[] references and fix up
           | offsets.
           | 
           | You'd probably also want to modify String.equals() to
           | internally mutate equal strings to point to the same byte[],
           | preferring smaller offsets, and when offsets are equal,
           | preferring lower addresses. This is a light weight lazy
           | version of the background String byte[] interning done by
           | some JVMs.
        
           | 37ef_ced3 wrote:
           | In Go, as in C, there's no magic. A programmer using Go
           | thinks of a string variable as a pointer/length pair, and
           | knows what will happen. Just like with slices
           | 
           | If you keep a pointer into an allocation (in your example, a
           | small Go string pointing into a much larger Go string) the
           | allocation is preserved by the garbage collector
           | 
           | You should explicitly copy the substring out (instead of
           | aliasing the underlying string) if retaining the underlying
           | string causes you a problem
        
             | foo_barrio wrote:
             | Okay I understand. In Java, the default String class has
             | the option to "intern" a string which just maintains a list
             | of strings that can be shared.
             | 
             | The change was made because often the devs were unaware of
             | the string manipulation taking place in a third party
             | library (eg XML/JSON/HTML parsing). You'd see the memory
             | balloon, investigate and notice that String/char[]
             | instances were dominating your heap. Instead of changing
             | the entire implementation of the standard String class,
             | they changed the semantics of the "substring()" call from
             | O(1) to O(n) + memory side-effects.
        
       | giardini wrote:
       | Surprising to a Tcl programmer!8-)) b/c
       | 
       | "Everything is a String":
       | 
       | https://wiki.tcl-lang.org/page/everything+is+a+string
       | 
       | and
       | 
       | "Everything is a Symbol":
       | 
       | https://wiki.tcl-lang.org/page/Everything+is+a+Symbol
        
         | BlueTemplar wrote:
         | Looks like that what Tcl means by 'string', the author names
         | 'text' ?
         | 
         | What does Tcl mean by 'character' ?
         | 
         | See for instance, the author's HTML example :
         | 
         | > Combining characters can create an accented version of that
         | symbol, <. In text this is clearly a different symbol: it's a
         | distinct grapheme cluster. The HTML parser doesn't care about
         | that. It sees code #60 followed by #807 (combining cedilla). It
         | thus sees the opening of an element. However, since it isn't
         | followed by a valid naming character most parsers just ignore
         | this element (I'm not positive that is correct to do). This is
         | not the case with an accented quote, like ". Here the parsers
         | (at least the browsers I tested), let the quote end an
         | attribute and then have a garbage character lying around.
         | 
         | https://mortoray.com/2014/03/17/strings-and-text-are-not-the...
         | 
         | EDIT: Ok, it looks like by 'character', Tcl means what the
         | author (and Unicode ?) calls a 'grapheme cluster' ?
         | 
         | https://wiki.tcl-lang.org/page/Characters%2C+glyphs%2C+code%...
         | 
         | https://mortoray.com/2016/04/28/what-is-the-length-of-a-stri...
        
       | BlueTemplar wrote:
       | Anyone else thinks that we missed an opportunity to make text
       | much simpler to deal with by not increasing the size of a byte
       | from 8 to 32 bits when we moved from 32-bit to 64-bit word length
       | CPUs ?
       | 
       | I mean, isn't the 7-bit ASCII text the reason why the byte length
       | was standardized to the next power of two bits ?
       | 
       | (With e-mail still supporting non-padded 7-bit ASCII until
       | recently for performance reasons.)
        
       | DougBTX wrote:
       | The date should be (2013) not (2018), as that dates it before
       | Rust 1.0 (which does have a UTF-8 string type) and before the
       | Julia 1.0 release date (which implements UTF-8 strings as arrays
       | with irregularly spaced indexes, eg, the valid indexes may be 1,
       | 2, 4, 5, if the character at 2 takes up two bytes). Both would be
       | interesting examples to compare against if this article was
       | written today.
        
         | dang wrote:
         | I've fixed the date now. Actually the date at the top of the
         | article "2013-08-13" is in a font that somehow makes it look
         | like 2018. I had to squint a couple times to make sure I was
         | reading it right! The year in the URL is easier to read.
        
       | tyingq wrote:
       | I can't speak for C++, but for C, the repeated issue is that a
       | null-terminated string has lots of utility routines that are
       | handy for manipulating them. Without 3rd party libraries, plain
       | length-header buffers don't. Hence things like Antirez's sds
       | library, which by nature, is a compromise. I get you can't
       | fundamentally change C now, but a buffer type with a rich
       | manipulation library would have been nice.
        
       | ncmncm wrote:
       | The article is an argument against types, in general.
       | 
       | The point that characters can be stored in other containers is
       | meaningless: the question is whether, conceptually, a specific
       | sequence of character values distinct from another sequence has
       | compile-time meaning. It does. Therefore, it needs a type.
       | 
       | Such a sequence has numerous special characteristics. In
       | particular, element at [i] often has an essential connection to
       | element at [i+1] such that swapping them could turn a valid
       | string to an invalid one. In fact, that an invalid sequence is
       | even possible is another such characteristic.
        
         | arcbyte wrote:
         | Let me respond to you again in a different way, this time
         | referencing some unicorn definitions I like
         | (https://stackoverflow.com/a/27331885).
         | 
         | I don't think we can have a meaningful conversation in terms of
         | characters so I'm going to ignore that and reference your last
         | paragraph. You seem to be arguing that string as a type has use
         | when viewing it as a collection of methods that allow access to
         | Code Points given an underlying storage of Code Units. The
         | article is arguing that unless you're writing a unicode
         | encoder/decoder, you probably don't care about manipulating
         | Code Units (except that modern languages have given you these
         | byte arrays that you reference the length of for memory
         | purposes). What you really usually care about is searching,
         | replacing, concating, and cutting collections of Code Points.
         | But languages have only given you this hodge podge grouping of
         | Code Unit arrays and specialty methods for Code Point access so
         | thats what you're used to dealing with and of course you want
         | some kind of abstraction, like a string type, to deal with so
         | you don't end up with the scenario you describe where you screw
         | up a Code Unit sequence trying to manipulate a Code Point.
         | 
         | So the final point is that unless you're working with unicode
         | encoding/decoding, you really only care about Code Points. And
         | once you create a String class that only exposes Code Points,
         | you have got something equivalent to a simple array.
        
         | arcbyte wrote:
         | I actually read it as a argument FOR types and against modern
         | languages choice to make the String class a weak proxy for
         | typeless byte arrays. See all the arguments (in this HN
         | comments no less!) for just using utf8 byte arrays as strings.
         | 
         | Hes saying semantically there's no difference between arrays
         | and string classes except that with string classes we let you
         | do all kinds of dangerous byte manipulation that we would never
         | dream of with any other type. Moreover, most of the uses for
         | this dangerous access aren't real usages because if you're
         | manipulating strings you're almost certainly actually
         | manipulating code points. So why wouldn't you just use a code
         | point array and give yourself real type safety instead?
        
           | ncmncm wrote:
           | I did not get that at all. Anyway a code point array would
           | not serve the purpose: most possible sequences of valid code
           | points are not valid strings.
           | 
           | A variable-size array of code points is _also_ useful, just
           | as, in C++, a std::vector <char> is useful, but that doesn't
           | make it a string.
           | 
           | That C++ std::string<> is wrong for what we now think of as
           | strings is a whole other argument. People once hoped that
           | std::string<wchar_t> or std::string<char32_t> might be the
           | useful string, but they were disappointed. C++ does not have
           | a useful string type at this time, but there is ongoing work
           | on one. It should appear in C++26.
        
             | AnimalMuppet wrote:
             | > most possible sequences of valid code points are not
             | valid strings.
             | 
             | Could you clarify? In what way are they not valid strings?
        
               | KMag wrote:
               | Some code points are characters. Others are operators
               | with constrained contexts in which they operate.
               | Sufficiently long random sequences of characters and
               | these context-specific operators are likely to apply the
               | operators in invalid contexts. Invalid characters mean
               | invalid strings.
               | 
               | For instance, there are code points that are effectively
               | operators that add continental European accents (umlaut,
               | accent grave, etc.) to Latin characters. (Also, there are
               | redundant code points for accented characters.) There's a
               | whole set of code points that are combinators for
               | primitive components of Han characters, etc. (Also, there
               | are redundant code points for pre-composed Han
               | characters.) One way of writing Korean syllables strictly
               | requires triplets of individual jamo components: initial
               | consonant jamo, vowel jamo, and final consonant jamo.
               | (Also, there are redundant code points for every valid
               | triple-jamo syllable in Korean.)
               | 
               | A Han character with an ancient Greek digamma in its
               | "radical" position, a poo emoji inside a box, a thousand
               | umlauts, all three French accents, a Hangul jamo vowel
               | sticking through its center, a Hebrew vowel point, and a
               | Thai tone mark is not a valid character. Any string
               | containing invalid characters is not a valid string.
        
         | GoblinSlayer wrote:
         | You can mess up any ordered sequence in this way.
        
       | shadowgovt wrote:
       | I think the author started from an assertion ("This primary
       | difference between a C++ 'string' and 'vector' is really just a
       | historical oddity that many programs don't even need anymore")
       | that highlights an error in the C++ model of strings, not in the
       | way we must think about strings.
       | 
       | Contrast NSString in Cocoa (https://developer.apple.com/documenta
       | tion/foundation/nsstrin...). The Cocoa string is extremely
       | opaque; it's basically an object. And under the hood, that
       | opacity allows for piles of optimization that are unsafe if the
       | developer is allowed to treat the thing as just a vector of bytes
       | or codepoints. Under the hood, Cocoa does all kinds of fanciness
       | to the memory representation of the string (automatically
       | building and cutting cords, "interning" short strings so that
       | multiple copies of the string are just pointers to the same
       | memory, caching of some transforms under the assumption that if
       | it's needed once, it's often needed again).
       | 
       | Taken this way, one can even start to talk about things like "Why
       | does 'indexing' into a string always return a character, instead
       | of, say, a word?" and other questions that are harder to get into
       | if one assumes a string is just 'vector of characters' or 'vector
       | of bytes.'
        
         | BlueTemplar wrote:
         | Today I learned that Python does interning of shorts strings
         | too :
         | 
         | https://news.ycombinator.com/item?id=26097732
        
       | BlueTemplar wrote:
       | The author has these followup blogposts :
       | 
       | 2013 : https://mortoray.com/2013/11/27/the-string-type-is-broken/
       | 
       | 2014 : https://mortoray.com/2014/03/17/strings-and-text-are-not-
       | the...
       | 
       | (See also : https://thehardcorecoder.com/2014/04/15/data-text-
       | and-string... )
       | 
       | 2016 : https://mortoray.com/2016/04/28/what-is-the-length-of-a-
       | stri...
        
       | hollasch wrote:
       | Curious. I have to come to exactly the opposite conclusion --
       | that we should drop the idea of a fixed-length character type,
       | and instead _only_ have (Unicode) string types. Actually, I'd
       | prefer something like `std::text` to finally be free of the
       | baggage of "string". Operations on text should work on logical
       | text concepts. For example, something like
       | `someText.firstCharacter()` would have a return type of `text`,
       | with logical length 1. It's _data_ length is variable, since a
       | Unicode character is variable length. So many Unicode-containing
       | string design problems arise because of the stubborn insistence
       | of having an integral character type.
       | 
       | I should be able to extract UTF-8, UTF-16 or whatever encoding I
       | want from a `text` value. Something like `c_str()` would be
       | pretty important, but the semantics would be a design problem,
       | not an encoding problem. Any Unicode-encoding string should be
       | able to encode U+0000, so you'd need to figure out how to handle
       | that from `c_str()` (perhaps a substitution ASCII character could
       | be specified to encode embedded nulls).
       | 
       | Basically, users should definitely _not_ need to understand the
       | deeper details of Unicode. They shouldn't need to understand and
       | worry about different entities such as code units, code points,
       | graphemes, and the like, though they should be able to extract
       | such encodings on demand.
        
         | lisper wrote:
         | I fully endorse the general idea here, but this:
         | 
         | > `someText.firstCharacter()` would have a return type of
         | `text`, with logical length 1
         | 
         | is a huge mistake. There are operations that make sense on
         | characters that do not make sense on texts whose length happens
         | to be 1. The most obvious of these is inquiring about the
         | numerical value of the unicode code point of a character.
         | Conflating characters and texts-of-length-1 is a mistake of the
         | same order as conflating strings and byte vectors. Python makes
         | this mistake even in version 3. As a result, a function like
         | this:
         | 
         | def f(s, n, m): return ord(s[n:m])
         | 
         | will return a value iff m is one more than n. Not good.
        
           | donaldihunter wrote:
           | Only if you ignore the rest of what the post said. First it
           | should make things easy for 'normal' tasks, then it should
           | make everything else possible.
           | 
           | > Basically, users should definitely _not_ need to understand
           | the deeper details of Unicode. They shouldn't need to
           | understand and worry about different entities such as code
           | units, code points, graphemes, and the like, though they
           | should be able to extract such encodings on demand.
        
             | lisper wrote:
             | Except that the "users" of a string type are programmers,
             | and a "normal task" for a programmer often requires things
             | like this. I'll give you an example from a project I am
             | currently working on: a spam filter. One of the things my
             | filter does is count the number of Chinese characters in a
             | string. I implement this as n1<=ord(c)<=n2 where n1 and n2
             | are integers representing the start and end of the range of
             | Unicode Chinese characters. This seems like a "normal task"
             | to me and I don't see how conflating characters and texts-
             | of-length-1 would make this any easier.
        
           | Koshkin wrote:
           | Functors are everywhere. That's why we need monads!
        
             | lisper wrote:
             | Gnats and sledgehammers something something...
        
         | BlueTemplar wrote:
         | > Actually, I'd prefer something like `std::text` to finally be
         | free of the baggage of "string". Operations on text should work
         | on logical text concepts. For example, something like
         | `someText.firstCharacter()` would have a return type of `text`,
         | with logical length 1. It's _data_ length is variable, since a
         | Unicode character is variable length.
         | 
         | I don't see how you came to the "opposite conclusion" when the
         | author basically says the same thing ?
        
         | donaldihunter wrote:
         | This!
         | 
         | Raku introduced the concept of NFG - Normal Form Grapheme - as
         | a way to represent any Unicode string in its logical 'visual
         | character' grapheme form. Sequences of combining characters
         | that don't have a canonical single codepoint form are given a
         | synthetic codepoint so that string methods including regexes
         | can operate on grapheme characters without ever causing
         | splitting side effects.
         | 
         | Of course there are methods for manipulating at the codepoint
         | level as well.
        
         | shadowgovt wrote:
         | Essentially, different tools for different applications.
         | 
         | "A string is a vector of characters, which happen to each be
         | one byte in length" was more of an artifact of a time where
         | there happened to be representational overlap than some deep
         | truism about proper data structure. Strings intended to be
         | displayed to humans are specialized constructs, much as a
         | "button" or a "file handle" are. A buffer of unstructured bytes
         | is a separate specialized construct, suitable for tasks
         | unrelated to "displaying text to a human."
        
       | pca006132 wrote:
       | I think the problem is that, a lot of time when we deal with
       | strings, we are thinking about ASCII strings instead of other
       | encoding like UTF-8. If we treat them as ASCII strings, an array
       | of characters would make sense, but it is not that simple for
       | other encoding.
       | 
       | One of the languages that considered the issue is Rust. In rust,
       | we don't really index into strings, but use iterators or other
       | methods to do the operations required. https://doc.rust-
       | lang.org/std/string/struct.String.html
        
         | sfvisser wrote:
         | I really don't think many programmers nowadays actually think
         | this.
        
           | bobthepanda wrote:
           | I would hazard that very few people think about what an
           | underlying String is at all.
           | 
           | String encoding is something I encountered as a problem in
           | college, but is up there with implementing a homemade red-
           | black tree in terms of "things that are asked in interviews
           | but have little to no bearing on my day-to-day."
        
             | BlueTemplar wrote:
             | Really, they don't run into string/character issues
             | regularly ? Because I do...
        
               | bobthepanda wrote:
               | I certainly run into them rarely, and if I do have an
               | issue it is usually solved by bunging it into some
               | purpose built standard or third party library and calling
               | it a day.
               | 
               | I'm sure people have jobs that deal with this, but the
               | low-level form of the problem is not something that I
               | could see one encountering in a meaningful way for
               | building a standard CRUD app or service.
        
       | dang wrote:
       | Discussed at the time:
       | https://news.ycombinator.com/item?id=6204427
        
       | BlueTemplar wrote:
       | TL;DR : Characters and Strings considered harmful.
       | 
       | And he's right, they totally are ! (Also, 'string' can mean an
       | ordered sequence of similar objects of any kind, not just
       | characters.)
       | 
       | But (as these discussions also mention) replacing them by much
       | more clearly defined concepts like byte arrays, codepoints,
       | glyphs, grapheme clusters and text fields is only the first
       | step...
       | 
       | The big question (these days) is what to do with text,
       | specifically the 'code' kind of text (either programming or
       | markup, and poor separation between 'plain' text and code keeps
       | causing security issues).
       | 
       | To start with, even code needs formatting, specifically some way
       | to signal a new line, or it will end up unreadable.
       | 
       | Then, code can't be just arbitrary Unicode text, some limits have
       | to apply, because Unicode can get verrrry 'fancy' ! (Arbitrary
       | Unicode is fine in text fields and comments embedded in code.)
       | 
       | So, I'm curious, is there any Unicode normalization specifically
       | designed for code ? (If not, why, and which is the closest one ?)
       | 
       | I'm thinking of Python (3), which has what seems to be a somewhat
       | arbitrary list of what can and what can't be used as a variable
       | name ? (And the language itself seemingly only uses ASCII, though
       | this shouldn't be a restriction for programming/markup languages
       | !)
       | 
       | Also I hear that Julia goes much further than that (with even
       | (La)TeX-like shortcuts for characters that might not be available
       | on some keyboards), what kind of 'normalization' have they
       | adopted ?
        
         | eigenspace wrote:
         | Yes, Julia really lets one get wild with Unicode. There are
         | certain classes of unicode characters that we have marked as
         | invalid for identifiers, some which are used for infix
         | operators, and some which count as modifiers on previously
         | typed characters which is useful for creating new infix
         | operators, e.g. one might define                   julia> +2(x,
         | y) = x^2 + y^2         +2 (generic function with 1 method)
         | 
         | such that                   julia> -2 +2 3         13
         | 
         | If someone doesn't know how to type this, they can just hit the
         | `?` button to open help mode in the repl and then paste it:
         | help?> +2         "+2" can be typed by +\^2<tab>
         | search: +2                No documentation found.
         | +2 is a Function.                # 1 method for generic
         | function "+2":           [1] +2(x, y) in Main at REPL[65]:1
         | 
         | Note how it says                   "+2" can be typed by
         | +\^2<tab>
         | 
         | Generally speaking we don't have a ton of strict rules on
         | unicode, but it's a community convention that if you have a
         | public facing API that uses unicode, you should provide an
         | alternative unicode-free API. This works pretty well for us,
         | and I think can be quite useful for some mathematical code if
         | you don't overdo it (the above example was not an example of
         | 'responsible' use).
         | 
         | I know we have a code formatter, but it doesn't do any unicode
         | normalization. We generally just accept unicode as a first
         | class citizen in code. This tends to cause some programmers to
         | 'clutch their pearls' and act horrified, but in practice it
         | works well. Maybe just because we have a cohesive community
         | though
        
           | BlueTemplar wrote:
           | Nice ! Python allows to define operators too, but AFAIK you
           | can't use Unicode in those ? And 2 (or any other
           | sub/superscript _number_ - at least some letters are fine) is
           | not allowed in identifiers either.
           | 
           | The point is to get closer to math notation though, if
           | anything x +2 y is IMHO even farther away than (x + y)*2 !
           | 
           | Any way to have (x + y)2 or [?](x + y) to work ?
           | 
           | ----
           | 
           | The new AZERTY has a _lot_ of improvements : [?], +-, [?],
           | [?], the whole Greek alphabet, () and [] and {} next to each
           | other... but for some reason they 've removed the 2 that the
           | old AZERTY had ?
           | 
           | http://norme-azerty.fr/
        
             | eigenspace wrote:
             | > if anything x +2 y is IMHO even farther away than (x + y)
             | _2 !
             | 
             | Yeah, it was just a random example that came to mind, not
             | to be taken seriously. Here's perhaps one example of
             | unicode being used in a way that's pleasing to some and
             | upsetting to others: https://www.reddit.com/r/programmingho
             | rror/comments/jqdi4i/y...
             | 
             | > Any way to have (x + y)2 or [?](x + y) to work ?
             | 
             | The sqrt one works out of the box actually, no new
             | definitions required:                   julia> [?](1 + 3)
             | 2.0
             | 
             | The second one does not work because we specifically ban
             | identifiers from starting with superscript or subscript
             | numbers. If it was allowed, we could work some black magic
             | with juxtaposition to make it work.
             | 
             | Here's an example with the transpose of an array:
             | julia> struct  end              julia> Base.:(*)(x,
             | ::Type{}) = transpose(x)              julia> [1 2 3 4]
             | 4x1 transpose(::Matrix{Int64}) with eltype Int64:
             | 1          2          3          4
             | 
             | _
        
       | irogers wrote:
       | String should be an interface/protocol. When I log a message, I
       | want to pass a string. If I have to append large strings for a
       | log message I don't want to run out of memory, I should be able
       | to pass a rope/cord [1]. We've known how to abstract this for
       | forever and should work to optimize our compilers/runtimes
       | accordingly. I'm not aware of a language which has got this
       | right, for example, Java has the ugly CharSequence interface that
       | nobody uses. StringProtocol in Swift (can I implement it?) makes
       | you pay a character tax rather than to just pass a string.
       | Rust/C++ give various non-abstracted types.
       | 
       | [1] https://en.wikipedia.org/wiki/Rope_(data_structure)
        
         | 60secz wrote:
         | Can't agree more. Java in particular suffers greatly from
         | Object toString with a weak contract and no global String
         | interface. If String were an interface instead of an
         | implementation than any method signature could accept multiple
         | implementations. This allows for really effective type aliases
         | which even support strong typing so if you have a signature
         | with multiple String values you can use the strong types to
         | ensure you don't transpose arguments.
        
         | pdimitar wrote:
         | Erlang/Elixir's iolists which are heavily utilized in Phoenix's
         | templating engine are a rope and are extremely efficient (for a
         | dynamic language). Phoenix's templating is very fast.
        
       ___________________________________________________________________
       (page generated 2021-02-11 23:02 UTC)