[HN Gopher] We don't need a string type (2013)
___________________________________________________________________
We don't need a string type (2013)
Author : grep_it
Score : 25 points
Date : 2021-02-09 05:59 UTC (2 days ago)
(HTM) web link (mortoray.com)
(TXT) w3m dump (mortoray.com)
| [deleted]
| 37ef_ced3 wrote:
| Go's immutable UTF-8 string type is one of the nice things about
| the language
|
| A Go string is almost exactly like this C struct:
| struct String { uint8_t* addr; ptrdiff_t len;
| };
|
| The language guarantees you can't modify the bytes in memory
| range [addr, addr+len)
|
| Go's garbage collection makes it simple and natural to have one
| string alias ("point into", "overlap") part of another string.
| This works because strings are immutable. Compare this to the
| nightmare in C++, where substrings require copying or explicit
| handling
|
| The rune (UTF-8) iterator and other facilities make Unicode
| handling natural in Go
|
| In summary, Go's string type is a huge win
| jrimbault wrote:
| I'd arguee Go's string type is "somewhat unusable"* since it
| doesn't enforce the guarantees it says/implies it does. The
| byte slice it points to is not guaranteed to be valid utf8.
|
| * of course to a degree, let's be reasonable, it's usable in a
| _lot_ of contexts, but I like my types to actually mean
| something.
| DougBTX wrote:
| Go doesn't guarantee any encoding for strings, very
| deliberately (so that, eg, they can be used to represent file
| names).
| KMag wrote:
| Filesystem paths are not strings. Linux doesn't enforce an
| encoding. Windows at least didn't used to enforce proper
| use of conjugate UTF-16 pairs (see WTF-8 encoding).
|
| I think OS X does perform UTF-8 normalization, which might
| include sanity checking and rejecting malformed UTF-8, but
| I'm not sure.
|
| A byte array (or a ref-counted singly-linked list of
| immutable byte arrays to save space/copying) is a much
| better representation for a file system path. That doesn't
| have great interaction with GUIs, but there are other
| corner cases that are often problematic for GUIs. In high
| school, one of my friends had a habit of putting games on
| the school library computers, and renaming them to names
| with non-printable characters using alt+number pad. (He
| used 129, IIRC, which isn't assigned a character in
| CP-1252.) The Windows 95 graphical shell would convert the
| non-printable characters to spaces for display, but when
| the librarian tried to delete the games, it would pass the
| display name to the kernel, which would complain that the
| presented path didn't exist.
| throwaway894345 wrote:
| A string is a byte array for all intents and purposes. In
| Go specifically, it's an immutable byte slice with some
| built-in operator overloading, some of which is sugar for
| dealing with utf-8, but there's nothing that suggests a
| string must be encoded any particular way.
| Koshkin wrote:
| > _A string is a byte array for all intents and
| purposes._
|
| This smacks of reductionism. String as an abstract type
| only needs to conform to a number of certain axioms and
| support certain operations. (Thus, for example, a text
| editor, where a string can be mutable, could choose a
| _representation_ of this type that is different from a
| simple byte array.)
| throwaway894345 wrote:
| Based on the context of the thread, the definition of
| "string" used in this thread must also include the
| properties possessed by Go strings in order for the
| original criticism to be coherent. It seems more likely
| (and charitable) that the criticism is incorrect rather
| than incoherent.
|
| In whatever case, Go strings have all of the relevant
| properties for modeling file paths.
| KMag wrote:
| I'm saying that it's useful to not conflate the types for
| sequences of Unicode codepoints and and filesystem paths.
| Using the same type for both is likely to result in code
| with baked-in assumptions that for any path, there is a
| standard encoding that will yield a sequence of Unicode
| codepoints.
|
| Pervasive code with this sort of type confusion in the
| wild in Python2 is why Python3 separated bytes and
| strings.
| throwaway894345 wrote:
| Maybe, but a decade of experience with Go suggests that
| this isn't a significant problem (i.e., more than a
| handful of instances).
| GoblinSlayer wrote:
| Posix thinks paths are strings. See https://pubs.opengrou
| p.org/onlinepubs/009695399/functions/op...
| msla wrote:
| "String" has multiple meanings in this context. In the
| context of that manpage, it means "nul-terminated array
| of char" which is the C language meaning. In the context
| of what you're replying to, a "string" is a sequence of
| bytes (octets) in a specific Unicode Transformation
| Format. Those are very different things when it comes to
| programmatic manipulation of those things.
| jerf wrote:
| It is not clear to me if you're elaborating or think
| you're disagreeing, but that is what Go does. It is
| generally assumed in Go that strings are UTF-8, but in
| practice what they actually are are just bags of bytes.
| Nothing really "UTF-y" will happen to them until you
| directly call UTF functions on them, which may produce
| new strings.
|
| It's something that I don't think could work unless your
| language is as recent as Go, and perhaps even Go 1.0 was
| pushing it, but it is an increasingly viable answer. For
| as thin as Go's encoding support really is in some sense,
| it has almost never caused me any trouble. The contexts
| where you are actively unsafe in assuming UTF-8 are
| decreasing, and the ones that are going to survive are
| the ones where there's some sort of explicit label, like
| in email. (Not that those are always trustworthy either.)
| KMag wrote:
| I'm saying it's useful to have valid strings and paths as
| separate types, but Go conflates the two types.
| Conflating the two is likely to lead to confused usage
| (such as programmers assuming there's a bijective mapping
| between valid paths and valid sequences of Unicode
| codepoints.)
|
| Pervasive confused usage of this sort in the wild in
| Python 2 was the motivation behind splitting bytes and
| strings in Python 3.
| skybrian wrote:
| Go's standard library works with both possibly-malformed and
| verified UTF-8 strings, which is a nice property.
|
| The type system needed to explain what they actually do (take
| one of two possible input types and return the corresponding
| output type) would require generics, which we don't have yet.
|
| An alternative would be to duplicate the code to account for
| the different types, but we already have that for []byte
| versus string and that's bad enough already.
| 37ef_ced3 wrote:
| In Go, malformed UTF-8 encodings are expected
|
| They are handled in a well-defined and graceful manner by all
| aspects of the language, runtime, and library
| throwaway894345 wrote:
| I think of a string as an immutable byte slice. This is a
| little confusing since the language supports utf-8 literals
| only and it also lets you iterate over individual runes with
| for loops, but those are just conveniences over the fact that
| these are really just immutable byte slices. You could
| probably make your own "UTF8" type with the invariants you
| want (or at least someone would have to drop down into unsafe
| to violate the invariants) but in general Go programs don't
| typically go that far, presumably because it doesn't add much
| value in practice, which would suggest that your "somewhat
| unusable" claim (even with its caveat) is too strong. That
| said, I think it would be nice if Go made it a little
| easier/clearer to model a type that can only be created by a
| particular constructor or some such.
| Koshkin wrote:
| Oh well. The could've (should've?) used the layout of _bstr_t
| instead.
| foo_barrio wrote:
| Java's sub-strings used to work sort of like this but was
| changed to use copy semantics. The structure was to have a
| "char[]" and an "offset" into that array. This allowed sub-
| strings to share the underlying array. However if you had a 1
| char sub-string to a 1 GB array, the underlying array was never
| trimmed for garbage collection.
|
| In the case of a 1 char substring to a 1 GB string, is Go smart
| enough to free the rest of the array and keep only the 1 char?
| KMag wrote:
| I wonder how hard the JVM folks looked into specialized weak
| references for solving this issue. The mark phase would treat
| all Strings with zero offset and full length as strongly
| referencing the byte[], and weakly otherwise. At the end of
| each full GC, you could iterate over all of your Strings
| (custom allocate/compact them to their own ranges of the heap
| for faster scanning), use some heuristics and probabilistic
| sampling to select some of the weakly reachable byte[]s for
| size reduction. A specialized copy/compact pass over the
| Strings could in-place replace byte[] references and fix up
| offsets.
|
| You'd probably also want to modify String.equals() to
| internally mutate equal strings to point to the same byte[],
| preferring smaller offsets, and when offsets are equal,
| preferring lower addresses. This is a light weight lazy
| version of the background String byte[] interning done by
| some JVMs.
| 37ef_ced3 wrote:
| In Go, as in C, there's no magic. A programmer using Go
| thinks of a string variable as a pointer/length pair, and
| knows what will happen. Just like with slices
|
| If you keep a pointer into an allocation (in your example, a
| small Go string pointing into a much larger Go string) the
| allocation is preserved by the garbage collector
|
| You should explicitly copy the substring out (instead of
| aliasing the underlying string) if retaining the underlying
| string causes you a problem
| foo_barrio wrote:
| Okay I understand. In Java, the default String class has
| the option to "intern" a string which just maintains a list
| of strings that can be shared.
|
| The change was made because often the devs were unaware of
| the string manipulation taking place in a third party
| library (eg XML/JSON/HTML parsing). You'd see the memory
| balloon, investigate and notice that String/char[]
| instances were dominating your heap. Instead of changing
| the entire implementation of the standard String class,
| they changed the semantics of the "substring()" call from
| O(1) to O(n) + memory side-effects.
| giardini wrote:
| Surprising to a Tcl programmer!8-)) b/c
|
| "Everything is a String":
|
| https://wiki.tcl-lang.org/page/everything+is+a+string
|
| and
|
| "Everything is a Symbol":
|
| https://wiki.tcl-lang.org/page/Everything+is+a+Symbol
| BlueTemplar wrote:
| Looks like that what Tcl means by 'string', the author names
| 'text' ?
|
| What does Tcl mean by 'character' ?
|
| See for instance, the author's HTML example :
|
| > Combining characters can create an accented version of that
| symbol, <. In text this is clearly a different symbol: it's a
| distinct grapheme cluster. The HTML parser doesn't care about
| that. It sees code #60 followed by #807 (combining cedilla). It
| thus sees the opening of an element. However, since it isn't
| followed by a valid naming character most parsers just ignore
| this element (I'm not positive that is correct to do). This is
| not the case with an accented quote, like ". Here the parsers
| (at least the browsers I tested), let the quote end an
| attribute and then have a garbage character lying around.
|
| https://mortoray.com/2014/03/17/strings-and-text-are-not-the...
|
| EDIT: Ok, it looks like by 'character', Tcl means what the
| author (and Unicode ?) calls a 'grapheme cluster' ?
|
| https://wiki.tcl-lang.org/page/Characters%2C+glyphs%2C+code%...
|
| https://mortoray.com/2016/04/28/what-is-the-length-of-a-stri...
| BlueTemplar wrote:
| Anyone else thinks that we missed an opportunity to make text
| much simpler to deal with by not increasing the size of a byte
| from 8 to 32 bits when we moved from 32-bit to 64-bit word length
| CPUs ?
|
| I mean, isn't the 7-bit ASCII text the reason why the byte length
| was standardized to the next power of two bits ?
|
| (With e-mail still supporting non-padded 7-bit ASCII until
| recently for performance reasons.)
| DougBTX wrote:
| The date should be (2013) not (2018), as that dates it before
| Rust 1.0 (which does have a UTF-8 string type) and before the
| Julia 1.0 release date (which implements UTF-8 strings as arrays
| with irregularly spaced indexes, eg, the valid indexes may be 1,
| 2, 4, 5, if the character at 2 takes up two bytes). Both would be
| interesting examples to compare against if this article was
| written today.
| dang wrote:
| I've fixed the date now. Actually the date at the top of the
| article "2013-08-13" is in a font that somehow makes it look
| like 2018. I had to squint a couple times to make sure I was
| reading it right! The year in the URL is easier to read.
| tyingq wrote:
| I can't speak for C++, but for C, the repeated issue is that a
| null-terminated string has lots of utility routines that are
| handy for manipulating them. Without 3rd party libraries, plain
| length-header buffers don't. Hence things like Antirez's sds
| library, which by nature, is a compromise. I get you can't
| fundamentally change C now, but a buffer type with a rich
| manipulation library would have been nice.
| ncmncm wrote:
| The article is an argument against types, in general.
|
| The point that characters can be stored in other containers is
| meaningless: the question is whether, conceptually, a specific
| sequence of character values distinct from another sequence has
| compile-time meaning. It does. Therefore, it needs a type.
|
| Such a sequence has numerous special characteristics. In
| particular, element at [i] often has an essential connection to
| element at [i+1] such that swapping them could turn a valid
| string to an invalid one. In fact, that an invalid sequence is
| even possible is another such characteristic.
| arcbyte wrote:
| Let me respond to you again in a different way, this time
| referencing some unicorn definitions I like
| (https://stackoverflow.com/a/27331885).
|
| I don't think we can have a meaningful conversation in terms of
| characters so I'm going to ignore that and reference your last
| paragraph. You seem to be arguing that string as a type has use
| when viewing it as a collection of methods that allow access to
| Code Points given an underlying storage of Code Units. The
| article is arguing that unless you're writing a unicode
| encoder/decoder, you probably don't care about manipulating
| Code Units (except that modern languages have given you these
| byte arrays that you reference the length of for memory
| purposes). What you really usually care about is searching,
| replacing, concating, and cutting collections of Code Points.
| But languages have only given you this hodge podge grouping of
| Code Unit arrays and specialty methods for Code Point access so
| thats what you're used to dealing with and of course you want
| some kind of abstraction, like a string type, to deal with so
| you don't end up with the scenario you describe where you screw
| up a Code Unit sequence trying to manipulate a Code Point.
|
| So the final point is that unless you're working with unicode
| encoding/decoding, you really only care about Code Points. And
| once you create a String class that only exposes Code Points,
| you have got something equivalent to a simple array.
| arcbyte wrote:
| I actually read it as a argument FOR types and against modern
| languages choice to make the String class a weak proxy for
| typeless byte arrays. See all the arguments (in this HN
| comments no less!) for just using utf8 byte arrays as strings.
|
| Hes saying semantically there's no difference between arrays
| and string classes except that with string classes we let you
| do all kinds of dangerous byte manipulation that we would never
| dream of with any other type. Moreover, most of the uses for
| this dangerous access aren't real usages because if you're
| manipulating strings you're almost certainly actually
| manipulating code points. So why wouldn't you just use a code
| point array and give yourself real type safety instead?
| ncmncm wrote:
| I did not get that at all. Anyway a code point array would
| not serve the purpose: most possible sequences of valid code
| points are not valid strings.
|
| A variable-size array of code points is _also_ useful, just
| as, in C++, a std::vector <char> is useful, but that doesn't
| make it a string.
|
| That C++ std::string<> is wrong for what we now think of as
| strings is a whole other argument. People once hoped that
| std::string<wchar_t> or std::string<char32_t> might be the
| useful string, but they were disappointed. C++ does not have
| a useful string type at this time, but there is ongoing work
| on one. It should appear in C++26.
| AnimalMuppet wrote:
| > most possible sequences of valid code points are not
| valid strings.
|
| Could you clarify? In what way are they not valid strings?
| KMag wrote:
| Some code points are characters. Others are operators
| with constrained contexts in which they operate.
| Sufficiently long random sequences of characters and
| these context-specific operators are likely to apply the
| operators in invalid contexts. Invalid characters mean
| invalid strings.
|
| For instance, there are code points that are effectively
| operators that add continental European accents (umlaut,
| accent grave, etc.) to Latin characters. (Also, there are
| redundant code points for accented characters.) There's a
| whole set of code points that are combinators for
| primitive components of Han characters, etc. (Also, there
| are redundant code points for pre-composed Han
| characters.) One way of writing Korean syllables strictly
| requires triplets of individual jamo components: initial
| consonant jamo, vowel jamo, and final consonant jamo.
| (Also, there are redundant code points for every valid
| triple-jamo syllable in Korean.)
|
| A Han character with an ancient Greek digamma in its
| "radical" position, a poo emoji inside a box, a thousand
| umlauts, all three French accents, a Hangul jamo vowel
| sticking through its center, a Hebrew vowel point, and a
| Thai tone mark is not a valid character. Any string
| containing invalid characters is not a valid string.
| GoblinSlayer wrote:
| You can mess up any ordered sequence in this way.
| shadowgovt wrote:
| I think the author started from an assertion ("This primary
| difference between a C++ 'string' and 'vector' is really just a
| historical oddity that many programs don't even need anymore")
| that highlights an error in the C++ model of strings, not in the
| way we must think about strings.
|
| Contrast NSString in Cocoa (https://developer.apple.com/documenta
| tion/foundation/nsstrin...). The Cocoa string is extremely
| opaque; it's basically an object. And under the hood, that
| opacity allows for piles of optimization that are unsafe if the
| developer is allowed to treat the thing as just a vector of bytes
| or codepoints. Under the hood, Cocoa does all kinds of fanciness
| to the memory representation of the string (automatically
| building and cutting cords, "interning" short strings so that
| multiple copies of the string are just pointers to the same
| memory, caching of some transforms under the assumption that if
| it's needed once, it's often needed again).
|
| Taken this way, one can even start to talk about things like "Why
| does 'indexing' into a string always return a character, instead
| of, say, a word?" and other questions that are harder to get into
| if one assumes a string is just 'vector of characters' or 'vector
| of bytes.'
| BlueTemplar wrote:
| Today I learned that Python does interning of shorts strings
| too :
|
| https://news.ycombinator.com/item?id=26097732
| BlueTemplar wrote:
| The author has these followup blogposts :
|
| 2013 : https://mortoray.com/2013/11/27/the-string-type-is-broken/
|
| 2014 : https://mortoray.com/2014/03/17/strings-and-text-are-not-
| the...
|
| (See also : https://thehardcorecoder.com/2014/04/15/data-text-
| and-string... )
|
| 2016 : https://mortoray.com/2016/04/28/what-is-the-length-of-a-
| stri...
| hollasch wrote:
| Curious. I have to come to exactly the opposite conclusion --
| that we should drop the idea of a fixed-length character type,
| and instead _only_ have (Unicode) string types. Actually, I'd
| prefer something like `std::text` to finally be free of the
| baggage of "string". Operations on text should work on logical
| text concepts. For example, something like
| `someText.firstCharacter()` would have a return type of `text`,
| with logical length 1. It's _data_ length is variable, since a
| Unicode character is variable length. So many Unicode-containing
| string design problems arise because of the stubborn insistence
| of having an integral character type.
|
| I should be able to extract UTF-8, UTF-16 or whatever encoding I
| want from a `text` value. Something like `c_str()` would be
| pretty important, but the semantics would be a design problem,
| not an encoding problem. Any Unicode-encoding string should be
| able to encode U+0000, so you'd need to figure out how to handle
| that from `c_str()` (perhaps a substitution ASCII character could
| be specified to encode embedded nulls).
|
| Basically, users should definitely _not_ need to understand the
| deeper details of Unicode. They shouldn't need to understand and
| worry about different entities such as code units, code points,
| graphemes, and the like, though they should be able to extract
| such encodings on demand.
| lisper wrote:
| I fully endorse the general idea here, but this:
|
| > `someText.firstCharacter()` would have a return type of
| `text`, with logical length 1
|
| is a huge mistake. There are operations that make sense on
| characters that do not make sense on texts whose length happens
| to be 1. The most obvious of these is inquiring about the
| numerical value of the unicode code point of a character.
| Conflating characters and texts-of-length-1 is a mistake of the
| same order as conflating strings and byte vectors. Python makes
| this mistake even in version 3. As a result, a function like
| this:
|
| def f(s, n, m): return ord(s[n:m])
|
| will return a value iff m is one more than n. Not good.
| donaldihunter wrote:
| Only if you ignore the rest of what the post said. First it
| should make things easy for 'normal' tasks, then it should
| make everything else possible.
|
| > Basically, users should definitely _not_ need to understand
| the deeper details of Unicode. They shouldn't need to
| understand and worry about different entities such as code
| units, code points, graphemes, and the like, though they
| should be able to extract such encodings on demand.
| lisper wrote:
| Except that the "users" of a string type are programmers,
| and a "normal task" for a programmer often requires things
| like this. I'll give you an example from a project I am
| currently working on: a spam filter. One of the things my
| filter does is count the number of Chinese characters in a
| string. I implement this as n1<=ord(c)<=n2 where n1 and n2
| are integers representing the start and end of the range of
| Unicode Chinese characters. This seems like a "normal task"
| to me and I don't see how conflating characters and texts-
| of-length-1 would make this any easier.
| Koshkin wrote:
| Functors are everywhere. That's why we need monads!
| lisper wrote:
| Gnats and sledgehammers something something...
| BlueTemplar wrote:
| > Actually, I'd prefer something like `std::text` to finally be
| free of the baggage of "string". Operations on text should work
| on logical text concepts. For example, something like
| `someText.firstCharacter()` would have a return type of `text`,
| with logical length 1. It's _data_ length is variable, since a
| Unicode character is variable length.
|
| I don't see how you came to the "opposite conclusion" when the
| author basically says the same thing ?
| donaldihunter wrote:
| This!
|
| Raku introduced the concept of NFG - Normal Form Grapheme - as
| a way to represent any Unicode string in its logical 'visual
| character' grapheme form. Sequences of combining characters
| that don't have a canonical single codepoint form are given a
| synthetic codepoint so that string methods including regexes
| can operate on grapheme characters without ever causing
| splitting side effects.
|
| Of course there are methods for manipulating at the codepoint
| level as well.
| shadowgovt wrote:
| Essentially, different tools for different applications.
|
| "A string is a vector of characters, which happen to each be
| one byte in length" was more of an artifact of a time where
| there happened to be representational overlap than some deep
| truism about proper data structure. Strings intended to be
| displayed to humans are specialized constructs, much as a
| "button" or a "file handle" are. A buffer of unstructured bytes
| is a separate specialized construct, suitable for tasks
| unrelated to "displaying text to a human."
| pca006132 wrote:
| I think the problem is that, a lot of time when we deal with
| strings, we are thinking about ASCII strings instead of other
| encoding like UTF-8. If we treat them as ASCII strings, an array
| of characters would make sense, but it is not that simple for
| other encoding.
|
| One of the languages that considered the issue is Rust. In rust,
| we don't really index into strings, but use iterators or other
| methods to do the operations required. https://doc.rust-
| lang.org/std/string/struct.String.html
| sfvisser wrote:
| I really don't think many programmers nowadays actually think
| this.
| bobthepanda wrote:
| I would hazard that very few people think about what an
| underlying String is at all.
|
| String encoding is something I encountered as a problem in
| college, but is up there with implementing a homemade red-
| black tree in terms of "things that are asked in interviews
| but have little to no bearing on my day-to-day."
| BlueTemplar wrote:
| Really, they don't run into string/character issues
| regularly ? Because I do...
| bobthepanda wrote:
| I certainly run into them rarely, and if I do have an
| issue it is usually solved by bunging it into some
| purpose built standard or third party library and calling
| it a day.
|
| I'm sure people have jobs that deal with this, but the
| low-level form of the problem is not something that I
| could see one encountering in a meaningful way for
| building a standard CRUD app or service.
| dang wrote:
| Discussed at the time:
| https://news.ycombinator.com/item?id=6204427
| BlueTemplar wrote:
| TL;DR : Characters and Strings considered harmful.
|
| And he's right, they totally are ! (Also, 'string' can mean an
| ordered sequence of similar objects of any kind, not just
| characters.)
|
| But (as these discussions also mention) replacing them by much
| more clearly defined concepts like byte arrays, codepoints,
| glyphs, grapheme clusters and text fields is only the first
| step...
|
| The big question (these days) is what to do with text,
| specifically the 'code' kind of text (either programming or
| markup, and poor separation between 'plain' text and code keeps
| causing security issues).
|
| To start with, even code needs formatting, specifically some way
| to signal a new line, or it will end up unreadable.
|
| Then, code can't be just arbitrary Unicode text, some limits have
| to apply, because Unicode can get verrrry 'fancy' ! (Arbitrary
| Unicode is fine in text fields and comments embedded in code.)
|
| So, I'm curious, is there any Unicode normalization specifically
| designed for code ? (If not, why, and which is the closest one ?)
|
| I'm thinking of Python (3), which has what seems to be a somewhat
| arbitrary list of what can and what can't be used as a variable
| name ? (And the language itself seemingly only uses ASCII, though
| this shouldn't be a restriction for programming/markup languages
| !)
|
| Also I hear that Julia goes much further than that (with even
| (La)TeX-like shortcuts for characters that might not be available
| on some keyboards), what kind of 'normalization' have they
| adopted ?
| eigenspace wrote:
| Yes, Julia really lets one get wild with Unicode. There are
| certain classes of unicode characters that we have marked as
| invalid for identifiers, some which are used for infix
| operators, and some which count as modifiers on previously
| typed characters which is useful for creating new infix
| operators, e.g. one might define julia> +2(x,
| y) = x^2 + y^2 +2 (generic function with 1 method)
|
| such that julia> -2 +2 3 13
|
| If someone doesn't know how to type this, they can just hit the
| `?` button to open help mode in the repl and then paste it:
| help?> +2 "+2" can be typed by +\^2<tab>
| search: +2 No documentation found.
| +2 is a Function. # 1 method for generic
| function "+2": [1] +2(x, y) in Main at REPL[65]:1
|
| Note how it says "+2" can be typed by
| +\^2<tab>
|
| Generally speaking we don't have a ton of strict rules on
| unicode, but it's a community convention that if you have a
| public facing API that uses unicode, you should provide an
| alternative unicode-free API. This works pretty well for us,
| and I think can be quite useful for some mathematical code if
| you don't overdo it (the above example was not an example of
| 'responsible' use).
|
| I know we have a code formatter, but it doesn't do any unicode
| normalization. We generally just accept unicode as a first
| class citizen in code. This tends to cause some programmers to
| 'clutch their pearls' and act horrified, but in practice it
| works well. Maybe just because we have a cohesive community
| though
| BlueTemplar wrote:
| Nice ! Python allows to define operators too, but AFAIK you
| can't use Unicode in those ? And 2 (or any other
| sub/superscript _number_ - at least some letters are fine) is
| not allowed in identifiers either.
|
| The point is to get closer to math notation though, if
| anything x +2 y is IMHO even farther away than (x + y)*2 !
|
| Any way to have (x + y)2 or [?](x + y) to work ?
|
| ----
|
| The new AZERTY has a _lot_ of improvements : [?], +-, [?],
| [?], the whole Greek alphabet, () and [] and {} next to each
| other... but for some reason they 've removed the 2 that the
| old AZERTY had ?
|
| http://norme-azerty.fr/
| eigenspace wrote:
| > if anything x +2 y is IMHO even farther away than (x + y)
| _2 !
|
| Yeah, it was just a random example that came to mind, not
| to be taken seriously. Here's perhaps one example of
| unicode being used in a way that's pleasing to some and
| upsetting to others: https://www.reddit.com/r/programmingho
| rror/comments/jqdi4i/y...
|
| > Any way to have (x + y)2 or [?](x + y) to work ?
|
| The sqrt one works out of the box actually, no new
| definitions required: julia> [?](1 + 3)
| 2.0
|
| The second one does not work because we specifically ban
| identifiers from starting with superscript or subscript
| numbers. If it was allowed, we could work some black magic
| with juxtaposition to make it work.
|
| Here's an example with the transpose of an array:
| julia> struct end julia> Base.:(*)(x,
| ::Type{}) = transpose(x) julia> [1 2 3 4]
| 4x1 transpose(::Matrix{Int64}) with eltype Int64:
| 1 2 3 4
|
| _
| irogers wrote:
| String should be an interface/protocol. When I log a message, I
| want to pass a string. If I have to append large strings for a
| log message I don't want to run out of memory, I should be able
| to pass a rope/cord [1]. We've known how to abstract this for
| forever and should work to optimize our compilers/runtimes
| accordingly. I'm not aware of a language which has got this
| right, for example, Java has the ugly CharSequence interface that
| nobody uses. StringProtocol in Swift (can I implement it?) makes
| you pay a character tax rather than to just pass a string.
| Rust/C++ give various non-abstracted types.
|
| [1] https://en.wikipedia.org/wiki/Rope_(data_structure)
| 60secz wrote:
| Can't agree more. Java in particular suffers greatly from
| Object toString with a weak contract and no global String
| interface. If String were an interface instead of an
| implementation than any method signature could accept multiple
| implementations. This allows for really effective type aliases
| which even support strong typing so if you have a signature
| with multiple String values you can use the strong types to
| ensure you don't transpose arguments.
| pdimitar wrote:
| Erlang/Elixir's iolists which are heavily utilized in Phoenix's
| templating engine are a rope and are extremely efficient (for a
| dynamic language). Phoenix's templating is very fast.
___________________________________________________________________
(page generated 2021-02-11 23:02 UTC)