[HN Gopher] Why Rust strings seem hard
___________________________________________________________________
Why Rust strings seem hard
Author : brundolf
Score : 108 points
Date : 2021-04-14 19:27 UTC (3 hours ago)
(HTM) web link (www.brandons.me)
(TXT) w3m dump (www.brandons.me)
| rsc wrote:
| If you can indulge a non-Rust point of view, if I'd been faced
| with this design problem, I think I would have put a "this is a
| rodata-backed string" bit into String, and then the
| representation would be {ptr, len, rodata-bit}
|
| (or squeeze a bit out of the ptr or len if the space matters).
|
| Then "abc" could have type String, and the only difference
| between let abc1: String = "abc"
|
| and let a: String = "a" let abc2:
| String = a + "bc"
|
| would be that (assuming the compiler doesn't get cute) abc1 is
| pointing at the rodata bytes for "abc" and abc2 is pointing at
| allocated bytes. (But it can tell them apart so the deallocation
| is not ambiguous.)
|
| It seems like this would have avoided a lot of the ink spilled
| over &str vs String. I know it's too late now, but was this
| considered at all, and if so what were the reasons not to adopt
| it?
|
| Thanks.
| tene wrote:
| When you want a reference that can transparently become an
| owned value when mutated, Rust has a type Cow that implements
| this generally. It's an enum that has two variants, Owned and
| Borrowed. If you mutate a Cow::Borrowed, it first copies the
| data to a new owned allocation, and replaces itself with
| Cow::Owned.
|
| However, the difference between String and &str has nothing to
| do with mutability, or whether the data is in a read-only page
| or not. If you have &mut str, you can mutate the values, and if
| you don't declare your String as mut, it's not mutable.
|
| The difference is that String owns its allocation, whereas &str
| is a reference to memory that someone else owns. It's exactly
| the same as Vec vs &[u8], and if you check the source, you'll
| see that String is just a wrapper around Vec<u8>:
| https://doc.rust-lang.org/src/alloc/string.rs.html#278-280
|
| The general principal is that if you own the allocation, you
| can reallocate it to change its size. If you only have a
| reference to memory that someone else owns, you can't do that.
|
| Consider for example Go's slices, which work kind of like you
| describe, where they point to the original array's memory until
| someone grows the array, at which time they might or might not
| make a new allocation. Appending to a Go slice from some inner
| function can suddenly break code that calls it, because the
| slice it's operating on suddenly points to new memory.
|
| Rust's Big Idea is to make ownership and borrowing more-
| explicit. Having your default stdlib text type be ambiguous
| about whether it's owning or borrowing is both weird, and also
| makes things a lot more awkward to deal with.
|
| If you use Cow<str>, you'll see that its API declares that it
| borrows from some source, and can't outlive that source. That's
| fine if what it's borrowing from is static text in the binary,
| but that really constrains what you can do with a string that
| you've dynamically allocated.
|
| Just like all other data structures, having a distinction
| between an owned value and a reference to the value is very
| useful. It's easy to build a variety of shared or ambiguous
| ownership structures on top of owned values and references, but
| it's much more complicated to go the other direction.
| int_19h wrote:
| Where would abc allocate the bytes from, if a is in rodata?
| brundolf wrote:
| I hadn't heard of rodata before, but based on a quick Google, I
| think what you're describing is similar to Cow<str>. I can't
| speak to the reasons why this wasn't made the default, but I
| believe it is at least possible.
| cesarb wrote:
| It's probably the same reason why IIRC the default Rust
| String does not have a "small string optimization": it would
| add a lot of unpredictable branches. And it would be even
| worse, since unlike the "small string optimization", even
| small mutations which don't change the size would have to
| allocate when the original was read-only.
| pornel wrote:
| Another problem is that `&str` is more than just "read-
| only". It also tracks lifetime to prevent use-after-free.
|
| A universal owned-or-readonly-borrowed String would be
| unsafe without also adding reference counting or GC to keep
| the borrowed-from string alive.
| holmium wrote:
| In addition to what the others have said, Rust's `String` is
| three pointer sized words: `(ptr, len, capacity)`. So, a
| `String` has 50% of the overhead of a `&str`.
| devit wrote:
| Cow<'static, str> is exactly what you are asking for.
|
| In general the String type is not very good, and you should use
| something appropriate to your use of strings for any string
| that is used more than a constant times in your program, which
| can be, for instance:
|
| - String
|
| - Box<[str]>
|
| - Cow<'static, str>
|
| - Cow<'a, str>
|
| - Interned strings
|
| - Rc<[str]> or Arc<[str]>
|
| - A rope data structure
|
| In fact I think putting String in the standard library might
| have been a mistake since it's almost always a suboptimal
| choice for anything except string builders in local variables.
| StreamBright wrote:
| More articles like this and I might get started with Rust again.
| Well async is still a problem but at least strings are much
| clearer now after reading this.
| eximius wrote:
| An alternative to `"string".to_owned() + "foo"` is just using a
| macro such as `concat!("string", "other string")`.
| alexchamberlain wrote:
| Very interesting write up. Would be interested in a comparison to
| a COW string too.
| armSixtyFour wrote:
| This was one of the more confusing parts of Rust when I first
| started using it. I find that I don't necessarily run into it a
| lot. After a while you sort of change how you organize your code
| and you find that you're not fighting strings or the borrow
| checker very often. I'm not really sure how I got to that point
| however, it's more just practice than anything else.
| seoaeu wrote:
| Yeah, in some ways it is kind of weird how much discussion
| there is about Rust's borrow checker compared to how much time
| practitioners actually spend dealing with it. I see less about
| string handling (this post being an exception) but that is also
| basically a non-issue once you get the hang of it.
| brundolf wrote:
| Exactly. It's not _actually_ hard once it clicks, but I think
| a certain subset of newcomers spend lots of time being
| frustrated with the fact that their understanding doesn 't
| seem to work, when they could just be given the key bits of
| information and be able to move forward with things making
| sense. That was the motive of this post :)
| nikisweeting wrote:
| Why cant every literal "abc" just be instantiated as a heap
| String by default? You could have a separate notation like &"abc"
| when you want a slice, similar to python's b"abc" r"abc", etc.
| modifiers. Heap Strings seem much more useful in general.
| Daishiman wrote:
| System programming languages avoid allocations and unnecessary
| resource consumption. I'd say it's one of their hallmark
| characteristics.
|
| The programming convenience of higher-level languages comes at
| a very substantial cost of requiring a complex runtime, more
| memory for data structures, unpredictable performance, and
| pushing complexity where it's not visible. One philosophy
| favors visibility over resources, the other favors convenience
| of use.
| brundolf wrote:
| It would be pretty wasteful; in any read-only context you'll
| need a &str anyway, and making them all Strings would cause
| tons of unneeded allocations. Many Rust devs care a lot about
| avoiding unnecessary allocations: some people even use Rust on
| embedded systems that don't allow allocations _at all_ , so
| building allocation into a language fundamental would likely be
| a mistake.
| CJefferson wrote:
| One thing I've often wondered about Rust strings. I often hear
| that &str is 'a string slice'. But, Rust has a notation for
| slices -- &[T]. Why are strings the only thing (that I know of)
| that don't use the same slice notation as everything else?
| hansihe wrote:
| In Rust a slice `&[T]` is a fixed length sequence of `T`s laid
| out linearly in memory. Every `T` is required to be of the same
| size.
|
| Strings in Rust are (normally) represented as UTF-8. Both
| `String` and `str` represent data that is guaranteed to be
| valid UTF-8.
|
| This means that if Rusts UTF-8 strings were represented as
| normal slices, they would have to be slices of UTF-8 code-
| units.
|
| Rust wants to provide a safe and correct String data type, and
| therefore, indexing a string on a byte (code-unit) level would
| be incorrect behavior.
|
| Having a custom type `String` and `str` instead of just a
| `Vec<u8>` enables you to have more correct behavior implemented
| on top of the data type that doesn't implement normal slice
| indexing and such.
|
| ---
|
| As a note, even though you probably don't want to normally, you
| can quite easily access the backing data of your string using
| `String::as_bytes`
| brundolf wrote:
| That's a great question. I don't have a complete answer, but I
| do know that &str has lots of string-specific functionality
| that's really helpful. The .chars() method for example gives
| you an iterator over actual unicode chars, as opposed to bytes,
| because the former can have variable byte-widths. There may be
| other reasons; I'm not sure.
| littlestymaar wrote:
| > But, Rust has a notation for slices -- &[T]
|
| &[T] is an "array-slice", even if it's called just "slice".
|
| See this example ( _[..]_ is the syntax to create a slice of
| something): https://play.rust-
| lang.org/?version=stable&mode=debug&editio...
| edflsafoiewq wrote:
| What would they be, &[u8]? That's already a thing: it's an
| arbitrary byte sequence. &str is specifically UTF-8 data.
|
| &OsStr and &Path are the same way.
| __s wrote:
| Indeed &str has as_bytes which returns itself as &[u8]
|
| But, str is a subset of &[u8], the type's contract is that it
| is unsafe to have non valid UTF-8 data, hence
| https://doc.rust-lang.org/std/str/fn.from_utf8.html can
| error, offering the unsafe variant https://doc.rust-
| lang.org/std/str/fn.from_utf8_unchecked.htm...
|
| This is all very different than &[char] which would be an
| array of 4 byte characters (or, a UCS4 string)
| aliceryhl wrote:
| Because there's no possible choice for T. You can't use u8,
| because then you would allow non-utf8 data. You can't use char,
| because a &[char] uses four bytes per character, whereas a &str
| stores the characters in utf-8, which is a variable-width
| encoding.
|
| A &str is really a different kind of thing from other slices.
| In any other slice, each element in the slice takes up a
| constant number of bytes, but this is not the case for a &str.
| mamcx wrote:
| Another: String is inmutable AND whole. &[T] is parts, and when
| declared as &mut [T] is mutable.
|
| Because unicode, &[T] make easy to write wrong code (that asume
| here T = Char).
|
| It CAN'T be Char, because char is larger than u8:
|
| https://doc.rust-lang.org/std/primitive.char.html
|
| and it mean unicode point.
|
| In other words: Rust is using types to PREVENT the wrong
| behavior.
| pornel wrote:
| Fun fact: `&mut str` exists. You don't get random access, but
| in controlled scenarios it's fine to mutate str in-place,
| e.g. `make_ascii_lowercase`
|
| https://doc.rust-
| lang.org/stable/std/primitive.str.html#meth...
| lmkg wrote:
| The regular slice type &[T] lets you access and manipulate
| individual elements. But Rust strings enforce the invariant
| that they are valid Unicode, which puts restrictions on
| element-wise operations.
|
| Calling &str a "string slice" is really more about the contrast
| with String, and how the relationship there mirrors the
| relationship between &[T] and Vec<T>. It's more of an analogy
| than a concrete description of the interface.
| tomjakubowski wrote:
| `str` contractually guarantees UTF-8 contents, so because of
| multi-byte codepoints it cannot be sliced at arbitrary indexes
| like a `[u8]` can be.
|
| As a side note, it is possible to define your own "unsized"
| slice type which wraps `[u8]`. This can be useful for binary
| serialization formats which can be subdivided / sliced into
| smaller data units.
| Const-me wrote:
| > `str` contractually guarantees UTF-8 contents
|
| I don't think any other languages do that. Instead, most of
| them are implementing as much as they can while viewing the
| storage as a blob of UTF8/UTF16 bytes/words, and throw
| exceptions from the methods which interpret the data as
| codepoints.
|
| Strings are used a lot in all kinds of APIs. For instance,
| strings are used for file and directory names. The OS kernels
| don't require these strings to be valid UTF-8 (Linux) or
| UTF-16 (Windows).
|
| To address the use case, Rust standard library needs yet
| another string type, OsString. This contributes to complexity
| and learning curve.
| estebank wrote:
| The number of things that you have to learn remains
| constant. I could even make the argument that the number of
| things you need to learn up-front is lowered when only
| talking about the distinction between String and
| OsString/CString. The difference is that rustc will be
| pedantic and complain about all of these cases, asking you
| to specify exactly what you wanted, while other languages
| will fail at runtime.
| Const-me wrote:
| > rustc will be pedantic and complain about all of these
| cases, asking you to specify exactly what you wanted
|
| So, they offloading complexity to programmers. Being such
| a programmer, I don't like their attitude.
|
| > other languages will fail at runtime
|
| In practice, other languages usually printing squares, or
| sometimes character escape codes after backslash, for
| encoding errors in their strings. That's not always the
| best thing to do, but I think that's what people want in
| majority of use cases.
| pornel wrote:
| The complexity already exists regardless of what the
| language does.
|
| The only choice is whether it's explicit and managed by
| the language, or hidden, and you need knowledge and
| experience to handle it yourself without language's help.
| If you want "squares" for broken encoding, Rust has
| `to_string_lossy()` for you. It's explicit, so you won't
| get that error by accident.
|
| Avoiding "mojibake" in other languages is usually a major
| pain. For example, PHP is completely hands-off when it
| comes to string encodings. To actually encode characters
| properly you need to know which ini settings to tweak,
| remember to use mb_ functions when appropriate, and don't
| lose track of which string has what encoding. There's
| internal encoding, filesystem encoding, output encoding,
| etc. They may be incompatible, but PHP doesn't care and
| won't help you.
| Const-me wrote:
| > It's explicit, so you won't get that error by accident
|
| I would want it to be implicit.
|
| Ideally, for rare 20% of cases when I care about UTF
| encoding errors, I'd want a compiler switch or something
| similar to re-introduce these checks, but I can live
| without that.
|
| > For example, PHP
|
| When you compare Rust with PHP it's no surprise Rust is
| better, many people think PHP is notoriously bad
| language: https://eev.ee/blog/2012/04/09/php-a-fractal-
| of-bad-design/
|
| I like C# strings the best, but I also have lots of
| experiencer with C++, and some experience with Java,
| Objective-C, Python, and a few others. None of them have
| Rust's amount of various string types exposed to
| programmers, many higher-level languages have exactly 1
| string type.
|
| Interestingly, some dynamic languages like swift use
| similar stuff internally, but they don't expose the
| complexity to programmers, they manage to provide a
| higher-level abstraction over the memory layout. Compared
| to Rust, improves usability a lot.
| steveklabnik wrote:
| So, there's like, a few things here. First is, technically they
| both do use the same notation, &T, where T=str and T=[u8]. This
| is the whole "unsized types" thing. &Path is another example of
| this, String : &str :: PathBuf : &Path.
|
| Beyond that though, &[T] implies a slice of Ts, that is,
| multiple Ts in a row. But a &str is a slice of a single string.
| So &[str] would feel wrong; that is, a &str is a slice of a
| String or another &str or something else, but isn't like, a
| list of multiple things. It's String, not Vec<str>.
|
| Basically, Strings are just weird.
| Covzire wrote:
| Are they though? I've long wondered why the Rust team hasn't
| imitated C# or other languages ease of use with strings while
| also retaining the existing functionality for lower-level use
| cases. I suppose it's a kind of gauntlet that a Rust dev
| would have to go through which could be a good thing but
| personally hitting walls with strings really turned me off on
| Rust the first time I tried it simply because my expectations
| were diametrically opposed to the reality of strings in rust.
| adkadskhj wrote:
| Yea but i don't think that's what GP was asking, imo. Rather
| than `[str]`, i think they were asking why it's `str` and not
| `[char]`, no?
|
| Just as `[u8]` is to `Vec<u8>`, `[char]` is hypothetically to
| `Vec<char>`.. and `Vec<char>` is basically a `String`, no?
|
| _edit_: Though looking at the docs, `char` is 4 always bytes,
| so i guess that's where the breakdown would be? `char` would
| need to be unsized i guess, but then it would be an awkward
| `[unsized_char]`, which is like two unsized types.... hence
| `str` maybe?
| codys wrote:
| `char` is a (internally) a `u32` because it represents any
| single unicode character. `str` is not a `[char]`, because
| rust doesn't store strings as utf-32 (system APIs don't
| accept utf-32, and it tends to waste space in many cases)
|
| `str`'s data layout happens to be `[u8]`, but it's type
| provides additional guarantees about the structure of the
| data within it's internal `[u8]` (for example, forbidding
| sequences of u8 that don't encode valid utf-8).
| adkadskhj wrote:
| Well yea, i wasn't saying `[char]` _is_ a `str`, rather i
| was positing that the GP comment was asking why it's a
| `str` than some hypothetical `[unsized_char]`.
|
| I think `char` _would_ work, if it was similarly unsized
| like a single piece of `str`. The Problem is .. as i see
| it, that `[unsized_char]` seems odd.
| steveklabnik wrote:
| To be extra pedantic, char represents a single Unicode
| Scalar Value.
| erik_seaberg wrote:
| This is a big deal because adding an accent mark to a
| letter (often) means a single char can no longer store
| it. APIs should not orient around isolated scalar values
| or codepoints because most devs will misuse them, not
| being experts on combining and normalization.
| [deleted]
| Blikkentrekker wrote:
| `str` is not `[char]`, there is no datatype `char` possible
| or which this would hold, and the name is already taken.
|
| `str` is not a slice; this is itself already a wrong
| statement. A slice is a dynamically sized type, a region of
| memory that contains any nonnegative number of elements of
| another type.
|
| A `str` is dynamically sized, but is not guaranteed to
| contain a succession of any particular type. It's simply a
| dynamically sized sequence of bytes guaranteed to be
| _Utf8_.
|
| `strs` aren't slices; all they have in common with them is
| that they are both dynamically sized types.
|
| `Vec<char>` is also not the same as `String`; a string is
| not a vector of `char`, which is already a type that has a
| size of 4 bytes.
|
| This all results from that _Utf8_ is a variable width
| encoding, and since slices are homogeneous, all elements
| have the same size.
| aliceryhl wrote:
| Yes, the reason you can't use char here is that a char is
| always 4 bytes, so a &[char] is a type that already exists,
| and that type uses four bytes per character.
| dralley wrote:
| I wish they had been named &path and Path instead, it would
| feel more consistent than String, &str, PathBuf, &Path
| steveklabnik wrote:
| I joke the only way that I'll sign off on Rust 2.0 is if we
| get to do the exact opposite, rename String to StrBuf,
| haha! (Same reason though, for consistency.)
| [deleted]
| CJefferson wrote:
| Related question (if you don't mind another).
|
| Why is 'str' a "primitive type"? What about 'str' means it
| has to be primitive, instead of being a light-weight wrapper
| around a '&[u8]' (that obviously enforces UTF8 requirements
| as approriate).
| yazaddaruvala wrote:
| You can't just index into a string. let
| hello: &[u8] = "Hello".as_u8_slice(); // hello[1]
| != "e", I believe it would be the bottom half-bytes of "H".
| // hello[2] != "e", I believe it would be the top half-
| bytes of "e".
|
| To use &[u8] would be _very_ non-ergonomic.
| SAI_Peregrinus wrote:
| One could, in theory, make an ExtendedGraphemeCluster
| type, and make str a slice of ExtendedGraphemeClusters.
| So &[ExtendedGraphemeCluster] could be indexed into
| without having things not make sense. Of course that's
| much more complicated than most other primitives, and
| most people don't have any idea what an Extended Grapheme
| Cluster even _is_. But since they 're the Unicode notion
| that most naturally maps to a "character" you could just
| call the type Character or char, and confuse the hell out
| of the C programmers by having a variable-width char
| type.
| josephg wrote:
| Sure - but iterating by extended graphemes isn't the only
| thing you want to do with strings. Sometimes you want to
| treat them as a bunch of UTF-8 bytes. Sometimes you want
| to iterate / index by Unicode code points (eg for CRDTs).
| And sometimes you want to render them, however the system
| fonts group them.
|
| It makes sense to have a special type because it can
| supports all of this stuff through separate methods on
| the type. (Or through nearby APIs). It's confusing, but I
| think it's the right choice.
|
| Although, I think the most confusing thing about rust
| strings isn't that &str isn't &[u8]. It's that &str isn't
| just &String or something like that.
| steveklabnik wrote:
| Well, &str is the type of string literals, so on some
| level, it has to be special.
|
| There was a PR implementing it as a wrapper around &[u8],
| but it didn't really provide any actual advantages, so it
| was decided to not do that.
|
| https://github.com/rust-lang/rust/pull/19612
| nicklecompte wrote:
| In particular: strings aren't actually simple arrays of
| characters in Rust like they are in C, but there is an
| underlying array on the heap, and the notion of "slicing"
| that array of characters still makes sense semantically.
| knodi123 wrote:
| > strings aren't actually simple arrays of characters in
| Rust
|
| Are you talking about them being a container with a pointer
| to the actual array, and also a size and etc?
| Skunkleton wrote:
| char is usually some single byte of data. Characters can
| be multiple bytes. Slicing a string on character
| boundaries is more coarse grained than slicing on byte
| boundaries.
| estebank wrote:
| char in Rust is 32bits, so it has a 1:1 mapping to
| Unicode glyphs. You might also want to care about
| grapheme clusters, but those are not part of the stdlib.
| toast0 wrote:
| Unicode codepoints; glyphs may take multiple codepoints.
| estebank wrote:
| You are, of course, correct.
| nicklecompte wrote:
| Right - the point is that &str isn't syntactic sugar /
| alias / etc for &[u8] and it would be confusing to have a
| notation that suggested otherwise.
| seoaeu wrote:
| In C, strings can hold invalid unicode. However, in Rust
| a str is guaranteed to be valid utf-8.
|
| For added confusion, Rust has a `char` type which is
| actually 32-bits. You can create arrays of them, but the
| resulting string would be in utf-32 and thus incompatible
| with the normal `str` type.
| Animats wrote:
| let s = "hello world".to_string(); for ch in
| s.chars() { print!({},ch); }
|
| will iterate through a string character by character.
| That's the most common use of the "char" type - one at a
| a time, not arrays of them.
|
| Although the proper grapheme form is:
| use unicode_segmentation::UnicodeSegmentation; // 1.7.1
| let s = "hello world".to_string(); for gr in
| UnicodeSegmentation::graphemes(s.as_str(), true) {
| print!("{}", gr) }
|
| This will handle accented characters and emoji modifiers.
| A line break in the middle of a grapheme will mess up
| output.
|
| By the way, open season for proposing new emoji starts
| tomorrow.[1]
|
| [1]
| http://blog.unicode.org/2020/09/emoji-150-submissions-re-
| ope...
| yazaddaruvala wrote:
| Are there potentially other situations where `&[T + !Sized]`
| makes sense?
|
| The majority of the functions on `&str` seem to make sense
| for all `&[T + !Sized]` where `type str = [unsized_char]`.
| steveklabnik wrote:
| I don't really know what unsized_char would even mean,
| chars have a size, and str is not a sequence of chars.
|
| That said, I'm also not sure in the general case.
| kzrdude wrote:
| "string slice" is just the name of the one thing, and "slice"
| the name of the other, and they are different things with
| similar names and related features.
| geodel wrote:
| > It's a testament to Rust's breadth and accessibility that even
| people who have never done low-level programming before are
| giving it a try.
|
| Umm, no, I think it is testament of Rust evangelism that even
| when Rust is least appropriate, people start using it. So we have
| folks who start coding a 5 webpage worth of project in Rust
| because you know low level control, high performace , next
| generation, secure software etc.
| zabzonk wrote:
| > "foofoo" takes up twice as much space in memory as "foo".
|
| Not in the imaginary language you keep talking about as "C/C++".
| pornel wrote:
| Right, in case they're heap allocated, the worst-case alignment
| requirement of malloc, as well as free() without a size is
| likely to force implementations to round it up to at least 8 or
| 16 bytes for both.
| ncmncm wrote:
| No. The result of std::string("foo") is _exactly_ the same
| size as of std::string( "foofoo"). They take up exactly the
| same number of bytes in the free store, on the stack, in a
| hash table, what have you.
| pornel wrote:
| We just interpret "space in memory" differently.
|
| The {ptr,len,cap} part may be fixed-size, but you also need
| to hold the letters somewhere in memory.
|
| If you try to create std::string("verylong...") with more
| characters than you have memory, you run out of memory, so
| it "takes up space in memory".
|
| Bonus fact: *"foo" and *"foofoo" in Rust actually take
| different amount of memory even when using your definition.
| One dereferences to 3 bytes, the other to 6 (not a
| pointer).
| zabzonk wrote:
| I actually (and pedantically) meant that neither in C or C++
| is it true that "foofoo" takes up double the size of "foo" -
| both will have a zero character at the end.
| rectang wrote:
| If you've been handling Unicode properly in other languages, then
| Rust strings seem _easy_ in comparison.
|
| * All the 2-byte-char languages which were designed for UCS-2
| before the Unicode Consortium pulled the rug out from underneath
| them and obsolesced constant-width UCS-2 in favor of variable-
| width UTF-16.
|
| * Languages which result in silent corruption when you
| concatenate encoded bytes with a string type (e.g. Perl but there
| are many examples.)
|
| * C, where NUL-terminated strings are the rule, and the standard
| library is of no help and so Unicode string handling needs to be
| built from scratch.
|
| All those checks which you have to fight to opt into, defying
| both the language and other lazy programmers (either inside your
| org, or at an org which develops dependencies you use)? Those
| checks either happen automatically or are _much_ easier to use
| without making mistakes in Rust.
| ajross wrote:
| Alternatively: if you have been handling Unicode and using wide
| characters, you have _not_ been handling Unicode properly.
|
| Obviously the world is a big place and there is room for lots
| of paradigms and worldviews and we aren't supposed to judge too
| much.
|
| But come on. If new code isn't working naturally in UTF-8 in
| 2021 then it's wrong, period.
| estebank wrote:
| > if you have been handling Unicode and using wide
| characters, you have not been handling Unicode properly.
|
| Paradoxically, trying to do "the right thing" and being an
| "early adopter" of (the now called) UCS-2 was a "mistake", as
| both Java and Windows can attest, by getting "stuck"
| supporting the worst possible Unicode encoding ad-infinitum.
| UTF-8 is the "obviously correct" choice (from the hindsight
| afforded by us talking about this in 2021).
|
| I still find it funny that emojis of all things are what
| actually got the anglosphere to actually write software that
| isn't _completely_ broken for the other 5.5 billion people
| out there.
| spamizbad wrote:
| It's the edge-cases that get you.
| diroussel wrote:
| My understanding is that UTF-8 is not a good representation
| for non-european alphabets.
|
| So do you think UTF-8 is always the best internal string
| representation? Or just for English speakers?
|
| For Mandarine what would be optimal?
| klodolph wrote:
| So, the advantage of UTF-16 is that CJK text will use 33%
| less space.
|
| Does this mean that "UTF-8 is not a good representation for
| non-European alphabets?" It may be less efficient but the
| difference does not seem shocking to me, considering that
| for most applications, the storage required for text is not
| a major concern--and when it is, you can use compression.
| rectang wrote:
| Mandarin is an interesting case. Most of the Han characters
| used by Mandarin fall within the basic multilingual plane
| and thus occupy 2 bytes in UTF-16 but 3 bytes in UTF-8.
| However, for web documents, most markup is ASCII which is
| only one byte. So for Mandarin web documents, the space
| requirements for UTF-8 and UTF-16 are about a wash.
|
| When you add in interoperability concerns, since so much
| text these days is UTF-8, for Mandarin at least UTF-8 is a
| perfectly defensible choice.
|
| (A harder problem is Japanese -- Japan really got screwed
| over with Han unification, so choosing Shift-JIS over any
| Unicode encoding is often best.)
|
| FWIW I covered the space requirements of various encodings
| and various languages in this talk for Papers We Love
| Seattle:
|
| https://www.youtube.com/watch?v=mhvaeHoIE24&t=39m14s
| amelius wrote:
| I genuinely wonder: is the space requirement of text
| encodings really an important issue in this age of large
| photo and video content?
| magicalhippo wrote:
| > if you have been handling Unicode and using wide
| characters, you have not been handling Unicode properly.
|
| How so? Delphi for example has wide character-based strings
| as default, what's wrong with that?
| josephg wrote:
| Wide character based strings have a .length field which is
| easy to reach for and never what you want, because it's
| value is meaningless:
|
| - It isn't the number of bytes, unless your string only
| contains ASCII characters. Works in testing, fails in
| production.
|
| - It isn't the number of characters because 16 bits isn't
| enough space to store the newer Unicode characters. And
| even if it could, many code sequences (eg emoji) turn
| multiple code points into a single glyph.
|
| I know all this, and I still get tripped up on a regular
| basis because .length is _right there_ and works with
| simple strings I type. I have muscle memory. But no, in
| javascript at least the correct approaches require thought
| and sometimes pulling in libraries from npm to just make
| simple string operations be correct.
|
| Rust does the right thing here. Strings are UTF-8
| internally. They check the encoding is valid when they're
| created (so you always know if you have a string, it is
| valid). You have string.chars().count() and other standard
| ways to figure out byte length and codepoint length and all
| the other things you want to know, all right there, built
| into the standard library.
| estebank wrote:
| The reasoning behind using UTF-16/UCS-2 is that then you
| can plug your ears and treat 1 char == 1 user-visible glyph
| on the screen, so programmers that acted as if ASCII was
| the only encoding in existence could continue treating
| strings in the same way (using their length to calculate
| their user-visible length, indexing directly on specific
| characters to change them, etc).
|
| All of those practices are immediately wrong once UTF-32
| came in existence and UTF-16 became a variable length
| encoding. But even if _that_ hadn 't happened, what you
| want to be operating on is _not_ characters, but grapheme
| clusters, which are equivalent to a vector of chars.
| Otherwise you won 't handle the distinction between e and e
| or emojis correctly.
| magicalhippo wrote:
| But how is that different from the underlying encoding
| being UTF-8?
|
| edit:
|
| For example, we do a lot of string manipulation in
| Delphi. We might split a string in multiple pieces and
| glue them together again somehow. But our separators are
| fixed, say a tab character, or a semicolon. So this
| stiching and joining is oblivious to whatever emojis and
| other funky stuff that might be inbetween.
|
| How is this doing it wrong?
|
| I mean yea sure you CAN screw it up by individually
| manipulating characters. But I don't see how an UTF-8
| encoded string _in itself_ prevents you from doing the
| same kind of mistakes.
| josephg wrote:
| Splitting and glueing is fine. But imagine 3 systems:
| system A is obviously wrong. It crashes on any input.
| System B is subtly wrong. It works most of the time, but
| you're getting reports that it crashes if you input
| Korean characters and you don't know Korean or how to
| type those characters. System C is correct.
|
| Obviously C is better than A or B, because you want
| people to have a good experience with your software. But
| weirdly, system A (broken always) is usually better than
| system B (broken in weird hard to test ways). The reason
| is that code that's broken can be easily debugged and
| fixed, and will not be shipped to customers until it
| works. Code that is broken in subtle ways will get
| shipped and cause user frustration, churn, support calls,
| and so on.
|
| The problem with UCS-2 is it falls into system B. It
| works most of the time, for all the languages I can type.
| It breaks with some inputs I can't type on my keyboard.
| So the bugs make it through to production.
|
| UTF-8 is more like system A than system B. You get
| multibyte code sequences as soon as you leave ASCII, so
| it's easier to break. (Though it really took emoji for
| people to be serious about making everything work.)
| barrkel wrote:
| I was part of that. Delphi has all the string types you
| want, since you can declare your preferred code page.
| String is an alias for UnicodeString (to distinguish from
| COM WideString) and is UTF-16 for compatibility with Win32
| API more than anything. UTF-8 would have meant a lot more
| temporaries and awkward memory management.
| magicalhippo wrote:
| All in all, while the Unicode transition took its time, I
| must admit it's was very smooth when it did happen.
|
| At work we have a codebase that does a lot of string
| handling. Both in reading and writing all kinds of text
| files, as well as doing string operations on entered
| data. Several hundred kLOC of code across the project.
|
| We had one guy who spent less than week wall-time to move
| the whole project, and the only issue we've had since is
| when other people send us crappy data... if I got a
| dollar for each XML file with encoding="utf-8" in the
| header and Windows-1252 encoded data we've received I'd
| have a fair fortune.
| shadowgovt wrote:
| This is my way of thinking about the topic these days. It's not
| that strings are more complicated in Rust than in other
| languages, it's that a lot of the other low-level languages are
| presenting an abstraction that assumes implicitly that a string
| is some type of sequence of uniform-sized cells, one cell per
| character, and that representation was an artifact of a
| specific time in computational history. It's like many other
| abstractions those languages provide... Seemingly simple at
| first glance, but if you do the details wrong you're just going
| to get undefined behavior and your program will be incorrect.
|
| Languages that don't expose strings as that abstraction are, in
| my humble opinion, more reflective of the underlying concept in
| the modern era.
| alerighi wrote:
| All of this is true, IF you assume that you want Unicode
| string. That, especially on system/embedded software (the kind
| of software that Rust is targeting) you don't really care about
| Unicode and you can simply treat strings as array of bytes.
|
| And I live in a country where you usually use Unicode
| characters. But for the purpose of the software that I write, I
| mostly stick with ASCII. For example I use strings to print
| debug messages to a serial terminal, or read commands from the
| serial terminal, or to put URL in the code, make HTTP requests,
| publish on MQTT topics... for all of these application I just
| use ASCII strings.
|
| Even if I have to represent something on the screen... as long
| as I have a compiler that supports Unicode as input files (all
| do these days) I can put Unicode string constants in the code
| and even print them on screen. It's the terminal (or the GUI I
| guess, but I don't write software with a GUI) that translates
| the bytes that I send on the line as Unicode characters.
|
| And yes, of course the length of the string doesn't correspond
| at the characters shown on the screen... but even with Unicode
| you cannot say that! You can count (and that what Rust does)
| how many Unicode code points you have, but a characters could
| be made of more code points (stupid example, the black emoji is
| composed by a code point that says "make the following black"
| and then the emoji itself).
|
| So to me it's pointless, and I care more about knowing how many
| bytes a string takes and being able to index the string in
| O(1), or take pointers in the middle of the string (useful when
| you are parsing some kind of structured data), and so on.
|
| In conclusion Rust is better when you have to handle Unicode
| string, but most application doesn't have to handle them, and
| handling them I don't mean passing them around as a black box,
| not caring how they contain (yes, in theory you should care
| about not truncating the string in the middle of a code point
| when truncating strings... in reality, how often do you
| truncate strings?)
| jasonhansel wrote:
| My way of thinking about this (which I _think_ is correct) is:
| "&str" is like "&[u8]", except that "&str" is guaranteed to
| contain valid UTF-8.
| jstanley wrote:
| > Then if your program later says "actually that wasn't enough,
| now I need Y bytes", it'll have to take the new chunk, copy
| everything into it from the old one, and then give the old one
| back to the system.
|
| This is mostly true. If you get lucky, there may already be
| enough unused space past the end of the existing allocation, and
| then realloc() can return the same address again, no copying
| required.
|
| But if you know you're going to be doing lots of realloc() (and
| you're not unusually-tightly memory-constrained) then instead of
| growing by 1 byte each time it's often worth starting with some
| sensible minimum size, and doubling the allocated size each time
| you need more space. That way you "waste" O(N) memory, but only
| spend O(N lg N) time on copying the data around instead of
| O(N^2).
| axiosgunnar wrote:
| Funnily enough this exact topic came up on Hackernews just
| recently when a Googler started benchmarking AssemblyScript (a
| language for WebAssembly) and realized that AssemblyScript was
| increasing the size by +1 instead of doubling when
| reallocating...
|
| Here is the HN thread:
| https://news.ycombinator.com/item?id=26804454
| sumtechguy wrote:
| I personally use a 'stride' instead of doubling. On small sizes
| doubling works OK. But when you get past about 8k-16k the empty
| memory starts to stack up.
| koverstreet wrote:
| Never do this - you're introducing a hidden O(n^2).
|
| Folks, this is why you take the algorithms class in CS.
| axiosgunnar wrote:
| Could you elaborate? This seems very interesting.
| magicsmoke wrote:
| Asymptotically, there's no difference between allocating
| a n+1 buffer and a n+k buffer before copying your old
| data in. You'll still get O(n^2)
|
| In reality, it depends on the data you're handling. You
| may never end up handling sizes where the O(n^2)
| asymptote is significant and end up wasting memory in a
| potentially memory constrained situation. At the end of
| the day, it all depends on the actual application instead
| of blind adherence to textbook answers. Program both,
| benchmark, and use what the data tells you about your
| system.
|
| If I've got a 500 MB buffer that I append some data to
| once every 5 minutes, I might want to reconsider before I
| spike my memory usage to fit a 1 GB buffer just to close
| the program 15 minutes later.
| deathanatos wrote:
| The O(n2) here is the time spent copying of the data;
| it's not about the size of the buffer, or that you'll
| temporarily use 2x the space. The program would die by
| becoming unusably slow.
|
| Take your 5MiB example. If we start with a 4KiB buffer.
| If we grow it by a constant 4KiB each time it runs out of
| space, buy the time the buffer is 500MiB we've copied ~30
| TiB of data. If instead we grow the buffer by doubling
| it, we will have had to copy ~1000 MiB (0.001 TiB) by the
| time it hits 500 MiB, difference of 30,000x. (Which is
| why the program would slow to a crawl.)
| magicsmoke wrote:
| Yes I'm aware of how the algorithm works. I also know
| that if I allocated 500 MiB at the beginning of the
| program expecting my memory usage to be roughly that
| size, and my prediction was off by 50 MiB maybe I don't
| want to go hunting for another 500 MiB of space before my
| program ends or I stop using the buffer and free it.
|
| But your point about the virtual memory makes that moot
| anyways. Thank god for modern OSes. I've clearly been
| spending too much time around microcontrollers.
| kaslai wrote:
| Except on non-embedded platforms, oftentimes large blocks of
| allocated memory aren't occupying physical memory until you
| write to them. There's not much reason to avoid using
| exponential buffer growth on a system with a robust virtual
| memory implementation.
| brundolf wrote:
| > and then realloc() can return the same address again, no
| copying required
|
| Interesting, I didn't know that!
|
| Regardless, though, I intentionally skimmed over certain
| nuances like exponential buffer growth for the sake of the main
| point
| irrational wrote:
| I love articles like this. I imagine this is the kind of stuff
| you learn if you study CS or Software Engineering in college.
| Maybe when I retire I will go and get a CS degree so I can learn
| all the things I should have learned before I began working
| professionally as a programmer 25 years ago.
| brundolf wrote:
| We used C++ throughout most of my CS program, and I hated it at
| the time and never wanted to write it ever again (and
| haven't!), but good lord did it help me understand how
| computers work. I've benefitted from that perspective ever
| since.
|
| I'm not sure you have to do a whole degree to get that
| perspective, though. Just learning C or C++ should get you a
| good chunk of it.
| v8dev123 wrote:
| > understand how computers work
|
| No body understands it even at electron orbitals. Your
| understanding will be wrong in next decade if not wrong
| already. Computer Arch is proprietary.
|
| Give a try for C++17.
|
| Zero Cost Abstractions RAII
|
| It's a beast.
| amilios wrote:
| Nah, I wish, you just learn how to implement quicksort in Java
| and stuff like that The only languages I saw during my degree
| were Java, C, Python, Perl, and a little bit of Prolog. And I
| finished my Bach last year.
| jeltz wrote:
| Hm, I feel Rust's strings are the easiest I have worked with on
| any programming language but that might be due to my knowledge of
| unicode and of C.
|
| The only thing which surprised me was the &str type. Why isn't it
| just an ordinary struct (called e.g. Str) consisting of a
| pointer/reference and a length?
| TheCoelacanth wrote:
| Rust's strings are one of the easiest to use correctly and one
| of the hardest to use incorrectly, but they vast majority of
| string handling is done incorrectly, so that makes Rust seem
| hard.
| steveklabnik wrote:
| I linked to a PR above which implemented this idea, and was
| rejected.
| kimundi wrote:
| With dynamically sized types like `str`, Rust allows to
| separate "how to access data behind a pointer" from "what data
| is behind the pointer". So you can for example have the types
| `str`, `[T]` or `Path`, and can have them for example behind
| the pointer types `&T`, `Box<T>` or `Arc<T>`.
|
| If Rust had defined a special struct `Str` for `&str`, then it
| would have to define special structs for all the combinations
| possible: Str, ArcSlice, BoxPath, etc...
| brundolf wrote:
| Yeah, this article was specifically aimed at people who _haven
| 't_ worked with C/C++, and instead have the higher-level mental
| model for what strings are and how they work.
| alkonaut wrote:
| I wish more languages had primitives for "simple strings". Most
| of the time when I use a string I could live with a restriction
| that it's ascii only and can fit in 64bytes. "Programming string"
| vs "human string". For example a textual value of some symbol or
| a name of a resource file I can control myself or a translation
| lookup key (which I can make sure is always short and ascii). An
| XML element name or json property in an enormous file with a
| schema I control. It seems weird to use the same type for the
| "human string" e.g. user input, a name, the value _looked up_
| from the translation key and so on. For the simple strings it
| _feels_ like making heap allocations, or either using two bytes
| per char (e.g C#) or worrying about encoding in an utf-8 string
| (e.g. Rust) are both wasteful.
| brundolf wrote:
| You can pretty much do this in Rust: let
| simple_str: &[u8] = b"hello world"; println!("{}",
| simple_str[6] as char); // 'w'
|
| Though I would advise against it in most cases, because even
| many "non-human" formats like XML and JSON do allow for Unicode
| characters
| cdcarter wrote:
| This is, as I understand it, how Symbols work in Ruby. They are
| prefixed with a colon, and are interned and immutable.
| bsder wrote:
| I would say UTF-8, but I really do miss old-school Pascal
| strings (aka strings with a length field and a _fixed_
| allocation) sometimes.
|
| Pascal strings could _automatically_ #[derive] a whole bunch of
| the Rust traits (Copy, Clone, Send, Sync, Eq, PartialEq, ...)
| that would help sidestep a whole bunch of ownership issues when
| you start throwing strings around in Rust.
|
| The downside would be that you would occasionally get a runtime
| panic!() if your strings overflowed.
|
| Sometimes, I can live with that. Embedded in particular would
| mostly prefer Pascal strings.
|
| I suspect that Rust is powerful enough to create a PString type
| like that and actually fit it into the language cleanly. The
| lifetime annotations may be the trickiest part (although--maybe
| not as everything is a fixed size).
| brundolf wrote:
| > Pascal strings could automatically #[derive] a whole bunch
| of the Rust traits (Copy, Clone, Send, Sync, Eq, PartialEq,
| ...)
|
| Worth noting that the explicitness of #[derive] was a design
| decision; particularly when exposing a library API, it's good
| to have control over the set of interfaces you actually
| support, so that (for example) if one of them stops being
| derivable later you won't break people's code downstream
| dralley wrote:
| This is a great article! The only omission is that you can
| concatenate two &str's at compile time using concat!().
| brundolf wrote:
| I didn't know about that! It's pretty interesting, though also
| fairly niche because the most common usecase is a string
| addition where at least one member is a variable, not a literal
| Animats wrote:
| He's working too hard. You don't need all those type
| declarations. let a = "hello".to_string();
| let b = "world".to_string(); let c = a + " " + &b;
| println!("Sentence: {}", c);
|
| It's a bit confusing that you need "&b". "+" was defined as
| String + &str, which is somewhat confusing but convenient.
|
| _If you 've been handling Unicode properly in other languages,
| then Rust strings seem easy in comparison._
|
| Yes. C is awful. C++ is still sometimes UTF-16. C#/.net is still
| mostly UTF-16. Windows is still partly in the UTF-16 era. So is
| Java. So is Javascript. Python 2 came in 2-byte and 4-byte
| versions of Unicode. The UTF-16 systems use a "surrogate pair"
| kludge to deal with characters above 2^16. CPython 3 has 1, 2,
| and 4 byte Unicode, and switches dynamically, although the user
| does not see that.
|
| Linux and Rust are pretty much UTF-8 everywhere now, but everyone
| else hasn't killed off the legacy stuff yet.
| brundolf wrote:
| The type declarations are there for maximum clarity, since the
| inferred types may not be obvious to the target audience
| jdmichal wrote:
| Yes, please. One of my biggest peeves are posts that are
| meant to be educational, but yet don't define the types used
| for variables nor define the namespace / package they are
| from. Java posts are rife with the latter, and I'm really not
| looking forward to the `var` keyword making the former a
| thing too. And quite a few get bonus points for doing such on
| types that require a new dependency -- but good luck figuring
| out which dependency without knowing the dependency name nor
| the package name of the class!
| Someone1234 wrote:
| This article was absolutely fantastic.
|
| It kind of lost me right up until the first example code, where I
| did in fact have different expectations. Then how it broke down
| _why_ , and gave different potential solutions was just
| wonderful. I learned a lot.
|
| I will add though that other languages have a distinction between
| const and string object too (at least under the hood), they just
| go to great lengths to hide it from programmers. For example a
| const + const call might do the same thing as Rust under the
| hood, but it is transparent. Rust seems like it requires the
| extra steps because it wants the programmer to make a choice here
| (or more specifically stop them from making mistakes: like having
| immutable data _stay_ immutable, rather than automatic conversion
| to mutable sting inc. the data cost involved even, automatic
| conversion is more convenient but also a performance footgun).
|
| I don't think Rust is wrong, I think it is opinionated and
| honestly as someone that like immutability I kinda dig it.
| brundolf wrote:
| Glad you enjoyed it :)
|
| And yeah, under the hood those other languages do all kinds of
| wild optimizations with their "immutable" strings like sharing
| substrings between different strings and pooling to reduce
| allocations. I intentionally left out those nuances because
| from the user's perspective, those are all implementation
| details (even if they can surface in the form of performance
| changes).
| ncmncm wrote:
| It is well-written, where it treats Rust, but almost lost me,
| too, for its use of "C/C++", treating the two very different
| languages as if they were trivial variations. Where string
| handling is concerned, as in so many other places, they are
| fundamentally different.
|
| This "C/C++" bad habit is very commonly used, around Rust, to
| slyly imply there is no effective difference, but in a way that
| permits an injured response to criticism that "it just means 'C
| or C++'". But _it doesn 't_, unless you are talking about
| object file symbols or compiler optimizers, and.often enough
| not then. What it does do is encourage sloppy thinking and
| resultant falsehoods. These falsehoods show up in the article,
| revealed if you change "C/C++" to "C or C++" in each case.
|
| In several places it says just "C++", yet is still talking
| about C. It is OK not to know C++; many don't. Things are hard
| enough without falsehoods.
| brundolf wrote:
| The only thing I claimed about the two as a category was that
| "strings are not immutable, they're data structures" (which
| applies to Rust too). I purposely didn't go into much more
| detail than that because it wasn't really the point of the
| article. I did mention that C works with strings as raw char
| arrays, and C++ has a struct around a char array that manages
| length-tracking and reallocation automatically.
|
| I believe these two statements are accurate, though I'm happy
| to be corrected if they aren't. It's been a few years since I
| wrote C++. Beyond that, I see my claims as "abstract" and not
| "falsehoods".
| ncmncm wrote:
| Amusingly, Rust strings, whether &str or String, are unable to
| represent filenames, which in many, many programs is the
| overwhelmingly most common use for character sequences that
| people want to call strings.
|
| The Rust people invented the wonderful "WTF-8" notion to talk
| about these things. It gets awkward when you want to display a
| filename in a box on the screen because those boxes like to hold
| actual strings _qua strings_ , not these godforsaken abominations
| that show up in file system directory listings.
|
| Handling WTF-8 will take a whole nother article. I don't know a
| name for WTF-8 sequences; I have been calling them sthrings,
| which is hard to say, and awkward, but that is kind of
| appropriate to the case.
| int_19h wrote:
| It doesn't really make things any more complicated than they
| already are. If you take a filename in C, and then have to
| display it somewhere, you're facing the same problem - except
| that you might not even be aware of it, because all types are
| the same, and you won't notice the problem unless you happen to
| run into an unprintable filename.
|
| Rust is doing the right thing here by forcing developers to
| deal with those issues explicitly, rather than swiping them
| under the rug. The real issue is filenames that aren't proper
| strings - i.e. an OS/FS design defect - but this ship has
| sailed long ago.
| ncmncm wrote:
| That sthrings are equally as awkward to handle correctly in
| C, C++, Python, Ruby, Javascript, or Perl, as in Rust makes
| them no less awkward.
|
| Nobody said Rust has done anything wrong to ban them from
| String.
| brink wrote:
| Nicely written!
|
| I wish I had this article when I first started Rust. Would have
| saved me some trouble.
| sequoia wrote:
| Incredibly clear and compassionate writing (callout boxes throw a
| bone to readers who aren't well versed in concepts like the heap,
| character arrays etc.).
|
| Big kudos to the author!
| brundolf wrote:
| Thanks :)
| skybrian wrote:
| This is a great intro, but clarifying one more thing might be
| useful: how do you return a string?
| brundolf wrote:
| A String can be returned like any other owned value; whether or
| not a &str can be returned depends on lifetimes, as it does
| with any other reference.
|
| Lifetimes seem out of scope for this post, and the lifetimes
| story for strings isn't really strings-specific enough that it
| felt important to cover. There are other resources out there
| that thoroughly cover the topic of lifetimes; in fact I wrote a
| short summary myself :)
|
| https://www.brandons.me/blog/favorite-rust-function
___________________________________________________________________
(page generated 2021-04-14 23:01 UTC)