hngopher.com

       [HN Gopher] Why Rust strings seem hard
       ___________________________________________________________________
        
       Why Rust strings seem hard
        
       Author : brundolf
       Score  : 108 points
       Date   : 2021-04-14 19:27 UTC (3 hours ago)
        
 (HTM) web link (www.brandons.me)
 (TXT) w3m dump (www.brandons.me)
        
       | rsc wrote:
       | If you can indulge a non-Rust point of view, if I'd been faced
       | with this design problem, I think I would have put a "this is a
       | rodata-backed string" bit into String, and then the
       | representation would be                   {ptr, len, rodata-bit}
       | 
       | (or squeeze a bit out of the ptr or len if the space matters).
       | 
       | Then "abc" could have type String, and the only difference
       | between                    let abc1: String = "abc"
       | 
       | and                    let a: String = "a"          let abc2:
       | String = a + "bc"
       | 
       | would be that (assuming the compiler doesn't get cute) abc1 is
       | pointing at the rodata bytes for "abc" and abc2 is pointing at
       | allocated bytes. (But it can tell them apart so the deallocation
       | is not ambiguous.)
       | 
       | It seems like this would have avoided a lot of the ink spilled
       | over &str vs String. I know it's too late now, but was this
       | considered at all, and if so what were the reasons not to adopt
       | it?
       | 
       | Thanks.
        
         | tene wrote:
         | When you want a reference that can transparently become an
         | owned value when mutated, Rust has a type Cow that implements
         | this generally. It's an enum that has two variants, Owned and
         | Borrowed. If you mutate a Cow::Borrowed, it first copies the
         | data to a new owned allocation, and replaces itself with
         | Cow::Owned.
         | 
         | However, the difference between String and &str has nothing to
         | do with mutability, or whether the data is in a read-only page
         | or not. If you have &mut str, you can mutate the values, and if
         | you don't declare your String as mut, it's not mutable.
         | 
         | The difference is that String owns its allocation, whereas &str
         | is a reference to memory that someone else owns. It's exactly
         | the same as Vec vs &[u8], and if you check the source, you'll
         | see that String is just a wrapper around Vec<u8>:
         | https://doc.rust-lang.org/src/alloc/string.rs.html#278-280
         | 
         | The general principal is that if you own the allocation, you
         | can reallocate it to change its size. If you only have a
         | reference to memory that someone else owns, you can't do that.
         | 
         | Consider for example Go's slices, which work kind of like you
         | describe, where they point to the original array's memory until
         | someone grows the array, at which time they might or might not
         | make a new allocation. Appending to a Go slice from some inner
         | function can suddenly break code that calls it, because the
         | slice it's operating on suddenly points to new memory.
         | 
         | Rust's Big Idea is to make ownership and borrowing more-
         | explicit. Having your default stdlib text type be ambiguous
         | about whether it's owning or borrowing is both weird, and also
         | makes things a lot more awkward to deal with.
         | 
         | If you use Cow<str>, you'll see that its API declares that it
         | borrows from some source, and can't outlive that source. That's
         | fine if what it's borrowing from is static text in the binary,
         | but that really constrains what you can do with a string that
         | you've dynamically allocated.
         | 
         | Just like all other data structures, having a distinction
         | between an owned value and a reference to the value is very
         | useful. It's easy to build a variety of shared or ambiguous
         | ownership structures on top of owned values and references, but
         | it's much more complicated to go the other direction.
        
         | int_19h wrote:
         | Where would abc allocate the bytes from, if a is in rodata?
        
         | brundolf wrote:
         | I hadn't heard of rodata before, but based on a quick Google, I
         | think what you're describing is similar to Cow<str>. I can't
         | speak to the reasons why this wasn't made the default, but I
         | believe it is at least possible.
        
           | cesarb wrote:
           | It's probably the same reason why IIRC the default Rust
           | String does not have a "small string optimization": it would
           | add a lot of unpredictable branches. And it would be even
           | worse, since unlike the "small string optimization", even
           | small mutations which don't change the size would have to
           | allocate when the original was read-only.
        
             | pornel wrote:
             | Another problem is that `&str` is more than just "read-
             | only". It also tracks lifetime to prevent use-after-free.
             | 
             | A universal owned-or-readonly-borrowed String would be
             | unsafe without also adding reference counting or GC to keep
             | the borrowed-from string alive.
        
         | holmium wrote:
         | In addition to what the others have said, Rust's `String` is
         | three pointer sized words: `(ptr, len, capacity)`. So, a
         | `String` has 50% of the overhead of a `&str`.
        
         | devit wrote:
         | Cow<'static, str> is exactly what you are asking for.
         | 
         | In general the String type is not very good, and you should use
         | something appropriate to your use of strings for any string
         | that is used more than a constant times in your program, which
         | can be, for instance:
         | 
         | - String
         | 
         | - Box<[str]>
         | 
         | - Cow<'static, str>
         | 
         | - Cow<'a, str>
         | 
         | - Interned strings
         | 
         | - Rc<[str]> or Arc<[str]>
         | 
         | - A rope data structure
         | 
         | In fact I think putting String in the standard library might
         | have been a mistake since it's almost always a suboptimal
         | choice for anything except string builders in local variables.
        
       | StreamBright wrote:
       | More articles like this and I might get started with Rust again.
       | Well async is still a problem but at least strings are much
       | clearer now after reading this.
        
       | eximius wrote:
       | An alternative to `"string".to_owned() + "foo"` is just using a
       | macro such as `concat!("string", "other string")`.
        
       | alexchamberlain wrote:
       | Very interesting write up. Would be interested in a comparison to
       | a COW string too.
        
       | armSixtyFour wrote:
       | This was one of the more confusing parts of Rust when I first
       | started using it. I find that I don't necessarily run into it a
       | lot. After a while you sort of change how you organize your code
       | and you find that you're not fighting strings or the borrow
       | checker very often. I'm not really sure how I got to that point
       | however, it's more just practice than anything else.
        
         | seoaeu wrote:
         | Yeah, in some ways it is kind of weird how much discussion
         | there is about Rust's borrow checker compared to how much time
         | practitioners actually spend dealing with it. I see less about
         | string handling (this post being an exception) but that is also
         | basically a non-issue once you get the hang of it.
        
           | brundolf wrote:
           | Exactly. It's not _actually_ hard once it clicks, but I think
           | a certain subset of newcomers spend lots of time being
           | frustrated with the fact that their understanding doesn 't
           | seem to work, when they could just be given the key bits of
           | information and be able to move forward with things making
           | sense. That was the motive of this post :)
        
       | nikisweeting wrote:
       | Why cant every literal "abc" just be instantiated as a heap
       | String by default? You could have a separate notation like &"abc"
       | when you want a slice, similar to python's b"abc" r"abc", etc.
       | modifiers. Heap Strings seem much more useful in general.
        
         | Daishiman wrote:
         | System programming languages avoid allocations and unnecessary
         | resource consumption. I'd say it's one of their hallmark
         | characteristics.
         | 
         | The programming convenience of higher-level languages comes at
         | a very substantial cost of requiring a complex runtime, more
         | memory for data structures, unpredictable performance, and
         | pushing complexity where it's not visible. One philosophy
         | favors visibility over resources, the other favors convenience
         | of use.
        
         | brundolf wrote:
         | It would be pretty wasteful; in any read-only context you'll
         | need a &str anyway, and making them all Strings would cause
         | tons of unneeded allocations. Many Rust devs care a lot about
         | avoiding unnecessary allocations: some people even use Rust on
         | embedded systems that don't allow allocations _at all_ , so
         | building allocation into a language fundamental would likely be
         | a mistake.
        
       | CJefferson wrote:
       | One thing I've often wondered about Rust strings. I often hear
       | that &str is 'a string slice'. But, Rust has a notation for
       | slices -- &[T]. Why are strings the only thing (that I know of)
       | that don't use the same slice notation as everything else?
        
         | hansihe wrote:
         | In Rust a slice `&[T]` is a fixed length sequence of `T`s laid
         | out linearly in memory. Every `T` is required to be of the same
         | size.
         | 
         | Strings in Rust are (normally) represented as UTF-8. Both
         | `String` and `str` represent data that is guaranteed to be
         | valid UTF-8.
         | 
         | This means that if Rusts UTF-8 strings were represented as
         | normal slices, they would have to be slices of UTF-8 code-
         | units.
         | 
         | Rust wants to provide a safe and correct String data type, and
         | therefore, indexing a string on a byte (code-unit) level would
         | be incorrect behavior.
         | 
         | Having a custom type `String` and `str` instead of just a
         | `Vec<u8>` enables you to have more correct behavior implemented
         | on top of the data type that doesn't implement normal slice
         | indexing and such.
         | 
         | ---
         | 
         | As a note, even though you probably don't want to normally, you
         | can quite easily access the backing data of your string using
         | `String::as_bytes`
        
         | brundolf wrote:
         | That's a great question. I don't have a complete answer, but I
         | do know that &str has lots of string-specific functionality
         | that's really helpful. The .chars() method for example gives
         | you an iterator over actual unicode chars, as opposed to bytes,
         | because the former can have variable byte-widths. There may be
         | other reasons; I'm not sure.
        
         | littlestymaar wrote:
         | > But, Rust has a notation for slices -- &[T]
         | 
         | &[T] is an "array-slice", even if it's called just "slice".
         | 
         | See this example ( _[..]_ is the syntax to create a slice of
         | something): https://play.rust-
         | lang.org/?version=stable&mode=debug&editio...
        
         | edflsafoiewq wrote:
         | What would they be, &[u8]? That's already a thing: it's an
         | arbitrary byte sequence. &str is specifically UTF-8 data.
         | 
         | &OsStr and &Path are the same way.
        
           | __s wrote:
           | Indeed &str has as_bytes which returns itself as &[u8]
           | 
           | But, str is a subset of &[u8], the type's contract is that it
           | is unsafe to have non valid UTF-8 data, hence
           | https://doc.rust-lang.org/std/str/fn.from_utf8.html can
           | error, offering the unsafe variant https://doc.rust-
           | lang.org/std/str/fn.from_utf8_unchecked.htm...
           | 
           | This is all very different than &[char] which would be an
           | array of 4 byte characters (or, a UCS4 string)
        
         | aliceryhl wrote:
         | Because there's no possible choice for T. You can't use u8,
         | because then you would allow non-utf8 data. You can't use char,
         | because a &[char] uses four bytes per character, whereas a &str
         | stores the characters in utf-8, which is a variable-width
         | encoding.
         | 
         | A &str is really a different kind of thing from other slices.
         | In any other slice, each element in the slice takes up a
         | constant number of bytes, but this is not the case for a &str.
        
         | mamcx wrote:
         | Another: String is inmutable AND whole. &[T] is parts, and when
         | declared as &mut [T] is mutable.
         | 
         | Because unicode, &[T] make easy to write wrong code (that asume
         | here T = Char).
         | 
         | It CAN'T be Char, because char is larger than u8:
         | 
         | https://doc.rust-lang.org/std/primitive.char.html
         | 
         | and it mean unicode point.
         | 
         | In other words: Rust is using types to PREVENT the wrong
         | behavior.
        
           | pornel wrote:
           | Fun fact: `&mut str` exists. You don't get random access, but
           | in controlled scenarios it's fine to mutate str in-place,
           | e.g. `make_ascii_lowercase`
           | 
           | https://doc.rust-
           | lang.org/stable/std/primitive.str.html#meth...
        
         | lmkg wrote:
         | The regular slice type &[T] lets you access and manipulate
         | individual elements. But Rust strings enforce the invariant
         | that they are valid Unicode, which puts restrictions on
         | element-wise operations.
         | 
         | Calling &str a "string slice" is really more about the contrast
         | with String, and how the relationship there mirrors the
         | relationship between &[T] and Vec<T>. It's more of an analogy
         | than a concrete description of the interface.
        
         | tomjakubowski wrote:
         | `str` contractually guarantees UTF-8 contents, so because of
         | multi-byte codepoints it cannot be sliced at arbitrary indexes
         | like a `[u8]` can be.
         | 
         | As a side note, it is possible to define your own "unsized"
         | slice type which wraps `[u8]`. This can be useful for binary
         | serialization formats which can be subdivided / sliced into
         | smaller data units.
        
           | Const-me wrote:
           | > `str` contractually guarantees UTF-8 contents
           | 
           | I don't think any other languages do that. Instead, most of
           | them are implementing as much as they can while viewing the
           | storage as a blob of UTF8/UTF16 bytes/words, and throw
           | exceptions from the methods which interpret the data as
           | codepoints.
           | 
           | Strings are used a lot in all kinds of APIs. For instance,
           | strings are used for file and directory names. The OS kernels
           | don't require these strings to be valid UTF-8 (Linux) or
           | UTF-16 (Windows).
           | 
           | To address the use case, Rust standard library needs yet
           | another string type, OsString. This contributes to complexity
           | and learning curve.
        
             | estebank wrote:
             | The number of things that you have to learn remains
             | constant. I could even make the argument that the number of
             | things you need to learn up-front is lowered when only
             | talking about the distinction between String and
             | OsString/CString. The difference is that rustc will be
             | pedantic and complain about all of these cases, asking you
             | to specify exactly what you wanted, while other languages
             | will fail at runtime.
        
               | Const-me wrote:
               | > rustc will be pedantic and complain about all of these
               | cases, asking you to specify exactly what you wanted
               | 
               | So, they offloading complexity to programmers. Being such
               | a programmer, I don't like their attitude.
               | 
               | > other languages will fail at runtime
               | 
               | In practice, other languages usually printing squares, or
               | sometimes character escape codes after backslash, for
               | encoding errors in their strings. That's not always the
               | best thing to do, but I think that's what people want in
               | majority of use cases.
        
               | pornel wrote:
               | The complexity already exists regardless of what the
               | language does.
               | 
               | The only choice is whether it's explicit and managed by
               | the language, or hidden, and you need knowledge and
               | experience to handle it yourself without language's help.
               | If you want "squares" for broken encoding, Rust has
               | `to_string_lossy()` for you. It's explicit, so you won't
               | get that error by accident.
               | 
               | Avoiding "mojibake" in other languages is usually a major
               | pain. For example, PHP is completely hands-off when it
               | comes to string encodings. To actually encode characters
               | properly you need to know which ini settings to tweak,
               | remember to use mb_ functions when appropriate, and don't
               | lose track of which string has what encoding. There's
               | internal encoding, filesystem encoding, output encoding,
               | etc. They may be incompatible, but PHP doesn't care and
               | won't help you.
        
               | Const-me wrote:
               | > It's explicit, so you won't get that error by accident
               | 
               | I would want it to be implicit.
               | 
               | Ideally, for rare 20% of cases when I care about UTF
               | encoding errors, I'd want a compiler switch or something
               | similar to re-introduce these checks, but I can live
               | without that.
               | 
               | > For example, PHP
               | 
               | When you compare Rust with PHP it's no surprise Rust is
               | better, many people think PHP is notoriously bad
               | language: https://eev.ee/blog/2012/04/09/php-a-fractal-
               | of-bad-design/
               | 
               | I like C# strings the best, but I also have lots of
               | experiencer with C++, and some experience with Java,
               | Objective-C, Python, and a few others. None of them have
               | Rust's amount of various string types exposed to
               | programmers, many higher-level languages have exactly 1
               | string type.
               | 
               | Interestingly, some dynamic languages like swift use
               | similar stuff internally, but they don't expose the
               | complexity to programmers, they manage to provide a
               | higher-level abstraction over the memory layout. Compared
               | to Rust, improves usability a lot.
        
         | steveklabnik wrote:
         | So, there's like, a few things here. First is, technically they
         | both do use the same notation, &T, where T=str and T=[u8]. This
         | is the whole "unsized types" thing. &Path is another example of
         | this, String : &str :: PathBuf : &Path.
         | 
         | Beyond that though, &[T] implies a slice of Ts, that is,
         | multiple Ts in a row. But a &str is a slice of a single string.
         | So &[str] would feel wrong; that is, a &str is a slice of a
         | String or another &str or something else, but isn't like, a
         | list of multiple things. It's String, not Vec<str>.
         | 
         | Basically, Strings are just weird.
        
           | Covzire wrote:
           | Are they though? I've long wondered why the Rust team hasn't
           | imitated C# or other languages ease of use with strings while
           | also retaining the existing functionality for lower-level use
           | cases. I suppose it's a kind of gauntlet that a Rust dev
           | would have to go through which could be a good thing but
           | personally hitting walls with strings really turned me off on
           | Rust the first time I tried it simply because my expectations
           | were diametrically opposed to the reality of strings in rust.
        
           | adkadskhj wrote:
           | Yea but i don't think that's what GP was asking, imo. Rather
           | than `[str]`, i think they were asking why it's `str` and not
           | `[char]`, no?
           | 
           | Just as `[u8]` is to `Vec<u8>`, `[char]` is hypothetically to
           | `Vec<char>`.. and `Vec<char>` is basically a `String`, no?
           | 
           | _edit_: Though looking at the docs, `char` is 4 always bytes,
           | so i guess that's where the breakdown would be? `char` would
           | need to be unsized i guess, but then it would be an awkward
           | `[unsized_char]`, which is like two unsized types.... hence
           | `str` maybe?
        
             | codys wrote:
             | `char` is a (internally) a `u32` because it represents any
             | single unicode character. `str` is not a `[char]`, because
             | rust doesn't store strings as utf-32 (system APIs don't
             | accept utf-32, and it tends to waste space in many cases)
             | 
             | `str`'s data layout happens to be `[u8]`, but it's type
             | provides additional guarantees about the structure of the
             | data within it's internal `[u8]` (for example, forbidding
             | sequences of u8 that don't encode valid utf-8).
        
               | adkadskhj wrote:
               | Well yea, i wasn't saying `[char]` _is_ a `str`, rather i
               | was positing that the GP comment was asking why it's a
               | `str` than some hypothetical `[unsized_char]`.
               | 
               | I think `char` _would_ work, if it was similarly unsized
               | like a single piece of `str`. The Problem is .. as i see
               | it, that `[unsized_char]` seems odd.
        
               | steveklabnik wrote:
               | To be extra pedantic, char represents a single Unicode
               | Scalar Value.
        
               | erik_seaberg wrote:
               | This is a big deal because adding an accent mark to a
               | letter (often) means a single char can no longer store
               | it. APIs should not orient around isolated scalar values
               | or codepoints because most devs will misuse them, not
               | being experts on combining and normalization.
        
             | [deleted]
        
             | Blikkentrekker wrote:
             | `str` is not `[char]`, there is no datatype `char` possible
             | or which this would hold, and the name is already taken.
             | 
             | `str` is not a slice; this is itself already a wrong
             | statement. A slice is a dynamically sized type, a region of
             | memory that contains any nonnegative number of elements of
             | another type.
             | 
             | A `str` is dynamically sized, but is not guaranteed to
             | contain a succession of any particular type. It's simply a
             | dynamically sized sequence of bytes guaranteed to be
             | _Utf8_.
             | 
             | `strs` aren't slices; all they have in common with them is
             | that they are both dynamically sized types.
             | 
             | `Vec<char>` is also not the same as `String`; a string is
             | not a vector of `char`, which is already a type that has a
             | size of 4 bytes.
             | 
             | This all results from that _Utf8_ is a variable width
             | encoding, and since slices are homogeneous, all elements
             | have the same size.
        
             | aliceryhl wrote:
             | Yes, the reason you can't use char here is that a char is
             | always 4 bytes, so a &[char] is a type that already exists,
             | and that type uses four bytes per character.
        
           | dralley wrote:
           | I wish they had been named &path and Path instead, it would
           | feel more consistent than String, &str, PathBuf, &Path
        
             | steveklabnik wrote:
             | I joke the only way that I'll sign off on Rust 2.0 is if we
             | get to do the exact opposite, rename String to StrBuf,
             | haha! (Same reason though, for consistency.)
        
             | [deleted]
        
           | CJefferson wrote:
           | Related question (if you don't mind another).
           | 
           | Why is 'str' a "primitive type"? What about 'str' means it
           | has to be primitive, instead of being a light-weight wrapper
           | around a '&[u8]' (that obviously enforces UTF8 requirements
           | as approriate).
        
             | yazaddaruvala wrote:
             | You can't just index into a string.                   let
             | hello: &[u8] = "Hello".as_u8_slice();         // hello[1]
             | != "e", I believe it would be the bottom half-bytes of "H".
             | // hello[2] != "e", I believe it would be the top half-
             | bytes of "e".
             | 
             | To use &[u8] would be _very_ non-ergonomic.
        
               | SAI_Peregrinus wrote:
               | One could, in theory, make an ExtendedGraphemeCluster
               | type, and make str a slice of ExtendedGraphemeClusters.
               | So &[ExtendedGraphemeCluster] could be indexed into
               | without having things not make sense. Of course that's
               | much more complicated than most other primitives, and
               | most people don't have any idea what an Extended Grapheme
               | Cluster even _is_. But since they 're the Unicode notion
               | that most naturally maps to a "character" you could just
               | call the type Character or char, and confuse the hell out
               | of the C programmers by having a variable-width char
               | type.
        
               | josephg wrote:
               | Sure - but iterating by extended graphemes isn't the only
               | thing you want to do with strings. Sometimes you want to
               | treat them as a bunch of UTF-8 bytes. Sometimes you want
               | to iterate / index by Unicode code points (eg for CRDTs).
               | And sometimes you want to render them, however the system
               | fonts group them.
               | 
               | It makes sense to have a special type because it can
               | supports all of this stuff through separate methods on
               | the type. (Or through nearby APIs). It's confusing, but I
               | think it's the right choice.
               | 
               | Although, I think the most confusing thing about rust
               | strings isn't that &str isn't &[u8]. It's that &str isn't
               | just &String or something like that.
        
             | steveklabnik wrote:
             | Well, &str is the type of string literals, so on some
             | level, it has to be special.
             | 
             | There was a PR implementing it as a wrapper around &[u8],
             | but it didn't really provide any actual advantages, so it
             | was decided to not do that.
             | 
             | https://github.com/rust-lang/rust/pull/19612
        
           | nicklecompte wrote:
           | In particular: strings aren't actually simple arrays of
           | characters in Rust like they are in C, but there is an
           | underlying array on the heap, and the notion of "slicing"
           | that array of characters still makes sense semantically.
        
             | knodi123 wrote:
             | > strings aren't actually simple arrays of characters in
             | Rust
             | 
             | Are you talking about them being a container with a pointer
             | to the actual array, and also a size and etc?
        
               | Skunkleton wrote:
               | char is usually some single byte of data. Characters can
               | be multiple bytes. Slicing a string on character
               | boundaries is more coarse grained than slicing on byte
               | boundaries.
        
               | estebank wrote:
               | char in Rust is 32bits, so it has a 1:1 mapping to
               | Unicode glyphs. You might also want to care about
               | grapheme clusters, but those are not part of the stdlib.
        
               | toast0 wrote:
               | Unicode codepoints; glyphs may take multiple codepoints.
        
               | estebank wrote:
               | You are, of course, correct.
        
               | nicklecompte wrote:
               | Right - the point is that &str isn't syntactic sugar /
               | alias / etc for &[u8] and it would be confusing to have a
               | notation that suggested otherwise.
        
               | seoaeu wrote:
               | In C, strings can hold invalid unicode. However, in Rust
               | a str is guaranteed to be valid utf-8.
               | 
               | For added confusion, Rust has a `char` type which is
               | actually 32-bits. You can create arrays of them, but the
               | resulting string would be in utf-32 and thus incompatible
               | with the normal `str` type.
        
               | Animats wrote:
               | let s = "hello world".to_string();         for ch in
               | s.chars() {             print!({},ch);         }
               | 
               | will iterate through a string character by character.
               | That's the most common use of the "char" type - one at a
               | a time, not arrays of them.
               | 
               | Although the proper grapheme form is:
               | use unicode_segmentation::UnicodeSegmentation; // 1.7.1
               | let s = "hello world".to_string();         for gr in
               | UnicodeSegmentation::graphemes(s.as_str(), true) {
               | print!("{}", gr)         }
               | 
               | This will handle accented characters and emoji modifiers.
               | A line break in the middle of a grapheme will mess up
               | output.
               | 
               | By the way, open season for proposing new emoji starts
               | tomorrow.[1]
               | 
               | [1]
               | http://blog.unicode.org/2020/09/emoji-150-submissions-re-
               | ope...
        
           | yazaddaruvala wrote:
           | Are there potentially other situations where `&[T + !Sized]`
           | makes sense?
           | 
           | The majority of the functions on `&str` seem to make sense
           | for all `&[T + !Sized]` where `type str = [unsized_char]`.
        
             | steveklabnik wrote:
             | I don't really know what unsized_char would even mean,
             | chars have a size, and str is not a sequence of chars.
             | 
             | That said, I'm also not sure in the general case.
        
         | kzrdude wrote:
         | "string slice" is just the name of the one thing, and "slice"
         | the name of the other, and they are different things with
         | similar names and related features.
        
       | geodel wrote:
       | > It's a testament to Rust's breadth and accessibility that even
       | people who have never done low-level programming before are
       | giving it a try.
       | 
       | Umm, no, I think it is testament of Rust evangelism that even
       | when Rust is least appropriate, people start using it. So we have
       | folks who start coding a 5 webpage worth of project in Rust
       | because you know low level control, high performace , next
       | generation, secure software etc.
        
       | zabzonk wrote:
       | > "foofoo" takes up twice as much space in memory as "foo".
       | 
       | Not in the imaginary language you keep talking about as "C/C++".
        
         | pornel wrote:
         | Right, in case they're heap allocated, the worst-case alignment
         | requirement of malloc, as well as free() without a size is
         | likely to force implementations to round it up to at least 8 or
         | 16 bytes for both.
        
           | ncmncm wrote:
           | No. The result of std::string("foo") is _exactly_ the same
           | size as of std::string( "foofoo"). They take up exactly the
           | same number of bytes in the free store, on the stack, in a
           | hash table, what have you.
        
             | pornel wrote:
             | We just interpret "space in memory" differently.
             | 
             | The {ptr,len,cap} part may be fixed-size, but you also need
             | to hold the letters somewhere in memory.
             | 
             | If you try to create std::string("verylong...") with more
             | characters than you have memory, you run out of memory, so
             | it "takes up space in memory".
             | 
             | Bonus fact: *"foo" and *"foofoo" in Rust actually take
             | different amount of memory even when using your definition.
             | One dereferences to 3 bytes, the other to 6 (not a
             | pointer).
        
           | zabzonk wrote:
           | I actually (and pedantically) meant that neither in C or C++
           | is it true that "foofoo" takes up double the size of "foo" -
           | both will have a zero character at the end.
        
       | rectang wrote:
       | If you've been handling Unicode properly in other languages, then
       | Rust strings seem _easy_ in comparison.
       | 
       | * All the 2-byte-char languages which were designed for UCS-2
       | before the Unicode Consortium pulled the rug out from underneath
       | them and obsolesced constant-width UCS-2 in favor of variable-
       | width UTF-16.
       | 
       | * Languages which result in silent corruption when you
       | concatenate encoded bytes with a string type (e.g. Perl but there
       | are many examples.)
       | 
       | * C, where NUL-terminated strings are the rule, and the standard
       | library is of no help and so Unicode string handling needs to be
       | built from scratch.
       | 
       | All those checks which you have to fight to opt into, defying
       | both the language and other lazy programmers (either inside your
       | org, or at an org which develops dependencies you use)? Those
       | checks either happen automatically or are _much_ easier to use
       | without making mistakes in Rust.
        
         | ajross wrote:
         | Alternatively: if you have been handling Unicode and using wide
         | characters, you have _not_ been handling Unicode properly.
         | 
         | Obviously the world is a big place and there is room for lots
         | of paradigms and worldviews and we aren't supposed to judge too
         | much.
         | 
         | But come on. If new code isn't working naturally in UTF-8 in
         | 2021 then it's wrong, period.
        
           | estebank wrote:
           | > if you have been handling Unicode and using wide
           | characters, you have not been handling Unicode properly.
           | 
           | Paradoxically, trying to do "the right thing" and being an
           | "early adopter" of (the now called) UCS-2 was a "mistake", as
           | both Java and Windows can attest, by getting "stuck"
           | supporting the worst possible Unicode encoding ad-infinitum.
           | UTF-8 is the "obviously correct" choice (from the hindsight
           | afforded by us talking about this in 2021).
           | 
           | I still find it funny that emojis of all things are what
           | actually got the anglosphere to actually write software that
           | isn't _completely_ broken for the other 5.5 billion people
           | out there.
        
           | spamizbad wrote:
           | It's the edge-cases that get you.
        
           | diroussel wrote:
           | My understanding is that UTF-8 is not a good representation
           | for non-european alphabets.
           | 
           | So do you think UTF-8 is always the best internal string
           | representation? Or just for English speakers?
           | 
           | For Mandarine what would be optimal?
        
             | klodolph wrote:
             | So, the advantage of UTF-16 is that CJK text will use 33%
             | less space.
             | 
             | Does this mean that "UTF-8 is not a good representation for
             | non-European alphabets?" It may be less efficient but the
             | difference does not seem shocking to me, considering that
             | for most applications, the storage required for text is not
             | a major concern--and when it is, you can use compression.
        
             | rectang wrote:
             | Mandarin is an interesting case. Most of the Han characters
             | used by Mandarin fall within the basic multilingual plane
             | and thus occupy 2 bytes in UTF-16 but 3 bytes in UTF-8.
             | However, for web documents, most markup is ASCII which is
             | only one byte. So for Mandarin web documents, the space
             | requirements for UTF-8 and UTF-16 are about a wash.
             | 
             | When you add in interoperability concerns, since so much
             | text these days is UTF-8, for Mandarin at least UTF-8 is a
             | perfectly defensible choice.
             | 
             | (A harder problem is Japanese -- Japan really got screwed
             | over with Han unification, so choosing Shift-JIS over any
             | Unicode encoding is often best.)
             | 
             | FWIW I covered the space requirements of various encodings
             | and various languages in this talk for Papers We Love
             | Seattle:
             | 
             | https://www.youtube.com/watch?v=mhvaeHoIE24&t=39m14s
        
               | amelius wrote:
               | I genuinely wonder: is the space requirement of text
               | encodings really an important issue in this age of large
               | photo and video content?
        
           | magicalhippo wrote:
           | > if you have been handling Unicode and using wide
           | characters, you have not been handling Unicode properly.
           | 
           | How so? Delphi for example has wide character-based strings
           | as default, what's wrong with that?
        
             | josephg wrote:
             | Wide character based strings have a .length field which is
             | easy to reach for and never what you want, because it's
             | value is meaningless:
             | 
             | - It isn't the number of bytes, unless your string only
             | contains ASCII characters. Works in testing, fails in
             | production.
             | 
             | - It isn't the number of characters because 16 bits isn't
             | enough space to store the newer Unicode characters. And
             | even if it could, many code sequences (eg emoji) turn
             | multiple code points into a single glyph.
             | 
             | I know all this, and I still get tripped up on a regular
             | basis because .length is _right there_ and works with
             | simple strings I type. I have muscle memory. But no, in
             | javascript at least the correct approaches require thought
             | and sometimes pulling in libraries from npm to just make
             | simple string operations be correct.
             | 
             | Rust does the right thing here. Strings are UTF-8
             | internally. They check the encoding is valid when they're
             | created (so you always know if you have a string, it is
             | valid). You have string.chars().count() and other standard
             | ways to figure out byte length and codepoint length and all
             | the other things you want to know, all right there, built
             | into the standard library.
        
             | estebank wrote:
             | The reasoning behind using UTF-16/UCS-2 is that then you
             | can plug your ears and treat 1 char == 1 user-visible glyph
             | on the screen, so programmers that acted as if ASCII was
             | the only encoding in existence could continue treating
             | strings in the same way (using their length to calculate
             | their user-visible length, indexing directly on specific
             | characters to change them, etc).
             | 
             | All of those practices are immediately wrong once UTF-32
             | came in existence and UTF-16 became a variable length
             | encoding. But even if _that_ hadn 't happened, what you
             | want to be operating on is _not_ characters, but grapheme
             | clusters, which are equivalent to a vector of chars.
             | Otherwise you won 't handle the distinction between e and e
             | or emojis correctly.
        
               | magicalhippo wrote:
               | But how is that different from the underlying encoding
               | being UTF-8?
               | 
               | edit:
               | 
               | For example, we do a lot of string manipulation in
               | Delphi. We might split a string in multiple pieces and
               | glue them together again somehow. But our separators are
               | fixed, say a tab character, or a semicolon. So this
               | stiching and joining is oblivious to whatever emojis and
               | other funky stuff that might be inbetween.
               | 
               | How is this doing it wrong?
               | 
               | I mean yea sure you CAN screw it up by individually
               | manipulating characters. But I don't see how an UTF-8
               | encoded string _in itself_ prevents you from doing the
               | same kind of mistakes.
        
               | josephg wrote:
               | Splitting and glueing is fine. But imagine 3 systems:
               | system A is obviously wrong. It crashes on any input.
               | System B is subtly wrong. It works most of the time, but
               | you're getting reports that it crashes if you input
               | Korean characters and you don't know Korean or how to
               | type those characters. System C is correct.
               | 
               | Obviously C is better than A or B, because you want
               | people to have a good experience with your software. But
               | weirdly, system A (broken always) is usually better than
               | system B (broken in weird hard to test ways). The reason
               | is that code that's broken can be easily debugged and
               | fixed, and will not be shipped to customers until it
               | works. Code that is broken in subtle ways will get
               | shipped and cause user frustration, churn, support calls,
               | and so on.
               | 
               | The problem with UCS-2 is it falls into system B. It
               | works most of the time, for all the languages I can type.
               | It breaks with some inputs I can't type on my keyboard.
               | So the bugs make it through to production.
               | 
               | UTF-8 is more like system A than system B. You get
               | multibyte code sequences as soon as you leave ASCII, so
               | it's easier to break. (Though it really took emoji for
               | people to be serious about making everything work.)
        
             | barrkel wrote:
             | I was part of that. Delphi has all the string types you
             | want, since you can declare your preferred code page.
             | String is an alias for UnicodeString (to distinguish from
             | COM WideString) and is UTF-16 for compatibility with Win32
             | API more than anything. UTF-8 would have meant a lot more
             | temporaries and awkward memory management.
        
               | magicalhippo wrote:
               | All in all, while the Unicode transition took its time, I
               | must admit it's was very smooth when it did happen.
               | 
               | At work we have a codebase that does a lot of string
               | handling. Both in reading and writing all kinds of text
               | files, as well as doing string operations on entered
               | data. Several hundred kLOC of code across the project.
               | 
               | We had one guy who spent less than week wall-time to move
               | the whole project, and the only issue we've had since is
               | when other people send us crappy data... if I got a
               | dollar for each XML file with encoding="utf-8" in the
               | header and Windows-1252 encoded data we've received I'd
               | have a fair fortune.
        
         | shadowgovt wrote:
         | This is my way of thinking about the topic these days. It's not
         | that strings are more complicated in Rust than in other
         | languages, it's that a lot of the other low-level languages are
         | presenting an abstraction that assumes implicitly that a string
         | is some type of sequence of uniform-sized cells, one cell per
         | character, and that representation was an artifact of a
         | specific time in computational history. It's like many other
         | abstractions those languages provide... Seemingly simple at
         | first glance, but if you do the details wrong you're just going
         | to get undefined behavior and your program will be incorrect.
         | 
         | Languages that don't expose strings as that abstraction are, in
         | my humble opinion, more reflective of the underlying concept in
         | the modern era.
        
         | alerighi wrote:
         | All of this is true, IF you assume that you want Unicode
         | string. That, especially on system/embedded software (the kind
         | of software that Rust is targeting) you don't really care about
         | Unicode and you can simply treat strings as array of bytes.
         | 
         | And I live in a country where you usually use Unicode
         | characters. But for the purpose of the software that I write, I
         | mostly stick with ASCII. For example I use strings to print
         | debug messages to a serial terminal, or read commands from the
         | serial terminal, or to put URL in the code, make HTTP requests,
         | publish on MQTT topics... for all of these application I just
         | use ASCII strings.
         | 
         | Even if I have to represent something on the screen... as long
         | as I have a compiler that supports Unicode as input files (all
         | do these days) I can put Unicode string constants in the code
         | and even print them on screen. It's the terminal (or the GUI I
         | guess, but I don't write software with a GUI) that translates
         | the bytes that I send on the line as Unicode characters.
         | 
         | And yes, of course the length of the string doesn't correspond
         | at the characters shown on the screen... but even with Unicode
         | you cannot say that! You can count (and that what Rust does)
         | how many Unicode code points you have, but a characters could
         | be made of more code points (stupid example, the black emoji is
         | composed by a code point that says "make the following black"
         | and then the emoji itself).
         | 
         | So to me it's pointless, and I care more about knowing how many
         | bytes a string takes and being able to index the string in
         | O(1), or take pointers in the middle of the string (useful when
         | you are parsing some kind of structured data), and so on.
         | 
         | In conclusion Rust is better when you have to handle Unicode
         | string, but most application doesn't have to handle them, and
         | handling them I don't mean passing them around as a black box,
         | not caring how they contain (yes, in theory you should care
         | about not truncating the string in the middle of a code point
         | when truncating strings... in reality, how often do you
         | truncate strings?)
        
       | jasonhansel wrote:
       | My way of thinking about this (which I _think_ is correct) is:
       | "&str" is like "&[u8]", except that "&str" is guaranteed to
       | contain valid UTF-8.
        
       | jstanley wrote:
       | > Then if your program later says "actually that wasn't enough,
       | now I need Y bytes", it'll have to take the new chunk, copy
       | everything into it from the old one, and then give the old one
       | back to the system.
       | 
       | This is mostly true. If you get lucky, there may already be
       | enough unused space past the end of the existing allocation, and
       | then realloc() can return the same address again, no copying
       | required.
       | 
       | But if you know you're going to be doing lots of realloc() (and
       | you're not unusually-tightly memory-constrained) then instead of
       | growing by 1 byte each time it's often worth starting with some
       | sensible minimum size, and doubling the allocated size each time
       | you need more space. That way you "waste" O(N) memory, but only
       | spend O(N lg N) time on copying the data around instead of
       | O(N^2).
        
         | axiosgunnar wrote:
         | Funnily enough this exact topic came up on Hackernews just
         | recently when a Googler started benchmarking AssemblyScript (a
         | language for WebAssembly) and realized that AssemblyScript was
         | increasing the size by +1 instead of doubling when
         | reallocating...
         | 
         | Here is the HN thread:
         | https://news.ycombinator.com/item?id=26804454
        
         | sumtechguy wrote:
         | I personally use a 'stride' instead of doubling. On small sizes
         | doubling works OK. But when you get past about 8k-16k the empty
         | memory starts to stack up.
        
           | koverstreet wrote:
           | Never do this - you're introducing a hidden O(n^2).
           | 
           | Folks, this is why you take the algorithms class in CS.
        
             | axiosgunnar wrote:
             | Could you elaborate? This seems very interesting.
        
               | magicsmoke wrote:
               | Asymptotically, there's no difference between allocating
               | a n+1 buffer and a n+k buffer before copying your old
               | data in. You'll still get O(n^2)
               | 
               | In reality, it depends on the data you're handling. You
               | may never end up handling sizes where the O(n^2)
               | asymptote is significant and end up wasting memory in a
               | potentially memory constrained situation. At the end of
               | the day, it all depends on the actual application instead
               | of blind adherence to textbook answers. Program both,
               | benchmark, and use what the data tells you about your
               | system.
               | 
               | If I've got a 500 MB buffer that I append some data to
               | once every 5 minutes, I might want to reconsider before I
               | spike my memory usage to fit a 1 GB buffer just to close
               | the program 15 minutes later.
        
               | deathanatos wrote:
               | The O(n2) here is the time spent copying of the data;
               | it's not about the size of the buffer, or that you'll
               | temporarily use 2x the space. The program would die by
               | becoming unusably slow.
               | 
               | Take your 5MiB example. If we start with a 4KiB buffer.
               | If we grow it by a constant 4KiB each time it runs out of
               | space, buy the time the buffer is 500MiB we've copied ~30
               | TiB of data. If instead we grow the buffer by doubling
               | it, we will have had to copy ~1000 MiB (0.001 TiB) by the
               | time it hits 500 MiB, difference of 30,000x. (Which is
               | why the program would slow to a crawl.)
        
               | magicsmoke wrote:
               | Yes I'm aware of how the algorithm works. I also know
               | that if I allocated 500 MiB at the beginning of the
               | program expecting my memory usage to be roughly that
               | size, and my prediction was off by 50 MiB maybe I don't
               | want to go hunting for another 500 MiB of space before my
               | program ends or I stop using the buffer and free it.
               | 
               | But your point about the virtual memory makes that moot
               | anyways. Thank god for modern OSes. I've clearly been
               | spending too much time around microcontrollers.
        
           | kaslai wrote:
           | Except on non-embedded platforms, oftentimes large blocks of
           | allocated memory aren't occupying physical memory until you
           | write to them. There's not much reason to avoid using
           | exponential buffer growth on a system with a robust virtual
           | memory implementation.
        
         | brundolf wrote:
         | > and then realloc() can return the same address again, no
         | copying required
         | 
         | Interesting, I didn't know that!
         | 
         | Regardless, though, I intentionally skimmed over certain
         | nuances like exponential buffer growth for the sake of the main
         | point
        
       | irrational wrote:
       | I love articles like this. I imagine this is the kind of stuff
       | you learn if you study CS or Software Engineering in college.
       | Maybe when I retire I will go and get a CS degree so I can learn
       | all the things I should have learned before I began working
       | professionally as a programmer 25 years ago.
        
         | brundolf wrote:
         | We used C++ throughout most of my CS program, and I hated it at
         | the time and never wanted to write it ever again (and
         | haven't!), but good lord did it help me understand how
         | computers work. I've benefitted from that perspective ever
         | since.
         | 
         | I'm not sure you have to do a whole degree to get that
         | perspective, though. Just learning C or C++ should get you a
         | good chunk of it.
        
           | v8dev123 wrote:
           | > understand how computers work
           | 
           | No body understands it even at electron orbitals. Your
           | understanding will be wrong in next decade if not wrong
           | already. Computer Arch is proprietary.
           | 
           | Give a try for C++17.
           | 
           | Zero Cost Abstractions RAII
           | 
           | It's a beast.
        
         | amilios wrote:
         | Nah, I wish, you just learn how to implement quicksort in Java
         | and stuff like that The only languages I saw during my degree
         | were Java, C, Python, Perl, and a little bit of Prolog. And I
         | finished my Bach last year.
        
       | jeltz wrote:
       | Hm, I feel Rust's strings are the easiest I have worked with on
       | any programming language but that might be due to my knowledge of
       | unicode and of C.
       | 
       | The only thing which surprised me was the &str type. Why isn't it
       | just an ordinary struct (called e.g. Str) consisting of a
       | pointer/reference and a length?
        
         | TheCoelacanth wrote:
         | Rust's strings are one of the easiest to use correctly and one
         | of the hardest to use incorrectly, but they vast majority of
         | string handling is done incorrectly, so that makes Rust seem
         | hard.
        
         | steveklabnik wrote:
         | I linked to a PR above which implemented this idea, and was
         | rejected.
        
         | kimundi wrote:
         | With dynamically sized types like `str`, Rust allows to
         | separate "how to access data behind a pointer" from "what data
         | is behind the pointer". So you can for example have the types
         | `str`, `[T]` or `Path`, and can have them for example behind
         | the pointer types `&T`, `Box<T>` or `Arc<T>`.
         | 
         | If Rust had defined a special struct `Str` for `&str`, then it
         | would have to define special structs for all the combinations
         | possible: Str, ArcSlice, BoxPath, etc...
        
         | brundolf wrote:
         | Yeah, this article was specifically aimed at people who _haven
         | 't_ worked with C/C++, and instead have the higher-level mental
         | model for what strings are and how they work.
        
       | alkonaut wrote:
       | I wish more languages had primitives for "simple strings". Most
       | of the time when I use a string I could live with a restriction
       | that it's ascii only and can fit in 64bytes. "Programming string"
       | vs "human string". For example a textual value of some symbol or
       | a name of a resource file I can control myself or a translation
       | lookup key (which I can make sure is always short and ascii). An
       | XML element name or json property in an enormous file with a
       | schema I control. It seems weird to use the same type for the
       | "human string" e.g. user input, a name, the value _looked up_
       | from the translation key and so on. For the simple strings it
       | _feels_ like making heap allocations, or either using two bytes
       | per char (e.g C#) or worrying about encoding in an utf-8 string
       | (e.g. Rust) are both wasteful.
        
         | brundolf wrote:
         | You can pretty much do this in Rust:                 let
         | simple_str: &[u8] = b"hello world";       println!("{}",
         | simple_str[6] as char);  // 'w'
         | 
         | Though I would advise against it in most cases, because even
         | many "non-human" formats like XML and JSON do allow for Unicode
         | characters
        
         | cdcarter wrote:
         | This is, as I understand it, how Symbols work in Ruby. They are
         | prefixed with a colon, and are interned and immutable.
        
         | bsder wrote:
         | I would say UTF-8, but I really do miss old-school Pascal
         | strings (aka strings with a length field and a _fixed_
         | allocation) sometimes.
         | 
         | Pascal strings could _automatically_ #[derive] a whole bunch of
         | the Rust traits (Copy, Clone, Send, Sync, Eq, PartialEq, ...)
         | that would help sidestep a whole bunch of ownership issues when
         | you start throwing strings around in Rust.
         | 
         | The downside would be that you would occasionally get a runtime
         | panic!() if your strings overflowed.
         | 
         | Sometimes, I can live with that. Embedded in particular would
         | mostly prefer Pascal strings.
         | 
         | I suspect that Rust is powerful enough to create a PString type
         | like that and actually fit it into the language cleanly. The
         | lifetime annotations may be the trickiest part (although--maybe
         | not as everything is a fixed size).
        
           | brundolf wrote:
           | > Pascal strings could automatically #[derive] a whole bunch
           | of the Rust traits (Copy, Clone, Send, Sync, Eq, PartialEq,
           | ...)
           | 
           | Worth noting that the explicitness of #[derive] was a design
           | decision; particularly when exposing a library API, it's good
           | to have control over the set of interfaces you actually
           | support, so that (for example) if one of them stops being
           | derivable later you won't break people's code downstream
        
       | dralley wrote:
       | This is a great article! The only omission is that you can
       | concatenate two &str's at compile time using concat!().
        
         | brundolf wrote:
         | I didn't know about that! It's pretty interesting, though also
         | fairly niche because the most common usecase is a string
         | addition where at least one member is a variable, not a literal
        
       | Animats wrote:
       | He's working too hard. You don't need all those type
       | declarations.                   let a = "hello".to_string();
       | let b = "world".to_string();         let c = a + " " + &b;
       | println!("Sentence: {}", c);
       | 
       | It's a bit confusing that you need "&b". "+" was defined as
       | String + &str, which is somewhat confusing but convenient.
       | 
       |  _If you 've been handling Unicode properly in other languages,
       | then Rust strings seem easy in comparison._
       | 
       | Yes. C is awful. C++ is still sometimes UTF-16. C#/.net is still
       | mostly UTF-16. Windows is still partly in the UTF-16 era. So is
       | Java. So is Javascript. Python 2 came in 2-byte and 4-byte
       | versions of Unicode. The UTF-16 systems use a "surrogate pair"
       | kludge to deal with characters above 2^16. CPython 3 has 1, 2,
       | and 4 byte Unicode, and switches dynamically, although the user
       | does not see that.
       | 
       | Linux and Rust are pretty much UTF-8 everywhere now, but everyone
       | else hasn't killed off the legacy stuff yet.
        
         | brundolf wrote:
         | The type declarations are there for maximum clarity, since the
         | inferred types may not be obvious to the target audience
        
           | jdmichal wrote:
           | Yes, please. One of my biggest peeves are posts that are
           | meant to be educational, but yet don't define the types used
           | for variables nor define the namespace / package they are
           | from. Java posts are rife with the latter, and I'm really not
           | looking forward to the `var` keyword making the former a
           | thing too. And quite a few get bonus points for doing such on
           | types that require a new dependency -- but good luck figuring
           | out which dependency without knowing the dependency name nor
           | the package name of the class!
        
       | Someone1234 wrote:
       | This article was absolutely fantastic.
       | 
       | It kind of lost me right up until the first example code, where I
       | did in fact have different expectations. Then how it broke down
       | _why_ , and gave different potential solutions was just
       | wonderful. I learned a lot.
       | 
       | I will add though that other languages have a distinction between
       | const and string object too (at least under the hood), they just
       | go to great lengths to hide it from programmers. For example a
       | const + const call might do the same thing as Rust under the
       | hood, but it is transparent. Rust seems like it requires the
       | extra steps because it wants the programmer to make a choice here
       | (or more specifically stop them from making mistakes: like having
       | immutable data _stay_ immutable, rather than automatic conversion
       | to mutable sting inc. the data cost involved even, automatic
       | conversion is more convenient but also a performance footgun).
       | 
       | I don't think Rust is wrong, I think it is opinionated and
       | honestly as someone that like immutability I kinda dig it.
        
         | brundolf wrote:
         | Glad you enjoyed it :)
         | 
         | And yeah, under the hood those other languages do all kinds of
         | wild optimizations with their "immutable" strings like sharing
         | substrings between different strings and pooling to reduce
         | allocations. I intentionally left out those nuances because
         | from the user's perspective, those are all implementation
         | details (even if they can surface in the form of performance
         | changes).
        
         | ncmncm wrote:
         | It is well-written, where it treats Rust, but almost lost me,
         | too, for its use of "C/C++", treating the two very different
         | languages as if they were trivial variations. Where string
         | handling is concerned, as in so many other places, they are
         | fundamentally different.
         | 
         | This "C/C++" bad habit is very commonly used, around Rust, to
         | slyly imply there is no effective difference, but in a way that
         | permits an injured response to criticism that "it just means 'C
         | or C++'". But _it doesn 't_, unless you are talking about
         | object file symbols or compiler optimizers, and.often enough
         | not then. What it does do is encourage sloppy thinking and
         | resultant falsehoods. These falsehoods show up in the article,
         | revealed if you change "C/C++" to "C or C++" in each case.
         | 
         | In several places it says just "C++", yet is still talking
         | about C. It is OK not to know C++; many don't. Things are hard
         | enough without falsehoods.
        
           | brundolf wrote:
           | The only thing I claimed about the two as a category was that
           | "strings are not immutable, they're data structures" (which
           | applies to Rust too). I purposely didn't go into much more
           | detail than that because it wasn't really the point of the
           | article. I did mention that C works with strings as raw char
           | arrays, and C++ has a struct around a char array that manages
           | length-tracking and reallocation automatically.
           | 
           | I believe these two statements are accurate, though I'm happy
           | to be corrected if they aren't. It's been a few years since I
           | wrote C++. Beyond that, I see my claims as "abstract" and not
           | "falsehoods".
        
       | ncmncm wrote:
       | Amusingly, Rust strings, whether &str or String, are unable to
       | represent filenames, which in many, many programs is the
       | overwhelmingly most common use for character sequences that
       | people want to call strings.
       | 
       | The Rust people invented the wonderful "WTF-8" notion to talk
       | about these things. It gets awkward when you want to display a
       | filename in a box on the screen because those boxes like to hold
       | actual strings _qua strings_ , not these godforsaken abominations
       | that show up in file system directory listings.
       | 
       | Handling WTF-8 will take a whole nother article. I don't know a
       | name for WTF-8 sequences; I have been calling them sthrings,
       | which is hard to say, and awkward, but that is kind of
       | appropriate to the case.
        
         | int_19h wrote:
         | It doesn't really make things any more complicated than they
         | already are. If you take a filename in C, and then have to
         | display it somewhere, you're facing the same problem - except
         | that you might not even be aware of it, because all types are
         | the same, and you won't notice the problem unless you happen to
         | run into an unprintable filename.
         | 
         | Rust is doing the right thing here by forcing developers to
         | deal with those issues explicitly, rather than swiping them
         | under the rug. The real issue is filenames that aren't proper
         | strings - i.e. an OS/FS design defect - but this ship has
         | sailed long ago.
        
           | ncmncm wrote:
           | That sthrings are equally as awkward to handle correctly in
           | C, C++, Python, Ruby, Javascript, or Perl, as in Rust makes
           | them no less awkward.
           | 
           | Nobody said Rust has done anything wrong to ban them from
           | String.
        
       | brink wrote:
       | Nicely written!
       | 
       | I wish I had this article when I first started Rust. Would have
       | saved me some trouble.
        
       | sequoia wrote:
       | Incredibly clear and compassionate writing (callout boxes throw a
       | bone to readers who aren't well versed in concepts like the heap,
       | character arrays etc.).
       | 
       | Big kudos to the author!
        
         | brundolf wrote:
         | Thanks :)
        
       | skybrian wrote:
       | This is a great intro, but clarifying one more thing might be
       | useful: how do you return a string?
        
         | brundolf wrote:
         | A String can be returned like any other owned value; whether or
         | not a &str can be returned depends on lifetimes, as it does
         | with any other reference.
         | 
         | Lifetimes seem out of scope for this post, and the lifetimes
         | story for strings isn't really strings-specific enough that it
         | felt important to cover. There are other resources out there
         | that thoroughly cover the topic of lifetimes; in fact I wrote a
         | short summary myself :)
         | 
         | https://www.brandons.me/blog/favorite-rust-function
        
       ___________________________________________________________________
       (page generated 2021-04-14 23:01 UTC)