[HN Gopher] You probably don't need to validate UTF-8 strings
       ___________________________________________________________________
        
       You probably don't need to validate UTF-8 strings
        
       Author : jakobnissen
       Score  : 41 points
       Date   : 2024-05-16 18:21 UTC (4 hours ago)
        
 (HTM) web link (viralinstruction.com)
 (TXT) w3m dump (viralinstruction.com)
        
       | remram wrote:
       | > In Rust, strings are always valid UTF8, and attempting to
       | create a string with invalid UTF8 will panic at runtime:
       | 
       | > [piece of code explicitly calling .unwrap()]
       | 
       | You misspelled "returns an error".
       | 
       | It might be worth considering Python, where the most central
       | change from 2 to 3 was that strings would now be validated UTF-8.
       | I don't understand why it gets discarded with "it was designed in
       | the 1990's" when that change happened so recently.
        
         | lucb1e wrote:
         | There are countries where Python 3's stable release is old
         | enough to legally get married, and this change was being
         | planned since 2004. It's not _that_ recent!
         | 
         | This is the oldest concrete plan I can quickly spot:
         | https://github.com/python/peps/blob/b5815e3e638834e28233fc20...
        
         | mixmastamyk wrote:
         | Python strings are Unicode, not utf8, at least until encoding
         | time. At which point they become bytes.
        
           | deathanatos wrote:
           | I'm going to define a "Unicode string" as Rust does: a
           | sequence of USVs / content that can be validly represented as
           | UTF-8. Thus, no, sadly, Python's strings are not Unicode, as
           | they're a sequence of Unicode code points. Because of that,
           | a_string.encode('utf-8')
           | 
           | ... can raise in Python. For example:                 In [1]:
           | '\uD83D'.encode('utf-8')       ------------------------------
           | ---------------------------------------------
           | UnicodeEncodeError                        Traceback (most
           | recent call last)       Cell In[1], line 1       ----> 1
           | '\uD83D'.encode('utf-8')            UnicodeEncodeError:
           | 'utf-8' codec can't encode character '\ud83d' in position 0:
           | surrogates not allowed
           | 
           | (The underlying encoding of str in Python these days is
           | either [u8], [u16], or [u32], essentially, depending on the
           | value of the largest code point in the string. So, for some
           | values, e.g., 'hello world', the underlying representation is
           | UTF-8, essentially.)
        
       | kstrauser wrote:
       | > As always, immutability comes with a performance penalty:
       | Mutating values is generally faster than creating new ones.
       | 
       | I get what they're saying, but I'm not sure I agree with it.
       | Mutating one specific value is faster than making a copy and then
       | altering that. Knowing that a value can't be mutated and using
       | that to optimize the rest of the system can be faster yet. I
       | think it's more likely the case that allowing mutability comes
       | with a performance penalty.
        
         | ghusbands wrote:
         | It's easy to see that altering an entry in an array (size N) is
         | O(1) with mutability and at least O(log N) with immutability,
         | and that affects many algorithms. Altering any small part of a
         | larger data structure has similar issues. In the end, many
         | algorithms gain a factor of log N in their time complexity.
        
           | kstrauser wrote:
           | Right, but look at the bigger picture. Immutability removes a
           | whole lot of order of operations concerns. That frees a smart
           | compiler to use faster algorithms, parallelization, etc. in
           | ways that might not be safely possible if the data _might_
           | change in place. Yes, that may mean it 's slower to deal with
           | individual values. It may also mean that the resulting system
           | can be faster than otherwise.
        
             | ghusbands wrote:
             | Well, language benchmarks fairly uniformly show that to be
             | untrue in general. None of the fastest languages have
             | forced immutability. It's not like it's a novel, untested
             | idea.
        
         | bawolff wrote:
         | I mean, that's kind of misinterpreting their point. The authors
         | are not claiming otherwise.
        
       | LegionMammal978 wrote:
       | There is one property about UTF-8 that distinguishes it from
       | opaque byte strings of unknown encoding: its codepoints are self-
       | delimiting, so you can naively locate instances of a substring
       | (and delete them, replace them, split on them, etc.) without
       | worrying that you've grabbed something else with the same bytes
       | as the substring.
       | 
       | Constrast with UTF-16, where a substring might match the bytes at
       | an odd index in the original string, corresponding to totally
       | different characters.
       | 
       | Identifying a substring is valid in every human language I know
       | of, as long as the substring itself is semantically meaningful
       | (e.g., it doesn't end in part of a grapheme cluster; though if
       | you want to avoid breaking up words, you may also want a \b-like
       | mechanism). So it does seem to refute the author's notion that
       | you can do nothing with knowledge only of the encoding.
        
       | kayodelycaon wrote:
       | Personally, I prefer Ruby's behavior of explicit encoding on
       | strings and being very cranky when invalid codepoints show up in
       | a UTF-8 string.
       | 
       | If you want to ignore invalid UTF-8, use String#scrub to replace
       | heretical values with \uFFFD and life is good. :)
        
         | adgjlsfhk1 wrote:
         | life is good until you try to read a file path that is not a
         | valid string and discover that you read the wrong file.
        
       | singpolyma3 wrote:
       | Honestly if you don't know that it's valid unicode then it's not
       | a string at all, but a bytstring.
        
       | 3pm wrote:
       | Good paper on UTF-8 validation performance:
       | https://arxiv.org/pdf/2010.03090                   The relatively
       | simple algorithm (lookup) can be several times faster than
       | conventional algorithms at a common task using nothing more than
       | the instructions available on commodity processors. It requires
       | fewer than an instruction per input byte in the worst case.
        
       | bawolff wrote:
       | In terms of actual issues - i think normalizing to NFC is much
       | more important than validating.
        
       | camgunz wrote:
       | You do have to validate UTF-8 strings:
       | 
       | - You can't just skip stuff if you run any kind of normalization
       | 
       | - How would you index into or split an invalid UTF-8 string?
       | 
       | - How would you apply a regex?
       | 
       | - What is its length?
       | 
       | - How do you deal with other systems that _do_ validate UTF-8
       | strings?
       | 
       | Meta point: scanning a sequence of byte for invalid UTF-8
       | sequences is validating. The decision to skip them is just
       | slightly different code than "raise error". It's probably also a
       | lot slower as you have to always do this for every operation,
       | whereas once you've validated a string you can operate on it with
       | impunity.
       | 
       | Love this for the hot take/big swing, but it's a whiff.
        
         | fanf2 wrote:
         | - delay validation until normalization
         | 
         | - treat non-utf-8 bytes as bytes
         | 
         | - regexes match bytes just fine: the article has a whole
         | section on ripgrep's use of bstr
         | 
         | - another section discusses how the length of a string is not a
         | well defined quantity
         | 
         | - the article says you can delay validation until required
        
           | paulddraper wrote:
           | Regexes normally match characters, not bytes.
        
       | creatonez wrote:
       | My understanding is that Rust has designed the rest of the String
       | API under the assumption of validity. You can't create an invalid
       | String because the methods that operate on strings strive to be
       | tightly optimized UTF-8 manipulation algorithms that assume the
       | string has already been cleaned. Pack all of the actual
       | robustness in the guarantee that you are working with UTF-8, and
       | you can avoid unnecessary CPU cycles, which is one of the goals
       | of systems languages. If you want to skip it, go for raw strings
       | or CStr -- all raw byte buffers have the basic ASCII functions
       | available, which are designed to be robust against whatever you
       | throw at it, and it shouldn't be too hard to introduce genericity
       | for an API to accept both strings and raw data.
       | 
       | That being said, I'm not sure how this is actually implemented, I
       | assume there is still some degree of robustness when running
       | methods on strings generated using `unsafe fn
       | from_utf8_unchecked` just by nature of UTF-8's self-
       | synchronization, which may be what the article is pointing out.
       | It's possible that some cleverly optimized UTF-8 algorithms don't
       | need valid data to avoid memory issues / UB that trips the
       | execution of the entire program, and can instead catch the error
       | or perform a lossy transformation on the spot without incurring
       | too much overhead.
        
       | kdheepak wrote:
       | In my opinion, one argument for internally representing `String`s
       | as UTF8 is it prevents accidentally saving a file as Latin1 or
       | other encodings. I would like to read a file my coworker sent me
       | in my favorite language without having to figure out what the
       | encoding of the file is.
       | 
       | For example, my most recent Julia project has the following line:
       | windows1252_to_utf8(s) = decode(Vector{UInt8}(String(coalesce(s,
       | ""))), "Windows-1252")
       | 
       | Figuring out that I had to use Windows-1252 (and not Latin1) took
       | a lot more time than I would have liked it to.
       | 
       | I get that there's some ergonomic challenges around this in
       | languages like Julia that are optimized for data analysis
       | workflows, but imho all data analysis languages/scripts should be
       | forced to explicitly list encodings/decodings whenever
       | reading/writing a file or default to UTF-8.
        
       ___________________________________________________________________
       (page generated 2024-05-16 23:01 UTC)