[HN Gopher] You probably don't need to validate UTF-8 strings
___________________________________________________________________
You probably don't need to validate UTF-8 strings
Author : jakobnissen
Score : 41 points
Date : 2024-05-16 18:21 UTC (4 hours ago)
(HTM) web link (viralinstruction.com)
(TXT) w3m dump (viralinstruction.com)
| remram wrote:
| > In Rust, strings are always valid UTF8, and attempting to
| create a string with invalid UTF8 will panic at runtime:
|
| > [piece of code explicitly calling .unwrap()]
|
| You misspelled "returns an error".
|
| It might be worth considering Python, where the most central
| change from 2 to 3 was that strings would now be validated UTF-8.
| I don't understand why it gets discarded with "it was designed in
| the 1990's" when that change happened so recently.
| lucb1e wrote:
| There are countries where Python 3's stable release is old
| enough to legally get married, and this change was being
| planned since 2004. It's not _that_ recent!
|
| This is the oldest concrete plan I can quickly spot:
| https://github.com/python/peps/blob/b5815e3e638834e28233fc20...
| mixmastamyk wrote:
| Python strings are Unicode, not utf8, at least until encoding
| time. At which point they become bytes.
| deathanatos wrote:
| I'm going to define a "Unicode string" as Rust does: a
| sequence of USVs / content that can be validly represented as
| UTF-8. Thus, no, sadly, Python's strings are not Unicode, as
| they're a sequence of Unicode code points. Because of that,
| a_string.encode('utf-8')
|
| ... can raise in Python. For example: In [1]:
| '\uD83D'.encode('utf-8') ------------------------------
| ---------------------------------------------
| UnicodeEncodeError Traceback (most
| recent call last) Cell In[1], line 1 ----> 1
| '\uD83D'.encode('utf-8') UnicodeEncodeError:
| 'utf-8' codec can't encode character '\ud83d' in position 0:
| surrogates not allowed
|
| (The underlying encoding of str in Python these days is
| either [u8], [u16], or [u32], essentially, depending on the
| value of the largest code point in the string. So, for some
| values, e.g., 'hello world', the underlying representation is
| UTF-8, essentially.)
| kstrauser wrote:
| > As always, immutability comes with a performance penalty:
| Mutating values is generally faster than creating new ones.
|
| I get what they're saying, but I'm not sure I agree with it.
| Mutating one specific value is faster than making a copy and then
| altering that. Knowing that a value can't be mutated and using
| that to optimize the rest of the system can be faster yet. I
| think it's more likely the case that allowing mutability comes
| with a performance penalty.
| ghusbands wrote:
| It's easy to see that altering an entry in an array (size N) is
| O(1) with mutability and at least O(log N) with immutability,
| and that affects many algorithms. Altering any small part of a
| larger data structure has similar issues. In the end, many
| algorithms gain a factor of log N in their time complexity.
| kstrauser wrote:
| Right, but look at the bigger picture. Immutability removes a
| whole lot of order of operations concerns. That frees a smart
| compiler to use faster algorithms, parallelization, etc. in
| ways that might not be safely possible if the data _might_
| change in place. Yes, that may mean it 's slower to deal with
| individual values. It may also mean that the resulting system
| can be faster than otherwise.
| ghusbands wrote:
| Well, language benchmarks fairly uniformly show that to be
| untrue in general. None of the fastest languages have
| forced immutability. It's not like it's a novel, untested
| idea.
| bawolff wrote:
| I mean, that's kind of misinterpreting their point. The authors
| are not claiming otherwise.
| LegionMammal978 wrote:
| There is one property about UTF-8 that distinguishes it from
| opaque byte strings of unknown encoding: its codepoints are self-
| delimiting, so you can naively locate instances of a substring
| (and delete them, replace them, split on them, etc.) without
| worrying that you've grabbed something else with the same bytes
| as the substring.
|
| Constrast with UTF-16, where a substring might match the bytes at
| an odd index in the original string, corresponding to totally
| different characters.
|
| Identifying a substring is valid in every human language I know
| of, as long as the substring itself is semantically meaningful
| (e.g., it doesn't end in part of a grapheme cluster; though if
| you want to avoid breaking up words, you may also want a \b-like
| mechanism). So it does seem to refute the author's notion that
| you can do nothing with knowledge only of the encoding.
| kayodelycaon wrote:
| Personally, I prefer Ruby's behavior of explicit encoding on
| strings and being very cranky when invalid codepoints show up in
| a UTF-8 string.
|
| If you want to ignore invalid UTF-8, use String#scrub to replace
| heretical values with \uFFFD and life is good. :)
| adgjlsfhk1 wrote:
| life is good until you try to read a file path that is not a
| valid string and discover that you read the wrong file.
| singpolyma3 wrote:
| Honestly if you don't know that it's valid unicode then it's not
| a string at all, but a bytstring.
| 3pm wrote:
| Good paper on UTF-8 validation performance:
| https://arxiv.org/pdf/2010.03090 The relatively
| simple algorithm (lookup) can be several times faster than
| conventional algorithms at a common task using nothing more than
| the instructions available on commodity processors. It requires
| fewer than an instruction per input byte in the worst case.
| bawolff wrote:
| In terms of actual issues - i think normalizing to NFC is much
| more important than validating.
| camgunz wrote:
| You do have to validate UTF-8 strings:
|
| - You can't just skip stuff if you run any kind of normalization
|
| - How would you index into or split an invalid UTF-8 string?
|
| - How would you apply a regex?
|
| - What is its length?
|
| - How do you deal with other systems that _do_ validate UTF-8
| strings?
|
| Meta point: scanning a sequence of byte for invalid UTF-8
| sequences is validating. The decision to skip them is just
| slightly different code than "raise error". It's probably also a
| lot slower as you have to always do this for every operation,
| whereas once you've validated a string you can operate on it with
| impunity.
|
| Love this for the hot take/big swing, but it's a whiff.
| fanf2 wrote:
| - delay validation until normalization
|
| - treat non-utf-8 bytes as bytes
|
| - regexes match bytes just fine: the article has a whole
| section on ripgrep's use of bstr
|
| - another section discusses how the length of a string is not a
| well defined quantity
|
| - the article says you can delay validation until required
| paulddraper wrote:
| Regexes normally match characters, not bytes.
| creatonez wrote:
| My understanding is that Rust has designed the rest of the String
| API under the assumption of validity. You can't create an invalid
| String because the methods that operate on strings strive to be
| tightly optimized UTF-8 manipulation algorithms that assume the
| string has already been cleaned. Pack all of the actual
| robustness in the guarantee that you are working with UTF-8, and
| you can avoid unnecessary CPU cycles, which is one of the goals
| of systems languages. If you want to skip it, go for raw strings
| or CStr -- all raw byte buffers have the basic ASCII functions
| available, which are designed to be robust against whatever you
| throw at it, and it shouldn't be too hard to introduce genericity
| for an API to accept both strings and raw data.
|
| That being said, I'm not sure how this is actually implemented, I
| assume there is still some degree of robustness when running
| methods on strings generated using `unsafe fn
| from_utf8_unchecked` just by nature of UTF-8's self-
| synchronization, which may be what the article is pointing out.
| It's possible that some cleverly optimized UTF-8 algorithms don't
| need valid data to avoid memory issues / UB that trips the
| execution of the entire program, and can instead catch the error
| or perform a lossy transformation on the spot without incurring
| too much overhead.
| kdheepak wrote:
| In my opinion, one argument for internally representing `String`s
| as UTF8 is it prevents accidentally saving a file as Latin1 or
| other encodings. I would like to read a file my coworker sent me
| in my favorite language without having to figure out what the
| encoding of the file is.
|
| For example, my most recent Julia project has the following line:
| windows1252_to_utf8(s) = decode(Vector{UInt8}(String(coalesce(s,
| ""))), "Windows-1252")
|
| Figuring out that I had to use Windows-1252 (and not Latin1) took
| a lot more time than I would have liked it to.
|
| I get that there's some ergonomic challenges around this in
| languages like Julia that are optimized for data analysis
| workflows, but imho all data analysis languages/scripts should be
| forced to explicitly list encodings/decodings whenever
| reading/writing a file or default to UTF-8.
___________________________________________________________________
(page generated 2024-05-16 23:01 UTC)