Post A2VDssQrcfYoPR818y by vurpo@mstdn.io
 (DIR) More posts by vurpo@mstdn.io
 (DIR) Post #A2Uu2PHsq5mvpZC84u by xerz@fedi.xerz.one
       2020-12-23T15:56:12.289732Z
       
       0 likes, 0 repeats
       
       Wait what https://twitter.com/steveklabnik/status/1341478150988042240
       
 (DIR) Post #A2Uv2id31p2plro4sy by codewiz@mstdn.io
       2020-12-23T16:07:27Z
       
       0 likes, 0 repeats
       
       @xerz I approve of this. Strings should just be a dumb sequence of bytes.Why? Because it makes everything work just fine for pass-through apps that don't need to interpret the contents. In most cases, that's what the user wants.One example: I have my Amiga source code in a directory, and some of the files have names in ISO-Latin-1. The Linux kernel and the GNU tools don't mind at all. But anything written in Python throws.
       
 (DIR) Post #A2Uwa8xUFwZswQrw5g by xerz@fedi.xerz.one
       2020-12-23T16:24:43.747455Z
       
       0 likes, 0 repeats
       
       @codewiz this seems more like a debate on typing and IPC than on strings themselves… if you ask me, I'm not comfortable writing any code that will be put in prod having a parser figuring out what kind of data is being given
       
 (DIR) Post #A2UwdpgjqOFSKnnIHI by xerz@fedi.xerz.one
       2020-12-23T16:25:24.037824Z
       
       0 likes, 0 repeats
       
       @codewiz this seems more like a debate on typing and IPC than on strings themselves… if you ask me, I'm not comfortable writing any code that will be put in prod having a parser figuring out what kind of data is being given, so I would rather reduce that as much as possibleand that's not what Rust is doing, anyway
       
 (DIR) Post #A2Uz9gMIGcToZdXuTI by codewiz@mstdn.io
       2020-12-23T16:53:31Z
       
       0 likes, 0 repeats
       
       @xerz It's too bad that tweet didn't come with  any references. Did something actually change in #Rust recently?
       
 (DIR) Post #A2UzGfQs6oQHjE1BDs by xerz@fedi.xerz.one
       2020-12-23T16:54:49.199668Z
       
       0 likes, 0 repeats
       
       @codewiz it's explained here https://twitter.com/lcnr7/status/1341478831090221061
       
 (DIR) Post #A2UzTaVri7MEEfVyiG by codewiz@mstdn.io
       2020-12-23T16:57:07Z
       
       1 likes, 0 repeats
       
       @xerz Ah, and here:https://github.com/rust-lang/rust/issues/71033After skimming through this, I think it was a good decision. Same validity guarantees of std::string in #cpp.Not sure about #golang...
       
 (DIR) Post #A2UzbxAtL3xepuh3Zo by xerz@fedi.xerz.one
       2020-12-23T16:58:39.939158Z
       
       0 likes, 0 repeats
       
       @codewiz when in doubt, Golang just does the most simple, generic thing possible - that is, it's just a byte slice
       
 (DIR) Post #A2V5VGVZer92UpYsme by friend@linuxrocks.online
       2020-12-23T18:04:39Z
       
       0 likes, 0 repeats
       
       @codewizI think, it would be better to just use byte arrays, if you really do not want to interpret the meaning of a string. In fact, paths on Unix are defined to just be a byte array, so Python exploding is caused solely by it being incompetent here: https://changelog.complete.org/archives/10063-the-fundamental-problem-in-python-3@xerz
       
 (DIR) Post #A2VDTkNSQx6EBhYIWu by vurpo@mstdn.io
       2020-12-23T19:33:58Z
       
       0 likes, 0 repeats
       
       @codewiz @xerz Rust has u8 and for when you want to deal with bytes and not Strings. Read the file into a vector of bytes if you don't want to guarantee it's valid UTF-8. String manipulation functions won't work on that vector of bytes, but that's by design since those string manipulation functions will also assume it's UTF-8.
       
 (DIR) Post #A2VDssQrcfYoPR818y by vurpo@mstdn.io
       2020-12-23T19:34:46Z
       
       0 likes, 0 repeats
       
       @codewiz @xerz Rust has u8 for when you want to deal with bytes and not Strings. Read the file into a vector of bytes if you don't want to guarantee it's valid UTF-8. String manipulation functions won't work on that vector of bytes, but that's by design since those string manipulation functions will also assume it's UTF-8.
       
 (DIR) Post #A2VjaG13jCl4uoitWa by teek_eh@aus.social
       2020-12-24T01:33:42Z
       
       0 likes, 0 repeats
       
       @codewiz @xerz It looks to me that nothing has changed regarding str/String containing valid UTF-8. The difference is that if you use `unsafe` to force invalid data into that memory location, that's no longer insta-UB per the language spec - it will just crash or do something horrible when you try to call one of the methods that assumed it was valid. Am I misunderstanding something?
       
 (DIR) Post #A2VmFQZ9bmSZPtQiKu by codewiz@mstdn.io
       2020-12-24T02:03:36Z
       
       0 likes, 0 repeats
       
       @friend @xerz Good article. Filenames are just one of the many things that should be treated as a byte array.What about html pages? Source code? Configuration files?Let's say you're writing an indenter. If you want input to loopback cleanly through your code, you must scan for certain keywords in the text while leaving the rest alone.In the end, the most useful string type is one that doesn't enforce valid utf-8, but still lets you perform typical string operations on the binary data.
       
 (DIR) Post #A2Vn26fCOUXPsbT9lI by codewiz@mstdn.io
       2020-12-24T02:12:24Z
       
       0 likes, 0 repeats
       
       @friend @xerz @vurpo Good article. Filenames are just one of the many things that should be treated as byte arrays.Let's say you're writing an indenter for html, json or maybe Python code. If you want any input to loopback cleanly, you must scan for certain keywords in the text while leaving the rest alone.For many tasks, the most useful string type is one that's not too picky about the content being valid utf-8, while still providing all the common string operations.
       
 (DIR) Post #A2VnFTxz478puvdMdU by codewiz@mstdn.io
       2020-12-24T02:14:49Z
       
       0 likes, 0 repeats
       
       @teek_eh @xerz Do you have to use unsafe though?
       
 (DIR) Post #A2Vnh0d5jVhP2RzHYu by teek_eh@aus.social
       2020-12-24T02:19:46Z
       
       0 likes, 0 repeats
       
       @codewiz @xerz Yes, Strings and string slices are still guaranteed to be UTF-8 in safe code per the docs.
       
 (DIR) Post #A2VpFGcks7VkRdjHbU by vurpo@mstdn.io
       2020-12-24T02:37:10Z
       
       0 likes, 0 repeats
       
       @codewiz @friend @xerz Even if you're not realty strict about having it be entirely valid UTF-8, you still always need to have some starting point regarding the encoding of your text, right? I mean, if you write your string operations (say, indentation) for this "ASCII/UTF-8-ish but not strictly checked" encoding, they would still fail completely if presented with e.g. UTF-16 or UTF-32 or one of the many non-Unicode-related character encodings that are common.
       
 (DIR) Post #A2VpHCbUn3o0YU3yIi by vurpo@mstdn.io
       2020-12-24T02:37:34Z
       
       0 likes, 0 repeats
       
       @codewiz @friend @xerz Even if you're not really strict about having it be entirely valid UTF-8, you still always need to have some starting point regarding the encoding of your text, right? I mean, if you write your string operations (say, indentation) for this "ASCII/UTF-8-ish but not strictly checked" encoding, they would still fail completely if presented with e.g. UTF-16 or UTF-32 or one of the many non-Unicode-related character encodings that are common.
       
 (DIR) Post #A2VpTlENsBBdQOF0vg by vurpo@mstdn.io
       2020-12-24T02:39:47Z
       
       0 likes, 0 repeats
       
       @codewiz @friend @xerz What I mean is just, it's impossible to provide literally any of the common string operations without first assuming/expecting the strings to be encoded with some specific encoding.
       
 (DIR) Post #A2W3vl9BB338dB3GMq by friend@linuxrocks.online
       2020-12-24T05:21:43Z
       
       0 likes, 0 repeats
       
       @vurpoGood point. Inserting 0x20 for a space will work correctly a lot of the times, because most encodings do share the ASCII range, but with UTF-16, you need to insert another empty byte and with UTF-32, you need three empty bytes.Even just searching for certain bytes could get you fucked over, as 00 00 00 20 is a space in UTF-32, but 00 00 20 00 is not.@codewiz
       
 (DIR) Post #A2WiZZumM14JXorFRY by friend@linuxrocks.online
       2020-12-24T05:35:53Z
       
       1 likes, 0 repeats
       
       @vurpo @codewizFor code indentation, we can usually assume that the keywords of a programming language will be correctly decoded via ASCII, so it would work for finding those keywords, but yeah, I'm not convinced that the majority of cases should be handled without decoding.I'm still discovering new gotchas after having intensively worked with encodings for a few months, so I would recommend either fully decoding or not. Custom code to interpret individual bytes, will bite you.
       
 (DIR) Post #A2WiZeNdkeX1OUOLz6 by codewiz@mstdn.io
       2020-12-24T12:57:07Z
       
       0 likes, 0 repeats
       
       @friend @vurpo To make everyone happy, those who want strings to use a valid encoding and those who want loopback of binary data, we could just say that strings are encoded in wtf-8:http://simonsapin.github.io/wtf-8/