[HN Gopher] The Byte Order Fiasco (2021)
       ___________________________________________________________________
        
       The Byte Order Fiasco (2021)
        
       Author : fanf2
       Score  : 35 points
       Date   : 2024-06-30 17:42 UTC (5 hours ago)
        
 (HTM) web link (justine.lol)
 (TXT) w3m dump (justine.lol)
        
       | throw0101b wrote:
       | > _Clang and GCC are reaching for any optimization they can get.
       | Undefined Behavior may be hostile and dangerous, but it 's legal.
       | So don't let your code become a casualty._
       | 
       | Perhaps we need a _-Wundefined-behaviour_ so that compilers print
       | out messages when they use those type of  'tricks'. If you see
       | them you can then choose to adjust your code in a way that it
       | follows defined path(s) of the standard(s) in question.
        
         | Uvix wrote:
         | Isn't that what _-fsanitize=undefined_ does?
        
           | MaulingMonkey wrote:
           | `-fsanitize=undefined` enables some runtime warnings/checks,
           | `-Wundefined-behaviour` would presumably enable some kind of
           | compile time warnings/checks.
        
         | saagarjha wrote:
         | You'll have far too many messages.
        
           | amelius wrote:
           | Another approach would be to implement undefined behavior in
           | a way that is random.
           | 
           | https://en.wikipedia.org/wiki/Chaos_engineering
        
         | tedunangst wrote:
         | Do you want a warning on every for (int i = 0; i < n; i++)
         | loop?
        
         | karatinversion wrote:
         | I always thought the problem with this was that the compilers
         | do loads of these optimisations in very mundane ways. Eg if I
         | have a                 #define FOO 17       void bar(int x, int
         | y) {         if (x + y >= FOO) {           //do stuff         }
         | }       void baz(int x) {         bar(x, FOO);       }
         | 
         | the compiler can inline the call to bar in baz, and then
         | optimise the condition to (x>=0)... because signed integer
         | overflow is undefined, so can't happen, so the two conditions
         | are equivalent.
         | 
         | The countless messages about optimisations like that would
         | swamp ones about real dangerous optimisations.
        
         | userbinator wrote:
         | Or better yet, perhaps they should just do the sane thing
         | instead of being a hostile pedantic smartass.
         | 
         | Undefined shouldn't mean "whatever", it should be an
         | opportunity to consider what makes the most sense.
        
           | amelius wrote:
           | But then you have to write a standard.
        
       | IshKebab wrote:
       | These days I think the sane option is to just add a static assert
       | that the machine is little endian and move on with your life.
       | Unless you're writing glibc or something do you _really_ care
       | about supporting ancient IBM mainframes?
       | 
       | Also recommending octal is sadistic!
        
         | bvrmn wrote:
         | > add a static assert that the machine is little endian and
         | move on with your life
         | 
         | It's not clear how it would free you from interpreting BE data
         | from incoming streams/blobs.
        
           | forrestthewoods wrote:
           | I feel like we're at a point where you should assume little
           | endian serialization and treat anything big endian as a slow
           | path you don't care about. There's no real reason for any
           | blob, stream, or socket to use big endian for anything
           | afaict.
           | 
           | If some legacy system still serializes big endian data then
           | call bswap and call it a day.
        
             | bvrmn wrote:
             | AFAIK quite a number of protocols and file formats use BE
             | without any sign to become a legacy even in a distant
             | future.
        
             | saagarjha wrote:
             | You do realize that most of the networking stack is big-
             | endian, right?
        
             | syncsynchalt wrote:
             | The internet is big-endian, and generally data sent over
             | the wire is converted to/from BE. For example the numbers
             | in IP or TCP headers are big-endian, and any RFC that
             | defines a protocol including binary data will generally go
             | with big-endian numbers.
             | 
             | I believe this dates from Bolt Baranek and Newman basing
             | the IMP on a BE architecture. Similarly computers tend to
             | be LE these days because that's what the "winning" PC
             | architecture (x86) uses.
        
               | userbinator wrote:
               | Only the early protocols below the application layer are
               | BE. A lot of the later stuff switched to LE.
        
               | kortilla wrote:
               | Yes, those "early protocols" carry everything. Until
               | applications stop opening sockets, this problem doesn't
               | go away.
        
               | forrestthewoods wrote:
               | > any RFC that defines a protocol including binary data
               | will generally go with big-endian numbers
               | 
               | I'm not sure this is true. And if it is true it really
               | shouldn't be. There are effectively no modern big endian
               | CPUs. If designing a new protocol there is, afaict, zero
               | benefit to serializing anything as big endian.
               | 
               | It's unfortunate that TCP headers and networking are big
               | endian. It's a historical artifact.
               | 
               | Converting data to/from BE is a waste. I've designed and
               | implemented a variety of simple communication protocols.
               | They all define the wire format to be LE. Works great,
               | zero issues, zero regrets.
        
           | bla3 wrote:
           | https://commandcenter.blogspot.com/2012/04/byte-order-
           | fallac... covers that part.
        
         | MenhirMike wrote:
         | Big Endian is also called Network Order because some networking
         | protocols use it. And of course, UTF-16 BE is a thing.
         | 
         | There is a non-trivial chance that you will have to deal with
         | BE data regardless if your machine is LE or BE.
        
       | o11c wrote:
       | When I dealt with this, there were a couple major gotchas:
       | 
       | * Compilers seem to reliably detect byteswap, but are(were) very
       | hit-or-miss with the shift patterns for reading/writing to memory
       | directly, so you still need(ed) an ifdef. I know compilers have
       | improved but there are so many patterns that I'm still paranoid.
       | 
       | * There are a _lot_ of  "optimized" headers that actually
       | pessimize by inserting inline assembly that the compiler can't
       | optimize through (in particularly, the compiler can't inline
       | constants and can't choose `movbe` instead of `bswap`), so do not
       | trust any standard API; write your own with memcpy + ifdef'd
       | C-only swapping.
       | 
       | * For speaking wire protocols, generating (struct-based?) code is
       | far better than writing code that mentions offsets directly,
       | which in turn is far better than the `mempcpy`-like code which
       | the link suggests.
        
       | akira2501 wrote:
       | > Now you don't need to use those APIs because you know the
       | secret.
       | 
       | Was that a desired outcome? The endian.3 and byteorder.3 manual
       | pages make it easy.
        
       | pizlonator wrote:
       | > Modern compiler policies don't even permit code that looks like
       | like that anymore. Your compiler might see that and emit assembly
       | that formats your hard drive with btrfs.
       | 
       | This is total FUD. Some sanitizers might be unhappy with that
       | code, but that's just sanitizers creating problems where there
       | need not be any.
       | 
       | The llvm policy here is that must alias trumps TBAA, so clang
       | will reliably compile the cast of char* to uint32_t* and do what
       | systems programmers expect. If it didn't then too much stuff
       | would break.
        
       ___________________________________________________________________
       (page generated 2024-06-30 23:01 UTC)