[HN Gopher] Falsehoods programmers believe about null pointers
       ___________________________________________________________________
        
       Falsehoods programmers believe about null pointers
        
       Author : HeliumHydride
       Score  : 61 points
       Date   : 2025-02-01 00:25 UTC (7 hours ago)
        
 (HTM) web link (purplesyringa.moe)
 (TXT) w3m dump (purplesyringa.moe)
        
       | alain94040 wrote:
       | I would add one more: the address you are dereferencing could be
       | non-zero, it could be an offset from 0 because the code is
       | accessing a field in a structure or method in a class. That
       | offset can be quite large, so if you see an error accessing
       | address 0x420, it's probably because you do have a null pointer
       | and are trying to access a field. As a bonus, the offending
       | offset may give you a hint as to which field and therefore where
       | in your code the bad dereferencing is happening.
        
         | catlifeonmars wrote:
         | Now this is a really interesting one. I'm assuming this
         | trivially applies to index array access as well?
        
           | kevincox wrote:
           | I think this technically wouldn't be a null pointer anymore.
           | As array indexing `p[n]` is defined as `*(p + n)` so first
           | you create a new pointer by doing math on a null pointer
           | (which is UB in C) then dereferencing this new pointer (which
           | doesn't even really exist because you have already committed
           | UB).
        
           | tialaramex wrote:
           | In C, and this article seems to be almost exclusively about
           | C, a[b] is basically sugar for (*((a) + (b)))
           | 
           | C does actually have arrays (don't let people tell you it
           | doesn't) but they decay to pointers at ABI fringes and the
           | index operation is, as we just saw, merely a pointer
           | addition, it's not anything more sophisticated - so the
           | arrays count for very little in practice.
        
             | juped wrote:
             | Not just basically sugar, the classic parlor trick is doing
             | 3[arr] or whatever
        
         | nyanpasu64 wrote:
         | One interesting failure mode is if (like the Linux kernel) a
         | function returns a union of a pointer or a negative errno
         | value, dereferencing a negative errno gives an offset (below or
         | above zero) _different_ from the field being accessed.
        
       | bitwize wrote:
       | Nowadays, UB is used pretty much as a license to make optimizer
       | go brrrr. But back in the day, I think it was used to allow
       | implementations wiggle room on whether a particular construct was
       | erroneous or not -- in contrast to other specifications like "it
       | is an error" (always erroneous) or "implementation-defined
       | behavior" (always legitimate; compiler must emit _something_
       | sensible, exactly what is not specified). In the null pointer
       | case, it makes sense for kernel-mode code to potentially indirect
       | to address 0 (or 0xffffffff, or whatever your architecture
       | designates as null), while user-space code can be reasonably
       | considered never to legitimately access that address because the
       | virtual memory never maps it as a valid address. So accessing
       | null is an error in one case and perfectly cromulent in the
       | other. So the standard shrugs its shoulders and says  "it's
       | undefined".
        
         | lmm wrote:
         | The original motivation was to permit implementations to do the
         | reasonable, friendly thing, and trap whenever the program
         | dereferences a null pointer. Since C compilers want to reorder
         | or elide memory accesses, you can't really define explicit
         | semantics for that (e.g. you want it to be ok to move the
         | memory access before or after a sequence point) - the JVM has
         | to do a lot of work to ensure that it throws
         | NullPointerException at the correct point when it happens, and
         | this slows down all programs even though no-one sane has their
         | program intentionally trigger one. But the intention was to
         | permit Java-like behaviour where your code would crash with a
         | specific error immediately-ish, maybe not on the exact line
         | where you dereferenced null but close to it. Ironically
         | compiler writers then took that standard and used it to do the
         | exact opposite, making null dereference far more dangerous than
         | even just unconditionally reading memory address 0.
        
       | Hizonner wrote:
       | > Instead of translating what you'd like the hardware to perform
       | to C literally, treat C as a higher-level language, because it is
       | one.
       | 
       | Alternately, _stop writing code in C_.
        
         | oguz-ismail wrote:
         | impossible
         | 
         | no serious alternative
        
           | tialaramex wrote:
           | For _very_ small platforms, where it 's a struggle to have a
           | C compiler because a "long int" of 32 bits is already a huge
           | challenge to implement, let alone "long long int" - stop
           | using high level languages. Figure out the few dozen machine
           | code instructions you want for your program, write them down,
           | review, translate to binary, done.
           | 
           | For the bigger systems where that's not appropriate, you'll
           | value a more expressive language. I recommend Rust
           | particularly, even though Rust isn't available everywhere
           | there's an excellent chance it covers every platform you
           | actually care about.
        
         | mcdeltat wrote:
         | IMO one of the most disappointing things about C: it smells
         | like it should be a straightforward translation to assembly,
         | but actually completely is not because of the "virtual machine"
         | magic the Standard uses which opens the door to almost
         | anything.
         | 
         | Oh you would like a byte? Is that going to be a 7 bit, 8 bit,
         | 12 bit, or 64 bit byte? It's not specified, yay! Have fun
         | trying to write robust code.
        
           | HeliumHydride wrote:
           | C++ has made efforts to fix some of this. Recently, they
           | enforced that signed integers must be two's complement. There
           | is a proposal currently to fix the size of bytes to 8 bits.
        
             | mcdeltat wrote:
             | Yes, which is excellent (although 50 years too late, I'll
             | try not to be too cynical...).
             | 
             | The problem is that C++ is a huge language which is complex
             | and surely not easy to implement. If I want a small, easy
             | language for my next microprocessor project, it probably
             | won't be C++20. It seems like C is a good fit, but really
             | it's not because it's a high level language with a myriad
             | of weird semantics. AFAIK we don't have a simple "portable
             | assembler + a few niceties" language. We either use
             | assembly (too low level), or C (slightly too high level and
             | full of junk).
        
           | bobmcnamara wrote:
           | Ahem, it's specified to not be 7.
        
           | tialaramex wrote:
           | Abstract. It's an Abstract machine, not a Virtual machine.
        
           | zajio1am wrote:
           | Size of byte is implementation-defined, not unspecified. Why
           | is that a problem for writing robust code? It is okay to
           | assume implementation-defined behavior as long as you are
           | targeting a subset of systems where these assumptions hold,
           | and if you check them at build-time.
        
           | keldaris wrote:
           | Luckily, little of it matters if you simply write C for your
           | actual target platforms, whatever they may be. C thankfully
           | discourages the very notion of "general purpose" code, so
           | unless you're writing a compiler, I've never really
           | understood why some C programmers actually care about the
           | standard as such.
           | 
           | In reality, if you're writing C in 2025, you have a finite
           | set of specific target platforms and a finite set of
           | compilers you care about. Those are what matter. Whether my
           | code is robust with respect to some 80s hardware that did
           | weird things with integers, I have no idea and really
           | couldn't care less.
        
             | msla wrote:
             | > I've never really understood why some C programmers
             | actually care about the standard as such.
             | 
             | Because I want the _next version_ of the compiler to agree
             | with me about what my code means.
             | 
             | The standard is an agreement: If you write code which
             | conforms to it, the compiler will agree with you about what
             | it means and not, say, optimize your important conditionals
             | away because some "Can't Happen" optimization was triggered
             | and the "dead" code got removed. This gets rather important
             | as compilers get better about optimization.
        
         | 1over137 wrote:
         | I'm excited about -fbounds-safety coming soon:
         | https://github.com/llvm/llvm-project/commit/64360899c76c
        
         | liontwist wrote:
         | No. I don't think I will.
        
         | kerkeslager wrote:
         | That's just not an option in a lot of cases, and it's not a
         | _good_ option in other cases.
         | 
         | Like it or not, C can run on more systems than anything else,
         | and it's by far the easiest language for doing a lot of low-
         | level things. The ease of, for example, accessing pointers,
         | does make it easier to shoot yourself in the foot, but when you
         | need to do that _all the time_ it 's pretty hard to justify the
         | tradeoffs of another language.
         | 
         | Before you say "Rust": I've used it extensively, it's a great
         | language, and probably an ideal replacement for C in a lot of
         | cases (such as writing a browser). But it is absolutely
         | unacceptable for the garbage collector work I'm using C for,
         | because I'm doing complex stuff with memory which cannot
         | reasonably be done under the tyranny of the borrow checker. I
         | did spend about six weeks of my life trying to translate my
         | work into Rust and I can see a path to doing it, but you spend
         | so much time bypassing the borrow checker that you're clearly
         | not getting much value from it, and you're getting a massive
         | amount of faffing that makes it very difficult to see what the
         | code is actually doing.
         | 
         | I know HN loves to correct people on things they know nothing
         | about, so if you are about to Google "garbage collector in
         | Rust" to show me that it can be done, just stop. I know it can
         | be done, because I did it; I'm saying it's _not worth it_.
        
       | EtCepeyd wrote:
       | > Dereferencing a null pointer always triggers "UB".
       | 
       | Calling this a "falsehood" is utter bullshit.
        
       | mcdeltat wrote:
       | "falsehoods 'falsehoods programmers believe about X' authors
       | believe about X"...
       | 
       | All you need to know about null pointers in C or C++ is that
       | dereferencing them gives undefined behaviour. That's it. The buck
       | stops there. Anything else is you trying to be smart about it.
       | These articles are annoying because they try to sound smart by
       | going through generally useless technicalities the average
       | programmer shouldn't even be considering in the first place.
        
         | MathMonkeyMan wrote:
         | Useless, but interesting. I used to work with somebody who
         | would ask: What happens with this code?
         | #include <iostream>              int main() {             const
         | char *p = 0;             std::cout << p;         }
         | 
         | You might answer "it's undefined behavior, so there is no point
         | reasoning about what happens." Is it undefined behavior?
         | 
         | The idea behind this question was to probe at the candidate's
         | knowledge of the sorts of things discussed in the article:
         | virtual memory, signals, undefined behavior, machine
         | dependence, compiler optimizations. And edge cases in iostream.
         | 
         | I didn't like this question, but I see the point.
         | 
         | FWIW, on my machine, clang produces a program that segfaults,
         | while gcc produces a program that doesn't. With "-O2", gcc
         | produces a program that doesn't attempt any output.
        
           | mcdeltat wrote:
           | I'm assuming it's meant to be:                 std::cout <<
           | *p;
           | 
           | ?
           | 
           | I still think discussing it is largely pointless. It's UB and
           | the compiler can do about anything, as your example shows.
           | Unless you want to discuss compiler internals, there's no
           | point. Maybe the compiler assumes the code can't execute and
           | removes it all - ok that's valid. Maybe it segfaults because
           | some optimisation doesn't get triggered - ok that's valid. It
           | could change between compiler flags and compiler versions.
           | From the POV of the programmer it's effectively arbitrary
           | what the result is.
           | 
           | Where it gets harmful IMO is when programmers _think_ they
           | understand UB because they 've seen a few articles, and start
           | getting smart about it. "I checked the code gen and the
           | compiler does X which means I can do Y, Z". No. Please stop.
           | You will pay the price in bugs later.
        
             | MathMonkeyMan wrote:
             | > I'm assuming it's meant to be: [...]
             | 
             | Nope, I mean inserting the character pointer ("string")
             | into the stream, not the character to which it maybe
             | points.
             | 
             | Your second paragraph demonstrates, I think, why my former
             | colleague asked the question. And I agree with your third
             | paragraph.
        
               | mcdeltat wrote:
               | Ah, I got confused for a minute why printing a character
               | pointer is UB. I was thinking of printing the address,
               | which is valid. But of course char* has a different
               | overload because it's a string. You can tell how much I
               | use std::string and std::string_view lol.
               | 
               | I reckon we are generally in agreement. Perhaps I am not
               | the best person to comment on the purpose of discussing
               | UB, since I already know all the ins and outs of it...
               | "Been there done that" kind of thing.
        
           | gerdesj wrote:
           | I think that reasoning about things is a good idea and
           | looking at failure modes is an engineers job. However, I
           | gather that the standard says "undefined", so a correct
           | answer to what "happens with this code" might be: "wankery"
           | (on the part of the questioner). You even demonstrate that
           | undefined status with concrete examples.
           | 
           | In another discipline you might ask what happens what happens
           | when you stress a material near to or beyond its plastic
           | limit? It's quite hard to find that limit precisely, without
           | imposing lots of constraints. For example take a small metal
           | thing eg a paper clip and bend it repeatedly. Eventually it
           | will snap due to quite a few effects - work hardening,
           | plastic limit and all that stuff. Your body heat will affect
           | it, along with ambient temperature. That's before we worry
           | about the material itself which a paper clip will be pretty
           | straightforwards ... ish!
           | 
           | OK, let's take a deeper look at that crystalline metallic
           | structure ... or let's see what happens with concrete or
           | concrete with steel in it, ooh let's stress that stuff and
           | bend it in strange ways.
           | 
           | Anyway, my point is: if you have something as simple as a
           | standard that says: "this will go weird if you do it" then
           | accept that fact and move on - don't try to be clever.
        
         | SR2Z wrote:
         | Haha all of the examples in the article are basically "here's
         | some really old method for making address 0 a valid pointer."
         | 
         | This isn't like timezones or zip codes where there are lots of
         | unavoidable footguns - pretty much everyone at every layer of
         | the stack thinks that a zero pointer should never point to
         | valid data and should result in, at the very least, a segfault.
        
         | SAI_Peregrinus wrote:
         | Not quite.
         | 
         | Trivially, `& _E` is equivalent to `E`, even if `E` is a null
         | pointer (C23 standard, footnote 114 from section 6.5.3.2
         | paragraph 4, page 80). So since ` &_` is a no-op that's not UB.
         | 
         | Also `*(a+b)` where `a` is NULL but `b` is a nonzero integer
         | never dereferences the NULL pointer, but is still undefined
         | behavior since conversions from null pointers to pointers of
         | other types still do not compare equal to pointers to any
         | actual objects or functions (6.3.2.3 paragraph 3) and addition
         | or subtraction of pointers into array objects with integers
         | that produce results that don't point into the same array
         | object are UB (6.5.6).
        
       | rstuart4133 wrote:
       | > In ye olden times, the C standard was considered guidelines
       | rather than a ruleset, undefined behavior was closer to
       | implementation-defined behavior than dark magic, and optimizers
       | were stupid enough to make that distinction irrelevant. On a
       | majority of platforms, dereferencing a null pointer compiled and
       | behaved exactly like dereferencing a value at address 0.
       | 
       | Let me unpack that for you. Old compilers didn't recognise
       | undefined behaviour, and so compiled the code that triggered
       | undefined behaviour in exactly the same way they compiled all
       | other code. The result was implementation defined, as the article
       | says.
       | 
       | Modern compilers can recognise undefined behaviour. When they
       | recognise it they don't warn the programmer "hey, you are doing
       | something non-portable here". Instead they may take advantage of
       | it in any way they damned well please. Most of those ways will be
       | contrary to what the programmer is expecting, consequently
       | yielding a buggy program.
       | 
       | But not in all circumstances. The icing on the cake is some
       | undefined behaviour (like dereferencing null pointers) is
       | tolerated (ie treated in the old way), and some not. In fact most
       | large C programs will rely on undefined behaviour of some sort,
       | such as what happens when integers overflow or signed is
       | converted to unsigned.
       | 
       | Despite that, what is acceptable undefined behaviour and what is
       | not isn't defined by the standard, or anywhere else really. So
       | the behaviour of most large C programs is it legally allowed to
       | to change if you use a different compiler, a different version of
       | the same compiler, or just different optimisation flags.
       | Consequently most C programs depend on the compiler writers do
       | the same thing with some undefined behaviour, despite there being
       | no guarantees that will happen.
       | 
       | This state of affairs, which is to say having a language standard
       | that doesn't standardise major features of the language, is
       | apparently considered perfectly acceptable by the C standards
       | committee.
        
         | ahartmetz wrote:
         | I see at least two possible reasons why it happened: 1. "Don't
         | do something counterproductive" or "Don't get your priorities
         | wrong" do not usually need to be said explicitly, 2. Standards
         | "culture" values precision so much that they'd balk at writing
         | fuzzy things like "Do typical null pointer things when trying
         | to deref a null pointer".
         | 
         | Then later 3. "But we implemented it that way for the
         | benchmarks, can't regress there!"
        
       | caspper69 wrote:
       | The article wasn't terrible. I give it a C+ (no pun intended).
       | 
       | Too general, too much trivia without explaining the underlying
       | concepts. Questionable recommendations (without covering
       | potential pitfalls).
       | 
       | I have to say that the discourse here is refreshing. I got a
       | headache reading the 190+ comments on the /r/prog post of this
       | article. They are a lively bunch though.
        
       | metalcrow wrote:
       | > asking for forgiveness (dereferencing a null pointer and then
       | recovering) instead of permission (checking if the pointer is
       | null before dereferencing it) is an optimization. Comparing all
       | pointers with null would slow down execution when the pointer
       | isn't null, i.e. in the majority of cases. In contrast, signal
       | handling is zero-cost until the signal is generated, which
       | happens exceedingly rarely in well-written programs.
       | 
       | Is this actually a real optimization? I understand the principal,
       | that you can bypass explicit checks by using exception handlers
       | and then massage the stack/registers back to a running state, but
       | does this actually optimize speed? A null pointer check is
       | literally a single TEST on a register, followed by a conditional
       | jump the branch predictor is 99.9% of the time going to know what
       | to do with. How much processing time is using an exception
       | actually going to save? Or is there a better example?
        
         | oguz-ismail wrote:
         | Signal handling is done anyway. The cost of null pointer checks
         | is a net overhead, if minuscule.
        
         | liontwist wrote:
         | > A null pointer check is literally a single TEST on a register
         | 
         | On every pointer deref in your entire program. Not for release
         | mode.
        
         | nickff wrote:
         | The OP is offering terrible advice based on a falsehood they
         | believe about null pointers. In many applications (including
         | the STM32H743 microcontroller that I am currently working on),
         | address zero (which is how "NULL" is defined by default in my
         | IDE) points to RAM or FLASH. In my current application, NULL is
         | ITCM (instruction tightly coupled memory), and it's where I've
         | put my interrupt vector table. If I read it, I don't get an
         | error, but I may get dangerously wrong data.
        
         | gwbas1c wrote:
         | > Is this actually a real optimization?
         | 
         | No... And yes.
         | 
         | No: _Because throwing and catching the null pointer exception
         | is hideously slow compared to doing a null check._ In Java  /
         | C#, the exception is an allocated object, and the stack is
         | walked to generate a stack trace. This is in addition to any
         | additional lower-level overhead (panic) that I don't understand
         | the details well enough to explain.
         | 
         | Yes: If, in practice, the pointer is never null, (and thus a
         | null pointer is truly an _exceptional_ situation,) carefully-
         | placed exception handlers are an optimization. Although, yes,
         | the code will technically be faster because it 's not doing
         | null checks, _the most important optimization is developer time
         | and code cleanliness._ The developer doesn 't waste time adding
         | redundant null checks, and the next developer finds code that
         | is easier to read because it isn't littered with redundant null
         | checks.
        
           | mr_00ff00 wrote:
           | "Most important optimization is developer time and code
           | cleaniness"
           | 
           | True for 99% of programming jobs, but if you are worried
           | about the speed of null checks, you are in that 1%.
           | 
           | In high frequency trading, if you aren't first your last and
           | this is the exact type of code optimizations you need for the
           | "happy path"
        
           | wging wrote:
           | If you're actually paying a significant cost generating stack
           | traces for NPEs, there's a JVM option to deal with that
           | (-XX:-OmitStackTraceInFastThrow). It still generates a stack
           | trace the first time; if you're able to go search for that
           | first one it shouldn't be a problem for debugging.
        
         | Aloisius wrote:
         | OpenJVM does it, iirc. If the handler is triggered too often at
         | a location, it will swap back to emitting null checks though
         | since it is rather expensive.
         | 
         | Of course, there's a big difference between doing it in a VM
         | and doing it in a random piece of software.
        
         | toast0 wrote:
         | Sure, the cost of the check is small, and if you actually hit a
         | null pointer, the cost is much higher if it's flagged by the
         | MMU instead of a simple check.
         | 
         | But you're saving probably two bytes in the instruction stream
         | for the test and conditional jump (more if you don't have
         | variable length instructions), and maybe that adds up over your
         | whole program so you can keep meaningfully more code in cache.
        
       | Blikkentrekker wrote:
       | > _In ye olden times, the C standard was considered guidelines
       | rather than a ruleset, undefined behavior was closer to
       | implementation-defined behavior than dark magic, and optimizers
       | were stupid enough to make that distinction irrelevant. On a
       | majority of platforms, dereferencing a null pointer compiled and
       | behaved exactly like dereferencing a value at address 0._
       | 
       | > _For all intents and purposes, UB as we understand it today
       | with spooky action at a distance didn't exist._
       | 
       | The first official C standard was from 1989, the second real
       | change was in 1995, and the infamous "nasal daemons" quote was
       | from 1992. So evidently the first C standard was already
       | interpreted that way, that compilers were really allowed to do
       | anything in the face of undefined behavior. As far as I know
        
       | userbinator wrote:
       | "Falsehoods programmers believe" is the "considered harmful" of
       | the modern dogma cult.
        
         | butter999 wrote:
         | It's a genre. It's neither dogmatic, modern, nor unique to
         | programming.
        
       | megous wrote:
       | Dereferencing a null pointer is how I boot half of my systems. :D
       | On Rockchip platforms address 0 is start of DRAM, and a location
       | where [U-Boot] SPL is loaded after DRAM is initialized. :)
        
         | SAI_Peregrinus wrote:
         | That's not a null pointer. Address `0` can be valid. A null
         | pointer critically _does not compare equal to any non-null
         | pointer, including a pointer to address 0 on platforms where
         | that 's allowed_.
         | 
         | > An integer constant expression with the value `0` , such an
         | expression cast to type `void *` , or the predefined constant
         | `nullptr` is called a null pointer constant ^69) . If a null
         | pointer constant or a value of the type `nullptr_t` (which is
         | necessarily the value `nullptr` ) is converted to a pointer
         | type, the resulting pointer, called a null pointer, is
         | guaranteed to compare unequal to a pointer to any object or
         | function.
         | 
         | C 23 standard 6.3.2.3.3
         | 
         | Also this is point 6 in the article.
        
       | heraclius1729 wrote:
       | > In both cases, asking for forgiveness (dereferencing a null
       | pointer and then recovering) instead of permission (checking if
       | the pointer is null before dereferencing it) is an optimization.
       | Comparing all pointers with null would slow down execution when
       | the pointer isn't null, i.e. in the majority of cases. In
       | contrast, signal handling is zero-cost until the signal is
       | generated, which happens exceedingly rarely in well-written
       | programs.
       | 
       | At least from a C/C++ perspective, I can't help but feel like
       | this isn't great advice. There isn't a "null dereference" signal
       | that gets sent--it's just a standard SIGSEGV that cannot be
       | distinguished easily from other memory access violations
       | (memprotect, buffer overflows, etc). In principle I suppose you
       | could write a fairly sophisticated signal handler that accounts
       | for this--but at the end of the day it _must_ replace the pointer
       | with a not null one, as the memory read will be immediately
       | retried when the handler returns. You 'll get stuck in an
       | infinite loop (READ, throw SIGSEGV, handler doesn't resolve the
       | issue, READ, throw SIGSEGV, &c.) unless you do something to the
       | value of that pointer.
       | 
       | All this to avoid the cost of an if-statement that almost always
       | has the same result (not null), which is perfect conditions for
       | the CPU branch predictor.
       | 
       | I'm not saying that it is definitely better to just do the check.
       | But without any data to suggest that it is actually more
       | performant, I don't really buy this.
       | 
       | EDIT: Actually, this is made a bit worse by the fact that
       | dereferencing nullptr is undefined behavior. Most implementations
       | set the nullptr to 0 and mark that page as unreadable, but that
       | isn't a sure thing. The author says as much later in this
       | article, which makes the above point even weirder.
        
       ___________________________________________________________________
       (page generated 2025-02-01 08:00 UTC)