hngopher.com

       [HN Gopher] The provenance memory model for C
       ___________________________________________________________________
        
       The provenance memory model for C
        
       Author : HexDecOctBin
       Score  : 193 points
       Date   : 2025-06-30 09:25 UTC (13 hours ago)
        
 (HTM) web link (gustedt.wordpress.com)
 (TXT) w3m dump (gustedt.wordpress.com)
        
       | zombot wrote:
       | Does C allow Unicode identifiers now, or is that pseudo code? The
       | code snippets also contain `&amp;`, so something definitely went
       | wrong with the transcoding to HTML.
        
         | qsort wrote:
         | Quoting cppreference:
         | 
         | An identifier is an arbitrarily long sequence of digits,
         | underscores, lowercase and uppercase Latin letters, and Unicode
         | characters specified using \u and \U escape notation(since
         | C99), of class XID_Continue(since C23). A valid identifier must
         | begin with a non-digit character (Latin letter, underscore, or
         | Unicode non-digit character(since C99)(until C23), or Unicode
         | character of class XID_Start)(since C23)). Identifiers are
         | case-sensitive (lowercase and uppercase letters are distinct).
         | Every identifier must conform to Normalization Form C.(since
         | C23)
         | 
         | In practice depends on the compiler.
        
           | dgrunwald wrote:
           | But the source character set remains implementation-defined,
           | so compilers do not have to directly support unicode names,
           | only the escape notation.
           | 
           | Definitely a questionable choice to throw off readers with
           | unicode weirdness in the very first code example.
        
             | qsort wrote:
             | If it were up to me, anything outside the basic character
             | set in a source file would be a syntax error, I'm simply
             | reporting what the spec says.
        
               | ncruces wrote:
               | I use unicode for math in comments, and think makes
               | certain complicated formulas far more readable.
        
               | kzrdude wrote:
               | I've just been learning pinyin notation, so now i think
               | the variable r[?] should have a value that first goes
               | down a bit and then up.
        
               | zelphirkalt wrote:
               | I am not sure it is a good idea to mix such specific
               | phonetic script ideas about diacritic marks with the
               | behavior of the program over time. Even considering the
               | shape, it does not align with the idea of first down a
               | little, then up a lot.
        
               | guipsp wrote:
               | What a "basic character set" is depends on locale
        
               | qsort wrote:
               | https://en.cppreference.com/w/c/language/charset.html
        
               | account42 wrote:
               | Anything except US-ASCII in source code outside comments
               | and string constants should be a syntax error.
        
               | guipsp wrote:
               | You are aware other languages exist? Some of which don't
               | even use the Latin script?
        
               | Y_Y wrote:
               | What? like APL!?
        
               | nottorp wrote:
               | Dunno about the OP but I'm very aware as I'm not an
               | english speaker.
               | 
               | I still don't want anything as unpredictable as Unicode
               | in my code. How many different encodings will display as
               | the same variable name and how is the compiler supposed
               | to decide?
               | 
               | If you're thinking of comments and user facing strings,
               | the OP already excluded those.
        
         | unwind wrote:
         | I can't even view the post, I just get some kind of content
         | management system-like with the page as JSON or something, in
         | pink-on-white. I'm super confused. :|
         | 
         | The answer to your question seems to (still) be "no".
        
         | pjmlp wrote:
         | Besides the sibling comment on C23, it does work fine on GCC.
         | 
         | https://godbolt.org/z/qKejzc1Kb
         | 
         | Whereas clang loudly complains,
         | 
         | https://godbolt.org/z/qWrccWzYW
        
         | Y_Y wrote:
         | Implementation-defined until C99, explicitly possible via UCNs
         | aince c99, possible with explicit encoding since C23, but
         | literals are _still_ implementation defined.
        
       | tialaramex wrote:
       | Presumably this was converted from markdown or similar and the
       | conversion partly failed or the input was broken.
       | 
       | From the PVI section onward it seems to recover, but if the
       | author sees this please fix and re-convert your post.
       | 
       | [Edited, nope, there are more errors further in the text, this
       | needed proper proofreading before it was posted, I can somewhat
       | struggle through because I already know this topic but if this
       | was intended to introduce newcomers it's probably very confusing]
        
         | gustedt wrote:
         | The problem is that wordpress changes these things once you
         | edit in some part. I will probably regenerate the whole.
        
       | lioeters wrote:
       | Looks like a code block didn't get closed properly, before this
       | phrase:
       | 
       | > the functions `recip` and `recip+` and not equivalent
       | 
       | Several paragraphs after this got swallowed by the code block.
       | 
       | Edit: Oh, I didn't realize the article is by the author of the
       | book, Modern C. I've seen it recommended in many places.
       | 
       | > The C23 edition of Modern C is now available for free download
       | from https://hal.inria.fr/hal-02383654
        
         | johnisgood wrote:
         | It is a great book. I prefer the second edition, not the latest
         | one though with what I call "bloated C".
        
           | laqq3 wrote:
           | I'm wondering if you could elaborate? I'd be curious to hear
           | more about "bloated C" and the differences between the 2nd
           | and 3rd edition.
        
         | shakabrah wrote:
         | It made immediate sense to me it was Jen once I saw the code
         | samples given
        
         | zmodem wrote:
         | > Looks like a code block didn't get closed properly
         | 
         | This seems to have been fixed now.
        
           | perching_aix wrote:
           | I still see it, even after clearing caches, visiting from a
           | separate browser from a separate computer (even a separate
           | network).
        
       | gavinray wrote:
       | Also of interest to folks looking at this might be TySan, the
       | recently-merged LLVM Type-Based Aliasing sanitizer:
       | 
       | https://clang.llvm.org/docs/TypeSanitizer.html
       | 
       | https://www.phoronix.com/news/LLVM-Merge-TySan-Type-Sanitize...
        
         | aengelke wrote:
         | It's probably worth noting that TySan currently only catches
         | aliasing violations that LLVM would be able to exploit. For
         | some types, e.g. unions, Clang doesn't emit accurate type-based
         | aliasing information and therefore TySan won't catch these.
        
           | flohofwoe wrote:
           | Which is fine I think, considering that union type punning is
           | legal in C (and even in C++ where union type punning is UB I
           | have never seen it break - theoretically it might of course).
        
       | jvanderbot wrote:
       | I love Rust, but I miss C. If C can be updated to make it
       | generally socially acceptable for new projects, I'd happily go
       | back for some decent subset of things I do. However, there's a
       | lot of anxiety and even angst around using C in production code.
        
         | mikewarot wrote:
         | If you can stomach the occasional Begin and End, and a far less
         | confusing pointer syntax, Pascal might be the language for you.
         | Free Pascal has some great string handling, so you never have
         | to worry about allocating and freeing them, and they can store
         | gigabytes of text, even Unicode. ;-)
        
           | tgv wrote:
           | Or try Ada.
        
           | jvanderbot wrote:
           | If my fellow devs cringe at C, imagine their reaction to
           | Pascal
        
             | mikewarot wrote:
             | C has all the things to hate in a programming language
             | CaSe Sensitivity       Weird pointer syntax       Lack of a
             | separate assignment token       Null terminated strings
             | Macros - the evil scourge of the universe
             | 
             | On the plus side, it's installed everywhere, and it's not
             | indent sensitive
        
               | jvanderbot wrote:
               | At this point, you're talking to someone who isn't here
        
               | ioasuncvinvaer wrote:
               | Except for null terminated strings these don't seem like
               | mayor issues to me. Can you elaborate?
        
               | 1718627440 wrote:
               | > Lack of a separate assignment token
               | 
               | What does that mean?
        
               | kbolino wrote:
               | Assignment is = which is too close to equality == and
               | thus has been the source of bugs in the past, especially
               | since C treats assignment as an expression and coerces
               | lots of non-boolean values to true/false wherever a
               | condition is expected (if, while, for). Most compilers
               | warn about this at least nowadays.
        
               | tialaramex wrote:
               | Even with warnings this is just terrible. People need to
               | stop inventing languages where "False" is true, or an
               | empty container is false or other insane "coercions" of
               | this kind.
               | 
               | True is true, and false is false, if you're wondering
               | whether this Doodad is Wibbly, you should _ask that
               | question_ not rely on a convention that Wibbly Doodads
               | are somehow  "truthy" while the non-Wibbly ones are not.
        
               | zelphirkalt wrote:
               | You mean "mere string replacement macros, instead of
               | hygienic macros", of course : )
        
         | flohofwoe wrote:
         | > to make it generally socially acceptable for new projects...
         | 
         | Or better yet, don't let 'social pressure' influence your
         | choice of programming language ;)
         | 
         | If your workplace has a clear rule to not use memory-unsafe
         | languages for production code that's a different matter of
         | course. But nothing can stop you from writing C code as a hobby
         | - C99 and later is a very enjoyable and fun language.
        
           | xxs wrote:
           | I was about the reply no amount of pressure can tell me how
           | to program. C was totally fine for esp32
        
           | TimorousBestie wrote:
           | > Or better yet, don't let 'social pressure' influence your
           | choice of programming language ;)
           | 
           | It's hard. Programming is a social discipline, and the more
           | people who work in a language, the more love it gets.
        
             | spauldo wrote:
             | If you're on UNIX or working in the embedded space, C is
             | still everywhere and gets lots of love. C tends to get lots
             | of libraries anyway because everything can FFI to it.
        
           | Y_Y wrote:
           | I don't want to summon WB, but honest-to-god, D is a good
           | middle ground here.
        
         | bnferguson wrote:
         | Feels like Zig is starting to fill that role in some ways.
         | Fewer sharp edges and a bit more safety than C, more modern
         | approach, and even interops really well with C (even being
         | possible to mix the two). Know a couple Rust devs that have
         | said it seems to scratch that C itch while being more modern.
         | 
         | Of course it's still really nice to just have C itself being
         | updated into something that's nicer to work with and easier to
         | write safely, but Zig seems to be a decent other option.
        
           | pjmlp wrote:
           | As usual the remark that much of the Zig's safety over C, has
           | been present since the late 1970's in languages like
           | Modula-2, Object Pascal and Ada, but sadly they didn't born
           | with curly brackets, nor brought a free OS to the uni party.
        
           | dnautics wrote:
           | (self-promotion) in principle one should be able to implement
           | a fairly mature pointer provenance checker for zig, without
           | changing the language. A basic proof of concept (don't use
           | this, branches and loops have not been implemented yet):
           | 
           | https://www.youtube.com/watch?v=ZY_Z-aGbYm8
        
           | purplesyringa wrote:
           | How close are Zig's safety guarantees to Rust's? Honest
           | question; I don't follow Zig development. I can't take C
           | seriously because it hasn't even bothered to define
           | provenance until now, but as far as I'm aware, Zig doesn't
           | even try to touch these topics.
           | 
           | Does Zig document the precise mechanics of noalias? Does it
           | provide a mechanism for controllably exposing or not exposing
           | provenance of a pointer? Does it specify the provenance ABA
           | problem in atomics on compare-exchange somehow or is that
           | undefined? Are there any plans to make allocation
           | optimizations sound? (This is still a problem even in Rust
           | land; you can write a program that is guaranteed to exhibit
           | OOM according to the language spec, but LLVM outputs code
           | that doesn't OOM.) Does it at least have a sanitizer like
           | Miri to make sure UB (e.g. data races, type confusion, or
           | aliasing problems) is absent?
           | 
           | If the answer to most of the above is "Zig doesn't care", why
           | do people even consider it better than C?
        
             | dnautics wrote:
             | safety-wise, zig is better than C because if you don't do
             | "easily flaggable things"[0] it doesn't have buffer
             | overruns (including protection in the case of sentinel
             | strings), or null pointer exceptions. Where this lies on
             | the spectrum of "C to Rust" is a matter of judgement, but
             | if I'm not mistaken it is easily a majority of memory-
             | safety related CVEs. There's also no UB in debug, test, or
             | release-safe. Note: you can opt-out of release-safe on a
             | function-by-function basis. IIUC noalias is safety checked
             | in debug, test, and release-safe.
             | 
             | In a sibling comment, I mentioned a proof of concept I did
             | that if I had the time to complete/do correctly, it should
             | give you near-rust-level checking on memory safety, plus
             | automatically flags sites where you need to inspect the
             | code. At the point where you are using MIRI, you're already
             | bringing extra stuff into rust, so in practice zig + zig-
             | clr could be the equivalent of the result of "what if you
             | moved borrow checking from rustc into miri"
             | 
             | [0] type erasure, or using "known dangerous types, like c
             | pointers, or non-slice multipointers".
        
               | tialaramex wrote:
               | This is very much a "Draw the rest of the fucking owl"
               | approach to safety.
        
               | dnautics wrote:
               | what percentage of CVEs are null pointer problems or
               | buffer overflows? That's what percentage of the owl has
               | been drawn. If someone (or me) builds out a proper zig-
               | clr, then we get to, what? 90%. Great. Probably good
               | enough, that's not far off from where rust is.
        
               | comex wrote:
               | Probably >50% of exploits these days target use-after-
               | frees, not buffer overflows. I don't have hard data
               | though.
               | 
               | As for null pointer problems, while they may result in
               | CVEs, they're a pretty minor security concern since they
               | generally only result in denial of service.
               | 
               | Edit 2: Here's some data: In an analysis by Google, the
               | "most frequently exploited" vulnerability types for zero-
               | day exploitation were use-after-free, command injection,
               | and XSS [3]. Since command injection and XSS are not
               | memory-unsafety vulnerabilities, that implies that use-
               | after-frees are significantly more frequently exploited
               | than other types of memory unsafety.
               | 
               | Edit: Zig previously had a GeneralPurposeAllocator that
               | prevented use-after-frees of heap allocations by never
               | reusing addresses. But apparently, four months ago [1],
               | GeneralPurposeAllocator was renamed to DebugAllocator and
               | a comment was added saying that the safety features
               | "require the allocator to be quite slow and wasteful". No
               | explicit reasoning was given for this change, but it
               | seems to me like a concession that applications need high
               | performance generally shouldn't be using this type of
               | allocator. In addition, it appears that use-after-free is
               | not caught for stack allocations [2], or allocations from
               | some other types of allocators.
               | 
               | Note that almost the entire purpose of Rust's borrow
               | checker is to prevent use-after-free. And the rest of its
               | purpose is to prevent other issues that Zig also doesn't
               | protect against: tagged-union type confusion and data
               | races.
               | 
               | [1] https://github.com/ziglang/zig/commit/cd99ab32294a3c2
               | 2f09615...
               | 
               | [2] https://github.com/ziglang/zig/issues/3180.
               | 
               | [3] https://cloud.google.com/blog/topics/threat-
               | intelligence/202...
        
         | modeless wrote:
         | Fil-C is a modified version of Clang that makes C and C++
         | memory safe. It supports things you wouldn't expect to work
         | like signal handling or setjmp/longjmp. It can compile real C
         | projects like SQLite and OpenSSL with minimal to no changes,
         | today. https://github.com/pizlonator/llvm-project-
         | deluge/blob/delug...
        
           | tialaramex wrote:
           | Fil-C does seem like a quicker route if your existing idea
           | was something like "rewrite it in Java" and it exists today
           | whereas both C and C++ have only vague ambitions to deliver
           | some future language which might meet your needs.
           | 
           | I will be very surprised if there's widespread adoption of
           | Fil-C for many new projects though.
        
         | uecker wrote:
         | Do you really love Rust, or do you feel pressured to say so?
        
           | grg0 wrote:
           | He grew up in a very stringent household. Everybody was
           | writing Rust and he was like, "damn, I wish I could write C."
        
       | briandw wrote:
       | The code blocks are very difficult to read on this page. I had
       | ChatGPT O3 rewrite this in a more accessible format.
       | https://chatgpt.com/share/68629096-0624-8005-846f-7c0d655061...
        
         | cenobyte wrote:
         | So much better. Thank you!
        
       | b0a04gl wrote:
       | provenance model basically turns memory back into a typed value.
       | finally malloc wont just be a dumb number generator, it'll act
       | more like a capability issuer. and access is not 'is this address
       | in range' anymore, but "does this pointer have valid provenance".
       | way more deterministic, decouples gcc -wall
        
         | HexDecOctBin wrote:
         | Will this create more nasal demons? I always disable strict
         | aliasing, and it's not clear to me after reading the whole
         | article whether provenance is about making sane code illegal,
         | or making previously illegal sane code legal.
        
           | jcranmer wrote:
           | All C compilers have some notion of pointer provenance
           | embedded in them, and this is true going back decades.
           | 
           | The problem is that the documented definitions of pointer
           | provenance (which generally amount to "you must somehow have
           | a data dependency from the original object definition (e.g.,
           | malloc)") aren't really upheld by the optimizer, and the
           | effective definition of the optimizer is generally internally
           | inconsistent because people don't think about side effects of
           | pointer-to-integer conversion. The one-past-the-end pointer
           | being equal (but of different provenance) to a different
           | object is a particular vexatious case.
           | 
           | The definition given in TS6010 is generally the closest
           | you'll get to a formal description of the behavior that
           | optimizers are already generally following, except for cases
           | that are clearly agreed to be bugs. The biggest problem is
           | that it makes pointer-to-int an operation with side effects
           | that need to be preserved, and compilers today generally fail
           | to preserve those side effects (especially when pointer-to-
           | int conversion happens more as an implicit operation).
           | 
           | The practical effect of provenance--that you can't magic a
           | pointer to an object out of thin air--has always been true.
           | This is largely trying to clarify what it means to actually
           | magic a pointer out of thin air; it's not a perfect answer,
           | but it's the best answer anyone's come up with to date.
        
           | layer8 wrote:
           | This is basically a formalization of the general
           | understanding one already had when reading the C standard
           | thoroughly 25 years ago. At least I was nodding along
           | throughout the article. It cleans up the parts where the
           | standard was too imprecise and handwavy.
        
           | Diggsey wrote:
           | It's standardizing the contract between the programmer and
           | the compiler.
           | 
           | Previously a lot of C code was non-portable because it relied
           | on behaviour that wasn't defined as part of the standard. If
           | you compiled it with the wrong compiler or the wrong flags
           | you might get miscompilations.
           | 
           | The provenance memory model draws a line in the sand and says
           | "all C code on this side of the line should behave in this
           | well defined way". Any optimizations implemented by compiler
           | authors which would miscompile code on that side of the line
           | would need to be disabled.
           | 
           | Assuming the authors of the model have done a good job, the
           | impact on compiler optimizations should be minimized whilst
           | making as much existing C code fall on the "right" side of
           | the line as possible.
           | 
           | For new C code it provides programmers a way to write useful
           | code that is also portable, since we now have a line that we
           | can all hopefully agree on.
        
       | cenobyte wrote:
       | Please fix the code in your post.
        
       | eqvinox wrote:
       | Using the "register" storage class feels _really_ alien for C
       | code written in 2025...
        
         | flohofwoe wrote:
         | It has a slightly different meaning now, instead of hinting to
         | the compiler that the variable should be placed in a register
         | it now means that it is illegal to take the address of the
         | variable (e.g. cannot create a pointer from it):
         | 
         | https://www.godbolt.org/z/eEYf5c59f
         | 
         | Might be useful in some situations although I currently can't
         | think of any :)
        
       | smcameron wrote:
       | Ugh. Are unicode variable names allowed in C now? That's
       | horrific.
        
         | mananaysiempre wrote:
         | "Now" as in since C99, twenty-five years ago, yes. (It seemed
         | like a good idea at the time.)
        
           | 90s_dev wrote:
           | See also https://www.ethiocloud.com/bunnascript.aspx and
           | https://en.wikipedia.org/wiki/Non-English-
           | based_programming_...
        
           | kevincox wrote:
           | Being able to program in languages that don't fit into ASCII
           | is a good idea. Using one-character variable names is a bad
           | idea.
        
             | adrianN wrote:
             | Using variable names that are different but render (almost)
             | the same can be a bad idea.
        
             | RossBencina wrote:
             | Mathematics is a language that doesn't fit into ASCII and
             | commonly uses one-character variable names. If you are
             | implementing a documented mathematical algorithm (i.e. one
             | with a description in a paper or book) then sticking to the
             | notation of the paper (i.e. using one character variable
             | names) makes sense to me.
        
               | mananaysiempre wrote:
               | Unfortunately, many of the things of this nature that
               | you'll want to implement use indices, which are
               | inevitably going to start at 1. So you'll still got
               | plenty of hours of unpleasant debugging ahead of you, and
               | a non-obvious correspondence to the original paper at the
               | end of it.
        
               | kevincox wrote:
               | I find math far easier to read when the authors use
               | proper names for variables. But I understand that it
               | isn't the idiomatic style and agree that it can be useful
               | to match the paper when re-implementing an algorithm.
        
         | 1over137 wrote:
         | Horrific? You might not think so if your (human) language used
         | a different alphabet.
        
           | ajross wrote:
           | Little to no source code is written for single (human)
           | language development teams. Sure, everyone would like the
           | ability to write source code in their native language. That's
           | natural.
           | 
           | Literally no one, anywhere, wants to be forced to _read_
           | source written in a language they can 't read (or more
           | specifically in this case: written in glyphs they can't even
           | produce on their keyboard). That idea, for almost everyone,
           | seems "horrific", yeah.
           | 
           | So a lingua franca is a firm requirement for modern software
           | development outside of extremely specific environments (FSB
           | malware authors probably don't care about anyone else reading
           | their cyrillic variable names, etc...). Must it be ASCII-
           | encoded English? No. But that's what the market has picked
           | and most people seem happy enough with it.
        
             | OkayPhysicist wrote:
             | > Little to no source code is written for single (human)
             | language development teams.
             | 
             | This is blatantly false. I'd posit that a solid 90% of all
             | source code written is done so by single, co-located teams
             | (a substantial portion of which are teams of 1). That
             | certainly fits the bill for most companies I've worked at.
        
           | eqvinox wrote:
           | Yes but also no. The thing about software is that 90% of it
           | is not culturally bound. If you're writing, say, some tax
           | reporting tool, a grammar reference, or something
           | religious... sure, it makes sense to write that in your
           | language. So, yeah, C should support that.
           | 
           | However, everything else, from spreadsheet software to CAD
           | tools to OS kernels to JavaScript frameworks is universal
           | across cultures and languages. And for better or for worse
           | (I'm not a native English speaker either), the world has gone
           | with English for a lot of code commons.
           | 
           | And the thing with the examples in that post isn't about
           | supporting language diversity, it's math symbols which are
           | noone's native language. _And you pretty much can 't type
           | them on any keyboard._ Which really makes it a rather poor
           | flex IMHO. Did the author reconfigure their keyboard layout
           | for that specific math use case? It can't generically cover
           | "all of math" either. Or did they copy&paste it around?
           | That's just silly.
           | 
           | [...could some of the downvoters explain why they're
           | downvoting?]
        
             | OkayPhysicist wrote:
             | When I was doing a lot of Physics simulation in Julia, I
             | had a Vim extension which would just allow me to type
             | something like \gamma, hit tab, and get g. This was worth
             | the (minimal) hassle, because it made it very easy to spot
             | check formulas. When you're shuffling data around in a
             | loosely-described space like most of web dev, descriptive
             | function and variable names are important because the
             | description of what you're doing and what you're doing it
             | too is the important information, and the actual operations
             | you're taking are typically approximately trivial.
             | 
             | In heavily mathematical contexts, most of those assumptions
             | get turned on their head. Anybody qualified to be modifying
             | a model of electromagnetism is going to be intimately
             | familiar with the language of the formulas: mu for
             | permeability, epsilon for permittivity, etc. With that
             | shared context,
             | 
             | 1/(4*p*e)*(q_electron * q_proton)/r^2 is going to be a lot
             | easier to see, at a glance, as Coulombs law
             | 
             | compared to
             | 
             | 1 / (4 * Math.Pi *
             | permitivity_of_free_space)*(charge_electron *
             | charge_proton)/distance_of_separation
             | 
             | Source code, like any other language built for humans, is
             | meant to be read by humans. If those humans have a shared
             | context, utilizing that shared context improves the quality
             | and ease of that communication.
        
               | eqvinox wrote:
               | Hrm. Fair point. But will the other humans, even if they
               | have the shared context, also have the ability to type in
               | these symbols, if they want to edit the code? They
               | probably don't have your vim extension...
               | 
               | I guess maybe this is an argument for better UI/UX for
               | symbolic input...
        
           | Joker_vD wrote:
           | My language uses Cyrillic and I personally prefer English-
           | based keywords and variable names precisely because they are
           | _not_ words of my (human) language. It introduces an easy and
           | obvious distinction between the machine-oriented and the
           | human-oriented.
        
             | ZoomZoomZoom wrote:
             | I know what you mean and I shudder when I see code that
             | uses words from my native lang, but most code _is_ human-
             | oriented.
        
         | OkayPhysicist wrote:
         | Why shouldn't they be? It's not the 00's anymore, Unicode
         | support is universal. You'd have to dust off some truly ancient
         | tech to find something incapable of rendering it.
         | 
         | Source code is for humans, and thus should be written in
         | whatever way makes it easiest to read, write, and understand
         | for humans. If your language doesn't map onto ASCII, then
         | Unicode support improves that goal. If your code is meant to
         | directly implement some physics formula, then using the
         | appropriate unicode characters might make it easier to read
         | (and thus spot transcription errors, something I find _far_ too
         | often in physics simulations).
        
           | wheybags wrote:
           | Hot take, but I've always felt the world would be better
           | served if mathematicians and physicists would stop using
           | terrible short variable names and use
           | longCamelCaseDescriptiveNames like the rest of us, because
           | paper is cheap, and abbreviations are confusing. I know it's
           | nicer when you're writing by hand, but when you clean up a
           | proof or formula for publishing, would it really be so hard
           | to switch to descriptive names?
           | 
           | I'm a practitioner of neither though, so I can't condemn the
           | practice wholeheartedly as an outsider, but it does make me
           | groan.
        
             | senbrow wrote:
             | Long names are good for short expressions, but they
             | obfuscate complex ones because the identifiers visually
             | crowd out the operators.
             | 
             | This can be especially difficult if the author is trying to
             | map 1:1 to a complex algorithm in a white paper that uses
             | domain-standard mathematical notation.
             | 
             | The alternative is to break the "full formula" into simpler
             | expression chunks, but then naming those partial expression
             | results descriptively can be even more challenging.
        
             | nsingh2 wrote:
             | Better served to students and those unfamiliar with the
             | field, but noisy to those familiar. Considering that much
             | of mathematical work is done using pen/paper, it would be a
             | total pain to write out huge variable names every time.
             | 
             | Consider a simple programming example, in C blocks are
             | delimited by `{}`, why not use `block_begin` and
             | `block_end`? Because it's noisy, and it doesn't take much
             | to internalize the meaning of braces.
        
           | someplaceguy wrote:
           | > using the appropriate unicode characters might make it
           | easier to read
           | 
           | It's probably also a great way to introduce almost
           | undetectable security vulnerabilities by using Unicode
           | characters that look similar to each other but in fact are
           | different.
        
             | OkayPhysicist wrote:
             | This would cause your compilation to fail, unless you were
             | deliberately declaring and using near identical symbols.
             | Which would violate the whole "Code is meant to be easily
             | read by humans" thing.
        
               | someplaceguy wrote:
               | > unless you were deliberately declaring and using near
               | identical symbols.
               | 
               | Yes, that would probably be one way to do it.
               | 
               | > Which would violate the whole "Code is meant to be
               | easily read by humans" thing.
               | 
               | I'd think someone who's deliberately and sneakily
               | introducing a security vulnerability would want it to be
               | undetectable, rather than easily readable.
        
           | bigstrat2003 wrote:
           | They shouldn't be precisely _because_ it makes the code
           | harder to read and write when you include non-ASCII
           | characters.
        
         | loeg wrote:
         | Math people shouldn't be allowed to write code. It's not the
         | unicode, so much as the extremely terse variable names.
        
           | perching_aix wrote:
           | Isn't that basically all C/C++ code? Admittedly I don't have
           | much exposure to it, but it's pretty much a trope in and of
           | itself, along with Java and C# suffering from the opposite
           | problem.
           | 
           | Such a silly issue too, you'd think we'd have come up with
           | some automated wrangling for this, so that those experienced
           | with a codebase can switch over and see super short versions
           | of identifiers, while people new to it all will see the long
           | stuff.
        
             | flohofwoe wrote:
             | > Isn't that basically all C/C++ code?
             | 
             | Maybe for code that was written in the early 90's, but the
             | only 'tradition' that has survived is calling the vanilla
             | loop variable 'i'.
        
         | SV_BubbleTime wrote:
         | > void recip(double* a[?], double* r[?]) > { > for (;;) > { >
         | register double P = ( _a[?])_ ( _r[?]);
         | 
         | _ My first thought before I saw this was _"I wonder is this
         | going to be an article from people who build things or
         | something from "academics" that don't."_
         | 
         | At least it was answered quickly.
        
       | Joker_vD wrote:
       | > Here the term "same representation and alignment" covers for
       | example the possibility to look at [...] one would be a structure
       | and the other would be another structure that sits at the
       | beginning of the first.
       | 
       | Does it? It is quite simple for a struct A that has struct B as
       | its first member to have radically different alignment:
       | struct B { char x; };              struct A { struct B b; long
       | long y; };
       | 
       | Also, accidentally coinciding pointers are nothing "rare" because
       | all objects are allowed to be treated as 1-element arrays: so any
       | pointer to an e.g. struct field is also a pointer one-past the
       | previous field of this struct; also, malloc() allocations easily
       | may produce "touching" objects. So thanks for allowing
       | implementations to _not_ have padding between almost every two
       | objects, I guess.
        
         | layer8 wrote:
         | This is about the representation and alignment of the pointer
         | object, not about the object being pointed to. And C requires
         | struct pointer types to all have the same representation and
         | alignment. This is generally necessary due to the possibility
         | of having pointers to opaque struct declarations in a
         | translation unit.
         | 
         | Regarding your second point, if I understand the model
         | correctly, there is only an ambiguity in pointer provenance if
         | the adjacent objects are independent "storage instances", i.e.
         | separately malloc'ed objects or separate variables on the stack
         | -- not between fields of the same struct.
        
       | dsp_person wrote:
       | if ((P- &lt; P) &amp;&amp; (P &lt; P+)) {
       | 
       | I spent way too long trying to figure this out as C code instead
       | of                   if ((P- < P) && (P < P+)) {
        
       | gustedt wrote:
       | Randomly introduced translation errors from markdown to
       | wordpress-internal should be fixed, now. Sorry for the
       | incovenience!
        
       | nikic wrote:
       | At least at a skim, what this specifies for exposure/synthesis
       | for reads/writes of the object representation is concerning. One
       | of the consequences is that dead integer loads cannot be
       | eliminated, as they may have an exposure side effect. I guess C
       | _might_ be able to get away with it due to the interaction with
       | strict aliasing rules. Still quite surprised that they are going
       | against consensus here (and reduces the likelihood that these
       | semantics will get adopted by implementers).
        
         | uecker wrote:
         | (Never mind, I misread you comment at first.) Yes, the
         | representation access needs to be discussed... I took a couple
         | of years to publish this document. More important would be if
         | the ptr2int exposure could be implemented.
        
         | comex wrote:
         | > I guess C might be able to get away with it due to the
         | interaction with strict aliasing rules.
         | 
         | But not for char-typed accesses. And even for larger types, I
         | think you would have to worry about the combo of first
         | memcpying from pointer-typed memory to integer-typed memory,
         | then loading the integer. If you eliminate dead integer loads,
         | then you would have to not eliminate the memcpy.
        
       | hinkley wrote:
       | > Unfortunately no C compiler can do this optimization
       | automatically:
       | 
       | > The functions recip and recip+ and not equivalent.
       | 
       | This is one of those examples of how optimizing code can improve
       | legibility, robustness, or both.
       | 
       | The first implementation allows for side effects to change the
       | outcome of the function. But the problem is that the code is not
       | written expecting someone to modify the values in the middle of
       | the loop. It's incorrect behavior, and you're paying a
       | performance penalty for it to boot.
       | 
       | Functional Core code tends not to have this problem, in that we
       | pass in a snapshot of data and it either gets an answer or an
       | error.
       | 
       | I've seen too much code that checks 3 times if a user is either
       | still logged in or has permission to do a task, and not one of
       | them was set up to deal with one answer for the first call and a
       | different one for any of the subsequent ones. They just go into
       | undefined behavior.
        
       | jaisio wrote:
       | The root cause of all this is that C programs are not much more
       | than glorified assembly programs. Any effort to retrofit higher
       | level reasoning will always be defeated somebody doing some dirty
       | pointer tricks. This can only be solved by more abstract ways to
       | express programs which necessarily restricts the bare metal dirty
       | things one can do. But what you gain is that the compiler will
       | easily be able to do lots of things which a C compiler can't do
       | or only with a lot of headache. The kind of stuff this article is
       | about is really trying to solve the wrong problem IMO.
        
       | RossBencina wrote:
       | After reading the fine article I'm left wondering what if you
       | implement your own heterogeneous allocation scheme on top of
       | malloc? (e.g. TLSF) In this case all of your objects will belong
       | to the same malloced storage region, and you will compute object
       | offsets using raw pointers, but I'd expect provenance to
       | potentially treat each returned object to behave as if it were
       | allocated from a separate disjoint storage.
       | 
       | I guess my question is: does this provenance model allow for
       | recursive nesting of allocators with a separate notion of
       | "storage" at each level?
        
         | f33d5173 wrote:
         | The compiler knows about malloc, and hence knows that the
         | pointer returned by malloc won't alias any other pointer. Your
         | compiler might support some attribute to mark a function as
         | behaving like malloc in this respect. Otherwise the compiler
         | will be forced to assume the return value could alias any other
         | pointer.
        
       ___________________________________________________________________
       (page generated 2025-06-30 23:00 UTC)