[HN Gopher] The provenance memory model for C
___________________________________________________________________
The provenance memory model for C
Author : HexDecOctBin
Score : 193 points
Date : 2025-06-30 09:25 UTC (13 hours ago)
(HTM) web link (gustedt.wordpress.com)
(TXT) w3m dump (gustedt.wordpress.com)
| zombot wrote:
| Does C allow Unicode identifiers now, or is that pseudo code? The
| code snippets also contain `&`, so something definitely went
| wrong with the transcoding to HTML.
| qsort wrote:
| Quoting cppreference:
|
| An identifier is an arbitrarily long sequence of digits,
| underscores, lowercase and uppercase Latin letters, and Unicode
| characters specified using \u and \U escape notation(since
| C99), of class XID_Continue(since C23). A valid identifier must
| begin with a non-digit character (Latin letter, underscore, or
| Unicode non-digit character(since C99)(until C23), or Unicode
| character of class XID_Start)(since C23)). Identifiers are
| case-sensitive (lowercase and uppercase letters are distinct).
| Every identifier must conform to Normalization Form C.(since
| C23)
|
| In practice depends on the compiler.
| dgrunwald wrote:
| But the source character set remains implementation-defined,
| so compilers do not have to directly support unicode names,
| only the escape notation.
|
| Definitely a questionable choice to throw off readers with
| unicode weirdness in the very first code example.
| qsort wrote:
| If it were up to me, anything outside the basic character
| set in a source file would be a syntax error, I'm simply
| reporting what the spec says.
| ncruces wrote:
| I use unicode for math in comments, and think makes
| certain complicated formulas far more readable.
| kzrdude wrote:
| I've just been learning pinyin notation, so now i think
| the variable r[?] should have a value that first goes
| down a bit and then up.
| zelphirkalt wrote:
| I am not sure it is a good idea to mix such specific
| phonetic script ideas about diacritic marks with the
| behavior of the program over time. Even considering the
| shape, it does not align with the idea of first down a
| little, then up a lot.
| guipsp wrote:
| What a "basic character set" is depends on locale
| qsort wrote:
| https://en.cppreference.com/w/c/language/charset.html
| account42 wrote:
| Anything except US-ASCII in source code outside comments
| and string constants should be a syntax error.
| guipsp wrote:
| You are aware other languages exist? Some of which don't
| even use the Latin script?
| Y_Y wrote:
| What? like APL!?
| nottorp wrote:
| Dunno about the OP but I'm very aware as I'm not an
| english speaker.
|
| I still don't want anything as unpredictable as Unicode
| in my code. How many different encodings will display as
| the same variable name and how is the compiler supposed
| to decide?
|
| If you're thinking of comments and user facing strings,
| the OP already excluded those.
| unwind wrote:
| I can't even view the post, I just get some kind of content
| management system-like with the page as JSON or something, in
| pink-on-white. I'm super confused. :|
|
| The answer to your question seems to (still) be "no".
| pjmlp wrote:
| Besides the sibling comment on C23, it does work fine on GCC.
|
| https://godbolt.org/z/qKejzc1Kb
|
| Whereas clang loudly complains,
|
| https://godbolt.org/z/qWrccWzYW
| Y_Y wrote:
| Implementation-defined until C99, explicitly possible via UCNs
| aince c99, possible with explicit encoding since C23, but
| literals are _still_ implementation defined.
| tialaramex wrote:
| Presumably this was converted from markdown or similar and the
| conversion partly failed or the input was broken.
|
| From the PVI section onward it seems to recover, but if the
| author sees this please fix and re-convert your post.
|
| [Edited, nope, there are more errors further in the text, this
| needed proper proofreading before it was posted, I can somewhat
| struggle through because I already know this topic but if this
| was intended to introduce newcomers it's probably very confusing]
| gustedt wrote:
| The problem is that wordpress changes these things once you
| edit in some part. I will probably regenerate the whole.
| lioeters wrote:
| Looks like a code block didn't get closed properly, before this
| phrase:
|
| > the functions `recip` and `recip+` and not equivalent
|
| Several paragraphs after this got swallowed by the code block.
|
| Edit: Oh, I didn't realize the article is by the author of the
| book, Modern C. I've seen it recommended in many places.
|
| > The C23 edition of Modern C is now available for free download
| from https://hal.inria.fr/hal-02383654
| johnisgood wrote:
| It is a great book. I prefer the second edition, not the latest
| one though with what I call "bloated C".
| laqq3 wrote:
| I'm wondering if you could elaborate? I'd be curious to hear
| more about "bloated C" and the differences between the 2nd
| and 3rd edition.
| shakabrah wrote:
| It made immediate sense to me it was Jen once I saw the code
| samples given
| zmodem wrote:
| > Looks like a code block didn't get closed properly
|
| This seems to have been fixed now.
| perching_aix wrote:
| I still see it, even after clearing caches, visiting from a
| separate browser from a separate computer (even a separate
| network).
| gavinray wrote:
| Also of interest to folks looking at this might be TySan, the
| recently-merged LLVM Type-Based Aliasing sanitizer:
|
| https://clang.llvm.org/docs/TypeSanitizer.html
|
| https://www.phoronix.com/news/LLVM-Merge-TySan-Type-Sanitize...
| aengelke wrote:
| It's probably worth noting that TySan currently only catches
| aliasing violations that LLVM would be able to exploit. For
| some types, e.g. unions, Clang doesn't emit accurate type-based
| aliasing information and therefore TySan won't catch these.
| flohofwoe wrote:
| Which is fine I think, considering that union type punning is
| legal in C (and even in C++ where union type punning is UB I
| have never seen it break - theoretically it might of course).
| jvanderbot wrote:
| I love Rust, but I miss C. If C can be updated to make it
| generally socially acceptable for new projects, I'd happily go
| back for some decent subset of things I do. However, there's a
| lot of anxiety and even angst around using C in production code.
| mikewarot wrote:
| If you can stomach the occasional Begin and End, and a far less
| confusing pointer syntax, Pascal might be the language for you.
| Free Pascal has some great string handling, so you never have
| to worry about allocating and freeing them, and they can store
| gigabytes of text, even Unicode. ;-)
| tgv wrote:
| Or try Ada.
| jvanderbot wrote:
| If my fellow devs cringe at C, imagine their reaction to
| Pascal
| mikewarot wrote:
| C has all the things to hate in a programming language
| CaSe Sensitivity Weird pointer syntax Lack of a
| separate assignment token Null terminated strings
| Macros - the evil scourge of the universe
|
| On the plus side, it's installed everywhere, and it's not
| indent sensitive
| jvanderbot wrote:
| At this point, you're talking to someone who isn't here
| ioasuncvinvaer wrote:
| Except for null terminated strings these don't seem like
| mayor issues to me. Can you elaborate?
| 1718627440 wrote:
| > Lack of a separate assignment token
|
| What does that mean?
| kbolino wrote:
| Assignment is = which is too close to equality == and
| thus has been the source of bugs in the past, especially
| since C treats assignment as an expression and coerces
| lots of non-boolean values to true/false wherever a
| condition is expected (if, while, for). Most compilers
| warn about this at least nowadays.
| tialaramex wrote:
| Even with warnings this is just terrible. People need to
| stop inventing languages where "False" is true, or an
| empty container is false or other insane "coercions" of
| this kind.
|
| True is true, and false is false, if you're wondering
| whether this Doodad is Wibbly, you should _ask that
| question_ not rely on a convention that Wibbly Doodads
| are somehow "truthy" while the non-Wibbly ones are not.
| zelphirkalt wrote:
| You mean "mere string replacement macros, instead of
| hygienic macros", of course : )
| flohofwoe wrote:
| > to make it generally socially acceptable for new projects...
|
| Or better yet, don't let 'social pressure' influence your
| choice of programming language ;)
|
| If your workplace has a clear rule to not use memory-unsafe
| languages for production code that's a different matter of
| course. But nothing can stop you from writing C code as a hobby
| - C99 and later is a very enjoyable and fun language.
| xxs wrote:
| I was about the reply no amount of pressure can tell me how
| to program. C was totally fine for esp32
| TimorousBestie wrote:
| > Or better yet, don't let 'social pressure' influence your
| choice of programming language ;)
|
| It's hard. Programming is a social discipline, and the more
| people who work in a language, the more love it gets.
| spauldo wrote:
| If you're on UNIX or working in the embedded space, C is
| still everywhere and gets lots of love. C tends to get lots
| of libraries anyway because everything can FFI to it.
| Y_Y wrote:
| I don't want to summon WB, but honest-to-god, D is a good
| middle ground here.
| bnferguson wrote:
| Feels like Zig is starting to fill that role in some ways.
| Fewer sharp edges and a bit more safety than C, more modern
| approach, and even interops really well with C (even being
| possible to mix the two). Know a couple Rust devs that have
| said it seems to scratch that C itch while being more modern.
|
| Of course it's still really nice to just have C itself being
| updated into something that's nicer to work with and easier to
| write safely, but Zig seems to be a decent other option.
| pjmlp wrote:
| As usual the remark that much of the Zig's safety over C, has
| been present since the late 1970's in languages like
| Modula-2, Object Pascal and Ada, but sadly they didn't born
| with curly brackets, nor brought a free OS to the uni party.
| dnautics wrote:
| (self-promotion) in principle one should be able to implement
| a fairly mature pointer provenance checker for zig, without
| changing the language. A basic proof of concept (don't use
| this, branches and loops have not been implemented yet):
|
| https://www.youtube.com/watch?v=ZY_Z-aGbYm8
| purplesyringa wrote:
| How close are Zig's safety guarantees to Rust's? Honest
| question; I don't follow Zig development. I can't take C
| seriously because it hasn't even bothered to define
| provenance until now, but as far as I'm aware, Zig doesn't
| even try to touch these topics.
|
| Does Zig document the precise mechanics of noalias? Does it
| provide a mechanism for controllably exposing or not exposing
| provenance of a pointer? Does it specify the provenance ABA
| problem in atomics on compare-exchange somehow or is that
| undefined? Are there any plans to make allocation
| optimizations sound? (This is still a problem even in Rust
| land; you can write a program that is guaranteed to exhibit
| OOM according to the language spec, but LLVM outputs code
| that doesn't OOM.) Does it at least have a sanitizer like
| Miri to make sure UB (e.g. data races, type confusion, or
| aliasing problems) is absent?
|
| If the answer to most of the above is "Zig doesn't care", why
| do people even consider it better than C?
| dnautics wrote:
| safety-wise, zig is better than C because if you don't do
| "easily flaggable things"[0] it doesn't have buffer
| overruns (including protection in the case of sentinel
| strings), or null pointer exceptions. Where this lies on
| the spectrum of "C to Rust" is a matter of judgement, but
| if I'm not mistaken it is easily a majority of memory-
| safety related CVEs. There's also no UB in debug, test, or
| release-safe. Note: you can opt-out of release-safe on a
| function-by-function basis. IIUC noalias is safety checked
| in debug, test, and release-safe.
|
| In a sibling comment, I mentioned a proof of concept I did
| that if I had the time to complete/do correctly, it should
| give you near-rust-level checking on memory safety, plus
| automatically flags sites where you need to inspect the
| code. At the point where you are using MIRI, you're already
| bringing extra stuff into rust, so in practice zig + zig-
| clr could be the equivalent of the result of "what if you
| moved borrow checking from rustc into miri"
|
| [0] type erasure, or using "known dangerous types, like c
| pointers, or non-slice multipointers".
| tialaramex wrote:
| This is very much a "Draw the rest of the fucking owl"
| approach to safety.
| dnautics wrote:
| what percentage of CVEs are null pointer problems or
| buffer overflows? That's what percentage of the owl has
| been drawn. If someone (or me) builds out a proper zig-
| clr, then we get to, what? 90%. Great. Probably good
| enough, that's not far off from where rust is.
| comex wrote:
| Probably >50% of exploits these days target use-after-
| frees, not buffer overflows. I don't have hard data
| though.
|
| As for null pointer problems, while they may result in
| CVEs, they're a pretty minor security concern since they
| generally only result in denial of service.
|
| Edit 2: Here's some data: In an analysis by Google, the
| "most frequently exploited" vulnerability types for zero-
| day exploitation were use-after-free, command injection,
| and XSS [3]. Since command injection and XSS are not
| memory-unsafety vulnerabilities, that implies that use-
| after-frees are significantly more frequently exploited
| than other types of memory unsafety.
|
| Edit: Zig previously had a GeneralPurposeAllocator that
| prevented use-after-frees of heap allocations by never
| reusing addresses. But apparently, four months ago [1],
| GeneralPurposeAllocator was renamed to DebugAllocator and
| a comment was added saying that the safety features
| "require the allocator to be quite slow and wasteful". No
| explicit reasoning was given for this change, but it
| seems to me like a concession that applications need high
| performance generally shouldn't be using this type of
| allocator. In addition, it appears that use-after-free is
| not caught for stack allocations [2], or allocations from
| some other types of allocators.
|
| Note that almost the entire purpose of Rust's borrow
| checker is to prevent use-after-free. And the rest of its
| purpose is to prevent other issues that Zig also doesn't
| protect against: tagged-union type confusion and data
| races.
|
| [1] https://github.com/ziglang/zig/commit/cd99ab32294a3c2
| 2f09615...
|
| [2] https://github.com/ziglang/zig/issues/3180.
|
| [3] https://cloud.google.com/blog/topics/threat-
| intelligence/202...
| modeless wrote:
| Fil-C is a modified version of Clang that makes C and C++
| memory safe. It supports things you wouldn't expect to work
| like signal handling or setjmp/longjmp. It can compile real C
| projects like SQLite and OpenSSL with minimal to no changes,
| today. https://github.com/pizlonator/llvm-project-
| deluge/blob/delug...
| tialaramex wrote:
| Fil-C does seem like a quicker route if your existing idea
| was something like "rewrite it in Java" and it exists today
| whereas both C and C++ have only vague ambitions to deliver
| some future language which might meet your needs.
|
| I will be very surprised if there's widespread adoption of
| Fil-C for many new projects though.
| uecker wrote:
| Do you really love Rust, or do you feel pressured to say so?
| grg0 wrote:
| He grew up in a very stringent household. Everybody was
| writing Rust and he was like, "damn, I wish I could write C."
| briandw wrote:
| The code blocks are very difficult to read on this page. I had
| ChatGPT O3 rewrite this in a more accessible format.
| https://chatgpt.com/share/68629096-0624-8005-846f-7c0d655061...
| cenobyte wrote:
| So much better. Thank you!
| b0a04gl wrote:
| provenance model basically turns memory back into a typed value.
| finally malloc wont just be a dumb number generator, it'll act
| more like a capability issuer. and access is not 'is this address
| in range' anymore, but "does this pointer have valid provenance".
| way more deterministic, decouples gcc -wall
| HexDecOctBin wrote:
| Will this create more nasal demons? I always disable strict
| aliasing, and it's not clear to me after reading the whole
| article whether provenance is about making sane code illegal,
| or making previously illegal sane code legal.
| jcranmer wrote:
| All C compilers have some notion of pointer provenance
| embedded in them, and this is true going back decades.
|
| The problem is that the documented definitions of pointer
| provenance (which generally amount to "you must somehow have
| a data dependency from the original object definition (e.g.,
| malloc)") aren't really upheld by the optimizer, and the
| effective definition of the optimizer is generally internally
| inconsistent because people don't think about side effects of
| pointer-to-integer conversion. The one-past-the-end pointer
| being equal (but of different provenance) to a different
| object is a particular vexatious case.
|
| The definition given in TS6010 is generally the closest
| you'll get to a formal description of the behavior that
| optimizers are already generally following, except for cases
| that are clearly agreed to be bugs. The biggest problem is
| that it makes pointer-to-int an operation with side effects
| that need to be preserved, and compilers today generally fail
| to preserve those side effects (especially when pointer-to-
| int conversion happens more as an implicit operation).
|
| The practical effect of provenance--that you can't magic a
| pointer to an object out of thin air--has always been true.
| This is largely trying to clarify what it means to actually
| magic a pointer out of thin air; it's not a perfect answer,
| but it's the best answer anyone's come up with to date.
| layer8 wrote:
| This is basically a formalization of the general
| understanding one already had when reading the C standard
| thoroughly 25 years ago. At least I was nodding along
| throughout the article. It cleans up the parts where the
| standard was too imprecise and handwavy.
| Diggsey wrote:
| It's standardizing the contract between the programmer and
| the compiler.
|
| Previously a lot of C code was non-portable because it relied
| on behaviour that wasn't defined as part of the standard. If
| you compiled it with the wrong compiler or the wrong flags
| you might get miscompilations.
|
| The provenance memory model draws a line in the sand and says
| "all C code on this side of the line should behave in this
| well defined way". Any optimizations implemented by compiler
| authors which would miscompile code on that side of the line
| would need to be disabled.
|
| Assuming the authors of the model have done a good job, the
| impact on compiler optimizations should be minimized whilst
| making as much existing C code fall on the "right" side of
| the line as possible.
|
| For new C code it provides programmers a way to write useful
| code that is also portable, since we now have a line that we
| can all hopefully agree on.
| cenobyte wrote:
| Please fix the code in your post.
| eqvinox wrote:
| Using the "register" storage class feels _really_ alien for C
| code written in 2025...
| flohofwoe wrote:
| It has a slightly different meaning now, instead of hinting to
| the compiler that the variable should be placed in a register
| it now means that it is illegal to take the address of the
| variable (e.g. cannot create a pointer from it):
|
| https://www.godbolt.org/z/eEYf5c59f
|
| Might be useful in some situations although I currently can't
| think of any :)
| smcameron wrote:
| Ugh. Are unicode variable names allowed in C now? That's
| horrific.
| mananaysiempre wrote:
| "Now" as in since C99, twenty-five years ago, yes. (It seemed
| like a good idea at the time.)
| 90s_dev wrote:
| See also https://www.ethiocloud.com/bunnascript.aspx and
| https://en.wikipedia.org/wiki/Non-English-
| based_programming_...
| kevincox wrote:
| Being able to program in languages that don't fit into ASCII
| is a good idea. Using one-character variable names is a bad
| idea.
| adrianN wrote:
| Using variable names that are different but render (almost)
| the same can be a bad idea.
| RossBencina wrote:
| Mathematics is a language that doesn't fit into ASCII and
| commonly uses one-character variable names. If you are
| implementing a documented mathematical algorithm (i.e. one
| with a description in a paper or book) then sticking to the
| notation of the paper (i.e. using one character variable
| names) makes sense to me.
| mananaysiempre wrote:
| Unfortunately, many of the things of this nature that
| you'll want to implement use indices, which are
| inevitably going to start at 1. So you'll still got
| plenty of hours of unpleasant debugging ahead of you, and
| a non-obvious correspondence to the original paper at the
| end of it.
| kevincox wrote:
| I find math far easier to read when the authors use
| proper names for variables. But I understand that it
| isn't the idiomatic style and agree that it can be useful
| to match the paper when re-implementing an algorithm.
| 1over137 wrote:
| Horrific? You might not think so if your (human) language used
| a different alphabet.
| ajross wrote:
| Little to no source code is written for single (human)
| language development teams. Sure, everyone would like the
| ability to write source code in their native language. That's
| natural.
|
| Literally no one, anywhere, wants to be forced to _read_
| source written in a language they can 't read (or more
| specifically in this case: written in glyphs they can't even
| produce on their keyboard). That idea, for almost everyone,
| seems "horrific", yeah.
|
| So a lingua franca is a firm requirement for modern software
| development outside of extremely specific environments (FSB
| malware authors probably don't care about anyone else reading
| their cyrillic variable names, etc...). Must it be ASCII-
| encoded English? No. But that's what the market has picked
| and most people seem happy enough with it.
| OkayPhysicist wrote:
| > Little to no source code is written for single (human)
| language development teams.
|
| This is blatantly false. I'd posit that a solid 90% of all
| source code written is done so by single, co-located teams
| (a substantial portion of which are teams of 1). That
| certainly fits the bill for most companies I've worked at.
| eqvinox wrote:
| Yes but also no. The thing about software is that 90% of it
| is not culturally bound. If you're writing, say, some tax
| reporting tool, a grammar reference, or something
| religious... sure, it makes sense to write that in your
| language. So, yeah, C should support that.
|
| However, everything else, from spreadsheet software to CAD
| tools to OS kernels to JavaScript frameworks is universal
| across cultures and languages. And for better or for worse
| (I'm not a native English speaker either), the world has gone
| with English for a lot of code commons.
|
| And the thing with the examples in that post isn't about
| supporting language diversity, it's math symbols which are
| noone's native language. _And you pretty much can 't type
| them on any keyboard._ Which really makes it a rather poor
| flex IMHO. Did the author reconfigure their keyboard layout
| for that specific math use case? It can't generically cover
| "all of math" either. Or did they copy&paste it around?
| That's just silly.
|
| [...could some of the downvoters explain why they're
| downvoting?]
| OkayPhysicist wrote:
| When I was doing a lot of Physics simulation in Julia, I
| had a Vim extension which would just allow me to type
| something like \gamma, hit tab, and get g. This was worth
| the (minimal) hassle, because it made it very easy to spot
| check formulas. When you're shuffling data around in a
| loosely-described space like most of web dev, descriptive
| function and variable names are important because the
| description of what you're doing and what you're doing it
| too is the important information, and the actual operations
| you're taking are typically approximately trivial.
|
| In heavily mathematical contexts, most of those assumptions
| get turned on their head. Anybody qualified to be modifying
| a model of electromagnetism is going to be intimately
| familiar with the language of the formulas: mu for
| permeability, epsilon for permittivity, etc. With that
| shared context,
|
| 1/(4*p*e)*(q_electron * q_proton)/r^2 is going to be a lot
| easier to see, at a glance, as Coulombs law
|
| compared to
|
| 1 / (4 * Math.Pi *
| permitivity_of_free_space)*(charge_electron *
| charge_proton)/distance_of_separation
|
| Source code, like any other language built for humans, is
| meant to be read by humans. If those humans have a shared
| context, utilizing that shared context improves the quality
| and ease of that communication.
| eqvinox wrote:
| Hrm. Fair point. But will the other humans, even if they
| have the shared context, also have the ability to type in
| these symbols, if they want to edit the code? They
| probably don't have your vim extension...
|
| I guess maybe this is an argument for better UI/UX for
| symbolic input...
| Joker_vD wrote:
| My language uses Cyrillic and I personally prefer English-
| based keywords and variable names precisely because they are
| _not_ words of my (human) language. It introduces an easy and
| obvious distinction between the machine-oriented and the
| human-oriented.
| ZoomZoomZoom wrote:
| I know what you mean and I shudder when I see code that
| uses words from my native lang, but most code _is_ human-
| oriented.
| OkayPhysicist wrote:
| Why shouldn't they be? It's not the 00's anymore, Unicode
| support is universal. You'd have to dust off some truly ancient
| tech to find something incapable of rendering it.
|
| Source code is for humans, and thus should be written in
| whatever way makes it easiest to read, write, and understand
| for humans. If your language doesn't map onto ASCII, then
| Unicode support improves that goal. If your code is meant to
| directly implement some physics formula, then using the
| appropriate unicode characters might make it easier to read
| (and thus spot transcription errors, something I find _far_ too
| often in physics simulations).
| wheybags wrote:
| Hot take, but I've always felt the world would be better
| served if mathematicians and physicists would stop using
| terrible short variable names and use
| longCamelCaseDescriptiveNames like the rest of us, because
| paper is cheap, and abbreviations are confusing. I know it's
| nicer when you're writing by hand, but when you clean up a
| proof or formula for publishing, would it really be so hard
| to switch to descriptive names?
|
| I'm a practitioner of neither though, so I can't condemn the
| practice wholeheartedly as an outsider, but it does make me
| groan.
| senbrow wrote:
| Long names are good for short expressions, but they
| obfuscate complex ones because the identifiers visually
| crowd out the operators.
|
| This can be especially difficult if the author is trying to
| map 1:1 to a complex algorithm in a white paper that uses
| domain-standard mathematical notation.
|
| The alternative is to break the "full formula" into simpler
| expression chunks, but then naming those partial expression
| results descriptively can be even more challenging.
| nsingh2 wrote:
| Better served to students and those unfamiliar with the
| field, but noisy to those familiar. Considering that much
| of mathematical work is done using pen/paper, it would be a
| total pain to write out huge variable names every time.
|
| Consider a simple programming example, in C blocks are
| delimited by `{}`, why not use `block_begin` and
| `block_end`? Because it's noisy, and it doesn't take much
| to internalize the meaning of braces.
| someplaceguy wrote:
| > using the appropriate unicode characters might make it
| easier to read
|
| It's probably also a great way to introduce almost
| undetectable security vulnerabilities by using Unicode
| characters that look similar to each other but in fact are
| different.
| OkayPhysicist wrote:
| This would cause your compilation to fail, unless you were
| deliberately declaring and using near identical symbols.
| Which would violate the whole "Code is meant to be easily
| read by humans" thing.
| someplaceguy wrote:
| > unless you were deliberately declaring and using near
| identical symbols.
|
| Yes, that would probably be one way to do it.
|
| > Which would violate the whole "Code is meant to be
| easily read by humans" thing.
|
| I'd think someone who's deliberately and sneakily
| introducing a security vulnerability would want it to be
| undetectable, rather than easily readable.
| bigstrat2003 wrote:
| They shouldn't be precisely _because_ it makes the code
| harder to read and write when you include non-ASCII
| characters.
| loeg wrote:
| Math people shouldn't be allowed to write code. It's not the
| unicode, so much as the extremely terse variable names.
| perching_aix wrote:
| Isn't that basically all C/C++ code? Admittedly I don't have
| much exposure to it, but it's pretty much a trope in and of
| itself, along with Java and C# suffering from the opposite
| problem.
|
| Such a silly issue too, you'd think we'd have come up with
| some automated wrangling for this, so that those experienced
| with a codebase can switch over and see super short versions
| of identifiers, while people new to it all will see the long
| stuff.
| flohofwoe wrote:
| > Isn't that basically all C/C++ code?
|
| Maybe for code that was written in the early 90's, but the
| only 'tradition' that has survived is calling the vanilla
| loop variable 'i'.
| SV_BubbleTime wrote:
| > void recip(double* a[?], double* r[?]) > { > for (;;) > { >
| register double P = ( _a[?])_ ( _r[?]);
|
| _ My first thought before I saw this was _"I wonder is this
| going to be an article from people who build things or
| something from "academics" that don't."_
|
| At least it was answered quickly.
| Joker_vD wrote:
| > Here the term "same representation and alignment" covers for
| example the possibility to look at [...] one would be a structure
| and the other would be another structure that sits at the
| beginning of the first.
|
| Does it? It is quite simple for a struct A that has struct B as
| its first member to have radically different alignment:
| struct B { char x; }; struct A { struct B b; long
| long y; };
|
| Also, accidentally coinciding pointers are nothing "rare" because
| all objects are allowed to be treated as 1-element arrays: so any
| pointer to an e.g. struct field is also a pointer one-past the
| previous field of this struct; also, malloc() allocations easily
| may produce "touching" objects. So thanks for allowing
| implementations to _not_ have padding between almost every two
| objects, I guess.
| layer8 wrote:
| This is about the representation and alignment of the pointer
| object, not about the object being pointed to. And C requires
| struct pointer types to all have the same representation and
| alignment. This is generally necessary due to the possibility
| of having pointers to opaque struct declarations in a
| translation unit.
|
| Regarding your second point, if I understand the model
| correctly, there is only an ambiguity in pointer provenance if
| the adjacent objects are independent "storage instances", i.e.
| separately malloc'ed objects or separate variables on the stack
| -- not between fields of the same struct.
| dsp_person wrote:
| if ((P- < P) && (P < P+)) {
|
| I spent way too long trying to figure this out as C code instead
| of if ((P- < P) && (P < P+)) {
| gustedt wrote:
| Randomly introduced translation errors from markdown to
| wordpress-internal should be fixed, now. Sorry for the
| incovenience!
| nikic wrote:
| At least at a skim, what this specifies for exposure/synthesis
| for reads/writes of the object representation is concerning. One
| of the consequences is that dead integer loads cannot be
| eliminated, as they may have an exposure side effect. I guess C
| _might_ be able to get away with it due to the interaction with
| strict aliasing rules. Still quite surprised that they are going
| against consensus here (and reduces the likelihood that these
| semantics will get adopted by implementers).
| uecker wrote:
| (Never mind, I misread you comment at first.) Yes, the
| representation access needs to be discussed... I took a couple
| of years to publish this document. More important would be if
| the ptr2int exposure could be implemented.
| comex wrote:
| > I guess C might be able to get away with it due to the
| interaction with strict aliasing rules.
|
| But not for char-typed accesses. And even for larger types, I
| think you would have to worry about the combo of first
| memcpying from pointer-typed memory to integer-typed memory,
| then loading the integer. If you eliminate dead integer loads,
| then you would have to not eliminate the memcpy.
| hinkley wrote:
| > Unfortunately no C compiler can do this optimization
| automatically:
|
| > The functions recip and recip+ and not equivalent.
|
| This is one of those examples of how optimizing code can improve
| legibility, robustness, or both.
|
| The first implementation allows for side effects to change the
| outcome of the function. But the problem is that the code is not
| written expecting someone to modify the values in the middle of
| the loop. It's incorrect behavior, and you're paying a
| performance penalty for it to boot.
|
| Functional Core code tends not to have this problem, in that we
| pass in a snapshot of data and it either gets an answer or an
| error.
|
| I've seen too much code that checks 3 times if a user is either
| still logged in or has permission to do a task, and not one of
| them was set up to deal with one answer for the first call and a
| different one for any of the subsequent ones. They just go into
| undefined behavior.
| jaisio wrote:
| The root cause of all this is that C programs are not much more
| than glorified assembly programs. Any effort to retrofit higher
| level reasoning will always be defeated somebody doing some dirty
| pointer tricks. This can only be solved by more abstract ways to
| express programs which necessarily restricts the bare metal dirty
| things one can do. But what you gain is that the compiler will
| easily be able to do lots of things which a C compiler can't do
| or only with a lot of headache. The kind of stuff this article is
| about is really trying to solve the wrong problem IMO.
| RossBencina wrote:
| After reading the fine article I'm left wondering what if you
| implement your own heterogeneous allocation scheme on top of
| malloc? (e.g. TLSF) In this case all of your objects will belong
| to the same malloced storage region, and you will compute object
| offsets using raw pointers, but I'd expect provenance to
| potentially treat each returned object to behave as if it were
| allocated from a separate disjoint storage.
|
| I guess my question is: does this provenance model allow for
| recursive nesting of allocators with a separate notion of
| "storage" at each level?
| f33d5173 wrote:
| The compiler knows about malloc, and hence knows that the
| pointer returned by malloc won't alias any other pointer. Your
| compiler might support some attribute to mark a function as
| behaving like malloc in this respect. Otherwise the compiler
| will be forced to assume the return value could alias any other
| pointer.
___________________________________________________________________
(page generated 2025-06-30 23:00 UTC)