[HN Gopher] C2Rust Transpiler
       ___________________________________________________________________
        
       C2Rust Transpiler
        
       Author : Aissen
       Score  : 104 points
       Date   : 2022-10-20 12:54 UTC (3 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | gwbas1c wrote:
       | Really cool concept! (Rust is my favorite language for tinkering;
       | I haven't touched C since I was in school.)
       | 
       | What would really help is success stories: Who's used it? What
       | have they used it for? What challenges did they encounter? Then
       | again, maybe this is so new that there aren't a lot of success
       | stories yet. :)
        
         | mastax wrote:
         | From what I remember it was primarily funded by a DoD project
         | of some sort. There probably isn't a lot of info about that.
         | There have been a few conference talks or blogposts about it
         | (I'm trying to remember) that talked about the process of using
         | it.
        
           | mshockwave wrote:
           | It's actually funded by DARPA so basically everything in this
           | project is public available.
        
         | dtolnay wrote:
         | I used it with great success for transpiring libyaml from C to
         | Rust. I even set up Miri to run the upstream library's entire
         | transpiled test suite, and the fact it passes is validation of
         | absence of UB in the original C code.
         | 
         | The transpiled library now serves as the YAML backend for the
         | widely used serde_yaml crate. Having serde_yaml be pure-Rust
         | code instead of linking C is advantageous for painless cross-
         | compilation as well as making downstream projects runnable in
         | Miri.
         | 
         | https://github.com/dtolnay/unsafe-libyaml
        
           | tialaramex wrote:
           | A few questions, if I may:
           | 
           | Is the intent that this continues to evolve by newer libyaml
           | code getting transpiled, or that it's effectively a fork and
           | might gradually become more idiomatic Rust which does the
           | same job as the C but won't track any changes? Or is this
           | basically "done" and only small changes (to both this code
           | and the C libyaml) are anticipated anyway?
           | 
           | The c2rust documentation cautions that any platform
           | independence isn't preserved, so if libyaml has types which
           | are different on platform A versus platform B, the c2rust
           | transpilation on platform A just gives you Rust types for
           | platform A, losing that independence, was this an issue for
           | libyaml ?
        
           | pitaj wrote:
           | Very cool. Miri is seriously awesome.
        
       | dahfizz wrote:
       | I would be interested to see performance numbers of the C version
       | and the transpired rust version of some program.
        
       | fbdab103 wrote:
       | A potential use case I see is for security auditing. Even if you
       | cannot port an existing C codebase to Rust, you could run this
       | tool to examine the unsafe hotspots. Any place where the
       | translation has to rely upon unsafe is a region of the code more
       | likely to contain any of the mistakes Rust is designed to
       | prevent. Of course, this pre-supposes that 90% of the translation
       | does not have to lean on unsafe annotation.
        
         | electroly wrote:
         | This only produces unsafe code. Every translated function has
         | the unsafe keyword. It's up to the programmer to clean it up
         | afterwards.
        
         | kazinator wrote:
         | I suspect that this relies on unsafe pretty much everywhere.
         | Even handling argc and argv in your C main function, in
         | idiomatic ways, is unsafe.
         | 
         | There is no 80/20 rule for C unsafety, other than maybe an
         | inverted one: 80% of the unsafety of a large C program might be
         | spread into 80% or more of the code. :)
        
           | a1369209993 wrote:
           | > There is no 80/20 rule for C unsafety,
           | 
           | Actually, there is; the problem is that the transpiler can't
           | tell the difference between code that _relies_ on unsafety
           | for its semantics, versus code that would still work if
           | appropiate annotations (potentially causing function to not
           | be callable in intended contexts), run-time checks (possibly
           | causing code to error out on what were intended to be valid
           | inputs), etc, were added to make it safe.
           | 
           | 80% of the code contains 20% of the cases where safety would
           | require deviating from the _intended_ semantics, not just the
           | incidental ones. (A general-purpose transpiler can 't (in
           | general) tell the difference between intended semantics and
           | incidental ones, so it has to conservatively assume all
           | semantics are intended, and write everything as the most-
           | general (ie most-unsafe) interpretation.)
        
             | kazinator wrote:
             | Any C code that performs a calculation which would silently
             | be wrong or crash if the values were not correct (even
             | though they are) is inherently unsafe.
        
               | turminal wrote:
               | That's a weird way to put it. If your function assumes
               | some constraints on the input it gets and you give it
               | data that violates its constraints, it's going to fail in
               | some way. Sure, C makes it worse by making it harder to
               | verify the assumptions and constraints, but by your
               | definition every function that operates on sorted arrays
               | and doesn't verify the input is sorted is inherently
               | unsafe, regardless of the language.
        
           | counttheforks wrote:
           | > Even handling argc and argv in your C main function, in
           | idiomatic ways, is unsafe.
           | 
           | Not knowing C very well, could you clarify what makes it
           | unsafe? Thanks!
        
             | insanitybit wrote:
             | You'd have to access the stack frame above `main` and then
             | treat some of the bytes within that frame as your env. This
             | means forging pointers/bounds based on the inputs. `execve`
             | basically sets that up for you but Rust doesn't know about
             | that.
             | 
             | Then if you wanted to handle dynamically set environment
             | variables you'll need to call into your libc
             | implementation, which crosses an ffi boundary, which means
             | rust doesn't know about what that code is doing and
             | therefor it requires `unsafe`.
             | 
             | edit: Question for others - is main a separate stackframe?
             | I actually don't recall.
        
             | hn92726819 wrote:
             | "safe" and "unsafe" in rust are well defined, but in my
             | opinion it's very confusing and I wish they used different
             | terms.
             | 
             | In rust, unsafe means accessing memory that's already freed
             | or unallocated or things like that. You can look up the
             | definition for the full definition.
             | 
             | I think the comment you replied to mistakenly used the term
             | "unsafe" (that's part of the reason I dislike the term; it
             | can mean multiple things). In rust context though, it isn't
             | unsafe to index an array that's out of bounds. I.e. if
             | argc=10 and you call argv[99], that will crash your program
             | but isn't considered "unsafe".
        
             | initplus wrote:
             | C arrays are just sugar for pointer arithmetic. [] just
             | calculates the sum and deferences the result.
             | 
             | arr[n] == *(arr + n) == n[arr]
             | 
             | All these forms are valid C and gcc will happily compile
             | them all without complaining.
        
             | pjmlp wrote:
             | Paraphrasing a common meme, how much time do you have?
             | 
             | Just scratching the surface, we have:
             | 
             | - The language doesn't really have vector and strings as
             | data types, they are pointers to memory sections without
             | any kind of protection
             | 
             | - All functions on the standard library deemed as safe,
             | added as mitgation to fix possible memory corruptions have
             | gotchas on their use, there isn't a single one that is
             | safe, specially because all of them expect the developer to
             | never get the buffer size parameters wrong.
             | 
             | - Enumerations are not type safe, decay implicitly to
             | integers when used in numeric context, and all numeric
             | values can be converted into an enumeration, even if there
             | isn't a mapping available
             | 
             | - Implicit numeric convertions everywhere, and since there
             | is no overflow/underflow checking, every single numeric
             | operation can wrap around, or be the source for clever
             | compiler optimizations
             | 
             | - ISO C documents at least around 200 cases of UB, where
             | the compiler can take the liberty to optimize the code as
             | it pleases
             | 
             | - Type casts that convert complex data types into others
             | can be a source of surprises when moving across compilers
             | and platforms
             | 
             | - Speaking of which, even if you restrict yourself to ISO
             | C, without any compiler specific extensions, there are
             | behaviours that are implementation defined, which can vary
             | across compilers and platforms.
             | 
             | - Variables defined as const, aren't really constant and
             | one can subvert their value
             | 
             | - There is no null checking, so whatever happens depends on
             | the platform.
             | 
             | This is just a short overview, open the man page for GCC or
             | clang and go through the list of all warnings that you can
             | enable to try to write safer code, specially all that are
             | enabled via -Wall and -Wextra.
             | 
             | All the above flaws are also present in Objective-C,
             | Objective-C++ and C++, due to their copy-paste
             | compatibility with C (yes C++ isn't 100% compatible).
        
               | int_19h wrote:
               | IIRC, in C++ at least, mutating an object that is
               | originally const (whether it's a variable declared as
               | such, or a heap object created with "new const ..."), is
               | UB regardless of how you do it - pointer casts etc.
        
             | kazinator wrote:
             | For instance it means that an expression like argv[i], even
             | though correct, could be wrong in a way that won't be
             | diagnosed. Code us "unsafe" to the extent that is
             | predictable behavior depends only on the programmer.
        
             | tinco wrote:
             | The word unsafe has a specific meaning in Rust. It doesn't
             | mean every C program that uses argc and argv is unsafe. In
             | this specific case however I don't think it would actually
             | require much unsafe. The only unsafe thing I'd introduce is
             | a way of casting the *argv[] to a type that safely deals
             | with null terminated strings. Maybe such a type is already
             | in Rust's standard library and I wouldn't even need that.
             | 
             | edit: eh sorry I wasn't thinking straight, you of course
             | need unsafe to cast the argv itself to a type that has a
             | seperate argc as well. Assuming such a type is available,
             | if it's not its implementation would also have unsafe all
             | over the place.
             | 
             | Maybe to answer the underlying question. What makes it
             | unsafe is that in C it is assumed the programmer knows to
             | keep all indexes into argv under argc. In Rust such an
             | assumption must be made explicit by specifying "unsafe". It
             | is idiomatic Rust to have all instances of "unsafe" in
             | libraries whose implementation is vetted by the community,
             | so ideally there are little to no instances of "unsafe" in
             | the application logic itself. Rust's compiler and type
             | system have various tricks that reduce the amount of
             | "unsafe" you would think you'd need for even quite complex
             | problems.
        
           | moomin wrote:
           | You're right, but they're hoping to improve upon that.
        
         | WalterBright wrote:
         | Since 90% (a wild guesstimate) of C code is pointers, I suspect
         | this is hopeless.
         | 
         | I've translated a lot of C code to D, and manually converting
         | `*` to `ref` (D's safe pointers), and converting to slices,
         | cleans up most of the C code nicely and you get buffer overflow
         | checks for free.
        
       | Animats wrote:
       | How does this compare to Corrode? The trouble with these things
       | is that the Rust that comes out is usually too awful to maintain.
       | Corrode, too, said that someday they'd generate more reasonable
       | Rust. But that never happened. Converting C into Rust with unsafe
       | raw pointers is not all that useful.
       | 
       | What's needed is some way to provide key information C doesn't
       | have. Mostly about array sizes. Some way to annotate
       | int read(int fd, void* buf, size_t len)
       | 
       | to tell the system that buf has size len.
       | 
       | A file of translation hints with such info could guide the
       | translator into producing decent Rust. Most of the things done
       | with pointer arithmetic can be expressed with slices. (Things
       | being done with pointer arithmetic which can't be expressed as
       | slices should be viewed with deep suspicion.) But you need size
       | info to do that.
        
         | andolanra wrote:
         | The short form is that Corrode is effectively deprecated in
         | favor of c2rust. Indeed, Corrode hasn't been updated since
         | 2017, while c2rust still gets active development--last commit
         | as of my writing this was 2 days ago.
         | 
         | It's worth noting that the developer of Corrode was consulted
         | on the early design of c2rust, which means c2rust was able to
         | benefit from hindsight on architectural decisions in Corrode.
         | That ended up leading to a bit of a messy history between the
         | two (c.f. https://jamey.thesharps.us/2018/06/30/c2rust-vs-
         | corrode/ with HN discussion
         | https://news.ycombinator.com/item?id=17436371 --although I
         | believe that after that blog post the c2rust developers did end
         | up acknowledging their inspiration and apologized for not doing
         | so earlier.)
        
         | masklinn wrote:
         | The goal of C2rust is not to produce good maintainable Rust.
         | 
         | It's to produce buildable rust which exactly matches the
         | original code, which you can then migrate to _proper_ rust.
         | 
         | So your query is really in the "not even wrong" category.
        
           | Animats wrote:
           | _which you can then migrate to proper rust._
           | 
           | Which means you have to manually work on that awful code that
           | comes out. In the chart at [1], this step is represented by a
           | magic wand.
           | 
           | (I wanted to give some examples, but https://c2rust.com/
           | seems to not be translating today.)
           | 
           | [1] https://c2rust.com/manual/
        
         | Arnavion wrote:
         | Yes, Windows has been doing that with SAL annotations for
         | years.
        
         | int_19h wrote:
         | > What's needed is some way to provide key information C
         | doesn't have. Mostly about array sizes.
         | 
         | You also want to know which way the data flows (i.e. is buf
         | read from, written to, or both). And then you end up with
         | something like this:
         | 
         | https://learn.microsoft.com/en-us/cpp/code-quality/understan...
        
         | WalterBright wrote:
         | I proposed an extension to C which adds slices:
         | 
         | https://www.digitalmars.com/articles/C-biggest-mistake.html
        
           | Animats wrote:
           | Me too.[1]
           | 
           | But it would have meant years of work on language politics.
           | 
           | It might be worth looking at this sort of thing again,
           | because machine learning is far enough along that recognizing
           | and converting the usual array idioms is feasible. If the
           | output code with array bounds is run time checked, then
           | errors in translation will result in detected array bounds
           | errors.
           | 
           | [1] http://animats.com/papers/languages/safearraysforc43.pdf
        
             | [deleted]
        
       | SaddledBounding wrote:
       | I think a demo of the transpiler output for a short function
       | would make a great addition to the readme.
        
         | [deleted]
        
       | kazinator wrote:
       | Sampling some directories and files in the test suite of this
       | project, I see a problem: testing is done by translating C to
       | rust and compiling it, and then testing the run-time behavior of
       | the result. I don't see test cases which cover the behavior of
       | the translator directly: like that a certain C language input
       | maps to a certain Rust output.
        
         | maxbond wrote:
         | I'd argue that these tests are more robust to changes in Rust
         | and changes in C2Rust that change the output in trivial ways. I
         | don't see how you could maintain the sort of test suite you're
         | describing in a project like this. If you made a change that
         | changed the Rust output, you'd invalidate huge parts of your
         | test suite and generate lots of noisey failures. It'd make it
         | fast too expensive to introduce all but the most critical
         | changes.
         | 
         | We don't care what Rust gets generated; we care that the Rust
         | which is generated has the correct behavior. Testing that is
         | where the value is.
        
           | kazinator wrote:
           | > _If you made a change that changed the Rust output, you 'd
           | invalidate huge parts of your test suite and generate lots of
           | noisey failures. _
           | 
           | If that change was unintended, you'd be thanking yourself for
           | unit tests.
           | 
           | The unit test suite doesn't have to be all that large. It's
           | the behavioral test suite which has to be large in order to
           | generate confidence.
           | 
           | > _It 'd make it fast too expensive to introduce all but the
           | most critical changes._
           | 
           | You can easily have diffs between the expected output of
           | those cases and the new output.
           | 
           | You can review those and merge them, which is time-consuming
           | work, but of great value. You can spot bugs in the review,
           | like whoa, this thing is now being translated in a bad way.
           | 
           | If the project is in a state of flux, the new expected
           | outputs can be more or less blindly merged; still better than
           | nothing, and there is a record that can be revisited. Ah,
           | that test case is actually confirming the wrong thing, which
           | was right previous to this commit when the output changed.
        
         | samus wrote:
         | This tool is there to automate the boring parts of converting
         | codebases. The real work is verifying the parts that rely on
         | undefined behavior.
         | 
         | High fidelity and reliability would be a concern if the
         | transpiler is used to regularly sync between C and Rust
         | versions of a same codebase. It's less of a problem for one-off
         | efforts. Those will usually be heavily tested and inspected
         | before the result is trusted.
         | 
         | Edit: it would of course be nice to see proper quality control
         | applied, but prototypes have to proof quickly that they are
         | worth the time spent refining them.
        
           | kazinator wrote:
           | A source-to-source translator is a textbook example of
           | something that is slam-dunk unit testable. There is almost no
           | valid argument against doing it.
           | 
           | Note: maybe this project has it in there; I haven't
           | exhaustively looked into every subdirectory.
           | 
           | > _if the transpiler is used to regularly sync between C and
           | Rust versions of a same codebase_
           | 
           | How do you know it won't be used that way? Because the
           | maintainers of every C codebase will stop what they are doing
           | in C, and follow the Rust conversion as soon as they hear
           | about it?
           | 
           | Even if you use this tool to permanently cut some code base
           | over to Rust, it would be nice there to be some assurance
           | about _what_ it 's doing beyond just "the converted code
           | seems to do the same thing". A conversion could be done two
           | or more times even if the C code isn't changing. Say you do
           | the conversion. Then hack on the converted code. A new
           | version of the converter comes out claiming to fix bugs. You
           | might want to re-run it on the original code again, see if
           | anything changed, and merge those changes to the current code
           | stream that already contains modifications.
           | 
           | > _The real work is verifying the parts that rely on
           | undefined behavior._
           | 
           | That is neither here nor there. A construct that is confirmed
           | undefined in C can be translated to the call to a Rust
           | function that makes damons fly out of your nose. And there
           | can be a couple of unit tests confirming this translation
           | strategy.
           | 
           | Or else, something else can be done. E.g. let's say the C
           | code relies on wraparound two's complement arithmetic. The
           | translator can oblige and generate code which makes that work
           | (making that translator more helpful than some modern C
           | compilers).
        
             | samus wrote:
             | Tests could verify that the transpiler correctly maps
             | certain C constructs, but that almost doesn't matter. There
             | is a fair chance that the translation won't work anyways if
             | there's too much C sorcery and undefined behavior involved.
             | 
             | The idea of continuous two-way synchronisation between
             | codebases is migraine-inducing to begin with. Even though
             | Rust can probably transpiled to C with way less risk of
             | losing fidelity. But I wouldn't be so sure that this always
             | works out the more `unsafe` blocks the Rust version
             | requires. C compilers are not designed to minimize UB after
             | all, and the results can be very surprising.
             | 
             | Yes, I hope people are sane enough to eventually commit to
             | a somewhat tidied-up Rust version and to only keep the C
             | version around to conduct software archeology. Of course,
             | this will produce a hard fork of the codebase, and the
             | usual political reasons specific to such efforts apply.
        
       | WalterBright wrote:
       | C translates quite readily to D. I've been able to translate
       | thousands of lines at a time in less than an hour, usually with
       | some global search & replace and then making adjustments after
       | running it through the D compiler. We relied on being able to do
       | this in the D community for quite a while. There also have been
       | three three translators built, with more or less effectiveness.
       | It is nice to get the code into D, and then take advantage of D's
       | safety features.
       | 
       | The fundamental problem with translation, followed by some hand
       | tweaking, is that it only works if the C version is to be
       | abandoned. If the C code is maintained by anyone else, as soon as
       | they make changes, the translation gets out of date. Updating the
       | translation turns out to be impractical because of the hand
       | tweaking necessary.
       | 
       | Then there are some frustrating structural limits. The largest is
       | that C doesn't have modules. The preprocessor puts everything
       | into one file, and every C compilation is for one file.
       | Declarations get duplicated across every translation unit.
       | Somehow, these need to get teased apart into modules. This
       | structural redo gets done by hand, and requires pretty good
       | familiarity with the C code's design.
       | 
       | The preprocessor poses another major problem. The preprocessor
       | language and the core C compiler have no knowledge of each other.
       | They are completely separate languages, with their own syntax,
       | keywords, semantics, etc. The preprocessor, aside from trivial
       | use of it, simply does not translate into other languages. I also
       | have yet to find a C programmer who could resist using the
       | preprocessor as a metaprogramming language, which does a great
       | job at obstructing all efforts at converting to another language.
       | 
       | All this stuff raises a lot of friction for D interacting with C
       | code. Programmers don't like friction, they don't want to deal
       | with C code they are unfamiliar with, they don't want to fold in
       | maintenance changes in C code to the translation, etc. They want
       | it to "just work".
       | 
       | The eventual solution I came up with is obvious, but I'd always
       | dismissed it as impractical. Just fix the D compiler to be able
       | to compile C code directly, and internally make the C
       | declarations and constructs available to D code. This turned out
       | to be fairly easy to do, and is ridiculously effective. It
       | sometimes works even better than C++'s ability to #include C code
       | (C++ doesn't support things like _Generic, old style C
       | declarations, etc.). All you have to do is import .c code just
       | like importing any D module, and the D compiler takes care of all
       | the dirty work for you.
       | 
       | It isn't perfect, for example, C compilers have lots of
       | extensions, and dealing with all of them is hopeless. But we just
       | do the common ones, as it turns out most of them are rarely used.
        
         | [deleted]
        
         | bachmeier wrote:
         | > The preprocessor poses another major problem. The
         | preprocessor language and the core C compiler have no knowledge
         | of each other. They are completely separate languages, with
         | their own syntax, keywords, semantics, etc.
         | 
         | "I wrote my program in C."
         | 
         | No, you wrote your program in a custom language that only you
         | (at most) understand, and you gave the file a .c extension.
        
         | jeffparsons wrote:
         | Zig also takes this approach, and even exposes its C compiler
         | (which if I recall correctly is basically Clang plus diverse
         | sysroots and other customisation out of the box) as a separate
         | `zig cc`.
         | 
         | I do a lot of work in Rust, and cross-compilation can be a pain
         | when you have a lot of C dependencies. Fortunately
         | https://github.com/messense/cargo-zigbuild exists. It sounds
         | crazy, but using Zig's inbuilt C compiler to help build my Rust
         | projects has been the smoothest option I've found.
         | 
         | I can't help but wonder if it would be worth it for Rust to
         | follow D and Zig by shipping its own inbuilt C compiler, even
         | if they still want to also support external C toolchains. It
         | should be roughly the same effort as it was for Zig, given that
         | they both use LLVM.
        
           | WalterBright wrote:
           | D can compile and link C programs with:                   dmd
           | hello.c
           | 
           | C and D code can be mixed with:                   dmd mars.d
           | pluto.c
           | 
           | C code can be imported by D code:                   import
           | stdio;  // looks for stdio.d, stdio.h, stdio.c in that order
           | void main() { printf("using C printf from D!"); }
           | 
           | It keys off of the file extension.
           | 
           | Amusingly, C code can also import D code:
           | ----- D file ----         int square(int x) { return x * x; }
           | ---- C file ----         __import square;              int
           | test() { return square(3); }
           | 
           | closing the circle, enabling D libraries to be written and
           | accessed by C.
        
       | als0 wrote:
       | You can try it out on the main website https://c2rust.com where
       | they have a web version. Unfortunately it isn't working (HTTP 503
       | error)
        
         | samus wrote:
         | Classic Hackernews hug of death.
        
           | Jalad wrote:
           | Nah I'm pretty sure it's just broken. I took a look at it a
           | week ish ago and it was down too
        
             | dataking wrote:
             | Can confirm it is broken. With a little luck, it should be
             | back up and running early next week.
        
         | mastax wrote:
         | Works for me
        
           | als0 wrote:
           | Did you press Translate?
        
             | mastax wrote:
             | No :)
        
               | metadat wrote:
               | Pressing "Translate" appears to do nothing.
        
       | mastax wrote:
       | Their blogpost about translating Quake 3 was interesting:
       | https://immunant.com/blog/2020/01/quake3/
        
       ___________________________________________________________________
       (page generated 2022-10-23 23:01 UTC)