[HN Gopher] C2Rust Transpiler
___________________________________________________________________
C2Rust Transpiler
Author : Aissen
Score : 104 points
Date : 2022-10-20 12:54 UTC (3 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| gwbas1c wrote:
| Really cool concept! (Rust is my favorite language for tinkering;
| I haven't touched C since I was in school.)
|
| What would really help is success stories: Who's used it? What
| have they used it for? What challenges did they encounter? Then
| again, maybe this is so new that there aren't a lot of success
| stories yet. :)
| mastax wrote:
| From what I remember it was primarily funded by a DoD project
| of some sort. There probably isn't a lot of info about that.
| There have been a few conference talks or blogposts about it
| (I'm trying to remember) that talked about the process of using
| it.
| mshockwave wrote:
| It's actually funded by DARPA so basically everything in this
| project is public available.
| dtolnay wrote:
| I used it with great success for transpiring libyaml from C to
| Rust. I even set up Miri to run the upstream library's entire
| transpiled test suite, and the fact it passes is validation of
| absence of UB in the original C code.
|
| The transpiled library now serves as the YAML backend for the
| widely used serde_yaml crate. Having serde_yaml be pure-Rust
| code instead of linking C is advantageous for painless cross-
| compilation as well as making downstream projects runnable in
| Miri.
|
| https://github.com/dtolnay/unsafe-libyaml
| tialaramex wrote:
| A few questions, if I may:
|
| Is the intent that this continues to evolve by newer libyaml
| code getting transpiled, or that it's effectively a fork and
| might gradually become more idiomatic Rust which does the
| same job as the C but won't track any changes? Or is this
| basically "done" and only small changes (to both this code
| and the C libyaml) are anticipated anyway?
|
| The c2rust documentation cautions that any platform
| independence isn't preserved, so if libyaml has types which
| are different on platform A versus platform B, the c2rust
| transpilation on platform A just gives you Rust types for
| platform A, losing that independence, was this an issue for
| libyaml ?
| pitaj wrote:
| Very cool. Miri is seriously awesome.
| dahfizz wrote:
| I would be interested to see performance numbers of the C version
| and the transpired rust version of some program.
| fbdab103 wrote:
| A potential use case I see is for security auditing. Even if you
| cannot port an existing C codebase to Rust, you could run this
| tool to examine the unsafe hotspots. Any place where the
| translation has to rely upon unsafe is a region of the code more
| likely to contain any of the mistakes Rust is designed to
| prevent. Of course, this pre-supposes that 90% of the translation
| does not have to lean on unsafe annotation.
| electroly wrote:
| This only produces unsafe code. Every translated function has
| the unsafe keyword. It's up to the programmer to clean it up
| afterwards.
| kazinator wrote:
| I suspect that this relies on unsafe pretty much everywhere.
| Even handling argc and argv in your C main function, in
| idiomatic ways, is unsafe.
|
| There is no 80/20 rule for C unsafety, other than maybe an
| inverted one: 80% of the unsafety of a large C program might be
| spread into 80% or more of the code. :)
| a1369209993 wrote:
| > There is no 80/20 rule for C unsafety,
|
| Actually, there is; the problem is that the transpiler can't
| tell the difference between code that _relies_ on unsafety
| for its semantics, versus code that would still work if
| appropiate annotations (potentially causing function to not
| be callable in intended contexts), run-time checks (possibly
| causing code to error out on what were intended to be valid
| inputs), etc, were added to make it safe.
|
| 80% of the code contains 20% of the cases where safety would
| require deviating from the _intended_ semantics, not just the
| incidental ones. (A general-purpose transpiler can 't (in
| general) tell the difference between intended semantics and
| incidental ones, so it has to conservatively assume all
| semantics are intended, and write everything as the most-
| general (ie most-unsafe) interpretation.)
| kazinator wrote:
| Any C code that performs a calculation which would silently
| be wrong or crash if the values were not correct (even
| though they are) is inherently unsafe.
| turminal wrote:
| That's a weird way to put it. If your function assumes
| some constraints on the input it gets and you give it
| data that violates its constraints, it's going to fail in
| some way. Sure, C makes it worse by making it harder to
| verify the assumptions and constraints, but by your
| definition every function that operates on sorted arrays
| and doesn't verify the input is sorted is inherently
| unsafe, regardless of the language.
| counttheforks wrote:
| > Even handling argc and argv in your C main function, in
| idiomatic ways, is unsafe.
|
| Not knowing C very well, could you clarify what makes it
| unsafe? Thanks!
| insanitybit wrote:
| You'd have to access the stack frame above `main` and then
| treat some of the bytes within that frame as your env. This
| means forging pointers/bounds based on the inputs. `execve`
| basically sets that up for you but Rust doesn't know about
| that.
|
| Then if you wanted to handle dynamically set environment
| variables you'll need to call into your libc
| implementation, which crosses an ffi boundary, which means
| rust doesn't know about what that code is doing and
| therefor it requires `unsafe`.
|
| edit: Question for others - is main a separate stackframe?
| I actually don't recall.
| hn92726819 wrote:
| "safe" and "unsafe" in rust are well defined, but in my
| opinion it's very confusing and I wish they used different
| terms.
|
| In rust, unsafe means accessing memory that's already freed
| or unallocated or things like that. You can look up the
| definition for the full definition.
|
| I think the comment you replied to mistakenly used the term
| "unsafe" (that's part of the reason I dislike the term; it
| can mean multiple things). In rust context though, it isn't
| unsafe to index an array that's out of bounds. I.e. if
| argc=10 and you call argv[99], that will crash your program
| but isn't considered "unsafe".
| initplus wrote:
| C arrays are just sugar for pointer arithmetic. [] just
| calculates the sum and deferences the result.
|
| arr[n] == *(arr + n) == n[arr]
|
| All these forms are valid C and gcc will happily compile
| them all without complaining.
| pjmlp wrote:
| Paraphrasing a common meme, how much time do you have?
|
| Just scratching the surface, we have:
|
| - The language doesn't really have vector and strings as
| data types, they are pointers to memory sections without
| any kind of protection
|
| - All functions on the standard library deemed as safe,
| added as mitgation to fix possible memory corruptions have
| gotchas on their use, there isn't a single one that is
| safe, specially because all of them expect the developer to
| never get the buffer size parameters wrong.
|
| - Enumerations are not type safe, decay implicitly to
| integers when used in numeric context, and all numeric
| values can be converted into an enumeration, even if there
| isn't a mapping available
|
| - Implicit numeric convertions everywhere, and since there
| is no overflow/underflow checking, every single numeric
| operation can wrap around, or be the source for clever
| compiler optimizations
|
| - ISO C documents at least around 200 cases of UB, where
| the compiler can take the liberty to optimize the code as
| it pleases
|
| - Type casts that convert complex data types into others
| can be a source of surprises when moving across compilers
| and platforms
|
| - Speaking of which, even if you restrict yourself to ISO
| C, without any compiler specific extensions, there are
| behaviours that are implementation defined, which can vary
| across compilers and platforms.
|
| - Variables defined as const, aren't really constant and
| one can subvert their value
|
| - There is no null checking, so whatever happens depends on
| the platform.
|
| This is just a short overview, open the man page for GCC or
| clang and go through the list of all warnings that you can
| enable to try to write safer code, specially all that are
| enabled via -Wall and -Wextra.
|
| All the above flaws are also present in Objective-C,
| Objective-C++ and C++, due to their copy-paste
| compatibility with C (yes C++ isn't 100% compatible).
| int_19h wrote:
| IIRC, in C++ at least, mutating an object that is
| originally const (whether it's a variable declared as
| such, or a heap object created with "new const ..."), is
| UB regardless of how you do it - pointer casts etc.
| kazinator wrote:
| For instance it means that an expression like argv[i], even
| though correct, could be wrong in a way that won't be
| diagnosed. Code us "unsafe" to the extent that is
| predictable behavior depends only on the programmer.
| tinco wrote:
| The word unsafe has a specific meaning in Rust. It doesn't
| mean every C program that uses argc and argv is unsafe. In
| this specific case however I don't think it would actually
| require much unsafe. The only unsafe thing I'd introduce is
| a way of casting the *argv[] to a type that safely deals
| with null terminated strings. Maybe such a type is already
| in Rust's standard library and I wouldn't even need that.
|
| edit: eh sorry I wasn't thinking straight, you of course
| need unsafe to cast the argv itself to a type that has a
| seperate argc as well. Assuming such a type is available,
| if it's not its implementation would also have unsafe all
| over the place.
|
| Maybe to answer the underlying question. What makes it
| unsafe is that in C it is assumed the programmer knows to
| keep all indexes into argv under argc. In Rust such an
| assumption must be made explicit by specifying "unsafe". It
| is idiomatic Rust to have all instances of "unsafe" in
| libraries whose implementation is vetted by the community,
| so ideally there are little to no instances of "unsafe" in
| the application logic itself. Rust's compiler and type
| system have various tricks that reduce the amount of
| "unsafe" you would think you'd need for even quite complex
| problems.
| moomin wrote:
| You're right, but they're hoping to improve upon that.
| WalterBright wrote:
| Since 90% (a wild guesstimate) of C code is pointers, I suspect
| this is hopeless.
|
| I've translated a lot of C code to D, and manually converting
| `*` to `ref` (D's safe pointers), and converting to slices,
| cleans up most of the C code nicely and you get buffer overflow
| checks for free.
| Animats wrote:
| How does this compare to Corrode? The trouble with these things
| is that the Rust that comes out is usually too awful to maintain.
| Corrode, too, said that someday they'd generate more reasonable
| Rust. But that never happened. Converting C into Rust with unsafe
| raw pointers is not all that useful.
|
| What's needed is some way to provide key information C doesn't
| have. Mostly about array sizes. Some way to annotate
| int read(int fd, void* buf, size_t len)
|
| to tell the system that buf has size len.
|
| A file of translation hints with such info could guide the
| translator into producing decent Rust. Most of the things done
| with pointer arithmetic can be expressed with slices. (Things
| being done with pointer arithmetic which can't be expressed as
| slices should be viewed with deep suspicion.) But you need size
| info to do that.
| andolanra wrote:
| The short form is that Corrode is effectively deprecated in
| favor of c2rust. Indeed, Corrode hasn't been updated since
| 2017, while c2rust still gets active development--last commit
| as of my writing this was 2 days ago.
|
| It's worth noting that the developer of Corrode was consulted
| on the early design of c2rust, which means c2rust was able to
| benefit from hindsight on architectural decisions in Corrode.
| That ended up leading to a bit of a messy history between the
| two (c.f. https://jamey.thesharps.us/2018/06/30/c2rust-vs-
| corrode/ with HN discussion
| https://news.ycombinator.com/item?id=17436371 --although I
| believe that after that blog post the c2rust developers did end
| up acknowledging their inspiration and apologized for not doing
| so earlier.)
| masklinn wrote:
| The goal of C2rust is not to produce good maintainable Rust.
|
| It's to produce buildable rust which exactly matches the
| original code, which you can then migrate to _proper_ rust.
|
| So your query is really in the "not even wrong" category.
| Animats wrote:
| _which you can then migrate to proper rust._
|
| Which means you have to manually work on that awful code that
| comes out. In the chart at [1], this step is represented by a
| magic wand.
|
| (I wanted to give some examples, but https://c2rust.com/
| seems to not be translating today.)
|
| [1] https://c2rust.com/manual/
| Arnavion wrote:
| Yes, Windows has been doing that with SAL annotations for
| years.
| int_19h wrote:
| > What's needed is some way to provide key information C
| doesn't have. Mostly about array sizes.
|
| You also want to know which way the data flows (i.e. is buf
| read from, written to, or both). And then you end up with
| something like this:
|
| https://learn.microsoft.com/en-us/cpp/code-quality/understan...
| WalterBright wrote:
| I proposed an extension to C which adds slices:
|
| https://www.digitalmars.com/articles/C-biggest-mistake.html
| Animats wrote:
| Me too.[1]
|
| But it would have meant years of work on language politics.
|
| It might be worth looking at this sort of thing again,
| because machine learning is far enough along that recognizing
| and converting the usual array idioms is feasible. If the
| output code with array bounds is run time checked, then
| errors in translation will result in detected array bounds
| errors.
|
| [1] http://animats.com/papers/languages/safearraysforc43.pdf
| [deleted]
| SaddledBounding wrote:
| I think a demo of the transpiler output for a short function
| would make a great addition to the readme.
| [deleted]
| kazinator wrote:
| Sampling some directories and files in the test suite of this
| project, I see a problem: testing is done by translating C to
| rust and compiling it, and then testing the run-time behavior of
| the result. I don't see test cases which cover the behavior of
| the translator directly: like that a certain C language input
| maps to a certain Rust output.
| maxbond wrote:
| I'd argue that these tests are more robust to changes in Rust
| and changes in C2Rust that change the output in trivial ways. I
| don't see how you could maintain the sort of test suite you're
| describing in a project like this. If you made a change that
| changed the Rust output, you'd invalidate huge parts of your
| test suite and generate lots of noisey failures. It'd make it
| fast too expensive to introduce all but the most critical
| changes.
|
| We don't care what Rust gets generated; we care that the Rust
| which is generated has the correct behavior. Testing that is
| where the value is.
| kazinator wrote:
| > _If you made a change that changed the Rust output, you 'd
| invalidate huge parts of your test suite and generate lots of
| noisey failures. _
|
| If that change was unintended, you'd be thanking yourself for
| unit tests.
|
| The unit test suite doesn't have to be all that large. It's
| the behavioral test suite which has to be large in order to
| generate confidence.
|
| > _It 'd make it fast too expensive to introduce all but the
| most critical changes._
|
| You can easily have diffs between the expected output of
| those cases and the new output.
|
| You can review those and merge them, which is time-consuming
| work, but of great value. You can spot bugs in the review,
| like whoa, this thing is now being translated in a bad way.
|
| If the project is in a state of flux, the new expected
| outputs can be more or less blindly merged; still better than
| nothing, and there is a record that can be revisited. Ah,
| that test case is actually confirming the wrong thing, which
| was right previous to this commit when the output changed.
| samus wrote:
| This tool is there to automate the boring parts of converting
| codebases. The real work is verifying the parts that rely on
| undefined behavior.
|
| High fidelity and reliability would be a concern if the
| transpiler is used to regularly sync between C and Rust
| versions of a same codebase. It's less of a problem for one-off
| efforts. Those will usually be heavily tested and inspected
| before the result is trusted.
|
| Edit: it would of course be nice to see proper quality control
| applied, but prototypes have to proof quickly that they are
| worth the time spent refining them.
| kazinator wrote:
| A source-to-source translator is a textbook example of
| something that is slam-dunk unit testable. There is almost no
| valid argument against doing it.
|
| Note: maybe this project has it in there; I haven't
| exhaustively looked into every subdirectory.
|
| > _if the transpiler is used to regularly sync between C and
| Rust versions of a same codebase_
|
| How do you know it won't be used that way? Because the
| maintainers of every C codebase will stop what they are doing
| in C, and follow the Rust conversion as soon as they hear
| about it?
|
| Even if you use this tool to permanently cut some code base
| over to Rust, it would be nice there to be some assurance
| about _what_ it 's doing beyond just "the converted code
| seems to do the same thing". A conversion could be done two
| or more times even if the C code isn't changing. Say you do
| the conversion. Then hack on the converted code. A new
| version of the converter comes out claiming to fix bugs. You
| might want to re-run it on the original code again, see if
| anything changed, and merge those changes to the current code
| stream that already contains modifications.
|
| > _The real work is verifying the parts that rely on
| undefined behavior._
|
| That is neither here nor there. A construct that is confirmed
| undefined in C can be translated to the call to a Rust
| function that makes damons fly out of your nose. And there
| can be a couple of unit tests confirming this translation
| strategy.
|
| Or else, something else can be done. E.g. let's say the C
| code relies on wraparound two's complement arithmetic. The
| translator can oblige and generate code which makes that work
| (making that translator more helpful than some modern C
| compilers).
| samus wrote:
| Tests could verify that the transpiler correctly maps
| certain C constructs, but that almost doesn't matter. There
| is a fair chance that the translation won't work anyways if
| there's too much C sorcery and undefined behavior involved.
|
| The idea of continuous two-way synchronisation between
| codebases is migraine-inducing to begin with. Even though
| Rust can probably transpiled to C with way less risk of
| losing fidelity. But I wouldn't be so sure that this always
| works out the more `unsafe` blocks the Rust version
| requires. C compilers are not designed to minimize UB after
| all, and the results can be very surprising.
|
| Yes, I hope people are sane enough to eventually commit to
| a somewhat tidied-up Rust version and to only keep the C
| version around to conduct software archeology. Of course,
| this will produce a hard fork of the codebase, and the
| usual political reasons specific to such efforts apply.
| WalterBright wrote:
| C translates quite readily to D. I've been able to translate
| thousands of lines at a time in less than an hour, usually with
| some global search & replace and then making adjustments after
| running it through the D compiler. We relied on being able to do
| this in the D community for quite a while. There also have been
| three three translators built, with more or less effectiveness.
| It is nice to get the code into D, and then take advantage of D's
| safety features.
|
| The fundamental problem with translation, followed by some hand
| tweaking, is that it only works if the C version is to be
| abandoned. If the C code is maintained by anyone else, as soon as
| they make changes, the translation gets out of date. Updating the
| translation turns out to be impractical because of the hand
| tweaking necessary.
|
| Then there are some frustrating structural limits. The largest is
| that C doesn't have modules. The preprocessor puts everything
| into one file, and every C compilation is for one file.
| Declarations get duplicated across every translation unit.
| Somehow, these need to get teased apart into modules. This
| structural redo gets done by hand, and requires pretty good
| familiarity with the C code's design.
|
| The preprocessor poses another major problem. The preprocessor
| language and the core C compiler have no knowledge of each other.
| They are completely separate languages, with their own syntax,
| keywords, semantics, etc. The preprocessor, aside from trivial
| use of it, simply does not translate into other languages. I also
| have yet to find a C programmer who could resist using the
| preprocessor as a metaprogramming language, which does a great
| job at obstructing all efforts at converting to another language.
|
| All this stuff raises a lot of friction for D interacting with C
| code. Programmers don't like friction, they don't want to deal
| with C code they are unfamiliar with, they don't want to fold in
| maintenance changes in C code to the translation, etc. They want
| it to "just work".
|
| The eventual solution I came up with is obvious, but I'd always
| dismissed it as impractical. Just fix the D compiler to be able
| to compile C code directly, and internally make the C
| declarations and constructs available to D code. This turned out
| to be fairly easy to do, and is ridiculously effective. It
| sometimes works even better than C++'s ability to #include C code
| (C++ doesn't support things like _Generic, old style C
| declarations, etc.). All you have to do is import .c code just
| like importing any D module, and the D compiler takes care of all
| the dirty work for you.
|
| It isn't perfect, for example, C compilers have lots of
| extensions, and dealing with all of them is hopeless. But we just
| do the common ones, as it turns out most of them are rarely used.
| [deleted]
| bachmeier wrote:
| > The preprocessor poses another major problem. The
| preprocessor language and the core C compiler have no knowledge
| of each other. They are completely separate languages, with
| their own syntax, keywords, semantics, etc.
|
| "I wrote my program in C."
|
| No, you wrote your program in a custom language that only you
| (at most) understand, and you gave the file a .c extension.
| jeffparsons wrote:
| Zig also takes this approach, and even exposes its C compiler
| (which if I recall correctly is basically Clang plus diverse
| sysroots and other customisation out of the box) as a separate
| `zig cc`.
|
| I do a lot of work in Rust, and cross-compilation can be a pain
| when you have a lot of C dependencies. Fortunately
| https://github.com/messense/cargo-zigbuild exists. It sounds
| crazy, but using Zig's inbuilt C compiler to help build my Rust
| projects has been the smoothest option I've found.
|
| I can't help but wonder if it would be worth it for Rust to
| follow D and Zig by shipping its own inbuilt C compiler, even
| if they still want to also support external C toolchains. It
| should be roughly the same effort as it was for Zig, given that
| they both use LLVM.
| WalterBright wrote:
| D can compile and link C programs with: dmd
| hello.c
|
| C and D code can be mixed with: dmd mars.d
| pluto.c
|
| C code can be imported by D code: import
| stdio; // looks for stdio.d, stdio.h, stdio.c in that order
| void main() { printf("using C printf from D!"); }
|
| It keys off of the file extension.
|
| Amusingly, C code can also import D code:
| ----- D file ---- int square(int x) { return x * x; }
| ---- C file ---- __import square; int
| test() { return square(3); }
|
| closing the circle, enabling D libraries to be written and
| accessed by C.
| als0 wrote:
| You can try it out on the main website https://c2rust.com where
| they have a web version. Unfortunately it isn't working (HTTP 503
| error)
| samus wrote:
| Classic Hackernews hug of death.
| Jalad wrote:
| Nah I'm pretty sure it's just broken. I took a look at it a
| week ish ago and it was down too
| dataking wrote:
| Can confirm it is broken. With a little luck, it should be
| back up and running early next week.
| mastax wrote:
| Works for me
| als0 wrote:
| Did you press Translate?
| mastax wrote:
| No :)
| metadat wrote:
| Pressing "Translate" appears to do nothing.
| mastax wrote:
| Their blogpost about translating Quake 3 was interesting:
| https://immunant.com/blog/2020/01/quake3/
___________________________________________________________________
(page generated 2022-10-23 23:01 UTC)