[HN Gopher] Comparing the C FFI overhead on various languages
       ___________________________________________________________________
        
       Comparing the C FFI overhead on various languages
        
       Author : generichuman
       Score  : 105 points
       Date   : 2022-05-14 10:49 UTC (12 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | KSPAtlas wrote:
       | What about Common Lisp?
        
         | medo-bear wrote:
         | there is a pretty powerful CFFI package (library) to achieve
         | this, however performance will be very implementation
         | dependent. in case someone wants to try this, the defacto-
         | standard, free open source, speedy implementation is SBCL
        
       | cube2222 wrote:
       | Just a caveat, not sure if it matters in practice, but this
       | benchmark is using very old versions of many languages it's
       | comparing (5 year old ones).
        
         | kllrnohj wrote:
         | It probably matters for a few of the slower ones, like Java,
         | Go, or Dart. It's also going to matter on what platform. eg,
         | Java may have better FFI on x86 than on ARM. Or similarly
         | Dart's FFI may better on ARM than on x86, particularly given
         | Flutter is the primary user these days.
         | 
         | And then to make it even more complicated it's also going to
         | potentially depend on the GC being used. For example for Java's
         | JNI it's actually the bookkeeping for the GC that takes the
         | most time in that FFI transition (can't pause the thread to
         | mark the stack for a concurrent GC when it's executing random C
         | code, after all). Which is going to potentially depend on what
         | the specific GC being used requires.
        
       | haberman wrote:
       | Some of the results look outdated. The Dart results look bad (25x
       | slower than C), but looking at the code
       | (https://github.com/dyu/ffi-overhead/tree/master/dart) it appears
       | to be five years old. Dart has a new FFI as of Dart 2.5 (2019):
       | https://medium.com/dartlang/announcing-dart-2-5-super-charge...
       | I'm curious how the new FFI would fare in these benchmarks.
        
         | elcritch wrote:
         | Actually looks like most of the languages are seriously
         | outdated. Nim and Julia are both way outdated, Elixir is pretty
         | outdated.
        
         | kcb wrote:
         | Because their environment is using an Ubuntu version from 8
         | years ago. So a better title would be "Comparing the C FFI
         | overhead on various languages in 2014"
        
         | meibo wrote:
         | Same with C#, it only benchmarks an old version of mono and not
         | .NET Core, which has received several big performance boosts in
         | recent releases.
        
       | SemanticStrengh wrote:
       | Java has a new API for FFI called the foreign memory interface
        
       | throw827474737 wrote:
       | So why isn't C the baseline (and zig and rust being pretty close
       | to it quite expected), but both luajit and julia are
       | significantly faster??
        
         | gallexme wrote:
         | https://nullprogram.com/blog/2018/05/27/
        
           | eatonphil wrote:
           | > For the C "FFI" he used standard dynamic linking, not
           | dlopen(). This distinction is important, since it really
           | makes a difference in the benchmark. There's a potential
           | argument about whether or not this is a fair comparison to an
           | actual FFI, but, regardless, it's still interesting to
           | measure
        
             | arinlen wrote:
             | > There's a potential argument about whether or not this is
             | a fair comparison to an actual FFI, but, regardless, it's
             | still interesting to measure (...)
             | 
             | If there's interest in measuring dynamic linking then
             | wouldn't there be an interest in measuring it on all
             | languages that support dynamic linking?
        
             | jcelerier wrote:
             | With clang, just compiling with -fno-plt gives me:
             | jit: 1.003483 ns/call         plt: 1.254158 ns/call
             | ind: 1.254616 ns/call
             | 
             | GCC does not seem to support it though, even if it accepts
             | the flag and gives me:                   jit: 1.003483
             | ns/call         plt: 1.502089 ns/call         ind: 1.254616
             | ns/call
             | 
             | (tried everything I could think of that would have a chance
             | to make the PLT disappear:                   cc -fno-plt
             | -Bsymbolic -fno-semantic-interposition -flto -std=c99 -Wall
             | -Wextra -O3 -g3 -Wl,-z,relro,-z,now -o benchmark
             | benchmark.c ./empty.so -ldl
             | 
             | without any change on GCC)
        
           | miohtama wrote:
           | Is there anything akin FFI but with static linking for any
           | foreign (non C) language?
        
             | junon wrote:
             | The question as it stands makes a few assumptions I don't
             | think one can make, and as such is a bit tricky to answer
             | cleanly, but I'll try.
             | 
             | Yes it's just called linking. The language needs to be
             | aware of calling conventions and perhaps side effects and
             | be prepared for no additional intrinsic support for higher
             | level features.
             | 
             | It probably also needs to be able to read C headers,
             | because C symbols do not contain type signatures like many
             | C++ compilers add.
             | 
             | There's no "library" or some out of the box solution for
             | this, if that's what you're asking. This boils down to how
             | programs are constructed and, moreso, how CPUs work.
             | 
             | In most (all?) cases, anything higher level than straight-
             | up linking is headed toward FFI territory.
        
             | tlb wrote:
             | Calling WebAssembly from Javascript, sort of?
             | 
             | In the early Python 2 era there was an option to build an
             | interpreter binary with statically linked C stubs, and it
             | was noticeably faster and let you access Python data
             | structures from C. I used it for robotics code for speed.
             | It was inconvenient because you had to link in all the
             | modules you needed.
        
             | fweimer wrote:
             | For OpenJDK, there is JEP 178:
             | https://openjdk.java.net/jeps/178 I haven't seen it used in
             | practice.
             | 
             | Ocaml's C-implemented functions are linked statically. But
             | like JNI, the C functions have special names and type
             | signatures, so it is slightly different from, say, ctypes
             | in Python.
             | 
             | CGO for Go is statically linked, too. Its overhead stems
             | from significant differences between the Go and C world.
             | The example uses dynamic linking, but it would not have to
             | do that.
        
             | samatman wrote:
             | LuaJIT can use the FFI against statically linked object
             | code just fine, I'm not sure if that answers your question
             | since in this context it must be embedded in a C program.
             | 
             | It's a hard requirement of static linking that you have
             | just one binary so it might, answer your question that is.
        
           | qalmakka wrote:
           | I'm always pretty surprised when I find out most people
           | writing C or C++ have no idea that PLTs exist. They have a
           | small but not negligible cost.
        
         | [deleted]
        
         | bachmeier wrote:
         | C, C++, Zig, Rust, D, and Haskell are all similar because
         | they're basically doing the same thing. Someone else linked to
         | the blog post, but Lua and Julia aren't doing the same thing,
         | so they get different results.
         | 
         | > both luajit and julia are significantly faster
         | 
         | I would be interested if anyone has an example where the
         | difference matters in practice. As soon as you move to the more
         | realistic scenario where you're writing a program that does
         | something other than what is measured by these benchmarks,
         | that's not going to be your biggest concern.
        
           | mananaysiempre wrote:
           | ETA: I see now I was answering the wrong question: you were
           | asking about the comparison between C and LuaJIT, not heavier
           | FFIs and C/LuaJIT.
           | 
           | Honestly I think of the difference (as discussed in Wellons's
           | post among others) not as a performance optimization but as
           | an anti-stupidity optimization: regardless of the performance
           | impact, it's _stupid_ that the standard ELF ABI forces us to
           | jump through these hoops for every foreign call, and even
           | stupider that plain inter- and even intra-compilation-unit
           | calls can also be affected unless you take additional
           | measures. Things are also being fixed on the C side with
           | things such as -fvisibility=, -fno-semantic-interposition,
           | -fno-plt, and new relocation types.
           | 
           | Can this be relevant to performance? Probably--aside from
           | just doing more stuff, there are trickier-to-predict parts of
           | the impact such as buffer pressure on the indirect branch
           | predictor. Does it? Not sure. The theoretical possibility of
           | interposition preventing inlining of publicly-accessible
           | functions is probably much more important, at the very least
           | I _have_ seen it make a difference. But this falls outside
           | the scope of FFI, strictly speaking, even if the cause is
           | related.
           | 
           | ---
           | 
           | I don't have a readily available example, but in the LuaJIT
           | case there are two considerations that I can mention:
           | 
           | - FFI is not just cheap but gets into the realm of a native
           | call (perhaps an indirect one), so a well-adapted inner loop
           | is not ruined even if it makes several FFI calls per
           | iteration (it will still be slower, but this is fractions not
           | multiples unless the loop did not allocate at all before the
           | change). What this influences is perhaps not even the final
           | performance but the shape of the API boundary: similarly to
           | the impact of promise pipelining for RPC[1], you're no longer
           | forced into the "construct job, submit job" mindset and
           | coarse-grained calls (think NumPy). Even calling libm
           | functions through the FFI, while probably not very smart,
           | isn't an instant death sentence, so not as many things are
           | forced to be reimplemented in the language as you're used to.
           | 
           | - The JIT is wonderfully speedy and simple, but draws much of
           | that speed and simplicity from the fact that it really only
           | understands two shapes of control flow: straight-line code;
           | and straight-line code leading into a loop with straight-line
           | code in the body. Other control transfers aren't banned as
           | such, but are built on top of these, can only be optimized
           | across to a limited extent, and can confuse the machinery
           | that decides what to trace. This has the unpleasant corollary
           | that builtins, which are normally implemented as baked-in
           | bytecode, can't usefully have loops in them. The solution
           | uses something LuaJIT 2.1 calls _trace stitching_ : the
           | problematic builtins are implemented in normal C and are free
           | to have arbitrarily complex control flow inside, but instead
           | of outright aborting the trace due to an unJITtable builtin
           | the compiler puts what is effectively an FFI call into it.
           | 
           | [1] https://capnproto.org/rpc.html
        
           | vvanders wrote:
           | Oh it totally matters, any sort of chatty interface over FFI
           | you will pay for it.
           | 
           | There's a reason a lot of gamedev uses luajit, I've
           | personally had to refactor many interfaces to avoid JNI calls
           | as much as possible as there was significant overhead(both in
           | the call and from the VM not being able to optimize around
           | it).
        
             | kllrnohj wrote:
             | The reason a lot of gamedev uses luajit is the ease at
             | which it can be embedded.
             | 
             | And that's not really even true anymore as the majority of
             | gamedev is using Unreal or Unity, neither of which use
             | luajit.
        
               | vvanders wrote:
               | It's not just how easy it is to embed, it's also really
               | small in both code + runtime size. I've shipped it on
               | systems with sub 8mb of total system memory(we used a
               | preallocated 400kb block), until quickjs came along there
               | really wasn't anything comparable. It was also much
               | faster than anything else at the time and regularly beat
               | v8 in the benchmarks I ran.
               | 
               | Unity+Unreal are the public engines out there but there's
               | plenty of in-house engines and tool chains you don't
               | really hear about. I wouldn't be surprised if it's still
               | deployed in quite a few contexts.
        
           | kllrnohj wrote:
           | > I would be interested if anyone has an example where the
           | difference matters in practice.
           | 
           | Vulkan. Any sort of binding to Vulkan over a non-trivial FFI
           | (so like, not from C++, Rust, etc...) is going to be murdered
           | by this FFI overhead cost. Especially since for bindings from
           | something like Java you're either paying FFI on every field
           | set on a struct, or you're paying non-trivial marshalling
           | costs to convert from a Java class to a C struct to then
           | finally call the corresponding Vulkan function.
        
             | bachmeier wrote:
             | > Especially since for bindings from something like Java
             | 
             | I guess I wasn't clear, but I meant the difference between
             | C and Luajit.
        
               | kllrnohj wrote:
               | Ah. The answer to that is a lot more murky, since in an
               | actual C/C++ program you're going to have a mix of local,
               | static, and dynamic linking. You're generally not putting
               | super chatty stuff across a dynamic linkage, since that
               | tends to be where the stable API boundaries go. Anything
               | internal is then going to be static linkage, so
               | comparable to luajit, or inlined (either by the compiler
               | initially or with something like LTO) and then even
               | faster than luajit
        
             | joeld42 wrote:
             | Not really, you're usually setting up commands and buffers
             | and stuff in Vulkan. If you're making millions of calls a
             | frame, you're going to have other bottlenecks.
             | 
             | My favorite example is something like Substance designer's
             | node graph or Disney's SeExpr. You'd often want custom
             | nodes that do often something trivial like a lookup from a
             | custom data format or a small math evaluation, but you're
             | calling the node potentially a handful of times per pixel,
             | on millions of pixels. The calling overhead often comes out
             | to take as much time or more than the operation, but
             | there's no easy way to rearrange the operations without
             | making things a lot more complicated for everyone.
             | 
             | I kind of like python's approach, make it so slow that it's
             | easy to notice when you're hitting the bottleneck.
             | Encourages you to write stuff that works in larger
             | operations, and you get stuff like numpy and tensorflow
             | which are some of the fastest things out there despite the
             | slowest binding.
             | 
             | https://www.disneyanimation.com/technology/seexpr-
             | expression...
        
       | exebook wrote:
       | I developed a terminal emulator, file manager and text editor
       | Deodar 8 years ago in JavaScript/V8 with native C++ calls, it
       | worked but I was extremely disappointed by speed, it felt so slow
       | like you need to do a passport control each time you call a C++
       | function.
        
         | Koromix wrote:
         | The official solutions, node-ffi and node-ffi-napi, are
         | extremely slow, with an overhead hundred of times higher than
         | it should be. I don't know what they do to be so slow.
         | 
         | I'm making my own FFI module for Node.js, Koffi, as a much
         | faster alternative. You can see some benchmark here, to compare
         | with node-ffi-napi:
         | https://www.npmjs.com/package/koffi#benchmarks
        
           | SemanticStrengh wrote:
           | An interesting alternative would be to not have any FFI and
           | to use transparent polyglot interop between javascript and
           | c++ via GraalJs /sulong
        
         | ZiiS wrote:
         | Probably butter smooth after eight years of V8 development and
         | Moore's Law (well whatever passes for it now).
        
           | kllrnohj wrote:
           | Probably not since single threaded improvement has barely
           | advanced over the last 8 years, and JS/V8 are still in a
           | single threaded world that stopped existing a decade ago.
        
         | exikyut wrote:
         | Oh nice, with a Norton Commander-alike terminal UI.
         | 
         | Screenshots at https://sourceforge.net/projects/deodar/
         | 
         | Last-modified 2018 over at https://github.com/exebook/deodar
         | 
         | I'm not sure where the use of Yacc ends and
         | https://github.com/exebook/elfu begins in the .yy files (which
         | are sprinkled with very algebraic-looking Unicode throughout).
         | The Pascal-like class definitions may be defined in
         | https://github.com/exebook/intervision.
         | 
         | Very interesting project.
        
       | dgan wrote:
       | I had to run it to believe, I confirm it's 183 seconds(!) for
       | python3 on my laptop
       | 
       | Also, OCaml because I was interested (milliseconds):
       | ocaml(int,noalloc,native) = 2022         ocaml(int,alloc,native)
       | = 2344         ocaml(int,untagged,native) = 1912
       | ocaml(int32,noalloc,native) = 1049
       | ocaml(int32,alloc,native) = 1556
       | ocaml(int32,boxed,native) = 7544
        
       | khoobid_shoma wrote:
       | I guess it is better to measure CPU time instead of wall time
       | (e.g. using clock() ).
        
       | tomas789 wrote:
       | There is no Python benchmark but you can find a PR claiming it
       | has 123,198ms. That would be a worst one by a wide margin.
       | 
       | https://github.com/dyu/ffi-overhead/pull/18
        
         | sk0g wrote:
         | C FFI takes 123 seconds?! That's pretty insane, but if you mean
         | 123.2 ms, it's still very bad.
         | 
         | Doesn't feel like that would be the case from using NumPy,
         | PyTorch and the likes, but they also typically run 'fat'
         | functions, where it's one function with a lot of data that
         | returns something. Usually don't chain or loop much there.
         | 
         | Edit: the number was for 500 million calls. Yeah, don't think
         | I've ever made that many calls. 123 seconds feels fairly short
         | then, except for demanding workflows like game dev maybe.
        
           | seniorsassycat wrote:
           | 500 million calls in 123 seconds
        
           | NoahKAndrews wrote:
           | I think that's the time to run the whole benchmark suite.
           | Compare to the results for go, for example.
        
         | remram wrote:
         | cffi is probably the canonical way to do this on Python, I
         | wonder what the performance is there.
         | 
         | edit: 30% improvement, still 100x slower than e.g. Rust.
        
           | kevin_thibedeau wrote:
           | If you need a fast loop in Python then switch to Cython.
        
             | cycomanic wrote:
             | Using a cython binding compared to the Ctypes one gives a
             | speedup of a factor of 3. That's still not very fast, now
             | putting the whole thing into a cython program. Like so:
             | def extern from "newplus/plus.h":             cpdef int
             | plusone(int x)              cdef extern from
             | "newplus/plus.h":             cpdef long long
             | current_timestamp()                   def run(int count):
             | cdef int start              cdef int out              cdef
             | int x = 0             start = current_timestamp()
             | while x < count:                 x = plusone(x)
             | out = current_timestamp() - start             return out
             | 
             | Actually yields 597 compared to the pure c program yielding
             | 838.
        
             | remram wrote:
             | That's fine for a _tight loop_. Performance might still
             | matter in a bigger application. This benchmark is measuring
             | the overhead, which is relevant in all contexts; the fact
             | that it does it with a loop is a statistical detail.
        
         | spullara wrote:
         | All python code generally does is call C/C++ code and you're
         | telling me it is slow to do that as well? Yikes.
        
       | ta988 wrote:
       | Java has project Panama coming that may improve things a little.
        
       | [deleted]
        
       | planetis wrote:
       | That Nim version has just left kindergarten and is prepping for
       | elementary.
        
       | WalterBright wrote:
       | The D programming language has literally a zero overhead to
       | interface with C. The same calling conventions are used, the
       | types are the same.
       | 
       | D can also access C code by simply importing a .c file:
       | import foo;  // call functions from foo.c
       | 
       | analogously to how you can `#include "foo.h"` in C++.
        
       | mhh__ wrote:
       | Needs LTO, with that it will have 0 overhead in the compiled
       | languages.
       | 
       | D can actually compile the C code in this test now.
        
       | [deleted]
        
       | sdze wrote:
       | Can you try PHP?
        
       | sk0g wrote:
       | For a game scripting language, Wren posts a pretty bad result
       | here. Think it has isn't explicitly game focused though. The
       | version tested is quite old however, having released in 2016.
        
       | kllrnohj wrote:
       | Another major caveat to this benchmark is it doesn't include any
       | significant marshalling costs. For example, passing strings or
       | arrays from Java to C is much, much slower than passing a single
       | integer. Same is going to be true for a lot (all?) of the GC'd
       | languages, and especially true for strings when the language
       | isn't utf8 natively (as in, even though Java can store in utf8
       | internally, it doesn't expose that publicly so JNI doesn't
       | benefit)
        
         | jimmaswell wrote:
         | How is that possible? It's not just passing pointers?
        
           | glouwbug wrote:
           | Likely a malloc'd copy to appease the 8bit char ABI
        
           | lelanthran wrote:
           | > How is that possible? It's not just passing pointers?
           | 
           | No. A Java string is a "pointer" to an array of 16-bit
           | integers (each element is a 2-byte character). A C string is
           | a pointer to an array of 8-bit integers.
           | 
           | You have to first convert the Java string to UTF8, then
           | allocate an array of 1-byte _unsigned_ integers, then copy
           | the UTF8 into it, and only then can you pass it to a C
           | function that expects a string.
        
             | vvanders wrote:
             | Let's not forget that it's _modified_ UTF-8[1] you get back
             | from JNI lest you think that you 'll be able to use the
             | buffer as-is.
             | 
             | [1] https://docs.oracle.com/javase/10/docs/specs/jni/types.
             | html#...
        
             | jimmaswell wrote:
             | Guess I was missing the context, I thought this was just
             | within Java.
        
             | ReactiveJelly wrote:
             | Qt (C++ framework) is also UTF-16, so maybe if you're lucky
             | you could pass strings between Java and Qt without
             | transcoding?
        
             | spullara wrote:
             | Will be interesting to see how Project Panama does on this
             | kind of benchmark.
             | 
             | https://openjdk.java.net/projects/panama/
        
         | adgjlsfhk1 wrote:
         | Julia (one of the 2 fastest languages here) Is GCed. GC only
         | make C interop hard if you move objects.
        
       | alkonaut wrote:
       | Any idea why mono is used rather than .NET here?
        
         | DoingIsLearning wrote:
         | From the readme:
         | 
         | > My environment:
         | 
         | > [...]
         | 
         | > - Ubuntu 14.04 x64
         | 
         | > [...]
         | 
         | Mono can run on nix targets but .NET itself (not .NET core) is
         | still very much windows only.
        
           | alkonaut wrote:
           | The "normal" .NET runs on Linux as much as Java or python.
           | 
           | The framework/core divide no longer exists. That's why I'm
           | asking.
        
             | richardwhiuk wrote:
             | The benchmark is ancient.
        
           | juki wrote:
           | .NET core is .NET nowadays. The old Windows only .NET
           | framework isn't being developed anymore (other than fixes).
        
           | vopi wrote:
           | .NET core doesn't exist anymore. It's now just .NET. It is
           | the successor to the old windows only .NET. It is very much
           | not windows only.
        
       | ryukoposting wrote:
       | This ia a cool concept, but the implementation is contrived (as
       | many others describe). e.g. JNI array marshalling/unmarshalling
       | has a lot of overhead. The Nim version is super outdated too (not
       | sure about the other languages).
        
       | dunefox wrote:
       | > - julia 0.6.3
       | 
       | That's an ancient version, the current version is v1.7.2.
        
         | Sukera wrote:
         | I don't think that matters here - the FFI interface hasn't
         | changed and I wouldn't expect it to differ significantly.
        
       | TazeTSchnitzel wrote:
       | It seems Rust has basically no overhead versus C, but it could
       | have _negative_ overhead if you use cross-language LTO. Of
       | course, you can do LTO between C files too, so that would be
       | unfair. But I think this sets it apart from languages that, even
       | with a highly optimised FFI, don 't have compiler support for LTO
       | with C code.
        
         | DonHopkins wrote:
         | >negative overhead
         | 
         | underhead
        
           | otikik wrote:
           | Overtail
        
         | Normal_gaussian wrote:
         | For those of us not up on our compilation acronyms - this is
         | Link Time Optimisation.
        
           | TazeTSchnitzel wrote:
           | Ah yes indeed. LTO means the compiler can do certain
           | optimisations across compilation units (in C/C++ those are
           | your .c/.cpp files, in Rust the unit is the entire crate),
           | notably inlining.
        
       ___________________________________________________________________
       (page generated 2022-05-14 23:01 UTC)