[HN Gopher] Comparing the C FFI overhead on various languages
___________________________________________________________________
Comparing the C FFI overhead on various languages
Author : generichuman
Score : 105 points
Date : 2022-05-14 10:49 UTC (12 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| KSPAtlas wrote:
| What about Common Lisp?
| medo-bear wrote:
| there is a pretty powerful CFFI package (library) to achieve
| this, however performance will be very implementation
| dependent. in case someone wants to try this, the defacto-
| standard, free open source, speedy implementation is SBCL
| cube2222 wrote:
| Just a caveat, not sure if it matters in practice, but this
| benchmark is using very old versions of many languages it's
| comparing (5 year old ones).
| kllrnohj wrote:
| It probably matters for a few of the slower ones, like Java,
| Go, or Dart. It's also going to matter on what platform. eg,
| Java may have better FFI on x86 than on ARM. Or similarly
| Dart's FFI may better on ARM than on x86, particularly given
| Flutter is the primary user these days.
|
| And then to make it even more complicated it's also going to
| potentially depend on the GC being used. For example for Java's
| JNI it's actually the bookkeeping for the GC that takes the
| most time in that FFI transition (can't pause the thread to
| mark the stack for a concurrent GC when it's executing random C
| code, after all). Which is going to potentially depend on what
| the specific GC being used requires.
| haberman wrote:
| Some of the results look outdated. The Dart results look bad (25x
| slower than C), but looking at the code
| (https://github.com/dyu/ffi-overhead/tree/master/dart) it appears
| to be five years old. Dart has a new FFI as of Dart 2.5 (2019):
| https://medium.com/dartlang/announcing-dart-2-5-super-charge...
| I'm curious how the new FFI would fare in these benchmarks.
| elcritch wrote:
| Actually looks like most of the languages are seriously
| outdated. Nim and Julia are both way outdated, Elixir is pretty
| outdated.
| kcb wrote:
| Because their environment is using an Ubuntu version from 8
| years ago. So a better title would be "Comparing the C FFI
| overhead on various languages in 2014"
| meibo wrote:
| Same with C#, it only benchmarks an old version of mono and not
| .NET Core, which has received several big performance boosts in
| recent releases.
| SemanticStrengh wrote:
| Java has a new API for FFI called the foreign memory interface
| throw827474737 wrote:
| So why isn't C the baseline (and zig and rust being pretty close
| to it quite expected), but both luajit and julia are
| significantly faster??
| gallexme wrote:
| https://nullprogram.com/blog/2018/05/27/
| eatonphil wrote:
| > For the C "FFI" he used standard dynamic linking, not
| dlopen(). This distinction is important, since it really
| makes a difference in the benchmark. There's a potential
| argument about whether or not this is a fair comparison to an
| actual FFI, but, regardless, it's still interesting to
| measure
| arinlen wrote:
| > There's a potential argument about whether or not this is
| a fair comparison to an actual FFI, but, regardless, it's
| still interesting to measure (...)
|
| If there's interest in measuring dynamic linking then
| wouldn't there be an interest in measuring it on all
| languages that support dynamic linking?
| jcelerier wrote:
| With clang, just compiling with -fno-plt gives me:
| jit: 1.003483 ns/call plt: 1.254158 ns/call
| ind: 1.254616 ns/call
|
| GCC does not seem to support it though, even if it accepts
| the flag and gives me: jit: 1.003483
| ns/call plt: 1.502089 ns/call ind: 1.254616
| ns/call
|
| (tried everything I could think of that would have a chance
| to make the PLT disappear: cc -fno-plt
| -Bsymbolic -fno-semantic-interposition -flto -std=c99 -Wall
| -Wextra -O3 -g3 -Wl,-z,relro,-z,now -o benchmark
| benchmark.c ./empty.so -ldl
|
| without any change on GCC)
| miohtama wrote:
| Is there anything akin FFI but with static linking for any
| foreign (non C) language?
| junon wrote:
| The question as it stands makes a few assumptions I don't
| think one can make, and as such is a bit tricky to answer
| cleanly, but I'll try.
|
| Yes it's just called linking. The language needs to be
| aware of calling conventions and perhaps side effects and
| be prepared for no additional intrinsic support for higher
| level features.
|
| It probably also needs to be able to read C headers,
| because C symbols do not contain type signatures like many
| C++ compilers add.
|
| There's no "library" or some out of the box solution for
| this, if that's what you're asking. This boils down to how
| programs are constructed and, moreso, how CPUs work.
|
| In most (all?) cases, anything higher level than straight-
| up linking is headed toward FFI territory.
| tlb wrote:
| Calling WebAssembly from Javascript, sort of?
|
| In the early Python 2 era there was an option to build an
| interpreter binary with statically linked C stubs, and it
| was noticeably faster and let you access Python data
| structures from C. I used it for robotics code for speed.
| It was inconvenient because you had to link in all the
| modules you needed.
| fweimer wrote:
| For OpenJDK, there is JEP 178:
| https://openjdk.java.net/jeps/178 I haven't seen it used in
| practice.
|
| Ocaml's C-implemented functions are linked statically. But
| like JNI, the C functions have special names and type
| signatures, so it is slightly different from, say, ctypes
| in Python.
|
| CGO for Go is statically linked, too. Its overhead stems
| from significant differences between the Go and C world.
| The example uses dynamic linking, but it would not have to
| do that.
| samatman wrote:
| LuaJIT can use the FFI against statically linked object
| code just fine, I'm not sure if that answers your question
| since in this context it must be embedded in a C program.
|
| It's a hard requirement of static linking that you have
| just one binary so it might, answer your question that is.
| qalmakka wrote:
| I'm always pretty surprised when I find out most people
| writing C or C++ have no idea that PLTs exist. They have a
| small but not negligible cost.
| [deleted]
| bachmeier wrote:
| C, C++, Zig, Rust, D, and Haskell are all similar because
| they're basically doing the same thing. Someone else linked to
| the blog post, but Lua and Julia aren't doing the same thing,
| so they get different results.
|
| > both luajit and julia are significantly faster
|
| I would be interested if anyone has an example where the
| difference matters in practice. As soon as you move to the more
| realistic scenario where you're writing a program that does
| something other than what is measured by these benchmarks,
| that's not going to be your biggest concern.
| mananaysiempre wrote:
| ETA: I see now I was answering the wrong question: you were
| asking about the comparison between C and LuaJIT, not heavier
| FFIs and C/LuaJIT.
|
| Honestly I think of the difference (as discussed in Wellons's
| post among others) not as a performance optimization but as
| an anti-stupidity optimization: regardless of the performance
| impact, it's _stupid_ that the standard ELF ABI forces us to
| jump through these hoops for every foreign call, and even
| stupider that plain inter- and even intra-compilation-unit
| calls can also be affected unless you take additional
| measures. Things are also being fixed on the C side with
| things such as -fvisibility=, -fno-semantic-interposition,
| -fno-plt, and new relocation types.
|
| Can this be relevant to performance? Probably--aside from
| just doing more stuff, there are trickier-to-predict parts of
| the impact such as buffer pressure on the indirect branch
| predictor. Does it? Not sure. The theoretical possibility of
| interposition preventing inlining of publicly-accessible
| functions is probably much more important, at the very least
| I _have_ seen it make a difference. But this falls outside
| the scope of FFI, strictly speaking, even if the cause is
| related.
|
| ---
|
| I don't have a readily available example, but in the LuaJIT
| case there are two considerations that I can mention:
|
| - FFI is not just cheap but gets into the realm of a native
| call (perhaps an indirect one), so a well-adapted inner loop
| is not ruined even if it makes several FFI calls per
| iteration (it will still be slower, but this is fractions not
| multiples unless the loop did not allocate at all before the
| change). What this influences is perhaps not even the final
| performance but the shape of the API boundary: similarly to
| the impact of promise pipelining for RPC[1], you're no longer
| forced into the "construct job, submit job" mindset and
| coarse-grained calls (think NumPy). Even calling libm
| functions through the FFI, while probably not very smart,
| isn't an instant death sentence, so not as many things are
| forced to be reimplemented in the language as you're used to.
|
| - The JIT is wonderfully speedy and simple, but draws much of
| that speed and simplicity from the fact that it really only
| understands two shapes of control flow: straight-line code;
| and straight-line code leading into a loop with straight-line
| code in the body. Other control transfers aren't banned as
| such, but are built on top of these, can only be optimized
| across to a limited extent, and can confuse the machinery
| that decides what to trace. This has the unpleasant corollary
| that builtins, which are normally implemented as baked-in
| bytecode, can't usefully have loops in them. The solution
| uses something LuaJIT 2.1 calls _trace stitching_ : the
| problematic builtins are implemented in normal C and are free
| to have arbitrarily complex control flow inside, but instead
| of outright aborting the trace due to an unJITtable builtin
| the compiler puts what is effectively an FFI call into it.
|
| [1] https://capnproto.org/rpc.html
| vvanders wrote:
| Oh it totally matters, any sort of chatty interface over FFI
| you will pay for it.
|
| There's a reason a lot of gamedev uses luajit, I've
| personally had to refactor many interfaces to avoid JNI calls
| as much as possible as there was significant overhead(both in
| the call and from the VM not being able to optimize around
| it).
| kllrnohj wrote:
| The reason a lot of gamedev uses luajit is the ease at
| which it can be embedded.
|
| And that's not really even true anymore as the majority of
| gamedev is using Unreal or Unity, neither of which use
| luajit.
| vvanders wrote:
| It's not just how easy it is to embed, it's also really
| small in both code + runtime size. I've shipped it on
| systems with sub 8mb of total system memory(we used a
| preallocated 400kb block), until quickjs came along there
| really wasn't anything comparable. It was also much
| faster than anything else at the time and regularly beat
| v8 in the benchmarks I ran.
|
| Unity+Unreal are the public engines out there but there's
| plenty of in-house engines and tool chains you don't
| really hear about. I wouldn't be surprised if it's still
| deployed in quite a few contexts.
| kllrnohj wrote:
| > I would be interested if anyone has an example where the
| difference matters in practice.
|
| Vulkan. Any sort of binding to Vulkan over a non-trivial FFI
| (so like, not from C++, Rust, etc...) is going to be murdered
| by this FFI overhead cost. Especially since for bindings from
| something like Java you're either paying FFI on every field
| set on a struct, or you're paying non-trivial marshalling
| costs to convert from a Java class to a C struct to then
| finally call the corresponding Vulkan function.
| bachmeier wrote:
| > Especially since for bindings from something like Java
|
| I guess I wasn't clear, but I meant the difference between
| C and Luajit.
| kllrnohj wrote:
| Ah. The answer to that is a lot more murky, since in an
| actual C/C++ program you're going to have a mix of local,
| static, and dynamic linking. You're generally not putting
| super chatty stuff across a dynamic linkage, since that
| tends to be where the stable API boundaries go. Anything
| internal is then going to be static linkage, so
| comparable to luajit, or inlined (either by the compiler
| initially or with something like LTO) and then even
| faster than luajit
| joeld42 wrote:
| Not really, you're usually setting up commands and buffers
| and stuff in Vulkan. If you're making millions of calls a
| frame, you're going to have other bottlenecks.
|
| My favorite example is something like Substance designer's
| node graph or Disney's SeExpr. You'd often want custom
| nodes that do often something trivial like a lookup from a
| custom data format or a small math evaluation, but you're
| calling the node potentially a handful of times per pixel,
| on millions of pixels. The calling overhead often comes out
| to take as much time or more than the operation, but
| there's no easy way to rearrange the operations without
| making things a lot more complicated for everyone.
|
| I kind of like python's approach, make it so slow that it's
| easy to notice when you're hitting the bottleneck.
| Encourages you to write stuff that works in larger
| operations, and you get stuff like numpy and tensorflow
| which are some of the fastest things out there despite the
| slowest binding.
|
| https://www.disneyanimation.com/technology/seexpr-
| expression...
| exebook wrote:
| I developed a terminal emulator, file manager and text editor
| Deodar 8 years ago in JavaScript/V8 with native C++ calls, it
| worked but I was extremely disappointed by speed, it felt so slow
| like you need to do a passport control each time you call a C++
| function.
| Koromix wrote:
| The official solutions, node-ffi and node-ffi-napi, are
| extremely slow, with an overhead hundred of times higher than
| it should be. I don't know what they do to be so slow.
|
| I'm making my own FFI module for Node.js, Koffi, as a much
| faster alternative. You can see some benchmark here, to compare
| with node-ffi-napi:
| https://www.npmjs.com/package/koffi#benchmarks
| SemanticStrengh wrote:
| An interesting alternative would be to not have any FFI and
| to use transparent polyglot interop between javascript and
| c++ via GraalJs /sulong
| ZiiS wrote:
| Probably butter smooth after eight years of V8 development and
| Moore's Law (well whatever passes for it now).
| kllrnohj wrote:
| Probably not since single threaded improvement has barely
| advanced over the last 8 years, and JS/V8 are still in a
| single threaded world that stopped existing a decade ago.
| exikyut wrote:
| Oh nice, with a Norton Commander-alike terminal UI.
|
| Screenshots at https://sourceforge.net/projects/deodar/
|
| Last-modified 2018 over at https://github.com/exebook/deodar
|
| I'm not sure where the use of Yacc ends and
| https://github.com/exebook/elfu begins in the .yy files (which
| are sprinkled with very algebraic-looking Unicode throughout).
| The Pascal-like class definitions may be defined in
| https://github.com/exebook/intervision.
|
| Very interesting project.
| dgan wrote:
| I had to run it to believe, I confirm it's 183 seconds(!) for
| python3 on my laptop
|
| Also, OCaml because I was interested (milliseconds):
| ocaml(int,noalloc,native) = 2022 ocaml(int,alloc,native)
| = 2344 ocaml(int,untagged,native) = 1912
| ocaml(int32,noalloc,native) = 1049
| ocaml(int32,alloc,native) = 1556
| ocaml(int32,boxed,native) = 7544
| khoobid_shoma wrote:
| I guess it is better to measure CPU time instead of wall time
| (e.g. using clock() ).
| tomas789 wrote:
| There is no Python benchmark but you can find a PR claiming it
| has 123,198ms. That would be a worst one by a wide margin.
|
| https://github.com/dyu/ffi-overhead/pull/18
| sk0g wrote:
| C FFI takes 123 seconds?! That's pretty insane, but if you mean
| 123.2 ms, it's still very bad.
|
| Doesn't feel like that would be the case from using NumPy,
| PyTorch and the likes, but they also typically run 'fat'
| functions, where it's one function with a lot of data that
| returns something. Usually don't chain or loop much there.
|
| Edit: the number was for 500 million calls. Yeah, don't think
| I've ever made that many calls. 123 seconds feels fairly short
| then, except for demanding workflows like game dev maybe.
| seniorsassycat wrote:
| 500 million calls in 123 seconds
| NoahKAndrews wrote:
| I think that's the time to run the whole benchmark suite.
| Compare to the results for go, for example.
| remram wrote:
| cffi is probably the canonical way to do this on Python, I
| wonder what the performance is there.
|
| edit: 30% improvement, still 100x slower than e.g. Rust.
| kevin_thibedeau wrote:
| If you need a fast loop in Python then switch to Cython.
| cycomanic wrote:
| Using a cython binding compared to the Ctypes one gives a
| speedup of a factor of 3. That's still not very fast, now
| putting the whole thing into a cython program. Like so:
| def extern from "newplus/plus.h": cpdef int
| plusone(int x) cdef extern from
| "newplus/plus.h": cpdef long long
| current_timestamp() def run(int count):
| cdef int start cdef int out cdef
| int x = 0 start = current_timestamp()
| while x < count: x = plusone(x)
| out = current_timestamp() - start return out
|
| Actually yields 597 compared to the pure c program yielding
| 838.
| remram wrote:
| That's fine for a _tight loop_. Performance might still
| matter in a bigger application. This benchmark is measuring
| the overhead, which is relevant in all contexts; the fact
| that it does it with a loop is a statistical detail.
| spullara wrote:
| All python code generally does is call C/C++ code and you're
| telling me it is slow to do that as well? Yikes.
| ta988 wrote:
| Java has project Panama coming that may improve things a little.
| [deleted]
| planetis wrote:
| That Nim version has just left kindergarten and is prepping for
| elementary.
| WalterBright wrote:
| The D programming language has literally a zero overhead to
| interface with C. The same calling conventions are used, the
| types are the same.
|
| D can also access C code by simply importing a .c file:
| import foo; // call functions from foo.c
|
| analogously to how you can `#include "foo.h"` in C++.
| mhh__ wrote:
| Needs LTO, with that it will have 0 overhead in the compiled
| languages.
|
| D can actually compile the C code in this test now.
| [deleted]
| sdze wrote:
| Can you try PHP?
| sk0g wrote:
| For a game scripting language, Wren posts a pretty bad result
| here. Think it has isn't explicitly game focused though. The
| version tested is quite old however, having released in 2016.
| kllrnohj wrote:
| Another major caveat to this benchmark is it doesn't include any
| significant marshalling costs. For example, passing strings or
| arrays from Java to C is much, much slower than passing a single
| integer. Same is going to be true for a lot (all?) of the GC'd
| languages, and especially true for strings when the language
| isn't utf8 natively (as in, even though Java can store in utf8
| internally, it doesn't expose that publicly so JNI doesn't
| benefit)
| jimmaswell wrote:
| How is that possible? It's not just passing pointers?
| glouwbug wrote:
| Likely a malloc'd copy to appease the 8bit char ABI
| lelanthran wrote:
| > How is that possible? It's not just passing pointers?
|
| No. A Java string is a "pointer" to an array of 16-bit
| integers (each element is a 2-byte character). A C string is
| a pointer to an array of 8-bit integers.
|
| You have to first convert the Java string to UTF8, then
| allocate an array of 1-byte _unsigned_ integers, then copy
| the UTF8 into it, and only then can you pass it to a C
| function that expects a string.
| vvanders wrote:
| Let's not forget that it's _modified_ UTF-8[1] you get back
| from JNI lest you think that you 'll be able to use the
| buffer as-is.
|
| [1] https://docs.oracle.com/javase/10/docs/specs/jni/types.
| html#...
| jimmaswell wrote:
| Guess I was missing the context, I thought this was just
| within Java.
| ReactiveJelly wrote:
| Qt (C++ framework) is also UTF-16, so maybe if you're lucky
| you could pass strings between Java and Qt without
| transcoding?
| spullara wrote:
| Will be interesting to see how Project Panama does on this
| kind of benchmark.
|
| https://openjdk.java.net/projects/panama/
| adgjlsfhk1 wrote:
| Julia (one of the 2 fastest languages here) Is GCed. GC only
| make C interop hard if you move objects.
| alkonaut wrote:
| Any idea why mono is used rather than .NET here?
| DoingIsLearning wrote:
| From the readme:
|
| > My environment:
|
| > [...]
|
| > - Ubuntu 14.04 x64
|
| > [...]
|
| Mono can run on nix targets but .NET itself (not .NET core) is
| still very much windows only.
| alkonaut wrote:
| The "normal" .NET runs on Linux as much as Java or python.
|
| The framework/core divide no longer exists. That's why I'm
| asking.
| richardwhiuk wrote:
| The benchmark is ancient.
| juki wrote:
| .NET core is .NET nowadays. The old Windows only .NET
| framework isn't being developed anymore (other than fixes).
| vopi wrote:
| .NET core doesn't exist anymore. It's now just .NET. It is
| the successor to the old windows only .NET. It is very much
| not windows only.
| ryukoposting wrote:
| This ia a cool concept, but the implementation is contrived (as
| many others describe). e.g. JNI array marshalling/unmarshalling
| has a lot of overhead. The Nim version is super outdated too (not
| sure about the other languages).
| dunefox wrote:
| > - julia 0.6.3
|
| That's an ancient version, the current version is v1.7.2.
| Sukera wrote:
| I don't think that matters here - the FFI interface hasn't
| changed and I wouldn't expect it to differ significantly.
| TazeTSchnitzel wrote:
| It seems Rust has basically no overhead versus C, but it could
| have _negative_ overhead if you use cross-language LTO. Of
| course, you can do LTO between C files too, so that would be
| unfair. But I think this sets it apart from languages that, even
| with a highly optimised FFI, don 't have compiler support for LTO
| with C code.
| DonHopkins wrote:
| >negative overhead
|
| underhead
| otikik wrote:
| Overtail
| Normal_gaussian wrote:
| For those of us not up on our compilation acronyms - this is
| Link Time Optimisation.
| TazeTSchnitzel wrote:
| Ah yes indeed. LTO means the compiler can do certain
| optimisations across compilation units (in C/C++ those are
| your .c/.cpp files, in Rust the unit is the entire crate),
| notably inlining.
___________________________________________________________________
(page generated 2022-05-14 23:01 UTC)