[HN Gopher] Cost of a Integer Cast in Go
___________________________________________________________________
Cost of a Integer Cast in Go
Author : kiyanwang
Score : 78 points
Date : 2022-08-30 08:06 UTC (14 hours ago)
(HTM) web link (boyter.org)
(TXT) w3m dump (boyter.org)
| foota wrote:
| It's weird to see this analysis without any look at the generated
| assembly. They've also added a data dependency between benchmark
| loops, which is going to matter. Depending on the size they're
| testing with they're also just going to be benchmarking memory
| bandwidth.
| anonymoushn wrote:
| The code does not seem to perform any reads or writes other
| than to registers, so I would be surprised if memory bandwidth
| mattered.
| foota wrote:
| Iiuc it's iterating over a list of integers? Reading those
| from memory requires bandwidth.
|
| Edit: nevermind, I thought i was from an array, not just
| incremented.
| thayne wrote:
| So all the tests were done on an m1 mac. Is there any difference
| on x86_64?
|
| I wouldn't expect there to be, but unless this program is only
| going to be running on new macs, this analysis isn't really
| complete.
| wruza wrote:
| This kind of benchmark makes zero sense. Sorry for being
| straightforward, it would be okay for someone who only starts
| programming, but this "have been doing interviews" part leaves an
| aftertaste. One must know their shit before "present a small
| chunk of code and ask them to review it over 15 minutes pointing
| out any issues they see".
| beanjuiceII wrote:
| Honestly I can't even understand how this is real...this seems
| like exactly the wrong person I want doing interviews and even
| as a developer at my company
| tehmillhouse wrote:
| As a compiler guy, I'd appreciate some look at the layers of
| abstraction in-between (so, ASM). Microbenchmarks are famously
| jittery on modern CPUs due to cache effects, branch prediction,
| process eviction, data dependencies, pipeline stalls, OoO
| execution, instruction level parallelism, etc.
|
| You have to be really careful to ensure the numbers you're
| getting are really coming from the thing you're trying to
| benchmark, and aren't corrupted beyond recognition by your
| benchmarking harness.
|
| Some of these questions would surely be trivial if I actually
| knew any Go, but I'm left wondering:
|
| * What does the machine code / assembly look like for this? What
| does the cast compile down to?
|
| * What's `int` an alias for? I assume 64-bit-signed-integer?
|
| * Are integer casts checked in go? Would an overflowing cast
| fault?
| silvestrov wrote:
| micro benchmarks are especially problematic in real-world code
| where you load stuff from random addresses in memory.
|
| If the code after the cast is blocked on a memory load, then
| you have a lot of free instructions while the cpu is waiting
| for the memory load to complete. In this case it doesn't matter
| if the cast is free or takes a handfull of instructions.
|
| Sometimes code becomes faster by using more instructions to
| make the data more compact so more of the data stays in the
| caches.
| Cthulhu_ wrote:
| Microbenchmarks are only valid for the one function under
| test, and only on the current machine; they're all right for
| optimizing on particular hardware, but not so much to go out
| into the world and go "X is faster than Y"
|
| That said, I did like this website where you could set up JS
| benchmarks, they would run on your own machine and you could
| compare how it ran on other people's systems. It wasn't
| perfect, but it gave a decent indication if X was faster than
| Y. Of course, it's a snapshot in time, JS engines have gone
| through tons of optimizations over the years.
| silvestrov wrote:
| My point is that the context of the function can invalidate
| a microbenchmark completely.
|
| If you only call this function once in a while, then the
| context is more important than the function.
|
| You can only ignore the context when you do video decoding
| or matrix inversion or similar "context free" long running
| code.
| anonymoushn wrote:
| int is defined to be the pointer width (or something like
| this), so probably int64 where the OP is running their code.
|
| integer casts are unchecked.
| stargrazer wrote:
| and is it aligned properly
| titzer wrote:
| All are good points. Also, loop unrolling and induction
| variable analysis. LLVM is particularly aggressive at trying to
| optimize inductions. It will literally turn a "sum of i from 0
| to N" into "n(n+1)/2", among others.
|
| It's really important to look at the actual machine code.
| zasdffaa wrote:
| An aside.
|
| > It will literally turn a "sum of i from 0 to N" into
| "n(n+1)/2", among others[1]
|
| Yeah, seen that on a godbolt youtube vid. Question is, should
| it do this? Or should it force you to use a library, by
| reporting what you're trying to do and telling you there's an
| easier way ( "sum of 1 to n is <formula>, instead of a loop
| use library function 'sum1toN()" )
|
| I think getting too clever risks hurting the user by not
| letting them know there's a better way.
|
| [1] actually it seems to do a slightly different version of
| this to prevent risk of overflow, but same result.
| fredrikholm wrote:
| * https://godbolt.org/z/3dEdha44W
|
| * Architecture based, so 32 or 64 bit signed integer.
|
| * No faults. Signed become -1 and unsigned become MAX.
| Thaxll wrote:
| This is using GCC Go which no-one actually uses.
|
| Edit: why I'm being downvoted:
| https://go.godbolt.org/z/699d7KjWr
| bayindirh wrote:
| I for one develop on GCC-Go. The only reason I chose Go as
| my next project's language is because it has a GCC
| implementation (I have strict rule about GPL licensed
| development tool chains for my own projects).
| pjmlp wrote:
| Anyone that cares about the extra performance out of Go
| code that the decades old battle tested GCC backend is able
| to provide.
| Thaxll wrote:
| I did not test but I assume the Go compiler is "much"
| faster than the "semi"old GCC implementation.
| erik_seaberg wrote:
| I think he meant the performance of the compiled binary,
| not the build.
| pjmlp wrote:
| I was referring to the quality of generated machine code.
| Thaxll wrote:
| I was also talking about that compiled code.
| pjmlp wrote:
| Doesn't look like it, since you mention Go compiler being
| faster, which is true and isn't the point of generated
| machine code, as the reference compiler doesn't do many
| of the GCC optimization passes.
| bayindirh wrote:
| As far as I can see, GCCGo implements Go 1.18 as of today
| [0]. Moreover, as noted on official Golang webpage for
| GCCGo [1], there are even some advantages for using
| GCCGo, quoting:
|
| "On x86 GNU/Linux systems the gccgo compiler is able to
| use a small discontiguous stack for goroutines. This
| permits programs to run many more goroutines, since each
| goroutine can use a relatively small stack. Doing this
| requires using the gold linker version 2.22 or later. You
| can either install GNU binutils 2.22 or later, or you can
| build gold yourself."
|
| Considering GCC can optimize well written code to the
| point of saturating the target processor, given the
| correct target flags, I'm not entirely sure that "gc"
| would be "much" faster than GCCGo. I'm relatively new
| with Go, but equally old with GCC, esp. with g++, so
| assuming the optimization prowess is equally valid for
| GCCGo.
|
| Last but not least, GCCGo is a part of GCC as a primary
| language since 4.7, which is an eternity in software
| terms.
|
| [0]: https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libgo/VE
| RSION;h=...
|
| [1]: https://go.dev/doc/install/gccgo
| traceroute66 wrote:
| I have to say I don't understand the point of debates like this
| when it comes to Go (its like the age-old "+" vs
| "strings.Builder" debate for concatenation).
|
| The point of Go is that its quick to learn, quick to get stuff
| done, and its got a fantastic stdlib that enables you to get most
| modern things out-of-the box.
|
| If someone is worrying about the cost of an integer cast in Go,
| then either: (a) They are using the wrong
| language and should be using something more low-level (e.g. Rust,
| C etc.) instead. (b) Like many who went before them, they
| are engaging in premature optimisation
|
| I'd put money on most people falling head-first into (b).
| anonymoushn wrote:
| It seems like there are a lot of posts online about how any
| effort at all to understand how code is executed and what
| tradeoffs are being made is a case of premature optimization.
| Retric wrote:
| I don't think that's what people are complaining about. Doing
| a poor job is.
|
| For example += on floats is slower than += on integers. So
| does casting actually take 3x as long or is that an artifact
| of poor test design?
| dvfjsdhgfv wrote:
| By your logic, any attempt to optimize code written in Go makes
| no sense. But there are many of us who use go for the reasons
| you described and still care about efficiency, and especially
| about cases where trivial changes have significant consequences
| (one can argue if this particular article qualifies or not).
| avgcorrection wrote:
| No, not by their logic. The question is presented as a
| filter. A filter needs to be (1) so obvious that it is
| learned by osmosis for a good practitioner, or (2) so useful
| that they are going to need it in their daily work. Why would
| any of these two hold for a Go dev job?
| kasey_junk wrote:
| The post is suggesting that they were testing a surprising
| answer to the filter. A candidate took a look at the cast
| and claimed it was inefficient. The author was double
| checking their intuition that it shouldn't be.
|
| We don't know the shape or result of the filter at all from
| the article.
| svnpenn wrote:
| > its like the age-old "+" vs "strings.Builder" debate for
| concatenation
|
| that's not even a debate. Have you done tests on this? If I
| remember right, once the string runs over the GC default (I
| think 1 GB), "+" turns from linear to exponential time. So yeah
| I guess for most stuff it doesn't matter, but when it does
| matter its a huge time difference
| HideousKojima wrote:
| While not Go, in C# you can avoid a ton of heap allocations
| by using StringBuilder instead of concatenation, and the
| crossover point for one being more efficient than the other
| is at something like 4 concatenations. But people will always
| say nonsense about "premature optimization" the moment you
| care about performance.
| svnpenn wrote:
| In regards to Go, in most cases its a pointless
| optimization. Unless the string is quite large, its the
| same result. Go is nothing like C# in this case.
| makapuf wrote:
| In python since 2.5 (and im sure other languages
| implemntations) the compiler/implementation recognize
| this += use case and implement it as a fast one.
| traceroute66 wrote:
| > once the string runs over the GC default (I think 1 GB),
|
| Not wishing to sound snarky, but is there even a valid use-
| case for concatenating GB+ quantities of strings (or even
| hundreds of MB) ?
|
| I mean surely you'll (potentially) run into memory exhaustion
| issues and other fun things ?
| svnpenn wrote:
| personally I don't see a need for it, but I'm sure someone
| will do something dumb that requires it
| IshKebab wrote:
| Of course there is. 1 GB is not even that big these days.
| trelane wrote:
| The benchmarks are interesting, but that's not what caught my
| attention.
|
| I'm not convinced finding increasingly esoteric bugs in an
| interviewer-selected language is an effective gauge of coding
| ability. I expect this is actually _worse_ than whiteboard
| coding.
| robertlagrant wrote:
| Why is that?
| trelane wrote:
| I had a long answer, but my phone eated it. (thanks, bus
| jounce + accidental swipe down + hn forgetting the post
| between reloads)
|
| How about: why is finding increasingly esoteric bugs in an
| interviewer-specified language a great way of hiring software
| engineers?
| robertlagrant wrote:
| > How about: why is finding increasingly esoteric bugs in
| an interviewer-specified language a great way of hiring
| software engineers?
|
| I don't know. You were saying it wasn't, so I wondered why.
| tedunangst wrote:
| What's wrong with asking candidates to find nonesoteric
| bugs?
| nicce wrote:
| While the programmer should produce clean, good and optimised
| code, finding bugs and errors is more job for a pentester.
| Experienced programmers can easily detect these, but that is
| not their main job.
|
| Programmer should be able to solve problems and apply the
| selected language for solving these problems as efficiently
| as possible.
| garren wrote:
| Why a pentester and not a QA team more broadly? QA won't
| necessarily review the code (haven't met a team that did),
| but they will typically hammer a system with test cases and
| scenarios that expose unusual behavior and uncover bugs.
|
| I've had pentesters review code looking for things like
| insecure hashing or encryption, or low hanging fruit like
| cress in the code, but I wouldn't be inclined to leave what
| is essentially a QA process to a pentester.
| robertlagrant wrote:
| You think a penetration tester should find bugs in code?
| They're looking for security weaknesses only, and almost
| never by looking at the code. Your method may be the most
| expensive possible.
| nicce wrote:
| Every bug is a weakness at some level. The most efficient
| way in bug bounties currently to make money is by
| reviewing open-source code. Manually testing takes huge
| amount of time. You can also automate code review for low
| hanging fruits with tools like semgrep or GitHub's code
| scanning.
|
| Of course programmers should test their code themselves
| and minimise the bugs, but their work it not to look for
| them.
| eptcyka wrote:
| >Of course programmers should test their code themselves
| and minimise the bugs, but their work it not to look for
| them.
|
| I have to respectfully WAT. Code review should be a part
| of everybody's workflow. And all programmers involved
| should be looking out for bugs. The best bugs are those
| which were never merged in the first place.
| nicce wrote:
| > I have to respectfully WAT. Code review should be a
| part of everybody's workflow.
|
| Exactly, it is one workflow. Not their whole job. Their
| overall job is to solve problems with code.
| robertlagrant wrote:
| It's a subset of the job, yes. Not to be left to a pen
| tester.
| wtetzner wrote:
| I would argue their job is to solve problems with working
| code, which requires finding/avoiding bugs.
| kazinator wrote:
| We don't know what the actual coding question was; the claim
| that casting int to int32 is expensive seems to be an
| unprompted remark coming from an interviewee. I'm guessing they
| didn't get the job.
| IshKebab wrote:
| It didn't sound like they were required to find esoteric bugs.
| This is just something a candidate offered unprompted.
| trelane wrote:
| TFA said
|
| > The filter for this is a simple review exercise. We present
| a small chunk of code and ask them to review it over 15
| minutes pointing out any issues they see. The idea is to
| respect their and our time. It works pretty well and we can
| determine how much experience someone has by their ability to
| pick up the obvious vs subtle bugs.
|
| So maybe there's more? But it's definitely a mandatory,
| initial part of the interview process at the very least.
| tialaramex wrote:
| The code may be pretty bad, and it's interesting to see
| whether a candidate _spots_ that it 's bad (if they don't
| you're presumably going to get more poor code from them if
| hired) and what about it in particular they point out. It's
| an opportunity for them to show you what they know, rather
| than a box checking exercise where you know the answers.
|
| I've actually done an exercise like this about a year ago,
| in Java, I remember I really cared about iterators at the
| time, and so even though it's probably not the worst thing
| wrong with the code I just kept coming back to these
| C-style for loops which I thought were reprehensible. I
| should ask my interviewer (whose department I now work in)
| what their impression was as a result of this obsession.
| craigching wrote:
| They even went one further and explained the answer they're
| looking for:
|
| _You lose data due to overflows which is what we expect_
|
| Seems like a reasonable question, though without more
| context that may or may not be true.
| Const-me wrote:
| > about 3x the overhead to convert from int to float.
|
| That 3x overhead has nothing to do with converting types.
|
| A CPU can't compute x += y before it has the old value of the x.
| This is called "data dependency chain".
|
| Modern AMD64 CPUs have 1 cycle of latency to add integers, 3
| cycles of latency to add FP32 or FP64 numbers. This means no
| matter what else is in the loop, a loop which increments a single
| float or double variable can't possibly be faster than 3 cycles /
| iteration.
|
| As demonstrated by the OP, Apple M1 is very similar in that
| regard.
| vikingerik wrote:
| Parallelism can break that speed limit, at least in theory. You
| could use four different registers to hold x+y, x+2y, x+3y,
| x+4y and run four instances of the loop in parallel, if there
| are no other data dependencies.
|
| I'm not sure if hardware microcode can be smart enough to
| detect that and do it with register renaming and such, but a
| sufficiently smart compiler could. (Or realistically it would
| be optimized into SIMD instructions; what I said in the first
| paragraph is really an ad-hoc implementation of SIMD.)
| Const-me wrote:
| > run four instances of the loop in parallel
|
| Indeed, here's an example which uses even higher latency FMA
| instructions to accumulate:
| https://stackoverflow.com/a/59495197/126995 The example uses
| 4 independent accumulators to saturate the throughput,
| instead of stalling on the latency. BTW, on most modern CPUs
| that code doesn't saturate the compute throughput, the
| bottleneck is loads.
|
| > a sufficiently smart compiler could
|
| They tend to avoid doing that for floats, for accuracy
| reasons: float addition ain't associative.
| makapuf wrote:
| > They tend to avoid doing that for floats, for accuracy
| reasons: float addition ain't associative.
|
| Indeed, unless you use -ffast-math (also known as
| -fidontcareaboutthe results). See
| https://stackoverflow.com/questions/7420665/what-does-
| gccs-f...
| jeffbee wrote:
| Isn't a cast from int to float going to generate CVTSQ2SD or
| similar instruction, that are slow and high latency? On Ice
| Lake Xeon these are faster but as recently as Skylake they had
| reciprocal throughput of 2.
| Const-me wrote:
| Yeah, cvtsi2sd has 3-9 cycles or latency on Zen3, but
| throughput ain't too bad. The table on uops.info says on
| Zen3, cvtsi2sd has 1 cycle throughput, meaning a core can
| complete these instructions every cycle. There's no other
| dependencies so the pipeline will deal with the latency just
| fine, by dispatching instructions for multiple loop
| iterations.
|
| It can't deal with the latency of ARM64 equivalent of
| addss/addsd due to the dependency chain. I think that's the
| bottleneck of the OP's microbenchmark.
| [deleted]
| ThinkBeat wrote:
| Would guessing how many instructions the code uses be better
| assessed looking at the assembler files (Might need to decompile
| to get it, not too familiar with the Go toolchain) Or a good
| debugger?
| anonymoushn wrote:
| It may be useful to have a look at the assembly. There's probably
| no instruction corresponding to the cast, you just get a
| different arithmetic instruction that uses the bottom X bits. For
| a comparison between code with and without a cast between signed
| and unsigned ints of the same width, I would expect exactly the
| same instructions to be used.
| tankenmate wrote:
| This may also depend on the architecture, what may be free or
| close to free on AArch64 may not be the case on x86_64 for
| example. Also of course, the the relative costs (e.g. 3x for
| float cast) may be different as well.
| Someone wrote:
| Not only useful, but required. They have to check that the code
| is performing the intended operation and, even, that the loop
| still is there (https://kristerw.blogspot.com/2019/04/how-llvm-
| optimizes-geo... shows how gcc and clang remove some loops. The
| loops used here have simple forms, too, and could fall to the
| same type of optimization)
| PaulKeeble wrote:
| The line x += int32(i)
|
| Produces the assembler line
|
| 0x001e 00030 (.\cast_test.go:10) ADDL CX, DX
|
| In comparison the code
|
| x += i
|
| is using
|
| 0x001e 00030 (.\cast_test.go:19) ADDQ CX, DX
|
| The results are definitely different since the ADDL is for 32
| bits and ADDQ is 64 bits and that is how its achieving the
| cast. Not sure the intent of the benchmark (the cost of a cast)
| is being captured directly since we have a specialist
| instruction for add to do the casting. But when you think about
| it ADDL 0, $Val achieves a cast and its 1 instruction so its
| not expensive at all and there may be something better than
| that for explicit casts, I doubt there is worse. Its basically
| free, I see no performance difference between them and that
| fits with expectations since casting is just throwing away all
| the top bits.
| anonymoushn wrote:
| Thanks. This is what I said I expected, given that int32 and
| int are different widths.
| gbin wrote:
| I agree... The compiler can make a lot of assumptions on such a
| simple operation, you have to look at the assembly. In other
| more complex cast cases it might be slower so I would not
| assume anything.
| kubb wrote:
| Checking how many language subtleties you can point out is no
| doubt a good interview if you're looking for people experienced
| in a specific language.
|
| It strikes me as so different from what the FAANGs are doing:
| they ask you to write trivial algorithms on the whiteboard, which
| apparently filters 99% of engineers.
|
| Theoretically you can have a good candidate that doesn't know
| e.g. that appending to a slice passed in a function argument is a
| recipe for disaster. You'd show them the door over something that
| they could learn in a day.
| Barrin92 wrote:
| theoretically sure but chances are any good engineer knows that
| stuff and doesn't need to learn it, and FAANG can filter
| heavily because they get a lot of applicants.
|
| Basically if you don't know something you could have learned in
| a day in preparation for an interview at a top paying company
| that in itself is a sufficient signal.
| kergonath wrote:
| There are many more "things you can learn in a day" than
| there are available days to do it, though. Nobody will ever
| know all the possible tiny simple things.
|
| I would think in general that it's better to select people
| who adapt and learn quickly rather than people who seem to
| know everything but you're right: we are not fishing in the
| same pond as the FAANGs.
| nemo1618 wrote:
| > Theoretically you can have a good candidate that doesn't know
| e.g. that appending to a slice passed in a function argument is
| a recipe for disaster.
|
| Huh? This is a common pattern. (hash.Hash).Sum,
| strconv.AppendInt, etc.
|
| I think you're referring to the fact that append can mutate
| existing memory (that is, it doesn't always allocate new
| memory), but in practice, this is rarely an issue, because, by
| convention, functions that append to a slice always return that
| slice.
| kevincox wrote:
| You are making a lot of assumptions about the interview. Code
| review is an important part of the development process and
| seeing candidates who can suggest improvements to code is
| valuable.
|
| You seem to be assuming that the intended answers are language
| subtitles but I don't see evidence of that in the post.
___________________________________________________________________
(page generated 2022-08-30 23:02 UTC)