[HN Gopher] Cost of a Integer Cast in Go
       ___________________________________________________________________
        
       Cost of a Integer Cast in Go
        
       Author : kiyanwang
       Score  : 78 points
       Date   : 2022-08-30 08:06 UTC (14 hours ago)
        
 (HTM) web link (boyter.org)
 (TXT) w3m dump (boyter.org)
        
       | foota wrote:
       | It's weird to see this analysis without any look at the generated
       | assembly. They've also added a data dependency between benchmark
       | loops, which is going to matter. Depending on the size they're
       | testing with they're also just going to be benchmarking memory
       | bandwidth.
        
         | anonymoushn wrote:
         | The code does not seem to perform any reads or writes other
         | than to registers, so I would be surprised if memory bandwidth
         | mattered.
        
           | foota wrote:
           | Iiuc it's iterating over a list of integers? Reading those
           | from memory requires bandwidth.
           | 
           | Edit: nevermind, I thought i was from an array, not just
           | incremented.
        
       | thayne wrote:
       | So all the tests were done on an m1 mac. Is there any difference
       | on x86_64?
       | 
       | I wouldn't expect there to be, but unless this program is only
       | going to be running on new macs, this analysis isn't really
       | complete.
        
       | wruza wrote:
       | This kind of benchmark makes zero sense. Sorry for being
       | straightforward, it would be okay for someone who only starts
       | programming, but this "have been doing interviews" part leaves an
       | aftertaste. One must know their shit before "present a small
       | chunk of code and ask them to review it over 15 minutes pointing
       | out any issues they see".
        
         | beanjuiceII wrote:
         | Honestly I can't even understand how this is real...this seems
         | like exactly the wrong person I want doing interviews and even
         | as a developer at my company
        
       | tehmillhouse wrote:
       | As a compiler guy, I'd appreciate some look at the layers of
       | abstraction in-between (so, ASM). Microbenchmarks are famously
       | jittery on modern CPUs due to cache effects, branch prediction,
       | process eviction, data dependencies, pipeline stalls, OoO
       | execution, instruction level parallelism, etc.
       | 
       | You have to be really careful to ensure the numbers you're
       | getting are really coming from the thing you're trying to
       | benchmark, and aren't corrupted beyond recognition by your
       | benchmarking harness.
       | 
       | Some of these questions would surely be trivial if I actually
       | knew any Go, but I'm left wondering:
       | 
       | * What does the machine code / assembly look like for this? What
       | does the cast compile down to?
       | 
       | * What's `int` an alias for? I assume 64-bit-signed-integer?
       | 
       | * Are integer casts checked in go? Would an overflowing cast
       | fault?
        
         | silvestrov wrote:
         | micro benchmarks are especially problematic in real-world code
         | where you load stuff from random addresses in memory.
         | 
         | If the code after the cast is blocked on a memory load, then
         | you have a lot of free instructions while the cpu is waiting
         | for the memory load to complete. In this case it doesn't matter
         | if the cast is free or takes a handfull of instructions.
         | 
         | Sometimes code becomes faster by using more instructions to
         | make the data more compact so more of the data stays in the
         | caches.
        
           | Cthulhu_ wrote:
           | Microbenchmarks are only valid for the one function under
           | test, and only on the current machine; they're all right for
           | optimizing on particular hardware, but not so much to go out
           | into the world and go "X is faster than Y"
           | 
           | That said, I did like this website where you could set up JS
           | benchmarks, they would run on your own machine and you could
           | compare how it ran on other people's systems. It wasn't
           | perfect, but it gave a decent indication if X was faster than
           | Y. Of course, it's a snapshot in time, JS engines have gone
           | through tons of optimizations over the years.
        
             | silvestrov wrote:
             | My point is that the context of the function can invalidate
             | a microbenchmark completely.
             | 
             | If you only call this function once in a while, then the
             | context is more important than the function.
             | 
             | You can only ignore the context when you do video decoding
             | or matrix inversion or similar "context free" long running
             | code.
        
         | anonymoushn wrote:
         | int is defined to be the pointer width (or something like
         | this), so probably int64 where the OP is running their code.
         | 
         | integer casts are unchecked.
        
         | stargrazer wrote:
         | and is it aligned properly
        
         | titzer wrote:
         | All are good points. Also, loop unrolling and induction
         | variable analysis. LLVM is particularly aggressive at trying to
         | optimize inductions. It will literally turn a "sum of i from 0
         | to N" into "n(n+1)/2", among others.
         | 
         | It's really important to look at the actual machine code.
        
           | zasdffaa wrote:
           | An aside.
           | 
           | > It will literally turn a "sum of i from 0 to N" into
           | "n(n+1)/2", among others[1]
           | 
           | Yeah, seen that on a godbolt youtube vid. Question is, should
           | it do this? Or should it force you to use a library, by
           | reporting what you're trying to do and telling you there's an
           | easier way ( "sum of 1 to n is <formula>, instead of a loop
           | use library function 'sum1toN()" )
           | 
           | I think getting too clever risks hurting the user by not
           | letting them know there's a better way.
           | 
           | [1] actually it seems to do a slightly different version of
           | this to prevent risk of overflow, but same result.
        
         | fredrikholm wrote:
         | * https://godbolt.org/z/3dEdha44W
         | 
         | * Architecture based, so 32 or 64 bit signed integer.
         | 
         | * No faults. Signed become -1 and unsigned become MAX.
        
           | Thaxll wrote:
           | This is using GCC Go which no-one actually uses.
           | 
           | Edit: why I'm being downvoted:
           | https://go.godbolt.org/z/699d7KjWr
        
             | bayindirh wrote:
             | I for one develop on GCC-Go. The only reason I chose Go as
             | my next project's language is because it has a GCC
             | implementation (I have strict rule about GPL licensed
             | development tool chains for my own projects).
        
             | pjmlp wrote:
             | Anyone that cares about the extra performance out of Go
             | code that the decades old battle tested GCC backend is able
             | to provide.
        
               | Thaxll wrote:
               | I did not test but I assume the Go compiler is "much"
               | faster than the "semi"old GCC implementation.
        
               | erik_seaberg wrote:
               | I think he meant the performance of the compiled binary,
               | not the build.
        
               | pjmlp wrote:
               | I was referring to the quality of generated machine code.
        
               | Thaxll wrote:
               | I was also talking about that compiled code.
        
               | pjmlp wrote:
               | Doesn't look like it, since you mention Go compiler being
               | faster, which is true and isn't the point of generated
               | machine code, as the reference compiler doesn't do many
               | of the GCC optimization passes.
        
               | bayindirh wrote:
               | As far as I can see, GCCGo implements Go 1.18 as of today
               | [0]. Moreover, as noted on official Golang webpage for
               | GCCGo [1], there are even some advantages for using
               | GCCGo, quoting:
               | 
               | "On x86 GNU/Linux systems the gccgo compiler is able to
               | use a small discontiguous stack for goroutines. This
               | permits programs to run many more goroutines, since each
               | goroutine can use a relatively small stack. Doing this
               | requires using the gold linker version 2.22 or later. You
               | can either install GNU binutils 2.22 or later, or you can
               | build gold yourself."
               | 
               | Considering GCC can optimize well written code to the
               | point of saturating the target processor, given the
               | correct target flags, I'm not entirely sure that "gc"
               | would be "much" faster than GCCGo. I'm relatively new
               | with Go, but equally old with GCC, esp. with g++, so
               | assuming the optimization prowess is equally valid for
               | GCCGo.
               | 
               | Last but not least, GCCGo is a part of GCC as a primary
               | language since 4.7, which is an eternity in software
               | terms.
               | 
               | [0]: https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libgo/VE
               | RSION;h=...
               | 
               | [1]: https://go.dev/doc/install/gccgo
        
       | traceroute66 wrote:
       | I have to say I don't understand the point of debates like this
       | when it comes to Go (its like the age-old "+" vs
       | "strings.Builder" debate for concatenation).
       | 
       | The point of Go is that its quick to learn, quick to get stuff
       | done, and its got a fantastic stdlib that enables you to get most
       | modern things out-of-the box.
       | 
       | If someone is worrying about the cost of an integer cast in Go,
       | then either:                   (a) They are using the wrong
       | language and should be using something more low-level (e.g. Rust,
       | C etc.) instead.         (b) Like many who went before them, they
       | are engaging in premature optimisation
       | 
       | I'd put money on most people falling head-first into (b).
        
         | anonymoushn wrote:
         | It seems like there are a lot of posts online about how any
         | effort at all to understand how code is executed and what
         | tradeoffs are being made is a case of premature optimization.
        
           | Retric wrote:
           | I don't think that's what people are complaining about. Doing
           | a poor job is.
           | 
           | For example += on floats is slower than += on integers. So
           | does casting actually take 3x as long or is that an artifact
           | of poor test design?
        
         | dvfjsdhgfv wrote:
         | By your logic, any attempt to optimize code written in Go makes
         | no sense. But there are many of us who use go for the reasons
         | you described and still care about efficiency, and especially
         | about cases where trivial changes have significant consequences
         | (one can argue if this particular article qualifies or not).
        
           | avgcorrection wrote:
           | No, not by their logic. The question is presented as a
           | filter. A filter needs to be (1) so obvious that it is
           | learned by osmosis for a good practitioner, or (2) so useful
           | that they are going to need it in their daily work. Why would
           | any of these two hold for a Go dev job?
        
             | kasey_junk wrote:
             | The post is suggesting that they were testing a surprising
             | answer to the filter. A candidate took a look at the cast
             | and claimed it was inefficient. The author was double
             | checking their intuition that it shouldn't be.
             | 
             | We don't know the shape or result of the filter at all from
             | the article.
        
         | svnpenn wrote:
         | > its like the age-old "+" vs "strings.Builder" debate for
         | concatenation
         | 
         | that's not even a debate. Have you done tests on this? If I
         | remember right, once the string runs over the GC default (I
         | think 1 GB), "+" turns from linear to exponential time. So yeah
         | I guess for most stuff it doesn't matter, but when it does
         | matter its a huge time difference
        
           | HideousKojima wrote:
           | While not Go, in C# you can avoid a ton of heap allocations
           | by using StringBuilder instead of concatenation, and the
           | crossover point for one being more efficient than the other
           | is at something like 4 concatenations. But people will always
           | say nonsense about "premature optimization" the moment you
           | care about performance.
        
             | svnpenn wrote:
             | In regards to Go, in most cases its a pointless
             | optimization. Unless the string is quite large, its the
             | same result. Go is nothing like C# in this case.
        
               | makapuf wrote:
               | In python since 2.5 (and im sure other languages
               | implemntations) the compiler/implementation recognize
               | this += use case and implement it as a fast one.
        
           | traceroute66 wrote:
           | > once the string runs over the GC default (I think 1 GB),
           | 
           | Not wishing to sound snarky, but is there even a valid use-
           | case for concatenating GB+ quantities of strings (or even
           | hundreds of MB) ?
           | 
           | I mean surely you'll (potentially) run into memory exhaustion
           | issues and other fun things ?
        
             | svnpenn wrote:
             | personally I don't see a need for it, but I'm sure someone
             | will do something dumb that requires it
        
             | IshKebab wrote:
             | Of course there is. 1 GB is not even that big these days.
        
       | trelane wrote:
       | The benchmarks are interesting, but that's not what caught my
       | attention.
       | 
       | I'm not convinced finding increasingly esoteric bugs in an
       | interviewer-selected language is an effective gauge of coding
       | ability. I expect this is actually _worse_ than whiteboard
       | coding.
        
         | robertlagrant wrote:
         | Why is that?
        
           | trelane wrote:
           | I had a long answer, but my phone eated it. (thanks, bus
           | jounce + accidental swipe down + hn forgetting the post
           | between reloads)
           | 
           | How about: why is finding increasingly esoteric bugs in an
           | interviewer-specified language a great way of hiring software
           | engineers?
        
             | robertlagrant wrote:
             | > How about: why is finding increasingly esoteric bugs in
             | an interviewer-specified language a great way of hiring
             | software engineers?
             | 
             | I don't know. You were saying it wasn't, so I wondered why.
        
             | tedunangst wrote:
             | What's wrong with asking candidates to find nonesoteric
             | bugs?
        
           | nicce wrote:
           | While the programmer should produce clean, good and optimised
           | code, finding bugs and errors is more job for a pentester.
           | Experienced programmers can easily detect these, but that is
           | not their main job.
           | 
           | Programmer should be able to solve problems and apply the
           | selected language for solving these problems as efficiently
           | as possible.
        
             | garren wrote:
             | Why a pentester and not a QA team more broadly? QA won't
             | necessarily review the code (haven't met a team that did),
             | but they will typically hammer a system with test cases and
             | scenarios that expose unusual behavior and uncover bugs.
             | 
             | I've had pentesters review code looking for things like
             | insecure hashing or encryption, or low hanging fruit like
             | cress in the code, but I wouldn't be inclined to leave what
             | is essentially a QA process to a pentester.
        
             | robertlagrant wrote:
             | You think a penetration tester should find bugs in code?
             | They're looking for security weaknesses only, and almost
             | never by looking at the code. Your method may be the most
             | expensive possible.
        
               | nicce wrote:
               | Every bug is a weakness at some level. The most efficient
               | way in bug bounties currently to make money is by
               | reviewing open-source code. Manually testing takes huge
               | amount of time. You can also automate code review for low
               | hanging fruits with tools like semgrep or GitHub's code
               | scanning.
               | 
               | Of course programmers should test their code themselves
               | and minimise the bugs, but their work it not to look for
               | them.
        
               | eptcyka wrote:
               | >Of course programmers should test their code themselves
               | and minimise the bugs, but their work it not to look for
               | them.
               | 
               | I have to respectfully WAT. Code review should be a part
               | of everybody's workflow. And all programmers involved
               | should be looking out for bugs. The best bugs are those
               | which were never merged in the first place.
        
               | nicce wrote:
               | > I have to respectfully WAT. Code review should be a
               | part of everybody's workflow.
               | 
               | Exactly, it is one workflow. Not their whole job. Their
               | overall job is to solve problems with code.
        
               | robertlagrant wrote:
               | It's a subset of the job, yes. Not to be left to a pen
               | tester.
        
               | wtetzner wrote:
               | I would argue their job is to solve problems with working
               | code, which requires finding/avoiding bugs.
        
         | kazinator wrote:
         | We don't know what the actual coding question was; the claim
         | that casting int to int32 is expensive seems to be an
         | unprompted remark coming from an interviewee. I'm guessing they
         | didn't get the job.
        
         | IshKebab wrote:
         | It didn't sound like they were required to find esoteric bugs.
         | This is just something a candidate offered unprompted.
        
           | trelane wrote:
           | TFA said
           | 
           | > The filter for this is a simple review exercise. We present
           | a small chunk of code and ask them to review it over 15
           | minutes pointing out any issues they see. The idea is to
           | respect their and our time. It works pretty well and we can
           | determine how much experience someone has by their ability to
           | pick up the obvious vs subtle bugs.
           | 
           | So maybe there's more? But it's definitely a mandatory,
           | initial part of the interview process at the very least.
        
             | tialaramex wrote:
             | The code may be pretty bad, and it's interesting to see
             | whether a candidate _spots_ that it 's bad (if they don't
             | you're presumably going to get more poor code from them if
             | hired) and what about it in particular they point out. It's
             | an opportunity for them to show you what they know, rather
             | than a box checking exercise where you know the answers.
             | 
             | I've actually done an exercise like this about a year ago,
             | in Java, I remember I really cared about iterators at the
             | time, and so even though it's probably not the worst thing
             | wrong with the code I just kept coming back to these
             | C-style for loops which I thought were reprehensible. I
             | should ask my interviewer (whose department I now work in)
             | what their impression was as a result of this obsession.
        
             | craigching wrote:
             | They even went one further and explained the answer they're
             | looking for:
             | 
             |  _You lose data due to overflows which is what we expect_
             | 
             | Seems like a reasonable question, though without more
             | context that may or may not be true.
        
       | Const-me wrote:
       | > about 3x the overhead to convert from int to float.
       | 
       | That 3x overhead has nothing to do with converting types.
       | 
       | A CPU can't compute x += y before it has the old value of the x.
       | This is called "data dependency chain".
       | 
       | Modern AMD64 CPUs have 1 cycle of latency to add integers, 3
       | cycles of latency to add FP32 or FP64 numbers. This means no
       | matter what else is in the loop, a loop which increments a single
       | float or double variable can't possibly be faster than 3 cycles /
       | iteration.
       | 
       | As demonstrated by the OP, Apple M1 is very similar in that
       | regard.
        
         | vikingerik wrote:
         | Parallelism can break that speed limit, at least in theory. You
         | could use four different registers to hold x+y, x+2y, x+3y,
         | x+4y and run four instances of the loop in parallel, if there
         | are no other data dependencies.
         | 
         | I'm not sure if hardware microcode can be smart enough to
         | detect that and do it with register renaming and such, but a
         | sufficiently smart compiler could. (Or realistically it would
         | be optimized into SIMD instructions; what I said in the first
         | paragraph is really an ad-hoc implementation of SIMD.)
        
           | Const-me wrote:
           | > run four instances of the loop in parallel
           | 
           | Indeed, here's an example which uses even higher latency FMA
           | instructions to accumulate:
           | https://stackoverflow.com/a/59495197/126995 The example uses
           | 4 independent accumulators to saturate the throughput,
           | instead of stalling on the latency. BTW, on most modern CPUs
           | that code doesn't saturate the compute throughput, the
           | bottleneck is loads.
           | 
           | > a sufficiently smart compiler could
           | 
           | They tend to avoid doing that for floats, for accuracy
           | reasons: float addition ain't associative.
        
             | makapuf wrote:
             | > They tend to avoid doing that for floats, for accuracy
             | reasons: float addition ain't associative.
             | 
             | Indeed, unless you use -ffast-math (also known as
             | -fidontcareaboutthe results). See
             | https://stackoverflow.com/questions/7420665/what-does-
             | gccs-f...
        
         | jeffbee wrote:
         | Isn't a cast from int to float going to generate CVTSQ2SD or
         | similar instruction, that are slow and high latency? On Ice
         | Lake Xeon these are faster but as recently as Skylake they had
         | reciprocal throughput of 2.
        
           | Const-me wrote:
           | Yeah, cvtsi2sd has 3-9 cycles or latency on Zen3, but
           | throughput ain't too bad. The table on uops.info says on
           | Zen3, cvtsi2sd has 1 cycle throughput, meaning a core can
           | complete these instructions every cycle. There's no other
           | dependencies so the pipeline will deal with the latency just
           | fine, by dispatching instructions for multiple loop
           | iterations.
           | 
           | It can't deal with the latency of ARM64 equivalent of
           | addss/addsd due to the dependency chain. I think that's the
           | bottleneck of the OP's microbenchmark.
        
       | [deleted]
        
       | ThinkBeat wrote:
       | Would guessing how many instructions the code uses be better
       | assessed looking at the assembler files (Might need to decompile
       | to get it, not too familiar with the Go toolchain) Or a good
       | debugger?
        
       | anonymoushn wrote:
       | It may be useful to have a look at the assembly. There's probably
       | no instruction corresponding to the cast, you just get a
       | different arithmetic instruction that uses the bottom X bits. For
       | a comparison between code with and without a cast between signed
       | and unsigned ints of the same width, I would expect exactly the
       | same instructions to be used.
        
         | tankenmate wrote:
         | This may also depend on the architecture, what may be free or
         | close to free on AArch64 may not be the case on x86_64 for
         | example. Also of course, the the relative costs (e.g. 3x for
         | float cast) may be different as well.
        
         | Someone wrote:
         | Not only useful, but required. They have to check that the code
         | is performing the intended operation and, even, that the loop
         | still is there (https://kristerw.blogspot.com/2019/04/how-llvm-
         | optimizes-geo... shows how gcc and clang remove some loops. The
         | loops used here have simple forms, too, and could fall to the
         | same type of optimization)
        
         | PaulKeeble wrote:
         | The line x += int32(i)
         | 
         | Produces the assembler line
         | 
         | 0x001e 00030 (.\cast_test.go:10) ADDL CX, DX
         | 
         | In comparison the code
         | 
         | x += i
         | 
         | is using
         | 
         | 0x001e 00030 (.\cast_test.go:19) ADDQ CX, DX
         | 
         | The results are definitely different since the ADDL is for 32
         | bits and ADDQ is 64 bits and that is how its achieving the
         | cast. Not sure the intent of the benchmark (the cost of a cast)
         | is being captured directly since we have a specialist
         | instruction for add to do the casting. But when you think about
         | it ADDL 0, $Val achieves a cast and its 1 instruction so its
         | not expensive at all and there may be something better than
         | that for explicit casts, I doubt there is worse. Its basically
         | free, I see no performance difference between them and that
         | fits with expectations since casting is just throwing away all
         | the top bits.
        
           | anonymoushn wrote:
           | Thanks. This is what I said I expected, given that int32 and
           | int are different widths.
        
         | gbin wrote:
         | I agree... The compiler can make a lot of assumptions on such a
         | simple operation, you have to look at the assembly. In other
         | more complex cast cases it might be slower so I would not
         | assume anything.
        
       | kubb wrote:
       | Checking how many language subtleties you can point out is no
       | doubt a good interview if you're looking for people experienced
       | in a specific language.
       | 
       | It strikes me as so different from what the FAANGs are doing:
       | they ask you to write trivial algorithms on the whiteboard, which
       | apparently filters 99% of engineers.
       | 
       | Theoretically you can have a good candidate that doesn't know
       | e.g. that appending to a slice passed in a function argument is a
       | recipe for disaster. You'd show them the door over something that
       | they could learn in a day.
        
         | Barrin92 wrote:
         | theoretically sure but chances are any good engineer knows that
         | stuff and doesn't need to learn it, and FAANG can filter
         | heavily because they get a lot of applicants.
         | 
         | Basically if you don't know something you could have learned in
         | a day in preparation for an interview at a top paying company
         | that in itself is a sufficient signal.
        
           | kergonath wrote:
           | There are many more "things you can learn in a day" than
           | there are available days to do it, though. Nobody will ever
           | know all the possible tiny simple things.
           | 
           | I would think in general that it's better to select people
           | who adapt and learn quickly rather than people who seem to
           | know everything but you're right: we are not fishing in the
           | same pond as the FAANGs.
        
         | nemo1618 wrote:
         | > Theoretically you can have a good candidate that doesn't know
         | e.g. that appending to a slice passed in a function argument is
         | a recipe for disaster.
         | 
         | Huh? This is a common pattern. (hash.Hash).Sum,
         | strconv.AppendInt, etc.
         | 
         | I think you're referring to the fact that append can mutate
         | existing memory (that is, it doesn't always allocate new
         | memory), but in practice, this is rarely an issue, because, by
         | convention, functions that append to a slice always return that
         | slice.
        
         | kevincox wrote:
         | You are making a lot of assumptions about the interview. Code
         | review is an important part of the development process and
         | seeing candidates who can suggest improvements to code is
         | valuable.
         | 
         | You seem to be assuming that the intended answers are language
         | subtitles but I don't see evidence of that in the post.
        
       ___________________________________________________________________
       (page generated 2022-08-30 23:02 UTC)