[HN Gopher] Debugging an evil Go runtime bug: From heat guns to ...
       ___________________________________________________________________
        
       Debugging an evil Go runtime bug: From heat guns to kernel compiler
       flags
        
       Author : goranmoomin
       Score  : 102 points
       Date   : 2024-07-19 13:24 UTC (9 hours ago)
        
 (HTM) web link (marcan.st)
 (TXT) w3m dump (marcan.st)
        
       | mseepgood wrote:
       | (2017) Previous discussion:
       | https://news.ycombinator.com/item?id=15845118
        
       | BobbyJo wrote:
       | This is honestly wild. 99% of devs would have found a work around
       | and moved on. Going so far as to create a multi-kernel test bench
       | to narrow down the source of the instability is a level of
       | dedication I have not personally seen, and I respect it.
        
         | randomdata wrote:
         | By the same token, you might be the first person I have ever
         | seen give respect to the 1x developer. I respect that. We could
         | no doubt all learn a thing or two from the 1x developer that
         | doesn't rush through everything with quick solutions.
        
           | sanbor wrote:
           | If you check the issue[1] he reported the crash on November
           | 7th and reported the issue is related to gcc and the kernel
           | on November 8th. At least he was very quick going through the
           | rabbit hole.
           | 
           | [1]: https://github.com/prometheus/node_exporter/issues/730
        
             | randomdata wrote:
             | Seems about right. As a lowly developer with 10x
             | tendencies, I'm not sure I have ever spent more than an
             | hour solving a problem. I'll just apply the quick hack that
             | addresses the immediate concern and move on. Frankly, I
             | don't have the attention span to dive in like this guy has.
             | Which, mathematically, means that a 1x developer will take
             | around 10 hours to get to the same place (no doubt with a
             | better solution, as demonstrated here), which is
             | approximately in line with your findings.
             | 
             | Respect to the talent that can pull off 1x greatness.
             | Something for us weak 10x developers to strive towards.
        
               | prerok wrote:
               | Emm, 10x developer is a myth. The real 10x developer is
               | the one that creates such an
               | infrastructure/libraries/culture that enables 10 other
               | engineers to move fast. The 10x developer is not the
               | person that is 10x faster than other developers on the
               | code base simply because they can hold the spaghetti code
               | they wrote in their head.
        
               | gouggoug wrote:
               | > means that a 1x developer will take around 10 hours to
               | get to the same place
               | 
               | A "workaround" isn't an adequate substitute for actual
               | the understanding and fixing of the root cause of a bug.
               | 
               | What you think is a 10x developer is, in fact, a short-
               | term 10x developer, medium-term 1x developer, long-term
               | -10x developer. Their work while seemingly great at first
               | is just accrued debt with an incredibly high interest
               | rate. But they're rarely the ones fixing it.
               | 
               | Now, like everything, a balance needs to be struck
               | between spending hours fixing a bug _the right way_, or,
               | finding a temporary workaround. The real 10x developer is
               | incredibly good at finding this balance.
        
               | randomdata wrote:
               | _> A  "workaround" isn't an adequate substitute for
               | actual the understanding and fixing of the root cause of
               | a bug._
               | 
               | Right, hence why we recognize that a 10x developer is a
               | weaker developer. Was there something that implied that a
               | weak developer is a substitute for a talented developer
               | for you to say this, or are you just pulling words out of
               | thin air?
        
               | gouggoug wrote:
               | Merely arguing about your _definition_ of 10x engineer
               | and 1x engineer.
               | 
               | Your original comment at the top of the thread implies
               | the engineer who wrote the blog post is a 1x engineer
               | because they spent so much time finding and fixing this
               | bug.
        
               | randomdata wrote:
               | Yes, and as the comment before it asserts, most engineers
               | would never take that kind of time. They'd bang out some
               | workaround as quickly as possible and move on with life.
               | 
               | But my original comment at the top also praised the value
               | of the 1x engineer; noting that the rest of us could
               | learn a thing or two from them. There is no denying that
               | 1x developers are the better developers.
               | 
               | The question remains outstanding: Where did you pick up
               | the suggestion that the quick fix is a suitable
               | replacement for the talented engineers who can fully
               | understand a problem that prompted the rebuttal?
        
             | BobbyJo wrote:
             | From the outside looking in, that's mind blowing.
        
           | eikenberry wrote:
           | Is the 10x thing really only about speed? I would say this
           | guy is a perfect 10x example as he actually gets to the root
           | cause of a difficult problem. When I think of 1x (or less)
           | devs they are usually the type that don't get things done
           | because they can't (without a lot of help), not because they
           | are slow. I.E. overall technical chops, not just speed.
        
         | abbbi wrote:
         | Marcan is an beast, this guy really loves to go down the rabbit
         | holes.
        
         | toast0 wrote:
         | Problems like this tend to come back and haunt you though.
         | Sure, you can set max threads to 1 and move on with what you're
         | doing for a while... but a lot of people run Go so they can
         | have a lot more than 1 thread.
         | 
         | I've run into some of these where it's a lot more rare to hit,
         | and so then it's reasonable to not do the thing that hurts, but
         | watch out for it in the future. Sometimes you get lucky and it
         | magically fixes itself forever; sometimes the weird case that
         | you only hit with internal traffic ends up getting hit by
         | public tratfic a lot.
         | 
         | Crashes like this where a wild write breaks something at a
         | distance are always a PITA to debug (especially here, where the
         | wild write is harmless if there's no data race)
        
       | im3w1l wrote:
       | I feel like I still didnt fully understand what's going on here.
       | Is the following correct? "Threads hava a 'canonical' stack that
       | the OS auto-grows for you as you use more of it. But you can also
       | create your own stack by putting any value you want in RSP. This
       | is what the Go program did, and the vDSO, assuming it ran on an
       | auto-growing stack, tried to probe it, which lead to corruption."
        
         | derefr wrote:
         | I believe that Golang, as a green-threaded runtime, is
         | allocating a separate carrier thread for running syscalls on,
         | so that a blocking syscall won't block the green-threads. These
         | syscall carrier threads are allocated with a distinct initial
         | stack size + stack size limit than green-thread-scheduler
         | carrier threads are, because it's expected that syscalls will
         | always just enter kernel code (which has its own stack.)
         | 
         | But vDSOs don't enter the kernel. They just run as userland
         | code; and so they depend on the userland allocated stack to be
         | arbitrarily deep in a way that kernel calls don't.
         | 
         | As shown in the article, Golang seems to have code specifically
         | for dealing with vDSO-type pseudo-syscalls -- but this is
         | likely a specialization of the pre-existing syscall-carrier-
         | thread allocation code, and so started off with a bad
         | assumption about how much stack should be allocated for the
         | created threads.
         | 
         | (I should also point out that the OS stack size specified in
         | the ELF executable headers, only guarantees the stack size of
         | the initial thread of a process created by exec(2). All further
         | threads get their stacks allocated explicitly in userland code
         | by libpthreads or the like calling malloc(2). Normally these
         | abstractions just reuse the same config params from the
         | executable (unless you override them, using e.g.
         | pthreads_attr_setstacksize). But, as the article says, Golang
         | implements its own support for things like this, and so can
         | implement special thread-allocation strategies per carrier
         | thread type.)
        
         | jeffbee wrote:
         | Thread stacks in Linux are demand-paged: if you touch the next
         | page then it magically exists, up to a limit. But the machine
         | is not concerned with the convenient properties of this virtual
         | memory area. To the CPU the register RSP is just an operand,
         | expressed or implied, to some instructions.
        
       | wolf550e wrote:
       | You can follow Hector Martin @marcan at
       | https://social.treehouse.systems/@marcan/
       | 
       | He works on Asahi Linux, a Linux port to arm64 Apple hardware.
        
       | Thaxll wrote:
       | One thing I learned from that post back then is that you can
       | instruct Grub to ignore some part of your physical memory. Really
       | nice trick, not sure this is doable on Windows / Mac?
        
       | Agingcoder wrote:
       | This is very elegant. I've had my share of nasty system bugs (
       | compilers and kernels ) , but the dedication and the speed with
       | which he went through it is quite remarkable.
       | 
       | The explanations are also very clear. Thanks for posting.
        
       | ncruces wrote:
       | Related (for the hash based bisecting):
       | https://research.swtch.com/bisect
        
       ___________________________________________________________________
       (page generated 2024-07-19 23:05 UTC)