[HN Gopher] Debugging an evil Go runtime bug: From heat guns to ...
___________________________________________________________________
Debugging an evil Go runtime bug: From heat guns to kernel compiler
flags
Author : goranmoomin
Score : 102 points
Date : 2024-07-19 13:24 UTC (9 hours ago)
(HTM) web link (marcan.st)
(TXT) w3m dump (marcan.st)
| mseepgood wrote:
| (2017) Previous discussion:
| https://news.ycombinator.com/item?id=15845118
| BobbyJo wrote:
| This is honestly wild. 99% of devs would have found a work around
| and moved on. Going so far as to create a multi-kernel test bench
| to narrow down the source of the instability is a level of
| dedication I have not personally seen, and I respect it.
| randomdata wrote:
| By the same token, you might be the first person I have ever
| seen give respect to the 1x developer. I respect that. We could
| no doubt all learn a thing or two from the 1x developer that
| doesn't rush through everything with quick solutions.
| sanbor wrote:
| If you check the issue[1] he reported the crash on November
| 7th and reported the issue is related to gcc and the kernel
| on November 8th. At least he was very quick going through the
| rabbit hole.
|
| [1]: https://github.com/prometheus/node_exporter/issues/730
| randomdata wrote:
| Seems about right. As a lowly developer with 10x
| tendencies, I'm not sure I have ever spent more than an
| hour solving a problem. I'll just apply the quick hack that
| addresses the immediate concern and move on. Frankly, I
| don't have the attention span to dive in like this guy has.
| Which, mathematically, means that a 1x developer will take
| around 10 hours to get to the same place (no doubt with a
| better solution, as demonstrated here), which is
| approximately in line with your findings.
|
| Respect to the talent that can pull off 1x greatness.
| Something for us weak 10x developers to strive towards.
| prerok wrote:
| Emm, 10x developer is a myth. The real 10x developer is
| the one that creates such an
| infrastructure/libraries/culture that enables 10 other
| engineers to move fast. The 10x developer is not the
| person that is 10x faster than other developers on the
| code base simply because they can hold the spaghetti code
| they wrote in their head.
| gouggoug wrote:
| > means that a 1x developer will take around 10 hours to
| get to the same place
|
| A "workaround" isn't an adequate substitute for actual
| the understanding and fixing of the root cause of a bug.
|
| What you think is a 10x developer is, in fact, a short-
| term 10x developer, medium-term 1x developer, long-term
| -10x developer. Their work while seemingly great at first
| is just accrued debt with an incredibly high interest
| rate. But they're rarely the ones fixing it.
|
| Now, like everything, a balance needs to be struck
| between spending hours fixing a bug _the right way_, or,
| finding a temporary workaround. The real 10x developer is
| incredibly good at finding this balance.
| randomdata wrote:
| _> A "workaround" isn't an adequate substitute for
| actual the understanding and fixing of the root cause of
| a bug._
|
| Right, hence why we recognize that a 10x developer is a
| weaker developer. Was there something that implied that a
| weak developer is a substitute for a talented developer
| for you to say this, or are you just pulling words out of
| thin air?
| gouggoug wrote:
| Merely arguing about your _definition_ of 10x engineer
| and 1x engineer.
|
| Your original comment at the top of the thread implies
| the engineer who wrote the blog post is a 1x engineer
| because they spent so much time finding and fixing this
| bug.
| randomdata wrote:
| Yes, and as the comment before it asserts, most engineers
| would never take that kind of time. They'd bang out some
| workaround as quickly as possible and move on with life.
|
| But my original comment at the top also praised the value
| of the 1x engineer; noting that the rest of us could
| learn a thing or two from them. There is no denying that
| 1x developers are the better developers.
|
| The question remains outstanding: Where did you pick up
| the suggestion that the quick fix is a suitable
| replacement for the talented engineers who can fully
| understand a problem that prompted the rebuttal?
| BobbyJo wrote:
| From the outside looking in, that's mind blowing.
| eikenberry wrote:
| Is the 10x thing really only about speed? I would say this
| guy is a perfect 10x example as he actually gets to the root
| cause of a difficult problem. When I think of 1x (or less)
| devs they are usually the type that don't get things done
| because they can't (without a lot of help), not because they
| are slow. I.E. overall technical chops, not just speed.
| abbbi wrote:
| Marcan is an beast, this guy really loves to go down the rabbit
| holes.
| toast0 wrote:
| Problems like this tend to come back and haunt you though.
| Sure, you can set max threads to 1 and move on with what you're
| doing for a while... but a lot of people run Go so they can
| have a lot more than 1 thread.
|
| I've run into some of these where it's a lot more rare to hit,
| and so then it's reasonable to not do the thing that hurts, but
| watch out for it in the future. Sometimes you get lucky and it
| magically fixes itself forever; sometimes the weird case that
| you only hit with internal traffic ends up getting hit by
| public tratfic a lot.
|
| Crashes like this where a wild write breaks something at a
| distance are always a PITA to debug (especially here, where the
| wild write is harmless if there's no data race)
| im3w1l wrote:
| I feel like I still didnt fully understand what's going on here.
| Is the following correct? "Threads hava a 'canonical' stack that
| the OS auto-grows for you as you use more of it. But you can also
| create your own stack by putting any value you want in RSP. This
| is what the Go program did, and the vDSO, assuming it ran on an
| auto-growing stack, tried to probe it, which lead to corruption."
| derefr wrote:
| I believe that Golang, as a green-threaded runtime, is
| allocating a separate carrier thread for running syscalls on,
| so that a blocking syscall won't block the green-threads. These
| syscall carrier threads are allocated with a distinct initial
| stack size + stack size limit than green-thread-scheduler
| carrier threads are, because it's expected that syscalls will
| always just enter kernel code (which has its own stack.)
|
| But vDSOs don't enter the kernel. They just run as userland
| code; and so they depend on the userland allocated stack to be
| arbitrarily deep in a way that kernel calls don't.
|
| As shown in the article, Golang seems to have code specifically
| for dealing with vDSO-type pseudo-syscalls -- but this is
| likely a specialization of the pre-existing syscall-carrier-
| thread allocation code, and so started off with a bad
| assumption about how much stack should be allocated for the
| created threads.
|
| (I should also point out that the OS stack size specified in
| the ELF executable headers, only guarantees the stack size of
| the initial thread of a process created by exec(2). All further
| threads get their stacks allocated explicitly in userland code
| by libpthreads or the like calling malloc(2). Normally these
| abstractions just reuse the same config params from the
| executable (unless you override them, using e.g.
| pthreads_attr_setstacksize). But, as the article says, Golang
| implements its own support for things like this, and so can
| implement special thread-allocation strategies per carrier
| thread type.)
| jeffbee wrote:
| Thread stacks in Linux are demand-paged: if you touch the next
| page then it magically exists, up to a limit. But the machine
| is not concerned with the convenient properties of this virtual
| memory area. To the CPU the register RSP is just an operand,
| expressed or implied, to some instructions.
| wolf550e wrote:
| You can follow Hector Martin @marcan at
| https://social.treehouse.systems/@marcan/
|
| He works on Asahi Linux, a Linux port to arm64 Apple hardware.
| Thaxll wrote:
| One thing I learned from that post back then is that you can
| instruct Grub to ignore some part of your physical memory. Really
| nice trick, not sure this is doable on Windows / Mac?
| Agingcoder wrote:
| This is very elegant. I've had my share of nasty system bugs (
| compilers and kernels ) , but the dedication and the speed with
| which he went through it is quite remarkable.
|
| The explanations are also very clear. Thanks for posting.
| ncruces wrote:
| Related (for the hash based bisecting):
| https://research.swtch.com/bisect
___________________________________________________________________
(page generated 2024-07-19 23:05 UTC)