[HN Gopher] Follow up to "I booted Linux 293k times"
___________________________________________________________________
Follow up to "I booted Linux 293k times"
Author : pabs3
Score : 285 points
Date : 2023-06-18 12:26 UTC (10 hours ago)
(HTM) web link (rwmj.wordpress.com)
(TXT) w3m dump (rwmj.wordpress.com)
| samwillis wrote:
| Previous thread from original post:
| https://news.ycombinator.com/item?id=36325986
|
| 808 points, 3 days ago, 266 comments.
| amtamt wrote:
| > You see there are 52,363 commits between those two kernels,
|
| Would it have been possible to generate a list of commits on
| timeline and perform bisects in a method similar to binary
| search? Or that was already done and still 293k boots were
| needed? I am genuinely curious.
| detaro wrote:
| > _in a method similar to binary search?_
|
| That's what bisect means. You do a binary search for the
| breaking commit
| kangalioo wrote:
| Exactly, and the reason 293k boots were needed anyways is
| because the bug was hard to reproduce and needed thousands of
| boots to potentially trigger
| hinkley wrote:
| That's why I always push telemetry onto a team. Concurrency
| and resource bugs are extremely expensive per iteration.
| Anything that can hint that the problem you are certain
| happened somewhere in a two month period was more likely to
| have happened in a particular week saves you a ton of work.
|
| The human brain struggles when it has to execute almost,
| but not quite, the same exercises over and over again. It
| becomes a blur and after a while you can't recall if you
| ran all of the steps this cycle (if you can't automate the
| test well) and all it takes is typing "bad" when you meant
| "good" to end in failure. Which doesn't sound like a big
| likelihood but when you're testing a negative it's much
| easier to get your wires crossed.
| cwillu wrote:
| Expanding on other comments: binary search is extremely
| sensitive to false positives and false negatives... false
| information generally. While other approaches will still put
| you in the right ballpark even with some amount of incorrect
| information, a binary search with a single flipped bit will
| consistently land you in entirely unrelated areas, so you need
| to be _damn_ sure your "git bisect good/bad" is correct.
| rwmj wrote:
| In the other post there was an interesting discussion about
| different approaches to searching for a bug (other than
| straight bisection) when your test is unreliable:
| https://news.ycombinator.com/item?id=36327263
| wging wrote:
| There _is_ `git bisect skip` (https://git-scm.com/docs/git-
| bisect#_bisect_skip), which lets you cope with a testing
| process that sometimes doesn't yield a true/false result. But
| it's more useful for untestable commits than for inconclusive
| results like you'd get from a single boot here.
| swordbeta wrote:
| I am unfamiliar with kernel dev but shouldn't that commit also
| need to add some kind of unit test?
| rattray wrote:
| Bit of a spoiler, but I did find this a fun followup worth
| skimming:
|
| > So the commit I had bisected to was the wrong one again - it
| exposed a latent bug simply because it ran the same code as a
| sleep.
|
| The fix was by someone else, and is here:
| https://lore.kernel.org/stable/20230615091830.RxMV2xf_@linut...
| mitko wrote:
| It is not that the commit was wrong. It was the right commit,
| in the sense that it did expose and trigger the bug. OP is
| being too modest :)
|
| But the root cause of the bug was in another earlier commit. In
| a complex system such as the linux kernel, there could be
| multiple contributing factors. Both the "introducing a sleep"
| and the "root bug" are contributing factors, but we as
| engineers tend to think of the "root bug" as the actual
| problem, because we like more elegant explanations. Also, the
| "root bug" is likely something that goes against the ethos of
| the software project, and "common sense best practices", to a
| much larger degree than "introducing a sleep" is.
| asylteltine wrote:
| Can someone explain the problem and fix in plain English
| smaddox wrote:
| Based on my reading, under some circumstances the scheduler
| wasn't dedicating large enough chunks of time to a particular
| boot process for it actually make progress. The fix appears to
| have been to change how the time chunk size was calculated.
| dev_tty01 wrote:
| As a curious onlooker, was the calculated time below a
| reasonable minimum?
| paulclinger wrote:
| Based on my reading it looks like the accuracy wasn't
| enough, so they switched it to a bit later time when hires
| timer was already available.
| jseutter wrote:
| In retrospect you could say yes, but it was hard to tell.
| It ended up being either okay or not okay depending on
| Intel vs AMD and clock speed.
| delecti wrote:
| There was a rare race condition [1] bug in the linux kernel,
| and they didn't know precisely where it was introduced (which
| made it difficult to know what to change to fix it). They knew
| it was introduced at some point between two versions, but there
| were 52,363 code changes between those two versions. Because
| the bug was rare, you would have to test any given code change
| multiple times (thousands of times, in their case) to be
| confident that the bug was still present (if working
| backwards), or still absent (if working forwards). A way to do
| that faster is a binary search [2], or commit bisection.
|
| Basically, you know commit 0 is fine, and change 52,363 is bad,
| so you run 1000 tests on change 26,180 to cut in half the
| number of commits that could have introduced the problem. If
| the bug is still there, then you run 1000 tests on change
| 13,090; if the bug is not in change 26,180, then you run 1000
| tests on change 39,270. Eventually (after 15 or 16 iterations)
| you'll narrow down the problem to a single change which
| introduced the problem. Unfortunately code is complicated, so
| the first time they ran that procedure, their tests didn't find
| the right bug, so they did it again but slightly differently.
|
| (many details omitted for simplicity)
|
| [1] https://en.wikipedia.org/wiki/Race_condition
|
| [2] https://en.wikipedia.org/wiki/Binary_search_algorithm?
| sh34r wrote:
| It's worth mentioning that this generalized problem in user
| space can often be solved much more easily using concurrent
| fuzz testing (ex: https://www.microsoft.com/en-
| us/research/project/cuzz-concur...).
|
| Unfortunately, when the problem is in kernel boot software,
| that is not a practical solution for obvious reasons, and
| you're left using more basic techniques like running a binary
| search on the commit history.
| ufo wrote:
| Furthermore, in this case there were actually two commits
| related to the bug. An older commit that introduced the race
| condition, and a second commit that triggered it (by subtly
| changing how long a certain function took to run).
|
| The long bisection process identified the second commit but
| it took further work to identify the true cause of the bug.
| MR4D wrote:
| "After 21 hours I got bored..."
|
| There are devs and then there are _kernel_ devs.
|
| The relentless pursuit of bugs is truly impressive.
| vasco wrote:
| Bored of waiting for the parallel run, I don't think they were
| investigating for 21h straight. Previous paragragh:
|
| > Paolo Bonzini suggested to me that I boot the kernel in
| parallel, in a loop, for 24 hours at the point immediately
| before the commit
|
| Otherwise I don't disagree that perseverance is important to
| get to the bottom of this stuff.
| kneebonian wrote:
| A great article by James Mickens entitled "The night watch"
| explains what it really means to be a kernel dev.
|
| https://scholar.harvard.edu/files/mickens/files/thenightwatc...
| plq wrote:
| Man, what a masterpiece, even the title gives me chuckles :)
| Thanks for bringing it up
| weinzierl wrote:
| Not surprised Thomas Gleixner ultimately found the root cause. He
| did a lot of real time work and this bug was a timer related bug.
| mcdonje wrote:
| >but thanks to a heroic effort of many people (and not me,
| really)...
|
| That's funny because when I was reading through this I kept
| thinking, "Wow, this is a heroic effort."
|
| It was a heroic individual effort that led to a heroic team
| effort. The team wouldn't have been assembled had the individual
| not raised the alarm and focused the attention. Then the team was
| able to handle it through their collective efforts. Doubly
| inspiring.
| noir_lord wrote:
| I've observed this many times over my career, a thorny problem
| that no-one wants to touch or look at sits there for ages until
| _I_ take a run at it, as soon as I do everyone piles in and
| helps.
|
| It's the Bystander Effect in code.
| aendruk wrote:
| Reminds me of the psychology behind crowdmatching a la
| Snowdrift.coop. If no one else is working on the problem I
| also don't feel inclined to; it doesn't seem worth it because
| the chance of success is low. But if I know that other people
| are in it with me I'm suddenly motivated because I can expect
| my contributions to help actually solve the problem.
| alsetmusic wrote:
| Previous discussion:
| https://news.ycombinator.com/item?id=36325986
| x86x87 wrote:
| I think that the author is too modest. Refreshing to see.
|
| The amount of attention this received probably led to a way
| faster fix.
___________________________________________________________________
(page generated 2023-06-18 23:01 UTC)