[HN Gopher] Follow up to "I booted Linux 293k times"
       ___________________________________________________________________
        
       Follow up to "I booted Linux 293k times"
        
       Author : pabs3
       Score  : 285 points
       Date   : 2023-06-18 12:26 UTC (10 hours ago)
        
 (HTM) web link (rwmj.wordpress.com)
 (TXT) w3m dump (rwmj.wordpress.com)
        
       | samwillis wrote:
       | Previous thread from original post:
       | https://news.ycombinator.com/item?id=36325986
       | 
       | 808 points, 3 days ago, 266 comments.
        
       | amtamt wrote:
       | > You see there are 52,363 commits between those two kernels,
       | 
       | Would it have been possible to generate a list of commits on
       | timeline and perform bisects in a method similar to binary
       | search? Or that was already done and still 293k boots were
       | needed? I am genuinely curious.
        
         | detaro wrote:
         | > _in a method similar to binary search?_
         | 
         | That's what bisect means. You do a binary search for the
         | breaking commit
        
           | kangalioo wrote:
           | Exactly, and the reason 293k boots were needed anyways is
           | because the bug was hard to reproduce and needed thousands of
           | boots to potentially trigger
        
             | hinkley wrote:
             | That's why I always push telemetry onto a team. Concurrency
             | and resource bugs are extremely expensive per iteration.
             | Anything that can hint that the problem you are certain
             | happened somewhere in a two month period was more likely to
             | have happened in a particular week saves you a ton of work.
             | 
             | The human brain struggles when it has to execute almost,
             | but not quite, the same exercises over and over again. It
             | becomes a blur and after a while you can't recall if you
             | ran all of the steps this cycle (if you can't automate the
             | test well) and all it takes is typing "bad" when you meant
             | "good" to end in failure. Which doesn't sound like a big
             | likelihood but when you're testing a negative it's much
             | easier to get your wires crossed.
        
         | cwillu wrote:
         | Expanding on other comments: binary search is extremely
         | sensitive to false positives and false negatives... false
         | information generally. While other approaches will still put
         | you in the right ballpark even with some amount of incorrect
         | information, a binary search with a single flipped bit will
         | consistently land you in entirely unrelated areas, so you need
         | to be _damn_ sure your "git bisect good/bad" is correct.
        
           | rwmj wrote:
           | In the other post there was an interesting discussion about
           | different approaches to searching for a bug (other than
           | straight bisection) when your test is unreliable:
           | https://news.ycombinator.com/item?id=36327263
        
           | wging wrote:
           | There _is_ `git bisect skip` (https://git-scm.com/docs/git-
           | bisect#_bisect_skip), which lets you cope with a testing
           | process that sometimes doesn't yield a true/false result. But
           | it's more useful for untestable commits than for inconclusive
           | results like you'd get from a single boot here.
        
       | swordbeta wrote:
       | I am unfamiliar with kernel dev but shouldn't that commit also
       | need to add some kind of unit test?
        
       | rattray wrote:
       | Bit of a spoiler, but I did find this a fun followup worth
       | skimming:
       | 
       | > So the commit I had bisected to was the wrong one again - it
       | exposed a latent bug simply because it ran the same code as a
       | sleep.
       | 
       | The fix was by someone else, and is here:
       | https://lore.kernel.org/stable/20230615091830.RxMV2xf_@linut...
        
         | mitko wrote:
         | It is not that the commit was wrong. It was the right commit,
         | in the sense that it did expose and trigger the bug. OP is
         | being too modest :)
         | 
         | But the root cause of the bug was in another earlier commit. In
         | a complex system such as the linux kernel, there could be
         | multiple contributing factors. Both the "introducing a sleep"
         | and the "root bug" are contributing factors, but we as
         | engineers tend to think of the "root bug" as the actual
         | problem, because we like more elegant explanations. Also, the
         | "root bug" is likely something that goes against the ethos of
         | the software project, and "common sense best practices", to a
         | much larger degree than "introducing a sleep" is.
        
       | asylteltine wrote:
       | Can someone explain the problem and fix in plain English
        
         | smaddox wrote:
         | Based on my reading, under some circumstances the scheduler
         | wasn't dedicating large enough chunks of time to a particular
         | boot process for it actually make progress. The fix appears to
         | have been to change how the time chunk size was calculated.
        
           | dev_tty01 wrote:
           | As a curious onlooker, was the calculated time below a
           | reasonable minimum?
        
             | paulclinger wrote:
             | Based on my reading it looks like the accuracy wasn't
             | enough, so they switched it to a bit later time when hires
             | timer was already available.
        
             | jseutter wrote:
             | In retrospect you could say yes, but it was hard to tell.
             | It ended up being either okay or not okay depending on
             | Intel vs AMD and clock speed.
        
         | delecti wrote:
         | There was a rare race condition [1] bug in the linux kernel,
         | and they didn't know precisely where it was introduced (which
         | made it difficult to know what to change to fix it). They knew
         | it was introduced at some point between two versions, but there
         | were 52,363 code changes between those two versions. Because
         | the bug was rare, you would have to test any given code change
         | multiple times (thousands of times, in their case) to be
         | confident that the bug was still present (if working
         | backwards), or still absent (if working forwards). A way to do
         | that faster is a binary search [2], or commit bisection.
         | 
         | Basically, you know commit 0 is fine, and change 52,363 is bad,
         | so you run 1000 tests on change 26,180 to cut in half the
         | number of commits that could have introduced the problem. If
         | the bug is still there, then you run 1000 tests on change
         | 13,090; if the bug is not in change 26,180, then you run 1000
         | tests on change 39,270. Eventually (after 15 or 16 iterations)
         | you'll narrow down the problem to a single change which
         | introduced the problem. Unfortunately code is complicated, so
         | the first time they ran that procedure, their tests didn't find
         | the right bug, so they did it again but slightly differently.
         | 
         | (many details omitted for simplicity)
         | 
         | [1] https://en.wikipedia.org/wiki/Race_condition
         | 
         | [2] https://en.wikipedia.org/wiki/Binary_search_algorithm?
        
           | sh34r wrote:
           | It's worth mentioning that this generalized problem in user
           | space can often be solved much more easily using concurrent
           | fuzz testing (ex: https://www.microsoft.com/en-
           | us/research/project/cuzz-concur...).
           | 
           | Unfortunately, when the problem is in kernel boot software,
           | that is not a practical solution for obvious reasons, and
           | you're left using more basic techniques like running a binary
           | search on the commit history.
        
           | ufo wrote:
           | Furthermore, in this case there were actually two commits
           | related to the bug. An older commit that introduced the race
           | condition, and a second commit that triggered it (by subtly
           | changing how long a certain function took to run).
           | 
           | The long bisection process identified the second commit but
           | it took further work to identify the true cause of the bug.
        
       | MR4D wrote:
       | "After 21 hours I got bored..."
       | 
       | There are devs and then there are _kernel_ devs.
       | 
       | The relentless pursuit of bugs is truly impressive.
        
         | vasco wrote:
         | Bored of waiting for the parallel run, I don't think they were
         | investigating for 21h straight. Previous paragragh:
         | 
         | > Paolo Bonzini suggested to me that I boot the kernel in
         | parallel, in a loop, for 24 hours at the point immediately
         | before the commit
         | 
         | Otherwise I don't disagree that perseverance is important to
         | get to the bottom of this stuff.
        
         | kneebonian wrote:
         | A great article by James Mickens entitled "The night watch"
         | explains what it really means to be a kernel dev.
         | 
         | https://scholar.harvard.edu/files/mickens/files/thenightwatc...
        
           | plq wrote:
           | Man, what a masterpiece, even the title gives me chuckles :)
           | Thanks for bringing it up
        
       | weinzierl wrote:
       | Not surprised Thomas Gleixner ultimately found the root cause. He
       | did a lot of real time work and this bug was a timer related bug.
        
       | mcdonje wrote:
       | >but thanks to a heroic effort of many people (and not me,
       | really)...
       | 
       | That's funny because when I was reading through this I kept
       | thinking, "Wow, this is a heroic effort."
       | 
       | It was a heroic individual effort that led to a heroic team
       | effort. The team wouldn't have been assembled had the individual
       | not raised the alarm and focused the attention. Then the team was
       | able to handle it through their collective efforts. Doubly
       | inspiring.
        
         | noir_lord wrote:
         | I've observed this many times over my career, a thorny problem
         | that no-one wants to touch or look at sits there for ages until
         | _I_ take a run at it, as soon as I do everyone piles in and
         | helps.
         | 
         | It's the Bystander Effect in code.
        
           | aendruk wrote:
           | Reminds me of the psychology behind crowdmatching a la
           | Snowdrift.coop. If no one else is working on the problem I
           | also don't feel inclined to; it doesn't seem worth it because
           | the chance of success is low. But if I know that other people
           | are in it with me I'm suddenly motivated because I can expect
           | my contributions to help actually solve the problem.
        
       | alsetmusic wrote:
       | Previous discussion:
       | https://news.ycombinator.com/item?id=36325986
        
       | x86x87 wrote:
       | I think that the author is too modest. Refreshing to see.
       | 
       | The amount of attention this received probably led to a way
       | faster fix.
        
       ___________________________________________________________________
       (page generated 2023-06-18 23:01 UTC)