[HN Gopher] Debugging: Indispensable rules for finding even the ...
___________________________________________________________________
Debugging: Indispensable rules for finding even the most elusive
problems (2004)
Author : omkar-foss
Score : 387 points
Date : 2025-01-13 12:07 UTC (10 hours ago)
(HTM) web link (dwheeler.com)
(TXT) w3m dump (dwheeler.com)
| mootoday wrote:
| Did anyone say debugging?
|
| I've followed https://debugbetter.com/ for a few weeks and the
| content has been great!
| spawarotti wrote:
| Very good online course on debugging: Software Debugging on
| Udacity by Andreas Zeller
|
| https://www.udacity.com/course/debugging--cs259
| belter wrote:
| Udacity is owned by Accenture? That is...Surprising.
| k3vinw wrote:
| The unspoken rule is talking to the rubber duck :)
| zarq wrote:
| That is literally #8 in the list
| dkdbejwi383 wrote:
| Rule #10 - read everything twice
| k3vinw wrote:
| Hmm. Perhaps, but mannequin is not nearly as whimsical
| sounding as a rubber duck which inspires you to bounce your
| ideas off of the inanimate object.
| apples_oranges wrote:
| A good bug is the most fun thing about software development
| jamesblonde wrote:
| Just let a LLM create even better bugs for you - Erik Meijer
| https://www.youtube.com/live/SsJqmV3Wtkg?si=MUoiNbWpsunsZ39y...
| waynecochran wrote:
| Sometimes I am actually happy when there is a obvious bug. It
| is like solving a murder mystery.
| BobbyTables2 wrote:
| And often you're the culprit too!
| urbandw311er wrote:
| Rule #10 - it's probably DNS
| tmountain wrote:
| Years ago, my boss thought he was being clever and set our
| server's DNS to the root nameservers. We kept getting sporadic
| timeouts on requests. That took a while to track down... I
| think I got a pizza out of the deal.
| knlb wrote:
| I wrote a fairly similar take on this a few years ago (without
| having read the original book mentioned here) --
| https://explog.in/notes/debugging.html
|
| Julia Evans also has a very nice zine on debugging:
| https://wizardzines.com/zines/debugging-guide/
| Hackbraten wrote:
| I love Julia Evans' zine! Bought several copies when it came
| out, gave some to coworkers and donated one to our office
| library.
| david_draco wrote:
| Step 10, add the bug as a test to the CI to prevent regressions?
| Make sure the CI fails before the fix and works after the fix.
| soco wrote:
| What do you do with the years old bug fixes? How fast can one
| run the CI after a long while of accumulating tests? Do they
| still make sense to be kept in the long run?
| hobs wrote:
| Why would you want to stop knowing that your old bug fixes
| still worked in the context of your system?
|
| Saying "oh its been good for awhile now" has nothing to do
| with breaking it in the future.
| simmonmt wrote:
| This is a great problem to have, if (IME) rare. Step 1
| Understand the System helps you figure out when tests can be
| eliminated as no longer relevant and/or which tests can be
| merged.
| hsbauauvhabzb wrote:
| I think for some types of bugs a CI test would be valuable if
| the developer believes regressions may occur, for other bugs
| they would be useless.
| jerf wrote:
| I'm not particularly passionate about arguing the exact
| details of "unit" versus "integration" testing, let alone
| breaking down the granularity beyond that as some do, but I
| am passionate that they need to be _fast_ , and this is why.
| By that, I mean, it is a perfectly viable use of engineering
| time to make changes that deliberately make running the tests
| faster.
|
| A lot of slow tests are slow because nobody has even tried to
| speed them up. They just wrote something that worked,
| probably literally years ago, that does something horrible
| like fully build a docker container and fully initialize a
| complicated database and fully do an install of the system
| and starts processes for everything and liberally uses
| "sleep"-based concurrency control and so on and so forth,
| which was fine when you were doing that 5 times but becomes a
| problem when you're trying to run it hundreds of times, and
| that's a problem, because we really ought to be running it
| hundreds of thousands or millions of times.
|
| I would love to work on a project where we had so many well-
| optimized automated tests that despite their speed they were
| still a problem for building. I'm sure there's a few out
| there, but I doubt it's many.
| gregthelaw wrote:
| I would say yes, your CI should accumulate all of those
| regression tests. Where I work we now have many, many
| thousands of regression test cases. There's a subset to be
| run prior to merge which runs in reasonable time, but the
| full CI just cycles through.
|
| For this to work all the regression tests must be fast, and
| 100% reliable. It's worth it though. If the mistake was made
| once, unless there's a regression test to catch it, it'll be
| made again at some point.
| Tade0 wrote:
| The largest purely JavaScript repo I ever worked on (150k LoC)
| had this rule and it was a life saver, particularly because the
| project had commits dating back more than five years and since
| it was a component/library, it had quite few strange hacks for
| IE.
| nonrandomstring wrote:
| Yes, just more generally _document it_
|
| I've lost count of how many things i've fixed only to to see;
|
| 1) It recurs because a deeper "cause" of the bug reactivated
| it.
|
| 2) Nobody knew I fixed something so everyone continued to
| operate workarounds as if the bug was still there.
|
| I realise these are related and arguably already fall under
| "You didn't fix it". That said a bit of writing-up and root-
| cause analysis after getting to "It's fixed!" seems helpful to
| others.
| seanwilson wrote:
| I don't think this is always worth it. Some tests can be time
| consuming or complex to write, have to be maintained, and we
| accept that a test suite won't be testing all edge cases
| anyway. A bug that made it to production can mean that
| particular bug might happen again, but it could be a silly
| mistake and no more likely to happen again than 100s of other
| potential silly mistakes. It depends, and writing tests isn't
| free.
| Ragnarork wrote:
| Writing tests isn't free but writing non-regression tests for
| bugs that were actually fixed is one of the best test cases
| to consider writing right away, before the bug is fixed.
| You'll be reproducing the bug anyway (so already consider how
| to reproduce). You'll also have the most information about it
| to make sure the test is well written anyway, after building
| a mental model around the bug.
|
| Writing tests isn't free, I agree, but in this case a good
| chunk of the cost of writing them will have already been paid
| in a way.
| jerf wrote:
| For people who aren't getting the value of unit tests, this
| is my intro to the idea. You had to do some sort of testing
| on your code. At its core, the concept of unit testing is
| just, what if instead of throwing away that code, you
| _kept_ it?
|
| To the extent that other concerns get in the way of the
| concept, like the general difficulty of testing that GUIs
| do what they are supposed to do, I don't blame the concept
| of unit testing; I blame the techs that make the testing
| hard.
| Ragnarork wrote:
| I also think that this is a great way to emphasis their
| value.
|
| If anything I'd only keep those if it's hard to write
| them, if people push back against it (and I myself don't
| like them sometimes, e.g. when the goal is just to push
| up the coverage metric but without _actually_ testing
| much, which only add test code to maintain but no real
| testing value...).
|
| Like any topic there's no universal truth and lots of
| ways to waste time and effort, but this specifically is
| extremely practical and useful in a very explicit manner:
| just fix it once and catch it the next time before
| production. Massively reduce the chance one thing has to
| be fixed twice or more.
| n144q wrote:
| I can't count how many times when other people ask me
| "how can I use this API?", I just send a test case to
| them. Best example you can give to someone that is never
| out of sync.
| seanwilson wrote:
| > Writing tests isn't free, I agree, but in this case a
| good chunk of the cost of writing them will have already
| been paid in a way.
|
| Some examples that come to mind are bugs to do with UI
| interactions, visuals/styling, external online
| APIs/services, gaming/simulation stuff, and
| asynchronous/thread code, where it can be a substantial
| effort to write tests for, vs fixing the bug that might
| just be a typo. This is really different compared to if
| you're testing some pure functions that only need a few
| inputs.
|
| It depends on what domain you're working in, but I find
| people very rarely mention how much work certain kinds of
| test can be to write, especially if there aren't similar
| tests written already and you have to do a ton of setup
| like mocking, factories, and UI automation.
| Ragnarork wrote:
| Definitely agree with you on the fact that there are
| tests which are complicated to write and will take
| effort.
|
| But I think all other things considered my impression
| still holds, and that I should maybe rather say they're
| _easier_ to write in a way, though not necessarily easy.
| n144q wrote:
| You (or other people) will thank yourself in a few
| months/years when refactoring the code, knowing that they
| don't need to worry about missing edge cases, because all
| known edge cases are covered with these non regression tests.
| seanwilson wrote:
| There's really no situation you wouldn't write a test? Have
| you not had situations where writing the test would take a
| lot of effort vs the risk/impact of the bug it's checking
| for? Your test suite isn't going to be exhaustive anyway so
| there's always a balance of weighing up what's worth
| testing. Going overkill with tests can actually get in the
| way of refactoring as well when a change to the UI or API
| isn't done because it would require updating too many
| tests.
| qwertox wrote:
| Make sure you're editing the correct file on the correct machine.
| chupasaurus wrote:
| Poor Yorick!
| spacebanana7 wrote:
| Also that it's in the correct folder
| eddd-ddde wrote:
| How much time I've wasted unknowingly editing generated files,
| out of version files, forgetting to save, ... only god knows.
| ZedZark wrote:
| Yep, this is a variation of "check the plug"
|
| I find myself doing this all the time now I will temporarily
| add a line to cause a fatal error, to check that it's the right
| file (and, depending on the situation, also the right line)
| shmoogy wrote:
| I'm glad I'm not the only one doing this after I wasted too
| much time trying to figure out why my docker build was not
| reflecting the changes ... never again..
| overhead4075 wrote:
| This is also covered by "make it fail"
| ajuc wrote:
| That's why you make it break differently first. To see your
| changes have any effect.
| snowfarthing wrote:
| When working on a test that has several asserts, I have
| adopted the process of adding one final assert, "assert 'TEST
| DEBUGGED' is False", so that even when I succeed, the test
| fails -- and I could review to consider if any other tests
| should be added or adjusted.
|
| Once I'm satisfied with the test, I remove the line.
| reverendsteveii wrote:
| very first order of business: git stash && git checkout main &&
| git pull
| n144q wrote:
| ...and you are building and running the correct clone of a
| repository
| netcraft wrote:
| the biggest thing I've always told myself and anyone ive
| taught: make sure youre running the code you think youre
| running.
| waynecochran wrote:
| I also think it is worthwhile stepping thru working code with a
| debugger. The actual control flow reveals what is actually
| happening and will tell you how to improve the code. It is also a
| great way to demystify how other's code runs.
| ajross wrote:
| I think that fits nicely under rule 1 ("Understand the
| system"). The rules aren't about tools and methods, they're
| about core tasks and the reason behind them.
| nthingtohide wrote:
| Make sure through pure logic that you have correctly identified
| the Root Cause. Don't fix other probable causes. This is very
| important.
| sumtechguy wrote:
| That is rule #3. quit thinking and look. Use whatever tool you
| need and look at what is going on. The next few rules (4-6) are
| what you need to do while you are doing step #3.
| nottorp wrote:
| This is necessary sometimes when you're simply working on an
| existing code base.
| berikv wrote:
| Personally, I'd start with divide and conquer. If you're working
| on a relevant code base chances are that you can't learn all the
| API spec and documentation because it's just too much.
| berikv wrote:
| Also: Fix every bug twice: Both the implementation and the
| "call site" -- if at all possible
| BobbyTables2 wrote:
| Ye ol' "belt and suspenders" approach?
| causal wrote:
| Check the plug should be first
| begueradj wrote:
| This is related to the classic debugging book with the same
| title. I first discovered it here in HN.
| fn-mote wrote:
| The article is a 2024 "review" (really more of a very brief
| summary) of a 2002 book about debugging.
|
| The list is fun for us to look at because it is so familiar. The
| enticement to read the book is the stories it contains. Plus the
| hope that it will make our juniors more capable of handling
| complex situations that require meticulous care...
|
| The discussion on the article looks nice but the submitted title
| breaks the HN rule about numbering (IMO). It's a catchy take on
| the post anyway. I doubt I would have looked at a more mundane
| title.
| bananapub wrote:
| > The article is a 2024 "review"
|
| 200 _4_.
| TheLockranore wrote:
| Rule 11: If you haven't solved it and reach this rule, one of
| your assertions is incorrect. Start over.
| duxup wrote:
| I'm so bad at #1.
|
| I know it is the best route, I do know the system (maybe I wrote
| it) and yet time and again I don't take the time to read what I
| should... and I make assumptions in hopes of speeding up the
| process/ fix, and I cost myself time...
| nickjj wrote:
| For #4 (divide and conquer), I've found `git bisect` helps a lot.
| If you have a known good commit and one of dozens or hundreds of
| commits after that is bad, this can help you identify the bad
| commit / code in a few steps.
|
| Here's a walk through on using it:
| https://nickjanetakis.com/blog/using-git-bisect-to-help-find...
|
| I jumped into a pretty big unknown code base in a live consulting
| call and we found the problem pretty quickly using this method.
| Without that, the scope of where things could be broken was too
| big given the context (unfamiliar code base, multiple people
| working on it, only able to chat with 1 developer on the project,
| etc.).
| Icathian wrote:
| Tacking on my own article about git bisect run. It really is an
| amazing little tool.
|
| https://andrewrepp.com/git_bisect_run
| jerf wrote:
| "git bisect" is why I maintain the discipline that all commits
| to the "real" branch, however you define that term, should all
| individually build and pass all (known-at-the-time) tests and
| generally be deployable in the sense that they would "work" to
| the best of your knowledge, even if you do not actually want to
| deploy that literal release. I use this as my #1 principle,
| above "I should be able to see every keystroke ever written" or
| "I want every last 'Fixes.' commit" that is sometimes advocated
| for here, because those principles make bisect useless.
|
| The thing is, I don't even bisect that often... the discipline
| necessary to maintain that in your source code heavily overlaps
| with the disciplines to prevent code regression and bugs in the
| first place, but when I do finally use it, it can pay for
| itself in literally one shot once a year, because we get bisect
| out for the biggest, most mysterious bugs, the ones that I know
| from experience can involve devs staring at code for
| potentially weeks, and while I'm yet to have a bisect that
| points at a one-line commit, I've definitely had it hand us
| multiple-day's-worth of clue in one shot.
|
| If I was maintaining that discipline _just_ for bisect we might
| quibble with the cost /benefits, but since there's a lot of
| other reasons to maintain that discipline anyhow, it's a big
| win for those sorts of disciplines.
| skydhash wrote:
| Same. Every branch apart from the "real" one and release
| snapshots is transient and WIP. They don't get merged back
| unless tests pass.
| forrestthewoods wrote:
| > why I maintain the discipline that all commits to the
| "real" branch, however you define that term, should all
| individually build and pass all (known-at-the-time) tests and
| generally be deployable in the sense that they would "work"
| to the best of your knowledge, even if you do not actually
| want to deploy that literal release
|
| You're spot on.
|
| However it's clearly a missing feature that Git/Mercurial
| can't tag diffs as "passes" or "bisectsble".
|
| This is especially annoying when you want to merge a stack of
| commits and the top passes all tests but the middle does not.
| It's a monumental and valueless waste of time to fix the
| middle of the stack. But it's required if you want to
| maintain bisectability.
|
| It's very annoying and wasteful. :(
| Izkata wrote:
| If there's a way to identify those incomplete commits, git
| bisect does support "skip" - a commit that's neither good
| nor bad, just ignored.
| michalsustr wrote:
| This is why we use squash like here https://docs.gitlab.com
| /ee/user/project/merge_requests/squas...
| forrestthewoods wrote:
| I explicitly don't want squash. The commits are still
| worth keeping separate. There's lots of distinct pieces
| of work. But sometimes you break something and fix it
| later. Or you add something new but support different
| environments/platforms later.
| gregthelaw wrote:
| But if you don't squash, doesn't this render git bisect
| almost useless?
|
| I think every commit that gets merged to main should be
| an atomic believed-to-work thing. Not only does this make
| bisect way more effective, but it's a much better
| narrative for others to read. You should write code to be
| as readable by others as possible, and your git history
| likewise.
| snowfarthing wrote:
| As someone who doesn't like to see history lost via
| "rebase" and "squashing" branches, I have had to think
| through some of these things, since my personal
| preferences are often trampled on by company policy.
|
| I have only been in one place where "rebase" is used
| regularly, and now that I'm a little more familiar with
| it, I don't mind using it to bring in changes from a
| parent branch into a working branch, if the working
| branch hasn't been pushed to origin. It still weirds me
| out somewhat, and I don't see why a simple merge can't
| just be the preferred way.-
|
| I have, however, seen "squashing" regularly (and my
| current position uses it as well as rebasing) -- and I
| don't particularly like it, because sometimes I put in
| notes and trials that get "lost" as the task progresses,
| but nonetheless might be helpful for future work. While
| it's often standard to delete "squashed" branches, I
| cannot help but think that, for history-minded folks like
| me, a good compromise would be to "squash and keep" -- so
| that the individual commits don't pollute the parent
| branch, while the details are kept around for anyone
| needing to review them.
|
| Having said that, I've never been in a position where I
| felt like I need to "forcibly" push for my preferences. I
| just figure I might as well just "go with the flow", even
| if a tiny bit of me dies every time I squash or rebase
| something, or delete a branch upon merging!
| rlkf wrote:
| I use git-format-patch to create a list of diffs for the
| individual commits before the branch gets squashed, and
| tuck them away in a private directory. Several times have
| I gone back to peek at those lists to understand my own
| thoughts later.
| rlkf wrote:
| Could you not use --first-parent option to test only at the
| merge-points?
| SoftTalker wrote:
| I do bisecting almost as a last resort. I've used it when all
| else fails only a few times. Especially as I've never worked
| on code where it was very easy to just build and deploy a
| working debug system from a random past commit.
|
| Edit to add: I will study old diffs when there is a bug,
| particularly for bugs that seem correlated with a new
| release. Asking "what has changed since this used to work?"
| often leads to an obvious cause or at least helps narrow
| where to look. Also asking the person who made those changes
| for help looking at the bug can be useful, as the code may be
| more fresh in their mind than in yours.
| aag wrote:
| Sometimes you'll find a repo where that isn't true.
| Fortunately, git bisect has a way to deal with failed builds,
| etc: three-value logic. The test program that git bisect runs
| can return an exit value that means that the failure didn't
| happen, a different value that means that it did, or a third
| that means that it neither failed nor succeeded. I wrote up
| an example here:
|
| https://speechcode.com/blog/git-bisect
| Noumenon72 wrote:
| Very handy! I forgot about `git bisect skip`.
| jvans wrote:
| git bisect is an absolute power feature everybody should be
| aware of. I use it maybe once or twice a year at most but it's
| the difference between fixing a bug in an hour vs spending days
| or weeks spinning your wheels
| ajross wrote:
| Not to complain about bisect, which is great. But IMHO it's
| _really_ important to distinguish the philosophy and mindspace
| aspect to this book (the "rules") from the practical advice
| ("tools").
|
| Someone who thinks about a problem via "which tool do I want"
| (c.f. "git bisect helps a lot"[1]) is going to be at a huge
| disadvantage to someone else coming at the same decisions via
| "didn't this used to work?"[2]
|
| The world is filled to the brim with tools. Trying to file away
| all the tools in your head just leads to madness. Embrace
| philosophy first.
|
| [1] Also things like "use a time travel debugger", "enable
| logging", etc...
|
| [2] e.g. "This state is illegal, where did it go wrong?", "What
| command are we trying to process here?"
| nottorp wrote:
| Just be careful to not contradict #3 "Quit thinking and
| look".
| ajross wrote:
| Touche
| gregthelaw wrote:
| I've spent the past two decades working on a time travel
| debugger so obviously I'm massively biassed, but IMO most
| programmers are not nearly as proficient in the available
| debug tooling as they should be. Consider how long it takes
| to pick up a tool so that you at least have a vague
| understanding of what it can do, and compare to how much time
| a programmer spends debugging. Too many just spend hour after
| hour hammering out printf's.
| smcameron wrote:
| Back in the 1990s, while debugging some network configuration
| issue a wiser older colleague taught me the more general
| concept that lies behind git bisect, which is "compare the
| broken system to a working system and systematically eliminate
| differences to find the fault." This can apply to things other
| than software or computer hardware. Back in the 90s my friend
| and I had identical jet-skis on a trailer we shared. When
| working on one of them, it was nice to have its twin right
| there to compare it to.
| epolanski wrote:
| Bisection is also useful when debugging css.
|
| When you don't know what is breaking that specific scroll or
| layout somewhere in the page, you can just remove half the DOM
| in the dev tools and check if the problem is still there.
|
| Rinse and repeat, it's a basic binary search.
|
| I am often surprised that leetcode black belts are absolutely
| unable to apply what they learn in the real world, neither in
| code nor debugging which always reminds me of what a useless
| metric to hire engineers it is.
| tetha wrote:
| You can also use divide and conquer when dealing with a complex
| system.
|
| Like, traffic going from A to B can turn ... complicated with
| VPNs and such. You kinda have source firewalls, source routing,
| connectivity of the source to a router, routing on the router,
| firewalls on the router, various VPN configs that can go wrong,
| and all of that on the destination side as well. There can
| easily be 15+ things that can cause the traffic to disappear.
|
| That's why our runbook recommends to start troubleshooting by
| dumping traffic on the VPN nodes. That's a very low-effort,
| quick step to figure out on which of the six-ish legs of the
| journey drops traffic - to VPN, through VPN, to destination,
| back to VPN node, back through VPN, back to source. Then you
| realize traffic back to VPN node disappears and you can dig
| into that.
|
| And this is a powerful concept to think through in system
| troubleshooting: Can I understand my system as a number of
| connected tubes, so that I have a simple, low-effort way to
| pinpoint one tube to look further into?
|
| As another example, for many services, the answer here is to
| look at the requests on the loadbalancer. This quickly isolates
| which services are throwing errors blowing up requests, so you
| can start looking at those. Or, system metrics can help - which
| services / servers are burning CPU and thus do something, and
| which aren't? Does that pattern make sense? Sometimes this can
| tell you what step in a pipeline of steps on different systems
| fails.
| rozap wrote:
| Binary search rules. Being systematic about dividing the
| problem in half, determining which half the issue is in, and
| then repeating applies to non software problems quite well. I
| use the strategy all the time while troubleshooting issue with
| cars, etc.
| yuliyp wrote:
| The principle here "bisection" is a lot more general than just
| "git bisect" for identifying ranges of commits. It can also be
| used for partitioning the _space_ of systems. For instance, if
| a workflow with 10 steps is broken, can you perform some tests
| to confirm that 5 of the steps functioned correctly? Can you
| figure out that it 's definitely not a hardware issue (or
| definitely a hardware issue) somewhere?
|
| This is critical to apply in cases where the problem might not
| even be caused by a code commit in the repo you're bisecting!
| heikkilevanto wrote:
| Some additional rules: - "It is your own fault". Always suspect
| your code changes before anything else. It can be a compiler bug
| or even a hardware error, but those are very rare. - "When you
| find a bug, go back hunt down its family and friends". Think
| where else the same kind of thing could have happened, and check
| those. - "Optimize for the user first, the maintenance programmer
| second, and last if at all for the computer".
| ajuc wrote:
| It's healthier to assume your code is wrong than otherwise. But
| it's best to simply bisect the cause-effect chain a few more
| times and be sure.
| wormlord wrote:
| I always have the mindset of "its my fault". My Linux
| workstation constantly crashing because of the i9-13900k in it
| was honestly humiliating. Was very relieved when I realized it
| was the CPU and not some impossible to find code error.
| dehrmann wrote:
| Linux is like an abusive relationship in that way--it's
| always your fault.
| physicles wrote:
| The first one is known in the Pragmatic Programmer as "select
| isn't broken." Summarized at https://blog.codinghorror.com/the-
| first-rule-of-programming-...
| bsammon wrote:
| Alternatively, I've found the "Maybe it's a bug. I'll try an
| make a test case I can report on the mailing list" approach
| useful at times.
|
| Usually, in the process of reducing my error-generating code
| down to a simpler case, I find the bug in my logic. I've been
| fortunate that heisenbugs have been rare.
|
| Once or twice, I have ended up with something to report to the
| devs. Generally, those were libraries (probably from
| sourceforge/github) with only a few hundred or less users that
| did not get a lot of testing.
| astrobe_ wrote:
| About "family and friends", a couple of times by fixing minor
| and _a priori_ unrelated side issues, it revealed the bug I was
| after.
| ChrisArchitect wrote:
| (2004)
|
| Title is: David A. Wheeler's Review of Debugging by David J.
| Agans
| Tepix wrote:
| My first rule for debugging debutants:
|
| _Don 't be too embarassed to scatter debug logmessages in the
| code. It helps._
|
| My second rule:
|
| _Don 't forget to remove them when you're done._
| lanstin wrote:
| My rule for a long time has been anytime I add a print or log,
| except for the first time I am writing some new cide with
| tricky logic, which I try not to do, never delete it. Lower it
| to the lowest possible debug or trace level but if it was
| useful once it will be useful again, even if only to document
| the flow thru the code on full debug.
|
| The nicest log package I had would always count the number of
| times a log msg was hit even if the debug level meant nothing
| happened. The C preprocessor made this easy, haven't been able
| to get a short way to do this counting efficiently in other
| languages.
| sitkack wrote:
| I really like this.
| condour75 wrote:
| One good timesaver: debug in the easiest environment that you can
| reproduce the bug in. For instance, if it's an issue with a
| website on an iPad, first see if you reproduce in chrome using
| the responsive tools in web developer. If that doesn't work, see
| if it reproduces in desktop safari. Then the iPad simulator, and
| only then the real hardware. Saves a lot of frustration and time,
| and each step towards the actual hardware eliminates a whole
| category of bugs.
| ChrisMarshallNY wrote:
| _#7 Check the plug: Question your assumptions, start at the
| beginning, and test the tool._
|
| I have found that 90% of network problems, are bad cables.
|
| That's not an exaggeration. Most IT folks I know, throw out
| ethernet cables immediately. They don't bother testing them. They
| just toss 'em in the trash, and break a new one out of the
| package.
| nickcw wrote:
| I prefer to cut the connectors off with savage vengeance before
| tossing the faulty cable ;-)
| teleforce wrote:
| The tenth golden rule:
|
| 10) Enable frame pointers [1].
|
| [1] The return of the frame pointers:
|
| https://news.ycombinator.com/item?id=39731824
| PhunkyPhil wrote:
| I would almost change 4 into "Binary search".
|
| Wheeler gets close to it by suggesting to locate which side of
| the bug you're on, but often I find myself doing this recursively
| until I locate it.
| ajuc wrote:
| Yeah people say use git bisect but that's another dimension
| (which change introduced the bug).
|
| Bisecting is just as useful when searching for the layer of
| application which has the bug (including external libraries,
| OS, hardware, etc.) or data ranges that trigger the bug.
| There's just no handy tools like git bisect for that. So this
| amounts to writing down what you tested and removing the
| possibilities that you excluded with each test.
| analog31 wrote:
| One I learned on Friday: Check your solder connections under a
| microscope before hacking the firmware.
| InitialLastName wrote:
| The worst is when it works only when your oscilloscope probe is
| pushing down on the pin.
| GuB-42 wrote:
| Rule 0: Don't panic
|
| Really, that's important. You need to think clearly, deadlines
| and angry customers are a distraction. That's also when having a
| good manager who can trust you is important, his job is to shield
| you from all that so that you can devote all of your attention to
| solving the problem.
| Cerium wrote:
| Slow is smooth and smooth is fast. If you don't have time to do
| it right, what makes you think there is time to do it twice?
| sitkack wrote:
| "We have to do _something_! "
| Bootvis wrote:
| And this is something, so I'm doing it.
| adamc wrote:
| I had a boss who used to say that her job was to be a crap
| umbrella, so that the engineers under her could focus on their
| actual jobs.
| dazzawazza wrote:
| Ideally it's crap umbrellas all the way down. Everyone should
| be shielding everyone below them from the crap slithering its
| way down.
| generic92034 wrote:
| That must be the trickle-down effect everyone talking about
| in the 80ies. ;)
| airblade wrote:
| At first I thought you meant an umbrella that doesn't work
| very well.
| chrsig wrote:
| Also a pager/phone going off incessantly isn't useful either.
| manage your alarms or you'll be throwing your phone at a wall.
| ianmcgowan wrote:
| There's a story in the book - on nuclear submarines there's a
| brass bar in front of all the dials and knobs, and the
| engineers are trained to "grab the bar" when something goes
| wrong rather than jumping right to twiddling knobs to see what
| happens.
| throwawayfks wrote:
| I read this book and took this advice to heart. I don't have
| a brass bar in the office, but when I'm about to push a
| button that could cause destructive changes, especially in
| prod, my hands reflexively fly up into the air while I
| double-check everything.
| tetha wrote:
| A weird, yet effective recommendation from someone at my
| last job: If it's a destructive or dangerous action in
| prod, touch both your elbows first. This forces ou to take
| the hands away from the keyboard, stop any possible auto-
| pilot and look what you're doing.
| gregthelaw wrote:
| Related: write down what you're seeing (or rather, what
| you _think_ you're seeing), and so with pen and paper,
| not the keyboard. You can type way faster than you can
| write, and the slowness of writing makes you think harder
| about what you think you know. Often you do the know the
| answer, you just have to tease it out. Or there are gaps
| in your knowledge that you hadn't clocked. After all, an
| assumption is something you don't realise you've made.
|
| This also works well in conjunction with debug tooling --
| the tooling gives you the raw information, writing down
| that information helps join the dots.
| toolslive wrote:
| "a good chess player sits on his hands" (NN). It's good
| advice as it prevents you from playing an impulsive move.
| stronglikedan wrote:
| I have to sit on my hands at the dentist to prevent impulse
| moves.
| Shorn wrote:
| I always wanted that to be a true story, but I don't think it
| is.
| ahci8e wrote:
| Uff, yeah. I used to work with a guy who would immediately turn
| the panic up to 11 at the first thought of a bug in prod. We
| would end up with worse architecture after his "fix" or he
| would end up breaking something else.
| bilekas wrote:
| This is very underrated. Also an extension to this is don't be
| afraid to break things further to probe. I often see a lot of
| devs mid level included panicking and thus preventing them to
| even know where to start. I've come to believe that some people
| just have an inherent intuition and some just need to learn it.
| jimmySixDOF wrote:
| Yes its sometimes instinct takes over when your on the spot
| in a pinch but there are institutional things you can do to
| be prepared in advance that expand your set of options in the
| moment much like a pre-prepared firedrill playbook you can
| pull from also there are training courses like Kepner-Tregoe
| but you are right there are just some people who do better
| than others when _it's hitting the fan.
| augbog wrote:
| 100% agree. I remember I had an on-call and our pagerduty
| started going off for a SEV-2 and naturally a lot of managers
| from teams that are affected are in there sweating bullets
| because their products/features/metrics are impacted. It can
| get pretty frustrating having so many people try to be cooks in
| the kitchen. We had a great manager who literally just moved us
| to a different call/meeting and he told us "ignore everything
| those people are saying; just stay focused and I'll handle
| them." Everyone's respect for our manager really went up from
| there.
| CobrastanJorji wrote:
| I once worked for a team that, when a serious visible incident
| occurred, a company VP would pace the floor, occasionally
| yelling, describing how much money we were losing per second
| (or how much customer trust if that number was too low) or
| otherwise communicating that we were in a battlefield situation
| and things were Very Critical.
|
| Later I worked for a company with a much bigger and more
| critical website, and the difference in tone during urgent
| incidents was amazing. The management made itself available for
| escalations and took a role in externally communicating what
| was going on, but besides that they just trusted us to do our
| jobs. We could even go get a glass of water during the incident
| without a VP yelling at us. I hadn't realized until that point
| that being calm adults was an option.
| o_nate wrote:
| A corollary to this is always have a good roll-back plan. It's
| much nicer to be able to roll-back to a working version and
| then be able to debug without the crisis-level pressure.
| sroussey wrote:
| Rollback ability is a must--it can be the most used
| mitigation if done right.
|
| Not all issues can be fixed with a rollback though.
| BlueUmarell wrote:
| Post: "9 rules of debugging"
|
| Each comment: "..and this is my 10th rule: <insert witty rule>"
|
| Total number of rules when reaching the end of the post: 9 + n +
| n * m, with n being number of users commenting, m being the
| number of users not posting but still mentally commenting on the
| other users' comments.
| fedeb95 wrote:
| rule -1: don't trust the bug issuer
| reverendsteveii wrote:
| Review was good enough to make me snag the entire book. I'm
| taking a break from algorithmic content for a bit and this will
| help. Besides, I've got an OOM bug at work and it will be fun to
| formalize the steps of troubleshooting it. Thanks, OP!
| sumtechguy wrote:
| I recommend this book to all Jr. devs. Many feel very
| overwhelmed by the process. Putting it into nice interesting
| stories and how to be methodical is a good lesson for everyone.
| omkar-foss wrote:
| For folks who love to read books, here's an excerpt from the
| Debugging book's accompanying website
| (https://debuggingrules.com/):
|
| "Dave was asked as the author of Debugging to create a list of 5
| books he would recommend to fans, and came up with this.
|
| https://shepherd.com/best-books/to-give-engineers-new-perspe..."
| burrish wrote:
| thanks for the link
| shahzaibmushtaq wrote:
| I can't comment further on David A. Wheeler's review because his
| words were from 2004 (He said everything true), and I can't
| comment on the book either because I haven't read it yet.
|
| Thank you for introducing me to this book.
|
| One of my favorite rules of debugging is to read the code in
| plain language. If the words don't make sense somewhere, you have
| found the problem or part of it.
| BWStearns wrote:
| > Check the plug
|
| I just spent a whole day trying to figure out what was going on
| with a radio. Turns out I had tx/rx swapped. When I went to check
| tx/rx alignment I misread the documentation in the same way as
| the first. So, I would even add "try switching things anyways" to
| the list. If you have solid (but wrong) reasoning for why you did
| something then you won't see the error later even if it's right
| in front of you.
| SoftTalker wrote:
| Yes the human brain can really be blind when its a priori
| assumptions turn out to be wrong.
| jgrahamc wrote:
| Wasn't Bryan Cantrill writing a book about debugging? I'd love to
| read that.
| bcantrill wrote:
| I was! (Along with co-author Dave Pacheco.) And I still have
| the dream that we'll finish it one day: we had written probably
| a third of it, but then life intervened in various dimensions.
| And indeed, as part of our preparation to write our book (which
| we titled _The Joy of Debugging_ ), we read Wheeler's
| _Debugging_. On the one hand, I think it 's great to have
| anything written about debugging, as it's a subject that has
| not been treated with the weight that it deserves. But on the
| other, the "methodology" here is really more of a collection of
| aphorisms; if folks find it helpful, great -- but I came away
| from _Debugging_ thinking that the canonical book on debugging
| has yet to be written.
|
| Fortunately, my efforts with Dave weren't for naught: as part
| of testing our own ideas on the subject, I gave a series of
| presentations from ~2015 to ~2017 that described our thinking.
| A talk that pulls many of these together is my GOTO Chicago
| talk in 2017, on debugging production systems.[0] That talk
| doesn't incorporate all of our thinking, but I think it gets to
| a lot of it -- and I do think it stands at a bit of contrast to
| Wheeler's work.
|
| [0] https://www.youtube.com/watch?v=30jNsCVLpAE
| gregthelaw wrote:
| It's a great talk! I have stolen your "if you smell smoke,
| find the source" advice and put it in some of my own talks on
| the subject.
| Zolomon wrote:
| I have been bitten more than once thinking that my initial
| assumption was correct, diving deeper and deeper - only to
| realize I had to ascend and look outside of the rabbit hole to
| find the actual issue.
|
| > Assumption is the mother of all screwups.
| sitkack wrote:
| This is how I view debugging, aligning my mental model with how
| the system actually works. Assumptions are bugs in the mental
| model. The problem is conflating what is knowledge with what is
| an assumption.
| astrobe_ wrote:
| I've once heard from an RTS game caster (IIRC it was Day9 about
| Starcraft) "Assuming... Is killing you".
| ianmcgowan wrote:
| I used to manage a team that supported an online banking platform
| and gave a copy of this book to each new team member. If nothing
| else, it helped create a shared vocabulary.
|
| It's useful to get the poster and make sure everyone knows the
| rules.
|
| https://debuggingrules.com/download-the-poster/
| nottorp wrote:
| I'd add "a logging module done today will save you a lot of
| overtime next year".
| goshx wrote:
| > Quit thinking and look (get data first, don't just do
| complicated repairs based on guessing)
|
| From my experience, this is the single most important part of the
| process. Once you keep in mind that nothing paranormal ever
| happens in systems and everything has an explanation, it is your
| job to find the reason for things, not guess them.
|
| I tell my team: just put your brain aside and start following the
| flow of events checking the data and eventually you will find
| where things mismatch.
| pbalau wrote:
| Acquire a rubber duck. Teach the duck how the system works.
| throwawayfks wrote:
| I worked at a place once where the process was "Quit thinking,
| and have a meeting where everyone speculates about what it
| might be." "Everyone" included all the nontechnical staff to
| whom the computer might as well be magic, and all the engineers
| who were sitting there guessing and as a consequence not at a
| keyboard looking.
|
| I don't miss working there.
| drivers99 wrote:
| There's a book I love and always talk about called "Stop
| Guessing: The 9 Behaviors of Great Problem Solvers" by Nat
| Greene. It's coincidental, I guess, that they both have 9
| steps. Some of the steps are similar so I think the two books
| would be complementary, so I'm going to check out "Debugging"
| as well.
| __MatrixMan__ wrote:
| > Check that it's really fixed, check that it's really your fix
| that fixed it, know that it never just goes away by itself
|
| I wish this were true, and maybe it was in 2004, but when you've
| got noise coming in from the cloud provider and noise coming in
| from all of your vendors I think it's actually quite likely that
| you'll see a failure once and never again.
|
| I know I've fixed things for people without without asking if
| they ever noticed it was broken, and I'm sure people are doing
| that to me also.
| jwpapi wrote:
| I'm not sure that doesn't sit well with me.
|
| Rule 1 should be: Reproduce with most minimal setup.
|
| 99% you'll already have found the bug.
|
| 1% for me was a font that couldn't do a combination of letters in
| a row. life ft, just didn't work and thats why it made mistakes
| in the PDF.
|
| No way I could've ever known that if I wouldn't have reproduced
| it down to the letter.
|
| Just split code in half till you find what's the exact part that
| goes wrong.
| physicles wrote:
| Related: decrease your iterating time as much as possible. If
| you can test your fix in 30 seconds vs 5 minutes, you'll fix it
| in hours instead of days.
| gnufx wrote:
| Then, after successful debugging your job isn't finished. The
| outline of "Three Questions About Each Bug You Find"
| <http://www.multicians.org/thvv/threeq.html> is:
|
| 1. Is this mistake somewhere else also?
|
| 2. What next bug is hidden behind this one?
|
| 3. What should I do to prevent bugs like this?
| sandbar wrote:
| Take the time to speed up my iteration cycles has always been
| incredibly valuable. It can be really painful because its not
| directly contributing to determining/fixing the bug (which could
| be exacerbated if there is external pressure), but its _always_
| been worth it. Of course, this only applies to instances where it
| takes ~4+ minutes to run a single 'experiment' (test, startup
| etc). I find when I do just try to push through with long running
| tests I'll often forget the exact variable I tweaked during the
| course of the run. Further, these tweaks can be very nuanced and
| require you to maintain a lot of the larger system in your head.
| astrobe_ wrote:
| Also sometimes: the bug is not in the code, its in the data.
|
| A few times I looked for a bug like "something is not happening
| when it should" or "This is not the expected result", when the
| issue was with some config file, database records, or thing sent
| by a server.
|
| For instance, particularly nasty are non-printable characters in
| text files that you don't see when you open the file.
|
| "simulate the failure" is sometimes useful, actually. Ask
| yourself "how would I implement this behavior", maybe even do it.
|
| Also: never reason on the absence of a specific log line. The
| logs can be wrong (bugged) too, sometimes. If you printf-
| debugging a problem around a conditional for instance, log both
| branches.
| hughdbrown wrote:
| In my experience, the most pernicious temptation is to take the
| buggy, non-working code you have now and to try to modify it with
| "fixes" until the code works. In my experience, you often cannot
| get broken code to become working code because there are too many
| possible changes to make. In my view, it is much easier to break
| working code than it is to fix broken code.
|
| Suppose you have a complete chain of N Christmas lights and they
| do not work when turned on. The temptation is to go through all
| the lights and to substitute in a single working light until you
| identify the non-working light.
|
| But suppose there are multiple non-working lights? You'll never
| find the error with this approach. Instead, you need to start
| with the minimal working approach -- possibly just a single light
| (if your Christmas lights work that way), adding more lights
| until you hit an error. In fact, the best case is if you have a
| broken string of lights and a similar but working string of
| lights! Then you can easily swap a test bulb out of the broken
| string and into the working chain until you find all the bad
| bulbs in the broken string.
|
| Starting with a minimal working example is the best way to fix a
| bug I have found. And you will find you resist this because you
| believe that you are close and it is too time-consuming to start
| from scratch. In practice, it tends to be a real time-saver, not
| the opposite.
| __mharrison__ wrote:
| Go on a walk or take a shower...
| kazinator wrote:
| I've had trouble keeping the audit trail. It can distract from
| the _flow_ of debugging, and there can be lots of details to it,
| many of which end up being irrelevant; i.e. all the blind rabbit
| holes that were not on the maze path to the bug. Unless you 're a
| consultant who needs to account for the hours, or a teller of
| engaging debugging war stories, the red herrings and blind alleys
| are not that useful later.
| sitkack wrote:
| If folks want to instill this mindset in their kids, themselves
| or others I would recommend at least
|
| The Martian by Andy Weir
| https://en.wikipedia.org/wiki/The_Martian_(Weir_novel)
|
| https://en.wikipedia.org/wiki/Zen_and_the_Art_of_Motorcycle_...
|
| https://en.wikipedia.org/wiki/The_Three-Body_Problem_(novel)
|
| To Engineer Is Human - The Role of Failure in Successful Design
| By Henry Petroski
| https://pressbooks.bccampus.ca/engineeringinsociety/front-ma...
|
| https://en.wikipedia.org/wiki/Surely_You%27re_Joking,_Mr._Fe...!
| nox101 wrote:
| > #1 Understand the system: Read the manual, read everything in
| depth, know the fundamentals, know the road map, understand your
| tools, and look up the details.
|
| Maybe I'm mis-understand but "Read the manual, read everything in
| depth" sounds like. Oh, I have bug in my code, first read the
| entire manual of the library I'm using, all 700 pages, then read
| 7 books on the library details, now that a month or two has
| passed, go look at the bug.
|
| I'd be curious if there's a single programmer that follows this
| advice.
| scudsworth wrote:
| great strawman, guy that refuses to read documentation
| mox1 wrote:
| I mean he has a point. Things are incredibly complex now
| adays, I don't think most people have time to "understand the
| system."
|
| I would be much more interested in rules that don't start
| with that... Like "Rules for debugging when you don't have
| the capacity to fully understand every part of the system."
|
| Bisecting is a great example here. If you are Bisecting, by
| definition you don't fully understand the system (or you
| would know which change caused the problem!)
| adolph wrote:
| This was written in 2004, the year of Google's IPO. Atwood and
| Spolsky didn't found Stack Overflow until 2008. [0] People knew
| things as the "Camel book" [1] and generally just knew things.
|
| 0. https://stackoverflow.blog/2021/12/14/podcast-400-an-oral-
| hi...
|
| 1. https://www.perl.com/article/extracting-the-list-of-o-
| reilly...
| feoren wrote:
| Essentially yes, that's correct. Your mistake is thinking that
| the outcome of those months of work is being able to kinda-
| probably fix one single bug. No: the point of all that effort
| is to _truly_ fix _all_ the bugs of that kind (or as close to
| "all" as is feasible), and to stop writing them in the first
| place.
|
| The alternative is paradropping into an unknown system with a
| weird bug, messing randomly with thing you don't understand
| until the tests turn green, and then submitting a PR and hoping
| you didn't just make everything even worse. The alternative is
| never really knowing whether your system actually works or not.
| While I understand that is sometimes how it goes, doing that
| regularly is my nightmare.
|
| P.S. if the manual of a library you're using is 700 pages,
| you're using the wrong library.
| gregthelaw wrote:
| I love the "if you didn't fix it, it ain't fixed". It's too easy
| to convince yourself something is fixed when you haven't fully
| root-caused it. If you don't understand exactly how the thing
| your seeing manifested, papering over the cracks will only cause
| more pain later on.
|
| As someone who has been working on a debugging tool
| (https://undo.io) for close to two decades now, I totally agree
| that it's just weird how little attention debugging as a whole
| gets. I'm somewhat encouraged to see this topic staying near the
| top of hacker news for as long as it has.
| andypi_swfc wrote:
| I found this book so helpful I created a worksheet based on it -
| might be helpful for some:
| https://andypi.co.uk/2024/01/26/concise-guide-to-debugging-a...
| pcblues wrote:
| Over twenty five odd years, I have found the path to a general
| debugging prowess can best be achieved by doing it. I'd recommend
| taking the list/buying the book, using https://up-for-grabs.net
| to find bugs on github/bugzilla, etc. and doing the following:
|
| 1. set up the dev environment
|
| 2. fork/clone the code
|
| 3. create a new branch to make changes and tests
|
| 4. use the list to try to find the root cause
|
| 5. create a pull request if you think you have fixed the bug
|
| And use Rule 0 from GuB-42: Don't panic
|
| (edited for line breaks)
| samsquire wrote:
| One thing I have been doing is to create a directory called
| "debug" from the software and write lots of different files when
| the main program has executed to add debugging information but
| only write files outside of hot loops for debugging and then
| visually inspect the logs when the program is exited.
|
| For intermediate representations this is better than printf to
| stdout
___________________________________________________________________
(page generated 2025-01-13 23:00 UTC)