[HN Gopher] My Hardest Bug Ever (2013)
___________________________________________________________________
My Hardest Bug Ever (2013)
Author : whack
Score : 67 points
Date : 2023-03-07 19:52 UTC (3 hours ago)
(HTM) web link (www.gamedeveloper.com)
(TXT) w3m dump (www.gamedeveloper.com)
| metadat wrote:
| Previous discussions (slightly tricky to find because the URL has
| changed)
|
| https://news.ycombinator.com/item?id=6654905 (November 2013; 81
| comments)
|
| https://news.ycombinator.com/item?id=9738302 (June 2015; 29
| comments)
|
| https://news.ycombinator.com/item?id=14394095 (May 2017, 7
| comments)
| ezekg wrote:
| > As a programmer, you learn to blame your code first, second,
| and third... and somewhere around 10,000th you blame the
| compiler. Well down the list after that, you blame the hardware.
|
| I wish this were the case. The average programmer blames whatever
| library/third-party/etc. they're using, then somewhere around the
| 10,000th they might blame their own code.
|
| (I run a third-party service and everything is always my fault,
| even syntax errors.)
| bena wrote:
| Like everything, it really depends on the person.
|
| I also like to espouse a philosophy that problems should be
| investigated from inside out. Start with what you had direct
| control over, assume the issue is with something you did. Then
| work your way out.
|
| However, I have watched more than one person do the exact
| opposite: assume everything else was wrong before even looking
| at their own contributions.
|
| And this holds not just for programming, but for any endeavor.
| yifanl wrote:
| Why would they make their own life harder?
|
| If there's a bug in 3p code, they'd need to open up a PR to the
| open source library and be stalled on 3 weeks for the
| maintainer to see it. If it's a one-line bug in their own code,
| it's one glance at a stack trace.
| ezekg wrote:
| I think you misunderstood my comment?
| happytoexplain wrote:
| There are many strange assumptions here. Why in your example
| is the library open source? Even when it is, why would the
| developer be expected to know how to fix it? Why in your
| example is the bug in the developer's code a "one-line" bug
| fixable by "one glance at a stack trace"?
|
| The point is that, if the bug's cause is not immediately
| obvious, some developers tend to jump to "it's the 3rd party
| library", because in many cases they can then claim to be
| unable to fix it, or offload the responsibility to the 3rd
| party.
| gumby wrote:
| My most memorable hardware bug was noware near as hard as this,
| but I'll never forget it.
|
| Intel was trying to sell the 960s and sent us a dev board with
| that CPU. Nobody in the company could get it to boot up. It would
| power up but nothing would show up on the serial port. Eventually
| it was my turn to look and for some reason I happened to notice a
| pullup _capacitor_ on the UART VCC. I looked at the schematics
| and indeed it was there. A simple jumper to bypass it (back in
| those days we had big, manly components; none of that surface
| mount shit) and what hey: the serial console responded. It had
| booted up just fine, but was mute.
|
| After that we could do development but it was immediately clear
| to me that the 960 was DoA. It's not like we were the first to
| get that board!
| einpoklum wrote:
| > As a programmer, you learn to blame your code first, second,
| and third... and somewhere around 10,000th you blame the
| compiler. Well down the list after that, you blame the hardware.
|
| So, first - in many settings, the hardware is more likely to be
| the source of the problem than your compiler; the question is
| what has more churn - the compiler code or the chip you run on.
|
| But regardless - the compiler is much higher than the 10,000'th
| item on the blame list. Even mature, popular compilers have bugs!
| Hell, they have many known, open bugs! The subtle ones, which
| don't manifest easily, can stay open for quite a long time. See:
|
| https://gcc.gnu.org/bugzilla/
|
| and:
|
| https://bugs.llvm.org/
|
| I personally have encountered and even filed several of them, and
| it's not like I was trying. Some of these were even the result of
| "Why does my code not work?" questions on StackOverflow.
|
| One tip, though: Play one compiler against another when you begin
| suspecting your compiler, or the hardware. The buggy behavior
| will often be different. And of course run multiple times to
| check for variation in behavior, like the author had.
| AshamedCaptain wrote:
| > But regardless - the compiler is much higher than the
| 10,000'th item on the blame list. Even mature, popular
| compilers have bugs! Hell, they have many known, open bugs!
|
| I don't even understand when compilers started being thought as
| these perfect, bug-free programs. It's been some kind of
| gradual change over the decades. A lot of people seem surprised
| when I mention that around 15 years ago -O3 in gcc was
| practically unusable. I don't mean "it would actually degrade
| performance", I mean "it would break your program".
| einpoklum wrote:
| TBH, I'm surprised by that. I would have though compiler
| authors would not have released optimization options in this
| state - when such breakage is encountered by testers of
| nightlies or beta releases.
| glonq wrote:
| Having spent the better part of 30 years working on/with/around
| embedded systems, I can't even count how many bugs I've bumped
| into that were hiding inbetween sofware and hardware. Or between
| software and compiler/tools/OS. Or between hardware and spooky RF
| black magic.
| GlenTheMachine wrote:
| Oh man.
|
| I was writing the motor controller code for a new submersible
| robot my PhD lab was building. We had bought one of the very
| first compact PCI boards on the market, and it was so new we
| couldn't find any cPCI motor controller cards, so we bought a
| different format card and a motherboard that converted between
| compact PCI bus signals and the signals on the controller boards.
| The controller boards themselves were based around the LM629, an
| old but widely used motor controller chip.
|
| To interface with the LM629 you have to write to 8-bit registers
| that are mapped to memory addresses and then read back the
| result. The 8-bit part is important, because some of the
| registers are read or write only, and reading or writing to a
| register that cannot be read from or written to throws the chip
| into an error state.
|
| LM629s are dead simple, but my code didn't work. It. Did. Not.
| Work. The chip kept erroring out. I had no idea why. It's almost
| trivially easy to issue 8-bit reads and writes to specific memory
| addresses in C. I had been coding in C since I was fifteen years
| old. I banged my head against it for two weeks.
|
| Eventually we packed up the entire thing in a shipping crate and
| flew to Minneapolis, the site of the company that made the cards.
| They looked at my code. They thought it was fine.
|
| After three days the CEO had pity on us poor grad students and
| detailed his highly paid digital logic analyst to us for an hour.
| He carted in a crate of electronics that were probably worth
| about a million dollars. Hooked everything up. Ran my code.
|
| "You're issuing a sixteen-bit read, which is reading both the
| correct read-only register and the next adjacent register, which
| is write-only", he said.
|
| Is showed him in my code where the read in question was very
| clearly a *CHAR*. 8 bits.
|
| "I dunno," he said - "I can only say what the digital logic
| analyzer shows, which is that you're issuing a sixteen bit read."
|
| Eventually, we found it. The Intel bridge chip that did the bus
| conversion had a known bug, which was clearly documented in an
| 8-point footnote on page 79 of the manual: 8 bit reads were
| translated to 16 bit reads on the cPCI bus, and then the 8 most
| significant units were thrown away.
|
| In other words, a hardware bug. One that would only manifest in
| these _very_ specific circumstances.
|
| We fixed it by taking a razor knife to the bus address lines and
| shifting them to the right by one, and then taking the least
| significant line and mapping it all the way over to the left, so
| that even and odd addresses resolved to completely different
| memory banks. Thus, reads to odd addresses resolved to addresses
| way outside those the chip was mapped to, and it never saw them.
| Adjusted the code to the (new) correct address range. Worked like
| a charm.
|
| But I feel bad for the next grad student who had to work on that
| robot. "You are not expected to understand this."
| nameoda wrote:
| It's not a bug! It's a clearly documented feature! /s
| whitewingjek wrote:
| Previously discussed:
|
| https://news.ycombinator.com/item?id=6654905 (81 comments)
|
| https://news.ycombinator.com/item?id=9738302 (29 comments)
|
| https://news.ycombinator.com/item?id=14394095 (7 comments)
| Ruq wrote:
| What can you say? It's a classic reading!
| dang wrote:
| Reposts of classics are most welcome on HN!
|
| We do try to space them out a bit, to avoid too much
| repetition, but anything up to once a year is fine. This one
| hasn't had a thread since 2017, so completely ok.
|
| https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que.
| ..
| yellow_lead wrote:
| 2013
| dang wrote:
| Added. Thanks!
| toolslive wrote:
| I once (about 10y ago) experienced hardware that got tired. A
| customer replaced the usual hard disks with shiny new Seagate SMR
| drives, because they had more storage capacity. Funny thing is
| that they could not handle the sustained 100MB/s we were feeding
| them. So after about 20 minutes they started slowing down and
| after half an hour they stopped working for about 20 minutes and
| then they were fine again. Obviously the customer complained
| about our storage product and forgot to mention this small fact.
| Once we figured it out we had good laugh.
| _a_a_a_ wrote:
| That's interesting. My old server about 10 years ago had a
| Seagate black which died. I replaced it with a Seagate green. I
| notice things started slowing down and down when the disc
| writes got heavy. It could freeze up for minutes at a time,
| then recover without any errors. It took me weeks to realise
| what was happening because... Because I don't actually know
| why. In hindsight it was obvious. Maybe the Seagate green was a
| SMR drive. Either way, it was nasty and caused a lot of
| frustration.
|
| A quick check just now and it seems that the Seagate green were
| SMR. Fuckers never put that on the box did they. Bastards.
| favorited wrote:
| A couple years ago, Western Digital quietly changed their WD
| Red line (which is explicitly marketed as being for NAS use)
| to SMR.
|
| https://www.tomshardware.com/news/wd-addresses-smr-
| controver...
| oifjsidjf wrote:
| I've never seen such annoying ads on any website: the ad size
| changes every ~30 seconds which rearranges the text flow of the
| article completely and I get lost.
| ezekg wrote:
| How have you survived the nets this long without an ad blocker?
___________________________________________________________________
(page generated 2023-03-07 23:00 UTC)