[HN Gopher] Spending 3 months investigating a 7-year old bug and...
___________________________________________________________________
Spending 3 months investigating a 7-year old bug and fixing it in 1
line of code
Author : asicsp
Score : 185 points
Date : 2024-06-21 13:49 UTC (9 hours ago)
(HTM) web link (lemmy.world)
(TXT) w3m dump (lemmy.world)
| m3kw9 wrote:
| These one line fix always seem like a stupid bug , but in reality
| most bugs are like this and the fix is in the discovery
| xeromal wrote:
| One of the reasons I struggle to give ETAs on fixing a bug. The
| moment I know what the issue is, the solution to fix it is
| usually already figured out barring a rearchitecture of some
| services or infrastructure.
| shermantanktop wrote:
| This kind of bug is always an emotional rollercoaster of
| anticipation, discovery, disappointment, angst, self-criticality,
| and satisfaction.
| TehShrike wrote:
| I particularly liked this part:
|
| > Knowing very little about USB audio processing, but having cut
| my teeth in college on 8-bit 8051 processors, I knew what kind of
| functions tended to be slow. I did a Ctrl+F for "%" and found a
| 16-bit modulo right in the audio processing code.
|
| That feeling of saving days of work because you remember a clue
| from previous experience is so good.
| djoldman wrote:
| Those are the times one gets the opposite of imposter syndrome.
| ASalazarMX wrote:
| Fortunately it's a temporal state, otherwise there's risk of
| entering the Dunning-Kruger effect.
|
| "You did awesome, but don't let it go to your head."
| dylan604 wrote:
| F-that! That's one of those times where I re-enact the
| scene from the Bond Golden Eye film where the guy jumps up
| extending both arms yelling "Yes! I am invincible!" Of
| course I totally expect the hubris to be short lived, just
| maybe not with liquid nitrogen
|
| https://www.youtube.com/watch?v=fXW02XmBGQw
| squigz wrote:
| I alternate between "I am the best programmer to ever
| exist" and "I am completely incompetent at this and I
| should quit" while debugging.
| dylan604 wrote:
| I've been known to inform people that the person that
| wrote the incredibly horrendous code that caused whatever
| problems to occur should be fired immediately knowing
| good and well that I was the only dev to write any of the
| code.
| nikanj wrote:
| This essentially is why senior engineers get much bigger
| salaries
| mulmen wrote:
| As a total newbie I saved my company a quarter million
| dollars in Oracle licensing in a single afternoon by
| rewriting a PL/SQL function. That change was a few lines of
| SQL. Seniors don't have a monopoly on good ideas.
|
| Salary is driven by market conditions and nothing else. It is
| not an approximation of merit or even delivered value.
| close04 wrote:
| Statistically speaking a senior (more experienced) engineer
| is more likely to consistently deliver time saving results,
| while a junior is more likely to occasionally do it, if
| ever.
|
| Proving it's not a one time thing is what pushes you in the
| salary and seniority ranking.
| nashashmi wrote:
| Senior engineers have less opportunity to write time
| consumingly careful code because they get paid so much.
| Much easier to throw new great hardware at it.
| fifilura wrote:
| Senior engineers have less time to write code period.
|
| And this is what saves the day.
|
| Code is a liability.
| JonChesterfield wrote:
| The corporate structures that reward people who prove
| especially good at building the product with more
| meetings and less time building the product are perhaps
| not optimal in their deployment of resources.
|
| Maximising the fraction of the product built by people
| who don't know what they're doing would however explain
| the emergent properties of modern software.
| dgfitz wrote:
| Senior engineers can write time-consuming, careful code
| efficiently. This is why they are seniors.
| stavros wrote:
| This is laughably false. The highly-paid, experienced
| seniors produce so much more value than juniors that it's
| not even in the same ballpark. It's also usually the kind
| of value that juniors don't even notice, because it's not
| measured in lines of code.
|
| A good junior will write a hundred lines of code in a day.
| A good senior will delete a hundred because they realize
| the user dictated a solution instead of detailing their
| problem, asked them, and figured out that they can solve
| that problem by changing a config variable somewhere.
| johnnyanmac wrote:
| Not a monopoly, but a majority. Many juniors who do have
| that potential don't ever get put in such a situation.
|
| Junior/senior isn't necessarily about skill level; I'm sure
| many can find a senior with 1YOE ten times over. It's about
| trust both in technical and sociopolitical navigation
| through the job. That's only really gained with time and
| experience (and yes, isn't perfect. Hence, the
| aforementioned 1x10 senior. Still "trusted" more than a 1
| year junior).
| drewg123 wrote:
| If modulus is expensive, and he's checking a power-of-2, why
| not just use a bitwise AND.
|
| Eg, for positive integers, x % 16 == x & 15. That should be
| trivially cheap.
| ladberg wrote:
| It wasn't `x % 16` it was `x % y` where x and y are 16-bit
| integers. A compiler would also have taken care of it if it
| were just a literal.
| drewg123 wrote:
| Whoops.. I misread what he was doing.
| beebmam wrote:
| why is this not considered a compiler optimization and/or
| language problem? it seems to me that compiler optimizations
| for expressive programming languages should be able to handle
| something like this
| JonChesterfield wrote:
| What would you hope a compiler to optimise x % y into?
|
| Higher level change-the-algorithm aspirations haven't really
| been met by sufficiently smart compilers yet, with the
| possible exception of scalar evolution turning loops into
| direct calculation of the result. E.g. I don't know of any
| that would turn bubble sort into a more reasonable sort
| routine.
| pelagicAustral wrote:
| Some of the stuff I've struggled with the most over the years
| have been SQL constraints that are not documented. I remember
| (probably like 10 years ago), I deployed an update to an ancient
| Windows Forms implementation that deprecated some login and
| instead made use of Windows Authentication. It worked like a
| charm for all users, but one! Checked everything, replicated the
| machine, tried so many weird stuff, and in the end, what was
| happening is that the "Users" table had a constraint in the
| number of characters for the username. This username was over the
| limit and was not being validated... Another one was a report
| that was giving the wrong amount, but getting the data from
| database seemed to do the math right... it was the damn Money
| datatype, changed to decimal, done...
| readthenotes1 wrote:
| "...it was based on a USB product we had already been making for
| PCs for almost a decade.
|
| This product was so old in fact that nobody knew how to compile
| the source code. "
|
| I think you mean "Management was so bad, nobody knew how to
| compile the source code".
|
| There are plenty of systems out there that can and and plenty
| that cannot be reproduced from source. The biggest difference is
| the card taken to do so, not the age.
| brogrammernot wrote:
| This exact type of thing is why when I switched to the dark side
| (product) and sat in management meetings where often non-
| technical folks would go "we could measure by lines of code or
| similar" for productivity I often pointed out how that was a bad
| idea.
|
| Did I win? Of course not, it's hard for non-technical people to
| fully appreciate these things and any sort of larger
| infrastructure work, esp for developer productivity because it
| goes back to well how you going to measure that ROI.
|
| Anyways, this was fun to read and brought back good engineering
| memories. I'd also like to say, as it brought back a bug I chased
| forever, fuck you channelfactory in c#.
| jonathanlydall wrote:
| I really miss working with WCF, said no one ever.
| neonsunset wrote:
| Troubleshooting vendor WCF SDK version mismatch was not fun,
| and the guy who had to reverse engineer it to attempt a .NET
| Core port probably lost a few years off his lifespan (this was
| before CoreWCF was a thing).
|
| When people bash gRPC today, they don't know of the horrors of
| the past.
| brogrammernot wrote:
| Yeah, I've lived the life of straddling .NET Core and ASP.NET
| while also dealing with React vs Angular2+ and having half of
| the system in the script bundling hell that was razor views
| and all sorts of craziness.
|
| That experience is actually what led me to switch over to
| Product among other things, I get it when people joke (half
| joke) about considering retirement rather than going through
| that again.
| neonsunset wrote:
| At the time, we had already been using React for front-end
| widgets so migrating most other parts to then latest .NET
| Core 3.1 went surprisingly smooth. There were a couple of
| EF queries that stopped working as EF Core disabled
| application side evaluation by default, but that was
| ultimately a good thing as the intention wasn't to pull
| more data than needed.
|
| Instead, the actual source of problems was K8S and the huge
| amount of institutional knowledge it required that wasn't
| there. I still don't think K8S is that good, it's useful
| but it and containerized environments in general to this
| day have a lot of rough edges and poorly implemented design
| aspects - involved runtimes like .NET CLR and OpenJDK end
| up having to do special handling for them because reporting
| of core count and available memory is still scuffed while
| the storage is likely to be a network drive. The latter is
| not an issue in C# where pretty much all I/O code is non-
| blocking so there is no impact on application
| responsiveness, but it still violates many expectations.
| Aspects of easy horizontal scaling and focus on lean
| deployments are primarily more useful for worse languages
| with weaker runtimes that cannot scale as well within a
| single process.
|
| I suppose, a silver lining to your situation on the other
| hand is that developers get to have a PO/PM with strong
| technical background which makes so many communication
| issues go away.
| Swizec wrote:
| Have you ever suggested that management/leadership should
| measure productivity by lines of document text written? They
| might better grok how that's a bad idea. Especially since many
| of them much prefer to communicate in bullet-pointed slides
| than documents.
| pvaldes wrote:
| _" I also ended up needing to find a Perl script that was buried
| deep in some university website. I still don't know anything
| about Perl, but I got it to run"_
|
| Find dusty Perl script forgotten for years. Still works
|
| Not the first time that I hear that
| nikanj wrote:
| Outside of javascript, it's a pretty reasonable assumption that
| if you have the sources, you can get them to run
| creeble wrote:
| Ha, coincidentally, I designed and built an 8051-based MIDI
| switch in the early 90's. There weren't that many good tools at
| the time, and I designed everything from the software and UI to
| the circuit board and rack-mount case.
|
| I even wrote an 8051 assembler in C, but found a good tiny-C
| compiler for it before it went into production.
|
| You are not a programmer unless you've written key-debounce code
| :)
|
| (OTOH, some of the worst programmers I've ever had the
| displeasure of working with were amazing low-level code hackers.
| In olden times, it seems like you were either good at that level
| of abstraction, or you were good at a much different ["higher"]
| level, seldom both.)
| winrid wrote:
| Reminds me of fixing an ~11yr old bug in Enemy Territory. I had
| to spend a night debugging the C code only to realize the issue
| was in the UI config: https://github.com/etlegacy/etlegacy-
| deprecated/pull/100/fil...
|
| (IIRC UI scrolled twice for every mouse movement + you couldn't
| select items in server browser with mouse wheel as it would skip
| every other one)
| lostlogin wrote:
| That was such a great game but sadly it seemed to fizzle out.
| There were lots of neat exploits which made it even better. I
| also liked the communication style, with pre canned message you
| could give with certain key combos.
| winrid wrote:
| There's usually a full server or two. ETLegacy has plenty of
| players for me.
| Terr_ wrote:
| In a similar vein, the voice tree from Starseige:Tribes
| (1998) was mind-blowing for the dialup era.
|
| Ex: VIAB -> "I am attacking the enemy base!"
| magwa101 wrote:
| Similarly, I spent 6 weeks on a kernel token-ring driver
| intermittent initialization issue. This required kernel restarts
| over and over to observe the issue. Breakpoints were useless as
| they hid the issue. Turns out initialization in a specific step
| was not synchronous and reading the status was a race condition.
| It tooks weeks of staring, joking around, thinking, bs'ing, then
| suddenly, voila. Changed the order of the code, worked.
| rented_mule wrote:
| The worst I experienced in this direction was also on a consumer
| device about 15 years ago. Performance was degraded and we
| couldn't explain it. A team of 5 of us was assembled to figure it
| out.
|
| We spent over three months on it before finding a root cause. It
| was over two months before we could even understand how to
| measure it - we were seeing parts of the automated overnight test
| suite run taking longer, but every night it would be different
| tests that were slow. A key finding was that almost everything
| was slow on some boots of the device and fast on other boots of
| the device, and there was a reboot before each test was run.
| Doing some manual testing showed it being close to a 50% chance
| of a boot leading to slowness. Now what?
|
| I eventually got frustrated and took the brute force / mindless
| approach... binary search over commits. Unfortunately, that
| wasn't easy because our build was 45-60 minutes, and then there
| was a heavily manual installation process that took 10-20
| minutes, followed by several reboots to see if anything was slow.
| And there were several thousand commits since the last known good
| build (the previously shipped version of the device). The
| build/install/testing process was not easily automated, and we
| were not on git, otherwise using git-bisect would have been nice.
| Instead, I spent weeks doing the binary search manually.
|
| That yielded the offending commit. The problem was that it was a
| massive commit (tens of thousands of lines of code) from a group
| in another part of the company. It was a snapshot of all of their
| development over the course of a couple of years. The commit
| message, and the authors, stated that the commit was a no-op with
| everything behind a disabled feature flag.
|
| So now it was onto code level binary search. Keep deleting about
| half of the code in the commit, in this case by chunks that are
| intended to be inactive. After eventually deleting all the
| inactive code, there were still a few dozen lines of changes in a
| Linux subsystem that did window compositing. Those lines of code
| were all quite interdependent, so it was hard to delete much and
| keep things functional, so now on to walking through code. At
| least I could use my brain again!
|
| Using the clue that the problem was happening about half the time
| and given that this code was in C, I started looking for
| uninitialized booleans. Sure enough, there was one called
| something like `enable_transparency`. Disabled code was setting
| it to `true`, but nothing was setting it to `false` when their
| system was disabled. Before their commit, there was no variable -
| `false` was being passed into the initializer call directly.
| Adding `= false` to the declaration was the fix.
|
| So, well over a year of engineering hours spent to figure out the
| issue. The upside is that some people on the team didn't know how
| to proceed, so they spent their time speeding up random things
| that were slow. So the device ended up being noticeably faster
| when we were done. But it was pretty stressful as we were closing
| in on our launch date with little visibility into whether we'd
| figure it out or not.
| hoten wrote:
| Oh man, that sounds rough. I salute you.
|
| This probably wasn't an option back then with your toolchain,
| but it's so reassuring to know modern compilers / ASAN are
| amazing at catching this class of bugs today.
| leni536 wrote:
| AFAIK ASAN does not catch uninitialized variables, MSAN does.
| MSAN is significantly harder to set up.
| JonChesterfield wrote:
| Branch on uninit lights up beautifully in valgrind which
| has no set up, just run valgrind ./a.out
| leni536 wrote:
| Good point, although sometimes valgrind is too slow.
| namrog84 wrote:
| C++ senior dev here. Of the few teams I've been on, one of the
| first things I make sure is setup right is cranking up
| warnings, warnings as errors. Which include things like un
| initialized variables. I then fix up the errors and make sure
| they are part of build gates.
|
| These types of problems(undefined behavior and or
| uninitialized) are often hard(time consuming) to diagnose and
| fairly common.
|
| Lots of places overlook simple static analysis or built in
| compile features.
| halifaxbeard wrote:
| Reminds me of a bug I fixed in yamux, simply because of how long
| I've had to deal with it. Bug existed for as long as yamux did.
| (yamux is used by hashicorp for stream muxing _everywhere_ in
| their products.)
|
| If yamux's keepalive fails/times out, and you're calling Read on
| a demuxed stream, it blocks forever.
|
| https://github.com/hashicorp/yamux/pull/127
| tommiegannert wrote:
| Kudos also to the original author for not doing premature
| optimization, of course. It wasn't until the iPad that it was
| needed. However, a TODO might have been useful. ;)
| tedunangst wrote:
| Only for users that didn't use both features at the same time.
| Users who did probably experienced the same bug, but it took
| until a critical mass of users reported the bug to get it
| fixed. At which point the fix probably took four times longer
| than necessary because the developer was unfamiliar with the
| design and the toolchain had decayed.
| omoikane wrote:
| > given a fixed denominator, any 16-bit modulo can be rewritten
| as three 8-bit modulos
|
| Anybody know what's the exact transformation here? I searched
| around and found this answer, but it doesn't work:
|
| https://stackoverflow.com/a/10441333
| o11c wrote:
| If the denominator is a constant, wouldn't it be faster to use
| the divmod identity to turn it into (divide, multiply,
| subtract), then use the usual constant-divide-is-multiply-and-
| shift optimization?
| figassis wrote:
| The number of times I bumped by head against a desk, after
| missing multiple deadlines and then out of nowhere having a
| random moment of clarity such has "this gives me X vibes, but it
| would be insane if this was actually the case", and then I do a
| quick string search and there it is.
___________________________________________________________________
(page generated 2024-06-21 23:01 UTC)