[HN Gopher] As winter approaches, here's a story about why hardw...
___________________________________________________________________
As winter approaches, here's a story about why hardware is hard
Author : mooreds
Score : 125 points
Date : 2022-12-17 14:28 UTC (8 hours ago)
(HTM) web link (twitter.com)
(TXT) w3m dump (twitter.com)
| lostlogin wrote:
| Thought this was going to be about Iran's drones, with which
| Russia is smashing up Ukraine. They are reported to be sensitive
| to low temperatures.
|
| https://www.nytimes.com/2022/12/14/world/europe/ukraine-russ...
| Eleison23 wrote:
| Chris Siebenmann has been blogging about how cold affects his
| workstation.
| https://utcc.utoronto.ca/~cks/space/blog/tech/ColdLockupMach...
| mdorazio wrote:
| I really wish hardware startups would at least hire people early
| on who have experience with actual manufacturing. Stuff like this
| really shouldn't happen and really shouldn't be a surprise,
| either. Temperature considerations are a standard part of the
| design criteria and testing in any mechanical device, and
| standard root cause analysis (did they even do FMEA?) would have
| forced them to find the problem the first time, not the second.
| When you're still in the early prototype phase, fine, but this is
| a 25? person team selling a product with a subscription and
| everything.
|
| Edit: just realized their stated fix is to remove two resistors.
| Did they actually test the full effects of that change? Seems
| like there's a decent chance of a third head-scratcher a few
| months from now...
| Game_Ender wrote:
| As a software person this is first I have heard of FMEA
| (Failure Mode and Effect Analysis) [0]. Sounds like a very
| rigorous way to move through a system and identify all the ways
| it can break and develop ways to fix it.
|
| 0 -
| https://en.wikipedia.org/wiki/Failure_mode_and_effects_analy...
| Tempest1981 wrote:
| Yep, we used a temperature chamber that could ramp from -40degC
| to +80degC, along with humidity. We would push the limits until
| something broke, fix the design, and repeat.
|
| Surprised, but not surprised, that everyone doesn't do at least
| some stress testing.
|
| I was expecting something more complicated, like
| electromagnetic interference from a passing vehicle. Or ESD
| (static discharge).
| exmadscientist wrote:
| > standard root cause analysis (did they even do FMEA?) would
| have forced them to find the problem the first time, not the
| second
|
| Not to mention that the standard checklist for solving "why is
| this joint having problems" is "is it cold or does it get worse
| when we make it cold"....
|
| > just realized their stated fix is to remove two resistors.
| Did they actually test the full effects of that change? Seems
| like there's a decent chance of a third head-scratcher a few
| months from now...
|
| As a half-decent analog guy I can actually believe this. I can
| imagine a lot of ways something like this _could_ fix a
| problem, though of course I haven 't seen any schematics here.
| pclmulqdq wrote:
| I'm guessing that a robotics company actually has very few
| analog-competent engineers. It may well have solved the issue
| if one of them had built an analog circuit. A lot of
| engineers don't think about operating conditions when they
| first learn to build circuits, and run them near the edge of
| their specified tolerances, so moving the component into a
| "comfortable" range would fix it.
|
| However, if they were working with a COTS component and took
| some resistors off that, I doubt it was actually the right
| fix.
| pifm_guy wrote:
| What's the betting that on the hot weather their Texas
| customers with those resistors removed will find their robots
| nonfunctional again...?
| steve_adams_86 wrote:
| I'm just a bum with an automated hydroponic garden and I
| verified and tested all of my components for expected behaviour
| in the normal conditions the system will be in. I had to
| exclude several components as a result.
|
| On one hand I'm surprised they didn't do that, but on the
| other, I have no time constraints and I'm very, very uncertain
| of what I'm doing most of the time. I knew I wanted to have
| accurate readings and reliable performance so my garden doesn't
| die due to malfunctions. So, I goofed around and made sure
| stuff was right. I'd do the same with code in my spare time,
| but my employers have me cut corners all the time. People could
| criticize me for it, but it's not as though I don't know
| better. It might be similar for this team. By the time the bugs
| strike, it's not clear what's been properly vetted or who knows
| what about which components. Debugging becomes harder because
| the initial spec and how well it was met is no longer clear.
|
| I also discovered the ruggeduino line of arduino boards in the
| process which are pretty cool. Overkill for my use case, but I
| hope to have a use for one some day. I'm thinking of making a
| robot shop vac and metal-picker-upper, and the ruggeduino would
| be great there I think.
| ThrowawayTestr wrote:
| Temperature dependent problems are the hardest to reproduce.
| Often they only occur within a specific range of temps.
| dagw wrote:
| I met a guy who worked for IBM on the AS/400s. Apparently they
| had a 'server room' where they could control the temperature and
| humidity to basically any combination that was likely to occur in
| the real world and they would test all their hardware there under
| any extreme condition they could think of.
| dboreham wrote:
| All hardware produced in a professional manner is tested like
| this.
| h2odragon wrote:
| Wonder what the chances are that the first iteration of "problem
| solved!" actually _was_ a lurking problem that would 've bitten
| as well, sooner or later.
| nixpulvis wrote:
| Reminds me of the time I bought a new iPhone and the touchscreen
| didn't work properly when it was below about 20 degrees.
|
| Leave it to Californians to forget that winter exists.
| ghaff wrote:
| In fairness, it's generally known that a lot of consumer
| electronics have problems in freezing weather unless they're
| protected somehow (e.g. kept in an inner pocket) if only
| because of the battery. Some of it is possibly California
| designers not being especially focused on the sub-zero use case
| but it's also not clear how much focus there should be on that
| use case in general if there are costs (money, physical specs)
| associated with doing so.
| nixpulvis wrote:
| Let's see... does the object go outside? If yes, it will
| probably experience below freezing temperature.
|
| This isn't rocket science.
|
| Now, if you happen to be OK with selling a product that isn't
| reliable, now's a good time! I still need some more toys for
| the holidays.
| ghaff wrote:
| Maybe you should start up your 40 below Android phone
| company. Enjoy!
| gumby wrote:
| This is why I was shocked by Unix producing core dumps (I'd
| previously gotten my experience on ITS and Lispms where
| everything ran in the debugger).
|
| In a core dump all the IO (network connections, files, etc) is
| closed while a debugger gives you access to the functioning
| environment.
| joezydeco wrote:
| Any developer, hardware _or_ software, will tell you that
| reproducibility is the key to solving a problem.
|
| But reproducibility can be vague and sometimes, when you're under
| pressure, you can be quick to point to something and declare
| "aha! that's the root cause!" and be totally wrong.
| JoeAltmaier wrote:
| Or even be totally right, but there's more to the problem.
| Peeling the onion.
|
| We hope that fixing one piece of code will solve three or four
| exhibited problems. But it's often more like, change three or
| four pieces of code to make one problem go away.
|
| As Heinlein is purported to have written, "If it's not one
| thing, it's two things."
| eesmith wrote:
| I've read about 98% of everything Heinlein published[1], and
| don't recognize that quote.
|
| An archive.org search find a few examples, like this 1979
| Harlequin short story https://archive.org/details/romanticsho
| rtsto00harl/page/30/m...
|
| If Heinlein did use a phrase like it, I expect my searches
| would have found it.
|
| It doesn't appear to be a common saying, so I'm curious how
| you acquired the association between it and Heinlein. It
| doesn't seems like a common misquote people end up spreading.
|
| [1] I've only read "Tramp Royale" up to the point where they
| left the US, I haven't read the "stinkeroos", nor his
| posthumous novels, nor most of what Wikipedia lists under
| "Other short fiction", nor a couple more non-fiction
| publications.
| morphle wrote:
| I agree, having read most Heinlein ever published (even
| including his unfortunate right wing and conservative and
| unscientific ramblings that wasted my time), I never came
| across "If it is not one thing, it is two things."
| joshmarinacci wrote:
| Maybe Niven?
| morphle wrote:
| I'm even more sure it was not written by Larry Niven. I
| read most of Heinlein twice but read Niven at least 4
| times.
| JoeAltmaier wrote:
| Maybe Spider Robinson?
| metaphor wrote:
| Agreed.
|
| While reading the thread, a red flag[1] was immediately raised
| in my mind when:
|
| > _We couldn 't reproduce it, but we did come up with a theory
| for why it was happening._
|
| ...going right into mechanical subsystem redesign. Surely a
| cursory review would have challenged such a reactionary
| proposal: What meaningful steps were taken to falsify the
| prevailing theory?
|
| There's something implied about discipline when this vacant _QA
| Tester Hardware /Software_ engineering position description[2]
| bundles verification/validation test roles on the
| design/development/production/field support fronts with the
| following caveat:
|
| > _Initially, you 'll be the only QA engineer and will perform
| active testing of new product releases in our lab and in the
| field at construction sites._
|
| Also, non-rhetorical question: Selenium for industrial hardware
| test automation...is that really a thing in the wild?
|
| [1] https://twitter.com/tessalau/status/1604018887603138561
|
| [2] https://boards.greenhouse.io/dustyrobotics/jobs/5373908003
| Workaccount2 wrote:
| As a hardware guy I look at software with envy. Having to deal
| with physics is such a huge fucking pain in the ass all the time.
| Reality real fucking hates low entropy systems and will try and
| sabotage you at every turn. There is also the inherent opaqueness
| to reality based systems that makes debugging them a huge pain
| that can be enormously time consuming and expensive. And scaling
| is ridiculously difficult and expensive.
|
| And worst of all, for me, there is no money in hardware. At best
| you make a trinket that requires a $9.99 subscription to really
| get use out of. At worst you make a cool trinket, get forced by
| pricing to make it in China, and then end up just having the idea
| stolen and reproduced to be sold for 1/2 the cost.
|
| Ok rant over.
| green_on_black wrote:
| I agree with everything except the first sentence. Similar to a
| meeting, where time used meets time alloted, complexity meets
| complexity allowed.
|
| https://www.stilldrinking.org/programming-sucks
| Quarrelsome wrote:
| I once lost about a month to a USB issue reported in the field on
| some custom hardware. Spent several weeks failing to reproduce it
| (setting up multiple machines to automatically hammer through
| typical usage). It eventually transpired that the issue
| correlated with cold temperatures and the recent outsourcing of
| assembly had resulted in some poor soldering.
| analog31 wrote:
| I work in hardware. The picture of the freezer with wires coming
| out under the door gasket, is familiar.
| svnt wrote:
| I'm gonna snark on this one.
|
| There is no mystery or surprise here. It's basic functional
| qualification. You buy or make a thermal chamber and cycle
| release versions of your device before you ship one. This isn't
| some uncatchable mystery, they just didn't test adequately.
|
| Depending on the product size and cost you may also do this to
| every individual robot off the line. This is not uncommon.
|
| This isn't "hardware is hard" this is "we thought it was software
| with screwdrivers."
| SkyPuncher wrote:
| I'm arm-chairing a bit. This is the type of problem I'd expect
| on a prototype, but not on a production level device -
| especially on a construction robot that will be clearly out in
| the weather.
|
| It reads to me like they didn't properly spec/source components
| that were appropriate for the weather conditions these robots
| are likely to see. The fact that this issue was reproducible at
| refrigerator temperatures is even more shocking. 39F (taken
| from a photo) is not very cold.
| iancmceachern wrote:
| Agreed. In many more regulated industries (medical devices,
| automotive, aerospace) the kind of testing you mention is
| required by law. In all cases it's good form, good practice to
| test your product to make sure it works as promised by its
| labeling. Typically you put an operating temperature and
| humidity range in your manual or labeling. In many industries
| testing to those operating parameters is required by law.
| greenbit wrote:
| "environmental qualification testing", good old EQT.
| bsder wrote:
| > You buy or make a thermal chamber and cycle release versions
| of your device before you ship one.
|
| That is a _total_ waste of time in a startup with a small
| number of units shipped.
|
| A component behaving out of spec due to temperature excursions
| simply isn't that common nowadays. If my system is mostly ADC
| to digital to DAC (standard for robotics controllers), testing
| for temperature is a waste until I'm shipping significant
| volumes.
|
| There is a video of one of the slightly famous YouTubers who
| has a high voltage thing that fails at the altitude of his lab.
| The manufacturer _did_ check it for function at the elevation
| of Denver, but his lab is higher than that. There are limits to
| how much engineering effort you should put in until you get an
| _actual_ failure. (Maybe someone can link the video as I can 't
| remember it at this point.)
|
| You can waste infinite engineering effort covering all
| possibilities. Or you can ship the thing and fix the failures.
| "Good engineering" is about balancing the two--you need to
| ship, but you don't want to have too many failures in the field
| either.
| mdorazio wrote:
| Testing thermal performance is as simple as going to a
| restaurant down the street and asking if you can pay them
| $100 to borrow their walk-in fridge for an hour for cold, and
| leaving your device in a hot car for an hour for hot. It's
| also exactly the kind of thing you should be doing if you're
| selling an actual product to a customer instead of partnering
| with someone to test your prototype.
| snovv_crash wrote:
| No, often components are dependent on the stress they are
| under. If you have inconsistent reflow, or lots of rework
| happening, then each device will behave differently due to
| thermal contractions.
| justin66 wrote:
| If you specify an operating temperature range for your
| gadget, some testing at the extremes of that range is called
| for.
| KennyBlanken wrote:
| Or even just drive down to Lake Tahoe and find a winter
| construction site and try it there.
|
| This is a general problem with all these California-based
| companies and inventors (especially SV inventors cranking out
| crowd-funded bike stuff.) They seem blissfully unaware of
| things like cold weather, water, dirt/mud, and road salt...or
| combinations of them. I laugh at all those stupid fucking
| delivery bots because they'll fall apart anywhere there's snow,
| and get completely stuck on the slightest bit of ice.
|
| For many years, driving a Model S in heavy rain would cause
| water to get into the drive unit via either seals or vents that
| weren't sufficiently designed to keep water out. It "totals"
| the drive unit, causing corrosion of the motor control boards.
| And Tesla denies warranty claims on such repairs, because of
| course they do - just like they did on the windows that
| randomly shattered in parked cars.
|
| Raise your hand if you've owned a car that had problems with
| water ingress issues affecting its transmission. Or windows
| randomly shattering.
|
| What's that? Nobody? Exactly.
| trasz2 wrote:
| lostlogin wrote:
| It is not just SV startups.
|
| I have owned a couple of Philips bread makers. Basic ones and
| expensive ones.
|
| If you make sourdough with them (ie bread) the coating is
| stripped off the bowl and the stirrer corrodes.
|
| They will deny replacement and claim you sprayed something
| acid on it. Yes, fermentation is acid, but they don't believe
| bread would damage their unit.
| TooSmugToFail wrote:
| You are 100% right, but I would not be as dismissive to the
| engineering team.
|
| When you run a hardware startup, you can only hope for an
| experienced team that would do everything by the book and
| implement best practices from the very first production unit.
| Reality is: that's a luxury for most hardware startup teams out
| there.
|
| Typically, there's a frantic rush to get your device to market
| that you simply skip, or more likely don't even have time to
| think about stuff like climate chamber cycling.
|
| One thing I'm almost sure of: these guys have learned something
| -- the engineer's way. Good chance there's a guy there googling
| climate chambers to ask the CEO for a budget to buy one.
| toss1 wrote:
| >>Typically, there's a frantic rush to get your device to
| market that you simply skip, or more likely don't even have
| time to think about stuff like climate chamber cycling.
|
| And that, right there, is the difference between a company
| oriented around myopic management vs a company oriented
| around robust quality.
|
| Any company trying to build a quality reputation would spec
| this stuff out AT THE BEGINNING -- what are the operational
| requirements, what loads will they see, in what environments
| will they run, etc??? Then spec every component, and test the
| whole lot against those requirements. Sure, this is more like
| the dreaded "waterfall" vs "agile", but the result is a
| quality product from the start that has far fewer of these
| problems (because they did this whole test & fix routine at
| the prototype or Alpha test stages), rather than showing up
| with stories like this of how they recovered from customer-
| reported problems.
|
| If you're telling your customers that they're the Alpha
| testers because they get early access, fine. If you're
| selling it as a finished product, then we know your company
| isn't prioritizing actual quality.
| mschuster91 wrote:
| > If you're telling your customers that they're the Alpha
| testers because they get early access, fine. If you're
| selling it as a finished product, then we know your company
| isn't prioritizing actual quality.
|
| Sounds like Tesla... although it is public knowledge there
| that you _are_ still alpha testers years after a model was
| introduced.
| spaceywilly wrote:
| I've worked for a hardware startup. We didn't have the budget
| for a thermal chamber, but that doesn't mean we didn't test
| and anticipate temperature related issues. We had plenty of
| setups like the one in the blog with units in the fridge or
| on a car dashboard on a summer day.
|
| The difference is we did it before the units reached
| customers.
| exmadscientist wrote:
| We skip stuff like the temperature chamber all the time.
| The difference is that we _know_ what we 're skipping and
| why. This can easily make for a "10x" team: we are
| experienced enough to know what we can get away with, and
| experienced enough to know very quickly what happened when
| we _fail_ to get away with it. (So it 's often a pretty
| quick and direct fix! Well, that or a C-level "you declined
| this part of our proposal so now it's failing exactly like
| we told you it would so you're looking at this big of a
| redo, exactly like we told you...").
| dboreham wrote:
| But you probably have a cupboard full of freezer spray
| cans and a heat gun.
| bostik wrote:
| Sorry, but this is a bit rich:
|
| > _When you run a hardware startup, you can only hope for an
| experienced team_
|
| If you run a hardware startup and fail to _acknowledge_ that
| places like Alaska or Ontario exist, you fail before even
| getting close to merely inexperienced. The most charitable
| word I can think of is "myopic".
| mindslight wrote:
| It all looks so simple after someone else has debugged the
| problem and figured out it was temperature, that it's just
| too tempting to distill it down to some dismissive
| statement like " _fail to acknowledge places like Alaska or
| Ontario exist_ ". But ultimately it's about unknown
| unknowns. Sure, you can spend massive amounts of time and
| money trying to make the known unknowns into knowns (higher
| background radiation, higher cosmic rays, lower/higher air
| pressure, higher sun intensity, camera flashes, and so on).
| But even if you do this for a number of things including
| temperature, there will still be bunch of factors you won't
| be preemptively testing. In which case respecting the
| general lesson will come in handy, even though it's been
| explained by someone who failed to do testing that you
| consider routine.
| mcculley wrote:
| It is not only startups and not only cold places that are
| overlooked. The iPhone documentation says that it should
| not be used outside of 32deg-95degF.
|
| https://support.apple.com/en-gb/HT201678
| runnerup wrote:
| That seems reasonably accurate on the high end for me.
| For _sustained_ use, iPhones generally stop working well
| around 100-105F.
|
| I suspect I would be disappointed by the performance of
| an iPhone that could operate in >125F temperatures
| (temperatures which I have worked in outdoors for several
| years)
| Aloha wrote:
| Based on the pictures of the robots, it looks like they were
| intended for indoor use, I guess no one planned for an indoor
| space outside of 50-90 degrees - which is a pretty reasonable
| supposition, better than 80% of indoor environments are within
| that range.
| todd8 wrote:
| My story: my first real job after grad school was at Texas
| Instruments. It was a good job, and I enjoyed working there.
|
| A fellow new hire and I were tasked with fixing a machine that
| ran in a clean room where semiconductor wafers were made. On
| weekends while the line was down, we would go in, crank up one
| section of the line and let waste wafers travel through the line
| where very rarely they might get stuck in a multi-lane machine
| that would etch the wafers with some sort of acid.
|
| The machine had over two dozen asynchronous motors, actuators,
| pumps, sensors and so forth. All generating interrupts and I/O
| events that were sent to a computer that ran the whole line and
| controlled all the machines.
|
| We couldn't slow down the machine, it had to run at full speed.
| The program controlling the machine was thousands of lines of
| assembly language--everything was assembly language, including
| the homemade OS that ran the computer running the line. It took
| like an hour for us to bring up the line and two more hours to
| see the machine do something strange.
|
| The computer running all this had no user interface other than a
| some front panel switches and some panel lights that would reveal
| 16 bits of it's 128K of memory at a time. This was in the 1970s
| before Ethernet had been invented.
|
| It felt a bit like those escape room events where you know there
| is a solution, but you don't know if you will ever get out.
| Without my coworker cracking jokes about our plight, I'm not sure
| we would have ever triumphed over that stupid machine.
| mrkeen wrote:
| I have an HP Spectre x360 laptop. It wasn't registering some of
| its keystrokes. The same keys would fail, and for quite a while
| too - hitting them harder and repeatedly didn't happen.
|
| Turns out it was the cold. Now when I take a trip out for coffee
| & coding, I boot up and let it sit for a while before starting my
| work.
| TeeMassive wrote:
| "Turns out that last year's coupler problems had the same root
| cause. While people were opening up the robot and tightening the
| coupler, the robot would warm up. By the time they put it back
| together, the problem would have gone away. It had nothing to do
| with couplers at all.
|
| By the time we had rolled out the coupler "fix" to all robots,
| the weather had warmed up enough across the country that the
| issue didn't reoccur. We thought we had fixed it, when actually
| spring fixed it."
|
| At first they _correlated_ a possible root cause and then after
| learning from that mistake they finally _understood_ the root
| cause.
|
| I've seen it happen many times where people with not enough time
| and knowledge to debug a huge system had to resort to shotgun
| debugging. IME taking the time to understand always ends up 1)
| solving the problem and 2) saving time and money.
|
| This is especially true when the problem is actually caused by
| two or more root causes.
| rileymat2 wrote:
| Is there a good name for "two root causes" where they work in
| tandem? I am blanking on it. Contributing factors? Necessary
| but not sufficient conditions?
| spaceywilly wrote:
| 2nd order effects
| jasonwatkinspdx wrote:
| https://en.wikipedia.org/wiki/Confounding
| itcrowd wrote:
| Destructive interference?
___________________________________________________________________
(page generated 2022-12-17 23:00 UTC)