[HN Gopher] Cold restart whole system after total outage
___________________________________________________________________
Cold restart whole system after total outage
Author : dmazin
Score : 95 points
Date : 2023-07-19 20:06 UTC (1 days ago)
(HTM) web link (www.evalapply.org)
(TXT) w3m dump (www.evalapply.org)
| zgluck wrote:
| So not one mention of terraform/pulumi?
| hlandau wrote:
| This is called 'blackstart', particularly in the energy sector.
|
| My earliest exposure to this concept as a child was watching the
| film Jurassic Park. As someone fascinated by systems I found the
| idea of having to bring the whole system back up from scratch
| pretty interesting.
|
| Today I still find these kinds of bootstrapping processes
| fascinating - both these megascale processes, but also the boot
| process that occurs whenever you turn on your computer. The
| latter is probably one of the most Rube Goldbergian feats of
| engineering with us today that actually still achieves a useful
| purpose. In fact it's absurd how Rube Goldbergian it is. And the
| complexity of the boot processes for modern systems (see [1] for
| a small glimpse) is extraordinary.
|
| When you turn on your computer, it's like you're re-executing the
| entire process of a civilization bringing itself into being,
| gradually developing progressively more sophisticated
| technologies: at first RAM isn't working, but then you get RAM
| working and that lets you get progressively more sophisticated
| parts of the hardware working, etc.
|
| By comparison, humans have no "automatic boot process". We're
| constructed in the 'on' state via fork(). So this repetition of
| entire process of, ah, 'abiogenesis' whenever you turn on your
| computer is kind of insane by comparison. Entire kingdoms of
| hardware state rise and fall with the press of a power button.
|
| As an aside, I'm fond of the Red Dwarf novels, which are set on a
| massive mothership-type spaceship. The ship was constructed in
| space and never designed to enter a planet's atmosphere. In
| particular, the ship, and its engines, was constructed in the
| 'on' state by the crew that built it originally. It was never
| conceived that the ship would ever need to be _rebooted_ ,
| because it was assumed once the engines were initially fired
| during the commissioning of the ship, they would never be turned
| off until decommissioning. Thus, the ship has no automatic boot
| process for the engines, only a manual engine firing procedure
| which is extraordinarily arduous and long-winded and takes weeks
| to execute, said procedure having been included in the manual
| only as a curiosity more than anything else. This idea of a ship
| built "on" under the assumption it would never once be shut down
| or "restarted" until decommissioning is interesting, but of
| course also directly mirrors biological life.
|
| [1] https://www.devever.net/~hl/backstage-cast
| potmat wrote:
| "Under the words 'Contact Position' there's a button that says
| 'Push to Close'".
|
| "Push it."
|
| When Spielberg was on he was ON! Who else could take a scene
| like "they have to reset the circuit breakers" and make you
| absolutely on the edge of your seat over it.
| mike_hock wrote:
| > We're constructed in the 'on' state via fork().
|
| Maybe some worms are, but "we" most certainly aren't. The
| closest analogy might be execve("/proc/self/exe"), but even
| that is flawed.
| jacquesm wrote:
| This is very interesting in the context of power infrastructure
| as well. As we found out the hard way during the 2003 power
| blackout in North America.
| rdhatt wrote:
| Practical Engineering did a video on the complexity of bringing
| a power grid back online, called "black start" (not cold
| start).
|
| https://practical.engineering/blog/2022/12/5/what-is-a-black...
| jsmith45 wrote:
| Black start of a single plant seems reasonable enough, but
| black start of a whole grid seems almost absurd.
|
| How could one possibly balance the load with the plants
| coming online? If the generation and load is too mismatched,
| the generators can literally automatically trip off the grid,
| so generation and load must be carefully balanced as things
| get brought back up.
|
| One would almost need to shed nearly all the loads from the
| black grid (Which may have happened anyway as the grid
| collapsed, but any loads not already shed by the collapse
| could prove interesting), and re-add some some gradually as
| plants come online, which still seems crazy difficult.
|
| And inrush current demands from many loads as they are get
| reconnected must be pretty insane.
| mjevans wrote:
| Offhand, that's pretty much the plan for a black start.
|
| Something about bringing up a designated plant and feeding
| the output over 'cranking' lines that other plants along
| the route can synchronize their output against. Then
| gradually adding load and source until the system is meta-
| stable again.
|
| Edit: additional data
|
| Not only synchronize, but also use for internal needs like
| all the pumps and particularly the 'excitation current'
| that establishes the magnetic field for the generator. It
| allows control over the output voltage. There are also
| other drawbacks to the more obvious solution of fixed
| magnets which can be oversimplified as 'ware'.
|
| https://en.wikipedia.org/wiki/Permanent_magnet_synchronous_
| g...
|
| https://en.wikipedia.org/wiki/Excitation_(magnetic)
|
| https://en.wikipedia.org/wiki/Electric_generator
| jsmith45 wrote:
| The problem is not so much the concept, as how tricky it
| would be to add the loads back in at just the right rate
| to not trip some or all the generation back off again.
| Sure once you have enough generation and load already
| online, adding the rest is relatively straightforward.
| Still need to be careful, but after a certain point it
| would look to utility operators much like the usual work
| restoring loads and sources after a large area blackout.
|
| The trickiness seems worst close to the very beginning
| when even relatively small misestimation of a chunk of
| load being restored would have a proportionally bigger
| impact. Many loads are not completely predictable, so
| presumably they would need to favor bringing some of the
| more stable loads online early so that normal variation
| from the loads that can only be predicted well in
| aggregate won't vary enough to trip everything back
| offline.
| rkagerer wrote:
| Or you can chaos monkey style shut off the continent on a regular
| basis.
| lantry wrote:
| In the anecdote about Bill and the DISASTER script, I'm not so
| sure that deleting the script would be such a big deal. If this
| script hasn't been touched since the 1980s and nobody knows what
| it does, presumably nobody has tested it recently.
|
| It seems like if there really was a disaster, first of all nobody
| would know that script existed, and second of all if they tried
| to run the script, it would fail because of all the changes to
| the system since the script was initially developed.
|
| Isn't there some saying like "if you don't test your backups, you
| don't have backups" or something like that?
| mastax wrote:
| I have a hard time believing the 10k line shell script didn't
| have a comment at the top saying what it did.
| LeoPanthera wrote:
| I bet it still would have been a useful template for a human to
| read to get a general idea of what things to do and in what
| order.
| wkdneidbwf wrote:
| good luck reading 10k lines of shell written decades ago. it
| would likely be an incredible waste of time.
| tivert wrote:
| > good luck reading 10k lines of shell written decades ago.
| it would likely be an incredible waste of time.
|
| If the entire telephone system was down and needed cold
| started, and the script had information someone needed to
| do that, _someone would take the time to read it._ Maybe
| not run it, but definitely read it to extract clues.
|
| I mean, it's not like it's binary. It's totally possible.
| Gabrys1 wrote:
| Based on ChatGPT, assuming 10k lines translates to around
| 30k words, it should take about 3hrs to read it. Multiply
| that by your favorite factor for read and understand.
| Split that to a few people, skim parts that are not
| applicable etc. All in one this seems easily readable in
| sensible time.
| thelastparadise wrote:
| I know, right?
|
| Can you even imagine the alternative? "Hey lets throw
| this maybe incredibly helpful shell script in the trash.
| Because it's too long."
| wkdneidbwf wrote:
| right? that whole bit reads like some lame parable. like who in
| there right mind is going to run a 10k line shell script named
| DISASTER they've never read and cannot read because it's 10k
| lines of shell? there is apparently no documentation (and
| positively no tests)? one guy close to retirement remembers
| what it's for and says "don't delete this critical but of
| code!"
|
| it's just utter bullshit.
| chubot wrote:
| If tens of millions of dollars are on the line, you will be
| able to find someone who can run the script or derive enough
| knowledge from it
|
| In a disaster scenario, something is better than nothing
| pavel_lishin wrote:
| Bill is clearly still picking up the phone. He'd likely be
| amenable to picking up a paycheck as well.
| wkdneidbwf wrote:
| it's more that i don't believe it's a real scenario.
| adityaathalye wrote:
| Isn't that the entire point? People don't believe (or
| choose to not believe) a certain disaster scenario is
| valid, until it happens. We all have seen first-hand the
| many examples of colossal failures of disaster-response
| planning in our recent planet-wide emergency. As have we
| seen the creative, dogged, herculean efforts to cope with
| it.
| adityaathalye wrote:
| OP here... As I wrote here, it is better to think of the
| story as apocryphal:
| https://news.ycombinator.com/item?id=36798893
|
| Also of course it will be crazy to read a giant shell script.
| But then again if the stakes are high enough, and if it
| yields even one critical piece of information, then it's
| worth it.
|
| The larger point is that organisational knowledge clings on
| in strange ways. In a crazy disaster scenario, people may
| appreciate having access to anything they can get their hands
| on.
| Gabrys1 wrote:
| 10K lines is not _that_ large. If it was written sensibly,
| it might be very useful. Bill might have written a text
| document, but chose to use Shell as a preferred engineers'
| language. Who doesn't share a one-liner with a colleague in
| need? Bill shared a 10k-liner :-). And Shell being a
| relatively high-level language, it probably packs the
| information more densely than a text file.
| perrygeo wrote:
| "Nobody cares if you have backups. Everyone cares that you can
| restore."
|
| Classic problem of deferred costs. Backups cost money and it's
| tempting to avoid investment in them (ie fail to test them) but
| that can bite you when its least convenient.
| pixl97 wrote:
| Heh, Microsoft AD + DNS + VMs is a common 'cold start' trap for
| the inexperienced.
|
| There was a story around this during the iraq war where a US
| military virtual machine system when down, and had to come back
| up without internet. Problem was VMware needed DNS to start the
| VMs, one of the VMs it needed was Active Directory for security,
| AD hosts the DNS and now you're locked up without an external
| running system.
|
| DNS itself is typically a cold start nightmare.
| mikewarot wrote:
| I had an old 486dx-50 as a backup domain controller for just
| that reason.
| mikewarot wrote:
| I think that John Plant[1] and some friends could get us from the
| stone age to ironworking. He's shown how to start with stones and
| get that far, albeit on a very small scale.
|
| Going from iron to precision screws is a matter of first making
| precision flat surfaces, then lathes, and onward from there.[2]
| You can do that with just iron and heat treating, but it won't be
| easy.
|
| If you want an alternate history where something slightly less
| drastic is dealt with, the book "Ring of Fire - 1632" by the late
| Eric Flint[3] is an interesting place to start. In the book, a
| town from West Virginia circa 2000 is thrown back into the middle
| of the 30 years war in Germany. Lots of exposition of the book is
| about the supply chains we all depend on, and how they work. It's
| the start of an awesome series.
|
| Books and working knowledge, are a precious resource. As long as
| we have a critical mass of them, and conditions remain reasonably
| tolerable for human life, we can recover.
|
| [1] https://www.youtube.com/channel/UCAL3JXZSzSm8AlZyD3nQdBA
|
| [2]
| https://ia800104.us.archive.org/20/items/FoundationsOfMechan...
|
| [3] http://www.baen.com/chapters/0671578499/0671578499.htm
| pavel_lishin wrote:
| Beware the 1632 series. You'll think that you're just picking
| up a fun "Connecticut Yankee in King Arthur's Court" adventure
| yarn, but then a year down the line you've read a full dozen,
| the library just knows to go ahead and order the next one for
| you once you pick one up, and you're wondering if you'll be
| able to finish the series before retirement.
|
| https://en.wikipedia.org/wiki/List_of_books_in_the_1632_seri...
| chipsa wrote:
| At this point, you're likely to be about to finish the
| series, because it's unlikely to get much longer. Eric
| Flint's Wikipedia entry is now past tense. He died last year.
| throwanem wrote:
| Could be worse. I found Weber's stuff as sticky once upon a
| time, and Eric Flint's a considerably more skillful writer.
| But I appreciate the warning all the same; I really don't
| need so weighty an obsession on top of all my other hobbies.
| galkk wrote:
| One of my stories of work as vendor on $large_bank is that per
| some folks from there , they weren't able to do disaster recovery
| testing of their largest oracle database for years, and per
| procedure they should've do it like every 6 months.
| anotherhue wrote:
| IMO if you can't cold start it you probably can't develop against
| it very quickly.
|
| Then again we couldn't cold start a supply chain or a semi fab or
| humanity itself so maybe that's the default.
| hinkley wrote:
| My comfort level with an architecture is always vastly improved
| by being able to run a toy version of the entire system on a
| developer box.
|
| It doesn't just speak well for disaster recovery prospects
| (both the feasibility of doing it and the density of developers
| who could possibly pull such a thing off), it's also very, very
| useful for speculative development.
|
| When you make a high barrier to entry of making large
| modifications to the system, you also tend to create an
| underclass of developers, who never really get to understand
| how the system works.
|
| What if we split these two microservices into three, or
| combined these three into two? That's a pretty common question,
| that only gets asked if you know you won't get laughed out of
| the room for suggesting it.
| bamfly wrote:
| You may enjoy the first episode of James Burke's _Connections_
| ( "The Trigger Effect"), if you've not seen it.
|
| https://www.youtube.com/watch?v=NcOb3Dilzjc
| anotherhue wrote:
| I enjoyed the one in the Witness but hadn't gotten around to
| the rest, thanks for the excellent recommendation!
| https://archive.org/details/james-burke-connections_s01e10
| potmat wrote:
| Still the best non-fiction TV ever produced in my opinion
| (the whole series I mean).
| JohnFen wrote:
| Every new semi fab that comes online was cold-started.
| anotherhue wrote:
| With the output of the prior generations was my point.
| lelandbatey wrote:
| I love this explanation for why we should make plans even though
| folks will try to shoot down planning with "no plan survives
| first contact":
|
| > Even though nothing will go as planned, it's important to have
| the memory and expertise that did the planning, because that's
| what's going to be able to think through the as-yet- unknown-
| unknowns, when the inevitable FUBAR situation suddenly happens
| later.
|
| We don't plan so that everything will go according to plan, we
| plan so that we are better equipped to _reason_ when the plan
| doesn 't work.
| johngalt wrote:
| At a certain point, you aren't doing a cold restart, but a high
| speed recreation of the system based on prioritized needs.
| thelastparadise wrote:
| It seems you've been there too :)
| tivert wrote:
| > Another colleague in the chat remarked up-thread (apropos cold
| reboot thinking):
|
| > I have seen this at <Indian eCommerce Giant> and at <a FAANG>.
| Most of it is related to cached data. Cold starts with empty
| caches causes too much load on databases. And then the failures
| cascade.
|
| > -- Another M'colleague in the Slackroom.
|
| Isn't that not really a problem with cold restart per se, but
| more the restart procedure? If caches are so critical, wouldn't
| you need a feature to throttle the load to what the databases can
| handle, as the caches populate? E.g., if you're cold-rebooting
| Facebook, start by blocking all connections except those
| geolocated to North Dakota, then add other regions as your caches
| fill.
| jbnorth wrote:
| That's spot on. I work at a large cloud provider and one of our
| larger eCommerce customers had an outage in a kubernetes
| cluster which handled the front end traffic routed through a
| large CDN provider. Well sure enough "just turn it back on"
| wasn't an option since the surge of traffic was too rapid for
| the services and the cluster to scale out. They ended up having
| to turn the traffic back on incrementally to let things scale
| up to the point where they could handle the load.
| donalhunt wrote:
| One of the earliest incidents I worked on in the late 90s
| involved students DDOSing a university webserver in
| anticipation of exam results being posted. The server load
| was so high we had to pull the physical plugs on the server.
| :/
| benlivengood wrote:
| Specifically you want load shedding and in the servers and
| retry with back off in the clients. Clients should do their
| best to exponentially back off on retrying failed requests and
| only try to contact healthy servers and maintain internal rate-
| limiting based on error rate, and servers should do their best
| to reply with failure quickly and cheaply to drive good
| client's backoffs/rate-limiting and just drop bad client's
| traffic, and the service discovery should try to detect and
| spread load across healthy servers (but this isn't always
| available at first in a cold start because the metrics or
| metadata probably aren't available yet), but in the end it's up
| to servers to reliably drop traffic they can't handle instead
| of building up giant queues and slowing to a crawl. Middleware
| is the hardest because it has to be a good client and also fail
| fast on overload as a server by correctly interpreting upstream
| behavior. Deadlines in RPCs that get passed across system
| boundaries can work pretty well for tall stacks of system
| layers where service health discovery or dependency discovery
| is hard, but require careful configuration to avoid failure or
| very slow starts under heavy load.
| draw_down wrote:
| [dead]
| nickdothutton wrote:
| It's been a few years but I used to run DR exercises for
| corporates. Cold start means your only possessions are the fire
| proof suitcase full of LTO-5s and the street address of the DR
| data center. 1 day to bootstrap essential infra services, after
| the end of the 2nd day you'd have most customer facing systems
| up, day 3 would be the non-essential stuff. Personally I'd do it
| without sleep, but most of the youngsters would need a break.
| Pretty exhilarating, as IT work goes. Always use the feature that
| generates multiple index tapes of what backup set is on what
| numbered tape :-)
| adityaathalye wrote:
| OP here. That would be quite the trip! Finding oneself the
| front line of something like that would easily bring out the
| best---and the worst---in a person. A rare chance to form
| lifelong collegial bonds, and perhaps to exorcise a personal
| demon or two (fear, anger, egotism ...).
| nickdothutton wrote:
| The only reason I was brought in to do the exercise as a
| consultant, was because the guy supposed to be doing it was
| too frightened, delayed for almost a year, went on long term
| sick leave and was eventually fired/quit. The docs and ops
| procedures actually looked good, so I took on the challenge.
| In the end, just 1 flexvol that had been configured manually
| years before, and was missed by the DR automation. Not a bad
| result.
| ChoHag wrote:
| [dead]
| gumby wrote:
| The full telephone system, which the author starts with, may not
| be restartable. Sure, you could restart the SS7 databases and
| computers, but the control plane runs over the data plane, which
| is configured via...the control plane. Originally the network
| controls were literally operators (humans), but bit by bit parts
| were incrementally automated, pulling the system slowly (over
| decades) by its bootstraps, which were gradually decommissioned
| as they weren't needed any more.
|
| I have a friend who knows a _lot_ about the phone system (he has
| a security clearance for some of his telephone work). One time we
| had a long conversation about this topic, until at one point he
| said "and let's talk about something else" -- I guess from that
| point some of the details are classified. So maybe there is a
| plan, or maybe they just designed the system in such a way that
| they could convince themselves that it would not go down unless
| things were so severe that loss of the phone system would not be
| your chief worry.
|
| ---
|
| In September 2001 there was a full standdown of US airspace. That
| was accomplished pretty quickly: "you are ordered to land
| immediately on the closest airport that can handle your aircraft,
| or be shot down". Undoing that, however, took some careful
| planning! Fortunately the standdown lasted several days so there
| was time to work it out. Even if you had a plan for this (and I
| assume FAA had one), figuring out what the realities on the
| ground were and matching them up with the plan was nontrivial.
|
| Apparently some of the planes landed where they could not take of
| again unless they were empty with a small amount of fuel to get
| to an airport designed for them. I don't believe I heard that any
| planes landed where they could _never_ leave.
| protastus wrote:
| > So maybe there is a plan, or maybe they just designed the
| system in such a way that they could convince themselves that
| it would not go down unless things were so severe that loss of
| the phone system would not be your chief worry.
|
| My belief from working in very large companies, and
| (previously) in mission critical systems is that a clean
| bootstrap and recovery process is extremely unlikely, almost
| impossible. Because in complex systems full of legacy parts and
| people who have long retired, the stars won't align.
|
| The only way to truly know is to design and periodically test
| for disaster scenarios (emphasis on the plural). But due to the
| scale in time and space, cost and bureaucracy, this planning
| and rehearsing is not going to happen with the desired detail
| and intensity. People do not seriously plan for things that
| have never happened.
|
| If it does happen, there will be a small group of extremely
| capable people that will find a way to bootstrap the system. It
| won't be according to some previously laid out plans -- they
| will make the plan in real time. They're not famous and
| probably never will be.
| colechristensen wrote:
| Eh, would not surprise me in the slightest if there was a
| secret billion dollar program that specifically practiced
| disaster recovery for the phone network. The government
| doesn't have the same motivations as a business and they
| spend a lot of money in a lot of places just to be prepared
| for unlikely events. Like we spend billions on military
| hardware we don't forsee ever needing _to keep the
| engineering capacity to design and build military hardware_.
| fluoridation wrote:
| >Like we spend billions on military hardware we don't
| forsee ever needing _to keep the engineering capacity to
| design and build military hardware_.
|
| Well, more because the military industrial complex lines
| the pockets of your politicians, who in turn decide how to
| spend the budget.
| runlaszlorun wrote:
| The two are not mutually exclusive. In fact they
| reinforce each other.
| kotaKat wrote:
| > I don't believe I heard that any planes landed where they
| could never leave.
|
| This has happened before outside of 2001, albeit not really a
| DR issue but more a political issue -- if we look to the Meigs
| Field destruction by former Chicago mayor Richard Daley,
| multiple aircraft were left stranded with a now destroyed
| runway. (The solution was to just give them special clearance
| to take off on a taxiway, but still.)
|
| https://web.archive.org/web/20110720045652/http://www.aopa.o...
| donalhunt wrote:
| In Ireland, we just build a runway to allow the plane to take
| off again.
|
| https://www.rte.ie/archives/2018/0521/965058-mexican-lands-p...
| joncrane wrote:
| And in the meantime they had the pilot judge a beauty
| contest. What a story!
| LoganDark wrote:
| this is truly one of the takeoffs of all time
| myself248 wrote:
| I've long had a fascination with aviation incidents (the
| Gimli Glider happened on my birthday), but I hadn't heard of
| this one before!
|
| What a great story. Thank you for posting it.
| wongarsu wrote:
| Clear evidence that news was more entertaining in the 80s:
| In the merry month of May Just before the dawn of day
| A plane flew in for Shannon to refuel. Because
| Shannon is fogged out, Their are the rite of ought
| To touch down in Cork Airport as a rule. As
| he flew towards Mallow town His supply of fuel was
| down But the pilot was as cool as cool could be.
| In a racetrack west of town He made a safe touch down
| Just beside the Mallow sugar factory.
| adityaathalye wrote:
| OP here. Thanks for the remarks! That tale is apocryphal to me.
| I found it amusing, and telling in the sense that disasters are
| the last thing people think of, and for large enough systems
| (especially ones that have accreted over human generations) the
| organisational knowledge has probably been lost to retirement
| and death. And if you're lucky, maybe a scrap of it has
| survived. Then one's job is to do the requisite software
| archaeology to figure out what one's people might have entirely
| forgotten.
|
| Also, if that script remained valid at the time, I doubt it
| would _do_ any critical actions. It might have been a sort of
| literal script to follow --- run the script, see what it says,
| do a thing, run the script, and so forth. Its supposed job was
| to help humans solve a _bootstrap_ problem.
|
| I see how my wording of that passage makes it sound like the
| be-all end-all of cold booting a telco. But that's what we get
| when we wall-of-text in our Slacks :D
|
| (edit: clarifying remarks)
| walrus01 wrote:
| If you dig deep enough into the SS7 stuff running in a modern
| regional ILEC it's way more fragile than you might think.
| Mostly because it's no longer treated as an absolutely cannot
| fail thing that is also a primary source of revenue like back
| in the days when everyone has a POTS line and tons of money
| came in from long distance bills. Many operators are
| decommissioning stuff like 5ESS and Nortel switches and moving
| to modern soft switches as quickly as they can.
|
| The network stuff underpinning a lot of critical tdm phone
| traffic these days is like a collection of 23 year old Cisco
| 15454 held together by spare parts and a few people who care
| about them.
| myself248 wrote:
| I've been out of the industry just long enough to remember
| that Cerent 454's got rebranded as Cisco 15454's right near
| the end of my tenure...
|
| Yow. Way to make a guy feel old! :P
| walrus01 wrote:
| Mostly I was using the 15454 as an example, there's lots of
| other 20 to 30 year old stuff out there in the TDM
| transport sector that's only available on ebay, through
| weird used equipment dealers, or by finding a decom from
| another ISP/telco.. Stuff like T1 (or DS0!) to DS3
| mux/demux to attach to a 15454, or similar. There's
| literally 911 call center transport circuits being held
| together by the telco equivalent of duct tape and string
| right now, nobody notices until it breaks.
|
| One of the weird challenges in building a new state-of-the-
| art inter city DWDM transport network now is dealing with
| things like legacy customers that have one OC48 and are
| unlikely to drop it any time soon, it's a considerable
| monthly revenue source, and have to deal with stuffing that
| into the system along with 100Gbps and greater coherent
| circuits.
|
| Also from a customer relations perspective sometimes the
| customer literally forgets that they have this extremely
| exensive DS3 or OC48 or something in monthly recurring
| billing, and you don't want to bring it to the attention of
| management, because they might go "are we still using
| this?" and cancel it.
| LeoPanthera wrote:
| > There's literally 911 call center transport circuits
| being held together by the telco equivalent of duct tape
| and string right now, nobody notices until it breaks.
|
| And break it does.
|
| https://www.kron4.com/news/bay-area/911-dispatch-system-
| in-o...
| bobthepanda wrote:
| The 2001 shutdown is even more crazy when you consider that for
| the FAA administrator who ordered it, it was his first day on
| the job. Hell of a first day.
| Arrath wrote:
| Hell of a day to quit sniffing glue.
| drbawb wrote:
| I'm reminded of Bryan Cantrill's talk "Debugging Under Fire"[1],
| which includes a retrospective of sorts about an entire
| datacenter rebooting.[2] That is a pretty large-scale disaster,
| but even that is a rung below a continent-wide outage. Poor
| "Bill" must have saw the proverbial light when he heard some
| folks wanted to trash the DIASTER script.
|
| [1]: https://www.youtube.com/watch?v=30jNsCVLpAE
|
| [2]: https://www.tritondatacenter.com/blog/postmortem-for-
| outage-...
| js2 wrote:
| It's 1995 or so. I'm at U.F. getting my CS degree. Our facilities
| guy is giving a tour of the department's server room to some big
| wigs.
|
| For some reason, he decides to demo the UPS cut-over switch. I
| have no idea why. But he manages to toggle it the wrong way and
| instead of switching the entire room full of servers to the UPS,
| he manages to cut power to _all of them_.
|
| My recollection is that the cooling went out too and the room was
| suddenly very silent. But in retrospect that doesn't make sense.
|
| What I do remember is that it was non-trivial to bring all our
| Unix servers back up because over the years they had been setup
| with NFS mounts in a loop such that for A to boot, it needed B to
| be up, which needed C to be up, which needed A to be up.
|
| Oops.
|
| So it took a lot of manual intervention to bring everything back
| up.
| jen729w wrote:
| Vaguely related, in 2015 we were building a new platform based
| on Cisco UCS and NetApp filers.
|
| Cisco had a virtual router. N1000 perhaps, the hardware was a
| C220 and it had some sort of appliance running on it. Which
| depended on some sort of LUN from NetApp for its storage. But
| you couldn't stand the LUN up until the vCenter was up because
| UCS provisioned the LUN, it didn't work if you did it manually,
| and that ran on vCenter, and the vCenter depended on the router
| so that it could reach the LUN. It was pure circular logic
| hell.
|
| There _was_ a path to make this work, but it literally took
| half a dozen very ~~clever~~ expensive Cisco and NetApp
| engineers a couple of days in a room with a whiteboard to
| figure it out. It was absurd.
| dunham wrote:
| We had a blackout (seemed to happen every fall when the
| students arrived - the university had a power station), the NFS
| server needed NIS to boot, and NIS server required NFS. We
| managed to manually get the NIS service running in single user
| mode, brought up everything else, and then rebooted NIS.
| spc476 wrote:
| I've got two such stories. Around 2000, I was working at a huge
| web hosting company (third shift, monitoring the network).
| Suddenly the power went out. Turns out, the building management
| (separate company) decided to run a UPS test and the memo got
| lost. Fun time that night.
|
| Second story, around 2005---I, along with a friend and my
| father, were in Las Vegas eating lunch at one of the major
| casinos when the power went out completely. It was _eerily_
| silent and _dark!_ (and then slowly, we started hearing the
| groans of slotzombies rising among us) I 'm sure someone lost
| their job for a UPS cut-over failure.
| rout39574 wrote:
| I was in that room! Poor fellow. It was the demo of our new
| UPS; I can still remember the gradual fading fans and clicks of
| capacitors.
|
| He was really nervous showing the new install off to us, and he
| was just talking through the positions of the switch; but he
| actually moved the switch to each of them as he did it.
|
| Back in the days the conslops ran the asylum. :)
|
| Any old school UF types looking at this, The Dog list lives on,
| and is again meeting from time to time. :) But at quieter
| venues. Our ears have unaccountably gotten old.
| zamadatix wrote:
| The heck is a conslops besides the food at the prison?
| djbusby wrote:
| Console Operators
| more_corn wrote:
| In some ways a disaster recovery plan can follow a better path
| than the original bootstrap. Imagine industrial society. On page
| two of the disaster recovery plan you have a description of germ
| theory. Something that has saved hundreds of millions of lives
| and would have saved hundreds of millions more had it been
| discovered / formalized thousands of years previously. People
| knew about sanitation in prehistory but the theory wasn't
| formalized so doctors didn't always wash their hands.
|
| Likewise knowledge around nitrogen fixation and fertilizers.
|
| There are probably a half a dozen huge improvements that could be
| made for bootstrap_society_v2.sh
|
| Perhaps we should all write down our version and tuck it away
| somewhere safe just in case. Maybe on something more durable than
| paper. And certainly more durable than electronic storage.
| ramidarigaz wrote:
| Interestingly this paragraph isn't quite true:
|
| > So much of the modern world depends on our mastery over
| materials (to make a precision screw, you need a precision-
| machined harder material--diamond / titanium--to work on a softer
| material--steel), and our ability to turn rotary motion to linear
| motion (it's stupidly difficult to reliably precision-machine a
| harder material without even more precise linear + rotary motion
| --lathe/CNC machine). Hence, a bootstrap problem.
|
| Steel is hardenable (or rather, some steels are hardenable), you
| can change its hardness through the specific application of
| heating and cooling. So you can make a crude tool with relatively
| soft steel, harden it, and use it to make a more precise steel
| tool (again machine soft, then harden). This does make the
| bootstrapping problem a bit easier, I think. Although not easy in
| the absolute.
|
| See https://www.youtube.com/watch?v=V_Mp1fNzIT8 for a great dive
| into primitive steel hardening techniques.
| smcameron wrote:
| David Gingery's books might be of interest to anyone thinking
| of bootstrapping a metal working shop starting from charcoal
| and scrap aluminum.
|
| https://www.gingerybookstore.com/
| adityaathalye wrote:
| OP here. Thanks for the critique! Yes I agree fully. The
| specific example of diamond/titanium aside, the general point
| stays, I feel. A youtube rabbit hole is nigh, clearly :)
| hinkley wrote:
| There's a way to grind mirrors optics for optics with polishing
| stones that aren't even flat to the naked eye. Basically the
| system arrives at tiny tolerances via the process of using the
| system.
|
| And there's way to make three perfectly flat sharpening stones
| by starting with three raw pieces of natural sharpening stone,
| just by alternately rubbing the three stones together until
| they flatten each other out.
|
| Paul Sellers can teach you how to flatten a large board without
| a planer. He also has videos on how to get a wood plane
| perfectly flat using a large sharpening stone (which can be
| made as above or with float glass).
|
| And if memory serves, you to make something perfectly round you
| first need something perfectly flat. Once you have something
| perfectly flat and something perfectly round it's off to the
| races.
|
| Edit: "The Origins of Precision" is a half hour well spent
| https://www.youtube.com/watch?v=gNRnrn5DE58
| alexwasserman wrote:
| Great read: One Good Turn: A Natural History...
| https://www.amazon.com/dp/0684867303?ref=ppx_pop_mob_ap_shar...
|
| A history of the screw. Really interesting around how it was
| developed. Some machining techniques are far older than you'd
| expect, and some capabilities far newer.
|
| The books thesis is that the screw is the most important
| invention.
___________________________________________________________________
(page generated 2023-07-20 23:02 UTC)