[HN Gopher] Cold restart whole system after total outage
       ___________________________________________________________________
        
       Cold restart whole system after total outage
        
       Author : dmazin
       Score  : 95 points
       Date   : 2023-07-19 20:06 UTC (1 days ago)
        
 (HTM) web link (www.evalapply.org)
 (TXT) w3m dump (www.evalapply.org)
        
       | zgluck wrote:
       | So not one mention of terraform/pulumi?
        
       | hlandau wrote:
       | This is called 'blackstart', particularly in the energy sector.
       | 
       | My earliest exposure to this concept as a child was watching the
       | film Jurassic Park. As someone fascinated by systems I found the
       | idea of having to bring the whole system back up from scratch
       | pretty interesting.
       | 
       | Today I still find these kinds of bootstrapping processes
       | fascinating - both these megascale processes, but also the boot
       | process that occurs whenever you turn on your computer. The
       | latter is probably one of the most Rube Goldbergian feats of
       | engineering with us today that actually still achieves a useful
       | purpose. In fact it's absurd how Rube Goldbergian it is. And the
       | complexity of the boot processes for modern systems (see [1] for
       | a small glimpse) is extraordinary.
       | 
       | When you turn on your computer, it's like you're re-executing the
       | entire process of a civilization bringing itself into being,
       | gradually developing progressively more sophisticated
       | technologies: at first RAM isn't working, but then you get RAM
       | working and that lets you get progressively more sophisticated
       | parts of the hardware working, etc.
       | 
       | By comparison, humans have no "automatic boot process". We're
       | constructed in the 'on' state via fork(). So this repetition of
       | entire process of, ah, 'abiogenesis' whenever you turn on your
       | computer is kind of insane by comparison. Entire kingdoms of
       | hardware state rise and fall with the press of a power button.
       | 
       | As an aside, I'm fond of the Red Dwarf novels, which are set on a
       | massive mothership-type spaceship. The ship was constructed in
       | space and never designed to enter a planet's atmosphere. In
       | particular, the ship, and its engines, was constructed in the
       | 'on' state by the crew that built it originally. It was never
       | conceived that the ship would ever need to be _rebooted_ ,
       | because it was assumed once the engines were initially fired
       | during the commissioning of the ship, they would never be turned
       | off until decommissioning. Thus, the ship has no automatic boot
       | process for the engines, only a manual engine firing procedure
       | which is extraordinarily arduous and long-winded and takes weeks
       | to execute, said procedure having been included in the manual
       | only as a curiosity more than anything else. This idea of a ship
       | built "on" under the assumption it would never once be shut down
       | or "restarted" until decommissioning is interesting, but of
       | course also directly mirrors biological life.
       | 
       | [1] https://www.devever.net/~hl/backstage-cast
        
         | potmat wrote:
         | "Under the words 'Contact Position' there's a button that says
         | 'Push to Close'".
         | 
         | "Push it."
         | 
         | When Spielberg was on he was ON! Who else could take a scene
         | like "they have to reset the circuit breakers" and make you
         | absolutely on the edge of your seat over it.
        
         | mike_hock wrote:
         | > We're constructed in the 'on' state via fork().
         | 
         | Maybe some worms are, but "we" most certainly aren't. The
         | closest analogy might be execve("/proc/self/exe"), but even
         | that is flawed.
        
       | jacquesm wrote:
       | This is very interesting in the context of power infrastructure
       | as well. As we found out the hard way during the 2003 power
       | blackout in North America.
        
         | rdhatt wrote:
         | Practical Engineering did a video on the complexity of bringing
         | a power grid back online, called "black start" (not cold
         | start).
         | 
         | https://practical.engineering/blog/2022/12/5/what-is-a-black...
        
           | jsmith45 wrote:
           | Black start of a single plant seems reasonable enough, but
           | black start of a whole grid seems almost absurd.
           | 
           | How could one possibly balance the load with the plants
           | coming online? If the generation and load is too mismatched,
           | the generators can literally automatically trip off the grid,
           | so generation and load must be carefully balanced as things
           | get brought back up.
           | 
           | One would almost need to shed nearly all the loads from the
           | black grid (Which may have happened anyway as the grid
           | collapsed, but any loads not already shed by the collapse
           | could prove interesting), and re-add some some gradually as
           | plants come online, which still seems crazy difficult.
           | 
           | And inrush current demands from many loads as they are get
           | reconnected must be pretty insane.
        
             | mjevans wrote:
             | Offhand, that's pretty much the plan for a black start.
             | 
             | Something about bringing up a designated plant and feeding
             | the output over 'cranking' lines that other plants along
             | the route can synchronize their output against. Then
             | gradually adding load and source until the system is meta-
             | stable again.
             | 
             | Edit: additional data
             | 
             | Not only synchronize, but also use for internal needs like
             | all the pumps and particularly the 'excitation current'
             | that establishes the magnetic field for the generator. It
             | allows control over the output voltage. There are also
             | other drawbacks to the more obvious solution of fixed
             | magnets which can be oversimplified as 'ware'.
             | 
             | https://en.wikipedia.org/wiki/Permanent_magnet_synchronous_
             | g...
             | 
             | https://en.wikipedia.org/wiki/Excitation_(magnetic)
             | 
             | https://en.wikipedia.org/wiki/Electric_generator
        
               | jsmith45 wrote:
               | The problem is not so much the concept, as how tricky it
               | would be to add the loads back in at just the right rate
               | to not trip some or all the generation back off again.
               | Sure once you have enough generation and load already
               | online, adding the rest is relatively straightforward.
               | Still need to be careful, but after a certain point it
               | would look to utility operators much like the usual work
               | restoring loads and sources after a large area blackout.
               | 
               | The trickiness seems worst close to the very beginning
               | when even relatively small misestimation of a chunk of
               | load being restored would have a proportionally bigger
               | impact. Many loads are not completely predictable, so
               | presumably they would need to favor bringing some of the
               | more stable loads online early so that normal variation
               | from the loads that can only be predicted well in
               | aggregate won't vary enough to trip everything back
               | offline.
        
       | rkagerer wrote:
       | Or you can chaos monkey style shut off the continent on a regular
       | basis.
        
       | lantry wrote:
       | In the anecdote about Bill and the DISASTER script, I'm not so
       | sure that deleting the script would be such a big deal. If this
       | script hasn't been touched since the 1980s and nobody knows what
       | it does, presumably nobody has tested it recently.
       | 
       | It seems like if there really was a disaster, first of all nobody
       | would know that script existed, and second of all if they tried
       | to run the script, it would fail because of all the changes to
       | the system since the script was initially developed.
       | 
       | Isn't there some saying like "if you don't test your backups, you
       | don't have backups" or something like that?
        
         | mastax wrote:
         | I have a hard time believing the 10k line shell script didn't
         | have a comment at the top saying what it did.
        
         | LeoPanthera wrote:
         | I bet it still would have been a useful template for a human to
         | read to get a general idea of what things to do and in what
         | order.
        
           | wkdneidbwf wrote:
           | good luck reading 10k lines of shell written decades ago. it
           | would likely be an incredible waste of time.
        
             | tivert wrote:
             | > good luck reading 10k lines of shell written decades ago.
             | it would likely be an incredible waste of time.
             | 
             | If the entire telephone system was down and needed cold
             | started, and the script had information someone needed to
             | do that, _someone would take the time to read it._ Maybe
             | not run it, but definitely read it to extract clues.
             | 
             | I mean, it's not like it's binary. It's totally possible.
        
               | Gabrys1 wrote:
               | Based on ChatGPT, assuming 10k lines translates to around
               | 30k words, it should take about 3hrs to read it. Multiply
               | that by your favorite factor for read and understand.
               | Split that to a few people, skim parts that are not
               | applicable etc. All in one this seems easily readable in
               | sensible time.
        
               | thelastparadise wrote:
               | I know, right?
               | 
               | Can you even imagine the alternative? "Hey lets throw
               | this maybe incredibly helpful shell script in the trash.
               | Because it's too long."
        
         | wkdneidbwf wrote:
         | right? that whole bit reads like some lame parable. like who in
         | there right mind is going to run a 10k line shell script named
         | DISASTER they've never read and cannot read because it's 10k
         | lines of shell? there is apparently no documentation (and
         | positively no tests)? one guy close to retirement remembers
         | what it's for and says "don't delete this critical but of
         | code!"
         | 
         | it's just utter bullshit.
        
           | chubot wrote:
           | If tens of millions of dollars are on the line, you will be
           | able to find someone who can run the script or derive enough
           | knowledge from it
           | 
           | In a disaster scenario, something is better than nothing
        
             | pavel_lishin wrote:
             | Bill is clearly still picking up the phone. He'd likely be
             | amenable to picking up a paycheck as well.
        
             | wkdneidbwf wrote:
             | it's more that i don't believe it's a real scenario.
        
               | adityaathalye wrote:
               | Isn't that the entire point? People don't believe (or
               | choose to not believe) a certain disaster scenario is
               | valid, until it happens. We all have seen first-hand the
               | many examples of colossal failures of disaster-response
               | planning in our recent planet-wide emergency. As have we
               | seen the creative, dogged, herculean efforts to cope with
               | it.
        
           | adityaathalye wrote:
           | OP here... As I wrote here, it is better to think of the
           | story as apocryphal:
           | https://news.ycombinator.com/item?id=36798893
           | 
           | Also of course it will be crazy to read a giant shell script.
           | But then again if the stakes are high enough, and if it
           | yields even one critical piece of information, then it's
           | worth it.
           | 
           | The larger point is that organisational knowledge clings on
           | in strange ways. In a crazy disaster scenario, people may
           | appreciate having access to anything they can get their hands
           | on.
        
             | Gabrys1 wrote:
             | 10K lines is not _that_ large. If it was written sensibly,
             | it might be very useful. Bill might have written a text
             | document, but chose to use Shell as a preferred engineers'
             | language. Who doesn't share a one-liner with a colleague in
             | need? Bill shared a 10k-liner :-). And Shell being a
             | relatively high-level language, it probably packs the
             | information more densely than a text file.
        
         | perrygeo wrote:
         | "Nobody cares if you have backups. Everyone cares that you can
         | restore."
         | 
         | Classic problem of deferred costs. Backups cost money and it's
         | tempting to avoid investment in them (ie fail to test them) but
         | that can bite you when its least convenient.
        
       | pixl97 wrote:
       | Heh, Microsoft AD + DNS + VMs is a common 'cold start' trap for
       | the inexperienced.
       | 
       | There was a story around this during the iraq war where a US
       | military virtual machine system when down, and had to come back
       | up without internet. Problem was VMware needed DNS to start the
       | VMs, one of the VMs it needed was Active Directory for security,
       | AD hosts the DNS and now you're locked up without an external
       | running system.
       | 
       | DNS itself is typically a cold start nightmare.
        
         | mikewarot wrote:
         | I had an old 486dx-50 as a backup domain controller for just
         | that reason.
        
       | mikewarot wrote:
       | I think that John Plant[1] and some friends could get us from the
       | stone age to ironworking. He's shown how to start with stones and
       | get that far, albeit on a very small scale.
       | 
       | Going from iron to precision screws is a matter of first making
       | precision flat surfaces, then lathes, and onward from there.[2]
       | You can do that with just iron and heat treating, but it won't be
       | easy.
       | 
       | If you want an alternate history where something slightly less
       | drastic is dealt with, the book "Ring of Fire - 1632" by the late
       | Eric Flint[3] is an interesting place to start. In the book, a
       | town from West Virginia circa 2000 is thrown back into the middle
       | of the 30 years war in Germany. Lots of exposition of the book is
       | about the supply chains we all depend on, and how they work. It's
       | the start of an awesome series.
       | 
       | Books and working knowledge, are a precious resource. As long as
       | we have a critical mass of them, and conditions remain reasonably
       | tolerable for human life, we can recover.
       | 
       | [1] https://www.youtube.com/channel/UCAL3JXZSzSm8AlZyD3nQdBA
       | 
       | [2]
       | https://ia800104.us.archive.org/20/items/FoundationsOfMechan...
       | 
       | [3] http://www.baen.com/chapters/0671578499/0671578499.htm
        
         | pavel_lishin wrote:
         | Beware the 1632 series. You'll think that you're just picking
         | up a fun "Connecticut Yankee in King Arthur's Court" adventure
         | yarn, but then a year down the line you've read a full dozen,
         | the library just knows to go ahead and order the next one for
         | you once you pick one up, and you're wondering if you'll be
         | able to finish the series before retirement.
         | 
         | https://en.wikipedia.org/wiki/List_of_books_in_the_1632_seri...
        
           | chipsa wrote:
           | At this point, you're likely to be about to finish the
           | series, because it's unlikely to get much longer. Eric
           | Flint's Wikipedia entry is now past tense. He died last year.
        
           | throwanem wrote:
           | Could be worse. I found Weber's stuff as sticky once upon a
           | time, and Eric Flint's a considerably more skillful writer.
           | But I appreciate the warning all the same; I really don't
           | need so weighty an obsession on top of all my other hobbies.
        
       | galkk wrote:
       | One of my stories of work as vendor on $large_bank is that per
       | some folks from there , they weren't able to do disaster recovery
       | testing of their largest oracle database for years, and per
       | procedure they should've do it like every 6 months.
        
       | anotherhue wrote:
       | IMO if you can't cold start it you probably can't develop against
       | it very quickly.
       | 
       | Then again we couldn't cold start a supply chain or a semi fab or
       | humanity itself so maybe that's the default.
        
         | hinkley wrote:
         | My comfort level with an architecture is always vastly improved
         | by being able to run a toy version of the entire system on a
         | developer box.
         | 
         | It doesn't just speak well for disaster recovery prospects
         | (both the feasibility of doing it and the density of developers
         | who could possibly pull such a thing off), it's also very, very
         | useful for speculative development.
         | 
         | When you make a high barrier to entry of making large
         | modifications to the system, you also tend to create an
         | underclass of developers, who never really get to understand
         | how the system works.
         | 
         | What if we split these two microservices into three, or
         | combined these three into two? That's a pretty common question,
         | that only gets asked if you know you won't get laughed out of
         | the room for suggesting it.
        
         | bamfly wrote:
         | You may enjoy the first episode of James Burke's _Connections_
         | ( "The Trigger Effect"), if you've not seen it.
         | 
         | https://www.youtube.com/watch?v=NcOb3Dilzjc
        
           | anotherhue wrote:
           | I enjoyed the one in the Witness but hadn't gotten around to
           | the rest, thanks for the excellent recommendation!
           | https://archive.org/details/james-burke-connections_s01e10
        
           | potmat wrote:
           | Still the best non-fiction TV ever produced in my opinion
           | (the whole series I mean).
        
         | JohnFen wrote:
         | Every new semi fab that comes online was cold-started.
        
           | anotherhue wrote:
           | With the output of the prior generations was my point.
        
       | lelandbatey wrote:
       | I love this explanation for why we should make plans even though
       | folks will try to shoot down planning with "no plan survives
       | first contact":
       | 
       | > Even though nothing will go as planned, it's important to have
       | the memory and expertise that did the planning, because that's
       | what's going to be able to think through the as-yet- unknown-
       | unknowns, when the inevitable FUBAR situation suddenly happens
       | later.
       | 
       | We don't plan so that everything will go according to plan, we
       | plan so that we are better equipped to _reason_ when the plan
       | doesn 't work.
        
       | johngalt wrote:
       | At a certain point, you aren't doing a cold restart, but a high
       | speed recreation of the system based on prioritized needs.
        
         | thelastparadise wrote:
         | It seems you've been there too :)
        
       | tivert wrote:
       | > Another colleague in the chat remarked up-thread (apropos cold
       | reboot thinking):
       | 
       | > I have seen this at <Indian eCommerce Giant> and at <a FAANG>.
       | Most of it is related to cached data. Cold starts with empty
       | caches causes too much load on databases. And then the failures
       | cascade.
       | 
       | > -- Another M'colleague in the Slackroom.
       | 
       | Isn't that not really a problem with cold restart per se, but
       | more the restart procedure? If caches are so critical, wouldn't
       | you need a feature to throttle the load to what the databases can
       | handle, as the caches populate? E.g., if you're cold-rebooting
       | Facebook, start by blocking all connections except those
       | geolocated to North Dakota, then add other regions as your caches
       | fill.
        
         | jbnorth wrote:
         | That's spot on. I work at a large cloud provider and one of our
         | larger eCommerce customers had an outage in a kubernetes
         | cluster which handled the front end traffic routed through a
         | large CDN provider. Well sure enough "just turn it back on"
         | wasn't an option since the surge of traffic was too rapid for
         | the services and the cluster to scale out. They ended up having
         | to turn the traffic back on incrementally to let things scale
         | up to the point where they could handle the load.
        
           | donalhunt wrote:
           | One of the earliest incidents I worked on in the late 90s
           | involved students DDOSing a university webserver in
           | anticipation of exam results being posted. The server load
           | was so high we had to pull the physical plugs on the server.
           | :/
        
         | benlivengood wrote:
         | Specifically you want load shedding and in the servers and
         | retry with back off in the clients. Clients should do their
         | best to exponentially back off on retrying failed requests and
         | only try to contact healthy servers and maintain internal rate-
         | limiting based on error rate, and servers should do their best
         | to reply with failure quickly and cheaply to drive good
         | client's backoffs/rate-limiting and just drop bad client's
         | traffic, and the service discovery should try to detect and
         | spread load across healthy servers (but this isn't always
         | available at first in a cold start because the metrics or
         | metadata probably aren't available yet), but in the end it's up
         | to servers to reliably drop traffic they can't handle instead
         | of building up giant queues and slowing to a crawl. Middleware
         | is the hardest because it has to be a good client and also fail
         | fast on overload as a server by correctly interpreting upstream
         | behavior. Deadlines in RPCs that get passed across system
         | boundaries can work pretty well for tall stacks of system
         | layers where service health discovery or dependency discovery
         | is hard, but require careful configuration to avoid failure or
         | very slow starts under heavy load.
        
       | draw_down wrote:
       | [dead]
        
       | nickdothutton wrote:
       | It's been a few years but I used to run DR exercises for
       | corporates. Cold start means your only possessions are the fire
       | proof suitcase full of LTO-5s and the street address of the DR
       | data center. 1 day to bootstrap essential infra services, after
       | the end of the 2nd day you'd have most customer facing systems
       | up, day 3 would be the non-essential stuff. Personally I'd do it
       | without sleep, but most of the youngsters would need a break.
       | Pretty exhilarating, as IT work goes. Always use the feature that
       | generates multiple index tapes of what backup set is on what
       | numbered tape :-)
        
         | adityaathalye wrote:
         | OP here. That would be quite the trip! Finding oneself the
         | front line of something like that would easily bring out the
         | best---and the worst---in a person. A rare chance to form
         | lifelong collegial bonds, and perhaps to exorcise a personal
         | demon or two (fear, anger, egotism ...).
        
           | nickdothutton wrote:
           | The only reason I was brought in to do the exercise as a
           | consultant, was because the guy supposed to be doing it was
           | too frightened, delayed for almost a year, went on long term
           | sick leave and was eventually fired/quit. The docs and ops
           | procedures actually looked good, so I took on the challenge.
           | In the end, just 1 flexvol that had been configured manually
           | years before, and was missed by the DR automation. Not a bad
           | result.
        
           | ChoHag wrote:
           | [dead]
        
       | gumby wrote:
       | The full telephone system, which the author starts with, may not
       | be restartable. Sure, you could restart the SS7 databases and
       | computers, but the control plane runs over the data plane, which
       | is configured via...the control plane. Originally the network
       | controls were literally operators (humans), but bit by bit parts
       | were incrementally automated, pulling the system slowly (over
       | decades) by its bootstraps, which were gradually decommissioned
       | as they weren't needed any more.
       | 
       | I have a friend who knows a _lot_ about the phone system (he has
       | a security clearance for some of his telephone work). One time we
       | had a long conversation about this topic, until at one point he
       | said  "and let's talk about something else" -- I guess from that
       | point some of the details are classified. So maybe there is a
       | plan, or maybe they just designed the system in such a way that
       | they could convince themselves that it would not go down unless
       | things were so severe that loss of the phone system would not be
       | your chief worry.
       | 
       | ---
       | 
       | In September 2001 there was a full standdown of US airspace. That
       | was accomplished pretty quickly: "you are ordered to land
       | immediately on the closest airport that can handle your aircraft,
       | or be shot down". Undoing that, however, took some careful
       | planning! Fortunately the standdown lasted several days so there
       | was time to work it out. Even if you had a plan for this (and I
       | assume FAA had one), figuring out what the realities on the
       | ground were and matching them up with the plan was nontrivial.
       | 
       | Apparently some of the planes landed where they could not take of
       | again unless they were empty with a small amount of fuel to get
       | to an airport designed for them. I don't believe I heard that any
       | planes landed where they could _never_ leave.
        
         | protastus wrote:
         | > So maybe there is a plan, or maybe they just designed the
         | system in such a way that they could convince themselves that
         | it would not go down unless things were so severe that loss of
         | the phone system would not be your chief worry.
         | 
         | My belief from working in very large companies, and
         | (previously) in mission critical systems is that a clean
         | bootstrap and recovery process is extremely unlikely, almost
         | impossible. Because in complex systems full of legacy parts and
         | people who have long retired, the stars won't align.
         | 
         | The only way to truly know is to design and periodically test
         | for disaster scenarios (emphasis on the plural). But due to the
         | scale in time and space, cost and bureaucracy, this planning
         | and rehearsing is not going to happen with the desired detail
         | and intensity. People do not seriously plan for things that
         | have never happened.
         | 
         | If it does happen, there will be a small group of extremely
         | capable people that will find a way to bootstrap the system. It
         | won't be according to some previously laid out plans -- they
         | will make the plan in real time. They're not famous and
         | probably never will be.
        
           | colechristensen wrote:
           | Eh, would not surprise me in the slightest if there was a
           | secret billion dollar program that specifically practiced
           | disaster recovery for the phone network. The government
           | doesn't have the same motivations as a business and they
           | spend a lot of money in a lot of places just to be prepared
           | for unlikely events. Like we spend billions on military
           | hardware we don't forsee ever needing _to keep the
           | engineering capacity to design and build military hardware_.
        
             | fluoridation wrote:
             | >Like we spend billions on military hardware we don't
             | forsee ever needing _to keep the engineering capacity to
             | design and build military hardware_.
             | 
             | Well, more because the military industrial complex lines
             | the pockets of your politicians, who in turn decide how to
             | spend the budget.
        
               | runlaszlorun wrote:
               | The two are not mutually exclusive. In fact they
               | reinforce each other.
        
         | kotaKat wrote:
         | > I don't believe I heard that any planes landed where they
         | could never leave.
         | 
         | This has happened before outside of 2001, albeit not really a
         | DR issue but more a political issue -- if we look to the Meigs
         | Field destruction by former Chicago mayor Richard Daley,
         | multiple aircraft were left stranded with a now destroyed
         | runway. (The solution was to just give them special clearance
         | to take off on a taxiway, but still.)
         | 
         | https://web.archive.org/web/20110720045652/http://www.aopa.o...
        
         | donalhunt wrote:
         | In Ireland, we just build a runway to allow the plane to take
         | off again.
         | 
         | https://www.rte.ie/archives/2018/0521/965058-mexican-lands-p...
        
           | joncrane wrote:
           | And in the meantime they had the pilot judge a beauty
           | contest. What a story!
        
           | LoganDark wrote:
           | this is truly one of the takeoffs of all time
        
           | myself248 wrote:
           | I've long had a fascination with aviation incidents (the
           | Gimli Glider happened on my birthday), but I hadn't heard of
           | this one before!
           | 
           | What a great story. Thank you for posting it.
        
           | wongarsu wrote:
           | Clear evidence that news was more entertaining in the 80s:
           | In the merry month of May         Just before the dawn of day
           | A plane flew in for Shannon to refuel.         Because
           | Shannon is fogged out,         Their are the rite of ought
           | To touch down in Cork Airport as a rule.                  As
           | he flew towards Mallow town          His supply of fuel was
           | down         But the pilot was as cool as cool could be.
           | In a racetrack west of town         He made a safe touch down
           | Just beside the Mallow sugar factory.
        
         | adityaathalye wrote:
         | OP here. Thanks for the remarks! That tale is apocryphal to me.
         | I found it amusing, and telling in the sense that disasters are
         | the last thing people think of, and for large enough systems
         | (especially ones that have accreted over human generations) the
         | organisational knowledge has probably been lost to retirement
         | and death. And if you're lucky, maybe a scrap of it has
         | survived. Then one's job is to do the requisite software
         | archaeology to figure out what one's people might have entirely
         | forgotten.
         | 
         | Also, if that script remained valid at the time, I doubt it
         | would _do_ any critical actions. It might have been a sort of
         | literal script to follow --- run the script, see what it says,
         | do a thing, run the script, and so forth. Its supposed job was
         | to help humans solve a _bootstrap_ problem.
         | 
         | I see how my wording of that passage makes it sound like the
         | be-all end-all of cold booting a telco. But that's what we get
         | when we wall-of-text in our Slacks :D
         | 
         | (edit: clarifying remarks)
        
         | walrus01 wrote:
         | If you dig deep enough into the SS7 stuff running in a modern
         | regional ILEC it's way more fragile than you might think.
         | Mostly because it's no longer treated as an absolutely cannot
         | fail thing that is also a primary source of revenue like back
         | in the days when everyone has a POTS line and tons of money
         | came in from long distance bills. Many operators are
         | decommissioning stuff like 5ESS and Nortel switches and moving
         | to modern soft switches as quickly as they can.
         | 
         | The network stuff underpinning a lot of critical tdm phone
         | traffic these days is like a collection of 23 year old Cisco
         | 15454 held together by spare parts and a few people who care
         | about them.
        
           | myself248 wrote:
           | I've been out of the industry just long enough to remember
           | that Cerent 454's got rebranded as Cisco 15454's right near
           | the end of my tenure...
           | 
           | Yow. Way to make a guy feel old! :P
        
             | walrus01 wrote:
             | Mostly I was using the 15454 as an example, there's lots of
             | other 20 to 30 year old stuff out there in the TDM
             | transport sector that's only available on ebay, through
             | weird used equipment dealers, or by finding a decom from
             | another ISP/telco.. Stuff like T1 (or DS0!) to DS3
             | mux/demux to attach to a 15454, or similar. There's
             | literally 911 call center transport circuits being held
             | together by the telco equivalent of duct tape and string
             | right now, nobody notices until it breaks.
             | 
             | One of the weird challenges in building a new state-of-the-
             | art inter city DWDM transport network now is dealing with
             | things like legacy customers that have one OC48 and are
             | unlikely to drop it any time soon, it's a considerable
             | monthly revenue source, and have to deal with stuffing that
             | into the system along with 100Gbps and greater coherent
             | circuits.
             | 
             | Also from a customer relations perspective sometimes the
             | customer literally forgets that they have this extremely
             | exensive DS3 or OC48 or something in monthly recurring
             | billing, and you don't want to bring it to the attention of
             | management, because they might go "are we still using
             | this?" and cancel it.
        
               | LeoPanthera wrote:
               | > There's literally 911 call center transport circuits
               | being held together by the telco equivalent of duct tape
               | and string right now, nobody notices until it breaks.
               | 
               | And break it does.
               | 
               | https://www.kron4.com/news/bay-area/911-dispatch-system-
               | in-o...
        
         | bobthepanda wrote:
         | The 2001 shutdown is even more crazy when you consider that for
         | the FAA administrator who ordered it, it was his first day on
         | the job. Hell of a first day.
        
           | Arrath wrote:
           | Hell of a day to quit sniffing glue.
        
       | drbawb wrote:
       | I'm reminded of Bryan Cantrill's talk "Debugging Under Fire"[1],
       | which includes a retrospective of sorts about an entire
       | datacenter rebooting.[2] That is a pretty large-scale disaster,
       | but even that is a rung below a continent-wide outage. Poor
       | "Bill" must have saw the proverbial light when he heard some
       | folks wanted to trash the DIASTER script.
       | 
       | [1]: https://www.youtube.com/watch?v=30jNsCVLpAE
       | 
       | [2]: https://www.tritondatacenter.com/blog/postmortem-for-
       | outage-...
        
       | js2 wrote:
       | It's 1995 or so. I'm at U.F. getting my CS degree. Our facilities
       | guy is giving a tour of the department's server room to some big
       | wigs.
       | 
       | For some reason, he decides to demo the UPS cut-over switch. I
       | have no idea why. But he manages to toggle it the wrong way and
       | instead of switching the entire room full of servers to the UPS,
       | he manages to cut power to _all of them_.
       | 
       | My recollection is that the cooling went out too and the room was
       | suddenly very silent. But in retrospect that doesn't make sense.
       | 
       | What I do remember is that it was non-trivial to bring all our
       | Unix servers back up because over the years they had been setup
       | with NFS mounts in a loop such that for A to boot, it needed B to
       | be up, which needed C to be up, which needed A to be up.
       | 
       | Oops.
       | 
       | So it took a lot of manual intervention to bring everything back
       | up.
        
         | jen729w wrote:
         | Vaguely related, in 2015 we were building a new platform based
         | on Cisco UCS and NetApp filers.
         | 
         | Cisco had a virtual router. N1000 perhaps, the hardware was a
         | C220 and it had some sort of appliance running on it. Which
         | depended on some sort of LUN from NetApp for its storage. But
         | you couldn't stand the LUN up until the vCenter was up because
         | UCS provisioned the LUN, it didn't work if you did it manually,
         | and that ran on vCenter, and the vCenter depended on the router
         | so that it could reach the LUN. It was pure circular logic
         | hell.
         | 
         | There _was_ a path to make this work, but it literally took
         | half a dozen very ~~clever~~ expensive Cisco and NetApp
         | engineers a couple of days in a room with a whiteboard to
         | figure it out. It was absurd.
        
         | dunham wrote:
         | We had a blackout (seemed to happen every fall when the
         | students arrived - the university had a power station), the NFS
         | server needed NIS to boot, and NIS server required NFS. We
         | managed to manually get the NIS service running in single user
         | mode, brought up everything else, and then rebooted NIS.
        
         | spc476 wrote:
         | I've got two such stories. Around 2000, I was working at a huge
         | web hosting company (third shift, monitoring the network).
         | Suddenly the power went out. Turns out, the building management
         | (separate company) decided to run a UPS test and the memo got
         | lost. Fun time that night.
         | 
         | Second story, around 2005---I, along with a friend and my
         | father, were in Las Vegas eating lunch at one of the major
         | casinos when the power went out completely. It was _eerily_
         | silent and _dark!_ (and then slowly, we started hearing the
         | groans of slotzombies rising among us) I 'm sure someone lost
         | their job for a UPS cut-over failure.
        
         | rout39574 wrote:
         | I was in that room! Poor fellow. It was the demo of our new
         | UPS; I can still remember the gradual fading fans and clicks of
         | capacitors.
         | 
         | He was really nervous showing the new install off to us, and he
         | was just talking through the positions of the switch; but he
         | actually moved the switch to each of them as he did it.
         | 
         | Back in the days the conslops ran the asylum. :)
         | 
         | Any old school UF types looking at this, The Dog list lives on,
         | and is again meeting from time to time. :) But at quieter
         | venues. Our ears have unaccountably gotten old.
        
           | zamadatix wrote:
           | The heck is a conslops besides the food at the prison?
        
             | djbusby wrote:
             | Console Operators
        
       | more_corn wrote:
       | In some ways a disaster recovery plan can follow a better path
       | than the original bootstrap. Imagine industrial society. On page
       | two of the disaster recovery plan you have a description of germ
       | theory. Something that has saved hundreds of millions of lives
       | and would have saved hundreds of millions more had it been
       | discovered / formalized thousands of years previously. People
       | knew about sanitation in prehistory but the theory wasn't
       | formalized so doctors didn't always wash their hands.
       | 
       | Likewise knowledge around nitrogen fixation and fertilizers.
       | 
       | There are probably a half a dozen huge improvements that could be
       | made for bootstrap_society_v2.sh
       | 
       | Perhaps we should all write down our version and tuck it away
       | somewhere safe just in case. Maybe on something more durable than
       | paper. And certainly more durable than electronic storage.
        
       | ramidarigaz wrote:
       | Interestingly this paragraph isn't quite true:
       | 
       | > So much of the modern world depends on our mastery over
       | materials (to make a precision screw, you need a precision-
       | machined harder material--diamond / titanium--to work on a softer
       | material--steel), and our ability to turn rotary motion to linear
       | motion (it's stupidly difficult to reliably precision-machine a
       | harder material without even more precise linear + rotary motion
       | --lathe/CNC machine). Hence, a bootstrap problem.
       | 
       | Steel is hardenable (or rather, some steels are hardenable), you
       | can change its hardness through the specific application of
       | heating and cooling. So you can make a crude tool with relatively
       | soft steel, harden it, and use it to make a more precise steel
       | tool (again machine soft, then harden). This does make the
       | bootstrapping problem a bit easier, I think. Although not easy in
       | the absolute.
       | 
       | See https://www.youtube.com/watch?v=V_Mp1fNzIT8 for a great dive
       | into primitive steel hardening techniques.
        
         | smcameron wrote:
         | David Gingery's books might be of interest to anyone thinking
         | of bootstrapping a metal working shop starting from charcoal
         | and scrap aluminum.
         | 
         | https://www.gingerybookstore.com/
        
         | adityaathalye wrote:
         | OP here. Thanks for the critique! Yes I agree fully. The
         | specific example of diamond/titanium aside, the general point
         | stays, I feel. A youtube rabbit hole is nigh, clearly :)
        
         | hinkley wrote:
         | There's a way to grind mirrors optics for optics with polishing
         | stones that aren't even flat to the naked eye. Basically the
         | system arrives at tiny tolerances via the process of using the
         | system.
         | 
         | And there's way to make three perfectly flat sharpening stones
         | by starting with three raw pieces of natural sharpening stone,
         | just by alternately rubbing the three stones together until
         | they flatten each other out.
         | 
         | Paul Sellers can teach you how to flatten a large board without
         | a planer. He also has videos on how to get a wood plane
         | perfectly flat using a large sharpening stone (which can be
         | made as above or with float glass).
         | 
         | And if memory serves, you to make something perfectly round you
         | first need something perfectly flat. Once you have something
         | perfectly flat and something perfectly round it's off to the
         | races.
         | 
         | Edit: "The Origins of Precision" is a half hour well spent
         | https://www.youtube.com/watch?v=gNRnrn5DE58
        
       | alexwasserman wrote:
       | Great read: One Good Turn: A Natural History...
       | https://www.amazon.com/dp/0684867303?ref=ppx_pop_mob_ap_shar...
       | 
       | A history of the screw. Really interesting around how it was
       | developed. Some machining techniques are far older than you'd
       | expect, and some capabilities far newer.
       | 
       | The books thesis is that the screw is the most important
       | invention.
        
       ___________________________________________________________________
       (page generated 2023-07-20 23:02 UTC)