[HN Gopher] We might want to regularly keep track of how importa...
       ___________________________________________________________________
        
       We might want to regularly keep track of how important each server
       is
        
       Author : pabs3
       Score  : 194 points
       Date   : 2024-02-06 10:51 UTC (12 hours ago)
        
 (HTM) web link (utcc.utoronto.ca)
 (TXT) w3m dump (utcc.utoronto.ca)
        
       | tlb wrote:
       | Or you might want to have redundant cooling.
       | 
       | Cooling system prices seem to scale fairly linearly with the
       | cooling power above a few kW, so instead of one 100 kW system you
       | could buy four 25 kW systems so a single failure won't be a
       | disaster.
        
         | nine_k wrote:
         | Won't 4 x 25 kW systems mean also 4x the installation cost?
        
           | mdekkers wrote:
           | What the % of install cost over the projected lifetime of the
           | system?
        
           | dist-epoch wrote:
           | Probably less if you install them all at the same time.
        
           | bluGill wrote:
           | Maybe, but costs are not linear and the nonlinear goes
           | different ways for different parts of the install. The costs
           | of the smaller systems installed could be cheaper than just
           | the large system not installed if the smaller systems are
           | standard parts.
        
         | throw0101a wrote:
         | > _Or you might want to have redundant cooling._
         | 
         | Can you provide a cost centre or credit card for which they can
         | bill this to? In case you didn't notice the domain, it is
         | UToronto: academic departments aren't generally flush with
         | cash.
         | 
         | Further, you have to have physical space to fit the extra
         | cooling equipment and pipes: not always easy or possible to do
         | in old university buildings.
        
           | wongarsu wrote:
           | If you designed the system like this from the start or when
           | replacing it anyways, N+1 redundancy might not me much more
           | expensive than one big cooling unit. The systems can mostly
           | share their ductwork and just have redundancy in the active
           | components, so mostly the chillers.
           | 
           | Of course these systems only get replaced every couple
           | decades, if ever, so they are pretty much stuck with the
           | setup they have.
        
             | throw0101b wrote:
             | > _If you designed_ [...]
             | 
             | University department IT is not designed, it grows over
             | decades.
             | 
             | At some point some benefactor may pay for a new building
             | and the department will move, so that could be a chance to
             | actually design. But modern architecture and architects
             | don't really go well with hosting lots of servers in what
             | is ostensibly office space.
             | 
             | I've been involved in the build-out of buildings/office
             | space on three occasions in my career, and trying to get a
             | decent IT space pencilled has always been like pulling
             | teeth.
        
             | bluGill wrote:
             | > Of course these systems only get replaced every couple
             | decades, if ever
             | 
             | This is despite the massive energy savings they could get
             | if they replaced those older systems. Universities often
             | are full of old buildings with terrible insulation
             | heated/cooled by very old/inefficient systems. In 20 years
             | they would be money ahead by tearing down most buildings on
             | campus and rebuilding to modern standards - assuming energy
             | costs don't go up which seems unlikely) But they consider
             | all those old buildings historic and so won't.
        
               | light_hue_1 wrote:
               | > In 20 years they would be money ahead by tearing down
               | most buildings on campus and rebuilding to modern
               | standards - assuming energy costs don't go up which seems
               | unlikely) But they consider all those old buildings
               | historic and so won't.
               | 
               | It has nothing to do with considering those building
               | historic.
               | 
               | The problem is unless someone wants to donate a $50-100M,
               | new buildings don't happen. And big donors want to donate
               | to massive causes "Build a new building to cure cancer!"
               | not "This building is kind of crappy, let's replace it
               | with a better one".
               | 
               | It doesn't matter that over 50 years something could be
               | cheaper if there's no money to fix it now.
        
           | gregmac wrote:
           | This kind of thing is like insurance. Maybe IT failed to
           | state the consequences of not having redundancy, maybe people
           | in control of the money failed to understand.. or maybe the
           | risks were understood and accepted.
           | 
           | Either way, by not paying for the insurance (redundant
           | systems) up front the organization is explicitly taking on
           | the risk.
           | 
           | Whether the cost now is higher is impossible to say as an
           | outsider, but there's a lot of expenses: paying a premium for
           | emergency repairs/replacement; paying salaries to a bunch of
           | staff who are unable to work at full capacity (or maybe at
           | all); a bunch of IT projects delayed because staff is dealing
           | with an outage; and maybe downstream ripple effects, like
           | classes cancelled or research projects in jeopardy.
           | 
           | I've never worked in academics, but I know people that do and
           | understand the budget nonsense they go through. It doesn't
           | change the reality, though, which is systems fail and if you
           | don't plan for that you'll pay dearly.
        
         | marcus0x62 wrote:
         | With that model, you'd probably want 5 instead of 4 (N+1), but
         | the other thing to consider is if you can duct the cold air to
         | where it needs to go when one or more of the units has failed.
        
         | dotancohen wrote:
         | This will mean just about 4 times the number of failures, too.
         | And can 75% cooling still cool the server room anyway?
        
           | donkeyd wrote:
           | Maybe not, but some cooling means less servers to shut down.
        
           | bluGill wrote:
           | It means 5 times the number of failures as you intentionally
           | put in an extra unit so that one can be taken offline at any
           | time for maintenance (which itself will keep the whole system
           | more reliable), and if one fails the whole keeps up. The cost
           | is only slightly more to do this when there are 5 smaller
           | units. Those smaller units could be standard off the shelf
           | units as well, so it could be cheaper than a large unit that
           | isn't mad in as large a quantity - this is a consideration
           | that needs to be made case by case)
           | 
           | Even if you cheap out and only install 4 units, odds are your
           | failure doesn't happen on the hottest day of the year and so
           | 3 can keep up just fine. It is only when you are unlikely
           | that you need to shut anything down.
        
           | evilduck wrote:
           | Four service degradations vs. one huge outage event. Pick
           | your poison.
        
       | cowsandmilk wrote:
       | It is interesting to contrast with where the wider industry has
       | gone.
       | 
       | Industry: don't treat your systems like pets.
       | 
       | Author: proudly declares himself Unix herder, wants to keep track
       | of which systems are important.
        
         | throw0101a wrote:
         | > _Author: proudly declares himself Unix herder, wants to keep
         | track of which systems are important._
         | 
         | Because not all environments are webapps which dozens or
         | hundreds of systems configured in a cookie cutter matter.
         | Plenty of IT environments have pets because plenty of IT
         | environments are not Web Scale(tm). And plenty of IT
         | environments have staff churn and legacy systems where
         | knowledge can become murky (see reference about "archaeology").
         | 
         | An IMAP server is different than a web server is different than
         | an NFS server, and there may also be inter-dependencies between
         | them.
        
           | cookiemonster9 wrote:
           | So much this. Not everything fits cattle/chicken/etc models.
           | Even in cases where those models could fit, they are not
           | necessarily the right choice, given staffing, expertise,
           | budgets, and other factors.
        
           | LeonB wrote:
           | I work with one system where the main entity is a "sale"
           | which is processed beginning to end in some fraction of a
           | second.
           | 
           | A different system I work with, the main "entity" is,
           | conceptually, more like a murder investigation. Less than 200
           | of the main entity are created in a year. Many different
           | pieces of information are painstakingly gathered and tracked
           | over a long period of time, with input from many people and
           | oversight from legal experts and strong auditing
           | requirements.
           | 
           | Trying to apply the best lessons and principles from one
           | system to the other is rarely a good idea.
           | 
           | These kind of characteristics of different systems make a lot
           | of difference to their care and feeding.
        
         | karmarepellent wrote:
         | I would argue even the "wider industry" still administrates
         | systems that must be treated as pets because they were not
         | designed to be treated as cattle.
        
         | luma wrote:
         | When you are responsible for the full infrastructure,
         | sequencing power down and power on in coordination with your
         | UPS is a common solution. Network gear needs a few minutes to
         | light up ports, core services like DNS and identity services
         | might need to light up next, then storage, then hypervisors and
         | container hosts, then you can actually start working on app
         | dependencies.
         | 
         | This sort of sequencing leads itself naturally to having a plan
         | for limited capacity "keep the lights on" workload shedding
         | when facing a situation like the OP.
         | 
         | Not everyone has elected to pay Bezos double the price for
         | things they can handle themselves, and this is part of handling
         | it.
        
           | bbarnett wrote:
           | Double? Try 100x!!
        
             | marcosdumay wrote:
             | Exactly. If it was double, it would be a non-brainier.
        
             | philsnow wrote:
             | If you're running a couple ec2 instances in one AZ then
             | yeah it's closer to 100x, but if you wanted to replicate
             | the durability of S3, it would cost you a lot in terms of
             | redundancy (usually "invisible" to the customer) and
             | ongoing R&D and support headcount.
             | 
             | Yes, even when you add it all up, Amazon still charges a
             | premium even over that all-in cost. That's sweat equity.
        
         | p_l wrote:
         | Even if you're running "cattle", you still need to keep track
         | of which systems are important, because to surprise of many,
         | the full infrastructure is more like the ranch, and cattle is
         | just part of it.
         | 
         | (and here I remind myself again to write the screed against
         | "cattle" metaphor...)
        
         | bayindirh wrote:
         | HPC admin here (and possibly managing a similar system topology
         | with their room).
         | 
         | In heterogeneous system rooms, you can't stuff everything into
         | a virtualization cluster with a shared storage and migrate
         | things on the fly, thinking that every server (in hardware) is
         | a cattle and you can just herd your VMs from host to host.
         | 
         | A SLURM cluster is easy. Shutdown all the nodes, controller
         | will say "welp, no servers to run the workloads, will wait
         | until servers come back", but storage systems are not that easy
         | (ordering, controller dependencies, volume dependencies,
         | service dependencies, etc.).
         | 
         | Also there are servers which can't be virtualized because
         | they're hardware dependent, latency dependent, or just filling
         | the server they are in, resource wise.
         | 
         | We also have some pet servers, and some cattle. We "pfft" to
         | some servers and scramble for others due to various reasons. We
         | know what server runs which service by the hostname, and never
         | install pet servers without team's knowledge. So if something
         | important goes down everyone at least can attend the OS or the
         | hardware it's running on.
         | 
         | Even in a cloud environment, you can't move a VSwitch VM as you
         | want, because you can't have the root of a fat SDN tree on
         | every node. Even the most flexible infrastructure has firm
         | parts to support that flexibility. It's impossible otherwise.
         | 
         | Lastly, not knowing which servers are important is a big no-no.
         | We had "glycol everywhere" incidents and serious heatwaves, and
         | all we say is, "we can't cool room down, scale down". Everybody
         | shuts the servers they know they can, even if somebody from the
         | team is on vacation.
         | 
         | Being a sysadmin is a team game.
        
         | readscore wrote:
         | At a FAANG, our services are cattle, but we still plan which
         | services to keep running when we need to drain 50% of a DC.
         | 
         | Latency is important. Money makers > Latency sensitive >
         | Optional requests > Background requests > Batch traffic.
         | 
         | Bootstrapping is important. If A depends on B, you might to
         | drain A first, or A and B together.
        
           | _heimdall wrote:
           | The cattle metaphors really is a bad one. Anyone raising
           | cattle should do the same thing, knowing which animals are
           | the priority in case of draught, disease, etc.
           | 
           | Hopefully one never has to face that scenario, but its much
           | easier to pick up the pieces when you know where the
           | priorities are whether you're having to power down servers or
           | thin a herd.
        
             | bluGill wrote:
             | Cattle are often interchangeable. You cull any that catch a
             | disease (in some cases the USDA will cull the entire herd
             | if just one catches something - bio security is a big deal)
             | In the case of drought you pick a bunch to get rid of -
             | based on market prices (If everyone else is you will try to
             | keep yours because the market is collapsing - but this
             | means managing feed and thus may mean culling more of the
             | herd latter.)
             | 
             | Some cattle we can measure. Milk cows are carefully managed
             | as to output - the farmer knows how much the milk each one
             | is worth and so they can cull the low producers. However
             | milk is so much more valuable than meat that they never
             | cull based on drought - milk can always outbid meat for
             | feed. If milk demand goes down the farmer might cull come -
             | but often the farmer is under contract for X amount of milk
             | and so they cannot manage prices.
        
               | p_l wrote:
               | Honestly, big issue with the cattle metaphor is that the
               | individual services you run on servers are very much
               | often not interchangeable.
               | 
               | A DNS service is not NTP is not mail gateway is not
               | application load balancer is not database _etc etc etc_
               | 
               | At best, multiple replicas of those are cattle.
               | 
               | And while you can treat the servers underlying them as
               | interchangeable, that doesn't change the fact the
               | services you run on them _are not_.
        
               | naniwaduni wrote:
               | Cattle often aren't interchangeable too. Not gonna have a
               | great time milking the bulls.
        
               | bluGill wrote:
               | But if you are milking you don't have bulls. Maybe you
               | have one (though almost everyone uses artificial
               | insemination these days). Worrying about milking bulls is
               | like worrying about the NetWare server - once common but
               | has been obsolete since before many reading this were
               | even born.
               | 
               | Of course the pigs, cows, and chickens are not
               | interchangeable. Nor are corn, hay, soybeans.
        
             | cortesoft wrote:
             | Yeah, really the only difference is whether you are
             | tracking individual servers or TYPES of servers
        
         | pch00 wrote:
         | > Industry: don't treat your systems like pets.
         | 
         | The industry has this narrative because it suits their desire
         | to sell higher-margined cloud services. However in the real
         | world, especially in academia as cks is, the reality is that
         | many workloads are still not suitable for the cloud.
        
         | tonyarkles wrote:
         | I'm absolutely loving the term Unix herder and will probably
         | adopt it :)
         | 
         | I'm generally with you and the wider industry on the cattle-
         | not-pets thing but there are a few things to keep in mind in
         | the context of a university IT department that are different
         | than what we regularly talk about here:
         | 
         | - budgets often work differently. You have a capex budget and
         | your institution will exist long enough to fully depreciate the
         | hardware they've bought you. They won't be as happy to
         | dramatically increase your opex.
         | 
         | - storage is the ultimate pet. In a university IT department
         | you're going to have people who need access to tons and tons of
         | speedy storage both short-term and long-term.
         | 
         | I'm smiling a little bit thinking about a job 10 years ago who
         | adopted the cattle-not-pets mentality. The IT department
         | decided they were done with their pets, moved everything to a
         | big vSphere cluster, and backed it by a giant RAID-5 array.
         | There was a disk failure, but that's ok, RAID-5 can handle
         | that. And then the next day there was a second disk failure.
         | Boom. Every single VM in the engineering department is gone
         | including all of the data. It was all backed up to tape and
         | slowly got restored but the blast radius was enormous.
        
           | jodrellblank wrote:
           | At the risk of no true Scotsman, that doesn't sound like
           | "cattle not pets"; when the cattle are sent to the
           | slaughterhouse there isn't any blast radius, there's just
           | more cattle taking over. You explicitly don't have to replace
           | them with exact clones of the original cattle from tape very
           | slowly, you spin up a herd of more cattle in moments.
        
             | logifail wrote:
             | > when the cattle are sent to the slaughterhouse
             | 
             | Data isn't "sent to the slaughterhouse". Ever.
             | 
             | Data can be annoying that way.
        
               | datadrivenangel wrote:
               | The true problem with kubernetes and modern cloud. Not
               | insurmountable, but painful when your data is large
               | compared to your processing needs.
        
               | j33zusjuice wrote:
               | I think the point is that systems that aren't replaced
               | easily shouldn't be managed like they are, not ...
               | whatever it is you're getting at here.
        
               | carbotaniuman wrote:
               | Sadly you can't just spool down data either - data isn't
               | fungible!
        
             | chris_wot wrote:
             | Analogies always break down under scrutiny. Any cattle
             | farmer would find "spinning up a herd of cattle" to be
             | hilarious.
        
             | tonyarkles wrote:
             | > you spin up a herd of more cattle in moments
             | 
             | Where does the decade of data they've been collecting come
             | from when you "spin up a new herd"?
        
               | ElevenLathe wrote:
               | It's stored in another system that _is_ probably treated
               | as a pet, hopefully by somebody way better at it with a
               | huge staff (like AWS). Even if it 's a local NetApp
               | cluster or something, you can leave the state to the
               | storage admins rather than random engineers that may or
               | may not even be with the company any more.
        
               | jodrellblank wrote:
               | If you are using clustered storage (Ceph, for example)
               | instead of a single RAID5 array, ideally the loss of one
               | node or one rack or one site doesn't lose your data it
               | only loses some of the replicas. When you spin up new
               | storage nodes, the data replicates from the other nodes
               | in the cluster. If you need 'the university storage
               | server' that's a pet. Google aren't keeping pet
               | webservers and pet mailbox servers for GMail - whichever
               | loadbalanced webserver you get connected to will work
               | with any storage cluster node it talks to. Microsoft
               | aren't keeping pet webservers and mailbox servers for
               | Office365, either. If they lose one storage array, or
               | rack, or one DC, your data isn't gone.
               | 
               | 'Cattle' is the idea that if you need more storage, you
               | spin up more identikit storage servers and they merge in
               | seamlessly and provide more replicated redundant storage
               | space. If some break, you replace them with identikit
               | ones which seamlessly take over. If you need data, any of
               | them will provide it.
               | 
               | 'Pets' is the idea that you need _the email storage
               | server_ , which is that HP box in the corner with the big
               | RAID5 array. If you need more storage, it needs to be
               | expansion shelves compatible with that RAID controller
               | and its specific firmware versions which needs space and
               | power in the same rack, and that's different from your
               | newer Engineering storage server, and different to your
               | Backup storage server. If the HP fails, the service is
               | down until you get parts for that specific HP server, or
               | restore that specific server's data to one new pet.
               | 
               | And yes, it's a model not a reality. It's easier to think
               | about scaling your services if you have "two large
               | storage clusters" than if you have a dozen different
               | specialist storage servers each with individual quirks
               | and individual support contracts which can only be worked
               | on by individual engineers who know what's weird and
               | unique about them. And if you can reorganise from pets to
               | cattle, it can free up time, attention, make things more
               | scalable, more flexible, make trade offs of maintenance
               | time and effort.
        
           | jsjohnst wrote:
           | > moved everything to a big vSphere cluster, and backed it by
           | a giant RAID-5 array
           | 
           | I'm with sibling commenter, if said IT department genuinely
           | thought that the core point in "cattle-not-pets" was met by
           | their single SuperMegaCow, then they missed the point
           | entirely.
        
           | throw0101b wrote:
           | > _You have a capex budget_ [...]
           | 
           | As someone who has worked IT in academia: no, you do not. :)
        
             | tonyarkles wrote:
             | I'm laughing (and crying) with you, not at you.
             | 
             | From my past life in academia, you're totally right. But
             | that kind of reinforces the point... you do _occasionally_
             | get some budget for servers and then have to make them last
             | as long as possible :). Those one-time expenses are
             | generally more palatable than recurring cloud storage costs
             | though.
        
           | verticalscaler wrote:
           | The devops meme referenced, "cattle not pets", probably
           | popularized by a book called "The Phoenix Project".
           | 
           | The real point is that pets are job security for the Unix
           | herder. If you end up with one neckbeard running the joint
           | that's even worse as a single point of failure.
        
             | datadrivenangel wrote:
             | If you end up with one cloud guru running your cloud,
             | that's maybe worse as a single point of failure.
             | 
             | AWS system admins may be more fungible these days than unix
             | sysadmins. Maybe.
        
           | fishtacos wrote:
           | >>The IT department decided they were done with their pets,
           | moved everything to a big vSphere cluster, and backed it by a
           | giant RAID-5 array. There was a disk failure, but that's ok,
           | RAID-5 can handle that.
           | 
           | Precisely why, when I was charged with setting up a 100 TB
           | array for a law firm client at previous job, I went for
           | RAID-6, even though it came with a tremendous write speed
           | hit. It was mostly archived data that needed retention for a
           | long period of time, so it wasn't bad for daily usage, and
           | read speeds were great. Had the budget been greater, RAID 10
           | would've been my choice. (requisite reminder: RAID is not
           | backup)
           | 
           | Not related, but they were hit with a million dollar
           | ransomware attack (as in: the hacker group requested a
           | million dollar payment), so that write speed limitation was
           | not the bottleneck considering internet speed when restoring.
           | Ahhh.... what a shitshow, the FBI got involved, and never
           | worked for them again. I did warn them though: zero updates
           | (disabled) and also disabled firewall on the host data server
           | (windows) was a recipe for disaster. Within 3 days they got
           | hit, and the boss had the temerity to imply I had something
           | to do with it. Glad I'm not there anymore, but what a screwy
           | opsec situation I thankfully no longer have to support.
        
             | justsomehnguy wrote:
             | > even though it came with a tremendous write speed hit
             | 
             | Only on a writes < stripe. If your writes are bigger then
             | you can have way more speed than RAID10 on the same set,
             | limited only by the RAID controller CPU.
        
               | fishtacos wrote:
               | Due to network limitations and contract budgeting, I
               | never got the chance to upgrade them to 10 Gb, but can
               | confirm I could hit 1000 Mbps (100+ MB/s) on certain
               | files on RAID-6. It sadly averaged out to about 55-60
               | MB/s writes (HDD array, Buffalo), which again, for this
               | use case was acceptable, but below expectations. I didn't
               | buy the unit, I didn't design the architecture it was
               | going into, merely a cog in the support machinery.
        
             | rjbwork wrote:
             | > the boss had the temerity to imply I had something to do
             | with it.
             | 
             | What was your response? I feel like mine would be "you are
             | now accusing me of a severe crime, all further
             | correspondence will be through my lawyer, good luck".
        
           | justsomehnguy wrote:
           | We bought a 108-drive DAS some years ago, for a backup
           | storage.
           | 
           | I needed to actively object the idea of a 108-wide RAID5 on
           | that.
           | 
           | File systems get borked, RAID arrays can have multiple
           | failures, admins can issue rm -rf in the wrong place.
        
           | jrumbut wrote:
           | > budgets often work differently.
           | 
           | Very differently. Instead of a system you continually iterate
           | on for its entire lifetime, if you're in a more regulated
           | research area you might build it once, get it approved, and
           | then it's only critical updates for the next five (or more!)
           | years while data is collected.
           | 
           | Not many of the IT principles developed for web app startups
           | apply in the research domain. They're less like cattle or
           | pets and more like satellites which have very limited ability
           | to be changed after launch.
        
         | viraptor wrote:
         | I think you're missing some aspects of cattle. You're still
         | supposed to keep track of what happens and where. You still
         | want to understand why and how each of the servers in the
         | autoscaling group (or similar) behaves. The cattle part just
         | means they're unified and quickly replaceable. Buy they still
         | need to be well tagged, accounted for in planning, removed when
         | they don't fulfil the purpose anymore, identified for billing,
         | etc.
         | 
         | And also importantly: you want to make sure you have a good
         | enough description for them that you can say
         | "terraform/cloudformation/ansible: make sure those are running"
         | - without having to find them on the list and do it manually.
        
         | falcor84 wrote:
         | Where's the contrast? Herding is something you do with cattle
         | rather than pets.
        
         | _heimdall wrote:
         | I'm pretty sure anyone in the industry that draws this
         | distinction between cattle and pets has never worked with
         | cattle and only knows of general ideas about the industrial
         | cattle business.
        
           | bluGill wrote:
           | Ranchers do eat their pets. They generally do love the
           | cattle, but they also know at the end of a few years they get
           | replaced - it is the cycle of life.
        
           | antod wrote:
           | Likewise, anyone talking about civil engineering and bridges.
           | 
           | At least back in the day Slashdot was fully aware of how
           | broken their ubiquitous car analogies were and played it up.
        
         | krisoft wrote:
         | I don't see the contrast here.
         | 
         | > proudly declares himself Unix herder
         | 
         | You know what herder's herd? Cattle. Not pets.
         | 
         | > wants to keep track of which systems are important.
         | 
         | I mean obviously? Industry does the same. Probably with
         | automation, tools and tagging during provisioning.
         | 
         | The pet mentality is when you create a beautiful handcrafted
         | spreadsheet showing which services run on the server named
         | "Mjolnir" and which services run on the server named "Valinor".
         | The cattle mentality is when you have the same in a distributed
         | key-value database with UUIDs instead of fancyful server names.
         | 
         | Or the pet mentality is when you prefer to not shut down
         | "Mjolnir" because it has more than 2 years of uptime or some
         | other silly reason like that. (as opposed to not shutting it
         | down because you know that you would loose more money that way
         | than by risking it overheating and having to buy a new one.)
        
         | johann8384 wrote:
         | Pets make sense sometimes. I also think there are still plenty
         | of companies, large ones, with important services and data,
         | that just don't operate in a way that allows the data center
         | teams to do this either. I have some experience with both
         | health insurance and life insurance companies for example where
         | "this critical thing #8 that we would go out of business
         | without" stillnloves solely on "this server right here". In
         | university settings you have systems that are owned by a wide
         | array of teams. These organizations aren't ready or even
         | looking to implement a platform model where the underlying
         | hardware can be generic.
        
         | mrighele wrote:
         | The issue here is not much the hardware, but the services that
         | run on top of them.
         | 
         | I guess that many companies that use "current practices" have
         | plenty of services that they don't even know about running on
         | their clusters.
         | 
         | The main difference is that instead of the kind of issues that
         | the link talks about, you have those services running year
         | after year, using resources, for the joy of the cloud
         | companies.
         | 
         | This happens even at Google [1]:
         | 
         | "There are several remarkable aspects to this story. One is
         | that running a Bigtable was so inconsequential to Google's
         | scale that it took 2 years before anyone even noticed it, and
         | even then, only because the version was old. "
         | 
         | [1] https://steve-yegge.medium.com/dear-google-cloud-your-
         | deprec...
        
         | NovemberWhiskey wrote:
         | Not sure if the goal was just to make an amusing comparison,
         | but these are actually two completely different concerns.
         | 
         | Building your systems so that they don't depend on permanent
         | infrastructure and snowflake configurations is an orthogonal
         | concern from understanding how to shed load in a business-
         | continuity crisis.
        
       | karmarepellent wrote:
       | It's generally a good idea to have some documentation that states
       | what a machine is used for, the "service", and how important said
       | service is relative to others.
       | 
       | At my company we kind of enforce this by not operating machines
       | that are not explicitly assigned to any service.
       | 
       | However you have to anticipate that the quality of documentation
       | still varies immensely which might result in you shutting down a
       | service that is actually more important than stated.
       | 
       | Fortunately documentation improves after every outage because
       | service owners reiterate on their part of the documentation when
       | their service was shut down as "it appeared unimportant".
       | 
       | It's a process.
        
       | spydum wrote:
       | Asset management is definitely a thing. Tag your environments,
       | tag your apps, and provide your apps criticality ratings based on
       | how important they are to running the business. Then it's a
       | matter of a query to know which servers can be shut, and which
       | absolutely must remain.
        
         | ethbr1 wrote:
         | > _provide your apps criticality ratings based on how important
         | they are to running the business_
         | 
         | In a decentralized, self-service model, you can add "deal with
         | convincing a stakeholder their app is anything less than most-
         | critical."
         | 
         | Although it usually works itself out if higher-criticality
         | imposes ongoing time commitments on them as well (aka stick).
        
           | j33zusjuice wrote:
           | That seems like a poorly run company. Idk. Maybe we've worked
           | in very different environments, but devs have almost always
           | been aware of the criticality of the app, so convincing
           | people wasn't hard. In most places, the answer hinges on "is
           | it customer facing?" and/or "does it break a core part of our
           | business?" If the answer is no to both, it's not critical,
           | and everyone understands that. There's always some weird
           | outlier, "well, this runs process B to report on process A,
           | and sends a custom report to the CEO ...", but hopefully
           | those exceptions are rare.
        
             | spydum wrote:
             | agree. ethbr1 is 100% right about this being a problem; if
             | politics is driving your criticality rating, it's probably
             | being done wrong. it should be as simple as your statement,
             | being mindful of some of those downstream systems that
             | aren't always obviously critical (until they are
             | unavailable for $time)
             | 
             | edit: whoops, maybe I read the meaning backward, but both
             | issues exist!
        
             | hinkley wrote:
             | I maintain a couple of apps that are pretty much free to
             | break or to revert back to an older build without much
             | consequence, except for one day a week, when half the team
             | uses them instead of just me.
             | 
             | Any other day I can use them to test new base images, new
             | coding or deployment techniques, etc. I just have to put
             | things back by the end of our cycle.
        
             | NovemberWhiskey wrote:
             | > _devs have almost always been aware of the criticality of
             | the app_
             | 
             | I'm sure that developers are aware of the how important
             | their stuff is to their immediate customer, but they're
             | almost never aware of the _relative_ criticality vis-a-vis
             | stuff they don 't own or have any idea about.
        
             | letsdothisagain wrote:
             | Welcome to University IT, where organizational structures
             | are basically feudal (by law!). Imagine an organization
             | where your president can't order a VP to do something, and
             | you have academia :)
        
         | Scubabear68 wrote:
         | Having an application, process and hardware inventory is a must
         | if you are going to have any hope of disaster recovery. Along
         | with regular failovers to make sure you haven't missed
         | anything.
        
         | outofpaper wrote:
         | In moments of crisis, immediate measures like physical tagging
         | can be crucial. Yet, a broader challenge looms: our dependency
         | on air conditioning. In Toronto's winter, the missed
         | opportunity to design buildings that work with the climate,
         | rather than defaulting to a universal AC solution, underscores
         | the need for thoughtful asset management tailored to specific
         | environments.
        
           | j33zusjuice wrote:
           | I upvoted, but I agree so much, I had to comment, too. I
           | wonder how long it'd take to recoup the loss of retrofitting
           | such a system. Despite this story today, this type of problem
           | must be rare. I imagine most of the savings would be found in
           | the electric bill, and it'd probably take a lot of years to
           | recoup the cost.
        
             | spydum wrote:
             | It's pretty common for hyperscalers actually:
             | https://betterbuildingssolutioncenter.energy.gov/showcase-
             | pr...
             | 
             | https://greenmountain.no/data-centers/cooling/
             | 
             | I vaguely remember some other whole building DC designs
             | that used a central vent which opened externally based on
             | external climate for some additional free cooling. Can't
             | find the reference now though. But geothermal is pretty
             | common for sure.
        
               | thakoppno wrote:
               | You may be thinking about Yahoo's approach from 2010?
               | 
               | > The Yahoo! approach is to avoid the capital cost and
               | power consumption of chillers entirely by allowing the
               | cold aisle temperatures to rise to 85F to 90F when they
               | are unable to hold the temperature lower. They calculate
               | they will only do this 34 hours a year which is less than
               | 0.4% of the year.
               | 
               | https://perspectives.mvdirona.com/2011/03/yahoo-compute-
               | coop...
        
               | spydum wrote:
               | No, what I was remembering was a building design for
               | datacenters, but I can't find a reference. Maybe it was
               | only conceptual. The design was to pull in cold exterior
               | air, pass thru the dehumidifiers to bring some of the
               | moisture levels down, and vent heat from a high rise
               | shaft out the top. All controlled to ensure humidity
               | didn't get wrecked.
        
           | monkeywork wrote:
           | Toronto's climate and winters is dramatically changing, the
           | universal AC solution is almost mandatory due to the climate
           | not being as cold in this area as it once was.
        
             | gosub100 wrote:
             | do you have a source for that? my source[1] appears the
             | average temp hasn't changed much in the past quarter
             | century:
             | 
             | https://toronto.weatherstats.ca/metrics/temperature.html
        
               | lazyasciiart wrote:
               | Average temp probably isn't what you need here - peak
               | temperature and length of high temperature conditions
               | would be more important when figuring out if you need to
               | have artificial cooling available.
        
           | letsdothisagain wrote:
           | I know someone who did that in the Yukon during the winter,
           | just monitor temperatures and crack a window when it got too
           | hot. Seems like a great solution except that they were in a
           | different building so they had to trudge through the snow to
           | close the window if it got too cold.
        
         | datadrivenangel wrote:
         | Good documentation and metadata like this is necessary for
         | corporations to truly be organized.
        
       | marcus0x62 wrote:
       | Years ago during a week-long power outage, a telephone central
       | office where we had some equipment suffered a generator failure.
       | The telephone company had a backup plan (they kept one generator
       | on a trailer in the city for such a contingency,) and they had
       | battery capacity[0] for their critical equipment to last until
       | the generator was hooked up.
       | 
       | They did have to load shed, though: they just turned off the AC
       | inverters. They figured anything _critical_ in a central office
       | was on DC power, and if you had something on AC, you were just
       | going to have to wait until the backup-backup generator was
       | installed.
       | 
       | 0 - at the time, at least, CO battery backup was usually sized
       | for 24 hours of runtime.
        
       | yevpats wrote:
       | Check out CloudQuery - https://github.com/cloudquery/cloudquery
       | for an easy cloud asset inventory.
        
       | radiowave wrote:
       | Tracking servers is one thing, but tracking the dependency
       | relationships among them is likely at least as important.
        
         | j33zusjuice wrote:
         | I'm really glad we realized that before disaster struck. We
         | have a project in-progress to do exactly this. It'd even better
         | if SWE wrote ADRs (or whatever) that document all this stuff up
         | front, but ... well, there are only so many battles anyone can
         | fight, right?
        
         | h2odragon wrote:
         | Writing down and graphing out these relationships is a good way
         | to identify and normalize them.
         | 
         | I once had a system with layers of functionality; lvl 0
         | services were the most critical; lvl 3+ was "user shit" that
         | could be sloughed off at need.
         | 
         | Had some stub servers at lvl 0 and 1 that did things like
         | providing a file share of the same name as the lower level
         | services, but not populated; so that inadvertent domain
         | crossing dependencies weren't severe problems.
         | 
         | There was a "DB server" stub that only returned "no results."
         | The actual DB server for those queries was on the monster big
         | rack SPARC with the 3dz disks that took 10min to spin up fully.
         | When it came up it took over.
        
       | charcircuit wrote:
       | Turning off servers seems like the wrong call instead of
       | transitioning servers into a lower powered state which can be
       | exited once the power budget is available again.
        
         | bluGill wrote:
         | The right answer is turn them all off - anything important is
         | in a redundant data center. But odds are they don't have that.
         | 
         | If a redundant data center isn't an option, then you should put
         | more into ensuring the system is resilient - fireproof room (if
         | a server catches on fire it can't spread to the next - there
         | are a lot of considerations here that I don't know about that
         | you need to figure out), plenty of backup power, redundant
         | HVAC, redundant connections to the internet - and you should
         | brainstorm other things that I didn't think of.
        
       | giaour wrote:
       | Cloud environments have an elegant solution in the form of "spot"
       | or "pre-emptible" instances: if your workload can tolerate
       | interruptions because it's not time sensitive or not terribly
       | important, you can get a pretty steep discount.
        
       | jll29 wrote:
       | I suspect making a list of "I think these ones are not critical."
       | is not sufficient.
       | 
       | You may overlook some subtle forms of interdependence.
       | 
       | To be sure, you need to test your documentation by actually
       | switching off the "uncritical" assets.
        
       | throwawaaarrgh wrote:
       | If your machines are all hypervisors you could migrate important
       | VMs to a couple hosts and turn off the rest. You could also
       | possibly throttle the vcpus, which would slow down the VMs but
       | allow you to run the machines cooler, or more VMs per machine.
       | Finally the ones with long running jobs could just be snapshotted
       | and powered down and restored later, resuming their computation.
       | 
       | There's a reason us old fogies were so excited when virtual
       | machines got increasingly robust. We could use them to solve
       | problems quickly that used to be nearly impossible.
        
       | oasisbob wrote:
       | Good plan! I think this is a relatively common practice within
       | some corners of the telecom world.
       | 
       | At university (Western WA in B'ham), I worked for our campus
       | resnet, which had extensive involvement with other networking
       | groups on campus. They ran layers 3 and below on the resnets, we
       | took DNS+DHCP, plus egress, and everything through to layers 8
       | and 9.
       | 
       | The core network gear was co-located in a few musty basements
       | along with the telephone switches. DC and backup power was
       | available, but severely limited under certain failure scenarios.
       | 
       | All of the racked networking gear in the primary space was
       | labeled with red and green dots. Green was first to go in any
       | load-shedding scenario. Think: redundant LAN switches, switches
       | carrying local ports, network monitoring servers, other +1
       | redundant components, etc.
       | 
       | I'm not sure if the scheme was ever required in real life, but do
       | know it was based on hard-earned experiences like the author
       | here.
        
         | red-iron-pine wrote:
         | Used to run data centers for ISPs and such around NoVA.
         | 
         | This was built into the building plan by room, with most rooms
         | going down first and Meet-Me-Rooms + the rooms immediately
         | adjacent where the big iron routers were, being the last to
         | fail. It's been a while but IIRC there weren't any specific by-
         | rack or by system protocols.
        
       | _dan wrote:
       | Similar thing (catastrophic aircon failure due to a flood in a
       | crap colocated DC) happened to us too before we shifted to AWS.
       | Photos from the colo were pretty bizarre - fans balanced on
       | random boxes, makeshift aircon ducting made of cardboard and
       | tape, and some dude flailing an open fire door back and forth all
       | day to get a little bit of fresh air in. Bizarre to see in
       | 2010-ish with multi million dollar customers.
       | 
       | We ended up having to strategically shut servers down as well,
       | but the question of what's critical, where is it in the racks,
       | and what's next to it was incredibly difficult to answer. And
       | kinda mind-bending - we'd been thinking of these things as
       | completely virtualised resources for years, suddenly having to
       | consider their physical characteristics as well was a bit of a
       | shock. Just shutting down everything non-critical wasn't enough -
       | there were still now critical non-redundant servers next to each
       | other overheating.
       | 
       | All we had to go on was an outdated racktables install, a readout
       | of the case temperature for each, and a map of which machine was
       | connected to which switch port which loosely related to position
       | in the rack - none completely accurate. In the end we got the
       | colo guys to send a photo of the rack front and back and (though
       | not everything was well labelled) we were able to make some
       | decisions and get things stable again.
       | 
       | In the end one server that was critical but we couldn't get to
       | run cooler we got lucky with - we were able to pull out the
       | server below and (without shutting it down) have the on site
       | engineer drop it down enough to crack the lid open and get some
       | cool air into it to keep it running (albeit with no redundancy
       | and on the edge of thermal shutdown).
       | 
       | We came really close to a major outage that day that would have
       | cost us dearly. I know it sounds like total shambles (and it
       | kinda was) but I miss those days.
        
         | macintux wrote:
         | I find it's much less stressful to rescue situations where it
         | wasn't your fault to begin with. Absent the ability to point
         | fingers at a vendor, crises like that are a miserable
         | experience for me.
        
           | organsnyder wrote:
           | Being the hero always feels better than cleaning up your own
           | messes.
        
         | SonOfLilit wrote:
         | > have the on site engineer drop it down enough to crack the
         | lid open
         | 
         | Took me four reads to find an alternative way to read it other
         | than "we asked some guy that doesn't even work for us to throw
         | it on the ground repeatedly until the cover cracks open", like
         | that Zoolander scene.
        
           | _dan wrote:
           | Honestly that was pretty much the situation.
           | 
           | In our defence, he offered. It had hit hour 6 of both the
           | primary and the backup aircon being down, on a very hot day -
           | everyone was way beyond blame and the NOC staff were
           | basically up for any creative solution they could find.
        
         | gottorf wrote:
         | > some dude flailing an open fire door back and forth all day
         | to get a little bit of fresh air in
         | 
         | That's hilarious (probably for you as well, in hindsight). Do
         | you feel comfortable naming and shaming this DC, so we know to
         | avoid it?
        
       | VikingCoder wrote:
       | I often think of the SNL sketch with the lines, "We should keep a
       | list of our clients, and how much money they have given us."
       | https://www.youtube.com/watch?v=reUjEFI0Slc
        
       | mgaunard wrote:
       | If machines aren't important, why do you have them at all?
       | 
       | Because it's an university and we don't care about justifying
       | costs?
        
         | barrucadu wrote:
         | It's possible to have two important things and yet for one to
         | be more important than the other.
        
       | syslog wrote:
       | Any server that must not fail is just not important enough.
       | 
       | (Build your services with HA in mind, so you don't have to worry
       | about a situation like this one.)
        
       | PaulKeeble wrote:
       | One place I worked did a backup power test and when they came
       | back from the diesels to the grid the entire datacentre lost
       | power due to a software bug for about 10 seconds. It caused a
       | massive outage.
       | 
       | The problem was a lot of the machines pulled their OS image from
       | central storage servers and there was no where near enough IO to
       | load everything and they had to prioritise what to bring up first
       | to lighten the load and stop everything thrashing. It was a
       | complete nightmare even though the front end to take sales were
       | well isolated from the backend. Working out what was most
       | important across an entire corporation took as long as the
       | problem resolving slowly by just bringing things up randomly.
       | 
       | Nowadays you would just run multiple datacentres or cloud HA and
       | we have SSDs but I just can't see such an architecture
       | understanding being possible for any reasonably large company.
       | The cost of keeping it and the dependencies up to date would be
       | huge and it would always be out of date. More documentation isn't
       | the solution, its to have multiple sites.
        
         | jiggawatts wrote:
         | That brings back memories of a similar setup with hundreds of
         | Windows servers booting over the network. We had regular
         | "brownouts" even during the day just because the image
         | streaming servers couldn't handle the IOPS. Basic maintenance
         | would slow down the servers for ten thousand users and generate
         | support tickets.
         | 
         | I jumped up and down and convinced management to buy one of the
         | first enterprise SSDs on the market. It was a PCIe card form
         | factor and cost five digits for a tiny amount of storage.
         | 
         | We squeezed in the images using block-level deduplication and
         | clever copy scripts that would run the compaction routine after
         | each file was copied.
         | 
         | The difference was staggering. Just two of those cards made
         | hundreds of other servers run like greased lightning. Boot
         | times dropped to single digit seconds instead of minutes.
         | Maintenance changes could be done at any time with zero impact
         | on users. The whole cluster could be rebooted all at once with
         | only a slight slowdown. Fun times.
        
       | quickthrower2 wrote:
       | This is where the cloud kicks ass. Run multiple nodes with geo
       | redundancy (where based on various concerns: cost, contracts,
       | legal). But nodes should cross data centres. Maybe if one city
       | gets nuked (literally or a fire/power outage) you still have
       | uptime. Use Kubernetes maybe.
        
       | iwontberude wrote:
       | Isn't this the point of decoupling your compute and datastores
       | using CSI with disaggregated storage Kubernetes? So long as you
       | keep your datastores available, whatever compute you can manage
       | to attach it from Kubernetes can run whatever you truly need at
       | capacities that you can handle with that level of hardware.
       | Similarly, you could scale down the workloads on all the machines
       | so they generated less heat without turning anything off at the
       | expense of performance.
        
       ___________________________________________________________________
       (page generated 2024-02-06 23:00 UTC)