[HN Gopher] We might want to regularly keep track of how importa...
___________________________________________________________________
We might want to regularly keep track of how important each server
is
Author : pabs3
Score : 194 points
Date : 2024-02-06 10:51 UTC (12 hours ago)
(HTM) web link (utcc.utoronto.ca)
(TXT) w3m dump (utcc.utoronto.ca)
| tlb wrote:
| Or you might want to have redundant cooling.
|
| Cooling system prices seem to scale fairly linearly with the
| cooling power above a few kW, so instead of one 100 kW system you
| could buy four 25 kW systems so a single failure won't be a
| disaster.
| nine_k wrote:
| Won't 4 x 25 kW systems mean also 4x the installation cost?
| mdekkers wrote:
| What the % of install cost over the projected lifetime of the
| system?
| dist-epoch wrote:
| Probably less if you install them all at the same time.
| bluGill wrote:
| Maybe, but costs are not linear and the nonlinear goes
| different ways for different parts of the install. The costs
| of the smaller systems installed could be cheaper than just
| the large system not installed if the smaller systems are
| standard parts.
| throw0101a wrote:
| > _Or you might want to have redundant cooling._
|
| Can you provide a cost centre or credit card for which they can
| bill this to? In case you didn't notice the domain, it is
| UToronto: academic departments aren't generally flush with
| cash.
|
| Further, you have to have physical space to fit the extra
| cooling equipment and pipes: not always easy or possible to do
| in old university buildings.
| wongarsu wrote:
| If you designed the system like this from the start or when
| replacing it anyways, N+1 redundancy might not me much more
| expensive than one big cooling unit. The systems can mostly
| share their ductwork and just have redundancy in the active
| components, so mostly the chillers.
|
| Of course these systems only get replaced every couple
| decades, if ever, so they are pretty much stuck with the
| setup they have.
| throw0101b wrote:
| > _If you designed_ [...]
|
| University department IT is not designed, it grows over
| decades.
|
| At some point some benefactor may pay for a new building
| and the department will move, so that could be a chance to
| actually design. But modern architecture and architects
| don't really go well with hosting lots of servers in what
| is ostensibly office space.
|
| I've been involved in the build-out of buildings/office
| space on three occasions in my career, and trying to get a
| decent IT space pencilled has always been like pulling
| teeth.
| bluGill wrote:
| > Of course these systems only get replaced every couple
| decades, if ever
|
| This is despite the massive energy savings they could get
| if they replaced those older systems. Universities often
| are full of old buildings with terrible insulation
| heated/cooled by very old/inefficient systems. In 20 years
| they would be money ahead by tearing down most buildings on
| campus and rebuilding to modern standards - assuming energy
| costs don't go up which seems unlikely) But they consider
| all those old buildings historic and so won't.
| light_hue_1 wrote:
| > In 20 years they would be money ahead by tearing down
| most buildings on campus and rebuilding to modern
| standards - assuming energy costs don't go up which seems
| unlikely) But they consider all those old buildings
| historic and so won't.
|
| It has nothing to do with considering those building
| historic.
|
| The problem is unless someone wants to donate a $50-100M,
| new buildings don't happen. And big donors want to donate
| to massive causes "Build a new building to cure cancer!"
| not "This building is kind of crappy, let's replace it
| with a better one".
|
| It doesn't matter that over 50 years something could be
| cheaper if there's no money to fix it now.
| gregmac wrote:
| This kind of thing is like insurance. Maybe IT failed to
| state the consequences of not having redundancy, maybe people
| in control of the money failed to understand.. or maybe the
| risks were understood and accepted.
|
| Either way, by not paying for the insurance (redundant
| systems) up front the organization is explicitly taking on
| the risk.
|
| Whether the cost now is higher is impossible to say as an
| outsider, but there's a lot of expenses: paying a premium for
| emergency repairs/replacement; paying salaries to a bunch of
| staff who are unable to work at full capacity (or maybe at
| all); a bunch of IT projects delayed because staff is dealing
| with an outage; and maybe downstream ripple effects, like
| classes cancelled or research projects in jeopardy.
|
| I've never worked in academics, but I know people that do and
| understand the budget nonsense they go through. It doesn't
| change the reality, though, which is systems fail and if you
| don't plan for that you'll pay dearly.
| marcus0x62 wrote:
| With that model, you'd probably want 5 instead of 4 (N+1), but
| the other thing to consider is if you can duct the cold air to
| where it needs to go when one or more of the units has failed.
| dotancohen wrote:
| This will mean just about 4 times the number of failures, too.
| And can 75% cooling still cool the server room anyway?
| donkeyd wrote:
| Maybe not, but some cooling means less servers to shut down.
| bluGill wrote:
| It means 5 times the number of failures as you intentionally
| put in an extra unit so that one can be taken offline at any
| time for maintenance (which itself will keep the whole system
| more reliable), and if one fails the whole keeps up. The cost
| is only slightly more to do this when there are 5 smaller
| units. Those smaller units could be standard off the shelf
| units as well, so it could be cheaper than a large unit that
| isn't mad in as large a quantity - this is a consideration
| that needs to be made case by case)
|
| Even if you cheap out and only install 4 units, odds are your
| failure doesn't happen on the hottest day of the year and so
| 3 can keep up just fine. It is only when you are unlikely
| that you need to shut anything down.
| evilduck wrote:
| Four service degradations vs. one huge outage event. Pick
| your poison.
| cowsandmilk wrote:
| It is interesting to contrast with where the wider industry has
| gone.
|
| Industry: don't treat your systems like pets.
|
| Author: proudly declares himself Unix herder, wants to keep track
| of which systems are important.
| throw0101a wrote:
| > _Author: proudly declares himself Unix herder, wants to keep
| track of which systems are important._
|
| Because not all environments are webapps which dozens or
| hundreds of systems configured in a cookie cutter matter.
| Plenty of IT environments have pets because plenty of IT
| environments are not Web Scale(tm). And plenty of IT
| environments have staff churn and legacy systems where
| knowledge can become murky (see reference about "archaeology").
|
| An IMAP server is different than a web server is different than
| an NFS server, and there may also be inter-dependencies between
| them.
| cookiemonster9 wrote:
| So much this. Not everything fits cattle/chicken/etc models.
| Even in cases where those models could fit, they are not
| necessarily the right choice, given staffing, expertise,
| budgets, and other factors.
| LeonB wrote:
| I work with one system where the main entity is a "sale"
| which is processed beginning to end in some fraction of a
| second.
|
| A different system I work with, the main "entity" is,
| conceptually, more like a murder investigation. Less than 200
| of the main entity are created in a year. Many different
| pieces of information are painstakingly gathered and tracked
| over a long period of time, with input from many people and
| oversight from legal experts and strong auditing
| requirements.
|
| Trying to apply the best lessons and principles from one
| system to the other is rarely a good idea.
|
| These kind of characteristics of different systems make a lot
| of difference to their care and feeding.
| karmarepellent wrote:
| I would argue even the "wider industry" still administrates
| systems that must be treated as pets because they were not
| designed to be treated as cattle.
| luma wrote:
| When you are responsible for the full infrastructure,
| sequencing power down and power on in coordination with your
| UPS is a common solution. Network gear needs a few minutes to
| light up ports, core services like DNS and identity services
| might need to light up next, then storage, then hypervisors and
| container hosts, then you can actually start working on app
| dependencies.
|
| This sort of sequencing leads itself naturally to having a plan
| for limited capacity "keep the lights on" workload shedding
| when facing a situation like the OP.
|
| Not everyone has elected to pay Bezos double the price for
| things they can handle themselves, and this is part of handling
| it.
| bbarnett wrote:
| Double? Try 100x!!
| marcosdumay wrote:
| Exactly. If it was double, it would be a non-brainier.
| philsnow wrote:
| If you're running a couple ec2 instances in one AZ then
| yeah it's closer to 100x, but if you wanted to replicate
| the durability of S3, it would cost you a lot in terms of
| redundancy (usually "invisible" to the customer) and
| ongoing R&D and support headcount.
|
| Yes, even when you add it all up, Amazon still charges a
| premium even over that all-in cost. That's sweat equity.
| p_l wrote:
| Even if you're running "cattle", you still need to keep track
| of which systems are important, because to surprise of many,
| the full infrastructure is more like the ranch, and cattle is
| just part of it.
|
| (and here I remind myself again to write the screed against
| "cattle" metaphor...)
| bayindirh wrote:
| HPC admin here (and possibly managing a similar system topology
| with their room).
|
| In heterogeneous system rooms, you can't stuff everything into
| a virtualization cluster with a shared storage and migrate
| things on the fly, thinking that every server (in hardware) is
| a cattle and you can just herd your VMs from host to host.
|
| A SLURM cluster is easy. Shutdown all the nodes, controller
| will say "welp, no servers to run the workloads, will wait
| until servers come back", but storage systems are not that easy
| (ordering, controller dependencies, volume dependencies,
| service dependencies, etc.).
|
| Also there are servers which can't be virtualized because
| they're hardware dependent, latency dependent, or just filling
| the server they are in, resource wise.
|
| We also have some pet servers, and some cattle. We "pfft" to
| some servers and scramble for others due to various reasons. We
| know what server runs which service by the hostname, and never
| install pet servers without team's knowledge. So if something
| important goes down everyone at least can attend the OS or the
| hardware it's running on.
|
| Even in a cloud environment, you can't move a VSwitch VM as you
| want, because you can't have the root of a fat SDN tree on
| every node. Even the most flexible infrastructure has firm
| parts to support that flexibility. It's impossible otherwise.
|
| Lastly, not knowing which servers are important is a big no-no.
| We had "glycol everywhere" incidents and serious heatwaves, and
| all we say is, "we can't cool room down, scale down". Everybody
| shuts the servers they know they can, even if somebody from the
| team is on vacation.
|
| Being a sysadmin is a team game.
| readscore wrote:
| At a FAANG, our services are cattle, but we still plan which
| services to keep running when we need to drain 50% of a DC.
|
| Latency is important. Money makers > Latency sensitive >
| Optional requests > Background requests > Batch traffic.
|
| Bootstrapping is important. If A depends on B, you might to
| drain A first, or A and B together.
| _heimdall wrote:
| The cattle metaphors really is a bad one. Anyone raising
| cattle should do the same thing, knowing which animals are
| the priority in case of draught, disease, etc.
|
| Hopefully one never has to face that scenario, but its much
| easier to pick up the pieces when you know where the
| priorities are whether you're having to power down servers or
| thin a herd.
| bluGill wrote:
| Cattle are often interchangeable. You cull any that catch a
| disease (in some cases the USDA will cull the entire herd
| if just one catches something - bio security is a big deal)
| In the case of drought you pick a bunch to get rid of -
| based on market prices (If everyone else is you will try to
| keep yours because the market is collapsing - but this
| means managing feed and thus may mean culling more of the
| herd latter.)
|
| Some cattle we can measure. Milk cows are carefully managed
| as to output - the farmer knows how much the milk each one
| is worth and so they can cull the low producers. However
| milk is so much more valuable than meat that they never
| cull based on drought - milk can always outbid meat for
| feed. If milk demand goes down the farmer might cull come -
| but often the farmer is under contract for X amount of milk
| and so they cannot manage prices.
| p_l wrote:
| Honestly, big issue with the cattle metaphor is that the
| individual services you run on servers are very much
| often not interchangeable.
|
| A DNS service is not NTP is not mail gateway is not
| application load balancer is not database _etc etc etc_
|
| At best, multiple replicas of those are cattle.
|
| And while you can treat the servers underlying them as
| interchangeable, that doesn't change the fact the
| services you run on them _are not_.
| naniwaduni wrote:
| Cattle often aren't interchangeable too. Not gonna have a
| great time milking the bulls.
| bluGill wrote:
| But if you are milking you don't have bulls. Maybe you
| have one (though almost everyone uses artificial
| insemination these days). Worrying about milking bulls is
| like worrying about the NetWare server - once common but
| has been obsolete since before many reading this were
| even born.
|
| Of course the pigs, cows, and chickens are not
| interchangeable. Nor are corn, hay, soybeans.
| cortesoft wrote:
| Yeah, really the only difference is whether you are
| tracking individual servers or TYPES of servers
| pch00 wrote:
| > Industry: don't treat your systems like pets.
|
| The industry has this narrative because it suits their desire
| to sell higher-margined cloud services. However in the real
| world, especially in academia as cks is, the reality is that
| many workloads are still not suitable for the cloud.
| tonyarkles wrote:
| I'm absolutely loving the term Unix herder and will probably
| adopt it :)
|
| I'm generally with you and the wider industry on the cattle-
| not-pets thing but there are a few things to keep in mind in
| the context of a university IT department that are different
| than what we regularly talk about here:
|
| - budgets often work differently. You have a capex budget and
| your institution will exist long enough to fully depreciate the
| hardware they've bought you. They won't be as happy to
| dramatically increase your opex.
|
| - storage is the ultimate pet. In a university IT department
| you're going to have people who need access to tons and tons of
| speedy storage both short-term and long-term.
|
| I'm smiling a little bit thinking about a job 10 years ago who
| adopted the cattle-not-pets mentality. The IT department
| decided they were done with their pets, moved everything to a
| big vSphere cluster, and backed it by a giant RAID-5 array.
| There was a disk failure, but that's ok, RAID-5 can handle
| that. And then the next day there was a second disk failure.
| Boom. Every single VM in the engineering department is gone
| including all of the data. It was all backed up to tape and
| slowly got restored but the blast radius was enormous.
| jodrellblank wrote:
| At the risk of no true Scotsman, that doesn't sound like
| "cattle not pets"; when the cattle are sent to the
| slaughterhouse there isn't any blast radius, there's just
| more cattle taking over. You explicitly don't have to replace
| them with exact clones of the original cattle from tape very
| slowly, you spin up a herd of more cattle in moments.
| logifail wrote:
| > when the cattle are sent to the slaughterhouse
|
| Data isn't "sent to the slaughterhouse". Ever.
|
| Data can be annoying that way.
| datadrivenangel wrote:
| The true problem with kubernetes and modern cloud. Not
| insurmountable, but painful when your data is large
| compared to your processing needs.
| j33zusjuice wrote:
| I think the point is that systems that aren't replaced
| easily shouldn't be managed like they are, not ...
| whatever it is you're getting at here.
| carbotaniuman wrote:
| Sadly you can't just spool down data either - data isn't
| fungible!
| chris_wot wrote:
| Analogies always break down under scrutiny. Any cattle
| farmer would find "spinning up a herd of cattle" to be
| hilarious.
| tonyarkles wrote:
| > you spin up a herd of more cattle in moments
|
| Where does the decade of data they've been collecting come
| from when you "spin up a new herd"?
| ElevenLathe wrote:
| It's stored in another system that _is_ probably treated
| as a pet, hopefully by somebody way better at it with a
| huge staff (like AWS). Even if it 's a local NetApp
| cluster or something, you can leave the state to the
| storage admins rather than random engineers that may or
| may not even be with the company any more.
| jodrellblank wrote:
| If you are using clustered storage (Ceph, for example)
| instead of a single RAID5 array, ideally the loss of one
| node or one rack or one site doesn't lose your data it
| only loses some of the replicas. When you spin up new
| storage nodes, the data replicates from the other nodes
| in the cluster. If you need 'the university storage
| server' that's a pet. Google aren't keeping pet
| webservers and pet mailbox servers for GMail - whichever
| loadbalanced webserver you get connected to will work
| with any storage cluster node it talks to. Microsoft
| aren't keeping pet webservers and mailbox servers for
| Office365, either. If they lose one storage array, or
| rack, or one DC, your data isn't gone.
|
| 'Cattle' is the idea that if you need more storage, you
| spin up more identikit storage servers and they merge in
| seamlessly and provide more replicated redundant storage
| space. If some break, you replace them with identikit
| ones which seamlessly take over. If you need data, any of
| them will provide it.
|
| 'Pets' is the idea that you need _the email storage
| server_ , which is that HP box in the corner with the big
| RAID5 array. If you need more storage, it needs to be
| expansion shelves compatible with that RAID controller
| and its specific firmware versions which needs space and
| power in the same rack, and that's different from your
| newer Engineering storage server, and different to your
| Backup storage server. If the HP fails, the service is
| down until you get parts for that specific HP server, or
| restore that specific server's data to one new pet.
|
| And yes, it's a model not a reality. It's easier to think
| about scaling your services if you have "two large
| storage clusters" than if you have a dozen different
| specialist storage servers each with individual quirks
| and individual support contracts which can only be worked
| on by individual engineers who know what's weird and
| unique about them. And if you can reorganise from pets to
| cattle, it can free up time, attention, make things more
| scalable, more flexible, make trade offs of maintenance
| time and effort.
| jsjohnst wrote:
| > moved everything to a big vSphere cluster, and backed it by
| a giant RAID-5 array
|
| I'm with sibling commenter, if said IT department genuinely
| thought that the core point in "cattle-not-pets" was met by
| their single SuperMegaCow, then they missed the point
| entirely.
| throw0101b wrote:
| > _You have a capex budget_ [...]
|
| As someone who has worked IT in academia: no, you do not. :)
| tonyarkles wrote:
| I'm laughing (and crying) with you, not at you.
|
| From my past life in academia, you're totally right. But
| that kind of reinforces the point... you do _occasionally_
| get some budget for servers and then have to make them last
| as long as possible :). Those one-time expenses are
| generally more palatable than recurring cloud storage costs
| though.
| verticalscaler wrote:
| The devops meme referenced, "cattle not pets", probably
| popularized by a book called "The Phoenix Project".
|
| The real point is that pets are job security for the Unix
| herder. If you end up with one neckbeard running the joint
| that's even worse as a single point of failure.
| datadrivenangel wrote:
| If you end up with one cloud guru running your cloud,
| that's maybe worse as a single point of failure.
|
| AWS system admins may be more fungible these days than unix
| sysadmins. Maybe.
| fishtacos wrote:
| >>The IT department decided they were done with their pets,
| moved everything to a big vSphere cluster, and backed it by a
| giant RAID-5 array. There was a disk failure, but that's ok,
| RAID-5 can handle that.
|
| Precisely why, when I was charged with setting up a 100 TB
| array for a law firm client at previous job, I went for
| RAID-6, even though it came with a tremendous write speed
| hit. It was mostly archived data that needed retention for a
| long period of time, so it wasn't bad for daily usage, and
| read speeds were great. Had the budget been greater, RAID 10
| would've been my choice. (requisite reminder: RAID is not
| backup)
|
| Not related, but they were hit with a million dollar
| ransomware attack (as in: the hacker group requested a
| million dollar payment), so that write speed limitation was
| not the bottleneck considering internet speed when restoring.
| Ahhh.... what a shitshow, the FBI got involved, and never
| worked for them again. I did warn them though: zero updates
| (disabled) and also disabled firewall on the host data server
| (windows) was a recipe for disaster. Within 3 days they got
| hit, and the boss had the temerity to imply I had something
| to do with it. Glad I'm not there anymore, but what a screwy
| opsec situation I thankfully no longer have to support.
| justsomehnguy wrote:
| > even though it came with a tremendous write speed hit
|
| Only on a writes < stripe. If your writes are bigger then
| you can have way more speed than RAID10 on the same set,
| limited only by the RAID controller CPU.
| fishtacos wrote:
| Due to network limitations and contract budgeting, I
| never got the chance to upgrade them to 10 Gb, but can
| confirm I could hit 1000 Mbps (100+ MB/s) on certain
| files on RAID-6. It sadly averaged out to about 55-60
| MB/s writes (HDD array, Buffalo), which again, for this
| use case was acceptable, but below expectations. I didn't
| buy the unit, I didn't design the architecture it was
| going into, merely a cog in the support machinery.
| rjbwork wrote:
| > the boss had the temerity to imply I had something to do
| with it.
|
| What was your response? I feel like mine would be "you are
| now accusing me of a severe crime, all further
| correspondence will be through my lawyer, good luck".
| justsomehnguy wrote:
| We bought a 108-drive DAS some years ago, for a backup
| storage.
|
| I needed to actively object the idea of a 108-wide RAID5 on
| that.
|
| File systems get borked, RAID arrays can have multiple
| failures, admins can issue rm -rf in the wrong place.
| jrumbut wrote:
| > budgets often work differently.
|
| Very differently. Instead of a system you continually iterate
| on for its entire lifetime, if you're in a more regulated
| research area you might build it once, get it approved, and
| then it's only critical updates for the next five (or more!)
| years while data is collected.
|
| Not many of the IT principles developed for web app startups
| apply in the research domain. They're less like cattle or
| pets and more like satellites which have very limited ability
| to be changed after launch.
| viraptor wrote:
| I think you're missing some aspects of cattle. You're still
| supposed to keep track of what happens and where. You still
| want to understand why and how each of the servers in the
| autoscaling group (or similar) behaves. The cattle part just
| means they're unified and quickly replaceable. Buy they still
| need to be well tagged, accounted for in planning, removed when
| they don't fulfil the purpose anymore, identified for billing,
| etc.
|
| And also importantly: you want to make sure you have a good
| enough description for them that you can say
| "terraform/cloudformation/ansible: make sure those are running"
| - without having to find them on the list and do it manually.
| falcor84 wrote:
| Where's the contrast? Herding is something you do with cattle
| rather than pets.
| _heimdall wrote:
| I'm pretty sure anyone in the industry that draws this
| distinction between cattle and pets has never worked with
| cattle and only knows of general ideas about the industrial
| cattle business.
| bluGill wrote:
| Ranchers do eat their pets. They generally do love the
| cattle, but they also know at the end of a few years they get
| replaced - it is the cycle of life.
| antod wrote:
| Likewise, anyone talking about civil engineering and bridges.
|
| At least back in the day Slashdot was fully aware of how
| broken their ubiquitous car analogies were and played it up.
| krisoft wrote:
| I don't see the contrast here.
|
| > proudly declares himself Unix herder
|
| You know what herder's herd? Cattle. Not pets.
|
| > wants to keep track of which systems are important.
|
| I mean obviously? Industry does the same. Probably with
| automation, tools and tagging during provisioning.
|
| The pet mentality is when you create a beautiful handcrafted
| spreadsheet showing which services run on the server named
| "Mjolnir" and which services run on the server named "Valinor".
| The cattle mentality is when you have the same in a distributed
| key-value database with UUIDs instead of fancyful server names.
|
| Or the pet mentality is when you prefer to not shut down
| "Mjolnir" because it has more than 2 years of uptime or some
| other silly reason like that. (as opposed to not shutting it
| down because you know that you would loose more money that way
| than by risking it overheating and having to buy a new one.)
| johann8384 wrote:
| Pets make sense sometimes. I also think there are still plenty
| of companies, large ones, with important services and data,
| that just don't operate in a way that allows the data center
| teams to do this either. I have some experience with both
| health insurance and life insurance companies for example where
| "this critical thing #8 that we would go out of business
| without" stillnloves solely on "this server right here". In
| university settings you have systems that are owned by a wide
| array of teams. These organizations aren't ready or even
| looking to implement a platform model where the underlying
| hardware can be generic.
| mrighele wrote:
| The issue here is not much the hardware, but the services that
| run on top of them.
|
| I guess that many companies that use "current practices" have
| plenty of services that they don't even know about running on
| their clusters.
|
| The main difference is that instead of the kind of issues that
| the link talks about, you have those services running year
| after year, using resources, for the joy of the cloud
| companies.
|
| This happens even at Google [1]:
|
| "There are several remarkable aspects to this story. One is
| that running a Bigtable was so inconsequential to Google's
| scale that it took 2 years before anyone even noticed it, and
| even then, only because the version was old. "
|
| [1] https://steve-yegge.medium.com/dear-google-cloud-your-
| deprec...
| NovemberWhiskey wrote:
| Not sure if the goal was just to make an amusing comparison,
| but these are actually two completely different concerns.
|
| Building your systems so that they don't depend on permanent
| infrastructure and snowflake configurations is an orthogonal
| concern from understanding how to shed load in a business-
| continuity crisis.
| karmarepellent wrote:
| It's generally a good idea to have some documentation that states
| what a machine is used for, the "service", and how important said
| service is relative to others.
|
| At my company we kind of enforce this by not operating machines
| that are not explicitly assigned to any service.
|
| However you have to anticipate that the quality of documentation
| still varies immensely which might result in you shutting down a
| service that is actually more important than stated.
|
| Fortunately documentation improves after every outage because
| service owners reiterate on their part of the documentation when
| their service was shut down as "it appeared unimportant".
|
| It's a process.
| spydum wrote:
| Asset management is definitely a thing. Tag your environments,
| tag your apps, and provide your apps criticality ratings based on
| how important they are to running the business. Then it's a
| matter of a query to know which servers can be shut, and which
| absolutely must remain.
| ethbr1 wrote:
| > _provide your apps criticality ratings based on how important
| they are to running the business_
|
| In a decentralized, self-service model, you can add "deal with
| convincing a stakeholder their app is anything less than most-
| critical."
|
| Although it usually works itself out if higher-criticality
| imposes ongoing time commitments on them as well (aka stick).
| j33zusjuice wrote:
| That seems like a poorly run company. Idk. Maybe we've worked
| in very different environments, but devs have almost always
| been aware of the criticality of the app, so convincing
| people wasn't hard. In most places, the answer hinges on "is
| it customer facing?" and/or "does it break a core part of our
| business?" If the answer is no to both, it's not critical,
| and everyone understands that. There's always some weird
| outlier, "well, this runs process B to report on process A,
| and sends a custom report to the CEO ...", but hopefully
| those exceptions are rare.
| spydum wrote:
| agree. ethbr1 is 100% right about this being a problem; if
| politics is driving your criticality rating, it's probably
| being done wrong. it should be as simple as your statement,
| being mindful of some of those downstream systems that
| aren't always obviously critical (until they are
| unavailable for $time)
|
| edit: whoops, maybe I read the meaning backward, but both
| issues exist!
| hinkley wrote:
| I maintain a couple of apps that are pretty much free to
| break or to revert back to an older build without much
| consequence, except for one day a week, when half the team
| uses them instead of just me.
|
| Any other day I can use them to test new base images, new
| coding or deployment techniques, etc. I just have to put
| things back by the end of our cycle.
| NovemberWhiskey wrote:
| > _devs have almost always been aware of the criticality of
| the app_
|
| I'm sure that developers are aware of the how important
| their stuff is to their immediate customer, but they're
| almost never aware of the _relative_ criticality vis-a-vis
| stuff they don 't own or have any idea about.
| letsdothisagain wrote:
| Welcome to University IT, where organizational structures
| are basically feudal (by law!). Imagine an organization
| where your president can't order a VP to do something, and
| you have academia :)
| Scubabear68 wrote:
| Having an application, process and hardware inventory is a must
| if you are going to have any hope of disaster recovery. Along
| with regular failovers to make sure you haven't missed
| anything.
| outofpaper wrote:
| In moments of crisis, immediate measures like physical tagging
| can be crucial. Yet, a broader challenge looms: our dependency
| on air conditioning. In Toronto's winter, the missed
| opportunity to design buildings that work with the climate,
| rather than defaulting to a universal AC solution, underscores
| the need for thoughtful asset management tailored to specific
| environments.
| j33zusjuice wrote:
| I upvoted, but I agree so much, I had to comment, too. I
| wonder how long it'd take to recoup the loss of retrofitting
| such a system. Despite this story today, this type of problem
| must be rare. I imagine most of the savings would be found in
| the electric bill, and it'd probably take a lot of years to
| recoup the cost.
| spydum wrote:
| It's pretty common for hyperscalers actually:
| https://betterbuildingssolutioncenter.energy.gov/showcase-
| pr...
|
| https://greenmountain.no/data-centers/cooling/
|
| I vaguely remember some other whole building DC designs
| that used a central vent which opened externally based on
| external climate for some additional free cooling. Can't
| find the reference now though. But geothermal is pretty
| common for sure.
| thakoppno wrote:
| You may be thinking about Yahoo's approach from 2010?
|
| > The Yahoo! approach is to avoid the capital cost and
| power consumption of chillers entirely by allowing the
| cold aisle temperatures to rise to 85F to 90F when they
| are unable to hold the temperature lower. They calculate
| they will only do this 34 hours a year which is less than
| 0.4% of the year.
|
| https://perspectives.mvdirona.com/2011/03/yahoo-compute-
| coop...
| spydum wrote:
| No, what I was remembering was a building design for
| datacenters, but I can't find a reference. Maybe it was
| only conceptual. The design was to pull in cold exterior
| air, pass thru the dehumidifiers to bring some of the
| moisture levels down, and vent heat from a high rise
| shaft out the top. All controlled to ensure humidity
| didn't get wrecked.
| monkeywork wrote:
| Toronto's climate and winters is dramatically changing, the
| universal AC solution is almost mandatory due to the climate
| not being as cold in this area as it once was.
| gosub100 wrote:
| do you have a source for that? my source[1] appears the
| average temp hasn't changed much in the past quarter
| century:
|
| https://toronto.weatherstats.ca/metrics/temperature.html
| lazyasciiart wrote:
| Average temp probably isn't what you need here - peak
| temperature and length of high temperature conditions
| would be more important when figuring out if you need to
| have artificial cooling available.
| letsdothisagain wrote:
| I know someone who did that in the Yukon during the winter,
| just monitor temperatures and crack a window when it got too
| hot. Seems like a great solution except that they were in a
| different building so they had to trudge through the snow to
| close the window if it got too cold.
| datadrivenangel wrote:
| Good documentation and metadata like this is necessary for
| corporations to truly be organized.
| marcus0x62 wrote:
| Years ago during a week-long power outage, a telephone central
| office where we had some equipment suffered a generator failure.
| The telephone company had a backup plan (they kept one generator
| on a trailer in the city for such a contingency,) and they had
| battery capacity[0] for their critical equipment to last until
| the generator was hooked up.
|
| They did have to load shed, though: they just turned off the AC
| inverters. They figured anything _critical_ in a central office
| was on DC power, and if you had something on AC, you were just
| going to have to wait until the backup-backup generator was
| installed.
|
| 0 - at the time, at least, CO battery backup was usually sized
| for 24 hours of runtime.
| yevpats wrote:
| Check out CloudQuery - https://github.com/cloudquery/cloudquery
| for an easy cloud asset inventory.
| radiowave wrote:
| Tracking servers is one thing, but tracking the dependency
| relationships among them is likely at least as important.
| j33zusjuice wrote:
| I'm really glad we realized that before disaster struck. We
| have a project in-progress to do exactly this. It'd even better
| if SWE wrote ADRs (or whatever) that document all this stuff up
| front, but ... well, there are only so many battles anyone can
| fight, right?
| h2odragon wrote:
| Writing down and graphing out these relationships is a good way
| to identify and normalize them.
|
| I once had a system with layers of functionality; lvl 0
| services were the most critical; lvl 3+ was "user shit" that
| could be sloughed off at need.
|
| Had some stub servers at lvl 0 and 1 that did things like
| providing a file share of the same name as the lower level
| services, but not populated; so that inadvertent domain
| crossing dependencies weren't severe problems.
|
| There was a "DB server" stub that only returned "no results."
| The actual DB server for those queries was on the monster big
| rack SPARC with the 3dz disks that took 10min to spin up fully.
| When it came up it took over.
| charcircuit wrote:
| Turning off servers seems like the wrong call instead of
| transitioning servers into a lower powered state which can be
| exited once the power budget is available again.
| bluGill wrote:
| The right answer is turn them all off - anything important is
| in a redundant data center. But odds are they don't have that.
|
| If a redundant data center isn't an option, then you should put
| more into ensuring the system is resilient - fireproof room (if
| a server catches on fire it can't spread to the next - there
| are a lot of considerations here that I don't know about that
| you need to figure out), plenty of backup power, redundant
| HVAC, redundant connections to the internet - and you should
| brainstorm other things that I didn't think of.
| giaour wrote:
| Cloud environments have an elegant solution in the form of "spot"
| or "pre-emptible" instances: if your workload can tolerate
| interruptions because it's not time sensitive or not terribly
| important, you can get a pretty steep discount.
| jll29 wrote:
| I suspect making a list of "I think these ones are not critical."
| is not sufficient.
|
| You may overlook some subtle forms of interdependence.
|
| To be sure, you need to test your documentation by actually
| switching off the "uncritical" assets.
| throwawaaarrgh wrote:
| If your machines are all hypervisors you could migrate important
| VMs to a couple hosts and turn off the rest. You could also
| possibly throttle the vcpus, which would slow down the VMs but
| allow you to run the machines cooler, or more VMs per machine.
| Finally the ones with long running jobs could just be snapshotted
| and powered down and restored later, resuming their computation.
|
| There's a reason us old fogies were so excited when virtual
| machines got increasingly robust. We could use them to solve
| problems quickly that used to be nearly impossible.
| oasisbob wrote:
| Good plan! I think this is a relatively common practice within
| some corners of the telecom world.
|
| At university (Western WA in B'ham), I worked for our campus
| resnet, which had extensive involvement with other networking
| groups on campus. They ran layers 3 and below on the resnets, we
| took DNS+DHCP, plus egress, and everything through to layers 8
| and 9.
|
| The core network gear was co-located in a few musty basements
| along with the telephone switches. DC and backup power was
| available, but severely limited under certain failure scenarios.
|
| All of the racked networking gear in the primary space was
| labeled with red and green dots. Green was first to go in any
| load-shedding scenario. Think: redundant LAN switches, switches
| carrying local ports, network monitoring servers, other +1
| redundant components, etc.
|
| I'm not sure if the scheme was ever required in real life, but do
| know it was based on hard-earned experiences like the author
| here.
| red-iron-pine wrote:
| Used to run data centers for ISPs and such around NoVA.
|
| This was built into the building plan by room, with most rooms
| going down first and Meet-Me-Rooms + the rooms immediately
| adjacent where the big iron routers were, being the last to
| fail. It's been a while but IIRC there weren't any specific by-
| rack or by system protocols.
| _dan wrote:
| Similar thing (catastrophic aircon failure due to a flood in a
| crap colocated DC) happened to us too before we shifted to AWS.
| Photos from the colo were pretty bizarre - fans balanced on
| random boxes, makeshift aircon ducting made of cardboard and
| tape, and some dude flailing an open fire door back and forth all
| day to get a little bit of fresh air in. Bizarre to see in
| 2010-ish with multi million dollar customers.
|
| We ended up having to strategically shut servers down as well,
| but the question of what's critical, where is it in the racks,
| and what's next to it was incredibly difficult to answer. And
| kinda mind-bending - we'd been thinking of these things as
| completely virtualised resources for years, suddenly having to
| consider their physical characteristics as well was a bit of a
| shock. Just shutting down everything non-critical wasn't enough -
| there were still now critical non-redundant servers next to each
| other overheating.
|
| All we had to go on was an outdated racktables install, a readout
| of the case temperature for each, and a map of which machine was
| connected to which switch port which loosely related to position
| in the rack - none completely accurate. In the end we got the
| colo guys to send a photo of the rack front and back and (though
| not everything was well labelled) we were able to make some
| decisions and get things stable again.
|
| In the end one server that was critical but we couldn't get to
| run cooler we got lucky with - we were able to pull out the
| server below and (without shutting it down) have the on site
| engineer drop it down enough to crack the lid open and get some
| cool air into it to keep it running (albeit with no redundancy
| and on the edge of thermal shutdown).
|
| We came really close to a major outage that day that would have
| cost us dearly. I know it sounds like total shambles (and it
| kinda was) but I miss those days.
| macintux wrote:
| I find it's much less stressful to rescue situations where it
| wasn't your fault to begin with. Absent the ability to point
| fingers at a vendor, crises like that are a miserable
| experience for me.
| organsnyder wrote:
| Being the hero always feels better than cleaning up your own
| messes.
| SonOfLilit wrote:
| > have the on site engineer drop it down enough to crack the
| lid open
|
| Took me four reads to find an alternative way to read it other
| than "we asked some guy that doesn't even work for us to throw
| it on the ground repeatedly until the cover cracks open", like
| that Zoolander scene.
| _dan wrote:
| Honestly that was pretty much the situation.
|
| In our defence, he offered. It had hit hour 6 of both the
| primary and the backup aircon being down, on a very hot day -
| everyone was way beyond blame and the NOC staff were
| basically up for any creative solution they could find.
| gottorf wrote:
| > some dude flailing an open fire door back and forth all day
| to get a little bit of fresh air in
|
| That's hilarious (probably for you as well, in hindsight). Do
| you feel comfortable naming and shaming this DC, so we know to
| avoid it?
| VikingCoder wrote:
| I often think of the SNL sketch with the lines, "We should keep a
| list of our clients, and how much money they have given us."
| https://www.youtube.com/watch?v=reUjEFI0Slc
| mgaunard wrote:
| If machines aren't important, why do you have them at all?
|
| Because it's an university and we don't care about justifying
| costs?
| barrucadu wrote:
| It's possible to have two important things and yet for one to
| be more important than the other.
| syslog wrote:
| Any server that must not fail is just not important enough.
|
| (Build your services with HA in mind, so you don't have to worry
| about a situation like this one.)
| PaulKeeble wrote:
| One place I worked did a backup power test and when they came
| back from the diesels to the grid the entire datacentre lost
| power due to a software bug for about 10 seconds. It caused a
| massive outage.
|
| The problem was a lot of the machines pulled their OS image from
| central storage servers and there was no where near enough IO to
| load everything and they had to prioritise what to bring up first
| to lighten the load and stop everything thrashing. It was a
| complete nightmare even though the front end to take sales were
| well isolated from the backend. Working out what was most
| important across an entire corporation took as long as the
| problem resolving slowly by just bringing things up randomly.
|
| Nowadays you would just run multiple datacentres or cloud HA and
| we have SSDs but I just can't see such an architecture
| understanding being possible for any reasonably large company.
| The cost of keeping it and the dependencies up to date would be
| huge and it would always be out of date. More documentation isn't
| the solution, its to have multiple sites.
| jiggawatts wrote:
| That brings back memories of a similar setup with hundreds of
| Windows servers booting over the network. We had regular
| "brownouts" even during the day just because the image
| streaming servers couldn't handle the IOPS. Basic maintenance
| would slow down the servers for ten thousand users and generate
| support tickets.
|
| I jumped up and down and convinced management to buy one of the
| first enterprise SSDs on the market. It was a PCIe card form
| factor and cost five digits for a tiny amount of storage.
|
| We squeezed in the images using block-level deduplication and
| clever copy scripts that would run the compaction routine after
| each file was copied.
|
| The difference was staggering. Just two of those cards made
| hundreds of other servers run like greased lightning. Boot
| times dropped to single digit seconds instead of minutes.
| Maintenance changes could be done at any time with zero impact
| on users. The whole cluster could be rebooted all at once with
| only a slight slowdown. Fun times.
| quickthrower2 wrote:
| This is where the cloud kicks ass. Run multiple nodes with geo
| redundancy (where based on various concerns: cost, contracts,
| legal). But nodes should cross data centres. Maybe if one city
| gets nuked (literally or a fire/power outage) you still have
| uptime. Use Kubernetes maybe.
| iwontberude wrote:
| Isn't this the point of decoupling your compute and datastores
| using CSI with disaggregated storage Kubernetes? So long as you
| keep your datastores available, whatever compute you can manage
| to attach it from Kubernetes can run whatever you truly need at
| capacities that you can handle with that level of hardware.
| Similarly, you could scale down the workloads on all the machines
| so they generated less heat without turning anything off at the
| expense of performance.
___________________________________________________________________
(page generated 2024-02-06 23:00 UTC)