[HN Gopher] AWS doesn't make sense for scientific computing
___________________________________________________________________
AWS doesn't make sense for scientific computing
Author : lebovic
Score : 206 points
Date : 2022-10-07 15:28 UTC (7 hours ago)
(HTM) web link (www.noahlebovic.com)
(TXT) w3m dump (www.noahlebovic.com)
| renewiltord wrote:
| > _Most scientific computing runs on queues. These queues can be
| months long for the biggest supercomputers - that 's the API
| equivalent of storing your inbound API requests, and then
| responding to them months later_
|
| Makes sense if the jobs are all low urgency.
|
| We have a similar problem in trading so we have a composite
| solution with non-cloud simulation hardware and additional AWS
| hardware. That's because we have the high utilization solution
| combined with high urgency.
| Fomite wrote:
| I did have to chuckle a bit because, working on HPC simulations
| of the pandemic during the pandemic, there was an awful lot of
| "This needs to be done tomorrow" urgency.
| prpl wrote:
| Actually computing is fine for most use cases (spot instances,
| preemptible VMs on GCP) and have been used in lots of situations,
| even at CERN. Where it also excels is if you need any kind of
| infrastructure, because no HPC center has figured a reasonable
| approach to that (some are trying with k8s). Also, obviously, you
| get a huge selection of hardware.
|
| Where cloud/aws doesn't make sense is storage, especially if you
| need egress, and if you actually need IB
| wistlo wrote:
| Database analyst for a large communication company here.
|
| I have similar doubts about AWS for certain kinds of intensive
| business analysis. Not API based transactions, but back-office
| analysis where complex multi-join queries are run in sequence
| against tables with 10s of millions of records.
|
| We do some of this with SQL servers running right on the desktop
| (and one still uses Excel with VLOOKUP). We have a pilot project
| to try these tasks in a new Azure instance. I look forward to
| seeing how it performs, and at what cost.
| AyyWS wrote:
| Do you have disaster recovery / high availability requirements?
| SQL server on a desktop has a lot of single points of failure.
| captainmuon wrote:
| A former colleague did his PHD in particle physics with a novel
| technique (matrix element method). I can't really explain it, but
| it is extremely CPU intensive. That working group did it on
| CERN's resources, and they had to borrow quotas from a bunch of
| other people. For fun they calculated how much it would have cost
| on AWS and came up with something ridiculous like 3 million
| euros.
| wenc wrote:
| I can't specifically to CERN and the exact workload. But bear
| in mind that the 3MM euros is non negotiated sticker pricing.
| In real life, negotiated pricing can be much much less
| depending on your org size and spend. This is a variable most
| people neglect.
| captainmuon wrote:
| That is true, and a large part of the theoretical cost was
| probably also traffic, and the use of nonstandard nodes. They
| could have gotten a much more realistic price.
|
| I guess the point is also that scientists often don't realize
| computer costs money, when the computers are already bought
| dguest wrote:
| The bigger experiments will routinely burn through tens of
| millions worth of computing. But 10 million euros isn't much
| for these experiments. The issue is that they are publicly
| funded: any country is much happier to build a local computing
| center and lend it to scientists than to fork the money over to
| an American cloud provider.
|
| (The expensive part of these experiments is simulating billions
| of collisions and how the thousands of outgoing particles
| propagate though a detector the size of a small building.
| Simulating a single event takes around a minute on a modern
| CPU, and the experiments will simulate billions of events in a
| few months. If AWS is charging 5 cents a minute it works out to
| tens of millions easy.)
| rirze wrote:
| I would imagine CERN's resources are essentially a data center
| comparable to a small cloud provider's resources.
| somesortofthing wrote:
| The author makes a convincing argument against doing this
| workload on on-demand instances, but what about spot instances?
| AWS explicitly calls out scientific computing as a major use case
| for scientific computing in its training/promotional materials.
| Given the advertised ~70-90% markdown on spot instance time, it
| seems like a great option compared to paying almost the same
| amount as the workstation but not having to pay to buy, maintain,
| or replace the hardware.
| lebovic wrote:
| Author here! Spot instance pricing is better than on-demand,
| but it doesn't include data transfer, and it's still more
| expensive than on-prem/Hetzner/etc. Data transfer costs exceed
| the cost of the instance itself if you're transferring many TB
| off AWS.
|
| For of our more popular AWS instance types I use - a
| c5a.24xlarge, used for comparison in the post - the cheapest
| spot price over the past month in us-east-1 was $1.69. That's
| still $1233.70/mo: above on-prem, colo, or Hetzner pricing.
| Data transfer is still extremely expensive.
|
| That said, for bursty loads that can't be smoothed with a
| queue, spot instances (or just normal EC2 instances) do make
| sense! I use them all the time for my computational biology
| company.
| latchkey wrote:
| I read this as a thinly veiled advertisement for the authors
| service, toolchest.
| lebovic wrote:
| Toolchest actually runs scientific computing on AWS! I'm just
| frustrated by what we can build, because most scientific
| compute can't effectively shift to AWS
| latchkey wrote:
| As others have noted, there are many other providers out
| there. I think your essay would have had more value if it
| didn't end with an advertisement.
| [deleted]
| gammarator wrote:
| Astronomy is moving more and more to cloud computing:
|
| https://www.nature.com/articles/d41586-020-02284-7
|
| https://arxiv.org/abs/1907.06320
| Moissanite wrote:
| This has been my exact field of work for a few years now; in
| general I have found that:
|
| When people claim it is 10x more expensive to use public cloud,
| they have no earthly idea what it actually costs to run a HPC
| service, a data centre, or do any of the associated maintenance.
|
| When the claim is 3x more expensive in the cloud, they do know
| those things but are making a bad faith comparison because their
| job involves running an on-premises cluster and they are scared
| of losing their toys.
|
| When the claim is 0-50% more to run in the cloud, someone is
| doing the math properly and aiming for a fair comparison.
|
| When the claim is that cloud is cheaper than on-prem, you are
| probably talking to a cloud vendor account manager whose
| colleagues are wincing at the fact that they just torched their
| credibility.
| lebovic wrote:
| Author here! I think running an HPC service that has a steady
| queue in AWS can be more than 3x as expensive.
|
| What type of HPC do you work in? Maybe I'm over-indexing on
| computational biology.
| Moissanite wrote:
| All types of HPC; I'm a sysadmin/consultant. I don't think
| the problem with the cost gap is overestimating cloud costs
| but rather underestimating on-prem costs. Also, failing to
| account for financing differences and opportunity-costs of
| large up-front capital purchases.
| johnklos wrote:
| This is oversimplifying things a bit.
|
| It can categorically be stated that for a year's worth of CPU
| compute, local will always be less than Amazon. Of course,
| putting percentages on it doesn't work - there are just too
| many variables.
|
| There are many admins out there who have no idea what an Alpha
| is who'll swear that if you're not buying Dell or HP hardware
| at a premium with expensive support contracts, you're doing
| things wrong and you're not a real admin. Visit Reddit's
| /r/sysadmin if you want to see the kind of people I'm talking
| about.
|
| The point is that if people insist on the most expensive, least
| efficient type of servers such as Dell Xeons with ridiculous
| service contracts, the savings over Amazon won't be large.
|
| It's a cumulative problem, because trying to cool and house
| less efficient hardware requires more power and that hardware
| ultimately has less tolerance for non-datacenter cooling.
|
| Rethink things. You can have AMD Threadripper / EPYC systems in
| larger rooms that require less overall cooling, that have
| better temperature tolerance, that're more reliable in
| aggregate, which cost less and for which you can easily keep
| around spare parts which would give better turnaround and
| availability than support contracts from Dell / HP. Suddenly
| your compute costs are halved, because of pricing, efficiency,
| overall power, real estate considerations...
|
| So percentages don't work, but the bottom line is that when
| you're doing lots of compute, over time it's always cheaper
| locally, even if you do things the "traditional" expensive and
| inefficient way, so arguing percentages with so many variables
| doesn't make any sense - it's still cheaper, no matter what.
| thayne wrote:
| > Most scientific computing runs on queues. These queues can be
| months long for the biggest supercomputers
|
| That sounds very much like an argument _for_ a cloud. Instead of
| waiting months to do your processing, you spin up what you need,
| then tear it down when you are done.
| withinboredom wrote:
| The queue then just turns into the bank account. The queue
| doesn't magically go away.
| adolph wrote:
| Classic iron triangle, pick two: * cheap
| * fast * available
|
| https://en.wikipedia.org/wiki/Project_management_triangle
| twawaaay wrote:
| On the other hand it makes sense if you just need to borrow their
| infrastructure for a while to calculate something.
|
| A lot of scientific computing isn't happening continuously, and a
| lot of it is one time experiment or maybe couple of times after
| which you would have to tear down and reassign.
|
| Another fun fact people forget is our ability to predict future
| is still pretty poor. Not only that, we are biased towards
| thinking we can predict it when in fact this is complete
| bullshit.
|
| You have to buy and set up infrastructure before you can use it
| and then you have to be ready to use it. What if you are not
| ready? What if you will not need as much resources? What if you
| stop needing it earlier than you thought? When you borrow it from
| AWS you have flexibility to start using it when you are ready and
| drop it immediately when you no longer need it. Which has value
| on its own.
|
| At the company I work for we found out and basically banned
| signing long term contracts for discounts. We found that, on
| average, we pay many times more for unused services than whatever
| we gained through discounts. Also when you pay for the resources
| there is incentive to improve efficiency. When you have basically
| prepaid for everything that incentive is very small and is
| basically limited to making sure you have to stay within limits.
| bsenftner wrote:
| This is the case for a large class of big data + high compute
| applications. Animation / simulation in engineering, planning,
| forecasting, not to mention entertainment require pipelines the
| typical cloud is simply too expensive to use.
| didip wrote:
| No way. I vehemently disagree.
|
| When a company reached a certain mass, hardware cost is a factor
| that is considered but not a big factor.
|
| The bigger problems are lost opportunity costs and unnecessary
| churns.
|
| Businesses lose a lot when the product launch is delayed by a
| year simply because the hardware arrived late or have too many
| defects (Ask your hardware fulfillment people how many defective
| RAM and SSD they got per new shipment).
|
| Churn can cost the business a lot as well. For example, imagine
| the model that everyone been using is trained in a Mac Pro under
| XYZ desk. And then when XYZ quit, s/he never properly backup the
| code and the model.
|
| Bare metal allows for sloppiness that the cloud cannot afford to
| allow. Accountability and ownership is a lot more apparent in the
| cloud.
| idiot900 wrote:
| This rings true for me. I have a federal grant that prohibits me
| from using its funds for capital acquisitions: i.e. servers. But
| I can spend it on AWS at massive cost for minimal added utility
| for my use case. Even though it would be a far better use of
| taxpayer funds to buy the servers, I have to rent them instead.
| giantrobot wrote:
| I'm not saying AWS is automatically the best option but the
| question isn't just servers. It's servers, networking hardware,
| HVAC, a facility to put them all in, and at least a couple
| people to run and maintain it all. The TCO of some servers is
| way higher than the cost of the hardware.
| adgjlsfhk1 wrote:
| Can you get your university to buy some servers for unrelated
| reasons and have them rent them to you?
| chrisseaton wrote:
| Well that's just rebuilding AWS badly. I've used academic-
| managed time-sharing setups and have some horror stories.
| lostmsu wrote:
| Doesn't have to be a university either. Depending on the
| amount of compute needed any capable IT guy can do it for you
| from their garage with a contract.
| boldlybold wrote:
| Lots of places (Hetzner for example) will rent you servers at
| 10-25% the cost of AWS if you want dedicated hardware, without
| the ability to autoscale. You can even set up a K8s cluster
| there if the overhead is worth it.
| intelVISA wrote:
| Fond memories of Hetzner asking for my driving license as ID
| for renting a $2 VPS. Lost a customer for life with that
| nonsense.
| testplzignore wrote:
| > prohibits me from using its funds for capital acquisitions
|
| What is a legitimate reason for this restriction?
| blep_ wrote:
| I can think of a few ways to abuse it while still spinning it
| as "for research". The obvious one is to buy a $9999 gaming
| machine with several of whatever the fanciest GPU on the
| market is at the time, and say you're doing machine learning.
|
| So my guess is it's an overly broad patch for that sort of
| thing.
| Fomite wrote:
| Not really - this is also true for things with no
| particular "civilian" use.
| Fomite wrote:
| Basically, the granting organization doesn't want to pay for
| the full cost of capital equipment that will - either via
| time or capacity - not be fully used for that grant.
|
| There are other grant mechanisms for large capital
| expenditures.
|
| The problem is the thresholds haven't shifted in a long time,
| so you can easily trigger it with a nice workstation. But
| then, the budget for a modular NIH R01 was set in 1999, so
| thats hardly a unique problem.
| lowbloodsugar wrote:
| >Even 2.5x over building your own infrastructure is significant
| for a $50M/yr supercomputer.
|
| Can't imagine you are paying public prices on any cloud provider
| if you have a $50M/yr budget.
|
| In addition, if, as the article states, the scientists are ok to
| wait some considerable time for results, then one can run most,
| if not all, on spot instances, and that can save 10x right there.
|
| If you don't have $50M/yr there are companies that will move your
| workload around different AWS regions to get the best price - and
| will factor in the cost of transferring the data too.
|
| I was architect at large scientific company using AWS.
| lebovic wrote:
| Author here. I agree that pricing is highly negotiable for any
| large cloud provider, and there are even (capped) egress fee
| waivers that you can negotiate as a part of your contract.
| There's also a place for using AWS; I used it for a smaller DNA
| sequencing facility, and I use it for my computational biology
| startup.
|
| That said, I'll repeat something that I commented somewhere
| else: most of scientific computing (by % of compute) happens in
| a context that still doesn't make sense in AWS. There's often a
| physical machine within the organization that's creating data
| (e.g. a DNA sequencer, particle accelerator, etc), and a well-
| maintained HPC cluster that analyzes that data.
|
| Spot instances are still pretty expensive for a steady queue
| (2x of Hetzer monthly costs, for reference), and you still have
| to pay AWS data transfer egress costs - which are at least 30x
| more expensive than a colo or on-prem, if you're saturating a 1
| Gbps link. Data transfer to optimize for spot instance pricing
| becomes prohibitive when your job has 100 TB of raw data.
| dastbe wrote:
| using on-demand for latency insensitive work, especially when
| you're also very cost sensitive, isn't the right choice. spot
| instances will get you somewhere in the realm of the hetzner/on-
| prem numbers.
| Sebb767 wrote:
| But, as the article points out, you are still paying a lot of
| money for features that you don't need for scientific
| computing.
|
| Also, AWS is notoriously easy to undercut with on-prem
| hardware, especially if your budget is large and your uptime
| requirements aren't - you'll save a few hundred thousand a year
| alone by not having to hire expert engineers for on-call duty
| and extreme reliability.
| lebovic wrote:
| Even spot instances on AWS are still over 2x more expensive per
| month than Hetzner. The cheapest c5a.24xlarge spot instance
| right now is $1.5546/hr in us-east-1c. That's $1134.86/mo,
| excluding data transfer costs. If you transfer out 10 TB over
| the course of a month, that's another $921.60/mo - or now 4x
| more expensive than Hetzner.
|
| Using the estimate from the article, spot instances are still
| over 8x more expensive than on-prem for scientific computing.
| dekhn wrote:
| Even more importantly, if you have any reasonable amount of
| spend on cloud, you can get preferred pricing agreements. As
| much as I hate to talk to "salespeople", I did manage to cut
| millions in costs per year with discounts on serving and
| storage.
|
| Personally, when I estimate the total cost of ownership of
| scientific cloud computing versus on prem (for extremely large-
| scale science with both significant server, storage, and
| bandwidth requirements) the cloud ends up winning for a number
| of reasons. I've seen a lot of academics who disagree but then
| I find out they use their grad students to manage their
| clusters.
| thesausageking wrote:
| I'm suspicious of the author's actual experience.
|
| The fact that scientific computing has a different pattern than
| the typical web app is actually a good thing. If you can
| architect large batch jobs to use spot instances, it's 50-80%
| cheaper.
|
| Also this bit: "you can keep your servers at 100% utilization by
| maintaining a queue of requested jobs" isn't true in practice.
| The pattern of research is the work normally comes in waves.
| You'll want to train a new model or run a number of large
| simulations. And then there will be periods of tweaking and work
| on other parts. And then more need for a lot of training. Yes,
| you can always find work to put on a cluster to keep it >90%
| utilization, but if it can be elastic (and has compute has budget
| attached to it), it will rise and fall.
| lebovic wrote:
| Author here! I worked for the computing infrastructure for a
| DNA sequencing facility, and I run a computational biology
| infrastructure company (trytoolchest.com, YC W22). Both are
| built on AWS, so I do think AWS in scientific computing has its
| use-cases - mostly in places where you can't saturate a queue
| or you want a fast cycle time.
|
| Spot instances are still pretty expensive for a steady queue
| (2x of Hetzer monthly costs, for reference), and you still have
| to pay AWS data transfer egress costs - which are at least 30x
| more expensive than a colo or on-prem, if you're saturating a 1
| Gbps link.
|
| This post was born from frustration at AWS for their pricing
| and offerings after trying to get people to switch to AWS in
| scientific computing for years :)
| Fomite wrote:
| One of the aspects not touched on for this is PII/confidential
| data/HIPAA data, etc.
|
| For that, whether it makes sense or not, a lot of universities
| are moving to AWS, and the infrastructure cost of AWS for what
| would be a pretty modest server are still considerably less than
| the cost of complying with the policies and regulations involved
| in that.
|
| Recently at my institution I asked about housing it on premise,
| and the answer was that IT supports AWS, and if I wanted to do
| something else, supporting that - as well as the responsibility
| for a breach - would rest entirely on my shoulders. Not doing
| that.
| [deleted]
| KaiserPro wrote:
| Its much more complex than described.
|
| The author is making a brilliant argument for getting a
| secondhand workstation and shoving under their desk.
|
| If you are doing multi machine batch style processing, then you
| won't be using ondemand, you'd use the spot pricing. The missing
| argument in that part is storage costs. Managing a high speed,
| highly available synchronous file system that can do a sustained
| 50gb/sec is hard bloody work (no S3 isnt a good fit, too much
| management overhead)
|
| Don't get me wrong AWS _is_ expensive if you are using a machine
| for more than a month or two.
|
| however if you are doing highly parallel stuff, Batch and lustre
| on demand is pretty ace.
|
| If you are doing a multi-year project, then real steel is where
| its at. Assuming you have factored in hosting, storage and admin
| costs.
| bushbaba wrote:
| Checkout Apache Iceberg which makes it fairly trivial to get
| high throughput from S3 without much fine-tuning. Bursts from 0
| to 50Gbps should be possible from S3 without much effort, just
| have object sizes that are in the NN+ MiB range. Personally,
| Lustre is a mess, it's expensive and even more pain to fine-
| tune.
| awiesenhofer wrote:
| From https://iceberg.apache.org
|
| > Iceberg is a high-performance format for huge analytic
| tables.
|
| How would that help speedup S3? Genuine question?
| gautamdivgi wrote:
| Even for multi-year, if you factor in everything does it still
| come out cheaper and AWS? Would you be running everything 24x7
| on an HPC? I don't think so. You need scale at some points and
| there are probably times where research is done on your
| desktop.
|
| You could invest in an HPC - but I think the human cost of
| maintaining one especially if you're in a high cost of living
| area (e.g. Bay Area, NYC, etc.) is going to be pretty high.
| Admin cost, UPS, cable wiring, heat/cooling etc. can all be
| pretty expensive. Maintenance of these can be pretty pricey
| too.
|
| Are there any companies that remotely manage data centers and
| rent out bare metal infra?
| lostmsu wrote:
| Isn't 50GB sec like 5 NVMe Gen 5 SSDs + 1 or 2 for redundancy?
|
| Actually, you are right. Consumer SSDs I've seen only do about
| 1.5GB/s sustained.
| davidmr wrote:
| Not in the context the person you responded to meant it. Yes,
| you can very easily get 50GB/s from a few NVMe devices on a
| single box. Getting 50GB/s on a POSIX-ish filesystem exported
| to 1000 servers is very possible and common, but orders of
| magnitude more complicated. 500GB/s is tougher still. 5TB/s
| is real tough, but real fun.
| pclmulqdq wrote:
| Even (high-end) consumer SSDs can saturate a PCIe gen 4 x4
| link if you are doing sequential reads. Non-sequential hurts
| on even enterprise SSDs.
| wenc wrote:
| Calculating costs based on sticker price is sometimes misleading
| because there's another variable: negotiated pricing, which can
| be much much lower than sticker prices, depending on your
| negotiating leverage. Different companies pay different prices
| for the same product.
|
| If you've ever worked at a big company or university (any place
| where you spend at scale), you'll know you rarely pay sticker
| price. Software licensing is particularly elastic because it's
| almost zero marginal cost. Raw cloud costs are largely a function
| of energy usage and amortized hardware costs -- there's a certain
| minimum you can't go under but there remains a huge margin that
| is open to be negotiated on.
|
| Startups/individuals rarely even think about this because they
| rarely qualify. But big orgs with large spends do. You can get
| negotiated cloud pricing.
| racking7 wrote:
| This is definitely true for cloud retail prices. However, this
| becomes not true in cases I've seen when there is an existing
| discount. Reserved instances, for example.
| bee_rider wrote:
| Is genomic code typically distributed-memory parallel? I'm under
| the impression that it is more like batch processing, not a ton
| of node-to-node communication but you want lots of bandwidth and
| storage.
|
| If you are doing a big distributed-memory numerical simulation,
| on the other hand, you probably want infiniband I guess.
|
| AWS seems like an OK fit for the former, maybe not great for the
| latter...
| pclmulqdq wrote:
| The fastest way to do a lot of genomics stuff is with FPGA
| accelerators, which also aren't used by most of the other
| tenants in a multi-tenant scientific computing center. The
| cloud is perfect for that kind of work.
| bee_rider wrote:
| That's interesting. It is sort of funny that I was right
| (putting genomics in the "maybe good for cloud" bucket) for
| the wrong reason (characterizing it as more suited for
| general-purpose commodity resources, rather than suited for
| the more niche FPGA platform).
| timeu wrote:
| As an HPC sysadmin for 3 research institutes (mostly life
| sciences & biology) I can't see how cloud HPC system could be any
| cheaper than an on-prem HPC system especially if I look at the
| resource efficiency (how much resources were requested vs how
| much were actually useed) of our users SLURM jobs. Often the
| users request 100s of GB but only use a fraction of it. In our
| on-prem HPC system this might decrease utilization (which is not
| great) but in the case of the cloud this would result in
| increased computing costs (because bigger VM flavor) which would
| be probably worse (CapEx vs OpEx) Of course you could argue that
| the users should do and know better and properly size/measure
| their resource requirements however most of our users have lab
| background and are new to computational biology so estimating or
| even knowing what all the knobs (cores, mem per core, total
| memory, etc) of the job specification means is hard for them. We
| try to educate by providing trainings and job efficency reporting
| however the researchters/users have little incentive to optimize
| the job requests and are more interested in quick results and
| turnover which is also understandable (the on-prem HPC system is
| already payed for). Maybe the cost transparancy of the cloud
| would force them or rather their group leaders/institute heads to
| put a focus on this but until you move to the cloud you won't
| know.
|
| Additionally the typical workloads that run on our HPC system are
| often some badly maintained bioinformatics software or
| R/perl/pythong throwaway scripts and often enough a typo in the
| script causes the entire pipeline to fail after days of running
| on the HPC system and needs to be restarted (maybe even multiple
| times). Again on the on-prem system you have wasted electricity
| (bad enough) but in the cloud you have to pay the computing costs
| of the failed runs. Again cost transparency might force to fix
| this but the users are not software engineers.
|
| One thing that the cloud is really good at, is elasticity and
| access to new hardware. We have seen for example a shift of
| workloads from pure CPUs to GPUs. A new CryoEM microscope was
| installed where the downstream analysis is relying heavily on
| GPUs, more and more resaerch groups run Alpafold predictions and
| also NGS analysis is now using GPUs. We have around 100 GPUs and
| average utlizations has increased to 80-90% and the users are
| complaining about long waiting/queueing times for their GPU jobs.
| For this bursting to the cloud would be nice, however GPUs are
| prohibitively expensive in the cloud unfortunately and the above
| mentioned caveats regarding job resource efficiencies still
| apply.
|
| One thing that will hurt on-prem HPC systems tough are the
| increased electricity prices. We are now taking measures to
| actively save energy (i.e. by powering down idle nodes and
| powering them up again when jobs are scheduled). As far as I can
| tell the big cloud providers (AWS, etc) haven't increased the
| prices yet either because they cover elecriticity cost increase
| with their profit margins or they are not affected as much
| because they have better deals with elecricity providers.
| kortex wrote:
| What does the landscape look like now for "terraform for bare
| metal"?. Is ansible/chef still the main name in town? I just
| wanna netboot some lightweight image, set up some basic network
| discovery on a control plane, and turn every connected box into a
| flexible worker bee I can deploy whatever cluster control layer
| (k8s/nomad) on top of and start slinging containers.
| nwilkens wrote:
| I really like this description of how baremetal infrastructure
| should work, and this is where I think (shameless self
| promotion) Triton DataCenter[1] plays really well today on-
| prem.
|
| PXE booted lightweight compute nodes with a robust API,
| including operator portal, user portal, and cli.
|
| Keep an eye out for the work we are doing with Triton Linux +
| K8s. Very lightweight Triton Linux compute node + baremetal k8s
| deployments on Triton.
|
| [1] https://www.tritondatacenter.com
| jrm4 wrote:
| I imagine what makes this especially hard is you have (at least)
| three parties in play here:
|
| - the people doing the research
|
| - the institution's IT services group
|
| - the administrator who writes the checks
|
| And in my experience, "actual knowledge of what must be done and
| what it will or could cost" can vary greatly across these three
| groups; frequently in very unintuitive ways.
| Fomite wrote:
| This is the biggest point of friction. I spent the better part
| of a year trying to get a postdoc admin access to his machine.
| rpep wrote:
| I think there are some things this misses about the scientific
| ecosystem in Universities/etc. that can make the cloud more
| attractive than it first appears:
|
| * If you want to run really big jobs e.g. with multiple multi-GPU
| nodes, this might not even be possible depending on your
| institution or your access. Most research-intensive Universities
| have a cluster but they're not normally big machines. For
| regional and national machines, you usually have to bid for
| access for specific projects, and you might not be successful.
|
| * You have control of exactly what hardware and OS you want on
| your nodes. Often you're using an out of date RHEL version and
| despite spack and easybuild gaining ground, all too often you're
| given a compiler and some old versions of libraries and that's
| it.
|
| * For many computationally intensive studies, your data transfer
| actually isn't that large. For e.g. you can often do the post-
| processing on-node and then only get aggregate statistics about
| simulation runs out.
| danking00 wrote:
| I think this post is identifying scientific computing with
| simulation studies and legacy workflows, to a fault. Scientific
| computing includes those things, but it _also_ includes
| interactive analysis of very large datasets as well as workflows
| designed around cloud computing.
|
| Interactive analysis of large datasets (e.g. genome & exome
| sequencing studies with 100s of 1000s of samples) is well suited
| to low-latency, server-less, & horizontally scalable systems
| (like Dremel/BigQuery, or Hail [1], which we build and is
| inspired by Dremel, among other systems). The load profile is
| unpredictable because after a scientist runs an analysis they
| need an unpredictable amount of time to think about their next
| step.
|
| As for productionized workflows, if we redesign the tools used
| within these workflows to directly read and write data to cloud
| storage as well as to tolerate VM-preemption, then we can exploit
| the ~1/5 cost of preemptible/spot instances.
|
| One last point: for the subset of scientific computing I
| highlighted above, speed is key. I want the scientist to stay in
| a flow state, receiving feedback from their experiments as fast
| as possible, ideally within 300 ms. The only way to achieve that
| on huge datasets is through rapid and substantial scale-out
| followed by equally rapid and substantial scale-in (to control
| cost).
|
| [1] https://hail.is
| jessfyi wrote:
| I've followed Hail and applaud the Broad Institute's work wrt
| establishing better bioinformatics software and toolkits so I
| hope this doesn't come as rude, but I can't imagine an instance
| in a real industry or academic workflow where you need 300ms
| feedback from an experiment to "maintain flow" considering how
| long experiments on data that large (especially exome
| sequencing!) take overall? My (likely lacking) imagination
| aside I guess what I'm really saying is that I don't know
| what's preventing the usecase you've described from being
| performed locally considering there'd be even _less_ latency?
| CreRecombinase wrote:
| These MPI-based scientific computing applications make up a bulk
| of the compute hours on hpc clusters, but there is a crazy long
| tail of scientists who have workloads that can't (or shouldn't)
| run on their personal computers. The other option is HPC. This
| sucks for a ton of reasons, but I think the biggest one is that
| it's more or less impossible to set up a persistent service of
| any kind. So no databases; if you want spark, be ready to spin it
| up from nothing every day (also no HDFS unless you spin that up
| in your SLURM job too). This makes getting work done harder but
| it also means that it makes integrating existing work so much
| harder because everyone's workflow involves reinventing
| everything, and everyone does it in subtly incompatible ways;
| there are no natural (common) abstraction layers because there
| are no services.
| 0xbadcafebee wrote:
| AWS is _fantastic_ for scientific computing. With it you can:
|
| - Deploy a thousand servers with GPUs in 10 minutes, churn over a
| giant dataset, then turn them all off again. Nobody ever has to
| wait for access to the supercomputer.
|
| - Automatically back up everything into cold storage over time
| with a lifecycle policy.
|
| - Avoid the massive overhead of maintaining HPC clusters, labs,
| data centers, additional staff and training, capex, load
| estimation, months/years of advance planning to be ready to start
| computing.
|
| - Automation via APIs to enable very quick adaptation with little
| coding.
|
| - An entire universe of services which ramp up your capabilities
| to analyze data and apply ML without needing to build anything
| yourself.
|
| - A marketplace of B2B and B2C solutions to quickly deploy new
| tools within your account.
|
| - Share data with other organizations easily.
|
| AWS costs are also "retail costs". There are massive savings to
| be had quite easily.
| Fomite wrote:
| One thing to consider:
|
| _I_ don 't control my AWS account. I don't even _have_ an AWS
| account in my professional life.
|
| I tell my IT department what I want. They tell the AWS people
| in central IT what they want. It's set up. At some point I get
| an email with login information.
|
| I email them again to turn it off.
|
| Do I hate this system? Yes. Is it the system I have to work
| with? Also yes.
|
| "AWS as implemented by any large institution" is considerably
| less agile than AWS itself.
| [deleted]
| slaymaker1907 wrote:
| Cloud worked really well for me when I was in school. A lot of
| the time, I would only need a beefy computer for a few hours at a
| time (often due to high memory usage) and you can/could rent out
| spot instances for very cheap. There are about 730 hours per
| month so the cost calculus is very different for a
| student/researcher who needs fast turnaround times (high
| performance), but only for a short period of time.
|
| However, I know not all HPC/scientific computing works that way
| and some workloads are much more continuous.
| a2tech wrote:
| Thats how my department uses the cloud--we have an image we
| store up at AWS geared towards a couple of tasks and we spin up
| a big instance when we need it, run the task, pull out the
| results, then stop the machine. Total cost sub-100 dollars. If
| we had to go the HPC group we'd have to fight with them to get
| the environment configured, get access to the system, get
| payment setup, teach the faculty to use the environment, etc.
| Its just a pain for very little gain.
| Mave83 wrote:
| I agree with the article. We at croit.io support customers around
| the globe to build their clusters and save huge amounts. For
| example, AWS S3 compared to Ceph S3 in any data center of your
| choice is around 1/10 of the AWS price.
| nharada wrote:
| I'd love to buy my own servers for small-scale (i.e. startup size
| or research lab size) projects, but it's very hard to be
| utilizing them 24x7. Does anyone know of open-source software or
| tools that allow multiple people to timeshare one of these? A big
| server full of A100s would be awesome, with the ability to
| reserve the server on specific days.
| jpeloquin wrote:
| > the ability to reserve the server on specific days
|
| In an environment where there are not too many users and
| everyone is cooperative, using Google Calendar to reserve time
| slots works very well and is very low maintenance. Technical
| restrictions are needed only when the users can't be trusted to
| stay out of each other's way.
| didip wrote:
| This is just the cloud with extra steps.
| mbreese wrote:
| I completely agree for most cases. In many scientific computing
| applications, compute time isn't the factor you prioritize in the
| good/fast/cheap triad. Instead, you often need to do things as
| cheaply as possible. And your data access isn't always
| predictable, so you need to keep results around for an extended
| period of time. This makes storage costs a major factor. For us,
| this alone was enough to move workloads away from cloud and onto
| local resources.
| COGlory wrote:
| >a month-long DNA sequencing project can generate 90 TB of data
|
| Our EM facility generates 10 TB of raw data per day, and once you
| start computing it, that increases by 30%-50% depending on what
| you do with it. Plus, moving between network storage and local
| scratch for computational steps basically never ends and keeps
| multiple 10 Gbe links saturated 100% of the time.
| bgro wrote:
| When I was looking at AWS for personal use, I first thought it
| was oddly expensive even when factoring in not having to buy the
| hardware. When I looked at just what the electricity cost to run
| it myself would be, I think that addition alone turned out AWS
| was actually cheaper. This is without factoring in cooling /
| dedicated space / maintenance.
| betolink wrote:
| I see both sides of the argument, there is a reason why CERN is
| not processing their data using EC2 and lambdas.
| thamer wrote:
| The vast majority of researchers don't need anywhere close to
| the amount of resources that CERN needs. The fact that CERN
| doesn't use EC2 and lambdas shouldn't be taken as a lesson by
| anyone who's not operating at their scale.
|
| This feels like a similar argument to the one made by people
| who use Kubernetes to ensure their web app with 100 visitors a
| day is web scale.
| harunurhan wrote:
| The cost isn't the only reason
|
| - CERN started planning its computing grid before AWS was
| launched.
|
| - It's pretty complicated (politics, mission, vision) for CERN
| to use external proprietary software/hardware for its main
| functions (they have even started to MS Office like products.)
|
| - [cost] CERN is quite different than a small team researchers
| doing few years research. the scale is enormous and very long
| lived, like for decades continue
|
| - and more...
|
| HPC and scientific computing aside, I would have loved to be
| able to use AWS when I worked there, internal infra for running
| web apps and services wasn't nearly good & reliable, neither
| had a wide catalog of services offered.
| betolink wrote:
| I think the spirit of the article is to put the cloud in
| perspective of the organization size and the workload type.
| There is a sweet spot where the cloud is the only option that
| makes sense, definitely with variable loads and capacity to
| basically scale on demand as big as our budget, there is no
| match for that. However... there are organizations with
| certain type of workloads that could afford to put
| infrastructure in place and even with the costs of staffing,
| energy etc they will save millions in the long run. NASA,
| CERN etc are some. This is not limited to HPC, the cloud at
| scale is not cheap either see:
| https://a16z.com/2021/05/27/cost-of-cloud-paradox-market-
| cap...
| bluedino wrote:
| We have 500-node cluster at a chemical company, and we've been
| experimenting with "hybrid-cloud". This allows jobs to use
| servers with resources we just don't have, or couldn't add fast
| enough.
|
| Storage is a huge issue for us. We have a petabyte of local
| storage from big name vendor that's bursting at the seams, and
| expensive to upgrade. A lot of our users leave big files laying
| around for a large time. Every few months we have to hound
| everyone to delete old stuff.
|
| The other thing that you get with the cloud is there's way more
| accountability for who's using how much resources. Right now we
| just let people have access and roam free. Cloud HPC is 5-10x
| more in cost and the beancounters would shut shit down real quick
| if the actual costs were divvied up.
|
| We also still have a legacy datacenter so in a similar vein, it's
| hard to say how much not having to deal with physical
| hardware/networking/power/bandwidth would be worth. Our work is
| maybe 1% of what that team does.
| adolph wrote:
| I can relate to these problems. Cloud brings positive
| accountability that is difficult to justify onprem. I have some
| hope that higher level tools for project/data/experiment
| management (as opposed to a bash prompt and a path) will bring
| some accountability without stifling flexibility.
| julienchastang wrote:
| I've also been skeptical of the commercial cloud for scientific
| computing workflows. I don't think this cost benefit analysis
| mentions it, but the commercial cloud makes even less sense when
| you take into account brick and mortar considerations. In other
| words, if your company/institution has already paid for the
| machine rooms, sys admins, networks, the physical buildings, the
| commercial cloud is even less appealing. This is especially true
| with "persistent services" for example data servers that are
| always on because they handle real-time data, for example.
|
| Another aspect of scientific computing on the commercial cloud
| that's a pain if you work in academia is procurement or paying
| for the cloud. Academic groups are much more comfortable with the
| grant model. They often operate on shoe-string budgets and are
| simply not comfortable entering a credit card number. You can
| also get commercial cloud grants, but they often lack long-term,
| multiyear continuity.
| mattkrause wrote:
| It's often not that they're "not comfortable"; it's that we're
| often flat-out not allowed to.
| Fomite wrote:
| This. It's got nothing to do with "comfort". I use cloud
| computing all the time in the rest of my life, but the rest
| of my life isn't subject to university policies and state
| regulations.
| ordiel wrote:
| Having worked for 2 of the largest cloud providers (1 of them
| beimg the largest) i have to say "The Cloud" just doesnt makes
| sense (maybe with the exception of cloud storage) yet for most
| use cases, this including start ups, small and, mid size
| companies its just way to expensive for the benefits it provides,
| it moves your hardware acquisitions /maintainance cost to
| development costs, you just think better/cheaper because that
| cost comes in small monthly chunks rather than as a single bill,
| plus you add all security risks either those introduced by the
| vendor or those introduced by the masive complexity and poor
| training of the developers which if you want to avoid will have
| to pay by hiring a developer competent in security for that
| particular cloud provider
| manv1 wrote:
| Having worked in 3 startups that were AWS-first, I can say that
| you've learned the completely wrong lessons from your time at
| your cloud providers.
|
| Building on AWS has provided scale, security, and redundancy at
| a substantially lower cost than doing any on-prem solution
| (except for a shitty one strung together with lowendbox
| machines).
|
| The combined AWS bill for the three startups is less than the
| cost of an F5, even on a non-inflation adjusted basis.
|
| The cloud doesn't mean that you can be totally clueless. I've
| had experience in HA/scalability/redundancy/deployment/developm
| ent/networking/etc. It means that if you do know what you're
| doing you can deliver a scalable HA solution at a ridiculously
| lower price point than a DIY solution using bare iron and colo.
| ordiel wrote:
| "The combined bill" during which time period?
|
| 1 Month, for sure. What about 1 year? Also did those
| companies required to provide any training or hiring to
| achieve that? Because you also need to add that to the cost
| comparison
|
| If you are comparing one month bill agains 1 time purchase
| (which if is correctly chosen should not happen but once
| every 10 years at the earliest) for sure it will be cheaper.
| When it comes down to scalability, development and
| deployment, you should check your tech stack rather than your
| infrastructure. Kubernetees and containerization should
| easily take care of those with on premise hardware while also
| reducing complexity + you will no longer have to worry for
| off the chart network transit fees
| jerjerjer wrote:
| Sure? I mean, if you have:
|
| 1) A large enough queue of tasks
|
| 2) Users/donstream willing to wait
|
| using your own infrastructure always wins (alsuming free labor)
| since you can load your own infrastructure to ~95% pretty much
| 24/7 which is unbeatable.
| mrweasel wrote:
| It might also depend on how long you're actually willing to
| wait. There's nothing stopping you from having a job queue in
| AWS, and you can setup things up so that instances are only
| running if the price is low enough.
|
| Otherwise completely agree, there might be some cases where the
| cost of labour means that you're better off running something
| in AWS, even if that requires someone to do the configuration
| as well.
| aschleck wrote:
| This is sort of a confusing article because it assumes the
| premise of "you have a fixed hardware profile" and then argues
| within that context ("Most scientific computing runs on queues.
| These queues can be months long for the biggest supercomputers".)
| Of course if you're getting 100% utilization then you'll find
| better raw pricing (and this article conveniently leaves out
| staffing costs), but this model misses one of the most powerful
| parts of cloud providers: autoscaling. Why would you want to
| waste scientist time by making them wait in a queue when you can
| just instead autoscale as high as needed? Giving scientists a
| tight iteration loop will likely be the biggest cost reduction
| and also the biggest benefit. And if you're doing that on prem
| then you need to provision for the peak load, which drives your
| utilization down and makes on prem far less cost effective.
| lebovic wrote:
| For fast-moving researchers who are blocked by a queue, cloud
| computing still makes sense. I guess I wasn't clear enough in
| the last section about how I still use AWS for startup-scale
| computational biology. My scientific computing startup
| (trytoolchest.com) is 100% built on top of AWS.
|
| Most scientific computing still happens on supercomputers in
| slower moving academic or big co settings. That's the group for
| whom cloud computing - or at least running everything on the
| cloud - doesn't make sense.
| adolph wrote:
| Another service that runs on AWS is CodeOcean. It looks like
| Toolchest is oriented toward facilitating execution of
| specific packages rather than organization and execution like
| CodeOcean. Is that a fair summary?
|
| https://codeocean.com/explore
| lebovic wrote:
| Yep, that's right! Toolchest focuses on compute, deploying
| and optimizing popular scientific computing packages.
| secabeen wrote:
| Generally, scientists aren't blocked while they are waiting on
| a computational queue. The results of a computation are needed
| eventually, but there is lots of other work that can be done
| that doesn't depend on a specific calculation.
| jefftk wrote:
| It's good to learn how not to be blocked on long-running
| calculations.
|
| On the other hand, if transitioning to a bursty cloud model
| means you can do your full run in hours instead of weeks,
| that has real impact on how many iterations you can do and
| often does appreciably affect velocity.
| secabeen wrote:
| It can, if you have the technical ability to write code
| that can leverage the scale-out that most bursty-cloud
| solutions entail. Coding for clustering can be pretty
| challenging, and I would generally recommend a user target
| a single large system with job that takes a week over
| trying to adapt that job to a clustered solution of 100
| smaller systems that can complete it in 8 hours.
| Fomite wrote:
| This is a big part of it. In my lab, I have a lot of grad
| students who are _computational_ scientists, not computer
| scientists. The time it will take them to optimize code
| far exceeds a quick-and-dirty job array on Slurm and then
| going back to working on the introduction of the paper,
| or catching up on the literature, or any one of a dozen
| other things.
| secabeen wrote:
| The general rule of thumb in the HPC world is if you can keep a
| system computing for more than 40% of the time, it will be
| cheaper to buy.
| tejtm wrote:
| Cloud never has made sense for scientific computing. Renting
| someone else's big computer makes good sense in a business
| setting where you are not paying for your peak capacity when you
| are not using it, and you are not losing revenue by
| underestimating whatever the peak capacity the market happens to
| dictate.
|
| For business, outsourcing compute cost center eliminates both
| cost and risk for a big win each quarter.
|
| Scientists never say, Gee it isn't the holiday season, guess we
| better scale things back.
|
| Instead they will always tend to push whatever compute limit
| there is, it is kinda in the job description.
|
| As for the grant argument, that is letting the tool shape the
| hand.
|
| business-science is not science, we will pay now or pay later.
| aBioGuy wrote:
| Furthermore, scientific computing often (usually?) involves
| trainees. It can difficult to train people when small mistakes
| can have five figure bills.
| Moissanite wrote:
| This is the biggest un-addressed problem, IMO. Getting more
| scientific computing done in the cloud is where we are
| inevitably trending, but no-one yet has a good answer for
| completely ad-hoc, low value experimentation and skill building
| in cloud. I see universities needing to maintain clusters to
| allow PhDs and postdocs to develop their computing skills for a
| good while yet.
| avereveard wrote:
| > Hardware is amortized over five years
|
| hardware running 100% won't last five years
|
| if hardware is not needed to be running 100% at full steam for
| five years, you can turn down instances on the cloud and you
| don't pay anything
|
| in 2 years you'll be stuck with the same hardware, while on the
| cloud you follow cpu evolution as it arrives to the provider
|
| all in all the comparison is too high level to be useful
| e63f67dd-065b wrote:
| > hardware running 100% won't last five years
|
| Five year is a pretty typical amortisation schedule for HPC
| hardware. During my sysadmin days, of CPU, memory, cooling,
| power, storage, and networking, the only things that broke were
| hard disks and a few cooling fans. Disks were replaced by just
| grabbing a space and slotting it in, and fans were replaced by,
| well, swapping them out.
|
| Modern CPUs and memory last a very long time. I think I
| remember seeing Ivy Bridge CPUs running in Hetzner servers in a
| video they put out, and they're still fine.
| avereveard wrote:
| if you expect downtime in the 5 year to replace fan and
| whatnot, you're not getting 100% of your money/perf back -
| and I didn't see that in the article.
|
| if you have spares, spares need to be in the cost, and value
| lost to downtime stay minimal. but you have to include spares
| in the expenses. if you don't have spares, 1-2 day downtime
| is going to be a decent hit to value.
| davidmr wrote:
| I'm not sure I understand what you mean. I've run HPC
| clusters for a long time now, and node failures are just a
| fact of life. If 3 or 4 nodes of your 500 node cluster are
| down for a few days while you wait for RMA parts to arrive,
| you haven't lost much value. Your cluster is still
| functioning at nearly peak capacity.
|
| You have a handful of nodes that the cluster can't function
| without (scheduler, fileservers, etc), but you buy spares
| and 24x7 contracts for those nodes.
|
| Did I misunderstand your comment?
| icedchai wrote:
| I think you underestimate how long modern hardware can last. I
| have 8 to 12 year old PCs running non-stop, in a musty and damp
| basement.
| avereveard wrote:
| they don't just die, thermal paste dry up, fans gum up, gpu
| will live, but thermal throttling will mean it'll run at,
| say, 80%.
| aflag wrote:
| I've worked with a yarn cluster with around 200 nodes which ran
| non stop for well over 5 years and still kicking. There were a
| handful of failures and replacements, but I'd 95% of the
| cluster was fine 7 years in.
| walnutclosefarm wrote:
| Having had the responsibility of providing HPC for a literal
| buildings full of scientists, I can say that it may be true that
| you can get computation cheaper with owned hardware, than in a
| cloud. Certainly pay as you go, individual project at a time
| processing will look that way to the scientist. But I can also
| say with confidence that the contest is far closer than they
| think. Scientists who make this argument almost invariably leave
| major costs out of their calculation - assuming they can put
| their servers in a closet,maintain them themselves, do all the
| security infrastructure, provide redundancy and still get to
| shared compute when they have an overflow need. When the closet
| starts to smoke because they stuffed it with too many cheaply
| sourced, hot-running cores and GPUs, or gets hacked by one of
| their postdocs resulting in an institutional HIPAA violation,
| well, that's not their fault.
|
| Put like for like in a well managed data center against
| negotiated and planned cloud services, and the former may still
| win, but it won't be dramatically cheaper, and figured over
| depreciable lifetime and including opportunity cost, may cost
| more. It takes work to figure out which is true.
| pbronez wrote:
| The article estimated:
|
| Running a modern AMD-based server that has 48 cores, at least
| 192 GB of RAM, and no included disk space costs:
| ~$2670.36/mo for a c5a.24xlarge AWS on-demand instance
| ~$1014.7/mo for a c5a.24xlarge AWS reserved instance on a
| three-year term, paid upfront ~$558.65/mo on OVH
| Cloud[1] ~$512.92/mo on Hetzner[2] ~$200/mo on
| your own infrastructure as a large institution[3]
|
| Footnote [3] explains this cost estimate as:
|
| "Assumes an AMD EPYC 7552 run at 100% load in Boston with high
| electricity prices of $0.23/kWh, for $33.24/mo in raw power.
| Hardware is amortized over five years, for an average monthly
| price of $67.08/mo. We assume that your large institution
| already has 24/7 security and public internet bandwidth, but
| multiply base hardware and power costs by 2x to account for
| other hardware, cooling, physical space, and a
| half-a-$120k-sysadmin amortized across 100 servers."
| jacobr1 wrote:
| Also it assumes full utilization of hardware. If you have
| variable load (such as only needing to run compute after an
| experiment). The overhead costs of maintaining a cluster you
| don't need all time are probably much lower than resources you
| can schedule on-demand.
| duxup wrote:
| When I worked as a network engineer I spent months working with
| some great scientists / their team who built a crazy microscope
| (I assumed it was looking at atoms or something...) the size of
| a small building.
|
| Their budget for the network gear was a couple hundred bucks
| and some old garbage consumer grade network gear. For something
| that spit out 10s of GB a second (at least) across a ton of
| network connections (they didn't seem to know what would even
| happen when they ran it), and was so bursty all but the highest
| end of gear could handle it.
|
| Can confirm sometimes scientists aren't really up on the
| overall costs. Then they dump it "this isn't working" on their
| university IT team to absorb the costs / manpower costs.
| ipaddr wrote:
| You are paying 10x more because no one gets fired for using
| IBM. AWS has many benefits most which you don't need. Pair up
| with another school in a different region and backup data.
| Computers are not scary they rarely catch fire.
| whatever1 wrote:
| Nah for us it was the department IT guy who set up once
| everything (a full cluster of 50 r720s) and works like a dream.
|
| Properly provisioned linux machines need no maintenance. You
| drive them until there is a hardware failure.
| mangecoeur wrote:
| I've been running a group server (basically a shared
| workstation) for 5 years and it's been great. Way cheaper than
| cloud, no worrying about national rules on where data can be
| stored, no waiting in a SLURM batch queue, Jupyter notebooks on
| tap for everyone. A single $~6k outlay (we don't need GPUs
| which helps).
|
| Classic big workstations are way more capable than people think
| - but at the same time it's hard to justify buying one machine
| per user unless your department is swimming in money. Also,
| academic budgets tend to come in fixed chunks, and university
| IT departments may not have your particular group as a priority
| - so often it's just better to invest once in a standalone
| server tower that you can set up to do exactly what you need it
| to than try to get IT to support your needs or the accounting
| department to pay recurring AWS bills.
| killingtime74 wrote:
| Aren't you talking about 1 server when this is talking about
| HPC?
| mangecoeur wrote:
| Well the title is scientific computing, which includes HPC
| but not only. Anyway the fact is that a lot of "HPC" in
| university clusters is smaller jobs that are too much for
| an average PC to handle, but still fit into a single
| typical HPC node. These are usually the jobs that people
| think to farm out to AWS, but that you will generally find
| are cheaper, faster, and more reliable if you just run them
| on your own hardware.
| [deleted]
| forgomonika wrote:
| This nails so much of the discussion that should be had. When
| using any cloud service provider, you aren't just paying for
| the machines/hardware you use - you are paying for people to
| take care of a bunch of headaches of having to maintain this
| hardware. It's incredibly easy to overlook this aspect of costs
| and really easy to oversimplify what's involved if you don't
| know how these things actually work.
| prpl wrote:
| The things that tend to be "cheap" on campuses:
|
| Power (especially if there is some kind of significant
| scientific facility on premise), space (especially in reused
| buildings), manpower (undergrads, grad students, post docs,
| professional post graduates), running old/reused hardware,
| etc...
|
| You can get away with those at large research universities.
| Some of that you can get away with at national lab sorts of
| places (not going to find as much free/cheap labor, surplus
| hardware). If you start going down in scale/prestige, etc...
| none of that holds true.
|
| Running a bunch of hardware from the surplus store in a closet
| somewhere with Lasko fans taped to the door is cheap. To some
| extent, the university system encourages such subsidies.
|
| In any case, once you get to actually building a datacenter, if
| you have to factor in power, if you have a 4 hardware refresh
| cycle, professional staffing, etc... unless you are in one of
| those low CoL college towns - cloud is probably more more than
| 1.5 to 3x more expensive for compute (spot, etc...). Storage on
| prem is much cheaper - erasure coded storage systems are cheap
| to buy and run, and everybody wants their own high performance
| file system.
|
| One continuing cloud obstacle though - researchers don't want
| to spend their time figuring out how to get their code friendly
| to preemptible VMs - which is the cost effective way to run on
| cloud.
|
| Another real issue with sticking to on-prem HPC is talent
| acquisition and staff development. When you don't care about
| those things so much, it's easy to say it's cheap to run on-
| prem, but often the pay is crap for the required expertise, and
| ignoring cloud doesn't help your staff either.
| W-Stool wrote:
| Let me echo this as someone who once was responsible for HPC
| computing in a research intensive public university. Most
| career academics have NO IDEA how much enterprise computing
| infrastructure costs. If a 1 terabyte USB hard drive is $40 at
| Costco we (university IT) must be getting a much better deal
| than that. Take this argument and apply it to any aspect of HPC
| computing and that's what you're fighting against. The closet
| with racks of gear and no cooling is another fond memory. Don't
| forget the AC terminal strips that power the whole thing,
| sourced from the local dollar store.
| bluedino wrote:
| It's kind of funny around this time of year when some
| researchers have $10,000 in their budget they need to spend,
| and they want to 'gift' us with some GPU's.
| davidmr wrote:
| That was definitely one of the weirdest things of working
| in academia IT: "hey. Can you buy me a workstation that's
| as close to $6,328.45 as it is possible to get, and can you
| do it by 4pm?"
| systemvoltage wrote:
| I am dealing with the exact opposite problem: "Oh you mean,
| we should leave the EC2 instance running _24 /7_??? No way,
| that would be too expensive"... to which I need to respond
| "No, it would be like $15/month. Trivial, stop worrying about
| costs in EC2 and S3, we're like 7 people here with 3 GB of
| data."
|
| I deal with Scientists that think AWS is some sort of a
| massively expensive enterprise thing. I can be, but not for
| the use case they're going to be embarking on. Our budget is
| $7M spanning 4 years.
| capableweb wrote:
| > think AWS is some sort of a massively expensive
| enterprise thing
|
| Compared to using dedicated instances with way cheaper
| bandwidth, storage and compute power, it might as well be.
|
| Cloud makes sense when you have to scale up/down very
| quickly, or you'd be losing money fast. But most don't
| suffer from this problem.
| gonzo41 wrote:
| Don't say the budget outloud near AWS. They'll find a way
| to help you spend it.
| systemvoltage wrote:
| Hahaha, may be I need to just go into the AWS ether and
| start yakking big words like "Elastic Kubernetes Service"
| to confuse the scientists and get my aws fix. These
| people are too stingy. I want some shit running in AWS,
| what good is this admin IAM role.
| 0xbadcafebee wrote:
| I remember the first time a server caught fire in the closet
| we kept the rack in. Backups were kept on a server right
| below the one on fire. But, y'know, we saved money.
| eastbound wrote:
| Don't worry, we do incremental backups during weekdays and
| a full backup on Sunday. We use 2 tapes only, so one is
| always outside of the building. But you know, we saved
| money.
| [deleted]
| treeman79 wrote:
| We had a million dollars worth of hardware installed in a
| closet. It had a portable AC hooked up that needed it's
| water bin changed every so often.
|
| Well I was in the middle of that. When the Director
| decided to show off the new security doors. So he closed
| the room up. Then found out that new security doors
| didn't work. I find out as I'm coming back to turn AC
| back on. Room will get hot really fast.
|
| We get office Security to unlock door. He says he doesn't
| have authority. His supervisor will be by later in the
| day.
|
| Completely deadpan, and in front of several VPs of a
| forth one 50.
|
| I turn to guy to my right who lived nearby. "Go home and
| get your chainsaw"
|
| We were quickly let in. Also got fast approval to install
| proper cooling.
| bilbo0s wrote:
| A bit off topic, but I gotta say you guys are a riot!
|
| If there was a comedy tour for IT/Programmer types, I'd
| pay to see you guys in it.
|
| Best thing about your stuff is that it's literally all
| funny precisely because it's all true.
| [deleted]
| [deleted]
| Proven wrote:
| pbronez wrote:
| This is my fear about my homelab lol
|
| Fire extinguisher nearby, smart temp sensors, but still...
| rovr138 wrote:
| oh, nice idea with temp sensor.
|
| I have extinguishers all over the house, but hadn't
| considered a temperature sensor set to send alerts.
|
| Do you have any recommendations?
| W-Stool wrote:
| What are you using for a homelab priced temperature
| sensor?
| xani_ wrote:
| Homelab-priced sensor is the temp sensor in your server,
| it's free! Actual servers have a bunch, usually have one
| at intake, "random old PC" servers can use motherboard
| temp as rough proxy for environment temp.
|
| Hell, even in DC you can look at temperatures and see in
| front of which server technican was standing just by
| those sensors.
|
| Second cheapest would be USB-to-1wire module + some
| DS18B20 1-sire sensors. Easy hobby job to make. They also
| come with unique ID which means if you put it in TSDB by
| that ID it doesn't matter where you plug those sensors.
| COGlory wrote:
| >the security infrastructure, provide redundancy and still get
| to shared compute when they have an overflow need
|
| The article points out that this is mostly not necessary for
| scientific computing.
| jrumbut wrote:
| Which I thought was the best point of the article, that a lot
| of IT best practice comes from the web app world.
|
| Web apps quickly become finely tuned factory machines,
| executing a million times a day and being duplicated
| thousands of times.
|
| Scientific computing projects are often more like workshops.
| Getting charged by the second while you're sitting at a
| console trying to figure out what this giant blob you were
| sent even is is unpleasant. The solution you create is most
| likely to be run exactly once. If it is a big hit, it may be
| run a dozen times.
|
| Trying to run scientific workloads on the cloud is like
| trying to put a human shoe on a horse. It might be possible
| but it's clearly not designed for that purpose.
| onetimeusename wrote:
| Is a postdoc hacking a cluster something you have seen before?
| I am genuinely curious because I worked on a cluster owned by
| my university as an undergrad and everyone was kind of assumed
| to be trusted. If you had shell access on the main node you
| could run any job you wanted on the cluster. You could enhance
| security I just wonder about this threat model, that's an
| interesting one. I am sure it happens to be clear.
| ptero wrote:
| I think it really depends on the task. Where HIPAA violation is
| a real threat, the equation changes. And just for CYA purposes
| those projects can get pushed to a cloud. Which does not
| necessarily involve any attempts to make them any more secure,
| but this is a different topic.
|
| That said, many scientists _are_ operating on premise hardware
| like this: some servers in a shared rack and an el-cheapo
| storage solutions with an ssh access for people working in the
| lab. And it works just fine for them.
|
| Cloud services focus for running _business_ computing in a
| cloud, emphasizing recurring revenue. Most research labs are
| _much_ more comfortable with spending the hardware portion of a
| grant upfront and not worrying about some student who, instead
| of working on some fluid dynamics problem found a script to re-
| train a stable diffusion and left it running over winter break.
| My 2c.
| secabeen wrote:
| Thankfully, only a small part of the academic research
| enterprise involves human subjects, HIPAA, and all that.
| Neither fruit flies nor quarks have privacy rights.
| dmicah wrote:
| Research involving human subjects (psychology, cognitive
| neuroscience, behavioral economics, etc.) requires
| institutional review board approval and informed consent,
| etc. but mostly doesn't involve HIPAA either.
| charcircuit wrote:
| That is not a law.
| icedchai wrote:
| There are actually laws around such things. You can read
| about them here: https://www.hhs.gov/ohrp/index.html
| Fomite wrote:
| And many, many institutions are over cautious. My own
| university, for example, has no data classification
| between "It would be totally okay if anyone in the
| university has access" and "Regulated data", so "I mean,
| it's health information, and it's governed by our data
| use agreement with the provider..." gets it kicked to the
| same level as full-fat HIPAA data.
| crazygringo wrote:
| > _And it works just fine for them._
|
| Until it doesn't because there's a fire or huge power surge
| or whatever.
|
| That's the point -- there's a lot of risk they're not taking
| into account, and by focusing on the "it works just fine for
| them", you're cherry picking the ones that didn't suffer
| disaster.
| horsawlarway wrote:
| I'd counter by saying I think you're over-estimating how
| valuable mitigating that risk is to this crowd.
|
| I'd further say that you're probably over-estimating how
| valuable mitigating that risk is to _anyone_ , although
| there are a few limited set of customers that genuinely do
| care.
|
| There are few places I can think of that would benefit more
| by avoiding cloud costs than scientific computing...
|
| They often have limited budgets that are driven by grants,
| not derived by providing online services (computer going
| down does not impact bottom line).
|
| They have real computation needs that mean hardware is
| unlikely to sit idle.
|
| There is no compelling reason to "scale" in the way that a
| company might need to in order to handle additional
| unexpected load from customers or hit marketing campaigns.
|
| Basically... the _only_ meaningful offering from the cloud
| is likely preventing data loss, and this can be done fairly
| well with a simple backup strategy.
|
| Again - they aren't a business where losing a few
| hours/days of customer data is potentially business ending.
|
| ---
|
| And to be blunt - I can make the same risk avoidance claims
| about a lot of things that would simply get me laughed out
| of the room.
|
| "The lead researcher shouldn't be allowed in a car because
| it might crash!"
|
| "The lab work must be done in a bomb shelter in case of war
| or tornados!"
|
| "No one on the team can eat red meat because it increases
| the risk of heart attack!"
|
| and on and on and on... Simply saying "There's risk" is not
| sufficient - you must still make a compelling argument that
| the cost of avoiding that risk is justified, and you're not
| doing that.
| billythemaniam wrote:
| The counterpoint to that point is that a significant
| percentage of scientific computing doesn't care about any
| of that. They are unlikely to have enough hardware to cause
| a fire and they don't care about outages or even data loss
| in many cases. As others have said, it depends on the
| specifics of the research. In the cases where that stuff
| matters, the cloud would be better option.
| Fomite wrote:
| This. If my lab-level server failed tomorrow, I'd be
| annoyed, order another one, and start the simulations
| again.
| vjk800 wrote:
| The point is, there's not need for everything to be 100%
| reliable in this context. If a fire destroys everything and
| their computational resources is unavailable for a few
| days, that's somewhat okay. Not ideal, but not a
| catastrophic loss either. Even data loss is no catastrophic
| - at worst it means redoing one or two weeks worth of
| computations.
|
| Some sort of 80/20 principle is at works here. Most of the
| costs in professional cloud solutions comes from making the
| infrastructure 99.99% reliable instead of 99% reliable. It
| is totally worth it if you have millions of customers that
| expect a certain level of reliability, but a complete
| overkill if the worst case scenario from a system failure
| is some graduate student having to redo a few days worth of
| computations (which probably had to be redone several times
| anyway because of some bug in the code or something).
| kijin wrote:
| Even that depends on what you're doing. Most scientists
| aren't running apps that require several 9's of
| availability, connect to an irreplaceable customer
| database, etc.
|
| An outage, or even permanent loss of hardware, might not be
| a big problem if you're running easily repeatable
| computations on data of which you have multiple copies. At
| worst, you might have to copy some data from an external
| hard drive and redo a few weeks' worth of computations.
| withinboredom wrote:
| Ummm. I've def been unable to do anything for entire days
| because our AWS region went down and we had to rebuild the
| database from scratch. AWS goes down, you twiddle your
| thumbs and the people you report to are going to be asking
| why, for how long, etc. and you can't give them an answer
| until AWS comes back to see how fubar things are.
|
| When your own hardware rack goes down. You know the
| problem, how much it costs to fix it, and when it will come
| back up; usually within a few hours (or minutes) of it
| going down.
|
| Do things catch fire, yes. But I think you're over-
| estimating how often. In my entire life, I've had a single
| SATA connector catch fire and it just melted plastic before
| going out.
| crazygringo wrote:
| I'm not talking about temporary outages, I'm talking
| about data loss.
|
| With AWS it's extremely easy to keep an up-to-date
| database backup in a different region.
|
| And it's great that you haven't personally encountered
| disaster, but of course once again that's cherry-picking.
| And it's not just a component overheating, it's the whole
| closet on fire, it's a broken ceiling sprinkler system
| going off, it's a hurricane, it's whatever.
| withinboredom wrote:
| So was I also talking about data loss. Not everything can
| be replicated, but backups can and were made.
|
| For the rest, there's insurance. Most calculations done
| in a research setting are dependent upon that research
| surviving. If there's a fire and the whole building goes
| down, those calculations are probably worthless now too.
|
| Hell, most companies probably can't survive their own
| building/factory burning down.
| FpUser wrote:
| >"With AWS it's extremely easy to keep an up-to-date
| database backup in a different region"
|
| It is just as extremely easy on Hetzner or on premises
| monkmartinez wrote:
| I would say even easier on prem as you don't need to wade
| 15 layers deep to do anything. Since I have moved to
| hosting my own stuff at my house, I have learned that
| connecting a monitor and keyboard to a 'sever' is awesome
| for productivity. I know where everything is, its fast as
| hell, and everything is locked down. Monitoring temps,
| adjusting and configuring hardware is just better in
| every imaginable way. Need more RAM, Storage, Compute?
| Slap those puppies in there and send it.
|
| For home gamers like myself, it's has become a no brainer
| with advances in tunneling, docker, and cheap prices on
| Ebay.
| ptero wrote:
| > there's a lot of risk they're not taking into account
|
| I see it the other way: experimental scientists operate
| with unreliable systems all the time: fickle systems,
| soldered one-time setups, shared lab space, etc. Computing
| is just one more thing that is not 100% reliable (but way
| more reliable than some other equipment), and usb data
| sticks serve as a good enough data backup.
| mangecoeur wrote:
| Or your university might have it's own backup system. We
| have a massive central tape-based archive that you can
| run nightly backups to.
| noobermin wrote:
| May be consider that your use case and the average
| scientist's use case isn't the same? What works for you
| won't work for them and vise versa? What you consider a
| risk, I wouldn't?
|
| Consider the following, I have never considered applying
| meltdown or spectre mitigations if it makes my code run
| slower because I plain don't care, assuming anyone even
| peeks at what my simulations doing, whoopdeedo, I don't
| care. I won't do that on my laptop I use to buy shit off
| amazon with, but the workstation I have control of? I don't
| care. I DO care if my simulation will take 10 days instead
| of a week.
|
| My use case isn't yours because my needs aren't yours. Not
| everything maps across domains.
| insane_dreamer wrote:
| Plus the supposed savings of in-house hardware only materialize
| if you have sufficiently managed and queued load to keep your
| servers running at 100% 24/7. The advantage of AWS/other is to
| be able to acquire the necessary amount of compute power for
| the duration that you need it.
|
| For a large university it probably makes sense to have and
| manage their own compute infrastructure (cheap post-doc labor,
| ftw!) but for smaller outfits, AWS can make a lot of sense for
| scientific computing (said as someone who uses AWS for
| scientific computing), especially if you have fluctuating
| loads.
|
| What works best IMO (and what we do) is have a minimum-to-
| moderate amount of compute resources in house that can satisfy
| the processing jobs most commonly run (and where you haven't
| had to overinvest in hardware), and then switch to AWS/other
| for heavier loads that run for a finite period.
|
| Another problem with in-house hardware is that you spent all
| that money on Nvidia V100's a few years ago and now there's the
| A100 that blows it away, but you can't just switch and take
| advantage of it without another huge capital investment.
| secabeen wrote:
| They leave out major costs because they don't pay those costs.
| Power, Cooling, Real Estate are all significant drivers of AWS
| costs. Researchers don't pay those costs directly. The
| university does, sure, but to the researcher, that means those
| costs are pre-paid. Going to AWS means you're essentially
| paying for those costs twice. plus all the profit margin and
| availability that AWS provides that you also don't need.
| fwip wrote:
| The killer we've seen is data egress costs. Crunching the numbers
| for some of our pipelines, we'd actually be paying more to get
| the data out of AWS than to compute it.
| bhewes wrote:
| Data movement has become the number one cost in system builds
| energy wise.
| boldlybold wrote:
| As in, the networking equipment consumes the most energy?
| Given the 30x markup on AWS egress I'm inclined to say it's
| more about incentives and marketing, but I'd love to learn
| otherwise.
| pclmulqdq wrote:
| Even as a big cloud detractor, I have to disagree with this.
|
| A lot of scientific computing doesn't need a persistent data
| center, since you are running a ton of simulations that only take
| a week or so, and scientific computing centers at big
| universities are a big expense that isn't always well-utilized.
| Also, when they are full, jobs can wait weeks to run.
|
| These computing centers have fairly high overhead, too, although
| some of that is absorbed by the university/nonprofit who runs
| them. It is entirely possible that this dynamic, where
| universities pay some of the cost out of your grant overhead,
| makes these computing centers synthetically cheaper for
| researchers when they are actually more expensive.
|
| One other issue here is that scientific computing really benefits
| from ultra-low-latency infiniband networks, and the cloud
| providers offer something more similar to a virtualized RoCE
| system, which is a lot slower. That means accounting for cloud
| servers potentially being slower core-for-core.
| davidmr wrote:
| This is tangential to your point, but I'll just mention that
| Azure has some properly specced out HPC gear: IB, FPGAs, the
| works. You used to be able to get time on a Cray XC with an
| Ares interconnect, but I never have occasion to use it, so I
| don't know if you still can. They've been aggressively hiring
| top-notch HPC people for a while.
| lebovic wrote:
| Author here. I agree with your points! I use AWS for a
| computational biology company I'm working on. A lot of
| scientific computing can spin up and down within a couple hours
| on AWS and benefits from fast turnaround. Most academic HPCs
| (by # of clusters) are slower than a mega-cluster on AWS, not
| well utilized, and have a lot of bureaucratic process.
|
| That said, most of scientific computing (by % of total compute)
| happens in a different context. There's often a physical
| machine within the organization that's creating data (e.g. a
| DNA sequencer, particle accelerator, etc), and a well-
| maintained HPC cluster that analyzes that data. The researchers
| have already waited months for their data, so another couple
| weeks in a queue doesn't impact their cycle.
|
| For that context, AWS doesn't really make sense. I do think
| there's room for a cloud provider that's geared towards an HPC
| use-case, and doesn't have the app-inspired limits (e.g data
| transfer) like AWS, GCP, and Azure.
| hellodanylo wrote:
| [retracted]
| Marazan wrote:
| It says 0.09 per GB on that page.
| philipkglass wrote:
| Where do you see that? On your link I see: Data
| Transfer OUT From Amazon EC2 To Internet First 10 TB /
| Month $0.09 per GB Next 40 TB / Month $0.085 per GB
| Next 100 TB / Month $0.07 per GB Greater than 150 TB /
| Month $0.05 per GB
|
| Which means if you transfer out 90 TB in one month, it's $0.09
| * 10000 + $0.085 * 40000 + $0.07 * 40000 = $7100.
| hellodanylo wrote:
| Sorry, you are right. I need another coffee today.
| xani_ wrote:
| It always was for load that doesn't allow for autoscaling to save
| you money; the savings were always from convenience of not having
| to do ops and pay for ops.
|
| Then again a part of ops cost you save is paid again in dev
| salary that have to deal with AWS stuff instead of just throwing
| a blob of binaries and letting ops worry about the rest.
| citizenpaul wrote:
| No one seems to consider colo data centers anymore as even an
| option?
| remram wrote:
| My university owns hardware in multiple locations, plus uses
| hardware in a collocation, and still uses the cloud for
| bursting (overflow). You can't beat the provisioning time of
| cloud providers which is measured in seconds.
| zatarc wrote:
| Why does no one consider colocation services anymore?
|
| And why do people only know Hetzner, OVH and Linode as
| alternatives to the big cloud providers?
|
| There are so many good and inexpensive server hosting providers,
| some with decades of experience.
| lostmsu wrote:
| Any particular you could recommend for GPU?
| zatarc wrote:
| I'm not in a position to recommend or not a particular
| provider for gpu-equipped servers, simply because I've never
| had the need for gpus.
|
| My first thought was related to colocation services. From
| what I understand, a lot of people avoid on-premise/in-house
| solutions because they don't want to deal with server rooms,
| redundant power, redundant networks, etc.
|
| So people go to the cloud and pay horrendous prices there.
|
| Why not take a middle path? Build your own custom server with
| your perferred hardware and put in a colocation
| dkobran wrote:
| There are several tier-two clouds that offer GPUs but I think
| they generally fall prey to the many of the same issues
| you'll find with AWS. There is a new generation of
| accelerator native clouds e.g. Paperspace
| (https://paperspace.com) that cater specifically to HPC, AI,
| etc. workloads. The main differentiators are: - much larger
| GPU catalog - support for new accelerators e.g. Graphcore
| IPUs - different pricing structure that address problematic
| areas for HPC such as egress
|
| However, one of the most important differences is the _lack_
| of unrelated web services related components that pose a
| major distraction /headache to users that don't have a DevOps
| background (which AWS obviously caters to). AWS can be
| incredibly complicated. Simple tasks are encumbered by a
| whole host of unrelated options/capabilities and the learning
| curve is very steep. A platform that is specifically designed
| to serve the scientific computing audience can be much more
| streamlined and user-friendly for this audience.
|
| Disclosure: I work on Paperspace.
| latchkey wrote:
| Coreweave. I know the CTO. They are doing great work over
| there.
|
| https://www.coreweave.com
| sabalaba wrote:
| Lambda GPU Cloud has the cheapest A100s of that group.
| https://lambdalabs.com/service/gpu-cloud
|
| Lambda A100s - $1.10 / hr Paperspace A100s - $3.09 / hr
| Genesis A100s - no A100s but their 3090 (1/2 the speed of
| 100) is - $1.30 / hr for half the speed
| lostmsu wrote:
| That's still way too expensive. 3090 is less than 2x of the
| monthly cost in Genesis. A100 is priced better here.
| tryauuum wrote:
| datacrunch.io has some 80G A100s
| theblazehen wrote:
| https://www.genesiscloud.com/ is pretty decent
| snorkel wrote:
| Buying your own fleet of dedicated servers seems like a smart
| move in the short term, but then five years from now you'll get
| someone on the team insisting that they need the latest greatest
| GPU to run their jobs. Cloud providers give you the option of
| using newer chipsets without having to re-purchase your entire
| server fleet every five years.
| lebovic wrote:
| In HPC land, most hardware is amortized over five years and
| then replaced! If you keep your service in life for five years
| at high utilization, you're doing great.
|
| For example, the Blue Waters supercomputer at UIUC was
| originally expected to last five years, although they kept it
| in service for nine; it was considered a success:
| https://www.ncsa.illinois.edu/historic-blue-waters-supercomp...
| adamsb6 wrote:
| I've never worked in this space, but I'm curious about the need
| for massive egress. What's driving the need to bring all that
| data back to the institution?
|
| Could whatever actions have to be performed on the data also be
| performed in AWS?
|
| Also while briefly looking into this I found that AWS has an
| egress waiver for researchers and educational instiutions:
| https://aws.amazon.com/blogs/publicsector/data-egress-waiver...
| COGlory wrote:
| Well for starters, if you are NIH or NSF funded, they have data
| storage requirements you must meet. So usually this involves
| something like tape backups in two locations.
|
| The other is for reproducibility - typically you need to
| preserve lots of in-between steps for peer review and proving
| that you aren't making things up. Some intermediary data is
| wiped out, but usually only if it can be quickly and easily
| regenerated.
| jpeloquin wrote:
| Regarding the waiver--"The maximum discount is 15 percent of
| total monthly spending on AWS services". Was very excited at
| first.
|
| As for leaving data in AWS, data is often (not always)
| revisited repeatedly for years after the fact. If new questions
| are raised about the results it's often much easier to check
| the output than rerun the analysis. And cloud storage is not
| cheap. But yes it sometimes makes sense to egress only summary
| statistics and discard the raw data.
| [deleted]
___________________________________________________________________
(page generated 2022-10-07 23:00 UTC)