[HN Gopher] Show HN: Cloud GPUs for Deep Learning - At 1/3 the C...
       ___________________________________________________________________
        
       Show HN: Cloud GPUs for Deep Learning - At 1/3 the Cost of AWS/GCP
        
       Author : ilmoi
       Score  : 102 points
       Date   : 2021-03-17 15:56 UTC (7 hours ago)
        
 (HTM) web link (gpu.land)
 (TXT) w3m dump (gpu.land)
        
       | wrongdonf wrote:
       | I love a dutch programmer
        
         | ilmoi wrote:
         | hah I am European indeed but not Dutch:) But I have spent a
         | year living in the Netherlands.
        
       | ilmoi wrote:
       | I'm a self taught ML engineer, and when I was starting on my ML
       | journey I was incredibly frustrated by cloud services like
       | AWS/GCP. a)they were super expensive, b)it took me longer to
       | setup a working GPU instance than to learn to build my first
       | model!
       | 
       | So I built https://gpu.land/.
       | 
       | It's a simple service that only does one thing: rents out Tesla
       | V100s in the cloud.
       | 
       | Why is it awesome?
       | 
       | - It's dirt-cheap. You get a Tesla V100 for $0.99/hr, which is
       | 1/3 the cost of AWS/GCP/Azure/[insert big cloud name].
       | 
       | - It's dead simple. It takes 2mins from registration to a
       | launched instance. Instances come pre-installed with everything
       | you need for Deep Learning, including a 1-click Jupyter server.
       | 
       | - It sports a retro, MS-DOS-like look. Because why not:)
       | 
       | The most common question I get is - how is this so cheap? The
       | answer is because AWS/GCP are charging you a huge markup and I'm
       | not. In fact I'm charging just enough to break even, and built
       | this project really to give back to community (and to learn some
       | of the tech in the process).
       | 
       | HN special: email me a few lines about yourself and what you're
       | working on and get $10 in free credit. I'm at hi@gpu.land.
       | 
       | Otherwise I'm around for any questions!
        
         | ogiberstein wrote:
         | Wow, sounds really useful! Will check it out
        
         | typon wrote:
         | Hey ilmoi,
         | 
         | Do you support multi-node training? Particularly, can I reserve
         | for example a 64-GPU instance via 8 nodes and perform
         | distributed training?
        
           | ilmoi wrote:
           | You can! All machines within a single account are inside a
           | private VLAN, which means they all talk to each other (out of
           | the box, no setup required), but nobody else on gpu.land sees
           | them.
           | 
           | See the "Are instances within my account connected?" item in
           | the FAQ - https://gpu.land/faq
           | 
           | If you're going to need 64 GPUs I'll have to increase your
           | account limit (currently 16 GPUs per account). Email me at
           | hi@gpu.land
        
         | etaioinshrdlu wrote:
         | Can you use GeForce cards and then make us pinky swear to only
         | use it for blockchain processing applications? This is what
         | LeaderGPU does (https://www.leadergpu.com/#chose-best), and
         | they had even lower pricing.
         | 
         | Although lately they have had demand outstrip supply.
         | 
         | Hetzner used to have GTX 1080 instances for about $100 a month,
         | no longer though, and I'm lucky to be grandfathered in to about
         | 8 of them.
         | 
         | I told myself a couple years ago that compute will only get
         | cheaper over time. In reality, compute has gotten MORE
         | expensive over time! I cannot match or scale the compute price
         | I locked in a couple years ago with Hetzner. Some of that is
         | the crypto market raising GPU prices, but it is also NVIDIA's
         | licensing making their cheaper cards unavailable in cloud
         | servers...
        
         | liuliu wrote:
         | Thank you! This looks indeed cheap.
         | 
         | Can you share more stories on data persistence / checkpointing?
         | If I have a job that requires 8 V100 with 3 days, what types of
         | reliability am I looking at?
        
           | ilmoi wrote:
           | Exactly the same as you would be had you been running an EC2
           | instance at AWS. The machines are hosted, maintained and
           | managed in exactly the same way in a whitelabel DC.
        
             | sacheendra wrote:
             | This I highly unlikely. Hyperscalers like AWS have custom
             | power delivery, rack organization, redundant networking,
             | etc. which make their instances reliable.
             | 
             | Long running jobs often use checkpoints which require high
             | speed networking and storage, which I don't see an option
             | for. Eg, I cam get EC2 instances with 100gbps networking.
             | 
             | Great job starting the service! But, I think you have ways
             | to go before reaching Hyperscalers level reliability.
        
           | joecot wrote:
           | I'm not very familiar with machine learning setups, but in
           | general with hosts you trade SLA for pricing. You trade the
           | 99.999% uptime and data consistency of hyperscaled hosts for
           | cheaper pricing at smaller hosts. Assuming you can backup
           | your data at checkpoints, consider running it on a setup like
           | this for very cheap, and then back it up every hour/day to a
           | consistently reliable data storage. AWS EC2 is very expensive
           | for computing, but S3 is relatively cheap for storage.
           | Backblaze B2 storage is also even cheaper, but with less
           | guaranteed reliability than S3. But the odds of both this
           | going down on you and Backblaze failing at the same time are
           | pretty low.
           | 
           | I run a credit card processor on AWS, my important personal
           | websites on Linode, and my fun time websites and video
           | conferencing on a small Chicago host called Genesis Hosting
           | with a website out of the 90s and dirt cheap pricing (but
           | excellent support). Match your price and SLA with how much
           | pain it's going to cause you if it goes down, and don't put
           | pay extra to put all your stuff on the same 5 9s host if you
           | don't really have to.
        
         | ghoomketu wrote:
         | For somebody who was 0 knowledge of ML but has always been wary
         | of the huge costs involved this sure looks like a great
         | offering!
         | 
         | But I still don't know what it would cost to get something
         | useful out of it. Can you (or anybody who knows about ML) tell
         | me a very very ballpark amount of what costs it incurs to train
         | a model like www.remove.bg that automatically removes
         | background from photos? I'm not trying to build a clone but
         | curios what sort ot financial investment it takes to make such
         | things.
        
           | perennate wrote:
           | Unless you're using unsupervised learning or can find a good
           | dataset, most of the cost will be in labeling the data. I'm
           | not too familiar with background removal, there may be some
           | self-supervised/contrastive learning approaches, but
           | generally they don't work as well as supervised learning.
           | Even for a week of training, the compute cost is only $200.
           | 
           | Edit: maybe you can get some decent results just with COCO
           | segmentation labels:
           | https://towardsdatascience.com/background-removal-with-
           | deep-...
        
         | ashish01 wrote:
         | Nice! I really like the top up feature. Really gives me peace
         | of mind that I will not blow up my budget accidentally. I have
         | been using https://datacrunch.io/ which has almost same feature
         | set.
        
         | mraza007 wrote:
         | Hey I loved the product you made. As a student who is always
         | concerned about the spending on cloud services as they can be
         | pricey but I'm just curious I have used google Colab in the
         | past and it charges $10/Month so I wanted to know how is it
         | different when compared to google colab
         | 
         | I would love to hear your thoughts on this
        
         | chovybizzass wrote:
         | was going to mine but you say not to.
        
           | ilmoi wrote:
           | thanks for reading the FAQ! Most people don't seem to
           | bother:)
        
             | Sanzig wrote:
             | Is general CUDA development permitted, or can I only use
             | this for deep learning?
             | 
             | There's a simulation problem that I encounter often at work
             | which should parallelize very well, and I was considering
             | getting my feet wet with CUDA by writing a solver for it.
        
               | ilmoi wrote:
               | CUDA develpment is totally fine. The service was designed
               | with DL in mind, but that shouldn't stop CUDA dev in any
               | way.
        
         | teruakohatu wrote:
         | Is there any persistent storage options? I looked through the
         | faq and comparisons and couldn't find any mention of it.
        
       | iujjkfjdkkdkf wrote:
       | Sorry if I missed it in the FAQ, do you have finite capacity or
       | do you have an agreement with the data center where you can bring
       | on more gpus with demand. My worry would be to wake up one day
       | and find that all GPUs are in use.
       | 
       | And (this may sound naive) I work with other companies' data, I
       | have a responsibility to them to keep it safe, do you have any
       | concerns with this being used for "professional" applications or
       | are you targeting research / hobby?
       | 
       | To be clear, I am dealing with things that my clients are
       | comfortable with me working with on mainstream cloud providers,
       | not state secrets or data with legislative requirements.
        
       | artem_mazur wrote:
       | Looks interesting! Cool!!
        
       | dindresto wrote:
       | How does the Tesla V100 compare to the Tesla P100, which Scaleway
       | offers at a similar price of EUR1/hour?
       | 
       | https://www.scaleway.com/en/gpu-instances/
        
         | ilmoi wrote:
         | Tesla P100 is the previous generation. P100 > V100 > A100. You
         | can see detailed benchmarking here - http://ai-
         | benchmark.com/ranking_deeplearning.html
        
           | knrz wrote:
           | For others who may have been confused, like me, in order of
           | GPU power:
           | 
           | (Least) P100 < V100 < A100 (Most)
        
             | ilmoi wrote:
             | Sorry I should have been clearer in my reply. Thanks for
             | clarifying!
        
       | usmannk wrote:
       | This is seriously cool. I want to mention that on AWS you can
       | reliably get V100's for $0.918/hr (I do this all the time) by
       | using spot instances. The spot price hasn't changed by even 0.01
       | for at least a year, so you're extremely unlikely to get an
       | unexpected shutdown. Also, how did you manage to get V100s? What
       | was that like?
        
         | ilmoi wrote:
         | Yep you're correct that V100s are available as spot instances.
         | If you're willing to write some extra code for saving down
         | weights you could easily go with them.
         | 
         | Actually, I'm renting V100s. Got lucky to know the right person
         | at the right time:)
        
           | usmannk wrote:
           | Ah gotcha, did you have to move them to your DC and have them
           | racked?
        
       | mastermojo wrote:
       | Very cool. Is there a way to persist disks at a cost and
       | attach/unattach Voltas for training?
       | 
       | EDIT: found in the FAQ:
       | 
       | Compute: $0.99/hr / 1x Tesla V100 (running instance only)
       | Storage: $0.02/GB/month (running and stopped instances)
        
         | ilmoi wrote:
         | Yep - most people stop instances and come back whenever they're
         | ready to train again. No charges for GPUs while the instance is
         | stopped.
         | 
         | If you have a say 200GB harddrive you're only paying $4/month
         | for storage.
        
       | ffpip wrote:
       | FYI, the site shows nothing until the scripts from stripe.com
       | load. I know that that these breakages might not happen to all
       | users, but it shouldn't be completely broken until the browser
       | connects to Stripe.
       | 
       | https://images2.imgbox.com/b2/42/n87NuBRT_o.png
        
         | ilmoi wrote:
         | Do you think it's the scripts from stripe that are blocking the
         | view? I would guess it's auth0.
         | 
         | This was my first front-end project so any feedback 100%
         | welcome.
        
           | jandrese wrote:
           | FWIW when I went to the site I had to whitelist scripts from
           | js.stripe.com to get anything more than a black screen.
        
             | ilmoi wrote:
             | Ok noted. That's for me to work on. Thanks for pointing
             | out.
        
       | neil1 wrote:
       | This is a pretty cool project
        
       | maremmano wrote:
       | I came for the GPU, I stayed for the vintage CSS
        
       | iujjkfjdkkdkf wrote:
       | One more comment / suggestion: I watched the video and saw there
       | are lots of pre-configured environments. I see this a lot, but to
       | be honest this is not to helpful for me because I'm not doing
       | development in your environment, I'm writing and checking the
       | code somewhere else and then training on the gpus.
       | 
       | For the same reason, I'm only getting value out of the gpus when
       | I am right ready to train. So I would be much happier if I could
       | push a docker, or maybe even a conda environment spec, along with
       | my code, attach data storage, and run e.g. train.py (or more
       | likely a shell script that calls it) to completion and then
       | immediately release the GPUs. Everything else is just the
       | overhead of as quickly as possible trying to get the environment
       | right, run my script, and then shut down the instance as soon as
       | it's done. It would be awesome to have this kind of
       | functionality, or is there a way to do that I missed?
        
       | paul_milovanov wrote:
       | this is great, except apparently there's no support for
       | containers? is that on the roadmap?
       | 
       | thanks!
        
       | 37ef_ced3 wrote:
       | AVX-512 neural net inference on inexpensive, CPU-only cloud
       | compute instances: https://NN-512.com
       | 
       | An AVX-512 Skylake-X cloud compute instance costs $10 per CPU-
       | core per month at Vultr (https://www.vultr.com/products/cloud-
       | compute/), and you can do about 18 DenseNet121 inferences per
       | CPU-core per second (in series, not batched) using tools like
       | NN-512
       | 
       | GPU cloud compute is almost unbelievably expensive. Even Linode
       | charges $1000 per month, or $1.50 per hour (look at the GPU
       | plans: https://www.linode.com/pricing/#row--compute). It's really
       | hard to keep that GPU saturated, which is what you need to do to
       | get your money's worth
       | 
       | As AVX-512 becomes better supported by Intel and AMD chips, it
       | becomes more attractive as an alternative to expensive GPU
       | instances for workloads with small amounts of inference mixed
       | with other computation
        
         | rckoepke wrote:
         | What does this have to do with https://gpu.land/ ? I've been
         | very impressed with NN-512 in your past postings but I'm
         | failing to see the direct salience w.r.t the top level post.
         | 
         | Perhaps readers would benefit from an apples-to-apples
         | comparison of V100 to some CPU in a training per dollar metric?
         | Preferably using something like MLPerf. You do mention
         | inference but I think most people looking at https://gpu.land/
         | will be far more interested in training rather than inference.
         | 
         | I think the most direct competitor to this ShowHN would be
         | https://vast.ai/
        
           | ilmoi wrote:
           | You are correct that vast.ai is pretty close.
           | 
           | The biggest difference is probably security / guaranteed
           | uptime. With vast you're getting what it says on the tin - a
           | machine from a marketplace. Could come from anyone /
           | anywhere. No idea what else is running on it. Ours are hosted
           | in a professional DC, managed and secured as they should be.
           | 
           | If anyone's curious, there's a detailed comparison page with
           | other platforms here - https://gpu.land/versus
        
           | perennate wrote:
           | I think Linode/Paperspace/AWS/GCP/etc. are closer to "direct
           | competitors". I would be much more comfortable trusting a
           | single entity that owns its servers than a GPU rental
           | marketplace like vast.ai.
        
           | 37ef_ced3 wrote:
           | The suggestion is that you may not want to use a GPU if you
           | can use AVX-512 CPUs for your machine learning workload.
           | Reducing the cost of the GPU helps, but it's still relatively
           | expensive
        
       | akrymski wrote:
       | Congrats on the launch! Cheaper GPUs are always welcome ;-)
        
       | stuartbman wrote:
       | This is a very cool project, but can I ask if there's a
       | particular CSS package you used to get this site appearance?
        
         | ilmoi wrote:
         | Thanks! I used Tailwind CSS and built the theme from scratch,
         | but I drew inspiration from:
         | 
         | - https://nostalgic-css.github.io/NES.css/
         | 
         | - https://jdan.github.io/98.css/
        
       | DIVx0 wrote:
       | Could this be used for bespoke cloud gaming? I was considering
       | setting up a parsec host with paperspace GPU. I want more control
       | over the game client than geforce now or stadia provides and a
       | better GPU and availability than Shadow.tech currently offers.
       | 
       | *edit: I see that only Linux hosts are available at the moment so
       | that kills my use case but I'll keep the question up for grins
       | and giggles
        
         | ilmoi wrote:
         | Not really. Firstly it will be way too expensive for you (cloud
         | gamins is like $10/mo, this is $10/10h). Secondly the GPUs
         | installed (Tesla V100s) are not designed for gaming - rather
         | for ML & scientific computing.
         | 
         | You should check out some of the services people mention in
         | https://www.reddit.com/r/cloudygamer/
        
       | tombh wrote:
       | These seems very similarly priced to AWS Spot Instances. But
       | you're guaranteeing uptime?
        
         | perennate wrote:
         | In the FAQ it says:
         | 
         | > Is my instance guaranteed?
         | 
         | > Yes, unless you run out of credit. To prevent that from
         | happening be sure to setup automatic top ups.
        
       | samsammurphy wrote:
       | Cooooool!
        
       | DougN7 wrote:
       | I know nothing about crypto mining, but why wouldn't this be
       | allowed? How would you stop it?
        
         | ilmoi wrote:
         | you'll be making around $6/d, while losing $24h paying for the
         | service.
        
           | ska wrote:
           | see jsnell's point which is worth considering and mitigating
           | if your (CC) processor isn't able to catch.
        
         | imhoguy wrote:
         | Woudn't pay off really, check the FAQ.
        
         | jsnell wrote:
         | Mining cryptocurrencies is a good way to launder stolen credit
         | cards numbers to cash. It is scalable (unlike most digital
         | goods) and automatable (unlike most physical goods). It being
         | unprofitable will not dissuade anyone from this use case: they
         | are not paying with their own money to start with.
         | 
         | So it is best to forbid it in the TOS and just autokill the
         | miners, rather than wait for the chargebacks. Especially if the
         | compute is being sold at cost, since there is no buffer of
         | profits to balance out the abuse.
        
           | ilmoi wrote:
           | Wow thanks for posting this! I defo haven't thought about
           | this use case. So interesting.
           | 
           | So we actually explicitly prohibit mining in the T&Cs, it's
           | just that in the FAQ I wanted to be a bit more human and
           | dissuade people.
           | 
           | Also, there's quite a few protections built-in at the network
           | layer to make sure mining isn't possible. Ports, ips, dns,
           | even DPI. I learnt a lot about the early days of bitcoin and
           | mining protocols when I was building gpu.land:)
           | 
           | But again, thanks for sharing this. Really good to know.
        
       | perennate wrote:
       | The website design put me off but the FAQ makes it clear that it
       | is quite professional. They have information about physical
       | security, GDPR compliance, and VLAN. Wish there was some
       | information about storage, e.g. if it's local disk or distributed
       | block storage, and how many replicas. Very nice that unused
       | credit is refundable.
        
         | ilmoi wrote:
         | Great point re storage, I'll be sure to add to the FAQ. To
         | answer your question it's SSD block storage.
        
       ___________________________________________________________________
       (page generated 2021-03-17 23:02 UTC)