hngopher.com

       [HN Gopher] Show HN: San Francisco Compute - 512 H100s at <$2/hr...
       ___________________________________________________________________
        
       Show HN: San Francisco Compute - 512 H100s at <$2/hr for research
       and startups
        
       Hey folks! We're Alex and Evan, and we're working on putting
       together a 512 H100 compute cluster for startups and researchers to
       train large generative models on. - it runs at the lowest possible
       margins (<$2.00/hr per H100) - designed for bursty training runs,
       so you can take say 128 H100s for a week - you don't need to commit
       to multiple years of compute or pay for a year upfront  Big labs
       like OpenAI and Deepmind have big clusters that support this kind
       of bursty allocation for their researchers, but startups so far
       have had to get very small clusters on very long term contracts,
       wait months of lead time, and try to keep them busy all the time.
       Our goal is to make it about 10-20x cheaper to do an AI startup
       than it is right now. Stable Diffusion only costs about $100k to
       train -- in theory every YC company could get up to that scale.
       It's just that no cloud provider in the world will give you $100k
       of compute for just a couple weeks, so startups have to raise 20x
       that much to buy a whole year of compute.  Once the cluster is
       online, we're going to be pretty much the only option for startups
       to do big training runs like that on.
        
       Author : flaque
       Score  : 330 points
       Date   : 2023-07-30 17:25 UTC (5 hours ago)
        
 (HTM) web link (sfcompute.org)
 (TXT) w3m dump (sfcompute.org)
        
       | resonance1994 wrote:
       | Just curious, do you guys use renewable energy to power your
       | cluster?
        
       | sillysaurusx wrote:
       | I hope you succeed. TPU research cloud (TRC) tried this in 2019.
       | It was how I got my start.
       | 
       | In 2023 you can barely get a single TPU for more than an hour.
       | Back then you could get literally hundreds, with an s.
       | 
       | I believed in TRC. I thought they'd solve it by scaling, and
       | building a whole continent of TPUs. But in the end, TPU time was
       | cut short in favor of internal researchers -- some researchers
       | being more equal than others. And how could it be any other way?
       | If I made a proposal today to get these H100s to train GPT to
       | play chess, people would laugh. The world is different now.
       | 
       | Your project has a youthful optimism that I hope you won't lose
       | as you go. And in fact it might be the way to win in the long
       | run. So whenever someone comes knocking, begging for a tiny slice
       | of your H100s for their harebrained idea, I hope you'll humor
       | them. It's the only reason I was able to become anybody.
        
         | ShamelessC wrote:
         | Wow! I never thought you'd see the light. All I ever see from
         | your posts is praise for TRC. As someone who got started way
         | later on, I had infinitely more success with a gaming GPU I
         | owned myself. Obviously not really comparable, but TRC was very
         | very difficult to work with. I think I only ever had access to
         | a TPUv3 once and that wasn't nearly enough time to learn the
         | ropes.
         | 
         | My understanding was that this situation changed drastically
         | depending on what sort of email you had or how popular your
         | Twitter handle was.
        
         | latchkey wrote:
         | What Shawn says is absolutely right. The race right now is way
         | too hot for this stuff. A single customer will eat up 512 gpus
         | for 3 years.
        
         | flaque wrote:
         | > Your project has a youthful optimism that I hope you won't
         | lose as you go. And in fact it might be the way to win in the
         | long run.
         | 
         | This is the nicest thing anyone has said to us about this.
         | We're gonna frame this and hang it out on our wall.
         | 
         | > So whenever someone comes knocking, begging for a tiny slice
         | of your H100s for their harebrained idea, I hope you'll humor
         | them.
         | 
         | Absolutely! :D
        
           | camhart wrote:
           | Optimism is (almost) always required in order to accomplish
           | anything of significance. Those who lose it, aren't living up
           | to their potential.
           | 
           | I'm not encouraging the false belief that everything you do
           | will work out. Instead I'm encouraging the realization that
           | the greatest accomplishments almost always feel like long
           | shots, and require significant amounts of optimism. Fear and
           | pessimism, while helpful in appropriate doses, will limit you
           | greatly in life if you let them rule you too significantly.
           | 
           | When I look back on my life, the greatest accomplishments
           | I've achieved are ones where I was naive yet optimistic going
           | into it. This was a good thing, because I would have been too
           | scared to try had I really known the challenges that lay
           | ahead.
        
             | Frost1x wrote:
             | >Optimism is (almost) always required in order to
             | accomplish anything of significance. Those who lose it,
             | aren't living up to their potential.
             | 
             | I argue that realism trumps optimism. It's perfectly normal
             | in a realist farming to see something difficult,
             | acknowledge the high risk and failure potential, and still
             | pursue something with intent to succeed.
             | 
             | I've personally grown tired of over optimism everywhere
             | because it creates unrealistic situations and passes
             | consequences of failure in an inequitable way. The
             | "visionary" is rewarded when the rare successes occur,
             | while everyone else suffers the consequences for most
             | failures. No contingency plans for failure, no discussion
             | of failure, and so on. Optimism just takes any idea,
             | pursues it and consequences be someone else's problem and
             | be damned.
             | 
             | Pessimism isn't much better, you essentially think
             | everything is too risky or unlikely to succeed so you never
             | do anything. You live in a state of inaction because any
             | level of risk or uncertainty is too much.
             | 
             | To me, realism is much better. You acknowledge the
             | challenge. You acknowledge the risk. You make sure everyone
             | involved understands it, but you still charge forward
             | knowing you might succeed. Some think if you're not naively
             | optimistic (what most people in my experience refer to as
             | "optimism") you don't create enough pressure. I think
             | that's non-sense.
        
               | rileyphone wrote:
               | You can be a realist visionary.
        
             | dhash wrote:
             | YC startup founder here,
             | 
             | Mostly agree, except the market is not an optimistic place
             | -- it's the market.
             | 
             | There are a multitude of reasons you lose your optimism,
             | mostly because people take it away -- your optimism is
             | their money
        
         | LoganDark wrote:
         | > In 2023 you can barely get a single TPU for more than an
         | hour.
         | 
         | Um. Can't you order them from coral.ai and put them in an NVMe
         | slot? Or are the cloud TPUs more powerful?
        
           | whimsicalism wrote:
           | TPU pod is not sold by google, edge tpu is different
        
             | LoganDark wrote:
             | So the cloud TPUs are more powerful...? Or what are you
             | saying?
        
               | whimsicalism wrote:
               | yes
        
               | sillysaurusx wrote:
               | Yeah, it's a silly branding thing.
               | 
               | One TPU (not even a pod, just a regular old TPUv2) has 96
               | CPU cores with 1.4TB of RAM, and that's not even counting
               | their hardware acceleration. I'd love to buy one.
        
       | kaycebasques wrote:
       | Hi, SF lover [1] here. Anything interesting to note about your
       | name? Will your hardware actually be based in SF? Any plans to
       | start meetups or bring customers together for socializing or
       | anything like that?
       | 
       | [1] We have not gone the way of the Xerces blue [2] yet... we
       | still exist!
       | 
       | [2] https://en.wikipedia.org/wiki/Xerces_blue
        
         | agajews wrote:
         | Ah the hardware isn't gonna be in SF (not the cheapest
         | datacenter space)
         | 
         | But I do think a lot of our customers will be out here --- SF
         | is still probably the best place to do startups. We just have
         | so many more people doing hard technical stuff here. Literally
         | every single place I've lived in SF there's been another
         | startup living upstairs or downstairs
         | 
         | Good idea to host some in person events!
        
         | [deleted]
        
       | moneycantbuy wrote:
       | How did you get the money to buy 512 H100s?
        
         | taminka wrote:
         | ask no questions hear no lies
        
           | rvnx wrote:
           | EDIT: They seem to be in a raising fund / debt stage. Great
           | initiative
        
             | [deleted]
        
             | williamstein wrote:
             | Their announcement says "We can probably get a good deal
             | from a bank [...]", so maybe they don't just have 20M USD
             | sitting around.
        
               | rvnx wrote:
               | Well, this pushes me even further in the direction that
               | they are actually good guys that need support, and that
               | they are trying to bring a good deal on the table :)
        
             | herval wrote:
             | unrelated to this specific initiative, but - I keep seeing
             | a lot of announcements of huge VC rounds around what's
             | effectively datacenters for GPUs. Curious about the math
             | behind that - I feel like those things get obsolete so
             | fast, it's almost like the whole scooter rental thing,
             | where the unit economics doesn't add up.
             | 
             | Anyone have an insight?
        
         | humanistbot wrote:
         | From sentence one of the post, it clearly states that they are
         | VC funders who are doing this for a round of startups they just
         | funded, and they're looking for others to be a part of it.
        
           | flaque wrote:
           | Oh no, definitely not. We just got a loan.
           | 
           | Neither Alex or I are currently VCs, and this has no
           | affiliation with any venture fund.
           | 
           | We want to be a customer of the sf compute group too!
        
             | xwdv wrote:
             | I'm curious, how are those loans guaranteed?
        
       | latchkey wrote:
       | 554 5.7.1 <evan@sfcompute.org>: Relay access denied
       | 
       | 554 5.7.1 <alex@sfcompute.org>: Relay access denied
        
         | flaque wrote:
         | !!!!!! fixing this. For the moment, evan at roomservice dot dev
        
           | ranting-moth wrote:
           | Ah, putting out flames live on HN. Back in the day it was on
           | IRC or just on the phone with the customer. I miss those
           | times.
        
           | [deleted]
        
           | fragmede wrote:
           | fwiw, https://roomservice.dev/ is currently a 404
        
             | latchkey wrote:
             | http != smtp                 roomservice.dev. 60 IN MX 5
             | alt1.aspmx.l.google.com.       roomservice.dev. 60 IN MX 5
             | alt2.aspmx.l.google.com.       roomservice.dev. 60 IN MX 1
             | aspmx.l.google.com.       roomservice.dev. 60 IN MX 10
             | alt3.aspmx.l.google.com.       roomservice.dev. 60 IN MX 10
             | alt4.aspmx.l.google.com.       roomservice.dev. 60 IN MX 15
             | 4ig53n4pw7p3cuxm7n7xi7dpuyq6722aipexvhkngzbd2e4mudmq.mx-
             | verification.google.com.
        
               | fragmede wrote:
               | I know the difference between an email and a web page,
               | tyvm.
        
               | [deleted]
        
             | flaque wrote:
             | Ah yeah, that's normal! Was from my old CRDT company, and
             | works as a good emergency email while we debug our DNS.
        
               | fragmede wrote:
               | I assume it was a Take3 reference. I wanted to point it
               | out, in case it was supposed to return more than a 404.
        
               | [deleted]
        
           | latchkey wrote:
           | done
        
       | 29athrowaway wrote:
       | During a gold rush, sell shovels.
       | 
       | When was the last time you spoke to a chatbot?
        
         | netsec_burn wrote:
         | For me, today and almost every day since the beginning of this
         | year. Not sure if that saying applies here.
        
         | version_five wrote:
         | Chatbot in the sense I think you mean is a horrible
         | application. Millions of people are using large language models
         | daily though.
        
         | lulunananaluna wrote:
         | Downvoted by others, yet very true. This is a valid business
         | model, nothing to be ashamed about it.
        
       | whimsicalism wrote:
       | I am super interested in AI on a personal level and have been
       | involved for a number of years.
       | 
       | I have never seen a GPU crunch quite like it is right now. To
       | anyone who is interested in hobbyist ML, I highly highly
       | recommend using vast.ai
        
         | quickthrower2 wrote:
         | Depends on what you class as hobbyist but I am running a T4 for
         | a few minutes to get acquainted with tools and concepts and I
         | found modal.com really good for this. They resell AWS and GCP
         | at the moment. They also have A100 but T4 is all I need for
         | now.
        
           | whimsicalism wrote:
           | Significantly more expensive than equivalent 3090
           | configuration if you can do model parallelism
        
             | quickthrower2 wrote:
             | What do you mean by this? I use less than the $30/m free
             | included usage.
             | 
             | I am guessing you mean at some point just buy your own 3090
             | as it will be cheaper than paying a cloud per second for a
             | server-grade Nvidia setup.
        
               | whimsicalism wrote:
               | I think this is more applicable for training usecases. If
               | you can get by with less than $30/mo in aws compute
               | (quite expensive) then it likely does not make a
               | didference.
               | 
               | What I mean is that you can rent out 4 3090 GPUs for much
               | cheaper than renting an A100 on aws because you are not
               | paying Nvidia's "cloud tax" on flops/$
        
         | williamstein wrote:
         | Many thanks for posting about vast.ai, which I had never heard
         | of! It's a sort of "gig economy/marketplace" for GPU's. The
         | first machine I tried just now worked fine, had 512GB of RAM,
         | 256 AMC CPUs, an A100 GPU, and I got about 4 minutes for $0.05
         | (which they provided for free).
        
           | whimsicalism wrote:
           | The only caveat is it is not really appropriate for private
           | usecases.
           | 
           | Also, many of the available options clearly are recycled
           | crypto mining rigs which have somewhat odd configurations
           | (poor gpu bandwidth, low cpu ram).
        
       | itissid wrote:
       | Noob Thought: So this would be a blue print on how a mid tier
       | universities with older large compute cluster ops could do things
       | in 2023 to support large LLM research?
       | 
       | Perhaps its also a way for freshly applying grad students to look
       | at a university looking to do research in LLMs that requires
       | scale...
        
         | itissid wrote:
         | Like to clarify, a new grad students could look at the current
         | group and ask "Hey I know you are working on LLMs, but how many
         | $$ of your grant are dedicated to how many TPU hours per grad
         | student?"
        
       | rsync wrote:
       | "Once the cluster is online ..."
       | 
       | Where will the cluster be hosted ?
       | 
       | May I suggest that you get your IP transit from he.net ?
        
         | fragmede wrote:
         | Not to mention, San Francisco is not known for having cheap
         | real estate, nor is it known for having cheap electricity. My
         | last (residential) bill to PGE, I paid $0.50938/KWh at peak.
        
           | vladgur wrote:
           | While business rates may be different, California cannot be a
           | sensible place to host power-hungry infrastructure - our
           | electrical rates are easily 5-8 times of other locations
           | within the US
        
         | [deleted]
        
       | williamstein wrote:
       | How does this compare to https://lambdalabs.com/ ?
        
         | jorlow wrote:
         | You can usually only get a few h100s at a time unless you're
         | committed to reserved instances (for a longer time period)
        
         | wongarsu wrote:
         | Very similar price, but from what I gather very different
         | model. One important difference might be if you regularly run
         | short-ish training runs over many GPUs. Lambdalabs might not
         | have 256 instances to give you right now. With OP you are
         | basically buying the right to put jobs in the job queue for
         | their 512 GPU cluster, so running a job that needs 256 GPUs
         | isn't an issue (though you might wait behind someone running a
         | 512 GPU job).
         | 
         | No idea how capacity at lambdalabs actually looks like though.
         | Does anyone have insight how easy it is to spin up more than
         | 2-3 instances up there?
        
           | agajews wrote:
           | Yeah it's pretty hard to find a big block of GPUs that you
           | can use for a short time, esp if you need infiniband for
           | multinode training. Lambda I think needs a min reservation of
           | 6-12 months if you want IB.
        
         | flaque wrote:
         | Ah, we're running a medium amount of compute at zero-margin.
         | The point is not to go sell the Fortune 500, but to make sure a
         | grad student can spend a $50k grant.
         | 
         | Right now, it's pretty easy to get a few A/H100s (Lambda is
         | great for this), but very hard to get more than 24 at a
         | reasonable price ($~2 an hour). One often needs to put up a 6+
         | month commitment, even when they may only want to run their
         | H100s for an 8 hour training run.
         | 
         | It's the right business decision for GPU brokers to do long
         | term reservations and so on, and we might do so too if we were
         | in their shoes. But we're not in their shoes and have a very
         | different goal: arm the rebels! Let someone who isn't BigCorp
         | train a model!
        
           | narrator wrote:
           | So what happens when some big bucks VC backed closed source
           | LLM company buys all your compute inventory for the next 5
           | years? This is not that unlikely. Lambda Labs a little while
           | back was completely sold out of all compute inventory.
        
             | xeromal wrote:
             | I assume it's up to them to say no. They did say they're
             | not in it to make bookoo bucks
        
               | agajews wrote:
               | Yeah we aren't going to let anyone book the whole thing
               | for years. If we ever have to make the choice, we'll
               | choose the startups over the big companies.
        
           | trostaft wrote:
           | > but to make sure a grad student can spend a $50k grant.
           | 
           | As a graduate student, thank you. Thankfully, my workloads
           | aren't LLM crazy so I can get by on my old NVIDIA consumer
           | hardware, but I have coworkers struggling to get reasonable
           | prices/time for larger scale hardware.
        
           | lulunananaluna wrote:
           | This is great. Thank you very much for your work.
        
         | theptip wrote:
         | My question too. At $2/hr for H100 that seems more flexible?
         | But I haven't tried to get 10k GPU-hours on any of these
         | services, maybe that is where the bottleneck is.
        
         | ivalm wrote:
         | No real way to get a big block without commitment. Iirc
         | smallest h100 commitment is 64gpus for 3 years (about $3M usd).
        
       | agajews wrote:
       | [dead]
        
       | nilsbunger wrote:
       | I love the idea of community assets. could it be the start of a
       | GPU co-op?
        
         | samstave wrote:
         | Serious Q, as I dont know Twitters internal infra at all... but
         | with a shrinking in revenue from ads, or maybe less engagement
         | by users, and the influx of Threads - maybe twitter can use
         | from slices of its infra (even if its rack space, VMs,
         | Containers, connectivity, who knows what, to support startups
         | such as this?
         | 
         | Basically twitter devolves into the Colos of the late 90s :-)
         | 
         | -
         | 
         | For those who didnt notice, it was tongue in cheek.
        
           | mike_d wrote:
           | Generally when you just stop paying your bills the datacenter
           | holds your hardware and eventually auctions it off to cover
           | some of your debt. I seriously doubt Twitter has any access
           | to the two of three datacenters Elon decided to not pay for.
        
           | version_five wrote:
           | I've generally tried to give Twitter the benefit of the doubt
           | but I would never trust them as an infrastructure provider in
           | their current incarnation. Reliability and consistency have
           | been so far from their focus.
        
           | aionaiodfgnio wrote:
           | Would you really trust a company that doesn't pay its rent to
           | run your infrastructure?
        
         | fragmede wrote:
         | For consumer-grade cards, that's already here.
         | 
         | Make money off your GPU with vast.AI
         | 
         | https://cloud.vast.ai/host/setup
        
           | mdaniel wrote:
           | > Requirements
           | 
           | > Ubuntu 18.04 or newer (required)
           | 
           | > Dedicated machines only - the machine shouldn't be doing
           | other stuff while rented
           | 
           | well that's certainly not what I expected. ctrl-f "virtual"
           | gives nothing, so it seems they really mean "take over your
           | machine"
           | 
           | > Note: you may need to install python2.7 to run the install
           | script.
           | 
           | what kind of nonsense is this? Did they write the script in
           | 2001 and just abandon it?
        
             | williamstein wrote:
             | I just skimmed their FAQ at https://vast.ai/faq, and it
             | seems like it could use an update. E.g., it says "Initially
             | we are supporting Ubuntu Linux, more specifically Ubuntu
             | 16.04 LTS.". That version of Ubuntu has been end-of-life'd
             | for several years, and when I just tried vast.ai out, it
             | seemed to be using Ubuntu 20.04. There were also a couple
             | of words with letters missing (probably trivial typos) that
             | could be found with a spell checker. The questions in their
             | FAQ are really interesting though, in terms of highlighting
             | what users care about (e.g., there's a lot devoted to "how
             | do I use vast.ai + google colab together"?). I also wonder
             | when vast.ai started? Sometimes you can get insight from a
             | company blog page, but the vast.ai blog seems to start in
             | Feb 2023: https://vast.ai/blog . There's a bunch of
             | "personal experiences" with vast.ai from 3 years ago in
             | this discussion though: https://www.reddit.com/r/MachineLea
             | rning/comments/hv49pd/d_c...
             | 
             | A comment in that discussion mentions yet another
             | competitor in this space that I've never heard of:
             | https://www.qblocks.cloud/ -- I just tried Q blocks out and
             | the new user experience wasn't as good for me as with
             | vast.ai: you have to put in $10 money to try it, instead of
             | getting to try it initially for free; there is a manual
             | approval process before you can try data center class GPUs;
             | you only see that your instance is in Norway (say) after
             | you try to start it, not before; it seems like there's no
             | ssh access, and they only provide Jupyter to connect;
             | neither pytorch nor tensorflow seemed to be installed. They
             | could probably update their pages too, e.g.,
             | https://www.qblocks.cloud/vision is all about crypto mining
             | and smartphones, which feels a bit dated... :-)
        
             | mschuster91 wrote:
             | > what kind of nonsense is this? Did they write the script
             | in 2001 and just abandon it?
             | 
             | Anything AI/ML is a hot mess of cobbled-together bits and
             | pieces of Python barely holding together. I recently read
             | somewhere that there should be a new specialization of "ML
             | DevOps Engineer"... and hell I'm supporting that.
        
               | p1esk wrote:
               | _there should be a new specialization of "ML DevOps
               | Engineer"_
               | 
               | Do you mean MLOps? Nothing new about it. We have two
               | full-time MLOps engineers at our startup.
        
           | lgats wrote:
           | check here to see the current bid prices / gpu setups
           | https://cloud.vast.ai/create/
        
           | PartiallyTyped wrote:
           | My computer is sitting mostly idle at home, thanks for this.
        
       | ucarion wrote:
       | Wishing y'all the best of luck. This would be huge for a lot of
       | folks.
        
       | sashank_1509 wrote:
       | Correct me if I'm wrong but doesn't Lambda Labs already provide
       | them at 1.89$? What's the point if you're starting out not the
       | cheapest
        
         | agajews wrote:
         | Ah that's only if you pay for 3 years of compute upfront. Most
         | startups, especially the small ones, really can't afford that
        
         | davidmurphy wrote:
         | Looks like their site is quoting a rate of $1.99 now
         | https://lambdalabs.com/
        
           | version_five wrote:
           | See this post above:
           | https://news.ycombinator.com/item?id=36935032
           | 
           | Price and market depth are very different things
        
       | whack wrote:
       | > _Rather than each of K startups individually buying clusters of
       | N gpus, together we buy a cluster with NK gpus... Then we set up
       | a job scheduler to allocate compute_
       | 
       | In theory, this sounds almost identical to the business model
       | behind AWS, Azure, and other cloud providers. "Instead of
       | everyone buying a fixed amount of hardware for individual use,
       | we'll buy a massive pool of hardware that people can time-share."
       | Outside of cloud providers having to mark up prices to give
       | themselves a net-margin, is there something else they are failing
       | to do, hence creating the need for these projects?
        
         | abraae wrote:
         | AWS and Azure would slit their own throats before they created
         | a way for their customers to pool instances to save money.
         | 
         | They want to do that themselves, and keep the customer
         | relationship and the profits, instead of giving them to a
         | middleman or the customer.
        
           | jiggawatts wrote:
           | It's just corporate profits combined with market forces, not
           | a some sort of malicious conspiracy.
           | 
           | You can rent a 2-socket AMD server with 120 available cores
           | and RDMA for something like 50c to $2 per hour. That's just
           | barely above the cost of the electricity and cooling!
           | 
           | What do you want, free compute just handed to you out of the
           | goodness of their hearts?
           | 
           | There is incredible demand for high-end GPUs right now, and
           | market prices reflect that.
        
       | bnr4u wrote:
       | Having hosted infrastructure in CA at multiple colos. I would
       | advise you to host it elsewhere if you can, cost of power, other
       | infrastructure is much higher in CA than AZ or NV.
        
       ___________________________________________________________________
       (page generated 2023-07-30 23:00 UTC)