[HN Gopher] Our container platform is in production. It has GPUs...
___________________________________________________________________
Our container platform is in production. It has GPUs. Here's an
early look
Author : jgrahamc
Score : 162 points
Date : 2024-09-27 13:05 UTC (9 hours ago)
(HTM) web link (blog.cloudflare.com)
(TXT) w3m dump (blog.cloudflare.com)
| pjmlp wrote:
| So I just discovered that Cloudfare now owns the trademark for
| Sun's "The Network is the Computer".
|
| "Cloudflare serves the entire world -- region: earth. Rather than
| asking developers to provision resources in specific regions,
| data centers and availability zones, we think "The Network is the
| Computer". "
|
| https://blog.cloudflare.com/the-network-is-the-computer/
| remram wrote:
| > The global scheduler is built on Cloudflare Workers, Durable
| Objects, and KV, and decides which Cloudflare location to
| schedule the container to run in. Each location then runs its
| own scheduler, which decides which metals within that location
| to schedule the container to run on.
|
| So they just use the term "location" instead of "region".
| DonHopkins wrote:
| Did they also get the old DEC t-shirt trademark: "The Network
| Is The Network and The Computer Is The Computer. We regret the
| confusion."
|
| IBM mocked Sun with: "When they put the dot into dot-com, they
| forgot how they were going to connect the dots," after sassily
| rolling out Eclipse just to cast a dark shadow on Java. Badoom
| psssh!
|
| https://www.itbusiness.ca/news/ibm-brings-on-demand-computin...
| CSMastermind wrote:
| Does it say anywhere what GPUs they have available?
|
| I really need NVIDIA RTX 4000, 5000, A4000, or A6000 GPUs for
| their ray tracing capabilities.
|
| Sadly I've been very limited in the cloud providers I can find
| that support them.
| willtemperley wrote:
| I doubt they'll commit to specific models before containers go
| GA in 2025 but they'll likely be NVIDIA:
|
| https://www.cloudflare.com/en-gb/press-releases/2023/cloudfl...
| singhrac wrote:
| You can as of very recently get A6000s on Hetzner, which is a
| pretty good deal (but not serverless, so you need a consistent
| load).
| CSMastermind wrote:
| Super helpful, thank you!
| asciimike wrote:
| The short answer here is that NVIDIA doesn't like Cloud Service
| Partners using RTX cards, as they are "professional" cards
| (they are also significantly cheaper than the corresponding
| data center cards). IIRC, A40, L40, and L40S have ray tracing,
| and might be more available on CSPs. Otherwise, the GPU
| marketplaces that aren't "true" CSPs will likely have RTX
| cards.
|
| Paperspace (now DO), Vultr, Coreweave, Crusoe, should all have
| something with ray tracing.
| CSMastermind wrote:
| Incredibly helpful, thank you!
|
| We did try on the T4 and A10G but the raytracing failed even
| though those cards claim to support it.
|
| We ended up on Paperspace for the time being but they
| depreciated their support for Windows so I've been looking
| for alternatives. Will check out the provides you mentioned.
| Thanks again.
| LukeLambert wrote:
| This is really cool and I can't wait to read all about it.
| Unfortunately, I've missed a month of blog posts because
| Cloudflare changed their blog's RSS URL without notice. If you
| change blogging platforms and can't implement a 301, please leave
| a post letting subscribers know where to find the new feed. RSS
| isn't dead!
| jgrahamc wrote:
| We did? That's nuts if we did. What URL were you using?
|
| EDIT: It looks like some people may have been using
| ghost.blog.cloudflare.com/rss because we used to use Ghost but
| the actual URL was/is blog.cloudflare.com/rss. We're setting up
| a redirect for anyone who was using the ghost. URL.
| Traubenfuchs wrote:
| Hacker News is my favorite C-Suite level support forum of
| cloudflare and stripe.
| LukeLambert wrote:
| Yes, it was the Ghost URL. Thank you for correcting it! I
| read just about every post, so I have a lot of catching up to
| do.
| jgrahamc wrote:
| Sorry about the interruption! We migrated away from Ghost
| and not sure how you ended up with that URL but we're
| adding a redirect. Have a good catch up :-)
| NicoJuicy wrote:
| Just works? https://blog.cloudflare.com/rss
| jgrahamc wrote:
| Yes, that should be the URL and I don't think that's changed.
| Just wondering what URL the parent was hitting.
| ckastner wrote:
| > _To add GPU support, the Google team introduced nvproxy which
| works using the same principles as described above for syscalls:
| it intercepts ioctls destined to the GPU and proxies a subset to
| the GPU kernel module._
|
| This does still expose the host's kernel to a potentially
| malicious workload, right?
|
| If so, could this be mitigated by (continuously) running a QEMU
| VM with GPUs passed through via VFIO, and running whatever
| Workers need within that VM?
|
| The Debian ROCm Team faces similar challenge, we want to do CI
| [1] for our stack and all our dependent packages, but cannot rule
| out potentially hostile workloads. We spawn QEMU VMs per test
| (instead of the model described above) but that's because our
| tests must also be run against the relevant distribution's kernel
| and firmwares.
|
| Incidentally, I've been monitoring the Firecracker VFIO GitHub
| issue linked in the article. Upstream does not have a use case
| for and thus no resources dedicated to implement this, but
| there's a community meeting [2] coming up in October to discuss
| the future of this feature request.
|
| [1]: https://ci.rocm.debian.net
|
| [2]: https://github.com/firecracker-
| microvm/firecracker/issues/11...
| ec109685 wrote:
| If the calls first pass through a memory safe language as what
| gvisor does, isn't the attack surface greatly reduced?
|
| It does seem however that Firecracker + GPU support (or
| https://github.com/cloud-hypervisor/cloud-hypervisor) is most
| promising though.
|
| It's surprising that AWS doesn't have a need for Lambda but
| with GPU's to motivate them to bring GPU's to firecracker.
| ckastner wrote:
| > _If the calls first pass through a memory safe language as
| what gvisor does, isn't the attack surface greatly reduced?_
|
| The runtime may be memory safe, but I'm thinking of the GPU
| workloads which nvproxy seems to pass on to the device via
| the host's kernel. Say I find a security issue in the GPU's
| driver, and manage to exploit it with some malicious CUDA
| workload.
| ec109685 wrote:
| Would having a VM inbetween help in that case? It seems
| like protecting against malicious GPU workloads requires
| the GPU to off virtualization to avoid this exploit.
|
| This is helpful in explaining why AWS hasn't been excited
| to ship this use case in firecracker.
| ckastner wrote:
| It would probably not stop all theoretically possible
| attacks, but it would stop many of them.
|
| Say you find a bug in the GPU driver that let's you
| execute arbitrary code as root. That still all happens
| within the VM. To attack the host, you'd still need to
| break out of the VM, and if the VM is unprivileged (which
| I assume it is), you'd next need gain privileges on the
| host.
|
| There are other channels -- perhaps you can get the GPU
| to do something funky on PCI level, perhaps you can get
| the GPU to crash the host -- but VM isolation does add a
| solid layer of protection.
| hinkley wrote:
| I've been looking at distributed CI and for now I'm just going
| to be running workloads queued by the owner of the agent. That
| doesn't eliminate hostile workloads but it does present a
| similar surface area to simply running the builds locally.
|
| I've been thinking about QEMM or firecracker instead of just
| containers for a more robust solution. I have some time before
| anyone would ask me about GPU workloads, but do you think
| firecracker is on track to get there or would I be better off
| learning QEMM?
| ckastner wrote:
| Amazon/AWS has no use case for VFIO in Firecracker. They're
| open to the community adding support and have a community
| meeting soon, but I wouldn't get my hopes up.
|
| QEMU _can_ work -- I say can, because it doesn 't work with
| all GPUs. And with consumer GPUs, VFIO is generally not an
| officially supported use case. We got it working, but with
| lots of trial and error, and there are still some problematic
| corner cases.
| hinkley wrote:
| What would you say is the sort of time horizon for turnkey
| operation of one commonly available video card, half a
| dozen, and OEM cards in high end laptops (eg, MacBook Pro)?
| Years? Decades? Heat death?
| ckastner wrote:
| I don't think I fully understand your question. If, with
| turnkey operation you mean virtualization, enterprise
| GPUs already officially support it now, and it already
| works with consumer GPUs, at least the discrete ones.
| tomrod wrote:
| This seems like a pretty big deal.
|
| I want to like CloudFlare over DO/AWS. I like their DevX focus
| too -- I could see issues if devs can't get into the abstractions
| though.
|
| Any red flags folks would stake regarding CF? I know they are
| widely used but not sure where the gotchas are.
| ec109685 wrote:
| Their solution isn't GA yet.
|
| For headless browsers, the latency benefits of "container
| anywhere" seems high. For things like AI inference, running on
| the edge seems way less beneficial than running on the cheapest
| location possible which would be larger regional data centers.
| hinkley wrote:
| One would hope that "larger regional data centers" are not
| that far from The Edge. But the problem isn't physics or the
| speed of light, it's operational.
|
| The operational excellence required to have every successful
| Internet company manage deployments to a dozen regions just
| isn't there. Most of us struggle with three, my last gig
| tried to do two, which isn't economical because you always
| try to handle one region going dark which means you need at
| least 200% capacity, where 3 data centers only need 150 +
| ??%, and 4 need 133 + ??%. It has all of the consistency
| problems of n > 1 and few if any of the advantages.
|
| We need more help from the CDNs of the world to run compute
| heavy operations at the edge. And if they choose to send them
| 10-20ms away to a beefier data center I think that's probably
| fine. Just don't make _us_ have to have the sort of
| operational discipline that requires.
| ec109685 wrote:
| Given how slow AI inference is (and for training it doesn't
| matter at all), the advantage of it being a few
| milliseconds closer to the user is greatly diminished. The
| latency to egress to a regional data center is
| inconsequential.
|
| Good point about at the very least not exposing placement
| to customers. That is a definite win.
| wmf wrote:
| Is Cloudflare the one that goes from free to "call for pricing"
| ($100K+) at the drop of a hat?
| jgrahamc wrote:
| https://blog.cloudflare.com/cloudflares-commitment-to-free/
| ignoramous wrote:
| One data point but, _one_ among our toy services has been
| pushing 30TB /mo to 60TB/mo for over a year now, and we
| haven't got the call:
| https://news.ycombinator.com/item?id=39521228
| trallnag wrote:
| Can you share what kind of toy is shuffling around so much
| data?
| mnahkies wrote:
| I think they have some incredibly low pricing for what most
| small companies need. I also think they've done a very good
| job of carving out pieces that more sophisticated setups need
| into the enterprise tier, which does constitute a big jump.
|
| One that bit me was
| https://developers.cloudflare.com/cloudflare-for-
| platforms/c... - we found an alternative solution that didn't
| require upgrading to their enterprise plan (yet), but it was
| a pretty compelling reason to upgrade and if I was doing it
| again I'd probably choose upgrading over implementing our
| solution. On balance I'm not sure we actually saved money in
| the end, considering opportunity cost
| srockets wrote:
| Won't apply to everyone (most?), but some compliance assurances
| your customers may require can't be fulfilled by Cloudflare.
| And personally, I would hope their laissez faire attitude
| towards protecting hate speech should damage their business,
| but I suspect most people not targeted by such just don't give
| a damn.
| jgrahamc wrote:
| _but some compliance assurances your customers may require
| can 't be fulfilled by Cloudflare._
|
| Such as? See: https://www.cloudflare.com/trust-
| hub/compliance-resources/
| halfcat wrote:
| > _"Remote Browser Isolation provides Chromium browsers that run
| on Cloudflare, in containers, rather than on the end user's own
| computer. Only the rendered output is sent to the end user."_
|
| It turns out we don't need React Server Components after all. In
| the future we will just run the entire browser on the server.
| srockets wrote:
| What old is new again.
| surfingdino wrote:
| Looks like CloudFlare will soon be using "All other clouds are
| behind ours." slogan.
| DonHopkins wrote:
| "We're the silver lining."
|
| "We'll keep you on the edge of your seat."
|
| "Nice parade you got there. It sure would be a shame if
| somebody were to rain on it."
| surfingdino wrote:
| "That's how dynamic pricing works, baby" (CF taking in
| learnings from the Oasis/Ticketmaster heist)
| lysace wrote:
| Lots of cool stuff in this blog post. Impressive work on many
| fronts!
|
| If I understand correctly, you will be running actual third party
| compute workloads/containers in hundreds of network interexchange
| locations.
|
| Is that in line with what the people running these locations have
| in mind? Can you scale this? Aren't these locations often very
| power/cooling-constrained?
| thefounder wrote:
| So this will be similar to Google Appengine(now Google run) ? If
| that's the case I would love to give it a try but then I need
| close SQL server nearby and other open source services as well
| dopylitty wrote:
| I like the dig at "first generation" clouds.
|
| There really is a wide gulf between the services provided by the
| older cloud providers (AWS, Azure) and the newer ones (fly.io,
| CloudFlare etc).
|
| AWS/Azure provide very leaky abstractions (VMs, VPCs) on top of
| very old and badly designed protocols/systems (IP, Windows,
| Linux) . That's fine for people who want to spend all their time
| janitoring VMs, operating systems, and networks but for
| developers who just want to write code that provides a service
| it's much better to be able to say to the cloud provider "Here's
| my code, you make sure it's running somewhere" and let the cloud
| provider deal with the headaches. Even the older providers' PaaS
| services have too many knobs to deal with (I don't want to think
| about putting a load balancer in front of ECS or whatever)
| abadpoli wrote:
| This undersells the fact that there's a lot more to
| infrastructure management than "janitoring". You and many
| others may want to just say "here's my code, ship it", but
| there's also a massive market of people that _need_ the
| customization and deep control over things like load balancers,
| because they're pumping petabytes of data through it and using
| a cloud-managed LB is leaving money and performance on the
| table. Or there are companies that _need_ the strong isolation
| between regions for legal and security reasons, even if it
| comes with added complexity.
|
| A lot of developers get frustrated at AWS or Azure because they
| want to deploy their hobby app on it and realize it's too
| difficult dealing with stuff like IAM - it's like trying to dig
| a small hole in your garden and someone suggests you go buy a
| Caterpillar Excavator, when all you needed was a hand trowel.
| The reason this persists is because AWS doesn't target the
| hobby developer - it targets the massive enterprise that does
| need the customization and power it provides, despite the
| complexity. There are, thankfully, other companies that have
| come in to serve up cloud hand trowels.
|
| There is no "one size fits all" cloud. There probably never
| will be. They're all going to coexist for the foreseeable
| future.
| bigcat12345678 wrote:
| Hn now clearly are swarmed by grandiose novice techs.
|
| 10 years ago, no such superficial assessment would appear on
| first page.
|
| This set of words bear little substance and engineering facts.
|
| > AWS/Azure provide very leaky abstractions (VMs, VPCs) on top
| of very old and badly designed protocols/systems (IP, Windows,
| Linux) .
|
| AWS cannot be made parallel, they themselves are 2 gens
|
| AWS gen1
|
| Azure gcp gen 2
|
| Gen1 is on vm, ecs ebs s3, for web2 era
|
| Gen2 is on cluster computing which was enable by vm
|
| The then "leaky abstraction" is the mandated abstraction at the
| time
|
| And GPUs today is about 70s's CPU
|
| For example, you don't have any form of abstracted runtime on
| GPU, it's like running dos system
|
| It's more leaky than 00s ' vm
| SantaCruz11 wrote:
| Edgegap has been doing this for 5 years.
| vednig wrote:
| > We rely on it in Production
|
| They really have a great engineering team
| roboben wrote:
| What I am always missing in these posts: How do they limit
| network bandwidth? Since these are all multi-tenant services, how
| do they make sure a container or isolated browser is not taking
| all the network bandwidth of a host?
| tscolari wrote:
| You probably can do this through the proc filesystem/cgroups.
| If you think about it, you can use cgroups to limit the
| bandwidth, so you can also use it to measure it.
| lofaszvanitt wrote:
| Why is Cloudflare trying to create a walled-garden internet
| within the internet?
___________________________________________________________________
(page generated 2024-09-27 23:01 UTC)