[HN Gopher] From S3 to R2: An economic opportunity
___________________________________________________________________
From S3 to R2: An economic opportunity
Author : dangoldin
Score : 93 points
Date : 2023-11-02 19:15 UTC (3 hours ago)
(HTM) web link (dansdatathoughts.substack.com)
(TXT) w3m dump (dansdatathoughts.substack.com)
| simonsarris wrote:
| Cloudflare has been attacking the S3 egress problem by creating
| Sippy: https://developers.cloudflare.com/r2/data-migration/sippy/
|
| It allows you to _incrementally_ migrate off of providers like S3
| and onto the egress-free Cloudflare R2. Very clever idea.
|
| He calls R2 an undiscovered gem and IMO this is the gem's
| undiscovered gem. (Understandable since Sippy is very new and
| still in beta)
| ravetcofx wrote:
| What are the economics that Amazon and other providers have
| egress fees and R2 doesn't? Is it acting as a loss leader or
| does this model still make money for CloudFlare?
| NicoJuicy wrote:
| You pay for the capacity of your network.
|
| Cloudflare has huge ingress, because they need it to protect
| sites against DDOS.
|
| They basically already pay for their R2 bandwidth ( = egress)
| because of that.
|
| Additionally, with their SDN ( software defined networking)
| they can fine-tune some of the Data-Flow/bandwidth too.
|
| That's how I understood it, fyi.
|
| Some more info could be found when they started ( or co-
| founded, not sure) the bandwidth alliance.
|
| Eg.
|
| https://blog.cloudflare.com/aws-egregious-egress/
|
| https://blog.cloudflare.com/bandwidth-alliance/
| miselin wrote:
| Also, for the CDN case that R2 seems to be targeting -
| regardless of the origin of the data (R2 or S3), chances
| are pretty good that Cloudflare is already paying for the
| egress anyway.
| NicoJuicy wrote:
| I'm not sure about that.
|
| A CDN keeps the data nearby, reducing the need to pay
| egress to the big bandwidth providers.
|
| ( not an expert though)
| ilc wrote:
| Let's say you want to use cloudflare, or another CDN. The
| process is pretty simple.
|
| You setup your website and preferably DON'T have it talk
| to anyone other than the CDN.
|
| You then point your DNS to wherever the CDN tells you to.
| (Or let them take over DNS. Depends on the provider.)
|
| The CDN then will fetch data from your site and cache it,
| as needed.
|
| Your site is the "origin", in CDN speak.
|
| If Cloudflare can move the origin within their network,
| there is huge cost savings and reliability increases
| there. This is game changing stuff. Do not under estimate
| it.
| kkielhofner wrote:
| It's actually worse than that.
|
| In the CDN case Cloudflare has to fetch it from the
| origin, cache (store) it anyway, and then egress it. By
| charging for R2 they're moving that cost center to a
| profit one.
| swyx wrote:
| somebody more knowledgeaeble please correct me if i'm
| mistaken, but i think the bandwidth alliance is really the
| lynchpin of the whole thing. basically get all the non-AWS
| players in the room and agree on zero rating traffic
| between each other, to provide a credible alternative to
| AWS networks
| Nextgrid wrote:
| _Completely_ free egress is a loss leader, but the true cost
| is so little (at least 90x less than what AWS charges) that
| it pays for itself in the form of more CloudFlare marketshare
| /mindshare.
| WJW wrote:
| I know from personal experience that "big" customers can
| negotiate incredible discounts on egress bandwidth as well.
| 90-95% discount is not impossible, only "retail" customers
| pay the sticker price.
| martinald wrote:
| That's still a 3-10x markup though. And it's also very
| dependent on your relationship with AWS. What happens if
| they don't offer the discount on renewal?
| candiddevmike wrote:
| Greed on the cloud providers part, I think. You'd expect
| egress fees to enable cheaper compute, but there are other
| cloud providers out there like Hetzner with cheaper compute
| and egress, so the economics don't really add up.
| vidarh wrote:
| Indeed, Hetzner is so much cheaper that if you have high S3
| egress fees you can rent Hetzner boxes to sit in front of
| your S3 deployment as caching proxies and get a lot of
| extra "free" compute on top.
|
| It's an option that's often been attractive if/when you
| didn't want the hassle of building out something that could
| provide S3 level durability yourself. But with more/cheaper
| S3 competitors it's becoming a significantly less
| attractive option.
| kazen44 wrote:
| also, egress fees are a sort of vendor lock-in, because
| getting data out of the cloud is vastly more expensive then
| putting new data into the cloud..
| oaktowner wrote:
| Exactly this. Data has gravity, and this increases the
| gravity around data stored at Amazon...making it more
| likely for you to buy more compute/services at Amazon.
| kkielhofner wrote:
| The big cloud providers are Hotel California - you can
| check in but you can't check out.
|
| Of course you can (like Snap) but it's a MASSIVE
| engineering effort and initial expense.
| chatmasta wrote:
| Amazon doesn't have unit cost for egress. They charge you for
| the stuff you put through their pipe, while paying their
| transit providers only for the size of the pipe (or more
| often, not paying them anything since they just peer directly
| with them at an exchange point).
|
| Amazon uses $/gb as a price gouging mechanism and also a QoS
| constraint. Every bit you send through their pipe is
| basically printing money for them, but they don't want to
| give you a reserved fraction of the pipe because then other
| people can't push their bits through that fraction. So they
| get the most efficient utilization by charging for the stuff
| you send through it, ripping everybody off equally.
|
| Also, this way it's not cost effective to build a competitor
| to Amazon (or any bandwidth intensive business like a CDN or
| VPN) on top of Amazon itself. You fundamentally need to
| charge more by adding a layer of virtualization, which means
| "PaaS" companies built on Amazon are never a threat to AWS
| and actually symbiotically grow the revenue of the ecosystem
| by passing the price gouging onto their own customers.
| specialp wrote:
| You don't get charge for transit if you are sending stuff
| IN from the internet or to any other AWS resource in that
| region. So there is no QOS constraint inside except for
| perhaps paying for the S3 GET/SELECT/LIST costs.
|
| It is pretty much exclusively to lock you into their
| services. It heavily impacts multi-cloud and outside of AWS
| service decisions when your data lives in AWS and is taxed
| at 5-9 cents a GB to come out. We have settled for inferior
| AWS solutions at times because the cost of moving things
| out is prohibitive (IE AWS Backup vs other providers)
| dangoldin wrote:
| Author here - have you tried using R2? As others
| mentioned there's also Sippy
| (https://developers.cloudflare.com/r2/data-
| migration/sippy/) which makes this easy to try.
| martinald wrote:
| It also makes things like just using RDS for your managed
| database and having compute nearby but with another
| provider often incredibly expensive.
| kkielhofner wrote:
| AWS egress charges blatantly take advantage of people who
| have never bought transit or done peering.
|
| To them "that's just what bandwidth costs" but anyone who's
| worked with this stuff (sounds like you and I both) can do
| the quick math and see what kind of money printing machine
| this scheme is.
| pests wrote:
| Honest question, how is this different than a toll road? An
| entity creates a road network with a certain size (lanes,
| capacity/hour, literal traffic) and pays for it by charging
| individual cars put through the road.
| dotnet00 wrote:
| There has to be more to it than a pure loss leader, since
| there's also the Bandwidth Alliance Cloudflare is in, which
| allows R2 competitors like Backblaze B2 to also offer free
| egress, which benefits those competitors while weakening the
| incentive for R2 somewhat.
| jmarbach wrote:
| Cloudflare wrote a blog post about their bandwidth egress
| charges in different parts of the world:
| https://blog.cloudflare.com/the-relative-cost-of-
| bandwidth-a...
|
| The original post also includes a link to a more recent
| Cloudflare blog post on AWS bandwidth charges:
| https://blog.cloudflare.com/aws-egregious-egress/
| dangoldin wrote:
| Author here and really cool link to Sippy. I love the idea here
| since you're really migrating data as needed so the cost you
| incur is really a function of the workload. It's basically
| acting as a caching layer.
| paulddraper wrote:
| Clever
| nik736 wrote:
| S3 and R2 aside, OVHs object storage offering is really robust
| and great. It performs better than S3 and is way cheaper, in
| storage and egress cost.
| drewnick wrote:
| Agree. We've used it for two years with solid performance and
| reliability.
| 9dev wrote:
| You might even say their offering is... on fire
| threatofrain wrote:
| Cloudflare has been building a micro-AWS/Vercel competitor and I
| love it; i.e., serverless functions, queues, sqlite, kv store,
| object store (R2), etc.
| davidjfelix wrote:
| FWIW, Vercel is at least partially backed by cloudflare
| services under the hood.
| zwily wrote:
| Right - Vercel's edge functions are just cloudflare workers
| with a massive markup.
| camkego wrote:
| I would love to see a good blog post or article on Cloudflares
| KV store. I just checked it out, and it reports eventual
| consistency, so it sounds like it might be based upon CRDTs,
| but I'm just guessing.
| chatmasta wrote:
| Vercel doesn't offer any of that, without major caveats (e.g.
| must use Next.js to get a serverless endpoint). And to the
| degree they do offer any of it, it's mostly built on
| infrastructure of other companies, including Cloudflare.
| paulgb wrote:
| Since I know there will be Cloudflare people reading this (hi!),
| I'm begging you: please wrestle control of the blob storage API
| standard from AWS.
|
| AWS has zero interest in S3's API being a universal standard for
| blob storage and you can tell from its design. What happens in
| practice is that everybody (including R2) implements some subset
| of the S3 API, so everyone ends up with a jagged API surface
| where developers can use a standard API library but then have to
| refer to docs of each S3-compatible vendor to see figure out
| whether the subset of the S3 API you need will be compatible with
| different vendors.
|
| This makes it harder than it needs to be to make vendor-agnostic
| open source projects that are backed by blob storage, which would
| otherwise be an excellent lowest-common-denominator storage
| option.
|
| Blob storage is the most underused cloud tech IMHO largely
| because of the lack of a standard blob storage API. Cloudflare is
| in the rare position where you have a fantastic S3 alternative
| that people love, and you would be doing the industry a huge
| service by standardizing the API.
| londons_explore wrote:
| I think the subtle API differences reflect bigger and deeper
| implementation differences...
|
| For example, "Can one append to an existing blob/resume an
| upload?" leads to lots of questions about data immutability,
| cacheability of blobs, etc.
|
| "What happens if two things are uploaded with the same name at
| the same time" leads into data models, mastership/eventual
| consistency, etc.
|
| Basically, these 'little' differences are in fact huge
| differences on the inside, and fixing them probably involves a
| total redesign.
| paulgb wrote:
| This is a good point, but just a standard for the standard
| create/read/update (replace)/delete operations combined with
| some baseline guarantees (like approximately-last-write-wins
| eventual consistency) would probably cover a whole lot of
| applications that currently use S3 (which doesn't support
| appends anyway).
|
| Heck, HTTP already provides verbs that would cover this, it
| would just require a vendor to carve out a subset of HTTP
| that a standard-compliant server would support, plus
| standardize an auth/signing mechanism.
| maxclark wrote:
| R2 and Sippy solve a specific pipeline issue: Storage -> CDN ->
| Eyeball
|
| The real issue is how that data get's into S3 in the first place
| and what else you need to do with it.
|
| S3 and DynamoDB are the real moats for AWS.
| tehlike wrote:
| If you are storing large amount of data: E2 is the cheapest
| (20$/TB/year, 3x egress for free)
|
| If you are having lots of egress: R2 is the cheapest
| (15$/TB/month, free egress)
|
| R2 can get somewhat expensive if you have lots of mutations,
| which is not a typical use case for most.
| vladvasiliu wrote:
| What's E2? Top google result for "e2 blob storage" is azure,
| but that can't be it since the pricing table comes at around
| $18/TB/month.
| maccam912 wrote:
| I imagine it was a typo for backblaze B2? The call out that
| egress is free for the first 3x of what you have stored
| matches up.
| wfleming wrote:
| That's what I thought they meant as well, but B2 is more
| like $72/TB/yr. Maybe relevant to another story on the
| front page right now, they have a very unusual custom
| keyboard layout that makes it easy to typo e for b and 2
| for 7 ;)?
| natrys wrote:
| Seems to be this one: https://www.idrive.com/object-
| storage-e2/
| leiferik wrote:
| I think Backblaze B2 is probably the reference (which has
| free egress up to 3x data stored -
| https://www.backblaze.com/blog/2023-product-announcement/). I
| don't know of any public S3-compatible provider that is as
| cheap as 20$/TB/year (roughly ~$0.0016/GB/mo).
| arghwhat wrote:
| I wish the R2 access control was similar to S3 - able to issue
| keys with specific accesses to particular prefixes, and ability
| to delegate ability to create keys.
|
| It currently feels a little limited and... bolted on to the
| Cloudflare UI.
| andrewstuart wrote:
| I think the idea is to use Cloudflare Workers to add more
| sophisticated functionality.
| slig wrote:
| But then you start paying for Worker's bandwidth, correct?
| meowface wrote:
| Is there any reason to _not_ use R2 over a competing storage
| service? I already use Cloudflare for lots of other things, and
| don 't personally care all that much about the "Cloudflare's
| near-monopoly as a web intermediary is dangerous" arguments or
| anything like that.
| Hasz wrote:
| As far as I know, R2 offers no storage tiers. Most of my s3
| usage is archival and sits in glacier. From Cloudflare's
| pricing page, S3 is substantially cheaper for that type of
| workload.
| gurchik wrote:
| 1. This is the most obvious one, but S3 access control is done
| via IAM. For better or for worse, IAM has a lot of
| functionality. I can configure a specific EC2 instance to have
| access to a specific file in S3 without the need to deal with
| API keys and such. I can search CloudTrail for all the times a
| specific user read a certain file.
|
| 2. R2 doesn't support file versioning like S3. As I understand
| it, Wasabi supports it.
|
| 3. R2's storage pricing is designed for frequently accessed
| files. They charge a flat $0.015 per GB-month stored. This is a
| lot cheaper than S3 Standard standard pricing ($0.023 per GB-
| month), but more expensive than Glacier and marginally more
| expensive than S3 Standard - Infrequent Access. Wasabi is even
| cheaper at $0.0068 per GB-month but with a 1 TB billing
| minimum.
|
| 4. If you want public access to the files in your S3 bucket
| using your own domain name, you can create a CNAME record with
| whatever DNS provider you use. With R2 you cannot use a custom
| domain unless the domain is set up in Cloudflare. I had to
| register a new domain name for this purpose since I could not
| switch DNS providers for something like this.
|
| 5. If you care about the geographical region your data is
| stored in, AWS has way more options. At a previous job I needed
| to control the specific US state my data was in, which is easy
| to do in AWS if there is an AWS Region there. In contrast R2
| and Wasabi both have few options. R2 has a "Jurisdictional
| Restriction" feature in Beta right now to restrict data to a
| specific legal jurisdiction, but they only support EU right
| now. Not helpful if you need your data to be stored in Brazil
| or something.
| paulddraper wrote:
| If you already use Cloudflare for lots of other things, no.
|
| If you already use AWS for lots of other things, yes.
| benjaminwootton wrote:
| The other hidden cost when you are working with data hosted on S3
| is the LIST requests. Some of the data tools seem very chatty
| with S3, and you end up with thousands of them when you have
| small filed buried in folders with a not insignifcant cost. I
| need to dig into it more, but they are always up there towards
| the top of my AWS bills.
| thedaly wrote:
| > In fact, there's an opportunity to build entire companies that
| take advantage of this price differential and I expect we'll see
| more and more of that happening.
|
| Interesting. What sort of companies can take advantage of this?
| diamondap wrote:
| Basically any company offering special services that work with
| very large data sets. That could be a consumer backup system
| like Carbonite or a bulk photo processing service. In either
| case, legal agreements with customers are key, because you
| ultimately don't control the storage system on which your
| business and their data depend.
|
| I work for a non-profit doing digital preservation for a number
| of universities in the US. We store huge amounts of data in S3,
| Glacier and Wasabi, and provide services and workflows to help
| depositors comply with legal requirements, access controls,
| provable data integrity, archival best practices, etc.
|
| There are some for-profits in this space as well. It's not a
| huge or highly profitable space, but I do think there are other
| business opportunities out there where organizations want to
| store geographically distributed copies of their data (for
| safety) and run that data through processing pipelines.
|
| The trick, of course, is to identify which organizations have a
| similar set of needs and then build that. In our case, we've
| spent a lot of time working around data access costs, and there
| are some cases where we just can't avoid them. They can really
| be considerable when you're working with large data sets, and
| if you can solve the problem of data transfer costs from the
| get-go, you'll be way ahead of many existing services built on
| S3 and Glacier.
| dangoldin wrote:
| Author here but some ideas I was thinking about: - An open
| source data pipeline built on top of R2. A way of keeping data
| on R2/S3 but then having execution handled in Workers/Lambda.
| Inspired by what https://www.boilingdata.com/ and
| https://www.bauplanlabs.com/ are doing. - Related to above but
| taking data that's stored in the various big data formats
| (Parquet, Iceberg, Hudi, etc) and generating many more
| combinations of the datasets and choose optimal ones based on
| the workload. You can do this with existing providers but I
| think the cost element just makes this easier to stomach. -
| Abstracting some of the AI/ML products out there and choosing
| best one for the job by keeping the data on R2 and then
| shipping it to the relevant providers (since data ingress to
| them is free) for specific tasks. -
| gen220 wrote:
| I'm building a "media hosting site". Based on somewhat
| reasonable forecasts of egress demand vs total volume stored,
| using R2 means I'll be able to charge a low take rate that
| should (in theory) give me a good counterposition to
| competitors in the space.
|
| Basically, using R2 allows you to undercut competitors'
| pricing. It also means I don't need to build out a separate CDN
| to host my files, because Cloudflare will do that for me, too.
|
| Competitors built out and maintain their own equivalent CDNs
| and storage solutions that are more ~10x more expensive to
| maintain and operate than going through Cloudflare. Basically,
| Cloudflare is doing to CDNs and storage what AWS and friends
| did to compute.
| xrd wrote:
| I just love minio. It is a drop-in replacement for S3. I have
| never done a price comparison for TOC to S3 or R2, but I have a
| good backup story and run it all inside docker/dokku so it is
| easy to recover.
| hipadev23 wrote:
| OP is missing that a correct implementation of Databricks or
| Snowflake will have those instances are running inside the same
| AWS region as the data. That's not to say R2 isn't an amazing
| product, but the egregious costs aren't as high since egress is
| $0 on both sides.
| dangoldin wrote:
| Author here and it is true that costs within a region are free
| and if you do design your system appropriately you can take
| advantage of it but I've seen accidental cases where someone
| will try to access in another region and it's nice to not even
| have to worry about it. Even that can be handled with better
| tooling/processes but the bigger point is if you want to have
| your data be available across clouds to take advantage of the
| different capabilities. I used AI as an example but imagine you
| have all your data in S3 but want to use Azure due to the
| OpenAI partnership. It's that use case that's enabled by R2.
| hipadev23 wrote:
| Yeah, for greenfield work building up on R2 is generally a
| far better deal than S3, but if you have a massive amount of
| data already on S3, especially if it's small files, you're
| going to pay a massive penalty to move the data. Sippy is
| nice but it just spreads the pain over time.
| cmgriffing wrote:
| I could be mistaken, but I believe AWS would still charge for
| one direction of an S3 to Databricks/Snowflake
| instance/cluster.
| hipadev23 wrote:
| AWS S3 Egress charges are $0.00 when the destination is AWS
| within the same region. When you setup your Databricks or
| Snowflake accounts, you need to correctly specify the same
| region as your S3 bucket(s) otherwise you'll pay egress.
| drexlspivey wrote:
| If I understand correctly when storing data to vanilla S3 (not
| their edge offering) the data live in a single zone/datacenter
| right? While on R2 they could potentially be replicated in tens
| of locations. If that is true how can Cloudflare afford the
| storage cost with basically the same pricing?
| leiferik wrote:
| As an indie dev, I recommend R2 highly. No egress is the killer
| feature. I started using R2 earlier this year for my AI
| transcription service TurboScribe (https://turboscribe.ai/).
| Users upload audio/video files directly to R2 buckets (sometimes
| many large, multi-GB files), which are then transferred to a
| compute provider for transcription. No vendor lock-in for my
| compute (ingress is free/cheap pretty much everywhere) and I can
| easily move workloads across multiple providers. Users can even
| re-download their (again, potentially large) files with a simple
| signed R2 URL (again, no egress fees).
|
| I'm also a Backblaze B2 customer, which I also highly recommend
| and has slightly different trade-offs (R2 is slightly faster in
| my experience, but B2 is 2-3x cheaper storage, so I use it mostly
| for backups other files that I'm likely to store a long time).
| jokethrowaway wrote:
| It blows my mind that anyone would consider S3 cheap.
|
| You always had available plenty of space on dedicated servers for
| way cheaper before the cloud.
|
| You could make an argument about the API being nicer than dealing
| with a linux server - but is AWS nice? I think it's pretty awful
| and requires tons of (different, specific, non transferable)
| knowledge.
|
| Hype, scalability buzzwords thrown around by startups with 1000
| users and 1M contract with AWS.
|
| Sure R2 is cheaper but it's still not a low cost option. You are
| paying for a nice shiny service.
| gen220 wrote:
| I think it all depends on the volume of data you're storing,
| access requirements, and how much value you plan to generate
| per GB.
|
| It's certainly quite cheap for a set of typical "requirements"
| for media hosting companies.
|
| But yeah, if you're storing data for mainly archival purposes,
| you shouldn't be paying for R2 or S3.
| sgammon wrote:
| We absolutely love R2, especially when paired with Workers.
| johnklos wrote:
| Should we simply ignore the tremendous amount of phishing hosted
| using r2.dev? Or is this also part of "an economic opportunity"?
|
| Cloudflare may well be on their way to becoming a monopoly, but
| they certainly show they don't care about abuse. Even if it
| weren't a simple matter of principle, in case they aren't
| successful in forcing themselves down everyone's throats, I
| wouldn't want to host anything on any service that hosts phishers
| and scammers without even a modicum of concern.
| andrewstuart wrote:
| >> you're paying anywhere from $0.05/GB to $0.09/GB for data
| transfer in us-east-1. At big data scale this adds up.
|
| At small data scale this adds up.
|
| And..... it's 11 cents a GB from Australia and 15 cents a GB from
| Brazil.
|
| If you have S3 facing the Internet a hacker can bankrupt your
| company in minutes with simple load testing application. Not even
| a hacker. A bug in a web page could do the same thing.
| paulddraper wrote:
| 200 TB in minutes is impressive.
|
| (Assuming your company can be bankrupted for ~$20k.)
___________________________________________________________________
(page generated 2023-11-02 23:00 UTC)