[HN Gopher] Fly.io outage - resolved
___________________________________________________________________
Fly.io outage - resolved
Author : punkpeye
Score : 232 points
Date : 2024-11-26 01:47 UTC (21 hours ago)
(HTM) web link (status.flyio.net)
(TXT) w3m dump (status.flyio.net)
| punkpeye wrote:
| It is not reflected in their status page, but fly.io itself is
| not even loading.
| duxup wrote:
| Confirmation ;)
| nomilk wrote:
| https://fly.io/ loading for me
| arusahni wrote:
| Oof, hugops to the team.
| stevefan1999 wrote:
| Yep...can confirm my self hosted Bitwarden there is completely
| FUBAR connection wise even if it is in EA, so it should be a
| worldwide outage...lemme guess, some internal tooling error,
| consensus split brain, or if it looks like someone leaked BGP
| routes again?
| jasonjayr wrote:
| DNS. It's always DNS. /s
| jart wrote:
| https://github.com/jart/cosmopolitan/blob/master/third_party.
| ..
| monkaiju wrote:
| Might be! Shameless plug of a DNS tool i wrote years ago for
| anyone this pushes to learn more about DNS
|
| https://dug.unfrl.com/
| satoru42 wrote:
| Mine is in Asia and it's still accessible.
| odo1242 wrote:
| It was a consensus split-brain ("database replication failure")
| it seems
| redslazer wrote:
| fly.io just has the weirdest outages. It has issues so regularly
| we dont even need to run mock outages to make sure our system
| fail overs work.
| duxup wrote:
| When I worked for a company who worked with big banks /
| financial institutions we used to run disaster recovery tests.
| Effectively a simulated outage where the company would try to
| run off their backup sites. They ran everything from those
| sites, it was impressive.
|
| Once in a while we'd have a real outage that matched the test
| we ran as recently as the weekend before.
|
| I was helping a bank switch over to the DR site(s) one day
| during such a real outage and I left my mic open when someone
| asked me what the commotion was on the upper floors of our HQ.
| I said "super happy fun surprise disaster recovery test for
| company X".
|
| VP of BIG bank was on the line monitoring and laughed "I'm
| using that one on the executive call in 15, thanks!" Supposedly
| it got picked up at the bank internally after the VP made the
| joke and was an unofficial code for such an outage for a long
| time.
| NetOpWibby wrote:
| Thankfully your comment was positive!
| latch wrote:
| In most BIG banks, "Vice President" is almost an entry-level
| title. Easily have 1000s of them. For example, this article
| points out that Goldman Sachs had ~12K VPs out of more than
| 30K employees: https://web.archive.org/web/20150311012855/htt
| ps://www.wsj.c...
| jart wrote:
| VP at Goldman is equivalent to Senior SWE according to
| levels.fyi and their entry level is Analyst. I'm surprised
| by the compensation though. I would have thought people
| working at a place with gold in the name would be making
| more. Also apparently Morgan Stanley pays their VPs
| $67k/year.
| philipwhiuk wrote:
| Tech outstripped big finance corps tech a while ago.
|
| Traders make loads, not the SWEs
| bormaj wrote:
| That VP comp number seems quite low fwiw
| jart wrote:
| Yes how much longer till we see Morgan Stanley VPs
| picketing outside demanding a living wage and humming The
| Internationale.
| SteveNuts wrote:
| Just like all Sales folks have heavily inflated titles, no
| customer wants to think they're dealing with a junior
| salesperson/loan officer when you're about to hand over
| your money.
|
| It seems like every vendor sales team I work with is an
| "executive" or "director of sales" even though in reality
| they're just regular old salespeople.
| benreesman wrote:
| In fairness to the fly.io folks (who are extremely serious
| hackers), they're standing up a whole cloud provider and
| they've priced it attractively and they're much customer-
| friendlier than most alternatives.
|
| I don't envy the difficulty of doing this, but I'm quite
| confident they'll iron the bugs out.
| redslazer wrote:
| The tech is impressive and the pricing is attractive which is
| why we use them. I just wish there was less black magic.
| benreesman wrote:
| I don't always agree with @tptacek on social/political
| issues, and I don't always agree with @xe on the direction
| of Nix, but these are legends on the technical side of
| things. And they're trying to build an equitable
| relationship between the user of cloud services and the
| provider, not fund a private space program.
|
| If I were in the market for cloud services I'd highly prize
| a long-term relationship on mutual benefit and fair
| dealings over a short-term nuisance of being an early
| adopter.
|
| I strongly suspect your investment in fly is going to pay
| off.
| verelo wrote:
| I want to believe, but in the meantime they're killing
| the product I've been working hard to build trust with my
| own customers though. There is a limit to my idealism,
| and it's well and truly in the past.
| reissbaker wrote:
| FWIW Xe was let go from Fly earlier this year during a
| round of layoffs.
| benreesman wrote:
| Unfortunate. Xe rocks.
| xena wrote:
| Xe here. As a sibling comment said, I didn't survive
| layoffs. If you're looking for someone like me, I'm on
| the market!
| benreesman wrote:
| Hiring people is above my pay grade, but I can vouch to
| my lords and masters and anyone else who cares what I
| think that a legend is up for grabs.
|
| b7r6@b7r6.net
| xena wrote:
| I'd email but I'm about to pass out in bed. Please see
| https://xeiaso.net/contact/ in case I don't get back to
| you in the morning.
| foldr wrote:
| I suspect that making a cloud service provider run
| reliably requires tons of grunt work more than it
| requires technical heroism from a small number of highly
| talented individuals.
| tptacek wrote:
| Yes.
| tptacek wrote:
| I'm several steps removed from day-to-day engineering at
| this point; the team working on this is much better than
| I am. It's just a very hard problem; biting it off is
| something you can certainly blame me for, though.
|
| (Also: not a legend, just loud.)
| benreesman wrote:
| I may be the minority on this view, but I think that it's
| possible to be both a recognized expert aka legend and
| loud ("visible" might be a kinder word).
|
| When you talk technology, I listen, and I doubt I'm alone
| in that. Keep up the good work with fly.io!
| shubhamjain wrote:
| This is probably 5th or 6th major outage from Fly.io that I have
| personally seen. Pretty sure there were many others and some just
| went unnoticed. I recommended the service to a friend, and within
| two days he faced two outages.
|
| Fly.io seriously needs to get it together. Why it hasn't happened
| yet is a mystery to me. They have a good product but stability
| needs to be an absolute top for a hosting service. Everything
| else is secondary.
| mcqueenjordan wrote:
| Reliability is hard when your volume is (presumably) scaling
| geometrically.
| paxys wrote:
| Can't use the "reliability is hard" excuse when you are quite
| literally in the business of selling reliability.
| mcqueenjordan wrote:
| It's just not that big of a mystery. It's not an excuse;
| it's just true. Also, they're not especially selling
| reliability as much as they're selling small geo-
| distributed deployments.
| ilrwbwrkhv wrote:
| Does anyone use them beyond the free tier? Same with Vercel for
| example.
| gk1 wrote:
| Vercel has revenue of over $100M. So yes at least a few
| companies use them beyond the free tier.
| dizhn wrote:
| Which company? GitHub? As far as I know fly.io does not have
| a free tier.
| adityapatadia wrote:
| We left it about a year ago due to reliability issues. We now
| use digitalocean apps and working like a charm. Zero downtime
| with DO.
| subarctic wrote:
| You mean their App Platform right? How does the pricing
| compare to fly?
| adityapatadia wrote:
| Yes, App Platform. Pricing is a little higher but way lower
| than AWS but it is fully justified. Zero downtime in the
| last 1 year.
|
| With Fly, we had 3-4 downtimes in 2023 in a span of 4
| months.
| SOLAR_FIELDS wrote:
| I get this but I think if people can give GitHub a pass for
| shitting the bed every two weeks maybe Fly should get a bit of
| goodwill here. I am not affiliated with Fly at all but I do
| think that people should temper their expectations when even
| mega corp can't get it right
|
| I guess the secret is to be the incumbent with no suitable
| replacement. Then you can be complete garbage in terms of
| reliability and everyone will just hand wave away your poor ops
| story
| ojame wrote:
| The biggest difference is GitHub in your infrastructure is
| (nearly always) internal. Fly in your infrastructure is
| external. Users generally don't see when you have issues with
| GitHub, but they do generally see when you have issues with
| Fly.
|
| That's the core difference.
| fragmede wrote:
| Who's giving GitHub a pass on shitting the bed? They go down
| often enough that if you don't have an internal git server
| setup for your CICD to hit, that's on you.
| SOLAR_FIELDS wrote:
| My point is made by your very post - getting off GitHub
| onto alternatives is not seriously discussed as an option -
| instead it's "well, why didn't you prepare better to deal
| with your vendor's poor ops story"
| fragmede wrote:
| I wasn't going to bring up being on an internally hosted
| gitlab instead of github, but that would be the "not
| giving them a pass" part.
| benhoyt wrote:
| My fly.io-hosted website went down for 5 minutes (6 hours ago),
| but then came right back up, and has been up ever since. I use a
| free monitoring service that checks it every 5 minutes, so it's
| possible it missed another short bit of downtime. But fly.io has
| been pretty reliable overall for me!
| nomilk wrote:
| Would be fascinated to see your data over a period of months.
|
| Application up time is flakey, but what was worse were fly
| deploys failing for no clear reason. Sometimes layers would
| just hang and eventually fail for no particular reason; I'd run
| the same command an hour or two later without any changes and
| it would just work as expected.
|
| I'd love to make a monitoring service to _deploy_ a basic app
| (i.e. run the fly deploy command) every 5 minutes and see how
| often those deploys fail or hang. I 'd guess ~5% inexplicably
| fail, which is frustrating unless you've got a lot of spare
| time.
| sanswork wrote:
| My downtimes from fly are pretty rare but generally global
| when they happen, in this outage we had no downtime but
| couldn't deploy for a few hours. I have issues with deploying
| about once per quarter(deploy most days across a few apps)
| nomilk wrote:
| If that's the case I suspect fly is getting a lot more
| reliable. I stopped using them about a year ago so haven't
| kept up on their reliability since. Glad to hear, it's good
| for a competitive market to have many providers, and fly
| might have issues but hopefully has a bright future
| sanswork wrote:
| They are definitely getting more reliable. I was an early
| user and moved off them to self hosted for quite a while
| because of the frequent downtime in early days.
|
| Their support still leaves a lot to be desired even as
| someone that pays for it but the ease of running and
| deploying a distributed front end keeps bringing me back.
| rozenmd wrote:
| This may be of interest to you:
| https://news.ycombinator.com/item?id=42243282
| jrockway wrote:
| I used to run a service that created k8s clusters on GCP for
| our customers. We did want to check that that functionality
| kept working and had a prober test it periodically. It was
| actually broken a lot.
|
| Always good to monitor your dependencies if you have the
| time. Then when someone complains about an issue in your
| service, you can check your monitoring to see if your
| upstream services are broken. If they are, at least you know
| where to start debugging.
| beezlewax wrote:
| Do you mind if I ask what monitoring service that is?
| benhoyt wrote:
| Sure, it's UptimeRobot: https://uptimerobot.com/
| andrew-jack wrote:
| Use https://pulsetic.com/
| vextea wrote:
| Is it your service?
| buzzier wrote:
| https://github.com/louislam/uptime-kuma
| rozenmd wrote:
| I externally monitor fly.io and it's docs here:
| https://flyio.onlineornot.com/
|
| Looks like it lasted 16 minutes for them.
| davidgl wrote:
| Same for us, down for ~5 mins, back up and fine, error was 501
| TacticalCoder wrote:
| Someone said 16 minutes: so it's not even 5 nines service.
| dprotaso wrote:
| What free monitoring tool do you use?
| MaxfordAndSons wrote:
| Kinda funny that they've named their global state store
| "Corrosion"... not really a word I'd associate with stability and
| persistence.
| kermatt wrote:
| https://community.fly.io/t/reliability-its-not-great/11253
|
| https://github.com/superfly/corrosion
| lordofgibbons wrote:
| It's an internal project based on Rust, not a product. So I
| don't think it matters too much what they name it. It's opens
| source which is great, but still not a product that they need
| to market.
| SOLAR_FIELDS wrote:
| And to be fair, it's a bit of a cute meme to name rust
| projects things that relate to it. Oxide, etc
| dumah wrote:
| I take your point but corrosion-resistant metals such as
| Aluminum, Titanium, Weathering Steel and Stainless Steel don't
| avoid corrosion entirely but form a thin and extremely stable
| corrosion layer (under the right conditions).
| littlestymaar wrote:
| Gold and platinum really are corrosion resistant though (but
| have questionable mechanical properties...)
| toast0 wrote:
| I stored important data in mnesia, so who would I be to talk.
| :p
| throwawaymaths wrote:
| amnesia means forget, so mnesia means remember, I would
| guess?
| veggieWHITES wrote:
| I was considering these guys the other day until I saw their
| pricing page: https://fly.io/pricing/
|
| (There's not a single price on there, why even create the page?)
| schmichael wrote:
| The prices are just one click deeper. Hardly a nefarious dark
| pattern.
| rascul wrote:
| There's a link to what appears to be the actual pricing page
| https://fly.io/docs/about/pricing/
|
| There's also a link to the pricing calculator
| https://fly.io/calculator
| totetsu wrote:
| Is that calculator hourly or monthly?
| radicalriddler wrote:
| Literally says "Monthly Costs" in the green panel on the
| right that calculates the total.
| eviks wrote:
| It's right there: "Monthly Cost"
| Aeolun wrote:
| OMG, that's hilarious. I use them, and I know what my prices
| are, but I'd never noticed that the page called pricing doesn't
| actually have any.
| tptacek wrote:
| We've always had public pricing; you can't do a metered cloud
| provider without a rate sheet. But it's been part of our
| product documentation, rather than the front page of the
| website, until recently; there's a whole saga behind it,
| which gets into whether we offer "plans" or not, how support
| works, all that jazz, all of which kept us from putting
| together a marketing pricing page.
| HellsMaddy wrote:
| Suspiciously, Turso started having issues around the same time.
| Their CEO confirmed on Discord it's due to the Fly outage:
|
| > Ok.I caught up with our oncall and This seems related to the
| Fly.io incident that is reported in our status page. Our login
| does call things in the Fly.io API
|
| > we are already in touch with Fly and will see if we can speed
| this up
| pier25 wrote:
| Not the first time Turso goes down because of Fly issues. It
| must suck to have built a db service and have this downtime.
|
| Apparently Turso are going to offer an AWS tier at some point.
| jonasdoesthings wrote:
| Last month Turso released AWS-hosted databases to the public
| (still in Beta): https://turso.tech/blog/turso-aws-beta
| pier25 wrote:
| Thanks!
| DataOverload wrote:
| We switched from Fly to CF workers a while ago, and never looked
| back
| eek2121 wrote:
| congrats on not developing a playbook for the time you have to
| 'look back'.
|
| Providers will fail. good contingencies won't.
|
| ...hears faint sound...I SAID GOOD, QUIET YOU!
| punkpeye wrote:
| They are fundamentally different. If Cloudflare provided a way
| to host docker containers with volumes though, that would be
| game over for so many paas platforms.
| stoicjumbotron wrote:
| Can't wait: https://blog.cloudflare.com/container-platform-
| preview/
| punkpeye wrote:
| wow, this will be huge
| Aeolun wrote:
| Only if they can sort out their atrocity of a
| documentation website.
| rstupek wrote:
| How are they equivalent?
| frakkingcylons wrote:
| I switched from apples to oranges and never looked back.
| pier25 wrote:
| Our stuff on CF Workers has been working non stop for years
| now.
|
| About 6 months ago we migrated our most critical stuff from Fly
| to CF and boy every time Fly has issues I'm so glad we did.
| jpgvm wrote:
| Too much custom stuff too quickly, there is a lot of
| efficiency in vertical integration and a fully cohesive stack
| but it takes a very long time to stabilize if you take that
| route.
|
| We spent months trying to convince them of problems with
| their H2 implementation in their LB/proxy (they insisted
| nginx was at fault, spoiler - it wasn't) but had to leave (we
| also went to CF, which has its own problems). Eventually one
| of their employees wrong a long blog post about H2 that made
| it obvious they finally found and fixed those problems but
| months too late for my employer at the time.
|
| It would have been infinitely better for us if they could
| have just fixed their stability problems, that abstraction
| suited us as did their LB/proxy impl and SNI pricing.
|
| I wish them well, some really smart folk over there but I can
| imagine these reliability problems are probably really
| grinding down morale.
| EGreg wrote:
| What exactly does flyio.net do?
| michaelbuckbee wrote:
| Hosting service that has a lot of interesting distributed
| features.
| HellsMaddy wrote:
| If you mean specifically flyio.net and not just fly.io the
| company, I'm guessing they host their status page on a separate
| domain in case of DNS/registrar issues with their primary
| domain.
| stackghost wrote:
| IIRC their value prop is that they let you rapidly spin up
| deployments/machines in regions that are closest to your users,
| the idea being that it will be lower latency and thus better
| UX.
| eek2121 wrote:
| WEB 2.0. SEE. TOLD YA! THEY SHOULDA UPGRADED TO THAT NEWFANGLED
| 3.0! ;)
| vachina wrote:
| It's basically what Heroku used to be but with CDN-like
| presence.
| mrcwinn wrote:
| I tried Fly early. I was very excited about this service, but
| I've never had a worse hosting experience. So I left.
| Coincidentally I tried it again a few days ago. Surely things
| must be better. Nope. Auth issues in the CLI, frustrations
| deploying a Docker app to a Fly machine. I wouldn't recommend it
| to anyone.
| steve_adams_86 wrote:
| I find their user experience to be exceptional. The only flake
| I've encountered is in uptime and general reliability of
| services I don't interface with directly. They've done a
| stellar job on the stuff you actually deal with, but the glue
| holding your services together seems pretty wobbly.
| teaearlgraycold wrote:
| I'm grateful to HN for keeping me well aware of Fly's issues.
| I'll never use them.
| kachapopopow wrote:
| It's still 99.99+% SLA? Would you really pay 100% more for
| <0.01% more uptime?
| cj wrote:
| I think what a lot of people fail to understand is that there
| are certain categories of apps that simply "can never go
| down"
|
| Examples include basically any PaaS, IaaS, or any company
| that provides a mission-critical service to another company
| (B2B SaaS).
|
| If you run a basic B2C CRUD app, maybe it's not a big deal if
| you service goes down for 5 minutes. Unfortunately there are
| quite a few categories of companies where downtime simply
| isn't tolerated by customers. (I operate a company with a
| "zero downtime" expectation from customers - it's no joke,
| and I would never use any infrastructure abstraction layer
| other than AWS, GCP or Azure - preferably AWS us-east-1
| because, well, if you know the joke...)
| macNchz wrote:
| Every PaaS and IaaS I've ever used has had some amount of
| downtime, often considerably more than 5 minutes, and I've
| run production services on many of them. Plenty of random
| issues on major cloud providers as well. Certainly plenty
| of situations with dozens of Twitter posts happening but
| never any acknowledgement on the AWS status page. Nothing's
| perfect.
| cj wrote:
| Yea, when running services where 5 minutes of downtime
| results in lots of support tickets, you learn to accept
| that the incident will happen and learn to manage the
| incident rather than relying that it will never occur.
| MobiusHorizons wrote:
| you realize all of those services you mention can't give
| you zero downtime, they would never even advertise that.
| They have quite good reliability certainly, but on long
| enough time horizons absolutely no-one has zero downtime.
| littlestymaar wrote:
| If your app cannot go down ever, then you cannot use a
| cloud provider either (because even AWS and Azure do fail
| sometime, just look up for "Azur down" on HN).
|
| But the truth is everybody can afford _some_ level of
| outage, simply because nobody has the budget to provision
| an infra that can _never_ fail.
| vrosas wrote:
| I've seen a team try and be truly "multi-cloud" but then
| ended up with this Frankenstein architecture where
| instead of being able to weather one cloud going down,
| their app would die if _any_ cloud had an issue. It was
| also surprisingly hard to convince people it doesn't
| matter how many globally distributed clusters you have if
| all your data is in us-east.
| toast0 wrote:
| > I think what a lot of people fail to understand is that
| there are certain categories of apps that simply "can never
| go down"
|
| I refuse to believe that this category still exists, when I
| need to keep my county's alternate number for 911 in my
| address book, because CenturyLink had a 6 hour outage in
| 2014 and a two day outage in 2018. If the phone company
| can't manage to keep 911 running anymore, I'd be very
| surprised what does have zero downtime over a ten year
| period.
|
| Personally, nine nines is too hard, so I shoot for eight
| eights.
| bri3d wrote:
| My experience with very large scale B2B SaaS and PaaS has
| been that customers like to get money, if allowed by
| contract, by complaining about outages, but that overall,
| B2B SaaS is actually very forgiving.
|
| Most B2B SaaS solutions have very long sales cycles and a
| high total cost to implement, so there is a lot of inertia
| to switching that "a few annoying hours of downtime a year"
| isn't going to cover. Also, the metric that will drive
| churn isn't actually zero downtime, it's "nearest
| competitor's downtime," which is usually a very different
| number.
| sgrove wrote:
| All of your examples have had multiple cases of going down,
| some for multiple days (2011 AWS was the first really long
| one I think) - or potentially worse, just deleting all
| customer data permanently and irretrievably.
|
| Meaning empirically, downtime seems to be tolerated by
| their customers up to some point?
| mrcwinn wrote:
| This is not my experience at all, as a former paying
| customer.
| runako wrote:
| No dog in this fight, all props to the Fly.io team for having
| the gumption to do what they are doing, I genuinely hope they
| are successful...
|
| > It's still 99.99+% SLA
|
| But this is simply not accurate. 99.99% uptime is < 52m 9.8s
| _annually_ of downtime. They apparently blew well through
| that today. Looks like they essentially had the equivalent of
| 4 years of 99.99% uptime equivalent this evening.
|
| Four nines is so unforgiving that it's almost the case that
| if people are required to be in the loop at any point during
| an incident, you will blow the fourth nine for the whole year
| in a single incident.
|
| Again, I know it's hard. I would not want to be in the space.
| That fourth nine is really difficult to earn.
|
| In the meanwhile, <hugops> to the Fly team as they work to
| resolve this (and hopefully get some rest).
| fulafel wrote:
| 99.99+% SLA typically means you get some billing credits
| for the downtime exceeding 99.99+ availability. So
| technically do get a "99.99+% SLA", but you don't get
| 99.99+% availability.
|
| Other circles use "SLO" (where the O stands for objective).
|
| (Anyone know what the details in fly.io SLA are?)
| runako wrote:
| You are correct in the legal/technical sense!
|
| Technically, anyone could offer five- or six-nines and
| just depend on most customers not to claim the credits
| :-D
|
| Actually hitting/exceeding four nines is still tough.
| PUSH_AX wrote:
| You say that like it's their only issue.
|
| Earlier in the year they had a catastrophic outage in LHR, we
| lost all our data. Yes this is also on me, I'm aware. Still,
| that's a hard nope from me, we migrated.
| akshayshah wrote:
| The series of outages early in 2023 also had some Corrosion-
| related pain: https://community.fly.io/t/reliability-its-not-
| great/11253
| __turbobrew__ wrote:
| Seems like rolling their own datastore turned out to be a bad
| bet.
|
| Im not super familiar with their constraints but scylladb can
| do eventual consistency and is generally quite flexible.
| CouchDB is also an option for multi-leader replication.
| pier25 wrote:
| My apps on Fly have not gone down this time.
| marvin-hansen wrote:
| No surprise. About a year ago, I looked at fly.io because of it's
| low pricing and I was wondering where they were cutting corners
| to still make some money. Ultimately, I found the answer in their
| tech docs where it was spelled out clearly that an fly instance
| is hardwired to one physical server and thus cannot fail over in
| case that server dies. Not sure if that part still is in the
| official documentation.
|
| In practice, that means if a server goes down, they have to load
| the last snapshot from that instance from the Backup and push it
| on a new server, update the network path, and pray to god that
| not more server fail than spare capacity is available. Otherwise
| you have to wait for a restore until the datacenter mounted a few
| more boxes in the rack.
|
| That explains quite a bit the randomness of those outage reports
| i.e. my app is down vs the other is fine and mine came back in 5
| minutes vs the other took forever.
|
| As a business on a budget, I think anything else i.e. a small
| civo cluster serves you better.
| fulafel wrote:
| The status tells a story about a high-availability/clustering
| system failure so I think in this case the problem is rather
| the complexity of the HA machinery hurting the system's
| availability vs something like a simple VPS.
| ignoramous wrote:
| Fly.io can migrate vm+volume now:
| https://fly.io/docs/reference/machine-migration/ /
| https://archive.md/rAK0V
|
| > _a fly instance is hardwired to one physical server and thus
| cannot fail over_
|
| I'm having trouble understanding how _else_ this is supposed to
| be? I understand that _live migration_ is a thing, but even in
| those cases, a VM is "hardwired" to some physical server, no?
| mzi wrote:
| > I'm having trouble understanding how else this is supposed
| to be? I understand that live migration is a thing, but even
| in those cases, a VM is "hardwired" to some physical server,
| no?
|
| You can run your workload (in this case a VM) on top of a
| scheduler, so if one node goes down the workload is just spun
| up on another available node.
|
| You will have downtime, but it will be limited.
| ignoramous wrote:
| > _so if one goes down ... just spun up on another_
|
| On Fly, one can absolutely set this up. Multiple ways:
| https://fly.io/docs/apps/app-availability /
| https://archive.md/SJ32K
| sofixa wrote:
| > I'm having trouble understanding how else this is supposed
| to be? I understand that live migration is a thing, but even
| in those cases, a VM is "hardwired" to some physical server,
| no?
|
| They mean the storage part. If your VM's storage(state) is on
| one server and that server dies, you have to restore from
| backup. If your VM's storage is on remote shared storage
| mounted to that server and the server dies, your VM can be
| restarted elsewhere that has access to that shared storage.
|
| In AWS land it's the difference between instance store (local
| to a server) and EBS (remote, attached locally).
|
| There's a tradeoff in that shared storage will be slightly
| slower due to having to traverse networking, and it's harder
| to manage properly; but the reliability gain is massive.
| dilyevsky wrote:
| > Ultimately, I found the answer in their tech docs where it
| was spelled out clearly that an fly instance is hardwired to
| one physical server and thus cannot fail over in case that
| server dies.
|
| Majority of EC2 instance types did not have live migration
| until very recently. Some probably still don't (they don't
| really spell out how and when it's supposed to work). It is
| also not free - there's a noticeable brown-out when your VM
| gets migrated on GCP for example.
| ixaxaar wrote:
| Can you shed some more light on this "browning out"
| phenomenon?
| toast0 wrote:
| Here's the GCP doc [1]. Other live migration products are
| similar.
|
| Generally, you have worse performance while in the
| preparing to move state, an actual pause, then worse
| performance as the move finishes up. Depending on the
| networking setup, some inbound packets may be lost or
| delayed.
|
| [1] https://cloud.google.com/compute/docs/instances/live-
| migrati...
| pier25 wrote:
| If you want HA on Fly you need to deploy an app to multiple
| regions (multiple machines).
|
| Fly might still go down completely if their proxy layer fails
| but it's much less common.
| sb8244 wrote:
| The proxy layer was the cause of yesterday's outage according
| to support.
| pier25 wrote:
| Yes but the previous comment was about hardware failure.
| theideaofcoffee wrote:
| Color me not surprised. My few interactions with people there
| just gave off the impression of them being in a bit over their
| heads. I don't know how well that translated to their actual ops,
| but it's difficult to not connect the two when they continue to
| have major outage after major outage for a product that 'should'
| be their customer's bedrock upon which they build everything
| else.
| xyst wrote:
| Recurring pattern I notice is outages tend to occur the week of
| major holidays in US.
|
| - MS 365/Teams/Exchange had a blip in the morning
|
| - Fly.io with complete outage
|
| - then a handful of sites and services impacted due to those
| outages
|
| Usually advocate against "change freezes" but I think a change
| freeze around major holidays makes sense. Give all teams a
| recharge/pause/whatever.
|
| Don't put too much pressure on the B-squads that were unfortunate
| to draw the short stick.
| aaomidi wrote:
| What do "Freezes" mean? Like, do you stop renewing your
| certificates? Do you stop taking in security updates for your
| software?
|
| Sure maybe "unnecessary" changes, but the line gets very gray
| very fast.
| vrosas wrote:
| No unnecessary code deployments.
| Spivak wrote:
| It's not very grey, prod becomes as if you told everyone but
| your ops team to go home and then sent your ops team on a
| cruise with pagers. If it's not important enough to merit
| interrupting their vacation you don't do it.
| fragmede wrote:
| Certs shouldn't still be done by hand that this point; if
| another heartbleed comes out in the next 7 days then the risk
| can be examined, escalated, and the CISO can overrule the
| freeze. If it's a patch for remote root via Bluetooth drivers
| on a server that has no Bluetooth hardware, it's gonna wait.
|
| you're right that there's a grey line, but crossing that line
| involves waking up several people and the on call person
| makes a judgement call. if it's not important enough to wake
| up several people over, then things stay frozen.
| aaomidi wrote:
| Right, that's basically what I mean. There are a lot of
| automated changes happening in the background for services.
| I guess the whole thing I'm saying is that not every
| breakage is happening because of a code change.
| kbolino wrote:
| There's still a lot of situations where automatic
| certificate enrollment and renewal is not possible. TLS is
| not the only use of X.509 certificates, and even then,
| public facing HTTPS is not the only use of TLS.
|
| It needs to get better but it's not there yet.
| vrosas wrote:
| Then you just get devs rushing out changes before the freeze...
| fragmede wrote:
| and stampeding changes in after the thaw, also leading to
| downtime. so it depends on the org, but doing a freeze is
| still reasonable policy. Downtime on December 15th is less
| expensive than on black Friday or cyber Monday for most
| retailers, so it's just a business decision at that point.
| subarctic wrote:
| As a developer I don't see why I would rush out a change
| before the freeze when I could just wait until after. Maybe a
| stakeholder that really wants it would press for it to get
| out but personally I'd rather wait until after so I'm not
| fixing a bug during my holiday.
| vrosas wrote:
| Congrats on not working for the product team I work for
| ploxiln wrote:
| I think you can't avoid the fact that these holiday weeks are
| different from regular weeks. If you "change freeze" then you
| also freeze out the little fixes and perf tuning that usually
| happens across these systems, because they're not "critical".
|
| And then inevitably it turns out that there's a special
| marketing/product push, with special pricing logic that needs
| new code, and new UI widgets, causing a huge traffic/load
| surge, and it needs to go out NOW during the freeze, and this
| is revenue, so it is critical to the business leaders. Most of
| eng, and all of infra, didn't know about it, because the
| product team was cramming until the last minute, and it was
| kinda secret. So it turns out you can freeze the high-quality
| little fixes, but you can't really freeze the flaky brand-new
| features ...
|
| It's just a struggle, and I still advise to forget the freeze,
| and try to be reasonable and not rush things (before, during,
| or after the freeze).
| ignoramous wrote:
| Some shops conduct _game days_ as the freeze approaches.
|
| https://wa.aws.amazon.com/wellarchitected/2020-07-02T19-33-2.
| .. / https://archive.md/uaJlR
| willsmith72 wrote:
| Any big tech company with large peak periods disagrees with
| you. It's absolutely worth freezing non-critical changes.
|
| Urgent business change needs to go through? Sure, be prepared
| to defend to a vp/exec why it needs to go in now.
|
| Urgent security fix? Yep same vp will approve it.
|
| It's a no-brainer to stop your typical changes which aren't
| needed for a couple of weeks. By the way, it doesn't mean
| your whole pipeline needs to stop. You can still have stuff
| ready to go to prod or pre prod after the freeze
| paxys wrote:
| Bad code rarely causes outages at this scale. The culprit is
| always configuration changes.
|
| Sure you can try and reduce those as well during the holiday
| season, but what if a certificate has to be renewed? What if a
| critical security patch needs to be applied? What if a set of
| servers need to be reprovisioned? What if a hard disk is
| running out of space?
|
| You cannot plan your way out of operational challenges,
| regardless of what time of year it is.
| bobsyourbuncle wrote:
| This is a good observation. Do you have any resources I can
| read up on to make this safer?
| jimmyl02 wrote:
| I think a good way of looking at it is risk. Is the change
| (whether it is code or configuration, etc.) worth the risk it
| brings on.
|
| For example if it's a small feature then it probably makes
| sense to wait and keep things stable. But, if it's something
| that itself causes larger imminent danger like security
| patches / hard disk space constraints, then it's worth taking
| on the risk of change to mitigate the risk of not doing it.
|
| At the end of the day no system is perfect and it ends up
| being judgement calls but I think viewing it as a risk
| tradeoff is helpful to understand.
| oarsinsync wrote:
| > Sure you can try and reduce those as well during the
| holiday season, but what if a certificate has to be renewed?
| What if a critical security patch needs to be applied? What
| if a set of servers need to be reprovisioned? What if a hard
| disk is running out of space?
|
| Reading this, I see two routine operational issues, one
| security issue and one hardware issue.
|
| You can't plan you way around security issues or hardware
| failures, but operational issues you both can and should plan
| around. Holiday schedules like this are fixed points in time,
| so there's absolutely no reason why you can't plan all
| routine works to be completed either a week in advance, or a
| week after, the holiday period.
|
| Certificates don't need to be near the point of expiry to be
| renewed. Capacity doesn't need to be at critical levels to be
| expanded. Ultimately, this is a risk management question (as
| a sibling has also commented). Is the organisation willing to
| take on increased risk in exchange for deferring operational
| expenses?
|
| If the operational expense is inevitable (the certificate
| will need renewing), that seems like an easy answer when it
| comes to risk management over holidays.
|
| If the operational expense is not inevitable (will we really
| need to expand capacity?), it then becomes a game of
| probabilities and financials - likelihood of expense being
| incurred, amount of expense incurred if done ahead of time,
| impact to business if something goes wrong during a holiday.
| tptacek wrote:
| We'll have a postmortem in next week's infra log update, but
| here it was a particularly ambitious customer app pushing our
| state sync service into a corner case; it's one we knew
| about, but the solution (federating regional state sharing
| clusters rather than running one globally) is taking time to
| roll out.
| cess11 wrote:
| Blip? 365 has an ongoing incident since yesterday morning,
| european timezone. The reason I know is because I use their
| compliance tools to secure information in a rather large
| bankruptcy.
| jart wrote:
| fly.io publishes their post-mortems here: https://fly.io/infra-
| log/
|
| The last post-mortem they wrote is very interesting and full of
| details. Basically back in 2016 the heart or keystone component
| of fly.io production infrastructure was called consul, which is a
| highly secure TLS server that tracks shared state and it requires
| that both the server certificate and the client certificate be
| authenticated. Since it was centralized, it had scaling issues,
| so fly.io wrote a replacement for it in 2020 called corrosion,
| and quickly forgot about consul, but didn't have the heart to
| kill it. Then in October 2024 consul's root key signing key
| expires, which brought down all connectivity, and since it uses
| bidirectional authentication, they couldn't bring it back online
| until they deployed new SSL certificates to every machine in
| their fleet. Somehow they did this in half an hour, but the chain
| of dominoes had already been set in motion to reveal other
| weaknesses in their infrastructure that they could eliminate.
| There was this other internal service whose own independent set
| of TLS keys had also expired long ago, but they didn't notice
| until they tried rebooting it as part of the consul rekey, since
| doing so severed the TCP connections it had established way back
| when its certificate was valid. Plus the whole time this is
| happening, their logging tools are DDOSing their network
| provider. It took some real heroes to save the company and all
| their customers too when that many things explode at once.
| ignoramous wrote:
| On that Consul outage, Fly Infra concludes, "The moral of the
| story is, no more half-measures."
|
| On their careers page [1], the Fly team goes, "We're not big
| believers in tech debt."
|
| As an outsider, reads like a cacophony of contradictions?
|
| [1] https://fly.io/docs/hiring/working/#we-re-ruthless-about-
| doi...
| jart wrote:
| No one actually lives up to their principles, but it's still
| important that we have them.
|
| If you actually do live up to yours, then you need to adopt
| better principles.
| whilenot-dev wrote:
| Any principle in itself isn't without critique, agree, but
| it's still the choice being made to pick this specific
| principle that tells the whole story. There are so many
| principles to pick from and the tech dept pick follows up
| with a _" We have a 3-month "no refactoring" rule for new
| hires. This isn't everyone's preferred work style! We try
| to be up front about stuff."_, which sounds a bit like an
| additional _perform or else..._ principle that just delays
| ownership of the stuff you 're supposed to work with. In
| the best case that sounds like naiive optimism and in the
| worst case that's gross negligence... neither one speaks
| "engineering" to me.
| Aeolun wrote:
| Two contradictory statements do not read like a 'cacophony'
| of anything to me xD I think you need a whole lot more than
| two to do that word justice.
| JimDabell wrote:
| _"No more half-measures"_ and _"We're not big believers in
| tech debt"_ aren't even contradictory statements, let alone
| a cacophony of them.
| mattgreenrocks wrote:
| The comment section doing what it does best!
| ignoramous wrote:
| For brevity I chose to put up only the conclusion from a
| postmortem (of which I've read plenty by now) and another
| point from their otherwise comparatively shorter careers
| page, which imo capture the inherent tension between
| building out fast & building out right. This is not
| something I've started complaining about today or
| yesterday. I've used Fly in prod for 4 years and spilled
| much ink on this topic on their forums already. Even if I
| critique, I remain optimistic about Fly despite the
| seemingly endless list of failure modes building such
| complex systems entail: https://community.fly.io/t/fly-
| down/10224/15
|
| (personally speaking, I'm humble enough because I can
| hardly build a toy side-project right!)
| bdcravens wrote:
| "full measures" aren't the same thing as tech debt.
| Complexity isn't even the same thing as tech debt.
| gigapotential wrote:
| HUGOPS
|
| Everything is going to be 200 OK!
| xyst wrote:
| I can't even login to my old account. Password reset is timing
| out yet still receive password reset e-mail. Password reset link
| broken, with 500 status code.
| neya wrote:
| Personal experience between Fly.io and Railway.com - Railway wins
| for me hands down. I have used both and the Railways support is
| stellar too, in comparison. Fly.io never responded to my query
| about data deletion till date. Despite emailing on their support
| email.
|
| I have had my Railway app online till date without any major
| downtimes too. I recommend anyone looking for a decent
| replacement to try them.
| punkpeye wrote:
| How does it compare in terms of price?
| justjake wrote:
| We actually only charge you for what you use. As a result
| people often see 30%+ savings when moving stuff over from
| other providers (especially Heroku)
|
| https://railway.com/pricing
| andai wrote:
| I've used Railway control panel maybe a total of 10 times in my
| life and half the time it was having weird issues. Control
| panel UI not loading or not working, actions failing, deploys
| randomly failing... I love the idea but in practice it's not
| something I'd want to use for anything serious.
| justjake wrote:
| While we've always aimed for great reliability on compute,
| the dashboard reliability wasn't very good at the start of
| the year.
|
| We ack'd this and then pretty heavily to making it stellar,
| so if you're still having issues please let us know (that
| should not be the case)
|
| Best, Jake from Railway
| ignoramous wrote:
| Fly builds on their own hardware. Is Railway doing the same? If
| not, that'd explain some of why Railway has relatively less
| number of outages (they're engineering fewer things).
|
| I understand that end-users want reliability (and Fly gets a
| bad rep despite pretty _significant_ investment on this front
| in the past 2 years), but such outages aren 't exclusive to one
| provider & not the other. Building cloud infra is no one's
| definition of easy.
| justjake wrote:
| We run on Google Cloud, AWS, and our own hardware now since
| middle of this year :)
|
| https://railway.com/changelog/2024-09-20-railway-metal-
| beta#...
| punkpeye wrote:
| Contrary to the title of the post, Fly.io API remains
| inaccessible. Meaning, users still cannot access
| deploys/databases, etc.
|
| For accurate updates, follow https://community.fly.io/t/fly-io-
| site-is-currently-inaccess...
| cryptos wrote:
| Fly.io seems to be a bit of a mixed bag:
|
| https://news.ycombinator.com/item?id=41917436
|
| https://news.ycombinator.com/item?id=35044516
|
| https://news.ycombinator.com/item?id=34742946
|
| https://news.ycombinator.com/item?id=34229751
|
| If a cloud platform doesn't really provide reliability, I'd say
| it's probably not worth it. You could better just rent a
| (virtual) server and save the cloud tax.
| qeternity wrote:
| I don't really understand the value prop of fly.io. They seem
| to have an impressive engineering team despite the outages, but
| is edge compute really something that 99.9% of devs need? There
| are tons of large companies that operate out of a single AWS
| region and those services are used by millions around the
| globe. It just strikes me as something that enables premature
| optimization right out of the box.
| k__ wrote:
| It's basically the new Heroku with less lock-in, because it
| works with Docker.
|
| You get edge computing, autoscaling, and load balancing
| without additional configuration.
|
| Not as flexible as AWS, but also much easier to setup and
| maintain.
|
| But the reliability issues suck now and then.
| nikodotio wrote:
| This is precisely it. The ease of deploy, https domain
| configuration, scaling.
|
| Additionally, having machines that turn off when not in use
| is easy to configure, which I never managed on AWS.
| ignoramous wrote:
| > _which I never managed on AWS_
|
| I haven't looked at it recently, but _App Runner_ could
| do a few of Fly.io esque things (but slightly more
| expensive): https://aws.amazon.com/apprunner/
| ignoramous wrote:
| > _Not as flexible as AWS_
|
| Today, Fly.io is more or less in the same market as
| _Lightsail_ , not AWS. And when you compare it to
| _Lightsail_ , it blows it away.
| mtlynch wrote:
| _And when you compare it to Lightsail, it blows it away._
|
| This is a bit of a confusing sentence because there are
| so many pronouns. Do all of the "it"s refer to Fly.io?
| dijksterhuis wrote:
| > And when you compare [fly.io] to Lightsail, [fly.io]
| blows [Lightsail] away.
| watermelon0 wrote:
| Did you count reliability into your assesment here? I'm
| reading about Fly.io outages multiple times a year,
| whereas Lightsail seem to be as stable as AWS EC2.
| gurgunday wrote:
| DigitalOcean has been doing this for years, and their value
| proposition is unmatched IMO
|
| For $5 you get:
|
| Latest gen CPUs and RAM
|
| HTTPS
|
| DDoS protection
|
| Cloudflare CDN
|
| Autoscale
|
| Competent support
|
| I'd say the best part is the predictable monthly prices
|
| And while most people probably don't care, they are an
| established public company, so there is more chance they
| will exist in 10 years
| dijksterhuis wrote:
| are global r/w token permissions still a thing, or did
| the token scopes thing finally come out of beta?
|
| also, my experience with support was not the same as
| yours. they were utterly useless for the most part.
|
| for a personal web dev (or similar) project, like, i
| agree, they've got good value.
|
| but having worked in a small biz where DO was what they
| built everything on -- no. bad idea. spend more. use aws
| (graviton ec2 instances)/azure.
| fragmede wrote:
| the $5 droplet is underpowered and can't run anything
| substantial. it's just the price to get you in the door.
| yabones wrote:
| It doesn't really need to run anything "substantial"
| though. Running some janky wordpress site with some
| scabbed-on ecommerce customizations is like 50% of the
| internet.
| infecto wrote:
| a 1vCPU 512mb instance is plenty for most base cases.
| Maybe you need one additional machine to act as a
| background worker. I am sure there are some noisy
| neighbors but to say its underpowered is silly.
| fragmede wrote:
| I'm calling it underpowered because the $5 one had
| trouble running my custom ssh daemon. ssh! the
| cryptography for that shouldn't chug down the server I'm
| renting from them. a bigger instance from them isn't
| having the same problems.
| pajeetz wrote:
| you wouldn't be able to run anything substantial with
| that kind of budget
|
| but GO and pocketbase is on record for supporting 10k
| concurrent requests per second on low powered VPS
| infecto wrote:
| I have asked this multiple times but is anyone really using
| edge compute and getting value out of it? I am certain
| there are cases but I have not seen any of them written up
| before.
| pier25 wrote:
| We have an embeddable audio player served globally with
| very low latency. This wouldn't be possible without edge
| compute/data.
| sofixa wrote:
| Depends on what you mean by edge compute, but you
| probably are.
|
| 5G towers are a ton of compute on the edge to secure and
| protect the traffic passing through them.
|
| Or if by edge you mean having stuff close to your
| consumers, every non trivial operation does that.
| victorbjorklund wrote:
| If half your customers are in new your and half in sidney it
| makes you app faster if you run it in both places.
|
| There is a lot of things we do for our users that we don't
| need (no one "needs" SPA etc). But if it is easy to make your
| app faster for your users, why not?
| victorbjorklund wrote:
| And it is easier than AWS to deploy.
| jrockway wrote:
| I would take edge compute if it's free and easy. That's
| fly.io's value prop.
|
| In a world where much web browsing starts with ACK SYN ACK,
| it is nice if the server is close to you.
| brainzap wrote:
| I typed fly launch, fly deploy and my node.js project was
| deployed. So I guess hobby projects?
| austinpena wrote:
| I have an SSR Astro project. Using Fly makes my project fast.
|
| For dynamic data I use SWR.
|
| I could use Cloudflare workers but it doesn't play so nice
| with Astro.
|
| I also have a "form submission service" where I receive a
| Post and send an email.
|
| I need maximum uptime to avoid revenue loss.
|
| It's a go service so I deploy ~6 machines across the US to
| ensure I don't drop any requests.
|
| I haven't had downtime in years.
| infecto wrote:
| I am going to go out on a limb and say there is no real value
| prop to fly.io. I could completely be wrong but it always
| feels like the modern MongoDB. Everyone wants to use it but I
| am not sure they are extracting value from it and instead its
| a shiny toy that is fun to build from.
| huijzer wrote:
| For experiments and hobby projects the value proposition is
| amazing. Where else can you spin up an independent instance for
| $1.94 per month?*
|
| *Note this is for an instance with only 256MB RAM
| (https://fly.io/docs/about/pricing/), but it's definitely
| possible to run non-trivial projects on that. Rust-based web
| servers like Rocket require only about 10MB RAM. Basic PHP
| servers should also fit from what I can find.
| hobo_mark wrote:
| One such microVM per month used to be within the free monthly
| allowance, is that not the case anymore?
| oefrha wrote:
| There are plenty of better deals as long as you don't limit
| yourself to big clouds and clouds with startup-esque landing
| pages frequently posted to HN. LowEndTalk may be the most
| well-known place for finding such deals.
|
| (Not saying the typical cheap VPS on LowEndTalk has
| comparable PaaS features. Only responding to parent's use
| case of a single cheap instance.)
| belter wrote:
| Sounds like a Lambda function....
| input_sh wrote:
| Nowhere? Because that's a ridiculously low amount of RAM to
| offer even in your cheapest offerings?
|
| You can easily get 4 GB of RAM for $5 from the likes of
| Hetzner or Hostinger, so that's 16x more RAM for 2.5x the
| price. One relatively unknown provider I have used in the
| past offers 2 GB of RAM for EUR3.6/month (if paid monthly,
| EUR3 if anually), so 8x more RAM for 1.5-2x the price. I'm
| sure I could find something even cheaper, but I'm just
| looking at providers I have personally used.
|
| BTW that dropdown seems to be sorted cheapest > most
| expensive. If you go to the bottom of the list the price for
| that same VPS doubles.
| KomoD wrote:
| > Nowhere? Because that's a ridiculously low amount of RAM
| to offer even in your cheapest offerings?
|
| There's definitely places that offer it... also 512m
|
| I know because I've personally bought such plans and that
| was $5-10/yr because I didn't need dedicated ipv4.
| kelvinjps10 wrote:
| I'm getting 1$ for a 2gb ram vps in ovh for the first year
| TiredOfLife wrote:
| Oracle free is one 4 core 24gb ram vps + 2 dualcore amd vps.
| treesknees wrote:
| And actually, it's the resources that are free (CPU,
| memory, network) and you're allowed to split them up into
| multiple VMs if you want to.
|
| One of my VMs had an uptime of more than 1050 days before
| the infrastructure rebooted it, so in terms of availability
| they've certainly surprised me.
|
| The only downside I've come across with Oracle Free is that
| the 'best' regions are typically full. I ended up
| provisioning my free VMs in another region/country and it
| works fine.
|
| I suppose another downside (if you want to view it this
| way) is they will delete idle unused free VMs after a
| certain time period. You have to add a credit card to your
| account to "upgrade" your account and run free resource
| indefinitely. While you're not charged for anything, it
| makes me nervous forking over a CC number to Oracle.
| throwaway63467 wrote:
| Best business model in the world, buy stuff in big bags, put
| it in smaller ones, sell at a multiple of the original price.
|
| Fly is mostly (to my knowledge) reselling Netactuate and OVH
| servers, their main innovation is the developer experience on
| top, using Docker on a MicroVM based approach. Of course not
| only that, but I think it's their main differentiator.
|
| Haven't used that in a while but Scaleway offered
| ridiculously cheap dedicated ARM hardware close to these
| price points, not sure if they still do.
| pc86 wrote:
| Maybe if you're limiting yourself to AWS-wrapper cloud
| companies. What good is a $2/mo cloud instance if it's down
| multiple times a month?
|
| Just get a $5/mo VPS instead if you're really concerned about
| a few dollars a month.
| cxr wrote:
| > What good is a $2/mo cloud instance if it's down multiple
| times a month?
|
| The perverse irony is that the most common reason cited by
| cloud providers for not letting people set a hard cap on
| charges is an insistence that surely the last thing you
| want in the world is for your service to be taken offline,
| even if it does means avoiding a $1k-$100k bill at the end
| of the month.
| hansvm wrote:
| I used to use Racknerd for that sort of thing, and the costs
| were around there -- maybe $1.90/mo for a 512MB instance. It
| was easy to squeeze several hobby projects onto the machine.
| pajeetz wrote:
| i recommend lowendtalk what fly.io doing is running colocated
| baremetal servers and using firecracker to overcommit
| (probably via memory ballooning and other disk compression on
| demand)
|
| if you are going to haggle over $2/month then you are better
| off just connecting your raspberry pi with
| wireguard/cloudflare tunnel on a residential connection
| akoculu wrote:
| Also:
|
| https://news.ycombinator.com/item?id=36808296
| zackify wrote:
| The reliability is very very bad. It was really insane that 2
| times in the past few months the main dashboard was down as I'm
| demoing something. Not to mention the deploy outages and almost
| daily some random thing was unavailable or delayed.
|
| I had to leave a few months ago after the price raises and how
| many times my boss saw some issue in the project I had with
| them.
|
| They also deprecated and removed their sqlite backup service.
| Back to GCP and not worrying about so many outages now.
| pc86 wrote:
| Now just to worry about GCP getting shut down with a few
| days' notice. /s
|
| But in all seriousness the gall to raise prices before
| actually fixing the reliability problems is pretty shocking.
| I understand it's a bit of a chicken-and-egg thing where you
| maybe are tight on resources but there's no scenario where
| it's acceptable to have a product with these kinds of
| problems and then raise prices on existing customers who are
| putting up with it.
| encom wrote:
| No /s is needed. Relying on any Google product long term is
| crazy.
| sofixa wrote:
| Google's b2b products are relatively stable (relative to
| their b2c free services). You generally get somewhere
| like a year of notice if they shut it down.
| pajeetz wrote:
| theres just so many anecdotes/nightmare stories from people
| using fly.io here much more than the ones linked by GP
|
| expect to see more of these "post-mortem apologies" from
| fly.io in the future because it won't be the last
| tptacek wrote:
| You're right. It won't. Nobody could claim otherwise.
| pajeetz wrote:
| fly.io has a very bad reputation for reliability there doesn't
| seem to be any damage control beyond hackernews and even here
| the consensus seems to be "dont run anything mission critical
| on fly.io or expect data redundancy"
|
| in fact, you can almost get the same thing fly.io does by
| running firecracker on your own bare metal servers and cheaper
| too.
|
| I'm afraid the public sentiment towards fly.io has been tainted
| for good (I can't count how many times they apologized now).
| tptacek wrote:
| This is the second place you've offered this sentiment. Was
| it your expectation that we were going to hit some point,
| sometime in the near future, where we weren't going to have
| deployment-blocking outages? I'd like to better understand
| your premise. If it's "I can get more reliability by
| deploying on a hyperscaler cloud", who ever told you
| otherwise?
| ARCarr wrote:
| I tried out Fly.io and deployed a little test app. I couldn't
| even access the app, because they put it onto a server that was
| under "emergency maintenance" and had been that way for twelve
| days.
| mattbee wrote:
| It feels like fly is trying to repeat a growth model that worked
| 20 years ago: throw interesting toys at engineers, then wait for
| engineers to recommend their services as they move on in their
| careers.
|
| Part of that playbook is the old Move Fast & Break Things. That
| can still be the right call for young projects, but it has two
| big problems:
|
| 1) AWS successfully moved themselves into the position of "safe"
| hosting choice, so it's much rarer for engineers to have
| influence on something that's seen by money men as a humdrum,
| solved problem;
|
| 2) engineers are not the internal influencers they used to be,
| being laid off left and right the last few years, and without
| time for hobby projects.
|
| (maybe also 3) it's much harder to build a useful free tier on a
| hosting service, which used to be a necessary marketing expense
| to reach those engineers).
|
| So idk, I feel like the bar is just higher for hosting stability
| than it used to be, and novelty is a much harder sell, even here.
| Or rather: if you're going to brag about reinventing so many
| wheels, they need to not to come off the cart as often.
| travisgriggs wrote:
| Don't a bunch of Elixir/Erlang guys work at fly.io? It's weird to
| me that that hallmark of reliability is associated with something
| that the public sees as unreliable. What gives with that
| association?
| Huppie wrote:
| It's interesting to see this discussion about fly.io's
| reliability on a day that (after _over three days of downtime_ )
| Microsoft Azure finally decided the update of Azure Static Web
| Apps they deployed last Friday is indeed broken for customers
| using specific authentication settings...
|
| ...with not a single status update from Microsoft in sight.
___________________________________________________________________
(page generated 2024-11-26 23:01 UTC)