[HN Gopher] Discovering Azure's unannounced breaking change with...
___________________________________________________________________
Discovering Azure's unannounced breaking change with Cosmos DB
Author : jmartens
Score : 99 points
Date : 2022-10-13 18:55 UTC (4 hours ago)
(HTM) web link (metrist.io)
(TXT) w3m dump (metrist.io)
| rroot wrote:
| girvo wrote:
| andrewstuart2 wrote:
| What are you doing here if you're not going to RTFA? The fifth
| paragraph pretty clearly describes the issue before they go
| into depth on how they determined that Azure did indeed publish
| a backwards-incompatible change without notice.
| speedgoose wrote:
| I believe it is easy for a well-made software to immediately
| detect and report what goes wrong. With Sentry, Elk, or whatever
| else.
|
| So, let's say I'm woken up in the middle of the night because my
| black box database as a service suddenly returns errors. If I'm
| not incompetent, I should have error messages and stacktraces
| available in a few seconds. If I'm a rich cloud customer, I can
| call the premium cloud support and ask for an explanation. If
| not, I would probably have to debug it myself.
|
| With your service, I understand that I can blame the cloud
| provider faster. Maybe it can make the debugging session slightly
| faster when your monitoring also returns errors. End users don't
| care whether it's my code or the cloud provider code crashing, so
| it's a developer tool for emergencies. Did I understand well?
| jmartens wrote:
| You got it right, it's a developer tool. Its not hard to get
| alerted about an issue, or to suspect a cloud dependency.
| Verifying it, which is typically required to take action, is
| what can take 10-30 minutes.
| dec0dedab0de wrote:
| This seems like an accident. Microsoft should treat it as a bug,
| and set the default on their backend to fix it.
| dharmab wrote:
| Back around 2017-2018 unannounced breaking changes in Azure
| services were so common, my team coined a term "Cloud Monday"
| (echoing Patch Tuesday) because usually our integration tests
| would break between 8-10AM Pacific Time on Mondays. (They did
| eventually become far less frequent.)
| TurkTurkleton wrote:
| > my team coined a term "Cloud Monday"
|
| Azure being a shade of blue, you should've called it "Blue
| Monday"[0]. Could've even rigged up something to play the song
| when integration tests mysteriously failed. _How does it feel /
| to treat me like you do?/ When you've laid your hands upon me/
| and told me who you are?_...
|
| [0]: https://www.youtube.com/watch?v=c1GxjzHm5us
| hupt wrote:
| Cosmos was originally created for hosting massive datasets
| internally within Microsoft. For example they use it for the OS
| telemetry sent in from customer machines, and raw data for threat
| intelligence. As part of Microsoft's move of everything hosted
| on-premise to their cloud, they decided to upon up Cosmos to
| other users of Azure. But the primary customer is and will likely
| always be Microsoft themselves. Which is probably why we see
| these breaking changes, it'll be in response to some internal
| ticket most likely.
| CurtHagenlocher wrote:
| CosmosDB is not the same as the internal Cosmos system.
| int0x2e wrote:
| Cosmos != CosmosDB.
|
| The two have nothing in common (and trust me, it sure is fun
| having to constantly make sure which of the two someone is
| actually referring to every time...).
| prepend wrote:
| This reenforces my idea that no one uses Cosmos because it is
| utter garbage.
|
| It sounds cool, but I was surprised when after what I think
| should be the worst and dumbest security design flaw breach [0]
| there wasn't much uproar.
|
| I thought maybe no one is using it so there wasn't much impact.
|
| Pushing out breaking changes without telling your customers also
| gets explained by there not being any (or many since these folks
| found it) users.
|
| Could you image how big of a deal it would be if a breaking
| change or elevated privs bug were in actually used products.
|
| [0]
| https://www.techtarget.com/searchsecurity/news/252505973/Res...
| DishyDev wrote:
| As someone whose job involves maintaining uptime of a critical
| system that's dependent on Cosmos DB this sort of thing is scary.
| Where there's been other reliability issues with Cosmos before
| we've not had an understanding customer base, and it feels very
| out of my control.
|
| I'm finding a lot of the reliability guarantees of Azure PaaS
| services are overblown or come with big caveats when you start to
| work with them in a serious way. For example I've had some bad
| reliability issues with Azure Functions not firing, or their
| premium function runtimes becoming unresponsive. And it seems
| like that's just the start of the outstanding issues with them
| https://github.com/Azure/azure-functions-host/issues
|
| I think people need to look more carefully at these PaaS
| guarantees and look at what that 99.999% reliability Microsoft
| are claiming actually means.
| rrdharan wrote:
| Even as bad as their reliability issues are, I'd still be more
| worried about their security issues:
|
| https://www.wiz.io/blog/chaosdb-explained-azures-cosmos-db-v...
| https://msrc-blog.microsoft.com/2021/08/27/update-on-vulnera...
| xiwenc wrote:
| Do you know what blew me off? When azure executes maintenance
| on for instance postgresql servers, there is no record of that
| activity in the activity logs or anything to note in service
| health. The service was unavailable during the maintenance. And
| stronger yet when the database is unusable due to an incident
| the cpu is maxed out and it doesnt allow any successful
| connection, nothing is detected.
|
| How can this be a premium iaas/paas? Azure feels like the MS
| teams of tele conference. Companies buy in because they are
| already in the MS world. Not because azure is better.
| nobodyandproud wrote:
| For new projects, why wouldn't anyone use postgres?
| semicolon_storm wrote:
| It's a lot easier to sell Microsoft products to management when
| working at a Microsoft shop.
| jen20 wrote:
| One reason is wanting zero-downtime failover.
| johndfsgdgdfg wrote:
| Just be glad that they didn't shutdown the service unannounced
| like Google.
| pb7 wrote:
| Reading through your past comments, it's clear that you have a
| strong dislike of Google[0] and a history of reactionary
| comments lacking both substance and clarification when
| challenged[1,2,3,4,5,6,7]. If you're not going to post anything
| worthwhile, perhaps it's best for you to skip over posts about
| Google since it's clear you have an axe to grind and nothing
| more.
|
| >HN used to be a place for interesting discussions. Now it's a
| grievance forum for entitled freeloaders.[8]
|
| Be the change you seek.
|
| [0] https://news.ycombinator.com/item?id=33120431 [1]
| https://news.ycombinator.com/item?id=33183900 [2]
| https://news.ycombinator.com/item?id=33158451 [3]
| https://news.ycombinator.com/item?id=33102921 [4]
| https://news.ycombinator.com/item?id=33102794 [5]
| https://news.ycombinator.com/item?id=33102761 [6]
| https://news.ycombinator.com/item?id=32937987 [7]
| https://news.ycombinator.com/item?id=32868992 [8]
| https://news.ycombinator.com/item?id=32657508
| metadat wrote:
| Thanks for pointing this out. As a self-admitted Google
| disliker, I would prefer to at least see more variation in
| the rhetoric. The same message spouted over and over makes
| for extremely dull reading.
| NicoJuicy wrote:
| Which service did Google shutdown unannounced in their cloud
| offering?
| metadat wrote:
| Stadia, their cloud gaming platform. It was cancelled without
| warning two weeks ago:
|
| https://news.ycombinator.com/item?id=33022768
| dagss wrote:
| This rhymes with my overall impression of Cosmos. It took us a
| while to see through the smokescreen because when talking to
| Microsoft support and representatives it is the Best Thing Ever
| and they sound so confident about it. But it really is a beta
| demo product sold with an alpha premium price tag.
|
| If your traffic pattern is exactly right, and you always scale
| traffic up and never ever down and do not have spikes, I guess it
| is probably OK. The main problem is the docs are (or, at least
| were 2 years ago) not clear about all the caveats and
| restrictions but pretend it is a generic database that just
| works. So one has to discover all the caveats oneself.
|
| Microsoft thinks the exact workings of the partitioning is
| something that should work so well you don't need to know it in
| detail. But, if your usecase is slightly off you end up really
| needing to know. I know at least one team who routinely copy all
| their data from one Cosmos instance to another and switch over
| traffic to the copy just to get a partitioning reset; it is one
| thing to have to do it; another to discover in production
| yourself it has to be done with no prior warning..
|
| Also: The ipython+portal+Cosmos security meltdown from 1 1/2
| years ago alone should be reason to look elsewhere.
|
| (No, not a competitor, just have spent way way way too much
| engineering time moving first on and then off Cosmos and yes I am
| bitter)
| VWWHFSfQ wrote:
| After suffering through the AWS SimpleDB disaster 10 years ago
| I will never use any of the cloud providers' hairbrained
| databases ever again. I'll use bog-standard Postgres or MySQL
| if they host it but nothing else.
| nobodyandproud wrote:
| Is this analogous to NTFS?
|
| For you young uns, back in the 1990s Microsoft was so convinced
| that NTFS made file fragmentation impossible that they didn't
| provide a way to defrag for a very long time.
| MrBuddyCasino wrote:
| I used it a few months ago, it still is a half-baked piece of
| shit. Code quality of client libs even worse than AWS.
|
| It could have been easy. We could have used Postgres.
| PaulWaldman wrote:
| >(No, not a competitor, just have spent way way way too much
| engineering time moving first on and then off Cosmos and yes I
| am bitter)
|
| Can you share what you migrated onto and the results?
| jmartens wrote:
| We used our own product to learn about and debug the issue. Its
| rather wild that they'd roll out this change so incrementally,
| which my colleague outlines here.
| Scaevolus wrote:
| Gradual rollouts are pretty typical to give the team a chance
| to do a rollback before they cause a complete outage. This
| particular usage pattern probably just didn't appear as a
| significant enough spike in error rates.
| jmartens wrote:
| Ya, that makes sense, it really isn't a normal use-case. I
| wish we kept tracking the other regions to see if they have
| had this change roll out to them yet, or if it's still slow
| rolling.
| twodave wrote:
| Funny, I was just last week having an argument with one of our
| team leads. I'd told him to create a specific container without a
| partition key (which I wouldn't recommend except in certain
| circumstances), and he said he couldn't. I assumed he was just
| doing it wrong.
| int0x2e wrote:
| In a document store, what does it even mean to create a
| container without a partition key? The document store has to
| partition the data somehow, and doing so implicitly sounds
| dangerous to me since all you're doing is creating a hotspot on
| one of the partitions...
| whalesalad wrote:
| This is very typical Microsoft behavior, unfortunately.
___________________________________________________________________
(page generated 2022-10-13 23:00 UTC)