[HN Gopher] LedgerStore Supports Trillions of Indexes at Uber
       ___________________________________________________________________
        
       LedgerStore Supports Trillions of Indexes at Uber
        
       Author : imnot404
       Score  : 89 points
       Date   : 2024-06-07 14:55 UTC (8 hours ago)
        
 (HTM) web link (www.uber.com)
 (TXT) w3m dump (www.uber.com)
        
       | dmoy wrote:
       | That was a pretty cool read, thanks for posting. It always sends
       | prickly/danger signals to my brain when I see "reads also write
       | stuff", but this was laid out pretty clearly. Dissolved that
       | prickly feeling pretty quickly haha.
        
       | candiddevmike wrote:
       | > From a business impact perspective, operating LedgerStore is
       | now very cost-effective due to reduced spend on DynamoDB. The
       | estimated yearly savings are over $6 million per year.
       | 
       | I'd like to see this broken out more. You replaced a managed
       | service with a custom DB on top of MySQL. Surely there is a
       | significant amount of engineering effort that was/is being spent
       | to create/maintain this.
        
         | stackskipton wrote:
         | Yea, 6 Million is about ~15 Engineers at Uber Total Comp
         | average.
        
         | BiteCode_dev wrote:
         | Managed services still require engineering efforts. I have
         | never seen a transition to a cloud service that actually didn't
         | end up requiring a new hire.
         | 
         | It's the dirty little secret of our industry.
        
           | echelon wrote:
           | This is the biggest nasty secret about cloud (and SaaS for
           | things you don't build in-house).
           | 
           | When my last company transitioned to cloud, all of the on-
           | prem folks had to additionally take on that responsibility.
           | We then had to hire a bunch of new folks to ease the
           | transition and manage the cloud pieces. And then there are
           | new teams spun up for permissions and security. Your
           | headcount requirements are not going to go down because of
           | cloud.
           | 
           | As for those that like to call DIY systems "weirdware":
           | homegrown systems are expensive to engineer, but cheap to
           | run. We 10x'd our cost going to 3rd party visibility and
           | feature flagging versus the thinly staffed [1] homegrown
           | systems. And those 3rd party systems also required a ton of
           | engineering to support, the migrations weren't clean, and
           | they missed a lot of key features that we had to upstream to
           | the third parties!
           | 
           | When SignalFx got bought out by Splunk, I've never seen an
           | annual plan get re-wired so hastily. Top priority was moving
           | to a new vendor, and that impacted every single team in the
           | company.
           | 
           | The only real tangible benefit to cloud that I've seen is
           | that a team can instantly provision hardware, database, etc.
           | resources without much planning. That's it. And is that worth
           | it? (I don't really think so.)
           | 
           | [1] Once built, half an engineering headcount per quarter to
           | accommodate new features. Oncall rotation for one team that
           | owned the systems. Our stuff was high resiliency and barely
           | paged at all. No SEV1 or worse outages ever.
        
             | deanCommie wrote:
             | What a short-sighted outlook
             | 
             | 1) "homegrown systems are expensive to engineer, but cheap
             | to run" misses the most important part - how much does it
             | cost to maintain, and how much of a RISK is it?
             | 
             | Homegrown probably means you're depending on tribal
             | knowledge from a few core people who if they leave you're
             | hosed. That's a risk. You also have to train EVERYONE you
             | hire to learn the homegrown systems, instead of hiring
             | people with externally transferrable skills that can come
             | in and hit the ground running.
             | 
             | 2) all of the on-prem folks had to additionally take on
             | that responsibility
             | 
             | Most transitions/migrations always end up with a period
             | where you have both systems running. It's not surprising
             | that at first everyone has additional responsibilities. It
             | sounds like the "on-prem folks" either didn't have the
             | skills to work in the cloud or refused to learn, and needed
             | external hires. Well, that's one way to put themselves out
             | of a job, and that would be the next step once the
             | homegrown system is deprecated.
             | 
             | Obviously there's a reason why "not invented here" syndrome
             | exists. There's a reason "build vs buy" is a complex
             | discussion because of more than just buy costs. People
             | always prefer the home-grown thing, DIY. Engineers
             | especially. On this site, particularly. And, also, plenty
             | of successful businesses exist by cutting out layers and
             | going on a shallower stack ("your margin is my
             | opportunity").
             | 
             | But at the end of the day, for the vast majority of
             | businesses, companies, and software shops, managing their
             | own in-house infrastructure is a worse decision than paying
             | a cloud vendor.
        
               | aledalgrande wrote:
               | I see both points but I have to agree with this one,
               | especially if we're talking small companies. A startup
               | should outsource everything that is outside their core
               | product (but not more than that), so they can focus on
               | making their product top class.
        
               | jcelerier wrote:
               | > Homegrown probably means you're depending on tribal
               | knowledge from a few core people who if they leave you're
               | hosed.
               | 
               | there's so much complexity and tribal knowledge with
               | cloud deployments that if tomorrow our cloud experts
               | leave we're also very definitely hosed too, despite
               | everything being documented thoroughly. I'm involved in a
               | product that leverages a cloud-based metaverse system
               | (Mozilla Hubs) that recently had organisational changes
               | requiring us to change our hosting approach, and it's
               | taking the better part of a year of work to understand
               | its logic, for something that would have been a non-issue
               | if homegrown & self-hosted.
        
             | pphysch wrote:
             | > This is the biggest nasty secret about cloud (and SaaS
             | for things you don't build in-house).
             | 
             | Yep. Huge amounts of marketing funds are spent tricking
             | decision-makers into thinking that off-the-shelf, plug-and-
             | play solutions exist for their idiosyncratic business
             | problems and schemas.
             | 
             | Sometimes such products exist, but they are so bloated with
             | features to support the other 99% of customers. So you
             | absolutely need at least one expert to unwind the
             | complexity.
             | 
             | It's known that "greenfield" software development is much
             | easier than "brownfield"/migrations. It should also be
             | widely known that brownfield SaaS integrations are
             | troublesome, because it involves enormous software+data
             | work to link the existing in-house interfaces to the third-
             | party interfaces. As a rule of thumb, there is no such
             | thing as "off-the-shelf". You WILL have to hire and build
             | and deeply understand your business processes.
        
               | dehrmann wrote:
               | This is what people tell themselves, but 99% of the time,
               | what your business does and the scale it operates at
               | aren't special.
        
               | adamisom wrote:
               | Special in the sense of scale does not equate to special
               | in the sense of adapting to your business'
               | needs/culture/sensemaking
        
               | lazide wrote:
               | They don't have to be special to be bespoke.
               | 
               | And almost all business integrations are highly bespoke.
        
             | spamizbad wrote:
             | I love the term "weirdware": I've seen systems like that
             | built (and used!) first-hand multiple times in my career.
             | 
             | They can sometimes be very odd force multipliers in
             | unexpected ways.
             | 
             | My favorite example is a home-rolled payroll applications
             | specifically for sales people from over a decade ago. When
             | I first arrived at the org I thought they were absolutely
             | insane to have created such a thing. But its killer feature
             | was that it allowed our org to pay out commissions rapidly
             | (same day) and also allowed negotiating on a per-account
             | residual commission basis as well as giving management the
             | option to buy-out residuals. Off-the-shelf solutions at
             | that time kinda-sorta did this but required either massive
             | upfront $$$ and a ton of integration work. A plucky startup
             | couldn't afford it. This was built, tested, and rolled-out
             | in 45 days by 2 engineers and basically allowed them to
             | poach top sales people because they knew they could get
             | paid faster and had more flexibility on residuals.
        
               | davedx wrote:
               | Ha I built an internal app just like that for field sales
               | at a mid sized company. One of the best projects I ever
               | worked on.
        
               | throwup238 wrote:
               | That's the tradeoff everyone makes when they say
               | something isn't in their "core competency" - they give up
               | any real competitive advantage that area of expertise
               | could bring them. That often makes total sense like with
               | accounting or HR, but most companies take it way too far
               | in engineering and internal support software.
               | 
               | Engineers make the same mistake all the time too: we love
               | to say that "premature optimization is the root of all
               | evil", ignoring that the full quote is _" We should
               | forget about small efficiencies, say about 97% of the
               | time: premature optimization is the root of all evil.
               | _Yet we should not pass up our opportunities in that
               | critical 3%_"_
        
             | throwup238 wrote:
             | _> The only real tangible benefit to cloud that I 've seen
             | is that a team can instantly provision hardware, database,
             | etc. resources without much planning. That's it. And is
             | that worth it? (I don't really think so.)_
             | 
             | Yes! Maybe not for you or me, but it's definitely worth it
             | to tons of organizations where everyone including
             | engineering was beholden to old school IT departments that
             | take months to spin up a VM after pages of bureaucratic
             | back and forth. The cloud moves it from an IT issue to
             | department budgeting.
             | 
             | The other advantage of cloud was moving capex to opex which
             | the finance people liked because * _hand wave*_ something
             | about amortization. Those are the two reasons most
             | established companies moved to the cloud. It had little to
             | do with the actual cost but how it was spent and who was in
             | charge of doing it.
        
               | 0cf8612b2e1e wrote:
               | where everyone including engineering was beholden to old
               | school IT departments that take months to spin up a VM
               | after pages of bureaucratic back and forth. The cloud
               | moves it from an IT issue to department budgeting.
               | 
               | My F500 company put a stop to that. There is new
               | paperwork in place to requisition cloud resources. The
               | bureaucracy will not be replaced.
        
               | ndriscoll wrote:
               | Even the startup I worked at in my last job didn't let
               | people just spin up whatever resources they felt like in
               | AWS. Changes had to go through architecture/security
               | review, and I might have been the only SDE who was
               | allowed to log into AWS at all. We were subject to SOC 2,
               | but I imagine it's relatively common for SaaS companies
               | to have some kind of compliance framework assuming they
               | want F500s as customers?
        
               | dopylitty wrote:
               | Having seen the results of everyone being able to spin up
               | a VM (or an EMR cluster, or an EKS cluster) I can say a
               | little more bureaucracy is probably a good thing.
               | 
               | You end up wasting huge amounts of money on resources
               | that are just sitting around idle (no your fancy
               | automation to shut them down won't work because maybe
               | that dev cluster is actually needed at 2am on Sunday.
               | It's an org problem, not a technical problem).
               | 
               | But more importantly you give the cloud providers an
               | excuse to destroy another square mile of pristine forest
               | and build a giant 4 story grey box that wastes enormous
               | amounts of already limited water and energy.
        
             | sneak wrote:
             | > _Your headcount requirements are not going to go down
             | because of cloud._
             | 
             | Depends on how big you are. I'm one person and I do a lot
             | of things with the cloud that I could not do if I had to
             | rack servers (or even order dedicateds).
             | 
             | Same goes for small 5-10 person teams that are good with
             | Terraform. I've seen some orgs punching waaaay above their
             | weight given their size. Not possible without classic IaaS.
        
           | roncesvalles wrote:
           | Eh, sure but it's not the same person. There are far more
           | people who can "do cloud" than those who genuinely understand
           | distributed systems well enough to build and operate a
           | homegrown data infrastructure.
        
             | makmanalp wrote:
             | True, but anyone who's a cheaper hire because they can "do
             | cloud" but not homegrown also represents a sacrifice in
             | terms of what's you get. In terms of your operational
             | capabilities, you'll likely be paying for that with a
             | crisis first where everyone yells at your team because of a
             | problem in your managed provider you can't fix, and you
             | can't get someone from the provider on the line and helpful
             | in a way that gets you out of it faster. So then you'll
             | tack on add a massive yearly support retainer for a
             | response time SLA that's still bad for your business bottom
             | line and support that's not incentivized to prioritize you.
             | I have stories from name-brand PaaS/IaaS companies.
             | 
             | It's like outsourcing (I don't mean offshore, I mean any
             | kind). It can be a reasonable thing to do if you don't
             | pretend you're getting the same thing for less money and
             | really actually plan around that. But people often do
             | pretend it's the same, and the results aren't immediately
             | apparent. And when they're finally blatantly apparent,
             | either the guilty party has cashed out and left, or people
             | don't have enough perspective to point to the real root
             | cause.
             | 
             | I'm not saying building your own infra is generally a good
             | idea of course. It's just that beyond a certain size, for
             | more businesses than you'd think, reliability has to be a
             | core competency you invest in. And homegrown is a lot
             | easier than it used to be because the offerings in terms of
             | tooling and platforms are way better such that you can
             | strategically craft your level of managedness for optimum
             | cost/benefit on multiple criteria.
        
           | 999900000999 wrote:
           | It depends on your scale.
           | 
           | As a solo dev just being able to point my apps to firebase
           | and not worry about it saves a ridiculous amount of time.
           | 
           | At Uber's scale they probably have to tweak Dynamo and other
           | services so much they might as well bring em in house.
           | 
           | I'm a bit surprised we haven't seen a new AWS, something like
           | Walmart Web Services.
        
             | wongarsu wrote:
             | On the other hand as a solo dev `sudo apt-get install
             | postgresql` plus a short tweak of your pg_hba.conf and a
             | quick script to run pg_dump every night is about as much
             | work as setting up a cloud service.
             | 
             | Managed databases provide a lot of useful features,
             | especially if shit hits the fan. I'm not saying you
             | shouldn't use them. But the work you put into reasonable
             | self-hosted services and reasonable managed services often
             | scale at a surprisingly similar pace.
        
               | 999900000999 wrote:
               | Alright, what about Auth or other features Firebase has
               | built in ?
               | 
               | I still have to host it somewhere.
        
               | BiteCode_dev wrote:
               | Auth is included in your stack, e.g: django comes with
               | it.
               | 
               | Plus you own the accounts, so you are not locked in.
               | 
               | Sync is the firebase kill feature, but most apps can live
               | without it.
        
               | 999900000999 wrote:
               | For a side project it's not that serious.
               | 
               | Firebase is ready to go without me thinking of the
               | details.
               | 
               | If I'm using Flutter or React, I already have a nice
               | client side sdk to use.
               | 
               | The 12th of never when I need to scale it I can switch to
               | a different stack. Firebase functions allows me to do
               | some server side data manipulation.
               | 
               | That said, if I was at work and needed to propose
               | something I'd probably spend more time thinking of this.
               | But for all my side projects firebase is my goto.
               | 
               | Let's just say we're having a competition to see who can
               | crank out a basic crud app first. You're not beating me
               | with Flutter and Firebase.
        
               | faddypaddy34 wrote:
               | Then if you are successful one day you can cry on social
               | media about your large surprise bills.
        
           | rco8786 wrote:
           | Bingo. You still need your whole Ops and tooling teams. They
           | just become Cloud Ops and Cloud Tooling.
        
           | devjab wrote:
           | You wish! In my experience they just load all the cloud work
           | on to developers. Like, I'm setting up virtual networks,
           | private endpoints, firewalls and what not, because if I don't
           | there won't be horrible (I basically knew nothing about
           | networking before I did it and I still don't know if anything
           | we do is correct in any manner) security but no security.
           | 
           | It'll bite the company in the ass eventually, or maybe it
           | won't, but it is what it is.
        
           | diob wrote:
           | I don't think it's a secret, I think the idea is it will
           | allow you to be more resilient then a custom solution. Why?
           | If the person(s) who made the custom solution leave, you
           | can't as easily hire a replacement. What's more, hiring
           | immediately effective folks becomes harder, even if those
           | folks who made it don't leave.
           | 
           | It's a tradeoff, but I don't think folks are keeping secrets
           | around it.
        
         | stefan_ wrote:
         | When did this profession turn into a bunch of people that never
         | want to build anything ever again?
         | 
         | I guess I finally understand what they meant by "no one ever
         | got fired buying IBM".
        
         | gigatexal wrote:
         | 6M in total UBER spend is it really a worthwhile chunk of
         | change?
         | 
         | I mean I'm here for the tech of course just curious from the
         | cost benefit part.
        
           | bastawhiz wrote:
           | Six million dollars is six million dollars. Now, if that's a
           | team of eight delivering that value, yeah, it's a great deal.
           | If it's an org of 100 people, 6M is much less exciting. I'd
           | suspect it's the former.
        
       | jpollock wrote:
       | I'm confused, why does the Trip Service talk to the issuer to
       | place a hold? I would have expected the Payment Service to
       | mediate all that. Particularly since "Place Hold" is another way
       | of saying "Authorize Payment".
       | 
       | Perhaps the lines are wrong?
        
         | mschuster91 wrote:
         | > Particularly since "Place Hold" is another way of saying
         | "Authorize Payment".
         | 
         | As the article explains, _it 's not_, at least from the
         | merchant's side (from the customer's side it is). A CC hold
         | tells the bank "hey guys, make sure that there will be at least
         | X dollars available when we call in the hold", and the bank can
         | either respond with "yes, hold confirmed" or "nope, declined"
         | (say, due to a lack of funds, wrong CVV, whatever). When the
         | hold is confirmed, and the customer makes another transaction
         | at another vendor that would cause the amount of open holds +
         | executed but not paid-off transactions to go above your limit,
         | that transaction gets blocked.
         | 
         | And eventually, only once you as the merchant call in the hold,
         | you will actually initiate the flow of money. Or you call in
         | less than the held amount, and you'll only get that amount of
         | money (usually seen in hotels and car rentals, where the hold
         | can be significantly larger than the bill, to account for stuff
         | like minibar expenses, room cleaning or damages to the
         | vehicle). Or you release the hold, or it expires, and you get
         | no money at all, and all you can do is try to create a new
         | transaction and hope it gets approved.
        
           | ComputerGuru wrote:
           | But that's what auth + capture _does_ , though.
        
           | jpollock wrote:
           | Authorize Payment results in the Hold. The diagram is
           | representing a standard Auth + Capture credit card flow.
           | 
           | Capture contains the final transaction amount and can either
           | be higher or lower than the Authorized amount.
           | 
           | The odd thing is the Trip Service shouldn't know anything
           | about the Issuer. That's the Payment Service's reason to
           | exist (so Trip Service doesn't need to know about different
           | payment methods). That the Trip Service knows how to talk to
           | Visa (and Amex, and MasterCard, and Discover, and Google, and
           | Apple) about Auth'ing a card, but doesn't know how to Capture
           | the payment is the strange part.
           | 
           | My expectation is that the diagram is incorrect, and the
           | TripService talks to the PaymentService to Auth the card.
        
       | johnrob wrote:
       | I'm a little confused as to what "Index" refers to here. Is it a
       | data structure providing key lookup, or is it a single entry
       | within one of these structures?
        
         | tantalor wrote:
         | It has to be entries. Also confused.
        
       | tiffanyh wrote:
       | Is this at all related to the major migration from Postgres to
       | MySQL?
       | 
       | Previously discussed on HN (294 comments):
       | 
       | https://news.ycombinator.com/item?id=12166585
        
         | dmoy wrote:
         | Yea, seems like a continuation of the same fundamental problem?
         | I don't know.
         | 
         | As a commenter there put it - """The actual summary of the
         | article is "The design of Postgres means that updating existing
         | rows is inefficient compared to MySQL"."""
         | 
         | I think this is yet another approach to dealing with the cost
         | of updating rows?
        
       | ThinkBeat wrote:
       | When you decide to just write your own database to store stuff
       | in, because whatever is out there just doesn't fit your use case.
       | you have most likely failed.
       | 
       | You can then either write a new database that fits the use case,
       | or take a hard look at the use case and change it to fit existing
       | database systems.
       | 
       | Now as a programmer I LOVE the idea of writing a database or an
       | operating system, it would be super leet.
       | 
       | When writing a new database system, you will make a lot of
       | mistakes and a lot of bugs that you will no doubt encounter at
       | inconvenient times after a lot of debugging and profiling.
       | 
       | This is built on top of MySQL which has been battle tested (still
       | not where I would put any critical data) Technically that you're
       | not writing the engine but with so much functionality in two
       | layers above MySQL you are in essence building an engine for the
       | engine.
        
         | bdcravens wrote:
         | Perhaps but I think most of us have never worked on a problem
         | of Uber's scale. The behaviors of things change spectacularly
         | above a certain level; it's not just the same thing times N.
         | 
         | The implicit assumption of "not invented here syndrome" is that
         | it has been invented elsewhere already, which isn't always
         | true.
        
           | convolvatron wrote:
           | whats missing from the reuse picture for things like
           | operating systems, compilers (somwhat), and databases is
           | systems that use a 'lego' model of components that can be put
           | together into solutions rather than deployable systems (which
           | can clearly be built on top).
           | 
           | this provdies room for doing real customization without
           | having to build all the things from ground zero, and is
           | likely to be more robust than trying to attach functionality
           | fully on the outside like was done here.
           | 
           | in the database world it would be really nice to implement
           | caching and sharding extensions inside the transactional
           | envelope
        
         | jonathan_landy wrote:
         | That all makes sense. On the other hand, next gen workhorses
         | must often arise from people opting to do something new like
         | this, rather than make do with the current gen workhorse.
        
         | SilverBirch wrote:
         | I think this depends on the context. Let's say Uber had a
         | database problem, they hired a bunch of database people and
         | they came to the conclusion "we just _have_ to write our own...
         | fine. I think that's actually perfectly legit. What I see much
         | more commonly though is people who don't really know too much
         | about databases have problems and rather than go off and hire
         | /consult with experts, they just decide to build their own.
         | That's when you get into real trouble. The dirty truth is that
         | whilst yes, Uber faces challenges due its scale, it also faces
         | few real limits on building superfluous shit because of their
         | success.
        
           | makestuff wrote:
           | Yeah this smells like promotion oriented architecture. I have
           | a hard time believing that a logistics company that built
           | DynamoDB and uses it extensively can make it work but uber
           | cannot.
           | 
           | The 6m in savings does not properly account for things like
           | ramping up new hires on some custom database, maintenance
           | (what happens in 5 years when whatever language you wrote it
           | in needs to be upgraded to a new version, or some
           | dependency), and a host of other things.
           | 
           | Yes the cloud is expensive, but the entire point of it is
           | that you are offloading all of that underlying
           | maintenance/feature work to a team that only does that all
           | day every day and is very good at it.
        
         | dangwu wrote:
         | You're implying that you should never invent new tools. But
         | every tool was a new invention upon its creation. Also, all
         | software is an "engine" for an "engine".
        
         | aeyes wrote:
         | I understand that this Ledgerstore thing is just some tables
         | and materialized views in a Docstore database which they
         | already use for everything else.
         | 
         | This seems to be an in-house DynamoDB (sharded MySQL with Raft)
         | which has been developed a long time ago, see:
         | https://www.uber.com/en-US/blog/schemaless-sql-database/
         | 
         | But maybe I'm totally wrong here because they also use or used
         | Google Spanner: https://www.uber.com/en-US/blog/building-ubers-
         | fulfillment-p...
        
       | normand1 wrote:
       | Sounds like they rebuilt QLDB?
       | https://docs.aws.amazon.com/qldb/latest/developerguide/what-...
        
       | nikolay wrote:
       | Stop bragging and open-source it!
        
         | stackskipton wrote:
         | No, please don't. Some CTO at some company I work at is going
         | to think they are Uber scale, deploy this and I'll be stuck
         | supporting it.
        
       | ddoolin wrote:
       | TIL that "indexes" is also a valid plural of "index", as well as
       | "indices."
        
       | redwood wrote:
       | Feels like this would have been a good fit for zoned sharding in
       | MongoDB -- hot shards that would age out to cold shards, never
       | having to deal with the lack of strongly consistent secondary
       | indexes or transactions, and not having to deal with a completely
       | different database for the hot and cold environments.
        
       | kookamamie wrote:
       | Indexes? Indices?
        
       ___________________________________________________________________
       (page generated 2024-06-07 23:00 UTC)