[HN Gopher] Scaling One Million Checkboxes to 650M checks
___________________________________________________________________
Scaling One Million Checkboxes to 650M checks
Author : todsacerdoti
Score : 157 points
Date : 2024-07-26 16:14 UTC (6 hours ago)
(HTM) web link (eieio.games)
(TXT) w3m dump (eieio.games)
| xivzgrev wrote:
| Nice write up - curious much did it end up costing?
| eieio wrote:
| Ah I should have included this (I'll edit the post shortly)
|
| Think the total cost was about $850, which was (almost) matched
| by donations.
|
| I made a mistake and never really spun down any infrastructure
| after moving to go, and also could have retired the second
| redis Replica I spun up; think I could have cut those costs in
| half if I had been focused on it. But given that donations were
| matching costs and there was so much else going on I wasn't
| super focused on that.
|
| I kept the infra up for a while after I shut down the site (to
| prepare graphs etc) which burned a little more money, so I'm
| slightly in the hole at this point but not in a major way.
| isoprophlex wrote:
| I wonder what that would have cost you on Hetzner, for
| example. I have a dedicated 20 vCPU box with 64 GB of ram and
| a terabyte of SSD. Bandwidth is free... and all this for 40
| eur/month.
|
| This should, especially after your Go rewrite, be enough to
| host everything?
| eieio wrote:
| you're right, I probably could have saved some money by
| using Hertzner! But I'm used to using Digital Ocean, and
| most of my projects haven't had these scaling problems, and
| I think changing my stack ahead of launching the project
| would have been a mistake.
|
| If I was planning to keep the site up for long I would have
| moved, but in this case I knew it was temporary and so
| toughing it out on DO seemed like a better choice.
| isoprophlex wrote:
| Haha yeah of course, the click a button thing is super
| convenient.
|
| I recently had to rebuild my box because I left a
| postgres instance open on 5432 with admin:admin
| credentials, and without default firewalls in place it
| got owned immediately.
|
| That would have been less painful on DO for sure.
| eieio wrote:
| Yeah I think this was particularly handy for Redis -
| click a box to upgrade my Redis instance (or add a
| Replica) while maintaining data with full uptime was
| really really nice.
|
| Painful to use managed redis when I was debugging (let me
| log in! let me change stuff! give me the IP!! ahhhhh!!!!)
| but really nice otherwise. A little painful to think
| about giving that up, although I coulda run a really
| hefty redis setup on hertzner for very little money!
| whynotmaybe wrote:
| Awesome !
|
| Will your next post be a statistical analysis of which
| checkboxes' were the less/most checked ?
|
| I remember scrolling way down and being kind of sad that the one
| I choose was almost instantly unchecked.
| eieio wrote:
| I'm gonna share the raw data soon! Just have one more story
| about the site I need to tell first.
| geek_at wrote:
| Is the game still live?
|
| When I go to https://onemillioncheckboxes.com/ nothing is checked
| and in the JS console I just see
|
| {"total":0,"totalGold":0,"totalRed":0,"totalGreen":0,"totalPurple
| ":0,"totalOrange":0,"recentlyChecked":false}
| kube-system wrote:
| From TFA:
|
| > We passed 650 million before I sunset the site 2 weeks later.
| geek_at wrote:
| Ah thanks. Makes sense since the server costs were 800+$
| xnx wrote:
| > Building the site in two days with little regard for scale was
| a good choice.
|
| Probably the key takeaway that many early-career engineers need
| to learn. Scaling's not a problem until it's a problem. At that
| point, it's a good problem t have, and it's not as hard to fix as
| you might think anyway.
| hobs wrote:
| As long as you also took "so keep the system simple and basic"
| to heart as well - I have seen many systems where the obvious
| choice was microservices, not for scaling or team separation
| mind you, but because the devs felt like it.
|
| Scaling those systems is a total bitch.
| isoprophlex wrote:
| TIL you can run Lua scripts on your Redis server. That's so nice,
| I never knew!
| jorl17 wrote:
| This was a fantastic writeup!!! Congratulations on the website.
| To me, though, the writeup is what you should be most proud of!!
| eieio wrote:
| Thank you! I spent substantially more time on the writeup than
| I did on the site pre-launch, which is pretty funny to me
| pizzafeelsright wrote:
| Many lessons learned along with great historical knowledge of
| (distributed) systems.
|
| I think you hit every type of interruption and point of failure,
| except storage space, and it is great to see your resolutions.
|
| I wasn't aware Redis could do the Lua stuff which makes me very
| interested in using it as an alternative state.
|
| As for the bandwidth - one of my biggest gripes with cloud
| services as there is no hard limit to avoid billing overages.
| eieio wrote:
| Thank you!
|
| FWIW I certainly hit storage space in some boring ways - I
| didn't have a good logrotate setup so I almost ran out of disk,
| and I sent my box-check logs to Redis and had to set something
| up to offload old logs to disk to not break Redis. But neither
| of those were very big deals - pretty interesting to have a
| problem like this where storage just wasn't a meaningful
| problem! That's a new one for me.
|
| And yeah, thinking about bandwidth was such a headache. I was
| on edge for like 2 days, constantly checking outbound bytes on
| my nic and redoing the math - not having a hard cap is just
| really scary. And that's with Digital Ocean, which has pretty
| sane pricing! I haven't used the popular serverless stuff at
| all, but my understanding is that you get really gouged on
| bandwidth there.
|
| (also yes, lua-in-redis is really incredible and lets you skip
| sooo many hard/racey problems as long as you're ok with a
| little performance hit, it was a joy to work with)
| wonger_ wrote:
| As someone new to backend - is there a simple alternative
| architecture for this project? I hope there's an easier way to
| host a million bits of state and sync with clients. Some of the
| solutions in the post went over my head.
|
| Kudos to the author - your projects are great.
| isoprophlex wrote:
| I don't think it gets much simpler than this to be honest...
| except for un-scalable things like keeping a single global list
| of a million booleans in the same process as your backend api.
| wild_egg wrote:
| Maybe dumb question but why is keeping a million booleans in
| one process un-scalable? That's only 125KB of memory which
| can easily live inside CPU cache and be operated on
| atomically.
| eieio wrote:
| (I'm the author)
|
| I don't think it's unscalable at all - it's just that if
| you do this there's not a great story for adding a second
| machine if the first one can't handle the load (but ofc a
| beefy machine with a fast implementation could handle a lot
| of load).
|
| When we did the go rewrite we considered just getting one
| beefier box and doing everything there (and maybe moving
| Redis state in memory), but it felt easier and safer to do
| a drop-in rewrite.
| alright2565 wrote:
| > there's not a great story for adding a second machine
| if the first one can't handle the load
|
| I mean this is basically your redis situation right? Just
| with a very specialized "redis".
|
| You could scale this out, even after the pretty massive
| ability to scale up is exceeded. Have some front-end
| servers that act as a connection pooler to your
| datastore. Or shard the datastore, and have clients only
| request from the shards that they are currently looking
| at.
| eieio wrote:
| Right, and then the question is "is my specialized
| datastore gonna be faster than Redis" right? And it seems
| totally reasonable that you could make something faster
| _eventually_ - but I think it 's not a reasonable goal
| within the timeframe of the go rewrite (one Sunday
| afternoon and evening). Especially if you want to extend
| that system such that other services could talk to it!
|
| The entire timeframe of this project was 2 weeks, and the
| critical period (most activity / new eyes) was a couple
| of days.
| alright2565 wrote:
| Sorry, I'm talking hypothetically about how this would be
| designed, not in the context of your specific timeframe!
|
| > "is my specialized datastore gonna be faster than
| Redis"
|
| Absolutely! With how efficient this code would be, you'd
| likely never need to scale horizontally, and in that case
| it is extremely easy to compete with a network hop (at
| least 1ms latency) versus L1 cache (<50ns)
|
| The comparison with redis otherwise only applies once you
| do need to scale horizontally.
|
| There's also the fact that redis must be designed to
| support any query pattern efficiently; a much harder
| problem then supporting just 1-2 queries efficiently.
| Boxxed wrote:
| But it's all going through one redis box, isn't it? That
| feels like you're still limited by your one beefy
| machine.
| isoprophlex wrote:
| Oh no that's absolutely fine. I wasn't thorough. As
| commented by the author, that probably gets you very far.
|
| However... You'd need to persist those booleans somewhere
| eventually, of course, if you want the state to survive a
| process restart. And if you want multiple concurrent
| connections from the same box, you have to somehow allow
| multiple writers to the same object. And if you want
| multiple boxes (for redundancy, load spreading, geo
| distribution...), you need a way to do the writing and
| reading from several different boxes...
|
| By this time you're basically building redis.
| eieio wrote:
| Author here!
|
| Sorry that some of the stuff went over your head! I wanted to
| include longer descriptions of the tech I was using but the
| post was already suuuper long and I felt like I couldn't add
| more.
|
| Very happy to answer any questions that you've got here!
|
| I'm not sure how you'd simplify the architecture in a major way
| to be honest. There are certainly services you could use for
| stuff like this, but I think there you're offloading the
| complexity to someone else? Ultimately you need:
| * A database that tracks which boxes are checked (that's Redis)
| * A choice about how to put your data in your database (I chose
| to just store 1 million bits for simplicity) * A way to
| tell your clients what the current state is (I chose to send
| them all 1 million bits - it's nice that this is not that much
| data) * A way for clients to tell you when they check a
| box + update your state (that's Flask + the websocket)
| * A way to tell your clients when a box is checked/unchecked
| (that's also Flask + websockets. I chose to send both updates
| about individual boxes and also updates about all 1 million
| boxes) * A way to avoid rendering 1 million dom
| elements all the time (react-window)
|
| The other stuff (nginx for static content + a reverse proxy) is
| mostly just to make things easier to scale; you could implement
| this solution without those details and the site would work
| fine, it just wouldn't be able to handle the same load.
| sa46 wrote:
| Just spitballing: could you change the database to a bool
| array? Guard it with a RWMutex and persist on server
| shutdown. The bottleneck probably moves to pushing updates
| from a single server, but Go can probably handle a few tens
| of thousands of goroutines.
| summerlight wrote:
| > persist on server shutdown
|
| Probably this is not the simplest thing to do if you want a
| certain degree of reliability. Should be definitely easier
| than writing the entire storage engine, but likely an
| overkill for this kind of overnight hobby projects.
| paxys wrote:
| This is pretty much as simple as it gets tbh. A couple of web
| servers backed by a cache/pubsub queue.
|
| It _may_ have been possible to do it all in-memory on a single
| large host, but then if it is unable to meet the demand or
| fails for whatever reason then you are completely out of luck.
| 10000truths wrote:
| Sure. Everything described in the article could be crammed into
| a single process. Instead of using a database, you could store
| the bitset in a file and mmap it. And instead of using a
| reverse proxy, you could handle the HTTP requests and WebSocket
| connections directly from the application.
| dang wrote:
| Recent and related:
|
| _One Million Checkboxes_ -
| https://news.ycombinator.com/item?id=40800869 - June 2024 (305
| comments)
| winrid wrote:
| These are fun projects. About six years ago I launched Pixmap on
| android, which is a little collaborative pixel editing app,
| supporting larger images (like 1024x1024 grids etc). I had a
| queue that would apply each event to png images, and then clients
| would load the initial PNG on connect, and then each pixel draw
| event is just one small object sent to the client. This way I
| could take advantage of image compression on initial load, and
| then the change sets are very small. Also, since each event is
| stored in a log, you can "rewind" the images [0].
|
| [0] 22mb: https://blog.winricklabs.com/images/pixmap-rewind-
| demo.gif
| usernamed7 wrote:
| Neat! I was exploring a similar idea about per-pixel updates to
| many web clients, but i found it would be way too
| bandwidth/storage intensive to do what I wanted to do. So I've
| been tinkering with canvases that can be addressed by API calls
|
| https://x.com/RussTheMagic/status/1816749136487588311
| winrid wrote:
| I see what you're doing is submitting shapes over the wire.
| That's a bit different than what a pixel art application has
| to do, unless we were doing like color fill and stuff, which
| it doesn't support. Every pixel you draw it shows your name
| next to the square, too, so it's not like "draw a shape and
| submit" kinda thing.
|
| regarding storage of the log, compression is a thing :)
| usernamed7 wrote:
| For sure, different projects - what I was doing when i
| started, expressing individual pixel updates was a huge
| storage hog, as trying to draw the most rudimentary shapes
| resulted in quite large storage (even after gzip) - and
| would translate to large bandwidth requirements (as I was
| going for realtime). I moved over to canvas drawings
| because I could express rendering with much more expressive
| syntax.
| winrid wrote:
| Yeah, my goal at some point was to add animations and
| stuff, and then I'd do something like you're doing. But I
| moved onto other projects :)
| butz wrote:
| This project won't be complete until it supports the elusive
| "indeterminate" state of the checkbox.
| layer8 wrote:
| It should go indeterminate whenever clicks from two users on
| the same checkbox are detected where the second click occurred
| before that copy of the checkbox received the result of the
| first click.
| junon wrote:
| Really cool, awesome followup to the original project.
___________________________________________________________________
(page generated 2024-07-26 23:07 UTC)