[HN Gopher] So You Want to Build Your Own Data Center
___________________________________________________________________
So You Want to Build Your Own Data Center
Author : dban
Score : 73 points
Date : 2025-01-17 20:41 UTC (2 hours ago)
(HTM) web link (blog.railway.com)
(TXT) w3m dump (blog.railway.com)
| dban wrote:
| This is our first post about building out data centers. If you
| have any questions, we're happy to answer them here :)
| gschier wrote:
| How do you deal with drive failures? How often does a Railway
| team member need to visit a DC? What's it like inside?
| justjake wrote:
| Everything is dual redundancy. We run RAID so if a drive
| fails it's fine; alerting will page oncall which will trigger
| remote hands onsite, where we have spares for everything in
| each datacenter
| gschier wrote:
| How much additional overhead is there for managing the
| bare-metal vs cloud? Is it mostly fine after the big effort
| for initial setup?
| ca508 wrote:
| We built some internal tooling to help manage the hosts.
| Once a host is onboarded onto it, it's a few button
| clicks on an internal dashboard to provision a QEMU VM.
| We made a custom ansible inventory plugin so we can
| manage these VMs the same as we do machines on GCP.
|
| The host runs a custom daemon that programs FRR (an OSS
| routing stack), so that it advertises addresses assigned
| to a VM to the rest of the cluster via BGP. So zero
| config of network switches, etc... required after initial
| setup.
|
| We'll blog about this system at some point in the coming
| months.
| exabrial wrote:
| I'm surprised you guys are building new!
|
| Tons of Colocation available nearly everywhere in the US, and in
| the KCMO area, there are even a few dark datacenters available
| for sale!
|
| cool project none-the-less. Bit jealous actually :P
| gschier wrote:
| More info on the cost comparison between all the options would
| be interesting
| dban wrote:
| We pulled some cost stuff out of the post in final review
| because we weren't sure it was interesting ... we'll bring it
| back for a future post
| justjake wrote:
| The requirements end up being pretty specific, based on
| workloads/power draw/supply chain
|
| So, while we could have bought something off the shelf, that
| would have been suboptimal from a specs perspective. Plus then
| we'd have to source supply chain etc.
|
| By owning not just the servers but the whole supply chain, we
| have redundancy at every layer, from the machine, to the parts
| on site (for failures), to the supply chain (refilling those
| spare parts/expanding capacity/etc)
| CMCDragonkai wrote:
| Can you share a list of dark datacenters that are for sale.
| They sound interesting as a business.
| idlewords wrote:
| They're not building new, though--the post is about renting a
| cage in a datacenter.
| ramon156 wrote:
| weird to think my final internship was running on one of these
| things. thanks for all the free minutes! it was a nice experience
| nextworddev wrote:
| First time checking out railway product- it seems like a "low
| code" and visual way to define and operate infrastructure?
|
| Like, if Terraform had a nice UI?
| justjake wrote:
| Kinda. It's like if you had everything from an infra stack but
| didn't need to manage it (Kubernetes for resilience, Argo for
| rollouts, Terraform for safely evolving infrastructure, DataDog
| for observability)
|
| If you've heard of serverless, this is one step farther;
| infraless
|
| Give us your code, we will spin it up, keep it up, automate
| rollouts service discovery, cluster scaling, monitoring, etc
| __fst__ wrote:
| Can anyone recommend some engineering reading for building and
| running DC infrastructure?
| ca508 wrote:
| We didn't find many good up-to-date resources online on the
| hardware side of things - kinda why we wanted to write about
| it. The networking aspect was the most mystical - I highly
| recommend "BGP in the datacenter" by Dinesh Dutt on that (I
| think it's available for free via NVidia). Our design is
| heavily influenced by the ideas discussed there.
| jonatron wrote:
| Why would you call colocation "building your own data center"?
| You could call it "colocation" or "renting space in a data
| center". What are you building? You're racking. Can you say what
| you mean?
| macintux wrote:
| Dealing with power at that scale, arranging your own ISPs,
| seems a bit beyond your normal colocation project, but I
| haven't bee in the data center space in a very long time.
| j-b wrote:
| Love these kinds of posts. Tried railway for the first time a few
| days ago. It was a delightful experience. Great work!
| coolkil wrote:
| Awesome!! Hope to see more companies go this route. I had the
| pleasure to do something similar for a company(lot smaller scale
| though)
|
| It was my first job out of university. I will never forget the
| awesome experience of walking into the datacenter and start
| plugging cables and stuff
| sitkack wrote:
| It would be nice to have a lot more detail. The WTF sections are
| the best part. Sounds like your gear needs "this side towards
| enemy" sign and/or the right affordances so it only goes in one
| way.
|
| Did you standardize on layout at the rack level? What poke-yoke
| processes did you put into place to prevent mistakes?
|
| What does your metal->boot stack look like?
|
| Having worked for two different cloud providers and built my own
| internal clouds with PXE booted hosts, I find too find this stuff
| fascinating.
|
| Also take utmost advantage of a new DC when you are booting it to
| try out all the failure scenarios you can think of and the ones
| you can't through randomized fault injection.
| ca508 wrote:
| > It would be nice to have a lot more detail
|
| I'm going to save this for when I'm asked to cut the three
| paras on power circuit types.
|
| Re: standardising layout at the rack level; we do now! we only
| figured this out after site #2. It makes everything so much
| easier to verify. And yeah, validation is hard - manually doing
| it thus far; want to play around with scraping LLDP data but
| out switch software stack has a bug :/. It's an evolving
| process, the more we work with different contractors, the more
| edge cases we unearth and account for. The biggest improvement
| is that we have built a internal DCIM that templates a rack
| design and exports a interactive "cabling explorer" for the
| site techs - including detailed annotated diagrams of equipment
| showing port names, etc... The screenshot of the elevation is a
| screenshot of part of that tool.
|
| > What does your metal->boot stack look like?
|
| We've hacked together something on top of
| https://github.com/danderson/netboot/tree/main/pixiecore that
| serves a debian netboot + preseed file. We have some custom
| temporal workers to connect to Redfish APIs on the BMCs to
| puppeteer the contraption. The a custom hot agent to provision
| QEMU VMs and advertise assigned IPs via BGP (using FRR) from
| the host.
|
| Re: new DCs for failure scenarios, yeah we've already blown
| breakers etc... testing stuff (that's how we figured out our
| phase balancing was off). Went in with a thermal camera on
| another. A site in AMS is coming up next week and the goal for
| that is to see how far we can push a fully loaded switch
| fabric.
| aetherspawn wrote:
| What brand of servers was used?
___________________________________________________________________
(page generated 2025-01-17 23:00 UTC)