[HN Gopher] So You Want to Build Your Own Data Center
       ___________________________________________________________________
        
       So You Want to Build Your Own Data Center
        
       Author : dban
       Score  : 73 points
       Date   : 2025-01-17 20:41 UTC (2 hours ago)
        
 (HTM) web link (blog.railway.com)
 (TXT) w3m dump (blog.railway.com)
        
       | dban wrote:
       | This is our first post about building out data centers. If you
       | have any questions, we're happy to answer them here :)
        
         | gschier wrote:
         | How do you deal with drive failures? How often does a Railway
         | team member need to visit a DC? What's it like inside?
        
           | justjake wrote:
           | Everything is dual redundancy. We run RAID so if a drive
           | fails it's fine; alerting will page oncall which will trigger
           | remote hands onsite, where we have spares for everything in
           | each datacenter
        
             | gschier wrote:
             | How much additional overhead is there for managing the
             | bare-metal vs cloud? Is it mostly fine after the big effort
             | for initial setup?
        
               | ca508 wrote:
               | We built some internal tooling to help manage the hosts.
               | Once a host is onboarded onto it, it's a few button
               | clicks on an internal dashboard to provision a QEMU VM.
               | We made a custom ansible inventory plugin so we can
               | manage these VMs the same as we do machines on GCP.
               | 
               | The host runs a custom daemon that programs FRR (an OSS
               | routing stack), so that it advertises addresses assigned
               | to a VM to the rest of the cluster via BGP. So zero
               | config of network switches, etc... required after initial
               | setup.
               | 
               | We'll blog about this system at some point in the coming
               | months.
        
       | exabrial wrote:
       | I'm surprised you guys are building new!
       | 
       | Tons of Colocation available nearly everywhere in the US, and in
       | the KCMO area, there are even a few dark datacenters available
       | for sale!
       | 
       | cool project none-the-less. Bit jealous actually :P
        
         | gschier wrote:
         | More info on the cost comparison between all the options would
         | be interesting
        
           | dban wrote:
           | We pulled some cost stuff out of the post in final review
           | because we weren't sure it was interesting ... we'll bring it
           | back for a future post
        
         | justjake wrote:
         | The requirements end up being pretty specific, based on
         | workloads/power draw/supply chain
         | 
         | So, while we could have bought something off the shelf, that
         | would have been suboptimal from a specs perspective. Plus then
         | we'd have to source supply chain etc.
         | 
         | By owning not just the servers but the whole supply chain, we
         | have redundancy at every layer, from the machine, to the parts
         | on site (for failures), to the supply chain (refilling those
         | spare parts/expanding capacity/etc)
        
         | CMCDragonkai wrote:
         | Can you share a list of dark datacenters that are for sale.
         | They sound interesting as a business.
        
         | idlewords wrote:
         | They're not building new, though--the post is about renting a
         | cage in a datacenter.
        
       | ramon156 wrote:
       | weird to think my final internship was running on one of these
       | things. thanks for all the free minutes! it was a nice experience
        
       | nextworddev wrote:
       | First time checking out railway product- it seems like a "low
       | code" and visual way to define and operate infrastructure?
       | 
       | Like, if Terraform had a nice UI?
        
         | justjake wrote:
         | Kinda. It's like if you had everything from an infra stack but
         | didn't need to manage it (Kubernetes for resilience, Argo for
         | rollouts, Terraform for safely evolving infrastructure, DataDog
         | for observability)
         | 
         | If you've heard of serverless, this is one step farther;
         | infraless
         | 
         | Give us your code, we will spin it up, keep it up, automate
         | rollouts service discovery, cluster scaling, monitoring, etc
        
       | __fst__ wrote:
       | Can anyone recommend some engineering reading for building and
       | running DC infrastructure?
        
         | ca508 wrote:
         | We didn't find many good up-to-date resources online on the
         | hardware side of things - kinda why we wanted to write about
         | it. The networking aspect was the most mystical - I highly
         | recommend "BGP in the datacenter" by Dinesh Dutt on that (I
         | think it's available for free via NVidia). Our design is
         | heavily influenced by the ideas discussed there.
        
       | jonatron wrote:
       | Why would you call colocation "building your own data center"?
       | You could call it "colocation" or "renting space in a data
       | center". What are you building? You're racking. Can you say what
       | you mean?
        
         | macintux wrote:
         | Dealing with power at that scale, arranging your own ISPs,
         | seems a bit beyond your normal colocation project, but I
         | haven't bee in the data center space in a very long time.
        
       | j-b wrote:
       | Love these kinds of posts. Tried railway for the first time a few
       | days ago. It was a delightful experience. Great work!
        
       | coolkil wrote:
       | Awesome!! Hope to see more companies go this route. I had the
       | pleasure to do something similar for a company(lot smaller scale
       | though)
       | 
       | It was my first job out of university. I will never forget the
       | awesome experience of walking into the datacenter and start
       | plugging cables and stuff
        
       | sitkack wrote:
       | It would be nice to have a lot more detail. The WTF sections are
       | the best part. Sounds like your gear needs "this side towards
       | enemy" sign and/or the right affordances so it only goes in one
       | way.
       | 
       | Did you standardize on layout at the rack level? What poke-yoke
       | processes did you put into place to prevent mistakes?
       | 
       | What does your metal->boot stack look like?
       | 
       | Having worked for two different cloud providers and built my own
       | internal clouds with PXE booted hosts, I find too find this stuff
       | fascinating.
       | 
       | Also take utmost advantage of a new DC when you are booting it to
       | try out all the failure scenarios you can think of and the ones
       | you can't through randomized fault injection.
        
         | ca508 wrote:
         | > It would be nice to have a lot more detail
         | 
         | I'm going to save this for when I'm asked to cut the three
         | paras on power circuit types.
         | 
         | Re: standardising layout at the rack level; we do now! we only
         | figured this out after site #2. It makes everything so much
         | easier to verify. And yeah, validation is hard - manually doing
         | it thus far; want to play around with scraping LLDP data but
         | out switch software stack has a bug :/. It's an evolving
         | process, the more we work with different contractors, the more
         | edge cases we unearth and account for. The biggest improvement
         | is that we have built a internal DCIM that templates a rack
         | design and exports a interactive "cabling explorer" for the
         | site techs - including detailed annotated diagrams of equipment
         | showing port names, etc... The screenshot of the elevation is a
         | screenshot of part of that tool.
         | 
         | > What does your metal->boot stack look like?
         | 
         | We've hacked together something on top of
         | https://github.com/danderson/netboot/tree/main/pixiecore that
         | serves a debian netboot + preseed file. We have some custom
         | temporal workers to connect to Redfish APIs on the BMCs to
         | puppeteer the contraption. The a custom hot agent to provision
         | QEMU VMs and advertise assigned IPs via BGP (using FRR) from
         | the host.
         | 
         | Re: new DCs for failure scenarios, yeah we've already blown
         | breakers etc... testing stuff (that's how we figured out our
         | phase balancing was off). Went in with a thermal camera on
         | another. A site in AMS is coming up next week and the goal for
         | that is to see how far we can push a fully loaded switch
         | fabric.
        
       | aetherspawn wrote:
       | What brand of servers was used?
        
       ___________________________________________________________________
       (page generated 2025-01-17 23:00 UTC)