[HN Gopher] The wrong way to switch operating systems on your se...
       ___________________________________________________________________
        
       The wrong way to switch operating systems on your server
        
       Author : ohjeez
       Score  : 48 points
       Date   : 2021-06-19 17:48 UTC (3 days ago)
        
 (HTM) web link (figbert.com)
 (TXT) w3m dump (figbert.com)
        
       | gperciva wrote:
       | Wow, that's much more technically advanced than I was as a
       | teenager! Way to go!
       | 
       | To print progress with tarsnap 1.0.39, send it a SIGUSR1 or
       | SIGINFO. On FreeBSD, you can do this by pressing ctrl-t. On
       | Linux, you have to use the unfortunately-named `kill` or
       | `killall` command, such as
       | 
       | killall -SIGUSR1 tarsnap
       | 
       | https://www.tarsnap.com/tips.html#check-current
       | 
       | (Note that Tarsnap is not responsible for naming the unix `kill`
       | or `killall` commands.)
       | 
       | In the unreleased git version of tarsnap, there's a `--progress-
       | bytes SIZE` command, which prints a progress message after every
       | SIZE bytes are processed.
       | 
       | As a general note: the tarsnap-users mailing list is a great
       | place to ask for tips. As you mentioned in your lessons learned,
       | some of the options could have helped a lot (such as `--recover`)
       | https://www.tarsnap.com/lists.html
       | 
       | (Disclaimer: I'm employed by Tarsnap Backup Inc.)
        
       | wildmanx wrote:
       | > I woke and the backup was finished! I wiped the VPS
       | 
       | Just this single line made me scream in horror.
       | 
       | Let me get this straight. He launched some backup command, it
       | didn't output anything for hours, he suspected it hadn't done
       | anything, aborted with ctrl-c, and then learned that he aborted
       | it at 90%. Wipes the partial backup, starts again.
       | 
       | After _that_ experience, he blindly trusts the result of _the
       | same tool_, blindly wipes everything? Wtf.
        
         | wildmanx wrote:
         | Spoiler: That's not even part of his "lessons learned".
         | 
         | I know whom I won't hire for my company IT or devops or
         | whatnot.
        
           | bluehatbrit wrote:
           | One of the articles they link off to about choosing a
           | different provider explains that the author is a student. I
           | think it's reasonable to assume they don't have much
           | experience with this type of work. Hopefully they'll learn a
           | lot from this process and maybe from the comments here as
           | well. Maybe lets not write someone off before we understand
           | more about their experience and background, especially after
           | they shared an honest account of something they screwed up
           | and learned from.
        
           | gperciva wrote:
           | Do you normally hire high school students at your company?
        
           | abraae wrote:
           | Brutal honesty can be a wonderful trait all of its own in a
           | team however. We can assume the author is not a BS artist
           | from his very candid self-evaluation. This is a good thing he
           | should hang onto.
           | 
           | "My haphazard strategy resulted in three days of stress and
           | frustration as I clambered to restore a self-hosting empire
           | that I myself had reduced to ash."
        
             | argomo wrote:
             | Plus he's able to reflect on his own mistakes and
             | articulate what he learned from them. Would hire.
        
           | nick__m wrote:
           | Good idea to let him graduate from highschool before hiring
           | him !
        
           | sigmonsays wrote:
           | i did things like this in high school tool...
        
           | monkeywork wrote:
           | Given you phrased this "for my company IT or devops or
           | whatnot" I'm going to guess you have very little if any input
           | into who gets hired into those positions, or at least
           | shouldn't if you yourself don't know what the terminology for
           | them should be.
           | 
           | Secondly taking shots at people like this does not encourage
           | people to share learnable lessons like this - instead it
           | encourages them to hide it away. Again if you are person in
           | management you'd think you'd know better than to encourage
           | people to hide mistakes instead of owning them and learning
           | from them.
           | 
           | Finally - you do realize this is a kid right? Not some
           | seasoned professional?
        
       | klodolph wrote:
       | These experiences are relatively common, we just don't see
       | writeups that often. Kudos to the author for writing this up. I
       | do disagree with some of the lessons here.
       | 
       | You want to switch operating systems on your server?
       | 
       | 1. Set up monitoring. If you don't have monitoring already, hack
       | something together with simple scripts.
       | 
       | 2. Start up a new server.
       | 
       | 3. Migrate services and data from your existing server to your
       | new server. Do this at whatever pace you feel is appropriate.
       | Point the monitoring scripts at your new server.
       | 
       | 4. Switch DNS records.
       | 
       | 5. Wait. You are not in a hurry to turn off the previous server.
       | Why not wait one or two months?
       | 
       | 6. Turn off the old server.
       | 
       | 7. Wait some more, and then delete the old server.
       | 
       | The idea here is that steps which might take your site offline
       | are easily reversible. For example, switching DNS records. It's
       | trivial to switch the DNS records back if your migration
       | unexpectedly failed. As much as reasonable, you want the ability
       | to go _backwards_ and undo the steps that you 've done to get
       | back to a known good setup.
       | 
       | In particular, I would say that backups are usually not the right
       | tool for migration. This is missing from the lessons at the
       | bottom of the article. The way you get more confidence in your
       | backups is by doing restore tests into a sandbox environment, by
       | adding automated monitoring to your backups, etc. Trying to
       | address a lack of confidence by increasing the backup frequency
       | doesn't make sense. The backup frequency is the most trivial
       | thing to adjust and doesn't address deeper issues, like the fact
       | that you need to dump/restore databases properly and shouldn't
       | copy files from a live database. These issues are discovered
       | through restore testing.
       | 
       | The saying goes, "Nobody wants a backup system, everyone wants a
       | _restore_ system. " If you are making backups but not testing
       | restores, you're gonna get bitten. Test the part of backups that
       | you care about--the ability to _restore data_ --and don't test it
       | live. Test it in a sandbox.
        
         | livueta wrote:
         | This approach saved my bacon on a migration just yesterday. I
         | had gotten to testing files after having done a big baseline
         | rsync, stopping file services on the old server, and doing a
         | catch-up rsync incremental. Oh shit - one veracrypt container
         | is corrupted and won't mount. Turns out rsync diff updates and
         | mounted containers being written to don't play nice together.
         | 
         | Since the old machine was still sitting there with all the data
         | accessible, I was able to just blow away the corrupted volume,
         | confirm it was unmounted on the source, then copy the whole
         | thing over. If I just had a one-time copy that I'd thrown at b2
         | or something, I would have been very sad.
         | 
         | So, yeah, test restoring your backups. Even fancy checksumming
         | filesystems and shit won't save you from bad assumptions about
         | the integrity of your data.
        
           | lisper wrote:
           | Anything you do in production should always be reversible at
           | least one step back with a single command. You should always
           | be able to roll back to the last-known-working version at the
           | first sign of trouble with the new version. (Not that you
           | should necessarily respond to all trouble with a roll back
           | but you should always design your processes to keep this as
           | one of your options.)
           | 
           | The way I usually do this is that I have a production PUSH
           | and ROLLBACK script. There is a single symlink on the
           | production server that points to the current working version
           | of the code. PUSH makes a complete local copy of the current
           | working version, changes the symlink to point to that, then
           | pushes the new version, then changes the symlink to point to
           | that. ROLLBACK just changes the symlink back to the backup
           | copy. This is robust and easy to tweak to allow you to roll
           | back as many levels as you like, though I've never had to go
           | back more than one. If a problem doesn't manifest itself
           | immediately you probably want to fix it by going forward, not
           | back.
        
             | klodolph wrote:
             | In the corporate world, this is also a great way to hold
             | vendors accountable.
             | 
             | Story: Company Red hires company Blue to take over the
             | company blog as part of a marketing initiative. Their point
             | of contact was the marketing team at company Red, and I'm
             | sure they thought, "We've nabbed a big customer, this is
             | going to be great."
             | 
             | However, the switch from the internal blog to the external
             | blog goes through the operations team at company Red. The
             | operations team tells company Blue, "Here is the plan for
             | rolling back to the internal blog at the touch of a
             | button."
             | 
             | Company Blue was suddenly _much_ more responsive to
             | questions from company Red.
        
       | lawrenceduk wrote:
       | This is a bit dumb and I feel like if as much effort had been
       | made doing it as writing the blog post, a more positive outcome
       | would have resulted.
        
       | pinkythepig wrote:
       | I feel like a lot of this article really should be about how bad
       | tarsnap is. Defaults to no progress updates, has no builtin
       | multithreading, has failures around large files, can't backup sym
       | links properly, no builtin way to detect an in progress restore
       | so you have to manually tell it to resume, etc.
       | 
       | If tarsnap didn't have a bad UX, this entire article would have
       | instead been 'the time I forgot to backup my .env file', none of
       | the other issues would have occurred.
        
       | geofft wrote:
       | > _My terminal sat empty for hours. There were no changes - the
       | process was running, but there was no feedback. I was nervous._
       | 
       | > _What if it failed silently? How can I check? What should I
       | do?_
       | 
       | On Linux, find the process ID and run e.g.                   ls
       | -l /proc/12345/fd
       | 
       | which will show you all the files currently open by the process.
       | For something like a backup of a whole directory, or something
       | generating a lot of output, run it again a few seconds later. If
       | it's opened different files, then you know it's making progress
       | and it's not stuck.
       | 
       | If it's something that operates on a single file, find the number
       | corresponding to that file in the list (the file descriptor), and
       | run e.g.                   cat /proc/12345/fd/3
       | 
       | which will output a "pos" field showing the position in the file,
       | in bytes. Compare it with the actual size of the file, and also
       | run it again a few seconds later to see how fast it's making
       | progress.
       | 
       | (You can also use strace, but that slows down your program and
       | potentially changes how it behaves in extremely unusual cases, so
       | it isn't the first thing I'd reach for unless I really think the
       | program is misbehaving and I want to see what it's doing in more
       | detail. And there are tools like iostat too, but they're
       | systemwide.)
        
       | dmuth wrote:
       | Someone tell me if I'm missing something, but isn't the whole
       | point of hosting things in a virtual environment so that when you
       | want to switch/upgrade OSes, you stand up a second server and
       | start migrating apps over one at a time?
       | 
       | I can't understand why that doesn't done here.
        
         | plorkyeran wrote:
         | Well the title is "The Wrong Way to Switch Operating Systems on
         | Your Server" so it's unsurprising that it's describing
         | completely the wrong way to do things.
         | 
         | The concerning part is that the author seems to have learned
         | the wrong lessons from doing things the wrong way.
        
           | jaywalk wrote:
           | Yep, the correct title would be "There is no Right Way to
           | Switch Operating Systems on Your Server" Just don't do it.
           | Fire up a new one, and migrate everything over.
        
       | Shank wrote:
       | Far and above, the best strategy is to spin up the new server,
       | scp/rsync the data to the new server directly from the old
       | server, and then boot services, and only decommission after
       | you've moved all DNS over and confirmed the new site is working.
       | Using Tarsnap for this is not only time consuming but needless
       | unless you already have it setup and working.
        
         | roywashere wrote:
         | Using tarsnap has one big advantage: it proves that you can
         | recover from your backups. Using this method caused the OP to
         | realize the backups were there but the secrets were missing!
         | 
         | I agree with keeping the old server in place until the new one
         | is working obv
        
       | [deleted]
        
       | [deleted]
        
       | fak3r wrote:
       | For me: rsync > tarsnap
       | 
       | I have a backup rsync script that parses a file I have that lists
       | every path I want backed up. Yes, this considers dotfiles, so the
       | poster's .env file would have been backed up. My script runs
       | locally, backs up to my main (home) server, and then does another
       | rsync to a 'cloud' server. Want to backup a new file or path? Add
       | it to the manifest file. Adding another server or device? Build
       | another manifest script, have rsync write to the same dir on the
       | server, it'll automatically get sync'd to the cloud server too.
        
       | justin_oaks wrote:
       | Don't consider it a backup until you've successfully restored the
       | data from it.
       | 
       | The first thing I do after setting up a new data backup is test a
       | restore of the data. Only after that will I feel confident that
       | the backup procedure works right.
       | 
       | In the article author's case, an attempt to restore would have
       | caught the problem of the missing .env files and the large movie
       | files.
       | 
       | As for the Ctrl-C on both the backup and restore, you should
       | check your I/O (network and disk) before terminating a process.
       | Doing that would have confirmed that the process was still going,
       | and indicate the rate at which the process was going.
        
         | spsesk117 wrote:
         | +1. `strace` can be very helpful here, to see if a process is
         | stuck waiting on something or whether it's just zooming along
         | with no output.
        
           | dspillett wrote:
           | `progress` (https://github.com/Xfennec/progress) and similar
           | can be very helpful too depending on the backup utilities
           | being used (in my case often involving rsync) even if the
           | processes normally have everything set to quiet so no
           | progress information is automatically forthcoming.
        
         | AnimalMuppet wrote:
         | I'd even go a step further. It's not a backup until you've
         | restored it _using different hardware_. You really want to know
         | that your tape (or whatever) can be read by a different tape
         | reader than the one that wrote it.
        
         | geoduck14 wrote:
         | > Don't consider it a backup until you've successfully restored
         | the data from it.
         | 
         | Ouch! I back up every 2 hours - should I REALLY restore from
         | each of those?
        
           | xgbi wrote:
           | No, not ALL backups, just the first backup you do: you test
           | that you can restore the data from it.
           | 
           | Then you can be confident that following backups will be
           | restorable.
           | 
           | Corrolary: If you introduce a new folder for another service
           | in your backup, make sure you can restore it too.
        
             | foobarian wrote:
             | :itsatrap:
             | 
             | It needs to be done periodically.
        
           | AnimalMuppet wrote:
           | No, not if they're written by the same hardware using the
           | same process. You need to restore data from one of them, and
           | then another one every so often. Not every two hours, though.
        
           | dspillett wrote:
           | Maybe, it depends on your data-loss and time to restore
           | flexibility. Test regularly enough that you are confident it
           | works. If you don't test at all you can't be confident that
           | it works at all.
           | 
           | For instance, if you have a full/diff/log backup cycle for a
           | database for instance, perhaps test restore each full backup.
           | Just be aware that if a full backup happens daily your
           | comfort zone for data loss in the case of a disaster needs to
           | be at least a day worth of work.
           | 
           | Also, if you backup a range of data and your method allows
           | for partial restores, you might do partial restores of key
           | information far more often than you test everything. You
           | could spread the testing load temporaly: check the _really_
           | import parts every time and cycle through the other bits less
           | often.
           | 
           | Also if your concern is man-time, automate the process as
           | much as possible. My mail server (running Zimbra) has a small
           | replica in a VM that restores itself from the latest backup
           | once a day and sends me a message to say what the last
           | message in its queue was. This way if I don't get the
           | message, or it says the last message was too far ago, I know
           | something has gone wrong and the backup or the restore
           | failed. I manually log in regularly to inspect a little
           | deeper too, this is slow as the VM has far lower resources
           | than the main box. If properly separated from other resources
           | (so there are no single points of failure that can take it
           | out along with everything else - mine isn't currently that
           | well arranged) then this copy becomes an extra secondary
           | single-snapshot backup itself.
        
       | simlevesque wrote:
       | He forgot one mistake: never take down your existing
       | infrastructure before the new one is up.
        
       | isatty wrote:
       | Why tarsnap when wasabi or backblaze would be significantly
       | cheaper? You can just encrypt by yourselves anyway.
       | 
       | Also I run my own personal infra and here's what I do:
       | 
       | * treat servers as cattle, not pets. This is really important.
       | Have mandatory reboots, never be afraid of reboots.
       | 
       | * preferably do things with an automation method, I use ansible
       | for n=5 but pick whatever you like
       | 
       | * have SOME monitoring. It's not too hard to throw up
       | prom+grafana so get on it early.
       | 
       | * VPN instead of securing internal services. Attack surface is
       | way too high if you've too many services. Just throw them all
       | behind a vpn and expose selectively. I use WireGuard.
       | 
       | * personally: don't self host critical infrastructure. I can't
       | afford downtime on email etc so I rather just pay someone to host
       | that. Personal infra is for fun, not a second job (and I'm an
       | SRE).
        
       | ay wrote:
       | The moment I see: - shutdown - backup + restore anew - bring up
       | 
       | This triggers shivers. Nononono.
       | 
       | - backup (while your current gif is running) - restore and test--
       | verify (via a separate vhost) - delta-backup, delta-restore and
       | test-verify - stop, delta-backup, delta restore, start at new
       | place
       | 
       | This way you minimize the amount of unexpected. Of course can be
       | infeasible in some contexts...
        
       | juped wrote:
       | if you get nervous and ctrl-c things with no output, try using -v
       | mode
        
         | fake-name wrote:
         | This is fine, except for the annoying tools that think they
         | need to then print their software version, and exit.
        
       | teekert wrote:
       | What a nightmare! I swear I had something similar with rsync on a
       | mac once, I was very certain it finished, ran again and it
       | reports it's done. I migrate and I'm missing all these files!
       | Although it was probably my fault, I really don't trust rsync
       | anymore... Maybe it had to do with HFS+ and those strange
       | aperture libraries but man it ruined my day (week).
       | 
       | Sure migrating is 100 times more relaxed when you have the old
       | system running, but sometimes you need to reinstall. I had only
       | one MacBook, now I only have 1 server.
       | 
       | What you could do in that case is just install a new disc, unplug
       | the old one until the new systems is running. It's worth the
       | money and effort.
       | 
       | I'm looking to install NixOS to my Home Server next week. All my
       | personal infra is in Docker compose, on Ubuntu 20.04 atm. I only
       | have one m.2 slot in the server and I don't want to buy a second
       | drive just for this... So I'm sweating already. Maybe I should
       | first migrate to my nuc, then back to the new server... hmmmm...
        
       | ajnin wrote:
       | My current backup strategy is to backup the whole filesystem. I
       | run services in actual VMs, not docker containers, with disks
       | mounted from LVM volumes which allows me to take snapshots and
       | back them up live without needing to shut them down. I'm using
       | bup to do the actual backups to a server I'm keeping in my home.
       | I wrote a few custom scripts to backup and restore servers, and
       | keep a history of the last x days, y weeks and z months. That way
       | it gives me more time to figure out if something's wrong, as it's
       | hobby stuff that I'm not checking every day.
       | 
       | My advice for OP would be to 1/ ditch tarsnap. A backups tool
       | that runs for hours without any feedback ? A restore tool that
       | fails if the files are too big ? Everything extremely slow ? Just
       | forget it. 2/ keep more than 3 days of backups. It's too short if
       | you make a mistake, it took 3 days to recover from this one
       | already. 3/ backup everything. Don't try to pick and choose
       | files, you're likely to forget something, and if not now then
       | some time later when you create a new file but forget to add it
       | to the list of things to back up.
        
       | simonblack wrote:
       | On my server, I have three 'root' partitions. One for general
       | day-to-day use, one for a backup if something catastrophic
       | happens to my main system, and one for experimentation. The extra
       | disk space taken up for my having two extra 'root' partitions is
       | a miserable 40-50 gig.
       | 
       | But I know that I can swap operating systems over almost
       | instantaneously and then back again just as quickly if I did
       | something wrong.
       | 
       | Great peace of mind for practically no cost.
       | 
       | The second error I see in the article is to use software that we
       | haven't used previously for something important. My first wife
       | had the habit of trying new recipes when we had a dinner-party. I
       | tried to tell her repeatedly to try the recipe on us first, then
       | she would have it down pat when she wanted to impress.
       | 
       | The third error of course was the need to have restorable daily
       | backups and the use of them to restore the system when need be,
       | associated with modularity of the system.
       | 
       | I back up my whole system daily. but the most important part of
       | that is not backing up the distro itself (we have re-installs for
       | that) but backing up all the config files, all the databases, all
       | the local binaries, and a current list of all of the installed
       | distro packages. I can replace the whole operating system from a
       | complete wipe-out in less than two hours.
       | 
       | I store these backups in a pseudo-exponential policy. I have more
       | recent backups, fewer older backups. Currently I have 15 backups
       | covering 9 years, with five of those covering just the last 3
       | weeks, and four covering the last five days. To augment this, I
       | have a monthly snapshot backup also stashed away.
       | 
       | The other stuff, personal docs etc, is deliberately kept small.
       | Total daily backup of base system and /home is approximately 12
       | gigs. That is easily transported on a USB stick.
       | 
       | I don't store music, photos, magazine .PDFs, old software, etc in
       | my /home directory. That stuff all goes in an archive directory
       | that's write-once, and store (practically) forever. That gets
       | rsynced to two external USB drives daily. Most days, there's
       | practically nothing that gets transferred out.
       | 
       | Having several times lost much valuable data, I suppose I am
       | really paranoid, but I still think I haven't been paranoid
       | enough.
        
       | ghostly_s wrote:
       | Spent three days fighting with this and didn't think to try the
       | -v flag on his apparently-hanging process?
        
         | rhn_mk1 wrote:
         | The author may not have the necessary familiarity with it.
         | Experience is gained via mistakes too.
        
         | fak3r wrote:
         | My first thought, also always run long running jobs like this
         | in Screen or Tmux! As it is, it's a good learning post that
         | others should be able to build on (and don't hit control-c just
         | bc it's taking too long!)
        
       ___________________________________________________________________
       (page generated 2021-06-22 23:02 UTC)