[HN Gopher] The wrong way to switch operating systems on your se...
___________________________________________________________________
The wrong way to switch operating systems on your server
Author : ohjeez
Score : 48 points
Date : 2021-06-19 17:48 UTC (3 days ago)
(HTM) web link (figbert.com)
(TXT) w3m dump (figbert.com)
| gperciva wrote:
| Wow, that's much more technically advanced than I was as a
| teenager! Way to go!
|
| To print progress with tarsnap 1.0.39, send it a SIGUSR1 or
| SIGINFO. On FreeBSD, you can do this by pressing ctrl-t. On
| Linux, you have to use the unfortunately-named `kill` or
| `killall` command, such as
|
| killall -SIGUSR1 tarsnap
|
| https://www.tarsnap.com/tips.html#check-current
|
| (Note that Tarsnap is not responsible for naming the unix `kill`
| or `killall` commands.)
|
| In the unreleased git version of tarsnap, there's a `--progress-
| bytes SIZE` command, which prints a progress message after every
| SIZE bytes are processed.
|
| As a general note: the tarsnap-users mailing list is a great
| place to ask for tips. As you mentioned in your lessons learned,
| some of the options could have helped a lot (such as `--recover`)
| https://www.tarsnap.com/lists.html
|
| (Disclaimer: I'm employed by Tarsnap Backup Inc.)
| wildmanx wrote:
| > I woke and the backup was finished! I wiped the VPS
|
| Just this single line made me scream in horror.
|
| Let me get this straight. He launched some backup command, it
| didn't output anything for hours, he suspected it hadn't done
| anything, aborted with ctrl-c, and then learned that he aborted
| it at 90%. Wipes the partial backup, starts again.
|
| After _that_ experience, he blindly trusts the result of _the
| same tool_, blindly wipes everything? Wtf.
| wildmanx wrote:
| Spoiler: That's not even part of his "lessons learned".
|
| I know whom I won't hire for my company IT or devops or
| whatnot.
| bluehatbrit wrote:
| One of the articles they link off to about choosing a
| different provider explains that the author is a student. I
| think it's reasonable to assume they don't have much
| experience with this type of work. Hopefully they'll learn a
| lot from this process and maybe from the comments here as
| well. Maybe lets not write someone off before we understand
| more about their experience and background, especially after
| they shared an honest account of something they screwed up
| and learned from.
| gperciva wrote:
| Do you normally hire high school students at your company?
| abraae wrote:
| Brutal honesty can be a wonderful trait all of its own in a
| team however. We can assume the author is not a BS artist
| from his very candid self-evaluation. This is a good thing he
| should hang onto.
|
| "My haphazard strategy resulted in three days of stress and
| frustration as I clambered to restore a self-hosting empire
| that I myself had reduced to ash."
| argomo wrote:
| Plus he's able to reflect on his own mistakes and
| articulate what he learned from them. Would hire.
| nick__m wrote:
| Good idea to let him graduate from highschool before hiring
| him !
| sigmonsays wrote:
| i did things like this in high school tool...
| monkeywork wrote:
| Given you phrased this "for my company IT or devops or
| whatnot" I'm going to guess you have very little if any input
| into who gets hired into those positions, or at least
| shouldn't if you yourself don't know what the terminology for
| them should be.
|
| Secondly taking shots at people like this does not encourage
| people to share learnable lessons like this - instead it
| encourages them to hide it away. Again if you are person in
| management you'd think you'd know better than to encourage
| people to hide mistakes instead of owning them and learning
| from them.
|
| Finally - you do realize this is a kid right? Not some
| seasoned professional?
| klodolph wrote:
| These experiences are relatively common, we just don't see
| writeups that often. Kudos to the author for writing this up. I
| do disagree with some of the lessons here.
|
| You want to switch operating systems on your server?
|
| 1. Set up monitoring. If you don't have monitoring already, hack
| something together with simple scripts.
|
| 2. Start up a new server.
|
| 3. Migrate services and data from your existing server to your
| new server. Do this at whatever pace you feel is appropriate.
| Point the monitoring scripts at your new server.
|
| 4. Switch DNS records.
|
| 5. Wait. You are not in a hurry to turn off the previous server.
| Why not wait one or two months?
|
| 6. Turn off the old server.
|
| 7. Wait some more, and then delete the old server.
|
| The idea here is that steps which might take your site offline
| are easily reversible. For example, switching DNS records. It's
| trivial to switch the DNS records back if your migration
| unexpectedly failed. As much as reasonable, you want the ability
| to go _backwards_ and undo the steps that you 've done to get
| back to a known good setup.
|
| In particular, I would say that backups are usually not the right
| tool for migration. This is missing from the lessons at the
| bottom of the article. The way you get more confidence in your
| backups is by doing restore tests into a sandbox environment, by
| adding automated monitoring to your backups, etc. Trying to
| address a lack of confidence by increasing the backup frequency
| doesn't make sense. The backup frequency is the most trivial
| thing to adjust and doesn't address deeper issues, like the fact
| that you need to dump/restore databases properly and shouldn't
| copy files from a live database. These issues are discovered
| through restore testing.
|
| The saying goes, "Nobody wants a backup system, everyone wants a
| _restore_ system. " If you are making backups but not testing
| restores, you're gonna get bitten. Test the part of backups that
| you care about--the ability to _restore data_ --and don't test it
| live. Test it in a sandbox.
| livueta wrote:
| This approach saved my bacon on a migration just yesterday. I
| had gotten to testing files after having done a big baseline
| rsync, stopping file services on the old server, and doing a
| catch-up rsync incremental. Oh shit - one veracrypt container
| is corrupted and won't mount. Turns out rsync diff updates and
| mounted containers being written to don't play nice together.
|
| Since the old machine was still sitting there with all the data
| accessible, I was able to just blow away the corrupted volume,
| confirm it was unmounted on the source, then copy the whole
| thing over. If I just had a one-time copy that I'd thrown at b2
| or something, I would have been very sad.
|
| So, yeah, test restoring your backups. Even fancy checksumming
| filesystems and shit won't save you from bad assumptions about
| the integrity of your data.
| lisper wrote:
| Anything you do in production should always be reversible at
| least one step back with a single command. You should always
| be able to roll back to the last-known-working version at the
| first sign of trouble with the new version. (Not that you
| should necessarily respond to all trouble with a roll back
| but you should always design your processes to keep this as
| one of your options.)
|
| The way I usually do this is that I have a production PUSH
| and ROLLBACK script. There is a single symlink on the
| production server that points to the current working version
| of the code. PUSH makes a complete local copy of the current
| working version, changes the symlink to point to that, then
| pushes the new version, then changes the symlink to point to
| that. ROLLBACK just changes the symlink back to the backup
| copy. This is robust and easy to tweak to allow you to roll
| back as many levels as you like, though I've never had to go
| back more than one. If a problem doesn't manifest itself
| immediately you probably want to fix it by going forward, not
| back.
| klodolph wrote:
| In the corporate world, this is also a great way to hold
| vendors accountable.
|
| Story: Company Red hires company Blue to take over the
| company blog as part of a marketing initiative. Their point
| of contact was the marketing team at company Red, and I'm
| sure they thought, "We've nabbed a big customer, this is
| going to be great."
|
| However, the switch from the internal blog to the external
| blog goes through the operations team at company Red. The
| operations team tells company Blue, "Here is the plan for
| rolling back to the internal blog at the touch of a
| button."
|
| Company Blue was suddenly _much_ more responsive to
| questions from company Red.
| lawrenceduk wrote:
| This is a bit dumb and I feel like if as much effort had been
| made doing it as writing the blog post, a more positive outcome
| would have resulted.
| pinkythepig wrote:
| I feel like a lot of this article really should be about how bad
| tarsnap is. Defaults to no progress updates, has no builtin
| multithreading, has failures around large files, can't backup sym
| links properly, no builtin way to detect an in progress restore
| so you have to manually tell it to resume, etc.
|
| If tarsnap didn't have a bad UX, this entire article would have
| instead been 'the time I forgot to backup my .env file', none of
| the other issues would have occurred.
| geofft wrote:
| > _My terminal sat empty for hours. There were no changes - the
| process was running, but there was no feedback. I was nervous._
|
| > _What if it failed silently? How can I check? What should I
| do?_
|
| On Linux, find the process ID and run e.g. ls
| -l /proc/12345/fd
|
| which will show you all the files currently open by the process.
| For something like a backup of a whole directory, or something
| generating a lot of output, run it again a few seconds later. If
| it's opened different files, then you know it's making progress
| and it's not stuck.
|
| If it's something that operates on a single file, find the number
| corresponding to that file in the list (the file descriptor), and
| run e.g. cat /proc/12345/fd/3
|
| which will output a "pos" field showing the position in the file,
| in bytes. Compare it with the actual size of the file, and also
| run it again a few seconds later to see how fast it's making
| progress.
|
| (You can also use strace, but that slows down your program and
| potentially changes how it behaves in extremely unusual cases, so
| it isn't the first thing I'd reach for unless I really think the
| program is misbehaving and I want to see what it's doing in more
| detail. And there are tools like iostat too, but they're
| systemwide.)
| dmuth wrote:
| Someone tell me if I'm missing something, but isn't the whole
| point of hosting things in a virtual environment so that when you
| want to switch/upgrade OSes, you stand up a second server and
| start migrating apps over one at a time?
|
| I can't understand why that doesn't done here.
| plorkyeran wrote:
| Well the title is "The Wrong Way to Switch Operating Systems on
| Your Server" so it's unsurprising that it's describing
| completely the wrong way to do things.
|
| The concerning part is that the author seems to have learned
| the wrong lessons from doing things the wrong way.
| jaywalk wrote:
| Yep, the correct title would be "There is no Right Way to
| Switch Operating Systems on Your Server" Just don't do it.
| Fire up a new one, and migrate everything over.
| Shank wrote:
| Far and above, the best strategy is to spin up the new server,
| scp/rsync the data to the new server directly from the old
| server, and then boot services, and only decommission after
| you've moved all DNS over and confirmed the new site is working.
| Using Tarsnap for this is not only time consuming but needless
| unless you already have it setup and working.
| roywashere wrote:
| Using tarsnap has one big advantage: it proves that you can
| recover from your backups. Using this method caused the OP to
| realize the backups were there but the secrets were missing!
|
| I agree with keeping the old server in place until the new one
| is working obv
| [deleted]
| [deleted]
| fak3r wrote:
| For me: rsync > tarsnap
|
| I have a backup rsync script that parses a file I have that lists
| every path I want backed up. Yes, this considers dotfiles, so the
| poster's .env file would have been backed up. My script runs
| locally, backs up to my main (home) server, and then does another
| rsync to a 'cloud' server. Want to backup a new file or path? Add
| it to the manifest file. Adding another server or device? Build
| another manifest script, have rsync write to the same dir on the
| server, it'll automatically get sync'd to the cloud server too.
| justin_oaks wrote:
| Don't consider it a backup until you've successfully restored the
| data from it.
|
| The first thing I do after setting up a new data backup is test a
| restore of the data. Only after that will I feel confident that
| the backup procedure works right.
|
| In the article author's case, an attempt to restore would have
| caught the problem of the missing .env files and the large movie
| files.
|
| As for the Ctrl-C on both the backup and restore, you should
| check your I/O (network and disk) before terminating a process.
| Doing that would have confirmed that the process was still going,
| and indicate the rate at which the process was going.
| spsesk117 wrote:
| +1. `strace` can be very helpful here, to see if a process is
| stuck waiting on something or whether it's just zooming along
| with no output.
| dspillett wrote:
| `progress` (https://github.com/Xfennec/progress) and similar
| can be very helpful too depending on the backup utilities
| being used (in my case often involving rsync) even if the
| processes normally have everything set to quiet so no
| progress information is automatically forthcoming.
| AnimalMuppet wrote:
| I'd even go a step further. It's not a backup until you've
| restored it _using different hardware_. You really want to know
| that your tape (or whatever) can be read by a different tape
| reader than the one that wrote it.
| geoduck14 wrote:
| > Don't consider it a backup until you've successfully restored
| the data from it.
|
| Ouch! I back up every 2 hours - should I REALLY restore from
| each of those?
| xgbi wrote:
| No, not ALL backups, just the first backup you do: you test
| that you can restore the data from it.
|
| Then you can be confident that following backups will be
| restorable.
|
| Corrolary: If you introduce a new folder for another service
| in your backup, make sure you can restore it too.
| foobarian wrote:
| :itsatrap:
|
| It needs to be done periodically.
| AnimalMuppet wrote:
| No, not if they're written by the same hardware using the
| same process. You need to restore data from one of them, and
| then another one every so often. Not every two hours, though.
| dspillett wrote:
| Maybe, it depends on your data-loss and time to restore
| flexibility. Test regularly enough that you are confident it
| works. If you don't test at all you can't be confident that
| it works at all.
|
| For instance, if you have a full/diff/log backup cycle for a
| database for instance, perhaps test restore each full backup.
| Just be aware that if a full backup happens daily your
| comfort zone for data loss in the case of a disaster needs to
| be at least a day worth of work.
|
| Also, if you backup a range of data and your method allows
| for partial restores, you might do partial restores of key
| information far more often than you test everything. You
| could spread the testing load temporaly: check the _really_
| import parts every time and cycle through the other bits less
| often.
|
| Also if your concern is man-time, automate the process as
| much as possible. My mail server (running Zimbra) has a small
| replica in a VM that restores itself from the latest backup
| once a day and sends me a message to say what the last
| message in its queue was. This way if I don't get the
| message, or it says the last message was too far ago, I know
| something has gone wrong and the backup or the restore
| failed. I manually log in regularly to inspect a little
| deeper too, this is slow as the VM has far lower resources
| than the main box. If properly separated from other resources
| (so there are no single points of failure that can take it
| out along with everything else - mine isn't currently that
| well arranged) then this copy becomes an extra secondary
| single-snapshot backup itself.
| simlevesque wrote:
| He forgot one mistake: never take down your existing
| infrastructure before the new one is up.
| isatty wrote:
| Why tarsnap when wasabi or backblaze would be significantly
| cheaper? You can just encrypt by yourselves anyway.
|
| Also I run my own personal infra and here's what I do:
|
| * treat servers as cattle, not pets. This is really important.
| Have mandatory reboots, never be afraid of reboots.
|
| * preferably do things with an automation method, I use ansible
| for n=5 but pick whatever you like
|
| * have SOME monitoring. It's not too hard to throw up
| prom+grafana so get on it early.
|
| * VPN instead of securing internal services. Attack surface is
| way too high if you've too many services. Just throw them all
| behind a vpn and expose selectively. I use WireGuard.
|
| * personally: don't self host critical infrastructure. I can't
| afford downtime on email etc so I rather just pay someone to host
| that. Personal infra is for fun, not a second job (and I'm an
| SRE).
| ay wrote:
| The moment I see: - shutdown - backup + restore anew - bring up
|
| This triggers shivers. Nononono.
|
| - backup (while your current gif is running) - restore and test--
| verify (via a separate vhost) - delta-backup, delta-restore and
| test-verify - stop, delta-backup, delta restore, start at new
| place
|
| This way you minimize the amount of unexpected. Of course can be
| infeasible in some contexts...
| juped wrote:
| if you get nervous and ctrl-c things with no output, try using -v
| mode
| fake-name wrote:
| This is fine, except for the annoying tools that think they
| need to then print their software version, and exit.
| teekert wrote:
| What a nightmare! I swear I had something similar with rsync on a
| mac once, I was very certain it finished, ran again and it
| reports it's done. I migrate and I'm missing all these files!
| Although it was probably my fault, I really don't trust rsync
| anymore... Maybe it had to do with HFS+ and those strange
| aperture libraries but man it ruined my day (week).
|
| Sure migrating is 100 times more relaxed when you have the old
| system running, but sometimes you need to reinstall. I had only
| one MacBook, now I only have 1 server.
|
| What you could do in that case is just install a new disc, unplug
| the old one until the new systems is running. It's worth the
| money and effort.
|
| I'm looking to install NixOS to my Home Server next week. All my
| personal infra is in Docker compose, on Ubuntu 20.04 atm. I only
| have one m.2 slot in the server and I don't want to buy a second
| drive just for this... So I'm sweating already. Maybe I should
| first migrate to my nuc, then back to the new server... hmmmm...
| ajnin wrote:
| My current backup strategy is to backup the whole filesystem. I
| run services in actual VMs, not docker containers, with disks
| mounted from LVM volumes which allows me to take snapshots and
| back them up live without needing to shut them down. I'm using
| bup to do the actual backups to a server I'm keeping in my home.
| I wrote a few custom scripts to backup and restore servers, and
| keep a history of the last x days, y weeks and z months. That way
| it gives me more time to figure out if something's wrong, as it's
| hobby stuff that I'm not checking every day.
|
| My advice for OP would be to 1/ ditch tarsnap. A backups tool
| that runs for hours without any feedback ? A restore tool that
| fails if the files are too big ? Everything extremely slow ? Just
| forget it. 2/ keep more than 3 days of backups. It's too short if
| you make a mistake, it took 3 days to recover from this one
| already. 3/ backup everything. Don't try to pick and choose
| files, you're likely to forget something, and if not now then
| some time later when you create a new file but forget to add it
| to the list of things to back up.
| simonblack wrote:
| On my server, I have three 'root' partitions. One for general
| day-to-day use, one for a backup if something catastrophic
| happens to my main system, and one for experimentation. The extra
| disk space taken up for my having two extra 'root' partitions is
| a miserable 40-50 gig.
|
| But I know that I can swap operating systems over almost
| instantaneously and then back again just as quickly if I did
| something wrong.
|
| Great peace of mind for practically no cost.
|
| The second error I see in the article is to use software that we
| haven't used previously for something important. My first wife
| had the habit of trying new recipes when we had a dinner-party. I
| tried to tell her repeatedly to try the recipe on us first, then
| she would have it down pat when she wanted to impress.
|
| The third error of course was the need to have restorable daily
| backups and the use of them to restore the system when need be,
| associated with modularity of the system.
|
| I back up my whole system daily. but the most important part of
| that is not backing up the distro itself (we have re-installs for
| that) but backing up all the config files, all the databases, all
| the local binaries, and a current list of all of the installed
| distro packages. I can replace the whole operating system from a
| complete wipe-out in less than two hours.
|
| I store these backups in a pseudo-exponential policy. I have more
| recent backups, fewer older backups. Currently I have 15 backups
| covering 9 years, with five of those covering just the last 3
| weeks, and four covering the last five days. To augment this, I
| have a monthly snapshot backup also stashed away.
|
| The other stuff, personal docs etc, is deliberately kept small.
| Total daily backup of base system and /home is approximately 12
| gigs. That is easily transported on a USB stick.
|
| I don't store music, photos, magazine .PDFs, old software, etc in
| my /home directory. That stuff all goes in an archive directory
| that's write-once, and store (practically) forever. That gets
| rsynced to two external USB drives daily. Most days, there's
| practically nothing that gets transferred out.
|
| Having several times lost much valuable data, I suppose I am
| really paranoid, but I still think I haven't been paranoid
| enough.
| ghostly_s wrote:
| Spent three days fighting with this and didn't think to try the
| -v flag on his apparently-hanging process?
| rhn_mk1 wrote:
| The author may not have the necessary familiarity with it.
| Experience is gained via mistakes too.
| fak3r wrote:
| My first thought, also always run long running jobs like this
| in Screen or Tmux! As it is, it's a good learning post that
| others should be able to build on (and don't hit control-c just
| bc it's taking too long!)
___________________________________________________________________
(page generated 2021-06-22 23:02 UTC)