[HN Gopher] Show HN: SadServers - Test your Linux troubleshootin...
___________________________________________________________________
Show HN: SadServers - Test your Linux troubleshooting skills
Hello, I'm building SadServers.com, a SaaS where users can test
their Linux troubleshooting skills on real Linux servers in a
"Capture the Flag" fashion. I hope this is useful, to learn more
about the project please see https://github.com/fduran/sadservers
Author : fduran
Score : 407 points
Date : 2022-10-26 14:22 UTC (8 hours ago)
(HTM) web link (sadservers.com)
(TXT) w3m dump (sadservers.com)
| BossingAround wrote:
| I'd love to get the actual VM content offline, packaged as
| Vagrantfiles or Containerfiles. Love the idea though! Go to
| Pluralsight and pitch it to them :)
| fduran wrote:
| A few people have suggested offering content offline as a
| Docker image etc, good idea, thanks.
| computershit wrote:
| I love this idea, I'll definitely try it out when provisioning
| for scenario machines is up again. Nice work.
| N3Xxus_6 wrote:
| Well this sucks I wanted to try it lol. It's timing out for me or
| throws an error.
| deeblering4 wrote:
| > It's also my not-so-secret hope that a sophisticated enough
| version of SadServers could be used by tech companies (or for
| companies that carry on job interviews on their behalf) to
| automate or facilitate the Linux troubleshooting interview
| section.
|
| Yup, that's what I was afraid of.
| KaiserPro wrote:
| but why? a real test that is repeatable, realistic and not
| _overly_ hard. Sure for a junior software its a bad fit. but
| for a devop/sre/sysadmin, its a great fit.
|
| its certainly better than some crappy whiteboarding session, or
| worse a take home test.
| [deleted]
| pvg wrote:
| _Please don 't post shallow dismissals, especially of other
| people's work.
|
| [...] Please don't pick the most provocative thing in an
| article or post to complain about in the thread._
|
| https://news.ycombinator.com/newsguidelines.html
| fduran wrote:
| That doesn't mean that I'd charge individual users :-)
|
| Heck, I'm not even asking for an email (and I had to do extra
| session management coding for that).
| technofiend wrote:
| The Redhat Certified System Admin, Redhat Certified System
| Engineer and similar tests require practical, general hands-on
| skills to solve broken systems. The performance tuning and
| troubleshooting exams go into more detail and more complex
| scenarios. No internet access, but resources are available if
| you understand how to use them. Would never suggest people
| should solely hire on those certs, but if someone takes the
| time to complete 7 hands on tests for the certified architect
| certification, it's a strong indicator they have skills.
|
| Even so, test taking can be stressful but it's arguably less
| stressful than actual production support with people waiting on
| the result. Whether people really want to put candidates in a
| stressful situation is up to them. Sadserver seems like it's
| somewhere in the middle vs some of the things I've seen. One
| job interview put me in a room with a boot cd, and an ancient
| computer with a cdrom so slow you got exactly one chance to
| boot the media and recover the system in the time limit. But
| the job was for a trading company, so if you couldn't handle
| that they didn't want you. It was a fun exercise but would I do
| that to someone else? Probably not.
| lbotos wrote:
| Why are you afraid of this? My org has run a hands-on technical
| exam with a stack of linux admin basics (I won't enumerate them
| here because people do their research) but they are _based on
| real problems we 've had_ and the feedback is overwhelmingly
| "this was one of the best technical interviews I've ever had."
|
| We ask the engineer who is proctoring the interview to think
| about the following question: Would you want to pair with that
| engineer again?
|
| If that answer is no, then we probably won't go further because
| _pairing with engineers to troubleshoot is what we do every
| day_.
|
| Some great resumes have died with not knowing how to see what's
| running on port 80.
| joenot443 wrote:
| If you give the person you're interviewing access to the same
| tools they'd have in a regular day on the job (Google,
| manpages, etc.), I'd say that's a fair and probably
| relatively enjoyable interview.
|
| Rejecting someone because they can't recall the correct
| netstat syntax doesn't seem like good hiring practice, but I
| assume in good faith that's not what you meant :)
| yamtaddle wrote:
| Yeah, I google, tealdear, "--help", and manpage anything I
| don't use at least once a week, every time. Usually I don't
| remember them otherwise, and if I think I do, I don't trust
| my memory that well. Only exception is if I remember enough
| to be able to ctrl+r them out of shell history faster than
| I can do those things--and actually, for some of those, I
| _do_ use them often, but couldn 't possibly tell you how
| because I only run a couple commands 99% of the time and
| always pull them out of history unless it's one of the rare
| exceptional cases--I couldn't rsync for a particular
| outcome without consulting a reference, to save my life,
| even though I use it often.
|
| And usually you only use a fairly small set of tools _that_
| often, in any job, and which set will depend on the
| employer, how things are set up, and what exactly you 're
| doing.
|
| Oh and somehow I get "-r" versus "-R" for "recursive" wrong
| almost every time, even for commands I type almost daily,
| unless I check first. It's weird. If tools could get on the
| same damn page about which means "recursive", that'd be
| great.
|
| TL;DR I do have a pretty good idea what I'm doing, but look
| like an absolute idiot if anyone watches me do it. Much
| worse, even, if I _know_ they 're watching and we're not in
| some kind of relatively high-trust relationship (so,
| definitely not in an interview setting).
| lbotos wrote:
| Exactly, all man pages and google is fair. We want to see
| _how they think_ not _rote memorization_.
| Multicomp wrote:
| I love this point. Joke: are are you hiring?
|
| I'm quite happy to try to demonstrate how I think, but I
| hate hate hate leet code because A) it's not relevant to
| showing how one thinks and B) I've read so much dunking
| on it on HN that I'm now stopping interviews when they
| pull out the hackerrank or live code to say 'without
| using the library, reverse this linked list'.
| deeblering4 wrote:
| > Why are you afraid of this?
|
| > My org has run a hands-on technical exam with a stack of
| linux admin basics ... they are based on real problems we've
| had and the feedback is overwhelmingly "this was one of the
| best technical interviews I've ever had."
|
| You essentially answered your own question.
|
| Putting thought into the interview process and working with
| candidates through real problems is valuable. I cannot say
| the same for outsourcing or "automating" this portion of an
| interview using 3rd party SaaS.
| mathverse wrote:
| People in higher up positions like yourself will rarely be
| subjected to testing with tools like this. You are basically
| trying to remove the human from equation and industrialize
| the whole process.
| splitstud wrote:
| rednerrus wrote:
| What we're trying to do is respect peoples' time. We can
| get more about someone's technical understanding in 30
| minutes of hands on exercises than we can in a full day of
| panel interviews. It's better for us as we have a much
| better understanding of where you're at Linux wise and it's
| better for you because you only need to come to two hours
| of interviews, total. Seems like a win win to me.
| deeblering4 wrote:
| Framing a question like "a system has a high load
| average, what commands would you use to begin diagnosing
| that?" and taking that conversation as deep as the
| candidate can go is neither time consuming nor requires a
| panel of people.
| mike_d wrote:
| In my experience this type of interview (and coding
| interviews in general) usually fall into one of two
| categories: 1) "I learned this neat trick and want to
| show candidates how smart I am" or 2) "I have this bug in
| prod and I want to see if you can fix it for me."
|
| If the interview was along the lines of upgrading the
| packages on the system, debugging why nginx was crashing,
| figuring out the specs of the system, etc. that is
| totally fine with me and I believe respectful of a
| candidates time. Unfortunately it always turns into
| something else when people need to come up with new
| "challenges" for canidates.
| deathanatos wrote:
| No, I'm trying to make sure the person who is interviewing
| for a job where they will deal with computers on a daily
| basis appears to have seen a computer at some prior point
| in their life.
|
| I wouldn't feel the need to do this if so many candidates
| didn't fail rudimentary tests. A SWE candidate MUST be able
| to write the function min(), in the language and tooling of
| their choice. But in an interview, a sizable fraction
| cannot. (The actual bar is far higher than min(), ofc., but
| min() _ought to be trivial_.)
| deathanatos wrote:
| Yeah, we did this at a previous employer.
|
| One example, is we had them ssh, download & extract a tarball
| (the Linux source, but the content doesn't matter).
| Sometimes, they'd gunzip to stdout. The reaction tells you a
| lot "lol _whoopsie_ " followed by a quick fix: person knows
| what they're doing. "uh... what is going on? did I break it?"
| followed with general cluelessness... maybe not.
|
| That did occasionally break tmux, though.
|
| Part of it was "what are the specs of this thing you're SSH'd
| into?" and we had one candidate who was _adamant_ the numbers
| must be wrong: 2 GiB is too little RAM, no machine is that
| small! Yeah we didn 't spin up 128 GiB VM for your
| interview...
| Volundr wrote:
| I never cease to be amazed at how few people really realize
| just how little hardware is often required for getting real
| work done. You'd be surprised just how much that 2GB vm
| with a couple cores can handle!
| sorongopowa wrote:
| I started with a single 1xx MHz core and 16MB of RAM. And
| I'm sure some with even less, lol.
|
| Supporting your point: Hardware is awesome if you use it
| wisely.
| icedchai wrote:
| My first Linux box was a 20mhz 386SX laptop with 3 megs
| of RAM (1 meg on the motherboard, 2 in an expansion.) I
| could barely run Linux 0.99.x. The distro was SLS, and it
| came on 12 or so floppy disks. I quickly upgraded to a
| 486 with 8 megs RAM, then 20... which seemed incredible
| at the time (1994-ish.)
|
| It's amazing how bloated today's software is...
| rednerrus wrote:
| We do this in our org as well. 30 minutes of troubleshooting
| linux issues is a good way to evaluate a candidates
| experience. We run it as a team exercise with the candidate
| so that we also get the added bonus of how do they work in a
| team setting, how do they communicate, etc.
| aliqot wrote:
| I knew this is where it was headed :/
| Nextgrid wrote:
| Is it bad though? The problem with Leetcode is that it's an
| extremely unrealistic test. This on the other hand seems like
| it actually tests real-world scenarios, and you can get there
| without grinding. I'm pretty sure I can pass all the tests
| they've currently got despite having no formal sysadmin
| experience, just using common developer knowledge, common sense
| and strategic Google-fu.
| x258wang_hn wrote:
| yapril wrote:
| andrewmcwatters wrote:
| My only feedback is that this is unrealistic because today
| developers wouldn't try to debug something, they'd just destroy
| the instance, push a commit and hope it fixed something infra
| related then recreate it.
|
| Why would you need to understand how something works? Just use
| containers. /s
| [deleted]
| vsareto wrote:
| Developers just need to understand everything because we need
| developers to do everything and meet all deadlines. We wouldn't
| dare consider a support role that could troubleshoot it because
| then there would be no point to having developers that can do
| everything! /s
| cube00 wrote:
| Support doesn't deliver features, we need new features! /s
| grepLeigh wrote:
| If most developers can't debug a VM, then anyone who can will
| be able to charge a premium. If you have a proficiency in ops,
| remember that the next time you negotiate a compensation
| package.
|
| [Edited my compensation numbers to avoid down votes - yikes]
| andrewmcwatters wrote:
| I feel like you definitely have to target particular
| companies and more specifically specific titles and skills to
| offer to do so.
|
| My guess is trying to sell high end services as a "principal
| software engineer" isn't going to be enough to justify that
| cash comp to a lot of people hiring.
| grepLeigh wrote:
| I wouldn't think of it as trying to sell yourself as a
| "principal software engineer" on an open market.
|
| I'd make a list of the companies where hiring/scaling the
| ops team will make or break the business's value delivery,
| and filter by companies _aware_ of this.
|
| You can knock this out at the recruiting step, just by
| asking about open developer headcount vs. open SRE ops
| headcount. Ask which direction that ratio seems to be
| going, and if there's anyone you can talk to whose job it
| is to change that ratio (director or VP mandate).
|
| The referral network from working at a hyperscaler co in
| ops is a great way to break into the space.
| andrewmcwatters wrote:
| Thanks for the heads up!
| sshd wrote:
| This is so sad but so true!
| edmcnulty101 wrote:
| If its dumb and it works it's not dumb.
| 10g1k wrote:
| "Have you turned it off and on again?"
| hotpotamus wrote:
| Are you familiar with Trueability? https://www.trueability.com/
|
| It seems like this is a similar SaaS.
| fduran wrote:
| Didn't know about this one. There's quite a few labs/sandbox
| SaaS but what I've seen so far is that they are more for
| training with a "follow the recipe" model (do this do that to
| configure something, rather than "this (real) server is broken,
| fix it (with possibly different solutions)" which imho is more
| real-life and useful.
| hotpotamus wrote:
| I believe the company was founded by some coworkers of mine
| way back when at Rackspace who often interviewed Linux admins
| with a lab VM and I assume they just automated the setup and
| spun it off as their own business. At least that's what
| happened as far as I can tell; I didn't know the parties
| involved.
| jer0me wrote:
| New challenge: Fix SadServers' sad servers
| Pr0ject217 wrote:
| Cool!
| imwillofficial wrote:
| This is badass, just what I need!
| dugmartin wrote:
| I'd suggest integrating https://bellard.org/jslinux/ and running
| the VM in the browser if you can - then you can scale without
| running out of resources.
| m00dy wrote:
| or linux kernel port on webassembly.
| fduran wrote:
| Thanks, I've been looking at WASM, for ex
| https://github.com/snaplet/postgres-wasm/tree/main/packages/...
| , it would certainly simplify everything to "download a fat
| file".
| jodrellblank wrote:
| Have you seen https://copy.sh/v86/ ? It doesn't run as fast
| as jslinux but is BSD Licensed, on Github, and supports
| resuming the VM from a snapshot.
|
| https://github.com/copy/v86
| fduran wrote:
| Didn't know about this, thanks!
| DeathArrow wrote:
| >Practice for your next SRE/DevOps interview.
|
| Are SREs and DevOps tasked with administration of operating
| systems?
| jen_h wrote:
| Yeah. Random data point: One of my most favorite SRE interviews
| ever (serious fun!) involved hands-on troubleshooting that
| eventually required gdb.
| asmr wrote:
| Both SRE and DevOps are essentially evolved sysadmin roles. The
| DevOps philosophy is cross-functional and many sysadmins have
| adopted a DevOps approach. The latest edition of the classic
| sysadmin book "The Practice of System and Network
| Administration" is now centered around DevOps.
| KaiserPro wrote:
| > Are SREs and DevOps tasked with administration of operating
| systems?
|
| yes, eventually.
|
| you can dress it up in all the fancy terms that you like. but
| devops and SREs are sysadmins with better PR.
|
| its critical that SREs understand _how_ to debug a system, so
| that they can work out how to put in fixes, and or design
| better systems.
| dsr_ wrote:
| If you have ops somewhere in your responsibilities, then yes.
| jabroni_salad wrote:
| depends on what layer the issue is happening at. I know
| everyone thinks the OS has been abstracted away but my ticket
| queue says otherwise. "yaml engineering" is just a control
| surface, I still need to pop the hood often.
| BossingAround wrote:
| How do you automate something you can't do manually?
| PanosJee wrote:
| Hack The Box -> Fix The Box
| Timja wrote:
| The idea is really cool, but all I see is "Waiting for server..."
| and nothing happens.
| kiyundai wrote:
| That's the trick you failed the first challenge : "Did you try
| to turn it off and on again?"
| apawloski wrote:
| Based on your architecture diagram it looks like you're spinning
| up an instance per-user? As you're probably finding now, you will
| hit AWS limits quickly.
|
| You might instead want to have a smaller pool of (larger) servers
| that you run co-resident VMs on with https://firecracker-
| microvm.github.io/. That will avoid account limits and also keep
| your AWS costs more predictable.
| fduran wrote:
| Yes thanks!
| temp0826 wrote:
| I haven't fully grokked this yet, but one trick I've used in
| the past to get around limits is AWS Organizations, creating a
| sub-account per property. A bit more setup but can keep things
| cleaner administratively.
| icedchai wrote:
| AWS will raise limits if you ask. Increasing EC2 instance
| limits is usually a quick turn around.
| andrewstuart2 wrote:
| At least for the tests I've done on a small startup
| recently, they've also implemented some automatic quota
| increases for EC2. I ran commands that would have (or did)
| eclipsed my quota, and got an email that my quotas were
| bumped a few minutes later.
| ericbarrett wrote:
| Yes, the default limits are there to prevent abuse and
| runaway misconfigurations. They won't turn down revenue if
| you confirm it's intentional.
| yamtaddle wrote:
| Just run them in Linux VMs with WASM, on the users' browsers.
| Make them all pay for it with higher utility bills and greater
| wear & tear on their hardware.
|
| _trollface.jpg_
| freeone3000 wrote:
| This is actually a good idea for this -- the user wants the
| education, they can pay for it with their own hardware. Keep
| your costs low!
| cogman10 wrote:
| Probably a better experience for everyone. You just have to
| distribute the image (rather than running vms) and the user
| gets instantaneous responses.
| BossingAround wrote:
| Why not spin up containers instead of VMs? Seems to me
| containers would fit much better than VMs.
| cogman10 wrote:
| Bypassing container security is easier than bypassing VM
| security.
| tamrix wrote:
| Then wouldn't that be the ultimate test ;)
| spiffytech wrote:
| Containers have a history of escape vulnerabilities, for
| reasons like sharing a kernel with the host and other
| containers.
|
| VMs are designed from the ground up to isolate guests, rather
| than focusing on application deployment.
|
| Firecracker is the modern container alternative in untrusted
| compute scenarios, with Fly.io even converting container
| images into Firecracker VMs.
| NovemberWhiskey wrote:
| > _Containers have a history of escape vulnerabilities_
|
| Generally agreed, but for this use-case do we care?
| ilyt wrote:
| That's kinda nice use case for the WASM machine/linux
| emulators, then you just need to provide image and user can run
| it in the browser
|
| > You might instead want to have a smaller pool of (larger)
| servers that you run co-resident VMs on with
| https://firecracker-microvm.github.io/. That will avoid account
| limits and also keep your AWS costs more predictable.
|
| I'd imagine (still waiting for it to load lmao) most of it
| could be containers too.
| twalla wrote:
| Someone else linked https://github.com/copy/v86 which seems
| really neat.
|
| I like making jokes with coworkers about implementing this or
| that bit of infra with WASM-based tools mostly to get a rise
| out of them but each time I make the joke I look into some of
| the tools or projects and the balance of joke to "I'm
| actually serious" shifts a little bit to the right.
| lagrange77 wrote:
| Really cool idea.
|
| After choosing a problem, the endpoint you poll at
| https://sadservers.com/celery-progress/xxxx repeatedly returns
| {pending: true, current: 0, total: 100, percent: 0} for me.
| b20000 wrote:
| did you read up on the problems with leetcode?
| fduran wrote:
| Hi, not sure what the question means, I came up with the
| scenarios not copying from leetcode if that's what you mean.
| pxc wrote:
| I think they mean 'are you aware of the limitations of
| Leetcode-like tests and the downsides of their (over)use in
| hiring processes?'
|
| (FWIW I think this is a very cool and fun educational project
| regardless of what usefulness it might or might not have in
| IT hiring decisions, and I'm looking forward to playing with
| it)
| vermon wrote:
| Seems like it's out of capacity: An error
| occurred (VcpuLimitExceeded) when calling the RunInstances
| operation: You have requested more vCPU capacity than your
| current vCPU limit of 64 allows for the instance bucket that the
| specified instance type belongs to. Please visit
| http://aws.amazon.com/contact-us/ec2-request to request an
| adjustment to this limit.
|
| Maybe something like https://leaningtech.com/webvm-server-
| less-x86-virtual-machin... would be cheaper and more reliable for
| this kind of thing?
| fduran wrote:
| Yes, HN effect lol-sob.
|
| Mitigation: reducing servers life time temporarily so more
| people can try.
| warent wrote:
| Usually I roll my eyes when someone posts their own website
| to HN and it crashes under load. But given the nature and
| complexity of yours I think there's room for understanding
| and patience :)
| fduran wrote:
| Thanks, I did some stress-testing and infra is scalable
| enough but I forgot about the AWS quotas, my bad. Quota
| increase requested and servers are killed off so hopefully
| "soon" the issue will go away.
| Nextgrid wrote:
| Scaling this service without breaking the bank could become
| its own "sad server" scenario.
|
| I'd start by moving the test VMs to bare-metal servers
| running libvirt. You can get a 128GB RAM server for ~110 EUR
| and that should be able to run around 120 concurrent VMs
| assuming 1GB of RAM to each (CPU isn't a major issue in this
| case).
| mewse-hn wrote:
| Completed the first challenge and it was a lot of fun - _spoiler_
| I 've never had to use the 'lsof' command before.
| grepLeigh wrote:
| Very cool! This reminds me of the ops challenge @ Slack. I'm not
| sure if they still do this, but the SRE/platform infra interview
| used to involve a VM running a malfunctioning LAMP stack.
|
| You'd get SSH access to the VM, then submit a diagnostic report
| of what was broken (and how you fixed it).
|
| Reminded me of how Red Hat used to run their certification test
| (RHCE). I probably still have the live CDs for my RHCE laying
| around somewhere.
| stevekemp wrote:
| I've had interviews like that in the past, and really enjoyed
| them. Much better than "Draw an architecture diagram for how
| you'd handle a serverless IoT application" - where you lose
| points, silenly, because you didn't pick something the
| interviewer expected you to do.
|
| Usually a simple combination of immutable files, SELinux
| policies, and types in configuration files were enough for most
| of the challenges. Though now and again you'd find they'd given
| you a server with packages removed, or not yet installed.
| fduran wrote:
| Oh that reminds me, I loved the original Stripe CTF, it's been
| 10 years already!
| https://twitter.com/fduran/status/240321390698442753
| yubiox wrote:
| Can't get to the first problem because of HN hug but anyway there
| are fake ways to "solve" it like renaming the logfile (what they
| test for solved is provided).
| Timja wrote:
| Depends on how the broken program writes to the log.
|
| If it does while true; do echo hello >>
| bad.log; done
|
| Then renaming bad.log will not solve the challenge.
| teddyh wrote:
| Replace it with a symlink to /dev/null! Or /dev/full if we
| feel like it.
|
| (Yes, these are bad solutions, since the instructions
| explicitly said to stop the process which is writing.)
| fduran wrote:
| There are ways to cheat but not so simple; there's a script
| that checks for the solution and a hash of the script is
| checked for modifications.
| BossingAround wrote:
| This is a self-test, not a certification. The goal is not to
| defeat the verification goal, but to learn something. So yeah,
| it's perfectly acceptable that the tests are not bullet-proof.
| bm-rf wrote:
| I'm assuming you're spinning up an EC2 instance for each lab.
| What do you think about using pre-built docker images for each
| challenge instead? that way they can spin up in just a couple of
| seconds. Might also be cheaper?
| clvx wrote:
| probably lxd would be better.
| bravetraveler wrote:
| Not a bad idea but something to consider; this limits the
| options for kernel level things quite considerably
| fduran wrote:
| I wanted to do full VMs rather than Docker images but yes I
| could do Docker images or dedicated big instances with VMs on
| top like somebody else is suggesting.
| bravetraveler wrote:
| Commenting to give this a try later, I've routinely been the
| person to get these kinds of gremlins escalated
|
| I've long wanted for some sort of mock, "things are broken - I
| want to see how you think" approach for sysad
| shagie wrote:
| In the "tricks of hacker news" - 188 points
| by fduran 3 hours ago | unvote | flag | hide | past | favorite
| | 68 comments
|
| If you click 'favorite' it will save it to your favorites list.
| This is a publicly visible list - yours is
| https://news.ycombinator.com/favorites?id=bravetraveler and
| mine is https://news.ycombinator.com/favorites?id=shagie which
| makes it easy to get a bookmark type style functionality within
| HN.
|
| As I tend to favorite less often than I comment, it makes it
| easier to find those things I want to find again.
| bravetraveler wrote:
| Much appreciated! I'm woeful about using not using features
| like this, it's a character fault at this point.
|
| The HN interface too tends to just have my eyes filter out
| those links... but that's no defense.
|
| Especially good to know that it's publicly viewable!
|
| Not that I'm particularly worried of being outed by anything
| I favorite here, it's just good to be mindful of the data we
| make and where it goes.
___________________________________________________________________
(page generated 2022-10-26 23:00 UTC)