[HN Gopher] Fearless SSH: Short-lived certificates bring Zero Tr...
       ___________________________________________________________________
        
       Fearless SSH: Short-lived certificates bring Zero Trust to
       infrastructure
        
       Author : mfrw
       Score  : 29 points
       Date   : 2024-10-23 09:44 UTC (13 hours ago)
        
 (HTM) web link (blog.cloudflare.com)
 (TXT) w3m dump (blog.cloudflare.com)
        
       | EthanHeilman wrote:
       | I'm a member of the team that worked on this happy to answer any
       | questions.
       | 
       | We (BastionZero) recently got bought by Cloudflare and it is
       | exciting bringing our SSH ideas to Cloudflare.
        
         | mdaniel wrote:
         | I just wanted to offer my congratulations on the acquisition. I
         | don't know any details about your specific one, but I have been
         | around enough to know that it's still worth celebrating o/
        
         | lenova wrote:
         | I'd love to hear about the acquisition story with Cloudflare.
        
       | mdaniel wrote:
       | I really enjoyed my time with Vault's ssh-ca (back when it had a
       | sane license) but have now grown up and believe that _any_ ssh
       | access is an antipattern. For context, I 'm also one of those
       | "immutable OS or GTFO" chaps because in my experience the next
       | thing that happens after some rando ssh-es into a machine is they
       | launch vi or apt-get or whatever and _now_ it 's a snowflake with
       | zero auditing of the actions taken to it
       | 
       | I don't mean to detract from this, because short-lived creds are
       | always better, but for my money I hope I never have sshd running
       | on any machine again
        
         | namxam wrote:
         | But what is the alternative?
        
           | mdaniel wrote:
           | There's not one answer to your question, but here's mine:
           | kubelet and AWS SSM (which, to the best of my knowledge will
           | work on non-AWS infra it just needs to be provided creds).
           | Bottlerocket <https://github.com/bottlerocket-
           | os/bottlerocket#setup> comes batteries included with both of
           | those things, and is cheaply provisioned with _(ahem)_ TOML
           | user-data  <https://github.com/bottlerocket-
           | os/bottlerocket#description-...>
           | 
           | In that specific case, one can also have "systemd for normal
           | people" via its support for static Pod definitions, so one
           | can run containerized toys on boot even without being a
           | formal member of a kubernetes cluster
           | 
           | AWS SSM provides auditing of what a person might normally
           | type via ssh, and kubelet similarly, just at a different
           | abstraction level. For clarity, I am aware that it's possible
           | via some sshd trickery one could get similar audit and log
           | egress, but I haven't seen one of those in practice whereas
           | kubelet and AWS SSM provide it out of the box
        
         | riddley wrote:
         | How do you troubleshoot?
        
           | mdaniel wrote:
           | In my world, if a developer needs access to the Node upon
           | which their app is deployed to troubleshoot, that's 100% a
           | bug in their application. I am cognizant that being whole-hog
           | on 12 Factor apps is a journey, but for my money get on the
           | train because "let me just ssh in and edit this one config
           | file" is the road to ruin when no one knows who edited what
           | to set it to what new value. Running $(kubectl edit) allows
           | $(kubectl rollout undo) to put it back, and also shows what
           | was changed from what to what
        
             | yjftsjthsd-h wrote:
             | How do you debug the worker itself?
        
               | mdaniel wrote:
               | Separate from my sibling comment about AWS SSM, I also
               | believe that if one cannot know that a Node is sick by
               | the metrics or log egress from it, that's a deployment
               | bug. I'm firmly in the "Cattle" camp, and am getting
               | closer and closer to the "Reverse Uptime" camp - made
               | easier by ASG's newfound "Instance Lifespan" setting to
               | make it basically one-click to get onboard that train
               | 
               | Even as I type all these answers out, I'm super cognizant
               | that there's not one hammer for all nails, and I am for
               | sure guilty of yanking Nodes out of the ASG in order to
               | figure out what the hell has gone wrong with them, but I
               | try very very hard not to place my Nodes in a precarious
               | situation to begin with so that such extreme
               | troubleshooting becomes a minor severity incident and not
               | Situation Normal
        
           | bigiain wrote:
           | I think ssh-ing into production is a sign of not fully mature
           | devops practices.
           | 
           | We are still stuck there, but we're striving to get to the
           | place where we can turn off sshd on Prod and rely on the
           | CI/CD pipeline to blow away and reprovision instances, and be
           | 100% confident we can test and troubleshoot in dev and stage
           | and by looking at off-instance logs from Prod.
           | 
           | How important it is to get there is something I ponder about
           | my motivations for - it's cleary not worthwhile if your
           | project is one or 2 prod servers perhaps running something
           | like HA WordPress, but it's obvious that at Netflix type
           | scale that nobody is sshing into individual instances to
           | troubleshoot. We are a long way (a long long long long way)
           | from Netflix scale, and are unlikely to ever get there. But
           | somewhere between dozens and hundreds of instances is about
           | where I reckon the work required to get close to there stars
           | paying off.
        
             | imiric wrote:
             | Right. The answer is having systems that are resilient to
             | failure, and if they do fail being able to quickly replace
             | any node, hopefully automatically, along with solid
             | observability to give you insight into what failed and how
             | to fix it. The process of logging into a machine to
             | troubleshoot it in real-time while the system is on fire is
             | so antiquated, not to mention stressful. On-call shouldn't
             | really be a major part of our industry. Systems should be
             | self-healing, and troubleshooting done during working
             | hours.
             | 
             | Achieving this is difficult, but we have the tools to do
             | it. The hurdles are often organizational rather than
             | technical.
        
         | ozim wrote:
         | How do you handle db.
         | 
         | Stuff I work on is write heavy so spawning dozens of app copies
         | doesn't make sense if I just hog the db with Erie locks.
        
           | mdaniel wrote:
           | I must resist the urge to write "users can access the DB via
           | the APIs in front of it" :-D
           | 
           | But, seriously, Teleport (back before they did a licensing
           | rug-pull) is great at that and no SSH required. I'm super
           | positive there are a bazillion other "don't use ssh as a poor
           | person's VPN" solutions
        
       | antoniomika wrote:
       | I wrote a system that did this >5 years ago (luckily was able to
       | open source it before the startup went under[0]). The bastion
       | would record ssh sessions in asciicast v2 format and store those
       | for later playback directly from a control panel. The main issue
       | that still isn't solved by a solution like this is user
       | management on the remote (ssh server) side. In a more recent
       | implementation, integration with LDAP made the most sense and
       | allows for separation of user and login credentials. A single
       | integrated solution is likely the holy grail in this space.
       | 
       | [0] https://github.com/notion/bastion
        
         | mdaniel wrote:
         | Out of curiosity, why ignore this PR?
         | https://github.com/notion/bastion/pull/13
         | 
         | I would think even a simple "sorry, this change does not align
         | with the project's goals" -> closed would help the submitter
         | (and others) have some clarity versus the PR limbo it's
         | currently in
         | 
         | That aside, thanks so much for pointing this out: it looks like
         | good fun, especially the Asciicast support!
        
           | antoniomika wrote:
           | Honestly never had a chance to merge it/review it. Once the
           | company wound down, I had to move onto other things (find a
           | new job, work on other priorities, etc) and lost access to be
           | able to do anything with it after. I thought about forking it
           | and modernizing it but never came to fruition.
        
       | edelbitter wrote:
       | Why does the title say "Zero Trust", when the article explains
       | that this only works as long as every involved component of the
       | Cloudflare MitM keylogger and its CA can be trusted? If hosts
       | keys are worthless because you do not know in advance what key
       | the proxy will have.. than this scheme is back to trusting
       | servers merely because they are in Cloudflare address space, no?
        
       ___________________________________________________________________
       (page generated 2024-10-23 23:00 UTC)