[HN Gopher] Fearless SSH: Short-lived certificates bring Zero Tr...
___________________________________________________________________
Fearless SSH: Short-lived certificates bring Zero Trust to
infrastructure
Author : mfrw
Score : 29 points
Date : 2024-10-23 09:44 UTC (13 hours ago)
(HTM) web link (blog.cloudflare.com)
(TXT) w3m dump (blog.cloudflare.com)
| EthanHeilman wrote:
| I'm a member of the team that worked on this happy to answer any
| questions.
|
| We (BastionZero) recently got bought by Cloudflare and it is
| exciting bringing our SSH ideas to Cloudflare.
| mdaniel wrote:
| I just wanted to offer my congratulations on the acquisition. I
| don't know any details about your specific one, but I have been
| around enough to know that it's still worth celebrating o/
| lenova wrote:
| I'd love to hear about the acquisition story with Cloudflare.
| mdaniel wrote:
| I really enjoyed my time with Vault's ssh-ca (back when it had a
| sane license) but have now grown up and believe that _any_ ssh
| access is an antipattern. For context, I 'm also one of those
| "immutable OS or GTFO" chaps because in my experience the next
| thing that happens after some rando ssh-es into a machine is they
| launch vi or apt-get or whatever and _now_ it 's a snowflake with
| zero auditing of the actions taken to it
|
| I don't mean to detract from this, because short-lived creds are
| always better, but for my money I hope I never have sshd running
| on any machine again
| namxam wrote:
| But what is the alternative?
| mdaniel wrote:
| There's not one answer to your question, but here's mine:
| kubelet and AWS SSM (which, to the best of my knowledge will
| work on non-AWS infra it just needs to be provided creds).
| Bottlerocket <https://github.com/bottlerocket-
| os/bottlerocket#setup> comes batteries included with both of
| those things, and is cheaply provisioned with _(ahem)_ TOML
| user-data <https://github.com/bottlerocket-
| os/bottlerocket#description-...>
|
| In that specific case, one can also have "systemd for normal
| people" via its support for static Pod definitions, so one
| can run containerized toys on boot even without being a
| formal member of a kubernetes cluster
|
| AWS SSM provides auditing of what a person might normally
| type via ssh, and kubelet similarly, just at a different
| abstraction level. For clarity, I am aware that it's possible
| via some sshd trickery one could get similar audit and log
| egress, but I haven't seen one of those in practice whereas
| kubelet and AWS SSM provide it out of the box
| riddley wrote:
| How do you troubleshoot?
| mdaniel wrote:
| In my world, if a developer needs access to the Node upon
| which their app is deployed to troubleshoot, that's 100% a
| bug in their application. I am cognizant that being whole-hog
| on 12 Factor apps is a journey, but for my money get on the
| train because "let me just ssh in and edit this one config
| file" is the road to ruin when no one knows who edited what
| to set it to what new value. Running $(kubectl edit) allows
| $(kubectl rollout undo) to put it back, and also shows what
| was changed from what to what
| yjftsjthsd-h wrote:
| How do you debug the worker itself?
| mdaniel wrote:
| Separate from my sibling comment about AWS SSM, I also
| believe that if one cannot know that a Node is sick by
| the metrics or log egress from it, that's a deployment
| bug. I'm firmly in the "Cattle" camp, and am getting
| closer and closer to the "Reverse Uptime" camp - made
| easier by ASG's newfound "Instance Lifespan" setting to
| make it basically one-click to get onboard that train
|
| Even as I type all these answers out, I'm super cognizant
| that there's not one hammer for all nails, and I am for
| sure guilty of yanking Nodes out of the ASG in order to
| figure out what the hell has gone wrong with them, but I
| try very very hard not to place my Nodes in a precarious
| situation to begin with so that such extreme
| troubleshooting becomes a minor severity incident and not
| Situation Normal
| bigiain wrote:
| I think ssh-ing into production is a sign of not fully mature
| devops practices.
|
| We are still stuck there, but we're striving to get to the
| place where we can turn off sshd on Prod and rely on the
| CI/CD pipeline to blow away and reprovision instances, and be
| 100% confident we can test and troubleshoot in dev and stage
| and by looking at off-instance logs from Prod.
|
| How important it is to get there is something I ponder about
| my motivations for - it's cleary not worthwhile if your
| project is one or 2 prod servers perhaps running something
| like HA WordPress, but it's obvious that at Netflix type
| scale that nobody is sshing into individual instances to
| troubleshoot. We are a long way (a long long long long way)
| from Netflix scale, and are unlikely to ever get there. But
| somewhere between dozens and hundreds of instances is about
| where I reckon the work required to get close to there stars
| paying off.
| imiric wrote:
| Right. The answer is having systems that are resilient to
| failure, and if they do fail being able to quickly replace
| any node, hopefully automatically, along with solid
| observability to give you insight into what failed and how
| to fix it. The process of logging into a machine to
| troubleshoot it in real-time while the system is on fire is
| so antiquated, not to mention stressful. On-call shouldn't
| really be a major part of our industry. Systems should be
| self-healing, and troubleshooting done during working
| hours.
|
| Achieving this is difficult, but we have the tools to do
| it. The hurdles are often organizational rather than
| technical.
| ozim wrote:
| How do you handle db.
|
| Stuff I work on is write heavy so spawning dozens of app copies
| doesn't make sense if I just hog the db with Erie locks.
| mdaniel wrote:
| I must resist the urge to write "users can access the DB via
| the APIs in front of it" :-D
|
| But, seriously, Teleport (back before they did a licensing
| rug-pull) is great at that and no SSH required. I'm super
| positive there are a bazillion other "don't use ssh as a poor
| person's VPN" solutions
| antoniomika wrote:
| I wrote a system that did this >5 years ago (luckily was able to
| open source it before the startup went under[0]). The bastion
| would record ssh sessions in asciicast v2 format and store those
| for later playback directly from a control panel. The main issue
| that still isn't solved by a solution like this is user
| management on the remote (ssh server) side. In a more recent
| implementation, integration with LDAP made the most sense and
| allows for separation of user and login credentials. A single
| integrated solution is likely the holy grail in this space.
|
| [0] https://github.com/notion/bastion
| mdaniel wrote:
| Out of curiosity, why ignore this PR?
| https://github.com/notion/bastion/pull/13
|
| I would think even a simple "sorry, this change does not align
| with the project's goals" -> closed would help the submitter
| (and others) have some clarity versus the PR limbo it's
| currently in
|
| That aside, thanks so much for pointing this out: it looks like
| good fun, especially the Asciicast support!
| antoniomika wrote:
| Honestly never had a chance to merge it/review it. Once the
| company wound down, I had to move onto other things (find a
| new job, work on other priorities, etc) and lost access to be
| able to do anything with it after. I thought about forking it
| and modernizing it but never came to fruition.
| edelbitter wrote:
| Why does the title say "Zero Trust", when the article explains
| that this only works as long as every involved component of the
| Cloudflare MitM keylogger and its CA can be trusted? If hosts
| keys are worthless because you do not know in advance what key
| the proxy will have.. than this scheme is back to trusting
| servers merely because they are in Cloudflare address space, no?
___________________________________________________________________
(page generated 2024-10-23 23:00 UTC)