Post Av7GliyvmwROND9mnw by ariaflame@masto.ai
 (DIR) More posts by ariaflame@masto.ai
 (DIR) Post #Av4L4ZqnmKhrtkigzo by bob_zim@infosec.exchange
       2024-02-02T16:29:56Z
       
       2 likes, 5 repeats
       
       @shortridge While working tech support, I got a call on a Monday. Some VPNs which had been working on Friday were no longer working. After a little digging, we found the negotiation was failing due to a certificate validation failure.The certificate validation was failing because the system couldn’t check the certificate revocation list (CRL).The system couldn’t check the CRL because it was too big. The software doing the validation only allocated 512kB to store the CRL, and it was bigger than that. This is from a private certificate authority, though, and 512kB is a *LOT* of revoked certificates. Shouldn’t be possible for this environment to hit within a human lifespan.Turns out the CRL was nearly a megabyte! What gives? We check the certificate authority, and it’s revoking and reissuing every single certificate it has signed once per second.The revocations say all the certificates (including the certificate authority’s) are expired. We check the expiration date of the certificate authority, and it’s set to some time in 1910. What? It was around here I started to suspect what had happened.The certificate authority isn’t valid before some time in 2037. It was waking up every second, seeing the current date was after the expiration date and reissuing everything. But time is linear, so it doesn’t make sense to reissue an expired certificate with an earlier not-valid-before date, so it reissued all the certs with the same dates and went to sleep. One second later, it woke up and did the whole process over again. But why the clearly invalid dates on the CA?The CA operation log was packed with revocations and reissues, but I eventually found the reissues which changed the validity dates of the CA’s certificate. Sure enough, it reissued itself in 2037 and the expiration date was set to 2037 plus ten years, which fell victim to the 2038 limitation. But it’s not 2037, so why did the system think it was?The OS running the CA was set to sync with NTP every 120 seconds, and it used a really bad NTP client which blindly set the time to whatever the NTP server gave it. No sanity checking, no drifting. Just get the time, set the time. OS logs showed most of the time, the clock adjustment was a fraction of a second. Then some time on Saturday, there was an adjustment of tens of thousands of seconds forward. The next adjustment was hundreds of thousands of seconds forward. Tens of millions of seconds forward. Eventually it hit billions of seconds backwards, taking the system clock back to 1904 or so. The NTP server was racing forward through the 32-bit timestamp space.At some point, the NTP server handed out a date in 2037 which was after the CA’s expiration. It reissued itself as I described above, and a date math bug resulted in a cert which expired before it was valid. So now we have an explanation for the CRL being so huge. On to the NTP server!Turns out they had an NTP “appliance” with a radio clock (i.e, a CDMA radio, GPS receiver, etc.). Whoever built it had done so in a really questionable way. It seems it had a faulty internal clock which was very fast. If it lost upstream time for a while, then reacquired it after the internal clock had accumulated a whole extra second, the server didn’t let itself step backwards or extend the duration of a second. The math it used to correct its internal clock somehow resulted in dramatically shortening the duration of a second until it wrapped in 2038 and eventually ended up at the correct time.Ultimately found three issues:• An OS with an overly-simplistic NTP client• A certificate authority with a bad date math system• An NTP server with design issues and bad hardwareEdit: The popularity of this story has me thinking about it some more.The 2038 problem happens because when the first bit of a 32-bit value is 1 and you use it as a signed integer, it’s interpreted as a negative number in 2’s complement representation. But C has no protection from treating the same value as signed in some contexts and unsigned in others. If you start with a signed 32-bit integer with the value -1, it is represented in memory as 0xFFFFFFFF. If you then use it as an unsigned integer, it becomes the value 4,294,967,296.I bet the NTP box subtracted the internal clock’s seconds from the radio clock’s seconds as signed integers (getting -1 seconds), then treated it as an unsigned integer when figuring out how to adjust the tick rate. It suddenly thought the clock was four billion seconds behind, so it really has to sprint forward to catch up!In my experience, the most baffling behavior is almost always caused by very small mistakes. This small mistake would explain the behavior.
       
 (DIR) Post #Av4L4fbmN6FrktjmPg by bob_zim@infosec.exchange
       2024-02-03T15:01:13Z
       
       2 likes, 2 repeats
       
       @shortridge Some time later, I was no longer working tech support. I got hired to do network and firewall stuff for a fairly large company. At one point, they decided to relocate the office where a lot of the operations and monitoring staff worked. They moved the whole application monitoring team to the new building with the unproven infrastructure first, because some people in charge made very bad decisions.The monitoring team gets to the new building, and they can’t access any of their monitoring systems. Clearly a problem with the new office, right? They go through a few environments to get to their monitoring systems, so I log in to the remote access VPN for the first one and confirm the first firewall they hit sees their traffic and isn’t dropping it.I go to log in to the remote access VPN for the second environment, where the monitoring systems actually live. I’m able to start the connection, but it never prompts me for my credentials, and the tunnel never comes up. Huh. That’s weird.Well, I’ll just get in through the DR version of the second environment. Connection works and it prompts me for my credentials, but it rejects them. I try again, in case I made a mistake entering the passphrase for my key, but it’s still rejected. Huh. That’s weird.I eventually find a working way in. I’m able to ping all the relevant systems, I’m able to make TCP connections via telnet, but trying to actually use a service like SSH or MSRDP just hangs. But wait! I can connect to my firewalls via SSH! So what’s common among the broken systems?All the broken systems are VMs. I start testing connections to other things which I know are VMs. They all behave the same. Ping works, TCP connections work, but data over the connections gets no response.I bring in the virtualization team. Some of us drive in to the datacenter hosting the VMs giving us trouble. Someone quickly realizes the single SAN hosting all of the VMs’ drives was up, but wasn’t responding to storage requests. Effectively the drive had been pulled out of every single VM. Now we have an explanation for why all the VMs seem to be broken.With most operating systems, the network stack is wired in RAM and can’t be swapped out. The network stack handles responding to pings and opening TCP connections on listening ports. Once a TCP connection is opened, it requests a copy of the listening service from storage to handle the connection. With storage no longer responding, the network stack never gets the copy of the service to handle the connection, so data doesn’t work.Why couldn’t I connect to the second VPN endpoint? Well, some people in charge made very bad decisions. They had decided that since VMs are the future, the VPN endpoints in that facility should be moved from dedicated hardware to VMs stored on the SAN. They hadn’t gotten to the first VPN endpoint yet, but that environment wasn’t allowed to connect in to the second environment.Okay, but I could connect to the other site’s VPN endpoint, and the other site didn’t have any problems. Why didn’t it accept my credentials? Well, some people in charge made very bad decisions (you may be noticing a theme!). All authentication was run through some VMs which were stored on the SAN. The VPN boxes in the working location were set to monitor the health of the authentication boxes in the failed location by pinging them. As long as they responded to ping, they were good, so the VPN boxes wouldn’t fail over to using their local authentication boxes. And a computer with its drive pulled can still respond to ping with just the network stack in RAM.Once we realized what was going on, we physically connected to the WAN routers and added routes to prevent the two sites from reaching each other’s authentication boxes. Presto! We could now log in via the DR environment as normal. The other infrastructure teams were then able to start digging into their parts.But why is the SAN unresponsive? Turns out this particular SAN vendor had an option for what to do under certain failure conditions: it could fail read-only or fail completely silent. This one was set to fail silent, and it had filled up.I wasn’t directly involved in fixing the SAN. I know the manager over the SAN team had been sounding the alarm for months before it filled. I also know there were multiple levels of bad configuration, such as more space offered by LUNs than the SAN could physically provide.Big takeaways:1. Make sure your access to fix a system doesn’t depend on that system. It’s really easy to accidentally introduce dependency cycles, and it takes constant work to avoid them.2. Superficial tests like whether you can ping something can’t detect some pretty major failures. More significant tests are more likely to notice the problem.3. When something is critical to an environment, maybe have more than one of them? The SAN had internal redundancy to deal with faulty drives and so on, but all the storage was in one giant pool. Multiple SAN systems can provide a bulkhead such that breaking one would not break all VMs.
       
 (DIR) Post #Av7GlTc2vkDqhtKnYG by aimaz@mstdn.social
       2024-02-02T15:37:57Z
       
       1 likes, 0 repeats
       
       @shortridge https://500mile.email a few of the classics are on this site named after the all time best one.
       
 (DIR) Post #Av7GlcAN80QvEfSKAK by Kensan@mastodon.social
       2024-02-02T20:15:13Z
       
       0 likes, 0 repeats
       
       @shortridge Have you heard of the “OpenOffice.org won’t print to Brother printers on Tuesdays (but works on other days of the week)” bug?http://catless.ncl.ac.uk/Risks/25/77#subj14.1https://mdzlog.alcor.net/2009/08/15/bohrbugs-openoffice-org-wont-print-on-tuesdays/Ubuntu bug:https://bugs.launchpad.net/ubuntu/+source/file/+bug/248619
       
 (DIR) Post #Av7GldIutirWlSSf7g by KevinMarks@xoxo.zone
       2024-02-04T00:04:53Z
       
       1 likes, 0 repeats
       
       @Kensan @shortridge this reminds me of the "python only parses dates correctly after the 12th of the month" problem I had. (The dates in the files I was being sent had been changed to UK dd/mm/YYYY format. Python assumes mm/dd/YYYY unless the mm>12)
       
 (DIR) Post #Av7GldR4PPNtAkHApc by Kensan@mastodon.social
       2024-04-01T23:21:52Z
       
       1 likes, 0 repeats
       
       @shortridge Sorry for resurrecting this Thread but this one belongs here:"The Wi-Fi only works when it's raining."https://predr.ag/blog/wifi-only-works-when-its-raining/
       
 (DIR) Post #Av7GldqEtpW8Qo3H3w by lacouvee@mastodon.online
       2024-02-03T16:23:22Z
       
       0 likes, 0 repeats
       
       @Kensan @shortridge is this a serious bug? Because when my Office 2013 bites the dust I'll be moving to Open Office/Libre Office and I have a Brother printer - just sayin'.
       
 (DIR) Post #Av7GleVMQwPWULI79s by Kensan@mastodon.social
       2024-02-03T21:59:44Z
       
       1 likes, 0 repeats
       
       @lacouvee The bug has been fixed in 2009. ;)@shortridge
       
 (DIR) Post #Av7Glec61tnYpERUem by ariaflame@masto.ai
       2024-02-04T15:39:22Z
       
       1 likes, 0 repeats
       
       @KevinMarks @Kensan @shortridge This is why all dates should be YYYY-mm-dd
       
 (DIR) Post #Av7Glh8mdL8KfBkcsq by Kensan@mastodon.social
       2024-02-02T20:20:10Z
       
       0 likes, 0 repeats
       
       @shortridge … or the story of the “magic”/“more magic”switch of MIT AI Lab’s PDP-10.https://users.cs.utah.edu/~elb/folklore/magic.html
       
 (DIR) Post #Av7GliKsBsOkMyPnMm by KevinMarks@xoxo.zone
       2024-02-04T17:48:03Z
       
       0 likes, 0 repeats
       
       @ariaflame @Kensan @shortridge the other challenge with being back in the UK is that half the year UTC and local time are the same.Oh, and being near 0 longitude..
       
 (DIR) Post #Av7GliQtpTDcffEblA by hugovk@mastodon.social
       2024-02-04T10:14:28Z
       
       0 likes, 0 repeats
       
       @KevinMarks @Kensan @shortridge Which bit of Python assumes mm/dd/YYYY unless mm>12?
       
 (DIR) Post #Av7GliyvmwROND9mnw by ariaflame@masto.ai
       2024-02-05T01:02:42Z
       
       1 likes, 0 repeats
       
       @KevinMarks @Kensan @shortridge Why I am glad to live somewhere that doesn't do DST
       
 (DIR) Post #Av7Glj4xQXGGftybCK by KevinMarks@xoxo.zone
       2024-02-04T11:34:47Z
       
       0 likes, 0 repeats
       
       @hugovk @Kensan @shortridge https://dateutil.readthedocs.io/en/stable/parser.html see the dayfirst and yearfirst settings docs
       
 (DIR) Post #Av7GljmCpjr8q2D8bo by hugovk@mastodon.social
       2024-02-04T18:46:01Z
       
       0 likes, 0 repeats
       
       @KevinMarks @Kensan @shortridge That's quite the gotcha! Well, if you're not using ISO dates, and don't tell it what format is being used, this library has to make some sort of guess between mm/dd/YYYY and dd/mm/YYYY. And iirc you have to tell the standard library which date format to parse, it won't guess.
       
 (DIR) Post #Av7GlkatnGPDNFvcci by clacke@libranet.de
       2025-06-14T08:41:52Z
       
       0 likes, 1 repeats
       
       Back in the day I wrote some LotusScript to convert a bunch of textual dates, and I implemented a heuristic "some of these we can determine mm/dd vs dd/mm because the dd ≥ 13, and we know they should be in order, so we can use those anchor dates to further narrow down the others". 😅@hugovk @KevinMarks @Kensan @shortridge
       
 (DIR) Post #Av7GlmDDmGecOOrjzk by Kensan@mastodon.social
       2024-02-02T20:30:48Z
       
       0 likes, 0 repeats
       
       @shortridge Oh and just remembered another one that was 🤪: @mxshift tells the story about a crazy bug while talking to @jessfraz and @bcantrill in the @oxidecomputer Podcast “On the Metal”: timestamp is 24m49s. The solution with the “load bearing cron job” is *chefkiss*https://oxide.computer/podcasts/on-the-metal/rick-altherr
       
 (DIR) Post #Av7Glu5iU3P8pRPrPM by MichaelTBacon@social.coop
       2024-02-02T16:59:20Z
       
       1 likes, 0 repeats
       
       @shortridge And yes, one of the annoying things about running Solaris back in the day is that every time you installed an OS patch, it would *remove* Sendmail 8 and put Sendmail 5 BACK ON.Like, gee, thanks, Sun.
       
 (DIR) Post #Av7Gm0ysrr6ACHH0EK by maswan@mastodon.acc.sunet.se
       2024-02-03T21:00:00Z
       
       1 likes, 0 repeats
       
       @MichaelTBacon Oh yes, the era of the mandatory post-patch-cleanup script to avoid RCE via Solaris' horribly old and exploitable sendmail.Much have changed, now we can trust Debian and Ubuntu to ship security patches to the degree that automated updates is good on (most packages for most) production servers.@shortridge