[HN Gopher] Grave flaws in BGP Error handling
       ___________________________________________________________________
        
       Grave flaws in BGP Error handling
        
       Author : greyface-
       Score  : 246 points
       Date   : 2023-08-29 10:51 UTC (12 hours ago)
        
 (HTM) web link (blog.benjojo.co.uk)
 (TXT) w3m dump (blog.benjojo.co.uk)
        
       | lexicality wrote:
       | Very comforting to know that most of the core underpinning of the
       | internet doesn't follow the robustness principle by default.
       | 
       | Still, I guess it keeps support contracts active and network
       | engineers in a perpetual hell of on call alerts
        
         | hn8305823 wrote:
         | The software on big routers that run the Internet is _far_ more
         | reliable than it was 20 years ago. It 's not perfect and each
         | vendor has cycles of good and bad code releases but it's so
         | much better than it used to be.
        
         | greyface- wrote:
         | > I worry about the state of networking vendors.
         | 
         | > With that being said, I would like to thank the OpenBSD
         | security team, who very rapidly acknowledged my report, and
         | prepared a patch. My only regret with dealing with the OpenBSD
         | team was reporting the issue to them too quickly, as they are
         | uninterested in waiting for any kind of coordination in
         | disclosure.
         | 
         | Lesson to network engineers: switch your routers to
         | OpenBSD/OpenBGPd and keep up on your syspatches to escape this
         | particular hell. ;)
        
           | karavelov wrote:
           | why not switch to bird that was not affected at all? not just
           | acknowledged and fixed. isn't it "safe be default" better
           | than "fixed after they pointed the flaw"
        
           | ExoticPearTree wrote:
           | > My only regret with dealing with the OpenBSD team was
           | reporting the issue to them too quickly, as they are
           | uninterested in waiting for any kind of coordination in
           | disclosure.
           | 
           | This BS with disclosure coordination and all the other crap
           | vendors have been trying to train researchers to do for years
           | needs to stop. You find a issue, you publish it, vendors need
           | to deal with bad code and products in realtime, not when it
           | suits them.
        
             | mschuster91 wrote:
             | While I agree in theory, in practice this _will_ lead to
             | bad actors ranging from trolls to cyberwar troops
             | exploiting it immediately.
             | 
             | There is IMHO no ethical alternative to responsible
             | disclosure, and using actual people who can get harmed by
             | having their information disclosed as pawns isn't the way
             | to go forward.
        
             | zbentley wrote:
             | > vendors need to deal with bad code and products in
             | realtime
             | 
             | One of the main reasons for coordinated disclosure is that,
             | over time, it has become evident that vendors will not do
             | that. We can say it until we're blue in the face, but
             | vendor behavior will remain largely unchanged regardless
             | (there are a few bright spots, but not nearly enough).
             | 
             | So you go to war with the army you have. In this case, that
             | means frequently deferring/coordinating security event
             | communications. As much as I wish the market would
             | appropriately penalize vendors that encourage researchers
             | to sit on 0-days, that rarely happens, and the cost of
             | insecurity is passed onto consumers instead.
             | 
             | Tl;dr delaying disclosure does suck, but you should still
             | do it.
        
               | salawat wrote:
               | Alas, it is the consumer that apparently drives pain
               | reception for vendors, as they have no interest in
               | minimizing consumer pain, but their own pain. Their own
               | pain is defined as lost business.
               | 
               | 'Tis the world we live in, so might as well get used to
               | it.
        
           | appplication wrote:
           | I could have sworn I saw someone on here a few weeks ago
           | saying how they switched their router to openBSD and ended up
           | in some horrible broken state after a power outage or
           | something. I can't recall the details, maybe someone else
           | would recall.
        
             | basedrum wrote:
             | Maybe because you need to go to the data center to reboot
             | open bsd
        
             | binkHN wrote:
             | Who here is running a production system without redundant
             | or backup power?
        
               | appplication wrote:
               | I should have clarified: I think they said they did it on
               | their home router.
        
             | Shared404 wrote:
             | Huh, I've been running an openbsd router as my prod home
             | network router for ~ a year now without issue, and it has
             | handled power outages and more without issues.
             | 
             | I wonder if the outage happened during patching or
             | something maybe?
        
           | literalAardvark wrote:
           | That just sounds like creating a new level of hell
        
             | martijnvds wrote:
             | One full of daemons.
        
           | Hikikomori wrote:
           | Any 100+ Terabit Openbsd routers out there?
        
             | Nextgrid wrote:
             | I would very much doubt any of those carrier-grade routers
             | run BGP in hardware. Chances are the control plane is
             | handled by a bog-standard Linux/VxWorks/BSD that does all
             | the logic of talking BGP/etc and then populates the right
             | registers in the switching/routing hardware which handles
             | the actual data plane.
             | 
             | You could technically replace all the control and decision-
             | making with whatever off-the-shelf computer platform you
             | want, and have it drive the carrier-grade router and
             | "manually" populate its routing table via some
             | API/Telnet/SNMP. In fact, as long as the hardware packet-
             | routing core is still good, this might be a good way to
             | upgrade out-of-support equipment that is no longer getting
             | updates to its control-plane computer.
        
         | treve wrote:
         | The robustness principle: "be conservative in what you do, be
         | liberal in what you accept from others", hardly applies here.
         | Whether or not it was followed, a crash shouldn't happen.
         | 
         | The robustness principle is also long considered to be not that
         | great of an idea for protocols. There's a good document here
         | going into this:
         | 
         | https://datatracker.ietf.org/doc/html/rfc9413
        
           | vermilingua wrote:
           | It isn't a crash, BGP terminates the session and retries
           | because it considers the connection to be malformed. So, "be
           | liberal in what you accept" certainly does apply.
        
             | j16sdiz wrote:
             | This "terminate and retry" behaviour is specified in
             | RFC7606
        
               | fanf2 wrote:
               | That is the exact opposite of what RFC 7606 says. As it
               | explains in its introduction, it deprecates session reset
               | in favour of treat-as-withdraw in most cases, unless
               | there is an explicit exception to the general rule.
        
           | forkerenok wrote:
           | That RFC is a very insightful read. Thanks for sharing!
        
       | cratermoon wrote:
       | This seems like a good time to mention Yuan et al., "Simple
       | Testing Can Prevent Most Critical Failures."
       | 
       | "We found the majority of catastrophic failures could easily have
       | been prevented by performing simple testing on error handling
       | code - the last line of defense - even without an understanding
       | of the software design."
        
       | JonChesterfield wrote:
       | Design behaviour here was on bad input, shutdown. Some bad input
       | happened, thus shutdown happened. Seems fine to me. I'm not
       | seeing a grave flaw.
        
         | vitiral wrote:
         | The issue is how it interacts with other designs, such as
         | forwarding of unknown attributes who are now poisoned
        
         | remram wrote:
         | Imagine if Gmail and Outlook refused to talk to each other and
         | transmit any email anymore, because Gmail tried to deliver one
         | email whose content was invalid (bad MIME, invalid attachment,
         | etc).
         | 
         | If you shutdown the whole session because of one bad forwarded
         | message, you are indeed a broken implementation.
        
         | Bluecobra wrote:
         | Agreed, the article title is a bit sensationalist. Really
         | should be titled "RTFM: Juniper routers shutdown BGP on bad
         | input". That being said, I found it be a good deep dive and I
         | appreciate the author looking into this. BGP isn't perfect but
         | we made it this far and I think it does a pretty good job for
         | handling ~75K autonomous systems.
        
           | sgjohnson wrote:
           | > BGP isn't perfect but we made it this far and I think it
           | does a pretty good job for handling ~75K autonomous systems.
           | 
           | Absolutely not. To me, it's a miracle that the internet even
           | works.
        
           | basedrum wrote:
           | Sure if you didn't read and only skimmed the article...
        
           | gregw2 wrote:
           | Err no, it was not just Juniper. Many other systems were
           | fuzzed, half a dozen vendors were fine, another 5 weren't and
           | some remain unpatched and vulnerable.
           | 
           | If it were just a Juniper implementation flaw it'd be less
           | interesting than explaining an aspect of BGP error handling
           | more generally which is indeed what the article was, for
           | those of us aware of BGP but not familiar with specific
           | protocol design decisions and implications.
           | 
           | If you quibbled with the word "Grave" I might agree since
           | it's not industrywide and impacting major vendors like Cisco
           | and Huawei and those implementing RFC 7606 (although there is
           | no patch for Nokia or Extreme or Bird!) I'd counterpropose
           | that a more neutral TLDR title/summary for an HN audience
           | might be "Grave flaws in non compliant BGP error handling."
           | 
           | But on balance I think the author is better off with raising
           | the alarm with a stronger headline; he clearly put his
           | passion to good use and I learned a little something as a
           | result.
        
             | nsteel wrote:
             | As I understand it, Nokia equipment already does the
             | correct thing if you configure it correctly in fault-
             | tolerance mode.
        
         | buro9 wrote:
         | The impacted systems get stuck on a retry loop... they can
         | neither forward the message blindly nor consume it. They break
         | the BGP connection, reset, come back alive and get the same
         | message and break the connection again.
         | 
         | Effectively the impacted instances are now no longer able to do
         | their job, they cannot propagate more rules.
         | 
         | That's the flaw.
         | 
         | When they receive something that they don't understand they
         | need to propagate without breaking connection, and they do not.
        
           | fanf2 wrote:
           | Yes. It's also important to emphasize that when a BGP session
           | between two ISPs is lost, all connectivity between those ISPs
           | on that path is also lost. This is why the author emphasized
           | when you can reconfigure the routers for error tolerance
           | without affecting active BGP sessions.
        
           | throw0101c wrote:
           | > _When they receive something that they don 't understand
           | they need to propagate without breaking connection, and they
           | do not._
           | 
           | And this is how things _are_ done, per the article:
           | 
           | > _On 2 June 2023, a small Brazilian network (re)announced
           | one of their internet routes with a small bit of information
           | called an attribute that was corrupted. The information on
           | this route was for a feature that had not finished
           | standardisation, but was set up in such a way that if an
           | intermediate router did not understand it, then the
           | intermediate router would pass it on unchanged._
           | 
           | > _As many routers did not understand this attribute, this
           | was no problem for them. They just took the information and
           | propagated it along._
           | 
           | However, the information was _corrupted_ at the source, and
           | those devices that did understand the attributed _detected_
           | the corruption and rejected it:
           | 
           | > _However it turned out that Juniper routers running even
           | slightly modern software did understand this attribute, and
           | since the attribute was corrupted the software in its default
           | configuration would respond by raising an error that would
           | shut down the whole BGP session._
           | 
           | The question is whether the session shutdown was appropriate
           | (at least by default).
           | 
           | I think Juniper's default is reasonable, as attributes can
           | communicate important information about the intentions of the
           | network operators, and rejecting a particular attribute and
           | basically ignoring the intentions of the operators may also
           | cause problems.
           | 
           | I also think that thinking that perhaps allowing the session
           | to continue but throwing a (loud) warning is also reasonable
           | (and there is a flag for that).
        
             | ExoticPearTree wrote:
             | > However, the information was corrupted at the source, and
             | those devices that did understand the attributed detected
             | the corruption and rejected it.
             | 
             | No they did not, they shutdown a whole session down for a
             | malformed announcement. The right decision would have been
             | to ignore the corrupt attribute and maybe not install that
             | particular route.
             | 
             | A logged message saying "Attribute 129 with value 0 is
             | invalid" would have helped operators more than just blindly
             | shutting down sessions through which this was announced.
        
               | fanf2 wrote:
               | The right decision is specified in RFC 7606, which is
               | usually to treat the malformed announcement as a
               | withdrawal. (BGP attributes cannot in general be
               | discarded safely.)
        
             | tsimionescu wrote:
             | The current behavior is akin to an HTTP server shutting
             | down whenever it receives an invalid HTTP request. Or, for
             | an even better analogy, it is like an HTTP server that,
             | upon receiving a malformed HTTP request, drops all current
             | and future traffic from that IP. A malicious user can then
             | issue a malformed HTTP request through an open proxy, and
             | deny all other connections to that server throgh that proxy
             | for all other users who were sending proper requests.
        
               | throw0101c wrote:
               | > _The current behavior is akin to an HTTP server
               | shutting down whenever it receives an invalid HTTP
               | request._
               | 
               | It is nothing like that.
               | 
               | HTTP requests do not propagate across an organization's
               | entire network, or possibly the entire Internet.
               | 
               | > _Or, for an even better analogy, it is like an HTTP
               | server that, upon receiving a malformed HTTP request,
               | drops all current and future traffic from that IP._
               | 
               | So just like (e.g.) fail2ban blocking IPs that
               | continuously send bad login requests? That's awesome! I
               | would _want_ my web server to block bad clients.
               | 
               | Heck, it would be nice if web browsers would do the same
               | thing with content they downloaded. In the early days of
               | the web there were all sorts of garbage files out there
               | and instead of forcing people to correct their HTML they
               | tried to be clever in mind reading:
               | 
               | * https://en.wikipedia.org/wiki/Tag_soup
               | 
               | For a time there was an entire 'movement' to get people
               | to clean up their act:
               | 
               | * https://en.wikipedia.org/wiki/W3C_Markup_Validation_Ser
               | vice
               | 
               | * https://en.wikipedia.org/wiki/HTML_Tidy
               | 
               | It's fine to have a flag to have stringency be a policy
               | that can be toggled, but I think Postel's law ("be
               | liberal in what you accept") can cause issues over the
               | long-term:
               | 
               | * https://en.wikipedia.org/wiki/Robustness_principle
        
               | fanf2 wrote:
               | It's like your origin web server unceremoniously dropping
               | the http connection when its front end reverse proxy
               | (fastly, cloudfront, akamai, etc) forwards an invalid
               | request, and when your web server restarts the proxy
               | retries the bad request.
        
               | tsimionescu wrote:
               | > So just like (e.g.) fail2ban blocking IPs that
               | continuously send bad login requests? That's awesome! I
               | would want my web server to block bad clients.
               | 
               | Well, somewhat, except that it's not awesome. It's like
               | fail2ban, but implemented on a server that sits behind a
               | load balancer. So, when it receives a bad login request
               | from an IP and it bans that IP, it actually banned the IP
               | of the load balancer, so now it won't receive any other
               | requests. So that a malicious client can just send 1 bad
               | request, and take down everyone else using the whole
               | server.
               | 
               | Basically, this whole discussion is not so much about
               | Postel's law, it's about limiting failure domains. A bad
               | route advertisement should make that 1 route
               | inaccessible. It is not good design for a bad route
               | advertisement to take down all the other good routes. It
               | is particularly bad design for a protocol where routes
               | are automatically propagated by devices that don't think
               | there's anything wrong with them.
               | 
               | If you want to compare it to the HTML situation, the bug
               | can also be seen as a browser that, when it receives
               | invalid HTML from www.google.com/search would not only
               | throw an error, but also refuse any other connection to
               | *.google.com/*. The proposed fix is not to interpret
               | anything as valid HTML. It is to show an error only when
               | accessing www.google.com/search.
        
         | capableweb wrote:
         | That sounds like a typical denial of service attack, which you
         | don't see a problem with?
        
           | JonChesterfield wrote:
           | It sounds like a system designed to stop doing things on
           | unexpected input. The designers chose correctness over
           | robustness. It's not an accidental vulnerability.
           | 
           | Maybe switch off on request is a dubious design choice for
           | infrastructure that you'd rather stay alive, but it sounds
           | like a new version of the spec has already been released to
           | make that an option.
        
         | mnw21cam wrote:
         | Bad input -> shutdown provides a method by which an external
         | attacker can shut down your service by providing bad input,
         | otherwise known as a denial of service attack.
        
           | throw0101c wrote:
           | The input in question was a particular attribute: these
           | attributes can have important information about the
           | intentions of how the operators wish things to work. Simply
           | ignoring the bad input and carrying on the BGP session
           | without the attribute(s) in question may lead to working
           | against the desired intentions of the network operators.
           | 
           | I think Juniper's default is reasonable: rejecting a
           | particular attribute and basically ignoring the intentions of
           | the operators could also cause problems, and instead of
           | trying to read the minds of those running the networks, the
           | software doesn't try to be clever.
           | 
           | I also think that thinking that perhaps allowing the session
           | to continue but throwing a (loud) warning is also reasonable
           | (and there is a flag for that).
        
             | tsimionescu wrote:
             | The problem is that one bad route in a BGP session caused
             | _the whole session_ to shut down. So if ISP1 advertises a
             | new route with a corrupted attribute to ISP2 who doesn 't
             | understand it, and ISP2 advertises it to ISP3 who does
             | understand the corrupted attribute, _all traffic_ between
             | ISP2 and ISP3 will shut down. So ISP1 has caused
             | communication failure between ISP2 and ISP3 for all
             | clients, not just for its own routes.
        
               | throw0101c wrote:
               | Yes, I know about BGP propagation works.
               | 
               | My point still stands: attributes can have important
               | information about the intentions of how a network
               | operator wishes their network to be seen to the rest of
               | the world. Having a corrupt attribute, and thinking it's
               | okay to simply assume that without that attribute the
               | rest of the advertisement is still valid is not fine
               | IMHO: you're basically trying to guess/assume the
               | intentions of the source of the advertisement.
               | 
               | For all we know ignoring that attribute and accepting the
               | rest of the advertisement as-is could have caused flood
               | of traffic over the connection causing a just-as-
               | effective-DoS.
        
               | tsimionescu wrote:
               | That is not what is being done. If you advertise a route
               | with a corrupt attribute, your route will be considered
               | inaccessible by any device that understands the
               | attribute.
               | 
               | But! All your other routes that you advertised earlier
               | with perfectly fine attributes will continue working.
               | This is what prevents this from becoming a DoS
               | vulnerability.
               | 
               | Let's take a more complete example.
               | 
               | ISP1 advertises 2 routes to ISP2:
               | 35.67.0.0/16 attr1=val1       37.61.0.0/16 attr1=corr
               | 
               | ISP2 doesn't know what attr1 is, and it advertisea the
               | following routes to ISP3:                 35.67.0.0/16
               | attr1=val1       37.61.0.0/16 attr1=corr       78.0.0.0/8
               | 
               | The bug is that ISP3 will then stop routing any traffic
               | to 35.67.0.0/16, 37.61.0.0/16, or to 78.0.0.0/8.
               | 
               | After the fix, ISP3 will keep routing traffic to
               | 35.67.0.0/16 (respecting the value val1 for attribute
               | attr1) and to 78.0.0.0/8 through ISP2. It will not route
               | traffic to 37.61.0.0/16.
        
               | fanf2 wrote:
               | As the fine article explains, your concerns are addressed
               | by RFC 7606. It recommends that in many cases, an error
               | should be treated as a route withdrawal, not by dropping
               | the session. (The RFC specifies other ways to handle
               | errors in specific situations.) There's no guessing or
               | assuming.
               | 
               | In the scenario that started off the author's
               | investigation, the bad attribute kicked COLT off the
               | Internet even though COLT did not connect directly to the
               | Brazilian ISP that sent the bad route update. When they
               | received the bad update, COLT's routers dropped their BGP
               | sessions with intermediate ISPs that were unrelated to
               | the Brazilian ISP, causing complete loss of connectivity.
               | 
               | If COLT's routers had treated the error as a route
               | withdrawal, they would have lost connectivity to the
               | Brazilian ISP but not the rest of the Internet.
        
       | sgjohnson wrote:
       | A small Brazilian network yesterday hijacked half the internet
       | too, including several of my own prefixes, all of whom have IRR
       | and RPKI set up.
       | 
       | https://bgp.tools/as/266970?show-low-vis#prefixes
       | 
       | [bgp.tools Alert] for AS200676 A hijack on 2a06:a005:2730::/44
       | from AS266970 (NET WAY PROVEDOR DE INTERNET DE CACOAL LTDA) has
       | been detected (based on your address policy you have configured)
       | 
       | [bgp.tools Alert] for AS200676 A hijack on 2a0a:6040:b1f4::/48
       | from AS266970 (NET WAY PROVEDOR DE INTERNET DE CACOAL LTDA) has
       | been detected (based on your address policy you have configured)
       | 
       | [bgp.tools Alert] for AS200676 A hijack on 2a0a:6040:b1f1::/48
       | from AS266970 (NET WAY PROVEDOR DE INTERNET DE CACOAL LTDA) has
       | been detected (based on your address policy you have configured)
       | 
       | [bgp.tools Alert] for AS200676 A hijack on 2a0a:6040:b1f0::/48
       | from AS266970 (NET WAY PROVEDOR DE INTERNET DE CACOAL LTDA) has
       | been detected (based on your address policy you have configured)
       | 
       | BGP is a complete mess.
        
       | basedrum wrote:
       | What is the response of the bird developers?
       | 
       | Also, for those responding that you don't see the problem, it's
       | simple: remote unsanitized input cause denial of service
        
         | gorkish wrote:
         | Bird is unaffected, so I assume their response is satisfaction
         | for a job well done. FRR, on the other hand, appears to be less
         | responsive to concerns -- which does actually concern me since
         | I use FRR, though not presently for eBGP. Disclosure seems to
         | have done the job thankfully, as there appears to have been a
         | little scramble over at FRRouting/frr about an hour ago.
        
       | wawwow wrote:
       | [flagged]
        
         | TheHappyOddish wrote:
         | No, these are very unrelated things.
        
           | salawat wrote:
           | How do you figure? The IP address to route to in DNS is still
           | an IP which from within that network segment will believe
           | that prefix is provided by that AS if they aren't honoring
           | RPKI.
           | 
           | Therefore, within that routing zone, that traffic would be
           | blackholed I'd think, unless I'm seriously missing some
           | nuance that I'd love it if you would share.
        
       | hannob wrote:
       | Not surprised.
       | 
       | A few years ago a team of researchers tried to use a feature
       | attribute flag in BGP. Their experiment caused an important BGP
       | software to crash.
       | 
       | The BGP community reacted by thanking the researchers for
       | uncovering that flaw and worked on improving stability. No, just
       | kidding. They shouted at the researchers and asked them to stop,
       | which they eventually did.
       | 
       | I felt back then that this community had a very unhealthy
       | relationship with the quality of their software and the work of
       | security researchers.
       | 
       | Sources:
       | https://mailman.nanog.org/pipermail/nanog/2018-December/0984...
       | https://mailman.nanog.org/pipermail/nanog/2019-January/09876...
       | https://mailman.nanog.org/pipermail/nanog/2019-January/09918...
       | https://mailman.nanog.org/pipermail/nanog/2019-January/09914...
        
         | hinkley wrote:
         | Recipe as I got it from grandma
         | 
         | 1) Beleaguered people misbehave.
         | 
         | 2) You are usually the architect of your own destruction.
         | 
         | 3) It's easier to see other people's problems than your own.
         | 
         | Stir into a death spiral and let it rest for six to twelve
         | months. Bake at 375deg for two years, then serve cold with a
         | sprinkling of schadenfreude to offset the bitterness.
        
         | dogleash wrote:
         | > I felt back then that this community had a very unhealthy
         | relationship with the quality of their software and the work of
         | security researchers.
         | 
         | I suspect a Network Operators mailing list is going to be full
         | of people trying to keep production up, with the tools they've
         | got, rather than the tools they'd like. Thinking the software
         | quality on the routers they're using are dogshit is probably
         | just called "Monday" and thinking their peers' routers are even
         | worse is called "Tuesday".
         | 
         | I can understand they don't appreciate "research" done on
         | production systems that might cause issues when they already
         | know their equipment sucks.
        
           | gnfargbl wrote:
           | I _can 't_ understand it. The network operators should save
           | their vitriol for the vendors who are selling them equipment
           | that sucks, and for whoever in their own organisation is
           | continuing to purchase equipment that sucks.
           | 
           | Bitching at researchers (note: not "researchers") who are
           | finding and calling out those flaws before someone really bad
           | comes along and exploits them is somewhere between unhelpful
           | and negligent. Internet security is always improved by more
           | sunlight.
        
             | hn8305823 wrote:
             | > I can't understand it. The network operators should save
             | their vitriol for the vendors who are selling them
             | equipment that sucks
             | 
             | We call out vendors when their quality suffers but you only
             | have two or three practical vendor options for high end
             | routers depending on your use case/feature requirements.
             | All three have varying cycles of hardware and software
             | quality, with hardware and software rarely being in phase
             | for the same vendor.
             | 
             | The fact that all three vendors make most of their money on
             | non-network focused enterprises with less sophisticated
             | engineers doesn't help get development attention where it
             | needs to be for global infrastructure.
        
             | tw04 wrote:
             | >I can't understand it. The network operators should save
             | their vitriol for the vendors who are selling them
             | equipment that sucks, and for whoever in their own
             | organisation is continuing to purchase equipment that
             | sucks.
             | 
             | I think you fail to realize that the internet is a bunch of
             | disparate networks that no one person controls. When
             | "fixing" a router means you break half the internet, you
             | just don't fix it. This isn't some walled garden where you
             | can just say "tough shit if you broke, fix it".
             | 
             | Most of the "issues" these researchers find are well known,
             | but backbone operators are more militant than Linus when it
             | comes to: rule #1 is don't break user space.
        
               | gnfargbl wrote:
               | > I think you fail to realize that the internet is a
               | bunch of disparate networks that no one person controls.
               | 
               | The internet is a bunch of disparate networks that no one
               | person controls, _exactly_. That means that when someone
               | tells you there 's a fault in the kit that you have
               | installed in your network, then that is 100% your problem
               | and it is 100% on you to get it fixed. Once a
               | vulnerability is known then someone out there is going to
               | start exploiting it almost immediately.
               | 
               | > backbone operators are more militant than Linus when it
               | comes to: rule #1 is don't break user space.
               | 
               | Backbone operators can have all the hubris they want, but
               | it won't change the reality that the only effective
               | action they can take when a vulnerability is found is to
               | get it fixed ASAP. This is a security lesson that has
               | been learnt in recent years by many sectors of the IT
               | industry and, judging by the woeful response that Ben
               | Cartwright-Cox describes in his post, it's one that the
               | network backbone is going to be learning soon.
        
               | tw04 wrote:
               | >The internet is a bunch of disparate networks that no
               | one person controls, exactly. That means that when
               | someone tells you there's a fault in the kit that you
               | have installed in your network, then that is 100% your
               | problem and it is 100% on you to get it fixed.
               | 
               | So you've never worked in a corporate environment. Here's
               | how that conversation would go:
               | 
               | Hey guys, some researchers found out if you use a test
               | flag meant for the lab on the public internet, it breaks
               | all our BGP sessions.
               | 
               | OK, so drop their feed and block them?
               | 
               | Great, done.
               | 
               | >Once a vulnerability is known then someone out there is
               | going to start exploiting it almost immediately.
               | 
               | And yet the "vulnerability" in question was known, and
               | was not being immediately exploited if you read through
               | the mailing list or were participating in NANOG at that
               | point in time. So your statement is provably false.
               | 
               | >Backbone operators can have all the hubris they want,
               | but it won't change the reality that the only effective
               | action they can take when a vulnerability is found is to
               | get it fixed ASAP.
               | 
               | And yet we're having this conversation in 2023, they have
               | operated the same way for 40+ years, and somehow the
               | internet is still working. Bad actors get blackholed, it
               | worked in the past, it'll continue working in the future.
               | The reality is that backbone routing is expensive, and
               | expecting everyone to update their kit on YOUR timeline
               | isn't reasonable.
        
               | robertlagrant wrote:
               | > OK, so drop their feed and block them?
               | 
               | > Great, done.
               | 
               | The "great" is the problem, no?
        
               | gnfargbl wrote:
               | I am aware that network operators have behaved this way
               | for very many years. The point I am making is that _all
               | of the IT industry_ used to attempt to  "work around"
               | security vulnerabilities in the same way, until log4shell
               | and all the others gradually beat that propensity out of
               | them.
               | 
               | I am prophesying that a similar reckoning is likely to
               | come upon the network backbone. You're arguing that the
               | cost of entry to the game (BGP peering) is such that the
               | old ways will continue to work. Let's hope you're right.
        
               | devonkim wrote:
               | Software ecosystems like libraries and frameworks have
               | completely different propagation and remediation
               | mechanics compared to federated systems like core
               | Internet backbone routers and switches is the thing. Try
               | as we might conceptualize otherwise the modern Internet
               | from a packet's purview is more like a loose
               | confederation of ultimately privatized or state-run
               | fiefdoms than a cellular automata digraph explosion. So
               | actors that try to act maliciously against the network
               | will be basically shut out given the rule of an iron fist
               | being the default.
        
               | zamadatix wrote:
               | The internet does not owe a negligent operator peering.
               | No one person controls the internet but you sure as hell
               | can find yourself kicked out if your goal is to be right
               | on principals instead of a good peer. The only hubris
               | which could be at play is someone thinking the rest of
               | the internet should follow their desires instead of the
               | other way around.
               | 
               | Research is great but if you're poking around breaking
               | things via DoS in the internet you're going to be judged
               | as a bad peer regardless what the route daemon does. The
               | security of the internet constantly moves forward but
               | that doesn't mean the expectation of improving has to
               | hold more precedence than the expectation of running.
               | It's still very much a human to human system for all but
               | the actual route exchange.
        
             | netfortius wrote:
             | I suppose you have never seen, ever, in your professional
             | career, the "C" suits making business deals and thus
             | decisions on what's to be acquired, with big
             | network/server/storage/database/software vendors, over the
             | head and against the advice of those then called to support
             | crap in production.
        
           | themerone wrote:
           | Network operators are also the ones that pop into IPV6
           | discussions to voice their displeasure with having support
           | another protocol.
        
             | booi wrote:
             | To be fair... supporting dual stack really does suck
        
               | yjftsjthsd-h wrote:
               | A lot about networking sucks, but that's what people are
               | paying them for.
        
             | philg_jr wrote:
             | Also to be fair, IME there seems to be a lot more vocal
             | pro-IPv6 people on the NANOG list than not.
        
         | networkchad wrote:
         | [dead]
        
         | [deleted]
        
         | throw0101c wrote:
         | > _I felt back then that this community had a very unhealthy
         | relationship with the quality of their software and the work of
         | security researchers._
         | 
         | Or they felt like some of us feel about Tesla/Musk unleashing
         | cars with "Full" "Self-Driving" on public roads.
        
           | adra wrote:
           | As lousy as these systems are, they kill a hell of a lot
           | fewer people (for now) than drunks, so net positive for the
           | time being.
        
             | mschuster91 wrote:
             | The solution to drunk drivers is public transport, not
             | self-driving cars.
        
               | zaroth wrote:
               | It's 2023 and driving kills like 40,000 people a year.
               | 
               | We've been building trillions of dollars of public
               | transit since about the same time as we started building
               | the National highway system. So when exactly does public
               | transit start saving all these lives?
               | 
               | A new train from SF to LA has been in the works since
               | 1996. They are expecting partial completion for $30
               | billion by 2030 which is projected to carry ~15 million
               | passengers * ~400 miles = 6 billion passenger miles per
               | year.
               | 
               | The fatality rate for passenger cars is ~0.57 per 100
               | million miles, so eliminating 6 billion passenger miles
               | could be reasonably expected to save about 60 * 0.57 = 34
               | lives a year out of the ~40,000 that are dying on the
               | roads annually.
               | 
               | This is just rough napkin math showing the scale of the
               | problem would require a roughly $30 _trillion_ investment
               | in public transit to appreciably shift passenger-miles
               | away from passenger cars.
               | 
               | Public transit can be very useful. But it's clearly not
               | the solution to eliminating driving deaths.
        
               | autoexec wrote:
               | It seems unfair to say that public transport doesn't have
               | a role in reducing deaths caused by drunk drivers just
               | because the hugely flawed and half-assed public transport
               | systems in our cities (which were designed to make public
               | transport ineffective) haven't yet delivered on that
               | promise. In places where cities and public transportation
               | systems are well designed and useful people do use them
               | and those places see far fewer deaths from drunk drivers
               | than the US does as a result.
               | 
               | You argue that it would be very expensive to fix the
               | problems we've made for ourselves in order to get useful
               | public transportation and that's certainly true. It's
               | also very expensive to invent self-driving cars and part
               | that cost has already been paid with lives lost and human
               | suffering, but that hasn't stopped us from developing
               | those systems. We could decide to invest instead in
               | improving our cities and infrastructure. We just choose
               | to let obscenely rich people kill innocent Americans
               | while they play with their profit generating spying car
               | tech instead.
        
             | hiddencost wrote:
             | You're confusing "musk killing people to compete for market
             | share because he's woefully behind" with "driverless cars".
             | 
             | There are other people building driverless cars, who have
             | safer systems that they rolled out reliably.
             | 
             | Musk isn't saving people from drunk drivers, he's trying to
             | catch up with the competition and chose to kill people to
             | catch up.
        
             | simiones wrote:
             | That only makes sense if _all_ the people using self-
             | driving cars would have driven drunk (or drugged, or
             | sleepy) instead. Otherwise, you 're replacing a decent
             | driver with a sub-mediocre automatic system.
        
           | yjftsjthsd-h wrote:
           | I think the difference is that network gear is supposed to
           | handle the whole spec and gracefully deal with (drop) things
           | that aren't to spec. If there was a law that all cars had to
           | be able to get hit by another car and keep going without
           | issue (and this was physically plausible), then it would be
           | similar.
        
         | tw04 wrote:
         | If you read the links: they discovered that using this flag
         | broke EVERY installation of FRR on the internet. They let the
         | developers know. And then said they were going to test again in
         | a week.
         | 
         | Does any sane person think that 1 week is enough time to both
         | notify every user of FRR in the _WORLD_ , as well as ensure
         | they have upgraded their installations for a "bug" that only is
         | triggered by someone using experimental flags meant for testing
         | purposes only?
         | 
         | Mind you, these researchers didn't test ANYTHING in a lab
         | environment first "because it's a lot of work to test all the
         | open source BGP implementations in the wild". AKA: I'm lazy.
        
           | LegionMammal978 wrote:
           | > Mind you, these researchers didn't test ANYTHING in a lab
           | environment first "because it's a lot of work to test all the
           | open source BGP implementations in the wild". AKA: I'm lazy.
           | 
           | That's a gross exaggeration: they said that they first tried
           | it "in a controlled environment with different versions of
           | Cisco, BIRD, and Quagga routers" [0] without issue.
           | Presumably, the problem we know about would have been avoided
           | if they had happened to include FRR in their test suite
           | beforehand. But that raises the next question, exactly how
           | many implementations does one need to test before one is no
           | longer "lazy"?
           | 
           | [0] https://mailman.nanog.org/pipermail/nanog/2019-January/09
           | 876...
        
         | braiamp wrote:
         | I don't think that is an accurate representation of the
         | "comunity", it was a "singular complaint from a company
         | advertising unallocated ASN and IPv4 resources", and got told
         | off by literally most of the mailing list. I'm counting at
         | least 20 other individual responses and all of them are in
         | support of the experiment
         | https://mailman.nanog.org/pipermail/nanog/2019-January/threa...
        
         | capableweb wrote:
         | Is there a reason why the researchers did this on live
         | production BGP network rather than in a test environment,
         | running their own BGP and FRR routers? Seems a bit haphazardly,
         | as doing so first seems to have uncovered the issue.
         | 
         | Overall I agree with your message though, and of course things
         | shouldn't stop working just because you use attributes reserved
         | for development.
        
           | LegionMammal978 wrote:
           | The researchers claimed that they had first tested it with a
           | few different vendors' routers without any issues, but they
           | had not included FRR in those tests [0]. There was a bit of
           | discussion on the mailing list regarding just how many BGP
           | implementations (and configurations) a responsible researcher
           | is obligated to test with before exposing something new on
           | the public Internet.
           | 
           | [0] https://mailman.nanog.org/pipermail/nanog/2019-January/09
           | 876...
        
             | salawat wrote:
             | The correct answer is "All of them on your own network
             | first".
             | 
             | Thou shalt not blow up another's network knowingly.
             | 
             | Blowing up another's network because you didn't test your
             | theory in your own isolated lab space first counts as
             | blowing up another's network knowingly. Because what the
             | hell else did you think would happen?
             | 
             | Cripes.
        
         | xyst wrote:
         | To be honest, if you are a responsible researcher you are NOT
         | running "experiments" in live environments that have potential
         | to impact multiple regions.
         | 
         | The outrage here is understandable. I would be pissed if I was
         | paged because halfway across the world, some asshole caused
         | this intentionally. Also this mailing list is aimed at the
         | operators, not the vendors. Different communities.
         | 
         | The vendor response here, I do agree is irresponsible. Besides
         | OpenBSD, every one here just sucked.
        
           | pixl97 wrote:
           | Ya the researcher could just sell their findings to whatever
           | blackhat groups are out there instead. You generally get less
           | pushback and more cash from them.
        
         | ExoticPearTree wrote:
         | The sane thing to do is to drop that particular announcement
         | with that invalid attribute or discard the invalid information.
         | But some people tend to be more catholic than the pope.
         | 
         | Dropping the whole session, especially in a DFZ scenario is
         | really bad. The churn that you generate is immense for all your
         | neighbors.
        
           | crest wrote:
           | Just have a bunch of fuzzers running 24/7 to keep all
           | implementations honest.
        
             | clankyclanker wrote:
             | That's what Google used to do (though not for BGP), back in
             | their engineering days. That's what gave us AFL.
        
               | tialaramex wrote:
               | And indeed GREASE the TLS feature where you just propose
               | nonsense because every other party should go "Er, no I
               | don't speak nonsense?" to prevent ossification also comes
               | from Google engineers.
               | 
               | GREASE means that when you build your "advanced" security
               | device which "protects" customers by treating everything
               | different as hostile, it won't work, so you'll need to
               | tweak it to _at least_ just ignore such differences as
               | irrelevant, which is enough that we can come back later
               | and intentionally improve things.
               | 
               | Previously, without GREASE, we'd have to guess what
               | oversights we could exploit to avoid this "protection" to
               | deliver protocol improvements, if we guessed wrong
               | nothing works, or everybody's security is broken,
               | sometimes both. e.g. for TLS 1.2 the oversight we found
               | is, if you're resuming an existing session the "security"
               | products just wave that through because otherwise they
               | break people's real workflows. So in TLS 1.3 protocol as
               | actually spoken essentially an initial connection goes
               | like this:
               | 
               | Client: Hi some.dns.name.example I'm a TLS 1.2 Client,
               | I'd like to resume our previous conversation
               | #randomNonsense. Also, completely unrelated, I happen to
               | speak FlyCasualThisIsReallyTLSv1.3 and so I have these
               | TLSv1.3KeyAgreementParameters.
               | 
               | then either:
               | 
               | TLS 1.3 Server: Hi Client, of course, let's continue from
               | there. [ whereupon everything further is encrypted
               | because this is actually TLS 1.3, but to a dumb middlebox
               | it makes sense that they're just resuming a prior
               | encrypted conversation #randomNonsense which it doesn't
               | remember ]
               | 
               | OR
               | 
               | TLS 1.2 Server: Er, no I don't remember any such
               | conversation and I don't know
               | FlyCasualThisIsReallyTLSv1.3 so let's start a fresh
               | conversation as normal.
        
             | hinkley wrote:
             | I think you've discovered the secret to getting people to
             | exercise more.
        
               | salawat wrote:
               | More like the secret to getting ignored/blackholed/sued
               | into the ground the moment your fuzzer hits on a combo
               | that actually does damage to someone else's system, since
               | the implication here is you're not doing it on your own
               | stuff.
               | 
               | If you are, good on ya. Continue doing $deity's/Bob's
               | work.
        
               | pixl97 wrote:
               | More like the secret is to buy up some routing equipment
               | and run these tests on your own network, sell the
               | exploits you find to nefarious actors for big bucks, and
               | have a bunker to survive the crash of the internet in.
               | 
               | If your system is on the internet it is now a potential
               | piece in a potential global war.
        
               | salawat wrote:
               | Always has been, even since it's inception. The only
               | thing that'd kept it safe and unutilized in that regard
               | had been a grassroots efforts to not facilitate that
               | transformation. That only lasted so long as the main
               | population of Ops/devs/netizens were in agreement.
               | 
               | We've blasted past that age at this point, do everybody
               | gets to don their robe and wizard hat, pick up the staff,
               | and yeet patches, routing rules, and packets betwixt one
               | another.
        
           | fanf2 wrote:
           | As the fine article explains, that is what RFC 7606
           | specifies.
        
       ___________________________________________________________________
       (page generated 2023-08-29 23:01 UTC)