[HN Gopher] Grave flaws in BGP Error handling
___________________________________________________________________
Grave flaws in BGP Error handling
Author : greyface-
Score : 246 points
Date : 2023-08-29 10:51 UTC (12 hours ago)
(HTM) web link (blog.benjojo.co.uk)
(TXT) w3m dump (blog.benjojo.co.uk)
| lexicality wrote:
| Very comforting to know that most of the core underpinning of the
| internet doesn't follow the robustness principle by default.
|
| Still, I guess it keeps support contracts active and network
| engineers in a perpetual hell of on call alerts
| hn8305823 wrote:
| The software on big routers that run the Internet is _far_ more
| reliable than it was 20 years ago. It 's not perfect and each
| vendor has cycles of good and bad code releases but it's so
| much better than it used to be.
| greyface- wrote:
| > I worry about the state of networking vendors.
|
| > With that being said, I would like to thank the OpenBSD
| security team, who very rapidly acknowledged my report, and
| prepared a patch. My only regret with dealing with the OpenBSD
| team was reporting the issue to them too quickly, as they are
| uninterested in waiting for any kind of coordination in
| disclosure.
|
| Lesson to network engineers: switch your routers to
| OpenBSD/OpenBGPd and keep up on your syspatches to escape this
| particular hell. ;)
| karavelov wrote:
| why not switch to bird that was not affected at all? not just
| acknowledged and fixed. isn't it "safe be default" better
| than "fixed after they pointed the flaw"
| ExoticPearTree wrote:
| > My only regret with dealing with the OpenBSD team was
| reporting the issue to them too quickly, as they are
| uninterested in waiting for any kind of coordination in
| disclosure.
|
| This BS with disclosure coordination and all the other crap
| vendors have been trying to train researchers to do for years
| needs to stop. You find a issue, you publish it, vendors need
| to deal with bad code and products in realtime, not when it
| suits them.
| mschuster91 wrote:
| While I agree in theory, in practice this _will_ lead to
| bad actors ranging from trolls to cyberwar troops
| exploiting it immediately.
|
| There is IMHO no ethical alternative to responsible
| disclosure, and using actual people who can get harmed by
| having their information disclosed as pawns isn't the way
| to go forward.
| zbentley wrote:
| > vendors need to deal with bad code and products in
| realtime
|
| One of the main reasons for coordinated disclosure is that,
| over time, it has become evident that vendors will not do
| that. We can say it until we're blue in the face, but
| vendor behavior will remain largely unchanged regardless
| (there are a few bright spots, but not nearly enough).
|
| So you go to war with the army you have. In this case, that
| means frequently deferring/coordinating security event
| communications. As much as I wish the market would
| appropriately penalize vendors that encourage researchers
| to sit on 0-days, that rarely happens, and the cost of
| insecurity is passed onto consumers instead.
|
| Tl;dr delaying disclosure does suck, but you should still
| do it.
| salawat wrote:
| Alas, it is the consumer that apparently drives pain
| reception for vendors, as they have no interest in
| minimizing consumer pain, but their own pain. Their own
| pain is defined as lost business.
|
| 'Tis the world we live in, so might as well get used to
| it.
| appplication wrote:
| I could have sworn I saw someone on here a few weeks ago
| saying how they switched their router to openBSD and ended up
| in some horrible broken state after a power outage or
| something. I can't recall the details, maybe someone else
| would recall.
| basedrum wrote:
| Maybe because you need to go to the data center to reboot
| open bsd
| binkHN wrote:
| Who here is running a production system without redundant
| or backup power?
| appplication wrote:
| I should have clarified: I think they said they did it on
| their home router.
| Shared404 wrote:
| Huh, I've been running an openbsd router as my prod home
| network router for ~ a year now without issue, and it has
| handled power outages and more without issues.
|
| I wonder if the outage happened during patching or
| something maybe?
| literalAardvark wrote:
| That just sounds like creating a new level of hell
| martijnvds wrote:
| One full of daemons.
| Hikikomori wrote:
| Any 100+ Terabit Openbsd routers out there?
| Nextgrid wrote:
| I would very much doubt any of those carrier-grade routers
| run BGP in hardware. Chances are the control plane is
| handled by a bog-standard Linux/VxWorks/BSD that does all
| the logic of talking BGP/etc and then populates the right
| registers in the switching/routing hardware which handles
| the actual data plane.
|
| You could technically replace all the control and decision-
| making with whatever off-the-shelf computer platform you
| want, and have it drive the carrier-grade router and
| "manually" populate its routing table via some
| API/Telnet/SNMP. In fact, as long as the hardware packet-
| routing core is still good, this might be a good way to
| upgrade out-of-support equipment that is no longer getting
| updates to its control-plane computer.
| treve wrote:
| The robustness principle: "be conservative in what you do, be
| liberal in what you accept from others", hardly applies here.
| Whether or not it was followed, a crash shouldn't happen.
|
| The robustness principle is also long considered to be not that
| great of an idea for protocols. There's a good document here
| going into this:
|
| https://datatracker.ietf.org/doc/html/rfc9413
| vermilingua wrote:
| It isn't a crash, BGP terminates the session and retries
| because it considers the connection to be malformed. So, "be
| liberal in what you accept" certainly does apply.
| j16sdiz wrote:
| This "terminate and retry" behaviour is specified in
| RFC7606
| fanf2 wrote:
| That is the exact opposite of what RFC 7606 says. As it
| explains in its introduction, it deprecates session reset
| in favour of treat-as-withdraw in most cases, unless
| there is an explicit exception to the general rule.
| forkerenok wrote:
| That RFC is a very insightful read. Thanks for sharing!
| cratermoon wrote:
| This seems like a good time to mention Yuan et al., "Simple
| Testing Can Prevent Most Critical Failures."
|
| "We found the majority of catastrophic failures could easily have
| been prevented by performing simple testing on error handling
| code - the last line of defense - even without an understanding
| of the software design."
| JonChesterfield wrote:
| Design behaviour here was on bad input, shutdown. Some bad input
| happened, thus shutdown happened. Seems fine to me. I'm not
| seeing a grave flaw.
| vitiral wrote:
| The issue is how it interacts with other designs, such as
| forwarding of unknown attributes who are now poisoned
| remram wrote:
| Imagine if Gmail and Outlook refused to talk to each other and
| transmit any email anymore, because Gmail tried to deliver one
| email whose content was invalid (bad MIME, invalid attachment,
| etc).
|
| If you shutdown the whole session because of one bad forwarded
| message, you are indeed a broken implementation.
| Bluecobra wrote:
| Agreed, the article title is a bit sensationalist. Really
| should be titled "RTFM: Juniper routers shutdown BGP on bad
| input". That being said, I found it be a good deep dive and I
| appreciate the author looking into this. BGP isn't perfect but
| we made it this far and I think it does a pretty good job for
| handling ~75K autonomous systems.
| sgjohnson wrote:
| > BGP isn't perfect but we made it this far and I think it
| does a pretty good job for handling ~75K autonomous systems.
|
| Absolutely not. To me, it's a miracle that the internet even
| works.
| basedrum wrote:
| Sure if you didn't read and only skimmed the article...
| gregw2 wrote:
| Err no, it was not just Juniper. Many other systems were
| fuzzed, half a dozen vendors were fine, another 5 weren't and
| some remain unpatched and vulnerable.
|
| If it were just a Juniper implementation flaw it'd be less
| interesting than explaining an aspect of BGP error handling
| more generally which is indeed what the article was, for
| those of us aware of BGP but not familiar with specific
| protocol design decisions and implications.
|
| If you quibbled with the word "Grave" I might agree since
| it's not industrywide and impacting major vendors like Cisco
| and Huawei and those implementing RFC 7606 (although there is
| no patch for Nokia or Extreme or Bird!) I'd counterpropose
| that a more neutral TLDR title/summary for an HN audience
| might be "Grave flaws in non compliant BGP error handling."
|
| But on balance I think the author is better off with raising
| the alarm with a stronger headline; he clearly put his
| passion to good use and I learned a little something as a
| result.
| nsteel wrote:
| As I understand it, Nokia equipment already does the
| correct thing if you configure it correctly in fault-
| tolerance mode.
| buro9 wrote:
| The impacted systems get stuck on a retry loop... they can
| neither forward the message blindly nor consume it. They break
| the BGP connection, reset, come back alive and get the same
| message and break the connection again.
|
| Effectively the impacted instances are now no longer able to do
| their job, they cannot propagate more rules.
|
| That's the flaw.
|
| When they receive something that they don't understand they
| need to propagate without breaking connection, and they do not.
| fanf2 wrote:
| Yes. It's also important to emphasize that when a BGP session
| between two ISPs is lost, all connectivity between those ISPs
| on that path is also lost. This is why the author emphasized
| when you can reconfigure the routers for error tolerance
| without affecting active BGP sessions.
| throw0101c wrote:
| > _When they receive something that they don 't understand
| they need to propagate without breaking connection, and they
| do not._
|
| And this is how things _are_ done, per the article:
|
| > _On 2 June 2023, a small Brazilian network (re)announced
| one of their internet routes with a small bit of information
| called an attribute that was corrupted. The information on
| this route was for a feature that had not finished
| standardisation, but was set up in such a way that if an
| intermediate router did not understand it, then the
| intermediate router would pass it on unchanged._
|
| > _As many routers did not understand this attribute, this
| was no problem for them. They just took the information and
| propagated it along._
|
| However, the information was _corrupted_ at the source, and
| those devices that did understand the attributed _detected_
| the corruption and rejected it:
|
| > _However it turned out that Juniper routers running even
| slightly modern software did understand this attribute, and
| since the attribute was corrupted the software in its default
| configuration would respond by raising an error that would
| shut down the whole BGP session._
|
| The question is whether the session shutdown was appropriate
| (at least by default).
|
| I think Juniper's default is reasonable, as attributes can
| communicate important information about the intentions of the
| network operators, and rejecting a particular attribute and
| basically ignoring the intentions of the operators may also
| cause problems.
|
| I also think that thinking that perhaps allowing the session
| to continue but throwing a (loud) warning is also reasonable
| (and there is a flag for that).
| ExoticPearTree wrote:
| > However, the information was corrupted at the source, and
| those devices that did understand the attributed detected
| the corruption and rejected it.
|
| No they did not, they shutdown a whole session down for a
| malformed announcement. The right decision would have been
| to ignore the corrupt attribute and maybe not install that
| particular route.
|
| A logged message saying "Attribute 129 with value 0 is
| invalid" would have helped operators more than just blindly
| shutting down sessions through which this was announced.
| fanf2 wrote:
| The right decision is specified in RFC 7606, which is
| usually to treat the malformed announcement as a
| withdrawal. (BGP attributes cannot in general be
| discarded safely.)
| tsimionescu wrote:
| The current behavior is akin to an HTTP server shutting
| down whenever it receives an invalid HTTP request. Or, for
| an even better analogy, it is like an HTTP server that,
| upon receiving a malformed HTTP request, drops all current
| and future traffic from that IP. A malicious user can then
| issue a malformed HTTP request through an open proxy, and
| deny all other connections to that server throgh that proxy
| for all other users who were sending proper requests.
| throw0101c wrote:
| > _The current behavior is akin to an HTTP server
| shutting down whenever it receives an invalid HTTP
| request._
|
| It is nothing like that.
|
| HTTP requests do not propagate across an organization's
| entire network, or possibly the entire Internet.
|
| > _Or, for an even better analogy, it is like an HTTP
| server that, upon receiving a malformed HTTP request,
| drops all current and future traffic from that IP._
|
| So just like (e.g.) fail2ban blocking IPs that
| continuously send bad login requests? That's awesome! I
| would _want_ my web server to block bad clients.
|
| Heck, it would be nice if web browsers would do the same
| thing with content they downloaded. In the early days of
| the web there were all sorts of garbage files out there
| and instead of forcing people to correct their HTML they
| tried to be clever in mind reading:
|
| * https://en.wikipedia.org/wiki/Tag_soup
|
| For a time there was an entire 'movement' to get people
| to clean up their act:
|
| * https://en.wikipedia.org/wiki/W3C_Markup_Validation_Ser
| vice
|
| * https://en.wikipedia.org/wiki/HTML_Tidy
|
| It's fine to have a flag to have stringency be a policy
| that can be toggled, but I think Postel's law ("be
| liberal in what you accept") can cause issues over the
| long-term:
|
| * https://en.wikipedia.org/wiki/Robustness_principle
| fanf2 wrote:
| It's like your origin web server unceremoniously dropping
| the http connection when its front end reverse proxy
| (fastly, cloudfront, akamai, etc) forwards an invalid
| request, and when your web server restarts the proxy
| retries the bad request.
| tsimionescu wrote:
| > So just like (e.g.) fail2ban blocking IPs that
| continuously send bad login requests? That's awesome! I
| would want my web server to block bad clients.
|
| Well, somewhat, except that it's not awesome. It's like
| fail2ban, but implemented on a server that sits behind a
| load balancer. So, when it receives a bad login request
| from an IP and it bans that IP, it actually banned the IP
| of the load balancer, so now it won't receive any other
| requests. So that a malicious client can just send 1 bad
| request, and take down everyone else using the whole
| server.
|
| Basically, this whole discussion is not so much about
| Postel's law, it's about limiting failure domains. A bad
| route advertisement should make that 1 route
| inaccessible. It is not good design for a bad route
| advertisement to take down all the other good routes. It
| is particularly bad design for a protocol where routes
| are automatically propagated by devices that don't think
| there's anything wrong with them.
|
| If you want to compare it to the HTML situation, the bug
| can also be seen as a browser that, when it receives
| invalid HTML from www.google.com/search would not only
| throw an error, but also refuse any other connection to
| *.google.com/*. The proposed fix is not to interpret
| anything as valid HTML. It is to show an error only when
| accessing www.google.com/search.
| capableweb wrote:
| That sounds like a typical denial of service attack, which you
| don't see a problem with?
| JonChesterfield wrote:
| It sounds like a system designed to stop doing things on
| unexpected input. The designers chose correctness over
| robustness. It's not an accidental vulnerability.
|
| Maybe switch off on request is a dubious design choice for
| infrastructure that you'd rather stay alive, but it sounds
| like a new version of the spec has already been released to
| make that an option.
| mnw21cam wrote:
| Bad input -> shutdown provides a method by which an external
| attacker can shut down your service by providing bad input,
| otherwise known as a denial of service attack.
| throw0101c wrote:
| The input in question was a particular attribute: these
| attributes can have important information about the
| intentions of how the operators wish things to work. Simply
| ignoring the bad input and carrying on the BGP session
| without the attribute(s) in question may lead to working
| against the desired intentions of the network operators.
|
| I think Juniper's default is reasonable: rejecting a
| particular attribute and basically ignoring the intentions of
| the operators could also cause problems, and instead of
| trying to read the minds of those running the networks, the
| software doesn't try to be clever.
|
| I also think that thinking that perhaps allowing the session
| to continue but throwing a (loud) warning is also reasonable
| (and there is a flag for that).
| tsimionescu wrote:
| The problem is that one bad route in a BGP session caused
| _the whole session_ to shut down. So if ISP1 advertises a
| new route with a corrupted attribute to ISP2 who doesn 't
| understand it, and ISP2 advertises it to ISP3 who does
| understand the corrupted attribute, _all traffic_ between
| ISP2 and ISP3 will shut down. So ISP1 has caused
| communication failure between ISP2 and ISP3 for all
| clients, not just for its own routes.
| throw0101c wrote:
| Yes, I know about BGP propagation works.
|
| My point still stands: attributes can have important
| information about the intentions of how a network
| operator wishes their network to be seen to the rest of
| the world. Having a corrupt attribute, and thinking it's
| okay to simply assume that without that attribute the
| rest of the advertisement is still valid is not fine
| IMHO: you're basically trying to guess/assume the
| intentions of the source of the advertisement.
|
| For all we know ignoring that attribute and accepting the
| rest of the advertisement as-is could have caused flood
| of traffic over the connection causing a just-as-
| effective-DoS.
| tsimionescu wrote:
| That is not what is being done. If you advertise a route
| with a corrupt attribute, your route will be considered
| inaccessible by any device that understands the
| attribute.
|
| But! All your other routes that you advertised earlier
| with perfectly fine attributes will continue working.
| This is what prevents this from becoming a DoS
| vulnerability.
|
| Let's take a more complete example.
|
| ISP1 advertises 2 routes to ISP2:
| 35.67.0.0/16 attr1=val1 37.61.0.0/16 attr1=corr
|
| ISP2 doesn't know what attr1 is, and it advertisea the
| following routes to ISP3: 35.67.0.0/16
| attr1=val1 37.61.0.0/16 attr1=corr 78.0.0.0/8
|
| The bug is that ISP3 will then stop routing any traffic
| to 35.67.0.0/16, 37.61.0.0/16, or to 78.0.0.0/8.
|
| After the fix, ISP3 will keep routing traffic to
| 35.67.0.0/16 (respecting the value val1 for attribute
| attr1) and to 78.0.0.0/8 through ISP2. It will not route
| traffic to 37.61.0.0/16.
| fanf2 wrote:
| As the fine article explains, your concerns are addressed
| by RFC 7606. It recommends that in many cases, an error
| should be treated as a route withdrawal, not by dropping
| the session. (The RFC specifies other ways to handle
| errors in specific situations.) There's no guessing or
| assuming.
|
| In the scenario that started off the author's
| investigation, the bad attribute kicked COLT off the
| Internet even though COLT did not connect directly to the
| Brazilian ISP that sent the bad route update. When they
| received the bad update, COLT's routers dropped their BGP
| sessions with intermediate ISPs that were unrelated to
| the Brazilian ISP, causing complete loss of connectivity.
|
| If COLT's routers had treated the error as a route
| withdrawal, they would have lost connectivity to the
| Brazilian ISP but not the rest of the Internet.
| sgjohnson wrote:
| A small Brazilian network yesterday hijacked half the internet
| too, including several of my own prefixes, all of whom have IRR
| and RPKI set up.
|
| https://bgp.tools/as/266970?show-low-vis#prefixes
|
| [bgp.tools Alert] for AS200676 A hijack on 2a06:a005:2730::/44
| from AS266970 (NET WAY PROVEDOR DE INTERNET DE CACOAL LTDA) has
| been detected (based on your address policy you have configured)
|
| [bgp.tools Alert] for AS200676 A hijack on 2a0a:6040:b1f4::/48
| from AS266970 (NET WAY PROVEDOR DE INTERNET DE CACOAL LTDA) has
| been detected (based on your address policy you have configured)
|
| [bgp.tools Alert] for AS200676 A hijack on 2a0a:6040:b1f1::/48
| from AS266970 (NET WAY PROVEDOR DE INTERNET DE CACOAL LTDA) has
| been detected (based on your address policy you have configured)
|
| [bgp.tools Alert] for AS200676 A hijack on 2a0a:6040:b1f0::/48
| from AS266970 (NET WAY PROVEDOR DE INTERNET DE CACOAL LTDA) has
| been detected (based on your address policy you have configured)
|
| BGP is a complete mess.
| basedrum wrote:
| What is the response of the bird developers?
|
| Also, for those responding that you don't see the problem, it's
| simple: remote unsanitized input cause denial of service
| gorkish wrote:
| Bird is unaffected, so I assume their response is satisfaction
| for a job well done. FRR, on the other hand, appears to be less
| responsive to concerns -- which does actually concern me since
| I use FRR, though not presently for eBGP. Disclosure seems to
| have done the job thankfully, as there appears to have been a
| little scramble over at FRRouting/frr about an hour ago.
| wawwow wrote:
| [flagged]
| TheHappyOddish wrote:
| No, these are very unrelated things.
| salawat wrote:
| How do you figure? The IP address to route to in DNS is still
| an IP which from within that network segment will believe
| that prefix is provided by that AS if they aren't honoring
| RPKI.
|
| Therefore, within that routing zone, that traffic would be
| blackholed I'd think, unless I'm seriously missing some
| nuance that I'd love it if you would share.
| hannob wrote:
| Not surprised.
|
| A few years ago a team of researchers tried to use a feature
| attribute flag in BGP. Their experiment caused an important BGP
| software to crash.
|
| The BGP community reacted by thanking the researchers for
| uncovering that flaw and worked on improving stability. No, just
| kidding. They shouted at the researchers and asked them to stop,
| which they eventually did.
|
| I felt back then that this community had a very unhealthy
| relationship with the quality of their software and the work of
| security researchers.
|
| Sources:
| https://mailman.nanog.org/pipermail/nanog/2018-December/0984...
| https://mailman.nanog.org/pipermail/nanog/2019-January/09876...
| https://mailman.nanog.org/pipermail/nanog/2019-January/09918...
| https://mailman.nanog.org/pipermail/nanog/2019-January/09914...
| hinkley wrote:
| Recipe as I got it from grandma
|
| 1) Beleaguered people misbehave.
|
| 2) You are usually the architect of your own destruction.
|
| 3) It's easier to see other people's problems than your own.
|
| Stir into a death spiral and let it rest for six to twelve
| months. Bake at 375deg for two years, then serve cold with a
| sprinkling of schadenfreude to offset the bitterness.
| dogleash wrote:
| > I felt back then that this community had a very unhealthy
| relationship with the quality of their software and the work of
| security researchers.
|
| I suspect a Network Operators mailing list is going to be full
| of people trying to keep production up, with the tools they've
| got, rather than the tools they'd like. Thinking the software
| quality on the routers they're using are dogshit is probably
| just called "Monday" and thinking their peers' routers are even
| worse is called "Tuesday".
|
| I can understand they don't appreciate "research" done on
| production systems that might cause issues when they already
| know their equipment sucks.
| gnfargbl wrote:
| I _can 't_ understand it. The network operators should save
| their vitriol for the vendors who are selling them equipment
| that sucks, and for whoever in their own organisation is
| continuing to purchase equipment that sucks.
|
| Bitching at researchers (note: not "researchers") who are
| finding and calling out those flaws before someone really bad
| comes along and exploits them is somewhere between unhelpful
| and negligent. Internet security is always improved by more
| sunlight.
| hn8305823 wrote:
| > I can't understand it. The network operators should save
| their vitriol for the vendors who are selling them
| equipment that sucks
|
| We call out vendors when their quality suffers but you only
| have two or three practical vendor options for high end
| routers depending on your use case/feature requirements.
| All three have varying cycles of hardware and software
| quality, with hardware and software rarely being in phase
| for the same vendor.
|
| The fact that all three vendors make most of their money on
| non-network focused enterprises with less sophisticated
| engineers doesn't help get development attention where it
| needs to be for global infrastructure.
| tw04 wrote:
| >I can't understand it. The network operators should save
| their vitriol for the vendors who are selling them
| equipment that sucks, and for whoever in their own
| organisation is continuing to purchase equipment that
| sucks.
|
| I think you fail to realize that the internet is a bunch of
| disparate networks that no one person controls. When
| "fixing" a router means you break half the internet, you
| just don't fix it. This isn't some walled garden where you
| can just say "tough shit if you broke, fix it".
|
| Most of the "issues" these researchers find are well known,
| but backbone operators are more militant than Linus when it
| comes to: rule #1 is don't break user space.
| gnfargbl wrote:
| > I think you fail to realize that the internet is a
| bunch of disparate networks that no one person controls.
|
| The internet is a bunch of disparate networks that no one
| person controls, _exactly_. That means that when someone
| tells you there 's a fault in the kit that you have
| installed in your network, then that is 100% your problem
| and it is 100% on you to get it fixed. Once a
| vulnerability is known then someone out there is going to
| start exploiting it almost immediately.
|
| > backbone operators are more militant than Linus when it
| comes to: rule #1 is don't break user space.
|
| Backbone operators can have all the hubris they want, but
| it won't change the reality that the only effective
| action they can take when a vulnerability is found is to
| get it fixed ASAP. This is a security lesson that has
| been learnt in recent years by many sectors of the IT
| industry and, judging by the woeful response that Ben
| Cartwright-Cox describes in his post, it's one that the
| network backbone is going to be learning soon.
| tw04 wrote:
| >The internet is a bunch of disparate networks that no
| one person controls, exactly. That means that when
| someone tells you there's a fault in the kit that you
| have installed in your network, then that is 100% your
| problem and it is 100% on you to get it fixed.
|
| So you've never worked in a corporate environment. Here's
| how that conversation would go:
|
| Hey guys, some researchers found out if you use a test
| flag meant for the lab on the public internet, it breaks
| all our BGP sessions.
|
| OK, so drop their feed and block them?
|
| Great, done.
|
| >Once a vulnerability is known then someone out there is
| going to start exploiting it almost immediately.
|
| And yet the "vulnerability" in question was known, and
| was not being immediately exploited if you read through
| the mailing list or were participating in NANOG at that
| point in time. So your statement is provably false.
|
| >Backbone operators can have all the hubris they want,
| but it won't change the reality that the only effective
| action they can take when a vulnerability is found is to
| get it fixed ASAP.
|
| And yet we're having this conversation in 2023, they have
| operated the same way for 40+ years, and somehow the
| internet is still working. Bad actors get blackholed, it
| worked in the past, it'll continue working in the future.
| The reality is that backbone routing is expensive, and
| expecting everyone to update their kit on YOUR timeline
| isn't reasonable.
| robertlagrant wrote:
| > OK, so drop their feed and block them?
|
| > Great, done.
|
| The "great" is the problem, no?
| gnfargbl wrote:
| I am aware that network operators have behaved this way
| for very many years. The point I am making is that _all
| of the IT industry_ used to attempt to "work around"
| security vulnerabilities in the same way, until log4shell
| and all the others gradually beat that propensity out of
| them.
|
| I am prophesying that a similar reckoning is likely to
| come upon the network backbone. You're arguing that the
| cost of entry to the game (BGP peering) is such that the
| old ways will continue to work. Let's hope you're right.
| devonkim wrote:
| Software ecosystems like libraries and frameworks have
| completely different propagation and remediation
| mechanics compared to federated systems like core
| Internet backbone routers and switches is the thing. Try
| as we might conceptualize otherwise the modern Internet
| from a packet's purview is more like a loose
| confederation of ultimately privatized or state-run
| fiefdoms than a cellular automata digraph explosion. So
| actors that try to act maliciously against the network
| will be basically shut out given the rule of an iron fist
| being the default.
| zamadatix wrote:
| The internet does not owe a negligent operator peering.
| No one person controls the internet but you sure as hell
| can find yourself kicked out if your goal is to be right
| on principals instead of a good peer. The only hubris
| which could be at play is someone thinking the rest of
| the internet should follow their desires instead of the
| other way around.
|
| Research is great but if you're poking around breaking
| things via DoS in the internet you're going to be judged
| as a bad peer regardless what the route daemon does. The
| security of the internet constantly moves forward but
| that doesn't mean the expectation of improving has to
| hold more precedence than the expectation of running.
| It's still very much a human to human system for all but
| the actual route exchange.
| netfortius wrote:
| I suppose you have never seen, ever, in your professional
| career, the "C" suits making business deals and thus
| decisions on what's to be acquired, with big
| network/server/storage/database/software vendors, over the
| head and against the advice of those then called to support
| crap in production.
| themerone wrote:
| Network operators are also the ones that pop into IPV6
| discussions to voice their displeasure with having support
| another protocol.
| booi wrote:
| To be fair... supporting dual stack really does suck
| yjftsjthsd-h wrote:
| A lot about networking sucks, but that's what people are
| paying them for.
| philg_jr wrote:
| Also to be fair, IME there seems to be a lot more vocal
| pro-IPv6 people on the NANOG list than not.
| networkchad wrote:
| [dead]
| [deleted]
| throw0101c wrote:
| > _I felt back then that this community had a very unhealthy
| relationship with the quality of their software and the work of
| security researchers._
|
| Or they felt like some of us feel about Tesla/Musk unleashing
| cars with "Full" "Self-Driving" on public roads.
| adra wrote:
| As lousy as these systems are, they kill a hell of a lot
| fewer people (for now) than drunks, so net positive for the
| time being.
| mschuster91 wrote:
| The solution to drunk drivers is public transport, not
| self-driving cars.
| zaroth wrote:
| It's 2023 and driving kills like 40,000 people a year.
|
| We've been building trillions of dollars of public
| transit since about the same time as we started building
| the National highway system. So when exactly does public
| transit start saving all these lives?
|
| A new train from SF to LA has been in the works since
| 1996. They are expecting partial completion for $30
| billion by 2030 which is projected to carry ~15 million
| passengers * ~400 miles = 6 billion passenger miles per
| year.
|
| The fatality rate for passenger cars is ~0.57 per 100
| million miles, so eliminating 6 billion passenger miles
| could be reasonably expected to save about 60 * 0.57 = 34
| lives a year out of the ~40,000 that are dying on the
| roads annually.
|
| This is just rough napkin math showing the scale of the
| problem would require a roughly $30 _trillion_ investment
| in public transit to appreciably shift passenger-miles
| away from passenger cars.
|
| Public transit can be very useful. But it's clearly not
| the solution to eliminating driving deaths.
| autoexec wrote:
| It seems unfair to say that public transport doesn't have
| a role in reducing deaths caused by drunk drivers just
| because the hugely flawed and half-assed public transport
| systems in our cities (which were designed to make public
| transport ineffective) haven't yet delivered on that
| promise. In places where cities and public transportation
| systems are well designed and useful people do use them
| and those places see far fewer deaths from drunk drivers
| than the US does as a result.
|
| You argue that it would be very expensive to fix the
| problems we've made for ourselves in order to get useful
| public transportation and that's certainly true. It's
| also very expensive to invent self-driving cars and part
| that cost has already been paid with lives lost and human
| suffering, but that hasn't stopped us from developing
| those systems. We could decide to invest instead in
| improving our cities and infrastructure. We just choose
| to let obscenely rich people kill innocent Americans
| while they play with their profit generating spying car
| tech instead.
| hiddencost wrote:
| You're confusing "musk killing people to compete for market
| share because he's woefully behind" with "driverless cars".
|
| There are other people building driverless cars, who have
| safer systems that they rolled out reliably.
|
| Musk isn't saving people from drunk drivers, he's trying to
| catch up with the competition and chose to kill people to
| catch up.
| simiones wrote:
| That only makes sense if _all_ the people using self-
| driving cars would have driven drunk (or drugged, or
| sleepy) instead. Otherwise, you 're replacing a decent
| driver with a sub-mediocre automatic system.
| yjftsjthsd-h wrote:
| I think the difference is that network gear is supposed to
| handle the whole spec and gracefully deal with (drop) things
| that aren't to spec. If there was a law that all cars had to
| be able to get hit by another car and keep going without
| issue (and this was physically plausible), then it would be
| similar.
| tw04 wrote:
| If you read the links: they discovered that using this flag
| broke EVERY installation of FRR on the internet. They let the
| developers know. And then said they were going to test again in
| a week.
|
| Does any sane person think that 1 week is enough time to both
| notify every user of FRR in the _WORLD_ , as well as ensure
| they have upgraded their installations for a "bug" that only is
| triggered by someone using experimental flags meant for testing
| purposes only?
|
| Mind you, these researchers didn't test ANYTHING in a lab
| environment first "because it's a lot of work to test all the
| open source BGP implementations in the wild". AKA: I'm lazy.
| LegionMammal978 wrote:
| > Mind you, these researchers didn't test ANYTHING in a lab
| environment first "because it's a lot of work to test all the
| open source BGP implementations in the wild". AKA: I'm lazy.
|
| That's a gross exaggeration: they said that they first tried
| it "in a controlled environment with different versions of
| Cisco, BIRD, and Quagga routers" [0] without issue.
| Presumably, the problem we know about would have been avoided
| if they had happened to include FRR in their test suite
| beforehand. But that raises the next question, exactly how
| many implementations does one need to test before one is no
| longer "lazy"?
|
| [0] https://mailman.nanog.org/pipermail/nanog/2019-January/09
| 876...
| braiamp wrote:
| I don't think that is an accurate representation of the
| "comunity", it was a "singular complaint from a company
| advertising unallocated ASN and IPv4 resources", and got told
| off by literally most of the mailing list. I'm counting at
| least 20 other individual responses and all of them are in
| support of the experiment
| https://mailman.nanog.org/pipermail/nanog/2019-January/threa...
| capableweb wrote:
| Is there a reason why the researchers did this on live
| production BGP network rather than in a test environment,
| running their own BGP and FRR routers? Seems a bit haphazardly,
| as doing so first seems to have uncovered the issue.
|
| Overall I agree with your message though, and of course things
| shouldn't stop working just because you use attributes reserved
| for development.
| LegionMammal978 wrote:
| The researchers claimed that they had first tested it with a
| few different vendors' routers without any issues, but they
| had not included FRR in those tests [0]. There was a bit of
| discussion on the mailing list regarding just how many BGP
| implementations (and configurations) a responsible researcher
| is obligated to test with before exposing something new on
| the public Internet.
|
| [0] https://mailman.nanog.org/pipermail/nanog/2019-January/09
| 876...
| salawat wrote:
| The correct answer is "All of them on your own network
| first".
|
| Thou shalt not blow up another's network knowingly.
|
| Blowing up another's network because you didn't test your
| theory in your own isolated lab space first counts as
| blowing up another's network knowingly. Because what the
| hell else did you think would happen?
|
| Cripes.
| xyst wrote:
| To be honest, if you are a responsible researcher you are NOT
| running "experiments" in live environments that have potential
| to impact multiple regions.
|
| The outrage here is understandable. I would be pissed if I was
| paged because halfway across the world, some asshole caused
| this intentionally. Also this mailing list is aimed at the
| operators, not the vendors. Different communities.
|
| The vendor response here, I do agree is irresponsible. Besides
| OpenBSD, every one here just sucked.
| pixl97 wrote:
| Ya the researcher could just sell their findings to whatever
| blackhat groups are out there instead. You generally get less
| pushback and more cash from them.
| ExoticPearTree wrote:
| The sane thing to do is to drop that particular announcement
| with that invalid attribute or discard the invalid information.
| But some people tend to be more catholic than the pope.
|
| Dropping the whole session, especially in a DFZ scenario is
| really bad. The churn that you generate is immense for all your
| neighbors.
| crest wrote:
| Just have a bunch of fuzzers running 24/7 to keep all
| implementations honest.
| clankyclanker wrote:
| That's what Google used to do (though not for BGP), back in
| their engineering days. That's what gave us AFL.
| tialaramex wrote:
| And indeed GREASE the TLS feature where you just propose
| nonsense because every other party should go "Er, no I
| don't speak nonsense?" to prevent ossification also comes
| from Google engineers.
|
| GREASE means that when you build your "advanced" security
| device which "protects" customers by treating everything
| different as hostile, it won't work, so you'll need to
| tweak it to _at least_ just ignore such differences as
| irrelevant, which is enough that we can come back later
| and intentionally improve things.
|
| Previously, without GREASE, we'd have to guess what
| oversights we could exploit to avoid this "protection" to
| deliver protocol improvements, if we guessed wrong
| nothing works, or everybody's security is broken,
| sometimes both. e.g. for TLS 1.2 the oversight we found
| is, if you're resuming an existing session the "security"
| products just wave that through because otherwise they
| break people's real workflows. So in TLS 1.3 protocol as
| actually spoken essentially an initial connection goes
| like this:
|
| Client: Hi some.dns.name.example I'm a TLS 1.2 Client,
| I'd like to resume our previous conversation
| #randomNonsense. Also, completely unrelated, I happen to
| speak FlyCasualThisIsReallyTLSv1.3 and so I have these
| TLSv1.3KeyAgreementParameters.
|
| then either:
|
| TLS 1.3 Server: Hi Client, of course, let's continue from
| there. [ whereupon everything further is encrypted
| because this is actually TLS 1.3, but to a dumb middlebox
| it makes sense that they're just resuming a prior
| encrypted conversation #randomNonsense which it doesn't
| remember ]
|
| OR
|
| TLS 1.2 Server: Er, no I don't remember any such
| conversation and I don't know
| FlyCasualThisIsReallyTLSv1.3 so let's start a fresh
| conversation as normal.
| hinkley wrote:
| I think you've discovered the secret to getting people to
| exercise more.
| salawat wrote:
| More like the secret to getting ignored/blackholed/sued
| into the ground the moment your fuzzer hits on a combo
| that actually does damage to someone else's system, since
| the implication here is you're not doing it on your own
| stuff.
|
| If you are, good on ya. Continue doing $deity's/Bob's
| work.
| pixl97 wrote:
| More like the secret is to buy up some routing equipment
| and run these tests on your own network, sell the
| exploits you find to nefarious actors for big bucks, and
| have a bunker to survive the crash of the internet in.
|
| If your system is on the internet it is now a potential
| piece in a potential global war.
| salawat wrote:
| Always has been, even since it's inception. The only
| thing that'd kept it safe and unutilized in that regard
| had been a grassroots efforts to not facilitate that
| transformation. That only lasted so long as the main
| population of Ops/devs/netizens were in agreement.
|
| We've blasted past that age at this point, do everybody
| gets to don their robe and wizard hat, pick up the staff,
| and yeet patches, routing rules, and packets betwixt one
| another.
| fanf2 wrote:
| As the fine article explains, that is what RFC 7606
| specifies.
___________________________________________________________________
(page generated 2023-08-29 23:01 UTC)