[HN Gopher] A coding error caused Rogers outage that left millio...
___________________________________________________________________
A coding error caused Rogers outage that left millions without
service
Author : kelseydh
Score : 32 points
Date : 2022-07-25 22:21 UTC (38 minutes ago)
(HTM) web link (web.archive.org)
(TXT) w3m dump (web.archive.org)
| gwen-shapira wrote:
| "Although Bell and Telus offered to help, Rogers quickly
| determined that it would not be able to transfer its customers to
| its rivals' networks because certain elements of the Rogers
| network, such as its centralized user database, were inaccessible
| as a result of the outage."
|
| It sounds like their control/management plane (with the user's
| database) was dependent on their data plane. So a data plane
| outage was more challenging to mitigate than it should have been
| in a decoupled architecture. Good lesson for any architecture.
| mabbo wrote:
| Both my home internet and mobile are through Rogers (for the time
| being). I had no access to the internet for 15 hours, 6am to 9pm.
| Couldn't do my job as a remote developer.
|
| And all through the day I kept thinking to myself "I bet someone
| pushed an update to prod, causing this. And I'm glad that this
| time it wasn't me."
| amatecha wrote:
| You can see Rogers' own report (with some redactions) as provided
| to the CRTC. See the doc linked under the first (2022-07-22)
| heading here:
| https://crtc.gc.ca/otf/eng/2022/8000/c12-202203868.htm
| Victerius wrote:
| The key bit:
|
| > But, at 4:43 a.m. on July 8, a piece of code was introduced
| that deleted a routing filter. In telecom networks, packets of
| data are guided and directed by devices called routers, and
| filters prevent those routers from becoming overwhelmed, by
| limiting the number of possible routes that are presented to
| them.
|
| > Deleting the filter caused all possible routes to the internet
| to pass through the routers, resulting in several of the devices
| exceeding their memory and processing capacities. This caused the
| core network to shut down.
|
| Lesson no 1: Do not design your system to have a single point of
| failure.
| erentz wrote:
| > But, in the early hours, the company's technicians had not
| yet pinpointed the cause of the catastrophe. Rogers apparently
| considered the possibility that its networks had been attacked
| by cybercriminals.
|
| I mean, if you just pushed a config change and the whole
| network goes kaput, take a look at the config change before you
| start suspecting hackers.
| pitched wrote:
| I heard that the teams were having trouble communicating with
| each other and so the ones who pushed the config might not
| have been the ones looking for hackers.
|
| This is why some hospitals still use the old pager systems to
| contact people in the city. One hospital-owned antenna on a
| battery can coordinate a lot of people. I don't know what the
| equivalent to that would be in this case though.
| chx wrote:
| Ham radio.
|
| It still works, you know?
|
| Also, pagerduty works over wifi...
___________________________________________________________________
(page generated 2022-07-25 23:00 UTC)