[HN Gopher] Mystery Blips
       ___________________________________________________________________
        
       Mystery Blips
        
       Author : bo0tzz
       Score  : 94 points
       Date   : 2022-11-19 10:28 UTC (12 hours ago)
        
 (HTM) web link (mosquitocapital.substack.com)
 (TXT) w3m dump (mosquitocapital.substack.com)
        
       | NKosmatos wrote:
       | Liked the ending of it, nice read :-) Being a data/statistics
       | freak, I would really love to see statistics and usage patterns
       | from big sites or games.
        
       | xyzelement wrote:
       | One of my clients back at a large financial software and data
       | provider, was a southern market making firm.
       | 
       | They kept complaining about "the whole desk being slow" and we
       | spent weeks trying to figure out what was wrong with our
       | software.
       | 
       | Eventually we figured out the whole firm was nuts about golf, and
       | they'd all stream golf tournament to their workstations at the
       | same time and saturate their ISDN or whatever they had.
        
         | steveBK123 wrote:
         | 10+ years ago the networking team at a big European bank I
         | worked for, observed that the biggest usage of network
         | bandwidth on the trading floor desktop network was not the
         | Bloomberg Terminal, market data, or trading application
         | traffic.. but Youtube :-)
        
           | bee_rider wrote:
           | Wouldn't surprise me if YouTube, Spotify, etc were some of
           | the biggest resources users in many offices.
           | 
           | Maybe it would be useful to have a local company radio
           | streaming service, haha.
        
       | js2 wrote:
       | Early in my career, in the late 90s, I was at Cox Interactive
       | Media on the team responsible for the web farm which hosted all
       | of Cox Enterprises news sites: newspapers, radio stations, TV
       | stations. This was before the days of SREs and SROs and dev-ops.
       | We were just system admins and programmers and some of us could
       | do both.
       | 
       | The web farm was about two-dozen Sun Ultra 2s connected on a FDDI
       | loop (we eventually upgraded to GigE), with content on NetApps
       | (4GB drives). A couple Sun E450s. Apache[1] on the Ultra 2s.
       | Apache + mod_perl on the E450s. Hosted at a Global Center DC in
       | Sunnyvale (same DC that early Yahoo was hosted in).
       | 
       | Monitoring with MRTG[2].
       | 
       | It's 1998. The Ken Starr report drops. Now, we knew it was
       | coming, and we did our best to be prepared, but this is the late
       | 90s. There was only so much load testing we could do and didn't
       | really know how much traffic it would drive. We kept the site up,
       | but it meant, as I recall, a lot of fine tuning of the mod_perl
       | box and disabling interactive parts[3] of the site.
       | 
       | I really wish I still had some of the traffic graphs.
       | 
       | Fun times.
       | 
       | [1] Receipt:
       | https://github.com/apache/httpd/blob/1.3.x/src/CHANGES#L5754
       | 
       | [2] Receipt:
       | https://github.com/oetiker/mrtg/blob/master/src/CHANGES#L310...
       | 
       | [3] The forums. OMG forum software was so terrible. I think we
       | eventually wrote our own after nothing based on Netscape
       | Application Server or that we found open source worked well at
       | all.
        
       | gooseyard wrote:
       | at the end 1999, my employer hosted what I guess was half NYE
       | Party and half Incident Response. We were optimistic that we
       | wouldn't have issues but figured it'd be wise to have a
       | sufficient crew of designated drivers, so to speak.
       | 
       | Our monitoring system included a mercator projection on some big
       | screens, with colored that showed where our gear was and whose
       | color indicated whether it was down, being hammered, etc.
       | 
       | We had no equipment in UTC+14, so there was an hour of waiting
       | between that zone reaching y2k and when we'd maybe see some
       | action. The control room was mobbed since we also had some
       | televisions running cable news there and everyone wanted to see
       | it.
       | 
       | Despite the anxiety, the gear was looking fine everywhere, until
       | about 5 minutes before midnight in UTC+13, when our locations in
       | New Zealand started to turn red. A hush fell over the room as the
       | operators attempted to open connections to those boxes. The
       | connection attempts appeared to hang, and the party got somber.
       | But after the eternity of a few seconds, we were in. The machines
       | were alive but being hammered; the alerts just indicated the
       | excessive traffic caused by mass numbers of users refreshing NZ
       | based websites as the date change approached, to see whether
       | they'd fallen off the net or not. Breathing resumed.
       | 
       | After 15 minutes or so had passed, the traffic fell off, and by
       | the time y2k reached us in UTC+5 we'd been in full festive mode
       | for hours. I still can't look at New Zealand on a map without
       | seeing it covered in angry red circles, though.
        
         | 2b3a51 wrote:
         | Big street party here in UK on Dec 31 1999. Myself and a few
         | neighbours who thought about these things relaxed a bit after
         | 9pm (Midnight Moscow time) and started partying seriously.
        
       | retrocryptid wrote:
       | Lol. During "high velocity events" at Amazon, we would have a war
       | room with all the engineers and a couple old hands directing
       | responses.
       | 
       | Some of our metrics came in 5 minutes delayed which wasn't a
       | problem for normal days. These metrics moved slow enough that
       | when you got an alarm, there was still plenty of time to take
       | corrective action.
       | 
       | But for HVEs this was an issue. During black Friday or prime day,
       | sometime some metrics spiked so fast you had no time to respond
       | (usually from people hitting page reload a few minutes before a
       | sale kicked off.)
       | 
       | To get an idea for what was going on, I would go in twitter and
       | search for things like "amazon failure" or "amazon 502."
       | 
       | We often got problem reports via Twitter before they showed up on
       | our dashboards.
        
       | tonetheman wrote:
       | I did SRE for years.
       | 
       | A lot of time it would end up being DNS or routing somewhere
       | outside of our control.
       | 
       | Sometimes a disk would get close to filling up and a cron would
       | clear it just in time so you would see decreasing performance
       | then it would clear.
       | 
       | Other times we would have a customer that did something that we
       | just did not expect with the system and it would cause SQL
       | queries to slow down in ways we could not imagine. And it was
       | transient so those would be hard to find or explain.
       | 
       | Or we would hit a limit in the load balancer (haproxy) in really
       | odd ways. Too much traffic on the frontend or not enough capacity
       | in the backend. And many other various ways of things not
       | working. Haproxy was and still is amazing software. Really almost
       | magical.
        
       | eb0la wrote:
       | I was working in a Telco that owns a big backbone network in
       | south America.
       | 
       | Our director was very sensitive to traffic changes. Every week or
       | two there was a big meeting at her office to explain what
       | happened when some traffic went down.
       | 
       | That day she was really mad at us. There afternoon before there
       | was 20% less traffic in the backbone and nobody knew where it
       | went.
       | 
       | I was on call that week for the management systems and I was
       | questioned about why the operators didn't have received any
       | alarm.
       | 
       | Turns out Spain was playing soccer against some Latin American
       | team. Spain was our biggest customer. I don't remember if the
       | other team was Argentina, Chile, or Brazil... but it was our 2nd
       | biggest market.
       | 
       | People just decided to watch TV instead of web browsing (that was
       | before mobile phones had an affordable internet connection).
       | 
       | Funny enough eMule traffic spiked during the match. That was the
       | reason I could justify there was no alert in the system for down
       | interfaces.
        
       | adw wrote:
       | This happens in all infrastructure, particularly power and water;
       | https://en.wikipedia.org/wiki/TV_pickup
        
       | ricardobeat wrote:
       | This footnote is almost more interesting than the post itself:
       | > There's a deep beauty to the thought that untold millions
       | of people using the app randomly, but oh so slightly
       | habitually, aggregated together, makes such a predictable pattern
       | 
       | That 'beauty' can be extremely scary from another angle. It's
       | evidence of what makes Facebook, and other social media, such
       | powerful mass manipulation tools.
        
         | retrocryptid wrote:
         | Yes. But just about every big service has marked diurnal and
         | weekly patterns.
        
           | 13of40 wrote:
           | One service I used to work on processed business-to-customer
           | and business-to-business emails almost exclusively, so you
           | could see a recurring weekly pattern, bumps throughout the
           | day when the US east coast, US west coast, Asia, and Europe
           | woke up, and spikes at the top of the hour from automation.
           | So one day we got a new boss, and he called me into his
           | office in a panic so I could explain why his chart of the
           | traffic kept going up and down. Took about four tries, but I
           | think he eventually got it.
           | 
           | I also recall seeing someone push a global update to that
           | system that was packaged wrong, and watching the graph
           | gradually drop and flatline as 20,000 VM hosts across the
           | planet stopped taking traffic. That had its own subtle
           | beauty, in no way diminished by the fact I was just a
           | bystander and couldn't get in trouble for it.
        
         | none_to_remain wrote:
         | I don't see the fright. So many things do this - electricity
         | for example - you turn the lights on and off and run the
         | laundry machine whenever it pleases you, but on the scale of
         | millions of people the power companies predict the daily usage
         | patterns very well.
        
       | encoderer wrote:
       | Maybe it's just the trained operator in me but the whole time
       | he's describing the depressed metrics and blip I'm screaming in
       | my head: it's exogenous! Check the news! He finally gets there.
       | 
       | For us (at normal company scale) I would actually go look at
       | recent activity, new customers, new workloads we are running, but
       | I guess at a Facebook scale it's a lot harder to do that.
        
         | nerdponx wrote:
         | One interesting aspect of the story is that the author was a
         | junior at the time (almost brand new to the job, if I read it
         | correctly), and it was one of the more experienced operators
         | who realized it was exogenous.
         | 
         | in addition to being a good story about the site reliability,
         | it's also a great lesson in the value of collaboration,
         | mentorship, and having senior people around whom you can ask
         | for help!
        
       | prox wrote:
       | I love these kind of stories! Any more HN'ers have these?
        
         | Kiro wrote:
         | The employee shaming on here has scared away all the people
         | with interesting war stories from big companies. Anyone active
         | on HN nowadays probably hasn't worked on anything significant.
        
         | gfv wrote:
         | Pirate video releases generate heaps of traffic, and they won't
         | be on the news. Earthquakes (mild ones, not the "cities
         | collapse" kind) cause people to immediately go online and check
         | on their friends. So do missile strikes, but those tend to
         | generate extremely popular videos as well. During the holy
         | month of Ramadan, you can see a rapid and deep traffic drop in
         | Muslim countries right at their local sunset, when people have
         | iftar, breaking their daily fast. A country in North Africa
         | shuts down their internet access completely during their school
         | exams. Popular web infrastructure sometimes reroutes your
         | requests to distant data centers, leading to request latency
         | exploding together with your queue lengths. In summer, the
         | morning user activity peaks later than in autumn because
         | schools are on their summer breaks.
         | 
         | I bet every seasoned SRE has a few to add.
        
           | prox wrote:
           | So cool to see these things happen!
        
             | jasonwatkinspdx wrote:
             | Here's a fun one I remember seeing a little video on: the
             | UK power authority has to anticipate commercial breaks in
             | major broadcasts like the World Cup because everyone
             | turning on their electric kettle spikes the grid.
        
       | retrocryptid wrote:
       | Another part of my career, I worked on a team that ran an
       | automated content detection service (like YouTube ContentId.)
       | 
       | The database holding signatures for known music samples was
       | sharded by artist. Not as crazy as you might think. You get a
       | trial sample and you send it to all the shards which chunk on it
       | in parallel then you just wait for the servers hosting each shard
       | to respond.
       | 
       | But then Prince died.
       | 
       | The queue for the server that owned the shard Prince was in
       | backed up.
       | 
       | Then our proxy that distributed trials to each shard backed up.
       | 
       | Then our regular reverse proxy backed up.
       | 
       | Then our anemic load balancer fell over.
       | 
       | What we learned for about the bazillioneth time is keep a backup
       | of each shard handy so you can add it into the rotation and pay
       | an intern to watch Facebook looking for news about recently
       | deceased musicians.
        
       | steveBK123 wrote:
       | Having been at a finance firm trying to implement SRE without
       | SRO, and without staffing either, I do find the story a bit
       | entertaining.
       | 
       | That is - we were mandated to implement all the SRE tooling on
       | top of our apps, with like 1 guy owning the SRE infra, but with
       | no one looking at the charts proactively.
       | 
       | The idea was sold to dev teams that it was something like a black
       | box recorder for the on-call rotation to check first when you got
       | called in the middle of the night/weekend.
       | 
       | In reality it turned into CTO reporting metrics of whether SLOs
       | were being met, lol.
       | 
       | Constantly feel like other industries read like 1/4 of a book
       | about how FAANG does stuff and then adopts the laziest worst
       | implementation of the part they skimmed.
        
         | nerdponx wrote:
         | > Constantly feel like other industries read like 1/4 of a book
         | about how FAANG does stuff and then adopts the laziest worst
         | implementation of the part they skimmed.
         | 
         | I think this is less wrong than you think it is. In the P&C
         | insurance industry, I saw a handful of initiatives that seemed
         | like they were started because a senior manager read about
         | something that sounded cool and high-tech in an industry
         | publication, and/or heard it in a sales pitch from a Microsoft
         | rep, without actually checking to see if it was feasible or
         | even useful.
        
       | jeroenhd wrote:
       | This reminds me of this YouTube video:
       | https://youtu.be/slDAvewWfrA. There's a monitoring room to make
       | sure people are ready to switch backups and reroute in case of
       | some kind of grid failure, but also in very specific scenarios.
       | 
       | Britain being filled with Brits, what would happen is that once
       | the show was over, half the nation got up from the couch to turn
       | on the kettle for a cup of tea. Those electric kettles are quite
       | demanding, especially if millions of them turn on at the same
       | time.
       | 
       | So every time there was a major event, dedicated people are
       | monitoring the grid frequency, power plants on standby and
       | foreign contracts at the ready when needed, just to increase
       | capacity at the right time. You can't just schedule this stuff,
       | because if a football match runs over its allotted time, you may
       | suddenly add power to the grid without any load, causing all
       | kinds of problems like the grid frequency increasing and making
       | digital clocks run ahead. There is a temporary demand of
       | gigawatts of power that rises within five minutes and lasts until
       | the kettles are done.
       | 
       | The YouTube video provides an example of 600MW of power being
       | requested from France... for the end of an episode of Eastenders.
       | 
       | I always knew the British like their tea and that there's some
       | kind of planning going on for events like street lights turning
       | on, but the combination of the two is a great example of complex
       | behaviour that's easy to overlook.
        
         | LeoPanthera wrote:
         | > Britain being filled with Brits, what would happen is that
         | once the show was over, half the nation got up from the couch
         | to turn on the kettle for a cup of tea. Those electric kettles
         | are quite demanding, especially if millions of them turn on at
         | the same time.
         | 
         | Particularly _British_ kettles, which pull 13 amps at 240v =
         | over 3000 watts. A lot more than (standard) American outlets
         | can deliver.
         | 
         | For a while there was a theory going around that this is why
         | Americans typically do not have electric kettles, because they
         | would boil much slower in the USA, but the real answer is that
         | Americans just don't drink much tea.
        
           | esaym wrote:
           | > For a while there was a theory going around that this is
           | why Americans typically do not have electric kettles, because
           | they would boil much slower in the USA, but the real answer
           | is that Americans just don't drink much tea.
           | 
           | I would drink tea and use an electric kettle if it would boil
           | faster...
        
             | seesaw wrote:
             | I use an electric kettle to boil water. I got it a while
             | back from Costco. I find it to be faster than our earlier
             | method - using the microwave to boil water.
        
       ___________________________________________________________________
       (page generated 2022-11-19 23:00 UTC)