[HN Gopher] Mystery Blips
___________________________________________________________________
Mystery Blips
Author : bo0tzz
Score : 94 points
Date : 2022-11-19 10:28 UTC (12 hours ago)
(HTM) web link (mosquitocapital.substack.com)
(TXT) w3m dump (mosquitocapital.substack.com)
| NKosmatos wrote:
| Liked the ending of it, nice read :-) Being a data/statistics
| freak, I would really love to see statistics and usage patterns
| from big sites or games.
| xyzelement wrote:
| One of my clients back at a large financial software and data
| provider, was a southern market making firm.
|
| They kept complaining about "the whole desk being slow" and we
| spent weeks trying to figure out what was wrong with our
| software.
|
| Eventually we figured out the whole firm was nuts about golf, and
| they'd all stream golf tournament to their workstations at the
| same time and saturate their ISDN or whatever they had.
| steveBK123 wrote:
| 10+ years ago the networking team at a big European bank I
| worked for, observed that the biggest usage of network
| bandwidth on the trading floor desktop network was not the
| Bloomberg Terminal, market data, or trading application
| traffic.. but Youtube :-)
| bee_rider wrote:
| Wouldn't surprise me if YouTube, Spotify, etc were some of
| the biggest resources users in many offices.
|
| Maybe it would be useful to have a local company radio
| streaming service, haha.
| js2 wrote:
| Early in my career, in the late 90s, I was at Cox Interactive
| Media on the team responsible for the web farm which hosted all
| of Cox Enterprises news sites: newspapers, radio stations, TV
| stations. This was before the days of SREs and SROs and dev-ops.
| We were just system admins and programmers and some of us could
| do both.
|
| The web farm was about two-dozen Sun Ultra 2s connected on a FDDI
| loop (we eventually upgraded to GigE), with content on NetApps
| (4GB drives). A couple Sun E450s. Apache[1] on the Ultra 2s.
| Apache + mod_perl on the E450s. Hosted at a Global Center DC in
| Sunnyvale (same DC that early Yahoo was hosted in).
|
| Monitoring with MRTG[2].
|
| It's 1998. The Ken Starr report drops. Now, we knew it was
| coming, and we did our best to be prepared, but this is the late
| 90s. There was only so much load testing we could do and didn't
| really know how much traffic it would drive. We kept the site up,
| but it meant, as I recall, a lot of fine tuning of the mod_perl
| box and disabling interactive parts[3] of the site.
|
| I really wish I still had some of the traffic graphs.
|
| Fun times.
|
| [1] Receipt:
| https://github.com/apache/httpd/blob/1.3.x/src/CHANGES#L5754
|
| [2] Receipt:
| https://github.com/oetiker/mrtg/blob/master/src/CHANGES#L310...
|
| [3] The forums. OMG forum software was so terrible. I think we
| eventually wrote our own after nothing based on Netscape
| Application Server or that we found open source worked well at
| all.
| gooseyard wrote:
| at the end 1999, my employer hosted what I guess was half NYE
| Party and half Incident Response. We were optimistic that we
| wouldn't have issues but figured it'd be wise to have a
| sufficient crew of designated drivers, so to speak.
|
| Our monitoring system included a mercator projection on some big
| screens, with colored that showed where our gear was and whose
| color indicated whether it was down, being hammered, etc.
|
| We had no equipment in UTC+14, so there was an hour of waiting
| between that zone reaching y2k and when we'd maybe see some
| action. The control room was mobbed since we also had some
| televisions running cable news there and everyone wanted to see
| it.
|
| Despite the anxiety, the gear was looking fine everywhere, until
| about 5 minutes before midnight in UTC+13, when our locations in
| New Zealand started to turn red. A hush fell over the room as the
| operators attempted to open connections to those boxes. The
| connection attempts appeared to hang, and the party got somber.
| But after the eternity of a few seconds, we were in. The machines
| were alive but being hammered; the alerts just indicated the
| excessive traffic caused by mass numbers of users refreshing NZ
| based websites as the date change approached, to see whether
| they'd fallen off the net or not. Breathing resumed.
|
| After 15 minutes or so had passed, the traffic fell off, and by
| the time y2k reached us in UTC+5 we'd been in full festive mode
| for hours. I still can't look at New Zealand on a map without
| seeing it covered in angry red circles, though.
| 2b3a51 wrote:
| Big street party here in UK on Dec 31 1999. Myself and a few
| neighbours who thought about these things relaxed a bit after
| 9pm (Midnight Moscow time) and started partying seriously.
| retrocryptid wrote:
| Lol. During "high velocity events" at Amazon, we would have a war
| room with all the engineers and a couple old hands directing
| responses.
|
| Some of our metrics came in 5 minutes delayed which wasn't a
| problem for normal days. These metrics moved slow enough that
| when you got an alarm, there was still plenty of time to take
| corrective action.
|
| But for HVEs this was an issue. During black Friday or prime day,
| sometime some metrics spiked so fast you had no time to respond
| (usually from people hitting page reload a few minutes before a
| sale kicked off.)
|
| To get an idea for what was going on, I would go in twitter and
| search for things like "amazon failure" or "amazon 502."
|
| We often got problem reports via Twitter before they showed up on
| our dashboards.
| tonetheman wrote:
| I did SRE for years.
|
| A lot of time it would end up being DNS or routing somewhere
| outside of our control.
|
| Sometimes a disk would get close to filling up and a cron would
| clear it just in time so you would see decreasing performance
| then it would clear.
|
| Other times we would have a customer that did something that we
| just did not expect with the system and it would cause SQL
| queries to slow down in ways we could not imagine. And it was
| transient so those would be hard to find or explain.
|
| Or we would hit a limit in the load balancer (haproxy) in really
| odd ways. Too much traffic on the frontend or not enough capacity
| in the backend. And many other various ways of things not
| working. Haproxy was and still is amazing software. Really almost
| magical.
| eb0la wrote:
| I was working in a Telco that owns a big backbone network in
| south America.
|
| Our director was very sensitive to traffic changes. Every week or
| two there was a big meeting at her office to explain what
| happened when some traffic went down.
|
| That day she was really mad at us. There afternoon before there
| was 20% less traffic in the backbone and nobody knew where it
| went.
|
| I was on call that week for the management systems and I was
| questioned about why the operators didn't have received any
| alarm.
|
| Turns out Spain was playing soccer against some Latin American
| team. Spain was our biggest customer. I don't remember if the
| other team was Argentina, Chile, or Brazil... but it was our 2nd
| biggest market.
|
| People just decided to watch TV instead of web browsing (that was
| before mobile phones had an affordable internet connection).
|
| Funny enough eMule traffic spiked during the match. That was the
| reason I could justify there was no alert in the system for down
| interfaces.
| adw wrote:
| This happens in all infrastructure, particularly power and water;
| https://en.wikipedia.org/wiki/TV_pickup
| ricardobeat wrote:
| This footnote is almost more interesting than the post itself:
| > There's a deep beauty to the thought that untold millions
| of people using the app randomly, but oh so slightly
| habitually, aggregated together, makes such a predictable pattern
|
| That 'beauty' can be extremely scary from another angle. It's
| evidence of what makes Facebook, and other social media, such
| powerful mass manipulation tools.
| retrocryptid wrote:
| Yes. But just about every big service has marked diurnal and
| weekly patterns.
| 13of40 wrote:
| One service I used to work on processed business-to-customer
| and business-to-business emails almost exclusively, so you
| could see a recurring weekly pattern, bumps throughout the
| day when the US east coast, US west coast, Asia, and Europe
| woke up, and spikes at the top of the hour from automation.
| So one day we got a new boss, and he called me into his
| office in a panic so I could explain why his chart of the
| traffic kept going up and down. Took about four tries, but I
| think he eventually got it.
|
| I also recall seeing someone push a global update to that
| system that was packaged wrong, and watching the graph
| gradually drop and flatline as 20,000 VM hosts across the
| planet stopped taking traffic. That had its own subtle
| beauty, in no way diminished by the fact I was just a
| bystander and couldn't get in trouble for it.
| none_to_remain wrote:
| I don't see the fright. So many things do this - electricity
| for example - you turn the lights on and off and run the
| laundry machine whenever it pleases you, but on the scale of
| millions of people the power companies predict the daily usage
| patterns very well.
| encoderer wrote:
| Maybe it's just the trained operator in me but the whole time
| he's describing the depressed metrics and blip I'm screaming in
| my head: it's exogenous! Check the news! He finally gets there.
|
| For us (at normal company scale) I would actually go look at
| recent activity, new customers, new workloads we are running, but
| I guess at a Facebook scale it's a lot harder to do that.
| nerdponx wrote:
| One interesting aspect of the story is that the author was a
| junior at the time (almost brand new to the job, if I read it
| correctly), and it was one of the more experienced operators
| who realized it was exogenous.
|
| in addition to being a good story about the site reliability,
| it's also a great lesson in the value of collaboration,
| mentorship, and having senior people around whom you can ask
| for help!
| prox wrote:
| I love these kind of stories! Any more HN'ers have these?
| Kiro wrote:
| The employee shaming on here has scared away all the people
| with interesting war stories from big companies. Anyone active
| on HN nowadays probably hasn't worked on anything significant.
| gfv wrote:
| Pirate video releases generate heaps of traffic, and they won't
| be on the news. Earthquakes (mild ones, not the "cities
| collapse" kind) cause people to immediately go online and check
| on their friends. So do missile strikes, but those tend to
| generate extremely popular videos as well. During the holy
| month of Ramadan, you can see a rapid and deep traffic drop in
| Muslim countries right at their local sunset, when people have
| iftar, breaking their daily fast. A country in North Africa
| shuts down their internet access completely during their school
| exams. Popular web infrastructure sometimes reroutes your
| requests to distant data centers, leading to request latency
| exploding together with your queue lengths. In summer, the
| morning user activity peaks later than in autumn because
| schools are on their summer breaks.
|
| I bet every seasoned SRE has a few to add.
| prox wrote:
| So cool to see these things happen!
| jasonwatkinspdx wrote:
| Here's a fun one I remember seeing a little video on: the
| UK power authority has to anticipate commercial breaks in
| major broadcasts like the World Cup because everyone
| turning on their electric kettle spikes the grid.
| retrocryptid wrote:
| Another part of my career, I worked on a team that ran an
| automated content detection service (like YouTube ContentId.)
|
| The database holding signatures for known music samples was
| sharded by artist. Not as crazy as you might think. You get a
| trial sample and you send it to all the shards which chunk on it
| in parallel then you just wait for the servers hosting each shard
| to respond.
|
| But then Prince died.
|
| The queue for the server that owned the shard Prince was in
| backed up.
|
| Then our proxy that distributed trials to each shard backed up.
|
| Then our regular reverse proxy backed up.
|
| Then our anemic load balancer fell over.
|
| What we learned for about the bazillioneth time is keep a backup
| of each shard handy so you can add it into the rotation and pay
| an intern to watch Facebook looking for news about recently
| deceased musicians.
| steveBK123 wrote:
| Having been at a finance firm trying to implement SRE without
| SRO, and without staffing either, I do find the story a bit
| entertaining.
|
| That is - we were mandated to implement all the SRE tooling on
| top of our apps, with like 1 guy owning the SRE infra, but with
| no one looking at the charts proactively.
|
| The idea was sold to dev teams that it was something like a black
| box recorder for the on-call rotation to check first when you got
| called in the middle of the night/weekend.
|
| In reality it turned into CTO reporting metrics of whether SLOs
| were being met, lol.
|
| Constantly feel like other industries read like 1/4 of a book
| about how FAANG does stuff and then adopts the laziest worst
| implementation of the part they skimmed.
| nerdponx wrote:
| > Constantly feel like other industries read like 1/4 of a book
| about how FAANG does stuff and then adopts the laziest worst
| implementation of the part they skimmed.
|
| I think this is less wrong than you think it is. In the P&C
| insurance industry, I saw a handful of initiatives that seemed
| like they were started because a senior manager read about
| something that sounded cool and high-tech in an industry
| publication, and/or heard it in a sales pitch from a Microsoft
| rep, without actually checking to see if it was feasible or
| even useful.
| jeroenhd wrote:
| This reminds me of this YouTube video:
| https://youtu.be/slDAvewWfrA. There's a monitoring room to make
| sure people are ready to switch backups and reroute in case of
| some kind of grid failure, but also in very specific scenarios.
|
| Britain being filled with Brits, what would happen is that once
| the show was over, half the nation got up from the couch to turn
| on the kettle for a cup of tea. Those electric kettles are quite
| demanding, especially if millions of them turn on at the same
| time.
|
| So every time there was a major event, dedicated people are
| monitoring the grid frequency, power plants on standby and
| foreign contracts at the ready when needed, just to increase
| capacity at the right time. You can't just schedule this stuff,
| because if a football match runs over its allotted time, you may
| suddenly add power to the grid without any load, causing all
| kinds of problems like the grid frequency increasing and making
| digital clocks run ahead. There is a temporary demand of
| gigawatts of power that rises within five minutes and lasts until
| the kettles are done.
|
| The YouTube video provides an example of 600MW of power being
| requested from France... for the end of an episode of Eastenders.
|
| I always knew the British like their tea and that there's some
| kind of planning going on for events like street lights turning
| on, but the combination of the two is a great example of complex
| behaviour that's easy to overlook.
| LeoPanthera wrote:
| > Britain being filled with Brits, what would happen is that
| once the show was over, half the nation got up from the couch
| to turn on the kettle for a cup of tea. Those electric kettles
| are quite demanding, especially if millions of them turn on at
| the same time.
|
| Particularly _British_ kettles, which pull 13 amps at 240v =
| over 3000 watts. A lot more than (standard) American outlets
| can deliver.
|
| For a while there was a theory going around that this is why
| Americans typically do not have electric kettles, because they
| would boil much slower in the USA, but the real answer is that
| Americans just don't drink much tea.
| esaym wrote:
| > For a while there was a theory going around that this is
| why Americans typically do not have electric kettles, because
| they would boil much slower in the USA, but the real answer
| is that Americans just don't drink much tea.
|
| I would drink tea and use an electric kettle if it would boil
| faster...
| seesaw wrote:
| I use an electric kettle to boil water. I got it a while
| back from Costco. I find it to be faster than our earlier
| method - using the microwave to boil water.
___________________________________________________________________
(page generated 2022-11-19 23:00 UTC)