[HN Gopher] We found and fixed a rare race condition in our sess...
       ___________________________________________________________________
        
       We found and fixed a rare race condition in our session handling
        
       Author : todsacerdoti
       Score  : 564 points
       Date   : 2021-03-18 20:00 UTC (1 days ago)
        
 (HTM) web link (github.blog)
 (TXT) w3m dump (github.blog)
        
       | maddyboo wrote:
       | I'm very impressed by the level of detail provided in this post
       | mortem, but it leaves me with a few questions.
       | 
       | What we know based on this report:
       | 
       | - This bug was live in production for at least 3 days.
       | 
       | - It was common enough to happen numerous times in the wild.
       | 
       | - GitHub says that they identified some specific user sessions
       | that may have been impacted by this bug by analyzing their logs.
       | 
       | What we do not know:
       | 
       | - How many leaked user sessions were identified in the followup
       | investigation?
       | 
       | - Is GitHub confident that they identified all leaked user
       | sessions?
       | 
       | - Was there any evidence of malicious use of these sessions?
       | 
       | - Did GitHub notify the users whose sessions were leaked?
       | 
       | - How long/for how many requests could these sessions have been
       | valid between their leak and when they were definitively revoked
       | on the 8th?
       | 
       | - Is it possible that any private user data was leaked? If so,
       | what data?
        
         | azernik wrote:
         | > - Is GitHub confident that they identified all leaked user
         | sessions?
         | 
         | Given that they revoked _all_ current logins, rather than just
         | the affected ones, I assume that they have no way to tell which
         | sessions were affected.
        
         | dfcowell wrote:
         | Most of that information is available in the initial report
         | linked to in the first paragraph of this blog post:
         | https://github.blog/2021-03-08-github-security-update-a-bug-...
        
           | maddyboo wrote:
           | That provides more information but doesn't fully answer any
           | of my questions.
           | 
           | Things that are still unclear:
           | 
           | - GitHub provides an estimate of the prevalence at 0.001%,
           | but doesn't state how many leaked sessions were actually
           | identified.
           | 
           | - Is GitHub confident that they identified all leaked user
           | sessions? This has bearing on my question about notification
           | of users whose sessions were leaked; if they did not identify
           | all leaked sessions, they obviously could not contact those
           | users.
           | 
           | - GitHub states that this bug could not be intentionally
           | triggered by a malicious party (why?), but they don't state
           | whether any of the natural occurrences may have been used
           | maliciously.
           | 
           | - How long/for how many requests could these sessions have
           | been valid between their leak and when they were definitively
           | revoked on the 8th?
           | 
           | - Is it possible that any private user data was leaked? If
           | so, what data?
           | 
           | People with technical understanding can infer most of this
           | information, but I think GitHub should issue some
           | straightforward answers rather than beating around the bush.
           | 
           | (Correction to my original comment: the bug was live for a
           | cumulative period of "less than 2 weeks")
        
             | ac29 wrote:
             | > GitHub states that this bug could not be intentionally
             | triggered by a malicious party (why?),
             | 
             | Because this bug requires two users making authenticated
             | requests at nearly the same time. An attacker cant force
             | their target to perform actions, especially not in the same
             | small time frame that is likely required (milliseconds?
             | less?). I suppose if the malicious party didnt care what
             | account they got swapped into, and github didnt rate limit
             | their requests, perhaps they could trigger this bug with
             | enough time?
        
         | Boltgolt wrote:
         | GitHub says there could have been sessions that were leaked but
         | can't be identified now:
         | 
         | "While our log analysis, conducted from March 5 through March
         | 8, confirmed that this was a rare issue, it could not rule out
         | the possibility that a session had been incorrectly returned
         | but then never used."
         | 
         | Those sessions are now invalidated and apparently never used
        
       | polote wrote:
       | > While our log analysis, conducted from March 5 through March 8,
       | confirmed that this was a rare issue, it could not rule out the
       | possibility that a session had been incorrectly returned but then
       | never used. This was not a risk we were willing to take, given
       | the potential impact of even one of these incorrectly returned
       | sessions being used.
       | 
       | It is funny how they make it seem like they are super cautious.
       | When in fact they had no other choice than do it when you
       | rephrase the problem as " there were maybe some people who got
       | the admin right of the github account of some of our customers"
        
         | lathiat wrote:
         | You say that, but resetting all the sessions is very publicly
         | visible and plenty of smaller companies would paper over such
         | things. Honestly I don't really expect any less from Github or
         | Microsoft scale but don't kid yourself into thinking that kind
         | of things goes un-noticed, un-investigated or un-actioned
         | elsewhere.
        
           | ComputerGuru wrote:
           | To the contrary, I find my session tokens get reset left and
           | right by websites that don't even care about how much of an
           | inconvenience it is for the customers.
        
             | imtringued wrote:
             | Because Discord doesn't let you change avatars I have
             | created three different Discord accounts. I get logged out
             | of all three all the time. It's a massive pain.
        
               | WayToDoor wrote:
               | Huh ? You can change your avatar in discord settings...
        
               | kroltan wrote:
               | Maybe part of the sentence is missing and they meant
               | change avatars _per server_? That would be the only
               | reason I can imagine someone might want to have multiple
               | accounts.
        
             | ncann wrote:
             | Yeah, I don't visit GitHub that often, maybe once a week
             | but it logs me out every damn time. I wonder if it's
             | because I have 2FA setup or something.
        
             | yunohn wrote:
             | I use a food delivery service (website only, no app) that
             | doesn't even remember the chosen language setting, and of
             | course, never keeps you logged in past 24hrs. So
             | effectively, need to login every time I make an order...
        
               | jspash wrote:
               | This is my biggest pet peeve about delivery service
               | websites. What are the chances that I've moved house
               | since I last ordered a pizza two days ago? Why do you
               | incessantly ask for my postcode? When I log in, assume
               | I'm the same person at the same address. If I _have_
               | moved and accidentally order to the old house...well that
               | 's on me and I'd have to be pretty stupid (or hungry) to
               | do such as thing.
        
               | noir_lord wrote:
               | I mean they could do both.
               | 
               | Default to last postcode but display it on the
               | confirmation screen "Sending your Pizza to LE13 XYZ" -
               | amazon does it right (as you'd mostly expect)
        
               | alpaca128 wrote:
               | That happens to me even on YouTube. Every couple of
               | months I have to fix the language setting and switch back
               | to dark theme. What can't be fixed are Google's weird
               | attempts at auto-translating video titles from other
               | languages, making me click on videos that I simply cannot
               | understand. Don't do that, Google. Do full title,
               | thumbnail and video translation with 90% accuracy or
               | just...don't. Half-assed tech is easy but not helpful for
               | anyone.
        
             | chrisandchris wrote:
             | Personal opinion: Clear cookies every time you stop the
             | browser process, which basically is at least once a day.
             | 
             | Logs you out of all services. It's a nightmare as a user,
             | but I don't want to stick those tracking to cookies around
             | longer than a single day.
        
               | jjav wrote:
               | Yes, for sure. I use cookie autodelete to remove all
               | cookies as soon as the tab is closed, so nothing lingers.
        
               | SamuelAdams wrote:
               | Firefox can do this automatically. Options > Privacy &
               | Security > check the box "Delete cookies and site data
               | when Firefox is closed".
               | 
               | Anything you intentionally want to retain can be added to
               | the exceptions list.
        
               | fsckboy wrote:
               | "Delete cookies and site data when Firefox is closed" has
               | always bothered me; it should be "Delete cookies and site
               | data when Firefox is opened", because abnormal
               | termination is not "closing", and you'd still want the
               | cookies deleted when you start up again.
        
               | motiejus wrote:
               | That way I'd lose the GDPR opt-outs from the number of
               | sites I've gone through the process.
               | 
               | Perhaps I should just accept the consent and remove
               | cookies regularly, instead of being careful with the
               | consent forms... Hmm, good idea.
        
             | nexuist wrote:
             | Is it really that hard to log in again? Using a password
             | manager it's basically two clicks to autofill the right
             | account and hit the submit button.
             | 
             | Login limits are a good thing. It means that the chances of
             | someone taking over your account and using it long term is
             | minimized.
        
               | Ennea wrote:
               | It is not. But a ton of sites seem to be inconsistent
               | about it, or it feels that way at least. Some sites I
               | frequent don't log me out for months at a time, and then
               | suddenly they do. And then they do it again a few weeks
               | later. Followed by multiple months of not doing it once
               | more.
        
         | TriNetra wrote:
         | Obviously if they are logging all actions of all users, and
         | there's some decent retention period, they can find out how
         | many people got unintended access and what all they did with
         | it. Recently, I got an advice from a C-level executive to
         | include such analytics in ASPSecurityKit [0], because that's
         | what companies are looking for these days. This GH incident
         | makes me consider his suggestion more seriously.
         | 
         | 0: https://ASPSecurityKit.net
        
           | abgr-y wrote:
           | This points to the last A in AAA of security which stands for
           | Authentication, Authorization & Accounting, AAA moniker is
           | commonly used in reference to either RADIUS or Diameter
           | (network protocols), the concept is widely used for software
           | application security as well. So Accounting implies What
           | resources were accessed, at what time, by whom, and what
           | commands were issued?
        
       | WORMS_EAT_WORMS wrote:
       | Oof, my brain. With that giant fiasco master/slave thread the
       | other day, I glanced at "race condition" and "GitHub" and
       | immediately thought this was related to that.
        
       | naebother wrote:
       | cgi-bin for the win
        
       | beders wrote:
       | Another case where using immutable data by default would have
       | prevented any shenanigans. It also shows the highly entangled
       | relationship between state and complexity. (in this case, the env
       | object)
       | 
       | Immutable data forces you to design your system in a different
       | way and remove layers of complexity.
        
         | ed25519FUUU wrote:
         | How would that help here? They would just exchange one piece
         | immutable data (un-signed in user cookie) with another piece of
         | _wrong_ immutable data (someone else's session).
        
           | beders wrote:
           | A mutable object 'env' was shared across threads.
        
           | beders wrote:
           | "While the immediate risk has been mitigated, we worked with
           | the maintainer of Unicorn to upstream the change to make new
           | requests allocate their own environment hashes. If Unicorn
           | uses new hashes for each environment, it removes the
           | possibility of one request mistakenly getting a hold of an
           | object that can affect the next request."
        
         | brundolf wrote:
         | While in general I agree, in this case that aspect of the
         | problem entirely came down to "don't use object pools that span
         | multiple requests". That's a conscious optimization that was
         | made and then retracted, not a subtlety that accidentally
         | slipped in due to mutability itself.
        
           | beders wrote:
           | object pools == mutability (why else would you keep objects
           | around in a pool)
        
             | brundolf wrote:
             | Object pools require mutability; by no means does
             | mutability require you to use object pools
        
               | beders wrote:
               | Obvious.
               | 
               | But not what I meant. I meant that it was a conscious
               | decision to use an object pool and re-use objects.
               | 
               | That doesn't happen with immutable by default languages.
               | 
               | Ergo, these particular kinds of bugs just don't happen.
        
               | joebob42 wrote:
               | I guess the point here is that an object pool isn't the
               | default anyways: theyve chosen to do something non-
               | default for performance, which you can still do in
               | immutable by default languages.
        
       | client4 wrote:
       | It would be super cool if they did more work to prevent brute
       | force account takeovers.
        
       | tropshop wrote:
       | Some time ago I dealt with this exact bug for an e-commerce site
       | with millions of requests per day, and it was also Rails +
       | Unicorn.
       | 
       | ANY time ruby 1.8 raised a timeout, it was silently swallowed,
       | and Unicorn returned the adjacent request session_id. We had to
       | constantly defend against any code that might hit the native
       | timeout.
        
         | artursapek wrote:
         | Lmao I am so glad Ruby/Unicorn/Rails are a distant memory in my
         | career at this point. So much magic and poorly defined behavior
        
           | imtringued wrote:
           | I don't write Ruby applications but I have never heard
           | anything good about Unicorn.
        
       | jacobparker wrote:
       | At D2L we have a large C# code base which gets deployed in a few
       | different subsets, but is largely a monolithic fleet of web
       | servers.
       | 
       | To prevent these kind of problems we have a few approaches, but
       | the main way is to prevent shared mutable state. To do this we
       | have a custom C# code analyzer (source here:
       | https://github.com/Brightspace/D2L.CodeStyle/tree/master/src... ,
       | but it's not documented for external consumption at this point...
       | ImmutableDefinitionChecker is the main place to look.) It goes
       | like this:                 [Immutable]       public class Foo :
       | Bar {         // object is a scary field type, but "new object()"
       | // is always an immutable value.         private readonly object
       | m_lock = new object();                // we'll check that
       | ISomething is [Immutable] here         private readonly
       | ISomething m_something;                // Fine because this
       | lambdas in initializers can't         // close over mutable state
       | that we wouldn't         // otherwise catch.         private
       | readonly Action m_abc = () => {};                // see the
       | constructor         private readonly Func<int, int> m_xyz;
       | // danger: not readonly         private int m_zzz;
       | // danger: arrays are always mutable         private readonly
       | int[] m_array1 new[]{ 1, 2, 3 };              // ok
       | private readonly ImmutableArray<int> m_array2           =
       | ImmutableArray.Create( 1, 2, 3 );                public Foo(
       | ISomething something ) {           m_something = something;
       | // ok: static lambdas can't close over mutable state
       | m_xyz = static _ => 3;                  // danger: lambdas can in
       | general close over mutable           // state.           int i =
       | 0;           m_xyz = x => { i += x; return i; };         }
       | }
       | 
       | Here we see that a type has the [Immutable] attribute on this
       | class, so we will check that all the members are readonly and
       | also contain immutable values (via looking at all assignments,
       | which is easy for readonly fields/properties). Additionally, we
       | will check that instances of Bar (our base class) are known to be
       | immutable.
       | 
       | Any class that were to extend Foo (as a base class) will be
       | required to be [Immutable] as well. There's a decent number of
       | aspects to this analysis (e.g. a generic type can be
       | "conditionally immutable" -- ImmutableArray<int> is, but
       | ImmutableArray<object> is not), check the source if you're
       | interested.
       | 
       | We require all static (global) variables be immutable, and any
       | "singleton" type (of which we have tens of thousands if I
       | remember correctly) also must be.
       | 
       | Another important thing we do to is to cut off any deferred boot-
       | up code from accessing customer data (via a thread-local variable
       | that is checked before access through the data layer). This
       | prevents accidentally caching something for a particular customer
       | inside, say, a persistent Lazy<T> (or the constructor for a
       | singleton, etc.)
       | 
       | We've adapted our code base to be very strict about this over the
       | last few years and it discovered many obscure thread-safety bugs
       | and doubtlessly prevented many from happening since.
        
         | tybit wrote:
         | This is really cool, it seems similar in intent to Rust's
         | "shared or mutable, but not both" philosophy. I hope it spreads
         | to C# more generally somehow.
        
       | dvt wrote:
       | I love how everyone here on HN shits on Gab's mishap just because
       | they happen to host a lot of nonsense alt-right garbage and their
       | CEO is a weirdo, but everyone's totally cool with GitHub
       | _knowingly having inconsistencies across threads_ :
       | 
       | > We initially thought this to be an internal reporting problem
       | only and that we would see some data logged for an otherwise
       | unrelated request from the background thread. Though
       | inconsistent, we considered this safe since each request has its
       | own request data and Rails creates a new controller object
       | instance for each request.
       | 
       | I mean this is like an SNL skit: "though inconsistent, we
       | considered this safe" -- inconsistency, no matter where it's
       | happening, _literally_ means your code isn 't safe. Like, jeez...
       | this is GITHUB we're talking about. I mean, I give _money_ to
       | GitHub -- this is pretty serious and we should really stop with
       | the  "omg I love these detailed post-mortems!" stuff. A billion-
       | dollar company knowingly having code that's not thread safe in
       | their flagship product and going "eh, it's probably fine" is
       | bonkers to me.
        
         | reissbaker wrote:
         | I think there's a pretty big difference between an engineer
         | writing thread-unsafe code, testing and seeing inconsistencies
         | that (incorrectly) appear to not affect any external users,
         | shipping it, and then the security team finding the problem and
         | developing new processes to prevent this in the future -- vs
         | the CTO of Gab writing raw SQL using string concatenation,
         | shipping it without code review, and then trying to cover it up
         | afterwards.
         | 
         | No software maintained by a large team will be bug-free
         | forever, and that includes security bugs. The response to bugs
         | is what matters. In this case Github's response was quite
         | mature; in Gab's case, it wasn't.
        
           | colesantiago wrote:
           | > In this case Github's response was quite mature; in Gab's
           | case, it wasn't.
           | 
           | This is irrelevant.
           | 
           | There is no excuse for this, especially if you are a billion
           | dollar company.
        
             | djmips wrote:
             | The billions doesn't automatically scale development. You
             | know that. Every development is still done at the human
             | level. But you still might be right that there is no
             | excuse.
        
       | the_mitsuhiko wrote:
       | I think this issue is relevant for Python programmers as well.
       | uwsgi for many years shipped a default allocation mode for the
       | WSGI environment which just cleared it between requests. One had
       | to switch over to `wsgi-env-behaviour=holy` to prevent it. Holy
       | is now the default but many people still set it to cheap because
       | that prevented some segfaults for a while in uWSGI.
        
       | shay_ker wrote:
       | Brutal - Unicorn didn't use Rack.env in a thread safe way. It's
       | tougher given that this was in a widely used app-server and not
       | even in Rails code.
       | 
       | Impressed with GH folks in finding this, though I wonder how many
       | more of these types of bugs are out there in Ruby libs. It'd be
       | nice to make this more difficult to do, a la FP languages, by
       | making it really intentional to modify state. Maybe in Ruby 4??
        
         | SilverRed wrote:
         | I feel like if you dig down deep enough, every single site on
         | the web has a serious security bug in its app code or the
         | layers it runs on top of. How many more bugs are there still in
         | the CPU your server runs on?
        
         | chucke wrote:
         | unicorn is single-threaded by design, and is not expected to
         | support any background threading workarounds that people build
         | on top of it. That "feature" in github shouldn't be passing the
         | env hash.
        
         | Thorrez wrote:
         | >Unicorn didn't use Rack.env in a thread safe way.
         | 
         | Was Unicorn itself not thread safe? It sounds like Unicorn had
         | the requirement that once you send back the request to the user
         | you must stop accessing the request object. Maybe Unicorn
         | didn't document that requirement clearly, but even if it's not
         | documented it's not clear to me that Unicorn itself is not
         | thread safe, but rather that it has undocumented pitfalls that
         | can easily lead to thread unsafety.
        
       | pmarreck wrote:
       | > Because the Rack environment is the same object for all
       | requests
       | 
       | Who in the heck thought this would be a good idea? For the
       | record, back when Desk.com existed, I also saw strange session
       | issues that were probably complex and irreproducible race
       | conditions as well.
       | 
       | Note that in Elixir/Phoenix, this class of bug is _literally
       | impossible_ , thanks to immutability-all-the-way-down. Turing
       | bless it.
        
       | makkesk8 wrote:
       | How do you even have logs for something like this? I feel like
       | the log level to see the raw cookie responses would be a no go
       | since you'd easily have hundreds of gigabytes of logs every day.
        
       | alexeiz wrote:
       | > We initially thought this to be an internal reporting problem
       | only and that we would see some data logged for an otherwise
       | unrelated request from the background thread.
       | 
       | So they already conceded that their code is fundamentally not
       | thread-safe and they decided to let it slide.
        
         | macintux wrote:
         | Yes, but in fairness, we all make tradeoffs to get more done in
         | less time. It's much easier to justify throwing your top
         | engineers at a race condition for a few days when you _know_ it
         | 's placing security at risk.
        
         | nexuist wrote:
         | What is the alternative? Throw the project away and start over?
        
       | tyingq wrote:
       | I may be missing something here. I do see that the patch[1] now
       | creates a new request object, and thus a new "env" Ruby hash for
       | each request.
       | 
       | But I don't see the old behavior described as _" allocates one
       | single Ruby Hash that is then cleared (using Hash#clear) between
       | each request"_...even after poking around in http_request.rb and
       | other places. I reads like the pre-patched version would just re-
       | use the hash as-is, only overwriting/adding if it saw a new key
       | or value, but not deleting any keys.
       | 
       | [1] https://yhbt.net/unicorn-
       | public/66A68DD8-83EF-4C7A-80E8-3F1F...
        
         | lstamour wrote:
         | Maybe here?
         | https://github.com/defunkt/unicorn/blob/2c347116305338710331...
         | 
         | Admittedly, that's an older copy, I'd have to look up a newer
         | copy of the source to see if that function or a similar one
         | still exists.
         | 
         | The change was made in Unicorn v6 according to
         | https://yhbt.net/unicorn/NEWS.html so I'd have to find code
         | earlier than that, maybe.
         | 
         | https://yhbt.net/unicorn.git/tree/ext/unicorn_http/unicorn_h...
         | looks like it's still there.
         | 
         | If I'm reading this right, the init function in the above
         | unicorn_http.rl function is what's referred to as
         | HttpRequest#new thanks to
         | https://yhbt.net/unicorn.git/tree/lib/unicorn/http_request.r...
         | but I might be misreading this.
         | 
         | Actually... compared to uWSGI's source code, this is quite
         | pleasant to read, warts and all. :D
        
           | tyingq wrote:
           | Ahh, yep. I hadn't thought to look at Ragel source.
        
       | phjesusthatguy3 wrote:
       | I flagged this because it has "race" in the title and if there's
       | one thing HN doesn't want to talk about, it's race.
       | 
       | Totally serious, but also totally disgusted.
       | 
       | Yeah, I'm done with HN. Fuck you niggers _.
       | 
       | _ Six-to-a-room six-figure-a-year motherfuckers crying about how
       | life is unfair. Fuck you.
        
       | __alexs wrote:
       | uWSGI had a similar bug for a long time that was only fixed by
       | accident and was never reported as security issue. Concurrency is
       | hard I guess.
        
       | boltzmann_brain wrote:
       | All race conditions are rare ;P
        
       | teraku wrote:
       | This happened to a student colleague of mine in I think 2014. We
       | were working on uni projects and when he logged in to check the
       | repository, he was suddenly another user and we were all really
       | surprised.
        
       | ibraheemdev wrote:
       | Funny, I don't believe I was auto-logged out on March 8th.
        
       | [deleted]
        
       | lambda_obrien wrote:
       | Always log important events with a timestamp and save them as
       | long as possible.
       | 
       | It saved me more than once in similar situations with several
       | threads.
        
         | oars wrote:
         | Wow, impressive!
        
           | lambda_obrien wrote:
           | I'm guessing that's sarcasm, I edited my comment.
        
         | gowld wrote:
         | > Always log important events with a timestamp and save them as
         | long as possible.
         | 
         | Except when those events are associated with private user data
         | or behavior.
        
           | lambda_obrien wrote:
           | The personal data itself should not be logged but the event
           | itself should be logged with lots of details that aren't
           | private info.
        
             | [deleted]
        
           | closeparen wrote:
           | Nope, _especially_ in those cases. The more sensitive the
           | data, the more important it is to have records of who
           | accessed or changed it and when.
        
           | temp667 wrote:
           | Huh? I want my bank to be VERY clear on when I transferred
           | money for my home closing (and keep that record). They need
           | to keep records and the browser used, IP used, authentication
           | flow etc. Why does this have to be thrown away?
        
       | ghiculescu wrote:
       | Anyone know if puma would be susceptible to the same bug or if
       | it's unicorn specific?
        
         | jakswa wrote:
         | This is what I'm going to be digging for later... I'm wondering
         | if puma also shares `env` objects across requests.
        
           | Freaky wrote:
           | Each server does share a "proto env" object, but it is
           | duplicated for each Client:
           | 
           | https://github.com/puma/puma/blob/0cc3f7d71d1550dfa8f545ece8.
           | ..
           | 
           | and Client objects are not reused:
           | 
           | https://github.com/puma/puma/blob/0cc3f7d71d1550dfa8f545ece8.
           | ..
        
         | vemv wrote:
         | It would not trigger the same bug.
         | 
         | But since it's a multi-threaded server, it will exercise
         | arbitrary code (Rails core, your app, depended-on gems) in a
         | concurrent manner.
         | 
         | This can unmask race conditions that wouldn't manifest
         | themselves under a single-threaded environment.
         | 
         | ...that's been my experience with Puma - brilliant server in
         | itself, somewhat risky approach when considering the Ruby
         | ecosystem.
        
         | rantwasp wrote:
         | nope. puma does the right thing. this is exactly what i looked
         | at when i saw unicorn in the article
        
           | ksec wrote:
           | I thought Github at one point looked at Puma and decided not
           | to switch over. I wonder if they will reconsider that now.
        
             | rantwasp wrote:
             | to be fair, switching over is probably non-trivial and/or
             | risky
        
       | rantwasp wrote:
       | food for thought: if they have used puma (which is sort of what
       | people have been using since 2017 when it comes to ruby/rails)
       | would this bug have still existed?
        
       | pianoben wrote:
       | This is some fantastic debugging. I love war stories like this!
       | They give me confidence that the world is full of mere mortals
       | like myself, and we can do great things nonetheless.
        
       | stephenr wrote:
       | > One such performance improvement involved moving the logic to
       | check a user's enabled features into a background thread that
       | refreshed on an interval rather than checking their enablement
       | while the request was being processed.
       | 
       | This type of thing is why PHP's shared-nothing model is vastly
       | undervalued IMO.
       | 
       | Does PHP have the ability to use threads if you really need it?
       | Yes.
       | 
       | Are they generally a PITA to use? Yes.
       | 
       | Is this a problem for the vast, vast, _vast_ majority of
       | projects? Not in the slightest.
       | 
       | Having a separate invocation of the runtime for each request is a
       | feature, not a failing.
        
         | joshdifabio wrote:
         | Not for much longer!
         | 
         | https://externals.io/message/113419#113555
        
       | aquarese wrote:
       | Thanks
        
       | bvanderveen wrote:
       | Well that's embarrassing. Passing a mutable dictionary--
       | containing callbacks that could be closed over heaven-even-knows-
       | what, no less!--between threads? And it's re-used between
       | requests?
       | 
       | They are lucky this wasn't much, much worse.
       | 
       | Why does the error reporting service need to call the web thread
       | back anyway?
       | 
       | Yet another case where immutable-everything would have prevented
       | this hairball from accreting in the first place.
        
         | wfleming wrote:
         | Unicorn isn't GitHub's code, though. So the object-reuse that
         | was the root cause here wasn't their code, and treating your
         | webserver as a working black box is pretty typical (right up
         | until you hit an edge case like this).
         | 
         | And Unicorn is pretty old. It wouldn't surprise me if this
         | object reuse dates back to mongrel's code base, back when doing
         | _anything_ with threads in Ruby would have been unusual.
         | 
         | It's still bad code on unicorn's part (when I got to that part
         | of the post I said "Unicorn does WHAT?" out loud), and GH
         | should have been more careful introducing threads to an
         | environment that didn't have concurrency before, but every step
         | of the chain is pretty understandable IMHO.
        
         | strictfp wrote:
         | I agree, and the bug in the report is likely not the only one
         | caused by these changes.
        
         | andix wrote:
         | > Well that's embarrassing
         | 
         | Probably it wasn't that obvious in the code to see.
         | 
         | But if they would've used immutable objects, or at least pure
         | functions with scalar parameters in the first place, such an
         | error would never occur.
        
         | arkadiyt wrote:
         | > Well that's embarrassing.
         | 
         | An IDOR is embarrassing - this was a complex bug and Github's
         | investigation/response was great here. I'd be proud if my
         | company handled a security incident like this.
        
           | brainzap wrote:
           | why the acronym
        
             | alpaca128 wrote:
             | For the people as confused as me, it stands for "Insecure
             | Direct Object Reference" and not "International Day of
             | Radiology" or "Iowa Department of Revenue".
        
           | arghwhat wrote:
           | Creating such unnecessary complexity that should ahead of
           | time be known to be fragile, and then having these resulting
           | issues, is embarrassing.
           | 
           | The follow-up report is good, but it doesn't matter when the
           | conclusion is "our footguns backfired".
        
             | renewiltord wrote:
             | As someone who has never added incremental complexity to
             | software until it manifested bugs, I agree.
        
           | yarg wrote:
           | It can be both.
        
         | matthewaveryusa wrote:
         | Never underestimate the inertia of working code that makes
         | money.
        
       | benmmurphy wrote:
       | This post mortem and changes to unicorn should be useful for a
       | lot of other ruby shops. I suspect github aren't the only people
       | who are not defensively copying rack data when sharing it with
       | background threads.
        
       | rtpg wrote:
       | I really like Python, but the difficulties in using the same
       | process for multiple requests really get me (and I think this bug
       | labels a similar sort of difficulty with Ruby).
       | 
       | There are of course safe ways to do all of this kind of stuff,
       | but there are _so many_ unsafe ways to do things, and
       | accidentally share things across requests. At least in the Django
       | world I think ASGI is going to make it more likely that stuff is
       | done the right way but there's a big learning curve there.
        
       | jrockway wrote:
       | Does Ruby have a data race detector, and would a race detector
       | have caught this unexpected sharing?
        
         | rantwasp wrote:
         | this is not Ruby. This is unicorn doing some dumb stuff
        
       | thecodemonkey wrote:
       | Really interesting write-up. Curious how they do logging, and how
       | they were able to get such a detailed look at the previous
       | requests, including the HTTP response. Wouldn't it be a massive
       | amount of data if all HTTP requests/responses were logged? (Not
       | to mention the security implications of that)
        
         | firebaze wrote:
         | As expensive as it is, datadog provides the required level of
         | insight via, among other drill-down methods, detailed
         | flamegraphs, to get to the bottom of problems like this one.
         | 
         | No, not affiliated with datadog, and not convinced the
         | cost/benefit ratio is positive for us in the long run.
        
           | nullsense wrote:
           | We've implemented it recently and it's helped us
           | tremendously.
        
         | ckuehl wrote:
         | Purely a guess, but I've sometimes been able to work backwards
         | from the content length (very often included in HTTP server
         | logs) to figure out what the body had to be.
        
         | bob1029 wrote:
         | It would indeed be a massive amount of data, but bear in mind
         | that only a small fraction of HTTP requests hitting GitHub are
         | actually both authenticated and attempting to change the state
         | of the system. Most requests are unauthenticated read-only and
         | cause no state changes in GitHub.
        
           | vlovich123 wrote:
           | It's a reasonable theory but step one required to trigger
           | this bug is an unauthenticated request. It's unclear if their
           | logs indicated that or they got lucky trying to repro that
           | someone thought to try with an unauthenticated request first
           | when the back to back session didn't repro.
        
             | whimsicalism wrote:
             | From the writeup, it _sounds_ like the latter, but we
             | obviously can 't be sure.
        
             | aidos wrote:
             | Could be that there's some tell tale in the logs.
             | 
             | We recently fixed a bug with incomplete requests by
             | noticing in nginx logs that the size of the headers + the
             | size of the body was the size of the buffer of a linux
             | socket.
        
         | andix wrote:
         | I guess the incident happened more often, then they wanted to
         | tell us. So as a first step they probably turned up the logging
         | and waited for new reports...
        
         | kevingadd wrote:
         | It's pretty common to do exhaustive logging even if the data is
         | enormous. It's valuable for debugging and security
         | investigations. As far as the security of storing log data, you
         | need log messages to not contain PII regardless of how long you
         | retain them.
         | 
         | It's not uncommon to see companies talk about processing
         | terabytes or even petabytes of log data per day. In the end the
         | storage isn't that expensive [1], and you can discard
         | everything after a little while.
         | 
         | [1] Assuming of course that any company generating a petabyte
         | of logs every day is doing so because they have an enormous
         | number of customers, AFAIK the cost of storing 1PB of logs on
         | something like Glacier is in the 100-300k USD/yr range.
         | Absolutely a tolerable expense for the peace of mind you get by
         | knowing you can go back in 2 months and find every single
         | request that touched a security vulnerability or hunt down the
         | path an attacker took to breach your infrastructure.
        
         | xvector wrote:
         | Distributed tracing with a small sample size or small
         | persistence duration?
        
           | dmlittle wrote:
           | Given the rarity of this bug I doubt a small sample sized
           | would have captured the right requests to be able to debug
           | this. The probability would be the probability of this bug
           | occurring * (sample percentage)^3 [ you'd need to have
           | sampled the correct 3 consecutive requests].
        
         | SilverRed wrote:
         | Yes it is a large amount of data but its worth paying to store
         | it because it lets you go back in time and work out wtf
         | happened which will save you far more hours and solve more
         | problems making it worth it.
        
       | mtnygard wrote:
       | I'm looking forward to learning about how Rust makes this kind of
       | bug impossible.
        
         | vlovich123 wrote:
         | It doesn't. The issue here seems like a more vanilla logical
         | data race + object reuse that Rust can't really do anything
         | about. Rust only makes sure that only one thread at a time
         | writes and other reads don't race that write. It can't make
         | guarantees that the relative ordering of reads and wires makes
         | sense for your use-case.
         | 
         | It can simplify certain abstractions of course so that you can
         | provide easier tools for devs to avoid footguns.
        
           | cogman10 wrote:
           | What rust would do for this particular situation is make it
           | painful to share an object like this across threads and make
           | it obvious that such a share is happening. That wouldn't
           | prevent this problem, but it would make it harder to stumble
           | into.
        
             | vlovich123 wrote:
             | Maybe, maybe not. A claim like that is definitely in the
             | realm of "people think race condition bugs are solved by
             | switching to Rust". I don't know Rust well enough to
             | evaluate your claim but that feels like an incorrect
             | statement - sharing data between threads should be as easy
             | as wrapping something in Arc & Mutex to get similar
             | semantics I think.
             | 
             | Even if that particular bug is actually harder to stumble
             | into, there's no guarantee the space of thread
             | synchronization issues one might stumble into is inherently
             | smaller for Rust than for Ruby, so all you're doing is
             | exchanging one class of failure modes for another.
        
               | cogman10 wrote:
               | As I said, they aren't solved but they are hard to do
               | unknowingly.
               | 
               | In most other languages, sharing mutable data across
               | threads is unprotected. You are lucky if you get a
               | compiler warning. Rust disallows that completely. You
               | cannot share mutable data across threads without first
               | wrapping that data in thread safety constructs.
               | 
               | That's what happened here. A mutable variable was shared
               | between threads and updated from different threads. Rust
               | makes it obvious that this variable is accessed from
               | multiple threads even if it doesn't prevent 2 threads
               | from writing to the variable at wrong times. That's where
               | it is a partial solution to the problem.
               | 
               | Sharing is as easy as wrapping the thing in a Mutex, and
               | if you are a programmer that sees something wrapped in a
               | mutex you should immediately start thinking of the thread
               | implications. You don't have to, but you should.
               | 
               | That mutex follows the variable around anywhere you want
               | to start changing it.
               | 
               | In a language like C or Ruby, that's not something that
               | is patently obvious. Shared variables aren't wrapped with
               | anything. It's perfectly understandable for someone to
               | not realize this bit of shared state is mutated across
               | threads.
        
       | bredren wrote:
       | Since this was essentially a potential organizational
       | administrator account takeover security vulnerability, I can see
       | why this got a write up. And as others have said, it's
       | impressive.
       | 
       | Github Actions has had many problems that affect many people this
       | year alone. While the missed deadlines and stress for people
       | can't be quantified by Github, they are happening.
       | 
       | I'd urge Github to take ongoing service interruptions in Github
       | Actions more seriously. I hope Github will take a stronger stance
       | on providing customer communication on migration timelines
       | between Packages and Container Registry.
       | 
       | Just because a bug or service interruption doesn't risk random
       | user session exposure doesn't mean it isn't important to
       | communicate about and describe how the business is working to
       | prevent it from happening again.
        
       | TedShiller wrote:
       | "Rare"? How about "catastrophic"? Is this written by PR people?
        
       | tcgv wrote:
       | > One such performance improvement involved moving the logic to
       | check a user's enabled features into a background thread that
       | refreshed on an interval rather than checking their enablement
       | while the request was being processed
       | 
       | It's never a good idea to perform multithreaded operations within
       | the web server, unless you have a strong knowledge of its
       | internal workings. For instance, if you miss manage threads in an
       | IIS web application you can end up having your web requests
       | queued since the web server has by default a fixed number of
       | worker threads to serve web requests.
       | 
       | From their bug description it's clear that ruby on rails is no
       | exception to this rule. In such cases where background processing
       | is required due to some condition triggered by a web request it's
       | best to stick with a propper queue system [1]
       | 
       | [1] https://thomasvilhena.com/2019/07/using-queues-to-offload-
       | we...
        
       | skytreader wrote:
       | > On March 2, 2021, we received a report via our support team
       | from a user who, while using GitHub.com logged in as their own
       | user, was suddenly authenticated as another user.
       | 
       | Sobering thought for me is how probable it is that a similar
       | issue might be present in other platforms. The title calls it
       | "rare" but given (a) the number of complex-enough web/app
       | platforms in use today, (b) just the endless number of ways/infra
       | permutations for "doing x", and (c) the size of a typical
       | company's codebase and dependencies, could it be a more common
       | occurrence than we think?
       | 
       | Total anecdote, up for you to believe: Back in April 2016, one of
       | my then-housemates bought a brand-new Macbook Air. A week into
       | his ownership, he exclaims surprise that he is logged-in to
       | another Facebook account, someone we are totally not acquainted
       | with. I borrow his MBA and try to investigate but, alas, I lack
       | expertise for it. The only relevant thing I can eke out is that
       | they both went to the same event a week or so ago and probably
       | shared a Wifi AP, where a "leak" could've happened. Or maybe a
       | hash collision in FB's side?
       | 
       | I would've reported it to FB but then I don't have enough
       | details; with the two conjectures I have, I'm basically holding
       | water with a sieve. But now the thought that someone I totally
       | don't know might end up logged-in in my FB account is a
       | possibility that bothers me. And I have no idea how to prevent it
       | other than minimizing what could be compromised (and no, "Get off
       | FB!" just isn't in the cards, I'm sorry).
       | 
       | Kudos for Github for investigating and fixing this issue. Another
       | thought that occurred to me while writing this is how many users
       | of other platforms might be filing issues like this but can't
       | provide enough detail and so their reports are eventually marked
       | as "Could Not Reproduce". Not exactly leaky sessions but just
       | general security mishaps that they would struggle to describe to
       | support.
        
         | sellyme wrote:
         | > The only relevant thing I can eke out is that they both went
         | to the same event a week or so ago and probably shared a Wifi
         | AP, where a "leak" could've happened. Or maybe a hash collision
         | in FB's side?
         | 
         | Given two users in extremely close geographic proximity, the
         | chances of them having manually logged in on the device
         | themselves with your housemate either forgetting or not knowing
         | are much much higher than the chances of an authentication
         | exploit in the world's most popular site that would have gone
         | undiscovered for a subsequent five years at this point.
         | 
         | Probably somewhere between the two (but closer to the "they
         | just logged in on that computer" side of things) is the chance
         | of the LAN being some eldritch abomination that just
         | occasionally serves responses to the wrong local IPs, but
         | that's not something any online service needs to care about.
        
           | mschuster91 wrote:
           | > is the chance of the LAN being some eldritch abomination
           | that just occasionally serves responses to the wrong local
           | IPs
           | 
           | This could have worked in ye olde pre HTTP days, but with
           | HTTPS this _cannot_ reasonably have happened.
        
         | [deleted]
        
         | robindebaets wrote:
         | I once had the exact same thing happen on PayPal. I logged into
         | my account, but was greeted by the information of another user.
         | I immediately logged out. I never reported it and it it never
         | happened again.
        
         | exdsq wrote:
         | Don't sell yourself short, I'm sure you're more than
         | experienced to get your own MBA!
        
         | ficklepickle wrote:
         | If it were a hash collision, I would think the odds of it
         | involving two users geographically close to each other would be
         | quite small.
         | 
         | That's a weird one, thanks for sharing it!
        
           | tcgv wrote:
           | Not if you consider that FB servers are somewhat
           | geographically partitioned to optmize response times
        
         | vanviegen wrote:
         | That sounds like the event was forcing the use of an HTTP proxy
         | server with over-eager caching. But in 2016, that would be a
         | bit anachronistic..
        
         | thraway123412 wrote:
         | Also on GOG:
         | https://www.gog.com/forum/general/warning_massive_security_r...
         | 
         | > Hello everyone! This is forum regular fronzelneekburm,
         | posting from the account of some poor Chinese guy that gog
         | randomly logged me into. I'll explain the situation in further
         | detail in another post from my actual account once I logged out
         | of this one.
        
           | alickz wrote:
           | Happened on Steam too a couple of years ago, people being
           | logged into other people's accounts.
        
             | efreak wrote:
             | My understanding was that was a caching issue and you
             | couldn't actually do anything. Unless maybe I'm thinking of
             | the wrong thing?
        
         | BurningFrog wrote:
         | > _one of my then-housemates bought a brand-new Macbook Air_
         | 
         | My most fun theories are
         | 
         | 1. Somewhere in the supply chain, someone logged in to their FB
         | account on this machine.
         | 
         | 2. During the first week he owned it, someone logged in to
         | their FB account on this machine.
         | 
         | 3. You ex housemate lied about this. Crazy people are rare, but
         | _MUCH_ more common likely bugs that could cause this.
        
         | deckard1 wrote:
         | > The title calls it "rare"
         | 
         | I believe they are referring to the rarity of hitting this bug
         | during the period of time it was broken, and not the rarity of
         | the bug in general.
         | 
         | People were speculating on this here:
         | https://news.ycombinator.com/item?id=26395138, and I even left
         | a comment about this type of bug, but on Node. This category of
         | bug, where session data leaks across requests, is actually
         | incredibly common.
        
         | wholien wrote:
         | The earlier report[1] says:
         | 
         | > The underlying bug existed on GitHub.com for a cumulative
         | period of less than two weeks at various times between February
         | 8, 2021 and March 5, 2021. Once the root cause was identified
         | and a fix developed, we immediately patched GitHub.com on March
         | 5. A second patch was deployed on March 8 to implement
         | additional measures to further harden our application from this
         | type of bug. There is no indication that other GitHub.com
         | properties or products were affected by this issue, including
         | GitHub Enterprise Server. We believe that this session
         | misrouting occurred in fewer than 0.001% of authenticated
         | sessions on GitHub.com.
         | 
         | If we use the 0.001% figure and assume all 56 million of their
         | users are authenticated, then this occurred with 560 sessions
         | at most.
         | 
         | You are right in that when something only happens to a
         | minuscule % of users is hard to investigate and repro. But at
         | the same time something as critical as "I logged in and was
         | authenticated as another user" should probably be immediately
         | escalated to the highest levels within your company.
         | 
         | 1: https://github.blog/2021-03-08-github-security-update-a-
         | bug-...
        
         | abdupo wrote:
         | So I am one of the users who reported this. I quickly filed a
         | support ticket but then I also searched my network to find
         | someone on GH's security team to email. I'm not sure if support
         | or the security person I emailed escalated it first.
        
           | nodesocket wrote:
           | Did you get any bounty?
        
           | codethief wrote:
           | I recently reported a phishing site to GitHub which looked
           | exactly like GitHub.com and GitHub responded _within a day_.
           | So maybe your escalation wouldn 't even have been necessary.
           | :)
        
       | benlivengood wrote:
       | This is the kind of bug that makes me leery of frameworks and
       | tall application stacks. Having to trust more than one layer of
       | multithreaded code between an application and the kernel gets
       | riskier with every layer.
       | 
       | Also another reason not to mix the main application code with the
       | authentication/authorization code. Setting security cookies
       | should be done by dedicated code in a dedicated process to avoid
       | this kind of interaction, and the load-balancer layer should
       | filter the setting of security cookies from non-security backends
       | to enforce it.
        
       | bkartal wrote:
       | It is scary to think this is happening with the bank account
       | systems.
        
       | gizmo wrote:
       | There is a lesson here about the way we go about sharing memory
       | between threads that is implicit and fundamentally broken. Adding
       | extra threading in order to improve performance is clearly the
       | wrong tradeoff when you're in an environment where threads can
       | introduce bugs like these. When it takes 10 senior engineers to
       | figure out why one user session gets data from another session
       | you've already lost.
       | 
       | Shared memory environments are very unsafe and we should be
       | working towards a memory model where data that belongs to
       | different organizations can never get mixed. This isn't
       | impossible, but it requires new programming languages that allow
       | us to express which memory areas can't get mixed up.
       | 
       | We've got to start designing our backend infrastructure
       | differently such that this entire class of memory errors can't
       | ever occur.
        
         | Cloudef wrote:
         | It's called separate processes or unshare after new thread.
        
         | lstamour wrote:
         | In many languages and web frameworks, the assumption is that
         | the application server will do this for you. For example, by
         | issuing a new env variable with each request such that requests
         | can't easily share memory with other requests, though as noted
         | that might not stop background threads from retaining such
         | memory.
         | 
         | In other cases, you might assume containers would be enough
         | separation, or maybe different databases (sharding, for
         | example). One approach I've considered but not yet implemented
         | is to take end-user JWT cookies and use them to authenticate to
         | a database such that a user can only ever see the rows they
         | should be allowed access to, and all other SQL queries would
         | fail.
         | 
         | But no matter how you implement a security control, it's
         | possible to make mistakes and allow others to see data you
         | didn't intend to share with them. Even Rust, famous for its
         | explicit memory ownership model, could still have similar bugs,
         | they just might be a bit more obvious to spot in some session
         | connection pool implemented incorrectly, for example, or maybe
         | exposed via a compiler optimization gone wrong. Code review
         | will still likely be one of the primary defenses of this kind
         | of bug, though hopefully better tooling and language support to
         | detect race conditions or shared ownership warnings will help
         | over time?
        
         | DougWebb wrote:
         | Replace Rails and similar server-side frameworks with CGI
         | scripts, and you're done. Every request is a new process with
         | its own memory space.
         | 
         | Yes, performance is an issue with that old approach. Around the
         | turn of the century I was responsible for a Perl web
         | application that ran as a CGI script, and to speed it up I
         | wrote my own Perl httpd server that had my web application code
         | pre-loaded. Then for each request I forked a new process, which
         | carried over all of the pre-loaded code and already-running
         | perl.exe, so there was very little startup time (compared to a
         | normal CGI script startup.) I was careful to take full
         | advantage of the operating system's Copy-on-write memory
         | semantics, so that the only memory that had to be allocated for
         | the new process was Perl's runtime stack. All of the memory
         | containing the application code was shared across processes
         | because it never got written to.
         | 
         | This web application is still running today, though I left the
         | company in 2010. Back then, it was handling about six million
         | requests per day, about 75% of which was during US daytime
         | working hours. So, around 160 requests/second, spread across
         | five or six servers that were getting old in 2010.
         | 
         | The modern frameworks have taken the same concept a step
         | further, and replaced the process fork with a thread pool.
         | That's faster, but it allows memory leaking in a way process
         | forking does not. You have to go out of your way to share
         | memory between processes, but with threads it's easy to do
         | accidentally.
        
           | hu3 wrote:
           | Workerman [1] uses a similar approach to put PHP pretty high
           | on TechEmpower [2]. Specially now with PHP8's JIT.
           | 
           | [1] https://github.com/walkor/Workerman
           | 
           | [2] https://www.techempower.com/benchmarks
        
       | wicket wrote:
       | It doesn't seem like they took this security incident very
       | seriously. It's a pretty serious bug that was reported by two
       | different users within a few hours. It could have been abused by
       | others, but it doesn't look like it occurred to them to roll back
       | recent changes whilst they investigated the problem.
        
       | vemv wrote:
       | This is a good example of why anything related with threading
       | makes me nervous when using Ruby. When a language is mutable and
       | thread-unsafe by default, keeping the whole stack sane at all
       | times seems prohibitively arduous. One cannot code-review a whole
       | ecosystem.
       | 
       | I believe that this could be addressed with an architecture that
       | favored single-threaded processes that can handle a single
       | request at a time. Said processes would be numerous, have a
       | (best-effort) low memory footprint, and be pooled under some
       | master process written in a concurrent/performant manner.
       | 
       | The result wouldn't be necessarily pretty or resource-efficient
       | (especially in terms of RAM), but it also can be seen as a
       | consequence of Ruby's requirements for safety.
       | 
       | An org like GH almost certainly can afford using more/beefier
       | servers it it means that entire classes of bugs will go away.
        
         | amarshall wrote:
         | > when using Ruby
         | 
         | To be fair to Ruby, most languages are largely mutable and
         | without any strong thread-safety guarantees. Does that make it
         | okay? No, but it's far from a Ruby-specific problem.
         | 
         | > I believe that this could be addressed with an architecture
         | that favored single-threaded processes that can handle a single
         | request at a time.
         | 
         | Unicorn (the application web server GitHub is using) does
         | follow this model. GitHub added their own additional thread
         | within that process that broke that design in an unforeseen
         | way.
        
       | mcherm wrote:
       | I just want to make clear that for me, write ups containing this
       | level of detail analyzing the error that occurred give me greater
       | confidence in the company rather than less confidence. Greater
       | even than if I had never heard of the original bug. There are
       | many bugs I have never heard about, but any company that uses
       | this level of care and openness is likely to be very reliable.
        
         | TedShiller wrote:
         | This is such a n00b mistake though, the "greater confidence" is
         | quickly canceled out. I end up having less confidence in this
         | company.
        
           | ShroudedNight wrote:
           | I think it's more a statement of the quality and clarity of
           | the postmortem that this problem appear so obvious in
           | hindsight.
        
           | SilverRed wrote:
           | Every bug is a "n00b mistake" when you are looking right at
           | it. There is not a single large project that is not filled
           | with or at least has had trivial bugs with serious
           | repercussions.
        
             | sellyme wrote:
             | > Every bug is a "n00b mistake" when you are looking right
             | at it.
             | 
             | Except the ones that are just the universe conspiring
             | against you, like bit flipping in the SM-1800 because of
             | heavily irradiated cattle being transported on a nearby
             | train line.
        
             | TedShiller wrote:
             | If you're a noob then yes every mistake is a noob mistake
        
         | petters wrote:
         | I agree with you in general, but in this particular case they
         | knowingly deployed thread-unsafe code in production while
         | increasing the threaded concurrency. That should have broken
         | internal development rules.
        
           | eli wrote:
           | Now imagine what happens behind the scenes at other service
           | providers. I'd bet the ones that have policies that bury the
           | details of incidents, on average, have worse problems.
        
           | yuliyp wrote:
           | Yes. Writing code with bugs breaks internal development
           | rules. The issue wasn't thread-unsafe code. It was confusion
           | about object lifecycle. An env object was assumed to no
           | longer be used by the web framework after a request
           | terminated, while in reality it was recycled for future
           | requests as a performance optimization.
        
           | simonw wrote:
           | It sounds to me like they created new internal development
           | rules as a result, which is a smart response to the issue.
        
         | m463 wrote:
         | I have to say the presentation is exceptional, including very
         | good diagrams of the problem.
         | 
         | But there's a thought in the back of my head wondering what
         | kind of stuff hit the fan, and wondering who had to create the
         | presentation. Was it the person who checked in the broken code,
         | was it the bug fixer or his manager, or was it a pm that got
         | stuck having to write up what happened?
        
           | fotta wrote:
           | My guess is that someone close to the remediation wrote it
           | (or it was written by the eng team) and it went through 5
           | levels of editing before publication.
        
             | m463 wrote:
             | I wonder about the email thread. how many recipients on the
             | TO, how many on the CC (how many on BCC?)
        
               | codetrotter wrote:
               | > how many on BCC?
               | 
               | I guess the only way we will know that is if there's ever
               | a vulnerability in the session handling code of the mail
               | server that they use also xD
        
         | Complexicate wrote:
         | Agreed. However, I smirked a bit when I read your comment,
         | since this is ultimately coming from Microsoft. Let's hope some
         | of this mentality seeps into other Microsoft projects
        
           | thitcanh wrote:
           | GitHub is owned by Microsoft but as far as I know it operates
           | pretty independently. Let's hope it stays that way.
        
             | craftinator wrote:
             | At this point we're all just waiting for that last E.
        
               | [deleted]
        
               | johannes1234321 wrote:
               | What would be the benefit for Microsoft in that case? -
               | They are transforming themselves from a platform company
               | based around windows, it a platform company based around
               | Azure. GitHub is a relevant part there.
               | 
               | This doesn't mean they are only doing good(tm) with
               | GitHub, but destroying it and open source vendors would
               | be counted to their interests.
        
               | edoceo wrote:
               | Or, pro-actively moving to GitLab - either self-hosted or
               | paid and using GH as a simple mirror for discovery.
        
               | arthurcolle wrote:
               | I don't understand this reference, can you elucidate
        
               | xxqp wrote:
               | elucidate, extend, extinguish
        
               | artificial wrote:
               | Most likely a reference to Embrace, Extend, Extinguish. A
               | strategy Microsoft employed in the past.
        
               | arthurcolle wrote:
               | Ah thanks, yeah. Familiar with the strategy, but didn't
               | get the connection - been a long day over here
        
               | m463 wrote:
               | We should Eulogize
        
               | scintill76 wrote:
               | https://en.m.wikipedia.org/wiki/Embrace,_extend,_and_exti
               | ngu...
        
       | zelon88 wrote:
       | > We believe that transparency is key in earning and keeping the
       | trust of our users and want to share more about this bug.
       | 
       | I find it ironic that the largest repository of open source code
       | is closed source and and owned by the largest closed source
       | software company on the planet, yet they somehow still "believe
       | in transparency."
       | 
       | That's like McDonald's "transparently" recalling burgers because
       | of a chemical contamination, and then when you ask what else is
       | in the burgers they tell you it's a secret.
        
         | berniemadoff69 wrote:
         | Fortunately Microsoft, like McDonalds, can use their billions
         | of dollars in the bank to produce high quality graphics to
         | 'explain the situation' - and when people visit the site, they
         | will be Delighted and Impressed with these high quality
         | graphics - instantly remedying the situation. 'Look how nice we
         | explained it - look at the cute cartoon characters solving a
         | puzzle in the header at the top of the page, see how slick we
         | are?'
        
         | spaetzleesser wrote:
         | I wonder if open sourcing their code would even be a problem
         | for them now. I don't think the code is worth much vs the brand
         | name and scale.
        
           | simonw wrote:
           | Strategic issues aside, open sourcing a codebase that is over
           | a decade old is extremely hard. You need to be absolutely
           | sure there's nothing in the code or commit history that's not
           | suitable for release.
           | 
           | Just one example: what if someone pasted an error log message
           | into a commit message that included a piece of PII?
        
           | [deleted]
        
           | xeeeeeeeeeeenu wrote:
           | Open-sourcing the codebase might affect the sales of the on-
           | premises version.
        
       | EMM_386 wrote:
       | I wonder if Microsoft has long-term plans to move Github off of
       | things like Unicorn/Rails/Erlang into a technology stack that
       | would be more familiar to them.
       | 
       | I'm not judging the choices of technology, but I'm thinking it
       | was easier for them to buy Github outright then try to compete
       | with it. I assume Github still operates independently, with
       | engineers who are familiar with this stack.
        
         | rantwasp wrote:
         | hah. good luck just moving it to another stack. this just
         | doesn't happen.
         | 
         | also, what does familiar to them means in the MS context? i'm
         | sure the people that work on github are familiar with the stack
        
           | EMM_386 wrote:
           | Agreed, the comment was short sighted. Was hesitant to even
           | post it considering many involved are likely here on HN.
           | 
           | I wasn't sure how independent Github is from Microsoft after
           | the acquisition. My thoughts were initially in moving
           | engineers around internally from within Microsoft to work on
           | Github.
           | 
           | Multi-threaded race conditions in a Rails app under Unicorn
           | Rack takes a lot of experience in specific tooling.
        
       | baggy_trough wrote:
       | The smartest engineers in the world are not smart enough to write
       | bug free multithreaded code.
        
         | fanf2 wrote:
         | This is the kind of multithreading bug that Rust's borrow
         | checker can prevent: the compiler would have complained that
         | the background task and the main request handler have write
         | access to the env object at the same time, which is not
         | allowed.
        
           | astrange wrote:
           | Writing your website in PHP would also have prevented it,
           | because its amateur process per request design is actually
           | the best way to do things, scales well, and has the right
           | security properties. Having an app server that can have
           | requests from different users in the same memory space is a
           | mistake.
        
             | baggy_trough wrote:
             | You can also run Ruby's unicorn or puma servers as a
             | single-threaded process cluster.
        
             | joshdifabio wrote:
             | PHP might move away from this safe approach with fibers,
             | which are about to be added to the language.
             | 
             | https://externals.io/message/113419#113555
        
         | WJW wrote:
         | Nobody is smart enough to write bug free code, but at least
         | some companies have the smarts to fix their code and blog about
         | how they did it.
        
           | robotresearcher wrote:
           | You're right but the parent's point is that multithreaded
           | code with shared objects is known to be particularly
           | difficult to do perfectly, and hard work to debug. It's so
           | difficult that the rational strategy may be to architect to
           | avoid it. We will have bugs, but not _those_ bugs. Those bugs
           | stink.
        
             | baggy_trough wrote:
             | may > definitely is, whenever possible.
        
         | trav4225 wrote:
         | It depends on the complexity.
        
       | jbverschoor wrote:
       | Is there a reason for the vague title? Any other title would
       | cause this to be the top post for 1 or 2 days
        
       ___________________________________________________________________
       (page generated 2021-03-19 23:03 UTC)