[HN Gopher] We found and fixed a rare race condition in our sess...
___________________________________________________________________
We found and fixed a rare race condition in our session handling
Author : todsacerdoti
Score : 214 points
Date : 2021-03-18 20:00 UTC (2 hours ago)
(HTM) web link (github.blog)
(TXT) w3m dump (github.blog)
| maddyboo wrote:
| I'm very impressed by the level of detail provided in this post
| mortem, but it leaves me with a few questions.
|
| What we know based on this report:
|
| - This bug was live in production for at least 3 days.
|
| - It was common enough to happen numerous times in the wild.
|
| - GitHub says that they identified some specific user sessions
| that may have been impacted by this bug by analyzing their logs.
|
| What we do not know:
|
| - How many leaked user sessions were identified in the followup
| investigation?
|
| - Is GitHub confident that they identified all leaked user
| sessions?
|
| - Was there any evidence of malicious use of these sessions?
|
| - Did GitHub notify the users whose sessions were leaked?
|
| - How long/for how many requests could these sessions have been
| valid between their leak and when they were definitively revoked
| on the 8th?
|
| - Is it possible that any private user data was leaked? If so,
| what data?
| azernik wrote:
| > - Is GitHub confident that they identified all leaked user
| sessions?
|
| Given that they revoked _all_ current logins, rather than just
| the affected ones, I assume that they have no way to tell which
| sessions were affected.
| dfcowell wrote:
| Most of that information is available in the initial report
| linked to in the first paragraph of this blog post:
| https://github.blog/2021-03-08-github-security-update-a-bug-...
| maddyboo wrote:
| That provides more information but doesn't fully answer any
| of my questions.
|
| Things that are still unclear:
|
| - GitHub provides an estimate of the prevalence at 0.001%,
| but doesn't state how many leaked sessions were actually
| identified.
|
| - Is GitHub confident that they identified all leaked user
| sessions? This has bearing on my question about notification
| of users whose sessions were leaked; if they did not identify
| all leaked sessions, they obviously could not contact those
| users.
|
| - GitHub states that this bug could not be intentionally
| triggered by a malicious party (why?), but they don't state
| whether any of the natural occurrences may have been used
| maliciously.
|
| - How long/for how many requests could these sessions have
| been valid between their leak and when they were definitively
| revoked on the 8th?
|
| - Is it possible that any private user data was leaked? If
| so, what data?
|
| (Correction to my original comment: the bug was live for a
| cumulative period of "less than 2 weeks")
| the_mitsuhiko wrote:
| I think this issue is relevant for Python programmers as well.
| uwsgi for many years shipped a default allocation mode for the
| WSGI environment which just cleared it between requests. One had
| to switch over to `wsgi-env-behaviour=holy` to prevent it. Holy
| is now the default but many people still set it to cheap because
| that prevented some segfaults for a while in uWSGI.
| shay_ker wrote:
| Brutal - Unicorn didn't use Rack.env in a thread safe way. It's
| tougher given that this was in a widely used app-server and not
| even in Rails code.
|
| Impressed with GH folks in finding this, though I wonder how many
| more of these types of bugs are out there in Ruby libs. It'd be
| nice to make this more difficult to do, a la FP languages, by
| making it really intentional to modify state. Maybe in Ruby 4??
| alexeiz wrote:
| > We initially thought this to be an internal reporting problem
| only and that we would see some data logged for an otherwise
| unrelated request from the background thread.
|
| So they already conceded that their code is fundamentally not
| thread-safe and they decided to let it slide.
| tyingq wrote:
| I may be missing something here. I do see that the patch[1] now
| creates a new request object, and thus a new "env" Ruby hash. But
| I don't see the old behavior described as _" allocates one single
| Ruby Hash that is then cleared (using Hash#clear) between each
| request"_...even after poking around in http_request.rb and other
| places.
|
| [1] https://yhbt.net/unicorn-
| public/66A68DD8-83EF-4C7A-80E8-3F1F...
| phjesusthatguy3 wrote:
| I flagged this because it has "race" in the title and if there's
| one thing HN doesn't want to talk about, it's race.
|
| Totally serious, but also totally disgusted.
|
| Yeah, I'm done with HN. Fuck you niggers _.
|
| _ Six-to-a-room six-figure-a-year motherfuckers crying about how
| life is unfair. Fuck you.
| [deleted]
| lambda_obrien wrote:
| Always log important events with a timestamp and save them as
| long as possible.
| oars wrote:
| Wow, impressive!
| lambda_obrien wrote:
| I'm guessing that's sarcasm, I edited my comment.
| gowld wrote:
| > Always log important events with a timestamp and save them as
| long as possible.
|
| Except when those events are associated with private user data
| or behavior.
| lambda_obrien wrote:
| The personal data itself should not be logged but the event
| itself should be logged with lots of details that aren't
| private info.
| closeparen wrote:
| Nope, _especially_ in those cases. The more sensitive the
| data, the more important it is to have records of who
| accessed or changed it and when.
| temp667 wrote:
| Huh? I want my bank to be VERY clear on when I transferred
| money for my home closing (and keep that record). They need
| to keep records and the browser used, IP used, authentication
| flow etc. Why does this have to be thrown away?
| ghiculescu wrote:
| Anyone know if puma would be susceptible to the same bug or if
| it's unicorn specific?
| jakswa wrote:
| This is what I'm going to be digging for later... I'm wondering
| if puma also shares `env` objects across requests.
| Freaky wrote:
| Each server does share a "proto env" object, but it is
| duplicated for each Client:
|
| https://github.com/puma/puma/blob/0cc3f7d71d1550dfa8f545ece8.
| ..
|
| and Client objects are not reused:
|
| https://github.com/puma/puma/blob/0cc3f7d71d1550dfa8f545ece8.
| ..
| pianoben wrote:
| This is some fantastic debugging. I love war stories like this!
| They give me confidence that the world is full of mere mortals
| like myself, and we can do great things nonetheless.
| benmmurphy wrote:
| This post mortem and changes to unicorn should be useful for a
| lot of other ruby shops. I suspect github aren't the only people
| who are not defensively copying rack data when sharing it with
| background threads.
| thecodemonkey wrote:
| Really interesting write-up. Curious how they do logging, and how
| they were able to get such a detailed look at the previous
| requests, including the HTTP response. Wouldn't it be a massive
| amount of data if all HTTP requests/responses were logged? (Not
| to mention the security implications of that)
| firebaze wrote:
| As expensive as it is, datadog provides the required level of
| insight via, among other drill-down methods, detailed
| flamegraphs, to get to the bottom of problems like this one.
|
| No, not affiliated with datadog, and not convinced the
| cost/benefit ratio is positive for us in the long run.
| ckuehl wrote:
| Purely a guess, but I've sometimes been able to work backwards
| from the content length (very often included in HTTP server
| logs) to figure out what the body had to be.
| bob1029 wrote:
| It would indeed be a massive amount of data, but bear in mind
| that only a small fraction of HTTP requests hitting GitHub are
| actually both authenticated and attempting to change the state
| of the system. Most requests are unauthenticated read-only and
| cause no state changes in GitHub.
| vlovich123 wrote:
| It's a reasonable theory but step one required to trigger
| this bug is an unauthenticated request. It's unclear if their
| logs indicated that or they got lucky trying to repro that
| someone thought to try with an unauthenticated request first
| when the back to back session didn't repro.
| whimsicalism wrote:
| From the writeup, it _sounds_ like the latter, but we
| obviously can 't be sure.
| aidos wrote:
| Could be that there's some tell tale in the logs.
|
| We recently fixed a bug with incomplete requests by
| noticing in nginx logs that the size of the headers + the
| size of the body was the size of the buffer of a linux
| socket.
| xvector wrote:
| Distributed tracing with a small sample size or small
| persistence duration?
| dmlittle wrote:
| Given the rarity of this bug I doubt a small sample sized
| would have captured the right requests to be able to debug
| this. The probability would be the probability of this bug
| occurring * (sample percentage)^3 [ you'd need to have
| sampled the correct 3 consecutive requests].
| mtnygard wrote:
| I'm looking forward to learning about how Rust makes this kind of
| bug impossible.
| vlovich123 wrote:
| It doesn't. The issue here seems like a more vanilla logical
| data race + object reuse that Rust can't really do anything
| about. Rust only makes sure that only one thread at a time
| writes and other reads don't race that write. It can't make
| guarantees that the relative ordering of reads and wires makes
| sense for your use-case.
|
| It can simplify certain abstractions of course so that you can
| provide easier tools for devs to avoid footguns.
| cogman10 wrote:
| What rust would do for this particular situation is make it
| painful to share an object like this across threads and make
| it obvious that such a share is happening. That wouldn't
| prevent this problem, but it would make it harder to stumble
| into.
| skytreader wrote:
| > On March 2, 2021, we received a report via our support team
| from a user who, while using GitHub.com logged in as their own
| user, was suddenly authenticated as another user.
|
| Sobering thought for me is how probable it is that a similar
| issue might be present in other platforms. The title calls it
| "rare" but given (a) the number of complex-enough web/app
| platforms in use today, (b) just the endless number of ways/infra
| permutations for "doing x", and (c) the size of a typical
| company's codebase and dependencies, could it be a more common
| occurrence than we think?
|
| Total anecdote, up for you to believe: Back in April 2016, one of
| my then-housemates bought a brand-new Macbook Air. A week into
| his ownership, he exclaims surprise that he is logged-in to
| another Facebook account, someone we are totally not acquainted
| with. I borrow his MBA and try to investigate but, alas, I lack
| expertise for it. The only relevant thing I can eke out is that
| they both went to the same event a week or so ago and probably
| shared a Wifi AP, where a "leak" could've happened. Or maybe a
| hash collision in FB's side?
|
| I would've reported it to FB but then I don't have enough
| details; with the two conjectures I have, I'm basically holding
| water with a sieve. But now the thought that someone I totally
| don't know might end up logged-in in my FB account is a
| possibility that bothers me. And I have no idea how to prevent it
| other than minimizing what could be compromised (and no, "Get off
| FB!" just isn't in the cards, I'm sorry).
|
| Kudos for Github for investigating and fixing this issue. Another
| thought that occurred to me while writing this is how many users
| of other platforms might be filing issues like this but can't
| provide enough detail and so their reports are eventually marked
| as "Could Not Reproduce". Not exactly leaky sessions but just
| general security mishaps that they would struggle to describe to
| support.
| [deleted]
| exdsq wrote:
| Don't sell yourself short, I'm sure you're more than
| experienced to get your own MBA!
| ficklepickle wrote:
| If it were a hash collision, I would think the odds of it
| involving two users geographically close to each other would be
| quite small.
|
| That's a weird one, thanks for sharing it!
| thraway123412 wrote:
| Also on GOG:
| https://www.gog.com/forum/general/warning_massive_security_r...
|
| > Hello everyone! This is forum regular fronzelneekburm,
| posting from the account of some poor Chinese guy that gog
| randomly logged me into. I'll explain the situation in further
| detail in another post from my actual account once I logged out
| of this one.
| benlivengood wrote:
| This is the kind of bug that makes me leery of frameworks and
| tall application stacks. Having to trust more than one layer of
| multithreaded code between an application and the kernel gets
| riskier with every layer.
|
| Also another reason not to mix the main application code with the
| authentication/authorization code. Setting security cookies
| should be done by dedicated code in a dedicated process to avoid
| this kind of interaction, and the load-balancer layer should
| filter the setting of security cookies from non-security backends
| to enforce it.
| gizmo wrote:
| There is a lesson here about the way we go about sharing memory
| between threads that is implicit and fundamentally broken. Adding
| extra threading in order to improve performance is clearly the
| wrong tradeoff when you're in an environment where threads can
| introduce bugs like these. When it takes 10 senior engineers to
| figure out why one user session gets data from another session
| you've already lost.
|
| Shared memory environments are very unsafe and we should be
| working towards a memory model where data that belongs to
| different organizations can never get mixed. This isn't
| impossible, but it requires new programming languages that allow
| us to express which memory areas can't get mixed up.
|
| We've got to start designing our backend infrastructure
| differently such that this entire class of memory errors can't
| ever occur.
| lstamour wrote:
| In many languages and web frameworks, the assumption is that
| the application server will do this for you. For example, by
| issuing a new env variable with each request such that requests
| can't easily share memory with other requests, though as noted
| that might not stop background threads from retaining such
| memory.
|
| In other cases, you might assume containers would be enough
| separation, or maybe different databases (sharding, for
| example). One approach I've considered but not yet implemented
| is to take end-user JWT cookies and use them to authenticate to
| a database such that a user can only ever see the rows they
| should be allowed access to, and all other SQL queries would
| fail.
|
| But no matter how you implement a security control, it's
| possible to make mistakes and allow others to see data you
| didn't intend to share with them. Even Rust, famous for its
| explicit memory ownership model, could still have similar bugs,
| they just might be a bit more obvious to spot in some session
| connection pool implemented incorrectly, for example, or maybe
| exposed via a compiler optimization gone wrong. Code review
| will still likely be one of the primary defenses of this kind
| of bug, though hopefully better tooling and language support to
| detect race conditions or shared ownership warnings will help
| over time?
| mcherm wrote:
| I just want to make clear that for me, write ups containing this
| level of detail analyzing the error that occurred give me greater
| confidence in the company rather than less confidence. Greater
| even than if I had never heard of the original bug. There are
| many bugs I have never heard about, but any company that uses
| this level of care and openness is likely to be very reliable.
| TedShiller wrote:
| This is such a n00b mistake though, the "greater confidence" is
| quickly canceled out. I end up having less confidence in this
| company.
| petters wrote:
| I agree with you in general, but in this particular case they
| knowingly deployed thread-unsafe code in production while
| increasing the threaded concurrency. That should have broken
| internal development rules.
| eli wrote:
| Now imagine what happens behind the scenes at other service
| providers. I'd bet the ones that have policies that bury the
| details of incidents, on average, have worse problems.
| Complexicate wrote:
| Agreed. However, I smirked a bit when I read your comment,
| since this is ultimately coming from Microsoft. Let's hope some
| of this mentality seeps into other Microsoft projects
| thitcanh wrote:
| GitHub is owned by Microsoft but as far as I know it operates
| pretty independently. Let's hope it stays that way.
| craftinator wrote:
| At this point we're all just waiting for that last E.
| edoceo wrote:
| Or, pro-actively moving to GitLab - either self-hosted or
| paid and using GH as a simple mirror for discovery.
| arthurcolle wrote:
| I don't understand this reference, can you elucidate
| zelon88 wrote:
| > We believe that transparency is key in earning and keeping the
| trust of our users and want to share more about this bug.
|
| I find it ironic that the largest repository of open source code
| is closed source and and owned by the largest closed source
| software company on the planet, yet they somehow still "believe
| in transparency."
|
| That's like McDonald's "transparently" recalling burgers because
| of a chemical contamination, and then when you ask what else is
| in the burgers they tell you it's a secret.
| spaetzleesser wrote:
| I wonder if open sourcing their code would even be a problem
| for them now. I don't think the code is worth much vs the brand
| name and scale.
| baggy_trough wrote:
| The smartest engineers in the world are not smart enough to write
| bug free multithreaded code.
| fanf2 wrote:
| This is the kind of multithreading bug that Rust's borrow
| checker can prevent: the compiler would have complained that
| the background task and the main request handler have write
| access to the env object at the same time, which is not
| allowed.
| astrange wrote:
| Writing your website in PHP would also have prevented it,
| because its amateur process per request design is actually
| the best way to do things, scales well, and has the right
| security properties. Having an app server that can have
| requests from different users in the same memory space is a
| mistake.
| WJW wrote:
| Nobody is smart enough to write bug free code, but at least
| some companies have the smarts to fix their code and blog about
| how they did it.
| robotresearcher wrote:
| You're right but the parent's point is that multithreaded
| code with shared objects is known to be particularly
| difficult to do perfectly. It's so difficult that the
| rational strategy may be to architect to avoid it. We will
| have bugs, but not _those_ bugs.
___________________________________________________________________
(page generated 2021-03-18 23:00 UTC)