[HN Gopher] March 20 ChatGPT outage: Here's what happened
___________________________________________________________________
March 20 ChatGPT outage: Here's what happened
Author : zerojames
Score : 234 points
Date : 2023-03-24 16:08 UTC (6 hours ago)
(HTM) web link (openai.com)
(TXT) w3m dump (openai.com)
| Kuinox wrote:
| I managed to manually produce this bug 2 months ago. As they
| don't have any bug bounty, I didn't submitted it. By starting a
| conversation and refreshing before ChatGPT has time to answer, I
| managed to reproduce this bug 2-3 times in January.
| breckenedge wrote:
| did you reach out via https://openai.com/security.txt?
| rvz wrote:
| First GitHub, then OpenAI. Two of Microsoft finest(!) (majority
| owned and acquired) companies on the top of HN announcing a
| serious security incident.
|
| It's quite unsettling to see this leak of highly sensitive
| information and a private key exposure as well. Doesn't look good
| and seems like they don't take security seriously.
| skybrian wrote:
| In the case of OpenAI, the product is more of a research demo
| that had to be drastically scaled up, though. From an
| operations point of view it's more like a startup.
| deltree7 wrote:
| Nobody cares and yet another case study of HN being out-of-
| touch with reality
| rvz wrote:
| > _" Nobody cares"_
|
| Yet another case study of absolutism, which can be simply
| dismissed.
|
| People paying for ChatGPT care once it goes down and getting
| their details and chats leaked that is certainly outside of
| HN. Same with GitHub. Both having ~100M users between them.
|
| That's the reality.
| sebzim4500 wrote:
| I'm paying for ChatGPT and I don't care about this any more
| than the many, many other services I use that have at some
| point had an embarassing security issue.
| deltree7 wrote:
| I'm paying and I don't care. If I write perfect bug-free
| code, lead a perfect life, live in a perfect world, I'd be
| upset.
|
| But, I know that shit happens and the reliability meter
| should be flexible for different things (bridges, heart
| surgery and chat agent).
|
| If I train my brain to bitch, whine, moan about every
| thing, I'd not have resources to care about really
| important things.
| kkarpkkarp wrote:
| is this only me who don't see any chat history since yesterday
| and generally chat does not work (you can type the message, but
| clicking button or hitting enter / ctrl+enter) does not give any
| effect?
|
| in chat history there is a button to "retry" but clicking it and
| inspecting the result, you see "internal server error"
| LarsDu88 wrote:
| I called it:
| https://news.ycombinator.com/item?id=35267569#35270165
| fintechie wrote:
| I reported this race condition via ChatGPT's internal feedback
| system after I saw other user's chat titles loading on my sidebar
| a couple of times (around 7-8 weeks ago). Didn't get a response,
| so I assumed it was fixed...
|
| Hopefully they'll start a bug bounty program soon, and prioritise
| bug reports over features.
| totallyunknown wrote:
| same to my. actually only the summary of the history was from a
| different user. the content itself was mine.
| sebzim4500 wrote:
| The claim made at the time was that the titles were not from
| other people and were in fact caused by the model
| hallucinating after the input query timed out (or something
| like that). Obviously that sounds a little suspect now, but
| it might be true.
| nwienert wrote:
| That's a lie if so, if you look at the Reddit threads
| there's no way those were not specific other users
| histories as they had the logical history of reading
| browser history. Eg, one I saw had stuff like "what is X",
| then the next would be "How to X" or something. Some were
| all in Japanese, others all in Chinese. If it was random
| you wouldn't see clear logical consistency across the list.
| jetrink wrote:
| The explanation at the time was that unavailable chat data (due
| to, e.g. high load) resulted in a null input sometimes being
| presented to the chat summary system, which in turn caused the
| system to hallucinate believable chat titles. It's possible
| that they misdiagnosed the issue or that both bugs were present
| and they caught the benign one before the serious one.
| ElijahLynn wrote:
| That is a pretty good disclosure that creates trust.
| kristianpaul wrote:
| This more a data leak than an outage...
| sebzim4500 wrote:
| It was down for quite a while, so I would call it an outage.
| qwertox wrote:
| Nice writeup, it's fair in the content presented to us.
|
| Yet I'm wondering why there is no checking if the response does
| actually belong to the issued query.
|
| The client issuing a query can pass a token and verify upon
| answer that this answer contains the token.
|
| TBH as a user of the client I would kind of expect the library to
| have this feature built-in, and if I'm starting to use the
| library to solve a problem, handling this edge-case would be of a
| somewhat low priority to me if the library wouldn't implement it,
| probably because I'm lazy.
|
| I hope that the fix they offered to Redis Labs does contain a
| solution to this problem and that everyone of us using this
| library will be able to profit from the effort put into resolving
| the issue.
|
| It doesn't [0], so the burden is still on the developer using the
| library.
|
| [0] https://github.com/redis/redis-
| py/commit/66a4d6b2a493dd3a20c...
|
| ---
|
| Edit: Now I'm confused, this issue [1] was raised on March 17 and
| fixed on March 22, was this a regression? Or did OpenAI start
| using this library on March 19-20?
|
| Interesing comment:
|
| > drago-balto commented 3 hours ago
|
| > Yep, that's the one, and the #2641 has not fixed it fully, as I
| already commented here: #2641 (comment)
|
| > I am asking for this ticket to be re-oped, since I can still
| reproduce the problem in the latest 4.5.3. version
|
| [1] https://github.com/redis/redis-
| py/issues/2624#issue-16293351...
| [deleted]
| menzoic wrote:
| That sounds more like a hindsight thing. In most systems
| authorization doesn't happen at the storage layer. Most queries
| fetch data by an identifier which is only assumed to be valid
| based on authorization that typically happens at the edge and
| then everything below relies on that result.
|
| It's not the safest design but I wouldn't say the client should
| be expected to implement it. That security concern is at the
| application layer and the actual needs of the implementation
| can be wildly different depending on the application. You can
| imagine use cases for redis where this isn't even relevant,
| like if it's being used to store price data for stocks that
| update every 30 seconds. There's no private data involved
| there. It's out of scope for a storage client to implement.
| [deleted]
| benmmurphy wrote:
| This is a common bug with a lot of software. For example some
| HTTP clients that do pooling won't invalidate the connection
| after timing out waiting for the response.
| picodguyo wrote:
| If you're subscribed to their status page, you'll know it's
| actually unusual for a day to go by without an outage alert from
| OpenAI. They don't usually write them up like this but I guess
| this counts as PII leak disclosure for them? For having raised
| billions of dollars the are comically immature from a reliability
| and support perspective.
| thequadehunter wrote:
| To be fair, they accidently made a game-changing breakthrough
| that gained millions of users overnight, and I don't think they
| were ready for it.
|
| Before chatgpt, most normal people had never heard of OpenAI.
| Their flagship product was basically an API that only
| programmers could make useful.
|
| Team leaders at OpenAI have stated that they were not expecting
| the success, let alone the highest adoption rate for any
| product in history. In their minds, it was just a cleaned-up
| version of a 2-year old product. It was billed as a research
| preview.
|
| So, all of a sudden you go from hiring mostly researchers
| because you only have to maintain an API and some mid-traffic
| web infra, to suddenly having the fastest growing web product
| in history and having to scale up as fast as you can. Keep in
| mind that they didn't get backing from Microsoft until January
| 23, 2023-- that was only 2 months ago.
|
| I'd say we should cut them some slack.
| picodguyo wrote:
| These problems predate ChatGPT. Their API has been on the
| market for nearly 3 years. And they raised their first $1B in
| 2019. That's plenty of money and time to hire capable
| leadership.
| [deleted]
| construct0 wrote:
| The bug: https://github.com/redis/redis-py/issues/2624
| braindead_in wrote:
| Was this written by ChatGPT? Maybe it found the bug as well,
| who knows.
| photochemsyn wrote:
| > "If a request is canceled after the request is pushed onto
| the incoming queue, but before the response popped from the
| outgoing queue, we see our bug: the connection thus becomes
| corrupted and the next response that's dequeued for an
| unrelated request can receive data left behind in the
| connection."
|
| The OpenAI API was incredibly slow and lots of requests
| probably got cancelled (I certainly was doing that) for some
| days. I imagine someone could write a whole blog post about how
| that worked, it would be interesting reading.
| construct0 wrote:
| .... "I am asking for this ticket to be re-opened, since I can
| still reproduce the problem in the latest 4.5.3. version"
| chatmasta wrote:
| The PR: https://github.com/redis/redis-py/pull/2641
|
| According to the latest comments there, the bug is only
| partially fixed.
| chatmasta wrote:
| Why did it take them _9 hours_ to notice? The problem was
| immediately obvious to anyone who used the web interface, as
| evidenced by the many threads on Reddit and HN.
|
| > between 1 a.m. and 10 a.m. Pacific time.
|
| Oh... so it was because they're based in San Francisco. Do they
| really not have a 24/7 SRE on-call rotation? Given the size of
| their funding, and the number of users they have, there is really
| no excuse not to at least have some basic monitoring system in
| place for this (although it's true that, ironically, this
| particular class of bug is difficult to detect in a monitoring
| system that doesn't explicitly check for it, despite being
| immediately obvious to a human observer).
|
| Perhaps they should consider opening an office in Europe, or
| hiring remotely, at least for security roles. Or maybe they could
| have GPT-4 keep an eye on the site!
| guessmyname wrote:
| > _[...] it was because they 're based in San Francisco. Do
| they really not have a 24/7 SRE on-call rotation?_
|
| OpenAI is hiring Site Reliability Engineers (SRE) in case you,
| or anyone you know, is interested in working for them:
| https://openai.com/careers/it-engineer-sre . Unfortunately, the
| job is an onsite role that requires 5 days a week in their San
| Francisco office, so they do not appear to be planning to have
| a 24/7 on-call rotation any time soon.
|
| Too bad because I could support them in APAC (from Japan).
|
| Over 10 years of industry experience, if anyone is interested.
| p1esk wrote:
| Also, I heard their interviews (for any technical position)
| are very tough.
| eep_social wrote:
| Staffing an actual 24x7 rotation of SREs costs about a million
| dollars a year in base salary as a floor and there are few SREs
| for hire. A metrics-based monitor probably would have triggered
| on the increased error rate but it wouldn't have been
| immediately obvious that there was also a leaking cache. The
| most plausible way to detect the problem from the user
| perspective would be a synthetic test running some affected
| workflow, built to check that the data coming back matches
| specific, expected strings (not just well-formed). All possible
| but none of this sounds easy to me. Absolutely none of this is
| plausible when your startup business is at the top of the news
| cycle every single day for the past several months.
| namaria wrote:
| Every system fail prompts people to exclaim "why aren't there
| safeguards?". Every time. Well guess what, if we try to do
| new stuff, we will run into new problems.
| wouldbecouldbe wrote:
| There is nothing new about using redis for cache, or
| returning a list for a user.
| namaria wrote:
| Are you trying to say cache invalidation in a distributed
| system is a trivial problem?
| oulu2006 wrote:
| It's non-trivial but it's also not that hard, there are
| well known strategies for achieving it; especially if you
| relax guarantees and only promise eventual consistency
| then it becomes fairly trivial - we do this for example
| and have little problems with it.
| chatmasta wrote:
| I'm not disagreeing with you, and I'm not the commenter
| you're replying to, but it's worth noting that cache
| leakage and cache invalidation are two different
| problems.
| namaria wrote:
| You're right. Thanks for pointing that out. My original
| point still stands, distributed systems are hard and
| people demanding zero failures are setting an impossible
| standard.
| sosodev wrote:
| "there are few SREs for hire"
|
| How do you figure? If you mean there are few SRE with several
| years of experience you might be right. SRE is a fairly new
| title so that's not too surprising.
|
| However, my experience with a recent job search is that most
| companies aren't hiring SRE right now because they consider
| reliability a luxury. In fact, I was search of a new SRE
| position because I was laid off for that very reason.
| chatmasta wrote:
| You don't even need an SRE to have an on-call rotation; you
| could ping a software engineer who could at least recognize
| the problem and either push a temporary fix, or try to wake
| someone else to put a mitigation in place (e.g. disabling
| the history API, which is what they eventually did).
|
| However, I think the GP's point about this class of bug
| being difficult to detect in a monitoring system is the
| more salient issue.
| eep_social wrote:
| Well hang on! Your question was why was the time to
| detect so high and you specifically mentioned 24x7 SRE so
| I thought that's what we were talking about ;)
|
| And I do think the answer is that monitoring is easy but
| good monitoring takes a whole lot of work. Devops teams
| tend to get to sufficient observability where a SRE team
| should be dedicating its time to engineering great
| observability because the SRE team is not being pushed by
| product to deliver features. A functional org will
| protect SRE teams from that pressure, a great one will
| allow the SRE team to apply counter-pressure from the
| reliability and non-functional perspective to the product
| perspective. This equilibrium is ideal because it allows
| speed but keeps a tight leash on tech debt by developing
| rigor around what is too fast or too many errors or
| whatever your relevant metrics are.
| eep_social wrote:
| I've anecdotally observed the opposite. I have noticed SRE
| jobs remain posted, even by companies laying off or
| announcing some kind of hiring slowdown over the last
| quarter or so. More generally, businesses that have decided
| that they need SRE are often building out from some kind of
| devops baseline that has become unsustainable for the dev
| team. When you hit that limit and need to split out a
| dedicated team, there aren't a ton of alternatives to
| getting a SRE or two in and shuffling some prod-oriented
| devs to the new SRE team (or building a full team from
| scratch which is what the $$ was estimating above). Among
| other things, the SRE baliwick includes capacity planning
| and resource efficiency; SRE will save you money in the
| long term.
|
| On a personal note, I am sorry to hear that your job search
| has not yet been fruitful. Presumably I am interested in
| different criteria from you --- I have found several
| postings that are quite appealing to the point where I am
| updating my CV and applying, despite being weakly motivated
| at the moment.
| pharmakom wrote:
| They raised a billion dollars.
| dharmab wrote:
| You don't necessarily need a full team of SREs- you can also
| have a lightly staffed ops center with escalation paths.
| eep_social wrote:
| I don't think that model has the properties you think it
| does. Someone still has to take call to back the operators.
| Someone has to build the signals that the ops folks watch.
| Someone has to write criteria for what should and should
| not be escalated, and in a larger org they will also need
| to know which escalation path is correct. And on and on --
| the work has to get done somewhere!
| majormajor wrote:
| The way those criteria usually get written in a startup
| with mission-critical customer-facing stuff (like this
| privacy issue) is that _first_ the person watching
| Twitter and email and whatever else pages the engineers,
| and _then_ there 's a retro on whether or not that
| particular one was necessary, lather, rinse, repeat.
|
| All you need on day 1 is someone to watch the
| (metaphorical) phones + a way to page an engineer. Don't
| start by spending a million bucks a year, start by having
| a first aid kit at the ready.
|
| Perhaps they could also help this person out by looking
| into some sort of fancy software to automatically
| summarize messages that were being sent to them, or their
| mentions on Reddit, or something, even?
| scarmig wrote:
| Since it now handles visual inputs, I wonder how hard it'd be
| to get GPT to monitor itself. Have it constantly observe a
| set of screenshares of automated processes starting and
| repeating ChatGPT sessions on prod, alert the on-call when it
| notices something "weird."
| inconceivable wrote:
| nobody qualified wants the 24/7 SRE job unless it pays an
| enormous amount of money. i wouldn't do it for less than 500
| grand cash. getting woken up at 3am constantly or working 3rd
| shift is the kind of thing you do with a specific monetary goal
| in mind (i.e., early retirement) or else it's absolute hell.
|
| combine that with ludicrous requirements (the same as a senior
| software engineer) and you get gaps in coverage. ask yourself
| what senior software engineer on earth would tolerate getting
| called CONSTANTLY at 3am, or working 3rd shift.
|
| the vast majority of computer systems just simply aren't as
| important as hospitals or nuclear power plants.
| nijave wrote:
| Not only that, but you probably need follow the sun if you
| want <30 minute response time.
|
| Given a system that collects minute-based metrics, it
| generally takes around 5-10 minutes to generate an alert.
| Another 5-10 minutes for the person to get to their computer
| unless it's already in their hand (what if you get unlucky
| and on-call was taking a shower or using the toilet?). After
| that, another 5-10 minutes to see what's going on with the
| system.
|
| After all that, it usually takes some more minutes to
| actually fix the problem.
|
| Dropbox has a nice article on all the changes they made to
| streamline incidence response
| https://dropbox.tech/infrastructure/lessons-learned-in-
| incid...
| mnahkies wrote:
| Timezones are a thing - your 3am is someone's 9am and may be
| a significant part of your customer base.
|
| Being paged constantly is a sign of bad alerts or bad systems
| IMO - either adjust the alert to accept the current reality
| or improve the system
| inconceivable wrote:
| spinning up a subsidiary in another country (especially one
| with very strict labor laws, like in european countries) is
| not as easy as "find some guy on the internet and pay him
| to watch your dashboard. and then give him root so he can
| actually fix stuff without calling your domestic team,
| which would defeat the whole purpose.
|
| also, even getting paged ONCE a month at 3am will fuck up
| an entire week at a time if you have a family. if it
| happens twice a month, that person is going to quit unless
| they're young and need the experience.
| mnahkies wrote:
| Sorry to be clear I was replying to this part of your
| comment
|
| > the vast majority of computer systems just simply
| aren't as important as hospitals or nuclear power plants.
|
| I agree that the stakes are lower in terms of harm, but
| was trying to express that whilst it might not be life
| and death, it might be hindering someone being able to do
| their job / use your product - eg: it still impacts
| customer experience and your (business) reputation.
|
| False pages for transient errors are bad - ideally you
| only get paged if human intervention is required, and
| this should form a feedback cycle to determine how to
| avoid it in future. If all the pages are genuine problems
| requiring human action then this should feed into tickets
| to improve things
| chatmasta wrote:
| It's really not that difficult, and there are providers
| like Deel who can manage it all for you, to the point you
| just ACH them every month.
|
| Source: co-founder of a remote startup with employees in
| five countries
| inconceivable wrote:
| like you said, timezones are a thing. now you're managing
| a global team.
| Godel_unicode wrote:
| That sounds harder than it is, especially if you already
| allow remote work. It mostly just forces you to have
| better docs.
| oulu2006 wrote:
| I did that for a few years, and wasn't on 500k a year, but
| I'm also the company co-founder, so you could argue that a
| "specific monetary goal" was applicable.
| [deleted]
| cloudking wrote:
| Probably because they launched ChatGPT as an experiment and
| didn't think it would blow up, needing full time SRE etc. I
| don't think it was designed for scale and reliability when they
| launched.
| majormajor wrote:
| You don't need 24/7 SREs, you could do it with 24/7 first-line
| customer support staff monitoring Twitter, Reddit, and official
| lines of comms that have the ability to page the regular
| engineering team.
|
| That's a lot easier to hire, and lower cost. More training
| required of what is worth waking people up over; way less in
| terms of how to fix database/cache bugs.
| CubsFan1060 wrote:
| Do events like this cause them to lose enough revenue that it
| would make sense to hire a bunch of SRE's?
| nijave wrote:
| Probably the real reason. I assume they intend to make money
| off enterprise contracts which would include SLAs. Then
| they'd set their support based off that
| chatmasta wrote:
| Given the Microsoft partnership, they might not even need
| to manage any real infrastructure. Just hand it off to
| Azure and let them handle the details.
| killerstorm wrote:
| Serious question: Why do people feel it's necessary to use a
| redis cluster?
|
| I understand in early 2000s we were using spinning disks and it
| was the only way. Well, we don't use spinning disks any more, do
| we?
|
| A modern server can easily have terabytes of RAM and petabytes of
| NVMe, so what's stopping people from just using postgres?
|
| A cluster of radishes is an anti-pattern.
| lofaszvanitt wrote:
| People know it, that's all.
| cplli wrote:
| For caching the query results you get from your database. Also
| it's easier to spin up Redis and replicate it closer to your
| user than doing that with your main database. From my
| experience anyway.
| killerstorm wrote:
| > For caching the query results you get from your database.
|
| This only makes sense if queries are computationally
| intensive. If you're fetching a single row by index you
| aren't winning much (or anything).
| dpkirchner wrote:
| Of course? I'm not really sure what the original question
| actually is if you know that users benefit from caching the
| results of computationally intensive queries.
| killerstorm wrote:
| OpenAI uses redis to store pieces of text. Fetching
| pieces of text is not computationally intensive.
| mannyv wrote:
| Most likely they have them in an rdbms, so it's more like
| joining a forum thread together. Not expensive, but why
| not prebuild and store it instead?
| acuozzo wrote:
| > This only makes sense if queries are computationally
| intensive.
|
| Or if the link to your DB is higher latency than you're
| comfortable with.
| mike_hearn wrote:
| I think the idea is that if your db can hold the working set
| in RAM and you're using a good db + prepared queries, you can
| just let it absorb the full workload because the act of
| fetching the data from the db is nearly as cheap as fetching
| it from redis.
| xp84 wrote:
| I'm confused on why the need to complicate something as
| seemingly-straightforward as a KV store into a series of queues
| that can get all mixed up. I asked ChatGPT to explain it
| though, and it sounds like the justification for its existence
| is that it doesn't "block the event loop" while a request is
| "waiting for a response from Redis."
|
| Last time I checked, Redis doesn't take that long to provide a
| response. And if your Redis servers actually are that
| overloaded that you're seeing latency in your requests, it
| seems like simple key-based sharding would allow horizontally
| scaling your Redis cluster.
|
| _Disclaimer: I am probably less smart than most people who
| work at OpenAI so I 'm sure I'm missing some details. Also this
| is apparently a Python thing and I don't know it beyond surface
| familiarity._
| zmj wrote:
| I'm not familiar with the Python client specifically, but
| Redis clients generally multiplex concurrent requests onto a
| single connection per Redis server. That necessitates some
| queueing.
| adrr wrote:
| My redis clusters are 10x more cost effective than my
| postgresdb in handling load.
| amtamt wrote:
| For caching somewhat larger objects based on ETag?
| eldenring wrote:
| Yes! I have been spending the last couple months pulling out
| completely unnecessary redis caching from some of our internal
| web servers.
|
| The only loss here is network latency which negligible when
| you're colocated in AWS.
|
| Postgres's caches end up pulling a lot more weight too when
| you're not only hitting the db on a cache miss from the web
| server.
| [deleted]
| aadvark69 wrote:
| Better concurrency (10k vs ~200 max connections compared to
| postgres). ~20x faster than Postgres at Key-value read/write
| operations. (mostly) single threaded, so atomicity is achieved
| without the synchronicity overhead found in RDBMS.
|
| Thus, it's much cheaper to run at massive scale like OpenAI's
| for certain workloads, including KV caching
|
| also:
|
| - robust, flexible data structures and atomic APIs to
| manipulate them are available out-of-the box
|
| - large and supportive community + tooling
| manv1 wrote:
| 1. Redis can handle a lot more connections, more quickly, than
| a database can. 2. It's still faster than a database,
| especially a database that's busy.
|
| #2 is an interesting point. When you benchmark, the normal
| process is to just set up a database then run a shitload of
| queries against it. I don't think a lot of people put actual
| production load on the database then run the same set of
| queries against it...usually because you don't have a
| production load in the prototyping phase.
|
| However, load does make a difference. It made more of a
| difference in the HDD era, but it still makes a difference
| today.
|
| I mean, redis is a cache, and you do need to ensure that stuff
| works if your purge redis (ie: be sure the rebuild process
| works), etc, etc.
|
| But just because it's old doesn't mean it's bad. OS/390 and
| AS/400 boxes are still out there doing their jobs.
| hobobaggins wrote:
| and those have reliable backup/restore infrastructure. Using
| redis as a cache is fine, just don't use it as your primary
| DB.
| nijave wrote:
| A pretty small Redis server can handle 10k clients and
| saturate a 1Gbps NIC. You'd need a pretty heavy duty Postgres
| database and definitely need a connection pooler to come
| anywhere close.
| anarazel wrote:
| I agree that redis can handle some query volumes and client
| counts that postgres can't.
|
| But FWIW I can easily saturate a 10GBit ethernet link with
| primary key-lookup read-only queries, without the results
| being ridiculously wide or anything.
|
| Because it didn't need any setup, I just used:
| SELECT * FROM pg_class WHERE oid = 'pg_class'::regclass;
|
| I don't immediately have access to a faster network,
| connecting via tcp to localhost, and using some moderate
| pipelining (common in the redis world afaik), I get up to
| 19GB/s on my workstation.
| qwerty3344 wrote:
| using chatGPT to summarize:
|
| This section summarizes the issue with the ChatGPT service which
| caused some users to be able to see another user's chat history
| and payment information. OpenAI took the ChatGPT service offline
| and published technical details of the problem. The bug was found
| in the Redis client open-source library and has been patched.
| OpenAI has reached out to affected users and improved their
| systems to prevent this from happening again. They have also
| expressed their commitment to protecting user privacy and their
| dedication to the open-source community.
| zamnos wrote:
| Interesting. I pasted in the outage report, and then asked
| ChatGPT to give me a single sentence summary and it gave me an
| answer way longer than a sentence. I pointed out that it wasn't
| a single sentence and it gave me:
|
| > ChatGPT experienced an outage due to a bug in an open-source
| library that allowed some users to see titles from another
| active user's chat history and payment-related information of
| 1.2% of ChatGPT Plus subscribers was exposed; the bug has since
| been patched and actions have been taken to prevent future
| incidents.
|
| I then asked it _why_ there was the outage and it spit out:
|
| > The exact cause of the Facebook outage is unknown, but it is
| believed to be related to a configuration change in the
| company's routers.
|
| It's likely because I ran out of tokens because the OpenAI
| outage report is long. Pasting in the text of the outage
| report, and then re-asking about why, it was able to give a
| much better answer:
|
| > There was an outage due to a bug in an open-source library
| that allowed some users to see titles from another active
| user's chat history and also unintentionally exposed payment-
| related information of 1.2% of ChatGPT Plus subscribers who
| were active during a specific nine-hour window.
|
| Querying it further, again having to repeat the whole OpenAI
| outage report, and asking it a few different ways I eventually
| managed to get this succinct answer:
|
| > The bug was caused by the redis-py library's shared pool of
| connections becoming corrupted and returning cached data
| belonging to another user when a request was cancelled before
| the corresponding response was received, due to a spike in
| Redis request cancellations caused by a server change on March
| 20.
|
| It did take me more than a few minutes to get to there, so just
| actually reading the report would have been faster, and I ended
| up having to read the report to verify that answer was correct
| and not a hallucination anyway, so our jobs are safe for now.
| flangola7 wrote:
| Try with GPT 4. The token window is quadruple.
| layer8 wrote:
| That sounds like the kind of bug that could be prevented by
| modeling with TLA+.
| m00dy wrote:
| maybe they've just scrolled over issue lists of popular tech
| stacks and cherry-picked the most compelling one to bury the
| dirt.
| w10-1 wrote:
| It's interesting (read: wrong) for an AI company to bother
| writing the user interface for their web application.
|
| This was a failure of integration testing and defensive design,
| whether the component was open-source or not. There's no reason
| to believe that an AI company would have the diligence and
| experience to do the grunt work of hardening a site.
|
| But management obviously understood the level and character of
| interest. Actual users include probably 10,000 curiosity seekers
| for every actual AI researcher, with 1,000 of those being
| commercial prospects -- people who might buy their service.
|
| This is a clear sign that the managers who've made technical
| breakthrough's in AI are not capable even of deploying the
| service at scale -- no less managing the societal consequences of
| AI.
|
| The difficulty with the board getting adults in the room is that
| leaders today give the appearance of humility and cooperation,
| with transparent disclosures and incorporation of influencers
| into advisory committees. The leaders may believe their own
| abilities because their underlings don't challenge them. So
| there's no obvious domineering friction, but the risk is still
| there, because of inability to manage.
|
| Delegation is the key to scaling, code and organizations. "Know
| thyself" is about knowing your limits, and having the humility to
| get help instead of basking in the puffery of being in control.
|
| This isn't a PR problem. It's the Achilles' heel of capitalism,
| and the capitalists in OpenAI's board should nip this incipient
| Musk in the bud or risk losing 2-3 orders of magnitude return on
| their investment.
| stygiansonic wrote:
| The key part:
|
| _If a request is canceled after the request is pushed onto the
| incoming queue, but before the response popped from the outgoing
| queue, we see our bug: the connection thus becomes corrupted and
| the next response that's dequeued for an unrelated request can
| receive data left behind in the connection._
| [deleted]
| qwertox wrote:
| This reminds me of a comment I made 1.5 months ago [0]:
|
| I was logging in during heavy load, and after typing the question
| I started getting responses to questions which I didn't ask.
|
| gdb answered on that comment "these are not actually messages
| from other users, but instead the model generating something
| ~random due to hitting a bug on our backend where, rather than
| submitting your question, we submitted an empty query to the
| model."
|
| I wonder if it was the same redis-py issue back then, but just at
| another point in the backend. His answer didn't really convince
| me back then.
|
| [0] https://news.ycombinator.com/item?id=34614796&p=2#34615875
| lopkeny12ko wrote:
| The original issue report is here:
| https://github.com/redis/redis-py/issues/2624
|
| This bit is particularly interesting:
|
| > I am asking for this ticket to be re-oped, since I can still
| reproduce the problem in the latest 4.5.3. version
|
| Sounds like the bug has not actually been fixed, per drago-balto.
| nvartolomei wrote:
| I wonder how much time passed between the first case of
| corruptions leading to exceptions (and they ignored it as "eh,
| not great not terrible we'll look at it later) and users
| reporting seeing other's users data?
| jchw wrote:
| Does anyone else find it a bit off-putting how much emphasis they
| keep putting on "open source library"? I don't think I've read
| about this without the word open source appearing more than once
| in their own messaging about it. Why is it so important to
| emphasize that the library with the bug is open source?
|
| The cynic in me wants to believe that it's a way of deflecting
| blame somehow, to make it seem like they did their due diligence
| but were thwarted by something outside of their control. I don't
| think it holds. If you use an open source library with no
| warranty, you are responsible (legally and otherwise) to ensure
| that it is sufficient. For example, if you break HIPAA compliance
| due to an open source library, it is still you who is responsible
| for that.
|
| But of course, they're not claiming it's anyone else's fault
| anywhere explicitly, so it's uncharitable to just assume that's
| what they meant. Still, it rubs me the wrong way. I can't fight
| the feeling that it's a _wink wink nudge nudge_ to give them more
| slack than they 'd otherwise get. It feels like it's inviting you
| to just criticize redis-py and give them a break.
|
| The open postmortem and whatnot is appreciated and everything,
| but sometimes it's important to be mindful of what you emphasize
| in your postmortems. People read things even if you don't write
| them, sometimes.
| babl-yc wrote:
| I don't find it over-emphasized. Many in the Twitter-sphere are
| acting as if they aren't being appreciative of open source
| software and I don't see it that way.
|
| The technical root cause was in the open source library.
| There's a patch available and more likely than not OpenAI will
| continue to use the library.
|
| Being overly sensitive to blame would be distracting to the
| technical issue at hand. It's great they are posting this post-
| mortem to raise awareness that the libraries you use can have
| bugs and to consider that risk when building systems.
| fabianhjr wrote:
| Root cause analysis would likely also include the lack of
| threat modeling / security evaluation of their dependencies
|
| Would likely also question the lack of resources allocated to
| these open source projects by companies with a lot of profits
| from, in part, using those open source projects.
| bitxbitxbitcoin wrote:
| Not surprising from a company that calls itself openai. The
| "open source" keyword stuffing is so people associate the open
| from openai with open source. Psyops I mean marketing 101.
| ishaanbahal wrote:
| The emphasis could also have been done to educate folks using
| this combination to check their setup.
|
| Though version reference in the postmartem should also be
| posted, as a general guidance to their own readers, but at
| least a quick google search leads you to it.
|
| https://github.com/redis/redis-py/releases/tag/v4.5.3
|
| For anyone reading this and using a combination of asyncio and
| py-redis, please bump your versions.
|
| Similar issues I've encountered with asyncio python and
| postgres too in the past when trying to pool connections. It's
| really not easy to debug them either.
| layer8 wrote:
| I think you're overreacting. What bothered me is that they
| didn't link to the actual bug or provide a reference ID.
| mlsu wrote:
| The gaping hole in this write-up goes something like:
|
| "In order to prevent a bug like this from happening in the
| future, we have stepped up our review process for external
| dependencies. In addition, we are conducting audits around code
| that involves sensitive information."
|
| Of course, we all know what actually happened here:
|
| - we did no auditing;
|
| - because our audit process consists of "blame someone else
| when our consumers are harmed";
|
| - because we would rather not waste dev time on making sure our
| consumers are not harmed
|
| If you want to know why no software "engineering" is happening
| here, this is your answer. Can you imagine if a bridge
| collapsed, and the builder of the bridge said, "iunno, it's the
| truck's fault for driving over the bridge."
| marshmellman wrote:
| Are you confident that an audit would have uncovered this
| bug? I'd be surprised if audits are effective at finding
| subtle bugs and race conditions, but I could be wrong.
| cwkoss wrote:
| If the FTC had teeth and good judgement, they'd force OpenAI to
| rename themselves.
| kobalsky wrote:
| the library is provided by the redis team themselves and the
| bug is awful [1]. I know it's not redis' fault but this bug
| could hit anyone. Connections may be left in a dirty state
| where they return data from the previous request in the
| following one.
|
| [1] https://github.com/redis/redis-py/issues/2624
| adrianmonk wrote:
| I noticed it too, but it doesn't necessarily bother me.
| Possibly they're just trying to say, "This incident may have
| made us look like we're complete amateurs who don't have any
| clue about security, but it wasn't like that."
|
| Using someone else's library doesn't absolve you of
| responsibility, but failing to be vigilant at thoroughly
| vetting and testing external dependencies is a different kind
| of mistake than creating a terrible security bug yourself
| because your engineers don't know how to code or everyone is in
| too much of a rush to care about anything.
| thefreeman wrote:
| They really skirt around the fact that they apparently
| introduced a bug which quite consistently initiated redis
| requests and terminated the connection before receiving the
| result.
| mewpmewp2 wrote:
| Yes, I agree with that sentiment, and I thought precisely the
| same. I know as an engineer that I would feel compelled to
| mention that it was an obscure bug in an open source library,
| if that was the case. Not to excuse myself of responsibility,
| but because I would feel so ashamed if I myself introduced
| such an obvious security flaw. I would still of course
| consider myself responsible for what happened.
|
| A lot of the time when people make mistakes, they explain
| themselves so as they are afraid to be perceived as
| completely stupid or incompetent for making that mistake, not
| excusing themselves of taking responsibility even though
| people frequently think that excuses or explanation means
| that you are trying to absolve yourself of what you did.
|
| There's a huge difference to me between having an obscure bug
| like this and introducing that type of security issue because
| you couldn't logically consider it. First one can be resolved
| in the future by introducing processes and make sure all open
| source libraries are from trusted sources, but second one
| implies that you are fundamentally unable to think and
| therefore also probably improve on that.
| mlsu wrote:
| Why?
|
| The result for the end consumer is identical whether they
| have their PII leaked from "an external library" vs a
| vendor's own home-baked solution.
|
| It's not really a different kind of mistake, it's exactly the
| same kind of mistake, because it is exactly the same mistake!
| This is talking the talk, and not walking the walk, when it
| comes to security.
|
| Publishing a writeup that passes the buck to some (unnamed)
| overworked and underpaid open source maintainer is _worse_ ,
| not better!
| Veserv wrote:
| I agree, it is a different kind of mistake; it is immensely
| worse than creating a terrible security bug yourself.
|
| Outsourcing your development work without a acceptance
| criteria and without validation for fitness of purpose is
| complete, abject engineering incompetence. Do you think
| bridge builders look at the rivets in the design and then
| just waltz over to Home Depot and just pick out one that
| looks kind of like the right size? No, they have exact
| specifications and it is their job to source rivets that meet
| those specifications. They then either validate the rivets
| themselves or contract with a reputable organization that
| _legally guarantees_ they meet the specifications and it
| might be prudent to validate it again anyways just to be
| sure.
|
| The fact that, in software, not validating your dependencies,
| i.e. the things your system _depends_ on, is viewed as not so
| bad is a major reason why software security is such a utter
| joke and why everybody keeps making such utterly egregious
| security errors. If one of the worst engineering practices is
| viewed as normal and not so bad, it is no wonder the entire
| thing is utterly rotten.
| jchw wrote:
| I do not believe it's necessarily _nefarious_ in nature, but
| maybe more specifically it feels kind of like they 're
| implying that this is actually a valid escape hatch: "Sorry,
| we can't possibly audit this code because who audits all of
| their open source deps, amirite?"
|
| But the truth is that actually, maybe that hints at a deeper
| problem. It was a direct dependency to their application code
| in a critical path. I mean, don't get me wrong, I don't think
| everyone can be expected to audit or fund auditing for every
| single line of code that they wind up running in production,
| and frankly even doing that might not be good enough to
| prevent most bugs anyways. Like clearly, every startup fully
| auditing the Linux kernel before using it to run some HTTP
| server is just not sustainable. But let's take it back a
| step: if the point of a postmortem is to analyze what went
| wrong to prevent it in the future, then this analysis has
| failed. It almost reads as "Bug in an open source project
| screwed us over, sorry. It will happen again." I realize
| that's not the most charitable reading, but the one takeaway
| I had is this: They don't actually know how to prevent this
| from happening again.
|
| Open source software helps all of us by providing us a wealth
| of powerful libraries that we can use to build solutions, be
| we hobbyists, employees, entrepreneurs, etc. There are many
| wrinkles to the way this all works, including obviously
| discussions regarding sustainability, but I think there is
| more room for improvement to be had. Wouldn't it be nice if
| we periodically had actual security audits on even just the
| most popular libraries people use in their service code?
| Nobody in particular has an impetus to fund such a thing, but
| in a sense, everyone has an impetus to fund such work, and
| everyone stands to gain from it, too. Today it's not the
| norm, but perhaps it could become the norm some day in the
| future?
|
| Still, in any case... I don't really mean to imply that
| they're being nefarious with it, but I do feel it comes off
| as at _best_ a bit tacky.
| xxpor wrote:
| I mean, if there were ever a company in a position to
| figure out a scalable way to audit OSS before usage, it'd
| be OpenAI, right?
| jvm___ wrote:
| Doesn't bother me either. All the car companies issue recalls
| regularly, sometimes an issue only shows up when the system
| hits capacity or you run into an edge case.
| skybrian wrote:
| I think you're reading too much into it. Being an open source
| library is relevant because it means it's third party and
| doesn't come with a support agreement, so fixing a bug is a
| somewhat different process than if it were in your own code or
| from a proprietary vendor.
|
| Yes, it's technically up to you to vet all your dependencies,
| but in practice, often it doesn't happen, people make
| assumptions that the code works, and that's relevant too.
| fabianhjr wrote:
| Open source can be fixed as if it was your own code. (And
| that is a strong tenant of free/open source software)
|
| Not only do most open/free source libraries come without
| support agreements: they come with the broadest possible
| limitation of warranties. (As they should)
|
| So the company, knowing that what they are using comes
| without any warranty either of quality or fitness to the use-
| case, have a very strong burden of due diligence / vetting.
| danenania wrote:
| Also, vetting a dependency != auditing and testing every line
| of code to find all possible bugs.
|
| If this bug was an open issue in the project's repo, that
| might be concerning and indicate that proper vetting wasn't
| done. Ditto if the project is old and unmaintained, doesn't
| have tests, etc. But if they were the first to trigger the
| bug and it only occurs under heavy load in production
| conditions, well, running into some of those occasionally is
| inevitable. The alternative is not using any dependencies, in
| which case you'd just be introducing these bugs yourself
| instead. Even with very thorough testing and QA, you're never
| going to perfectly mimic high load production conditions.
| JohnFen wrote:
| > in practice, often it doesn't happen, people make
| assumptions that the code works
|
| True, but that's an inexcusable practice and always has been.
| We as an industry need to stop accepting it.
| isopede wrote:
| What do you mean by "stop accepting it?"
|
| All of us rely on millions of lines of code that we have
| not personally audited every single day. Have you audited
| every framework you use? Your kernel? Drivers? Your
| compiler? Your CPU microcode? Your bootrom? The firmware in
| every gizmo you own?
|
| If "Reflections on Trusting Trust" has taught us anything,
| it's turtles all the way down. At some point, you have to
| either trust something, or abandon all hope and trust
| nothing.
| JohnFen wrote:
| > Have you audited every framework you use? Your
| compiler? Your CPU microcode? Your bootrom?
|
| Of course not. I exclude the CPU microcode, bootrom, and
| the like from the discussion because that's not part of
| the product being shipped.
|
| But it's also true that I don't do a deep dive analyzing
| every library I use, etc. I'm not saying that we should
| have to.
|
| What I'm saying is that when a bug pops up, that's on us
| as developers even when the bug is in a library, the
| compiler, etc. A lot of developers seem to think that
| just because the bug was in code they didn't personally
| write, that means that their hands are clean.
|
| That's just not a viable stance to take. The bug should
| have been caught in testing, after all.
|
| If your car breaks down because of a design failure in a
| component the auto manufacturer bought from another
| supplier, you'll still (rightfully) hold the auto
| manufacturer responsible.
| skybrian wrote:
| > when a bug pops up
|
| That's reacting to a bug you know about. Do you mean to
| talk about how developers aren't good enough at reacting
| to bugs found in third party libraries, or how they
| should do more prevention?
|
| In this case, it seems like OpenAI reacted fairly
| appropriately, though perhaps they could have caught it
| sooner since people reported it privately.
|
| "Holding someone responsible" is somewhat ambiguous about
| what you expect. It seems reasonable that a car
| manufacturer should be prepared to do a recall and to pay
| damages without saying that they should be perfect and
| recalls should never happen.
| JohnFen wrote:
| > Do you mean to talk about how developers aren't good
| enough at reacting to bugs found in third party
| libraries, or how they should do more prevention?
|
| My point was neither of these. My point is very simple:
| the developers of a product are responsible for how that
| product behaves.
|
| I'm not saying developers have to be perfect, I'm just
| saying that there appears to be a tendency, when
| something goes wrong because of external code, to deflect
| blame and responsibility away from them and onto the
| external code.
|
| I think this is an unseemly thing. If I ship a product
| and it malfunctions, that's on me. The customer will
| rightly blame me, and it's up to me to fix the problem.
|
| Whether the bug was in code I wrote or in a library I
| used isn't relevant to that point.
| JohnFen wrote:
| > The cynic in me wants to believe that it's a way of
| deflecting blame somehow
|
| That's how it reads to me as well.
|
| Of course, it doesn't deflect blame at all. Any time you
| include code in your project, no matter where the code came
| from, you are responsible for the behavior of that code.
| amtamt wrote:
| Was postmortem generated by chatGPT?
| dilap wrote:
| I half agree, but I also half-sympathize with them, because it
| really wasn't their fault -- it was a quite-bad bug in a very
| fundamental library.
|
| Bugs happen, though. Especially in Python.
| airstrike wrote:
| _> Especially in Python._
|
| as opposed to...?
| moffkalast wrote:
| As opposed to not in Python.
| deathanatos wrote:
| ... like JavaScript? Bash? C? PHP?
|
| Certainly none of those are widely used and have a
| reputation for making it easy to keep the gun aimed
| squarely at the foot.
| moffkalast wrote:
| Those would be roughly similar. The main difference would
| be between dynamically typed interpreted languages and
| statically typed compiled ones I guess. At least I think
| I make less mistakes when the compiler literally tells me
| what's wrong before I even run the thing. It's awful and
| slow to develop that way, but it is more reliable for
| when that's a requirement.
|
| So compared to ones like Kotlin or Rust.
| dilap wrote:
| Go, for one.
|
| In my experience errors are more common (for both cultural
| and technological reasons) in Python than in Go.
|
| I would guess something similar applies to Rust, though I
| don't have personal experience.
|
| There's wide variation in C, but with careful
| discrimination, you can find very high-quality libraries or
| software (redis itself being an excellent example).
|
| I don't have rigourous data to baack this stuff up, but I'm
| pretty convinced it's true, based on my own experience.
| qwertox wrote:
| I was upvoting you, but then reading
|
| > Especially in Python.
|
| made me unvote.
| kljhghfgdfjkgh wrote:
| it really _was_ their fault. they chose to ship the bug. it
| doesn 't matter in the last that someone else previously
| published the code under a license with no warranty
| whatsoever.
| gkbrk wrote:
| Instead of spending engineering time, they used a free and
| open-source library to do less work.
|
| The license they agreed to in order to use this library has
| this in capital letters. [THE SOFTWARE IS PROVIDED "AS IS",
| WITHOUT WARRANTY OF ANY KIND].
|
| After agreeing to this license and using the library for
| free, they charged people money and sold them a service. And
| when that library they got for free, which they read and
| agreed that had no warranty of any kind, had a software bug,
| they wrote a blog post and blamed the outage of their paid
| service on this free library.
|
| This is not another open-source project, or a small business.
| This is a company that got billions of dollars in investment,
| and a lot of income by selling services to businesses and
| individuals. They don't get to use free, no-warranty code
| written by others to save their own money, and then blame it
| and complain about it loudly for bugs.
| JohnFen wrote:
| > it really wasn't their fault -- it was a quite-bad bug in a
| very fundamental library.
|
| It's still their fault. When you ship code, you are
| responsible for how that code behaves regardless of where the
| code came from.
| JamesBarney wrote:
| Only for some incredibly broad definition of fault that
| almost no one uses.
|
| How many people make sure all of the open source libraries
| they're using are bug free?
|
| Anyone besides maybe NASA?
| JohnFen wrote:
| > Only for some incredibly broad definition of fault that
| almost no one uses.
|
| It's a definition most laypeople use. It's developers who
| tend to use a very narrow definition.
|
| I don't think it should be controversial to say that when
| you ship a product, you are responsible for how that
| product behaves.
| pjmlp wrote:
| Anyone that has to pay from their own pocket when things
| go wrong, like consulting warranties, liabitiliy in
| security exploits,...
| majormajor wrote:
| I've never cared per se that a library was bug free but
| I've put a lot of effort/$ into making sure _the features
| that used the libraries in my product_ were bug free
| (with the amount of effort depending on the sensitivity
| of the feature, data, etc).
|
| Usually "fix the original library" wasn't as easy or
| immediate a fix as "hack around it" which is sad just re:
| the overall OSS ecosystem but still the person releasing
| a product's responsibility.
|
| Unfortunately these sorts of bugs are wildly difficult to
| predict. Yet it's also a wildly common architecture.
| _That 's_ what's sad for all of us as engineers as a
| whole. But "caching credit card details and home
| addresses", for instance, is... particularly dicey.
| That's very sensitive, and you're tossing it into more
| DBs, without good access control restrictions?
| rschoultz wrote:
| Anywhere where you have payments related or any other PII
| data, then transitive dependencies, framework and
| language choices, memory sharing and other risks are
| taken into account as something that you as someone
| developing and operating a service is solely responsible
| for.
| practice9 wrote:
| There have been several reports of this issue in Feb/early
| March on r/ChatGPT subreddit - OpenAI could have known if
| they listened to community.
|
| Alternatively, they knew about it, and didn't fix the bug
| until it bit them
| JamesBarney wrote:
| This doesn't come across this way to me at all. They just
| described what happened. Do you expect them to jump in front of
| a bus for the library they're using, and beg for forgiveness
| for not ensuring the widely used libraries they're leveraging
| are bug free?
|
| There are very few companies that couldn't get caught by this
| type of bug.
| nickvincent wrote:
| Basically agree -- feels off-putting, but not technically a
| wrong detail to add. An additional reason it rubs me the wrong
| way, however, is that I believe open-source software code is
| especially critical to ChatGPT family's capabilities. Not just
| for code-related queries, but for everything! (e.g. see this
| "lineage-tracing" blog post: https://yaofu.notion.site/How-
| does-GPT-Obtain-its-Ability-Tr...)
|
| Thus, I honestly think firms operating generative AI should be
| walking on eggshells to avoid placing blame on "open-source".
| Rather, they really should going out of their way to channel as
| much positive energy towards it as possible.
|
| Still, agree the charitable interpretation is that this just
| purely descriptive.
| jatins wrote:
| I think you are reading a bit between the lines, and didn't
| feel them blaming the library as much as stating that the bug
| happened because of an issue in the library. Maybe they could
| have sugarcoated it between 10 layers of corporate jargon but
| I'd rather take this over that
| thequadehunter wrote:
| Personally, I think it was partially a virtue signal to show
| that they use open source software and collaborate with the
| maintainers.
| chamakits wrote:
| I've also noticed it, and I can't help but interpret it as
| their way of shifting blame. Which is irresponsible. It's their
| product, and they need to take accountability for the bug
| occurring.
|
| It's a serious bug, but in the grand scheme of things, not
| earth shattering, and not something that I think would
| discourage usage of their product. But their treatment of the
| bug causes more concerns than the bug itself. They are shifting
| the blame away from the work they did using a library with a
| bug, rather than their process by which that library made it
| into their product. And I don't understand how they can't see
| how that reflects poorly on them as an AI company.
|
| I find it so confusing that at the end of the day, OpenAI's
| biggest product is having created a good process by which to
| create value out of a massive amount of data, and build a good
| API on top of it. And the open source library is effectively
| something they processed into their product and built an API
| based off of it. So it creates (to me) some amount of doubt
| about how they will react when faced with similar challenges to
| their core product. How will they behave when the data they
| consume impacts their product negatively? From limited
| experience, they'll shift the blame to the data, not their
| process, and keep it pushing.
|
| It seems likely that this is only the beginning of OpenAI
| having a large customer base, with a high impact on many
| products. This is a disappointing result on their first test on
| how they'll manage issues and bugs with their products.
| metanonsense wrote:
| I don't know. To me it's simply an explanation of what has
| happened. I think its exactly what I would have written if I
| was in their position. And show me the one company that has
| audited all source code of all used open source projects, at
| least in a way that is able to rule out complex bugs like this.
| I have once found a memory corruption bug in Berkeley DB
| wrecking our huge production database, which I would have never
| found in any pre-emptive source code audit, however detailed.
|
| Edit: On second thought, maybe they could have just written
| "external library" instead of "open source library".
| davedx wrote:
| They were/are storing payment data in redis? LOL!
| taxman22 wrote:
| The postmortem doesn't say that. It just says they were caching
| "user information". Maybe that includes a Stripe customer or
| subscription ID that they look up before sending an email, for
| example.
| tmpz22 wrote:
| Yeah probably the session id and when the wrong session id is
| returned other operations like GET User details would pull
| its data from relational storage.
| galnagli wrote:
| Well - they have had more bugs and will have more bugs to worry
| from.
|
| https://twitter.com/naglinagli/status/1639343866313601024
| abujazar wrote:
| The disclosure is provides valuable information, but the
| introduction suggests someone else or <<open-source>> is to
| blame:
|
| >We took ChatGPT offline earlier this week due to a bug in an
| open-source library which allowed some users to see titles from
| another active user's chat history.
|
| Blaming an open-source library for a fault in closed-source
| product is simply unfair. The MIT licensed dependency explicitly
| comes without any warranties. After all, the bug went unnoticed
| until ChatGPT put it under pressure, and it was ChatGPT that
| failed to rule out the bug in their release QA.
| ajhai wrote:
| > In the hours before we took ChatGPT offline on Monday, it was
| possible for some users to see another active user's first and
| last name, email address, payment address, the last four digits
| (only) of a credit card number, and credit card expiration date
|
| This is a lot of sensitive data. It says 1.2% of ChatGPT Plus
| subscribers active during a 9 hour window, which considering
| their user base must be a lot.
| mach1ne wrote:
| It's a bit unclear if this means that 1.2% of all chatGPT Plus
| subscribers were active during that 9-hour window
| jkern wrote:
| Funnily enough I've had a very similar bug occur in an entirely
| separate redis library. It was a pretty troubling failure mode to
| suddenly start getting back unrelated data
| pixl97 wrote:
| There are 2 hard problems in computer science: cache
| invalidation, naming things, and off-by-1 errors.
| deathanatos wrote:
| ... in this case this variant seems more appropriate:
| There are 3 hard problems in Computer Science: 1. naming
| things 2. cache invalidation 3. 4. off-by-one
| errors concurrency
| DeathArrow wrote:
| I'm the only one terrible bored by the assault of the trivial AI
| news last months?
|
| Every fart some AI related person makes becomes a huge news. And
| it's followed by tens of random blog postings all posted to HN.
| Nuzzerino wrote:
| At least it isn't about the Rust language this time _grumbles_
| DeathArrow wrote:
| Because Rust hasn't conquered AI the way it conquered crypto.
|
| But we will see AI stuff rewritten in Rust quite soon.
| spprashant wrote:
| For some reason I liked reading about Rust (or any other
| technology) a lot more that the AI.
|
| Part of it is that, the average engineer could understand and
| grok what those articles were talking about, and I could
| appreciate, relate, and if applicable criticize it.
|
| The AI news just seems to swing between hype and doomsday
| prophecies, and little discussion about the technical aspects
| of it.
|
| Obviously OpenAI choosing to keep it closed source makes any
| in-depth discussion close to impossible, but also some of
| this is so beyond the capabilities of an average engineer
| with a laptop. It can be frustrating.
| [deleted]
| polyrand wrote:
| Commit fixing the bug:
|
| https://github.com/redis/redis-py/commit/66a4d6b2a493dd3a20c...
| ketchupdebugger wrote:
| It's surprising that openai seems to be the only one being
| affected. If the issue is with redis-py reusing connections then
| wouldn't more companies/products be affected by this?
| zzzeek wrote:
| their description of the problem seemed kind of obtuse, in
| practice, these connection-pool related issues have to do with
| 1. request is interrupted 2. exception is thrown 3. catch
| exception, return connection to pool, move on. The thing that
| has to be implemented is 2a. clean up the state of the
| connection when the interrupted exception is caught, _then_
| return to the pool.
|
| that is, this seems like a very basic programming mistake and
| not some deep issue in Redis. the strange way it was described
| makes it seem like they're trying to conceal that a bit.
| roberttod wrote:
| It's an open source library, I assume that logic is
| abstracted within it and that the "basic mistake" was one of
| the maintainer's.
| 19h wrote:
| It boggles my mind how they're not absolutely checking the user &
| conversation id for EVERY message in the queue given the possible
| sensitivity of the requests. How is this even remotely
| acceptable?
|
| In the one reddit post first surfacing this the user saw
| conversations related to politics in china and other rather
| sensitive topics related to CCP.
|
| This can absolutely get people hurt and they absolutely must take
| this serious.
| zaroth wrote:
| It doesn't boggle my mind at all. Session data appears, and is
| used to render the page. Do you verify every time the actual
| cookie and go back to the DB to see what user it pointed to?
|
| No, everyone assumes their session object is instantiated with
| the right values at that level of the code.
| m_0x wrote:
| Did they use chat-gpt to fix the bug?
| sergiotapia wrote:
| It sounds like their redis key was not unique enough and yada
| yada yada it returned sensitive info the wrong people.
| Jabrov wrote:
| Did you read the article? That's not at all what happened.
___________________________________________________________________
(page generated 2023-03-24 23:00 UTC)