Post ATms2g3uFlN112Ji4m by gray17@mastodon.social
 (DIR) More posts by gray17@mastodon.social
 (DIR) Post #ATm65DBlGS8eXH8Lvk by TedUnderwood@sigmoid.social
       2023-03-19T13:27:58Z
       
       0 likes, 1 repeats
       
       Blog post comparing #GPT4 to other forms of text analysis. tl;dr: Yes, it's more accurate, but the real advantage is that it thinks out loud and can argue with you. #LLM #AI @dh https://tedunderwood.com/2023/03/19/using-gpt-4-to-measure-the-passage-of-time-in-fiction/
       
 (DIR) Post #ATm6y1DhSuinLIAsiG by TedUnderwood@sigmoid.social
       2023-03-19T13:37:53Z
       
       0 likes, 0 repeats
       
       Many thanks to @simon and @quinnanya who shared code for using the Turbo API. Incidentally that happened here on Mastodon: Fediverse open science FTW!
       
 (DIR) Post #ATm7v7IgVBnFHmgbAm by afamiglietti79@mastodon.social
       2023-03-19T13:48:34Z
       
       0 likes, 0 repeats
       
       @TedUnderwood @dh thanks for this, it's a really useful pointer towards method. It's also a helpful reminder that the default "chat" interface (where you just rely on *content* from GPTs training data) is mostly just a demo/party trick. The real power (and, accordingly, the real danger) is using the model's "understanding" of structure on more carefully selected problems and data.
       
 (DIR) Post #ATm9ve0v4eGYeYkvvU by TedUnderwood@sigmoid.social
       2023-03-19T14:11:05Z
       
       0 likes, 0 repeats
       
       @afamiglietti79 @dh Yep. The model is just a little machine like you might get in a LEGO set. The interesting thing is going to be that we can connect them together in all kinds of different ways and point them at stuff.
       
 (DIR) Post #ATmCMay0i5tQvkNGFM by talyarkoni@sigmoid.social
       2023-03-19T14:38:20Z
       
       0 likes, 0 repeats
       
       @TedUnderwood @dh very cool. re: the last part of your post ("humans needed to define constructs"), i wouldn't be surprised if you could take the samples where two humans disagree, feed them to GPT-4, and have it tell you how the implicit conceptions of a term appear to differ between users
       
 (DIR) Post #ATmDTXER5yZHKMvjEG by TedUnderwood@sigmoid.social
       2023-03-19T14:50:45Z
       
       0 likes, 0 repeats
       
       @talyarkoni Interesting. This is where things are going to get fun. I tried to draw a crisp line between human intersubjectivity and what I called "backtalk" from the model ... but that line is not actually going to stay crisp.
       
 (DIR) Post #ATmE2JVK7t8Mt5Pesa by raiderrobert@mastodon.social
       2023-03-19T14:57:05Z
       
       0 likes, 0 repeats
       
       @TedUnderwood @dh "Reliance on OpenAI is still a bad idea in the long run. Universities should develop their own models and APIs."💯💯💯
       
 (DIR) Post #ATmTmfpYwpZClURZeS by fotis_jannidis@fedihum.org
       2023-03-19T17:53:32Z
       
       0 likes, 1 repeats
       
       @TedUnderwood @dh Very interesting report; thanks for sharing. Amazing how quickly it goes from !!😬!! to 'wake up each moment with eternal sunshine of the spotless mind.' Not sure about your last point: "To be confident that we’re measuring something called 'suspense' we need to show that multiple people recognize it as suspense." We can always define a concept and then apply it. The performance of the model is additional feedback on the quality/adequacy of our definition, isn't it?
       
 (DIR) Post #ATmUf8jdhKNo32e1lQ by TedUnderwood@sigmoid.social
       2023-03-19T18:03:24Z
       
       0 likes, 0 repeats
       
       @fotis_jannidis I'm not totally sure; I would really like to see a longer exploration of this question. In the post I took a stance that is conventional in the social sciences. But that doesn't necessarily mean it's correct, especially with these new models that can sort of "argue back."
       
 (DIR) Post #ATmV4zAdsfrZuhbDGa by TedUnderwood@sigmoid.social
       2023-03-19T18:08:04Z
       
       0 likes, 0 repeats
       
       @dh Over on the bird site, @dbamman adds BERT as a data point, and it equals or out-performs GPT-4: https://twitter.com/dbamman/status/1637515288584527872
       
 (DIR) Post #ATmncDO3qNHwjhs1RI by gray17@mastodon.social
       2023-03-19T21:35:45Z
       
       0 likes, 0 repeats
       
       @TedUnderwood @dh This is an interesting application, and I expect more things like this will be useful in the future. But right now, I'm still somewhat skeptical that when you ask GPT for an "explanation of its thoughts", the explanation has any connection to reality. It seems likely to be as much a confabulation/hallucination as anything else, which can be useful, but I don't see how to have confidence that it isn't just hard-to-detect nonsense.
       
 (DIR) Post #ATmrk8aPya4ByqgR5k by TedUnderwood@sigmoid.social
       2023-03-19T22:22:01Z
       
       0 likes, 0 repeats
       
       @gray17 @dh I don’t think it has an ability to introspect. But chain-of-thought prompting works because word n+2 is shaped by n and n+1, etc. So the trace is meaningful without any need for introspection
       
 (DIR) Post #ATms2g3uFlN112Ji4m by gray17@mastodon.social
       2023-03-19T22:25:22Z
       
       0 likes, 0 repeats
       
       @TedUnderwood @dh Right, but that works only because of statistical patterns. The "explanation" is derived from the previous text, but it's not clear to me that it won't be misled by patterns that it created itself. It's very easy to fool GPT with eg, a riddle that looks like something it's seen before, but has a small difference that makes the answer completely different.
       
 (DIR) Post #ATn8PZmXMpofSu8L2m by TedUnderwood@sigmoid.social
       2023-03-20T01:28:46Z
       
       0 likes, 0 repeats
       
       @gray17 yes, to be sure. But note that the debatable claim I’m making is not about whether the model is *right*—that part I simply measured in the post. The debatable part was that its errors often leave a trace of words. In the example you just provided, for instance, the riddle would be the trace.
       
 (DIR) Post #ATnHGknXwGwYc3TQIq by gray17@mastodon.social
       2023-03-20T03:08:02Z
       
       0 likes, 0 repeats
       
       @TedUnderwood Sure, when you know the answer, you can tell that the generated text is wrong. What bothers me about these models is that the generated text is almost always plausible, so that if I don't check carefully, I might not notice that it's lying about what it said earlier. "Give an answer and explain your reasoning", sometimes it's obvious if the explanation is about an answer different from what it gave. Sometimes the error is more subtle, and I don't know the implications of that
       
 (DIR) Post #ATnIYBYFp83mSpCFjE by TedUnderwood@sigmoid.social
       2023-03-20T03:22:23Z
       
       0 likes, 0 repeats
       
       @gray17 fwiw, the way to prompt these models is not "give your answer and explain your reasoning" but "a) summarize the data relevant to this question b) describe step by step how you would draw inferences from that data, and only then finally c) synthesize those data in a conclusion." In other words you ask it to show the reasoning before it answers — that's *how* it reaches the answer.
       
 (DIR) Post #ATnIkoKVTNJACvmpVY by TedUnderwood@sigmoid.social
       2023-03-20T03:24:40Z
       
       0 likes, 0 repeats
       
       @gray17 There is no mental state to describe; it's just a sequence of words and, when it's working properly, the thinking happens *in* the sequence of words.
       
 (DIR) Post #ATnJxBYn1pVbgUvqRE by gray17@mastodon.social
       2023-03-20T03:38:06Z
       
       0 likes, 0 repeats
       
       @TedUnderwood Right, I understand that. But in your experiment, you said, "5. Given the amount of speculation required in step 2, describe your certainty about the estimate--either high, moderate, or low." This "your certainty" is entirely imaginary, I don't know what it *means*
       
 (DIR) Post #ATnLY5oPSjWwWfOu7U by TedUnderwood@sigmoid.social
       2023-03-20T03:55:59Z
       
       0 likes, 0 repeats
       
       @gray17 It means "describe the level of certainty implicit in your answer to step two." I use the term a human being would use ("your confidence"), because that's how English works. But I'm actually instructing the model to look at the text it has just written and generalize about those words.
       
 (DIR) Post #ATnLcd5pfNQqu4iHmy by gray17@mastodon.social
       2023-03-20T03:56:49Z
       
       0 likes, 0 repeats
       
       @TedUnderwood Right, and that's the point where I don't know that I can reliably detect if it's "lying" about the summary or not.
       
 (DIR) Post #ATnLy8SyDUzAHqn5Q8 by gray17@mastodon.social
       2023-03-20T03:58:11Z
       
       0 likes, 0 repeats
       
       @TedUnderwood Because it's very good at writing something that always looks plausible. If it were a human, I could gain a model of the human's reliability and attention to detail, but the GPT model is known to be very weird in weird ways. I have to check every thing it says that I don't already know is true.
       
 (DIR) Post #ATnLy8zaGF4bv038Fs by TedUnderwood@sigmoid.social
       2023-03-20T04:00:41Z
       
       0 likes, 0 repeats
       
       @gray17 Well, in this case the words are all there on the same page, so I as I scan the answers, I can just ask "is the answer to step 5 consistent with what it said in step 2?" Like, does it speculate a lot in 2 and then weirdly say "high confidence" at the end? And in practice, no it doesn't do that. It can be wrong, but its answers are coherently wrong.
       
 (DIR) Post #ATnMJOlent48vTfcX2 by gray17@mastodon.social
       2023-03-20T04:04:32Z
       
       0 likes, 0 repeats
       
       @TedUnderwood But have you studied this other than "scanned a few, seems right to me"? Is it 95% or 99% correct? These models are also known to be vulnerable to adversarial inputs too. How often is that a problem?I mean, yes, it's useful, but I'm really wary that it's very easy for humans to implicitly go from "it's maybe 95% correct" to "the wording is pretty authoritative, it's probably 100% correct, I'm not going to bother checking"
       
 (DIR) Post #ATnMbSXBXQyRovVsTA by TedUnderwood@sigmoid.social
       2023-03-20T04:07:48Z
       
       0 likes, 0 repeats
       
       @gray17 I think we're going in circles. I'm not talking about correctness. Re: correctness I have human estimates for all these passages and can compare them all (every single one), precisely measuring their degree of agreement with human readers--which is close to humans' agreement with each other. But what we're talking about now is the model's habit of talking out loud, and that's not a question of correctness.
       
 (DIR) Post #ATnMzbmWo9jpS1Clii by TedUnderwood@sigmoid.social
       2023-03-20T04:12:10Z
       
       0 likes, 0 repeats
       
       @gray17 I know there's been this very influential discourse of "ChatGPT can make stuff up and so you have to check"—which is def true if you're asking it for a citation—but that's not very applicable to what I mean when I say the model leaves a trace of how it got to word 50 in words 1 through 49. That's just how the model works; it's not something it can fake or be wrong about, though sure interpretation may be tricky.
       
 (DIR) Post #ATnN8MN7TF3SGSUe0G by gray17@mastodon.social
       2023-03-20T04:13:44Z
       
       0 likes, 0 repeats
       
       @TedUnderwood yeah, ok. I'm not talking about correctness of that.in your experimental process, you ask GPT1. write a chain of reasoning to answer a question2. rate how "confident" the chain of reasoning seemsand you rely on the rating to improve the prompting. you're checking that the rating makes sense for a few, but you're not checking all of them, so you're implicitly trusting that summary in your feedback cycle. how does that distort the process vs not asking for a confidence rating?
       
 (DIR) Post #ATnNwqjKsCcItkxaBU by gray17@mastodon.social
       2023-03-20T04:17:01Z
       
       0 likes, 0 repeats
       
       @TedUnderwood maybe this is perfectly ok! but I don't know that a priori, I don't have any reason to think that the GPT rating of its own sentences is any more reliable than anything else it says, without checking them all. and you asked it to do that so you don't have to check them all, you're using it to filter down to things that seem useful. which might be ok! but I don't *know * a strong argument that it *is* ok
       
 (DIR) Post #ATnNwuwFE9Jrwx1wrQ by TedUnderwood@sigmoid.social
       2023-03-20T04:22:52Z
       
       0 likes, 0 repeats
       
       @gray17 Actually, I didn't rely on the rating to improve the prompting. I don't care much about the rating. I look at its explanation of what was hard, and then at the passage, to see what was confusing about the passage. You're right that we can't necessarily trust the rating itself as an accurate description of the whole process, for one thing because "high, medium, low" isn't very information-rich. No, the way to assess improvement is "does it get closer to human responses"? Which we know.
       
 (DIR) Post #ATnO1Mq5R2O7dmGva4 by gray17@mastodon.social
       2023-03-20T04:23:41Z
       
       0 likes, 0 repeats
       
       @TedUnderwood So why ask for the rating at all?
       
 (DIR) Post #ATnOC5lpEj6vt8OPlQ by TedUnderwood@sigmoid.social
       2023-03-20T04:25:38Z
       
       0 likes, 0 repeats
       
       @gray17 Uh, why not? It's an experiment. Among other things I wanted to see if those ratings did correlate at all with the accuracy of the time estimates. So far I don't think they do.
       
 (DIR) Post #ATnOYGMWXKsB6q87iy by gray17@mastodon.social
       2023-03-20T04:29:38Z
       
       0 likes, 0 repeats
       
       @TedUnderwood ok, I guess I was misled by the statement in your post "I added step 5 (allowing the model to describe its own confidence) because in early experiments I found the model’s tendency to editorialize extremely valuable", which implies that the answer to step 5 affected your choices in some way.