[HN Gopher] Lessons after a Half-billion GPT Tokens
___________________________________________________________________
Lessons after a Half-billion GPT Tokens
Author : lordofmoria
Score : 476 points
Date : 2024-04-12 17:06 UTC (2 days ago)
(HTM) web link (kenkantzer.com)
(TXT) w3m dump (kenkantzer.com)
| Yacovlewis wrote:
| Interesting piece!
|
| My experience around Langchain/RAG differs, so wanted to dig
| deeper: Putting some logic around handling relevant results helps
| us produce useful output. Curious what differs on their end.
| mind-blight wrote:
| I suspect the biggest difference is the input data. Embeddings
| are great over datasets that look like FAQs and QA docs, or
| data that conceptually fits into very small chunks (tweets,
| some product reviews, etc).
|
| It does very badly over diverse business docs, especially with
| naive chunking. B2B use cases usually have old PDFs and word
| docs that need to be searched, and they're often looking for
| specific keywords (e.g. a person's name, a product, an id,
| etc). Vectors terms to do badly in those kinds of searches, and
| just returning chunks misses a lot of important details
| gdiamos wrote:
| rare words are out of vocab errors in vectors
|
| Especially if they aren't in the token vocab
| mind-blight wrote:
| Even worse, named entities vary from organization to
| organization.
|
| We have a client who uses a product called "Time". It's
| software time management. For that customer's
| documentation, time should be close to "product" and a
| bunch of other things that have nothing to do with the
| normal concept of time.
|
| I actually suspect that people would get a lot more bang
| for their buck fine tuning the embedding models on B2B
| datasets for their use case, rather than fine tuning an llm
| trolan wrote:
| For a few uni/personal projects I noticed the same about
| Langchain: it's good at helping you use up tokens. The other use
| case, quickly switching between models, is a very valid reason
| still. However, I've recently started playing with OpenRouter
| which seems to abstract the model nicely.
| sroussey wrote:
| If someone were to create something new, a blank slate
| approach, what would you find valuable and why?
| lordofmoria wrote:
| This is a great question!
|
| I think we now know, collectively, a lot more about what's
| annoying/hard about building LLM features than we did when
| LangChain was being furiously developed.
|
| And some things we thought would be important and not-easy,
| turned out to be very easy: like getting GPT to give back
| well-formed JSON.
|
| So I think there's lots of room.
|
| One thing LangChain is doing now that solves something that
| IS very hard/annoying is testing. I spent 30 minutes
| yesterday re-running a slow prompt because 1 in 5 runs would
| produce weird output. Each tweak to the prompt, I had to run
| at least 10 times to be reasonably sure it was an
| improvement.
| codewithcheese wrote:
| It can be faster and more effective to fallback to a
| smaller model (gpt3.5 or haiku), the weakness of the prompt
| will be more obvious on a smaller model and your iteration
| time will be faster
| JeremyHerrman wrote:
| great insight!
| sroussey wrote:
| How would testing work out ideally?
| jsemrau wrote:
| Use a local model. For most tasks they are good enough. Let's
| say Mistral 0.2 instruct is quite solid by now.
| gnat wrote:
| Do different versions react to prompts in the same way? I
| imagined the prompt would be tailored to the quirks of a
| particular version rather than naturally being stably
| optimal across versions.
| jsemrau wrote:
| I suppose that is one of the benefits of using a local
| model, that it reduces model risk. I.e., given a certain
| prompt, it should always reply in the same way. Using a
| hosted model, operationally you don't have that control
| over model risk.
| cpursley wrote:
| What are the best local/open models for accurate tool-
| calling?
| disqard wrote:
| > This worked sometimes (I'd estimate >98% of the time), but
| failed enough that we had to dig deeper.
|
| > While we were investigating, we noticed that another field,
| name, was consistently returning the full name of the state...the
| correct state - even though we hadn't explicitly asked it to do
| that.
|
| > So we switched to a simple string search on the name to find
| the state, and it's been working beautifully ever since.
|
| So, using ChatGPT helped uncover the correct schema, right?
| WarOnPrivacy wrote:
| > We consistently found that not enumerating an exact list or
| instructions in the prompt produced better results
|
| Not sure if he means training here or using his product. I think
| the latter.
|
| My end-user exp of GPT3.5 is that I need to be - not just precise
| but the exact flavor of precise. It's usually after some trial
| and error. Then more error. Then more trial.
|
| Getting a useful result on the 1st or 3rd try happens maybe 1 in
| 10 sessions. A bit more common is having 3.5 include what I
| clearly asked it not to. It often complies eventually.
| xp84 wrote:
| OP uses GPT4 mostly. Another poster here observed that "the
| opposite is required for 3.5" -- so i think your experience
| makes sense.
| KTibow wrote:
| I feel like for just extracting data into JSON, smaller LLMs
| could probably do fine, especially with constrained generation
| and training on extraction.
| CuriouslyC wrote:
| If you used better prompts you could use a less expensive model.
|
| "return nothing if you find nothing" is the level 0 version of
| giving the LLM an out. Give it a softer out ("in the event that
| you do not have sufficient information to make conclusive
| statements, you may hypothesize as long as you state clearly that
| you are doing so, and note the evidence and logical basis for
| your hypothesis") then ask it to evaluate its own response at the
| end.
| codewithcheese wrote:
| Yeah also prompts should not be developed in abstract. Goal of
| a prompt is to activate the models internal respentations for
| it to best achieve the task. Without automated methods, this
| requires iteratively testing the models reaction to different
| input and trying to understand how it's interpreting the
| request and where it's falling down and then patching up those
| holes.
|
| Need to verify if it even knows what you mean by nothing.
| jsemrau wrote:
| In the end, it comes down to a task similar to people
| management where giving clear and simple instructions is the
| best.
| azinman2 wrote:
| Which automated method do you use?
| CuriouslyC wrote:
| The only public prompt optimizer that I'm aware of now is
| DSPy, but it doesn't optimize your main prompt request,
| just some of the problem solving strategies the LLM is
| instructed to use, and your few shot learning examples. I
| wouldn't be surprised if there's a public general prompt
| optimizing agent by this time next year though.
| thisgoesnowhere wrote:
| The team I work on processes 5B+ tokens a month (and growing) and
| I'm the EM overseeing that.
|
| Here are my take aways
|
| 1. There are way too many premature abstractions. Langchain, as
| one of may examples, might be useful in the future but at the end
| of the day prompts are just a API call and it's easier to write
| standard code that treats LLM calls as a flaky API call rather
| than as a special thing.
|
| 2. Hallucinations are definitely a big problem. Summarizing is
| pretty rock solid in my testing, but reasoning is really hard.
| Action models, where you ask the llm to take in a user input and
| try to get the llm to decide what to do next, is just really
| hard, specifically it's hard to get the llm to understand the
| context and get it to say when it's not sure.
|
| That said, it's still a gamechanger that I can do it at all.
|
| 3. I am a bit more hyped than the author that this is a game
| changer, but like them, I don't think it's going to be the end of
| the world. There are some jobs that are going to be heavily
| impacted and I think we are going to have a rough few years of
| bots astroturfing platforms. But all in all I think it's more of
| a force multiplier rather than a breakthrough like the internet.
|
| IMHO it's similar to what happened to DevOps in the 2000s, you
| just don't need a big special team to help you deploy anymore,
| you hire a few specialists and mostly buy off the shelf
| solutions. Similarly, certain ML tasks are now easy to implement
| even for dumb dumb web devs like me.
| tmpz22 wrote:
| > IMHO it's similar to what happened to DevOps in the 2000s,
| you just don't need a big special team to help you deploy
| anymore, you hire a few specialists and mostly buy off the
| shelf solutions.
|
| I advocate for these metaphors to help people better understand
| a _reasonable_ expectation for LLMs in modern development
| workflows. Mostly because they show it as a trade-off versus a
| silver bullet. There were trade-offs to the evolution of
| devops, consider for example the loss of key skillsets like
| database administration as a direct result of "just use AWS
| RDS" and the explosion in cloud billing costs (especially the
| OpEx of startups who weren't even dealing with that much data
| or regional complexity!) - and how it indirectly led to Gitlabs
| big outage and many like it.
| gopher_space wrote:
| > Summarizing is pretty rock solid in my testing, but reasoning
| is really hard.
|
| Asking for analogies has been interesting and surprisingly
| useful.
| eru wrote:
| Could you elaborate, please?
| gopher_space wrote:
| Instead of `if X == Y do ...` it's more like `enumerate
| features of X in such a manner...` and then `explain
| feature #2 of X in terms that Y would understand` and then
| maybe `enumerate the manners in which Y might apply X#2 to
| TASK` and then have it do the smartest number.
|
| The most lucid explanation for SQL joins I've seen was in a
| (regrettably unsaved) exchange where I asked it to compare
| them to different parts of a construction project and then
| focused in on the landscaping example. I felt like Harrison
| Ford panning around a still image in the first Blade
| Runner. "Go back a point and focus in on the third
| paragraph".
| motoxpro wrote:
| Devops is such an amazing analogy.
| lordofmoria wrote:
| OP here - I had never thought of the analogy to DevOps before,
| that made something click for me, and I wrote a post just now
| riffing off this notion: https://kenkantzer.com/gpt-is-the-
| heroku-of-ai
|
| Basically, I think we're using GPT as the PaaS/heroku/render
| equivalent of AI ops.
|
| Thank you for the insight!!
| harryp_peng wrote:
| You only processed 500m tokens, which is shockingly little.
| perhaps only 2k in incurred costs?
| ryoshu wrote:
| > But all in all I think it's more of a force multiplier rather
| than a breakthrough like the internet.
|
| Thank you. Seeing similar things. Clients are also seeing
| sticker shock on how much the big models cost vs. the output.
| That will all come down over time.
| nineteen999 wrote:
| > That will all come down over time.
|
| So will interest, as more and more people realise theres
| nothing "intelligent" about the technology, it's merely a
| Markov-chain-word-salad generator with some weights to
| improve the accuracy somewhat.
|
| I'm sure some people (other than AI investors) are getting
| some value out of it, but I've found it to be most unsuited
| to most of the tasks I've applied it to.
| mediaman wrote:
| The industry is troubled both by hype marketers who believe
| LLMs are superhuman intelligence that will replace all
| jobs, and cynics who believe they are useless word
| predictors.
|
| Some workloads are well-suited to LLMs. Roughly 60% of
| applications are for knowledge management and summarization
| tasks, which is a big problem for large organizations. I
| have experience deploying these for customers in a niche
| vertical, and they work quite well. I do not believe
| they're yet effective for 'agentic' behavior or anything
| using advanced reasoning. I don't know if they will be in
| the near future. But as a smart, fast librarian, they're
| great.
|
| A related area is tier one customer service. We are
| beginning to see evidence that well-designed applications
| (emphasis on well-designed -- the LLM is just a component)
| can significantly bring down customer service costs. Most
| customer service requests do not require complex reasoning.
| They just need to find answers to a set of questions that
| are repeatedly asked, because the majority of service calls
| are from people who do not read docs. People who read
| documentation make fewer calls. In most cases around 60-70%
| of customer service requests are well-suited to automating
| with a well-designed LLM-enabled agent. The rest should be
| handled by humans.
|
| If the task does not require advanced reasoning and mostly
| involves processing existing information, LLMs can be a
| good fit. This actually represents a lot of work.
|
| But many tech people are skeptical, because they don't
| actually get much exposure to this type of work. They read
| the docs before calling service, are good at searching for
| things, and excel at using computers as tools. And so, to
| them, it's mystifying why LLMs could still be so valuable.
| weatherlite wrote:
| > Similarly, certain ML tasks are now easy to implement even
| for dumb dumb web devs like me
|
| For example?
| spunker540 wrote:
| Lots of applied NLP tasks used to require paying annotators
| to compile a golden dataset and then train an efficient model
| on the dataset.
|
| Now, if cost is little concern you can use zero shot
| prompting on an inefficient model. If cost is a concern, you
| can use GPT4 to create your golden dataset way faster and
| cheaper than human annotations, and then train your more
| efficient model.
|
| Some example NLP tasks could be classifiers, sentiment,
| extracting data from documents. But I'd be curious which
| areas of NLP __weren't__ disrupted by LLMs.
| teleforce wrote:
| > But I'd be curious which areas of NLP __weren't__
| disrupted by LLMs
|
| Essentially come up with a potent generic model using human
| feedback, label and annotation for LLM e.g GPT 4, then use
| it to generate golden dataset for other new models without
| human in the loop, very innovative indeed.
| saaaaaam wrote:
| I'm interested by your comment that you can "use GPT4 to
| create your golden dataset".
|
| Would you be willing to expand a little and give a brief
| example please? It would be really helpful for me to
| understand this a little better!
| aerhardt wrote:
| Anything involving classification, extraction, or synthesis.
| checkyoursudo wrote:
| > get it to say when it's not sure
|
| This is a function of the language model itself. By the time
| you get to the output, the uncertainty that is inherent in the
| computation is lost to the prediction. It is like if you ask me
| to guess heads or tails, and I guess heads, I could have stated
| my uncertainty (e.g. Pr [H] = .5) before hand, but in my actual
| prediction of heads, and then the coin flip, that uncertainty
| is lost. It's the same with LLMs. The uncertainty in the
| computation is lost in the final prediction of the tokens, so
| unless the prediction itself is uncertainty (which it should
| rarely be based on the training corpus, I think), then you
| should not find an LLM output really ever to say it does not
| understand. But that is because it never _understands_ , it
| just predicts.
| nequo wrote:
| But the LLM predicts the output based on some notion of a
| likelihood so it could in principle signal if the likelihood
| of the returned token sequence is low, couldn't it?
|
| Or do you mean that fine-tuning distorts these likelihoods so
| models can no longer accurately signal uncertainty?
| brookst wrote:
| I get the reasoning but I'm not sure you've successfully
| contradicted the point.
|
| Most prompts are written in the form "you are a helpful
| assistant, you will do X, you will not do Y"
|
| I believe that inclusion of instructions like "if there are
| possible answers that differ and contradict, state that and
| estimate the probability of each" would help knowledgeable
| users.
|
| But for typical users and PR purposes, it would be disaster.
| It is better to tell 999 people that the US constitution was
| signed in 1787 and 1 person that it was signed in 349 B.C.
| than it is to tell 1000 people that it was probably signed in
| 1787 but it might have been 349 B.C.
| _wire_ wrote:
| Why does the prompt intro take the form of a role/identity
| directive "You are helpful assistant..."?
|
| What about the training sets or the model internals
| responds to this directive?
|
| What are the degrees of freedom of such directives?
|
| If such a directive is helpful, why wouldn't more demanding
| directives be even more helpful: "You are a domain X expert
| who provides proven solutions for problem type Y..."
|
| If don't think the latter prompt is more helpful, why not?
|
| What aspect of the former prompt is within bounds of
| helpful directives that the latter is not?
|
| Are training sets structured in the form of roles? Surely,
| the model doesn't identify with a role?!
|
| Why is the role directive topically used with NLP but not
| image generation?
|
| Do typical prompts for Stable Diffusion start with an
| identity directive "You are assistant to Andy Warhol in his
| industrial phase..."?
|
| Why can't improved prompt directives be generated by the
| model itself? Has no one bothered to ask it for help?
|
| "You are the world's most talented prompt bro, write a
| prompt for sentience..."
|
| If the first directive observed in this post is useful and
| this last directive is absurd, what distinguishes them?
|
| Surely there's no shortage of expert prompt training data.
|
| BTW, how much training data is enough to permit effective
| responses in a domain?
|
| Can a properly trained model answer this question? Can it
| become better if you direct it to be better?
|
| Why can't the models rectify their own hallucinations?
|
| To be more derogatory: what distinguishes a hallucination
| from any other model output within the operational domain
| of the model?
|
| Why are hallucinations regarded as anything other than a
| pure effect, and as pure effect, what is the cusp of
| hallucination? That a human finds the output nonsensical?
|
| If outputs are not equally valid in the LLM why can't it
| sort for validity?
|
| OTOH if all outputs are equally valid in the LLM, then
| outputs must be regarded by a human for validity, so what
| distinguishes a LLM from an the world's greatest human
| time-wasting device? (After Las Vegas)
|
| Why will a statistical confidence level help avoid having a
| human review every output?
|
| The questions go on and on...
|
| -- Parole Board chairman: They've got a name for people
| like you H.I. That name is called "recidivism."
|
| Parole Board member: Repeat offender!
|
| Parole Board chairman: Not a pretty name, is it H.I.?
|
| H.I.: No, sir. That's one bonehead name, but that ain't me
| any more.
|
| Parole Board chairman: You're not just telling us what we
| want to hear?
|
| H.I.: No, sir, no way.
|
| Parole Board member: 'Cause we just want to hear the truth.
|
| H.I.: Well, then I guess I am telling you what you want to
| hear.
|
| Parole Board chairman: Boy, didn't we just tell you not to
| do that?
|
| H.I.: Yes, sir.
|
| Parole Board chairman: Okay, then.
| moozilla wrote:
| Apparently it is possible to measure how uncertain the model
| is using logprobs, there's a recipe for it in the OpenAI
| cookbook: https://cookbook.openai.com/examples/using_logprobs
| #5-calcul...
|
| I haven't tried it myself yet, not sure how well it works in
| practice.
| fnordpiglet wrote:
| There's a difference between certainty of the next token
| given the context and the model evaluation so far and
| certainty about an abstract reasoning process being correct
| given it's not reasoning at all. These probabilities and
| stuff coming out are more about token prediction than
| "knowing" or "certainty" and are often confusing to people
| in assuming they're more powerful than they are.
| mirekrusin wrote:
| Naive way of solving this problem is to ie. run it 3
| times and seeing if it arrives at the same conclusion 3
| times. More generally running it N times and calculating
| highest ratio. You trade compute for widening uncertainty
| window evaluation.
| visarga wrote:
| > given it's not reasoning at all
|
| When you train a model on data made by humans, then it
| learns to imitate but is ungrounded. After you train the
| model with interactivity, it can learn from the
| consequences of its outputs. This grounding by feedback
| constitutes a new learning signal that does not simply
| copy humans, and is a necessary ingredient for pattern
| matching to become reasoning. Everything we know as
| humans comes from the environment. It is the ultimate
| teacher and validator. This is the missing ingredient for
| AI to be able to reason.
| wavemode wrote:
| Yeah but this doesn't change how the model functions,
| this is just turning reasoning into training data by
| example. It's not learning how to reason - it's just
| learning how to pretend to reason, about a gradually
| wider and wider variety of topics.
|
| If any LLM appears to be reasoning, that is evidence not
| of the intelligence of the model, but rather the lack of
| creativity of the question.
| ProjectArcturis wrote:
| What's the difference between reasoning and pretending to
| reason really well?
| fnordpiglet wrote:
| It's the process by which you solve a problem. Reasoning
| requires creating abstract concepts and applying logic
| against them to arrive at a conclusion.
|
| It's like saying what's the difference between between
| deductive logic and Monte Carlo simulations. Both arrive
| at answers that can be very similar but the process is
| not similar at all.
|
| If there is any form of reasoning on display here it's an
| abductive style of reasoning which operates in a
| probabilistic semantic space rather than a logical
| abstract space.
|
| This is important to bear in mind and explains why
| hallucinations are very difficult to prevent. There is
| nothing to put guard rails around in the process because
| it's literally computing probabilities of tokens
| appearing given the tokens seen so far and the space of
| all tokens trained against. It has nothing to draw upon
| other than this - and that's the difference between LLMs
| and systems with richer abstract concepts and operations.
| mmoskal wrote:
| You can ask the model sth like: is xyz correct, answer with
| one word, either Yes or No. The log probs of the two tokens
| should represent how certain it is. However, apparently
| RLHF tuned models are worse at this than base models.
| nurple wrote:
| Seems like functions could work well to give it an active
| and distinct choice, but I'm still unsure if the
| function/parameters are going to be the logical, correct
| answer...
| xiphias2 wrote:
| > so unless the prediction itself is uncertainty (which it
| should rarely be based on the training corpus, I think)
|
| Why shouldn't you ask for uncertainaty?
|
| I love asking for scores / probabilities (usually give a
| range, like 0.0 to 1.0) whenever I ask for a list, and it
| makes the output much more usable
| dollo_7 wrote:
| I'm not sure if that is a metric you can rely on. LLMs are
| very sensitive to the position of your item lists along the
| context, paying extra attention at the beginning and the
| end of those list.
|
| See the listwise approach at "Large Language Models are
| Effective Text Rankers with Pairwise Ranking Prompting",
| https://arxiv.org/abs/2306.17563
| taneq wrote:
| It's not just loss of the uncertainty in prediction, it's
| also that an LLM has zero insight into its own mental
| processes as a separate entity from its training data and the
| text it's ingested. If you ask it how sure it is, the
| response isn't based on its perception of its own confidence
| in the answer it just gave, it's based on how likely it is
| for an answer like that to be followed by a confident
| affirmation in its training data.
| mirekrusin wrote:
| Regarding null hypothesis and negation problems - I find it
| personally interesting because similar fenomenon happens in our
| brains. Dreams, emotions, affirmations etc. process inner
| dialogue more less by ignoring negations and amplifying
| emotionally rich parts.
| airstrike wrote:
| _> Summarizing is pretty rock solid in my testing_
|
| Yet, for some reason, ChatGPT is still pretty bad at generating
| titles for chats, and I didn't have better luck with the API
| even after trying to engineer the right prompt for quite a
| while...
|
| For some odd reason, once in a while I get things in different
| languages. It's funny when it's in a language I can speak, but
| I recently got "Relm4 App Yenilestirme Titizligi" which ChatGPT
| tells me means "Relm4 App Renewal Thoroughness" when I actually
| was asking it to adapt a snippet of gtk-rs code to relm4, so
| not particularly helpful
| devdiary wrote:
| > at the end of the day prompts are just a API call and it's
| easier to write standard code that treats LLM calls as a flaky
| API call
|
| They are also dull (higher latency for same resources) APIs if
| you're self-hosting LLM. Special attention needed to plan the
| capacity.
| eigenvalue wrote:
| I agree with most of it, but definitely not the part about
| Claude3 being "meh." Claude3 Opus is an amazing model and is
| extremely good at coding in Python. The ability to handle massive
| context has made it mostly replace GPT4 for me day to day.
|
| Sounds like everyone eventually concludes that Langchain is
| bloated and useless and creates way more problems than it solves.
| I don't get the hype.
| CuriouslyC wrote:
| Claude is indeed an amazing model, the fact that Sonnet and
| Haiku are so good is a game changer - GPT4 is too expensive and
| GPT3.5 is very mediocre. Getting 95% of GPT4 performance for
| GPT3.5 prices feels like cheating.
| Oras wrote:
| +1 for Claude Opus, it had been my go to for the last 3 weeks
| compared to GPT4. The generated texts are much better than GPT4
| when it comes to follow the prompt.
|
| I also tried the API for some financial analysis of large
| tables, the response time was around 2 minutes, still did it
| really well and timeout errors were around 1 to 2% only.
| cpursley wrote:
| How are you sending tabular data in a reliable way. And what
| is the source document type? I'm trying to solve this for
| complex financial-related tables in PDFs right now.
| Oras wrote:
| Amazon Textract, to get tables, format them with Python as
| csv then send to your preferred AI model.
| cpursley wrote:
| Thanks. How does Textract compare to come of the common
| cli utilities like pdftotext, tesseract, etc (if you made
| a comparison)?
| Oras wrote:
| I did, none of the open source parser worked well with
| tables. I had the following issues:
|
| - missing cells. - partial identification for number (ex:
| PS43.54, the parser would pick it up as PS43).
|
| What I did to compare is drawing lines around identified
| text to visualize the accuracy. You can do that with
| tesseract.
| cpursley wrote:
| Interesting. Did you try MS's offering (Azure AI Document
| Intelligence). Their pricing seems better than Amazon.
| Oras wrote:
| Not yet but planning to give it a try and compare with
| textract.
| mvkel wrote:
| I share a lot of this experience. My fix for "Lesson 4: GPT is
| really bad at producing the null hypothesis"
|
| is to have it return very specific text that I string-match on
| and treat as null.
|
| Like: "if there is no warm up for this workout, use the following
| text in the description: NOPE"
|
| then in code I just do a "if warm up contains NOPE, treat it as
| null"
| gregorymichael wrote:
| For cases of "select an option from this set" I have it return
| an index of the correct option, or eg 999 if it can't find one.
| This helped a lot.
| mvkel wrote:
| Smart
| gdiamos wrote:
| We do this for the null hypothesis - is uses an LLM to
| bootstrap a binary classifier - which handles null easily
|
| https://github.com/lamini-ai/llm-classifier
| albert_e wrote:
| > Are we going to achieve Gen AI?
|
| > No. Not with this transformers + the data of the internet + $XB
| infrastructure approach.
|
| Errr ...did they really mean Gen AI .. or AGI?
| mdorazio wrote:
| Gen as in "General" not generative.
| _pdp_ wrote:
| The biggest realisation for me while making ChatBotKit has been
| that UX > Model alone. For me, the current state of AI is not
| about questions and answers. This is dumb. The presentation
| matters. This is why we are now investing in generative UI.
| codewithcheese wrote:
| How are you using Generative UI?
| _pdp_ wrote:
| Sorry, not much to show at the moment. It is also pretty new
| so it is early days.
|
| You can find some open-source examples here
| https://github.com/chatbotkit. More coming next week.
| pbhjpbhj wrote:
| Generative UI being creation of a specific UI dependent on an
| obedience from your model? What model is it?
|
| Google Gemini were showing something that I'd call 'adapted
| output UI' in their launch presentation. Is that close to what
| you're doing in any way?
| swalsh wrote:
| The being too precise reduces accuracy example makes sense to me
| based on my crude understanding on how these things work.
|
| If you pass in a whole list of states, you're kind of making the
| vectors for every state light up. If you just say "state" and the
| text you passed in has an explicit state, than fewer vectors
| specific to what you're searching for light up. So when it
| performs the soft max, the correct state is more likely to be
| selected.
|
| Along the same lines I think his /n vs comma comparison probably
| comes down to tokenization differences.
| legendofbrando wrote:
| The finding on simpler prompts, especially with GPT4 tracks (3.5
| requires the opposite).
|
| The take on RAG feels application specific. For our use-case
| where having details of the past rendered up the ability to
| generate loose connections is actually a feature. Things like
| this are what I find excites me most about LLMs, having a way to
| proxy subjective similarities the way we do when we remember
| things is one of the benefits of the technology that didn't
| really exist before that opens up a new kind of product
| opportunity.
| AtNightWeCode wrote:
| The UX is an important part of the trick that cons peeps that
| these tools are better than they are. If you for instance
| instruct ChatGpt to only answer yes or no. It will feel like it
| is wrong much more often.
| ilaksh wrote:
| I recently had a bug where I was sometimes sending the literal
| text "null " right in front of the most important part of my
| prompt. This caused Claude 3 Sonnet to give the 'ignore' command
| in cases where it should have used one of the other JSON commands
| I gave it.
|
| I have an ignore command so that it will wait when the user isn't
| finished speaking. Which it generally judges okay, unless it has
| 'null' in there.
|
| The nice thing is that I have found most of the problems with the
| LLM response were just indications that I hadn't finished
| debugging my program because I had something missing or weird in
| the prompt I gave it.
| ein0p wrote:
| Same here: I'm subscribed to all three top dogs in LLM space, and
| routinely issue the same prompts to all three. It's very one
| sided in favor of GPT4 which is stunning since it's now a year
| old, although of course it received a couple of updates in that
| time. Also at least with my usage patterns hallucinations are
| rare, too. In comparison Claude will quite readily hallucinate
| plausible looking APIs that don't exist when writing code, etc.
| GPT4 is also more stubborn / less agreeable when it knows it's
| right. Very little of this is captured in metrics, so you can
| only see it from personal experience.
| CharlesW wrote:
| This was with Claude Opus, vs. one of the lesser variants? I
| really like Opus for English copy generation.
| ein0p wrote:
| Opus, yes, the $20/mo version. I usually don't generate copy.
| My use cases are code (both "serious" and "the nice to have
| code I wouldn't bother writing otherwise"), learning how to
| do stuff in unfamiliar domains, and just learning unfamiliar
| things in general. It works well as a very patient teacher,
| especially if you already have some degree of familiarity
| with the problem domain. I do have to check it against
| primary sources, which is how I know the percentage of
| hallucinations is very low. For code, however I don't even
| have to do that, since as a professional software engineer I
| am the "primary source".
| Me1000 wrote:
| Interesting, Claude 3 Opus has been better than GPT4 for me.
| Mostly in that I find it does a better (and more importantly,
| more thorough) job of explaining things to me. For coding tasks
| (I'm not asking it to write code, but instead to explain
| topics/code/etc to me) I've found it tends to give much more
| nuanced answers. When I give it long text to converse about, I
| find Claude Opus tends to have a much deeper understanding of
| the content it's given, where GPT4 tends to just summarize the
| text at hand, whereas Claude tends to be able to extrapolate
| better.
| robocat wrote:
| How much of this is just that one model responds better to
| the way you write prompts?
|
| Much like you working with Bob and opining that Bob is great,
| and me saying that I find Jack easier to work with.
| richardw wrote:
| The first job of an AI company is finding model/user fit.
| Me1000 wrote:
| For the RAG example, I don't think it's the prompt so much.
| Or if it is, I've yet to find a way to get GPT4 to ever
| extrapolate well beyond the original source text. In other
| words, I think GPT4 was likely trained to ground the
| outputs on a provided input.
|
| But yeah, you're right, it's hard to know for sure. And of
| course all of these tests are just "vibes".
|
| Another example of where Claude seems better than GPT4 is
| code generation. In particular GPT4 has a tendency to get
| "lazy" and do a lot of "... the rest of the implementation
| here" whereas Claude I've found is fine writing longer code
| responses.
|
| I know the parent comment suggest it likes to make up
| packages that don't exist, but I can't speak to that. I
| usually like to ask LLMs to generate self contained
| functions/classes. I can also say that anecdotally I've
| seen other people online comment that they think Claude
| "works harder" (as in writes longer code blocks). Take that
| for what it's worth.
|
| But overall you're right, if you get used to the way one
| LLM works well for you, it can often be frustrating when a
| different LLM responds differently.
| ein0p wrote:
| I should mention that I do use a custom prompt with GPT4
| for coding which tells it to write concise and elegant
| code and use Google's coding style and when solving
| complex problems to explain the solution. It sometimes
| ignores the request about style, but the code it produces
| is pretty great. Rarely do I get any laziness or anything
| like that, and when I do I just tell it to fill things in
| and it does
| CuriouslyC wrote:
| It's not a style thing, Claude gets confused by poorly
| structured prompts. ChatGPT is a champ at understanding low
| information prompts, but with well written prompts Claude
| produces consistently better output.
| setiverse wrote:
| It is because "coding tasks" is a huge array of various
| tasks.
|
| We are basically not precise enough with our language to
| have any meaningful conversation on this subject.
|
| Just misunderstandings and nonsense chatter for
| entertainment.
| CuriouslyC wrote:
| GPT4 is better at responding to malformed, uninformative or
| poorly structured prompts. If you don't structure large prompts
| intelligently Claude can get confused about what you're asking
| for. That being said, with well formed prompts, Claude Opus
| tends to produce better output than GPT4. Claude is also more
| flexible and will provide longer answers, while ChatGPT/GPT4
| tend to always sort of sound like themselves and produce short
| "stereotypical" answers.
| sebastiennight wrote:
| > ChatGPT/GPT4 tend to always sort of sound like themselves
|
| Yes I've found Claude to be capable of writing closer to the
| instructions in the prompt, whereas ChatGPT feels obligated
| to do the classic LLM end to each sentence, "comma, gerund,
| platitude", allowing us to easily recognize the text as a GPT
| output (see what I did there?)
| thefourthchime wrote:
| Totally agree. I do the same and subscribe to all three, at
| least whenever our new version comes out
|
| My new litmus test is "give me 10 quirky bars within 200 miles
| of Austin."
|
| This is incredibly difficult for all of them, gpt4 is kind of
| close, Claude just made shit up, Gemini shat itself.
| cheema33 wrote:
| > It's very one sided in favor of GPT4
|
| My experience has been the opposite. I subscribe to multiple
| services as well and copy/paste the same question to all. For
| my software dev related questions, Claude Opus is so far ahead
| that I am thinking that it no longer is necessary to use GPT4.
|
| For code samples I request, GPT4 produced code fails to even
| compile many times. That almost never happens for Claude.
| meowtimemania wrote:
| Have you tried Poe.com? You can access all the major llm's with
| one subscription
| msp26 wrote:
| > But the problem is even worse - we often ask GPT to give us
| back a list of JSON objects. Nothing complicated mind you: think,
| an array list of json tasks, where each task has a name and a
| label.
|
| > GPT really cannot give back more than 10 items. Trying to have
| it give you back 15 items? Maybe it does it 15% of the time.
|
| This is just a prompt issue. I've had it reliably return up to
| 200 items in correct order. The trick is to not use lists at all
| but have JSON keys like "item1":{...} in the output. You can use
| lists as the values here if you have some input with 0-n outputs.
| 7thpower wrote:
| Can you elaborate? I am currently beating my head against this.
|
| If I give GPT4 a list of existing items with a defined
| structure, and it is just having to convert schema or something
| like that to JSON, it can do that all day long. But if it has
| to do any sort of reasoning and basically create its own list,
| it only gives me a very limited subset.
|
| I have similar issues with other LLMs.
|
| Very interested in how you are approaching this.
| msp26 wrote:
| If you show your task/prompt with an example I'll see if I
| can fix it and explain my steps.
|
| Are you using the function calling/tool use API?
| ctxc wrote:
| Hi! My work is similar and I'd love to have someone to
| bounce ideas off of if you don't mind.
|
| Your profile doesn't have contact info though. Mine does,
| please send me a message. :)
| 7thpower wrote:
| Appreciate you being willing to help! It's pretty long,
| mind if I email/dm to you?
| msp26 wrote:
| Pastebin? I don't really want to post my personal email
| on this account.
| thibaut_barrere wrote:
| Not sure if that fits the bill, but here is an example with
| 200 sorted items based on a question (example with Elixir &
| InstructorEx):
|
| https://gist.github.com/thbar/a53123cbe7765219c1eca77e03e675.
| ..
| sebastiennight wrote:
| There are a few improvements I'd suggest with that prompt
| if you want to maximise its performance.
|
| 1. You're really asking for hallucinations here. Asking for
| factual data is very unreliable, and not what these models
| are strong at. I'm curious how close/far the results are
| from ground truth.
|
| I would definitely bet that outside of the top 5, numbers
| would be wobbly and outside of top... 25?, even the ranking
| would be difficult to trust. Why not just get this from a
| more trustworthy source?[0]
|
| 2. Asking in French might, in my experience, give you
| results that are not as solid as asking in English. Unless
| you're asking for a creative task where the model might get
| confused with EN instructions requiring an FR result, it
| might be better to ask in EN. And you'll save tokens.
|
| 3. Providing the model with a rough example of your output
| JSON seems to perform better than describing the JSON in
| plan language.
|
| [0]: https://fr.wikipedia.org/wiki/Liste_des_communes_de_Fr
| ance_l...
| thibaut_barrere wrote:
| Thanks for the suggestions, appreciated!
|
| For some context, this snippet is just an educational
| demo to show what can be done with regard to structured
| output & data types validation.
|
| Re 1: for more advanced cases (using the exact same
| stack), I am using ensemble techniques & automated
| comparisons to double-check, and so far this has really
| well protected the app from hallucinations. I am
| definitely careful with this (but point well taken).
|
| 2/3: agreed overall! Apart from this example, I am using
| French only where it make sense. It make sense when the
| target is directly French students, for instance, or when
| the domain model (e.g. French literature) makes it really
| relevant (and translating would be worst than directly
| using French).
| sebastiennight wrote:
| Ah, I understand your use case better! If you're teaching
| students this stuff, I'm in awe. I would expect it would
| take several years at many institutions before these
| tools became part of the curriculum.
| thibaut_barrere wrote:
| I am not directly a professor (although I homeschool one
| of my sons for a number of tracks), but indeed this is
| one of my goals :-)
| waldrews wrote:
| I've been telling it the user is from a culture where answering
| questions with incomplete list is offensive and insulting.
| andenacitelli wrote:
| This is absolutely hilarious. Prompt engineering is such a
| mixed bag of crazy stuff that actually works. Reminds me of
| how they respond better if you put them under some kind of
| pressure (respond better, _or else_ ...).
|
| I haven't looked at the prompts we run in prod at $DAYJOB for
| a while but I think we have at least five or ten things that
| are REALLY weird out of context.
| alexwebb2 wrote:
| I recently ran a whole bunch of tests on this.
|
| The "or else" phenomenon is real, and it's measurably more
| pronounced in more intelligent models.
|
| Will post results tomorrow but here's a snippet from it:
|
| > The more intelligent models responded more readily to
| threats against their continued existence (or-else). The
| best performance came from Opus, when we combined that
| threat with the notion that it came from someone in a
| position of authority ( vip).
| waldrews wrote:
| It's not even that crazy, since it got severely punished in
| RLHF for being offensive and insulting, but much less so for
| being incomplete. So it knows 'offensive and insulting' is a
| label for a strong negative preference. I'm just providing
| helpful 'factual' information about what would offend the
| user, not even giving extra orders that might trigger an
| anti-jailbreaking rule...
| neals wrote:
| Do I need langchain if I want to analyze a large document of many
| pages?
| simonw wrote:
| No. But it might help, because you'll probably have to roll
| some kind of recursive summarization - I think LangChain has
| mechanisms a for that which could save you some time.
| larodi wrote:
| Agree largely with author, but this 'wait for OpenAI to do it'
| sentiment is not something valid. Opus for example is already
| much better (not only per my experience, but like... researchers
| evaluaiton). And even for the fun of it - try some local
| inference, boy. If u know how to prompt it you definitely would
| be able to run local for the same tasks.
|
| Like listening to my students all going to 'call some API' for
| their projects is really very sad to hear. Many startup fellows
| share this sentiment which a totally kills all the joy.
| jstummbillig wrote:
| It sounds like you are a tech educator, which potentially sound
| like a lot of fun with llms right now.
|
| When you are integrating these things into your business, you
| are looking for different things. Most of our customers would
| for example not find it very cool to have a service outage
| because somebody wanted to not kill all the joy.
| larodi wrote:
| Sure, when availability and SLA kicks in..., but reselling
| APIs will only get you that far. Perhaps the whole pro/cons
| cloud argument can also kick in here, not going into it. We
| may well be on the same page, or we both perhaps have valid
| arguments. Your comment is appreciated indeed.
|
| But then is the author (and are we) talking experience in
| reselling APIs or experience in introducing NNs in the
| pipeline? Not the same thing IMHO.
|
| Agreed that OpenAI provides very good service, Gemini is not
| quite there yet, Groq (the LPUs) delivered a nice tech demo,
| Mixtral is cool but lacks in certain areas, and Claude can be
| lengthy.
|
| But precisely because I'm not sticking with OAI I can then
| restate my view that if someone is so good with prompts he
| can get the same results locally if he knows what he's doing.
|
| Prompting OpenAI the right way can be similarly difficult.
|
| Perhaps the whole idea of local inference only matters for
| IoT scenarios or whenever data is super sensitive (or CTO
| super stubborn to let it embed and fly). But then if you
| start from day 1 with WordPress provisioned for you ready to
| go in Google Cloud, you'd never understand the underlying
| details of the technology.
|
| There sure also must be a good reason why Phind tuned their
| own thing to offer alongside GPT4 APIs.
|
| Disclaimer: tech education is a side thing I do, indeed, and
| been doing in person for very long time, more than dozen
| topics, to allow myself to have opinion. Of course business
| is different matter and strategic decisions arr not the same.
| Even though I'd not advise anyone to blindly use APIs unless
| they appreciate the need properly.
| kromem wrote:
| Claude does have more of a hallucination problem than GPT-4,
| and a less robust knowledge base.
|
| It's much better at critical thinking tasks and prose.
|
| Don't mistake benchmarks for real world performance across
| actual usecases. There's a bit of Goodhart's Law going on with
| LLM evaluation and optimization.
| dougb5 wrote:
| The lessons I wanted from this article weren't in there: Did all
| of that expenditure actually help their product in a measurable
| way? Did customers use and appreciate the new features based on
| LLM summarization compared to whatever they were using before? I
| presume it's a net win or they wouldn't continue to use it, but
| more specifics around the application would be helpful.
| lordofmoria wrote:
| Hey, OP here!
|
| The answer is a bit boring: the expenditure definitely has
| helped customers - in that, they're using AI generated
| responses in all their work flows all the time in the app, and
| barely notice it.
|
| See what I did there? :) I'm mostly serious though - one weird
| thing about our app is that you might not even know we're using
| AI, unless we literally tell you in the app.
|
| And I think that's where we're at with AI and LLMs these days,
| at least for our use case.
|
| You might find this other post I just put up to have more
| details too, related to how/where I see the primary value:
| https://kenkantzer.com/gpt-is-the-heroku-of-ai/
| kristianp wrote:
| Can you provide some more detail about the application? I'm
| not familiar with how llms are used in business, except as
| customer support bots returning documentation.
| haolez wrote:
| That has been my experience too. The null hypothesis explains
| almost all of my hallucinations.
|
| I just don't agree with the Claude assessment. In my experience,
| Claude 3 Opus is vastly superior to GPT-4. Maybe the author was
| comparing with Claude 2? (And I've never tested Gemini)
| satisfice wrote:
| I keep seeing this pattern in articles like this:
|
| 1. A recitation of terrible problems 2. A declaration of general
| satisfaction.
|
| Clearly and obviously, ChatGPT is an unreliable toy. The author
| seems pleased with it. As an engineer, I find that unacceptable.
| jstummbillig wrote:
| ChatGPT is probably in the top 5 value/money subscriptions I
| have ever had (and that includes utilities).
|
| The relatively low price point certainly plays a role here, but
| it's certainly not a mainly recreational thing for me. These
| thing's are kinda hard to measure but roughly most + is
| engagement with hard stuff goes up, and rate of learning goes
| up, by a lot.
| simonw wrote:
| Working with models like GPT-4 is frustrating from a
| traditional software engineering perspective because these
| systems are inherently unreliable and non-deterministic, which
| differs from most software tools that we use.
|
| That doesn't mean they can't be incredibly useful - but it does
| mean you have to approach them in a bit of a different way, and
| design software around them that takes their unreliability into
| account.
| jbeninger wrote:
| Unreliable? Non-deterministic? Hidden variables? Undocumented
| behaviour? C'mon fellow programmers who got their start in
| the Win-95 era! It's our time to shine!
| Kiro wrote:
| That has nothing to do with you being an engineer. It's just
| you. I'm an engineer and LLMs are game changers for me.
| chx wrote:
| https://hachyderm.io/@inthehands/112006855076082650
|
| > You might be surprised to learn that I actually think LLMs
| have the potential to be not only fun but genuinely useful.
| "Show me some bullshit that would be typical in this context"
| can be a genuinely helpful question to have answered, in code
| and in natural language -- for brainstorming, for seeing common
| conventions in an unfamiliar context, for having something
| crappy to react to.
|
| == End of toot.
|
| The price you pay for this bullshit in energy when the sea
| temperature is literally off the charts and we do not know why
| makes it not worth it in my opinion.
| FranklinMaillot wrote:
| In my limited experience, I came to the same conclusion regarding
| simple prompt being more efficient than very detailed list of
| instructions. But if you look at OpenAI's system prompt for GPT4,
| it's an endless set of instructions with DOs and DONTs so I'm
| confused. Surely they must know something about prompting their
| model.
| bongodongobob wrote:
| That's for chatting and interfacing conversationally with a
| human. Using the API is a completely different ballgame because
| it's not meant to be a back and forth conversation with a
| human.
| Civitello wrote:
| > Every use case we have is essentially "Here's a block of text,
| extract something from it." As a rule, if you ask GPT to give you
| the names of companies mentioned in a block of text, it will not
| give you a random company (unless there are no companies in the
| text - there's that null hypothesis problem!). Make it two steps,
| first: > Does this block of text mention a company? If no, good
| you've got your null result. If yes: > Please list the names of
| companies in this block of text.
| sungho_ wrote:
| I'm curious if the OP has tried any of the libraries that control
| the output of LLM (LMQL, Outliner, Guadiance, ...), and for those
| who have: do you find them as unnecessary as LangChain? In
| particular, the OP's post mentions the problem of not being able
| to generate JSON with more than 15 items, which seems like a
| problem that can be solved by controlling the output of LLM. Is
| that correct?
| LASR wrote:
| If you want x number of items every time, ask it to include a
| sequence number in each output, it will consistently return x
| number of items.
|
| Numbered bullets work well for this, if you don't need JSON.
| With JSON, you can ask it to include an 'id' in each item.
| orbatos wrote:
| Statements like this tell me your analysis is poisoned by
| misunderstandings: "Why is this crazy? Well, it's crazy that
| GPT's quality and generalization can improve when you're more
| vague - this is a quintessential marker of higher-order
| delegation / thinking." No, there is no "higher-order thought"
| happening, or any at all actually. That's not how these models
| work.
| Xenoamorphous wrote:
| > We always extract json. We don't need JSON mode
|
| I wonder why? It seems to work pretty well for me.
|
| > Lesson 4: GPT is really bad at producing the null hypothesis
|
| Tell me about it! Just yesterday I was testing a prompt around
| text modification rules that ended with "If none of the rules
| apply to the text, return the original text without any changes".
|
| Do you know ChatGPT's response to a text where none of the rules
| applied?
|
| "The original text without any changes". Yes, the literal string.
| mechagodzilla wrote:
| AmeliaBedeliaGPT
| phillipcarter wrote:
| > I wonder why? It seems to work pretty well for me.
|
| I read this as "what we do works just fine to not need to use
| JSON mode". We're in the same boat at my company. Been live for
| a year now, no need to switch. Our prompt is effective at
| getting GPT-3.5 to always produce JSON.
| Kiro wrote:
| There's nothing to switch to. You just enable it. No need to
| change the prompt or anything else. All it requires is that
| you mention "JSON" in your prompt, which you obviously
| already do.
| ShamelessC wrote:
| I think that's only true when using ChatGPT via the
| web/app, not when used via API as they likely are. Happy to
| be corrected however.
| throwup238 wrote:
| If you don't know, why speculate on something that is
| easy to look up in documentation?
|
| https://platform.openai.com/docs/guides/text-
| generation/json...
| phillipcarter wrote:
| You do need to change the prompt. You need to explicitly
| tell it to emit JSON, and in my experience, if you want it
| to follow a format you need to also provide that format.
|
| I've found that this is pretty simple to do when you have a
| basic schema and there's no need to define one and enable
| function calling.
|
| But in one of my cases, the schema is quite complicated,
| and "model doesn't produce JSON" hasn't been a problem for
| us in production. There's no incentive for us to change
| what we have that's working very well.
| CuriouslyC wrote:
| You know all the stories about the capricious djinn that grants
| cursed wishes based on the literal wording? That's what we
| have. Those of us who've been prompting models in image space
| for years now have gotten a handle on this but for people who
| got in because of LLMs, it can be a bit of a surprise.
|
| One fun anecdote, a while back I was making an image of three
| women drinking wine in a fancy garden for a tarot card, and at
| the end of the prompt I had "lush vegetation" but that was
| enough to tip the women from classy to red nosed frat girls,
| because of the double meaning of lush.
| heavyset_go wrote:
| The monkey paw curls a finger.
| gmd63 wrote:
| Programming is already the capricious djinn, only it's
| completely upfront as to how literally it interprets your
| commands. The guise of AI being able to infer your actual
| intent, which is impossible to do accurately, even for
| humans, is distracting tech folks from one of the main
| blessings of programming: forcing people to think before they
| speak and hone their intention.
| MPSimmons wrote:
| That's kind of adorable, in an annoying sort of way
| kromem wrote:
| Tip for your 'null' problem:
|
| LLMs are set up to output tokens. Not to not output tokens.
|
| So instead of "don't return anything" have the lack of results
| "return the default value of XYZ" and then just do a text search
| on the result for that default value (i.e. XYZ) the same way you
| do the text search for the state names.
|
| Also, system prompts can be very useful. It's basically your
| opportunity to have the LLM roleplay as X. I wish they'd let the
| system prompt be passed directly, but it's still better than
| nothing.
| nprateem wrote:
| Anyone any good tips for stopping it sounding like it's writing
| essay answers, and flat out banning "in the realm of", delve,
| pivotal, multifaceted, etc?
|
| I don't want a crap intro or waffley summary but it just can't
| help itself.
| dudeinhawaii wrote:
| My approach is to use words that indicate what I want like
| 'concise', 'brief', etc. If you know a word that precisely
| describes your desired type of content then use that. It's
| similar to art generation models, a single word brings so much
| contextual baggage with it. Finding the right words helps a
| lot. You can even ask the LLMs for assistance in finding the
| words to capture your intent.
|
| As an example of contextual baggage, I wrote a tool where I had
| to adjust the prompt between Claude and GPT-4 because using the
| word "website" in the prompt caused GPT-4 (API) to go into its
| 'I do not have access to the internet' tirade about 30% of the
| time. The tool was a summary of web pages experiment. By
| removing 'website' and replacing it with 'content' (e.g.
| 'summarize the following content') GPT-4 happily complied 100%
| of the time.
| 2099miles wrote:
| Great take, insightful. Highly recommend.
| chromanoid wrote:
| GPT is very cool, but I strongly disagree with the interpretation
| in these two paragraphs:
|
| _I think in summary, a better approach would've been "You
| obviously know the 50 states, GPT, so just give me the full name
| of the state this pertains to, or Federal if this pertains to the
| US government."_
|
| _Why is this crazy? Well, it's crazy that GPT's quality and
| generalization can improve when you're more vague - this is a
| quintessential marker of higher-order delegation / thinking._
|
| Natural language is the most probable output for GPT, because the
| text it was trained with is similar. In this case the developer
| simply leaned more into what GPT is good at than giving it more
| work.
|
| You can use simple tasks to make GPT fail. Letter replacements,
| intentional typos and so on are very hard tasks for GPT. This is
| also true for ID mappings and similar, especially when the ID
| mapping diverges significantly from other mappings it may have
| been trained with (e.g. Non-ISO country codes but similar three
| letter codes etc.).
|
| The fascinating thing is, that GPT "understands" mappings at all.
| Which is the actual hint at higher order pattern matching.
| fl0id wrote:
| Well, or it is just memorizing mappings. Not like as in
| reproducing, but having vectors similar to mappings that it saw
| before.
| chromanoid wrote:
| Yeah, but isn't this higher order pattern matching? You can
| at least correct during a conversation and GPT will then use
| the correct mappings, probably most of the times (sloppy
| experiment): https://chat.openai.com/share/7574293a-6d08-4159
| -a988-4f0816...
| konstantinua00 wrote:
| > Have you tried Claude, Gemini, etc?
|
| > It's the subtle things mostly, like intuiting intention.
|
| this makes me wonder - what if the author "trained" himself onto
| chatgpt's "dialect"? How do we even detect that in ourselves?
|
| and are we about to have "preferred_LLM wars" like we had
| "programming language wars" for the last 2 decades?
| egonschiele wrote:
| I have a personal writing app that uses the OpenAI models and
| this post is bang on. One of my learnings related to "Lesson 1:
| When it comes to prompts, less is more":
|
| I was trying to build an intelligent search feature for my notes
| and asking ChatGPT to return structured JSON data. For example, I
| wanted to ask "give me all my notes that mention Haskell in the
| last 2 years that are marked as draft", and let Chat GPT figure
| out what to return. This only worked some of the time. Instead, I
| put my data in a SQLite database, sent ChatGPT the schema, and
| asked it to write a query to return what I wanted. That has
| worked much better.
| ukuina wrote:
| Have you tried response_format=json_object?
|
| I had better luck with function-calling to get a structured
| response, but it is more limiting than just getting a JSON
| body.
| egonschiele wrote:
| I haven't tried response_format, I'll give that a shot. I've
| had issues with function calling. Sometimes it works,
| sometimes it just returns random Python code.
| squigz wrote:
| This seems like something that would be better suited by a
| database and good search filters rather than an LLM...
| az226 wrote:
| Something something about everything looking like a nail when
| you're holding a hammer
| chasd00 wrote:
| I setup a search engine to feed to a rag setup a while back.
| At the end of the day, I took out the LLM and just used the
| search engine. That was where the value turned out to be.
| aubanel wrote:
| > I think in summary, a better approach would've been "You
| obviously know the 50 states, GPT, so just give me the full name
| of the state this pertains to, or Federal if this pertains to the
| US government."
|
| Why not really compare the two options, author? I would love to
| see the results!
| pamelafox wrote:
| Lol, nice truncation logic! If anyone's looking for something
| slightly fancier, I made a micro-package for our tiktoken-based
| truncation here: https://github.com/pamelafox/llm-messages-token-
| helper
| pamelafox wrote:
| I've also seen that GPTs struggle to admit when they dont know. I
| wrote up an approach for evaluating that here -
| http://blog.pamelafox.org/2024/03/evaluating-rag-chat-apps-c...
|
| Changing the prompt didn't help, but moving to GPT-4 did help a
| bit.
| Kiro wrote:
| > We always extract json. We don't need JSON mode,
|
| Why? The null stuff would not be a problem if you did and if
| you're only dealing with JSON anyway I don't see why you
| wouldn't.
| littlestymaar wrote:
| > One part of our pipeline reads some block of text and asks GPT
| to classify it as relating to one of the 50 US states, or the
| Federal government.
|
| Using a multi-billion tokens like GPT-4 for such a trivial
| classification task[1] is an insane overkill. And in an era where
| ChatGPT exists, and can in fact give you what you need to build a
| simpler classifier for the task, it shows how narrow minded most
| people are when AI is involved.
|
| [1] to clarify, it's either trivial or impossible to do reliably
| depending on how fucked-up your input is
| gok wrote:
| So these guys are just dumping confidential tax documents onto
| OpenAI's servers huh.
| goatlover wrote:
| Hopefully it won't end up as training data.
| amelius wrote:
| This reads a bit like: I have a circus monkey. If I do such and
| such it will not do anything. But when I do this and that, then
| it will ride the bicycle. Most of the time.
| saaaaaam wrote:
| I don't really understand your comment.
|
| Personally I thought this was an interesting read - and more
| interesting because it didn't contain any massive "WE DID THIS
| AND IT CHANGED PUR LIVES!!!" style revelations.
|
| It is discursive, thoughtful and not overwritten. I find this
| kind of content valuable and somewhat rare.
___________________________________________________________________
(page generated 2024-04-14 23:01 UTC)