[HN Gopher] Lessons after a Half-billion GPT Tokens
       ___________________________________________________________________
        
       Lessons after a Half-billion GPT Tokens
        
       Author : lordofmoria
       Score  : 204 points
       Date   : 2024-04-12 17:06 UTC (1 days ago)
        
 (HTM) web link (kenkantzer.com)
 (TXT) w3m dump (kenkantzer.com)
        
       | Yacovlewis wrote:
       | Interesting piece!
       | 
       | My experience around Langchain/RAG differs, so wanted to dig
       | deeper: Putting some logic around handling relevant results helps
       | us produce useful output. Curious what differs on their end.
        
         | mind-blight wrote:
         | I suspect the biggest difference is the input data. Embeddings
         | are great over datasets that look like FAQs and QA docs, or
         | data that conceptually fits into very small chunks (tweets,
         | some product reviews, etc).
         | 
         | It does very badly over diverse business docs, especially with
         | naive chunking. B2B use cases usually have old PDFs and word
         | docs that need to be searched, and they're often looking for
         | specific keywords (e.g. a person's name, a product, an id,
         | etc). Vectors terms to do badly in those kinds of searches, and
         | just returning chunks misses a lot of important details
        
           | gdiamos wrote:
           | rare words are out of vocab errors in vectors
           | 
           | Especially if they aren't in the token vocab
        
             | mind-blight wrote:
             | Even worse, named entities vary from organization to
             | organization.
             | 
             | We have a client who uses a product called "Time". It's
             | software time management. For that customer's
             | documentation, time should be close to "product" and a
             | bunch of other things that have nothing to do with the
             | normal concept of time.
             | 
             | I actually suspect that people would get a lot more bang
             | for their buck fine tuning the embedding models on B2B
             | datasets for their use case, rather than fine tuning an llm
        
       | trolan wrote:
       | For a few uni/personal projects I noticed the same about
       | Langchain: it's good at helping you use up tokens. The other use
       | case, quickly switching between models, is a very valid reason
       | still. However, I've recently started playing with OpenRouter
       | which seems to abstract the model nicely.
        
         | sroussey wrote:
         | If someone were to create something new, a blank slate
         | approach, what would you find valuable and why?
        
           | lordofmoria wrote:
           | This is a great question!
           | 
           | I think we now know, collectively, a lot more about what's
           | annoying/hard about building LLM features than we did when
           | LangChain was being furiously developed.
           | 
           | And some things we thought would be important and not-easy,
           | turned out to be very easy: like getting GPT to give back
           | well-formed JSON.
           | 
           | So I think there's lots of room.
           | 
           | One thing LangChain is doing now that solves something that
           | IS very hard/annoying is testing. I spent 30 minutes
           | yesterday re-running a slow prompt because 1 in 5 runs would
           | produce weird output. Each tweak to the prompt, I had to run
           | at least 10 times to be reasonably sure it was an
           | improvement.
        
             | codewithcheese wrote:
             | It can be faster and more effective to fallback to a
             | smaller model (gpt3.5 or haiku), the weakness of the prompt
             | will be more obvious on a smaller model and your iteration
             | time will be faster
        
             | sroussey wrote:
             | How would testing work out ideally?
        
           | jsemrau wrote:
           | Use a local model. For most tasks they are good enough. Let's
           | say Mistral 0.2 instruct is quite solid by now.
        
             | gnat wrote:
             | Do different versions react to prompts in the same way? I
             | imagined the prompt would be tailored to the quirks of a
             | particular version rather than naturally being stably
             | optimal across versions.
        
               | jsemrau wrote:
               | I suppose that is one of the benefits of using a local
               | model, that it reduces model risk. I.e., given a certain
               | prompt, it should always reply in the same way. Using a
               | hosted model, operationally you don't have that control
               | over model risk.
        
             | cpursley wrote:
             | What are the best local/open models for accurate tool-
             | calling?
        
       | disqard wrote:
       | > This worked sometimes (I'd estimate >98% of the time), but
       | failed enough that we had to dig deeper.
       | 
       | > While we were investigating, we noticed that another field,
       | name, was consistently returning the full name of the state...the
       | correct state - even though we hadn't explicitly asked it to do
       | that.
       | 
       | > So we switched to a simple string search on the name to find
       | the state, and it's been working beautifully ever since.
       | 
       | So, using ChatGPT helped uncover the correct schema, right?
        
       | WarOnPrivacy wrote:
       | > We consistently found that not enumerating an exact list or
       | instructions in the prompt produced better results
       | 
       | Not sure if he means training here or using his product. I think
       | the latter.
       | 
       | My end-user exp of GPT3.5 is that I need to be - not just precise
       | but the exact flavor of precise. It's usually after some trial
       | and error. Then more error. Then more trial.
       | 
       | Getting a useful result on the 1st or 3rd try happens maybe 1 in
       | 10 sessions. A bit more common is having 3.5 include what I
       | clearly asked it not to. It often complies eventually.
        
         | xp84 wrote:
         | OP uses GPT4 mostly. Another poster here observed that "the
         | opposite is required for 3.5" -- so i think your experience
         | makes sense.
        
       | KTibow wrote:
       | I feel like for just extracting data into JSON, smaller LLMs
       | could probably do fine, especially with constrained generation
       | and training on extraction.
        
       | CuriouslyC wrote:
       | If you used better prompts you could use a less expensive model.
       | 
       | "return nothing if you find nothing" is the level 0 version of
       | giving the LLM an out. Give it a softer out ("in the event that
       | you do not have sufficient information to make conclusive
       | statements, you may hypothesize as long as you state clearly that
       | you are doing so, and note the evidence and logical basis for
       | your hypothesis") then ask it to evaluate its own response at the
       | end.
        
         | codewithcheese wrote:
         | Yeah also prompts should not be developed in abstract. Goal of
         | a prompt is to activate the models internal respentations for
         | it to best achieve the task. Without automated methods, this
         | requires iteratively testing the models reaction to different
         | input and trying to understand how it's interpreting the
         | request and where it's falling down and then patching up those
         | holes.
         | 
         | Need to verify if it even knows what you mean by nothing.
        
           | jsemrau wrote:
           | In the end, it comes down to a task similar to people
           | management where giving clear and simple instructions is the
           | best.
        
       | thisgoesnowhere wrote:
       | The team I work on processes 5B+ tokens a month (and growing) and
       | I'm the EM overseeing that.
       | 
       | Here are my take aways
       | 
       | 1. There are way too many premature abstractions. Langchain, as
       | one of may examples, might be useful in the future but at the end
       | of the day prompts are just a API call and it's easier to write
       | standard code that treats LLM calls as a flaky API call rather
       | than as a special thing.
       | 
       | 2. Hallucinations are definitely a big problem. Summarizing is
       | pretty rock solid in my testing, but reasoning is really hard.
       | Action models, where you ask the llm to take in a user input and
       | try to get the llm to decide what to do next, is just really
       | hard, specifically it's hard to get the llm to understand the
       | context and get it to say when it's not sure.
       | 
       | That said, it's still a gamechanger that I can do it at all.
       | 
       | 3. I am a bit more hyped than the author that this is a game
       | changer, but like them, I don't think it's going to be the end of
       | the world. There are some jobs that are going to be heavily
       | impacted and I think we are going to have a rough few years of
       | bots astroturfing platforms. But all in all I think it's more of
       | a force multiplier rather than a breakthrough like the internet.
       | 
       | IMHO it's similar to what happened to DevOps in the 2000s, you
       | just don't need a big special team to help you deploy anymore,
       | you hire a few specialists and mostly buy off the shelf
       | solutions. Similarly, certain ML tasks are now easy to implement
       | even for dumb dumb web devs like me.
        
         | tmpz22 wrote:
         | > IMHO it's similar to what happened to DevOps in the 2000s,
         | you just don't need a big special team to help you deploy
         | anymore, you hire a few specialists and mostly buy off the
         | shelf solutions.
         | 
         | I advocate for these metaphors to help people better understand
         | a _reasonable_ expectation for LLMs in modern development
         | workflows. Mostly because they show it as a trade-off versus a
         | silver bullet. There were trade-offs to the evolution of
         | devops, consider for example the loss of key skillsets like
         | database administration as a direct result of  "just use AWS
         | RDS" and the explosion in cloud billing costs (especially the
         | OpEx of startups who weren't even dealing with that much data
         | or regional complexity!) - and how it indirectly led to Gitlabs
         | big outage and many like it.
        
         | gopher_space wrote:
         | > Summarizing is pretty rock solid in my testing, but reasoning
         | is really hard.
         | 
         | Asking for analogies has been interesting and surprisingly
         | useful.
        
         | motoxpro wrote:
         | Devops is such an amazing analogy.
        
         | lordofmoria wrote:
         | OP here - I had never thought of the analogy to DevOps before,
         | that made something click for me, and I wrote a post just now
         | riffing off this notion: https://kenkantzer.com/gpt-is-the-
         | heroku-of-ai
         | 
         | Basically, I think we're using GPT as the PaaS/heroku/render
         | equivalent of AI ops.
         | 
         | Thank you for the insight!!
        
         | ryoshu wrote:
         | > But all in all I think it's more of a force multiplier rather
         | than a breakthrough like the internet.
         | 
         | Thank you. Seeing similar things. Clients are also seeing
         | sticker shock on how much the big models cost vs. the output.
         | That will all come down over time.
        
       | eigenvalue wrote:
       | I agree with most of it, but definitely not the part about
       | Claude3 being "meh." Claude3 Opus is an amazing model and is
       | extremely good at coding in Python. The ability to handle massive
       | context has made it mostly replace GPT4 for me day to day.
       | 
       | Sounds like everyone eventually concludes that Langchain is
       | bloated and useless and creates way more problems than it solves.
       | I don't get the hype.
        
         | CuriouslyC wrote:
         | Claude is indeed an amazing model, the fact that Sonnet and
         | Haiku are so good is a game changer - GPT4 is too expensive and
         | GPT3.5 is very mediocre. Getting 95% of GPT4 performance for
         | GPT3.5 prices feels like cheating.
        
         | Oras wrote:
         | +1 for Claude Opus, it had been my go to for the last 3 weeks
         | compared to GPT4. The generated texts are much better than GPT4
         | when it comes to follow the prompt.
         | 
         | I also tried the API for some financial analysis of large
         | tables, the response time was around 2 minutes, still did it
         | really well and timeout errors were around 1 to 2% only.
        
           | cpursley wrote:
           | How are you sending tabular data in a reliable way. And what
           | is the source document type? I'm trying to solve this for
           | complex financial-related tables in PDFs right now.
        
             | Oras wrote:
             | Amazon Textract, to get tables, format them with Python as
             | csv then send to your preferred AI model.
        
               | cpursley wrote:
               | Thanks. How does Textract compare to come of the common
               | cli utilities like pdftotext, tesseract, etc (if you made
               | a comparison)?
        
               | Oras wrote:
               | I did, none of the open source parser worked well with
               | tables. I had the following issues:
               | 
               | - missing cells. - partial identification for number (ex:
               | PS43.54, the parser would pick it up as PS43).
               | 
               | What I did to compare is drawing lines around identified
               | text to visualize the accuracy. You can do that with
               | tesseract.
        
               | cpursley wrote:
               | Interesting. Did you try MS's offering (Azure AI Document
               | Intelligence). Their pricing seems better than Amazon.
        
       | mvkel wrote:
       | I share a lot of this experience. My fix for "Lesson 4: GPT is
       | really bad at producing the null hypothesis"
       | 
       | is to have it return very specific text that I string-match on
       | and treat as null.
       | 
       | Like: "if there is no warm up for this workout, use the following
       | text in the description: NOPE"
       | 
       | then in code I just do a "if warm up contains NOPE, treat it as
       | null"
        
         | gregorymichael wrote:
         | For cases of "select an option from this set" I have it return
         | an index of the correct option, or eg 999 if it can't find one.
         | This helped a lot.
        
           | mvkel wrote:
           | Smart
        
         | gdiamos wrote:
         | We do this for the null hypothesis - is uses an LLM to
         | bootstrap a binary classifier - which handles null easily
         | 
         | https://github.com/lamini-ai/llm-classifier
        
       | albert_e wrote:
       | > Are we going to achieve Gen AI?
       | 
       | > No. Not with this transformers + the data of the internet + $XB
       | infrastructure approach.
       | 
       | Errr ...did they really mean Gen AI .. or AGI?
        
         | mdorazio wrote:
         | Gen as in "General" not generative.
        
       | _pdp_ wrote:
       | The biggest realisation for me while making ChatBotKit has been
       | that UX > Model alone. For me, the current state of AI is not
       | about questions and answers. This is dumb. The presentation
       | matters. This is why we are now investing in generative UI.
        
         | codewithcheese wrote:
         | How are you using Generative UI?
        
           | _pdp_ wrote:
           | Sorry, not much to show at the moment. It is also pretty new
           | so it is early days.
           | 
           | You can find some open-source examples here
           | https://github.com/chatbotkit. More coming next week.
        
         | pbhjpbhj wrote:
         | Generative UI being creation of a specific UI dependent on an
         | obedience from your model? What model is it?
         | 
         | Google Gemini were showing something that I'd call 'adapted
         | output UI' in their launch presentation. Is that close to what
         | you're doing in any way?
        
       | swalsh wrote:
       | The being too precise reduces accuracy example makes sense to me
       | based on my crude understanding on how these things work.
       | 
       | If you pass in a whole list of states, you're kind of making the
       | vectors for every state light up. If you just say "state" and the
       | text you passed in has an explicit state, than fewer vectors
       | specific to what you're searching for light up. So when it
       | performs the soft max, the correct state is more likely to be
       | selected.
       | 
       | Along the same lines I think his /n vs comma comparison probably
       | comes down to tokenization differences.
        
       | legendofbrando wrote:
       | The finding on simpler prompts, especially with GPT4 tracks (3.5
       | requires the opposite).
       | 
       | The take on RAG feels application specific. For our use-case
       | where having details of the past rendered up the ability to
       | generate loose connections is actually a feature. Things like
       | this are what I find excites me most about LLMs, having a way to
       | proxy subjective similarities the way we do when we remember
       | things is one of the benefits of the technology that didn't
       | really exist before that opens up a new kind of product
       | opportunity.
        
       | AtNightWeCode wrote:
       | The UX is an important part of the trick that cons peeps that
       | these tools are better than they are. If you for instance
       | instruct ChatGpt to only answer yes or no. It will feel like it
       | is wrong much more often.
        
       | ilaksh wrote:
       | I recently had a bug where I was sometimes sending the literal
       | text "null " right in front of the most important part of my
       | prompt. This caused Claude 3 Sonnet to give the 'ignore' command
       | in cases where it should have used one of the other JSON commands
       | I gave it.
       | 
       | I have an ignore command so that it will wait when the user isn't
       | finished speaking. Which it generally judges okay, unless it has
       | 'null' in there.
       | 
       | The nice thing is that I have found most of the problems with the
       | LLM response were just indications that I hadn't finished
       | debugging my program because I had something missing or weird in
       | the prompt I gave it.
        
       | ein0p wrote:
       | Same here: I'm subscribed to all three top dogs in LLM space, and
       | routinely issue the same prompts to all three. It's very one
       | sided in favor of GPT4 which is stunning since it's now a year
       | old, although of course it received a couple of updates in that
       | time. Also at least with my usage patterns hallucinations are
       | rare, too. In comparison Claude will quite readily hallucinate
       | plausible looking APIs that don't exist when writing code, etc.
       | GPT4 is also more stubborn / less agreeable when it knows it's
       | right. Very little of this is captured in metrics, so you can
       | only see it from personal experience.
        
         | CharlesW wrote:
         | This was with Claude Opus, vs. one of the lesser variants? I
         | really like Opus for English copy generation.
        
           | ein0p wrote:
           | Opus, yes, the $20/mo version. I usually don't generate copy.
           | My use cases are code (both "serious" and "the nice to have
           | code I wouldn't bother writing otherwise"), learning how to
           | do stuff in unfamiliar domains, and just learning unfamiliar
           | things in general. It works well as a very patient teacher,
           | especially if you already have some degree of familiarity
           | with the problem domain. I do have to check it against
           | primary sources, which is how I know the percentage of
           | hallucinations is very low. For code, however I don't even
           | have to do that, since as a professional software engineer I
           | am the "primary source".
        
         | Me1000 wrote:
         | Interesting, Claude 3 Opus has been better than GPT4 for me.
         | Mostly in that I find it does a better (and more importantly,
         | more thorough) job of explaining things to me. For coding tasks
         | (I'm not asking it to write code, but instead to explain
         | topics/code/etc to me) I've found it tends to give much more
         | nuanced answers. When I give it long text to converse about, I
         | find Claude Opus tends to have a much deeper understanding of
         | the content it's given, where GPT4 tends to just summarize the
         | text at hand, whereas Claude tends to be able to extrapolate
         | better.
        
           | robocat wrote:
           | How much of this is just that one model responds better to
           | the way you write prompts?
           | 
           | Much like you working with Bob and opining that Bob is great,
           | and me saying that I find Jack easier to work with.
        
             | richardw wrote:
             | [delayed]
        
       | msp26 wrote:
       | > But the problem is even worse - we often ask GPT to give us
       | back a list of JSON objects. Nothing complicated mind you: think,
       | an array list of json tasks, where each task has a name and a
       | label.
       | 
       | > GPT really cannot give back more than 10 items. Trying to have
       | it give you back 15 items? Maybe it does it 15% of the time.
       | 
       | This is just a prompt issue. I've had it reliably return up to
       | 200 items in correct order. The trick is to not use lists at all
       | but have JSON keys like "item1":{...} in the output. You can use
       | lists as the values here if you have some input with 0-n outputs.
        
         | 7thpower wrote:
         | Can you elaborate? I am currently beating my head against this.
         | 
         | If I give GPT4 a list of existing items with a defined
         | structure, and it is just having to convert schema or something
         | like that to JSON, it can do that all day long. But if it has
         | to do any sort of reasoning and basically create its own list,
         | it only gives me a very limited subset.
         | 
         | I have similar issues with other LLMs.
         | 
         | Very interested in how you are approaching this.
        
           | msp26 wrote:
           | If you show your task/prompt with an example I'll see if I
           | can fix it and explain my steps.
           | 
           | Are you using the function calling/tool use API?
        
             | ctxc wrote:
             | Hi! My work is similar and I'd love to have someone to
             | bounce ideas off of if you don't mind.
             | 
             | Your profile doesn't have contact info though. Mine does,
             | please send me a message. :)
        
           | thibaut_barrere wrote:
           | Not sure if that fits the bill, but here is an example with
           | 200 sorted items based on a question (example with Elixir &
           | InstructorEx):
           | 
           | https://gist.github.com/thbar/a53123cbe7765219c1eca77e03e675.
           | ..
        
         | waldrews wrote:
         | I've been telling it the user is from a culture where answering
         | questions with incomplete list is offensive and insulting.
        
           | andenacitelli wrote:
           | This is absolutely hilarious. Prompt engineering is such a
           | mixed bag of crazy stuff that actually works. Reminds me of
           | how they respond better if you put them under some kind of
           | pressure (respond better, _or else_ ...).
           | 
           | I haven't looked at the prompts we run in prod at $DAYJOB for
           | a while but I think we have at least five or ten things that
           | are REALLY weird out of context.
        
       | neals wrote:
       | Do I need langchain if I want to analyze a large document of many
       | pages?
        
         | simonw wrote:
         | No. But it might help, because you'll probably have to roll
         | some kind of recursive summarization - I think LangChain has
         | mechanisms a for that which could save you some time.
        
       | larodi wrote:
       | Agree largely with author, but this 'wait for OpenAI to do it'
       | sentiment is not something valid. Opus for example is already
       | much better (not only per my experience, but like... researchers
       | evaluaiton). And even for the fun of it - try some local
       | inference, boy. If u know how to prompt it you definitely would
       | be able to run local for the same tasks.
       | 
       | Like listening to my students all going to 'call some API' for
       | their projects is really very sad to hear. Many startup fellows
       | share this sentiment which a totally kills all the joy.
        
         | jstummbillig wrote:
         | It sounds like you are a tech educator, which potentially sound
         | like a lot of fun with llms right now.
         | 
         | When you are integrating these things into your business, you
         | are looking for different things. Most of our customers would
         | for example not find it very cool to have a service outage
         | because somebody wanted to not kill all the joy.
        
           | larodi wrote:
           | Sure, when availability and SLA kicks in..., but reselling
           | APIs will only get you that far. Perhaps the whole pro/cons
           | cloud argument can also kick in here, not going into it. We
           | may well be on the same page, or we both perhaps have valid
           | arguments. Your comment is appreciated indeed.
           | 
           | But then is the author (and are we) talking experience in
           | reselling APIs or experience in introducing NNs in the
           | pipeline? Not the same thing IMHO.
           | 
           | Agreed that OpenAI provides very good service, Gemini is not
           | quite there yet, Groq (the LPUs) delivered a nice tech demo,
           | Mixtral is cool but lacks in certain areas, and Claude can be
           | lengthy.
           | 
           | But precisely because I'm not sticking with OAI I can then
           | restate my view that if someone is so good with prompts he
           | can get the same results locally if he knows what he's doing.
           | 
           | Prompting OpenAI the right way can be similarly difficult.
           | 
           | Perhaps the whole idea of local inference only matters for
           | IoT scenarios or whenever data is super sensitive (or CTO
           | super stubborn to let it embed and fly). But then if you
           | start from day 1 with WordPress provisioned for you ready to
           | go in Google Cloud, you'd never understand the underlying
           | details of the technology.
           | 
           | There sure also must be a good reason why Phind tuned their
           | own thing to offer alongside GPT4 APIs.
           | 
           | Disclaimer: tech education is a side thing I do, indeed, and
           | been doing in person for very long time, more than dozen
           | topics, to allow myself to have opinion. Of course business
           | is different matter and strategic decisions arr not the same.
           | Even though I'd not advise anyone to blindly use APIs unless
           | they appreciate the need properly.
        
         | kromem wrote:
         | Claude does have more of a hallucination problem than GPT-4,
         | and a less robust knowledge base.
         | 
         | It's much better at critical thinking tasks and prose.
         | 
         | Don't mistake benchmarks for real world performance across
         | actual usecases. There's a bit of Goodhart's Law going on with
         | LLM evaluation and optimization.
        
       | dougb5 wrote:
       | The lessons I wanted from this article weren't in there: Did all
       | of that expenditure actually help their product in a measurable
       | way? Did customers use and appreciate the new features based on
       | LLM summarization compared to whatever they were using before? I
       | presume it's a net win or they wouldn't continue to use it, but
       | more specifics around the application would be helpful.
        
         | lordofmoria wrote:
         | Hey, OP here!
         | 
         | The answer is a bit boring: the expenditure definitely has
         | helped customers - in that, they're using AI generated
         | responses in all their work flows all the time in the app, and
         | barely notice it.
         | 
         | See what I did there? :) I'm mostly serious though - one weird
         | thing about our app is that you might not even know we're using
         | AI, unless we literally tell you in the app.
         | 
         | And I think that's where we're at with AI and LLMs these days,
         | at least for our use case.
         | 
         | You might find this other post I just put up to have more
         | details too, related to how/where I see the primary value:
         | https://kenkantzer.com/gpt-is-the-heroku-of-ai/
        
       | haolez wrote:
       | That has been my experience too. The null hypothesis explains
       | almost all of my hallucinations.
       | 
       | I just don't agree with the Claude assessment. In my experience,
       | Claude 3 Opus is vastly superior to GPT-4. Maybe the author was
       | comparing with Claude 2? (And I've never tested Gemini)
        
       | satisfice wrote:
       | I keep seeing this pattern in articles like this:
       | 
       | 1. A recitation of terrible problems 2. A declaration of general
       | satisfaction.
       | 
       | Clearly and obviously, ChatGPT is an unreliable toy. The author
       | seems pleased with it. As an engineer, I find that unacceptable.
        
         | jstummbillig wrote:
         | ChatGPT is probably in the top 5 value/money subscriptions I
         | have ever had (and that includes utilities).
         | 
         | The relatively low price point certainly plays a role here, but
         | it's certainly not a mainly recreational thing for me. These
         | thing's are kinda hard to measure but roughly most + is
         | engagement with hard stuff goes up, and rate of learning goes
         | up, by a lot.
        
         | simonw wrote:
         | Working with models like GPT-4 is frustrating from a
         | traditional software engineering perspective because these
         | systems are inherently unreliable and non-deterministic, which
         | differs from most software tools that we use.
         | 
         | That doesn't mean they can't be incredibly useful - but it does
         | mean you have to approach them in a bit of a different way, and
         | design software around them that takes their unreliability into
         | account.
        
       | FranklinMaillot wrote:
       | In my limited experience, I came to the same conclusion regarding
       | simple prompt being more efficient than very detailed list of
       | instructions. But if you look at OpenAI's system prompt for GPT4,
       | it's an endless set of instructions with DOs and DONTs so I'm
       | confused. Surely they must know something about prompting their
       | model.
        
         | bongodongobob wrote:
         | That's for chatting and interfacing conversationally with a
         | human. Using the API is a completely different ballgame because
         | it's not meant to be a back and forth conversation with a
         | human.
        
       | Civitello wrote:
       | > Every use case we have is essentially "Here's a block of text,
       | extract something from it." As a rule, if you ask GPT to give you
       | the names of companies mentioned in a block of text, it will not
       | give you a random company (unless there are no companies in the
       | text - there's that null hypothesis problem!). Make it two steps,
       | first: > Does this block of text mention a company? If no, good
       | you've got your null result. If yes: > Please list the names of
       | companies in this block of text.
        
       | sungho_ wrote:
       | I'm curious if the OP has tried any of the libraries that control
       | the output of LLM (LMQL, Outliner, Guadiance, ...), and for those
       | who have: do you find them as unnecessary as LangChain? In
       | particular, the OP's post mentions the problem of not being able
       | to generate JSON with more than 15 items, which seems like a
       | problem that can be solved by controlling the output of LLM. Is
       | that correct?
        
         | LASR wrote:
         | If you want x number of items every time, ask it to include a
         | sequence number in each output, it will consistently return x
         | number of items.
         | 
         | Numbered bullets work well for this, if you don't need JSON.
         | With JSON, you can ask it to include an 'id' in each item.
        
       | orbatos wrote:
       | Statements like this tell me your analysis is poisoned by
       | misunderstandings: "Why is this crazy? Well, it's crazy that
       | GPT's quality and generalization can improve when you're more
       | vague - this is a quintessential marker of higher-order
       | delegation / thinking." No, there is no "higher-order thought"
       | happening, or any at all actually. That's not how these models
       | work.
        
       | Xenoamorphous wrote:
       | > We always extract json. We don't need JSON mode
       | 
       | I wonder why? It seems to work pretty well for me.
       | 
       | > Lesson 4: GPT is really bad at producing the null hypothesis
       | 
       | Tell me about it! Just yesterday I was testing a prompt around
       | text modification rules that ended with "If none of the rules
       | apply to the text, return the original text without any changes".
       | 
       | Do you know ChatGPT's response to a text where none of the rules
       | applied?
       | 
       | "The original text without any changes". Yes, the literal string.
        
         | mechagodzilla wrote:
         | AmeliaBedeliaGPT
        
         | phillipcarter wrote:
         | > I wonder why? It seems to work pretty well for me.
         | 
         | I read this as "what we do works just fine to not need to use
         | JSON mode". We're in the same boat at my company. Been live for
         | a year now, no need to switch. Our prompt is effective at
         | getting GPT-3.5 to always produce JSON.
        
       | kromem wrote:
       | Tip for your 'null' problem:
       | 
       | LLMs are set up to output tokens. Not to not output tokens.
       | 
       | So instead of "don't return anything" have the lack of results
       | "return the default value of XYZ" and then just do a text search
       | on the result for that default value (i.e. XYZ) the same way you
       | do the text search for the state names.
       | 
       | Also, system prompts can be very useful. It's basically your
       | opportunity to have the LLM roleplay as X. I wish they'd let the
       | system prompt be passed directly, but it's still better than
       | nothing.
        
       | nprateem wrote:
       | Anyone any good tips for stopping it sounding like it's writing
       | essay answers, and flat out banning "in the realm of", delve,
       | pivotal, multifaceted, etc?
       | 
       | I don't want a crap intro or waffley summary but it just can't
       | help itself.
        
       | 2099miles wrote:
       | Great take, insightful. Highly recommend.
        
       ___________________________________________________________________
       (page generated 2024-04-13 23:00 UTC)