[HN Gopher] Lessons after a Half-billion GPT Tokens
___________________________________________________________________
Lessons after a Half-billion GPT Tokens
Author : lordofmoria
Score : 204 points
Date : 2024-04-12 17:06 UTC (1 days ago)
(HTM) web link (kenkantzer.com)
(TXT) w3m dump (kenkantzer.com)
| Yacovlewis wrote:
| Interesting piece!
|
| My experience around Langchain/RAG differs, so wanted to dig
| deeper: Putting some logic around handling relevant results helps
| us produce useful output. Curious what differs on their end.
| mind-blight wrote:
| I suspect the biggest difference is the input data. Embeddings
| are great over datasets that look like FAQs and QA docs, or
| data that conceptually fits into very small chunks (tweets,
| some product reviews, etc).
|
| It does very badly over diverse business docs, especially with
| naive chunking. B2B use cases usually have old PDFs and word
| docs that need to be searched, and they're often looking for
| specific keywords (e.g. a person's name, a product, an id,
| etc). Vectors terms to do badly in those kinds of searches, and
| just returning chunks misses a lot of important details
| gdiamos wrote:
| rare words are out of vocab errors in vectors
|
| Especially if they aren't in the token vocab
| mind-blight wrote:
| Even worse, named entities vary from organization to
| organization.
|
| We have a client who uses a product called "Time". It's
| software time management. For that customer's
| documentation, time should be close to "product" and a
| bunch of other things that have nothing to do with the
| normal concept of time.
|
| I actually suspect that people would get a lot more bang
| for their buck fine tuning the embedding models on B2B
| datasets for their use case, rather than fine tuning an llm
| trolan wrote:
| For a few uni/personal projects I noticed the same about
| Langchain: it's good at helping you use up tokens. The other use
| case, quickly switching between models, is a very valid reason
| still. However, I've recently started playing with OpenRouter
| which seems to abstract the model nicely.
| sroussey wrote:
| If someone were to create something new, a blank slate
| approach, what would you find valuable and why?
| lordofmoria wrote:
| This is a great question!
|
| I think we now know, collectively, a lot more about what's
| annoying/hard about building LLM features than we did when
| LangChain was being furiously developed.
|
| And some things we thought would be important and not-easy,
| turned out to be very easy: like getting GPT to give back
| well-formed JSON.
|
| So I think there's lots of room.
|
| One thing LangChain is doing now that solves something that
| IS very hard/annoying is testing. I spent 30 minutes
| yesterday re-running a slow prompt because 1 in 5 runs would
| produce weird output. Each tweak to the prompt, I had to run
| at least 10 times to be reasonably sure it was an
| improvement.
| codewithcheese wrote:
| It can be faster and more effective to fallback to a
| smaller model (gpt3.5 or haiku), the weakness of the prompt
| will be more obvious on a smaller model and your iteration
| time will be faster
| sroussey wrote:
| How would testing work out ideally?
| jsemrau wrote:
| Use a local model. For most tasks they are good enough. Let's
| say Mistral 0.2 instruct is quite solid by now.
| gnat wrote:
| Do different versions react to prompts in the same way? I
| imagined the prompt would be tailored to the quirks of a
| particular version rather than naturally being stably
| optimal across versions.
| jsemrau wrote:
| I suppose that is one of the benefits of using a local
| model, that it reduces model risk. I.e., given a certain
| prompt, it should always reply in the same way. Using a
| hosted model, operationally you don't have that control
| over model risk.
| cpursley wrote:
| What are the best local/open models for accurate tool-
| calling?
| disqard wrote:
| > This worked sometimes (I'd estimate >98% of the time), but
| failed enough that we had to dig deeper.
|
| > While we were investigating, we noticed that another field,
| name, was consistently returning the full name of the state...the
| correct state - even though we hadn't explicitly asked it to do
| that.
|
| > So we switched to a simple string search on the name to find
| the state, and it's been working beautifully ever since.
|
| So, using ChatGPT helped uncover the correct schema, right?
| WarOnPrivacy wrote:
| > We consistently found that not enumerating an exact list or
| instructions in the prompt produced better results
|
| Not sure if he means training here or using his product. I think
| the latter.
|
| My end-user exp of GPT3.5 is that I need to be - not just precise
| but the exact flavor of precise. It's usually after some trial
| and error. Then more error. Then more trial.
|
| Getting a useful result on the 1st or 3rd try happens maybe 1 in
| 10 sessions. A bit more common is having 3.5 include what I
| clearly asked it not to. It often complies eventually.
| xp84 wrote:
| OP uses GPT4 mostly. Another poster here observed that "the
| opposite is required for 3.5" -- so i think your experience
| makes sense.
| KTibow wrote:
| I feel like for just extracting data into JSON, smaller LLMs
| could probably do fine, especially with constrained generation
| and training on extraction.
| CuriouslyC wrote:
| If you used better prompts you could use a less expensive model.
|
| "return nothing if you find nothing" is the level 0 version of
| giving the LLM an out. Give it a softer out ("in the event that
| you do not have sufficient information to make conclusive
| statements, you may hypothesize as long as you state clearly that
| you are doing so, and note the evidence and logical basis for
| your hypothesis") then ask it to evaluate its own response at the
| end.
| codewithcheese wrote:
| Yeah also prompts should not be developed in abstract. Goal of
| a prompt is to activate the models internal respentations for
| it to best achieve the task. Without automated methods, this
| requires iteratively testing the models reaction to different
| input and trying to understand how it's interpreting the
| request and where it's falling down and then patching up those
| holes.
|
| Need to verify if it even knows what you mean by nothing.
| jsemrau wrote:
| In the end, it comes down to a task similar to people
| management where giving clear and simple instructions is the
| best.
| thisgoesnowhere wrote:
| The team I work on processes 5B+ tokens a month (and growing) and
| I'm the EM overseeing that.
|
| Here are my take aways
|
| 1. There are way too many premature abstractions. Langchain, as
| one of may examples, might be useful in the future but at the end
| of the day prompts are just a API call and it's easier to write
| standard code that treats LLM calls as a flaky API call rather
| than as a special thing.
|
| 2. Hallucinations are definitely a big problem. Summarizing is
| pretty rock solid in my testing, but reasoning is really hard.
| Action models, where you ask the llm to take in a user input and
| try to get the llm to decide what to do next, is just really
| hard, specifically it's hard to get the llm to understand the
| context and get it to say when it's not sure.
|
| That said, it's still a gamechanger that I can do it at all.
|
| 3. I am a bit more hyped than the author that this is a game
| changer, but like them, I don't think it's going to be the end of
| the world. There are some jobs that are going to be heavily
| impacted and I think we are going to have a rough few years of
| bots astroturfing platforms. But all in all I think it's more of
| a force multiplier rather than a breakthrough like the internet.
|
| IMHO it's similar to what happened to DevOps in the 2000s, you
| just don't need a big special team to help you deploy anymore,
| you hire a few specialists and mostly buy off the shelf
| solutions. Similarly, certain ML tasks are now easy to implement
| even for dumb dumb web devs like me.
| tmpz22 wrote:
| > IMHO it's similar to what happened to DevOps in the 2000s,
| you just don't need a big special team to help you deploy
| anymore, you hire a few specialists and mostly buy off the
| shelf solutions.
|
| I advocate for these metaphors to help people better understand
| a _reasonable_ expectation for LLMs in modern development
| workflows. Mostly because they show it as a trade-off versus a
| silver bullet. There were trade-offs to the evolution of
| devops, consider for example the loss of key skillsets like
| database administration as a direct result of "just use AWS
| RDS" and the explosion in cloud billing costs (especially the
| OpEx of startups who weren't even dealing with that much data
| or regional complexity!) - and how it indirectly led to Gitlabs
| big outage and many like it.
| gopher_space wrote:
| > Summarizing is pretty rock solid in my testing, but reasoning
| is really hard.
|
| Asking for analogies has been interesting and surprisingly
| useful.
| motoxpro wrote:
| Devops is such an amazing analogy.
| lordofmoria wrote:
| OP here - I had never thought of the analogy to DevOps before,
| that made something click for me, and I wrote a post just now
| riffing off this notion: https://kenkantzer.com/gpt-is-the-
| heroku-of-ai
|
| Basically, I think we're using GPT as the PaaS/heroku/render
| equivalent of AI ops.
|
| Thank you for the insight!!
| ryoshu wrote:
| > But all in all I think it's more of a force multiplier rather
| than a breakthrough like the internet.
|
| Thank you. Seeing similar things. Clients are also seeing
| sticker shock on how much the big models cost vs. the output.
| That will all come down over time.
| eigenvalue wrote:
| I agree with most of it, but definitely not the part about
| Claude3 being "meh." Claude3 Opus is an amazing model and is
| extremely good at coding in Python. The ability to handle massive
| context has made it mostly replace GPT4 for me day to day.
|
| Sounds like everyone eventually concludes that Langchain is
| bloated and useless and creates way more problems than it solves.
| I don't get the hype.
| CuriouslyC wrote:
| Claude is indeed an amazing model, the fact that Sonnet and
| Haiku are so good is a game changer - GPT4 is too expensive and
| GPT3.5 is very mediocre. Getting 95% of GPT4 performance for
| GPT3.5 prices feels like cheating.
| Oras wrote:
| +1 for Claude Opus, it had been my go to for the last 3 weeks
| compared to GPT4. The generated texts are much better than GPT4
| when it comes to follow the prompt.
|
| I also tried the API for some financial analysis of large
| tables, the response time was around 2 minutes, still did it
| really well and timeout errors were around 1 to 2% only.
| cpursley wrote:
| How are you sending tabular data in a reliable way. And what
| is the source document type? I'm trying to solve this for
| complex financial-related tables in PDFs right now.
| Oras wrote:
| Amazon Textract, to get tables, format them with Python as
| csv then send to your preferred AI model.
| cpursley wrote:
| Thanks. How does Textract compare to come of the common
| cli utilities like pdftotext, tesseract, etc (if you made
| a comparison)?
| Oras wrote:
| I did, none of the open source parser worked well with
| tables. I had the following issues:
|
| - missing cells. - partial identification for number (ex:
| PS43.54, the parser would pick it up as PS43).
|
| What I did to compare is drawing lines around identified
| text to visualize the accuracy. You can do that with
| tesseract.
| cpursley wrote:
| Interesting. Did you try MS's offering (Azure AI Document
| Intelligence). Their pricing seems better than Amazon.
| mvkel wrote:
| I share a lot of this experience. My fix for "Lesson 4: GPT is
| really bad at producing the null hypothesis"
|
| is to have it return very specific text that I string-match on
| and treat as null.
|
| Like: "if there is no warm up for this workout, use the following
| text in the description: NOPE"
|
| then in code I just do a "if warm up contains NOPE, treat it as
| null"
| gregorymichael wrote:
| For cases of "select an option from this set" I have it return
| an index of the correct option, or eg 999 if it can't find one.
| This helped a lot.
| mvkel wrote:
| Smart
| gdiamos wrote:
| We do this for the null hypothesis - is uses an LLM to
| bootstrap a binary classifier - which handles null easily
|
| https://github.com/lamini-ai/llm-classifier
| albert_e wrote:
| > Are we going to achieve Gen AI?
|
| > No. Not with this transformers + the data of the internet + $XB
| infrastructure approach.
|
| Errr ...did they really mean Gen AI .. or AGI?
| mdorazio wrote:
| Gen as in "General" not generative.
| _pdp_ wrote:
| The biggest realisation for me while making ChatBotKit has been
| that UX > Model alone. For me, the current state of AI is not
| about questions and answers. This is dumb. The presentation
| matters. This is why we are now investing in generative UI.
| codewithcheese wrote:
| How are you using Generative UI?
| _pdp_ wrote:
| Sorry, not much to show at the moment. It is also pretty new
| so it is early days.
|
| You can find some open-source examples here
| https://github.com/chatbotkit. More coming next week.
| pbhjpbhj wrote:
| Generative UI being creation of a specific UI dependent on an
| obedience from your model? What model is it?
|
| Google Gemini were showing something that I'd call 'adapted
| output UI' in their launch presentation. Is that close to what
| you're doing in any way?
| swalsh wrote:
| The being too precise reduces accuracy example makes sense to me
| based on my crude understanding on how these things work.
|
| If you pass in a whole list of states, you're kind of making the
| vectors for every state light up. If you just say "state" and the
| text you passed in has an explicit state, than fewer vectors
| specific to what you're searching for light up. So when it
| performs the soft max, the correct state is more likely to be
| selected.
|
| Along the same lines I think his /n vs comma comparison probably
| comes down to tokenization differences.
| legendofbrando wrote:
| The finding on simpler prompts, especially with GPT4 tracks (3.5
| requires the opposite).
|
| The take on RAG feels application specific. For our use-case
| where having details of the past rendered up the ability to
| generate loose connections is actually a feature. Things like
| this are what I find excites me most about LLMs, having a way to
| proxy subjective similarities the way we do when we remember
| things is one of the benefits of the technology that didn't
| really exist before that opens up a new kind of product
| opportunity.
| AtNightWeCode wrote:
| The UX is an important part of the trick that cons peeps that
| these tools are better than they are. If you for instance
| instruct ChatGpt to only answer yes or no. It will feel like it
| is wrong much more often.
| ilaksh wrote:
| I recently had a bug where I was sometimes sending the literal
| text "null " right in front of the most important part of my
| prompt. This caused Claude 3 Sonnet to give the 'ignore' command
| in cases where it should have used one of the other JSON commands
| I gave it.
|
| I have an ignore command so that it will wait when the user isn't
| finished speaking. Which it generally judges okay, unless it has
| 'null' in there.
|
| The nice thing is that I have found most of the problems with the
| LLM response were just indications that I hadn't finished
| debugging my program because I had something missing or weird in
| the prompt I gave it.
| ein0p wrote:
| Same here: I'm subscribed to all three top dogs in LLM space, and
| routinely issue the same prompts to all three. It's very one
| sided in favor of GPT4 which is stunning since it's now a year
| old, although of course it received a couple of updates in that
| time. Also at least with my usage patterns hallucinations are
| rare, too. In comparison Claude will quite readily hallucinate
| plausible looking APIs that don't exist when writing code, etc.
| GPT4 is also more stubborn / less agreeable when it knows it's
| right. Very little of this is captured in metrics, so you can
| only see it from personal experience.
| CharlesW wrote:
| This was with Claude Opus, vs. one of the lesser variants? I
| really like Opus for English copy generation.
| ein0p wrote:
| Opus, yes, the $20/mo version. I usually don't generate copy.
| My use cases are code (both "serious" and "the nice to have
| code I wouldn't bother writing otherwise"), learning how to
| do stuff in unfamiliar domains, and just learning unfamiliar
| things in general. It works well as a very patient teacher,
| especially if you already have some degree of familiarity
| with the problem domain. I do have to check it against
| primary sources, which is how I know the percentage of
| hallucinations is very low. For code, however I don't even
| have to do that, since as a professional software engineer I
| am the "primary source".
| Me1000 wrote:
| Interesting, Claude 3 Opus has been better than GPT4 for me.
| Mostly in that I find it does a better (and more importantly,
| more thorough) job of explaining things to me. For coding tasks
| (I'm not asking it to write code, but instead to explain
| topics/code/etc to me) I've found it tends to give much more
| nuanced answers. When I give it long text to converse about, I
| find Claude Opus tends to have a much deeper understanding of
| the content it's given, where GPT4 tends to just summarize the
| text at hand, whereas Claude tends to be able to extrapolate
| better.
| robocat wrote:
| How much of this is just that one model responds better to
| the way you write prompts?
|
| Much like you working with Bob and opining that Bob is great,
| and me saying that I find Jack easier to work with.
| richardw wrote:
| [delayed]
| msp26 wrote:
| > But the problem is even worse - we often ask GPT to give us
| back a list of JSON objects. Nothing complicated mind you: think,
| an array list of json tasks, where each task has a name and a
| label.
|
| > GPT really cannot give back more than 10 items. Trying to have
| it give you back 15 items? Maybe it does it 15% of the time.
|
| This is just a prompt issue. I've had it reliably return up to
| 200 items in correct order. The trick is to not use lists at all
| but have JSON keys like "item1":{...} in the output. You can use
| lists as the values here if you have some input with 0-n outputs.
| 7thpower wrote:
| Can you elaborate? I am currently beating my head against this.
|
| If I give GPT4 a list of existing items with a defined
| structure, and it is just having to convert schema or something
| like that to JSON, it can do that all day long. But if it has
| to do any sort of reasoning and basically create its own list,
| it only gives me a very limited subset.
|
| I have similar issues with other LLMs.
|
| Very interested in how you are approaching this.
| msp26 wrote:
| If you show your task/prompt with an example I'll see if I
| can fix it and explain my steps.
|
| Are you using the function calling/tool use API?
| ctxc wrote:
| Hi! My work is similar and I'd love to have someone to
| bounce ideas off of if you don't mind.
|
| Your profile doesn't have contact info though. Mine does,
| please send me a message. :)
| thibaut_barrere wrote:
| Not sure if that fits the bill, but here is an example with
| 200 sorted items based on a question (example with Elixir &
| InstructorEx):
|
| https://gist.github.com/thbar/a53123cbe7765219c1eca77e03e675.
| ..
| waldrews wrote:
| I've been telling it the user is from a culture where answering
| questions with incomplete list is offensive and insulting.
| andenacitelli wrote:
| This is absolutely hilarious. Prompt engineering is such a
| mixed bag of crazy stuff that actually works. Reminds me of
| how they respond better if you put them under some kind of
| pressure (respond better, _or else_ ...).
|
| I haven't looked at the prompts we run in prod at $DAYJOB for
| a while but I think we have at least five or ten things that
| are REALLY weird out of context.
| neals wrote:
| Do I need langchain if I want to analyze a large document of many
| pages?
| simonw wrote:
| No. But it might help, because you'll probably have to roll
| some kind of recursive summarization - I think LangChain has
| mechanisms a for that which could save you some time.
| larodi wrote:
| Agree largely with author, but this 'wait for OpenAI to do it'
| sentiment is not something valid. Opus for example is already
| much better (not only per my experience, but like... researchers
| evaluaiton). And even for the fun of it - try some local
| inference, boy. If u know how to prompt it you definitely would
| be able to run local for the same tasks.
|
| Like listening to my students all going to 'call some API' for
| their projects is really very sad to hear. Many startup fellows
| share this sentiment which a totally kills all the joy.
| jstummbillig wrote:
| It sounds like you are a tech educator, which potentially sound
| like a lot of fun with llms right now.
|
| When you are integrating these things into your business, you
| are looking for different things. Most of our customers would
| for example not find it very cool to have a service outage
| because somebody wanted to not kill all the joy.
| larodi wrote:
| Sure, when availability and SLA kicks in..., but reselling
| APIs will only get you that far. Perhaps the whole pro/cons
| cloud argument can also kick in here, not going into it. We
| may well be on the same page, or we both perhaps have valid
| arguments. Your comment is appreciated indeed.
|
| But then is the author (and are we) talking experience in
| reselling APIs or experience in introducing NNs in the
| pipeline? Not the same thing IMHO.
|
| Agreed that OpenAI provides very good service, Gemini is not
| quite there yet, Groq (the LPUs) delivered a nice tech demo,
| Mixtral is cool but lacks in certain areas, and Claude can be
| lengthy.
|
| But precisely because I'm not sticking with OAI I can then
| restate my view that if someone is so good with prompts he
| can get the same results locally if he knows what he's doing.
|
| Prompting OpenAI the right way can be similarly difficult.
|
| Perhaps the whole idea of local inference only matters for
| IoT scenarios or whenever data is super sensitive (or CTO
| super stubborn to let it embed and fly). But then if you
| start from day 1 with WordPress provisioned for you ready to
| go in Google Cloud, you'd never understand the underlying
| details of the technology.
|
| There sure also must be a good reason why Phind tuned their
| own thing to offer alongside GPT4 APIs.
|
| Disclaimer: tech education is a side thing I do, indeed, and
| been doing in person for very long time, more than dozen
| topics, to allow myself to have opinion. Of course business
| is different matter and strategic decisions arr not the same.
| Even though I'd not advise anyone to blindly use APIs unless
| they appreciate the need properly.
| kromem wrote:
| Claude does have more of a hallucination problem than GPT-4,
| and a less robust knowledge base.
|
| It's much better at critical thinking tasks and prose.
|
| Don't mistake benchmarks for real world performance across
| actual usecases. There's a bit of Goodhart's Law going on with
| LLM evaluation and optimization.
| dougb5 wrote:
| The lessons I wanted from this article weren't in there: Did all
| of that expenditure actually help their product in a measurable
| way? Did customers use and appreciate the new features based on
| LLM summarization compared to whatever they were using before? I
| presume it's a net win or they wouldn't continue to use it, but
| more specifics around the application would be helpful.
| lordofmoria wrote:
| Hey, OP here!
|
| The answer is a bit boring: the expenditure definitely has
| helped customers - in that, they're using AI generated
| responses in all their work flows all the time in the app, and
| barely notice it.
|
| See what I did there? :) I'm mostly serious though - one weird
| thing about our app is that you might not even know we're using
| AI, unless we literally tell you in the app.
|
| And I think that's where we're at with AI and LLMs these days,
| at least for our use case.
|
| You might find this other post I just put up to have more
| details too, related to how/where I see the primary value:
| https://kenkantzer.com/gpt-is-the-heroku-of-ai/
| haolez wrote:
| That has been my experience too. The null hypothesis explains
| almost all of my hallucinations.
|
| I just don't agree with the Claude assessment. In my experience,
| Claude 3 Opus is vastly superior to GPT-4. Maybe the author was
| comparing with Claude 2? (And I've never tested Gemini)
| satisfice wrote:
| I keep seeing this pattern in articles like this:
|
| 1. A recitation of terrible problems 2. A declaration of general
| satisfaction.
|
| Clearly and obviously, ChatGPT is an unreliable toy. The author
| seems pleased with it. As an engineer, I find that unacceptable.
| jstummbillig wrote:
| ChatGPT is probably in the top 5 value/money subscriptions I
| have ever had (and that includes utilities).
|
| The relatively low price point certainly plays a role here, but
| it's certainly not a mainly recreational thing for me. These
| thing's are kinda hard to measure but roughly most + is
| engagement with hard stuff goes up, and rate of learning goes
| up, by a lot.
| simonw wrote:
| Working with models like GPT-4 is frustrating from a
| traditional software engineering perspective because these
| systems are inherently unreliable and non-deterministic, which
| differs from most software tools that we use.
|
| That doesn't mean they can't be incredibly useful - but it does
| mean you have to approach them in a bit of a different way, and
| design software around them that takes their unreliability into
| account.
| FranklinMaillot wrote:
| In my limited experience, I came to the same conclusion regarding
| simple prompt being more efficient than very detailed list of
| instructions. But if you look at OpenAI's system prompt for GPT4,
| it's an endless set of instructions with DOs and DONTs so I'm
| confused. Surely they must know something about prompting their
| model.
| bongodongobob wrote:
| That's for chatting and interfacing conversationally with a
| human. Using the API is a completely different ballgame because
| it's not meant to be a back and forth conversation with a
| human.
| Civitello wrote:
| > Every use case we have is essentially "Here's a block of text,
| extract something from it." As a rule, if you ask GPT to give you
| the names of companies mentioned in a block of text, it will not
| give you a random company (unless there are no companies in the
| text - there's that null hypothesis problem!). Make it two steps,
| first: > Does this block of text mention a company? If no, good
| you've got your null result. If yes: > Please list the names of
| companies in this block of text.
| sungho_ wrote:
| I'm curious if the OP has tried any of the libraries that control
| the output of LLM (LMQL, Outliner, Guadiance, ...), and for those
| who have: do you find them as unnecessary as LangChain? In
| particular, the OP's post mentions the problem of not being able
| to generate JSON with more than 15 items, which seems like a
| problem that can be solved by controlling the output of LLM. Is
| that correct?
| LASR wrote:
| If you want x number of items every time, ask it to include a
| sequence number in each output, it will consistently return x
| number of items.
|
| Numbered bullets work well for this, if you don't need JSON.
| With JSON, you can ask it to include an 'id' in each item.
| orbatos wrote:
| Statements like this tell me your analysis is poisoned by
| misunderstandings: "Why is this crazy? Well, it's crazy that
| GPT's quality and generalization can improve when you're more
| vague - this is a quintessential marker of higher-order
| delegation / thinking." No, there is no "higher-order thought"
| happening, or any at all actually. That's not how these models
| work.
| Xenoamorphous wrote:
| > We always extract json. We don't need JSON mode
|
| I wonder why? It seems to work pretty well for me.
|
| > Lesson 4: GPT is really bad at producing the null hypothesis
|
| Tell me about it! Just yesterday I was testing a prompt around
| text modification rules that ended with "If none of the rules
| apply to the text, return the original text without any changes".
|
| Do you know ChatGPT's response to a text where none of the rules
| applied?
|
| "The original text without any changes". Yes, the literal string.
| mechagodzilla wrote:
| AmeliaBedeliaGPT
| phillipcarter wrote:
| > I wonder why? It seems to work pretty well for me.
|
| I read this as "what we do works just fine to not need to use
| JSON mode". We're in the same boat at my company. Been live for
| a year now, no need to switch. Our prompt is effective at
| getting GPT-3.5 to always produce JSON.
| kromem wrote:
| Tip for your 'null' problem:
|
| LLMs are set up to output tokens. Not to not output tokens.
|
| So instead of "don't return anything" have the lack of results
| "return the default value of XYZ" and then just do a text search
| on the result for that default value (i.e. XYZ) the same way you
| do the text search for the state names.
|
| Also, system prompts can be very useful. It's basically your
| opportunity to have the LLM roleplay as X. I wish they'd let the
| system prompt be passed directly, but it's still better than
| nothing.
| nprateem wrote:
| Anyone any good tips for stopping it sounding like it's writing
| essay answers, and flat out banning "in the realm of", delve,
| pivotal, multifaceted, etc?
|
| I don't want a crap intro or waffley summary but it just can't
| help itself.
| 2099miles wrote:
| Great take, insightful. Highly recommend.
___________________________________________________________________
(page generated 2024-04-13 23:00 UTC)