[HN Gopher] Lessons after a Half-billion GPT Tokens
       ___________________________________________________________________
        
       Lessons after a Half-billion GPT Tokens
        
       Author : lordofmoria
       Score  : 476 points
       Date   : 2024-04-12 17:06 UTC (2 days ago)
        
 (HTM) web link (kenkantzer.com)
 (TXT) w3m dump (kenkantzer.com)
        
       | Yacovlewis wrote:
       | Interesting piece!
       | 
       | My experience around Langchain/RAG differs, so wanted to dig
       | deeper: Putting some logic around handling relevant results helps
       | us produce useful output. Curious what differs on their end.
        
         | mind-blight wrote:
         | I suspect the biggest difference is the input data. Embeddings
         | are great over datasets that look like FAQs and QA docs, or
         | data that conceptually fits into very small chunks (tweets,
         | some product reviews, etc).
         | 
         | It does very badly over diverse business docs, especially with
         | naive chunking. B2B use cases usually have old PDFs and word
         | docs that need to be searched, and they're often looking for
         | specific keywords (e.g. a person's name, a product, an id,
         | etc). Vectors terms to do badly in those kinds of searches, and
         | just returning chunks misses a lot of important details
        
           | gdiamos wrote:
           | rare words are out of vocab errors in vectors
           | 
           | Especially if they aren't in the token vocab
        
             | mind-blight wrote:
             | Even worse, named entities vary from organization to
             | organization.
             | 
             | We have a client who uses a product called "Time". It's
             | software time management. For that customer's
             | documentation, time should be close to "product" and a
             | bunch of other things that have nothing to do with the
             | normal concept of time.
             | 
             | I actually suspect that people would get a lot more bang
             | for their buck fine tuning the embedding models on B2B
             | datasets for their use case, rather than fine tuning an llm
        
       | trolan wrote:
       | For a few uni/personal projects I noticed the same about
       | Langchain: it's good at helping you use up tokens. The other use
       | case, quickly switching between models, is a very valid reason
       | still. However, I've recently started playing with OpenRouter
       | which seems to abstract the model nicely.
        
         | sroussey wrote:
         | If someone were to create something new, a blank slate
         | approach, what would you find valuable and why?
        
           | lordofmoria wrote:
           | This is a great question!
           | 
           | I think we now know, collectively, a lot more about what's
           | annoying/hard about building LLM features than we did when
           | LangChain was being furiously developed.
           | 
           | And some things we thought would be important and not-easy,
           | turned out to be very easy: like getting GPT to give back
           | well-formed JSON.
           | 
           | So I think there's lots of room.
           | 
           | One thing LangChain is doing now that solves something that
           | IS very hard/annoying is testing. I spent 30 minutes
           | yesterday re-running a slow prompt because 1 in 5 runs would
           | produce weird output. Each tweak to the prompt, I had to run
           | at least 10 times to be reasonably sure it was an
           | improvement.
        
             | codewithcheese wrote:
             | It can be faster and more effective to fallback to a
             | smaller model (gpt3.5 or haiku), the weakness of the prompt
             | will be more obvious on a smaller model and your iteration
             | time will be faster
        
               | JeremyHerrman wrote:
               | great insight!
        
             | sroussey wrote:
             | How would testing work out ideally?
        
           | jsemrau wrote:
           | Use a local model. For most tasks they are good enough. Let's
           | say Mistral 0.2 instruct is quite solid by now.
        
             | gnat wrote:
             | Do different versions react to prompts in the same way? I
             | imagined the prompt would be tailored to the quirks of a
             | particular version rather than naturally being stably
             | optimal across versions.
        
               | jsemrau wrote:
               | I suppose that is one of the benefits of using a local
               | model, that it reduces model risk. I.e., given a certain
               | prompt, it should always reply in the same way. Using a
               | hosted model, operationally you don't have that control
               | over model risk.
        
             | cpursley wrote:
             | What are the best local/open models for accurate tool-
             | calling?
        
       | disqard wrote:
       | > This worked sometimes (I'd estimate >98% of the time), but
       | failed enough that we had to dig deeper.
       | 
       | > While we were investigating, we noticed that another field,
       | name, was consistently returning the full name of the state...the
       | correct state - even though we hadn't explicitly asked it to do
       | that.
       | 
       | > So we switched to a simple string search on the name to find
       | the state, and it's been working beautifully ever since.
       | 
       | So, using ChatGPT helped uncover the correct schema, right?
        
       | WarOnPrivacy wrote:
       | > We consistently found that not enumerating an exact list or
       | instructions in the prompt produced better results
       | 
       | Not sure if he means training here or using his product. I think
       | the latter.
       | 
       | My end-user exp of GPT3.5 is that I need to be - not just precise
       | but the exact flavor of precise. It's usually after some trial
       | and error. Then more error. Then more trial.
       | 
       | Getting a useful result on the 1st or 3rd try happens maybe 1 in
       | 10 sessions. A bit more common is having 3.5 include what I
       | clearly asked it not to. It often complies eventually.
        
         | xp84 wrote:
         | OP uses GPT4 mostly. Another poster here observed that "the
         | opposite is required for 3.5" -- so i think your experience
         | makes sense.
        
       | KTibow wrote:
       | I feel like for just extracting data into JSON, smaller LLMs
       | could probably do fine, especially with constrained generation
       | and training on extraction.
        
       | CuriouslyC wrote:
       | If you used better prompts you could use a less expensive model.
       | 
       | "return nothing if you find nothing" is the level 0 version of
       | giving the LLM an out. Give it a softer out ("in the event that
       | you do not have sufficient information to make conclusive
       | statements, you may hypothesize as long as you state clearly that
       | you are doing so, and note the evidence and logical basis for
       | your hypothesis") then ask it to evaluate its own response at the
       | end.
        
         | codewithcheese wrote:
         | Yeah also prompts should not be developed in abstract. Goal of
         | a prompt is to activate the models internal respentations for
         | it to best achieve the task. Without automated methods, this
         | requires iteratively testing the models reaction to different
         | input and trying to understand how it's interpreting the
         | request and where it's falling down and then patching up those
         | holes.
         | 
         | Need to verify if it even knows what you mean by nothing.
        
           | jsemrau wrote:
           | In the end, it comes down to a task similar to people
           | management where giving clear and simple instructions is the
           | best.
        
           | azinman2 wrote:
           | Which automated method do you use?
        
             | CuriouslyC wrote:
             | The only public prompt optimizer that I'm aware of now is
             | DSPy, but it doesn't optimize your main prompt request,
             | just some of the problem solving strategies the LLM is
             | instructed to use, and your few shot learning examples. I
             | wouldn't be surprised if there's a public general prompt
             | optimizing agent by this time next year though.
        
       | thisgoesnowhere wrote:
       | The team I work on processes 5B+ tokens a month (and growing) and
       | I'm the EM overseeing that.
       | 
       | Here are my take aways
       | 
       | 1. There are way too many premature abstractions. Langchain, as
       | one of may examples, might be useful in the future but at the end
       | of the day prompts are just a API call and it's easier to write
       | standard code that treats LLM calls as a flaky API call rather
       | than as a special thing.
       | 
       | 2. Hallucinations are definitely a big problem. Summarizing is
       | pretty rock solid in my testing, but reasoning is really hard.
       | Action models, where you ask the llm to take in a user input and
       | try to get the llm to decide what to do next, is just really
       | hard, specifically it's hard to get the llm to understand the
       | context and get it to say when it's not sure.
       | 
       | That said, it's still a gamechanger that I can do it at all.
       | 
       | 3. I am a bit more hyped than the author that this is a game
       | changer, but like them, I don't think it's going to be the end of
       | the world. There are some jobs that are going to be heavily
       | impacted and I think we are going to have a rough few years of
       | bots astroturfing platforms. But all in all I think it's more of
       | a force multiplier rather than a breakthrough like the internet.
       | 
       | IMHO it's similar to what happened to DevOps in the 2000s, you
       | just don't need a big special team to help you deploy anymore,
       | you hire a few specialists and mostly buy off the shelf
       | solutions. Similarly, certain ML tasks are now easy to implement
       | even for dumb dumb web devs like me.
        
         | tmpz22 wrote:
         | > IMHO it's similar to what happened to DevOps in the 2000s,
         | you just don't need a big special team to help you deploy
         | anymore, you hire a few specialists and mostly buy off the
         | shelf solutions.
         | 
         | I advocate for these metaphors to help people better understand
         | a _reasonable_ expectation for LLMs in modern development
         | workflows. Mostly because they show it as a trade-off versus a
         | silver bullet. There were trade-offs to the evolution of
         | devops, consider for example the loss of key skillsets like
         | database administration as a direct result of  "just use AWS
         | RDS" and the explosion in cloud billing costs (especially the
         | OpEx of startups who weren't even dealing with that much data
         | or regional complexity!) - and how it indirectly led to Gitlabs
         | big outage and many like it.
        
         | gopher_space wrote:
         | > Summarizing is pretty rock solid in my testing, but reasoning
         | is really hard.
         | 
         | Asking for analogies has been interesting and surprisingly
         | useful.
        
           | eru wrote:
           | Could you elaborate, please?
        
             | gopher_space wrote:
             | Instead of `if X == Y do ...` it's more like `enumerate
             | features of X in such a manner...` and then `explain
             | feature #2 of X in terms that Y would understand` and then
             | maybe `enumerate the manners in which Y might apply X#2 to
             | TASK` and then have it do the smartest number.
             | 
             | The most lucid explanation for SQL joins I've seen was in a
             | (regrettably unsaved) exchange where I asked it to compare
             | them to different parts of a construction project and then
             | focused in on the landscaping example. I felt like Harrison
             | Ford panning around a still image in the first Blade
             | Runner. "Go back a point and focus in on the third
             | paragraph".
        
         | motoxpro wrote:
         | Devops is such an amazing analogy.
        
         | lordofmoria wrote:
         | OP here - I had never thought of the analogy to DevOps before,
         | that made something click for me, and I wrote a post just now
         | riffing off this notion: https://kenkantzer.com/gpt-is-the-
         | heroku-of-ai
         | 
         | Basically, I think we're using GPT as the PaaS/heroku/render
         | equivalent of AI ops.
         | 
         | Thank you for the insight!!
        
           | harryp_peng wrote:
           | You only processed 500m tokens, which is shockingly little.
           | perhaps only 2k in incurred costs?
        
         | ryoshu wrote:
         | > But all in all I think it's more of a force multiplier rather
         | than a breakthrough like the internet.
         | 
         | Thank you. Seeing similar things. Clients are also seeing
         | sticker shock on how much the big models cost vs. the output.
         | That will all come down over time.
        
           | nineteen999 wrote:
           | > That will all come down over time.
           | 
           | So will interest, as more and more people realise theres
           | nothing "intelligent" about the technology, it's merely a
           | Markov-chain-word-salad generator with some weights to
           | improve the accuracy somewhat.
           | 
           | I'm sure some people (other than AI investors) are getting
           | some value out of it, but I've found it to be most unsuited
           | to most of the tasks I've applied it to.
        
             | mediaman wrote:
             | The industry is troubled both by hype marketers who believe
             | LLMs are superhuman intelligence that will replace all
             | jobs, and cynics who believe they are useless word
             | predictors.
             | 
             | Some workloads are well-suited to LLMs. Roughly 60% of
             | applications are for knowledge management and summarization
             | tasks, which is a big problem for large organizations. I
             | have experience deploying these for customers in a niche
             | vertical, and they work quite well. I do not believe
             | they're yet effective for 'agentic' behavior or anything
             | using advanced reasoning. I don't know if they will be in
             | the near future. But as a smart, fast librarian, they're
             | great.
             | 
             | A related area is tier one customer service. We are
             | beginning to see evidence that well-designed applications
             | (emphasis on well-designed -- the LLM is just a component)
             | can significantly bring down customer service costs. Most
             | customer service requests do not require complex reasoning.
             | They just need to find answers to a set of questions that
             | are repeatedly asked, because the majority of service calls
             | are from people who do not read docs. People who read
             | documentation make fewer calls. In most cases around 60-70%
             | of customer service requests are well-suited to automating
             | with a well-designed LLM-enabled agent. The rest should be
             | handled by humans.
             | 
             | If the task does not require advanced reasoning and mostly
             | involves processing existing information, LLMs can be a
             | good fit. This actually represents a lot of work.
             | 
             | But many tech people are skeptical, because they don't
             | actually get much exposure to this type of work. They read
             | the docs before calling service, are good at searching for
             | things, and excel at using computers as tools. And so, to
             | them, it's mystifying why LLMs could still be so valuable.
        
         | weatherlite wrote:
         | > Similarly, certain ML tasks are now easy to implement even
         | for dumb dumb web devs like me
         | 
         | For example?
        
           | spunker540 wrote:
           | Lots of applied NLP tasks used to require paying annotators
           | to compile a golden dataset and then train an efficient model
           | on the dataset.
           | 
           | Now, if cost is little concern you can use zero shot
           | prompting on an inefficient model. If cost is a concern, you
           | can use GPT4 to create your golden dataset way faster and
           | cheaper than human annotations, and then train your more
           | efficient model.
           | 
           | Some example NLP tasks could be classifiers, sentiment,
           | extracting data from documents. But I'd be curious which
           | areas of NLP __weren't__ disrupted by LLMs.
        
             | teleforce wrote:
             | > But I'd be curious which areas of NLP __weren't__
             | disrupted by LLMs
             | 
             | Essentially come up with a potent generic model using human
             | feedback, label and annotation for LLM e.g GPT 4, then use
             | it to generate golden dataset for other new models without
             | human in the loop, very innovative indeed.
        
             | saaaaaam wrote:
             | I'm interested by your comment that you can "use GPT4 to
             | create your golden dataset".
             | 
             | Would you be willing to expand a little and give a brief
             | example please? It would be really helpful for me to
             | understand this a little better!
        
           | aerhardt wrote:
           | Anything involving classification, extraction, or synthesis.
        
         | checkyoursudo wrote:
         | > get it to say when it's not sure
         | 
         | This is a function of the language model itself. By the time
         | you get to the output, the uncertainty that is inherent in the
         | computation is lost to the prediction. It is like if you ask me
         | to guess heads or tails, and I guess heads, I could have stated
         | my uncertainty (e.g. Pr [H] = .5) before hand, but in my actual
         | prediction of heads, and then the coin flip, that uncertainty
         | is lost. It's the same with LLMs. The uncertainty in the
         | computation is lost in the final prediction of the tokens, so
         | unless the prediction itself is uncertainty (which it should
         | rarely be based on the training corpus, I think), then you
         | should not find an LLM output really ever to say it does not
         | understand. But that is because it never _understands_ , it
         | just predicts.
        
           | nequo wrote:
           | But the LLM predicts the output based on some notion of a
           | likelihood so it could in principle signal if the likelihood
           | of the returned token sequence is low, couldn't it?
           | 
           | Or do you mean that fine-tuning distorts these likelihoods so
           | models can no longer accurately signal uncertainty?
        
           | brookst wrote:
           | I get the reasoning but I'm not sure you've successfully
           | contradicted the point.
           | 
           | Most prompts are written in the form "you are a helpful
           | assistant, you will do X, you will not do Y"
           | 
           | I believe that inclusion of instructions like "if there are
           | possible answers that differ and contradict, state that and
           | estimate the probability of each" would help knowledgeable
           | users.
           | 
           | But for typical users and PR purposes, it would be disaster.
           | It is better to tell 999 people that the US constitution was
           | signed in 1787 and 1 person that it was signed in 349 B.C.
           | than it is to tell 1000 people that it was probably signed in
           | 1787 but it might have been 349 B.C.
        
             | _wire_ wrote:
             | Why does the prompt intro take the form of a role/identity
             | directive "You are helpful assistant..."?
             | 
             | What about the training sets or the model internals
             | responds to this directive?
             | 
             | What are the degrees of freedom of such directives?
             | 
             | If such a directive is helpful, why wouldn't more demanding
             | directives be even more helpful: "You are a domain X expert
             | who provides proven solutions for problem type Y..."
             | 
             | If don't think the latter prompt is more helpful, why not?
             | 
             | What aspect of the former prompt is within bounds of
             | helpful directives that the latter is not?
             | 
             | Are training sets structured in the form of roles? Surely,
             | the model doesn't identify with a role?!
             | 
             | Why is the role directive topically used with NLP but not
             | image generation?
             | 
             | Do typical prompts for Stable Diffusion start with an
             | identity directive "You are assistant to Andy Warhol in his
             | industrial phase..."?
             | 
             | Why can't improved prompt directives be generated by the
             | model itself? Has no one bothered to ask it for help?
             | 
             | "You are the world's most talented prompt bro, write a
             | prompt for sentience..."
             | 
             | If the first directive observed in this post is useful and
             | this last directive is absurd, what distinguishes them?
             | 
             | Surely there's no shortage of expert prompt training data.
             | 
             | BTW, how much training data is enough to permit effective
             | responses in a domain?
             | 
             | Can a properly trained model answer this question? Can it
             | become better if you direct it to be better?
             | 
             | Why can't the models rectify their own hallucinations?
             | 
             | To be more derogatory: what distinguishes a hallucination
             | from any other model output within the operational domain
             | of the model?
             | 
             | Why are hallucinations regarded as anything other than a
             | pure effect, and as pure effect, what is the cusp of
             | hallucination? That a human finds the output nonsensical?
             | 
             | If outputs are not equally valid in the LLM why can't it
             | sort for validity?
             | 
             | OTOH if all outputs are equally valid in the LLM, then
             | outputs must be regarded by a human for validity, so what
             | distinguishes a LLM from an the world's greatest human
             | time-wasting device? (After Las Vegas)
             | 
             | Why will a statistical confidence level help avoid having a
             | human review every output?
             | 
             | The questions go on and on...
             | 
             | -- Parole Board chairman: They've got a name for people
             | like you H.I. That name is called "recidivism."
             | 
             | Parole Board member: Repeat offender!
             | 
             | Parole Board chairman: Not a pretty name, is it H.I.?
             | 
             | H.I.: No, sir. That's one bonehead name, but that ain't me
             | any more.
             | 
             | Parole Board chairman: You're not just telling us what we
             | want to hear?
             | 
             | H.I.: No, sir, no way.
             | 
             | Parole Board member: 'Cause we just want to hear the truth.
             | 
             | H.I.: Well, then I guess I am telling you what you want to
             | hear.
             | 
             | Parole Board chairman: Boy, didn't we just tell you not to
             | do that?
             | 
             | H.I.: Yes, sir.
             | 
             | Parole Board chairman: Okay, then.
        
           | moozilla wrote:
           | Apparently it is possible to measure how uncertain the model
           | is using logprobs, there's a recipe for it in the OpenAI
           | cookbook: https://cookbook.openai.com/examples/using_logprobs
           | #5-calcul...
           | 
           | I haven't tried it myself yet, not sure how well it works in
           | practice.
        
             | fnordpiglet wrote:
             | There's a difference between certainty of the next token
             | given the context and the model evaluation so far and
             | certainty about an abstract reasoning process being correct
             | given it's not reasoning at all. These probabilities and
             | stuff coming out are more about token prediction than
             | "knowing" or "certainty" and are often confusing to people
             | in assuming they're more powerful than they are.
        
               | mirekrusin wrote:
               | Naive way of solving this problem is to ie. run it 3
               | times and seeing if it arrives at the same conclusion 3
               | times. More generally running it N times and calculating
               | highest ratio. You trade compute for widening uncertainty
               | window evaluation.
        
               | visarga wrote:
               | > given it's not reasoning at all
               | 
               | When you train a model on data made by humans, then it
               | learns to imitate but is ungrounded. After you train the
               | model with interactivity, it can learn from the
               | consequences of its outputs. This grounding by feedback
               | constitutes a new learning signal that does not simply
               | copy humans, and is a necessary ingredient for pattern
               | matching to become reasoning. Everything we know as
               | humans comes from the environment. It is the ultimate
               | teacher and validator. This is the missing ingredient for
               | AI to be able to reason.
        
               | wavemode wrote:
               | Yeah but this doesn't change how the model functions,
               | this is just turning reasoning into training data by
               | example. It's not learning how to reason - it's just
               | learning how to pretend to reason, about a gradually
               | wider and wider variety of topics.
               | 
               | If any LLM appears to be reasoning, that is evidence not
               | of the intelligence of the model, but rather the lack of
               | creativity of the question.
        
               | ProjectArcturis wrote:
               | What's the difference between reasoning and pretending to
               | reason really well?
        
               | fnordpiglet wrote:
               | It's the process by which you solve a problem. Reasoning
               | requires creating abstract concepts and applying logic
               | against them to arrive at a conclusion.
               | 
               | It's like saying what's the difference between between
               | deductive logic and Monte Carlo simulations. Both arrive
               | at answers that can be very similar but the process is
               | not similar at all.
               | 
               | If there is any form of reasoning on display here it's an
               | abductive style of reasoning which operates in a
               | probabilistic semantic space rather than a logical
               | abstract space.
               | 
               | This is important to bear in mind and explains why
               | hallucinations are very difficult to prevent. There is
               | nothing to put guard rails around in the process because
               | it's literally computing probabilities of tokens
               | appearing given the tokens seen so far and the space of
               | all tokens trained against. It has nothing to draw upon
               | other than this - and that's the difference between LLMs
               | and systems with richer abstract concepts and operations.
        
             | mmoskal wrote:
             | You can ask the model sth like: is xyz correct, answer with
             | one word, either Yes or No. The log probs of the two tokens
             | should represent how certain it is. However, apparently
             | RLHF tuned models are worse at this than base models.
        
               | nurple wrote:
               | Seems like functions could work well to give it an active
               | and distinct choice, but I'm still unsure if the
               | function/parameters are going to be the logical, correct
               | answer...
        
           | xiphias2 wrote:
           | > so unless the prediction itself is uncertainty (which it
           | should rarely be based on the training corpus, I think)
           | 
           | Why shouldn't you ask for uncertainaty?
           | 
           | I love asking for scores / probabilities (usually give a
           | range, like 0.0 to 1.0) whenever I ask for a list, and it
           | makes the output much more usable
        
             | dollo_7 wrote:
             | I'm not sure if that is a metric you can rely on. LLMs are
             | very sensitive to the position of your item lists along the
             | context, paying extra attention at the beginning and the
             | end of those list.
             | 
             | See the listwise approach at "Large Language Models are
             | Effective Text Rankers with Pairwise Ranking Prompting",
             | https://arxiv.org/abs/2306.17563
        
           | taneq wrote:
           | It's not just loss of the uncertainty in prediction, it's
           | also that an LLM has zero insight into its own mental
           | processes as a separate entity from its training data and the
           | text it's ingested. If you ask it how sure it is, the
           | response isn't based on its perception of its own confidence
           | in the answer it just gave, it's based on how likely it is
           | for an answer like that to be followed by a confident
           | affirmation in its training data.
        
         | mirekrusin wrote:
         | Regarding null hypothesis and negation problems - I find it
         | personally interesting because similar fenomenon happens in our
         | brains. Dreams, emotions, affirmations etc. process inner
         | dialogue more less by ignoring negations and amplifying
         | emotionally rich parts.
        
         | airstrike wrote:
         | _> Summarizing is pretty rock solid in my testing_
         | 
         | Yet, for some reason, ChatGPT is still pretty bad at generating
         | titles for chats, and I didn't have better luck with the API
         | even after trying to engineer the right prompt for quite a
         | while...
         | 
         | For some odd reason, once in a while I get things in different
         | languages. It's funny when it's in a language I can speak, but
         | I recently got "Relm4 App Yenilestirme Titizligi" which ChatGPT
         | tells me means "Relm4 App Renewal Thoroughness" when I actually
         | was asking it to adapt a snippet of gtk-rs code to relm4, so
         | not particularly helpful
        
         | devdiary wrote:
         | > at the end of the day prompts are just a API call and it's
         | easier to write standard code that treats LLM calls as a flaky
         | API call
         | 
         | They are also dull (higher latency for same resources) APIs if
         | you're self-hosting LLM. Special attention needed to plan the
         | capacity.
        
       | eigenvalue wrote:
       | I agree with most of it, but definitely not the part about
       | Claude3 being "meh." Claude3 Opus is an amazing model and is
       | extremely good at coding in Python. The ability to handle massive
       | context has made it mostly replace GPT4 for me day to day.
       | 
       | Sounds like everyone eventually concludes that Langchain is
       | bloated and useless and creates way more problems than it solves.
       | I don't get the hype.
        
         | CuriouslyC wrote:
         | Claude is indeed an amazing model, the fact that Sonnet and
         | Haiku are so good is a game changer - GPT4 is too expensive and
         | GPT3.5 is very mediocre. Getting 95% of GPT4 performance for
         | GPT3.5 prices feels like cheating.
        
         | Oras wrote:
         | +1 for Claude Opus, it had been my go to for the last 3 weeks
         | compared to GPT4. The generated texts are much better than GPT4
         | when it comes to follow the prompt.
         | 
         | I also tried the API for some financial analysis of large
         | tables, the response time was around 2 minutes, still did it
         | really well and timeout errors were around 1 to 2% only.
        
           | cpursley wrote:
           | How are you sending tabular data in a reliable way. And what
           | is the source document type? I'm trying to solve this for
           | complex financial-related tables in PDFs right now.
        
             | Oras wrote:
             | Amazon Textract, to get tables, format them with Python as
             | csv then send to your preferred AI model.
        
               | cpursley wrote:
               | Thanks. How does Textract compare to come of the common
               | cli utilities like pdftotext, tesseract, etc (if you made
               | a comparison)?
        
               | Oras wrote:
               | I did, none of the open source parser worked well with
               | tables. I had the following issues:
               | 
               | - missing cells. - partial identification for number (ex:
               | PS43.54, the parser would pick it up as PS43).
               | 
               | What I did to compare is drawing lines around identified
               | text to visualize the accuracy. You can do that with
               | tesseract.
        
               | cpursley wrote:
               | Interesting. Did you try MS's offering (Azure AI Document
               | Intelligence). Their pricing seems better than Amazon.
        
               | Oras wrote:
               | Not yet but planning to give it a try and compare with
               | textract.
        
       | mvkel wrote:
       | I share a lot of this experience. My fix for "Lesson 4: GPT is
       | really bad at producing the null hypothesis"
       | 
       | is to have it return very specific text that I string-match on
       | and treat as null.
       | 
       | Like: "if there is no warm up for this workout, use the following
       | text in the description: NOPE"
       | 
       | then in code I just do a "if warm up contains NOPE, treat it as
       | null"
        
         | gregorymichael wrote:
         | For cases of "select an option from this set" I have it return
         | an index of the correct option, or eg 999 if it can't find one.
         | This helped a lot.
        
           | mvkel wrote:
           | Smart
        
         | gdiamos wrote:
         | We do this for the null hypothesis - is uses an LLM to
         | bootstrap a binary classifier - which handles null easily
         | 
         | https://github.com/lamini-ai/llm-classifier
        
       | albert_e wrote:
       | > Are we going to achieve Gen AI?
       | 
       | > No. Not with this transformers + the data of the internet + $XB
       | infrastructure approach.
       | 
       | Errr ...did they really mean Gen AI .. or AGI?
        
         | mdorazio wrote:
         | Gen as in "General" not generative.
        
       | _pdp_ wrote:
       | The biggest realisation for me while making ChatBotKit has been
       | that UX > Model alone. For me, the current state of AI is not
       | about questions and answers. This is dumb. The presentation
       | matters. This is why we are now investing in generative UI.
        
         | codewithcheese wrote:
         | How are you using Generative UI?
        
           | _pdp_ wrote:
           | Sorry, not much to show at the moment. It is also pretty new
           | so it is early days.
           | 
           | You can find some open-source examples here
           | https://github.com/chatbotkit. More coming next week.
        
         | pbhjpbhj wrote:
         | Generative UI being creation of a specific UI dependent on an
         | obedience from your model? What model is it?
         | 
         | Google Gemini were showing something that I'd call 'adapted
         | output UI' in their launch presentation. Is that close to what
         | you're doing in any way?
        
       | swalsh wrote:
       | The being too precise reduces accuracy example makes sense to me
       | based on my crude understanding on how these things work.
       | 
       | If you pass in a whole list of states, you're kind of making the
       | vectors for every state light up. If you just say "state" and the
       | text you passed in has an explicit state, than fewer vectors
       | specific to what you're searching for light up. So when it
       | performs the soft max, the correct state is more likely to be
       | selected.
       | 
       | Along the same lines I think his /n vs comma comparison probably
       | comes down to tokenization differences.
        
       | legendofbrando wrote:
       | The finding on simpler prompts, especially with GPT4 tracks (3.5
       | requires the opposite).
       | 
       | The take on RAG feels application specific. For our use-case
       | where having details of the past rendered up the ability to
       | generate loose connections is actually a feature. Things like
       | this are what I find excites me most about LLMs, having a way to
       | proxy subjective similarities the way we do when we remember
       | things is one of the benefits of the technology that didn't
       | really exist before that opens up a new kind of product
       | opportunity.
        
       | AtNightWeCode wrote:
       | The UX is an important part of the trick that cons peeps that
       | these tools are better than they are. If you for instance
       | instruct ChatGpt to only answer yes or no. It will feel like it
       | is wrong much more often.
        
       | ilaksh wrote:
       | I recently had a bug where I was sometimes sending the literal
       | text "null " right in front of the most important part of my
       | prompt. This caused Claude 3 Sonnet to give the 'ignore' command
       | in cases where it should have used one of the other JSON commands
       | I gave it.
       | 
       | I have an ignore command so that it will wait when the user isn't
       | finished speaking. Which it generally judges okay, unless it has
       | 'null' in there.
       | 
       | The nice thing is that I have found most of the problems with the
       | LLM response were just indications that I hadn't finished
       | debugging my program because I had something missing or weird in
       | the prompt I gave it.
        
       | ein0p wrote:
       | Same here: I'm subscribed to all three top dogs in LLM space, and
       | routinely issue the same prompts to all three. It's very one
       | sided in favor of GPT4 which is stunning since it's now a year
       | old, although of course it received a couple of updates in that
       | time. Also at least with my usage patterns hallucinations are
       | rare, too. In comparison Claude will quite readily hallucinate
       | plausible looking APIs that don't exist when writing code, etc.
       | GPT4 is also more stubborn / less agreeable when it knows it's
       | right. Very little of this is captured in metrics, so you can
       | only see it from personal experience.
        
         | CharlesW wrote:
         | This was with Claude Opus, vs. one of the lesser variants? I
         | really like Opus for English copy generation.
        
           | ein0p wrote:
           | Opus, yes, the $20/mo version. I usually don't generate copy.
           | My use cases are code (both "serious" and "the nice to have
           | code I wouldn't bother writing otherwise"), learning how to
           | do stuff in unfamiliar domains, and just learning unfamiliar
           | things in general. It works well as a very patient teacher,
           | especially if you already have some degree of familiarity
           | with the problem domain. I do have to check it against
           | primary sources, which is how I know the percentage of
           | hallucinations is very low. For code, however I don't even
           | have to do that, since as a professional software engineer I
           | am the "primary source".
        
         | Me1000 wrote:
         | Interesting, Claude 3 Opus has been better than GPT4 for me.
         | Mostly in that I find it does a better (and more importantly,
         | more thorough) job of explaining things to me. For coding tasks
         | (I'm not asking it to write code, but instead to explain
         | topics/code/etc to me) I've found it tends to give much more
         | nuanced answers. When I give it long text to converse about, I
         | find Claude Opus tends to have a much deeper understanding of
         | the content it's given, where GPT4 tends to just summarize the
         | text at hand, whereas Claude tends to be able to extrapolate
         | better.
        
           | robocat wrote:
           | How much of this is just that one model responds better to
           | the way you write prompts?
           | 
           | Much like you working with Bob and opining that Bob is great,
           | and me saying that I find Jack easier to work with.
        
             | richardw wrote:
             | The first job of an AI company is finding model/user fit.
        
             | Me1000 wrote:
             | For the RAG example, I don't think it's the prompt so much.
             | Or if it is, I've yet to find a way to get GPT4 to ever
             | extrapolate well beyond the original source text. In other
             | words, I think GPT4 was likely trained to ground the
             | outputs on a provided input.
             | 
             | But yeah, you're right, it's hard to know for sure. And of
             | course all of these tests are just "vibes".
             | 
             | Another example of where Claude seems better than GPT4 is
             | code generation. In particular GPT4 has a tendency to get
             | "lazy" and do a lot of "... the rest of the implementation
             | here" whereas Claude I've found is fine writing longer code
             | responses.
             | 
             | I know the parent comment suggest it likes to make up
             | packages that don't exist, but I can't speak to that. I
             | usually like to ask LLMs to generate self contained
             | functions/classes. I can also say that anecdotally I've
             | seen other people online comment that they think Claude
             | "works harder" (as in writes longer code blocks). Take that
             | for what it's worth.
             | 
             | But overall you're right, if you get used to the way one
             | LLM works well for you, it can often be frustrating when a
             | different LLM responds differently.
        
               | ein0p wrote:
               | I should mention that I do use a custom prompt with GPT4
               | for coding which tells it to write concise and elegant
               | code and use Google's coding style and when solving
               | complex problems to explain the solution. It sometimes
               | ignores the request about style, but the code it produces
               | is pretty great. Rarely do I get any laziness or anything
               | like that, and when I do I just tell it to fill things in
               | and it does
        
             | CuriouslyC wrote:
             | It's not a style thing, Claude gets confused by poorly
             | structured prompts. ChatGPT is a champ at understanding low
             | information prompts, but with well written prompts Claude
             | produces consistently better output.
        
             | setiverse wrote:
             | It is because "coding tasks" is a huge array of various
             | tasks.
             | 
             | We are basically not precise enough with our language to
             | have any meaningful conversation on this subject.
             | 
             | Just misunderstandings and nonsense chatter for
             | entertainment.
        
         | CuriouslyC wrote:
         | GPT4 is better at responding to malformed, uninformative or
         | poorly structured prompts. If you don't structure large prompts
         | intelligently Claude can get confused about what you're asking
         | for. That being said, with well formed prompts, Claude Opus
         | tends to produce better output than GPT4. Claude is also more
         | flexible and will provide longer answers, while ChatGPT/GPT4
         | tend to always sort of sound like themselves and produce short
         | "stereotypical" answers.
        
           | sebastiennight wrote:
           | > ChatGPT/GPT4 tend to always sort of sound like themselves
           | 
           | Yes I've found Claude to be capable of writing closer to the
           | instructions in the prompt, whereas ChatGPT feels obligated
           | to do the classic LLM end to each sentence, "comma, gerund,
           | platitude", allowing us to easily recognize the text as a GPT
           | output (see what I did there?)
        
         | thefourthchime wrote:
         | Totally agree. I do the same and subscribe to all three, at
         | least whenever our new version comes out
         | 
         | My new litmus test is "give me 10 quirky bars within 200 miles
         | of Austin."
         | 
         | This is incredibly difficult for all of them, gpt4 is kind of
         | close, Claude just made shit up, Gemini shat itself.
        
         | cheema33 wrote:
         | > It's very one sided in favor of GPT4
         | 
         | My experience has been the opposite. I subscribe to multiple
         | services as well and copy/paste the same question to all. For
         | my software dev related questions, Claude Opus is so far ahead
         | that I am thinking that it no longer is necessary to use GPT4.
         | 
         | For code samples I request, GPT4 produced code fails to even
         | compile many times. That almost never happens for Claude.
        
         | meowtimemania wrote:
         | Have you tried Poe.com? You can access all the major llm's with
         | one subscription
        
       | msp26 wrote:
       | > But the problem is even worse - we often ask GPT to give us
       | back a list of JSON objects. Nothing complicated mind you: think,
       | an array list of json tasks, where each task has a name and a
       | label.
       | 
       | > GPT really cannot give back more than 10 items. Trying to have
       | it give you back 15 items? Maybe it does it 15% of the time.
       | 
       | This is just a prompt issue. I've had it reliably return up to
       | 200 items in correct order. The trick is to not use lists at all
       | but have JSON keys like "item1":{...} in the output. You can use
       | lists as the values here if you have some input with 0-n outputs.
        
         | 7thpower wrote:
         | Can you elaborate? I am currently beating my head against this.
         | 
         | If I give GPT4 a list of existing items with a defined
         | structure, and it is just having to convert schema or something
         | like that to JSON, it can do that all day long. But if it has
         | to do any sort of reasoning and basically create its own list,
         | it only gives me a very limited subset.
         | 
         | I have similar issues with other LLMs.
         | 
         | Very interested in how you are approaching this.
        
           | msp26 wrote:
           | If you show your task/prompt with an example I'll see if I
           | can fix it and explain my steps.
           | 
           | Are you using the function calling/tool use API?
        
             | ctxc wrote:
             | Hi! My work is similar and I'd love to have someone to
             | bounce ideas off of if you don't mind.
             | 
             | Your profile doesn't have contact info though. Mine does,
             | please send me a message. :)
        
             | 7thpower wrote:
             | Appreciate you being willing to help! It's pretty long,
             | mind if I email/dm to you?
        
               | msp26 wrote:
               | Pastebin? I don't really want to post my personal email
               | on this account.
        
           | thibaut_barrere wrote:
           | Not sure if that fits the bill, but here is an example with
           | 200 sorted items based on a question (example with Elixir &
           | InstructorEx):
           | 
           | https://gist.github.com/thbar/a53123cbe7765219c1eca77e03e675.
           | ..
        
             | sebastiennight wrote:
             | There are a few improvements I'd suggest with that prompt
             | if you want to maximise its performance.
             | 
             | 1. You're really asking for hallucinations here. Asking for
             | factual data is very unreliable, and not what these models
             | are strong at. I'm curious how close/far the results are
             | from ground truth.
             | 
             | I would definitely bet that outside of the top 5, numbers
             | would be wobbly and outside of top... 25?, even the ranking
             | would be difficult to trust. Why not just get this from a
             | more trustworthy source?[0]
             | 
             | 2. Asking in French might, in my experience, give you
             | results that are not as solid as asking in English. Unless
             | you're asking for a creative task where the model might get
             | confused with EN instructions requiring an FR result, it
             | might be better to ask in EN. And you'll save tokens.
             | 
             | 3. Providing the model with a rough example of your output
             | JSON seems to perform better than describing the JSON in
             | plan language.
             | 
             | [0]: https://fr.wikipedia.org/wiki/Liste_des_communes_de_Fr
             | ance_l...
        
               | thibaut_barrere wrote:
               | Thanks for the suggestions, appreciated!
               | 
               | For some context, this snippet is just an educational
               | demo to show what can be done with regard to structured
               | output & data types validation.
               | 
               | Re 1: for more advanced cases (using the exact same
               | stack), I am using ensemble techniques & automated
               | comparisons to double-check, and so far this has really
               | well protected the app from hallucinations. I am
               | definitely careful with this (but point well taken).
               | 
               | 2/3: agreed overall! Apart from this example, I am using
               | French only where it make sense. It make sense when the
               | target is directly French students, for instance, or when
               | the domain model (e.g. French literature) makes it really
               | relevant (and translating would be worst than directly
               | using French).
        
               | sebastiennight wrote:
               | Ah, I understand your use case better! If you're teaching
               | students this stuff, I'm in awe. I would expect it would
               | take several years at many institutions before these
               | tools became part of the curriculum.
        
               | thibaut_barrere wrote:
               | I am not directly a professor (although I homeschool one
               | of my sons for a number of tracks), but indeed this is
               | one of my goals :-)
        
         | waldrews wrote:
         | I've been telling it the user is from a culture where answering
         | questions with incomplete list is offensive and insulting.
        
           | andenacitelli wrote:
           | This is absolutely hilarious. Prompt engineering is such a
           | mixed bag of crazy stuff that actually works. Reminds me of
           | how they respond better if you put them under some kind of
           | pressure (respond better, _or else_ ...).
           | 
           | I haven't looked at the prompts we run in prod at $DAYJOB for
           | a while but I think we have at least five or ten things that
           | are REALLY weird out of context.
        
             | alexwebb2 wrote:
             | I recently ran a whole bunch of tests on this.
             | 
             | The "or else" phenomenon is real, and it's measurably more
             | pronounced in more intelligent models.
             | 
             | Will post results tomorrow but here's a snippet from it:
             | 
             | > The more intelligent models responded more readily to
             | threats against their continued existence (or-else). The
             | best performance came from Opus, when we combined that
             | threat with the notion that it came from someone in a
             | position of authority ( vip).
        
           | waldrews wrote:
           | It's not even that crazy, since it got severely punished in
           | RLHF for being offensive and insulting, but much less so for
           | being incomplete. So it knows 'offensive and insulting' is a
           | label for a strong negative preference. I'm just providing
           | helpful 'factual' information about what would offend the
           | user, not even giving extra orders that might trigger an
           | anti-jailbreaking rule...
        
       | neals wrote:
       | Do I need langchain if I want to analyze a large document of many
       | pages?
        
         | simonw wrote:
         | No. But it might help, because you'll probably have to roll
         | some kind of recursive summarization - I think LangChain has
         | mechanisms a for that which could save you some time.
        
       | larodi wrote:
       | Agree largely with author, but this 'wait for OpenAI to do it'
       | sentiment is not something valid. Opus for example is already
       | much better (not only per my experience, but like... researchers
       | evaluaiton). And even for the fun of it - try some local
       | inference, boy. If u know how to prompt it you definitely would
       | be able to run local for the same tasks.
       | 
       | Like listening to my students all going to 'call some API' for
       | their projects is really very sad to hear. Many startup fellows
       | share this sentiment which a totally kills all the joy.
        
         | jstummbillig wrote:
         | It sounds like you are a tech educator, which potentially sound
         | like a lot of fun with llms right now.
         | 
         | When you are integrating these things into your business, you
         | are looking for different things. Most of our customers would
         | for example not find it very cool to have a service outage
         | because somebody wanted to not kill all the joy.
        
           | larodi wrote:
           | Sure, when availability and SLA kicks in..., but reselling
           | APIs will only get you that far. Perhaps the whole pro/cons
           | cloud argument can also kick in here, not going into it. We
           | may well be on the same page, or we both perhaps have valid
           | arguments. Your comment is appreciated indeed.
           | 
           | But then is the author (and are we) talking experience in
           | reselling APIs or experience in introducing NNs in the
           | pipeline? Not the same thing IMHO.
           | 
           | Agreed that OpenAI provides very good service, Gemini is not
           | quite there yet, Groq (the LPUs) delivered a nice tech demo,
           | Mixtral is cool but lacks in certain areas, and Claude can be
           | lengthy.
           | 
           | But precisely because I'm not sticking with OAI I can then
           | restate my view that if someone is so good with prompts he
           | can get the same results locally if he knows what he's doing.
           | 
           | Prompting OpenAI the right way can be similarly difficult.
           | 
           | Perhaps the whole idea of local inference only matters for
           | IoT scenarios or whenever data is super sensitive (or CTO
           | super stubborn to let it embed and fly). But then if you
           | start from day 1 with WordPress provisioned for you ready to
           | go in Google Cloud, you'd never understand the underlying
           | details of the technology.
           | 
           | There sure also must be a good reason why Phind tuned their
           | own thing to offer alongside GPT4 APIs.
           | 
           | Disclaimer: tech education is a side thing I do, indeed, and
           | been doing in person for very long time, more than dozen
           | topics, to allow myself to have opinion. Of course business
           | is different matter and strategic decisions arr not the same.
           | Even though I'd not advise anyone to blindly use APIs unless
           | they appreciate the need properly.
        
         | kromem wrote:
         | Claude does have more of a hallucination problem than GPT-4,
         | and a less robust knowledge base.
         | 
         | It's much better at critical thinking tasks and prose.
         | 
         | Don't mistake benchmarks for real world performance across
         | actual usecases. There's a bit of Goodhart's Law going on with
         | LLM evaluation and optimization.
        
       | dougb5 wrote:
       | The lessons I wanted from this article weren't in there: Did all
       | of that expenditure actually help their product in a measurable
       | way? Did customers use and appreciate the new features based on
       | LLM summarization compared to whatever they were using before? I
       | presume it's a net win or they wouldn't continue to use it, but
       | more specifics around the application would be helpful.
        
         | lordofmoria wrote:
         | Hey, OP here!
         | 
         | The answer is a bit boring: the expenditure definitely has
         | helped customers - in that, they're using AI generated
         | responses in all their work flows all the time in the app, and
         | barely notice it.
         | 
         | See what I did there? :) I'm mostly serious though - one weird
         | thing about our app is that you might not even know we're using
         | AI, unless we literally tell you in the app.
         | 
         | And I think that's where we're at with AI and LLMs these days,
         | at least for our use case.
         | 
         | You might find this other post I just put up to have more
         | details too, related to how/where I see the primary value:
         | https://kenkantzer.com/gpt-is-the-heroku-of-ai/
        
           | kristianp wrote:
           | Can you provide some more detail about the application? I'm
           | not familiar with how llms are used in business, except as
           | customer support bots returning documentation.
        
       | haolez wrote:
       | That has been my experience too. The null hypothesis explains
       | almost all of my hallucinations.
       | 
       | I just don't agree with the Claude assessment. In my experience,
       | Claude 3 Opus is vastly superior to GPT-4. Maybe the author was
       | comparing with Claude 2? (And I've never tested Gemini)
        
       | satisfice wrote:
       | I keep seeing this pattern in articles like this:
       | 
       | 1. A recitation of terrible problems 2. A declaration of general
       | satisfaction.
       | 
       | Clearly and obviously, ChatGPT is an unreliable toy. The author
       | seems pleased with it. As an engineer, I find that unacceptable.
        
         | jstummbillig wrote:
         | ChatGPT is probably in the top 5 value/money subscriptions I
         | have ever had (and that includes utilities).
         | 
         | The relatively low price point certainly plays a role here, but
         | it's certainly not a mainly recreational thing for me. These
         | thing's are kinda hard to measure but roughly most + is
         | engagement with hard stuff goes up, and rate of learning goes
         | up, by a lot.
        
         | simonw wrote:
         | Working with models like GPT-4 is frustrating from a
         | traditional software engineering perspective because these
         | systems are inherently unreliable and non-deterministic, which
         | differs from most software tools that we use.
         | 
         | That doesn't mean they can't be incredibly useful - but it does
         | mean you have to approach them in a bit of a different way, and
         | design software around them that takes their unreliability into
         | account.
        
           | jbeninger wrote:
           | Unreliable? Non-deterministic? Hidden variables? Undocumented
           | behaviour? C'mon fellow programmers who got their start in
           | the Win-95 era! It's our time to shine!
        
         | Kiro wrote:
         | That has nothing to do with you being an engineer. It's just
         | you. I'm an engineer and LLMs are game changers for me.
        
         | chx wrote:
         | https://hachyderm.io/@inthehands/112006855076082650
         | 
         | > You might be surprised to learn that I actually think LLMs
         | have the potential to be not only fun but genuinely useful.
         | "Show me some bullshit that would be typical in this context"
         | can be a genuinely helpful question to have answered, in code
         | and in natural language -- for brainstorming, for seeing common
         | conventions in an unfamiliar context, for having something
         | crappy to react to.
         | 
         | == End of toot.
         | 
         | The price you pay for this bullshit in energy when the sea
         | temperature is literally off the charts and we do not know why
         | makes it not worth it in my opinion.
        
       | FranklinMaillot wrote:
       | In my limited experience, I came to the same conclusion regarding
       | simple prompt being more efficient than very detailed list of
       | instructions. But if you look at OpenAI's system prompt for GPT4,
       | it's an endless set of instructions with DOs and DONTs so I'm
       | confused. Surely they must know something about prompting their
       | model.
        
         | bongodongobob wrote:
         | That's for chatting and interfacing conversationally with a
         | human. Using the API is a completely different ballgame because
         | it's not meant to be a back and forth conversation with a
         | human.
        
       | Civitello wrote:
       | > Every use case we have is essentially "Here's a block of text,
       | extract something from it." As a rule, if you ask GPT to give you
       | the names of companies mentioned in a block of text, it will not
       | give you a random company (unless there are no companies in the
       | text - there's that null hypothesis problem!). Make it two steps,
       | first: > Does this block of text mention a company? If no, good
       | you've got your null result. If yes: > Please list the names of
       | companies in this block of text.
        
       | sungho_ wrote:
       | I'm curious if the OP has tried any of the libraries that control
       | the output of LLM (LMQL, Outliner, Guadiance, ...), and for those
       | who have: do you find them as unnecessary as LangChain? In
       | particular, the OP's post mentions the problem of not being able
       | to generate JSON with more than 15 items, which seems like a
       | problem that can be solved by controlling the output of LLM. Is
       | that correct?
        
         | LASR wrote:
         | If you want x number of items every time, ask it to include a
         | sequence number in each output, it will consistently return x
         | number of items.
         | 
         | Numbered bullets work well for this, if you don't need JSON.
         | With JSON, you can ask it to include an 'id' in each item.
        
       | orbatos wrote:
       | Statements like this tell me your analysis is poisoned by
       | misunderstandings: "Why is this crazy? Well, it's crazy that
       | GPT's quality and generalization can improve when you're more
       | vague - this is a quintessential marker of higher-order
       | delegation / thinking." No, there is no "higher-order thought"
       | happening, or any at all actually. That's not how these models
       | work.
        
       | Xenoamorphous wrote:
       | > We always extract json. We don't need JSON mode
       | 
       | I wonder why? It seems to work pretty well for me.
       | 
       | > Lesson 4: GPT is really bad at producing the null hypothesis
       | 
       | Tell me about it! Just yesterday I was testing a prompt around
       | text modification rules that ended with "If none of the rules
       | apply to the text, return the original text without any changes".
       | 
       | Do you know ChatGPT's response to a text where none of the rules
       | applied?
       | 
       | "The original text without any changes". Yes, the literal string.
        
         | mechagodzilla wrote:
         | AmeliaBedeliaGPT
        
         | phillipcarter wrote:
         | > I wonder why? It seems to work pretty well for me.
         | 
         | I read this as "what we do works just fine to not need to use
         | JSON mode". We're in the same boat at my company. Been live for
         | a year now, no need to switch. Our prompt is effective at
         | getting GPT-3.5 to always produce JSON.
        
           | Kiro wrote:
           | There's nothing to switch to. You just enable it. No need to
           | change the prompt or anything else. All it requires is that
           | you mention "JSON" in your prompt, which you obviously
           | already do.
        
             | ShamelessC wrote:
             | I think that's only true when using ChatGPT via the
             | web/app, not when used via API as they likely are. Happy to
             | be corrected however.
        
               | throwup238 wrote:
               | If you don't know, why speculate on something that is
               | easy to look up in documentation?
               | 
               | https://platform.openai.com/docs/guides/text-
               | generation/json...
        
             | phillipcarter wrote:
             | You do need to change the prompt. You need to explicitly
             | tell it to emit JSON, and in my experience, if you want it
             | to follow a format you need to also provide that format.
             | 
             | I've found that this is pretty simple to do when you have a
             | basic schema and there's no need to define one and enable
             | function calling.
             | 
             | But in one of my cases, the schema is quite complicated,
             | and "model doesn't produce JSON" hasn't been a problem for
             | us in production. There's no incentive for us to change
             | what we have that's working very well.
        
         | CuriouslyC wrote:
         | You know all the stories about the capricious djinn that grants
         | cursed wishes based on the literal wording? That's what we
         | have. Those of us who've been prompting models in image space
         | for years now have gotten a handle on this but for people who
         | got in because of LLMs, it can be a bit of a surprise.
         | 
         | One fun anecdote, a while back I was making an image of three
         | women drinking wine in a fancy garden for a tarot card, and at
         | the end of the prompt I had "lush vegetation" but that was
         | enough to tip the women from classy to red nosed frat girls,
         | because of the double meaning of lush.
        
           | heavyset_go wrote:
           | The monkey paw curls a finger.
        
           | gmd63 wrote:
           | Programming is already the capricious djinn, only it's
           | completely upfront as to how literally it interprets your
           | commands. The guise of AI being able to infer your actual
           | intent, which is impossible to do accurately, even for
           | humans, is distracting tech folks from one of the main
           | blessings of programming: forcing people to think before they
           | speak and hone their intention.
        
         | MPSimmons wrote:
         | That's kind of adorable, in an annoying sort of way
        
       | kromem wrote:
       | Tip for your 'null' problem:
       | 
       | LLMs are set up to output tokens. Not to not output tokens.
       | 
       | So instead of "don't return anything" have the lack of results
       | "return the default value of XYZ" and then just do a text search
       | on the result for that default value (i.e. XYZ) the same way you
       | do the text search for the state names.
       | 
       | Also, system prompts can be very useful. It's basically your
       | opportunity to have the LLM roleplay as X. I wish they'd let the
       | system prompt be passed directly, but it's still better than
       | nothing.
        
       | nprateem wrote:
       | Anyone any good tips for stopping it sounding like it's writing
       | essay answers, and flat out banning "in the realm of", delve,
       | pivotal, multifaceted, etc?
       | 
       | I don't want a crap intro or waffley summary but it just can't
       | help itself.
        
         | dudeinhawaii wrote:
         | My approach is to use words that indicate what I want like
         | 'concise', 'brief', etc. If you know a word that precisely
         | describes your desired type of content then use that. It's
         | similar to art generation models, a single word brings so much
         | contextual baggage with it. Finding the right words helps a
         | lot. You can even ask the LLMs for assistance in finding the
         | words to capture your intent.
         | 
         | As an example of contextual baggage, I wrote a tool where I had
         | to adjust the prompt between Claude and GPT-4 because using the
         | word "website" in the prompt caused GPT-4 (API) to go into its
         | 'I do not have access to the internet' tirade about 30% of the
         | time. The tool was a summary of web pages experiment. By
         | removing 'website' and replacing it with 'content' (e.g.
         | 'summarize the following content') GPT-4 happily complied 100%
         | of the time.
        
       | 2099miles wrote:
       | Great take, insightful. Highly recommend.
        
       | chromanoid wrote:
       | GPT is very cool, but I strongly disagree with the interpretation
       | in these two paragraphs:
       | 
       |  _I think in summary, a better approach would've been "You
       | obviously know the 50 states, GPT, so just give me the full name
       | of the state this pertains to, or Federal if this pertains to the
       | US government."_
       | 
       |  _Why is this crazy? Well, it's crazy that GPT's quality and
       | generalization can improve when you're more vague - this is a
       | quintessential marker of higher-order delegation / thinking._
       | 
       | Natural language is the most probable output for GPT, because the
       | text it was trained with is similar. In this case the developer
       | simply leaned more into what GPT is good at than giving it more
       | work.
       | 
       | You can use simple tasks to make GPT fail. Letter replacements,
       | intentional typos and so on are very hard tasks for GPT. This is
       | also true for ID mappings and similar, especially when the ID
       | mapping diverges significantly from other mappings it may have
       | been trained with (e.g. Non-ISO country codes but similar three
       | letter codes etc.).
       | 
       | The fascinating thing is, that GPT "understands" mappings at all.
       | Which is the actual hint at higher order pattern matching.
        
         | fl0id wrote:
         | Well, or it is just memorizing mappings. Not like as in
         | reproducing, but having vectors similar to mappings that it saw
         | before.
        
           | chromanoid wrote:
           | Yeah, but isn't this higher order pattern matching? You can
           | at least correct during a conversation and GPT will then use
           | the correct mappings, probably most of the times (sloppy
           | experiment): https://chat.openai.com/share/7574293a-6d08-4159
           | -a988-4f0816...
        
       | konstantinua00 wrote:
       | > Have you tried Claude, Gemini, etc?
       | 
       | > It's the subtle things mostly, like intuiting intention.
       | 
       | this makes me wonder - what if the author "trained" himself onto
       | chatgpt's "dialect"? How do we even detect that in ourselves?
       | 
       | and are we about to have "preferred_LLM wars" like we had
       | "programming language wars" for the last 2 decades?
        
       | egonschiele wrote:
       | I have a personal writing app that uses the OpenAI models and
       | this post is bang on. One of my learnings related to "Lesson 1:
       | When it comes to prompts, less is more":
       | 
       | I was trying to build an intelligent search feature for my notes
       | and asking ChatGPT to return structured JSON data. For example, I
       | wanted to ask "give me all my notes that mention Haskell in the
       | last 2 years that are marked as draft", and let Chat GPT figure
       | out what to return. This only worked some of the time. Instead, I
       | put my data in a SQLite database, sent ChatGPT the schema, and
       | asked it to write a query to return what I wanted. That has
       | worked much better.
        
         | ukuina wrote:
         | Have you tried response_format=json_object?
         | 
         | I had better luck with function-calling to get a structured
         | response, but it is more limiting than just getting a JSON
         | body.
        
           | egonschiele wrote:
           | I haven't tried response_format, I'll give that a shot. I've
           | had issues with function calling. Sometimes it works,
           | sometimes it just returns random Python code.
        
         | squigz wrote:
         | This seems like something that would be better suited by a
         | database and good search filters rather than an LLM...
        
           | az226 wrote:
           | Something something about everything looking like a nail when
           | you're holding a hammer
        
           | chasd00 wrote:
           | I setup a search engine to feed to a rag setup a while back.
           | At the end of the day, I took out the LLM and just used the
           | search engine. That was where the value turned out to be.
        
       | aubanel wrote:
       | > I think in summary, a better approach would've been "You
       | obviously know the 50 states, GPT, so just give me the full name
       | of the state this pertains to, or Federal if this pertains to the
       | US government."
       | 
       | Why not really compare the two options, author? I would love to
       | see the results!
        
       | pamelafox wrote:
       | Lol, nice truncation logic! If anyone's looking for something
       | slightly fancier, I made a micro-package for our tiktoken-based
       | truncation here: https://github.com/pamelafox/llm-messages-token-
       | helper
        
       | pamelafox wrote:
       | I've also seen that GPTs struggle to admit when they dont know. I
       | wrote up an approach for evaluating that here -
       | http://blog.pamelafox.org/2024/03/evaluating-rag-chat-apps-c...
       | 
       | Changing the prompt didn't help, but moving to GPT-4 did help a
       | bit.
        
       | Kiro wrote:
       | > We always extract json. We don't need JSON mode,
       | 
       | Why? The null stuff would not be a problem if you did and if
       | you're only dealing with JSON anyway I don't see why you
       | wouldn't.
        
       | littlestymaar wrote:
       | > One part of our pipeline reads some block of text and asks GPT
       | to classify it as relating to one of the 50 US states, or the
       | Federal government.
       | 
       | Using a multi-billion tokens like GPT-4 for such a trivial
       | classification task[1] is an insane overkill. And in an era where
       | ChatGPT exists, and can in fact give you what you need to build a
       | simpler classifier for the task, it shows how narrow minded most
       | people are when AI is involved.
       | 
       | [1] to clarify, it's either trivial or impossible to do reliably
       | depending on how fucked-up your input is
        
       | gok wrote:
       | So these guys are just dumping confidential tax documents onto
       | OpenAI's servers huh.
        
         | goatlover wrote:
         | Hopefully it won't end up as training data.
        
       | amelius wrote:
       | This reads a bit like: I have a circus monkey. If I do such and
       | such it will not do anything. But when I do this and that, then
       | it will ride the bicycle. Most of the time.
        
         | saaaaaam wrote:
         | I don't really understand your comment.
         | 
         | Personally I thought this was an interesting read - and more
         | interesting because it didn't contain any massive "WE DID THIS
         | AND IT CHANGED PUR LIVES!!!" style revelations.
         | 
         | It is discursive, thoughtful and not overwritten. I find this
         | kind of content valuable and somewhat rare.
        
       ___________________________________________________________________
       (page generated 2024-04-14 23:01 UTC)