https://kenkantzer.com/lessons-after-a-half-billion-gpt-tokens/

Skip to the content
Ken Kantzer's Blog

logging my thoughts on technology, security & management

Toggle mobile menu
Toggle search field
Search for: [                    ] [Search]
  * Home
  * About

  * Home
  * About

Lessons after a half-billion GPT tokens

April 11, 2024 / Ken / 11 Comments
[e609f4fe0033]
CTO @ Truss | Former VP of Engineering and Head of Security @
FiscalNote | ex-PKC co-founder | princeton tiger '11 | writes on
engineering, management, and security.

My startup (gettruss.io) released a few LLM-heavy features in the
last six months, and the narrative around LLMs that I read on Hacker
News is now starting to diverge from my reality, so I thought I'd
share some of the more "surprising" lessons after churning through
just north of 500 million tokens, by my estimate.

Some details first:

- we're using the OpenAI models, see the Q&A at the bottom if you
want my opinion of the others

- our usage is 85% GPT-4, and 15% GPT-3.5

- we deal exclusively with text, so no gpt-4-vision, Sora, whisper,
etc.

- we have a B2B use case - strongly focused on summarize/
analyze-extract, so YMMV

- 500M tokens actually isn't as much as it seems - it's about 750,000
pages of text, to put it in perspective

Lesson 1: When it comes to prompts, less is more

We consistently found that not enumerating an exact list or
instructions in the prompt produced better results, if that thing was
already common knowledge. GPT is not dumb, and it actually gets
confused if you over-specify.

This is fundamentally different than coding, where everything has to
be explicit.

Here's an example where this bit us:

One part of our pipeline reads some block of text and asks GPT to
classify it as relating to one of the 50 US states, or the Federal
government. This is not a hard task - we probably could have used
string/regex, but there's enough weird corner cases that that
would've taken longer. So our first attempt was (roughly) something
like this:

Here's a block of text. One field should be "locality_id", and it should be the ID of one of the 50 states, or federal, using this list:
[{"locality: "Alabama", "locality_id": 1}, {"locality: "Alaska", "locality_id": 2} ... ]

This worked sometimes (I'd estimate >98% of the time), but failed
enough that we had to dig deeper.

While we were investigating, we noticed that another field, name, was
consistently returning the full name of the state...the correct state -
even though we hadn't explicitly asked it to do that.

So we switched to a simple string search on the name to find the
state, and it's been working beautifully ever since.

I think in summary, a better approach would've been "You obviously
know the 50 states, GPT, so just give me the full name of the state
this pertains to, or Federal if this pertains to the US government."

Why is this crazy? Well, it's crazy that GPT's quality and
generalization can improve when you're more vague - this is a
quintessential marker of higher-order delegation / thinking.

(Random side note one: GPT was failing most often with the M states --
Maryland, Maine, Massachusettes, Michigan -- which you might expect of
a fundamentally stochastic model.)

(Random side note two: when we asked GPT to choose an ID from a list
of items, it got confused a lot less when we sent the list as
prettified JSON, where each state was on its own line. I think \n is
a stronger separator than a comma.)

Lesson 2: You don't need langchain. You probably don't even need
anything else OpenAI has released in their API in the last year. Just
chat. That's it.

Langchain is the perfect example of premature abstraction. We started
out thinking we had to use it because the internet said so. Instead,
millions of tokens later, and probably 3-4 very diverse LLM features
in production, and our openai_service file still has only one,
40-line function in it:

def extract_json(prompt, variable_length_input, number_retries)

The only API we use is chat. We always extract json. We don't need
JSON mode, or function calling, or assistants (though we do all
that). Heck, we don't even use system prompts (maybe we should...).
When a gpt-4-turbo was released, we updated one string in the
codebase.

This is the beauty of a powerful generalized model - less is more.

Most of the 40 lines in that function are around error handling
around OpenAI API's regular 500s/socket closed (though it's gotten
better, and given their load, it's not surprising).

There's some auto-truncating we built in, so we don't have to worry
about context length limits. We have my own proprietary token-length
estimator. Here it is:

if s.length > model_context_size * 3
  # truncate it!
end

It fails in corner cases when there are a LOT of periods, or numbers
(the token ratio is < 3 characters / token for those). So there's
another very proprietary try/catch retry logic:

if response_error_code == "context_length_exceeded"
   s.truncate(model_context_size * 3 / 1.3)

We've gotten quite far with this approach, and it's been flexible
enough for our needs.

Lesson 3: improving the latency with streaming API and showing users
variable-speed typed words is actually a big UX innovation with
ChatGPT.

We thought this was a gimmick, but users react very positively to
variable-speed "typed" characters - this feels like the mouse/cursor
UX moment for AI.

Lesson 4: GPT is really bad at producing the null hypothesis

"Return nothing if you don't find anything" - is probably the most
error-prone prompting language we came across. Not only does GPT
often choose to hallucinate rather than return nothing, it also
causes it to just lack confidence a lot, returning blank more often
than it should.

Most of our prompts are in the form:

    "Here's a block of text that's making a statement about a
    company, I want you to output JSON that extracts these companies.
    If there's nothing relevant, return a blank. Here's the text:
    [block of text]"

For a time, we had a bug where [block of text] could be empty. The
hallucinations were bad. Incidentally, GPT loves to hallucinate
bakeries, here are some great ones:

  * Sunshine Bakery
  * Golden Grain Bakery
  * Bliss Bakery

Fortunately, the solution was to fix the bug and not send it a prompt
at all if there was no text (duh!). But it's harder when "it's empty"
is harder to define programmatically, and you actually do need GPT to
weigh in.

Lesson 5: "Context windows" are a misnomer - and they are only
growing larger for input, not output

Little known fact: GPT-4 may have a 128k token window for input, but
it's output window is still a measly 4k! Calling it a "context
window" is confusing, clearly.

But the problem is even worse - we often ask GPT to give us back a
list of JSON objects. Nothing complicated mind you: think, an array
list of json tasks, where each task has a name and a label.

GPT really cannot give back more than 10 items. Trying to have it
give you back 15 items? Maybe it does it 15% of the time.

We originally thought this was because of the 4k context window, but
we were hitting 10 items, and it'd only be maybe 700-800 tokens, and
GPT would just stop.

Now, you can of course trade in output for input by giving it a
prompt, ask for a single task, then give it (prompt + task), ask for
the next task, etc. But now you're playing a game of telephone with
GPT, and have to deal with things like Langchain.

Lesson 6: vector databases, and RAG/embeddings are mostly useless for
us mere mortals

I tried. I really did. But every time I thought I had a killer use
case for RAG / embeddings, I was confounded.

I think vector databases / RAG are really meant for Search. And only
search. Not search as in "oh - retrieving chunks is kind of like
search, so it'll work!", real google-and-bing search. Here's some
reasons why:

 1. there's no cutoff for relevancy. There are some solutions out
    there, and you can create your own cutoff heuristics for
    relevancy, but they're going to be unreliable. This really kills
    RAG in my opinion - you always risk poisoning your retrieval with
    irrelevant results, or being too conservative, you miss important
    results.
 2. why would you put your vectors in a specialized, proprietary
    database, away from all your other data? Unless you are dealing
    at a google/bing scale, this loss of context absolutely isn't
    worth the tradeoff.
 3. unless you are doing a very open-ended search, of say - the whole
    internet - users typically don't like semantic searches that
    return things they didn't directly type. For most applications of
    search within business apps, your users are domain experts - they
    don't need you to guess what they might have meant - they'll let
    you know!

It seems to me (this is untested) that a much better use of LLMS for
most search cases is to use a normal completion prompt to convert a
user's search into a faceted-search, or even a more complex query (or
heck, even SQL!). But this is not RAG at all.

Lesson 7: Hallucination basically doesn't happen.

Every use case we have is essentially "Here's a block of text,
extract something from it." As a rule, if you ask GPT to give you the
names of companies mentioned in a block of text, it will not give you
a random company (unless there are no companies in the text - there's
that null hypothesis problem!).

Similarly -- and I'm sure you've noticed this if you're an engineer --
GPT doesn't really hallucinate code - in the sense that it doesn't
make up variables, or randomly introduce a typo in the middle of
re-writing a block of code you sent it. It does hallucinate the
existence of standard library functions when you ask it to give you
something, but again, I see that more as the null hypothesis. It
doesn't know how to say "I don't know".

But if your use case is entirely, "here's the full context of
details, analyze / summarize / extract" - it's extremely reliable. I
think you can see a lot of product releases recently that emphasize
this exact use case.

So it's all about good data in, good GPT tokens responses out.

Where do I think all this is heading?

Rather than responding with some long-form post, here's a quick Q&A:

Are we going to achieve AGI?

No. Not with this transformers + the data of the internet + $XB
infrastructure approach.

Is GPT-4 actually useful, or is it all marketing?

It is 100% useful. This is the early days of the internet still. Will
it fire everyone? No. Primarily, I see this lowering the barrier of
entry to ML/AI that was previously only available to Google.

Have you tried Claude, Gemini, etc?

Yeah, meh. Actually in all seriousness, we haven't done any serious A
/B testing, but I've tested these with my day to day coding, and it
doesn't feel even close. It's the subtle things mostly, like
intuiting intention.

How do I keep up to date with all the stuff happening with LLMs/AI
these days?

You don't need to. I've been thinking a lot about The Bitter Lesson -
that general improvements to model performance outweigh niche
improvements. If that's true, all you need to worry about is when
GPT-5 is coming out. Nothing else matters, and everything else being
released by OpenAI in the meantime (not including Sora, etc, that's a
whooolle separate thing) are basically noise.

So when will GPT-5 come out, and how good will it be?

I've been trying to read the signs with OpenAI, as has everyone else.
I think we're going to see incremental improvement, sadly. I don't
have a lot of hope that GPT-5 is going to "change everything". There
are fundamental economic reasons for that: between GPT-3 and GPT-3.5,
I thought we might be in a scenario where the models were getting
hyper-linear improvement with training: train it 2x as hard, it gets
2.2x better.

But that's not the case, apparently. Instead, what we're seeing is
logarithmic. And in fact, token speed and cost per token is growing
exponentially for incremental improvements.

If that's the case, there's some Pareto-optimal curve we're on, and
GPT-4 might be optimal: whereas I was willing to pay 20x for GPT-4
over GPT-3.5, I honestly don't think I'd pay 20x per token to go from
GPT-4 to GPT-5, not for the set of tasks that GPT-4 is used for.

GPT-5 may break that. Or, it may be the iPhone 5 to the iPhone 4. I
don't think that's a loss!

Share this:

  * Twitter
  * Facebook
  * Reddit
  * Email
  * 

Uncategorized

Previous post

The Parable of the Wise Hiring Manager

Next post

GPT is the Heroku of AI

10 Comments

 1. [d2d47c9b7786ef74]
    David Vandervort

    April 12, 2024 at 10:02 am

    Some good learning here.I think the reason GPT-5 won't be soon or
    super-impressive is that we've gotten most of the improvement
    from adding training data as is going to happen. Instead, the
    next leap in capability is going to require some kind of
    enhancement tot he transformer model itself. (But if I knew what
    that was, I would probably be developing it myself and getting
    ready to become a billionaire).

    Thanks for this insightful post.

    Reply
 2. [a30bca9a524d7802]
    Yacov Lewis

    April 12, 2024 at 10:26 am

    Great piece! My experience around Langchain/RAG differs, so
    wanted to dig deeper:
    Putting some logic around handling relevant results helps us
    produce useful output. Curious what differs on your folks end.

    Reply
      + [e609f4fe003376ee]
        Ken (Post author)

        April 12, 2024 at 12:46 pm

        This is a great point! YMMV is very important here - it's
        very possible your use-case for RAG is just something we
        haven't had to deal with yet! Maybe some details would
        interest you:

        1. Our use case is (1) entirely text based, and (2)
        exclusively "un-creative"/bounded - meaning, we can assume a
        lot about the user's inputs for a given prompt.
        2. In the whole precision vs recall spectrum, we tend
        towards...(I had to look this up) - precision, I think! ;).
        Basically, we tend to be extremely conservative - we'd rather
        discard potentially relevant results just in case, rather
        than all the relevant results, but sometimes some irrelevant
        ones. RAG doesn't seem to be very conducive or helpful if you
        want precision.

        Reply
 3. [e8af5f3f47b1433e]
    Wilhelm

    April 13, 2024 at 12:17 pm

    > Well, it's crazy that GPT's quality and generalization can
    improve when you're more vague - this is a quintessential marker
    of higher-order delegation / thinking.

    Would you mind justifying this statement so I can understand what
    you mean?

    Reply
      + [e609f4fe003376ee]
        Ken (Post author)

        April 13, 2024 at 12:30 pm

        For sure! I do realize that was a bit out of nowhere.

        Generalization
        I tell my engineers that progression to more senior levels
        fundamentally is about increasing levels of delegation. What
        are the levels of delegation? Well, for entry-level folk,
        they get delegated tasks that have all the steps spelled out
        explicitly. As that entry-level engineer grows, they can
        handle tasks that are increasingly vague. At the senior
        level, the tasks can basically be:

        "Add a payment system, we need it badly, but not so badly
        that if Johnny needs help on a PR, do that first. Oh, and
        probably use Stripe unless you find something better."

        When you're CTO, your task that you've been entrusted with by
        the CEO is basically "find ways we can use technology to
        create value."

        Quality
        On top of that, at these higher levels of delegation, quality
        of the outcome can approve. Because it's vague, you can get
        solutions that you'd never have specified to a junior person
        doing the task. So not only are you giving more vague
        guidance, the results you expect are better than if you gave
        more specific guidance.

        That's why you hear so many engineering leaders say "Give
        your people freedom to explore" - it's a recognition of this
        dynamic.

        So, to see GPT improve quality when you're more vague really
        seems impressive, to me when it happens!

        Reply
 4. [4915547020c8fbdc]
    Civitello

    April 13, 2024 at 12:45 pm

    Regarding null hypothesis for asking for a list of companies in a
    block of text, would this work:
    Make it two steps, first:
    > Does this block of text mention a company?
    If no, good you've got your null result.
    If yes:
    > Please list the names of companies in this block of text.

    Reply
 5. [de061e9a52695ccd]
    Michal Flak

    April 13, 2024 at 1:24 pm

    I think you could benefit from two things, especially since "Our
    use case is [...] exclusively "un-creative"/bounded":
    1) Using OpenAI JSON mode, it's made exactly for your use case
    and could save you retries
    2) Spinning up an open source LLM on your machine - they work
    really well for tasks like this, especially when coupled with a
    constrained generation tool like Outlines or Guidance. You can
    guarantee adherence to schema and avoid wasting time on "fluff"
    tokens like parentheses or keys, only generating the value
    tokens. It could greatly save costs.

    I've written about it some time ago and tooling has progressed
    since then, but you may want to have a look: https://
    monadical.com/posts/how-to-make-llms-speak-your-language.html

    Reply
 6. [e4f80e42794094c9]
    alex sharp

    April 13, 2024 at 2:24 pm

    > Heck, we don't even use system prompts (maybe we should...).

    I'm using 3.5-turbo and 4 for similar-ish use-cases to extract
    json, and various text processing and classification tasks. I
    found both models were much better aligned generally for both the
    classification tasks and the "hey always give me json" (i've
    since moved to function calling, highly recommend) when using the
    system prompts on both model versions. 3.5-turbo especially was a
    big improvement when moving to a system prompt. Hope this helps.
    Cheers.

    Reply
 7. [0c8f4fd22525e936]
    Phil

    April 13, 2024 at 2:28 pm

    Awesome post. Thanks for the insights and practical takeaways
    (not over engineering things). Super useful article I'll be
    sending to my coworkers.

    Reply
 8. [94bf5e863f6530f4]
    Phillip Carter

    April 13, 2024 at 2:51 pm

    This was a great post. Genuinely enjoyed another one from someone
    learning real lessons!

    About RAG, I feel like it's certainly useful for mere mortals. Me
    and my company are one of them. However, several of our features
    boil down to an explicit search + generation process, and so RAG
    is quite useful for us there. It's really simple though. Just a
    bunch of vectors stored in Redis. We don't even use Redis'
    built-in vector search because it was so easy to just write code
    that fetches a group of embeddings (each group is per-user,
    effectively) and run search in memory. It's fast and nowhere near
    the performance bottleneck.

    However, I think people try to apply RAG to very complex problems
    and find that it's a lot harder than just cosine similarity on
    large blobs of text. In our case, that's actually plenty suitable
    for the job we have to do. And further, we find that there's
    often a sweet spot for GPT to sort through what's _actually_
    relevant. Vector search can often yield results that aren't
    actually useful, but if you increase the window big enough (but
    not too big) it often will include what you'd actually want, and
    GPT can figure it out. I wish there was a way where you could get
    a sense for how useful RAG would be without having to test it so
    much, though.

    Reply

1 Pingback

 1. Ban Yi GPTDai Bi Zhi Hou De Jiao Xun  - Pian Zhi De Ma Nong 

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked 
*

          [                                             ]
          [                                             ]
          [                                             ]
          [                                             ]
          [                                             ]
          [                                             ]
          [                                             ]
Comment * [                                             ]

Name * [                              ]

Email * [                              ]

Website [                              ]

[ ] Notify me of follow-up comments by email.

[ ] Notify me of new posts by email.

[Post Comment] 

 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
D[                                             ] 

About This Site

This may be a good place to introduce yourself and your site or
include some credits.

Find Us

Address
123 Main Street
New York, NY 10001

Hours
Monday--Friday: 9:00AM-5:00PM
Saturday & Sunday: 11:00AM-3:00PM

Categories

  * Hiring
  * Management
  * Security
  * Series: Core Controls for the Transcendent CISO
  * Technology
  * Uncategorized

Type your email... [                    ]

Subscribe

Recent Posts

  * GPT is the Heroku of AI
  * Lessons after a half-billion GPT tokens
  * The Parable of the Wise Hiring Manager
  * Learnings from 5 years of tech startup code audits
  * The Unreasonable Effectiveness of Secure-by-default
  * You Don't Need Hundreds of Engineers to Build a Great Product
  * Technology ROI Discussions are Broken
  * 5 Software Engineering Foot-guns
  * The Backlog Peter Principle
  * How to find great senior engineers
  * The Googler's Dilemma: Why Experience Will Always Have a Premium
  * 5 Red Flags Signaling Your Rebuild Will Fail
  * Core Control #6: Log Everything
  * Core Control #5: Secure by Default
  * Core Principle #4: Managing Privileged Access

Pages

  * About
  * Blog

(c) 2024 Ken Kantzer's Blog

Theme by Anders Noren -- Up |