[HN Gopher] Hard stuff when building products with LLMs
       ___________________________________________________________________
        
       Hard stuff when building products with LLMs
        
       Author : mavelikara
       Score  : 144 points
       Date   : 2023-05-27 18:04 UTC (4 hours ago)
        
 (HTM) web link (www.honeycomb.io)
 (TXT) w3m dump (www.honeycomb.io)
        
       | binarymax wrote:
       | The first problem, context window size, is going to bite a lot of
       | people.
       | 
       | The gotcha is that it's a search problem. The article mentions
       | embeddings and dot product, but that's the most basic and naive
       | search you can do. Search is information retrieval, and it's a
       | huge problem space.
       | 
       | You need a proper retriever that you tune for relevance. You
       | should use a search engine for this, and have multiple features
       | and do reranking.
       | 
       | That's the only way to crack the context window problem. But the
       | good news, is that once you do, things get much much better! You
       | can then apply your search/retriever skills on all kinds of other
       | problems - because search is really the backbone of all AI.
        
         | darkteflon wrote:
         | Such a great comment. The nice thing about this is that if
         | search was a key part of your product pre-LLM, you likely
         | already have something useful in place that requires very
         | little adaptation.
        
         | kristjansson wrote:
         | Exactly. LLMs are incredible at information processing, and ok-
         | to-terrible at information retrieval. All LLM applications that
         | rely on accurate information are either infeasible or kick the
         | entire can to the retrieval component.
        
         | azinman2 wrote:
         | Context window size will eventually be solved, likely with its
         | own trade offs.
        
         | pmoriarty wrote:
         | Anthropic's Claude[1] already has a 100k token context length.
         | 
         | [1] - https://poe.com/Claude-instant-100k
        
         | phillipcarter wrote:
         | Yeah, we're definitely learning this. It's actually promising
         | how well a very simple cosine similarity pass on data before
         | sending it to an LLM can do [0]. But as we're learning, each
         | further step towards accuracy is bigger and bigger, and there
         | doesn't appear to be any turnkey solutions you can pay for
         | right now.
         | 
         | [0]: https://twitter.com/_cartermp/status/1657037648400117760
        
       | devjab wrote:
       | This seems more like a list of everything everybody is talking
       | about while skipping the "how to make a profit" hard stuff.
        
         | abraae wrote:
         | The article has done its job if you come away with "honeycomb
         | query" ringing in your ears.
        
         | JimtheCoder wrote:
         | "while skipping the "how to make a profit" hard stuff."
         | 
         | Some things never change...
        
       | fnordpiglet wrote:
       | TL;DR new technologies are full of sharp edges and no blueprint
       | for success. Engineering is still hard.
        
         | hobs wrote:
         | Engineering will always be hard, but I think a lot of this
         | current AI hype cycle doesn't even have a product - its just
         | "well that's cool so I want that."
        
           | dinvlad wrote:
           | When "it could be used for anything" really means only that
           | they haven't found a market fit and are just a solution in
           | search of a problem, as with most venture-baked enterprises.
        
           | fnordpiglet wrote:
           | I don't think that's very generous - I think it's "wow that's
           | amazing I'd like to find a way to integrate it" which I think
           | is perfectly reasonable given it _is_ amazing (even though I
           | think it's overestimated in its current form, and
           | underestimated due to its current form)
        
       | simonw wrote:
       | I think this may be the best thing I've read about real-world
       | prompt engineering - there's SO MUCH hard earned knowledge in
       | here.
       | 
       | The description of how they're handling the threat of prompt
       | injection was particularly smart.
        
       | softfalcon wrote:
       | We had a hack-a-thon at my company around using AI tooling with
       | respect to our products. The topics mentioned in this article are
       | real and come up quickly when trying to make a real product that
       | interfaces with an AI-API.
       | 
       | This was so true that there was an obvious chunk of teams in the
       | hack-a-thon who didn't even bother doing anything more than a
       | fancy version of asking ChatGPT "where should I go for dinner in
       | Brooklyn?" or straight up failed to even deliver a concept of a
       | product.
       | 
       | Asking a clear question and harvesting accurate results from AI
       | prompts is far more difficult than you might think it would be.
        
       | typpo wrote:
       | This is a great summary of why productionizing LLMs is hard. I'm
       | working on a couple LLM products, including one that's in
       | production for >10 million users.
       | 
       | The lack of formal tooling for prompt engineering drives me
       | bonkers, and it compounds the problems outlined in the article
       | around correctness and chaining.
       | 
       | Then there are the hot takes on Twitter from people claiming
       | prompt engineering will soon be obsolete, or people selling blind
       | prompts without any quality metrics. It's surprisingly hard to
       | get LLMs to do _exactly_ what you want.
       | 
       | I'm building an open-source framework for systematically
       | measuring prompt quality [0], inspired by best practices for
       | traditional engineering systems.
       | 
       | 0. https://github.com/typpo/promptfoo
        
         | darkteflon wrote:
         | This looks excellent, thank you - really nails the UI. Going to
         | use this this week.
        
         | jmccarthy wrote:
         | Very nice, thank you! Will give it a try.
        
       | muglug wrote:
       | I predict there will be another six months of these sorts of
       | articles, accompanied by a raft of LLM-powered features that
       | aren't nearly as transformative as the people currently hyping AI
       | are telling us to expect.
       | 
       | The engineers I know whose job it is to implement LLM features
       | are _much_ more skeptical about the near future than the
       | engineers who are interested in the topic but lack hands-on
       | experience.
        
         | dinvlad wrote:
         | This will be entertaining (if dangerous) to watch, as people
         | hopefully become disillusioned with the overselling and
         | overhyping of this not-new-tech-but-wrapped-in-marketing-bs
         | wave of 'AI'. But, history shows we rarely learn from past
         | mistakes.
         | 
         | I also hope there will be some whistleblower within OpenAI (and
         | others like it) that exposes its internal practices and all of
         | the hypocricy surrounding it. And, usually the fish rots from
         | the head, as they say.
        
           | version_five wrote:
           | Is there something specific your second paragraph refers to?
           | I agree with the first one but I don't see a clear basis for
           | the second. Do you just hope they're doing something
           | untoward? And even if there is some dirt on them, how could
           | it possibly relate to the quality of llms overall? Unless
           | gpt4 is really just a room full of smart people typing really
           | quickly...
        
             | dinvlad wrote:
             | For a business that lies and deceives the general public
             | from the very beginning about the capabilities of their
             | technology, it can only mean they're as unethical on the
             | inside as on the outside. If they were truly honest and
             | open, they would not be behaving the way they are. These
             | two sorts of behaviors are simply incompatible with each
             | other.
             | 
             | To be brutally honest, I expected better from Sam, but he
             | has lost all credibility in my eyes based on how they chose
             | to roll out ChatGPT. I now see that he's even more hawky
             | and daring and manipulative than Zuckelberg ever was.
        
         | deet wrote:
         | I suspect you're right for how people are using and deploying
         | LLMs now: hacking all kinds of functionality out of a text-
         | completion model that, although it encodes a ton of data and
         | some reasoning, is fundamentally still a text completion model
         | and when deployed via commercial APIs like today without fine
         | tuning, are not flexible beyond prompt engineering, chaining,
         | etc. make possible.
         | 
         | But I think we've only scratched the surface as to what LLMs
         | fine-tuned on specific tasks, especially for abstract reasoning
         | over narrow domains, could do.
         | 
         | These applications possibly won't look anything like the chat
         | interfaces that people are getting excited about now, and fine-
         | tuning is not as accessible as prompt engineering. But there's
         | a whole lot more to explore.
        
         | evrydayhustling wrote:
         | The main thing LLMs can do is make products accessible/useful
         | to a wider range of users - either by parsing their intents or
         | by translating outputs to their needs.
         | 
         | This might result in a sort of transformation that engineers
         | and power users aren't geared to appreciate. You might look at
         | a natural language log query and say, "that would actually slow
         | me down". But if it makes Honeycomb suddenly useful to
         | stakeholders who couldn't before, it could lead to use cases
         | not on the radar right now.
        
           | phillipcarter wrote:
           | I probably could have elaborated more on this in the blog
           | post, but you can really distill a lot of Honeycomb's success
           | as a business down to a few things:
           | 
           | - How easily can you query stuff when you're interested
           | 
           | - How easily can you get other people on your team to use the
           | product too
           | 
           | - How quickly can you narrow down a problem (e.g., during an
           | outage) to something you can fix
           | 
           | - How relevant is your alerting (i.e., SLOs) to the success
           | or failure of something business critical
           | 
           | Our bet here is that the first two could potentially be
           | improved by using LLMs, since we hypothesized (and confirmed
           | in some new user interviews) that there's an "expressivity
           | gap" in our product. A lot of people who aren't already
           | observability experts, but do have some vested interest in
           | observability, often know what they want to look for but get
           | confused by a UX that's tailored for people who are more
           | familiar with these kinds of tools.
           | 
           | It's only been 3 weeks so it's too early to tell, but we're
           | seeing _some_ signs that the needle is being moved a bit on
           | some key metrics. We 're not betting the farm on this stuff
           | just yet, and it's really cool that there's technology that
           | lets us experiment in this way without having to hire a whole
           | ML engineering team.
        
             | evrydayhustling wrote:
             | Since you are here, want to say this is one of the most
             | useful posts I've seen about pragmatic development on top
             | of LLMs!
             | 
             | And agreed re: development effort - compared to other hype
             | cycles of AI, it's important for folks to understand that
             | the results they see are coming at a fraction of the
             | experimental budget.
        
         | tbalsam wrote:
         | Am an ML engineer, was around long before LLMs. Definitely am
         | skeptical, and I think I know where one dimension that we're
         | missing performance is, a number of people do (in regards to
         | steerability/controllability without losing performance), it's
         | just something that's quite hard to do tbh. Quite hard to do
         | indeedy.
         | 
         | Those who figure out how to do that well will have quite a
         | legacy under their belts, and money if they're a profit-making
         | company and handle it well.
         | 
         | It's not about whether or not it can be done, it's not hard to
         | lay out the math and prove that pretty trivially if you know
         | where to look. Just actually doing it is the hard part in a way
         | where the inductive bias translates appropriately to the
         | solution at hand.
        
           | dinvlad wrote:
           | The problem is, of course, these systems are fundamentally
           | incapable of human-level intelligence and cognition.
           | 
           | There will be a lot of wasted effort in pursuit of this
           | unreachable goal (with LLM technology), an effort better
           | spent elsewhere, like solving cancer or climate change, and
           | stealing young and naive people's minds away from these
           | problems.
        
         | coffeebeqn wrote:
         | I tried building with LLMs but it has the basic problem that
         | it's totally wrong 20-50% of the time. That's very meaningful
         | for most business cases. It does fine when accuracy isn't
         | important but that's fairly rare other than writing throwaway
         | content
        
         | herval wrote:
         | > The engineers I know whose job it is to implement LLM
         | features are much more skeptical about the near future than the
         | engineers who are interested in the topic but lack hands-on
         | experience.
         | 
         | Isn't that always the case?
        
         | alaskamiller wrote:
         | The last chatbot wave (when FB opened up Messenger API and when
         | Microsoft slung SKYPE as a plausible bot platform and when
         | Slack rebranded their app store) fizzled out after 18 months.
         | 
         | All to figure out the singular most important thing: chat
         | interfaces are the worst.
        
           | znpy wrote:
           | > chat interfaces are the worst
           | 
           | Can't agree more. As an user, a chatbot makes me think the
           | company has put some kind of dumb parrot in front of me in
           | order to avoid giving actual support.
        
           | muspimerol wrote:
           | I agree that chat interfaces are not great, but we shouldn't
           | reduce LLMs to implementations in chat interfaces. For
           | example, the "autocompletion" I get with copilot is a very
           | useful tool that I use daily, and I think that sort of UX
           | could be built into plenty of other interfaces. Most
           | applications where you input some form of text could benefit
           | from LLM AI.
        
             | phillipcarter wrote:
             | Yes, this exactly. That's why we didn't go with chat for
             | our UX here, and for future product areas we likely won't
             | either. We already have good UX for our kind of product and
             | haven't seen much feedback or convinced via some other
             | means that adding chat would help more than it would hurt.
        
               | JieJie wrote:
               | One of the weirdest parts of using Bing Chat is that it
               | has tab-to-autocomplete function that is almost always
               | wrong about what I want to say. I wish there was an LLM
               | that _actually was_ an  "autocorrect on steroids" because
               | that's honestly one of my most-anticipated features of
               | this technology.
               | 
               | Having an LLM spell-checker that would autocorrect my
               | spelling as I typed, based on the context of what I was
               | typing? That would be magnificent.
        
           | muglug wrote:
           | Chatbots can be really useful.
           | 
           | At my org we use a chatbot for pull requests -- you get
           | pinged by the bot when the PR is ready to merge, with a
           | button in the chat interface that merges the PR -- no need to
           | open GitHub and locate the big green button yourself.
           | 
           | That won't 10x your productivity or whatever, but it does
           | make it slightly more pleasant.
        
             | cjcenizal wrote:
             | That does sound cool, but I'm not sure that's what most
             | folks mean by "chatbot". My understanding was that a
             | chatbot is an automated chat program that will generate
             | responses to your messages, simulating a live human.
        
       | fswd wrote:
       | I could add a couple things from my own experiences. Storing
       | prompts in a database seemed like a good idea, but in practice it
       | ended up being a disater. Storing the prompt in a
       | python/typescript file, up front at the top, works well. Using
       | OpenAI playground with it's ability to export a prompt works
       | well, or even better, something in gradio running in vscode with
       | debugging mode, works even better. Few shots with refinements
       | works really well. LangChain did not work well for any of my
       | cases, I might go as boldly as saying that using langchain is bad
       | practice.
        
         | phillipcarter wrote:
         | It's delightfully hacky, but we actually have our prompt (that
         | we parameterize later) stored in a feature flag right now, with
         | a few variations! I actually can't believe we shipped with
         | that, but hey, it works? Each variation is pulled from a
         | specific version in a separate repo where we iterate on the
         | prompt.
         | 
         | We're going to likely settle on just storing whatever version
         | of the prompt is considered "stable" as a source file, but for
         | now this isn't actively hurting us, as far as we can tell, and
         | there's a lot of prompt engineering left to do.
        
           | ntonozzi wrote:
           | https://thedailywtf.com/articles/the-inner-json-effect
        
         | Dwood023 wrote:
         | Could you explain how storing prompts was a disaster?
        
           | fswd wrote:
           | Prompts should be really easy to cut and paste from an editor
           | to a playground. Updating a database is unnecessary friction
           | with no real benefit.
        
         | ukuina wrote:
         | So much "Yes!" for LangChain being bad practice. An
         | unnecessarily bloated abstraction over what should be simple
         | API calls.
        
           | yinser wrote:
           | Do you have a recommendation of how to easily connect a
           | language model to a python repl, apify, bash shell, and
           | composable chaining structures if not langchain? I find those
           | structures invaluable but am curious where else I could build
           | these programs.
        
             | ntonozzi wrote:
             | It's great for prototyping and seeing what is possible but
             | for running in production you'll likely need to write it
             | yourself, and it will just take a few minutes.
        
             | jmccarthy wrote:
             | Could be we're in a (short?) interregnum analogous to pre-
             | Rails Ruby: there are lots of nascent frameworks, but the
             | dominant one hasn't been born yet. FWIW - DIY is working
             | well for me.
        
         | darkteflon wrote:
         | I'm keeping all our prompts in a json file (along with some
         | helpful metadata for us humans).
         | 
         | No idea if I'm doing it right.
        
       | andy99 wrote:
       | I'd call all of these things specific cases of some of the
       | general problems we've faced with using neural networks for
       | years. There's a big gap between demo and product. One one hand
       | OpenAI has built a great product, on the other hand, it's not yet
       | clear if downstream users will be able to do the same.
       | 
       | http://marble.onl/posts/into-the-great-wide-open-ai.html
        
       | ZephyrBlu wrote:
       | I would argue a more appropriate title would be something about
       | integrating LLMs into complex products.
       | 
       | A lot of the problems are much more easily solved when you're
       | working on a new product from scratch.
       | 
       | Also, I think LLMs work much better with structured data when you
       | use them as a selector instead of a generator.
       | 
       | Asking an LLM to generate a structured schema is a bad idea. Ask
       | it to pick from a set of pre-defined schemas instead, for
       | example.
       | 
       | You're not using LLMs for their schema generating ability, you're
       | using them for their intelligence and creativity. Don't make them
       | do things they're not good at.
        
         | phillipcarter wrote:
         | Something we're looking to experiment with is asking the LLM to
         | produce pieces of things that we then construct a query from,
         | rather than ask it to also assemble it. The hypothesis is that
         | it's more likely to produce things we can "work with" that are
         | also "interesting" or "useful" to users.
         | 
         | FWIW we have about a ~7% failure rate (meaning it fails to
         | produce a valid, runnable query) after some work done to
         | correct what we consider correctable outputs. Not terrible, but
         | we think the above idea could help with that.
        
           | ZephyrBlu wrote:
           | Based on my personal experience I think that's a much better
           | approach, so I wish you luck with it.
           | 
           | Maybe somewhat counter-intuitively to how most people view
           | LLMs, I strongly believe they're better when you constrain
           | them a bit with some guardrails (E.g. pieces of a query, a
           | bunch of existing queries, etc).
           | 
           | Happily surprised you guys managed to get it down to only a
           | 7% failure rate though! For how temperamental LLMs are and
           | the seeming complexity of the task that's impressive.
        
             | phillipcarter wrote:
             | > Happily surprised you guys managed to get it down to only
             | a 7% failure rate though!
             | 
             | Thanks! It, uhh, was quite a bit higher before we did some
             | of that work though, heh. Since we can take a query and
             | attempt to run it, we get good errors for anything that's
             | ill-specified, and we can track it. Ideally we'd address
             | everything with better prompt engineering, but it's
             | certainly quicker to just fix stuff up after the fact when
             | we know how to.
        
             | Der_Einzige wrote:
             | Re: constraints, it turns out that banning tokens in a
             | vocabulary is a great way to force models to be creative
             | and follow syntactic or semantic constraints without
             | errors:
             | 
             | https://github.com/hellisotherpeople/constrained-text-
             | genera...
        
         | selalipop wrote:
         | That's how I built notionsmith.ai
         | 
         | I don't go directly to an LLM asking for structured data, or
         | even a final answer, so you can type literally anything into
         | the entry field and get a useful result
         | 
         | People are trying to treat them as conversational, but I'd say
         | for most products it'll be rare to ever want more than one
         | response for a given system prompt, and instead you'll want to
         | build a final answer procedurally.
        
           | HalcyonCowboy wrote:
           | Just wanted to say that I checked out your app, and it's
           | really impressive! When building it, did you bootstrap by
           | asking it what developers like me would want out of a site
           | like that?
        
             | selalipop wrote:
             | It actually came of my own use of the default ChatGPT
             | interface: I was working on an indie game in my spare time
             | and using it to spitball new mechanics with personas
             | 
             | But it was really tedious to prompt ChatGPT into being
             | properly critical about an idea that doesn't exist: A basic
             | "make me a persona" prompt will give you an answer, but if
             | you can really break down the generation of the persona
             | (ie. instead of asking for the whole thing, ask who are the
             | people likely to use X, what's the range of incomes they
             | have, etc) you get a much better answer
             | 
             | The site just automates that process and presents chats
             | that are seeded with the result of that process so the LLM
             | is more willing to imagine things. For example, if a
             | persona complains about a feature, when can hit 'Chat with
             | X' and interrogate them about it instead of running into
             | 'As a LLM' you should get an actual answer
        
           | jonchurch_ wrote:
           | Also wanted to say this is a really cool tool, ty for
           | mentioning it.
           | 
           | I fall into the category of developers using LLMs every
           | single day, for both answering questions while working, and
           | also for more exploratory "bounce ideas off the wall"
           | exercises.
           | 
           | Everytime I find a new way to explain to the LLM how I want
           | us to work together, I feel like Ive unlocked new abilities
           | and usecases I didnt expect the model to have.
           | 
           | Some examples for those curious:
           | 
           | * I am interested in learning more about X because I want to
           | achieve Y. Please give me an overview of concepts that would
           | be useful to learn about X to achieve Y. [then after going
           | back and forth fleshing out what Im interested in learning]
           | Please create a syllabus for me to begin learning about X
           | based on the information you've given me. Provide examples of
           | additional materials I can study which already exist, and
           | some exercises to test and operationalize my knowledge.
           | 
           | * [I find that the above can often make the model attempt to
           | squeeze all the info into a single response, which compresses
           | the fidelity of the knowledge and tends towards big shallow
           | lists, so I will employ this trick] I want you to go deeper
           | into each topic you have listed, one at a time. When I say
           | "next" move onto the next topic
           | 
           | * You are my personal coach for X, here is context about the
           | problem I want to work on and my goals. This is our first
           | coaching session, ask me any questions you need to gather
           | more information, but never more than 3 at once. Where should
           | we start?
        
       | hyperliner wrote:
       | [dead]
        
       | joelm wrote:
       | Latency has been the biggest challenge for me.
       | 
       | They cite "two to 15+ seconds" in this blog post for responses.
       | Via the OpenAI API I've been seeing more like 45-60 seconds for
       | responses (using GPT-3.5-turbo or GPT-4 in chat mode). Note, this
       | is using ~3500 tokens total.
       | 
       | I've had to extensively adapt to that latency in the UI of our
       | product. Maybe I should start showing funny messages while the
       | user is waiting (like I've seen porkbun do when you pay for
       | domain names).
        
         | kristjansson wrote:
         | If a user is waiting on the response, you basically have to
         | stream the result instead of waiting on the entire completion.
        
           | joelm wrote:
           | Yea, that is probably a better solution. Not an easy one to
           | refactor into at the moment though.
        
         | phillipcarter wrote:
         | Was this in the past week? We had much worse latency this past
         | week compared to the rest (in addition to model unavailability
         | errors), which we attributed to the Microsoft Build conference.
         | One of our customers that uses it a lot is always at the token
         | limit and their average latency was ~5 seconds, but that was
         | closer to 10 second last week.
         | 
         | ...also why we can't wait for other vendors to get SOC I/II
         | clearance, and I guess eventually fine-tuning our own model, so
         | we're not stuck with situations like this.
        
           | joelm wrote:
           | I've seen more errors lately I think, but no the latency has
           | been an issue for months. I think it has grown some over the
           | last few months, but not a dramatic change.
        
             | phillipcarter wrote:
             | Well poop, hope that gets resolved fast. I guess OpenAI
             | can't hire compute platform engineers fast enough!
        
       | commandlinefan wrote:
       | It's hard to build a product using humans - I don't know why
       | anybody would think using AI would make that part any easier.
        
       ___________________________________________________________________
       (page generated 2023-05-27 23:01 UTC)