[HN Gopher] Hard stuff when building products with LLMs
___________________________________________________________________
Hard stuff when building products with LLMs
Author : mavelikara
Score : 144 points
Date : 2023-05-27 18:04 UTC (4 hours ago)
(HTM) web link (www.honeycomb.io)
(TXT) w3m dump (www.honeycomb.io)
| binarymax wrote:
| The first problem, context window size, is going to bite a lot of
| people.
|
| The gotcha is that it's a search problem. The article mentions
| embeddings and dot product, but that's the most basic and naive
| search you can do. Search is information retrieval, and it's a
| huge problem space.
|
| You need a proper retriever that you tune for relevance. You
| should use a search engine for this, and have multiple features
| and do reranking.
|
| That's the only way to crack the context window problem. But the
| good news, is that once you do, things get much much better! You
| can then apply your search/retriever skills on all kinds of other
| problems - because search is really the backbone of all AI.
| darkteflon wrote:
| Such a great comment. The nice thing about this is that if
| search was a key part of your product pre-LLM, you likely
| already have something useful in place that requires very
| little adaptation.
| kristjansson wrote:
| Exactly. LLMs are incredible at information processing, and ok-
| to-terrible at information retrieval. All LLM applications that
| rely on accurate information are either infeasible or kick the
| entire can to the retrieval component.
| azinman2 wrote:
| Context window size will eventually be solved, likely with its
| own trade offs.
| pmoriarty wrote:
| Anthropic's Claude[1] already has a 100k token context length.
|
| [1] - https://poe.com/Claude-instant-100k
| phillipcarter wrote:
| Yeah, we're definitely learning this. It's actually promising
| how well a very simple cosine similarity pass on data before
| sending it to an LLM can do [0]. But as we're learning, each
| further step towards accuracy is bigger and bigger, and there
| doesn't appear to be any turnkey solutions you can pay for
| right now.
|
| [0]: https://twitter.com/_cartermp/status/1657037648400117760
| devjab wrote:
| This seems more like a list of everything everybody is talking
| about while skipping the "how to make a profit" hard stuff.
| abraae wrote:
| The article has done its job if you come away with "honeycomb
| query" ringing in your ears.
| JimtheCoder wrote:
| "while skipping the "how to make a profit" hard stuff."
|
| Some things never change...
| fnordpiglet wrote:
| TL;DR new technologies are full of sharp edges and no blueprint
| for success. Engineering is still hard.
| hobs wrote:
| Engineering will always be hard, but I think a lot of this
| current AI hype cycle doesn't even have a product - its just
| "well that's cool so I want that."
| dinvlad wrote:
| When "it could be used for anything" really means only that
| they haven't found a market fit and are just a solution in
| search of a problem, as with most venture-baked enterprises.
| fnordpiglet wrote:
| I don't think that's very generous - I think it's "wow that's
| amazing I'd like to find a way to integrate it" which I think
| is perfectly reasonable given it _is_ amazing (even though I
| think it's overestimated in its current form, and
| underestimated due to its current form)
| simonw wrote:
| I think this may be the best thing I've read about real-world
| prompt engineering - there's SO MUCH hard earned knowledge in
| here.
|
| The description of how they're handling the threat of prompt
| injection was particularly smart.
| softfalcon wrote:
| We had a hack-a-thon at my company around using AI tooling with
| respect to our products. The topics mentioned in this article are
| real and come up quickly when trying to make a real product that
| interfaces with an AI-API.
|
| This was so true that there was an obvious chunk of teams in the
| hack-a-thon who didn't even bother doing anything more than a
| fancy version of asking ChatGPT "where should I go for dinner in
| Brooklyn?" or straight up failed to even deliver a concept of a
| product.
|
| Asking a clear question and harvesting accurate results from AI
| prompts is far more difficult than you might think it would be.
| typpo wrote:
| This is a great summary of why productionizing LLMs is hard. I'm
| working on a couple LLM products, including one that's in
| production for >10 million users.
|
| The lack of formal tooling for prompt engineering drives me
| bonkers, and it compounds the problems outlined in the article
| around correctness and chaining.
|
| Then there are the hot takes on Twitter from people claiming
| prompt engineering will soon be obsolete, or people selling blind
| prompts without any quality metrics. It's surprisingly hard to
| get LLMs to do _exactly_ what you want.
|
| I'm building an open-source framework for systematically
| measuring prompt quality [0], inspired by best practices for
| traditional engineering systems.
|
| 0. https://github.com/typpo/promptfoo
| darkteflon wrote:
| This looks excellent, thank you - really nails the UI. Going to
| use this this week.
| jmccarthy wrote:
| Very nice, thank you! Will give it a try.
| muglug wrote:
| I predict there will be another six months of these sorts of
| articles, accompanied by a raft of LLM-powered features that
| aren't nearly as transformative as the people currently hyping AI
| are telling us to expect.
|
| The engineers I know whose job it is to implement LLM features
| are _much_ more skeptical about the near future than the
| engineers who are interested in the topic but lack hands-on
| experience.
| dinvlad wrote:
| This will be entertaining (if dangerous) to watch, as people
| hopefully become disillusioned with the overselling and
| overhyping of this not-new-tech-but-wrapped-in-marketing-bs
| wave of 'AI'. But, history shows we rarely learn from past
| mistakes.
|
| I also hope there will be some whistleblower within OpenAI (and
| others like it) that exposes its internal practices and all of
| the hypocricy surrounding it. And, usually the fish rots from
| the head, as they say.
| version_five wrote:
| Is there something specific your second paragraph refers to?
| I agree with the first one but I don't see a clear basis for
| the second. Do you just hope they're doing something
| untoward? And even if there is some dirt on them, how could
| it possibly relate to the quality of llms overall? Unless
| gpt4 is really just a room full of smart people typing really
| quickly...
| dinvlad wrote:
| For a business that lies and deceives the general public
| from the very beginning about the capabilities of their
| technology, it can only mean they're as unethical on the
| inside as on the outside. If they were truly honest and
| open, they would not be behaving the way they are. These
| two sorts of behaviors are simply incompatible with each
| other.
|
| To be brutally honest, I expected better from Sam, but he
| has lost all credibility in my eyes based on how they chose
| to roll out ChatGPT. I now see that he's even more hawky
| and daring and manipulative than Zuckelberg ever was.
| deet wrote:
| I suspect you're right for how people are using and deploying
| LLMs now: hacking all kinds of functionality out of a text-
| completion model that, although it encodes a ton of data and
| some reasoning, is fundamentally still a text completion model
| and when deployed via commercial APIs like today without fine
| tuning, are not flexible beyond prompt engineering, chaining,
| etc. make possible.
|
| But I think we've only scratched the surface as to what LLMs
| fine-tuned on specific tasks, especially for abstract reasoning
| over narrow domains, could do.
|
| These applications possibly won't look anything like the chat
| interfaces that people are getting excited about now, and fine-
| tuning is not as accessible as prompt engineering. But there's
| a whole lot more to explore.
| evrydayhustling wrote:
| The main thing LLMs can do is make products accessible/useful
| to a wider range of users - either by parsing their intents or
| by translating outputs to their needs.
|
| This might result in a sort of transformation that engineers
| and power users aren't geared to appreciate. You might look at
| a natural language log query and say, "that would actually slow
| me down". But if it makes Honeycomb suddenly useful to
| stakeholders who couldn't before, it could lead to use cases
| not on the radar right now.
| phillipcarter wrote:
| I probably could have elaborated more on this in the blog
| post, but you can really distill a lot of Honeycomb's success
| as a business down to a few things:
|
| - How easily can you query stuff when you're interested
|
| - How easily can you get other people on your team to use the
| product too
|
| - How quickly can you narrow down a problem (e.g., during an
| outage) to something you can fix
|
| - How relevant is your alerting (i.e., SLOs) to the success
| or failure of something business critical
|
| Our bet here is that the first two could potentially be
| improved by using LLMs, since we hypothesized (and confirmed
| in some new user interviews) that there's an "expressivity
| gap" in our product. A lot of people who aren't already
| observability experts, but do have some vested interest in
| observability, often know what they want to look for but get
| confused by a UX that's tailored for people who are more
| familiar with these kinds of tools.
|
| It's only been 3 weeks so it's too early to tell, but we're
| seeing _some_ signs that the needle is being moved a bit on
| some key metrics. We 're not betting the farm on this stuff
| just yet, and it's really cool that there's technology that
| lets us experiment in this way without having to hire a whole
| ML engineering team.
| evrydayhustling wrote:
| Since you are here, want to say this is one of the most
| useful posts I've seen about pragmatic development on top
| of LLMs!
|
| And agreed re: development effort - compared to other hype
| cycles of AI, it's important for folks to understand that
| the results they see are coming at a fraction of the
| experimental budget.
| tbalsam wrote:
| Am an ML engineer, was around long before LLMs. Definitely am
| skeptical, and I think I know where one dimension that we're
| missing performance is, a number of people do (in regards to
| steerability/controllability without losing performance), it's
| just something that's quite hard to do tbh. Quite hard to do
| indeedy.
|
| Those who figure out how to do that well will have quite a
| legacy under their belts, and money if they're a profit-making
| company and handle it well.
|
| It's not about whether or not it can be done, it's not hard to
| lay out the math and prove that pretty trivially if you know
| where to look. Just actually doing it is the hard part in a way
| where the inductive bias translates appropriately to the
| solution at hand.
| dinvlad wrote:
| The problem is, of course, these systems are fundamentally
| incapable of human-level intelligence and cognition.
|
| There will be a lot of wasted effort in pursuit of this
| unreachable goal (with LLM technology), an effort better
| spent elsewhere, like solving cancer or climate change, and
| stealing young and naive people's minds away from these
| problems.
| coffeebeqn wrote:
| I tried building with LLMs but it has the basic problem that
| it's totally wrong 20-50% of the time. That's very meaningful
| for most business cases. It does fine when accuracy isn't
| important but that's fairly rare other than writing throwaway
| content
| herval wrote:
| > The engineers I know whose job it is to implement LLM
| features are much more skeptical about the near future than the
| engineers who are interested in the topic but lack hands-on
| experience.
|
| Isn't that always the case?
| alaskamiller wrote:
| The last chatbot wave (when FB opened up Messenger API and when
| Microsoft slung SKYPE as a plausible bot platform and when
| Slack rebranded their app store) fizzled out after 18 months.
|
| All to figure out the singular most important thing: chat
| interfaces are the worst.
| znpy wrote:
| > chat interfaces are the worst
|
| Can't agree more. As an user, a chatbot makes me think the
| company has put some kind of dumb parrot in front of me in
| order to avoid giving actual support.
| muspimerol wrote:
| I agree that chat interfaces are not great, but we shouldn't
| reduce LLMs to implementations in chat interfaces. For
| example, the "autocompletion" I get with copilot is a very
| useful tool that I use daily, and I think that sort of UX
| could be built into plenty of other interfaces. Most
| applications where you input some form of text could benefit
| from LLM AI.
| phillipcarter wrote:
| Yes, this exactly. That's why we didn't go with chat for
| our UX here, and for future product areas we likely won't
| either. We already have good UX for our kind of product and
| haven't seen much feedback or convinced via some other
| means that adding chat would help more than it would hurt.
| JieJie wrote:
| One of the weirdest parts of using Bing Chat is that it
| has tab-to-autocomplete function that is almost always
| wrong about what I want to say. I wish there was an LLM
| that _actually was_ an "autocorrect on steroids" because
| that's honestly one of my most-anticipated features of
| this technology.
|
| Having an LLM spell-checker that would autocorrect my
| spelling as I typed, based on the context of what I was
| typing? That would be magnificent.
| muglug wrote:
| Chatbots can be really useful.
|
| At my org we use a chatbot for pull requests -- you get
| pinged by the bot when the PR is ready to merge, with a
| button in the chat interface that merges the PR -- no need to
| open GitHub and locate the big green button yourself.
|
| That won't 10x your productivity or whatever, but it does
| make it slightly more pleasant.
| cjcenizal wrote:
| That does sound cool, but I'm not sure that's what most
| folks mean by "chatbot". My understanding was that a
| chatbot is an automated chat program that will generate
| responses to your messages, simulating a live human.
| fswd wrote:
| I could add a couple things from my own experiences. Storing
| prompts in a database seemed like a good idea, but in practice it
| ended up being a disater. Storing the prompt in a
| python/typescript file, up front at the top, works well. Using
| OpenAI playground with it's ability to export a prompt works
| well, or even better, something in gradio running in vscode with
| debugging mode, works even better. Few shots with refinements
| works really well. LangChain did not work well for any of my
| cases, I might go as boldly as saying that using langchain is bad
| practice.
| phillipcarter wrote:
| It's delightfully hacky, but we actually have our prompt (that
| we parameterize later) stored in a feature flag right now, with
| a few variations! I actually can't believe we shipped with
| that, but hey, it works? Each variation is pulled from a
| specific version in a separate repo where we iterate on the
| prompt.
|
| We're going to likely settle on just storing whatever version
| of the prompt is considered "stable" as a source file, but for
| now this isn't actively hurting us, as far as we can tell, and
| there's a lot of prompt engineering left to do.
| ntonozzi wrote:
| https://thedailywtf.com/articles/the-inner-json-effect
| Dwood023 wrote:
| Could you explain how storing prompts was a disaster?
| fswd wrote:
| Prompts should be really easy to cut and paste from an editor
| to a playground. Updating a database is unnecessary friction
| with no real benefit.
| ukuina wrote:
| So much "Yes!" for LangChain being bad practice. An
| unnecessarily bloated abstraction over what should be simple
| API calls.
| yinser wrote:
| Do you have a recommendation of how to easily connect a
| language model to a python repl, apify, bash shell, and
| composable chaining structures if not langchain? I find those
| structures invaluable but am curious where else I could build
| these programs.
| ntonozzi wrote:
| It's great for prototyping and seeing what is possible but
| for running in production you'll likely need to write it
| yourself, and it will just take a few minutes.
| jmccarthy wrote:
| Could be we're in a (short?) interregnum analogous to pre-
| Rails Ruby: there are lots of nascent frameworks, but the
| dominant one hasn't been born yet. FWIW - DIY is working
| well for me.
| darkteflon wrote:
| I'm keeping all our prompts in a json file (along with some
| helpful metadata for us humans).
|
| No idea if I'm doing it right.
| andy99 wrote:
| I'd call all of these things specific cases of some of the
| general problems we've faced with using neural networks for
| years. There's a big gap between demo and product. One one hand
| OpenAI has built a great product, on the other hand, it's not yet
| clear if downstream users will be able to do the same.
|
| http://marble.onl/posts/into-the-great-wide-open-ai.html
| ZephyrBlu wrote:
| I would argue a more appropriate title would be something about
| integrating LLMs into complex products.
|
| A lot of the problems are much more easily solved when you're
| working on a new product from scratch.
|
| Also, I think LLMs work much better with structured data when you
| use them as a selector instead of a generator.
|
| Asking an LLM to generate a structured schema is a bad idea. Ask
| it to pick from a set of pre-defined schemas instead, for
| example.
|
| You're not using LLMs for their schema generating ability, you're
| using them for their intelligence and creativity. Don't make them
| do things they're not good at.
| phillipcarter wrote:
| Something we're looking to experiment with is asking the LLM to
| produce pieces of things that we then construct a query from,
| rather than ask it to also assemble it. The hypothesis is that
| it's more likely to produce things we can "work with" that are
| also "interesting" or "useful" to users.
|
| FWIW we have about a ~7% failure rate (meaning it fails to
| produce a valid, runnable query) after some work done to
| correct what we consider correctable outputs. Not terrible, but
| we think the above idea could help with that.
| ZephyrBlu wrote:
| Based on my personal experience I think that's a much better
| approach, so I wish you luck with it.
|
| Maybe somewhat counter-intuitively to how most people view
| LLMs, I strongly believe they're better when you constrain
| them a bit with some guardrails (E.g. pieces of a query, a
| bunch of existing queries, etc).
|
| Happily surprised you guys managed to get it down to only a
| 7% failure rate though! For how temperamental LLMs are and
| the seeming complexity of the task that's impressive.
| phillipcarter wrote:
| > Happily surprised you guys managed to get it down to only
| a 7% failure rate though!
|
| Thanks! It, uhh, was quite a bit higher before we did some
| of that work though, heh. Since we can take a query and
| attempt to run it, we get good errors for anything that's
| ill-specified, and we can track it. Ideally we'd address
| everything with better prompt engineering, but it's
| certainly quicker to just fix stuff up after the fact when
| we know how to.
| Der_Einzige wrote:
| Re: constraints, it turns out that banning tokens in a
| vocabulary is a great way to force models to be creative
| and follow syntactic or semantic constraints without
| errors:
|
| https://github.com/hellisotherpeople/constrained-text-
| genera...
| selalipop wrote:
| That's how I built notionsmith.ai
|
| I don't go directly to an LLM asking for structured data, or
| even a final answer, so you can type literally anything into
| the entry field and get a useful result
|
| People are trying to treat them as conversational, but I'd say
| for most products it'll be rare to ever want more than one
| response for a given system prompt, and instead you'll want to
| build a final answer procedurally.
| HalcyonCowboy wrote:
| Just wanted to say that I checked out your app, and it's
| really impressive! When building it, did you bootstrap by
| asking it what developers like me would want out of a site
| like that?
| selalipop wrote:
| It actually came of my own use of the default ChatGPT
| interface: I was working on an indie game in my spare time
| and using it to spitball new mechanics with personas
|
| But it was really tedious to prompt ChatGPT into being
| properly critical about an idea that doesn't exist: A basic
| "make me a persona" prompt will give you an answer, but if
| you can really break down the generation of the persona
| (ie. instead of asking for the whole thing, ask who are the
| people likely to use X, what's the range of incomes they
| have, etc) you get a much better answer
|
| The site just automates that process and presents chats
| that are seeded with the result of that process so the LLM
| is more willing to imagine things. For example, if a
| persona complains about a feature, when can hit 'Chat with
| X' and interrogate them about it instead of running into
| 'As a LLM' you should get an actual answer
| jonchurch_ wrote:
| Also wanted to say this is a really cool tool, ty for
| mentioning it.
|
| I fall into the category of developers using LLMs every
| single day, for both answering questions while working, and
| also for more exploratory "bounce ideas off the wall"
| exercises.
|
| Everytime I find a new way to explain to the LLM how I want
| us to work together, I feel like Ive unlocked new abilities
| and usecases I didnt expect the model to have.
|
| Some examples for those curious:
|
| * I am interested in learning more about X because I want to
| achieve Y. Please give me an overview of concepts that would
| be useful to learn about X to achieve Y. [then after going
| back and forth fleshing out what Im interested in learning]
| Please create a syllabus for me to begin learning about X
| based on the information you've given me. Provide examples of
| additional materials I can study which already exist, and
| some exercises to test and operationalize my knowledge.
|
| * [I find that the above can often make the model attempt to
| squeeze all the info into a single response, which compresses
| the fidelity of the knowledge and tends towards big shallow
| lists, so I will employ this trick] I want you to go deeper
| into each topic you have listed, one at a time. When I say
| "next" move onto the next topic
|
| * You are my personal coach for X, here is context about the
| problem I want to work on and my goals. This is our first
| coaching session, ask me any questions you need to gather
| more information, but never more than 3 at once. Where should
| we start?
| hyperliner wrote:
| [dead]
| joelm wrote:
| Latency has been the biggest challenge for me.
|
| They cite "two to 15+ seconds" in this blog post for responses.
| Via the OpenAI API I've been seeing more like 45-60 seconds for
| responses (using GPT-3.5-turbo or GPT-4 in chat mode). Note, this
| is using ~3500 tokens total.
|
| I've had to extensively adapt to that latency in the UI of our
| product. Maybe I should start showing funny messages while the
| user is waiting (like I've seen porkbun do when you pay for
| domain names).
| kristjansson wrote:
| If a user is waiting on the response, you basically have to
| stream the result instead of waiting on the entire completion.
| joelm wrote:
| Yea, that is probably a better solution. Not an easy one to
| refactor into at the moment though.
| phillipcarter wrote:
| Was this in the past week? We had much worse latency this past
| week compared to the rest (in addition to model unavailability
| errors), which we attributed to the Microsoft Build conference.
| One of our customers that uses it a lot is always at the token
| limit and their average latency was ~5 seconds, but that was
| closer to 10 second last week.
|
| ...also why we can't wait for other vendors to get SOC I/II
| clearance, and I guess eventually fine-tuning our own model, so
| we're not stuck with situations like this.
| joelm wrote:
| I've seen more errors lately I think, but no the latency has
| been an issue for months. I think it has grown some over the
| last few months, but not a dramatic change.
| phillipcarter wrote:
| Well poop, hope that gets resolved fast. I guess OpenAI
| can't hire compute platform engineers fast enough!
| commandlinefan wrote:
| It's hard to build a product using humans - I don't know why
| anybody would think using AI would make that part any easier.
___________________________________________________________________
(page generated 2023-05-27 23:01 UTC)