[HN Gopher] What we've learned from a year of building with LLMs
___________________________________________________________________
What we've learned from a year of building with LLMs
Author : ViktorasJucikas
Score : 82 points
Date : 2024-05-31 12:33 UTC (1 days ago)
(HTM) web link (eugeneyan.com)
(TXT) w3m dump (eugeneyan.com)
| solidasparagus wrote:
| No offense, but I'd love to see what they've successfully built
| using LLMs before taking their advice too seriously. The idea
| that fine-tuning isn't even a consideration (perhaps even
| something they think is absolutely incorrect if the section
| titles of the unfinished section is anything to go by) is very
| strange to me and suggests a pretty narrow perspective IMO
| gandalfgeek wrote:
| This was kind of conventional wisdom ("fine tune only when
| absolutely necessary for your domain", "fine-tuning hurts
| factuality"), but some recent research (some of which they
| cite) has actually quantitatively shown that RAG is much
| preferable to FT for adding domain-specific knowledge to an
| LLM:
|
| - "Does Fine-Tuning LLMs on New Knowledge Encourage
| Hallucinations?" https://arxiv.org/abs//2405.05904
|
| - "Fine-Tuning or Retrieval? Comparing Knowledge Injection in
| LLMs" https://arxiv.org/abs/2312.05934
| solidasparagus wrote:
| Thanks, I'll read those more fully.
|
| But "knowledge injection" is still pretty narrow to me.
| Here's an example of a very simple but extremely valuable
| usecase - taking a model that was trained on language+code
| and finetuning it on a text-to-DSL task, where the DSL is a
| custom one you created (and thus isn't in the training data).
| I would consider that close to infeasible if your only tool
| is a RAG hammer, but it's a very powerful way to leverage
| LLMs.
| CuriouslyC wrote:
| Fine tuning has been on the way out for a while. It's hard to
| do right and costly. LoRAs are better for influencing output
| style as they don't dumb down the model, and they're easier to
| create. This is on top of RAG just being better for new facts
| like the other reply mentioned.
| solidasparagus wrote:
| How much of that is just the flood of traditional engineers
| into the space and the fact that collecting data and then
| fine-tuning models is orders of magnitude more complex than
| just throwing in RAG? I suspect a huge amount of RAG's
| popularity is just that any engineer can do a version of it +
| ChatGPT API calls in a day.
|
| As for lora - in the context of my comment, that's just
| splitting hairs IMO. It falls in the category of finetuning
| for me, although I understand why you might disagree. But
| it's not like the article mentions lora either, nor am I
| aware of people doing lora without GPUs which the article is
| against (No GPUs before PMF)
| OutOfHere wrote:
| Fine-tuning is an absolutely necessary for true AI, and even if
| it's desirable, it's unfeasible to do for now for any large
| model considering how expensive GPUs are. If I had infinite
| money, I'd throw it at continuous fine-tuning and would throw
| away the RAG. Fine-tuning also requires appropriate measures to
| prevent forgetting of older concepts.
| solidasparagus wrote:
| It is not unfeasible. It is absolutely realistic to do
| distributed finetuning of an 8B text model on previous
| generation hardware. You can add finetuning to your set of
| options for about the cost of one FTE - up to you whether
| that tradeoff is worth it, but in many places it is. The
| expertise to pull it off is expensive, but to get a mid-level
| AI SME capable of helping a company adopt finetuning, you are
| only going to pay about the equivalent of 1-3 senior
| engineers.
|
| Expensive? Sure, all of AI is crazy expensive. Unfeasible? No
| OutOfHere wrote:
| I don't consider a small 8B model to be worth fine-tuning.
| Fine-tuning is worthwhile when you have a larger model with
| capacity to add data, perhaps one that can even grow its
| layers with the data. In contrast, fine-tuning a small
| saturated model will easily cause it to forget older
| information.
|
| All things considered, in relative terms, as much as I
| think fine-tuning would be nice, it will remain
| significantly more expensive than just making RAG or search
| calls. I say this while being a fan of fine-tuning.
| solidasparagus wrote:
| > I don't consider a small 8B model to be worth fine-
| tuning.
|
| Going to have to disagree with you on that one. A modern
| 8B model that has been trained on enough tokens is
| ridiculously powerful.
| OutOfHere wrote:
| A well-trained 8B model will already be over-saturated
| with information from the start. It will therefore easily
| forget much old information when fine-tuning it with new
| materials. It just doesn't have the capacity to take in
| too much information.
|
| Don't get me wrong. I think an 70B or larger model would
| be worth fine-tuning, especially if it can be grown
| further with more layers.
| solidasparagus wrote:
| > A well-trained 8B model will already be over-saturated
| with information from the start
|
| Any evidence of that that I can look at? This doesn't
| match what I've seen nor have I heard this from the
| world-class researchers I have worked with. Would be
| interested to learn more.
| lmeyerov wrote:
| We work in some pretty serious domains and try to stay away
| from fine tuning:
|
| - Most of our accuracy ROI is from agentic loops over top
| models, and dynamic RAG example injection goes far here that
| the relative lift of adding fine-tuning isn't worth the many
| costs
|
| - A lot of fine-tuning is for OSS models that do worse than
| agentic loops over the proprietary GPT4/Opus3
|
| - For distribution, it's a lot easier to deploy for pluggable
| top APIs without requiring fine-tuning, e.g., "connect to your
| gpt4/opus3 + for dumber-but-bigger tasks, groq"
|
| - The resources we could put into fine-tuning are better spent
| on RAG, agentic loops, prompts/evals, etc
|
| We do use tuned smaller dumber models, such as part of a coarse
| relevancy filter in a firehose pipeline... but these are
| outliers. Likewise, we expect to be using them more... but
| again, for rarer cases and only after we've exhausted other
| stuff. I'm guessing as we do more fine-tuning, it'll be more on
| embeddings than LLMs, at least until OSS models get a lot
| better.
| solidasparagus wrote:
| See if the article said this, I would have agreed - fine-
| tuning is a tool and it should be used thoughtfully. Although
| I personally believe that in this funding climate it makes
| sense to make data collection and model training a core
| capability of any AI product. However that will only be
| available and wise for some founders.
| OutOfHere wrote:
| Almost all of this should flow from common-sense. I would use
| what makes sense for your application, and not worry about the
| rest. It's a toolbox, not a rulebook. The one point that comes
| more from experience than from common-sense is to always pin your
| model versions. As a final tip, if despite trying everything, you
| still don't like the LLM's output, just run it again!
|
| Here is a summary of all points:
|
| 1. Focus on Prompting Techniques: 1.1. Start
| with n-shot prompts to provide examples demonstrating tasks.
| 1.2. Use Chain-of-Thought (CoT) prompting for complex tasks,
| making instructions specific. 1.3. Incorporate relevant
| resources via Retrieval Augmented Generation (RAG).
|
| 2. Structure Inputs and Outputs: 2.1. Format
| inputs using serialization methods like XML, JSON, or Markdown.
| 2.2. Ensure outputs are structured to integrate seamlessly with
| downstream systems.
|
| 3. Simplify Prompts: 3.1. Break down complex
| prompts into smaller, focused ones. 3.2. Iterate and
| evaluate each prompt individually for better performance.
|
| 4. Optimize Context Tokens: 4.1. Minimize
| redundant or irrelevant context in prompts. 4.2. Structure
| the context clearly to emphasize relationships between parts.
|
| 5. Leverage Information Retrieval/RAG: 5.1. Use
| RAG to provide the LLM with knowledge to improve output.
| 5.2. Ensure retrieved documents are relevant, dense, and
| detailed. 5.3. Utilize hybrid search methods combining
| keyword and embedding-based retrieval.
|
| 6. Workflow Optimization: 6.1. Decompose tasks
| into multi-step workflows for better accuracy. 6.2.
| Prioritize deterministic execution for reliability and
| predictability. 6.3. Use caching to save costs and reduce
| latency.
|
| 7. Evaluation and Monitoring: 7.1. Create
| assertion-based unit tests using real input/output samples.
| 7.2. Use LLM-as-Judge for pairwise comparisons to evaluate
| outputs. 7.3. Regularly review LLM inputs and outputs for
| new patterns or issues.
|
| 8. Address Hallucinations and Guardrails: 8.1.
| Combine prompt engineering with factual inconsistency guardrails.
| 8.2. Use content moderation APIs and PII detection packages to
| filter outputs.
|
| 9. Operational Practices: 9.1. Regularly check
| for development-prod data skew. 9.2. Ensure data logging
| and review input/output samples daily. 9.3. Pin specific
| model versions to maintain consistency and avoid unexpected
| changes.
|
| 10. Team and Roles: 10.1. Educate and empower
| all team members to use AI technology. 10.2. Include
| designers early in the process to improve user experience and
| reframe user needs. 10.3. Ensure the right progression of
| roles and hire based on the specific phase of the project.
|
| 11. Risk Management: 11.1. Calibrate risk
| tolerance based on the use case and audience. 11.2. Focus
| on internal applications first to manage risk and gain confidence
| before expanding to customer-facing use cases.
| felixbraun wrote:
| related discussion (3 days ago):
| https://news.ycombinator.com/item?id=40508390
| DylanSp wrote:
| Looks like the same content that was posted on oreilly.com a
| couple days ago, just on a separate site. That has some existing
| discussion: https://news.ycombinator.com/item?id=40508390.
| Multicomp wrote:
| Anyone have a convenience solution for doing multi-step
| workflows? For example, I'm filling out the basics of an NPC
| character sheet on my game prep. I'm using a certain rule system,
| give the enemy certain tactics, certain stats, certain types of
| weapons, right now I have a 'god prompt' trying to walk the LLM
| through creating the basic character sheet, but the responses get
| squeezed down into what one or two prompt responses can be.
|
| If I can do node-red or a function chain for prompts and outputs,
| that would be sweet.
| CuriouslyC wrote:
| You can do multi shot workflows pretty easy, I like to have the
| model produce markdown, then add code blocks (```json/yaml```)
| to extract the interim results. You can lay out multiple
| "phases" in your prompt and have it perform each one in turn,
| and have each one reference prior phases. Then at the end you
| just pull out the code blocks for each phase and you have your
| structured result.
| mentos wrote:
| I still haven't played with using one LLM to oversee another.
|
| "You are in charge of game prep and must work with an LLM over
| many prompts to..."
| hugocbp wrote:
| For me, a very simple "breakdown tasks into a queue and store
| in a DB" solution has help tremendously with most requests.
|
| Instead of trying to do everything into a single chat or chain,
| add steps to ask the LLM to break down the next tasks, with
| context, and store that into SQLite or something. Then start
| new chats/chains on each of those tasks.
|
| Then just loop them back into LLM.
|
| I find that long chats or chains just confuse most models and
| we start seeing gibberish.
|
| Right now I'm favoring something like:
|
| "We're going to do task {task}. The current situation and
| context is {context}.
|
| Break down what individual steps we need to perform to achieve
| {goal} and output these steps with their necessary context as
| {standard_task_json}. If the output is already enough to
| satisfy {goal}, just output the result as text."
|
| I find that leaving everything to LLM in a sequence is not as
| effective as using LLM to break things down and having a DB and
| code logic to support the development of more complex outcomes.
| gpsx wrote:
| One option for doing this is to incrementally build up the
| "document" using isolated prompts for each section. I say
| document because I am not exactly sure what the character sheet
| looks like, but I am assuming it can be constructed one section
| at a time. You create a prompt to create the first section.
| Then, you create a second prompt that gives the agent your
| existing document and prompts it to create the next section.
| You continue until all the sections are finished. In some cases
| this works better than doing a single conversation.
| proc0 wrote:
| Sounds like you need an agent system, some libs are mentioned
| here: https://lilianweng.github.io/posts/2023-06-23-agent/
| dbs wrote:
| Show me the use cases you have supported in production. Then I
| might read all the 30 pages praising the dozens (soon to be
| hundreds?) of "best practices" to build LLMs.
| joe_the_user wrote:
| I have a friend who uses ChatGPT for writing quick policy
| statement for her clients (mostly schools). I have a friend who
| uses it to create images and descriptions for DnD adventures.
| LLMs have uses.
|
| The problem I see is, who can an "application" be anything but
| a little window onto the base abilities of ChatGPT and so
| effectively offers nothing _more_ to an end-user. The final
| result still have to be checked and regular end-users have to
| do their own prompt.
|
| Edit: Also, I should also say that anyone who's designing LLM
| apps that, rather than being end-user tools, are effectively
| gate keepers to getting action or "a human" from a company
| deserves a big "f* you" 'cause that approach is evil.
| harrisoned wrote:
| It certainly has use cases, just not as many as the hype lead
| people to believe. For me:
|
| -Regex expressions: ChatGPT is the best multi-million regex
| parser to date.
|
| -Grammar and semantic check: It's a very good revision tool,
| helped me a lot of times, specially when writing in non-native
| languages.
|
| -Artwork inspiration: Not only for visual inspiration, in the
| case of image generators, but descriptive as well. The
| verbosity of some LLMs can help describe things in more detail
| than a person would.
|
| -General coding: While your mileage may vary on that one, it
| has helped me a lot at work building stuff on languages i'm not
| very familiar with. Just snippets, nothing big.
| thallium205 wrote:
| We have a company mail, fax, and phone room that receives
| thousands of pages a day that now sorts, categorizes, and
| extracts useful information from them all in a completely
| automated way by LLMs. Several FTEs have been reassigned
| elsewhere as a result.
| robbiemitchell wrote:
| Processing high volumes of unstructured data (text)... we're
| using a STAG architecture.
|
| - Generate targeted LLM micro summaries of every record
| (ticket, call, etc.) continually
|
| - Use layers of regex, semantic embeddings, and scoring
| enrichments to identify report rows (pivots on aggregates)
| worth attention, running on a schedule
|
| - Proactively explain each report row by identifying what's
| unusual about it and LLM summarizing a subset of the
| microsummaries.
|
| - Push the result to webhook
|
| Lack of JSON schema restriction is a significant barrier to
| entry on hooking LLMs up to a multi step process.
|
| Another is preventing LLMs from adding intro or conclusion
| text.
| BoorishBears wrote:
| > Lack of JSON schema restriction is a significant barrier to
| entry on hooking LLMs up to a multi step process.
|
| How are you struggling with this, let alone as a significant
| barrier? JSON adherence with a well thought out schema hasn't
| been a worry between improved model performance and various
| grammar based constraint systems in a while.
|
| > Another is preventing LLMs from adding intro or conclusion
| text.
|
| Also trivial to work around by pre-filling and stop tokens,
| or just extremely basic text parsing.
|
| Also would recommend writing out Stream-Triggered Augmented
| Generation since the term is so barely used it might as well
| be made up from the POV of someone trying to understand the
| comment
| robbiemitchell wrote:
| Asking even a top-notch LLM to output well formed JSON
| simply fails sometimes. And when you're running LLMs in the
| background, you can't use the best LLM available.
|
| You work around it with post-processing and retries. But
| it's still a bit brittle given how much stuff happens
| downstream without supervision.
___________________________________________________________________
(page generated 2024-06-01 23:00 UTC)