[HN Gopher] What we've learned from a year of building with LLMs
       ___________________________________________________________________
        
       What we've learned from a year of building with LLMs
        
       Author : ViktorasJucikas
       Score  : 82 points
       Date   : 2024-05-31 12:33 UTC (1 days ago)
        
 (HTM) web link (eugeneyan.com)
 (TXT) w3m dump (eugeneyan.com)
        
       | solidasparagus wrote:
       | No offense, but I'd love to see what they've successfully built
       | using LLMs before taking their advice too seriously. The idea
       | that fine-tuning isn't even a consideration (perhaps even
       | something they think is absolutely incorrect if the section
       | titles of the unfinished section is anything to go by) is very
       | strange to me and suggests a pretty narrow perspective IMO
        
         | gandalfgeek wrote:
         | This was kind of conventional wisdom ("fine tune only when
         | absolutely necessary for your domain", "fine-tuning hurts
         | factuality"), but some recent research (some of which they
         | cite) has actually quantitatively shown that RAG is much
         | preferable to FT for adding domain-specific knowledge to an
         | LLM:
         | 
         | - "Does Fine-Tuning LLMs on New Knowledge Encourage
         | Hallucinations?" https://arxiv.org/abs//2405.05904
         | 
         | - "Fine-Tuning or Retrieval? Comparing Knowledge Injection in
         | LLMs" https://arxiv.org/abs/2312.05934
        
           | solidasparagus wrote:
           | Thanks, I'll read those more fully.
           | 
           | But "knowledge injection" is still pretty narrow to me.
           | Here's an example of a very simple but extremely valuable
           | usecase - taking a model that was trained on language+code
           | and finetuning it on a text-to-DSL task, where the DSL is a
           | custom one you created (and thus isn't in the training data).
           | I would consider that close to infeasible if your only tool
           | is a RAG hammer, but it's a very powerful way to leverage
           | LLMs.
        
         | CuriouslyC wrote:
         | Fine tuning has been on the way out for a while. It's hard to
         | do right and costly. LoRAs are better for influencing output
         | style as they don't dumb down the model, and they're easier to
         | create. This is on top of RAG just being better for new facts
         | like the other reply mentioned.
        
           | solidasparagus wrote:
           | How much of that is just the flood of traditional engineers
           | into the space and the fact that collecting data and then
           | fine-tuning models is orders of magnitude more complex than
           | just throwing in RAG? I suspect a huge amount of RAG's
           | popularity is just that any engineer can do a version of it +
           | ChatGPT API calls in a day.
           | 
           | As for lora - in the context of my comment, that's just
           | splitting hairs IMO. It falls in the category of finetuning
           | for me, although I understand why you might disagree. But
           | it's not like the article mentions lora either, nor am I
           | aware of people doing lora without GPUs which the article is
           | against (No GPUs before PMF)
        
         | OutOfHere wrote:
         | Fine-tuning is an absolutely necessary for true AI, and even if
         | it's desirable, it's unfeasible to do for now for any large
         | model considering how expensive GPUs are. If I had infinite
         | money, I'd throw it at continuous fine-tuning and would throw
         | away the RAG. Fine-tuning also requires appropriate measures to
         | prevent forgetting of older concepts.
        
           | solidasparagus wrote:
           | It is not unfeasible. It is absolutely realistic to do
           | distributed finetuning of an 8B text model on previous
           | generation hardware. You can add finetuning to your set of
           | options for about the cost of one FTE - up to you whether
           | that tradeoff is worth it, but in many places it is. The
           | expertise to pull it off is expensive, but to get a mid-level
           | AI SME capable of helping a company adopt finetuning, you are
           | only going to pay about the equivalent of 1-3 senior
           | engineers.
           | 
           | Expensive? Sure, all of AI is crazy expensive. Unfeasible? No
        
             | OutOfHere wrote:
             | I don't consider a small 8B model to be worth fine-tuning.
             | Fine-tuning is worthwhile when you have a larger model with
             | capacity to add data, perhaps one that can even grow its
             | layers with the data. In contrast, fine-tuning a small
             | saturated model will easily cause it to forget older
             | information.
             | 
             | All things considered, in relative terms, as much as I
             | think fine-tuning would be nice, it will remain
             | significantly more expensive than just making RAG or search
             | calls. I say this while being a fan of fine-tuning.
        
               | solidasparagus wrote:
               | > I don't consider a small 8B model to be worth fine-
               | tuning.
               | 
               | Going to have to disagree with you on that one. A modern
               | 8B model that has been trained on enough tokens is
               | ridiculously powerful.
        
               | OutOfHere wrote:
               | A well-trained 8B model will already be over-saturated
               | with information from the start. It will therefore easily
               | forget much old information when fine-tuning it with new
               | materials. It just doesn't have the capacity to take in
               | too much information.
               | 
               | Don't get me wrong. I think an 70B or larger model would
               | be worth fine-tuning, especially if it can be grown
               | further with more layers.
        
               | solidasparagus wrote:
               | > A well-trained 8B model will already be over-saturated
               | with information from the start
               | 
               | Any evidence of that that I can look at? This doesn't
               | match what I've seen nor have I heard this from the
               | world-class researchers I have worked with. Would be
               | interested to learn more.
        
         | lmeyerov wrote:
         | We work in some pretty serious domains and try to stay away
         | from fine tuning:
         | 
         | - Most of our accuracy ROI is from agentic loops over top
         | models, and dynamic RAG example injection goes far here that
         | the relative lift of adding fine-tuning isn't worth the many
         | costs
         | 
         | - A lot of fine-tuning is for OSS models that do worse than
         | agentic loops over the proprietary GPT4/Opus3
         | 
         | - For distribution, it's a lot easier to deploy for pluggable
         | top APIs without requiring fine-tuning, e.g., "connect to your
         | gpt4/opus3 + for dumber-but-bigger tasks, groq"
         | 
         | - The resources we could put into fine-tuning are better spent
         | on RAG, agentic loops, prompts/evals, etc
         | 
         | We do use tuned smaller dumber models, such as part of a coarse
         | relevancy filter in a firehose pipeline... but these are
         | outliers. Likewise, we expect to be using them more... but
         | again, for rarer cases and only after we've exhausted other
         | stuff. I'm guessing as we do more fine-tuning, it'll be more on
         | embeddings than LLMs, at least until OSS models get a lot
         | better.
        
           | solidasparagus wrote:
           | See if the article said this, I would have agreed - fine-
           | tuning is a tool and it should be used thoughtfully. Although
           | I personally believe that in this funding climate it makes
           | sense to make data collection and model training a core
           | capability of any AI product. However that will only be
           | available and wise for some founders.
        
       | OutOfHere wrote:
       | Almost all of this should flow from common-sense. I would use
       | what makes sense for your application, and not worry about the
       | rest. It's a toolbox, not a rulebook. The one point that comes
       | more from experience than from common-sense is to always pin your
       | model versions. As a final tip, if despite trying everything, you
       | still don't like the LLM's output, just run it again!
       | 
       | Here is a summary of all points:
       | 
       | 1. Focus on Prompting Techniques:                  1.1. Start
       | with n-shot prompts to provide examples demonstrating tasks.
       | 1.2. Use Chain-of-Thought (CoT) prompting for complex tasks,
       | making instructions specific.        1.3. Incorporate relevant
       | resources via Retrieval Augmented Generation (RAG).
       | 
       | 2. Structure Inputs and Outputs:                  2.1. Format
       | inputs using serialization methods like XML, JSON, or Markdown.
       | 2.2. Ensure outputs are structured to integrate seamlessly with
       | downstream systems.
       | 
       | 3. Simplify Prompts:                  3.1. Break down complex
       | prompts into smaller, focused ones.        3.2. Iterate and
       | evaluate each prompt individually for better performance.
       | 
       | 4. Optimize Context Tokens:                  4.1. Minimize
       | redundant or irrelevant context in prompts.        4.2. Structure
       | the context clearly to emphasize relationships between parts.
       | 
       | 5. Leverage Information Retrieval/RAG:                  5.1. Use
       | RAG to provide the LLM with knowledge to improve output.
       | 5.2. Ensure retrieved documents are relevant, dense, and
       | detailed.        5.3. Utilize hybrid search methods combining
       | keyword and embedding-based retrieval.
       | 
       | 6. Workflow Optimization:                  6.1. Decompose tasks
       | into multi-step workflows for better accuracy.        6.2.
       | Prioritize deterministic execution for reliability and
       | predictability.        6.3. Use caching to save costs and reduce
       | latency.
       | 
       | 7. Evaluation and Monitoring:                  7.1. Create
       | assertion-based unit tests using real input/output samples.
       | 7.2. Use LLM-as-Judge for pairwise comparisons to evaluate
       | outputs.        7.3. Regularly review LLM inputs and outputs for
       | new patterns or issues.
       | 
       | 8. Address Hallucinations and Guardrails:                  8.1.
       | Combine prompt engineering with factual inconsistency guardrails.
       | 8.2. Use content moderation APIs and PII detection packages to
       | filter outputs.
       | 
       | 9. Operational Practices:                  9.1. Regularly check
       | for development-prod data skew.        9.2. Ensure data logging
       | and review input/output samples daily.        9.3. Pin specific
       | model versions to maintain consistency and avoid unexpected
       | changes.
       | 
       | 10. Team and Roles:                   10.1. Educate and empower
       | all team members to use AI technology.         10.2. Include
       | designers early in the process to improve user experience and
       | reframe user needs.         10.3. Ensure the right progression of
       | roles and hire based on the specific phase of the project.
       | 
       | 11. Risk Management:                   11.1. Calibrate risk
       | tolerance based on the use case and audience.         11.2. Focus
       | on internal applications first to manage risk and gain confidence
       | before expanding to customer-facing use cases.
        
       | felixbraun wrote:
       | related discussion (3 days ago):
       | https://news.ycombinator.com/item?id=40508390
        
       | DylanSp wrote:
       | Looks like the same content that was posted on oreilly.com a
       | couple days ago, just on a separate site. That has some existing
       | discussion: https://news.ycombinator.com/item?id=40508390.
        
       | Multicomp wrote:
       | Anyone have a convenience solution for doing multi-step
       | workflows? For example, I'm filling out the basics of an NPC
       | character sheet on my game prep. I'm using a certain rule system,
       | give the enemy certain tactics, certain stats, certain types of
       | weapons, right now I have a 'god prompt' trying to walk the LLM
       | through creating the basic character sheet, but the responses get
       | squeezed down into what one or two prompt responses can be.
       | 
       | If I can do node-red or a function chain for prompts and outputs,
       | that would be sweet.
        
         | CuriouslyC wrote:
         | You can do multi shot workflows pretty easy, I like to have the
         | model produce markdown, then add code blocks (```json/yaml```)
         | to extract the interim results. You can lay out multiple
         | "phases" in your prompt and have it perform each one in turn,
         | and have each one reference prior phases. Then at the end you
         | just pull out the code blocks for each phase and you have your
         | structured result.
        
         | mentos wrote:
         | I still haven't played with using one LLM to oversee another.
         | 
         | "You are in charge of game prep and must work with an LLM over
         | many prompts to..."
        
         | hugocbp wrote:
         | For me, a very simple "breakdown tasks into a queue and store
         | in a DB" solution has help tremendously with most requests.
         | 
         | Instead of trying to do everything into a single chat or chain,
         | add steps to ask the LLM to break down the next tasks, with
         | context, and store that into SQLite or something. Then start
         | new chats/chains on each of those tasks.
         | 
         | Then just loop them back into LLM.
         | 
         | I find that long chats or chains just confuse most models and
         | we start seeing gibberish.
         | 
         | Right now I'm favoring something like:
         | 
         | "We're going to do task {task}. The current situation and
         | context is {context}.
         | 
         | Break down what individual steps we need to perform to achieve
         | {goal} and output these steps with their necessary context as
         | {standard_task_json}. If the output is already enough to
         | satisfy {goal}, just output the result as text."
         | 
         | I find that leaving everything to LLM in a sequence is not as
         | effective as using LLM to break things down and having a DB and
         | code logic to support the development of more complex outcomes.
        
         | gpsx wrote:
         | One option for doing this is to incrementally build up the
         | "document" using isolated prompts for each section. I say
         | document because I am not exactly sure what the character sheet
         | looks like, but I am assuming it can be constructed one section
         | at a time. You create a prompt to create the first section.
         | Then, you create a second prompt that gives the agent your
         | existing document and prompts it to create the next section.
         | You continue until all the sections are finished. In some cases
         | this works better than doing a single conversation.
        
         | proc0 wrote:
         | Sounds like you need an agent system, some libs are mentioned
         | here: https://lilianweng.github.io/posts/2023-06-23-agent/
        
       | dbs wrote:
       | Show me the use cases you have supported in production. Then I
       | might read all the 30 pages praising the dozens (soon to be
       | hundreds?) of "best practices" to build LLMs.
        
         | joe_the_user wrote:
         | I have a friend who uses ChatGPT for writing quick policy
         | statement for her clients (mostly schools). I have a friend who
         | uses it to create images and descriptions for DnD adventures.
         | LLMs have uses.
         | 
         | The problem I see is, who can an "application" be anything but
         | a little window onto the base abilities of ChatGPT and so
         | effectively offers nothing _more_ to an end-user. The final
         | result still have to be checked and regular end-users have to
         | do their own prompt.
         | 
         | Edit: Also, I should also say that anyone who's designing LLM
         | apps that, rather than being end-user tools, are effectively
         | gate keepers to getting action or "a human" from a company
         | deserves a big "f* you" 'cause that approach is evil.
        
         | harrisoned wrote:
         | It certainly has use cases, just not as many as the hype lead
         | people to believe. For me:
         | 
         | -Regex expressions: ChatGPT is the best multi-million regex
         | parser to date.
         | 
         | -Grammar and semantic check: It's a very good revision tool,
         | helped me a lot of times, specially when writing in non-native
         | languages.
         | 
         | -Artwork inspiration: Not only for visual inspiration, in the
         | case of image generators, but descriptive as well. The
         | verbosity of some LLMs can help describe things in more detail
         | than a person would.
         | 
         | -General coding: While your mileage may vary on that one, it
         | has helped me a lot at work building stuff on languages i'm not
         | very familiar with. Just snippets, nothing big.
        
         | thallium205 wrote:
         | We have a company mail, fax, and phone room that receives
         | thousands of pages a day that now sorts, categorizes, and
         | extracts useful information from them all in a completely
         | automated way by LLMs. Several FTEs have been reassigned
         | elsewhere as a result.
        
         | robbiemitchell wrote:
         | Processing high volumes of unstructured data (text)... we're
         | using a STAG architecture.
         | 
         | - Generate targeted LLM micro summaries of every record
         | (ticket, call, etc.) continually
         | 
         | - Use layers of regex, semantic embeddings, and scoring
         | enrichments to identify report rows (pivots on aggregates)
         | worth attention, running on a schedule
         | 
         | - Proactively explain each report row by identifying what's
         | unusual about it and LLM summarizing a subset of the
         | microsummaries.
         | 
         | - Push the result to webhook
         | 
         | Lack of JSON schema restriction is a significant barrier to
         | entry on hooking LLMs up to a multi step process.
         | 
         | Another is preventing LLMs from adding intro or conclusion
         | text.
        
           | BoorishBears wrote:
           | > Lack of JSON schema restriction is a significant barrier to
           | entry on hooking LLMs up to a multi step process.
           | 
           | How are you struggling with this, let alone as a significant
           | barrier? JSON adherence with a well thought out schema hasn't
           | been a worry between improved model performance and various
           | grammar based constraint systems in a while.
           | 
           | > Another is preventing LLMs from adding intro or conclusion
           | text.
           | 
           | Also trivial to work around by pre-filling and stop tokens,
           | or just extremely basic text parsing.
           | 
           | Also would recommend writing out Stream-Triggered Augmented
           | Generation since the term is so barely used it might as well
           | be made up from the POV of someone trying to understand the
           | comment
        
             | robbiemitchell wrote:
             | Asking even a top-notch LLM to output well formed JSON
             | simply fails sometimes. And when you're running LLMs in the
             | background, you can't use the best LLM available.
             | 
             | You work around it with post-processing and retries. But
             | it's still a bit brittle given how much stuff happens
             | downstream without supervision.
        
       ___________________________________________________________________
       (page generated 2024-06-01 23:00 UTC)