hngopher.com

       [HN Gopher] What we've learned from a year of building with LLMs
       ___________________________________________________________________
        
       What we've learned from a year of building with LLMs
        
       Author : ViktorasJucikas
       Score  : 383 points
       Date   : 2024-05-31 12:33 UTC (2 days ago)
        
 (HTM) web link (eugeneyan.com)
 (TXT) w3m dump (eugeneyan.com)
        
       | solidasparagus wrote:
       | No offense, but I'd love to see what they've successfully built
       | using LLMs before taking their advice too seriously. The idea
       | that fine-tuning isn't even a consideration (perhaps even
       | something they think is absolutely incorrect if the section
       | titles of the unfinished section is anything to go by) is very
       | strange to me and suggests a pretty narrow perspective IMO
        
         | gandalfgeek wrote:
         | This was kind of conventional wisdom ("fine tune only when
         | absolutely necessary for your domain", "fine-tuning hurts
         | factuality"), but some recent research (some of which they
         | cite) has actually quantitatively shown that RAG is much
         | preferable to FT for adding domain-specific knowledge to an
         | LLM:
         | 
         | - "Does Fine-Tuning LLMs on New Knowledge Encourage
         | Hallucinations?" https://arxiv.org/abs//2405.05904
         | 
         | - "Fine-Tuning or Retrieval? Comparing Knowledge Injection in
         | LLMs" https://arxiv.org/abs/2312.05934
        
           | solidasparagus wrote:
           | Thanks, I'll read those more fully.
           | 
           | But "knowledge injection" is still pretty narrow to me.
           | Here's an example of a very simple but extremely valuable
           | usecase - taking a model that was trained on language+code
           | and finetuning it on a text-to-DSL task, where the DSL is a
           | custom one you created (and thus isn't in the training data).
           | I would consider that close to infeasible if your only tool
           | is a RAG hammer, but it's a very powerful way to leverage
           | LLMs.
        
             | gandalfgeek wrote:
             | Agree that your use-case is different. The papers above are
             | dealing mostly with adding a domain-specific _textual_
             | corpus, still answering questions in prose.
             | 
             | "Teaching" the LLM an entirely new language (like a DSL)
             | might actually need fine-tuning, but you can probably build
             | a pretty decent first-cut of your system with n-shot
             | prompts, then fine-tune to get the accuracy higher.
        
             | yoelhacks wrote:
             | This is exactly (one of) our use cases at Eraser - taking
             | code or natural language and producing diagram-as-code DSL.
             | 
             | As with other situations that want a custom DSL, our syntax
             | has its own quirks and details, but is similar enough to
             | e.g. Mermaid that we are able to produce valid syntax
             | pretty easily.
             | 
             | What we've found harder is controlling for edge cases about
             | how to build proper diagrams.
             | 
             | For more context: https://www.eraser.io/decision-node/on-
             | building-with-ai
        
         | CuriouslyC wrote:
         | Fine tuning has been on the way out for a while. It's hard to
         | do right and costly. LoRAs are better for influencing output
         | style as they don't dumb down the model, and they're easier to
         | create. This is on top of RAG just being better for new facts
         | like the other reply mentioned.
        
           | solidasparagus wrote:
           | How much of that is just the flood of traditional engineers
           | into the space and the fact that collecting data and then
           | fine-tuning models is orders of magnitude more complex than
           | just throwing in RAG? I suspect a huge amount of RAG's
           | popularity is just that any engineer can do a version of it +
           | ChatGPT API calls in a day.
           | 
           | As for lora - in the context of my comment, that's just
           | splitting hairs IMO. It falls in the category of finetuning
           | for me, although I understand why you might disagree. But
           | it's not like the article mentions lora either, nor am I
           | aware of people doing lora without GPUs which the article is
           | against (No GPUs before PMF)
        
             | altdataseller wrote:
             | I disagree. No amount of fine tuning will ever give the LLM
             | the relevant context with which to answer my question.
             | Maybe if your context is a static Wikipedia or something
             | that will never change, you can fine tune it. But if your
             | data and docs keep changing, how is fine tuning going to be
             | better than RAG?
        
               | solidasparagus wrote:
               | Continuous retraining and deployment maybe? But I'm
               | actually not anti-RAG (although I think it is overrated
               | because the retrieval problem is still handled extremely
               | naively), I just think that fine-tuning should _also_ be
               | in your toolkit.
        
               | altdataseller wrote:
               | Why is the retrieval part overrated? There isnt even a
               | single way to retrieve. It could be a simple keyword
               | sesrch, a vector sesrch, a combo, or just simply
               | retrieving a single doc and stuffing it in the context
        
               | solidasparagus wrote:
               | People will disagree, but my problem with retrieval is
               | that every technique that is popular uses one-hop
               | thinking - you retrieve information that is directly
               | related to the prompt using old-school techniques (even
               | though the embeddings are new, text similarity is old).
               | LLMs are most powerful, IMO, at horizontal thinking.
               | Building a prompt using one-hop narrow AI techniques and
               | then feeding it into a powerful generally capable model
               | is like building a drone but only letting it fly over
               | streets that already exist - not worthless, but only
               | using a fraction of the technology's power.
               | 
               | A concrete example is something like a tool for querying
               | an internal company wiki and the query "tell me about the
               | Backend Team's approach to sprint planning". Normal
               | retrieval approaches will pull information directly
               | related to that query. But what if there is no
               | information about the backend team's practices? As a
               | human, you would do multi-hop/horizontal information
               | extraction - you would retrieve information about who
               | makes up the backend team, you would then retrieve
               | information about them and their backgrounds/practices.
               | You might might have a hypothesis that people carry over
               | their practices from previous experiences, so you look at
               | the previous teams and their practices. Then you would
               | have the context necessary to give a good answer. I don't
               | know of many people implementing RAG like that. And what
               | I described is 100% possible for AI to do today.
               | 
               | Techniques that would get around this like iterative
               | retrieval or retrieval-as-a-tool don't seem popular.
        
               | altdataseller wrote:
               | People cant do that because of cost. If every single
               | query involved taking everything even remotely related to
               | the query, and passing it to OpenAI, it would get
               | expensive very very fast.
               | 
               | Its not a technical issue, its a practicality issue imo.
        
               | idf00 wrote:
               | Luckily it's not one or the other. You can fine tune and
               | use RAG.
               | 
               | Sometimes RAG is enough. Sometimes fine tuning on top of
               | RAG is better. It depends on the use case. I can't think
               | of any examples where you would want to fine tune and not
               | use rag as well.
               | 
               | Sometimes you fine tune a small model so it performs
               | close to a larger varient on that specific narrow task
               | and you improve inference performance by using a smaller
               | model.
        
           | phillipcarter wrote:
           | I don't see why this is seen as an either-or by people? Fine-
           | tuning doesn't eliminate the need for RAG, and RAG doesn't
           | obviate the need for fine-tuning either.
           | 
           | Note that their guidance here is quite practical:
           | 
           | > If prompting gets you 90% of the way there, then fine-
           | tuning may not be worth the investment.
        
         | OutOfHere wrote:
         | Fine-tuning is an absolutely necessary for true AI, and even if
         | it's desirable, it's unfeasible to do for now for any large
         | model considering how expensive GPUs are. If I had infinite
         | money, I'd throw it at continuous fine-tuning and would throw
         | away the RAG. Fine-tuning also requires appropriate measures to
         | prevent forgetting of older concepts.
        
           | solidasparagus wrote:
           | It is not unfeasible. It is absolutely realistic to do
           | distributed finetuning of an 8B text model on previous
           | generation hardware. You can add finetuning to your set of
           | options for about the cost of one FTE - up to you whether
           | that tradeoff is worth it, but in many places it is. The
           | expertise to pull it off is expensive, but to get a mid-level
           | AI SME capable of helping a company adopt finetuning, you are
           | only going to pay about the equivalent of 1-3 senior
           | engineers.
           | 
           | Expensive? Sure, all of AI is crazy expensive. Unfeasible? No
        
             | OutOfHere wrote:
             | I don't consider a small 8B model to be worth fine-tuning.
             | Fine-tuning is worthwhile when you have a larger model with
             | capacity to add data, perhaps one that can even grow its
             | layers with the data. In contrast, fine-tuning a small
             | saturated model will easily cause it to forget older
             | information.
             | 
             | All things considered, in relative terms, as much as I
             | think fine-tuning would be nice, it will remain
             | significantly more expensive than just making RAG or search
             | calls. I say this while being a fan of fine-tuning.
        
               | solidasparagus wrote:
               | > I don't consider a small 8B model to be worth fine-
               | tuning.
               | 
               | Going to have to disagree with you on that one. A modern
               | 8B model that has been trained on enough tokens is
               | ridiculously powerful.
        
               | OutOfHere wrote:
               | A well-trained 8B model will already be over-saturated
               | with information from the start. It will therefore easily
               | forget much old information when fine-tuning it with new
               | materials. It just doesn't have the capacity to take in
               | too much information.
               | 
               | Don't get me wrong. I think an 70B or larger model would
               | be worth fine-tuning, especially if it can be grown
               | further with more layers.
        
               | solidasparagus wrote:
               | > A well-trained 8B model will already be over-saturated
               | with information from the start
               | 
               | Any evidence of that that I can look at? This doesn't
               | match what I've seen nor have I heard this from the
               | world-class researchers I have worked with. Would be
               | interested to learn more.
        
               | OutOfHere wrote:
               | Upon further thought, if fine-tuning involves adding
               | layers, then the initial saturation should not matter.
               | Let's say if an 8B model adds 0.8*2 = 1.6B of new layers
               | for fine-tuning, then with some assumptions, a ballpark
               | is that this could be good for 16 million articles for
               | fine-tuning.
        
               | robrenaud wrote:
               | The reason to fine tune is to get a model that performs
               | well on a specific task. It could lose 90 percent of it's
               | knowledge and beat the unturned model at the narrow task
               | at hand. That's the point, no?
        
               | OutOfHere wrote:
               | It is not really possible to lose 90% of one's brain and
               | do well on certain narrow tasks. If the tasks truly were
               | so narrow, you would be better off training a small model
               | just for them from scratch.
        
         | lmeyerov wrote:
         | We work in some pretty serious domains and try to stay away
         | from fine tuning:
         | 
         | - Most of our accuracy ROI is from agentic loops over top
         | models, and dynamic RAG example injection goes far here that
         | the relative lift of adding fine-tuning isn't worth the many
         | costs
         | 
         | - A lot of fine-tuning is for OSS models that do worse than
         | agentic loops over the proprietary GPT4/Opus3
         | 
         | - For distribution, it's a lot easier to deploy for pluggable
         | top APIs without requiring fine-tuning, e.g., "connect to your
         | gpt4/opus3 + for dumber-but-bigger tasks, groq"
         | 
         | - The resources we could put into fine-tuning are better spent
         | on RAG, agentic loops, prompts/evals, etc
         | 
         | We do use tuned smaller dumber models, such as part of a coarse
         | relevancy filter in a firehose pipeline... but these are
         | outliers. Likewise, we expect to be using them more... but
         | again, for rarer cases and only after we've exhausted other
         | stuff. I'm guessing as we do more fine-tuning, it'll be more on
         | embeddings than LLMs, at least until OSS models get a lot
         | better.
        
           | solidasparagus wrote:
           | See if the article said this, I would have agreed - fine-
           | tuning is a tool and it should be used thoughtfully. Although
           | I personally believe that in this funding climate it makes
           | sense to make data collection and model training a core
           | capability of any AI product. However that will only be
           | available and wise for some founders.
        
             | lmeyerov wrote:
             | Agreed, model training and data collection are great!
             | 
             | The subtle bit is just doesn't have to be for LLMs, as
             | these are typically part of a system-of-models. E.g., we <3
             | RAG, and GNNs for improving your KG is fascinating.
             | Likewise, dspy's explorations in optimizing prompts, vs
             | LLMs, is very cool.
        
               | solidasparagus wrote:
               | > we <3 RAG, and GNNs for improving your KG is
               | fascinating
               | 
               | Oh man I am so torn between this being a fantastic idea
               | and this being "building a better slide-rule in the age
               | of the computer".
               | 
               | dspy is definitely a project I want to dig into more
        
               | lmeyerov wrote:
               | Yeah I would recommend sticking to RAG on naively chunked
               | data for weekend projects by 1 person. Likewise, a
               | consumer tool like perplexity's search engine where you
               | minimize spend per user task or go bankrupt, same thing,
               | do the cheap thing and move on, good enough
               | 
               | Once RAG projects become important and good answers
               | matter - we work with governments, manufacturers, banks,
               | cyber teams, etc - working through data quality, data
               | representation, & retrieval quality helps
               | 
               | Note that we didn't start here: We began with naive RAG,
               | then relevancy filtering, then agentic & neurosymbolic
               | querying, then dynamic example prompt injection, and now
               | are getting into cleaning up the database/kg itself
               | 
               | For folks doing investigative/analytics projects in this
               | space, happy to chat about what we are doing w Louie.AI.
               | These are more implementation details we don't normally
               | write about.
        
               | tarasglek wrote:
               | Can you give a concrete example of GNNs helping?
        
               | lmeyerov wrote:
               | Entity resolution - RAG often mixes vector & symbolic
               | queries, and ER improves reverse indexing, which is a
               | starting point for a lot of the symbolic ones
               | 
               | Identifying misinfo - Ranking & summarization based on
               | internet data should be a lot more careful, and sometimes
               | the controversy is the interesting part
               | 
               | For both, GNNs are generally SOTA
        
               | qeternity wrote:
               | Have you actually used DSPy? I still can't figure out
               | what it's useful for beyond optimizing basic few shot
               | prompts.
        
         | jph00 wrote:
         | > _The idea that fine-tuning isn 't even a consideration
         | (perhaps even something they think is absolutely incorrect if
         | the section titles of the unfinished section is anything to go
         | by) is very strange to me and suggests a pretty narrow
         | perspective IMO_
         | 
         | The article has a section called "When to finetune", along with
         | links to separate pages describing how to do so. They
         | absolutely don't say that "fine-tuning isn't even a
         | consideration". Instead, they describe the situations in which
         | fine-tuning is likely to be helpful.
        
           | solidasparagus wrote:
           | Huh. Well that's embarrassing. I guess I missed it when I
           | lost interest in the caching section and jumped straight to
           | Evaluation and Monitoring.
        
         | bbischof wrote:
         | Hello, it's Bryan, an author on this piece.
         | 
         | I'd you're interested in using one of the LLM-applications I
         | have in prod, check out https://hex.tech/product/magic-ai/ It
         | has a free limit every month to give it a try and see how you
         | like it. If you have feedback after using it, we're always very
         | interested to hear from users.
         | 
         | As far as fine-tuning in particular, our consensus is that
         | there are easier options first. I personally have fine-tuned
         | gpt models since 2022; here's a silly post I wrote about it on
         | gpt 2: https://wandb.ai/wandb/fc-bot/reports/Accelerating-ML-
         | Conten...
        
           | solidasparagus wrote:
           | I took at look at Magic earlier today and it didn't work at
           | all for me, sorry to say. After the example prompt, I tried
           | to learn about a table and it generated bad SQL (correct
           | query to pull a row, but with limit 0). I asked it to show me
           | the DDL and it generated invalid SQL. Then I tried to ask it
           | to do some population statistics on the customer table and
           | ended up confused about why there appears to be two windows
           | in the cell, with the previously generated SQL on the left
           | and the newly generated SQL on the right. The new SQL
           | wouldn't run when I hit run cell, the error showed the
           | originally generated SQL. I gave up and bounced.
           | 
           | I went back while writing this comment and realized it might
           | be showing me a diff (better use of color would have helped,
           | I have been trained by github). But I was at a loss for what
           | to do with that. I just now figured out the Keep button
           | exists and it accepted the diff and now it sort of makes
           | sense, but the SQL still doesn't return any results.
           | 
           | My honest feedback is that there is way too much stuff I
           | don't understand on the screen and it makes me confused and a
           | little stressed. Ease me into it please, I'm dumb. There
           | seems to be cells that are linked together and cells that
           | aren't(? separated by purplish background) and I don't
           | understand it. I am a jupyter user and I feel like this
           | should be intuitive to me, but it isn't. I am not a designer,
           | but I suspect the structural markings like cell boundaries
           | are too faint compared to the content of the cells and/or the
           | exterior of a cell having the same color as the interior is
           | making it hard for me. I feel lost in a sea of white.
           | 
           | But the core issue is that, excluding the prompt I copy-
           | pasted word for word which worked like a charm, I am 0 out of
           | 4 on actually leveraging AI to solve the problems I asked of
           | Magic. I like the concept of natural language BI (I worked on
           | in the early days when Alexa came out) so I probably gave it
           | more chances than I would have for a different product.
           | 
           | For me, it doesn't fit my criteria for good problems to solve
           | with AI in 2024 - the conversational interface and binary
           | right/wrong nature of querying/presenting data accurately
           | make the cost of failure too high, which is a death sentence
           | for AI products IMO (compare to proactive, non-blocking
           | products like copilot or shades-of-wrong problems like image
           | generation or conversations with imaginary characters). But
           | text-to-SQL and data presentation make sense as AI
           | capabilities in 2024 so I can see why that could be a good
           | product to pursue. If it worked, I would definitely use it.
        
       | OutOfHere wrote:
       | Almost all of this should flow from common-sense. I would use
       | what makes sense for your application, and not worry about the
       | rest. It's a toolbox, not a rulebook. The one point that comes
       | more from experience than from common-sense is to always pin your
       | model versions. As a final tip, if despite trying everything, you
       | still don't like the LLM's output, just run it again!
       | 
       | Here is a summary of all points:
       | 
       | 1. Focus on Prompting Techniques:                  1.1. Start
       | with n-shot prompts to provide examples demonstrating tasks.
       | 1.2. Use Chain-of-Thought (CoT) prompting for complex tasks,
       | making instructions specific.        1.3. Incorporate relevant
       | resources via Retrieval Augmented Generation (RAG).
       | 
       | 2. Structure Inputs and Outputs:                  2.1. Format
       | inputs using serialization methods like XML, JSON, or Markdown.
       | 2.2. Ensure outputs are structured to integrate seamlessly with
       | downstream systems.
       | 
       | 3. Simplify Prompts:                  3.1. Break down complex
       | prompts into smaller, focused ones.        3.2. Iterate and
       | evaluate each prompt individually for better performance.
       | 
       | 4. Optimize Context Tokens:                  4.1. Minimize
       | redundant or irrelevant context in prompts.        4.2. Structure
       | the context clearly to emphasize relationships between parts.
       | 
       | 5. Leverage Information Retrieval/RAG:                  5.1. Use
       | RAG to provide the LLM with knowledge to improve output.
       | 5.2. Ensure retrieved documents are relevant, dense, and
       | detailed.        5.3. Utilize hybrid search methods combining
       | keyword and embedding-based retrieval.
       | 
       | 6. Workflow Optimization:                  6.1. Decompose tasks
       | into multi-step workflows for better accuracy.        6.2.
       | Prioritize deterministic execution for reliability and
       | predictability.        6.3. Use caching to save costs and reduce
       | latency.
       | 
       | 7. Evaluation and Monitoring:                  7.1. Create
       | assertion-based unit tests using real input/output samples.
       | 7.2. Use LLM-as-Judge for pairwise comparisons to evaluate
       | outputs.        7.3. Regularly review LLM inputs and outputs for
       | new patterns or issues.
       | 
       | 8. Address Hallucinations and Guardrails:                  8.1.
       | Combine prompt engineering with factual inconsistency guardrails.
       | 8.2. Use content moderation APIs and PII detection packages to
       | filter outputs.
       | 
       | 9. Operational Practices:                  9.1. Regularly check
       | for development-prod data skew.        9.2. Ensure data logging
       | and review input/output samples daily.        9.3. Pin specific
       | model versions to maintain consistency and avoid unexpected
       | changes.
       | 
       | 10. Team and Roles:                   10.1. Educate and empower
       | all team members to use AI technology.         10.2. Include
       | designers early in the process to improve user experience and
       | reframe user needs.         10.3. Ensure the right progression of
       | roles and hire based on the specific phase of the project.
       | 
       | 11. Risk Management:                   11.1. Calibrate risk
       | tolerance based on the use case and audience.         11.2. Focus
       | on internal applications first to manage risk and gain confidence
       | before expanding to customer-facing use cases.
        
       | felixbraun wrote:
       | related discussion (3 days ago):
       | https://news.ycombinator.com/item?id=40508390
        
       | DylanSp wrote:
       | Looks like the same content that was posted on oreilly.com a
       | couple days ago, just on a separate site. That has some existing
       | discussion: https://news.ycombinator.com/item?id=40508390.
        
       | Multicomp wrote:
       | Anyone have a convenience solution for doing multi-step
       | workflows? For example, I'm filling out the basics of an NPC
       | character sheet on my game prep. I'm using a certain rule system,
       | give the enemy certain tactics, certain stats, certain types of
       | weapons, right now I have a 'god prompt' trying to walk the LLM
       | through creating the basic character sheet, but the responses get
       | squeezed down into what one or two prompt responses can be.
       | 
       | If I can do node-red or a function chain for prompts and outputs,
       | that would be sweet.
        
         | CuriouslyC wrote:
         | You can do multi shot workflows pretty easy, I like to have the
         | model produce markdown, then add code blocks (```json/yaml```)
         | to extract the interim results. You can lay out multiple
         | "phases" in your prompt and have it perform each one in turn,
         | and have each one reference prior phases. Then at the end you
         | just pull out the code blocks for each phase and you have your
         | structured result.
        
         | mentos wrote:
         | I still haven't played with using one LLM to oversee another.
         | 
         | "You are in charge of game prep and must work with an LLM over
         | many prompts to..."
        
         | hugocbp wrote:
         | For me, a very simple "breakdown tasks into a queue and store
         | in a DB" solution has help tremendously with most requests.
         | 
         | Instead of trying to do everything into a single chat or chain,
         | add steps to ask the LLM to break down the next tasks, with
         | context, and store that into SQLite or something. Then start
         | new chats/chains on each of those tasks.
         | 
         | Then just loop them back into LLM.
         | 
         | I find that long chats or chains just confuse most models and
         | we start seeing gibberish.
         | 
         | Right now I'm favoring something like:
         | 
         | "We're going to do task {task}. The current situation and
         | context is {context}.
         | 
         | Break down what individual steps we need to perform to achieve
         | {goal} and output these steps with their necessary context as
         | {standard_task_json}. If the output is already enough to
         | satisfy {goal}, just output the result as text."
         | 
         | I find that leaving everything to LLM in a sequence is not as
         | effective as using LLM to break things down and having a DB and
         | code logic to support the development of more complex outcomes.
        
           | datameta wrote:
           | Indeed! If I'm met with several misunderstandings in a row,
           | asking it to explain what I'm trying to do is a pretty
           | surefire way to move forward.
           | 
           | Also mentioning what to "forget" or not focus on anymore
           | seems to remove some noise from the responses if they are
           | large.
        
         | gpsx wrote:
         | One option for doing this is to incrementally build up the
         | "document" using isolated prompts for each section. I say
         | document because I am not exactly sure what the character sheet
         | looks like, but I am assuming it can be constructed one section
         | at a time. You create a prompt to create the first section.
         | Then, you create a second prompt that gives the agent your
         | existing document and prompts it to create the next section.
         | You continue until all the sections are finished. In some cases
         | this works better than doing a single conversation.
        
           | e1g wrote:
           | Perplexity recently released something like this
           | https://www.perplexity.ai/hub/blog/perplexity-pages
        
         | proc0 wrote:
         | Sounds like you need an agent system, some libs are mentioned
         | here: https://lilianweng.github.io/posts/2023-06-23-agent/
        
         | 127 wrote:
         | Did you force it into a parser? You can define a simple
         | language in llama.cpp for the LLM to obey.
        
         | punkspider wrote:
         | Perhaps this would be of use?
         | https://github.com/langgenius/dify/ I use it for quick
         | workflows and it's pretty intuitive.
        
       | dbs wrote:
       | Show me the use cases you have supported in production. Then I
       | might read all the 30 pages praising the dozens (soon to be
       | hundreds?) of "best practices" to build LLMs.
        
         | joe_the_user wrote:
         | I have a friend who uses ChatGPT for writing quick policy
         | statement for her clients (mostly schools). I have a friend who
         | uses it to create images and descriptions for DnD adventures.
         | LLMs have uses.
         | 
         | The problem I see is, who can an "application" be anything but
         | a little window onto the base abilities of ChatGPT and so
         | effectively offers nothing _more_ to an end-user. The final
         | result still have to be checked and regular end-users have to
         | do their own prompt.
         | 
         | Edit: Also, I should also say that anyone who's designing LLM
         | apps that, rather than being end-user tools, are effectively
         | gate keepers to getting action or "a human" from a company
         | deserves a big "f* you" 'cause that approach is evil.
        
         | harrisoned wrote:
         | It certainly has use cases, just not as many as the hype lead
         | people to believe. For me:
         | 
         | -Regex expressions: ChatGPT is the best multi-million regex
         | parser to date.
         | 
         | -Grammar and semantic check: It's a very good revision tool,
         | helped me a lot of times, specially when writing in non-native
         | languages.
         | 
         | -Artwork inspiration: Not only for visual inspiration, in the
         | case of image generators, but descriptive as well. The
         | verbosity of some LLMs can help describe things in more detail
         | than a person would.
         | 
         | -General coding: While your mileage may vary on that one, it
         | has helped me a lot at work building stuff on languages i'm not
         | very familiar with. Just snippets, nothing big.
        
           | int_19h wrote:
           | GPT-4 has amazing translation capabilities, too. Actually
           | usable for long conversations.
        
         | thallium205 wrote:
         | We have a company mail, fax, and phone room that receives
         | thousands of pages a day that now sorts, categorizes, and
         | extracts useful information from them all in a completely
         | automated way by LLMs. Several FTEs have been reassigned
         | elsewhere as a result.
        
         | robbiemitchell wrote:
         | Processing high volumes of unstructured data (text)... we're
         | using a STAG architecture.
         | 
         | - Generate targeted LLM micro summaries of every record
         | (ticket, call, etc.) continually
         | 
         | - Use layers of regex, semantic embeddings, and scoring
         | enrichments to identify report rows (pivots on aggregates)
         | worth attention, running on a schedule
         | 
         | - Proactively explain each report row by identifying what's
         | unusual about it and LLM summarizing a subset of the
         | microsummaries.
         | 
         | - Push the result to webhook
         | 
         | Lack of JSON schema restriction is a significant barrier to
         | entry on hooking LLMs up to a multi step process.
         | 
         | Another is preventing LLMs from adding intro or conclusion
         | text.
        
           | BoorishBears wrote:
           | > Lack of JSON schema restriction is a significant barrier to
           | entry on hooking LLMs up to a multi step process.
           | 
           | How are you struggling with this, let alone as a significant
           | barrier? JSON adherence with a well thought out schema hasn't
           | been a worry between improved model performance and various
           | grammar based constraint systems in a while.
           | 
           | > Another is preventing LLMs from adding intro or conclusion
           | text.
           | 
           | Also trivial to work around by pre-filling and stop tokens,
           | or just extremely basic text parsing.
           | 
           | Also would recommend writing out Stream-Triggered Augmented
           | Generation since the term is so barely used it might as well
           | be made up from the POV of someone trying to understand the
           | comment
        
             | robbiemitchell wrote:
             | Asking even a top-notch LLM to output well formed JSON
             | simply fails sometimes. And when you're running LLMs at
             | high volume in the background, you can't use the best
             | available until the last mile.
             | 
             | You work around it with post-processing and retries. But
             | it's still a bit brittle given how much stuff happens
             | downstream without supervision.
        
               | fancy_pantser wrote:
               | Constrained output with GBNF or JSON is much more
               | efficient and less error-prone. I hope nobody outside of
               | hobby projects is still using error/retry loops.
        
               | joatmon-snoo wrote:
               | Constraining output means you don't get to use ChatGPT or
               | Claude though, and now you have to run your own stuff.
               | Maybe for some folks that's OK, but really annoying for
               | others.
        
               | fancy_pantser wrote:
               | You're totally right, I'm in my own HPC bubble. The
               | organizations I work with create their own models and
               | it's easy for me to forget that's the exception more than
               | the rule. I apologize for making too many assumptions in
               | my previous comment.
        
               | joatmon-snoo wrote:
               | Not at all!
               | 
               | Out of curiosity- do those orgs not find the loss of
               | generality that comes from custom models to be an issue?
               | e.g. vs using Llama or Mistral or some other open model?
        
               | int_19h wrote:
               | I do wonder why, though. Constraining output based on
               | logits is a fairly simple and easy-to-implement idea, so
               | why is this not part of e.g. the OpenAI API yet? They
               | don't even have to expose it at the lowest level, just
               | use it to force valid JSON in the output on their end.
        
               | jncfhnb wrote:
               | ... why would you have the LLM spit out a json rather
               | than define the json yourself and have the LLM supply
               | values?
        
               | janpieterz wrote:
               | How would I do this reliably? Eg give me 10 different
               | values, all in one prompt for performance reasons?
               | 
               | Might not need JSON but whatever format it outputs, it
               | needs to be reliable.
        
               | jncfhnb wrote:
               | Don't do it all in one prompt.
        
               | janpieterz wrote:
               | Right, but now I'm basically running a huge performance
               | hit, need to parallelize my queries etc.
               | 
               | I was parsing a document recently, 10-ish questions for 1
               | document, would make things expensive.
               | 
               | Might be what's needed but not ideal.
        
               | esafak wrote:
               | If the LLM doesn't output data that conforms to a schema,
               | you can't reliably parse it, so you're back to square
               | one.
        
               | jncfhnb wrote:
               | It's significantly easier to output an integer than a
               | JSON with a key value structure where the value is an
               | integer and everything else is exactly as desired
        
               | esafak wrote:
               | That's because you've dumbed down the problem. If it was
               | just about outputting one integer, there would be nothing
               | to discuss. Now add a bunch more fields, add some nesting
               | and other constraints into it...
        
               | jncfhnb wrote:
               | The more complexity you add the less likely the LLM is to
               | give you a valid response in one shot. It's still going
               | to be easier to get the LLM to supply values to a fixed
               | scheme than to get the LLM to give the answers and the
               | scheme
        
               | neverokay wrote:
               | Is there a general model that got fine tuned on these
               | json schema/output pairs?
               | 
               | Seems like it would be universally useful.
        
               | yeahwhatever10 wrote:
               | The phrase you want to search is "constrained decoding".
        
               | BoorishBears wrote:
               | The best available actually have the _fewest_ knobs for
               | JSON schema enforcement (ie. OpenAI 's JSON mode, which
               | technically can still produce incorrect JSON)
               | 
               | If you're using anything less you should have a grammar
               | that enforces exactly what tokens are allowed to be
               | output. Fine Tuning can help too in case you're worried
               | about the effects of constraining the generation, but in
               | my experience it's not really a thing
        
           | benreesman wrote:
           | I only became aware of it recently and therefore haven't done
           | more than play with in a fairly cursory way, but
           | unstructured.io seems to have a lot of traction and certainly
           | in my little toy tests their open-source stuff seems pretty
           | clearly better than the status quo.
           | 
           | Might be worth checking out.
        
           | adamsbriscoe wrote:
           | > Lack of JSON schema restriction is a significant barrier to
           | entry on hooking LLMs up to a multi step process.
           | 
           | (Plug) I shipped a dedicated OpenAI-compatible API for this,
           | jsonmode.com a couple weeks ago and just integrated Groq
           | (they were nice enough to bump up the rate limits) so it's
           | crazy fast. It's a WIP but so far very comparable to JSON
           | output from frontier models, with some bonus features (web
           | crawling etc).
        
             | tarasglek wrote:
             | The metallica-esque lightning logo is cool
        
           | joatmon-snoo wrote:
           | We actually built an error-tolerant JSON parser to handle
           | this. Our customers were reporting exactly the same issue-
           | trying a bunch of different techniques to get more usefully
           | structured data out.
           | 
           | You can check it out over at
           | https://github.com/BoundaryML/baml. Would love to talk if
           | this is something that seems interesting!
        
           | lastdong wrote:
           | "Use layers of regex, semantic embeddings, and scoring
           | enrichments to identify report rows (pivots on aggregates)
           | worth attention, running on a schedule"
           | 
           | This is really interesting, is there any architecture
           | documentation/articles that you can recommend?
        
         | fnordpiglet wrote:
         | We use LLMs in dozens of different production applications for
         | critical business flows. They allow for a lot of dynamism in
         | our flows that aren't amenable to direct quantitative reasoning
         | or structured workflows. Double digit percents of our growth in
         | the last year are entirely due to them. The biggest challenge
         | is tool chain, limits on inference capacity, and developer
         | understanding of the abilities, limits, and techniques for
         | using LLMs effectively.
         | 
         | I often see these messages from the community doubting the
         | reality, but LLMs are a powerful tool in the tool chest. But I
         | think most companies are not staffed with skilled enough
         | engineers with a creative enough bent to really take advantage
         | of them yet or be willing to fund basic research and from first
         | principles toolchain creation. That's ok. But it's foolish to
         | assume this is all hype like crypto was. The parallels are
         | obvious but the foundations are different.
        
           | mvdtnz wrote:
           | Yet another post claiming "dozens" of production use cases
           | without listing a single one.
        
             | fnordpiglet wrote:
             | I've listed plenty in my comment history. I don't generally
             | feel compelled to trot them all out all the time - I don't
             | need to "prove" anything and if you think I'm lying that's
             | your choice. Finally, many of our uses are trade secrets
             | and a significant competitive advantage so I don't feel the
             | need to disclose them to the world if our competitors don't
             | believe in the tech. We can keep eating their lunch.
        
           | threeseed wrote:
           | No one is saying that all of AI is hype. It clearly isn't.
           | 
           | But the facts are that today LLMs are not suitable for use
           | cases that need accurate results. And there is no evidence or
           | research that suggests this is changing anytime soon. Maybe
           | for ever.
           | 
           | There are very strong parallels to crypto in that (a) people
           | are starting with the technology and trying to find problems
           | and (b) there is a cult like atmosphere where non-believers
           | are seen as being anti-progress and anti-technology.
        
             | fnordpiglet wrote:
             | Yeah I think a key is LLMs in business are not generally
             | useful alone. They require classical computing techniques
             | to really be powerful. Accurate computation is a generally
             | well established field and you don't need an LLM to do
             | optimization or math or even deductive logical reasoning.
             | That's a waste of their power which is typically abstract
             | semantic abductive "reasoning" and natural language
             | processing. Overlaying this with constraints, structure,
             | and augmenting with optimizers, solvers, etc, you get a
             | form of computing that was impossible more than 5 years
             | prior and is only practical in the last 9 months.
             | 
             | On the crypto stuff yeah I get it - especially if you're
             | not in the weeds of its use. A lot of people formed
             | opinions from GPT3.5, Gemini, copilot, and other crappy
             | experiences and haven't kept up with the state of the art.
             | The rate of change in AI is breathtaking and I think hard
             | to comprehend for most people. Also the recent mess of
             | crypto and the fact grifters grift etc also hurts. But
             | people who doubt -are- stuck in the past. That's not
             | necessarily their fault and it might not even apply to
             | their career or lives in the present and the flaws are
             | enormous as you point out. But it's such a remarkably
             | powerful new mode of compute that it in combination with
             | all the other powerful modes of compute is changing
             | everything and will continue too, especially if next
             | generation models keep improving as they seem to be likely
             | to.
        
               | jeffreygoesto wrote:
               | That text applies to basically every new technology.
               | Point is that you can't predict it's usefulness in 20
               | years from that.
               | 
               | To me it still looks like a hammer made completely from
               | rubber. You can practice to get some good hits, but it is
               | pretty hard to get something reliable. And a beginner
               | will basically just bounce it around. But it is sold as
               | rescue for beginners.
        
             | idf00 wrote:
             | I didn't see anything in the article that indicated the
             | authors believed that those who don't see use cases for
             | LLMs are anti-progress or anti-technology. Is that comment
             | related to the authors of this article, or just a general
             | grievance you have unrelated to this article?
        
           | TeMPOraL wrote:
           | > _We use LLMs in dozens of different production applications
           | for critical business flows. They allow for a lot of dynamism
           | in our flows that aren't amenable to direct quantitative
           | reasoning or structured workflows. Double digit percents of
           | our growth in the last year are entirely due to them. The
           | biggest challenge is tool chain, limits on inference
           | capacity, and developer understanding of the abilities,
           | limits, and techniques for using LLMs effectively._
           | 
           | That sounds like corporate buzzword salad. It doesn't tell
           | much as it stands, not without at least one specific example
           | to ground all those relative statements.
        
             | mloncode wrote:
             | Hi, Hamel here. I'm one of the co-authors. I'm an
             | independent consultant and not all clients allow me to talk
             | about their work.
             | 
             | However, I have two that do, which I've discussed in the
             | article. These are two production use cases that I have
             | supported (which again, are explicitly mentioned in the
             | article):
             | 
             | 1. https://www.honeycomb.io/blog/introducing-query-
             | assistant
             | 
             | 2. https://www.youtube.com/watch?v=B_DMMlDuJB0
             | 
             | Other co-authors have worked on significant bodies of work:
             | 
             | Bryan Bischoff lead the creation of Magic in Hex:
             | https://www.latent.space/p/bryan-bischof
             | 
             | Jason Liu created the most popular OSS libraries for
             | structured data called instructor
             | https://github.com/jxnl/instructor, and works with some of
             | the leading companies in the space like Limitless and
             | Raycast (https://jxnl.co/services/#current-and-past-
             | clients)
             | 
             | Eugene Yan works with LLMs extensively at Amazon and uses
             | that to inform his writing: https://eugeneyan.com/writing/
             | (However he isn't allowed to share specifics about Amazon)
             | 
             | I believe you might find these worth looking at.
        
               | mattmanser wrote:
               | You've linked to a query generator for a custom
               | programming language and a 1 hour video about LLM tools.
               | The cynic in me feels like the former could probably be
               | done by chatgpt off the shelf.
               | 
               | But those do not seem to be real world business cases.
               | 
               | Can you expand a bit more why you think they are? We
               | don't have hours to spend reading, and you say you've
               | been allowed to talk about them.
               | 
               | So can you summarise the business benefits for us, which
               | is what people are asking for, instead of linking to huge
               | articles?
        
               | 80hd wrote:
               | Sounds like something you could do with an LLM
        
               | idf00 wrote:
               | They think they are real business use cases, because real
               | businesses use them to solve their use cases. They know
               | that chatgpt can't solve this off the shelf, because they
               | tried that first and were forced to do more in order to
               | solve their problem.
               | 
               | There's a summary for ya! More details in the stuff that
               | they linked if you want to learn. Technical skills do
               | require a significant time investment to learn, and LLM
               | usage is no different.
        
               | mloncode wrote:
               | > do not seem to be real world business cases
               | 
               | The first one is a real world product that lives in
               | production that is user facing for a paid product.
               | 
               | The second video goes in depth about how a AI assistant
               | was built for a real estate CRM company, also a paid
               | product.
               | 
               | I don't understand the assertion that it's not "real
               | world" or not "business"
               | 
               | Here are additional articles about these
               | 
               | https://help.rechat.com/guides/lucy
               | 
               | https://www.prnewswire.com/news-releases/honeycomb-
               | launches-...
        
               | phillipcarter wrote:
               | > The cynic in me feels like the former could probably be
               | done by chatgpt off the shelf.
               | 
               | Hello! I'm the owner of the feature in question who
               | experimented with chatgpt last year in the course of
               | building the feature (and working with Hamel to improve
               | it via fine-tuning later).
               | 
               | Even today, it could not work with ChatGPT. To generate
               | valid queries, you need to know which subset of a user's
               | dataset schema is relevant to their query, which makes it
               | equally a retrieval problem as it does a generation
               | problem.
               | 
               | Beyond that, though, the details of "what makes a good
               | query" are quite tricky and subtle. Honeycomb as a
               | querying tool is unique in the market because it lets you
               | arbitrarily group and filter by any column/value in your
               | schema without pre-indexing and without any cost w.r.t.
               | cardinality. And so there are many cases where you can
               | quite literally answer someone's question, but there are
               | multitudes of ways you can be _even more helpful_ , often
               | by introducing a grouping that they didn't directly ask
               | for. For example, "count my errors" is just a COUNT where
               | the error column exists, but if you group by something
               | like the HTTP route, the name of the operation, etc. --
               | or the name of a child operation and its calling HTTP
               | route for requests -- you end up actually showing people
               | where and how these errors come from. In my experience,
               | the large majority of power users already do this
               | themselves (it's how you use HNY effectively), and the
               | large majority of new users who know little about the
               | tool simply have no idea it's this flexible. Query
               | Assistant helps them with that and they have a pretty
               | good activation rate when they use it.
               | 
               | Unfortunately, ChatGPT and even just good old fashioned
               | RAG is often not up to the task. That's why fine-tuning
               | is so important for this use case.
        
               | fnordpiglet wrote:
               | Thanks for the reply. Huge fan of honeycomb and the
               | feature. Spent many years in observability and built a
               | some of the large in use log platforms. Tracing is the
               | way of the future and hope to see you guys eat that
               | market. I did some executive tech strategy stuff at some
               | megacorp on observability and it's really hard to unwedge
               | metrics and logs but I've done my best when it was my
               | focus. Good luck and thanks for all you're doing over
               | there.
        
         | obiefernandez wrote:
         | The book I'm writing is almost finished and is based almost
         | entirely on production use cases: https://leanpub.com/patterns-
         | of-application-development-usin...
        
         | cqqxo4zV46cp wrote:
         | Or maybe they could choose to focus their attention on people
         | that aren't needlessly aggressive and adversarial.
        
         | mloncode wrote:
         | Hi, Hamel here. I'm one of the co-authors. I'm an independent
         | consultant and not all clients allow me to talk about their
         | work.
         | 
         | However, I have two that do, which I've discussed in the
         | article. These are two production use cases that I have
         | supported (which again, are explicitly mentioned in the
         | article):
         | 
         | 1. https://www.honeycomb.io/blog/introducing-query-assistant
         | 
         | 2. https://www.youtube.com/watch?v=B_DMMlDuJB0
         | 
         | Other co-authors have worked on significant bodies of work:
         | 
         | Bryan Bischoff lead the creation of Magic in Hex:
         | https://www.latent.space/p/bryan-bischof
         | 
         | Jason Liu created the most popular OSS libraries for structured
         | data called instructor https://github.com/jxnl/instructor, and
         | works with some of the leading companies in the space like
         | Limitless and Raycast (https://jxnl.co/services/#current-and-
         | past-clients)
         | 
         | Eugene Yan works with LLMs extensively at Amazon and uses that
         | to inform his writing: https://eugeneyan.com/writing/ (However
         | he isn't allowed to share specifics about Amazon)
         | 
         | I believe you might find these worth looking at.
        
           | anon373839 wrote:
           | I know it's a snarky comment you responded to, but I'm glad
           | you did. Those are great resources, as is your excellent
           | article. Thanks for posting!
        
         | hubraumhugo wrote:
         | I think it comes down to relatively unexciting use cases that
         | have a high business impact (process automation, RPA, data
         | analysis), not fancy chatbots or generative art.
         | 
         | For example, we focused on the boring and hard task of web data
         | extraction.
         | 
         | Traditional web scraping is labor-intensive, error-prone, and
         | requires constant updates to handle website changes. It's
         | repetitive and tedious, but couldn't be automated due to the
         | high data diversity and many edge cases. This required a
         | combination of rule-based tools, developers, and constant
         | maintenance.
         | 
         | We're now using LLMs to generate web scrapers and data
         | transformation steps on the fly that adapt to website changes,
         | automating the full process end-to-end.
        
         | bbischof wrote:
         | Hello, it's Bryan, an author on this piece.
         | 
         | I'd you're interested in using one of the LLM-applications I
         | have in prod, check out https://hex.tech/product/magic-ai/ It
         | has a free limit every month to give it a try and see how you
         | like it. If you have feedback after using it, we're always very
         | interested to hear from users.
        
       | threeseed wrote:
       | RAGs do not prevent hallucinations nor does it guarantee that the
       | quality of your output is contingent solely on the quality of
       | your input. Using LLMs for legal use cases for example has shown
       | it to be poor for anything other than initial research as it is
       | accurate at best 65%:
       | 
       | https://dho.stanford.edu/wp-content/uploads/Legal_RAG_Halluc...
       | 
       | So would strongly disagree that _LLMs have become "good enough"
       | for real-world applications "_ based on what was promised.
        
         | mattyyeung wrote:
         | You may be interested "Deterministic Quoting"[1]. This doesn't
         | completely "solve" hallucinations, but I would argue that we do
         | get "good enough" in several applications
         | 
         | Disclosure: author on [1]
         | 
         | [1] https://mattyyeung.github.io/deterministic-quoting
        
           | threeseed wrote:
           | Have seen this approach before.
           | 
           | It's the yes we hallucinate but don't worry because we
           | provide the sources for users to check.
           | 
           | Even though everyone knows that users will never check unless
           | the hallucination is egregious.
           | 
           | It's such a disingenuous way of handling this.
        
         | phillipcarter wrote:
         | > So would strongly disagree that LLMs have become "good
         | enough" for real-world applications" based on what was
         | promised.
         | 
         | I can't speak for "what was promised" by anyone, but LLMs have
         | been good enough to live in production as a core feature in my
         | product since early last year, and have only gotten better.
        
       | sheepscreek wrote:
       | I'm sure this has some decent insights but it's from almost 1
       | year ago! A lot has changed in this space since then.
        
         | bgrainger wrote:
         | Are you sure? The article says "cite this as Yan et al. (May
         | 2024)" and published-time in the metadata is 2024-05-12.
         | 
         | Weird: I just refreshed the page and it now redirects to a
         | different domain (than the originally-submitted URL) and has a
         | date of June 8, 2023. It still cites articles and blog posts
         | from 2024, though.
        
           | jph00 wrote:
           | Looks like they made a mistake in the article metadata - they
           | definitely just released this article.
        
             | jph00 wrote:
             | OK I let them know, and they've fixed it now.
        
               | sheepscreek wrote:
               | Awesome - thanks. Makes much more sense now. Can't update
               | my original comment but hopefully people will read this.
        
       | mloncode wrote:
       | This is Hamel, one of the authors of the article. We published
       | the article with OReilly here:
       | 
       | Part 1: https://www.oreilly.com/radar/what-we-learned-from-a-
       | year-of... Part 2: https://www.oreilly.com/radar/what-we-learned-
       | from-a-year-of...
       | 
       | We were working on this webpage to collect the entire three part
       | article in one place (the third part isn't published yet). We
       | didn't expect anyone to notice the site! Either way, part 3
       | should be out in a week or so.
        
         | seventytwo wrote:
         | Was wondering about the June 8th date on there :)
        
       | blumomo wrote:
       | > PUBLISHED
       | 
       | > June 8, 2024
       | 
       | Is this an article from the future?
        
         | defrost wrote:
         | Good catch.
         | 
         | Best guess is that's the anticipated publishing date of the
         | _full_ three parts on the official O 'Reilly site.
         | 
         | See: https://news.ycombinator.com/item?id=40551413
        
       | mercurialsolo wrote:
       | As we go about moving LLM enabled products into production we
       | definitely see a bunch of what is being spoken about resonate. We
       | also see the below as areas which need to be expanded upon for
       | developers building in the space to take products to production :
       | 
       | I would love to see this article also expand to touch upon things
       | like : - data management - (tooling, frameworks, open vs closed
       | data management, labelling & annotations) - inference as a
       | pipeline - frameworks for breaking down model inference into
       | smaller tasks & combining outputs (do DAG's have a role to play
       | here?) - prompts - areas like caching, management, versioning,
       | evaluations - model observability - tokens, costs, latency,
       | drift? - evals for multimodality - how do we tackle evals here
       | which in turn can go into loops e.g. quality of audio, speech or
       | visual outputs
        
       | JKCalhoun wrote:
       | > Note that in recent times, some doubt has been cast on if this
       | technique is as powerful as believed. Additionally, there's
       | significant debate as to exactly what is going on during
       | inference when Chain-of-Thought is being used...
       | 
       | I love this new era of computing we're in where rumors, second-
       | guessing and something akin to voodoo have entered into working
       | with LLMs.
        
         | ezst wrote:
         | That's the thing, it's a novel form of computing that's
         | increasingly moving away from computer science. It deserves to
         | be treated as a discipline of its own, with lots of words of
         | caution and danger stickers slapped over it.
        
           | skydhash wrote:
           | It's text (word) manipulation based on probalistic rules
           | derived from analyzing human-produced text. And everyone
           | knows language is imperfect. That's why we have introduced
           | logic and formalism so that we can reliably transmit
           | knowledge.
           | 
           | That's why LLMs are good at translating and spellchecking.
           | We've been describing the same world and almost all texts
           | respect grammar. That's the first things that surface. But
           | you can extract the same rules in other way and create a
           | program that does it without the waste of computing power.
           | 
           | If we describe computing as solving problems, then it's not
           | computing because if your solution was not part of the
           | training data, you won't solve anything. If we describe
           | computing as symbol manipulation, then it's not doing a good
           | job because the rules changes with every model and they are
           | probabilistic. No way to get a reliable answer. It's
           | divination without the divine (no hint from an omniscient
           | entity).
        
           | amelius wrote:
           | Yeah like psychology being a different field from physics
           | even if it is running on atoms ultimately.
           | 
           | Imagine if physics literature was filled with stuff about
           | psychology and how that would drive physicists nuts. That's
           | how I feel right now ;)
        
       | pklee wrote:
       | This is pure gold !! Thank you so much eugene and gang for doing
       | this. For those of them which I have encountered, I can 100 %
       | agree with them. This is fantastic !! So many good insights.
        
       | jakubmazanec wrote:
       | I'm not saying the content of the article is wrong, but what apps
       | are people/companies writing articles like this actually
       | building? I'm seriously unable to imagine any useful app. I only
       | use GPT via API (as better Google for documentations, and its
       | output is never usable without heavy editing). This week I tried
       | to use "AI" in Notion: I needed to generate 84 check boxes for
       | each day starting with specific date. I got 10 check boxes and
       | line "here should go rest..." (or some variation of such lazy
       | output). Completely useless.
        
         | qeternity wrote:
         | I think you're going about it backwards. You don't take a tool,
         | and then try to figure out what to do with it. You take a
         | problem, and then figure out which tool you can use to solve
         | it.
        
           | jakubmazanec wrote:
           | But it seems to me that's what they're doing: "We have LLMs,
           | what to do with them?" But anyway, I'm seriously just looking
           | for an example of app that is build with stuff described in
           | the article.
           | 
           | Me personally, I only used LLM for one "serious" application:
           | I used GPT-3.5Turbo for transforming unstructured text into
           | JSON; it was basically just ad-hoc Node.js script that called
           | API (prompt was few examples of input-output pairs), and then
           | it did some checks (these checks usually failed only because
           | GPT also corrected misspellings). It would take me weeks to
           | do it manually, but with the help of GPT it was few hours
           | (writing of the script + I made a lot of misspellings so the
           | script stopped a lot). But I cannot imagine anything more
           | complex.
        
             | exhaze wrote:
             | https://github.com/hrishioa/lumentis
             | 
             | Since you seem to have not noticed my comment above, here's
             | another example of a project that implements many of these
             | techniques. Me and many others have used this to transcribe
             | hour long videos into a well organized "docs site" that
             | makes the content easy to read.
             | 
             | Example: https://matadoc.vercel.app/
             | 
             | This was completely auto-generated in a few minutes. The
             | author of the library reviewed it and said that it's nearly
             | 100% correct and people in the company where it was built
             | rely on these docs.
             | 
             | Tell me how long it would take you to write these docs. I'm
             | really confused where your dismissive mentality is coming
             | from in the face of what I think is overwhelming evidence
             | to the contrary. I'm happy to provide example after example
             | after example. I'm sorry, but you are utterly, completely
             | wrong in your conclusions.
        
               | jakubmazanec wrote:
               | But that seems to belong to the category "text
               | transformation" (e.g. translating, converting unstructed
               | notes into structured data, etc.), which I acknowledge
               | LLMs are good at; instead of category "I'll magically
               | debug your SQL wish!".
        
               | exhaze wrote:
               | I believe we were discussing the former not the latter? I
               | agree that for lots of problem solving tasks it can be
               | hit or miss - in my experience, all the models are quite
               | bad at writing decent frontend code when it comes to the
               | rendered page looking the way you want it to.
               | 
               | What you're describing is more about reasoning abilities
               | - that's not really what the article was about or the
               | problems the techniques are for. The techniques in
               | article are more for stuff like Q&A, classification,
               | summarization, etc.
        
               | root_axis wrote:
               | I've tried this type of thing quite a bit (generating
               | documentation based on code I've written), and it's
               | generally pretty bad. Even just generating a README for a
               | single source file project produces bloviated fluff that
               | I have to edit rigorously. I'd say it does about 40% of
               | the job, which is obviously a technical marvel, but in a
               | practical sense it's more novelty than utility.
        
               | exhaze wrote:
               | Please just go and try the lumentis library I mentioned -
               | that is what was used to generate this. It works. For the
               | library docs I showed, I literally just wrote a zsh
               | script to concat all the code together into one file,
               | each one wrapped with XML open/close tags, and fed that
               | in. Just because you weren't able to do it doesn't mean
               | it's a novelty.
        
         | exhaze wrote:
         | I've built many production applications using a lot of these
         | techniques and others - it's made money either by increasing
         | sales or decreasing operational costs.
         | 
         | Here's a more dramatic example: https://www.grey-wing.com/
         | 
         | This company provides deeply integrated LLM-powered software
         | for operating freight ships.
         | 
         | There are a lot of people who are doing this and achieving very
         | good results.
         | 
         | Sorry, if it's not working for you, it doesn't mean that it
         | doesn't work.
        
           | robbiep wrote:
           | That's really interesting. Surely the crewing roster stuff is
           | actually using linear algebra rather than AI though?
        
       | hakanderyal wrote:
       | If you didn't follow what has been happing in the LLM space, this
       | document gives you everything you need to know about state of the
       | art LLM usage & applications.
       | 
       | Thanks a lot for this!
        
       | gengstrand wrote:
       | Interesting blog. It seems to be a compendium of advice for all
       | kinds of folks ranging from end user to integration partner. For
       | a slightly different take on how to use LLMs to build software,
       | you might be interested in https://www.infoq.com/articles/llm-
       | productivity-experiment/ which documents an experiment where the
       | same prompt was given to various prominent LLMs asking to write
       | two unit tests for an already existing code base. The results
       | were collected, metrics were analyzed, then comparisons were
       | made. No advice on how to write better prompts but some insight
       | on how to work with and what you can expect from LLMs in order to
       | improve developer productivity.
        
       ___________________________________________________________________
       (page generated 2024-06-02 23:02 UTC)