[HN Gopher] Speech and Language Processing (3rd ed. draft)
       ___________________________________________________________________
        
       Speech and Language Processing (3rd ed. draft)
        
       Author : yeesian
       Score  : 181 points
       Date   : 2024-03-11 03:59 UTC (19 hours ago)
        
 (HTM) web link (web.stanford.edu)
 (TXT) w3m dump (web.stanford.edu)
        
       | 3abiton wrote:
       | Why Hidden Markov Models are in the related links (7 years ago)?
       | Did the PDF name change?
        
         | rhdunn wrote:
         | It's an appendix (supplementary material) describing Hidden
         | Markov Models. It is web only presumably because a) it is not
         | specific to NLP, and b) keeping it out reduces the print
         | size/cost.
        
       | ilaksh wrote:
       | I almost believe that if I know how make an LLM prompt and how to
       | make an API call to OpenAI, Mistral, Claude 3, or together.ai,
       | then as an application programmer, I can skip this whole book. I
       | see people posting project specifications asking for NLP, named
       | entity extraction, etc. But most things in those jobs look like
       | they could be handled by an LLM, possibly even smaller than 7b,
       | and probably more robustly. My other assumption is that this
       | wasn't true at all three years ago.
       | 
       | Maybe machine learning engineers want to explain what I am
       | missing?
        
         | wodenokoto wrote:
         | That's kinda like saying we don't need to teach CS students
         | anything about algorithms.
         | 
         | I don't think that book was ever aimed at application engineers
         | in the first place.
        
           | ilaksh wrote:
           | Well, I can see it as being useful to give aspiring ML
           | researchers a survey and jumping off point for possibly
           | digging deeper.
           | 
           | But I also think when they created that book originally it
           | was about giving ML practitioners tools they would use.
        
         | rhdunn wrote:
         | It depends on what you are trying to do.
         | 
         | LLMs are powerful, but can be difficult to work with w.r.t.
         | doing post processing or adding markup/annotations in the text.
         | For example, asking it to label terms in a sentence it can
         | produce varied output, e.g. sometimes listing "word or clause:
         | description of the meaning" which is hard to parse in
         | downstream tasks, or sometimes out of order.
         | 
         | I've also seen LLMs label split infinitives with the correct
         | meaning at the preposition instead of the whole subclause. With
         | NLP you would label either the start and span of the label
         | (common) or the root of the subclause, depending on the
         | application. The NLP approach would work for more complex
         | nested and interconnected expressions where the structure is
         | not flat text.
         | 
         | If you ask it to generate XML or HTML, it will usually invent
         | its own markup, or sometimes generate invalid output. E.g. I've
         | seen it output HTML using tags like `<Question Word>` in some
         | cases.
         | 
         | Asking it to generate CoNLL-U (which it knew about from its
         | response) -- it only outputted 3 columns as "in _ PREP" etc.
         | which is not correct.
         | 
         | Asking `Can you lemmatize "She is going to the bank to get some
         | money."` I get `"She go(ing) to bank(er) for money(y)"` which
         | leaves out words, doesn't lemmatize some words correctly, and
         | has a format that is difficult to parse/interpret reliably.
         | 
         | LLMs also have a limited context window, so if you are trying
         | to process a large document, or have a large complex set of
         | instructions (e.g. on how to label, process, or format the
         | data) then it can lose those instructions and deviate from what
         | you are doing.
         | 
         | LLMs are also susceptible to being guided by the input text as
         | the prompt and input are taken together. Thus, if the text is
         | talking about a different format or something else it can
         | easily switch to that.
         | 
         | --
         | 
         | While NLP pipelines require more work to get right, they can
         | often be more efficient computationally as they are often a lot
         | smaller than the 7b parameters, or use other techniques that
         | don't use ML/NNs.
         | 
         | You can build more custom pipelines by querying the different
         | features (part of speech, lemma, lexical features, etc.)
         | without having to reparse the data, and you can keep the
         | annotations consistent across those different pipelines, e.g.
         | when labelling, extracting and storing the data in a database
         | for searching, etc. so a user can see all the places where a
         | term is referenced in a given text.
        
         | Al-Khwarizmi wrote:
         | This is an interesting question.
         | 
         | As an NLP researcher who is in contact with companies that
         | demand NLP applications, I think we are not there yet. For
         | example, a company that wants to extract information from
         | medical records cannot use solutions like GPT-4 or Claude that
         | involve sending protected data to third parties in foreign
         | jurisdictions. Modest local models (like 7B models) don't work
         | so well for things like named entity extraction yet. And
         | furthermore, companies typically want some explanation and
         | accountability for the results (especially in sensitive
         | domains...) and sure, LLMs can explain, but you have no
         | guarantee that the explanation isn't hallucinated. When I
         | mention the possibility of hallucinations, companies typically
         | balk and say that they prefer the classic way.
         | 
         | My (potentially biased) opinion is that classic NLP still has
         | life left in it. For how long, I don't know. If small, locally-
         | runnable LLMs get much better and more reliable, addressing the
         | hallucination problems, what you mention might become largely
         | true for most engineering applications.
         | 
         | Also note that beyond engineering, things like syntactic
         | parsing are also useful for scientific pursuits too, and at the
         | moment they seem to be out of reach of LLMs.
        
         | imjonse wrote:
         | It is not written for application programmers, it is a machine
         | learning book.
        
           | deskamess wrote:
           | Imagine a category - books for machines (some sort of LLM or
           | RAG augment). Kind of like the downloadable helicopter manual
           | that Trinity uses in the Matrix. Then having a higher order
           | integration onto your foundational muscle memory would be the
           | next step. Human (vCurrent + 0.1).
           | 
           | The nice thing is performance would be different based on the
           | human getting it. It kind of preserves some unique element.
        
         | miraculixx wrote:
         | Reliability, is what you are missing. LLM are not reliable.
        
           | esafak wrote:
           | And cost effectiveness.
        
         | vjerancrnjak wrote:
         | LLMs silently accumulate errors with the length of the sequence
         | predicted ("label bias" is the term that should be in that
         | book, but not in the context of LLMs).
         | 
         | The problem is exacerbated by a very big token dictionary.
         | Every new token you generate is independently sampled (from
         | tokens outside of the context size). If you have any task that
         | requires joint distribution modeling LLM will fail and
         | HMMs/CRFs will succeed. Of course, it depends when the problem
         | will manifest and I do not know that, but skipping the whole
         | book is not recommended.
         | 
         | For example, there were approaches for many language tasks in
         | that book that used CNNs (instead of something more principled)
         | and were extremely successful (despite the lack of joint
         | modeling). Who knows when this accumulation of errors becomes
         | measurable.
         | 
         | As long as your generation of tokens fits inside the context
         | size, you're modeling everything jointly. But I guess you need
         | to be aware that when you drop the furthest tokens to generate
         | next ones, the performance can drastically drop (multiplicative
         | accumulation of errors).
        
         | maebert wrote:
         | You're not entirely wrong.
         | 
         | 10 years ago, I started an ML & NLP consulting firm. Back then
         | nobody was doing NLP in production (SpaCy hadn't come out yet,
         | efficient vector embeddings were not around, the only
         | comprehensive library to do NLP was NLTK, which was academic
         | and hard to run in prod).
         | 
         | I recently revisited some of our projects from back then (like
         | this super fun one where we put 1 Million words into the
         | dictionary [1]) and realized how much faster we could have done
         | many of those tasks with LLMs.
         | 
         | Except we couldn't -- the whole "in production" part would have
         | made LLMs for the most minute tasks prohibitively expensive,
         | and that is not going to change for a while, sadly. So, if you
         | want to work something in prod that is not specifically an LLM
         | application, this book is still super valuable.
         | 
         | [1] https://www.nytimes.com/2015/10/04/technology/scouring-
         | the-w...
        
         | k8si wrote:
         | I suggest going through the exercise of seeing whether this is
         | true quantitatively. Get a business-relevant NER dataset
         | together (not CoNLL, preferably something that your boss or
         | customers would care about), run it against Mistral/etc, look
         | at the P/R/F1 scores, and ask "does this solve the problem that
         | I want to solve with NER". If the answer is 'yes', and you
         | could do all those things without reading the book or other NLP
         | educational sources, then yeah you're right, job's finished.
        
         | gillesjacobs wrote:
         | LLMs typically do worse on specific extractive and
         | classification tasks than a finetuned BERT-large model, so no
         | you actually can't replace everything with LLM calls with
         | similar performance.
         | 
         | (Cf. BloombergGPT paper all financial benchmark tasks).
         | 
         | And that's not even taking into account inference cost, but
         | that is a business case issue.
        
           | reissbaker wrote:
           | BERT _is_ a language model! It was considered a  "large"
           | language model for its time, and it's even based on
           | transformers. It's a very small language model by today's
           | standards (340MM params), and is encoder-only instead of
           | decoder-only, but trying to draw a hard line between "BERT"
           | and "LLMs" is more about parameter count than capabilities --
           | in fact, the original GPT-3 paper benchmarked GPT-3 against
           | finetuned BERT-large and beat it on nearly every measure [1].
           | And BERT-large is not unique in being able to be finetuned;
           | finetuning Mistral 7B on your task should result in very good
           | performance (similarly, OpenAI allows finetuning of
           | gpt-3.5-turbo, and there are plenty of non-Mistral open-
           | source LLMs like Yi and Qwen that should do well too).
           | 
           | I'm not sure what BloombergGPT has to do with LLMs vs non-
           | LLMs; BloombergGPT is an LLM [2], and it defeating other LLMs
           | on financial benchmarks doesn't prove much about large
           | language models other than "LLMs can be trained to be better
           | at specific tasks."
           | 
           | 1: https://arxiv.org/abs/2005.14165
           | 
           | 2: https://www.bloomberg.com/company/press/bloomberggpt-50-bi
           | ll...
        
         | chaxor wrote:
         | If you want to process 100M documents, you want to use the most
         | performant and fastest option - which a 7b param model isn't
         | going to be it.
         | 
         | Additionally, entity linking is an _extremely common_ task, for
         | which LLMs fail at pretty miserably (certainly if the
         | dictionary is custom /private. Additional work must be done to
         | somehow (!? Many options here ?!) perform EL.
         | 
         | So, in the end, making a silver corpus from an LLM may be an
         | option for NER to train a much much smaller algorithm. But EL
         | is _still_ not a 'plug and play' problem, and can actually be
         | pretty difficult to do "well" (using the modern techniques of
         | MHS, etc).
        
         | hintymad wrote:
         | Take a Lexis Nexis as an example. They used to build
         | sophisticated NLP pipelines to squeeze information out of legal
         | documents: Part-of-Speech tagging, dependency parsing, NER,
         | topic modeling, relation extraction, event extraction,
         | summarization, and etc. They also spend millions of dollars on
         | custom training just to expand the entity types and etc. The
         | whole process is error prone, painful to maintain, and
         | expensive. Similarly, McDonald must have struggled a lot
         | building their automatic ordering system.
         | 
         | LLM must be a godsend for these companies. All of sudden, low-
         | level tasks like POS can be eliminated. Tasks like NER have
         | only limited use and the companies can enjoy orders more entity
         | types almost for free. Tasks like intention slotting and topic
         | modeling become trivial compared to the pre-LLM-era pipelines.
        
         | abhgh wrote:
         | You're correct except for the use-cases where one of these come
         | into play:
         | 
         | A. Latency: for some systems, you need near real-time
         | predictions. LLMs (today) are slow for that.
         | 
         | B. Cost: when the low dev. effort (for building and deploying
         | an ML model) and low sample complexity (i.e. zero/few-shot)
         | doesn't translate into proportionate monetary gains over what
         | you pay for LLM usage.
         | 
         | C. Precision: when you want the model to reliably tell you when
         | it doesn't know the correct answer. Hallucination is a part of
         | it - but I think of this requirement as the broader umbrella of
         | good uncertainty quantification. I think there are two reasons
         | why this is worse for LLMs: (1) traditional ML models also
         | suffer from this, but there are some well known ways for
         | mitigation. For LLM's there is still no universal or accepted
         | way to perform this reliably (2) the quality of generated
         | language an LLM produces seems to be more likely to deceive you
         | when it is wrong. I don't know how to scientifically think
         | about this - maybe as LLMs proliferate people would build
         | appropriate mental defenses?
         | 
         | There is also the practical problem of prompt transferability
         | across LLMs: what works best for one LLM might not work well
         | for another, and there is no systematic way to optimally modify
         | the original prompt. This is painful in some setups where
         | you're looking to be not locked-in. But I didn't put it in the
         | list because this seems to be a problem for niche groups -
         | everyone seems to be busy in getting stuff working with one
         | LLM. Maybe this will become a larger issue later.
        
       | stakhanov wrote:
       | It was the first edition of this book that, when I read it 20
       | years ago, got me "hooked" on computer science as a science
       | rather than as just a fun thing to do (which I had been doing for
       | several years prior).
       | 
       | When I met Dan Jurafsky, 10 years later, I definitely felt none
       | of that disappointment that often goes with meeting one of your
       | "childhood heroes" and thanked him for writing the book and the
       | impact it had on my life.
        
         | uticus wrote:
         | 1st ed (2000) also on my bookshelf, a good read. Will have to
         | look through what's been updated for a 24-year-comparison.
         | 
         | Although much in computer science is reinventing the wheel, LLM
         | so far strikes me as a tool that doesn't have a ton of
         | analogues in history.
        
         | abhgh wrote:
         | This, and Manning and Schutze [1] were my hooks into NLP.
         | 
         | [1] https://nlp.stanford.edu/fsnlp/
        
       | stevesimmons wrote:
       | Chapter 10 looks like a good, self-contained intro to
       | Transformers and LLMs
       | 
       | https://web.stanford.edu/~jurafsky/slp3/10.pdf
        
         | garyiskidding wrote:
         | Yep. I read chapter 10 too to start with, really nice content
         | on Transformers and self-attention.
        
       | imjonse wrote:
       | It has a good treatment of recurrent neural networks, which is
       | nice since the theory behind them starts being relevant again.
       | 
       | For comparison, both Bishop's new deep learning book and Simon
       | Price's Understanding deep learning get straight to transformers
       | and only mention RNN/LSTM in passing - they were written just as
       | state space models and hybrid RNN/transformer models started
       | showing good results.
        
       | empiricus wrote:
       | I like how the LLM is the new Hidden Markov Models. Basically
       | both are some of the simplest structures that are competent
       | enough to learn/process the available data at that moment in
       | history.
        
       | cschmidt wrote:
       | I've been downloading and reading various chapters of this book
       | for what seems like years now. I'd love a hard copy. I know it
       | says "When will the whole book be finished? Don't ask." But it
       | seems like when the missing Chapter 12 is added ("We also expect
       | to release Chapter 12 soon in an updated release.") it would be a
       | great time to print it. I can hope....
        
         | techwizrd wrote:
         | Same, I fondly recall reading it years ago and I would love a
         | hard copy as well. I've thought about printing it myself, but
         | it seems an awful waste when updates are coming.
        
       ___________________________________________________________________
       (page generated 2024-03-11 23:01 UTC)