[HN Gopher] Speech and Language Processing (3rd ed. draft)
___________________________________________________________________
Speech and Language Processing (3rd ed. draft)
Author : yeesian
Score : 181 points
Date : 2024-03-11 03:59 UTC (19 hours ago)
(HTM) web link (web.stanford.edu)
(TXT) w3m dump (web.stanford.edu)
| 3abiton wrote:
| Why Hidden Markov Models are in the related links (7 years ago)?
| Did the PDF name change?
| rhdunn wrote:
| It's an appendix (supplementary material) describing Hidden
| Markov Models. It is web only presumably because a) it is not
| specific to NLP, and b) keeping it out reduces the print
| size/cost.
| ilaksh wrote:
| I almost believe that if I know how make an LLM prompt and how to
| make an API call to OpenAI, Mistral, Claude 3, or together.ai,
| then as an application programmer, I can skip this whole book. I
| see people posting project specifications asking for NLP, named
| entity extraction, etc. But most things in those jobs look like
| they could be handled by an LLM, possibly even smaller than 7b,
| and probably more robustly. My other assumption is that this
| wasn't true at all three years ago.
|
| Maybe machine learning engineers want to explain what I am
| missing?
| wodenokoto wrote:
| That's kinda like saying we don't need to teach CS students
| anything about algorithms.
|
| I don't think that book was ever aimed at application engineers
| in the first place.
| ilaksh wrote:
| Well, I can see it as being useful to give aspiring ML
| researchers a survey and jumping off point for possibly
| digging deeper.
|
| But I also think when they created that book originally it
| was about giving ML practitioners tools they would use.
| rhdunn wrote:
| It depends on what you are trying to do.
|
| LLMs are powerful, but can be difficult to work with w.r.t.
| doing post processing or adding markup/annotations in the text.
| For example, asking it to label terms in a sentence it can
| produce varied output, e.g. sometimes listing "word or clause:
| description of the meaning" which is hard to parse in
| downstream tasks, or sometimes out of order.
|
| I've also seen LLMs label split infinitives with the correct
| meaning at the preposition instead of the whole subclause. With
| NLP you would label either the start and span of the label
| (common) or the root of the subclause, depending on the
| application. The NLP approach would work for more complex
| nested and interconnected expressions where the structure is
| not flat text.
|
| If you ask it to generate XML or HTML, it will usually invent
| its own markup, or sometimes generate invalid output. E.g. I've
| seen it output HTML using tags like `<Question Word>` in some
| cases.
|
| Asking it to generate CoNLL-U (which it knew about from its
| response) -- it only outputted 3 columns as "in _ PREP" etc.
| which is not correct.
|
| Asking `Can you lemmatize "She is going to the bank to get some
| money."` I get `"She go(ing) to bank(er) for money(y)"` which
| leaves out words, doesn't lemmatize some words correctly, and
| has a format that is difficult to parse/interpret reliably.
|
| LLMs also have a limited context window, so if you are trying
| to process a large document, or have a large complex set of
| instructions (e.g. on how to label, process, or format the
| data) then it can lose those instructions and deviate from what
| you are doing.
|
| LLMs are also susceptible to being guided by the input text as
| the prompt and input are taken together. Thus, if the text is
| talking about a different format or something else it can
| easily switch to that.
|
| --
|
| While NLP pipelines require more work to get right, they can
| often be more efficient computationally as they are often a lot
| smaller than the 7b parameters, or use other techniques that
| don't use ML/NNs.
|
| You can build more custom pipelines by querying the different
| features (part of speech, lemma, lexical features, etc.)
| without having to reparse the data, and you can keep the
| annotations consistent across those different pipelines, e.g.
| when labelling, extracting and storing the data in a database
| for searching, etc. so a user can see all the places where a
| term is referenced in a given text.
| Al-Khwarizmi wrote:
| This is an interesting question.
|
| As an NLP researcher who is in contact with companies that
| demand NLP applications, I think we are not there yet. For
| example, a company that wants to extract information from
| medical records cannot use solutions like GPT-4 or Claude that
| involve sending protected data to third parties in foreign
| jurisdictions. Modest local models (like 7B models) don't work
| so well for things like named entity extraction yet. And
| furthermore, companies typically want some explanation and
| accountability for the results (especially in sensitive
| domains...) and sure, LLMs can explain, but you have no
| guarantee that the explanation isn't hallucinated. When I
| mention the possibility of hallucinations, companies typically
| balk and say that they prefer the classic way.
|
| My (potentially biased) opinion is that classic NLP still has
| life left in it. For how long, I don't know. If small, locally-
| runnable LLMs get much better and more reliable, addressing the
| hallucination problems, what you mention might become largely
| true for most engineering applications.
|
| Also note that beyond engineering, things like syntactic
| parsing are also useful for scientific pursuits too, and at the
| moment they seem to be out of reach of LLMs.
| imjonse wrote:
| It is not written for application programmers, it is a machine
| learning book.
| deskamess wrote:
| Imagine a category - books for machines (some sort of LLM or
| RAG augment). Kind of like the downloadable helicopter manual
| that Trinity uses in the Matrix. Then having a higher order
| integration onto your foundational muscle memory would be the
| next step. Human (vCurrent + 0.1).
|
| The nice thing is performance would be different based on the
| human getting it. It kind of preserves some unique element.
| miraculixx wrote:
| Reliability, is what you are missing. LLM are not reliable.
| esafak wrote:
| And cost effectiveness.
| vjerancrnjak wrote:
| LLMs silently accumulate errors with the length of the sequence
| predicted ("label bias" is the term that should be in that
| book, but not in the context of LLMs).
|
| The problem is exacerbated by a very big token dictionary.
| Every new token you generate is independently sampled (from
| tokens outside of the context size). If you have any task that
| requires joint distribution modeling LLM will fail and
| HMMs/CRFs will succeed. Of course, it depends when the problem
| will manifest and I do not know that, but skipping the whole
| book is not recommended.
|
| For example, there were approaches for many language tasks in
| that book that used CNNs (instead of something more principled)
| and were extremely successful (despite the lack of joint
| modeling). Who knows when this accumulation of errors becomes
| measurable.
|
| As long as your generation of tokens fits inside the context
| size, you're modeling everything jointly. But I guess you need
| to be aware that when you drop the furthest tokens to generate
| next ones, the performance can drastically drop (multiplicative
| accumulation of errors).
| maebert wrote:
| You're not entirely wrong.
|
| 10 years ago, I started an ML & NLP consulting firm. Back then
| nobody was doing NLP in production (SpaCy hadn't come out yet,
| efficient vector embeddings were not around, the only
| comprehensive library to do NLP was NLTK, which was academic
| and hard to run in prod).
|
| I recently revisited some of our projects from back then (like
| this super fun one where we put 1 Million words into the
| dictionary [1]) and realized how much faster we could have done
| many of those tasks with LLMs.
|
| Except we couldn't -- the whole "in production" part would have
| made LLMs for the most minute tasks prohibitively expensive,
| and that is not going to change for a while, sadly. So, if you
| want to work something in prod that is not specifically an LLM
| application, this book is still super valuable.
|
| [1] https://www.nytimes.com/2015/10/04/technology/scouring-
| the-w...
| k8si wrote:
| I suggest going through the exercise of seeing whether this is
| true quantitatively. Get a business-relevant NER dataset
| together (not CoNLL, preferably something that your boss or
| customers would care about), run it against Mistral/etc, look
| at the P/R/F1 scores, and ask "does this solve the problem that
| I want to solve with NER". If the answer is 'yes', and you
| could do all those things without reading the book or other NLP
| educational sources, then yeah you're right, job's finished.
| gillesjacobs wrote:
| LLMs typically do worse on specific extractive and
| classification tasks than a finetuned BERT-large model, so no
| you actually can't replace everything with LLM calls with
| similar performance.
|
| (Cf. BloombergGPT paper all financial benchmark tasks).
|
| And that's not even taking into account inference cost, but
| that is a business case issue.
| reissbaker wrote:
| BERT _is_ a language model! It was considered a "large"
| language model for its time, and it's even based on
| transformers. It's a very small language model by today's
| standards (340MM params), and is encoder-only instead of
| decoder-only, but trying to draw a hard line between "BERT"
| and "LLMs" is more about parameter count than capabilities --
| in fact, the original GPT-3 paper benchmarked GPT-3 against
| finetuned BERT-large and beat it on nearly every measure [1].
| And BERT-large is not unique in being able to be finetuned;
| finetuning Mistral 7B on your task should result in very good
| performance (similarly, OpenAI allows finetuning of
| gpt-3.5-turbo, and there are plenty of non-Mistral open-
| source LLMs like Yi and Qwen that should do well too).
|
| I'm not sure what BloombergGPT has to do with LLMs vs non-
| LLMs; BloombergGPT is an LLM [2], and it defeating other LLMs
| on financial benchmarks doesn't prove much about large
| language models other than "LLMs can be trained to be better
| at specific tasks."
|
| 1: https://arxiv.org/abs/2005.14165
|
| 2: https://www.bloomberg.com/company/press/bloomberggpt-50-bi
| ll...
| chaxor wrote:
| If you want to process 100M documents, you want to use the most
| performant and fastest option - which a 7b param model isn't
| going to be it.
|
| Additionally, entity linking is an _extremely common_ task, for
| which LLMs fail at pretty miserably (certainly if the
| dictionary is custom /private. Additional work must be done to
| somehow (!? Many options here ?!) perform EL.
|
| So, in the end, making a silver corpus from an LLM may be an
| option for NER to train a much much smaller algorithm. But EL
| is _still_ not a 'plug and play' problem, and can actually be
| pretty difficult to do "well" (using the modern techniques of
| MHS, etc).
| hintymad wrote:
| Take a Lexis Nexis as an example. They used to build
| sophisticated NLP pipelines to squeeze information out of legal
| documents: Part-of-Speech tagging, dependency parsing, NER,
| topic modeling, relation extraction, event extraction,
| summarization, and etc. They also spend millions of dollars on
| custom training just to expand the entity types and etc. The
| whole process is error prone, painful to maintain, and
| expensive. Similarly, McDonald must have struggled a lot
| building their automatic ordering system.
|
| LLM must be a godsend for these companies. All of sudden, low-
| level tasks like POS can be eliminated. Tasks like NER have
| only limited use and the companies can enjoy orders more entity
| types almost for free. Tasks like intention slotting and topic
| modeling become trivial compared to the pre-LLM-era pipelines.
| abhgh wrote:
| You're correct except for the use-cases where one of these come
| into play:
|
| A. Latency: for some systems, you need near real-time
| predictions. LLMs (today) are slow for that.
|
| B. Cost: when the low dev. effort (for building and deploying
| an ML model) and low sample complexity (i.e. zero/few-shot)
| doesn't translate into proportionate monetary gains over what
| you pay for LLM usage.
|
| C. Precision: when you want the model to reliably tell you when
| it doesn't know the correct answer. Hallucination is a part of
| it - but I think of this requirement as the broader umbrella of
| good uncertainty quantification. I think there are two reasons
| why this is worse for LLMs: (1) traditional ML models also
| suffer from this, but there are some well known ways for
| mitigation. For LLM's there is still no universal or accepted
| way to perform this reliably (2) the quality of generated
| language an LLM produces seems to be more likely to deceive you
| when it is wrong. I don't know how to scientifically think
| about this - maybe as LLMs proliferate people would build
| appropriate mental defenses?
|
| There is also the practical problem of prompt transferability
| across LLMs: what works best for one LLM might not work well
| for another, and there is no systematic way to optimally modify
| the original prompt. This is painful in some setups where
| you're looking to be not locked-in. But I didn't put it in the
| list because this seems to be a problem for niche groups -
| everyone seems to be busy in getting stuff working with one
| LLM. Maybe this will become a larger issue later.
| stakhanov wrote:
| It was the first edition of this book that, when I read it 20
| years ago, got me "hooked" on computer science as a science
| rather than as just a fun thing to do (which I had been doing for
| several years prior).
|
| When I met Dan Jurafsky, 10 years later, I definitely felt none
| of that disappointment that often goes with meeting one of your
| "childhood heroes" and thanked him for writing the book and the
| impact it had on my life.
| uticus wrote:
| 1st ed (2000) also on my bookshelf, a good read. Will have to
| look through what's been updated for a 24-year-comparison.
|
| Although much in computer science is reinventing the wheel, LLM
| so far strikes me as a tool that doesn't have a ton of
| analogues in history.
| abhgh wrote:
| This, and Manning and Schutze [1] were my hooks into NLP.
|
| [1] https://nlp.stanford.edu/fsnlp/
| stevesimmons wrote:
| Chapter 10 looks like a good, self-contained intro to
| Transformers and LLMs
|
| https://web.stanford.edu/~jurafsky/slp3/10.pdf
| garyiskidding wrote:
| Yep. I read chapter 10 too to start with, really nice content
| on Transformers and self-attention.
| imjonse wrote:
| It has a good treatment of recurrent neural networks, which is
| nice since the theory behind them starts being relevant again.
|
| For comparison, both Bishop's new deep learning book and Simon
| Price's Understanding deep learning get straight to transformers
| and only mention RNN/LSTM in passing - they were written just as
| state space models and hybrid RNN/transformer models started
| showing good results.
| empiricus wrote:
| I like how the LLM is the new Hidden Markov Models. Basically
| both are some of the simplest structures that are competent
| enough to learn/process the available data at that moment in
| history.
| cschmidt wrote:
| I've been downloading and reading various chapters of this book
| for what seems like years now. I'd love a hard copy. I know it
| says "When will the whole book be finished? Don't ask." But it
| seems like when the missing Chapter 12 is added ("We also expect
| to release Chapter 12 soon in an updated release.") it would be a
| great time to print it. I can hope....
| techwizrd wrote:
| Same, I fondly recall reading it years ago and I would love a
| hard copy as well. I've thought about printing it myself, but
| it seems an awful waste when updates are coming.
___________________________________________________________________
(page generated 2024-03-11 23:01 UTC)