hngopher.com

       [HN Gopher] Show HN: Natural Language Processing Demystified (Pa...
       ___________________________________________________________________
        
       Show HN: Natural Language Processing Demystified (Part One)
        
       Hi HN:  I published part one of my free NLP course. The course is
       intended to help anyone who knows Python and a bit of math go from
       the very basics all the way to today's mainstream models and
       frameworks.  I strive to balance theory and practice and so every
       module consists of detailed explanations and slides along with a
       Colab notebook (in most modules) putting the theory into practice.
       In part one, we cover text preprocessing, how to turn text into
       numbers, and multiple ways to classify and search text using
       "classical" approaches. And along the way, we'll pick up useful
       bits on how to use tools such as spaCy and scikit-learn.  No
       registration required: https://www.nlpdemystified.org/
        
       Author : mothcamp
       Score  : 151 points
       Date   : 2022-05-18 10:55 UTC (12 hours ago)
        
 (HTM) web link (www.nlpdemystified.org)
 (TXT) w3m dump (www.nlpdemystified.org)
        
       | jll29 wrote:
       | NLP researcher here. It's great to see many offerings for courses
       | and tutorials, and NLP has made a lot of progress, in terms of
       | both its science as well as its re-usable software artifacts
       | (ibraries & notebooks, standalone tools).
       | 
       | But what saddens me is too many people are trying to dive into
       | NLP without trying to understand language & linguistics first.
       | For example, you can run a part of speech (POS) tagger in three
       | lines of Python, but you will still not know much about what
       | parts of speech are, which languages have which ones, what
       | function they have in linguistic theory or practical
       | applications.
       | 
       | What are the advantages of using the C7 tagset over the C5 or
       | PENN tagsets?
       | 
       | Why is AT sometimes called DET?
       | 
       | etc.
       | 
       | I recommend people spend a bit of time to read an(y) introduction
       | to linguistics textbook before diving into NLP, then the second
       | investment will be worth so much more.
        
         | mywaifuismeta wrote:
         | I'm generally not a fan of these kind of high-level tutorials
         | that tell you "use X library to get Y result" - it's just not
         | good for learning. But any content that tries to sell you on
         | learning ML/NLP/etc in a few weeks is just that. I understand
         | people want to make money by targeting a large audience, but it
         | makes me sad when I see (the vast vast majority) of
         | practitioners not having any understanding about ML (or NLP)
         | and just blindly applying libraries.
         | 
         | I don't think you necessarily need a linguistics background for
         | NLP, but I think you need either a strong linguistics OR ML
         | background so that you know what's going on under the hood and
         | can make connections. Anyone can call into Huggingface, you
         | don't need a course for that.
        
           | Der_Einzige wrote:
           | Doing non trivial things (more than .train or .generate) with
           | huggingface models def requires tutorials or other resources,
           | not sure what you're on about at all.
        
           | scarface74 wrote:
           | Everything eventually gets boiled down to libraries. The
           | purpose of technology is to get things done. I could say the
           | same that it makes me sad that today's developers use high
           | level languages without ever knowing assembly. A chip
           | designer could say that assembly language developers are
           | saddened that the assembly language programmer never had to
           | learn how processors are created.
        
             | mirker wrote:
             | It's fine when the library is a tight abstraction.
             | Unfortunately, ML libraries are leaky.
             | 
             | Example: take a classification model and change the output
             | dimensions without understanding the model.
        
               | nmstoker wrote:
               | Yes, the challenge people then face is that if they lack
               | too much intuition for the subject, they can't spot
               | obvious issues.
               | 
               | We've all seen how ML people don't necessarily have to
               | have the skills to solve a problem (ie i don't need to
               | speak Vietnamese to make a "passable" ML translator) but
               | it's not long before the lack of knowledge starts to show
               | up embarrassing shortfalls - being too arms length about
               | any topic is a recipe for disaster!
        
         | adamsmith143 wrote:
         | I'm not at all sympathetic to this viewpoint. The Deep Learning
         | revolution has shown us time and time again that Deep Learning
         | experts universally outperform SME on modelling performance. I
         | an almost 100% certain that the teams building big Transformers
         | which are now by far the best NLP models (OpenAI, Meta, Google
         | Brain, Deepmind, etc) are not made up of linguistic experts but
         | Deep Learning experts.
        
           | amitport wrote:
           | These groups are not mutually exclusive.
        
             | sam_lowry_ wrote:
             | In practice, they are, AFAIK.
        
         | LunaSea wrote:
         | Is this still true in an era where most NLP problems use
         | language models as a solution?
        
           | k8si wrote:
           | Language models as a solution to what problems?
           | 
           | Yes, you can easily use AutoModel.from_pretrained('bert-base-
           | uncased') to convert some text into a vector of floats. What
           | then?
           | 
           | What are the properties of downstream (aka actually useful)
           | datasets that might make few-shot transfer difficult or easy?
           | How much data do your users need to provide to get a useful
           | classifier/tagger/etc. for their problem domain?
           | 
           | Why do seemingly-minor perturbations like typos or concating
           | a few numbers result in major differences in representations,
           | and how do you detect/test/mitigate this to ensure model
           | behavior doesn't result weird downstream system behavior?
           | 
           | How do you train a dialog system to map 'I'm good, thanks' to
           | 'no'? How do you train a sentiment classifier learn from
           | contextual/pragmatic cues rather than purely lexical ones
           | (example: 'I hate to say it but this product solves all my
           | problems.' - positive or negative sentiment?)
           | 
           | How bad is the user experience of your Arabic-speaking
           | customers compared to that of your English-speaking
           | customers, and what can you do to measure this and fix it?
           | 
           | My linguistics background really helps me think through a lot
           | of these 'applied' NLP problems. Knowing how to make matmuls
           | fast on GPUs and knowing exactly how multihead self-attention
           | works is definitely useful too, but that's only one piece of
           | building systems with NLP components.
        
             | riku_iki wrote:
             | > My linguistics background really helps me think through a
             | lot of these 'applied' NLP problems.
             | 
             | There many benchmarks where LMs absolutely outperform
             | mechanical linguistics solutions.
             | 
             | Do you have success stories when there is significant
             | outperforming solution in opposite direction?
        
               | k8si wrote:
               | There's no competition between linguistics and ML/NLP,
               | they have completely different goals as fields.
               | 
               | I meant that my linguistics background helps me
               | understand & solve problems: studying linguistic field
               | work has helped me design crowd labeling jobs, knowing
               | about morphology helps me understand why BPE tokenizers
               | work so well (and when they might not), knowing about
               | syntax/dominant word order makes me think that
               | multilingual Bert should probably do something more
               | intelligent with positional embeddings, methods from
               | psycholinguistics are useful for understanding
               | entropy/surprisal wrt LM next-word probabilities... just
               | a few examples but the list could go on.
        
           | gattilorenz wrote:
           | I think so. First of all, knowing some linguistics will teach
           | you terms and concepts (e.g. parse tree, phrase, morpheme,
           | phoneme, etc) that will both help you find relevant
           | literature and avoid reinventing terms for stuff that is
           | widely known (so others will more readily find your work).
           | 
           | Language models are _currently_ the best solution for many
           | problems, but it 's hard to predict how we will move forward
           | from here. Maybe the inclusion of linguistic information, or
           | linguistic-inspired knowledge, or whatever, will be the key
           | to having better results, or saving training time/resources.
           | With no linguistics background, I imagine it's hard to get
           | ideas going in that direction (and test if it's actually a
           | _good_ direction)
        
             | mothcamp wrote:
             | I agree. I think having linguistics knowledge can help
             | especially in applied situations. Linguistics knowledge can
             | help create fallback systems when an ML system fails, or
             | help build rules to amplify or dampen the confidence of a
             | response from an ML system, or aid in the engineering of a
             | system (all that comes before or after the ML blackbox).
             | 
             | Sort of like an algorithmic trader knowing market
             | microstructure intimately (versus only pure statistics).
        
         | meristem wrote:
         | Do you have specific book suggestions?
        
         | PainfullyNormal wrote:
         | > I recommend people spend a bit of time to read an(y)
         | introduction to linguistics textbook
         | 
         | Do you have a favorite you can recommend?
        
           | sam_lowry_ wrote:
           | Elements by Tesniere. I am not kidding, there is a shitload
           | of knowledge there, largely forgotten by the time NLP merged
           | with CompSci.
           | 
           | Jurafsky and Martin, Manning and Schutze are great books for
           | computer scientists but these do not teach about the
           | language.
        
             | [deleted]
        
           | mothcamp wrote:
           | In addition to Jurafsky and Martin
           | (https://web.stanford.edu/~jurafsky/slp3/), I also like Emily
           | Bender's book:
           | https://www.goodreads.com/book/show/18128399-linguistic-
           | fund...
           | 
           | Bender's book is NOT an end-to-end text though imo. It's more
           | a central jumping off point. So you can read about a concept
           | and if it sounds interesting, search more about it.
        
           | rmellow wrote:
           | In addition to Jurafsky and Martin, I recommend Foundations
           | of Statistical NLP by Manning and Schutze:
           | https://nlp.stanford.edu/fsnlp/promo/
        
         | screye wrote:
         | I makes sense to completely disregard language when looking at
         | modern NLP solutions. In some sense, 'hand engineering'
         | anything is looked down upon.
         | 
         | Transformers and scaling laws have made it such that the only
         | thing that truly matters is your ability to build a model that
         | can computationally and parametrically scale. The 2nd would be
         | to figure out how to make more data 'viable' for usable within
         | such a hungry model's encoding.
         | 
         | Look at anyone who has written the last 20 seminal papers in
         | NLP, and almost none of them have a strong background in
         | linguistics. Vision went through a similar period of forced
         | obsolescence, during the 2012-2016 Alexnet -> VGG -> Inception
         | -> Resnet transition.
         | 
         | It is unfortunate. But, time is limited and most researchers
         | can only spare enough time to learn a few new things.
         | Unfortunately for linguistics, it does not rank that high.
        
         | amitport wrote:
         | NLP is a vast field nowadays, you can solve a research problem
         | with a novel transformer architecture (for example) without
         | knowing anything about linguistics. The thing is that NLP is a
         | vast field and there is plenty room to go around (same goes for
         | vision, you don't really need classical vision background as
         | much as you used to).
         | 
         | (also an NLP researcher. Knows nothing about linguistics)
        
         | vb234 wrote:
         | Could you recommend a good introduction to NLP book?
        
         | xtiansimon wrote:
         | "I recommend people spend a bit of time to read an(y)
         | introduction to linguistics textbook..."
         | 
         | Linguists is a broad area of study. Can you be more specific?
         | Such as grammar and syntax.
        
         | philophyse wrote:
         | In your opinion, would George Yule's _The Study of Language_ be
         | a good introduction to linguistics? Or is there any other book
         | that you would recommend to someone who has little knowledge of
         | the field, but a lot of interest?
        
           | photonemitter wrote:
           | Jumping in on this; I've found jurafsky/martin a good place
           | to start. Covers a lot of ground and is a pretty good read as
           | well.
           | 
           | https://web.stanford.edu/~jurafsky/slp3/
        
             | ninjin wrote:
             | As a somewhat established researcher in the field, I second
             | Jurasky and Martin. It is peerless and what I recommend to
             | anyone joining my team if they think their background NLP
             | knowledge is a bit on the weak side.
        
       | mothcamp wrote:
       | Hi HN:
       | 
       | I published part one of my free NLP course. The course is
       | intended to help anyone who knows Python and a bit of math go
       | from the very basics all the way to today's mainstream models and
       | frameworks.
       | 
       | I strive to balance theory and practice and so every module
       | consists of detailed explanations and slides along with a Colab
       | notebook (in most modules) putting the theory into practice.
       | 
       | In part one, we cover text preprocessing, how to turn text into
       | numbers, and multiple ways to classify and search text using
       | "classical" approaches. And along the way, we'll pick up useful
       | bits on how to use tools such as spaCy and scikit-learn.
       | 
       | No registration required: https://www.nlpdemystified.org/
        
         | irln wrote:
         | The interface is great. Did you create the front-end/back-end
         | from scratch?
        
           | mothcamp wrote:
           | Thank you. Yep. It's all statically-generated pages using
           | Next.js with a single Next.js API route for the subscription.
           | All hosted on Netlify.
        
       | jasfi wrote:
       | I'm working on extracting facts from sentences, see
       | https://lxagi.com.
       | 
       | Which are the toughest NLP problems you know of that aren't being
       | solved satisfactorily?
        
         | riku_iki wrote:
         | Actually, problem you are working on doesn't look like solved
         | satisfactory yet :-)
        
         | airstrike wrote:
         | Getting an invalid HTTPS certificate
        
           | jasfi wrote:
           | It works for me, which browser are you using? Can you see the
           | certificate?
        
         | Der_Einzige wrote:
         | Queryable, word level, extractive summarization with
         | grammatical correctness. AKA: what a human does when they are
         | "highlighting" a document.
         | 
         | think extractive QA but the answer size should be configurable
         | and the answer can potentially be multiple spans, and spans may
         | not need to be contiguous.
         | 
         | If you got a solution, I'd love to see it - and you could even
         | beat the baselines for the only dataset that exists for it:
         | https://paperswithcode.com/sota/extractive-document-summariz...
        
       | [deleted]
        
       | Utkarsh_Mood wrote:
       | Looks great, thanks for this!
        
       ___________________________________________________________________
       (page generated 2022-05-18 23:01 UTC)