[HN Gopher] Advanced NLP with spaCy v3
___________________________________________________________________
Advanced NLP with spaCy v3
Author : philipvollet
Score : 146 points
Date : 2021-12-10 16:07 UTC (6 hours ago)
(HTM) web link (course.spacy.io)
(TXT) w3m dump (course.spacy.io)
| minimaxir wrote:
| A relatively underdiscussed quirk of the rise of superlarge
| language models like GPT-3 for certain NLP tasks is that since
| those models have incorporated so much real world grammar,
| there's no need to do advanced preprocessing and can just YOLO
| and work with generated embeddings instead without going into
| spaCy's (excellent) parsing/NER features.
|
| OpenAI recently released an Embeddings API for GPT-3 with good
| demos and explanations:
| https://beta.openai.com/docs/guides/embeddings
|
| Hugging Face Transformers makes this easier (and for free) as
| most models can be configured to return a "last_hidden_state"
| which will return the aggregated embedding. Just use DistilBERT
| uncased/cased (which is fast enough to run on consumer CPUs) and
| you're probably good to go.
| mtqwerty wrote:
| Readjusting expectations for pre-processing was one of the
| biggest differences I noticed going from NLP courses to working
| on NLP in production. For the amount of pre-processing learning
| material there is, I expected it to be much more important in
| practice.
|
| I feel lucky to gotten into NLP when I did (learning in
| 2017/2018 and working in the beginning of 2020). Changing our
| system from glove to BERT was super exciting and a great way to
| learn about the drawbacks and benefits of each.
| Vetch wrote:
| While you make sensible points, in the case of GPT-3, not
| everyone will be willing to route their data through OpenAI's
| servers.
|
| > Just use DistilBERT uncased/cased (which is fast enough to
| run on consumer CPUs)
|
| This can still be impractical, at least in my case of regularly
| needing to process hundreds of pages of text. Simpler systems
| can be much faster for an acceptable loss and you can get more
| robustness by working with label distributions instead of just
| picking argmax.
|
| Fast simpler classifiers can also help decide where the more
| resource intensive models should focus attention.
|
| Another reason for preprocessing is rule systems. Even if not
| glamorous to talk about, they still see heavy use in practical
| settings. While dependency parses are hard to make use of,
| shallow parses (chunking) and parts of speech data can be
| usefully fed into rule systems.
| new_stranger wrote:
| I imagine it being very useful to understand what you just said
| hooande wrote:
| lol. a rough translation is that the new super language
| models are good enough that you don't have to keep track of
| specific parts of speech in your programming. if you look at
| the arrays of floating point weights that underlie gpt-3 etc,
| you can use them to match present participle phrases with
| other present participle phrases and so forth
|
| this is of course a correct and prescient observation.
| minimaxir is kind of an NLP final boss, so I wouldn't expect
| most people to be able to follow everything he says
| minimaxir wrote:
| I don't think it's more of a final boss thing: IMO working
| with embeddings/word vectors is easier, even in the basest
| case such as word2vec/GloVe, to understand than some of the
| more conventional NLP techniques (e.g. bag of words/TF-
| IDF).
|
| The spaCy tutorials in the submission also have a section
| on word vectors.
| Vetch wrote:
| Ah, although, TF-IDF is still good to know. Semantic
| search hasn't eliminated the need for classical retrieval
| techniques. It can also be used to select a subset of
| words to use to create an average of word vectors for a
| document signature, a quick and dirty method for document
| embeddings.
|
| Bag of word co-occurrences in matrix format is also a
| nice to know, factorizing such matrices were the original
| vector space model for distributional semantics and
| provide historical context for GloVe and the like.
| master_yoda_1 wrote:
| I am not able to see what is advanced here. spCY just wrap all
| the open source code/model into a python api and just want to
| sell the hype.
| dang wrote:
| " _Please don 't post shallow dismissals, especially of other
| people's work. A good critical comment teaches us something._"
|
| https://news.ycombinator.com/newsguidelines.html
| Der_Einzige wrote:
| As usual, dang is wrong and not moderating effectively. This
| is not a shallow comment but a legitimate concern about
| spaCy, and to a lesser extent other NLP tools such as NLTK.
| Most of the tooling around them that people end up using
| really is nothing more than wrappers around other tools. See
| the default tokenizers or models utilized by these tools.
|
| And yes, even if spaCy is not making money itself, you can
| bet that the other paid for tools that they sell are.
| dang wrote:
| Actually if the GP had posted this critique instead of a
| shallow, reductionist internet dismissal ("just want to
| sell the hype"), that would have been fine. Thoughtful
| critique is welcome--it just requires higher-quality
| comments than that.
| Ldorigo wrote:
| Ah, yes. The tried-and-true method of "just selling the hype"
| with an open source library that everyone can use for free.
| coding123 wrote:
| That's a huge part of software development. Wrapping things to
| be more concise, use-case driven. I mean most software
| developers are just placing a veneer over something more
| complex. That's pretty much all we do.
| 41209 wrote:
| I really love spaCy, it's trivial to throw up a server which
| handles basic NLP. No complaints here, very happy to see it still
| being updated
| artembugara wrote:
| We've been using spaCy a lot for the past few months.
|
| Mostly for non-production use cases, however, I can say that it
| is the most robust framework for NLP at the moment.
|
| V3 added support for transformers: that's a killer feature as
| many models from https://huggingface.co/docs/transformers/index
| work great out of the box.
|
| At the same time, I found NER models provided by spaCy to have a
| low accuracy while working with real data: we deal with news
| articles https://demo.newscatcherapi.com/
|
| Also, while I see how much attention ML models get from the
| crowd, I think that many problems can be solved with rule-based
| approach: and spaCy is just amazing for these.
|
| Btw, we recently wrote a blog post comparing spaCy to NLTK for
| text normalization task: https://newscatcherapi.com/blog/spacy-
| vs-nltk-text-normaliza...
| brd wrote:
| I really appreciate how accessible SpaCy has made NLP work but
| their NER is definitely low accuracy.
|
| Where stem/lem felt critical to successful NLP processing a few
| years ago, we've found stem/lem work to be much less important
| for downstream tasks when transformer based models are
| involved.
|
| For topic extraction stem/lem still seems to do a lot to
| improve accuracy and for rules based approaches I can still see
| how it would facilitate more efficient processing at scale. I'd
| be curious to hear your experience fine tuning and/or training
| new models after stem/lem processing with transformers, we've
| admittedly done little testing to see how transformers actually
| performer if properly tuned to post-processed data.
| artembugara wrote:
| Did you try something like autoNLP by huggingface?
| brd wrote:
| No, we've got our own fine tuning pipeline and initial
| tests showed better performance without traditional
| stem/lem processing so we dropped it from our
| classification pipelines and haven't seen a need to
| revisit.
| pantsforbirds wrote:
| We use spaCy at work for (mostly) news articles as well. We've
| been pretty impressed with it overall for detecting larger
| trends using the NER models. I've been contemplating whether it
| might be useful to make a spaCy module that uses a Count-Min
| Sketch to track the top N of each of the NER categories
| partitioned on a daily (or weekly etc.) time.
|
| Think it could be an interesting use case to get sort of
| similar results to Google's search trends.
| artembugara wrote:
| I'd really love to chat about that. Any chance to connect?
| email in bio
| Eridrus wrote:
| I feel like NER is a poorly designed task in general. You're
| eventually trying to link the entities to some kind of KB, so
| you should be injecting that entity information into your
| system for detecting mentions.
| kulikalov wrote:
| Are you using the high accuracy eng model for NER? I've been
| very happy with orgs recognition, it actually did way better
| than any other open source model in my case.
| artembugara wrote:
| Try it on a sentence where all tokens are lower/upper case.
| It just doesn't really work.
| Xenoamorphous wrote:
| I don't know how it compares with other paid alternatives (like
| Google's or Amazon's) but spaCy's NER was pretty close to the
| (paid) service we were using (IBM) to the point we ditched IBM.
| Also for news articles.
|
| But yeah disambiguation/entity linking would be nice.
| artembugara wrote:
| I'd be happy to chat more if you want.
| artembugara wrote:
| Also I have an article about spaCy NER:
| https://newscatcherapi.com/blog/named-entity-recognition-wit...
|
| The conclusion I came up with:
|
| "A few notes on my Spacy NER accuracy with "real world" data
|
| Low accuracy with sentences without a proper casing
|
| 1. Low accuracy overall, even with a large model
|
| 2. You'd need to fine-tune your model if you want to use it in
| production
|
| 3. Overall, there's no open-source high accuracy NER model that
| you can use out-of-a-box"
| wyldfire wrote:
| I assume your product does some kind of entity disambiguation
| and/or link to an ontology? Spacy doesn't provide this out of
| the box either, AFAICT. Can you share more info about how you
| do it?
| artembugara wrote:
| We don't provide entity disambiguation out of a box. It's
| more of a on request for Enterprise clients.
|
| But overall, entity disambiguation is one of the most
| useful and difficult tasks in the NLP.
|
| SpaCy supports entity linking via knowledge base:
| https://spacy.io/api/entitylinker
| nefitty wrote:
| That might be the killer feature from what I've heard.
| Tarq0n wrote:
| NER good enough to anonymise free text would be the
| absolute dream for many governments.
| Vetch wrote:
| > Overall, there's no open-source high accuracy NER model
| that you can use out-of-a-box"
|
| Part of it is most underestimate the complexity of NER and
| the rest of it, in my opinion, is that NER is not well-
| defined as a classification problem.
|
| At least in my experience, having a specific battery of
| questions to query documents, first by transformer based
| semantic search and narrowed by Q/A models, removed the need
| for explicit NER, entity linking or relation extraction. For
| the case of entities as features for rule systems, shallow
| models and using all label predictions instead of just
| selecting argmax has been sufficiently robust. Using big
| transformers for classification doesn't pay enough to be
| worth it there.
___________________________________________________________________
(page generated 2021-12-10 23:00 UTC)