[HN Gopher] SpaCy 3.0
       ___________________________________________________________________
        
       SpaCy 3.0
        
       Author : syllogism
       Score  : 376 points
       Date   : 2021-02-01 13:57 UTC (9 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | gillesjacobs wrote:
       | Excited for this release and I will start integrating this in my
       | own information extraction pipelines immediately. Thanks,
       | Explosion team, got your stickers on my notebook!
       | 
       | The new configuration approach looks familiar to AllenNLP and
       | that's great. Loose-coupling of model submodules with flexible
       | config should be standard in NLP. I am happy that more libraries
       | are integrating these concepts.
        
       | langitbiru wrote:
       | So with SpaCy 3.0, HuggingFace, do we still have a reason to use
       | NLTK? Or they complement each other? Right now, I lost track of
       | the progress in NLP.
        
         | gillesjacobs wrote:
         | NLTK is showing its age. In my information extraction
         | pipelines, the heavy lifting for modelling is done by SpaCy,
         | AllenNLP, and Huggingface (and Pytorch or TF ofc).
         | 
         | I only use NLTK since it has some base tools for low-resource
         | languages for which noone has pretrained a transformer model or
         | for specific NLP-related tasks. I still use their agreement
         | metrics module, for instance. But that's about it. Dep parsing,
         | NER, lemmatising and stemming is all better with the above
         | mentioned packages.
        
       | The_rationalist wrote:
       | I see that by default the trf model is roberta_base
       | https://spacy.io/models/en#en_core_web_trf
       | 
       | Is there an easy way to use xlnet (from transformers) for pos
       | tagging, dep parsing, etc? Btw it would have been a smarter
       | default as it scores more sota results on paperswithcode.com
        
         | syllogism wrote:
         | Yes you can easily train with xlnet instead of roberta-base ---
         | just write the different model name in the config file (or pass
         | a different string value, if doing it from Python). You can
         | find an example config file here:
         | https://github.com/explosion/projects/blob/v3/benchmarks/ner...
         | 
         | I actually didn't see a performance improvement when using
         | XLNet over roberta-base. I always wondered about this; ages ago
         | I looked into it and I wasn't sure that the preprocessing
         | details in the transformers version were entirely correct.
         | 
         | Given very similar accuracies from XLNet and RoBERTa, I
         | preferred RoBERTa for the following reasons:
         | 
         | * I've never been able to understand the XLNet paper :(. I
         | spent some time trying when it was released, but I just didn't
         | really get it, not anything close to the level where I'd be
         | able to implement it, anyway.
         | 
         | * Standardising on BERT architecture has some advantages. If we
         | mostly use BERT, we have a better chance of using faster
         | implementations. Mostly nobody is training new XLNet models,
         | whereas many new BERT models are being trained.
        
       | magicalhippo wrote:
       | I stumbled over SpaCy when looking for something to extract key
       | words and numbers from sentences, however it looked a bit
       | daunting and/or overkill. Think recipes or similar, turning "take
       | three tablespoons of sugar" into [3, 'tablespoons', 'sugar'] or
       | similar.
       | 
       | Should I give it another shot or are there libraries more suited
       | for this than just plain regexp galore?
        
         | teruakohatu wrote:
         | I did that years ago for some project. For recipes you can
         | probably get away with regular expressions.
         | 
         | But with Spacy you could tokenize the sentence, then tag each
         | work with the Part of Speech it is, and then find patterns eg.
         | Verb Number Noun Preposition Noun
         | 
         | Then match the first noum against a list of measurements
         | (tablespoon, teaspoon, tbsp...) and extract the rest of the
         | components.
        
       | cambalache wrote:
       | > spaCy is a library for advanced Natural Language Processing in
       | Python and Cython.
        
         | syllogism wrote:
         | I actually submitted this with (Python Natural Language
         | Processing) after it, but it got edited away. I always find it
         | hard to predict the preferred title style here...
        
       | bratao wrote:
       | Thank you Matthew, Ines, Sofie and Adriane for spaCy. It is a
       | fundamental piece for me, both for work in Academia and in
       | Industry.
        
       | ZeroCool2u wrote:
       | SpaCy and HuggingFace fulfill practically 99% of all our needs
       | for NLP project at work. Really incredible bodies of work.
       | 
       | Also, my team chat is currently filled with people being
       | extremely stoked about the SpaCy + FastAPI support! Really hope
       | FastAPI replaces Flask sooner rather than later.
        
         | gavinray wrote:
         | There's some native SpaCy + FastAPI integration being created?!
         | 
         | This sounds mindblowing, off to Google I go...
        
           | thesehands wrote:
           | The author of FastAPI https://twitter.com/tiangolo is a Spacy
           | employee
        
             | tiangolo wrote:
             | Yep, that's me, I work at Explosion (spaCy's home)!
        
               | ZeroCool2u wrote:
               | Big fan my dude! While FastAPI is amazing, the docs for
               | it are a work of art. I know a few people that have just
               | used the FastAPI docs to learn what API's are and how
               | they work, nevermind how use FastAPI itself.
        
               | tiangolo wrote:
               | Thanks for saying that! :)
        
           | tiangolo wrote:
           | With the new "spaCy project" it's easy to generate the
           | boilerplate for many spaCy related projects, including a
           | FastAPI app to serve models.
           | 
           | But apart from that, there's actually not really much to
           | integrate, as it all just works.
           | 
           | Both libraries are actually (well ) designed to be decoupled.
           | So that you can mix them independently with anything else.
           | 
           | Note: I'm the creator of FastAPI, and also a developer at
           | Explosion (spaCy's home).
        
         | number6 wrote:
         | As someone who never used NLP, what is it used for? Or better
         | what is SpaCy used for - I know that it can generate Texts...
         | but how would I use it in a business?
        
           | ZeroCool2u wrote:
           | Obviously, it depends, but assuming you do have an NLP use
           | case already, there are certain things that you will almost
           | certainly have to do in your preprocessing regardless of your
           | task. For example, sentence parsing. Writing your own basic
           | sentence parser is fairly easy. Writing your own _good_
           | sentence parser is a nightmare akin to trying to parse HTML
           | with regex. SpaCy provides a very good one for you. Down the
           | line it will help you train a model on various tasks in a
           | compute efficient manner as well. This is just a small
           | example.
        
             | number6 wrote:
             | Thanks for the explanation. Guss I never had an NLP
             | usecase. So SpaCy can take a text or sentences apart and
             | knows what the parts "mean" but I can't think of a problem
             | to solve with this. Maybe summarize a text or something,
             | but I guess I am not imaginative enough
        
       | Xenoamorphous wrote:
       | I think I read somewhere that spaCy was going to have named
       | entity disambiguation at some point, with named entities having
       | links to knowledge bases like Wikidata or DBpedia. That's
       | something that paid NER services but that I haven't found in open
       | source libs, and would be really interesting IMO.
        
         | svlandeg wrote:
         | There's a component for Entity Linking available in spaCy, but
         | you have to train it yourself, as the use-cases (type of
         | entities, type of knowledge base etc) can vary greatly. See
         | more here: https://spacy.io/api/entitylinker
        
       | screye wrote:
       | I have been using Spacy3 nightly for a while now. This is game
       | changing.
       | 
       | Spacy3 practically covers 90% of NLP use-cases with near SOTA
       | performance. The only reason to not use it would be if you are
       | literally pushing the boundaries of NLP or building something
       | super specialized.
       | 
       | Hugging Face and Spacy (also Pytorch, but duh) are saving
       | millions of dollars in man hours for companies around the world.
       | They've been a revelation.
        
         | williamtrask wrote:
         | Man I wish they could be compensated remotely in proportion to
         | that. Matthew Honnibal and team are wizards who have been
         | working really hard for a really long time.
         | 
         | Maybe there's a comp strategy i don't know about - but they've
         | created SO much value for the world.
        
           | syllogism wrote:
           | Thanks for the love :). For the record yes we've been working
           | hard, but also yes, we've been doing well from it.
           | 
           | I will say that people are using spaCy for free because that
           | is what we asked them to do. I chose to make the library free
           | and open-source when I first released it because I had the
           | idea that I would be able to make that work out for me, if I
           | could make this thing that would be useful to people and if
           | they could be convinced to adopt it. And in order to convince
           | people to adopt it, we've been telling people that spaCy will
           | stay free and that we'll continue to work on it. So
           | everything's going to plan here. Even if things weren't
           | working out well for us (and they are), the fault would be
           | entirely ours. I don't think we'd have any right to suddenly
           | say, "Oh none of you jerks are paying, how unfair".
           | 
           | (For the record, we make money from sales of our annotation
           | tool, Prodigy: https://prodi.gy . If you're reading this and
           | you like spaCy, check it out ;)
        
             | earthnail wrote:
             | Oh wow, Prodigy looks amazing. I need a tool for audio
             | annotation, and looks like you guys built just the right
             | thing for me.
             | 
             | Please invest more in SEO; I didn't find you guys two weeks
             | ago when I researched different options for audio
             | annotation :D.
        
             | joshklein wrote:
             | I want to personally thank you for your work, and let you
             | know I couldn't have done an important project of mine if
             | spaCy didn't exist, and if it were not a free resource.
             | 
             | Your project was 1 of the 2 instrumental tools in my
             | project to structure the transcripts of every word said on
             | the floor of the New York State Senate over the past ~30
             | years in order to develop a topic-based "proximity"
             | heuristic (based on CorEx, the 2nd instrumental tool) for
             | which state senators were focused on which issues, based on
             | the things they _actually said on the record_ , not based
             | on their press statements or their voting records (the
             | latter of which doesn't capture all the information you'd
             | hope it would due to procedural nuances too obscure to
             | detail here).
             | 
             | Thank you. Thank you, thank you, thank you.
        
             | tedivm wrote:
             | The only thing that Prodigy is missing is a team based
             | workflow. We've been on the beta list for awhile for it,
             | and are excited for it to come out- but without having a
             | concept of users we've had to use other tools that aren't
             | as polished on the annotation side but which hit our
             | compliance needs.
        
               | hantusk wrote:
               | This. Wholly agree. Currently running a large labelling
               | task with 12 labelers.
               | 
               | Using Amazon ground truth which works fine (although
               | seems quite MVP outside the core functionality e.g wrt
               | reporting or user creation).
               | 
               | What tool have you had success with?
        
         | JPKab wrote:
         | Everything in the above paragraph sounds like a hyped
         | overstatement. None of it is.
         | 
         | As someone that's worked on some rather intensive NLP
         | implementations, Spacy 3.0 and HuggingFace both represent the
         | culmination of a technological leap in NLP that started a few
         | years ago with the advent of transfer learning in NLP. The
         | level of accessibility to the masses these libraries offer is
         | game-changing and democratizing.
        
           | aapppwe wrote:
           | i am curious, what kind of project are you working on?
        
           | drran wrote:
           | Can you help, please?
           | 
           | I want to use AI to translate (localize) messages for free
           | software, in my case, Ukrainian language. My plan to improve
           | quality of automated translation is to translate from similar
           | languages in parallel, i.e. give a same message in English,
           | Russian, Polish, and expect message in Ukrainian as output.
           | 
           | Where I should start? Which libraries to use? How to connect
           | them? How to train them?
        
             | chartpath wrote:
             | I've been using LASER from Facebook Research via
             | https://github.com/yannvgn/laserembeddings to accept multi-
             | lingual input in front of the the domain-specific models
             | for recommendations and stuff (that are trained on English
             | annotated examples).
        
       | liminal wrote:
       | I'm curious what sort of NLP use cases people are solving. How
       | are people finding business value in these models and pipelines?
       | We have looked at a number of uses and have found it hard to make
       | a case for ROI. Wondering what's been working for folks.
        
       | datameta wrote:
       | I wonder if it is sheer coincidence that SpaCy is pronounced the
       | way the russian word "spasai" is. It means "rescue" (v.)
        
       | jcuenod wrote:
       | I'll add my hats off to to @ines and the spaCy team. It's super
       | impressive. There's also a (free) orientation course I'd
       | recommend at https://course.spacy.io/
        
       | mark_l_watson wrote:
       | Thanks to the SpaCy team! I spent a lot of time over about 20
       | years working on my own NLP tools. I stopped doing that and
       | mostly now just use SpaCy (and sometimes Huggingface and Apple's
       | NLP models).
        
         | Abishek_Muthian wrote:
         | Have you compared SpaCy with Apple's NLP on M1? I presume GPU
         | is not being used by SpaCy on M1.
        
           | t-vi wrote:
           | Given that SpaCy uses PyTorch, that is being worked on.
        
           | mark_l_watson wrote:
           | I have used both on M1, but compared them.
        
       | stingraycharles wrote:
       | Ok so I've evaluated spacy a few years ago, but nowadays we're
       | using huggingface's transformers / tokenizers / etc to train our
       | own language models + fine tuned models. I see there's now
       | transformer based pipeline support, how do the two relate?
       | 
       | Phrased differently, how does spacy fit in with today's world of
       | transformers? Would it still be interesting for me?
        
         | binarymax wrote:
         | I have lots of experience with both, and I use both together
         | for different use cases. SpaCy fills the need of
         | predictable/explainable pattern matching and NER - and is very
         | fast and reasonably accurate on a CPU. Huggingface fills the
         | need for task based prediction when you have a GPU.
        
           | danieldk wrote:
           | _Huggingface fills the need for task based prediction when
           | you have a GPU._
           | 
           | With model distillation, you can make models that annotate
           | hundreds of sentences per second on a single CPU with a
           | library like Huggingface Transformers.
           | 
           | For instance, one of my distilled Dutch multi-task syntax
           | models (UD POS, language-specific POS, lemmatization,
           | morphology, dependency parsing) annotates 316 sentences per
           | second with 4 threads on a Ryzen 3700X. This distilled model
           | has virtually no loss in accuracy compared to the finetuned
           | XLM-RoBERTa base model.
           | 
           | I don't use Huggingface Transformers, but ported some of
           | their implementations to Rust [1], but that should not make a
           | big difference since all the heavy lifting happens in C++ in
           | libtorch anyway.
           | 
           | tl;dr: it is not true that tranformers are only useful for
           | GPU prediction. You can get high CPU prediction speeds with
           | some tricks (distillation, length-based bucketing in batches,
           | using MKL, etc.).
           | 
           | [1] https://github.com/tensordot/syntaxdot/tree/main/syntaxdo
           | t-t...
        
             | binarymax wrote:
             | Interesting. Did you start from a Distilled base model
             | (like DistilRoBerta), or did you distill your fine-tuned
             | model?
        
         | syllogism wrote:
         | The improved transformers support is definitely one of the main
         | features of the release. I'm also really pleased with how the
         | project system and config files work.
         | 
         | If you're always working with exactly one task model, I think
         | working directly in transformers isn't that different from
         | using spaCy. But if you're orchestrating multiple models,
         | spaCy's pipeline components and Doc object will probably be
         | helpful. A feature in v3 that I think will be particularly
         | useful is the ability to share a transformer model between
         | multiple components, for instance you can have an entity
         | recogniser, text classifier and tagger all using the same
         | transformer, and all backpropagating to it.
         | 
         | You also might find the projects system useful if you're
         | training a lot of models. For instance, take a look at the
         | project repo here: https://github.com/explosion/projects/tree/v
         | 3/benchmarks/ner.... Most of the readme there is actually
         | generated from the project.yml file, which fully specifies the
         | preprocessing steps you need to build the project from the
         | source assets. The project system can also push and pull
         | intermediate or final artifacts to a remote cache, such as an
         | S3 bucket, with the addressing of the artifacts calculated
         | based on hashes of the inputs and the file itself.
         | 
         | The config file is comprehensive and extensible. The blocks
         | refer to typed functions that you can specify yourself, so you
         | can substitute any of your own layer (or other) functions in,
         | to change some part of the system's behaviour. You don't _have_
         | to specify your models from the config files like this --- you
         | can instead put it together in code. But the config system
         | means there's a way of fully specifying a pipeline and all of
         | the training settings, which means you can really standardise
         | your training machinery.
         | 
         | Overall the theme of what we're doing is helping you to line up
         | the workflows you use during development with something you can
         | actually ship. We think one of the problems for ML engineers is
         | that there's quite a gap between how people are iterating in
         | their local dev environment (notebooks, scrappy directories
         | etc) and getting the project into a state that you can get
         | other people working on, try out in automation, and then pilot
         | in some sort of soft production (e.g. directing a small amount
         | of traffic to the model).
         | 
         | The problem with iterating in the local state is that you're
         | running the model against benchmarks that are not real, and you
         | hit diminishing returns quite quickly this way. It also
         | introduces a lot of rework.
         | 
         | All that said, there will definitely be usage contexts where
         | it's not worth introducing another technology. For instance, if
         | your main goal is to develop a model, run an experiment and
         | publish a paper, you might find spaCy doesn't do much that
         | makes your life easier.
        
       | pabs3 wrote:
       | Please note that Explosion does not like redistribution of SpaCy,
       | they expect everyone to only use the builds they produce, so it
       | would not be a good idea to package it for your favourite distro.
        
         | syllogism wrote:
         | I'm sorry that this conflicted with your plans, but I feel
         | strongly that distributing Python libraries via system package
         | managers such as apt is very bad for users. The pain is felt
         | especially by users who are relatively new to Python, who will
         | end up with their system Python in a confusing state that is
         | difficult to correct.
         | 
         | We of course encourage anyone to clone the repo or install from
         | an sdist if they want to compile from source. In fact you can
         | do the following:                   git clone
         | https://github.com/explosion/spaCy         cd spaCy
         | make
         | 
         | This will build you a standalone executable file, in the pex
         | format, that only depends on your system Python and does not
         | install any files into your system. You can copy this artifact
         | into your bin and use it as a command-line application.
        
           | t-vi wrote:
           | That is sad, I would use SpaCy more if it had Debian packages
           | (in particular in Debian). Python stuff packaged by Debian
           | seems to work very well for me and has been for years.
        
         | dna_polymerase wrote:
         | Pretty poor choice of license if they wanted me to care about
         | their builds, tbh.
        
           | pabs3 wrote:
           | Their concern is poor user experience when the documentation
           | on the web doesn't match what versions are being
           | redistributed by other folks.
        
       | pplonski86 wrote:
       | Is there any framework similar to SpaCy or HugginFaces but for
       | images?
        
         | syllogism wrote:
         | I haven't had a situation to use it, but I think Kornia looks
         | cool: https://github.com/kornia/kornia
        
       | polynomial wrote:
       | Super excited to see improvement in NER accuracy in SpaCy 3.0.
        
       ___________________________________________________________________
       (page generated 2021-02-01 23:00 UTC)