[HN Gopher] SpaCy 3.0
___________________________________________________________________
SpaCy 3.0
Author : syllogism
Score : 376 points
Date : 2021-02-01 13:57 UTC (9 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| gillesjacobs wrote:
| Excited for this release and I will start integrating this in my
| own information extraction pipelines immediately. Thanks,
| Explosion team, got your stickers on my notebook!
|
| The new configuration approach looks familiar to AllenNLP and
| that's great. Loose-coupling of model submodules with flexible
| config should be standard in NLP. I am happy that more libraries
| are integrating these concepts.
| langitbiru wrote:
| So with SpaCy 3.0, HuggingFace, do we still have a reason to use
| NLTK? Or they complement each other? Right now, I lost track of
| the progress in NLP.
| gillesjacobs wrote:
| NLTK is showing its age. In my information extraction
| pipelines, the heavy lifting for modelling is done by SpaCy,
| AllenNLP, and Huggingface (and Pytorch or TF ofc).
|
| I only use NLTK since it has some base tools for low-resource
| languages for which noone has pretrained a transformer model or
| for specific NLP-related tasks. I still use their agreement
| metrics module, for instance. But that's about it. Dep parsing,
| NER, lemmatising and stemming is all better with the above
| mentioned packages.
| The_rationalist wrote:
| I see that by default the trf model is roberta_base
| https://spacy.io/models/en#en_core_web_trf
|
| Is there an easy way to use xlnet (from transformers) for pos
| tagging, dep parsing, etc? Btw it would have been a smarter
| default as it scores more sota results on paperswithcode.com
| syllogism wrote:
| Yes you can easily train with xlnet instead of roberta-base ---
| just write the different model name in the config file (or pass
| a different string value, if doing it from Python). You can
| find an example config file here:
| https://github.com/explosion/projects/blob/v3/benchmarks/ner...
|
| I actually didn't see a performance improvement when using
| XLNet over roberta-base. I always wondered about this; ages ago
| I looked into it and I wasn't sure that the preprocessing
| details in the transformers version were entirely correct.
|
| Given very similar accuracies from XLNet and RoBERTa, I
| preferred RoBERTa for the following reasons:
|
| * I've never been able to understand the XLNet paper :(. I
| spent some time trying when it was released, but I just didn't
| really get it, not anything close to the level where I'd be
| able to implement it, anyway.
|
| * Standardising on BERT architecture has some advantages. If we
| mostly use BERT, we have a better chance of using faster
| implementations. Mostly nobody is training new XLNet models,
| whereas many new BERT models are being trained.
| magicalhippo wrote:
| I stumbled over SpaCy when looking for something to extract key
| words and numbers from sentences, however it looked a bit
| daunting and/or overkill. Think recipes or similar, turning "take
| three tablespoons of sugar" into [3, 'tablespoons', 'sugar'] or
| similar.
|
| Should I give it another shot or are there libraries more suited
| for this than just plain regexp galore?
| teruakohatu wrote:
| I did that years ago for some project. For recipes you can
| probably get away with regular expressions.
|
| But with Spacy you could tokenize the sentence, then tag each
| work with the Part of Speech it is, and then find patterns eg.
| Verb Number Noun Preposition Noun
|
| Then match the first noum against a list of measurements
| (tablespoon, teaspoon, tbsp...) and extract the rest of the
| components.
| cambalache wrote:
| > spaCy is a library for advanced Natural Language Processing in
| Python and Cython.
| syllogism wrote:
| I actually submitted this with (Python Natural Language
| Processing) after it, but it got edited away. I always find it
| hard to predict the preferred title style here...
| bratao wrote:
| Thank you Matthew, Ines, Sofie and Adriane for spaCy. It is a
| fundamental piece for me, both for work in Academia and in
| Industry.
| ZeroCool2u wrote:
| SpaCy and HuggingFace fulfill practically 99% of all our needs
| for NLP project at work. Really incredible bodies of work.
|
| Also, my team chat is currently filled with people being
| extremely stoked about the SpaCy + FastAPI support! Really hope
| FastAPI replaces Flask sooner rather than later.
| gavinray wrote:
| There's some native SpaCy + FastAPI integration being created?!
|
| This sounds mindblowing, off to Google I go...
| thesehands wrote:
| The author of FastAPI https://twitter.com/tiangolo is a Spacy
| employee
| tiangolo wrote:
| Yep, that's me, I work at Explosion (spaCy's home)!
| ZeroCool2u wrote:
| Big fan my dude! While FastAPI is amazing, the docs for
| it are a work of art. I know a few people that have just
| used the FastAPI docs to learn what API's are and how
| they work, nevermind how use FastAPI itself.
| tiangolo wrote:
| Thanks for saying that! :)
| tiangolo wrote:
| With the new "spaCy project" it's easy to generate the
| boilerplate for many spaCy related projects, including a
| FastAPI app to serve models.
|
| But apart from that, there's actually not really much to
| integrate, as it all just works.
|
| Both libraries are actually (well ) designed to be decoupled.
| So that you can mix them independently with anything else.
|
| Note: I'm the creator of FastAPI, and also a developer at
| Explosion (spaCy's home).
| number6 wrote:
| As someone who never used NLP, what is it used for? Or better
| what is SpaCy used for - I know that it can generate Texts...
| but how would I use it in a business?
| ZeroCool2u wrote:
| Obviously, it depends, but assuming you do have an NLP use
| case already, there are certain things that you will almost
| certainly have to do in your preprocessing regardless of your
| task. For example, sentence parsing. Writing your own basic
| sentence parser is fairly easy. Writing your own _good_
| sentence parser is a nightmare akin to trying to parse HTML
| with regex. SpaCy provides a very good one for you. Down the
| line it will help you train a model on various tasks in a
| compute efficient manner as well. This is just a small
| example.
| number6 wrote:
| Thanks for the explanation. Guss I never had an NLP
| usecase. So SpaCy can take a text or sentences apart and
| knows what the parts "mean" but I can't think of a problem
| to solve with this. Maybe summarize a text or something,
| but I guess I am not imaginative enough
| Xenoamorphous wrote:
| I think I read somewhere that spaCy was going to have named
| entity disambiguation at some point, with named entities having
| links to knowledge bases like Wikidata or DBpedia. That's
| something that paid NER services but that I haven't found in open
| source libs, and would be really interesting IMO.
| svlandeg wrote:
| There's a component for Entity Linking available in spaCy, but
| you have to train it yourself, as the use-cases (type of
| entities, type of knowledge base etc) can vary greatly. See
| more here: https://spacy.io/api/entitylinker
| screye wrote:
| I have been using Spacy3 nightly for a while now. This is game
| changing.
|
| Spacy3 practically covers 90% of NLP use-cases with near SOTA
| performance. The only reason to not use it would be if you are
| literally pushing the boundaries of NLP or building something
| super specialized.
|
| Hugging Face and Spacy (also Pytorch, but duh) are saving
| millions of dollars in man hours for companies around the world.
| They've been a revelation.
| williamtrask wrote:
| Man I wish they could be compensated remotely in proportion to
| that. Matthew Honnibal and team are wizards who have been
| working really hard for a really long time.
|
| Maybe there's a comp strategy i don't know about - but they've
| created SO much value for the world.
| syllogism wrote:
| Thanks for the love :). For the record yes we've been working
| hard, but also yes, we've been doing well from it.
|
| I will say that people are using spaCy for free because that
| is what we asked them to do. I chose to make the library free
| and open-source when I first released it because I had the
| idea that I would be able to make that work out for me, if I
| could make this thing that would be useful to people and if
| they could be convinced to adopt it. And in order to convince
| people to adopt it, we've been telling people that spaCy will
| stay free and that we'll continue to work on it. So
| everything's going to plan here. Even if things weren't
| working out well for us (and they are), the fault would be
| entirely ours. I don't think we'd have any right to suddenly
| say, "Oh none of you jerks are paying, how unfair".
|
| (For the record, we make money from sales of our annotation
| tool, Prodigy: https://prodi.gy . If you're reading this and
| you like spaCy, check it out ;)
| earthnail wrote:
| Oh wow, Prodigy looks amazing. I need a tool for audio
| annotation, and looks like you guys built just the right
| thing for me.
|
| Please invest more in SEO; I didn't find you guys two weeks
| ago when I researched different options for audio
| annotation :D.
| joshklein wrote:
| I want to personally thank you for your work, and let you
| know I couldn't have done an important project of mine if
| spaCy didn't exist, and if it were not a free resource.
|
| Your project was 1 of the 2 instrumental tools in my
| project to structure the transcripts of every word said on
| the floor of the New York State Senate over the past ~30
| years in order to develop a topic-based "proximity"
| heuristic (based on CorEx, the 2nd instrumental tool) for
| which state senators were focused on which issues, based on
| the things they _actually said on the record_ , not based
| on their press statements or their voting records (the
| latter of which doesn't capture all the information you'd
| hope it would due to procedural nuances too obscure to
| detail here).
|
| Thank you. Thank you, thank you, thank you.
| tedivm wrote:
| The only thing that Prodigy is missing is a team based
| workflow. We've been on the beta list for awhile for it,
| and are excited for it to come out- but without having a
| concept of users we've had to use other tools that aren't
| as polished on the annotation side but which hit our
| compliance needs.
| hantusk wrote:
| This. Wholly agree. Currently running a large labelling
| task with 12 labelers.
|
| Using Amazon ground truth which works fine (although
| seems quite MVP outside the core functionality e.g wrt
| reporting or user creation).
|
| What tool have you had success with?
| JPKab wrote:
| Everything in the above paragraph sounds like a hyped
| overstatement. None of it is.
|
| As someone that's worked on some rather intensive NLP
| implementations, Spacy 3.0 and HuggingFace both represent the
| culmination of a technological leap in NLP that started a few
| years ago with the advent of transfer learning in NLP. The
| level of accessibility to the masses these libraries offer is
| game-changing and democratizing.
| aapppwe wrote:
| i am curious, what kind of project are you working on?
| drran wrote:
| Can you help, please?
|
| I want to use AI to translate (localize) messages for free
| software, in my case, Ukrainian language. My plan to improve
| quality of automated translation is to translate from similar
| languages in parallel, i.e. give a same message in English,
| Russian, Polish, and expect message in Ukrainian as output.
|
| Where I should start? Which libraries to use? How to connect
| them? How to train them?
| chartpath wrote:
| I've been using LASER from Facebook Research via
| https://github.com/yannvgn/laserembeddings to accept multi-
| lingual input in front of the the domain-specific models
| for recommendations and stuff (that are trained on English
| annotated examples).
| liminal wrote:
| I'm curious what sort of NLP use cases people are solving. How
| are people finding business value in these models and pipelines?
| We have looked at a number of uses and have found it hard to make
| a case for ROI. Wondering what's been working for folks.
| datameta wrote:
| I wonder if it is sheer coincidence that SpaCy is pronounced the
| way the russian word "spasai" is. It means "rescue" (v.)
| jcuenod wrote:
| I'll add my hats off to to @ines and the spaCy team. It's super
| impressive. There's also a (free) orientation course I'd
| recommend at https://course.spacy.io/
| mark_l_watson wrote:
| Thanks to the SpaCy team! I spent a lot of time over about 20
| years working on my own NLP tools. I stopped doing that and
| mostly now just use SpaCy (and sometimes Huggingface and Apple's
| NLP models).
| Abishek_Muthian wrote:
| Have you compared SpaCy with Apple's NLP on M1? I presume GPU
| is not being used by SpaCy on M1.
| t-vi wrote:
| Given that SpaCy uses PyTorch, that is being worked on.
| mark_l_watson wrote:
| I have used both on M1, but compared them.
| stingraycharles wrote:
| Ok so I've evaluated spacy a few years ago, but nowadays we're
| using huggingface's transformers / tokenizers / etc to train our
| own language models + fine tuned models. I see there's now
| transformer based pipeline support, how do the two relate?
|
| Phrased differently, how does spacy fit in with today's world of
| transformers? Would it still be interesting for me?
| binarymax wrote:
| I have lots of experience with both, and I use both together
| for different use cases. SpaCy fills the need of
| predictable/explainable pattern matching and NER - and is very
| fast and reasonably accurate on a CPU. Huggingface fills the
| need for task based prediction when you have a GPU.
| danieldk wrote:
| _Huggingface fills the need for task based prediction when
| you have a GPU._
|
| With model distillation, you can make models that annotate
| hundreds of sentences per second on a single CPU with a
| library like Huggingface Transformers.
|
| For instance, one of my distilled Dutch multi-task syntax
| models (UD POS, language-specific POS, lemmatization,
| morphology, dependency parsing) annotates 316 sentences per
| second with 4 threads on a Ryzen 3700X. This distilled model
| has virtually no loss in accuracy compared to the finetuned
| XLM-RoBERTa base model.
|
| I don't use Huggingface Transformers, but ported some of
| their implementations to Rust [1], but that should not make a
| big difference since all the heavy lifting happens in C++ in
| libtorch anyway.
|
| tl;dr: it is not true that tranformers are only useful for
| GPU prediction. You can get high CPU prediction speeds with
| some tricks (distillation, length-based bucketing in batches,
| using MKL, etc.).
|
| [1] https://github.com/tensordot/syntaxdot/tree/main/syntaxdo
| t-t...
| binarymax wrote:
| Interesting. Did you start from a Distilled base model
| (like DistilRoBerta), or did you distill your fine-tuned
| model?
| syllogism wrote:
| The improved transformers support is definitely one of the main
| features of the release. I'm also really pleased with how the
| project system and config files work.
|
| If you're always working with exactly one task model, I think
| working directly in transformers isn't that different from
| using spaCy. But if you're orchestrating multiple models,
| spaCy's pipeline components and Doc object will probably be
| helpful. A feature in v3 that I think will be particularly
| useful is the ability to share a transformer model between
| multiple components, for instance you can have an entity
| recogniser, text classifier and tagger all using the same
| transformer, and all backpropagating to it.
|
| You also might find the projects system useful if you're
| training a lot of models. For instance, take a look at the
| project repo here: https://github.com/explosion/projects/tree/v
| 3/benchmarks/ner.... Most of the readme there is actually
| generated from the project.yml file, which fully specifies the
| preprocessing steps you need to build the project from the
| source assets. The project system can also push and pull
| intermediate or final artifacts to a remote cache, such as an
| S3 bucket, with the addressing of the artifacts calculated
| based on hashes of the inputs and the file itself.
|
| The config file is comprehensive and extensible. The blocks
| refer to typed functions that you can specify yourself, so you
| can substitute any of your own layer (or other) functions in,
| to change some part of the system's behaviour. You don't _have_
| to specify your models from the config files like this --- you
| can instead put it together in code. But the config system
| means there's a way of fully specifying a pipeline and all of
| the training settings, which means you can really standardise
| your training machinery.
|
| Overall the theme of what we're doing is helping you to line up
| the workflows you use during development with something you can
| actually ship. We think one of the problems for ML engineers is
| that there's quite a gap between how people are iterating in
| their local dev environment (notebooks, scrappy directories
| etc) and getting the project into a state that you can get
| other people working on, try out in automation, and then pilot
| in some sort of soft production (e.g. directing a small amount
| of traffic to the model).
|
| The problem with iterating in the local state is that you're
| running the model against benchmarks that are not real, and you
| hit diminishing returns quite quickly this way. It also
| introduces a lot of rework.
|
| All that said, there will definitely be usage contexts where
| it's not worth introducing another technology. For instance, if
| your main goal is to develop a model, run an experiment and
| publish a paper, you might find spaCy doesn't do much that
| makes your life easier.
| pabs3 wrote:
| Please note that Explosion does not like redistribution of SpaCy,
| they expect everyone to only use the builds they produce, so it
| would not be a good idea to package it for your favourite distro.
| syllogism wrote:
| I'm sorry that this conflicted with your plans, but I feel
| strongly that distributing Python libraries via system package
| managers such as apt is very bad for users. The pain is felt
| especially by users who are relatively new to Python, who will
| end up with their system Python in a confusing state that is
| difficult to correct.
|
| We of course encourage anyone to clone the repo or install from
| an sdist if they want to compile from source. In fact you can
| do the following: git clone
| https://github.com/explosion/spaCy cd spaCy
| make
|
| This will build you a standalone executable file, in the pex
| format, that only depends on your system Python and does not
| install any files into your system. You can copy this artifact
| into your bin and use it as a command-line application.
| t-vi wrote:
| That is sad, I would use SpaCy more if it had Debian packages
| (in particular in Debian). Python stuff packaged by Debian
| seems to work very well for me and has been for years.
| dna_polymerase wrote:
| Pretty poor choice of license if they wanted me to care about
| their builds, tbh.
| pabs3 wrote:
| Their concern is poor user experience when the documentation
| on the web doesn't match what versions are being
| redistributed by other folks.
| pplonski86 wrote:
| Is there any framework similar to SpaCy or HugginFaces but for
| images?
| syllogism wrote:
| I haven't had a situation to use it, but I think Kornia looks
| cool: https://github.com/kornia/kornia
| polynomial wrote:
| Super excited to see improvement in NER accuracy in SpaCy 3.0.
___________________________________________________________________
(page generated 2021-02-01 23:00 UTC)