[HN Gopher] WhisperNER: Unified Open Named Entity and Speech Rec...
___________________________________________________________________
WhisperNER: Unified Open Named Entity and Speech Recognition
Author : timbilt
Score : 108 points
Date : 2024-11-21 21:41 UTC (1 days ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| timbilt wrote:
| GitHub repo: https://github.com/aiola-lab/whisper-ner
|
| Hugging Face Demo: https://huggingface.co/spaces/aiola/whisper-
| ner-v1
|
| Pretty good article that focuses on the privacy/security aspect
| of this -- having a single model that does ASR and NER:
|
| https://venturebeat.com/ai/aiola-unveils-open-source-ai-audi...
| wanderingmind wrote:
| Looks like only inference available and no fine tuning code
| available
| Tsarp wrote:
| Wouldnt it be better to run normal Whisper and NER on top of
| the transcription before streaming a response or writing
| anything to disk?
|
| What advantage does this offer?
| conradev wrote:
| Yeah, I'm also curious about that. Does combining ASR and NER
| into one model improve performance for either?
| anewhnaccount2 wrote:
| Almost definitely. You can think of there being a type of
| triangle inequality for cascading different systems where
| manually combined systems almost always perform worse given
| comparable data and model capacity. Alternatively you have
| tied the models hands by forcing it to bottleneck through a
| representation you chose.
| timbilt wrote:
| I think one of the biggest advantages is the security/privacy
| benefit -- you can see in the demo that the model can mask
| entities instead of tagging. This means that instead of
| transcribing and then scrubbing sensitive info, you can
| prevent the sensitive info from ever being transcribed.
| Another potential benefit is in lower latency. The paper
| doesn't specifically mention latency but it seems to be on
| par with normal Whisper, so you save all of the time it would
| normally take to do entity tagging -- big deal for real-time
| applications
| PeterisP wrote:
| The general principle is that "pipelines" impose a
| restriction where the errors of the first step get baked-in
| and can't effectively use the knowledge of the following
| step(s) to fix them.
|
| So if the first step isn't near-perfect (which ASR isn't) and
| if there is some information or "world knowledge" in the
| later step(s) which is helpful in deciding that (which is
| true with respect to knowledge about named entities and ASR)
| then you can get better accuracy by having an end-to-end
| system where you don't attempt to pick just one best option
| at the system boundary. Also, joint training can be helpful,
| but that IMHO is less important.
| its_down_again wrote:
| From my experience, ASR-to-NER pipelines don't perform
| adequately out of the box. Even though SOTA ASR systems claim
| 85% word accuracy, the distribution of errors is worth
| looking into. Errors around critical entities like credit
| card numbers or addresses are particularly prone, and even a
| small mistake renders the result useless.
|
| These ASR errors cascade into the NER step, further degrading
| recall and precision. Combining ASR and NER into a joint
| model or integrated approach can reduce these issues in
| theory, it's just more complex to implement and less commonly
| used.
| clueless wrote:
| "The model processes audio files and simultaneously applies NER
| to tag or mask specific types of sensitive information directly
| within the transcription pipeline. Unlike traditional multi-step
| systems, which leave data exposed during intermediary processing
| stages, Whisper-NER eliminates the need for separate ASR and NER
| tools, reducing vulnerability to breaches."
| vessenes wrote:
| The title is dense and the paper is short. But the demo is
| outstanding: (https://huggingface.co/spaces/aiola/whisper-
| ner-v1). The sample audio is submitted with "entity labels" set
| to "football-club, football-player, referee" and WhisperNER
| returns tags Arsenal and Juventus for the football-club tag. They
| suggest "personal information" as a tag to try on audio.
|
| Impressive, very impressive. I wonder if it could listen for
| credit cards or passwords.
| alienallys wrote:
| On a similar note, I've a request for the HN community. Can
| anyone recommend a low-latency NER model/service.
|
| I'm building an assistant that gives information on local medical
| providers that match your criteria. I'm struggling with query
| expansion and entity recognition. For any incoming query, I would
| want to NER for medical terms (which are limited in scope and
| pre-determined), and subsequently where I would do Query
| rewriting and expansion.
| will-burner wrote:
| https://www.tonic.ai/products/textual offers NER models through
| an API or with a UI for managing projects. You can sign up for
| a free trial at https://textual.tonic.ai
| uniqueuid wrote:
| It's so great to see that we finally move away from the thirty
| year old triple categorization of people, organizations and
| locations.
|
| This of course means that we now have to think about all the
| irreconcilable problems of taxonomy, but I'll take that any day
| over the old version :)
| will-burner wrote:
| Is there any reason why this would work better or is needed
| compared to taking audio and 1. doing ASR with whisper for
| instance 2. applying an NER model to the transcribed text?
|
| There are open source NER models that can identify any specified
| entity type (https://universal-ner.github.io/,
| https://github.com/urchade/GLiNER). I don't see why this
| WhisperNER approach would be any better than doing ASR with
| whisper and then applying one of these NER models.
| danielcampos93 wrote:
| This works better because it gives a secondary set of
| conditions for which the decoder (generating text) is
| conditioning its generation. Assume instead of their demo you
| are doing Speech2Text for Oncologists. Out of the Box Whisper
| is terrible because the words are new and rare, especially in
| YouTube videos. If you just run ASR through it and run NER, it
| will generate regular words over cancer names. Instead, if you
| condition generation on topical entities the generation space
| is constrained and performance will improve. Especially when
| you can tell the model what all the drug names are because you
| have a list (https://www.cancerresearchuk.org/about-
| cancer/treatment/drug...)
___________________________________________________________________
(page generated 2024-11-22 23:01 UTC)