[HN Gopher] WhisperNER: Unified Open Named Entity and Speech Rec...
       ___________________________________________________________________
        
       WhisperNER: Unified Open Named Entity and Speech Recognition
        
       Author : timbilt
       Score  : 108 points
       Date   : 2024-11-21 21:41 UTC (1 days ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | timbilt wrote:
       | GitHub repo: https://github.com/aiola-lab/whisper-ner
       | 
       | Hugging Face Demo: https://huggingface.co/spaces/aiola/whisper-
       | ner-v1
       | 
       | Pretty good article that focuses on the privacy/security aspect
       | of this -- having a single model that does ASR and NER:
       | 
       | https://venturebeat.com/ai/aiola-unveils-open-source-ai-audi...
        
         | wanderingmind wrote:
         | Looks like only inference available and no fine tuning code
         | available
        
         | Tsarp wrote:
         | Wouldnt it be better to run normal Whisper and NER on top of
         | the transcription before streaming a response or writing
         | anything to disk?
         | 
         | What advantage does this offer?
        
           | conradev wrote:
           | Yeah, I'm also curious about that. Does combining ASR and NER
           | into one model improve performance for either?
        
             | anewhnaccount2 wrote:
             | Almost definitely. You can think of there being a type of
             | triangle inequality for cascading different systems where
             | manually combined systems almost always perform worse given
             | comparable data and model capacity. Alternatively you have
             | tied the models hands by forcing it to bottleneck through a
             | representation you chose.
        
           | timbilt wrote:
           | I think one of the biggest advantages is the security/privacy
           | benefit -- you can see in the demo that the model can mask
           | entities instead of tagging. This means that instead of
           | transcribing and then scrubbing sensitive info, you can
           | prevent the sensitive info from ever being transcribed.
           | Another potential benefit is in lower latency. The paper
           | doesn't specifically mention latency but it seems to be on
           | par with normal Whisper, so you save all of the time it would
           | normally take to do entity tagging -- big deal for real-time
           | applications
        
           | PeterisP wrote:
           | The general principle is that "pipelines" impose a
           | restriction where the errors of the first step get baked-in
           | and can't effectively use the knowledge of the following
           | step(s) to fix them.
           | 
           | So if the first step isn't near-perfect (which ASR isn't) and
           | if there is some information or "world knowledge" in the
           | later step(s) which is helpful in deciding that (which is
           | true with respect to knowledge about named entities and ASR)
           | then you can get better accuracy by having an end-to-end
           | system where you don't attempt to pick just one best option
           | at the system boundary. Also, joint training can be helpful,
           | but that IMHO is less important.
        
           | its_down_again wrote:
           | From my experience, ASR-to-NER pipelines don't perform
           | adequately out of the box. Even though SOTA ASR systems claim
           | 85% word accuracy, the distribution of errors is worth
           | looking into. Errors around critical entities like credit
           | card numbers or addresses are particularly prone, and even a
           | small mistake renders the result useless.
           | 
           | These ASR errors cascade into the NER step, further degrading
           | recall and precision. Combining ASR and NER into a joint
           | model or integrated approach can reduce these issues in
           | theory, it's just more complex to implement and less commonly
           | used.
        
       | clueless wrote:
       | "The model processes audio files and simultaneously applies NER
       | to tag or mask specific types of sensitive information directly
       | within the transcription pipeline. Unlike traditional multi-step
       | systems, which leave data exposed during intermediary processing
       | stages, Whisper-NER eliminates the need for separate ASR and NER
       | tools, reducing vulnerability to breaches."
        
       | vessenes wrote:
       | The title is dense and the paper is short. But the demo is
       | outstanding: (https://huggingface.co/spaces/aiola/whisper-
       | ner-v1). The sample audio is submitted with "entity labels" set
       | to "football-club, football-player, referee" and WhisperNER
       | returns tags Arsenal and Juventus for the football-club tag. They
       | suggest "personal information" as a tag to try on audio.
       | 
       | Impressive, very impressive. I wonder if it could listen for
       | credit cards or passwords.
        
       | alienallys wrote:
       | On a similar note, I've a request for the HN community. Can
       | anyone recommend a low-latency NER model/service.
       | 
       | I'm building an assistant that gives information on local medical
       | providers that match your criteria. I'm struggling with query
       | expansion and entity recognition. For any incoming query, I would
       | want to NER for medical terms (which are limited in scope and
       | pre-determined), and subsequently where I would do Query
       | rewriting and expansion.
        
         | will-burner wrote:
         | https://www.tonic.ai/products/textual offers NER models through
         | an API or with a UI for managing projects. You can sign up for
         | a free trial at https://textual.tonic.ai
        
       | uniqueuid wrote:
       | It's so great to see that we finally move away from the thirty
       | year old triple categorization of people, organizations and
       | locations.
       | 
       | This of course means that we now have to think about all the
       | irreconcilable problems of taxonomy, but I'll take that any day
       | over the old version :)
        
       | will-burner wrote:
       | Is there any reason why this would work better or is needed
       | compared to taking audio and 1. doing ASR with whisper for
       | instance 2. applying an NER model to the transcribed text?
       | 
       | There are open source NER models that can identify any specified
       | entity type (https://universal-ner.github.io/,
       | https://github.com/urchade/GLiNER). I don't see why this
       | WhisperNER approach would be any better than doing ASR with
       | whisper and then applying one of these NER models.
        
         | danielcampos93 wrote:
         | This works better because it gives a secondary set of
         | conditions for which the decoder (generating text) is
         | conditioning its generation. Assume instead of their demo you
         | are doing Speech2Text for Oncologists. Out of the Box Whisper
         | is terrible because the words are new and rare, especially in
         | YouTube videos. If you just run ASR through it and run NER, it
         | will generate regular words over cancer names. Instead, if you
         | condition generation on topical entities the generation space
         | is constrained and performance will improve. Especially when
         | you can tell the model what all the drug names are because you
         | have a list (https://www.cancerresearchuk.org/about-
         | cancer/treatment/drug...)
        
       ___________________________________________________________________
       (page generated 2024-11-22 23:01 UTC)