[HN Gopher] Wav2vec Overview: Semi and Unsupervised Speech Recog...
___________________________________________________________________
Wav2vec Overview: Semi and Unsupervised Speech Recognition
Author : vackosar
Score : 110 points
Date : 2021-07-03 15:39 UTC (7 hours ago)
(HTM) web link (vaclavkosar.com)
(TXT) w3m dump (vaclavkosar.com)
| lunixbochs wrote:
| One addendum to the linked post's notes:
|
| > SoTa in low-resource setting Libri-light by a lot on WER clean
| test 100h labeled: others ~4 vs theirs ~2.5
|
| > SoTa on high-resource noisy data (3.3 vs 3.4) close to SoTa on
| clean data
|
| This note isn't super specific, but it's outdated if I'm
| understanding it correctly. To my understanding, the SOTA on this
| data is held by Conformer 1B (a 1 billion parameter model), at
| 1.4 clean, 2.6 noisy.
|
| Conformer 1B is something like wav2vec 2.0 pretraining +
| conformer + noisy student + specaugment.
|
| https://arxiv.org/pdf/2010.10504.pdf
|
| --
|
| Wav2vec 2.0 is very cool, but I've had some trouble reproducing
| the pretraining and fine tuning reliably. It might need a lot of
| resources (e.g. hundreds of clustered GPUs).
|
| I think Wav2vec-U is extremely cool.
| knuthsat wrote:
| I always wonder how people figure out these successful gigantic
| models if it takes hundreds of TPUs and days to train them.
|
| I recently bought rtx 3090 in hopes of playing around with some
| computer vision applications but I guess having 24GB VRAM is
| nothing if I want to get something SOTA working.
| qayxc wrote:
| The RTX 3090 is a beast compared to what researchers had
| available to them just a few years ago.
|
| Don't try to chase SOTA - that's a fruitless endeavour.
|
| 24GB of VRAM is plenty for CV and you can train some
| excellent models with it. You also need to keep in mind that
| you don't necessarily need to train models from scratch
| either.
|
| You can achieve great things by downloading a well-tested,
| pretrained model and fine-tune it for your particular task or
| application. Trying to come up with new models and training
| them from scratch is an exercise in futility for really big
| models.
|
| I usually only train smaller models (couple of million
| parameters) and training and finetuning usually takes
| anywhere from a few hours to a day or two. But then again my
| hardware is two generations older than yours.
| sdenton4 wrote:
| The EfficientNet paper has some good things to say on this.
|
| If you're working at a place with giant datacenters full of
| (T/G)PUs, you can train one giant model a few times, or train
| smaller models hundreds of times. Without hyperparameter
| search, there's a really high chance that you're just looking
| in the wrong region and wind up with something gigantic but
| kinda meh.
|
| So, the simple strategy is to use the smaller models to find
| a great mix of hyperparameters, and then scale up to a
| gigantic model. The EfficientNet paper demonstrates some
| fairly reliable ways to scale up the model, changing width
| and depth together according to a scaling factor.
|
| But yeah, even for smaller model footprints, the ability to
| run tens of experiments in parallel goes a very long way. If
| you've got a single GPU to play with, I would instead try to
| focus on a well-scoped interesting question that you can
| answer without having to demonstrate SOTA-ness, as it will be
| an uphill climb.
|
| Also remember that it's good to lean heavily on pre-trained
| models to save time. Anything you can do to iterate faster,
| really.
| WillDaSilva wrote:
| I wonder how much better this would be at capturing information
| that doesn't translate well into text representations of speech.
|
| Consider how with word2vec there are relationships in the
| embedding space between semantically related words. I would
| expect the examples of that for word2vec (e.g. king -> queen
| being a similar translation as man -> woman) to apply here too,
| but can it also do things like place regular questions and
| rhetorical questions in different regions of the embedding space
| based off of of the inflection in the speech?
|
| It would also be interesting to see what relationships exist
| between equivalent words in different languages within the
| embedding space. I suppose something like that is probably
| already used for text translation neural networks, but maybe some
| notable differences exist when dealing with speech directly.
| theropost wrote:
| Does anyone know of some good open sourced projects for OCR?
| Tesseract always seems to be the default, and then it seems
| Google cloud, and other services are miles ahead. However, for
| those who don't want to rely on the big tech companies, are there
| any comparable alternatives?
| ismaj wrote:
| There is easyocr which is good enough but lacks maturity (it
| was acknowledged at some point by Yann LeCun). The code base
| isn't ideal. I'm currently working on my own custom OCR since
| easyocr isn't perfect at detecting emails for example
| Www.ismaj@gmail ;com
| piceas wrote:
| I recently came across CRAFT wich appears to have come out of
| the ICDAR2017 Robust reading challenge.
|
| It performed better than expected. I only tested a few images
| so please don't take my word for it.
|
| That led me to PaddleOCR. There is still plenty of room for
| improvement but I found it way more convenient to use for my
| purposes than messing with Tesseract.
|
| https://github.com/clovaai/CRAFT-pytorch
|
| https://github.com/PaddlePaddle/PaddleOCR
| spijdar wrote:
| As someone who's an idiot about machine learning, is it possible
| to run this code in reverse? e.g. take the generated (or novel)
| vectors and convert them back into audio/waveforms?
| monocasa wrote:
| Generalized reverse projection through even non recurrent
| neural networks is still an open research problem.
|
| So no in this case.
| spywaregorilla wrote:
| That doesn't sound like a particularly realistic problem to
| solve.
| monocasa wrote:
| I agree, but all the more glory if someone does solve it
| then. And the field is still new enough that I don't want
| to be cited for decades like the iPod release "no wireless.
| Less space than a Nomad. Lame." slashdot comment.
| [deleted]
| jmalicki wrote:
| If you look at the architecture diagram for Wav2Vec-U, the
| "generator" is doing exactly that - generating waveforms from
| the vectors. All GANs work this way, and is how websites like
| https://thispersondoesnotexist.com/ work. Of course as the
| sibling comment notes the results today might not be great for
| this task, and it is open research, bit it's not as of it just
| can't be done at all.
| lunixbochs wrote:
| My reading of the generator diagram (figure 6) isn't that it
| is generating waveforms, but that it is generating phoneme
| probabilities.
|
| You can train a similar system to produce audio on the output
| of wav2vec, though it probably won't sound similar to the
| input audio (accent/voice) unless you expose more features of
| the input than phonemes.
___________________________________________________________________
(page generated 2021-07-03 23:00 UTC)