[HN Gopher] DeepMind achieves SOTA image recognition with 8.7x f...
___________________________________________________________________
DeepMind achieves SOTA image recognition with 8.7x faster training
Author : highfrequency
Score : 186 points
Date : 2021-02-14 16:44 UTC (6 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| eggie5 wrote:
| Do what do attribute to the gains? Adaptive clipping? Or $$$
| spent on NAS??
| dheera wrote:
| > 8.7x faster to train
|
| This is an achievement but it would be helpful to have put "to
| train" in the title as this is quite different from efficiency at
| inference time, which is what often actually matters in deployed
| applications.
|
| From Table 3 on Page 7 it appears to me that NFNet is
| significantly heavier in the number of parameters than
| EfficientNet for similar accuracies. For example EffNet-B5
| achieves 83.7% with 30M params and 9.9B FLOPs on Top-1 while
| NFNet-F0 achieves 83.6% with 71.5M params and 12.38B FLOPs on
| Top-1.
|
| It appears to me at first glance that NFNet has not achieved SOTA
| at inference.
| trott wrote:
| > It appears to me at first glance that NFNet has not achieved
| SOTA at inference.
|
| It has, for larger models (F1 vs B7). See Fig 4 in the
| Appendix.
| The_rationalist wrote:
| No it hasn't https://paperswithcode.com/sota/image-
| classification-on-imag...
| trott wrote:
| > No it hasn't https://paperswithcode.com/sota/image-
| classification-on-imag...
|
| We were talking about models trained on ImageNet,
| specifically about the trade-off between accuracy and
| FLOPs. But the higher-accuracy models listed in your link
| use extra data. So it's not quite the same benchmark we
| were talking about.
| The_rationalist wrote:
| The deepmind paper NFNet-F4+ you were talking about also
| has external training data.
|
| The number one in accuracy (Meta pseudo labels) is also
| faster for inference (390M vs 570M parameters) vs the
| deepmind one. So what are you disagreeing with?
| trott wrote:
| > The deepmind paper NFNet-F4+ you were talking about
| also has external training data.
|
| @dheera and I did not mention NFNet-F4+. All models,
| tables, figures and numbers that we did mention resulted
| from training on ImageNet alone.
| sillysaurusx wrote:
| I was skeptical of the 83%-whatever top-1 accuracy on Imagenet.
| But someone pointed out that the accuracy increases when the
| model is pretrained on JFT, google's proprietary 100-million
| image dataset, the model's accuracy increases to 87%-whatever.
|
| That's pretty interesting. It implies that the original accuracy
| rating might be legit. The concern is that we're chasing the
| imagenet benchmark as if it's the holy grail, when in fact it's a
| very narrow slice of what we normally care about as ML
| researchers. _However_ , the fact that pretraining on JFT
| increases the accuracy means that the model is generalizing,
| which is very interesting; it implies that models might be "just
| that good now."
|
| Or more succinctly, if the result was bogus, you'd expect JFT
| pretraining to have no effect whatsoever (or a negative effect).
| But it has a positive result.
|
| The other thing worth mentioning is that AJMooch seems to have
| killed batch normalization dead, which is very strange to think
| about. BN has had a long reign of some ~4 years, but the
| drawbacks are significant: you have to maintain counters
| yourself, for example, which was quite annoying.
|
| It always seemed like a neural net ought to be able to learn what
| BN forces you to keep track of. And AJMooch et al seem to prove
| this is true. I recommend giving evonorm-s a try; it worked
| perfectly for us the first time, with no loss in generality, and
| it's basically a copy-paste replacement.
|
| (Our BigGAN-Deep model is so good that I doubt you can tell the
| difference vs the official model. It uses AJMooch's evonorm-s
| rather than batchnorm: [1] https://i.imgur.com/sfGVbuq.png [2]
| https://i.imgur.com/JMJ1Ll0.png and lol at the fake speedometer.)
| Jabbles wrote:
| What safeguards are there or what assurances do we have that
| JFT is not contaminated with images from (or extremely similar
| to) the validation set?
| quantumofalpha wrote:
| Just the sheer size of JFT (latest versions I heard approach
| 1B images), so it's probably impractical to train on it till
| overfitting.
| sillysaurusx wrote:
| Haha. None whatsoever.
|
| The assurance is that everyone in the field seems to take the
| work seriously. But the reality is that errors creep in from
| a variety of corners. I would not be even slightly surprised
| to find that the validation data is substantially similar.
| We're still at the "bangs rocks together to make fire" phase
| of ML, which is both exciting and challenging; we're building
| the future from the ground up.
|
| People rarely take the time to look at the actual images, but
| if you do, you'll notice they have some interesting errors in
| them:
| https://twitter.com/theshawwn/status/1262535747975868418
|
| I built an interactive viewer for the tagging site:
| https://tags.tagpls.com/
|
| (Someone tagged all 70 shoes in this one, which was kind of
| impressive... https://tags.shawwn.com/tags/https://battle.sha
| wwn.com/sdc/i... )
|
| Anyway, some of the validation images happen to be rotated 90
| degrees and no one noticed. That made me wonder what other
| sorts of unexpected errors are in these specific 50,000
| validation images that the world just-so-happened to decide
| were Super Important to the future of AI.
|
| The trouble is, images _in general_ are substantially similar
| to the imagenet validation dataset. In other words, it 's
| tempting to try to think of some way of "dividing up" the
| data so that there's some sort of validation phase that you
| can cleanly separate. But reality isn't so kind. When you're
| at the scale of millions of images, holding out 10% is just a
| way of sanity checking that your model isn't memorizing the
| training data; nothing more.
|
| Besides, random 90 degree rotations are introduced on purpose
| now, so it's funny that old mistakes tend not to matter.
| gameshot911 wrote:
| Those example pictures are trippy! Some of them look like those
| weird DeepMind sick-fever dream creations. Except I assume they
| are all real photos, not generated. It's very possible that a
| trained AI would be better-able to identify some of the
| candidates than _I_ would.
|
| For example, from OP's post, w/ coordinate system starting at
| lower left, I have no idea what I'm looking at in these
| examples, except they look organic-ish: [1]: [1,4], [3,2],
| [4,1]
|
| sillysaurusx: I've never seen conglomerate pictures like this
| used in AI training. Do you train models on these 4x4 images?
| What's the purpose vs a single picture at a time? Does the
| model know that you're feeding it 4x4 examples, or does it have
| to figure that out itself?
|
| Aside: Another awesome 'sick-fever dream creation' example if
| you missed it when it made the rounds on HN is this[3]. Slide
| the creativity filter up for weirdness!
|
| [3] https://thisanimedoesnotexist.ai/
| sillysaurusx wrote:
| I'm surprised so many people want to see our BigGAN images.
| Thank you for asking :)
|
| You can watch the training process here:
| http://song.tensorfork.com:8097/#images
|
| It's been going on for a month and a half, but I leave it
| running mostly as a fishtank rather than to get to a specific
| objective. It's fun to load it up and look at a new random
| image whenever I want. Plus I like the idea of my little TPU
| being like "look at me! I'm doing work! Here's what I've
| prepared for you!" so I try to keep my little fella online
| all the time.
|
| - https://i.imgur.com/0O5KZdE.png
|
| - Plus stuff like this makes me laugh really hard.
| https://i.imgur.com/EnfIBz3.png
|
| - Some nice flowers and a boat.
| https://i.imgur.com/mrFkIx0.png
|
| The model is getting quite good. I kind of forgot about it
| over the past few weeks. StyleGAN could never get anywhere
| close to this level of detail. I had to spend roughly a year
| tracking down a crucial bug in the implementation that
| prevented biggan from working very well until now:
| https://github.com/google/compare_gan/issues/54
|
| And we also seemed to solve BigGAN collapse, so theoretically
| the model can improve forever now. I leave it running to see
| how good it can get.
|
| _I 've never seen conglomerate pictures like this used in AI
| training. Do you train models on these 4x4 images? What's the
| purpose vs a single picture at a time? Does the model know
| that you're feeding it 4x4 examples, or does it have to
| figure that out itself?_
|
| Nah, the grid is just for convenient viewing for humans.
| Robots see one image at a time. (Or more specifically, a
| batch of images; we happen to use batch size 2 or 4, I
| forget, so each core sees two images at a time, and then all
| 8 cores broadcast their gradients to each other and average,
| so it's really seeing 16 or 32 images at a time.)
|
| I feel a bit silly plugging our community so much, but it's
| really true. If you like tricks like this, join the
| Tensorfork discord:
|
| https://discord.com/invite/x52Xz3y
|
| My theory when I set it up was that everyone has little
| tricks like this, but there's no central repository of
| knowledge / place to ask questions. But now that there are
| 1,200+ of us, it's become the de facto place to pop in and
| share random ideas and tricks.
|
| For what it's worth, https://thisanimedoesnotexist.ai/ was a
| joint collaboration of several Tensorfork discord members. :)
|
| If you want future updates about this specific BigGAN model,
| twitter is your best bet: https://twitter.com/search?q=(from%
| 3Atheshawwn)%20biggan&src...
| Itsdijital wrote:
| This is awesome, thanks.
| indiv0 wrote:
| That's awesome! Is your model available publically? I run a
| site [0] where users can generate images from text prompts
| using models like the official BigGAN-Deep one and I'd love to
| try it out for this purpose. Do you also have somewhere
| whereupon discuss this stuff? I'm new to ML in general and was
| wondering if there's somewhere where y'all experts gather.
|
| [0]: https://dank.xyz
| code-scope wrote:
| Is there any (influence) SW framework that takes youtube video
| as input and split out object/timestamp as output?
| trhway wrote:
| > pretraining on JFT increases the accuracy means that the
| model is generalizing
|
| not necessarily, it may be mostly a bonus of the transfer,
| especially considering that JFT is that much larger - getting
| for example the first conv layers kernels to converge to Gabor-
| like takes time, yet those layers kernels are very similar
| across the well trained image nets (and there were some works
| showing that it is optimal in a sense, and that it is one of
| the reasons it is in our visual cortex) and thus transferrable,
| and can practically be treated as fixed in the new model
| (especially if those layers were pretrained on very large model
| and reached the state of generic feature extraction). I suspect
| the similar is applicable for the low level feature aggregating
| layers too.
| bla3 wrote:
| Where are these images from? Are there more?
| sillysaurusx wrote:
| Oh, you! I'm so flattered. You're making me blush.
|
| Sure, you can have as many as you want. Watch it train in
| real time:
|
| http://song.tensorfork.com:8097/#images
|
| Though I imagine HN might swamp our little training server
| running tensorboard, so here you go.
|
| https://i.imgur.com/BkGxbo7.png
|
| We've been training a BigGAN-Deep model for 1.5 months now.
| Though that sounds like a long time, in reality it's
| completely automatic and I've been leaving it running just to
| see what will happen. Every single other BigGAN
| implementation reports that eventually the training run will
| collapse. We observed the same thing. But gwern came up with
| a brilliantly simple way to solve this: if
| D_loss < 0.2: D_loss = 0
|
| It takes some thinking about _why_ this solves collapse. But
| in short, the discriminator isn 't allowed to get too
| intelligent. Therefore the generator is never forced to
| degenerate into single examples that happen to fool the
| discriminator, i.e. collapse.
|
| If you like this sort of thing in general, I encourage you to
| come join the Tensorfork community discord server, which we
| affectionately named the "TPU podcast":
| https://discord.com/invite/x52Xz3y
|
| There are some 1,200 of us now, and people are always showing
| off stuff like this in our (way too many) channels.
| Jabbles wrote:
| Do you take advantage of previous iterations of the
| generator and discriminator? i.e. the generator should be
| able to fool all previous discriminators, and the
| discriminator should be able to recognise the work of all
| previous generators?
| sillysaurusx wrote:
| Nope! It's an interesting balance. The truth of the
| situation seems to be: the generator and discriminator
| provide a "signal" to each other, like two planets
| orbiting around each other.
|
| If you cut the signal from one, the other will rapidly
| veer off into infinity, i.e. collapse quickly. Or it will
| veer off in the other direction, i.e. all progress will
| stop and the model won't improve.
|
| So it's a constant "signal", you see, where one is
| dependent on the other _in the current state_. Therefore
| I am skeptical of attempts to use previous states of
| discriminators.
|
| However! One of the counterintuitive aspects of AI is
| that the strangest-sounding ideas often have a chance of
| being good ideas. It's also so hard to try new ideas that
| you have to pick specific ones. So, roll up your sleeves
| and implement yours; I would personally be delighted to
| see what the code would look like for "the current
| generator can fool all previous discriminators".
|
| I really do not mean that in any sort of negative or
| dismissive way. I really hope that you will come try it,
| because DL has never been more accessible. And the time
| is ripe for fresh takes on old ideas; there's a very real
| chance that you'll stumble across something that works
| quite well, if you follow your line of thinking.
|
| But for practical purposes, the current theory with
| generators and discriminators is that they react to their
| current states. So there's not really any way of testing
| "can the generator fool all previous discriminators?"
| because in reality, the generator isn't fooling the
| discriminator at all -- they simply notice when each
| other deviates by a small amount, and they make a
| corresponding "small delta change" in response. Kind of
| like an ongoing chess game.
| Jabbles wrote:
| Thanks for the detailed answer.
|
| I don't claim it to be a novel idea, I just remember the
| Alpha Go (zero?) paper that said they played it against
| older versions to make sure it hadn't got into a bad
| state.
| sillysaurusx wrote:
| Ah! This is an interesting difference, and illustrates
| one fun aspect of GANs vs other types of models: Alpha Go
| had a very specific "win condition" that you can measure
| precisely. (Can the model win the game?)
|
| Whereas it's very difficult to quantify what it means to
| be "better" at generating images, once you get to a
| certain threshold of realism. (Was Leonardo better than
| michelangelo? Probably, but it's hard to measure
| precisely.)
|
| The way that Alpha Go worked was, it gathered a bunch of
| experiences, i.e. it played a bunch of games. Then, after
| playing tons of games -- tens of dozens! just kidding,
| probably like 20 million -- it then performed a _single
| gradient update_.
|
| In other words, you gather your current experiences, and
| _then_ you react to them. It 's a two-phase commit.
| There's an explicit "gather" step, which you then react
| to by updating your database of parameters, so to speak.
|
| Whereas with GANs, that happens continuously. There's no
| "gather" step. The generator simply tries to maximize the
| discrimiantor's loss, and the discriminator tries to
| minimize it.
|
| Balancing the two has been very tricky. But the results
| speak for themselves.
| logane wrote:
| Not sure if I follow your JFT argument, but there's a large
| body of work on both (a) studying whether chasing ImageNet
| accuracy yields models that generalize well to out of
| distribution data [1, 2, 3] and (b) contextualizing progress on
| ImageNet (i.e., what does high accuracy on ImageNet really
| mean?) [4, 5, 6].
|
| For (a), maybe surprisingly the answer is mostly yes! Better
| ImageNet accuracy generally corresponds to better out of
| distribution accuracy. For (b), it turns out that the ImageNet
| dataset is full of contradictions---many images have multiple
| ImageNet-relevant objects, and often are ambiguously or mis-
| labeled, etc---so it's hard to disentangle progress in
| identifying objects vs. models overfitting to the quirks of the
| benchmark.
|
| [1] ObjectNet: https://objectnet.dev / associated paper
|
| [2] ImageNet-v2: https://arxiv.org/abs/1902.10811
|
| [3] An Unbiased Lookat Dataset Bias:
| https://people.csail.mit.edu/torralba/publications/datasets_...
| (pre-AlexNet!)
|
| [4] From ImageNet to Image Classification:
| https://arxiv.org/abs/2005.11295
|
| [5] Are we done with ImageNet? https://arxiv.org/abs/2006.07159
|
| [6] Evaluating Machine Accuracy on ImageNet:
| http://proceedings.mlr.press/v119/shankar20c.html
| The_rationalist wrote:
| This isn't the real SOTA, "Meta pseudo labels" has ~10% less
| errors, while having less parameters.
| https://paperswithcode.com/sota/image-classification-on-imag...
| However the fast training is an interesting property.
|
| It would be interesting to test thoses efficientNets with zeroth
| order backpropagation as it allows a 300X speedup (vs 8.7x) while
| not regressing accuracy _that much_
| https://paperswithcode.com/paper/zorb-a-derivative-free-back...
| f430 wrote:
| eli5 what SOTA image recognition is?
| jphoward wrote:
| SOTA is 'state of the art'. Image recognition is a task
| classically appraised by calculating the accuracy on the
| ImageNet dataset, which requires a system to classify images
| each as one of 1,000 pre-determined classes.
| f430 wrote:
| so how many images does the current SOTA take to train a
| classifier? Trying to gauge how much of an improvement
| Deepmind has made here.
| jphoward wrote:
| The results that should be used to compare their results to
| others involves training on just under 1.3 million images
| across the 1,000 classes.
|
| Their best results involve 'pretraining' on a dataset of
| 300 million examples, before 'tuning' it on the actual
| ImageNet training dataset as above.
| [deleted]
| marviel wrote:
| State of The Art == SOTA
|
| You might enjoy paperswithcode.com
| smeeth wrote:
| The speed improvements are certainly interesting, the performance
| improvements seem decidedly not. This method has more than 2x the
| parameters of all but one of the models it was compared against.
|
| If I'm off-base here can someone explain?
| modeless wrote:
| I don't care how many parameters my model has per se. What I
| care about is how expensive it is to train in time and dollars.
| If this makes it cheaper to train better models despite more
| parameters, that's still a win.
| 6gvONxR4sf7o wrote:
| Some models are still memory limited. Fewer parameters are
| very useful in those settings.
| sillysaurusx wrote:
| There's one important caveat, though I agree with your
| thrust: at GPT-3 scale, cutting params in half is a
| nontrivial optimization. So it's worth keeping an eye out for
| that concern.
|
| (Yeah, none of us are anywhere near GPT-3 scale. But I spend
| most of my time thinking about scaling issues, and it was
| interesting to see your comment pop up; I would've agreed
| entirely with it myself, if not for seeing all the anguish
| caused by attempting to train and deploy billions of
| parameters.)
| dontreact wrote:
| In cases where you have to deploy the model and you are
| limited in terms of flops, this paper does not help much,
| unless it's removal of batchnorm somehow allows a future
| network that is actually faster at inference time.
| modeless wrote:
| There are a lot of techniques for sparsifying or pruning or
| distilling models to reduce inference FLOPS, and they
| almost always produce better results when starting with a
| better model. Also, if your model is 8x faster to train at
| the same size then you can do 8x as much hyperparameter
| tuning and get a better result.
| dontreact wrote:
| This model is much more expensive than efficientnet at
| inference (I think the flops are about 2x?). You can use
| these same techniques with efficientnet.
| david-gpu wrote:
| But for deployment in smaller devices you can use
| techniques such as distillation, quantization and sparsity.
| Training and inference are very different problems in
| practice.
| dontreact wrote:
| Yes but you can do that with efficientnet as well. The
| point is that this is an improvement only for training
| because it uses computations which are highly optimized
| on TPU
| Willson50 wrote:
| *11.5% as much compute.
| The_rationalist wrote:
| But what was the baseline hardware for a reasonable training
| time?
| nl wrote:
| Page 7 has a table of one training step on TPUv3 and V100
| GPUs.
|
| I don't completely understand this: NFNet is slower than its
| competitors on this benchmark, but they claim higher
| efficiency. This isn't obvious to me.
| nullifidian wrote:
| Do they disclose any important techniques/ideas on how to achieve
| these results in the paper, or it's more of a technical press
| release?
| belgian_guy wrote:
| Yes they do (arxiv is a place for scientific papers not press
| releases). I've only skimmed it, but the paper introduce an
| adaptive way to clip gradients. Meaning that if the ratio of
| the gradient norm to weight norm surpasses a certain threshold,
| they clip it. This stabilizes learning and seems to avoid the
| need for batch normalization. Seems quite promising imo and
| something that could stick (I'm quite happy if we could finally
| do away with batchnorm).
| modeless wrote:
| The big deal here is the removal of BatchNorm. People never
| really liked BatchNorm for various theoretical and practical
| reasons, and yet it was required for all the top performing
| models. If this allows us to get rid of it forever that will be
| really nice.
| nl wrote:
| Make it work then make it fast and efficient.
|
| People complaining about how slow and expensive brand new models
| are to train are ignorant of the history of machine learning and
| of engineering in general.
| throwaway189262 wrote:
| I have a feeling the ML community is going to pivot focus to
| faster and smaller training before larger advancements are made.
| It's simply too expensive for much AI research to happen when
| state of the art models take 500k of hardware to train.
|
| For all the mathematician hype around ML research, much of the
| work is closer to alchemy than science. We simply don't
| understand a great deal of why these neural nets work.
|
| The people doing math above algebra are few and the scene is
| dominated by "guess and check" style model tinkering.
|
| Many "state of the art models" are simply a bunch of common
| strategies glued together in a way researchers found worked the
| best (by trying a bunch of different ones).
|
| An average Joe could probably write influential ML papers by
| gluing RNN/GAN layers to existing models and fiddling with the
| parameters until they beat current state of the art. In fact, in
| NLP models, this is essentially what has happened with roBERTa,
| XLNET, ELECTRA, etc. They're all somewhat trivial variations on
| Google's BERT, which is more creative but yet again built on
| existing models.
|
| Anyways, my point is, none of this required math or genius or
| particularly demanding thought. It was basically let's tinker
| with this until we find a way that's better, using guess and
| check. No equations needed.
|
| We are a long way from the type of simulations done for protein
| folding and materials strength and basically every other
| scientific field. It's still the wild west
| sillysaurusx wrote:
| Apply to TFRC! https://www.tensorflow.org/tfrc
|
| They are very permissive. And you get to play with $500k worth
| of hardware. Been a member for over a year now. Jonathan is
| singlehandedly the best support person I've ever worked with,
| or perhaps ever will work with.
|
| I would've completely agreed with you if not for TFRC. And I
| couldn't resist the opportunity of playing with some big metal,
| even if it's hard to work with.
| panabee wrote:
| just applied. thanks for sharing!
| tbalsam wrote:
| This feels much like the sentiment in the field about two years
| ago or so. While I feel like the "alchemy" storyline is still
| somewhat in play, most of the big important parts of the deep
| learning process have enough ideological linear approximators
| stacked around them that if you know what you're doing or
| looking at, you can jump to an unexplored trench with some
| reasonable feeling about whether you'll get something good or
| not. I feel like the "alchemy" approach is when people new to
| the field are innundated with information about it, and while I
| think that still holds, there very much is a well-understood
| science of principles in most parts of it.
|
| There's the neural tangent kernel work that's achieved a lot,
| and the transformers themselves are really taking off a lot as
| the blockwise/lower rank approximation algorithms look more and
| more like circuits built off of basic, more well-established
| components.
|
| "An average Joe could probably write influential ML papers by
| gluing RNN/GAN layers to existing models and fiddling with the
| parameters until they beat current state of the art. In fact,
| in NLP models, this is essentially what has happened with
| roBERTa, XLNET, ELECTRA, etc. They're all somewhat trivial
| variations on Google's BERT, which is more creative but yet
| again built on existing models."
|
| This feels like it trivializes a lot of the work and collapses
| some of the major advancements in training at scale down to a
| more one-dimensional outlook. Companies are doing both, but
| it's easy to throw money and compute at an absolutely
| guaranteed logarithmic improvement in results. It's not
| stupidity, it's just reducing variance in scaling known laws as
| we work on making things more efficient, which weirdly enough
| starts the iterative process of academics frantically trying to
| mine the expensive, inefficient compute tactics to flag plant
| their own materials.
|
| With respect to you comment on protein folding and such, I feel
| you might have missed a lot of the major work in that aren more
| recently. There really and truly been some field-shattering
| work on that in combining deep learning systems with last-mile
| supervision and refinement systems. I'd posit that we're very
| much out of the wild west and in the mild, but still
| rambunctious west, if I were to put terms on it.
|
| With reference to guess and check -- yes, that especially was
| prevalent and worked 2-3 years ago and I'd be in favor of
| advocating that it does still happen somewhat in a more refined
| fashion, but I personally believe we'd not get too far beyond
| the SOTA if we're not working (effectively) with your data
| manifold now and tightly incorporating whatever
| projections/constraints of that data distillation process into
| your network training procedure. I really do agree with you in
| that I think average Joe breakthroughs will happen and continue
| to benefit the the field, and I'd certainly agree that there's
| always going to be the mediocre paper churn of paper mills I
| think that you alluded to trying to justify their own existence
| as academics/paper writers, but I really do legitimately think
| there's enough precedent set in most parts of the field that
| you need to have some kind of thoughtful improvement to move
| forward (like AdaBelief, which is still terrible because they
| straight up lie about what they do in the abstract, even though
| the improvement of debiasing the variance estimates during
| training is an exceptionally good idea).
|
| Just my 2c, hope this helps. I think we may have a similar end
| perspective from two different sides, like two explorers
| looking at the same peak from the different side of the
| mountain. :thumbsup:
| nickhuh wrote:
| There's a lot of interest in various ML communities on more
| efficient training and inference. Both vision and NLP have had
| a growing focus on these problems in recent years.
|
| I think you make a good observation that much of ML progress is
| driven by tinkering with existing models, though instead of
| describing it as more "alchemy than science" it's probably more
| accurate to say it's very experimental right now. Being very
| experimental is neither unscientific nor unusual in the
| development of knowledge. James Watt worked as an instrument
| maker (not a theoretician) when he invented the Watt steam
| engine in 1776 [1], and at the time the idea of heat as
| Phlogiston [2] was still more prevalent than anything that
| looks like modern thermodynamics. Theory and practice naturally
| take turns outpacing each other, which is part of why we need
| both.
|
| I'd also caution against the belief that experimental work
| doesn't require "particularly demanding thought". There are
| many things one can tweak in current ML models (the search
| space is exponential) and, as you point out, the experiments
| are expensive. Having a solid understanding of the system,
| great intuition, and good heuristics is necessary to reliably
| make progress.
|
| For those who are interested in the theory of deep learning,
| the community has recently made great strides on developing a
| mathematical understanding of neural networks. The research is
| still very cutting edge, but the following PDF helps introduce
| the topic [3].
|
| [1]: https://en.wikipedia.org/wiki/James_Watt
|
| [2]: https://en.wikipedia.org/wiki/Phlogiston_theory
|
| [3]:
| https://www.cs.princeton.edu/courses/archive/fall19/cos597B/...
| The_rationalist wrote:
| While I generally agree to some extent: _roBERTa, XLNET,
| ELECTRA, etc. They 're all somewhat trivial variations on
| Google's BERT, which is more creative_ Researchers take
| inspirations from existing models of course and some BERT
| derivatives are trivial. However, XLnet is in it's own league,
| while the author (a genius chinese student) was inspired by
| BERT it is one of the few SOTA pre trained models to be not
| based on BERT and is actually an auto regressive one! Such
| difference allow it to be better at many things as it doesn't
| has to corrupt the tokens (from my shallow understanding). This
| model is two years old but is still sadly the one that ranks
| the most SOTA in key tasks e.g dependency parsing. And after
| all those time nobody cared enough to even test it on other
| foundational tasks (which is extremely sad and pathetic) like
| e.g coreference resolution. Sadly because of conformism effects
| almost zero researcher has created XLnet derivatives. Almost
| all researchers continue to search in the local minima that is
| BERT, which I find, immensely ironic.
|
| While ad hoc empirical fine tuning is a big part of improving
| sota, mathematical genius can still enable revolutions e.g this
| recent alternative to classical backpropagation that is 300X
| faster with low accuracy loss
| https://paperswithcode.com/paper/zorb-a-derivative-free-back...
| throwaway189262 wrote:
| Interesting, but in not sure you're completely right about
| XLNET. I heard it takes an absurd amount of resources to
| train. Even more than the BERT variations. And this is likely
| why there's not a ton of interest in it
| The_rationalist wrote:
| https://github.com/renatoviolin/xlnet XLnet running on very
| low end hardware (a single 8GB 2080 non ti) significantly
| outperform BERT large on e.g the reference question
| answering benchmarck: SQUAD 2 86% vs 81%
|
| Nobody has even tried to create a spanXLnet (akin to
| spanBERT) How many years will be wasted before researchers
| get out of the BERT local minima? I'm a afraid it might
| last a decade
| jgalt212 wrote:
| > The people doing math above algebra are few and the scene is
| dominated by "guess and check" style model tinkering.
|
| "guess and check" is terribly ineffective with multi-day
| training runs. Brings us right back to the batch processing
| paradigm of the 1960s.
| highfrequency wrote:
| They compare the training latency for different models with a
| fixed batch size of 32. But if the DeepMind models are several
| times larger than the comparison models in each latency class, it
| seems that the comparison models could use larger batch sizes for
| faster overall training time.
| trott wrote:
| For ConvNets, the memory use of the models themselves is pretty
| modest. For example, even with 0.5B parameters, with FP32,
| weights+gradients+momentum should use just 6GB (unless your
| framework sucks, or you have extra overhead from distributed
| training) So, if your model is twice smaller, you'll only save
| 3GB. If your VRAM is 32GB, saving 3GB won't let you use a much
| bigger batch size. On the other hand, the absence of batch norm
| can actually lead to memory savings proportional to batch size.
| longtom wrote:
| tl;dr: Don't use batch norm for preventing exploding gradients
| but adaptive gradient thresholds.
|
| For this they compute the Frobenius norm (square root of the sum
| of squares) of the weight layer and its gradient and take the
| ratio of these as clipping threshold.
|
| That saves the meta search for the optimal threshold but also is
| better than a fixed threshold could ever be.
|
| Very simple idea.
| sillysaurusx wrote:
| Thanks for macroexpanding frobnorm.
|
| I'm skeptical that these hand-coded thresholds can ever match
| what a model can learn automatically. But it's hard to argue
| with results.
| highfrequency wrote:
| Is this the first time that the gradient clip threshold has
| been chosen relative to the size of the weight matrix?
| longtom wrote:
| Lol, why is my comment down here with 7 upvotes.
___________________________________________________________________
(page generated 2021-02-14 23:00 UTC)