[HN Gopher] DeepMind achieves SOTA image recognition with 8.7x f...
       ___________________________________________________________________
        
       DeepMind achieves SOTA image recognition with 8.7x faster training
        
       Author : highfrequency
       Score  : 186 points
       Date   : 2021-02-14 16:44 UTC (6 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | eggie5 wrote:
       | Do what do attribute to the gains? Adaptive clipping? Or $$$
       | spent on NAS??
        
       | dheera wrote:
       | > 8.7x faster to train
       | 
       | This is an achievement but it would be helpful to have put "to
       | train" in the title as this is quite different from efficiency at
       | inference time, which is what often actually matters in deployed
       | applications.
       | 
       | From Table 3 on Page 7 it appears to me that NFNet is
       | significantly heavier in the number of parameters than
       | EfficientNet for similar accuracies. For example EffNet-B5
       | achieves 83.7% with 30M params and 9.9B FLOPs on Top-1 while
       | NFNet-F0 achieves 83.6% with 71.5M params and 12.38B FLOPs on
       | Top-1.
       | 
       | It appears to me at first glance that NFNet has not achieved SOTA
       | at inference.
        
         | trott wrote:
         | > It appears to me at first glance that NFNet has not achieved
         | SOTA at inference.
         | 
         | It has, for larger models (F1 vs B7). See Fig 4 in the
         | Appendix.
        
           | The_rationalist wrote:
           | No it hasn't https://paperswithcode.com/sota/image-
           | classification-on-imag...
        
             | trott wrote:
             | > No it hasn't https://paperswithcode.com/sota/image-
             | classification-on-imag...
             | 
             | We were talking about models trained on ImageNet,
             | specifically about the trade-off between accuracy and
             | FLOPs. But the higher-accuracy models listed in your link
             | use extra data. So it's not quite the same benchmark we
             | were talking about.
        
               | The_rationalist wrote:
               | The deepmind paper NFNet-F4+ you were talking about also
               | has external training data.
               | 
               | The number one in accuracy (Meta pseudo labels) is also
               | faster for inference (390M vs 570M parameters) vs the
               | deepmind one. So what are you disagreeing with?
        
               | trott wrote:
               | > The deepmind paper NFNet-F4+ you were talking about
               | also has external training data.
               | 
               | @dheera and I did not mention NFNet-F4+. All models,
               | tables, figures and numbers that we did mention resulted
               | from training on ImageNet alone.
        
       | sillysaurusx wrote:
       | I was skeptical of the 83%-whatever top-1 accuracy on Imagenet.
       | But someone pointed out that the accuracy increases when the
       | model is pretrained on JFT, google's proprietary 100-million
       | image dataset, the model's accuracy increases to 87%-whatever.
       | 
       | That's pretty interesting. It implies that the original accuracy
       | rating might be legit. The concern is that we're chasing the
       | imagenet benchmark as if it's the holy grail, when in fact it's a
       | very narrow slice of what we normally care about as ML
       | researchers. _However_ , the fact that pretraining on JFT
       | increases the accuracy means that the model is generalizing,
       | which is very interesting; it implies that models might be "just
       | that good now."
       | 
       | Or more succinctly, if the result was bogus, you'd expect JFT
       | pretraining to have no effect whatsoever (or a negative effect).
       | But it has a positive result.
       | 
       | The other thing worth mentioning is that AJMooch seems to have
       | killed batch normalization dead, which is very strange to think
       | about. BN has had a long reign of some ~4 years, but the
       | drawbacks are significant: you have to maintain counters
       | yourself, for example, which was quite annoying.
       | 
       | It always seemed like a neural net ought to be able to learn what
       | BN forces you to keep track of. And AJMooch et al seem to prove
       | this is true. I recommend giving evonorm-s a try; it worked
       | perfectly for us the first time, with no loss in generality, and
       | it's basically a copy-paste replacement.
       | 
       | (Our BigGAN-Deep model is so good that I doubt you can tell the
       | difference vs the official model. It uses AJMooch's evonorm-s
       | rather than batchnorm: [1] https://i.imgur.com/sfGVbuq.png [2]
       | https://i.imgur.com/JMJ1Ll0.png and lol at the fake speedometer.)
        
         | Jabbles wrote:
         | What safeguards are there or what assurances do we have that
         | JFT is not contaminated with images from (or extremely similar
         | to) the validation set?
        
           | quantumofalpha wrote:
           | Just the sheer size of JFT (latest versions I heard approach
           | 1B images), so it's probably impractical to train on it till
           | overfitting.
        
           | sillysaurusx wrote:
           | Haha. None whatsoever.
           | 
           | The assurance is that everyone in the field seems to take the
           | work seriously. But the reality is that errors creep in from
           | a variety of corners. I would not be even slightly surprised
           | to find that the validation data is substantially similar.
           | We're still at the "bangs rocks together to make fire" phase
           | of ML, which is both exciting and challenging; we're building
           | the future from the ground up.
           | 
           | People rarely take the time to look at the actual images, but
           | if you do, you'll notice they have some interesting errors in
           | them:
           | https://twitter.com/theshawwn/status/1262535747975868418
           | 
           | I built an interactive viewer for the tagging site:
           | https://tags.tagpls.com/
           | 
           | (Someone tagged all 70 shoes in this one, which was kind of
           | impressive... https://tags.shawwn.com/tags/https://battle.sha
           | wwn.com/sdc/i... )
           | 
           | Anyway, some of the validation images happen to be rotated 90
           | degrees and no one noticed. That made me wonder what other
           | sorts of unexpected errors are in these specific 50,000
           | validation images that the world just-so-happened to decide
           | were Super Important to the future of AI.
           | 
           | The trouble is, images _in general_ are substantially similar
           | to the imagenet validation dataset. In other words, it 's
           | tempting to try to think of some way of "dividing up" the
           | data so that there's some sort of validation phase that you
           | can cleanly separate. But reality isn't so kind. When you're
           | at the scale of millions of images, holding out 10% is just a
           | way of sanity checking that your model isn't memorizing the
           | training data; nothing more.
           | 
           | Besides, random 90 degree rotations are introduced on purpose
           | now, so it's funny that old mistakes tend not to matter.
        
         | gameshot911 wrote:
         | Those example pictures are trippy! Some of them look like those
         | weird DeepMind sick-fever dream creations. Except I assume they
         | are all real photos, not generated. It's very possible that a
         | trained AI would be better-able to identify some of the
         | candidates than _I_ would.
         | 
         | For example, from OP's post, w/ coordinate system starting at
         | lower left, I have no idea what I'm looking at in these
         | examples, except they look organic-ish: [1]: [1,4], [3,2],
         | [4,1]
         | 
         | sillysaurusx: I've never seen conglomerate pictures like this
         | used in AI training. Do you train models on these 4x4 images?
         | What's the purpose vs a single picture at a time? Does the
         | model know that you're feeding it 4x4 examples, or does it have
         | to figure that out itself?
         | 
         | Aside: Another awesome 'sick-fever dream creation' example if
         | you missed it when it made the rounds on HN is this[3]. Slide
         | the creativity filter up for weirdness!
         | 
         | [3] https://thisanimedoesnotexist.ai/
        
           | sillysaurusx wrote:
           | I'm surprised so many people want to see our BigGAN images.
           | Thank you for asking :)
           | 
           | You can watch the training process here:
           | http://song.tensorfork.com:8097/#images
           | 
           | It's been going on for a month and a half, but I leave it
           | running mostly as a fishtank rather than to get to a specific
           | objective. It's fun to load it up and look at a new random
           | image whenever I want. Plus I like the idea of my little TPU
           | being like "look at me! I'm doing work! Here's what I've
           | prepared for you!" so I try to keep my little fella online
           | all the time.
           | 
           | - https://i.imgur.com/0O5KZdE.png
           | 
           | - Plus stuff like this makes me laugh really hard.
           | https://i.imgur.com/EnfIBz3.png
           | 
           | - Some nice flowers and a boat.
           | https://i.imgur.com/mrFkIx0.png
           | 
           | The model is getting quite good. I kind of forgot about it
           | over the past few weeks. StyleGAN could never get anywhere
           | close to this level of detail. I had to spend roughly a year
           | tracking down a crucial bug in the implementation that
           | prevented biggan from working very well until now:
           | https://github.com/google/compare_gan/issues/54
           | 
           | And we also seemed to solve BigGAN collapse, so theoretically
           | the model can improve forever now. I leave it running to see
           | how good it can get.
           | 
           |  _I 've never seen conglomerate pictures like this used in AI
           | training. Do you train models on these 4x4 images? What's the
           | purpose vs a single picture at a time? Does the model know
           | that you're feeding it 4x4 examples, or does it have to
           | figure that out itself?_
           | 
           | Nah, the grid is just for convenient viewing for humans.
           | Robots see one image at a time. (Or more specifically, a
           | batch of images; we happen to use batch size 2 or 4, I
           | forget, so each core sees two images at a time, and then all
           | 8 cores broadcast their gradients to each other and average,
           | so it's really seeing 16 or 32 images at a time.)
           | 
           | I feel a bit silly plugging our community so much, but it's
           | really true. If you like tricks like this, join the
           | Tensorfork discord:
           | 
           | https://discord.com/invite/x52Xz3y
           | 
           | My theory when I set it up was that everyone has little
           | tricks like this, but there's no central repository of
           | knowledge / place to ask questions. But now that there are
           | 1,200+ of us, it's become the de facto place to pop in and
           | share random ideas and tricks.
           | 
           | For what it's worth, https://thisanimedoesnotexist.ai/ was a
           | joint collaboration of several Tensorfork discord members. :)
           | 
           | If you want future updates about this specific BigGAN model,
           | twitter is your best bet: https://twitter.com/search?q=(from%
           | 3Atheshawwn)%20biggan&src...
        
             | Itsdijital wrote:
             | This is awesome, thanks.
        
         | indiv0 wrote:
         | That's awesome! Is your model available publically? I run a
         | site [0] where users can generate images from text prompts
         | using models like the official BigGAN-Deep one and I'd love to
         | try it out for this purpose. Do you also have somewhere
         | whereupon discuss this stuff? I'm new to ML in general and was
         | wondering if there's somewhere where y'all experts gather.
         | 
         | [0]: https://dank.xyz
        
         | code-scope wrote:
         | Is there any (influence) SW framework that takes youtube video
         | as input and split out object/timestamp as output?
        
         | trhway wrote:
         | > pretraining on JFT increases the accuracy means that the
         | model is generalizing
         | 
         | not necessarily, it may be mostly a bonus of the transfer,
         | especially considering that JFT is that much larger - getting
         | for example the first conv layers kernels to converge to Gabor-
         | like takes time, yet those layers kernels are very similar
         | across the well trained image nets (and there were some works
         | showing that it is optimal in a sense, and that it is one of
         | the reasons it is in our visual cortex) and thus transferrable,
         | and can practically be treated as fixed in the new model
         | (especially if those layers were pretrained on very large model
         | and reached the state of generic feature extraction). I suspect
         | the similar is applicable for the low level feature aggregating
         | layers too.
        
         | bla3 wrote:
         | Where are these images from? Are there more?
        
           | sillysaurusx wrote:
           | Oh, you! I'm so flattered. You're making me blush.
           | 
           | Sure, you can have as many as you want. Watch it train in
           | real time:
           | 
           | http://song.tensorfork.com:8097/#images
           | 
           | Though I imagine HN might swamp our little training server
           | running tensorboard, so here you go.
           | 
           | https://i.imgur.com/BkGxbo7.png
           | 
           | We've been training a BigGAN-Deep model for 1.5 months now.
           | Though that sounds like a long time, in reality it's
           | completely automatic and I've been leaving it running just to
           | see what will happen. Every single other BigGAN
           | implementation reports that eventually the training run will
           | collapse. We observed the same thing. But gwern came up with
           | a brilliantly simple way to solve this:                 if
           | D_loss < 0.2:         D_loss = 0
           | 
           | It takes some thinking about _why_ this solves collapse. But
           | in short, the discriminator isn 't allowed to get too
           | intelligent. Therefore the generator is never forced to
           | degenerate into single examples that happen to fool the
           | discriminator, i.e. collapse.
           | 
           | If you like this sort of thing in general, I encourage you to
           | come join the Tensorfork community discord server, which we
           | affectionately named the "TPU podcast":
           | https://discord.com/invite/x52Xz3y
           | 
           | There are some 1,200 of us now, and people are always showing
           | off stuff like this in our (way too many) channels.
        
             | Jabbles wrote:
             | Do you take advantage of previous iterations of the
             | generator and discriminator? i.e. the generator should be
             | able to fool all previous discriminators, and the
             | discriminator should be able to recognise the work of all
             | previous generators?
        
               | sillysaurusx wrote:
               | Nope! It's an interesting balance. The truth of the
               | situation seems to be: the generator and discriminator
               | provide a "signal" to each other, like two planets
               | orbiting around each other.
               | 
               | If you cut the signal from one, the other will rapidly
               | veer off into infinity, i.e. collapse quickly. Or it will
               | veer off in the other direction, i.e. all progress will
               | stop and the model won't improve.
               | 
               | So it's a constant "signal", you see, where one is
               | dependent on the other _in the current state_. Therefore
               | I am skeptical of attempts to use previous states of
               | discriminators.
               | 
               | However! One of the counterintuitive aspects of AI is
               | that the strangest-sounding ideas often have a chance of
               | being good ideas. It's also so hard to try new ideas that
               | you have to pick specific ones. So, roll up your sleeves
               | and implement yours; I would personally be delighted to
               | see what the code would look like for "the current
               | generator can fool all previous discriminators".
               | 
               | I really do not mean that in any sort of negative or
               | dismissive way. I really hope that you will come try it,
               | because DL has never been more accessible. And the time
               | is ripe for fresh takes on old ideas; there's a very real
               | chance that you'll stumble across something that works
               | quite well, if you follow your line of thinking.
               | 
               | But for practical purposes, the current theory with
               | generators and discriminators is that they react to their
               | current states. So there's not really any way of testing
               | "can the generator fool all previous discriminators?"
               | because in reality, the generator isn't fooling the
               | discriminator at all -- they simply notice when each
               | other deviates by a small amount, and they make a
               | corresponding "small delta change" in response. Kind of
               | like an ongoing chess game.
        
               | Jabbles wrote:
               | Thanks for the detailed answer.
               | 
               | I don't claim it to be a novel idea, I just remember the
               | Alpha Go (zero?) paper that said they played it against
               | older versions to make sure it hadn't got into a bad
               | state.
        
               | sillysaurusx wrote:
               | Ah! This is an interesting difference, and illustrates
               | one fun aspect of GANs vs other types of models: Alpha Go
               | had a very specific "win condition" that you can measure
               | precisely. (Can the model win the game?)
               | 
               | Whereas it's very difficult to quantify what it means to
               | be "better" at generating images, once you get to a
               | certain threshold of realism. (Was Leonardo better than
               | michelangelo? Probably, but it's hard to measure
               | precisely.)
               | 
               | The way that Alpha Go worked was, it gathered a bunch of
               | experiences, i.e. it played a bunch of games. Then, after
               | playing tons of games -- tens of dozens! just kidding,
               | probably like 20 million -- it then performed a _single
               | gradient update_.
               | 
               | In other words, you gather your current experiences, and
               | _then_ you react to them. It 's a two-phase commit.
               | There's an explicit "gather" step, which you then react
               | to by updating your database of parameters, so to speak.
               | 
               | Whereas with GANs, that happens continuously. There's no
               | "gather" step. The generator simply tries to maximize the
               | discrimiantor's loss, and the discriminator tries to
               | minimize it.
               | 
               | Balancing the two has been very tricky. But the results
               | speak for themselves.
        
         | logane wrote:
         | Not sure if I follow your JFT argument, but there's a large
         | body of work on both (a) studying whether chasing ImageNet
         | accuracy yields models that generalize well to out of
         | distribution data [1, 2, 3] and (b) contextualizing progress on
         | ImageNet (i.e., what does high accuracy on ImageNet really
         | mean?) [4, 5, 6].
         | 
         | For (a), maybe surprisingly the answer is mostly yes! Better
         | ImageNet accuracy generally corresponds to better out of
         | distribution accuracy. For (b), it turns out that the ImageNet
         | dataset is full of contradictions---many images have multiple
         | ImageNet-relevant objects, and often are ambiguously or mis-
         | labeled, etc---so it's hard to disentangle progress in
         | identifying objects vs. models overfitting to the quirks of the
         | benchmark.
         | 
         | [1] ObjectNet: https://objectnet.dev / associated paper
         | 
         | [2] ImageNet-v2: https://arxiv.org/abs/1902.10811
         | 
         | [3] An Unbiased Lookat Dataset Bias:
         | https://people.csail.mit.edu/torralba/publications/datasets_...
         | (pre-AlexNet!)
         | 
         | [4] From ImageNet to Image Classification:
         | https://arxiv.org/abs/2005.11295
         | 
         | [5] Are we done with ImageNet? https://arxiv.org/abs/2006.07159
         | 
         | [6] Evaluating Machine Accuracy on ImageNet:
         | http://proceedings.mlr.press/v119/shankar20c.html
        
       | The_rationalist wrote:
       | This isn't the real SOTA, "Meta pseudo labels" has ~10% less
       | errors, while having less parameters.
       | https://paperswithcode.com/sota/image-classification-on-imag...
       | However the fast training is an interesting property.
       | 
       | It would be interesting to test thoses efficientNets with zeroth
       | order backpropagation as it allows a 300X speedup (vs 8.7x) while
       | not regressing accuracy _that much_
       | https://paperswithcode.com/paper/zorb-a-derivative-free-back...
        
       | f430 wrote:
       | eli5 what SOTA image recognition is?
        
         | jphoward wrote:
         | SOTA is 'state of the art'. Image recognition is a task
         | classically appraised by calculating the accuracy on the
         | ImageNet dataset, which requires a system to classify images
         | each as one of 1,000 pre-determined classes.
        
           | f430 wrote:
           | so how many images does the current SOTA take to train a
           | classifier? Trying to gauge how much of an improvement
           | Deepmind has made here.
        
             | jphoward wrote:
             | The results that should be used to compare their results to
             | others involves training on just under 1.3 million images
             | across the 1,000 classes.
             | 
             | Their best results involve 'pretraining' on a dataset of
             | 300 million examples, before 'tuning' it on the actual
             | ImageNet training dataset as above.
        
         | [deleted]
        
         | marviel wrote:
         | State of The Art == SOTA
         | 
         | You might enjoy paperswithcode.com
        
       | smeeth wrote:
       | The speed improvements are certainly interesting, the performance
       | improvements seem decidedly not. This method has more than 2x the
       | parameters of all but one of the models it was compared against.
       | 
       | If I'm off-base here can someone explain?
        
         | modeless wrote:
         | I don't care how many parameters my model has per se. What I
         | care about is how expensive it is to train in time and dollars.
         | If this makes it cheaper to train better models despite more
         | parameters, that's still a win.
        
           | 6gvONxR4sf7o wrote:
           | Some models are still memory limited. Fewer parameters are
           | very useful in those settings.
        
           | sillysaurusx wrote:
           | There's one important caveat, though I agree with your
           | thrust: at GPT-3 scale, cutting params in half is a
           | nontrivial optimization. So it's worth keeping an eye out for
           | that concern.
           | 
           | (Yeah, none of us are anywhere near GPT-3 scale. But I spend
           | most of my time thinking about scaling issues, and it was
           | interesting to see your comment pop up; I would've agreed
           | entirely with it myself, if not for seeing all the anguish
           | caused by attempting to train and deploy billions of
           | parameters.)
        
           | dontreact wrote:
           | In cases where you have to deploy the model and you are
           | limited in terms of flops, this paper does not help much,
           | unless it's removal of batchnorm somehow allows a future
           | network that is actually faster at inference time.
        
             | modeless wrote:
             | There are a lot of techniques for sparsifying or pruning or
             | distilling models to reduce inference FLOPS, and they
             | almost always produce better results when starting with a
             | better model. Also, if your model is 8x faster to train at
             | the same size then you can do 8x as much hyperparameter
             | tuning and get a better result.
        
               | dontreact wrote:
               | This model is much more expensive than efficientnet at
               | inference (I think the flops are about 2x?). You can use
               | these same techniques with efficientnet.
        
             | david-gpu wrote:
             | But for deployment in smaller devices you can use
             | techniques such as distillation, quantization and sparsity.
             | Training and inference are very different problems in
             | practice.
        
               | dontreact wrote:
               | Yes but you can do that with efficientnet as well. The
               | point is that this is an improvement only for training
               | because it uses computations which are highly optimized
               | on TPU
        
       | Willson50 wrote:
       | *11.5% as much compute.
        
         | The_rationalist wrote:
         | But what was the baseline hardware for a reasonable training
         | time?
        
           | nl wrote:
           | Page 7 has a table of one training step on TPUv3 and V100
           | GPUs.
           | 
           | I don't completely understand this: NFNet is slower than its
           | competitors on this benchmark, but they claim higher
           | efficiency. This isn't obvious to me.
        
       | nullifidian wrote:
       | Do they disclose any important techniques/ideas on how to achieve
       | these results in the paper, or it's more of a technical press
       | release?
        
         | belgian_guy wrote:
         | Yes they do (arxiv is a place for scientific papers not press
         | releases). I've only skimmed it, but the paper introduce an
         | adaptive way to clip gradients. Meaning that if the ratio of
         | the gradient norm to weight norm surpasses a certain threshold,
         | they clip it. This stabilizes learning and seems to avoid the
         | need for batch normalization. Seems quite promising imo and
         | something that could stick (I'm quite happy if we could finally
         | do away with batchnorm).
        
       | modeless wrote:
       | The big deal here is the removal of BatchNorm. People never
       | really liked BatchNorm for various theoretical and practical
       | reasons, and yet it was required for all the top performing
       | models. If this allows us to get rid of it forever that will be
       | really nice.
        
       | nl wrote:
       | Make it work then make it fast and efficient.
       | 
       | People complaining about how slow and expensive brand new models
       | are to train are ignorant of the history of machine learning and
       | of engineering in general.
        
       | throwaway189262 wrote:
       | I have a feeling the ML community is going to pivot focus to
       | faster and smaller training before larger advancements are made.
       | It's simply too expensive for much AI research to happen when
       | state of the art models take 500k of hardware to train.
       | 
       | For all the mathematician hype around ML research, much of the
       | work is closer to alchemy than science. We simply don't
       | understand a great deal of why these neural nets work.
       | 
       | The people doing math above algebra are few and the scene is
       | dominated by "guess and check" style model tinkering.
       | 
       | Many "state of the art models" are simply a bunch of common
       | strategies glued together in a way researchers found worked the
       | best (by trying a bunch of different ones).
       | 
       | An average Joe could probably write influential ML papers by
       | gluing RNN/GAN layers to existing models and fiddling with the
       | parameters until they beat current state of the art. In fact, in
       | NLP models, this is essentially what has happened with roBERTa,
       | XLNET, ELECTRA, etc. They're all somewhat trivial variations on
       | Google's BERT, which is more creative but yet again built on
       | existing models.
       | 
       | Anyways, my point is, none of this required math or genius or
       | particularly demanding thought. It was basically let's tinker
       | with this until we find a way that's better, using guess and
       | check. No equations needed.
       | 
       | We are a long way from the type of simulations done for protein
       | folding and materials strength and basically every other
       | scientific field. It's still the wild west
        
         | sillysaurusx wrote:
         | Apply to TFRC! https://www.tensorflow.org/tfrc
         | 
         | They are very permissive. And you get to play with $500k worth
         | of hardware. Been a member for over a year now. Jonathan is
         | singlehandedly the best support person I've ever worked with,
         | or perhaps ever will work with.
         | 
         | I would've completely agreed with you if not for TFRC. And I
         | couldn't resist the opportunity of playing with some big metal,
         | even if it's hard to work with.
        
           | panabee wrote:
           | just applied. thanks for sharing!
        
         | tbalsam wrote:
         | This feels much like the sentiment in the field about two years
         | ago or so. While I feel like the "alchemy" storyline is still
         | somewhat in play, most of the big important parts of the deep
         | learning process have enough ideological linear approximators
         | stacked around them that if you know what you're doing or
         | looking at, you can jump to an unexplored trench with some
         | reasonable feeling about whether you'll get something good or
         | not. I feel like the "alchemy" approach is when people new to
         | the field are innundated with information about it, and while I
         | think that still holds, there very much is a well-understood
         | science of principles in most parts of it.
         | 
         | There's the neural tangent kernel work that's achieved a lot,
         | and the transformers themselves are really taking off a lot as
         | the blockwise/lower rank approximation algorithms look more and
         | more like circuits built off of basic, more well-established
         | components.
         | 
         | "An average Joe could probably write influential ML papers by
         | gluing RNN/GAN layers to existing models and fiddling with the
         | parameters until they beat current state of the art. In fact,
         | in NLP models, this is essentially what has happened with
         | roBERTa, XLNET, ELECTRA, etc. They're all somewhat trivial
         | variations on Google's BERT, which is more creative but yet
         | again built on existing models."
         | 
         | This feels like it trivializes a lot of the work and collapses
         | some of the major advancements in training at scale down to a
         | more one-dimensional outlook. Companies are doing both, but
         | it's easy to throw money and compute at an absolutely
         | guaranteed logarithmic improvement in results. It's not
         | stupidity, it's just reducing variance in scaling known laws as
         | we work on making things more efficient, which weirdly enough
         | starts the iterative process of academics frantically trying to
         | mine the expensive, inefficient compute tactics to flag plant
         | their own materials.
         | 
         | With respect to you comment on protein folding and such, I feel
         | you might have missed a lot of the major work in that aren more
         | recently. There really and truly been some field-shattering
         | work on that in combining deep learning systems with last-mile
         | supervision and refinement systems. I'd posit that we're very
         | much out of the wild west and in the mild, but still
         | rambunctious west, if I were to put terms on it.
         | 
         | With reference to guess and check -- yes, that especially was
         | prevalent and worked 2-3 years ago and I'd be in favor of
         | advocating that it does still happen somewhat in a more refined
         | fashion, but I personally believe we'd not get too far beyond
         | the SOTA if we're not working (effectively) with your data
         | manifold now and tightly incorporating whatever
         | projections/constraints of that data distillation process into
         | your network training procedure. I really do agree with you in
         | that I think average Joe breakthroughs will happen and continue
         | to benefit the the field, and I'd certainly agree that there's
         | always going to be the mediocre paper churn of paper mills I
         | think that you alluded to trying to justify their own existence
         | as academics/paper writers, but I really do legitimately think
         | there's enough precedent set in most parts of the field that
         | you need to have some kind of thoughtful improvement to move
         | forward (like AdaBelief, which is still terrible because they
         | straight up lie about what they do in the abstract, even though
         | the improvement of debiasing the variance estimates during
         | training is an exceptionally good idea).
         | 
         | Just my 2c, hope this helps. I think we may have a similar end
         | perspective from two different sides, like two explorers
         | looking at the same peak from the different side of the
         | mountain. :thumbsup:
        
         | nickhuh wrote:
         | There's a lot of interest in various ML communities on more
         | efficient training and inference. Both vision and NLP have had
         | a growing focus on these problems in recent years.
         | 
         | I think you make a good observation that much of ML progress is
         | driven by tinkering with existing models, though instead of
         | describing it as more "alchemy than science" it's probably more
         | accurate to say it's very experimental right now. Being very
         | experimental is neither unscientific nor unusual in the
         | development of knowledge. James Watt worked as an instrument
         | maker (not a theoretician) when he invented the Watt steam
         | engine in 1776 [1], and at the time the idea of heat as
         | Phlogiston [2] was still more prevalent than anything that
         | looks like modern thermodynamics. Theory and practice naturally
         | take turns outpacing each other, which is part of why we need
         | both.
         | 
         | I'd also caution against the belief that experimental work
         | doesn't require "particularly demanding thought". There are
         | many things one can tweak in current ML models (the search
         | space is exponential) and, as you point out, the experiments
         | are expensive. Having a solid understanding of the system,
         | great intuition, and good heuristics is necessary to reliably
         | make progress.
         | 
         | For those who are interested in the theory of deep learning,
         | the community has recently made great strides on developing a
         | mathematical understanding of neural networks. The research is
         | still very cutting edge, but the following PDF helps introduce
         | the topic [3].
         | 
         | [1]: https://en.wikipedia.org/wiki/James_Watt
         | 
         | [2]: https://en.wikipedia.org/wiki/Phlogiston_theory
         | 
         | [3]:
         | https://www.cs.princeton.edu/courses/archive/fall19/cos597B/...
        
         | The_rationalist wrote:
         | While I generally agree to some extent: _roBERTa, XLNET,
         | ELECTRA, etc. They 're all somewhat trivial variations on
         | Google's BERT, which is more creative_ Researchers take
         | inspirations from existing models of course and some BERT
         | derivatives are trivial. However, XLnet is in it's own league,
         | while the author (a genius chinese student) was inspired by
         | BERT it is one of the few SOTA pre trained models to be not
         | based on BERT and is actually an auto regressive one! Such
         | difference allow it to be better at many things as it doesn't
         | has to corrupt the tokens (from my shallow understanding). This
         | model is two years old but is still sadly the one that ranks
         | the most SOTA in key tasks e.g dependency parsing. And after
         | all those time nobody cared enough to even test it on other
         | foundational tasks (which is extremely sad and pathetic) like
         | e.g coreference resolution. Sadly because of conformism effects
         | almost zero researcher has created XLnet derivatives. Almost
         | all researchers continue to search in the local minima that is
         | BERT, which I find, immensely ironic.
         | 
         | While ad hoc empirical fine tuning is a big part of improving
         | sota, mathematical genius can still enable revolutions e.g this
         | recent alternative to classical backpropagation that is 300X
         | faster with low accuracy loss
         | https://paperswithcode.com/paper/zorb-a-derivative-free-back...
        
           | throwaway189262 wrote:
           | Interesting, but in not sure you're completely right about
           | XLNET. I heard it takes an absurd amount of resources to
           | train. Even more than the BERT variations. And this is likely
           | why there's not a ton of interest in it
        
             | The_rationalist wrote:
             | https://github.com/renatoviolin/xlnet XLnet running on very
             | low end hardware (a single 8GB 2080 non ti) significantly
             | outperform BERT large on e.g the reference question
             | answering benchmarck: SQUAD 2 86% vs 81%
             | 
             | Nobody has even tried to create a spanXLnet (akin to
             | spanBERT) How many years will be wasted before researchers
             | get out of the BERT local minima? I'm a afraid it might
             | last a decade
        
         | jgalt212 wrote:
         | > The people doing math above algebra are few and the scene is
         | dominated by "guess and check" style model tinkering.
         | 
         | "guess and check" is terribly ineffective with multi-day
         | training runs. Brings us right back to the batch processing
         | paradigm of the 1960s.
        
       | highfrequency wrote:
       | They compare the training latency for different models with a
       | fixed batch size of 32. But if the DeepMind models are several
       | times larger than the comparison models in each latency class, it
       | seems that the comparison models could use larger batch sizes for
       | faster overall training time.
        
         | trott wrote:
         | For ConvNets, the memory use of the models themselves is pretty
         | modest. For example, even with 0.5B parameters, with FP32,
         | weights+gradients+momentum should use just 6GB (unless your
         | framework sucks, or you have extra overhead from distributed
         | training) So, if your model is twice smaller, you'll only save
         | 3GB. If your VRAM is 32GB, saving 3GB won't let you use a much
         | bigger batch size. On the other hand, the absence of batch norm
         | can actually lead to memory savings proportional to batch size.
        
       | longtom wrote:
       | tl;dr: Don't use batch norm for preventing exploding gradients
       | but adaptive gradient thresholds.
       | 
       | For this they compute the Frobenius norm (square root of the sum
       | of squares) of the weight layer and its gradient and take the
       | ratio of these as clipping threshold.
       | 
       | That saves the meta search for the optimal threshold but also is
       | better than a fixed threshold could ever be.
       | 
       | Very simple idea.
        
         | sillysaurusx wrote:
         | Thanks for macroexpanding frobnorm.
         | 
         | I'm skeptical that these hand-coded thresholds can ever match
         | what a model can learn automatically. But it's hard to argue
         | with results.
        
         | highfrequency wrote:
         | Is this the first time that the gradient clip threshold has
         | been chosen relative to the size of the weight matrix?
        
         | longtom wrote:
         | Lol, why is my comment down here with 7 upvotes.
        
       ___________________________________________________________________
       (page generated 2021-02-14 23:00 UTC)