[HN Gopher] 20B-parameter Alexa model sets new marks in few-shot...
       ___________________________________________________________________
        
       20B-parameter Alexa model sets new marks in few-shot learning
        
       Author : reckel
       Score  : 55 points
       Date   : 2022-08-02 18:52 UTC (4 hours ago)
        
 (HTM) web link (www.amazon.science)
 (TXT) w3m dump (www.amazon.science)
        
       | zaroth wrote:
       | 20 billion parameters and the UI for voice is still cringe level
       | terrible.
       | 
       | Or is it just me and I've turned into a get-off-my-lawn
       | curmudgeon when it comes to audio interfaces?
       | 
       | > _Find a reservation far from my work location in eight hours
       | for 8 people at Union Auto Company._
       | 
       | Said absolutely no one ever, right? I guess if this is what it's
       | trained on it's no wonder.
        
         | creeble wrote:
         | _I_ can 't parse that sentence.
        
         | jonathankoren wrote:
         | This is the type of thing you learn to say when you're dealing
         | with a slot filling algorithm that allows for
         | overspecification. By putting it in one utterance, you avoid it
         | saying coming back and saying "Where do you want a
         | reservation?" "When do you want your reservations?" "How many
         | people are in your party?"
        
       | ctoth wrote:
       | > We follow Hoffmann et al. (2022) and pre-train the model for
       | roughly 1 Trillion tokens (longer than the 300B token updates of
       | GPT-3).
       | 
       | If I'm understanding the discussion of the Chinchilla paper
       | correctly[0] then this should offer a significantly better boost
       | than increasing the number of parameters would have. Also really
       | cool that they make the model easy(ish) to run and play with!
       | 
       | [0]:
       | https://www.lesswrong.com/posts/6Fpvch8RR29qLEWNH/chinchilla...
        
         | rafaelero wrote:
         | Not sure how much scaling laws apply here, since this is a seq-
         | to-seq model instead of a autoregressive causal model. It's
         | interesting to see AlexaTM performing better than GPT-3 on
         | SuperGLUE and SQuADv2, but it fails on Chain of Thought prompt,
         | which is a bummer. So, is it because it's a different model or
         | because it is positively leveraging multilingual tokens? I wish
         | they compared this architecture to a classic GPT family model.
        
       | mrlonglong wrote:
       | Should we worry if it achieves sentience? Any reason why we
       | shouldn't?
        
       | transcriptase wrote:
       | Question for those familiar with the backend of things like
       | Alexa, Google Home, Siri:
       | 
       | At what point can we say things like "turn off the bedroom light
       | in 5 minutes" or "stop music and turn off all the lights"? Even
       | something like "keep the lights on" in a motion sensor system is
       | impossible it seems. Because to me they feel like low hanging
       | fruit, and yet despite all the advances in machine learning and
       | these systems being around for the better part of a decade,
       | anything but the simplest single-task no-modifier commands result
       | in a "sorry... I didn't understand that" or completely
       | unpredictable results. Is there something inherently difficult
       | about these types of queries?
        
         | runnerup wrote:
         | Also why doesn't Siri work at all in Honda Civics and Honda
         | HR-V's when connected to car play and driving on the highway
         | with no radio playing?
         | 
         | Google assistant works fine, for the most part anyways.
        
           | jeffbee wrote:
           | Weird. Works fine in an Insight, which is in all respects a
           | Civic Hybrid.
        
         | jeffbee wrote:
         | My benchmark for Siri will be when it learns to do "Siri, wake
         | me up an hour earlier tomorrow". What it currently does it set
         | a new alarm for 1 hour in the future.
        
         | alphabetting wrote:
         | I have google home and the first one worked. The 2nd did not
         | but I don't think I'd ever say that bc I just have routines
         | where I say "i'm leaving" or "i'm going to bed" and everything
         | shuts off at once.
        
         | Closi wrote:
         | These are all subtly different problems I think, but in general
         | most of these architectures currently assume there is a single
         | intent.
         | 
         | > stop music and turn off all the lights
         | 
         | This is probably the easiest of the bunch because you are
         | asking it to perform two distinct actions.
         | 
         | > turn off the bedroom light in 5 minutes
         | 
         | This is much more complex, because you are asking the
         | application to setup some sort of workflow - after it
         | understands what you want it to do, it then has to work out how
         | to execute that, which will be utilising the device API's /
         | services. This is a simple example, but there are lots of
         | permutations of different actions here, for example you might
         | want to say "turn off the sound system once this song finishes
         | playing" which assumes that the assistant has the capability to
         | then understand you want it to create a task specifically
         | waiting for the trigger of a particular song finishing playing,
         | and that it has the ability to setup that trigger.
         | 
         | > "keep the lights on" in a motion sensor system
         | 
         | Now this is where the orchestration gets tricky -
         | 
         | The assistant has to:
         | 
         | * Work out that the lights are being affected by a motion
         | sensor system, which is likely outside it's own platform.
         | 
         | * Work out that your intent is that you want the assistant to
         | override that.
         | 
         | * Understand how to connect to the platform in order to control
         | it.
         | 
         | * Work out what parameter it is supposed to alter to achieve
         | this task.
         | 
         | * Override the existing users settings, and presumably
         | reinstate the settings after some portion of time.
        
           | visarga wrote:
           | Language models can generate code from text instructions. It
           | just needs a training set to learn the target APIs. I expect
           | in the next couple of years to see automated desktop
           | operation (RPA) from text commands, generalising access over
           | human interfaces.
           | 
           | It's really a shame the good language models are not deployed
           | as voice assistants. It would probably be expensive to offer
           | and they don't have the scale necessary. Just to load one of
           | these models you need a $100K computer.
        
             | Closi wrote:
             | It also depends what's the biggest priority - I would
             | assume there is a bigger 'quick win' from becoming more
             | reliable at single-intent actions from a market/customer
             | experience perspective rather than pursuing highly complex
             | multi-intent statements.
             | 
             | >99% of commands will be single intent, and they probably
             | work 80% of the time at the moment, so getting those to 99%
             | will have a much bigger short-term impact than focussing on
             | solving the 1% (with the added benfit that once you have
             | solved the first case of getting single-intent right all
             | the time, solving the second more complex queries will be
             | easier as you will have built a more robust base).
        
           | omega3 wrote:
           | > This is much more complex, because you are asking the
           | application to setup some sort of workflow - after it
           | understands what you want it to do, it then has to work out
           | how to execute that, which will be utilising the device API's
           | / services.
           | 
           | I don't see conceptually the difference between the first and
           | the second example. You're still executing two distinct
           | actions, first being the waiting for x amount of time?
        
         | liquidwax wrote:
         | My guess is that it's a very different (and difficult) problem
         | to generalize that way. Interpreting intent and taking action
         | are different aspects. Someone needs to write code to call a
         | vendor's API to execute those actions and that's a super
         | specialized action. Next step is probably instruct a CoPilot-
         | like tool to do it.
        
         | csnweb wrote:
         | > turn off the bedroom light in 5 minutes
         | 
         | This actually works already with Siri (and as mentioned in a
         | sibling comment with google Home as well). I just tried that
         | for fun a few days ago and was surprised that it actually
         | worked.
        
           | isatty wrote:
           | It didn't work for me.
           | 
           | "Turn off the lights in 5 minutes did but "turn off the
           | floorstanding lamp in 5 minutes" did not
           | 
           | Honestly that's more frustrating when it's not uniform and
           | now I've to remember this weird behavior.
        
         | jcoder wrote:
         | It's sad that the only option is a voice assistant that must
         | learn how to interpret my words through this slow, error-prone
         | process. I would much rather have a pure speech-to-text option
         | where I must learn the exact words to say to get a reliable
         | result.
        
           | kupopuffs wrote:
           | Yeah. I wouldn't even mind learning weird syntax or grammar,
           | like
           | 
           | Thread.sleep(5 * 60 * 1000); light.off();
        
       | xxpor wrote:
       | As a complete outsider: has ML research just become a phallus
       | measuring contest to see who can stuff the most parameters into a
       | model? In other words, who can acquire the most Nvidia cards? The
       | model size seems to always be the headline in stuff I see on HN.
        
         | visarga wrote:
         | This is a small model keeping up with the big guys. 20B
         | parameters might fit in 2 beefy GPUs, that's a bargain compared
         | to GPT-3.
        
           | cuuupid wrote:
           | +1, also this is a teacher model. The implications are huge
           | here as AWS will likely spin this into an offering like they
           | did with their other AI products. Building a model downstream
           | of GPT-3 is difficult and usually yields suboptimal results;
           | however 20b is small enough that it would be easy to finetune
           | this on a smaller dataset for a specific task.
           | 
           | You could then distill that model and end up with something
           | that's a fraction of the size (6b parameters for example,
           | just under 1/3, would fit on commercial GPUs like 3090s).
           | There are some interesting examples of this with smaller
           | models like BERT/BART or PEGASUS in Huggingface Transformer's
           | seq2seq distillation examples.
        
           | pjfin123 wrote:
           | Yeah this is the opposite, they did impressively well with
           | fewer parameters.
           | 
           | In general larger models and more data has been an effective
           | strategy for getting better performance but getting the right
           | ratio is also important:
           | https://www.deepmind.com/publications/an-empirical-
           | analysis-...
        
       | pjfin123 wrote:
       | The most notable thing about this model is that they use fewer
       | parameters (20 billion) than many of the other LLM which makes it
       | less resource intensive to train and easier to run.
       | 
       | They also use an encoder-decoder architecture, which is common
       | for machine translation, unlike most large language models which
       | are decoder-only.
       | 
       | https://community.libretranslate.com/t/alexatm-a-20b-multili...
        
       ___________________________________________________________________
       (page generated 2022-08-02 23:00 UTC)