[HN Gopher] Google's Pathways Language Model and Chain-of-Thought
       ___________________________________________________________________
        
       Google's Pathways Language Model and Chain-of-Thought
        
       Author : vackosar
       Score  : 59 points
       Date   : 2022-04-18 15:43 UTC (7 hours ago)
        
 (HTM) web link (vaclavkosar.com)
 (TXT) w3m dump (vaclavkosar.com)
        
       | phoe18 wrote:
       | The article quotes the cost as roughly 10B$ in the first
       | paragraph. Likely a typo? They quote 10M$ in a later paragraph.
        
         | vackosar wrote:
         | Yes, of course, thanks!
        
         | azinman2 wrote:
         | Ya I was like there's no way google spend 10B on this.
        
       | vackosar wrote:
       | Correction!! The model costed around 10M not 10B! Thanks for
       | raising that. Mistake during copying from the second slide :(
        
       | PaulHoule wrote:
       | I've talked about structural deficiencies in earlier language
       | models, this one seems to be doing something about them.
        
         | vackosar wrote:
         | Sounds interesting! Would you link to that or describe them
         | here? Thanks!
        
           | PaulHoule wrote:
           | A very simple one is "can you write a program that might
           | never terminate?"
           | 
           | If a neural network does a fixed amount of computation and
           | that is that it is never going to be able to do things that
           | require a program that may not terminate.
           | 
           | There are numerous results of theoretical computer science
           | that apply just as well to neural networks and other
           | algorithms even though people seem to forget it.
           | 
           | Another is "can an error discovered in late stage processing
           | be fed back to an early stage and be repaired?" That's
           | important if you are parsing a sentence like
           | Squad helps dog bite victim.
           | 
           | It was funny because I saw Geoff Hinton give a talk in 2005,
           | before he got super-famous, and he was talking about the idea
           | that led to deep networks and he had a criticism of
           | "blackboard" systems and other architectures that produced
           | layered representations (say the radar of an anti-aircraft
           | system that is going to start with raw signals, turn those
           | into a set of 'blips', coalesce the 'blips' into tracks,
           | interpret the tracks as aircraft, etc.)
           | 
           | Hinton said that you should build the whole system in an
           | integrated manner and train the whole thing working end-to-
           | end and I thought "what a neat idea" but also "there is no
           | way this would work for the systems I'm building because it
           | doesn't have an answer for correcting itself.
        
             | cygaril wrote:
             | You're assuming here that there are discrete stages that do
             | different things. I think a better way to conceptualise
             | these deepnets is that they're doing exactly what you want
             | - each layer is "correcting" the mistakes of the previous
             | layer.
        
               | PaulHoule wrote:
               | Most "deep" networks are organized into layers and
               | information flows in a particular direction although it
               | doesn't have to be that way. Hinton wasn't saying we
               | shouldn't have layers but that we should train the layers
               | together rather than as black boxes that work in
               | isolation.
               | 
               | Also, when people talk about solving problems they talk
               | about layers, layers play a big role in the conceptual
               | models people have for how they do tasks even if they
               | don't really do them that way.
               | 
               | For instance in that ambiguous sentence somebody might
               | say it hinges on whether or not you think "bite" is a
               | verb or a noun.
               | 
               | (Every concept in linguistics is suspect, if only because
               | linguistics has proven to have little value for
               | developing systems that understand language. For instance
               | I'd say a "word" doesn't exist because there are subword
               | objects that depend like a word "non-" and phrases that
               | behave like a word (e.g. "dog bite" fills the same slot
               | as "bite"))
               | 
               | Another ambiguous example is this notorious picture
               | 
               | https://www.livescience.com/63645-optical-illusion-young-
               | old...
               | 
               | which most people experience as "flapping" between two
               | states. Since you only see one at a time there is some
               | kind of inhibition between the two states. Who knows how
               | people really see things, but if I'm going to talk about
               | features I'm going to say that one part is the nose of
               | one of the ladies or the chin of the other lady.
               | 
               | Deep networks as we know it have nothing like that.
        
             | space_fountain wrote:
             | I'm by no means an expert, but a lot of choices machine
             | learning algorithms make are more about training
             | parallelization than anything. In many ways it feels like
             | something like a recursive neural network or some
             | architecture even more weird should be better for language,
             | but in practice it's harder to train an architecture that
             | demands each new output depend on the one before.
             | Introducing dependencies on prier output typically kills
             | parallelization. Obviously this is less of a problem for
             | say a brain that has years of training time, but more of
             | problem if you want to train one up in much less time using
             | compute that can't do sequential things very quickly
        
       | simulate-me wrote:
       | The amount of capital needed to train these high-quality models
       | is eye watering (not to mention the costs needed to acquire the
       | data). Does anyone know of any well capitalized startups
       | exploring this space?
        
         | lumost wrote:
         | OpenAI would be the best example. However these large language
         | models also have limited business value _today_ , making an
         | startup a speculative bet that the team will beat
         | Google/FB/AI/Academics at making a language model _and_ find a
         | viable business model for the resulting model.
         | 
         | I'd take one of those bets or the other, both are tough to pull
         | off. Considering that the first task of such a startup would be
         | to hand ~100-500MM to a hardware or cloud vendor I'd be
         | hesitant to invest as an investor.
        
           | visarga wrote:
           | It costs less than 10M to train. Why hand so much to hardware
           | or cloud? Soon enough there will be open source GPT-3's, at
           | least two are in training as we speak (BigScience and
           | EleutherAI).
           | 
           | > these large language models also have limited business
           | value today
           | 
           | The Instruct version of GPT-3 has become very easy to steer
           | with just a task description. It can do so many tasks so well
           | it's crazy. Try some interactions with the beta API.
           | 
           | I believe GPT-3 is already above average human level at
           | cognitive tasks that fit in a 4000 token window. In 2-3 years
           | I think all developers will have to adapt to the new status
           | quo.
        
             | nawgz wrote:
             | > I believe GPT-3 is already above average human level at
             | cognitive tasks that fit in a 4000 token window.
             | 
             | How can you possibly make a claim like this without like 80
             | links justifying it? The claim is fuzzy and absurd, my
             | least favorite combo
        
               | visarga wrote:
               | Gut feeling based on playing with it. Here's an example:
               | 
               | > Colorless green ideas sleep furiously, and other
               | grammatical nonsense by Noam Chomsky
               | 
               | He was a man without a country, A linguist without a
               | language, A mind without a thought, A dream without a
               | dreamer. He was lost in a world of words, A world where
               | ideas slept furiously, And grammar was a never-ending
               | nightmare.
               | 
               | But he persevered, For he knew that language was the key
               | to understanding the world. And so he continued to study,
               | To learn all that he could, In the hopes that one day, He
               | would find his way home.
        
               | nawgz wrote:
               | > Gut feeling based on playing with it.
               | 
               | Ok, your phrasing made it sound like some article or
               | material had convinced you of this opinion on my first
               | reading, now I understand.
               | 
               | This is kind of my point about 80 links though - you're
               | using a definition of "cognitive tasks" that more closely
               | resembles knowledge, and then you're letting your
               | personal feelings about profundity guide your conclusions
               | on said cognition.
               | 
               | I don't deny that the machine can output pretty words and
               | has a breadth of knowledge to put us each to shame on
               | some simple queries, but "cognition in a 4000 token
               | window" is an incredibly large place and I don't even
               | understand how you would be able to claim a machine has
               | above-human-average cognition based solely on your own
               | interactions... That's a pretty crazy leap.
               | 
               | PS: I saw the downvotes, I was downvoted for questioning
               | the validity of information that was actually just pure
               | conjecture, be better with your votes
        
               | refulgentis wrote:
               | > Gut feeling based on playing with it
               | 
               | You should check out the post we're commenting on, it has
               | graphs for this exact metric.
               | 
               | Spoiler: Google's model with 3x the parameters does pass
               | average human in a couple categories, but not at all. I
               | don't think GPT-3 does in any.
               | 
               | It's doubly puzzling to me because you have access and
               | are asserting it feels like an average human to you. It's
               | awesome and it does magical stuff, I use it daily both
               | for code and prose. It also majorly screws up sometimes.
               | It only at an average human level if we play word games
               | with things like "well, the average human wouldn't know
               | the Dart implementation of the 1D gaussian function.
               | Therefore it's better than the average human."
        
           | simulate-me wrote:
           | I agree 100%, but I think viable businesses will begin to
           | emerge especially as these large models move from text to
           | images (and eventually to video and 3d models). If the
           | examples shown of DALL-E 2 are indicative of its quality,
           | then a large number of creative jobs could be replaced with a
           | single "creative director" using the model. But the high
           | entry cost just to attempt to train such a model will likely
           | remain a hurdle until more business value is proven.
        
             | lumost wrote:
             | aye - I suspect the other concern is hat the high entry
             | costs can quickly lead to a "second mover" advantage. The
             | first team spends all the money doing the hard R&D and the
             | second team implements a slightly better version for a
             | fraction of the money.
        
           | sjg007 wrote:
           | I'd just solve some existing problem with the most basic
           | language model you can get your hands on and then move up
           | from there. Sell it first.
        
         | vackosar wrote:
         | Correction! Cost is around $10M not $10B.
        
         | gwern wrote:
         | The data here is effectively free. I don't think they would
         | exhaust The Pile, which you can download for free. This is also
         | true for text2image models like DALL-E 2: while OA may have
         | invested in its own datasets, everyone else can just download
         | LAION-400M (or if they are really ambitious, LAION-5B
         | https://laion.ai/laion-5b-a-new-era-of-open-large-scale-mult...
         | ).
        
         | visarga wrote:
         | > The amount of capital needed to train these high-quality
         | models is eye watering
         | 
         | It's relative. It would cost more to open a 40 room hotel
         | (about 320k/room), and hotels can't be copied like software.
        
           | Vetch wrote:
           | It's not like that many people are opening 40 room hotels
           | either. Such amounts are atypical within programming and CS
           | communities.
           | 
           | A more relevant example is video games, imagine if the only
           | viable ones were top end AAA games whose completed versions
           | could only be accessed by cloud gaming?
        
         | rafaelero wrote:
         | That's literally nothing for the benefits it could provide if
         | applied on the real world.
        
       ___________________________________________________________________
       (page generated 2022-04-18 23:01 UTC)