[HN Gopher] Sequential modeling enables scalable learning for  l...
       ___________________________________________________________________
        
       Sequential modeling enables scalable learning for  large vision
       models
        
       Author : og_kalu
       Score  : 97 points
       Date   : 2023-12-05 14:14 UTC (8 hours ago)
        
 (HTM) web link (yutongbai.com)
 (TXT) w3m dump (yutongbai.com)
        
       | og_kalu wrote:
       | I'd simply been thinking about Large Vision Models in an
       | annotation sense. QandA, Captioning...That sort of thing.
       | 
       | Even though it makes so much sense, I never thought about it like
       | this. Inpainting, Object Detection, Rotation, Lighting,
       | Segmentation, Edge Detection, Pose Estimation, Surface Normal,
       | Colorization and much more achieved by a single model.
       | 
       | I believe this and Codi-2(https://codi-2.github.io/) offer a
       | glimpse of the future of Large Multimodal Models.
        
         | kaibee wrote:
         | So this is artificial general intelligence, right?
        
           | ben_w wrote:
           | The problem with the phrase "artificial general intelligence"
           | is that everyone is arguing about the definition of all three
           | words, and has a different threshold for the boolean
           | pass/fail boundary.
        
           | og_kalu wrote:
           | General Intelligence is a gradient, not a hard on/off.
           | Obviously, these are machines so artificial. They're
           | certainly not narrow in scope or abilities so general. They
           | perform tasks we consider intelligent. So...sure. Like ENIAC,
           | i imagine we'll build or have built agi well before everyone
           | can agree it is so.
        
         | BoiledCabbage wrote:
         | One of the things sci-fi really seemed to get right is that we
         | will have AGI _long_ before we 'll have agreement that it is
         | actually AGI.
         | 
         | People will keep finding some small case or reason why not to
         | call it AGI. And then finally once that last case is knocked
         | down and we have agreement on a definition, we'll realize we
         | crossed that threshold a "long" while back.
         | 
         | And I'm not saying we have AGI now, just that it's now clear to
         | me how this process will play out.
         | 
         | (Where "long" in AI development timeliness probably doesn't
         | mean the same thing "long" meant even in the 2010s.)
        
       | mickdarling wrote:
       | Has anyone tried using transformers on weather forecasts yet?
        
         | ben_w wrote:
         | Yes, several.
         | 
         | * https://arxiv.org/pdf/2106.14742.pdf
         | 
         | *
         | https://rmets.onlinelibrary.wiley.com/doi/full/10.1002/met.2...
         | 
         | * https://ieeexplore.ieee.org/document/9671442
         | 
         | Also non-transformer models, because of the scaling complexity
         | on input: https://arxiv.org/pdf/2212.12794.pdf
        
       | toxik wrote:
       | Would have been neat to see some animations since it is video
       | frames in many cases.
        
         | mpeg wrote:
         | The "add a frame to a video" use-cases are probably the least
         | exciting here, the image annotation capabilities seem to me the
         | bigger deal.
        
       | cs702 wrote:
       | Upon reading this, my immediate thought is:
       | 
       | It's only a matter of time before we have robots powered by large
       | models pretrained to "predict the next token" across a bunch of
       | different _sensory modalities_ -- sight, sound, smell, touch,
       | taste, etc. in a variety of artificial and natural settings,
       | including social-interaction settings. Learning to read, learning
       | to talk, learning to interact with the physical world, and so on
       | -- all of it could very well be built upon the simple idea of
       | learning to  "predict the next token."
       | 
       | We live in interesting times.
        
         | catchnear4321 wrote:
         | across a bunch of different gradients. senses will be the next
         | step, humans grok that good and easy. once multiple gradients
         | are considered, non-sensory gradients are going to be next.
         | 
         | this is all a bunch of gobbledygook until it isn't.
        
         | raidicy wrote:
         | These sentiments are pretty close to my own. I read a paper
         | that claimed that llms are General Pattern Machines and could
         | be used to complete small games in gym environments. It seems
         | to me that if these things really are General Pattern Machines
         | all we have to do is figure out a way to represent any data as
         | a pattern and try and predict the next step in the pattern
         | right?
         | 
         | The multi token[1] project which allows you to take any type of
         | data and turn it into a token it's pretty interesting and seems
         | like it's going in this direction.
         | 
         | I would really like to see a framework where you can take any
         | modality of any type turn it into a series of tokens and just
         | cram it into a language model and effectively turning into a
         | multimodal model with almost no effort.
         | 
         | [0]https://general-pattern-machines.github.io/ [1]
         | https://github.com/sshh12/multi_token
        
           | kelseyfrog wrote:
           | If they're looking for a name, The Glass Bead Game isn't a
           | bad start.
        
         | og_kalu wrote:
         | Oh yeah. The exciting thing is that is pretty low hanging fruit
         | (at least for the common modalities)
         | 
         | What would a Large Language Model that can manipulate audio-
         | visual data as expertly as it can manipulate text look like ?
         | This is beyond just Text to Speech or Captioning and Image Q&A.
         | I think we'll find out very soon.
        
         | gwern wrote:
         | That pretty much already exists. Look at DeepMind's Gato: all
         | tasks and modalities are simply sequences of tokens, everything
         | from 'predict English text' to 'predict VAE image token
         | sequences' to 'predict robotic arms commands and movements
         | IRL'.
        
           | jerpint wrote:
           | GATO was heavily biased towards tasks in a simple simulator,
           | but didn't exhibit emergent behaviours
        
           | cs702 wrote:
           | Ah, yes, I'd forgotten about Gato. Thank you for reminding
           | me. There's so much research activity that the Gato paper
           | feels as if it was published _eons_ ago. There 's only so
           | much I can retain in my puny little human mind at once!
           | 
           | In any case, I'm not sure Gato qualifies as a "large" model
           | with 1.2B parameters -- it's kinda right below the threshold
           | at which it could or would start exhibiting emergent
           | behaviors. Maybe a new Gato with 10's or 100's of billions of
           | parameters operating in the physical world?
        
             | gwern wrote:
             | Yes. Gato was a good proof-of-concept that the Decision
             | Transformer approach of 'just model literally everything as
             | a sequence' scales well and doesn't exhibit some sort of
             | catastrophic interference and can successfully imitation-
             | learn from all the expert datasets, and a bit of transfer.
             | But they need to push it at least another OOM or 2 to show
             | major transfer, some emergences, and ideally do both from-
             | scratch learning and additional learning on many new tasks.
             | We continue to wait. :(
             | 
             | I hope it didn't all get rolled up into Gemini and become a
             | state secret they'll never publish on again, or lost in the
             | shuffle in the chaos of the DeepMind/Brain
             | merger/liquidation.
        
         | ldjkfkdsjnv wrote:
         | Yeah its going to happen. We will see and speak with
         | intelligent machines
        
         | jefft255 wrote:
         | I did that a few years back: https://arxiv.org/abs/2011.11751
        
           | cs702 wrote:
           | With a _large_ model? How many parameters?
           | 
           | See my other comment here:
           | 
           | https://news.ycombinator.com/item?id=38536178
        
         | dottedmag wrote:
         | One can argue there are ~8bln of these robots already roaming
         | the Earth.
        
         | ww520 wrote:
         | Let me guess. Someone will train a LLM on the stock prices to
         | predict the stock market. It might work as well as how human
         | has predicted the market.
        
           | astrange wrote:
           | Past performance is not a guarantee of future results.
        
         | Kelkonosemmel wrote:
         | Yepp. I think a robot who can easily follow your commands
         | should be doable in 10-20 years.
         | 
         | I plan to buy a farm when I have the money and I'm pretty sure
         | while I will/want to do a lot of hands on renovation and
         | sculpting (park, etc) long term some type of robot should be
         | good and affordable enough to take over when I'm too old.
        
       | agarsev wrote:
       | I'm reading my thesis next tuesday, and the advances in AI the
       | last couple of years have already made most of it obsolete :/
       | 
       | Anyway I'm excited and looking forward to the code and models to
       | be released, hopefully I can use them for my research! I think
       | it's easy to overlook how revolutionary the transformer "way" of
       | doing things has been, and the fact that so many different tasks
       | can be reformulated in a "language" way I believe hints at
       | something deeper about how the universe, our minds and language
       | work.
        
       | bambax wrote:
       | Do captchas have a future? It seems inevitable that AI will beat
       | humans on captcha real soon (if not already). What's next?
        
         | jksk61 wrote:
         | I don't know about you but I use chat gpt to solve captcha
         | because I can't.
        
         | jazzyjackson wrote:
         | I should patent this idea but here it goes anyway: in the
         | future Captcha's will consist of requesting you make an
         | antisemitic or misogynist remark to prove you are human, since
         | the bots will be held to higher moral standards than man.
        
           | cooper_ganglia wrote:
           | This is a completely silly comment, but I will admit that it
           | made me chuckle just because of how incredibly random and
           | shocking it was, especially here on HN, lol
        
           | PeterisP wrote:
           | Obviously, not all bots will be held to such standards -
           | these are the standards which apply to western corporations
           | who offer free bots for PR and marketing purposes, since
           | 'politically incorrect' behavior stunts that goal, however,
           | anyone keeping their own bot (e.g. a spammer wanting to solve
           | millions of captchas) has no problems running an 'uncensored'
           | bot - right now you can get reasonably large open source
           | models without the RLHF pretraining and thus also no attempt
           | at 'moral standards' other than those _you_ choose to put in
           | when finetuning the bot for your purposes.
           | 
           | It's relatively trivial to flip the sign on that training
           | data and have a bot that will instead refuse to make
           | antifascist or feminist arguments; if someone wants a
           | Hitlerbot to ghostwrite "My struggle" for them, there's
           | nothing that could prevent them from finetuning such a model
           | from one of the publicly available models; there is no one
           | that can enforce any 'moral standards' on the bots other than
           | their creators.
        
         | swfsql wrote:
         | Considering this, the only remaining captcha will be hard cold
         | money. Paywalls.
        
       | mola wrote:
       | So this can solve visual analogies from iq tests?
        
       | jerpint wrote:
       | I would have loved to see videos on the blog post of completions
        
       | dwaltrip wrote:
       | Hypothesis: Intelligence _is_ prediction?
        
         | red75prime wrote:
         | Yep, prediction of what needs to be done.
        
         | iandanforth wrote:
         | You might enjoy 'On Intelligence' by Jeff Hawkins to learn more
         | about this hypothesis. (It's an older book / theory at this
         | point but still worth reading IMHO)
        
       | fancyfredbot wrote:
       | Next step, accept user input from VR glasses and we've basically
       | got a holodeck.
        
       | iandanforth wrote:
       | If this paper is coming out of BAIR at a max 3B parameter model I
       | suspect we'll quickly see much larger models from the industrial
       | players. Hopefully Mistral takes an interest and releases an OSI
       | licensed model.
        
       ___________________________________________________________________
       (page generated 2023-12-05 23:00 UTC)