[HN Gopher] Meta-Transformer: A unified framework for multimodal...
       ___________________________________________________________________
        
       Meta-Transformer: A unified framework for multimodal learning
        
       Author : ulrikhansen54
       Score  : 88 points
       Date   : 2023-07-24 17:33 UTC (5 hours ago)
        
 (HTM) web link (kxgong.github.io)
 (TXT) w3m dump (kxgong.github.io)
        
       | ImHereToVote wrote:
       | This seems like a step in the dangerous direction.
        
         | sebzim4500 wrote:
         | I am also concerned about existential threats from AI, but part
         | of the problem is that I have no idea which research directions
         | help and which ones hurt.
        
         | faktory wrote:
         | Why?
        
           | FrustratedMonky wrote:
           | Because up till now many people that discount AI threats base
           | that discount on a few assumptions like 'its just a parrot',
           | 'it doesn't have any drives', 'it doesn't really understand',
           | 'it isn't conscious', etc... ad-Infinium.
           | 
           | But the more different technology is plugged together to
           | start resembling a brain, like a visual cortex, a speech
           | center, motor controls, etc...
           | 
           | At some point the distinction between carbon based life and
           | silicon becomes meaningless vanishes. All the arguments or
           | proofs that humans are conscious would equally prove AI is
           | conscious. Or that neither truly are. Proving an AI is not
           | conscious would also prove humans aren't.
           | 
           | And of course, Terminators.
        
         | valine wrote:
         | It'll be ok. The technology for "dangerous" AI doesn't actually
         | exist. The near term risks we face from AI are constrained to
         | the realms of spam and privacy. World ending super-bots are
         | science fiction.
        
           | danielbln wrote:
           | Before superintelligence scifi stuff we'll probably get some
           | sort of superworm. Some rogue autonomous agent network that
           | is improving itself via some framework like SKILL[1] going
           | around 0-day'ing systems left and right and wreaking havoc.
           | 
           | [1] https://arxiv.org/abs/2010.11944
        
             | naasking wrote:
             | WormGPT already exists. These will only become more
             | dangerous as the tech evolves.
        
           | flangola7 wrote:
           | Blind denial. No argument or evidence presented, merely bold
           | statements made with the expectation they be taken without
           | question.
           | 
           | Flying humans was science fiction 120 years ago. A single
           | bomb able to destroy an entire city was science fiction 80
           | years ago. A machine that can complete more mathematical
           | calculations in one minute than all human manual computation
           | in history was science fiction 60 years ago. EUV
           | photolithography capable of creating molecule-sized
           | transistors was science fiction 30 years ago. A computer that
           | can create visual art and talk to you in plain English was
           | science fiction 2 years ago. A computer that can clone your
           | voice and mannerisms was science fiction 1 year ago.
           | 
           | Science fiction has a way of becoming non-fiction, often
           | within the span of a generation or less.
        
           | naasking wrote:
           | > It'll be ok. The technology for "dangerous" AI doesn't
           | actually exist.
           | 
           | Nobody's worried about the tech that exists.
           | 
           | > The near term risks we face from AI are constrained to the
           | realms of spam and privacy.
           | 
           | Define "near term".
           | 
           | > World ending super-bots are science fiction.
           | 
           | Science fiction has become science fact before. Where's the
           | knockdown argument that won't happen in this case?
        
             | valine wrote:
             | Its not feasible to worry about the implications of every
             | imaginary technology. Nuclear chain reactions were first
             | theorized to exist a decade before the first bomb dropped.
             | Should scientists have stopped exploring quantum mechanics
             | in the 30s? Fear of the unknown shouldn't be allowed to
             | stop scientific progress.
             | 
             | We can deal with the implications of dangerous AI if and
             | when it becomes a problem.
        
             | [deleted]
        
           | FrustratedMonky wrote:
           | >> "The technology for "dangerous" AI doesn't actually exist"
           | 
           | What? Did you not see the Netflix documentary on AI for
           | military use? They literally have AI's that can beat fighter
           | pilots in dog fighting.
           | 
           | Just because it isn't walking around having coffee and
           | chatting you up, doesn't mean it isn't already very advanced
           | and deadly.
        
             | valine wrote:
             | Dog fighting AI isn't going to end the world. When people
             | talk about the "risks" associated with AI they're talking
             | about an AI that spirals out of control and destroys
             | civilization. Something something infinite paper clip
             | optimizer.
             | 
             | It's scifi themed end-times cosplay.
        
               | FrustratedMonky wrote:
               | I get that.
               | 
               | But the post was just saying it seems 'dangerous'. It is
               | already 'dangerous'.
               | 
               | Yes, it will probably become even 'more dangerous'.
               | 
               | I'd disagree that many people agree on common definitions
               | of risk. Some people think autonomous drones that can
               | beat humans in a dogfight is already too far, others are
               | holding out for some paper clip optimizers before getting
               | worried.
               | 
               | You included 'world ending' as the definition of risk,
               | others have lower bar than that.
        
       | kristjank wrote:
       | Yo dawg, we heard you like transformers so we put transformers on
       | your transformers so you can train while you train. The spider
       | web graph shows metatransformers performing worse to their
       | counterparts in almost all fields. Is there a reason I should not
       | believe that an expert model will always outperform a general
       | purpose one, even if it's a metatransformer?
        
         | sebzim4500 wrote:
         | >Is there a reason I should not believe that an expert model
         | will always outperform a general purpose one, even if it's a
         | metatransformer?
         | 
         | If a general purpose model beats the specialized one, you could
         | almost certainly distill the general purpose one into a better
         | specialized one.
        
         | throwawayadvsec wrote:
         | I'm pretty sure it's a relatively small model?
         | 
         | If you had the same quantity of text data as GPT-4 + comparable
         | quantity of data for other domains, it could probably learn
         | transferable skills across those domains.
         | 
         | But it would take a huge amount of processing power that is
         | probably not attainable today
        
         | nh23423fefe wrote:
         | performance is bounded and so outperformance will approach
         | episilon?
        
         | danielbln wrote:
         | I mean, there is a somewhat unique value proposition of a
         | multimodal framework like this meta transfirmer. Its goal isn't
         | necessarily to beat expert models in their own game, but to
         | provide a unified framework for processing diverse modalities
         | of data.
         | 
         | I think it aims to leverage the cross-modal relationships and
         | unified learning, which might not be possible with expert
         | models designed for only a single modality.
         | 
         | Even if it performs slightly worse on some tasks, the ability
         | to handle multiple modalities within a single framework is an
         | pretty sweet advantage in scenarios where data from various
         | sources need to be processed simultaneously, and patterns
         | across modalities need to be captured somehow.
         | 
         | A general-purpose model could also be a more cost-effective
         | solution in some cases, ensemble experts are difficult to scale
         | and parallelize.
        
         | bick_nyers wrote:
         | Yo dawg, we just need to figure out what x converges to as you
         | apply transformer() infinite times and then finally attention
         | will no longer be all you need:
         | 
         | transformer(transformer(transformer( ... x ... ))) = ?
        
         | AndrewKemendo wrote:
         | >an expert model will always outperform a general purpose one,
         | even if it's a metatransformer
         | 
         | It's an interesting question as it begs questions of conceptual
         | "boundaries."
         | 
         | The sense-plan-do process requires a search and filter process
         | for task switching, assuming an agent can do more than one
         | thing.
         | 
         | So assuming you have a robotic/autonomous agent that is a
         | collection of systems (locomotion, dexterous gripper, visual
         | perception, etc...), if each system could be represented as an
         | "expert module", say for example the dexterous manipulator,
         | then so long as a discriminator can appropriately switch states
         | using the sensor/system inputs, then it's conceptually possible
         | that there is a canonical "expert module" that everyone uses
         | and therefore "general purpose" would apply to the agent as a
         | whole while expert model would apply to the dexterous
         | manipulator.
         | 
         | You can walk that reasoning up the abstraction layers then to
         | conclude that (as usual with these turtle stacks) the
         | distinctions come as each sub system/module specializes more
         | granularly for the environment they operate in.
         | 
         | I think that it's probably forever and always true that any
         | system designed to explore/exploit a bounded environment with
         | comprehensive observations, will always outperform a system
         | that is required to adapt it's sense-plan-do components to the
         | bounded environment without similar observations.
         | 
         | A system would either have to generate different observations
         | than the native agent, or change the boundaries of the
         | environment in a way that is unavailable to the native agent in
         | order to outperform it.
        
         | dorkusBdork wrote:
         | [dead]
        
       | Oras wrote:
       | According to the website, the model can then fine-tuned for
       | certain tasks such as image classification.
       | 
       | 1. How does the multi-model help here in improving the accuracy
       | of image classification when training is combined from text,
       | images, and audio?
       | 
       | 2. How about the speed? I would imagine a model with text, audio
       | and image data would be larger compared to text-only models?
        
       | orwin wrote:
       | Yeah, that's where I thought it would go shortly after I tried
       | GPT-4 from openAI. We're clearly at the transformer limits imho
       | (comparing the effectiveness between 3.5 and 4, and the number of
       | parameter in each model is why I think we reached a soft cap).
       | 
       | So since it'll be hard to go deeper, going broader by interlacing
       | different model types might be a way to pierce through.
        
         | whimsicalism wrote:
         | > We're clearly at the transformer limits imho
         | 
         | GPT-4 did not scale up substantially in depth, going from 175 b
         | to 220 b per transformer.
        
           | CSMastermind wrote:
           | Wouldn't making the model multimodal require scaling the
           | models significantly?
           | 
           | Or is the idea to keep the network the same size and trade
           | off some of its nodes for image, video, etc. data?
           | 
           | If so has anyone shown that doing so results in better
           | overall performance?
           | 
           | My lay-observation is that GPT-4 seems to be on the border of
           | usability for most applications so if nothing is gained by
           | simply changing the input data type as opposed to expanding
           | the model then it feels like it won't be of much use yet.
           | 
           | Also apologies if I'm not making sense, I'm almost certainly
           | not using to correct technical terms to articulate what I'm
           | thinking.
        
             | whimsicalism wrote:
             | > Wouldn't making the model multimodal require scaling the
             | models significantly?
             | 
             | Just width if that makes sense. Basically, you add another
             | encoder model but you are not actually increasing the depth
             | that much.
        
       | FrustratedMonky wrote:
       | Just few more steps like this, put it in a robot body, and Voila
       | , we have start of the first AI wars. How many centuries after
       | this does the Butlerian Jihad start, lead by John Conner, of
       | course?
        
       | ccheney wrote:
       | We need to start ingesting raw scientific data through these
       | models and see what it comes up with. What could these models
       | identify by parsing through raw JWST or Hubble data? Or training
       | against every published scientific paper? Is anyone doing this
       | sort of thing already?
        
         | danielbln wrote:
         | Meta's Galactica was an attempt to train an LLM predominantly
         | on scientific papers, articles and so on. It failed pretty
         | spectacularly but Galactica 2, if that's ever a things, might
         | rectify that.
        
           | RC_ITR wrote:
           | GP likely means training transformers on raw data (similar to
           | protein folding transformers) to find patterns that humans
           | cannot (due to lack of context, bias, or whatever).
           | 
           | Problem with the assumption though is that transformers are
           | good at identifying and replicating patterns given a set of
           | rules (i.e. how proteins fold and misfold depending on the
           | environment).
           | 
           | Hubble data isn't so much "we know the rules but not their
           | interactions" as much as "we don't really know the full set
           | of rules," so that particular example probably wouldn't be
           | that fruitful.
           | 
           | In general, biology (where we understand the basic rules but
           | not the complex ways they are combined) is the most fertile
           | ground for transformer driven research.
        
       ___________________________________________________________________
       (page generated 2023-07-24 23:01 UTC)