[HN Gopher] TinyGPT-V: Efficient Multimodal Large Language Model...
       ___________________________________________________________________
        
       TinyGPT-V: Efficient Multimodal Large Language Model via Small
       Backbones
        
       Author : T-A
       Score  : 49 points
       Date   : 2024-01-03 20:53 UTC (2 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | T-A wrote:
       | From the paper's abstract [1]:
       | 
       |  _It stands out by requiring merely a 24G GPU for training and an
       | 8G GPU or CPU for inference. Built upon Phi-2, TinyGPT-V couples
       | an effective language backbone with pre-trained vision modules
       | from BLIP-2 or CLIP. TinyGPT-V 's 2.8B parameters can undergo a
       | unique quantisation process, suitable for local deployment and
       | inference tasks on 8G various devices._
       | 
       | [1] https://arxiv.org/abs/2312.16862
        
       | ilaksh wrote:
       | Their results seem comparable to BLIP-2, shifted over in the
       | diagram.
        
       | samstave wrote:
       | I really want to understand this post, but I can't, may you
       | please direct me, and those noobs like me - to resources to be
       | able to read this? (help anyone to climb the knowledge ladder)
       | ELI3
       | 
       | -
       | 
       | EDIT - GPT helped me understand better:
       | 
       | --
       | 
       | >>> " _This model is special because it can do similar tasks as
       | the big models but requires much less computational power1. It's
       | like having a small but powerful engine that can do the work of a
       | big one. This makes it more accessible for more people to use it
       | "_
       | 
       | ---
       | 
       | >>> _" TinyGPT-V is built on another model called Phi-2 and uses
       | pre-trained vision modules from BLIP-2 or CLIP1. It has 2.8
       | billion parameters (these are like the model's brain cells) and
       | can be further compressed to fit on devices with 8GB memory1.
       | This means you could potentially run this model on your personal
       | computer or even some high-end smartphones1"_
       | 
       | ----
       | 
       | >>> _" In summary, TinyGPT-V is a step towards making powerful AI
       | models more accessible and efficient, which could lead to their
       | use in a wide range of real-world applications1. The authors have
       | also shared their code and training weights for others to use and
       | learn from1"_
       | 
       | -----
       | 
       | This is really interesting if you fan out implications over N
       | time?
       | 
       | Here is my thinking:
       | 
       | Assume this paper results in a way of "compression-alyzed vision"
       | into a model (a tiny compressed view into a model)
       | 
       | Then one, in a few years can imagine "laser views" - that slice
       | through fractals of models to find the result. Resulting in tiny
       | agents that have a heat-seeking-fractal-laser that can navigate
       | giant data based on a method of knowing instantaneously what to
       | _exclude_ (meaning the path is defined by the walls that you
       | already know you do not want to hit, so your steps are always
       | that which helps you forward)
       | 
       | --
       | 
       | Or am I stating something obvious to all you brainiacs?
       | 
       | (no shame, I like thinking out loud)
        
       | justinl33 wrote:
       | > _You need to execute the above code 17 times to complete the
       | first stage of training._
       | 
       | Am I missing something here? Did the authors forget about for
       | loops? What happens if you only do it 16 times?
        
       ___________________________________________________________________
       (page generated 2024-01-03 23:00 UTC)