[HN Gopher] TinyGPT-V: Efficient Multimodal Large Language Model...
___________________________________________________________________
TinyGPT-V: Efficient Multimodal Large Language Model via Small
Backbones
Author : T-A
Score : 49 points
Date : 2024-01-03 20:53 UTC (2 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| T-A wrote:
| From the paper's abstract [1]:
|
| _It stands out by requiring merely a 24G GPU for training and an
| 8G GPU or CPU for inference. Built upon Phi-2, TinyGPT-V couples
| an effective language backbone with pre-trained vision modules
| from BLIP-2 or CLIP. TinyGPT-V 's 2.8B parameters can undergo a
| unique quantisation process, suitable for local deployment and
| inference tasks on 8G various devices._
|
| [1] https://arxiv.org/abs/2312.16862
| ilaksh wrote:
| Their results seem comparable to BLIP-2, shifted over in the
| diagram.
| samstave wrote:
| I really want to understand this post, but I can't, may you
| please direct me, and those noobs like me - to resources to be
| able to read this? (help anyone to climb the knowledge ladder)
| ELI3
|
| -
|
| EDIT - GPT helped me understand better:
|
| --
|
| >>> " _This model is special because it can do similar tasks as
| the big models but requires much less computational power1. It's
| like having a small but powerful engine that can do the work of a
| big one. This makes it more accessible for more people to use it
| "_
|
| ---
|
| >>> _" TinyGPT-V is built on another model called Phi-2 and uses
| pre-trained vision modules from BLIP-2 or CLIP1. It has 2.8
| billion parameters (these are like the model's brain cells) and
| can be further compressed to fit on devices with 8GB memory1.
| This means you could potentially run this model on your personal
| computer or even some high-end smartphones1"_
|
| ----
|
| >>> _" In summary, TinyGPT-V is a step towards making powerful AI
| models more accessible and efficient, which could lead to their
| use in a wide range of real-world applications1. The authors have
| also shared their code and training weights for others to use and
| learn from1"_
|
| -----
|
| This is really interesting if you fan out implications over N
| time?
|
| Here is my thinking:
|
| Assume this paper results in a way of "compression-alyzed vision"
| into a model (a tiny compressed view into a model)
|
| Then one, in a few years can imagine "laser views" - that slice
| through fractals of models to find the result. Resulting in tiny
| agents that have a heat-seeking-fractal-laser that can navigate
| giant data based on a method of knowing instantaneously what to
| _exclude_ (meaning the path is defined by the walls that you
| already know you do not want to hit, so your steps are always
| that which helps you forward)
|
| --
|
| Or am I stating something obvious to all you brainiacs?
|
| (no shame, I like thinking out loud)
| justinl33 wrote:
| > _You need to execute the above code 17 times to complete the
| first stage of training._
|
| Am I missing something here? Did the authors forget about for
| loops? What happens if you only do it 16 times?
___________________________________________________________________
(page generated 2024-01-03 23:00 UTC)