[HN Gopher] LLaVA-1.6: Improved reasoning, OCR, and world knowledge
       ___________________________________________________________________
        
       LLaVA-1.6: Improved reasoning, OCR, and world knowledge
        
       Author : tosh
       Score  : 28 points
       Date   : 2024-01-31 17:13 UTC (5 hours ago)
        
 (HTM) web link (llava-vl.github.io)
 (TXT) w3m dump (llava-vl.github.io)
        
       | benopal64 wrote:
       | Wow! You folks are making huge strides for open-source multimodal
       | models. Thank you for all the time and effort on these as they
       | will open up many opportunities for researchers and developers.
       | Also, the emergent zero-shot capabilities when LLaVA-1.6 is
       | tested against Chinese benchmarks with only English multi-modal
       | training data are interesting and that may be a good direction
       | for future research.
        
       | fngjdflmdflg wrote:
       | To me this is the money shot:
       | 
       | >LLaVA-1.6 is trained with 32 GPUs for ~1 day, with 1.3M data
       | samples in total. The compute / training data cost is 100-1000
       | times smaller than others.
        
       | chx wrote:
       | There's no reasoning involved with LLMs. Please. Words have
       | meaning.
        
       | GaggiX wrote:
       | Demo: https://llava.hliu.cc/
       | 
       | My main interest with VLM is their ability to caption images, and
       | this one seems very good honestly, this is going to be super
       | useful to caption datasets.
        
       | mildbyte wrote:
       | Damn, literally a day after I wrote up my experiments[0] with
       | LLaVA 1.5 and computing image embeddings. Interesting to see the
       | performance with the fine-tuned Mistral-7B variant being pretty
       | close to the one with Vicuna-13B - using Mistral 7B is what
       | BakLLaVA did back with LLaVA 1.5.
       | 
       | [0] https://mildbyte.xyz/blog/llama-cpp-python-llava-gpu-
       | embeddi...
        
       ___________________________________________________________________
       (page generated 2024-01-31 23:00 UTC)