[HN Gopher] LLaVA-1.6: Improved reasoning, OCR, and world knowledge
___________________________________________________________________
LLaVA-1.6: Improved reasoning, OCR, and world knowledge
Author : tosh
Score : 28 points
Date : 2024-01-31 17:13 UTC (5 hours ago)
(HTM) web link (llava-vl.github.io)
(TXT) w3m dump (llava-vl.github.io)
| benopal64 wrote:
| Wow! You folks are making huge strides for open-source multimodal
| models. Thank you for all the time and effort on these as they
| will open up many opportunities for researchers and developers.
| Also, the emergent zero-shot capabilities when LLaVA-1.6 is
| tested against Chinese benchmarks with only English multi-modal
| training data are interesting and that may be a good direction
| for future research.
| fngjdflmdflg wrote:
| To me this is the money shot:
|
| >LLaVA-1.6 is trained with 32 GPUs for ~1 day, with 1.3M data
| samples in total. The compute / training data cost is 100-1000
| times smaller than others.
| chx wrote:
| There's no reasoning involved with LLMs. Please. Words have
| meaning.
| GaggiX wrote:
| Demo: https://llava.hliu.cc/
|
| My main interest with VLM is their ability to caption images, and
| this one seems very good honestly, this is going to be super
| useful to caption datasets.
| mildbyte wrote:
| Damn, literally a day after I wrote up my experiments[0] with
| LLaVA 1.5 and computing image embeddings. Interesting to see the
| performance with the fine-tuned Mistral-7B variant being pretty
| close to the one with Vicuna-13B - using Mistral 7B is what
| BakLLaVA did back with LLaVA 1.5.
|
| [0] https://mildbyte.xyz/blog/llama-cpp-python-llava-gpu-
| embeddi...
___________________________________________________________________
(page generated 2024-01-31 23:00 UTC)