[HN Gopher] CatLIP: Clip Vision Accuracy with 2.7x Faster Pre-Tr...
___________________________________________________________________
CatLIP: Clip Vision Accuracy with 2.7x Faster Pre-Training on Web-
Scale Data
Author : panabee
Score : 36 points
Date : 2024-04-25 17:46 UTC (5 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| ggnore7452 wrote:
| question: any good on-device size image embedding models?
|
| tried https://github.com/unum-cloud/uform which i do like,
| especially they also support languages other than English. Any
| recommendations on other alternatives?
| philipkglass wrote:
| I have successfully used OpenCLIP models for embedding and
| similar-image search. The smallest model listed on that UForm
| page is 79 million parameters, so I presume that you can use
| other models of similar size. There are a few OpenCLIP models
| with 80 million or fewer parameters listed here:
|
| https://github.com/mlfoundations/open_clip/blob/main/docs/mo...
|
| When embeddings are quantized to int8 they still work very well
| for similarity (no differences in top 10 search on my test
| set). I haven't tried quantizing the models themselves.
| cs702 wrote:
| TL;DR: The authors pretrain the model to classify images into
| Wordnet synsets[a] that appear in the caption, using a standard
| Cross Entropy loss. They keep the number of classes relatively
| small by removing any synsets that don't show up in captions at
| least 500 times in the dataset. It seems to work well.
|
| My immediate question is: Why not classify among the entire
| hierarchy of all Wordnet synsets?
|
| ---
|
| [a] https://wordnet.princeton.edu/
___________________________________________________________________
(page generated 2024-04-25 23:01 UTC)