[HN Gopher] MetaCLIP - Meta AI Research
___________________________________________________________________
MetaCLIP - Meta AI Research
Author : zerojames
Score : 134 points
Date : 2023-10-26 09:36 UTC (13 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| dpflan wrote:
| No discussion after 4 hours of existence, wondering if this is
| leaving people speechless or not... ;)
|
| CLIP is a very interesting development in AI these days, so
| demystifying it is a great idea. Is anyone using CLIP or similar
| models daily and will find this research useful -- and willing to
| discuss it? I'm curious what you're doing.
| zerojames wrote:
| I just posted a comment :D
|
| I work for a computer vision company. I use CLIP almost every
| day. Example use cases for which I have used CLIP:
| - Image classification - Automated labeling for
| classification models - Image clustering -
| Gathering images for model training that are sufficiently
| dissimilar from existing samples - Content moderation
|
| CLIP is also being used widely in new research. SAM-CLIP,
| shared with me by my manager today, is using CLIP
| (https://arxiv.org/abs/2310.15308) and knowledge distillation
| for training a new model. I have seen references to CLIP
| throughout multimodal LLM papers, too, although my knowledge of
| multimodal model architectures is nascent.
| dpflan wrote:
| Very cool. I've used CLIP and VQGAN in a grad school project
| ~2years ago, when StyleGAN, StyleCLIP and similar projects
| were emerging for controlled image manipulations.
| chankstein38 wrote:
| I have a noob-to-CLIP question. When I've tried to use it to
| auto-caption photos or things the result has been like 4-5
| words that may really vaguely describe the image but honestly
| it's usually like "A woman holding a pencil" and sometimes "A
| woman woman holding a pencil" or just "A woman"
|
| Do different models do better or worse at this? Is this just
| untuned outputs? Like are there parameters I should be
| tweaking? Sorry I'm not able to give too much more detail.
| I'm mostly using it within A1111's "Interrogate CLIP" option
| but I've tried using a model I found on replicate as well as
| installing locally. Same results every time.
|
| It seems vaguely useful but like it misses the mark a lot of
| the time. I'm assuming I'm doing something wrong.
| Philpax wrote:
| iirc "Interrogate CLIP" is a bit of a misnomer - what it's
| actually doing is generating a basic caption with BLIP ("a
| woman holding a pencil"), then iterating over categories
| and checking with CLIP if any items in those categories are
| depicted in that image, then concatenating any hits to the
| resulting caption.
|
| This means the resulting caption is of the form "[BLIP
| caption], [category1 item], [category2 item], ...". It's
| very rudimentary.
|
| To clarify: CLIP can tell you if a text label matches an
| image. It can't generate a caption by itself.
|
| There are more advanced captioning methods, but I'm not
| sure if they're exposed in A1111 (I haven't used it in some
| months)
| chankstein38 wrote:
| Thank you for this! I've always been confused about BLIP
| vs CLIP. That makes a lot of sense and explains the weird
| duplication of a noun I see sometimes "A woman woman"
| things like that.
| Filligree wrote:
| For a task like that, I'd recommend LLaVA instead. It's
| still inaccurate, but it's a _great_ deal more accurate
| than the other options I 've tried. It also works with
| llama.cpp.
|
| LLaVA is a multimodal language model you ask questions of.
| If you don't provide a question, then the default is
| "Describe this picture in detail". But if you have a
| concrete question, you're likely to get better results. You
| can also specify the output format, which often works.
|
| (Make sure to use --temp 0.1, the default is far too high.)
|
| It runs very slowly on CPU, but will eventually give you an
| answer. If you have more than about four-five pictures to
| caption, you probably want to put as many as possible as
| the layers on the GPU. This requires specific compilation
| options for CUDA; on an M1/M2 it's possible by default, but
| still needs to be turned on. (-ngl 9999)
| simonw wrote:
| I suggest trying BLIP for this. I've had really good
| results from that.
|
| https://github.com/salesforce/BLIP
|
| I built a tiny Python CLI wrapper for it to make it easier
| to try: https://github.com/simonw/blip-caption
| captaincaveman wrote:
| Well as someone not already familiar with it, I've no idea what
| it is for, github readme didn't help?
| dpflan wrote:
| The readme from the linked repo? I think you'll really need
| to go to the source: OpenAI's CLIP:
| https://openai.com/research/clip
| zerojames wrote:
| https://blog.roboflow.com/openai-clip/ is a good high-level
| guide to what CLIP can do. (Disclosure: I work at Roboflow,
| but I did not author this piece.)
| ninja3925 wrote:
| I work for a large tech company and we use CLIP internally for
| a variety of things:
|
| - Image search (aka Google Image) to find relevant photos given
| a text prompt. This is the biggest use case
|
| - Automated Labeling (to answer questions such as "How many
| entities have attribute Y"?
|
| We didn't fine tune CLIP though. If folks have done it
| successfully, I would be love to hear them!
| hwsw wrote:
| We have ported CLIP to our Bottlenose camera. The results are
| very exciting and the possibilities are, for lack of better
| terms, endless. You can now tell the camera what to look for.
| Example, if using for manufacturing automation and the task is
| to detect if any product is missing a label: our customers can
| use natural language input "unlabelled product" and "labelled
| product". The system can now differentiate between the two and
| send results to a PLC. Previously this would have required a
| new machine learning loop to deploy.
|
| We are generating embeddings on the camera and send them out
| via chunk-data on the GigE Vision 2.1 protocol.
| xml wrote:
| I found CLIP to be _amazing_ for all kinds of image search,
| like search-by-text or search-by-image. I even ported it to
| NumPy to understand it better. The whole thing is less than 500
| lines of Python (including blank lines and comments):
| https://github.com/99991/NumPyCLIP
| jsemrau wrote:
| DALL-E 3 is using CLIP for synthetic caption generation
| https://jdsemrau.substack.com/p/paper-review-dall-e-3
| zerojames wrote:
| I have been playing with MetaCLIP this afternoon and made
| https://github.com/autodistill/autodistill-metaclip as a pip
| installable version. The Facebook repository has some guidance
| but you have to pull the weights yourself, save them, etc.
|
| My inference function (model.predict("image.png")) return an
| sv.Classifications object that you can load into supervision for
| processing (i.e. get top k) [1].
|
| The paper [2] notes the following in terms of performance:
|
| > In Table 4, we observe that MetaCLIP outperforms OpenAI CLIP on
| ImageNet and average accuracy across 26 tasks, for 3 model
| scales. With 400 million training data points on ViT-B/32,
| MetaCLIP outperforms CLIP by +2.1% on ImageNet and by +1.6% on
| average. On ViT-B/16, MetaCLIP outperforms CLIP by +2.5% on
| ImageNet and by +1.5% on average. On ViT-L/14, MetaCLIP
| outperforms CLIP by +0.7% on ImageNet and by +1.4% on average
| across the 26 tasks.
|
| [1] https://github.com/autodistill/autodistill-metaclip [2]
| https://arxiv.org/pdf/2309.16671.pdf
| ninja3925 wrote:
| CLIP is a such a nice paradigm shift. Historically, CV things
| were quite limited:
|
| - You could predict a class (from a static list such as [dog,cat,
| ...]) or ...
|
| - You could use image embeddings disconnected from text (you
| could tell image look-alikes but not what they actually
| represent). By "embedding" text and images in the same latent
| space, you can now query your images with text query (such a "a
| large dog") and find the relevant photos. CLIP understands
| semantics but also is not limited to a set list of classes
| (thanks to the ability to use of web data in training).
|
| This is a list compiled by OpenCLIP of high performance models
| (some better than MetaCLIP) for those interested in using CLIP:
| https://github.com/mlfoundations/open_clip/blob/main/docs/op...
| gurkwart wrote:
| Very exciting. CLIP and latent space embeddings in general are
| such an intuitive to use and powerful tool. I'm using it in some
| hobby projects, from semantic image search in private
| collections, to trading card recognition among tenthousands of
| cards. Love to see more open source work from big players on
| this.
| sroussey wrote:
| How would one license this for commercial use?
| martincmartin wrote:
| So this can tell you what food is in an image? Like Shazam for
| food?
|
| https://play.google.com/store/apps/details?id=com.codylab.se...
| cma wrote:
| CLIP is a big part of generative AI models as well like stable
| diffusion and DALLE.
| javier2 wrote:
| Is this available in a commercial license like the CLIP model
| from OpenAI was?
| ternaus wrote:
| Love that there is more accurate pre-trained CLIP model as it is
| a foundation for Stable Diffusion and many other very important
| open source models.
|
| But, I would say that the main issue with CLIP is not
| performance, but that textual input is limited to 77 characters.
|
| This is a severe limitations, if Meta or other company collected
| the dataset that allowed model with 1024 characters instead it
| would enrich the word of open source models much more than +2%
| accuracy.
|
| My hope is that next person or company who works on that will
| invest into longer context size for text input :fingers_crossed:
___________________________________________________________________
(page generated 2023-10-26 23:01 UTC)