[HN Gopher] MetaCLIP - Meta AI Research
       ___________________________________________________________________
        
       MetaCLIP - Meta AI Research
        
       Author : zerojames
       Score  : 134 points
       Date   : 2023-10-26 09:36 UTC (13 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | dpflan wrote:
       | No discussion after 4 hours of existence, wondering if this is
       | leaving people speechless or not... ;)
       | 
       | CLIP is a very interesting development in AI these days, so
       | demystifying it is a great idea. Is anyone using CLIP or similar
       | models daily and will find this research useful -- and willing to
       | discuss it? I'm curious what you're doing.
        
         | zerojames wrote:
         | I just posted a comment :D
         | 
         | I work for a computer vision company. I use CLIP almost every
         | day. Example use cases for which I have used CLIP:
         | - Image classification       - Automated labeling for
         | classification models       - Image clustering       -
         | Gathering images for model training that are sufficiently
         | dissimilar from existing samples       - Content moderation
         | 
         | CLIP is also being used widely in new research. SAM-CLIP,
         | shared with me by my manager today, is using CLIP
         | (https://arxiv.org/abs/2310.15308) and knowledge distillation
         | for training a new model. I have seen references to CLIP
         | throughout multimodal LLM papers, too, although my knowledge of
         | multimodal model architectures is nascent.
        
           | dpflan wrote:
           | Very cool. I've used CLIP and VQGAN in a grad school project
           | ~2years ago, when StyleGAN, StyleCLIP and similar projects
           | were emerging for controlled image manipulations.
        
           | chankstein38 wrote:
           | I have a noob-to-CLIP question. When I've tried to use it to
           | auto-caption photos or things the result has been like 4-5
           | words that may really vaguely describe the image but honestly
           | it's usually like "A woman holding a pencil" and sometimes "A
           | woman woman holding a pencil" or just "A woman"
           | 
           | Do different models do better or worse at this? Is this just
           | untuned outputs? Like are there parameters I should be
           | tweaking? Sorry I'm not able to give too much more detail.
           | I'm mostly using it within A1111's "Interrogate CLIP" option
           | but I've tried using a model I found on replicate as well as
           | installing locally. Same results every time.
           | 
           | It seems vaguely useful but like it misses the mark a lot of
           | the time. I'm assuming I'm doing something wrong.
        
             | Philpax wrote:
             | iirc "Interrogate CLIP" is a bit of a misnomer - what it's
             | actually doing is generating a basic caption with BLIP ("a
             | woman holding a pencil"), then iterating over categories
             | and checking with CLIP if any items in those categories are
             | depicted in that image, then concatenating any hits to the
             | resulting caption.
             | 
             | This means the resulting caption is of the form "[BLIP
             | caption], [category1 item], [category2 item], ...". It's
             | very rudimentary.
             | 
             | To clarify: CLIP can tell you if a text label matches an
             | image. It can't generate a caption by itself.
             | 
             | There are more advanced captioning methods, but I'm not
             | sure if they're exposed in A1111 (I haven't used it in some
             | months)
        
               | chankstein38 wrote:
               | Thank you for this! I've always been confused about BLIP
               | vs CLIP. That makes a lot of sense and explains the weird
               | duplication of a noun I see sometimes "A woman woman"
               | things like that.
        
             | Filligree wrote:
             | For a task like that, I'd recommend LLaVA instead. It's
             | still inaccurate, but it's a _great_ deal more accurate
             | than the other options I 've tried. It also works with
             | llama.cpp.
             | 
             | LLaVA is a multimodal language model you ask questions of.
             | If you don't provide a question, then the default is
             | "Describe this picture in detail". But if you have a
             | concrete question, you're likely to get better results. You
             | can also specify the output format, which often works.
             | 
             | (Make sure to use --temp 0.1, the default is far too high.)
             | 
             | It runs very slowly on CPU, but will eventually give you an
             | answer. If you have more than about four-five pictures to
             | caption, you probably want to put as many as possible as
             | the layers on the GPU. This requires specific compilation
             | options for CUDA; on an M1/M2 it's possible by default, but
             | still needs to be turned on. (-ngl 9999)
        
             | simonw wrote:
             | I suggest trying BLIP for this. I've had really good
             | results from that.
             | 
             | https://github.com/salesforce/BLIP
             | 
             | I built a tiny Python CLI wrapper for it to make it easier
             | to try: https://github.com/simonw/blip-caption
        
         | captaincaveman wrote:
         | Well as someone not already familiar with it, I've no idea what
         | it is for, github readme didn't help?
        
           | dpflan wrote:
           | The readme from the linked repo? I think you'll really need
           | to go to the source: OpenAI's CLIP:
           | https://openai.com/research/clip
        
           | zerojames wrote:
           | https://blog.roboflow.com/openai-clip/ is a good high-level
           | guide to what CLIP can do. (Disclosure: I work at Roboflow,
           | but I did not author this piece.)
        
         | ninja3925 wrote:
         | I work for a large tech company and we use CLIP internally for
         | a variety of things:
         | 
         | - Image search (aka Google Image) to find relevant photos given
         | a text prompt. This is the biggest use case
         | 
         | - Automated Labeling (to answer questions such as "How many
         | entities have attribute Y"?
         | 
         | We didn't fine tune CLIP though. If folks have done it
         | successfully, I would be love to hear them!
        
         | hwsw wrote:
         | We have ported CLIP to our Bottlenose camera. The results are
         | very exciting and the possibilities are, for lack of better
         | terms, endless. You can now tell the camera what to look for.
         | Example, if using for manufacturing automation and the task is
         | to detect if any product is missing a label: our customers can
         | use natural language input "unlabelled product" and "labelled
         | product". The system can now differentiate between the two and
         | send results to a PLC. Previously this would have required a
         | new machine learning loop to deploy.
         | 
         | We are generating embeddings on the camera and send them out
         | via chunk-data on the GigE Vision 2.1 protocol.
        
         | xml wrote:
         | I found CLIP to be _amazing_ for all kinds of image search,
         | like search-by-text or search-by-image. I even ported it to
         | NumPy to understand it better. The whole thing is less than 500
         | lines of Python (including blank lines and comments):
         | https://github.com/99991/NumPyCLIP
        
         | jsemrau wrote:
         | DALL-E 3 is using CLIP for synthetic caption generation
         | https://jdsemrau.substack.com/p/paper-review-dall-e-3
        
       | zerojames wrote:
       | I have been playing with MetaCLIP this afternoon and made
       | https://github.com/autodistill/autodistill-metaclip as a pip
       | installable version. The Facebook repository has some guidance
       | but you have to pull the weights yourself, save them, etc.
       | 
       | My inference function (model.predict("image.png")) return an
       | sv.Classifications object that you can load into supervision for
       | processing (i.e. get top k) [1].
       | 
       | The paper [2] notes the following in terms of performance:
       | 
       | > In Table 4, we observe that MetaCLIP outperforms OpenAI CLIP on
       | ImageNet and average accuracy across 26 tasks, for 3 model
       | scales. With 400 million training data points on ViT-B/32,
       | MetaCLIP outperforms CLIP by +2.1% on ImageNet and by +1.6% on
       | average. On ViT-B/16, MetaCLIP outperforms CLIP by +2.5% on
       | ImageNet and by +1.5% on average. On ViT-L/14, MetaCLIP
       | outperforms CLIP by +0.7% on ImageNet and by +1.4% on average
       | across the 26 tasks.
       | 
       | [1] https://github.com/autodistill/autodistill-metaclip [2]
       | https://arxiv.org/pdf/2309.16671.pdf
        
       | ninja3925 wrote:
       | CLIP is a such a nice paradigm shift. Historically, CV things
       | were quite limited:
       | 
       | - You could predict a class (from a static list such as [dog,cat,
       | ...]) or ...
       | 
       | - You could use image embeddings disconnected from text (you
       | could tell image look-alikes but not what they actually
       | represent). By "embedding" text and images in the same latent
       | space, you can now query your images with text query (such a "a
       | large dog") and find the relevant photos. CLIP understands
       | semantics but also is not limited to a set list of classes
       | (thanks to the ability to use of web data in training).
       | 
       | This is a list compiled by OpenCLIP of high performance models
       | (some better than MetaCLIP) for those interested in using CLIP:
       | https://github.com/mlfoundations/open_clip/blob/main/docs/op...
        
       | gurkwart wrote:
       | Very exciting. CLIP and latent space embeddings in general are
       | such an intuitive to use and powerful tool. I'm using it in some
       | hobby projects, from semantic image search in private
       | collections, to trading card recognition among tenthousands of
       | cards. Love to see more open source work from big players on
       | this.
        
       | sroussey wrote:
       | How would one license this for commercial use?
        
       | martincmartin wrote:
       | So this can tell you what food is in an image? Like Shazam for
       | food?
       | 
       | https://play.google.com/store/apps/details?id=com.codylab.se...
        
         | cma wrote:
         | CLIP is a big part of generative AI models as well like stable
         | diffusion and DALLE.
        
       | javier2 wrote:
       | Is this available in a commercial license like the CLIP model
       | from OpenAI was?
        
       | ternaus wrote:
       | Love that there is more accurate pre-trained CLIP model as it is
       | a foundation for Stable Diffusion and many other very important
       | open source models.
       | 
       | But, I would say that the main issue with CLIP is not
       | performance, but that textual input is limited to 77 characters.
       | 
       | This is a severe limitations, if Meta or other company collected
       | the dataset that allowed model with 1024 characters instead it
       | would enrich the word of open source models much more than +2%
       | accuracy.
       | 
       | My hope is that next person or company who works on that will
       | invest into longer context size for text input :fingers_crossed:
        
       ___________________________________________________________________
       (page generated 2023-10-26 23:01 UTC)