[HN Gopher] Beating OpenAI CLIP with 100x less data and compute
___________________________________________________________________
Beating OpenAI CLIP with 100x less data and compute
Author : vov_or
Score : 234 points
Date : 2023-02-28 15:04 UTC (7 hours ago)
(HTM) web link (www.unum.cloud)
(TXT) w3m dump (www.unum.cloud)
| ipsum2 wrote:
| How did you deal with data contamination?
| vov_or wrote:
| The datasets we used are pretty clean themselves if we compare
| them with LAION. But we also filtered out images with captions
| on them and by CLIP's scores. Btw, huge thanks for Laion and
| Open_clip projects! It inspires us a lot.
| bilater wrote:
| For me - the biggest thing I am looking for is a serverless
| vector data store. Competitors like Pinecone work just fine but
| they go from 0-70 as soon as you upgrade to a pod.
|
| If you can figure out pricing primarily based on usage you can
| capture a whole segment of this market.
| ashvardanian wrote:
| Great point! I would be happy to get more input and brain-storm
| a good pricing model together, one that is fair both for
| developers and for users.
|
| We have an source project UKV, that partly overlaps with
| vector-search: https://github.com/unum-cloud/ukv
|
| Another one - UNSW, is a placeholder for now:
| https://github.com/unum-cloud/unsw
|
| Both will be soon available on cloud marketplaces, but server-
| less options are a bit harder to cook. Our Discord is the best
| place to continue conversation: https://discord.gg/Bbh2bjNhvz
|
| Thank you for advice!
| swyx wrote:
| > The original CLIP was trained on 500x A100 Nvidia GPUs. The
| latest Open_CLIP trained on 1024x GPUs.
|
| > We trained on the setup of 3x workstations, with 4x RTX 3090
| consumer-grade GPUs in each, connected over 200 GBit InfiniBand
| HDR.
|
| ok so 85x improvement on the GPU count (i suspect even better
| once you take into account the differences in consumer grade GPU)
| but i must still be missing something - where does it say it uses
| 100x less data?
| brookst wrote:
| Look at the "dataset" column: CLIP was trained on 400m images,
| UForm on 4m.
| vov_or wrote:
| There are also dataset sizes for Albef and ViCHA.
| nl wrote:
| This looks interesting for image retrieval.
|
| I don't love the way their tables[1] report performance though.
| My understanding is that the "Dataset" column in the table
| represents the size of the training dataset, _not_ the size of
| the dataset they are evaluating on. Note that this undersells
| their performance though, so it isn 't like they are trying to
| hide something here!
|
| Also I'd love to see someone do a similar benchmark for the
| OpenAI CPT-3 embeddings. I'm pretty unclear how well they compare
| to something like FLAN-T5, because they don't seem to be
| evaluated anywhere in the retrieval setting (unless I've missed
| it?)
|
| [1] See "Zero-Shot Image Retrieval, English-only" in
| https://www.unum.cloud/blog/2023-02-20-efficient-multimodali...
| alexandargyurov wrote:
| Am I the only one who is very confused what this is?
| jasonjmcghee wrote:
| This is a good introduction to OpenAI CLIP, which should help
| provide context. https://openai.com/research/clip
| pizzaknife wrote:
| thank you for this primer!
| juxtaposicion wrote:
| It is exciting that you could train a CLIP-style model from
| scratch with only 4M datapoints. But if you've got that data, why
| not fine tune a pretrained model with your 4M points? It seems
| likely to outperform the from-scratch method.
| vov_or wrote:
| There is not only a difference in the data source but pre-
| trained tasks as well. But you are right, a fine-tuned models
| on human-annotated data are way better than zero-shot (just
| pre-trained) on Image retrieval. And it is correct for CLIP,
| ALBEF, VICHA, and UFORM.
| ttt3ts wrote:
| Any plans to document how to fine tune your models then?
| vov_or wrote:
| It will take some time, but yes, we have this in our plans.
| riku_iki wrote:
| perhaps this approach can lead to better training of
| foundational models?..
| vov_or wrote:
| More efficient - for sure!
| varispeed wrote:
| I read a lot about training models and so on, but very little
| about inference.
|
| Let's say you came up with the custom model that gives good
| results, how do you transfer that model so it can be used in an
| API?
| binarymax wrote:
| I specialize in this area and build a product for self hosted
| inference.
|
| The challenge to support a new model architecture is about
| coding the preprocessing for inputs (like tokenization or image
| resizing and color feature extraction) and post processing the
| outputs (for example entity recognition needs to lookup the
| entities and align the text).
|
| Once an architecture is coded for the pre/post processing, then
| serving a new model for inference with that architecture is
| easy!
| alex_sf wrote:
| There's no one answer to that since different models are..
| different. Beyond just modalities (text input and image output?
| image input and video output?), there are different common
| underlying tools used to build them. And then, of course, what
| do you mean by API? How do you want to interact with it?
|
| As a general thing, you'd take a request that would require an
| inference step, which would then invoke the model with some
| parameters and input, and return the output. Beyond that, you'd
| need more detail.
| [deleted]
| sashank_1509 wrote:
| They seem to be only testing for the image retrieval task, but I
| don't think CLIP is actually used for image retrieval. Most
| cases, I see CLIP being used for semantic segmentation, detection
| etc. Do these guys have similar results on these tasks?
| vov_or wrote:
| Hi! I am one of the contributors! We were focused on image
| retrieval only. Almost all semantic search engines for images
| are based on CLIP today. We are also building a semantic
| multimodal search engine as a DBMS component. That is why Image
| retrieval is so crucial for us as well as inference perf. Also,
| for semantic segmentation and detection, you probably use only
| the image encoder part of the CLIP.
| ilaksh wrote:
| This may be a dumb question, but would it be possible to apply
| these techniques to something like text completion and/or visual
| question answering? If you went ahead and used the optimizations
| but still scaled the model up?
| vov_or wrote:
| Yes, it is possible. Approaches, on which our model is based,
| are capable to solve VQA and other similar tasks showing SOTA
| results.
| ilaksh wrote:
| Do you know anyone working on a large text completion model
| based on it?
| freediver wrote:
| Do you have/plan to have a text embeddings model?
| vov_or wrote:
| Yes, we are training text embedding models right now. And
| also have plans to open-source some of them! In addition,
| we train encoders for different modalities with retrieval
| purposes. For example, video data.
| margorczynski wrote:
| From what I understand the basis for their model are these two
| described in these papers: https://arxiv.org/abs/2107.07651
| https://arxiv.org/abs/2208.13628
|
| Lot of tricks put together for a great final result it seems
| ashvardanian wrote:
| Thank you! Founder here :) You are right, those are the base
| papers, but we have extended the set of objectives quite
| significantly, tapping into modalities that haven't been
| publicly CLIP-ed :)
|
| It is probably worth writing a paper about, but we are just too
| busy building tons of open-source stuff. Check out the GitHub
| org here: https://github.com/unum-cloud
|
| It is not just about the tranformers, but also about databases,
| networking, and improving the modern data stack for very large
| scale retrieval-based AI. A lot of the pieces may be pre-
| production, but I believe the amazing HN community may still
| enjoy the ways we use io_uring, SIMD, and a few other less then
| popular technologies.
| eternalban wrote:
| where is the udisk? Repo is just a readme on configuration.
| cosmojg wrote:
| Are the pretraining and training pipelines available anywhere
| under a FOSS license? I'd love to take a swing at training a
| mid-fusion model on data other than text and images (e.g.,
| sound, neuron spike trains, etc.)
| debdut wrote:
| man I just looked at ukv, it looks to good to be true, 30x
| RocksDB, wtf! Hoping it's true
| mahnerak wrote:
| I could now find license in the huggingface repo, but it seems
| like the codebase is Apache 2.0. Are the pretrained weights /
| checkpoints also covered under this (or other permissive)
| license?
|
| In other words, can we use it for _commercial purposes for free_?
| grammers wrote:
| Good question, was about to ask the same!
| vov_or wrote:
| Hi! Just added Apache2.0 to HF models card. Thanks!
| cosmojg wrote:
| Are the pretraining and training pipelines available anywhere
| under a FOSS license? I'd love to take a swing at training a
| mid-fusion model on data other than text and images (e.g.,
| sound, neuron spike trains, etc.)
| sva_ wrote:
| Not sure if I'm blind, but what is the number of parameters?
| vov_or wrote:
| 143M - English 206M - Multilingual
| [deleted]
| fabbari wrote:
| The sample code has an error in it, it uses `model` before
| initializing it.
| vov_or wrote:
| Thanks Seems like a typo. It will be fixed soon
| kimihailv wrote:
| Did the author report metrics of the unimodal model or of the
| multimodal model with re-ranking?
| vov_or wrote:
| The results are reported with the multimodal model.
___________________________________________________________________
(page generated 2023-02-28 23:00 UTC)