[HN Gopher] Indexing iCloud Photos with AI Using LLaVA and Pgvector
___________________________________________________________________
Indexing iCloud Photos with AI Using LLaVA and Pgvector
Author : CSDude
Score : 175 points
Date : 2024-01-20 14:01 UTC (2 days ago)
(HTM) web link (medium.com)
(TXT) w3m dump (medium.com)
| say_it_as_it_is wrote:
| I really appreciate itch scratching posts like these. The life
| story is as important as the workflow.
| behnamoh wrote:
| I'm still tryna understand the difference between multimodal
| models like Llava and projects like JARVIS that connect LLMs to
| other huggingface models (including object detection models) or
| clip. Is a multimodal model doing this under the hood?
| michaelt wrote:
| Object detection models have human-comprehensible outputs. You
| can feed in a picture and it'll tell you that there's a child
| and a cat, and it'll draw bounding boxes around them. You can
| pass that info into an LLM if you want.
|
| The downside to that approach is the LLM can't tell whether the
| cat is standing in front of the child, or sitting on the child,
| or the child is holding the cat; the input just tells it
| there's a child, and a cat, and their bounding boxes overlap.
|
| In contrast, LLaVA feeds feeds the image into a visual encoder
| called 'CLIP' which doesn't output anything human-
| comprehensible - it just gives out a bunch of numbers which
| have something to do with the contents of the image. But the
| numbers can be fed into the LLM along with text - and they can
| train the image encoder and the LLM _together_.
|
| If the training works right, and they have enough training data
| for the model to figure out the difference between a cat
| sitting on a lap and one being held, they end up with a model
| that can figure out that the child is holding the cat.
| reacharavindh wrote:
| A nice work. I'm thinking it could even be tinkered further by
| incorporating location information, date and time, and even
| people (facial recognition) data from the photos, and have an LLM
| write one "metadata text" for every photo. This way one can query
| " person X traveling with Y to Norway about 7 years ago" and
| quickly get useful results.
| ssijak wrote:
| that is exactly what I wanted to do for my Apple Photos lib but
| have not yet got the time to spend on it. Apple Photos search
| is just bad, very bad.
| GaggiX wrote:
| For indexing images is probably convenient to directly calculate
| the embeddings using CLIP image encoder and retrieve them using
| the CLIP text encoder.
| speedgoose wrote:
| Going through a LLM may improve the performance. From my
| experience working with Stable Diffusion 1.*, clip is not very
| intelligent and a 7B quantised LLM could help a lot.
| kkielhofner wrote:
| I second this. CLIP, BLIP, etc alone are light but pretty
| dumb for captioning in the grand scheme of things.
|
| CLIP is reasonable for reverse image search via embeddings
| but many of the models in this class don't work very well for
| captioning because they're trained on COCO, etc and they're
| pretty generic.
| bootsmann wrote:
| But this specific use case the extracts an embedding from
| the caption which is where CLIP would skip a lot of
| overhead by going from the image to the embedding directly.
| kkielhofner wrote:
| If you were solely doing reverse image search (submit
| image, generate embeddings, vector search) yes.
|
| This is LLaVA -> text output -> sentence embedding ->
| (RAG style-ish) search on sentence embedding output based
| on query input text (back through the sentence
| embedding).
|
| You could skip the LLaVA step and use CLIP/BLIP-ish
| caption output -> sentence embedding but pure
| caption/classification model text output is pretty
| terrible by comparison. Not only inaccurate, but very
| little to no context for semantic and extremely short so
| the sentence embedding models have poor quality input and
| not much to go on even when the caption/classification is
| decently accurate.
| GaggiX wrote:
| CLIP does not generate captions, it's simply an encoder,
| the image and text encoders are aligned so you don't need
| to generate a caption, you simply encode the image and
| you later retrieve it using the vector crated by the text
| encoder (the query).
| kkielhofner wrote:
| I'm using CLIP here generically to refer to
| families/models generating captions by leveraging CLIP as
| the encoder - of which there are plenty on "The Hub".
|
| Have you actually done the approach I think you're
| suggesting for anything more complex than "this is a
| yellow cat"? Not trying to be snarky, genuinely curious.
| I've done a few of these projects and this approach never
| comes close to meeting user expectations in the real
| world.
| GaggiX wrote:
| Do you an example of a query that should fail by using
| the CLIP embeddings directly but works with the method
| describe in the article?
| viraptor wrote:
| Since llava is multimodal, I wonder if there's a chance here to
| strip a bit of complexity. Specifically, instead of going through
| 3 embeddings (llava internal, text, mini-lm), could you use the
| not-last layer of llava as your vector? It would probably require
| a bit of fine-tuning though.
|
| For pure text, that's kind of how e5-mistral works
| https://huggingface.co/intfloat/e5-mistral-7b-instruct Or yeah,
| just use clip like another commenter suggests...
| warangal wrote:
| I think image-encoder from CLIP (even smallest variant ViT B/32)
| is good enough to capture a lot of semantic information to allow
| natural language query once images are indexed. A lot of work
| actually goes into integrating with existing meta-data like
| local-directory, date-time to augment NL query and re-ranking the
| results.
|
| I work on such a tool[0] to enable end to end indexing of user's
| personal photos and recently added functionality to index Google
| Photos too!
|
| [0] https://github.com/eagledot/hachi
| 3abiton wrote:
| I would love to see some benchmark on that
| warangal wrote:
| I keep forgetting to put a benchmark for a standard flickr30k
| like dataset! But a ballpark figure should be about 100ms per
| image on a quad-core CPU, i also generate an ETA during
| indexing and provide some meta-information to make it easy to
| get information about data being indexed.
| Zetobal wrote:
| vit h and g are fine I wouldn't use b anymore.
| burningion wrote:
| Can you give details as to why not?
| warangal wrote:
| It is quite possible B variant is not enough for some
| scenarios, earlier version also included the videos search,
| frames used for indexing were sometimes blur (not having
| fine-details) and these frames generally would have higher
| score for naive Natural language queries. I only tested with
| B variant.
|
| But i resolved that problem upto a point by adding a Linear
| layer trained to discard such frames, and it was less costly
| than running a bigger variant for my use case.
| dmezzetti wrote:
| Here is an example that builds a vector index of images using the
| CLIP model.
|
| https://neuml.hashnode.dev/similarity-search-with-images
|
| This allows queries with both text and images.
| jsmith99 wrote:
| Immich (self hosted Google photos alternative) has been using
| CLIP models for smart search for a while and anecdotally seems to
| work really well - it indexes fast and results are of similar
| quality to the giant SaaS providers.
| eurekin wrote:
| I learned it the hard way while trying to index few TBs of
| photos. I couldn't have it finished, always got stuck after 15
| ish hours.
| ninja3925 wrote:
| We use CLIP internally (large US tech company) and it works
| very well at a large scale
| diggan wrote:
| > has been using CLIP models for smart search for a while and
| anecdotally seems to work really well [..] results are of
| similar quality to the giant SaaS providers
|
| I'm not super familiar with how the results for the "giant SaaS
| providers" are, but the demo instance of Immich doesn't seem to
| do it very well.
|
| Example query for "airplane":
| https://demo.immich.app/search?q=airplane&clip=true
|
| Even the fourth result seems to rank higher than photos of
| actual airplanes, and most of the results aren't actually
| airplanes at all.
|
| Again, not sure how that compares with other providers, but on
| Google Photos (as one example I am familiar with), searching
| for "airplane" shows me photos taken of airplanes, or photos
| taken from the inside of an airplane. Even lego airplanes seems
| to show up correctly, and none of the photos are incorrectly
| shown as far as I can tell.
| jsmith99 wrote:
| I've just tried that and it's true although on my instance
| searching 'airplane' gives good results. I wonder if it's due
| to an insufficient number of images in the demo? I also took
| the advice in the forums to tweak the exact model version
| used.
| clord wrote:
| Is anyone aware of a model that is trained to give photos a
| quality rating? I have decades of RAW files sitting on my server
| that I would love to pass over and tag those that are worth
| developing more. Would be nice to make a short list.
| twoWhlsGud wrote:
| So I think some sort of hybrid between object recognition (like
| being discussed here as part of the workflow) and standard
| image processing stuff could be helpful there. E.g. it's not
| absolute sharpness that you're looking for it's the subject
| being sharp (and possibly sharper than in other photos from the
| same time period of the same subject).
| joshvm wrote:
| Both Google and Apple have implemented models that aim to take
| the "best" picture from a video sequence, like a live photo.
|
| Have a look at MUSIQ and NIMA.
|
| https://github.com/google-research/google-research/tree/mast...
|
| https://blog.research.google/2022/10/musiq-assessing-image-a...
| vladgur wrote:
| This is pretty awesome, but I'm curious if it can be used to
| "enhance" the existing iCloud search which is great at
| identifying people in my photos even kids as they age.
|
| I would not want to lose that functionality
| diggan wrote:
| Slightly related, are there any good photo management
| alternatives to Photoprism that leverages more recent AI/ML
| technologies and provides a GUI for end users?
| voiper1 wrote:
| Is there a state of the art for face matching? I love being able
| to put in a name and find all the photos they are in.
|
| I don't even mind some training of "are these the same or not"
|
| That's one of the conveniences that means I'm still using google
| photos...
| eurekin wrote:
| One long winded way could be using the Lightroom for that. It
| finds and groups faces. Also, maybe it can save that info into
| fmthe file itself (with xmp)
___________________________________________________________________
(page generated 2024-01-22 23:02 UTC)