hngopher.com

       [HN Gopher] Indexing iCloud Photos with AI Using LLaVA and Pgvector
       ___________________________________________________________________
        
       Indexing iCloud Photos with AI Using LLaVA and Pgvector
        
       Author : CSDude
       Score  : 175 points
       Date   : 2024-01-20 14:01 UTC (2 days ago)
        
 (HTM) web link (medium.com)
 (TXT) w3m dump (medium.com)
        
       | say_it_as_it_is wrote:
       | I really appreciate itch scratching posts like these. The life
       | story is as important as the workflow.
        
       | behnamoh wrote:
       | I'm still tryna understand the difference between multimodal
       | models like Llava and projects like JARVIS that connect LLMs to
       | other huggingface models (including object detection models) or
       | clip. Is a multimodal model doing this under the hood?
        
         | michaelt wrote:
         | Object detection models have human-comprehensible outputs. You
         | can feed in a picture and it'll tell you that there's a child
         | and a cat, and it'll draw bounding boxes around them. You can
         | pass that info into an LLM if you want.
         | 
         | The downside to that approach is the LLM can't tell whether the
         | cat is standing in front of the child, or sitting on the child,
         | or the child is holding the cat; the input just tells it
         | there's a child, and a cat, and their bounding boxes overlap.
         | 
         | In contrast, LLaVA feeds feeds the image into a visual encoder
         | called 'CLIP' which doesn't output anything human-
         | comprehensible - it just gives out a bunch of numbers which
         | have something to do with the contents of the image. But the
         | numbers can be fed into the LLM along with text - and they can
         | train the image encoder and the LLM _together_.
         | 
         | If the training works right, and they have enough training data
         | for the model to figure out the difference between a cat
         | sitting on a lap and one being held, they end up with a model
         | that can figure out that the child is holding the cat.
        
       | reacharavindh wrote:
       | A nice work. I'm thinking it could even be tinkered further by
       | incorporating location information, date and time, and even
       | people (facial recognition) data from the photos, and have an LLM
       | write one "metadata text" for every photo. This way one can query
       | " person X traveling with Y to Norway about 7 years ago" and
       | quickly get useful results.
        
         | ssijak wrote:
         | that is exactly what I wanted to do for my Apple Photos lib but
         | have not yet got the time to spend on it. Apple Photos search
         | is just bad, very bad.
        
       | GaggiX wrote:
       | For indexing images is probably convenient to directly calculate
       | the embeddings using CLIP image encoder and retrieve them using
       | the CLIP text encoder.
        
         | speedgoose wrote:
         | Going through a LLM may improve the performance. From my
         | experience working with Stable Diffusion 1.*, clip is not very
         | intelligent and a 7B quantised LLM could help a lot.
        
           | kkielhofner wrote:
           | I second this. CLIP, BLIP, etc alone are light but pretty
           | dumb for captioning in the grand scheme of things.
           | 
           | CLIP is reasonable for reverse image search via embeddings
           | but many of the models in this class don't work very well for
           | captioning because they're trained on COCO, etc and they're
           | pretty generic.
        
             | bootsmann wrote:
             | But this specific use case the extracts an embedding from
             | the caption which is where CLIP would skip a lot of
             | overhead by going from the image to the embedding directly.
        
               | kkielhofner wrote:
               | If you were solely doing reverse image search (submit
               | image, generate embeddings, vector search) yes.
               | 
               | This is LLaVA -> text output -> sentence embedding ->
               | (RAG style-ish) search on sentence embedding output based
               | on query input text (back through the sentence
               | embedding).
               | 
               | You could skip the LLaVA step and use CLIP/BLIP-ish
               | caption output -> sentence embedding but pure
               | caption/classification model text output is pretty
               | terrible by comparison. Not only inaccurate, but very
               | little to no context for semantic and extremely short so
               | the sentence embedding models have poor quality input and
               | not much to go on even when the caption/classification is
               | decently accurate.
        
               | GaggiX wrote:
               | CLIP does not generate captions, it's simply an encoder,
               | the image and text encoders are aligned so you don't need
               | to generate a caption, you simply encode the image and
               | you later retrieve it using the vector crated by the text
               | encoder (the query).
        
               | kkielhofner wrote:
               | I'm using CLIP here generically to refer to
               | families/models generating captions by leveraging CLIP as
               | the encoder - of which there are plenty on "The Hub".
               | 
               | Have you actually done the approach I think you're
               | suggesting for anything more complex than "this is a
               | yellow cat"? Not trying to be snarky, genuinely curious.
               | I've done a few of these projects and this approach never
               | comes close to meeting user expectations in the real
               | world.
        
               | GaggiX wrote:
               | Do you an example of a query that should fail by using
               | the CLIP embeddings directly but works with the method
               | describe in the article?
        
       | viraptor wrote:
       | Since llava is multimodal, I wonder if there's a chance here to
       | strip a bit of complexity. Specifically, instead of going through
       | 3 embeddings (llava internal, text, mini-lm), could you use the
       | not-last layer of llava as your vector? It would probably require
       | a bit of fine-tuning though.
       | 
       | For pure text, that's kind of how e5-mistral works
       | https://huggingface.co/intfloat/e5-mistral-7b-instruct Or yeah,
       | just use clip like another commenter suggests...
        
       | warangal wrote:
       | I think image-encoder from CLIP (even smallest variant ViT B/32)
       | is good enough to capture a lot of semantic information to allow
       | natural language query once images are indexed. A lot of work
       | actually goes into integrating with existing meta-data like
       | local-directory, date-time to augment NL query and re-ranking the
       | results.
       | 
       | I work on such a tool[0] to enable end to end indexing of user's
       | personal photos and recently added functionality to index Google
       | Photos too!
       | 
       | [0] https://github.com/eagledot/hachi
        
         | 3abiton wrote:
         | I would love to see some benchmark on that
        
           | warangal wrote:
           | I keep forgetting to put a benchmark for a standard flickr30k
           | like dataset! But a ballpark figure should be about 100ms per
           | image on a quad-core CPU, i also generate an ETA during
           | indexing and provide some meta-information to make it easy to
           | get information about data being indexed.
        
         | Zetobal wrote:
         | vit h and g are fine I wouldn't use b anymore.
        
           | burningion wrote:
           | Can you give details as to why not?
        
           | warangal wrote:
           | It is quite possible B variant is not enough for some
           | scenarios, earlier version also included the videos search,
           | frames used for indexing were sometimes blur (not having
           | fine-details) and these frames generally would have higher
           | score for naive Natural language queries. I only tested with
           | B variant.
           | 
           | But i resolved that problem upto a point by adding a Linear
           | layer trained to discard such frames, and it was less costly
           | than running a bigger variant for my use case.
        
       | dmezzetti wrote:
       | Here is an example that builds a vector index of images using the
       | CLIP model.
       | 
       | https://neuml.hashnode.dev/similarity-search-with-images
       | 
       | This allows queries with both text and images.
        
       | jsmith99 wrote:
       | Immich (self hosted Google photos alternative) has been using
       | CLIP models for smart search for a while and anecdotally seems to
       | work really well - it indexes fast and results are of similar
       | quality to the giant SaaS providers.
        
         | eurekin wrote:
         | I learned it the hard way while trying to index few TBs of
         | photos. I couldn't have it finished, always got stuck after 15
         | ish hours.
        
         | ninja3925 wrote:
         | We use CLIP internally (large US tech company) and it works
         | very well at a large scale
        
         | diggan wrote:
         | > has been using CLIP models for smart search for a while and
         | anecdotally seems to work really well [..] results are of
         | similar quality to the giant SaaS providers
         | 
         | I'm not super familiar with how the results for the "giant SaaS
         | providers" are, but the demo instance of Immich doesn't seem to
         | do it very well.
         | 
         | Example query for "airplane":
         | https://demo.immich.app/search?q=airplane&clip=true
         | 
         | Even the fourth result seems to rank higher than photos of
         | actual airplanes, and most of the results aren't actually
         | airplanes at all.
         | 
         | Again, not sure how that compares with other providers, but on
         | Google Photos (as one example I am familiar with), searching
         | for "airplane" shows me photos taken of airplanes, or photos
         | taken from the inside of an airplane. Even lego airplanes seems
         | to show up correctly, and none of the photos are incorrectly
         | shown as far as I can tell.
        
           | jsmith99 wrote:
           | I've just tried that and it's true although on my instance
           | searching 'airplane' gives good results. I wonder if it's due
           | to an insufficient number of images in the demo? I also took
           | the advice in the forums to tweak the exact model version
           | used.
        
       | clord wrote:
       | Is anyone aware of a model that is trained to give photos a
       | quality rating? I have decades of RAW files sitting on my server
       | that I would love to pass over and tag those that are worth
       | developing more. Would be nice to make a short list.
        
         | twoWhlsGud wrote:
         | So I think some sort of hybrid between object recognition (like
         | being discussed here as part of the workflow) and standard
         | image processing stuff could be helpful there. E.g. it's not
         | absolute sharpness that you're looking for it's the subject
         | being sharp (and possibly sharper than in other photos from the
         | same time period of the same subject).
        
         | joshvm wrote:
         | Both Google and Apple have implemented models that aim to take
         | the "best" picture from a video sequence, like a live photo.
         | 
         | Have a look at MUSIQ and NIMA.
         | 
         | https://github.com/google-research/google-research/tree/mast...
         | 
         | https://blog.research.google/2022/10/musiq-assessing-image-a...
        
       | vladgur wrote:
       | This is pretty awesome, but I'm curious if it can be used to
       | "enhance" the existing iCloud search which is great at
       | identifying people in my photos even kids as they age.
       | 
       | I would not want to lose that functionality
        
       | diggan wrote:
       | Slightly related, are there any good photo management
       | alternatives to Photoprism that leverages more recent AI/ML
       | technologies and provides a GUI for end users?
        
       | voiper1 wrote:
       | Is there a state of the art for face matching? I love being able
       | to put in a name and find all the photos they are in.
       | 
       | I don't even mind some training of "are these the same or not"
       | 
       | That's one of the conveniences that means I'm still using google
       | photos...
        
         | eurekin wrote:
         | One long winded way could be using the Lightroom for that. It
         | finds and groups faces. Also, maybe it can save that info into
         | fmthe file itself (with xmp)
        
       ___________________________________________________________________
       (page generated 2024-01-22 23:02 UTC)