[HN Gopher] AI Photo Geolocation
       ___________________________________________________________________
        
       AI Photo Geolocation
        
       Author : hubraumhugo
       Score  : 118 points
       Date   : 2024-05-02 04:41 UTC (18 hours ago)
        
 (HTM) web link (geospy.ai)
 (TXT) w3m dump (geospy.ai)
        
       | voidUpdate wrote:
       | I'm not convinced by the quality of this. I took some screenshots
       | of street view, not including any icons, and it identified them
       | as completely the wrong city. One of them included the name of
       | the town on a bus stop, which it completely failed to identify,
       | placing the picture across the county, also asserting that it
       | contained featured that it definitely didn't, such as thatched
       | rooves (all rooves in the image were normal slate). I would make
       | trust it to get me in the correct area of the country, but that's
       | about it
        
         | voidUpdate wrote:
         | After a bit more testing, it could successfully identify
         | Buckingham Palace and St Michaels Mount (however the location
         | wasn't great), however a street overlooking a beach in Cornwall
         | was marked as Wales, apparently including house numbers in
         | Welsh and English (despite Welsh using English numerals). It
         | seems to work somewhat ok if there is a clear image of an
         | obvious and distinctive monument, otherwise it isn't
         | particularly accurate
        
         | tapland wrote:
         | I got similar results, and that would be ok if it didn't sound
         | so confident with the guess.
        
       | boesboes wrote:
       | It's not very accurate, but it seems consistent. However it quite
       | often tells me 'this is in X because the language on the sign'
       | when there are no signs at all. Or just now I got 'The house in
       | the background is made of wood, which is a common building
       | material in Finland.' with a photo of a lake. There is no house,
       | there are trees though :)
        
       | lkramer wrote:
       | The page is very broken for me (Firefox in Linux), locking up,
       | flickering.
       | 
       | I did manage to get it to place a picture of a praying mantis I
       | took in Japan to be from California...
        
         | eru wrote:
         | I get the same flickering in Firefox on MacOS, but it managed
         | to recognise my picture.
        
         | lordswork wrote:
         | I got "an error has occurred"
        
         | ape4 wrote:
         | Me too - maybe its overloaded
        
         | mhuffman wrote:
         | Same, Firefox in linux, also tried with all extensions disabled
         | and same thing. Just flickering with an "error" modal.
        
       | sanxiyn wrote:
       | This correctly identifies South Korean landmarks, like Diamond
       | Bridge in Busan. Since I don't have encyclopedic knowledge of
       | world landmarks -- I wouldn't be able to recognize Diamond
       | Bridge-like landmarks in United States -- and nearly no one does
       | either, that alone is quite useful.
        
       | tgv wrote:
       | Doxxing for dough. The ethics committee is out to lunch.
       | 
       | See this thread: https://news.ycombinator.com/item?id=40233248
        
         | ToucanLoucan wrote:
         | In their defense, the AI hype folks have been ignoring the
         | ethics community from jump. I'd leave too.
        
       | sebzim4500 wrote:
       | Pretty cool. Correctly identifies different islands in the
       | Galapagos based on the ground and the plants.
        
       | elsadek wrote:
       | Childish AI, I gave him two photos, both was totally wrong and
       | hundred thousands miles from actual locations.
       | 
       | Don't recommend.
        
       | eru wrote:
       | > I'm sorry but GeoSpy is not allowed to process this image.
       | Please try again with a different image or contact support at
       | info@graylark.io
       | 
       | That was with an image I took in London on my phone.
        
       | bambax wrote:
       | This is what happens when I try to scroll to read the results...?
       | 
       | https://i.imgur.com/ywc1Hn0.png
       | 
       | (Chrome on Windows)
        
       | coumbaya wrote:
       | My backyard: Germany because there are trees and a fence (? Also,
       | no). A picture of the farmer's market of my town: correctly
       | assume France but confidently incorrect on the town and landmarks
       | (off by 200-300km I'd say).
        
       | mufty wrote:
       | I'm not sure how this works under the hood. My initial
       | observation is it does not work.
        
       | trebligdivad wrote:
       | Hmm interesting; one hit, one probably in vaguely the right area;
       | both from scans of ~40 year old photos. (As someone else noted,
       | site is rather brokne on Firefox/Linux but does work).
        
       | amarcheschi wrote:
       | I put an image of villa isnard, cascina, Pisa, and it was
       | recognized as being from France. It is a villa with French
       | architecture and olive trees. I then tried to upload an image
       | still from Pisa with a building in venetian Gothic style and it
       | was recognized being in Venice. It can be deceived quite easily
       | imho, it looks like it just search for the corresponding
       | architecture (maybe?) and details surrounding it but it doesn't
       | search online. Villa isnard is quite famous (at least, you have
       | results online) and a Google lens search would have found it
        
         | j-bos wrote:
         | Yeah, I would guess it's identifying elements in the photo,
         | qualifying the likelihood of those combined elements in a
         | particular location, then outputting the assumption. I posted a
         | picture of a Texas lake seen from a privatr residence, and it
         | correctly guessed Texas, but pushed it off by a couple hundred
         | miles into an Austin golf course.
        
           | sanxiyn wrote:
           | It seems to use many signals, at least according to its own
           | explanation. For example it looks at road signs and license
           | plates to identify countries.
        
       | axegon_ wrote:
       | Assuming this is more of a proof of concept/prototype, that's not
       | bad. It didn't get it[1] right, but at it's core, the guess is
       | not terrible, shift it 560km[2] south-east and you'd be bang on.
       | I'll admit, I did set the bar a bit high.
       | 
       | [1] https://imgur.com/a/67t0TVt
       | 
       | [2] https://imgur.com/a/aspf8px
        
       | pt_PT_guy wrote:
       | Is the website glitchy on firefox?
        
         | DominoTree wrote:
         | Very - never seen anything explode quite like it
        
       | wongarsu wrote:
       | Seems to be about as accurate as a good geoguessr player on a
       | time limit. Recognizable vistas are generally right down to the
       | city, and even if there's only general architecture to go off
       | it's often right to within a couple hundred kilometers.
       | 
       | The explanations are a bit hit and miss. Some are great and
       | correctly describe the names of buildings in the picture, some
       | are only vaguely related to the picture.
       | 
       | Ethically this is very questionable. Of course with enough
       | dedication humans can do the same (e.g. Rainbolt has made a
       | Youtube career out of this), but commoditizing this for every
       | stalker around the world has some troubling implications.
        
         | surfingdino wrote:
         | Ethics is absent in the minds of the people building and
         | financing this. AI is about wholesale value extraction and
         | destruction of competition done by the ecosystem of small
         | startups repackaging AI APIs. Those APIs will be turned off
         | once ad revenue starts flowing into the bank accounts of AI API
         | providers.
        
       | kome wrote:
       | Interesting concept, and it works somehow. But they definitely
       | needs better web developers. Very strange flickering, what the
       | hell is that?
        
       | erksa wrote:
       | This is close to an actual need I've managed to create for
       | myself.
       | 
       | I do photography and I store those I want to share on nextcloud.
       | In my selection and export process all metadata etc is stripped.
       | But I realized too late that it also stripped out the geo-
       | coordinates. No problem adding that in, but still have a laaarge
       | amount of photos without geolocation data.
       | 
       | I'm too lazy to re-export all the older ones, so being able to
       | run something like this on them would be perfect. I would be
       | satisfied with a general area, roughly hitting the province/state
       | its taken in. It doesn't have to be accurate at all, it's more
       | for my own geo grouping.
       | 
       | This site though goes bananas on firefox/mac. Flickering and font
       | adjustments..
        
         | CaptainOfCoit wrote:
         | > I'm too lazy to re-export all the older ones, so being able
         | to run something like this on them would be perfect. I would be
         | satisfied with a general area, roughly hitting the
         | province/state its taken in. It doesn't have to be accurate at
         | all, it's more for my own geo grouping.
         | 
         | I don't think this is even close to being accurate to be used
         | in this way, out of ~10 images I uploaded it got one "correct"
         | (right country, wrong city). Unless you want all your images to
         | geo-tagged "Somewhere, US", probably better to re-export/re-
         | import with your original metadata.
        
           | erksa wrote:
           | That's fair. I couldn't even get this to work, so not in
           | particular looking at this implementation. I just literally
           | was thinking of if this would be viable or not as an
           | approach, so it was fun to see something that tries to match
           | the bill!
        
         | solardev wrote:
         | If you still have the original photos, maybe you can write a
         | script to run both the originals and exports against a
         | perceptual hash (so as to easily identify the correct original)
         | and then just update the JPEG EXIF data of the exports?
         | 
         | https://github.com/JohannesBuchner/imagehash
         | 
         | Depending on specific formats, you should be able to read and
         | edit metadata without having to reprocess the images. If the
         | exports are named similarly to the originals, you don't even
         | need to hash them.
        
       | erkkonet wrote:
       | Funny enough it was accurate all while citing items that were not
       | in the picture (not even cropped out), like tall buildings and
       | signs in a specific language. I'm sure there will be refined
       | versions that are scarily accurate. Another OSINT tool for better
       | or worse.
        
       | karma_pharmer wrote:
       | FirebaseError: Installations: Create Installation request failed
       | with error "400 INVALID_ARGUMENT: API key expired. Please renew
       | the API key." (installations/request-failed).
        
       | jimlawruk wrote:
       | I put in a photo of a small lake near Truckee, CA, several miles
       | from Lake Tahoe, and it reported it was Lake Tahoe. It was wrong,
       | but impressed it was geographically very close.
        
         | iLoveOncall wrote:
         | Now try with a photo of a lake that's on the other side of the
         | world compared to Lake Tahoe and see if it doesn't also report
         | it as Lake Tahoe.
        
       | 1970-01-01 wrote:
       | It's heavily biased and therefore easily tricked. I uploaded a
       | photo from NYC.                    The graffiti on the wall is a
       | clue that the photo was taken in Detroit. The vegetation in the
       | background is also consistent with the climate of Detroit.
        
         | surfingdino wrote:
         | It is looking for distinct features in the photo and does a
         | probabilistic match against a tagged dataset. The features that
         | match best on the tagged photos in its dataset are used to
         | construct output that looks like a plausible answer. Don't use
         | it to plan your trip.
        
         | usaar333 wrote:
         | Yeah, gave it a photo of a beach on Lake Ontario with two Asian
         | friends of mine in it. Guessed.. Japan
        
         | KeplerBoy wrote:
         | It's also a ridiculously hard task.
        
           | ToucanLoucan wrote:
           | Yeah maybe just don't do it then? If someone removes the EXIF
           | data from a photo there's probably a reason for that, and
           | assuming that's suspicious in some way is pretty ridiculous
           | in a society that's supposedly all about personal freedom and
           | the right to a fair trial.
           | 
           | I don't mean to be aggressive here but this seems like yet
           | another tool that will be abused to shit by already powerful
           | people to do even sketchier things.
        
             | KeplerBoy wrote:
             | I agree, this project probably shouldn't exist. But oh,
             | well here we are and stuff like this can be built with
             | reasonable effort. Scrape google streetview and every exif
             | tagged Image you can get your hands on and get training.
             | 
             | I have no idea where this is heading, but we aren't turning
             | back.
        
       | boxed wrote:
       | > Sweden
       | 
       | Good
       | 
       | > Rural area
       | 
       | good
       | 
       | > [pin in the center of Stockholm, the most urban area in Sweden]
       | 
       | ouch.. not so good.
        
         | poulpy123 wrote:
         | to be fair they say they provide the coordinates for the city
         | or the town, so if it's not too far from stockholm I would
         | count it right
        
       | dmd wrote:
       | I'm blown away; it correctly identified a photo taken _inside my
       | house_ - just a picture of my kitchen - as being in eastern
       | Massachusetts, just based on the architecture.
        
         | CaptainOfCoit wrote:
         | Maybe it does a lot better with photos related to the US,
         | training set probably contained mostly US-related images, as
         | only one image out of ~10 taken in various European places were
         | correctly guessed for me. Most of the guesses was places in the
         | US while none of the images I tried were from the US.
        
           | fer wrote:
           | There's a certain training set bias. Most pictures from post-
           | Soviet states land in Moscow for me.
        
         | Zambyte wrote:
         | I wonder if it looks at EXIF data at all.
        
           | dmd wrote:
           | I stripped date and geo info.
        
         | btasker wrote:
         | I didn't have the same luck.
         | 
         | I gave it a photo from inside a house, you can see a person on
         | the bed, and the white wall behind - that's it.
         | 
         | Obviously I wasn't expecting an accurate location, but
         | 
         | > This photo was taken in Los Angeles, California. We can tell
         | this from the architecture of the buildings in the background,
         | as well as the vegetation. The palm trees are a dead giveaway
         | that this is Los Angeles.
         | 
         | There are no palm trees, the photo wasn't taken in the US and
         | palm trees exist outside of LA.
         | 
         | I also fed a photo of some quite distinctive castle ruins. It
         | mislocated that by 100s of miles.
        
       | foobarbecue wrote:
       | I gave it a picture from a bar in Austin. It nailed it, but with
       | some interesting hallucinations in the description. The photo had
       | a small Texas flag, but nobody was wearing cowboy hats, and there
       | was nothing with "Austin" on it in the photo. Description was:
       | 
       | This photo was taken inside a bar. There are several clues that
       | indicate this is Austin. First, there is a sign on the wall that
       | says "Austin." Second, there is a Texas flag on the wall. Third,
       | there are several people wearing cowboy hats, which is a common
       | sight in Austin. The coordinates of this photo are
        
         | dylan604 wrote:
         | Is it possible that the sign that says Austin _is_ on the wall
         | and is known to the system but not visible in the actual photo?
        
           | fer wrote:
           | Perhaps, but I've tried some rural landscapes without any
           | sign and it came up with English signs as a hint for pointing
           | England/Wales.
           | 
           | Even photos with signs in Irish were pointing to England,
           | it's half funny, half offensive.
        
       | nirav72 wrote:
       | I uploaded couple of pics of a beach in Turks and Caicos. It came
       | back with a beach in the Bahamas. Not even close. But I suppose
       | close enough geographically. Also a pic taken from a stationary
       | train in Chicago, came back as NYC.
        
       | Loranubi wrote:
       | I uploaded a drone shot of New Taipei City and it gave me Taipei.
       | Close enough. I don't know if it was cheating though because the
       | image had exif gps coords embedded...
       | 
       | The site worked fine on Firefox on iOS.
        
       | exar0815 wrote:
       | It told me the correct country, but the completely wrong city,
       | and then began to describe a typical place in the style of the
       | country - nothing of which was visible on the image.
        
       | consumer451 wrote:
       | I uploaded a generic photo I took of a field, dandelions, and
       | trees. It confidently stated Switzerland, and provided specific
       | GPS coords.
       | 
       | Of course, it was entirely wrong.
       | 
       | Some level of confidence indication would make a system like this
       | much more useful.
        
         | burkaman wrote:
         | I think this is generative AI and it doesn't know how confident
         | it is. So far it hasn't gotten any of my pictures right and
         | it's made pretty bad guesses, a human could do better on most
         | of them.
        
           | consumer451 wrote:
           | I can only imagine the quality of the systems being sold to
           | governments, and people trusting them because "AI." I mean,
           | "intelligence" is right in the name!
        
             | dylan604 wrote:
             | Intelligence is also in the name of the CIA, and that's
             | pretty well understood as an oxymoron. Artificial is also
             | in the name which seems much more apropos. It's clearly not
             | real intelligence, it's purely artificial in the use of the
             | word. I guess, Computer's Best Guess Simulating
             | Intelligence To Low Intelligent Humans would be too on the
             | spot and not as sexy of an acronym.
        
               | consumer451 wrote:
               | I actually have faith in the intelligence products
               | produced by the modern CIA, much more than AI snake oil.
        
               | dylan604 wrote:
               | There's a difference in the products used to produce
               | intelligence vs the analysis/centralization of it that
               | the oxymoron comment goes towards.
        
           | ubutler wrote:
           | The LLM's logits should be translatable into probabilities
           | although I'm not too sure how meaningful those might be as
           | models can sometimes be quite confident in entirely invalid
           | predictions.
        
             | vineyardlabs wrote:
             | I haven't done deep reading on LLM architectures and I
             | don't really know if LLMs have logits in the traditional
             | sense of a CNN or something, but I think the problem with
             | this is that the LLM's logits would have absolutely no
             | bearing on it's confidence of the location being correct,
             | only on it's confidence that the tokens making up the
             | answer it provides follow from the tokens that were encoded
             | from the provided image, which isn't the same thing.
        
             | barrkel wrote:
             | I don't think that's the right way to think about LLM
             | logits. Fundamentally the logits represent probability of
             | similarity with the text it's been trained on, given the
             | current prefix. Mixed in with any correspondence with truth
             | is not only tone, phrasing, dialect, language syntax, but
             | also stuff like the likelihood that specific details are
             | related to general concepts. Even if we're talking about a
             | person with three legs, or a horse riding a man, it'll be
             | hard for the LLM to not assign a fairly high probability to
             | sentences that describe two legs, or a man riding the horse
             | and not the other way around.
        
       | sebys7 wrote:
       | 1/3 was completely wrong, in the sense that the coordinates,
       | country and city had nothing to do with it but THEN the sources
       | were other buildings from the actual country and city. 2/3 got
       | city and coordinates correct, but got the country wrong, which
       | idk how that happened. 3/3 got country, city and coordinates
       | correct Pretty cool
        
       | karaterobot wrote:
       | I had ChatGPT generate some selfies taken in various places, then
       | ran it through this app. My assumption was that this app would do
       | really well, since one model would identify the stereotypical
       | features generated by the other model. It got 1/3. It nailed
       | Minneapolis, it got Damascus, Syria wrong (said Amman, Jordan),
       | and it got the Ballard neighborhood of Seattle wrong (said San
       | Francisco).
        
       | plorg wrote:
       | I uploaded pictures of a couple of street corners and it
       | confidently identified them as being in Texas and Florida, based
       | on text that was not in the pictures and, in the second case,
       | foliage in a scene that included only concrete. Although in
       | fairness to the model, a parking lot may be the dominant
       | ecosystem in Jacksonville.
       | 
       | Anyways, these pictures were from Iowa.
        
         | DougBTX wrote:
         | Same, location identified by the architecture of the buildings
         | in the background and the car's numberplate... of a car with no
         | numberplate driving through a wood.
        
       | abnry wrote:
       | I took images directly from Google Images search and it got them
       | wrong. But it was sort of directionally right. My local city hall
       | it said was the courthouse in the same county. The local bridge
       | was put into a wrong state.
       | 
       | Interestingly, it provided reference images and the images I
       | posted were basically in the reference images.
        
       | ethanholt1 wrote:
       | It was correctly able to identify several photos of my vacation
       | to NC, down to the exact location where the photo was taken on
       | the hiking trail. Pretty scary. Additionally, just to be sure, I
       | used an EXIF data wiper to make sure it wasn't pulling data from
       | there and tried each photo in a seperate Incognito instance.
       | Still got it correct, all 3 times. Mind boggling.
        
         | teakie wrote:
         | how?
        
       | poulpy123 wrote:
       | It recognized 2/7 of the pictures I used. The two success are a
       | really well known place in Rome (the roman forum, although it got
       | the arch wrong) and a small but very touristic city. It guessed
       | the country right but was far from the place in 4 cases:
       | landmarks were visible but they are not hugely touristic. In the
       | last case the country was wrong, but it was a picture from my
       | office with no landmark.
        
       | Sporktacular wrote:
       | It's funny that with a direct match to a precisely located photo
       | in its database, it got the country right by comparing the
       | architectural style, but still got the city wrong.
        
       | rnewme wrote:
       | Even though it missed the town by few kilometers it also
       | recognized my wife's dress and linked to the webstore for it
        
       | hhh wrote:
       | FYI this site is keeping everything you upload in a Google
       | storage bucket, which was unauthenticated up until a little bit
       | ago. (Full disclosure, it's my tweet.)
       | 
       | https://twitter.com/spuhghetti/status/1786033761341083731
        
         | Zeratoss wrote:
         | thotDBSmash ??
         | 
         | Does this imply that the people behind the website specifically
         | saved "juicy" user-uploaded images?
        
           | hhh wrote:
           | no, it looks like it was a separate project, but stored in
           | the same bucket. In the time I had access to the bucket (it's
           | no longer public), it looks like they were scraping images
           | from a dating site/app and each directory represented a
           | profile.
        
             | ShamelessC wrote:
             | That doesn't sound super sketchy or anything.
        
       | pphysch wrote:
       | Why use a LLM for this? You'd definitely want a large model, but
       | this seems like a more straightforward classification problem
       | that doesn't require understanding of language.
        
       | andoando wrote:
       | I uploaded a photo of a screenshot of a chess game on chess.com
       | 
       | It identified as the golden state bridge of San Francisco, saying
       | the buildings in the background are also consistent with the
       | architecture in San fran.
        
       | onemoresoop wrote:
       | Entirely wrong result.
        
       | salade_pissoir wrote:
       | This seems like another Hotdog/Not Hotdog business model.
        
       | underlogic wrote:
       | I think it just uses EXIF data, then makes guess of other photos
       | from same IP. Fake it till you make it
        
       | vel0city wrote:
       | Uploaded a photo of the Bell Centre. Easy Habs logo on it. City
       | location: Toronto.
        
       | ghastmaster wrote:
       | > The photo was taken from a tall structure, possibly a fire
       | tower.
       | 
       | It was way off on the state, but I am still impressed with that
       | spot on description. It was taken from a fire tower.
        
       | alistairSH wrote:
       | Tried it a few times, it's hit or miss.
       | 
       | - Rooftop bar in Viejo San Juan, PR was identified correctly,
       | down the intersection.
       | 
       | - Beach on the south coast of Vieques, PR was identified as
       | Jamaica, so reasonably close for a non-descript tropical beach.
       | 
       | - Office building in Reston, VA which is fairly obvious
       | (biggest/tallest building in the area) was identified as being in
       | San Jose, CA.
       | 
       | - Train station in Staunton, VA was identified as somewhere in
       | Massachusetts.
       | 
       | The attributes of the photo were mostly accurate, but were
       | matched to an incorrect location.
        
       | itslennysfault wrote:
       | I've been learning deep learning, and I built a very toy version
       | of this recently. It's really just a classifier that can (maybe)
       | tell you if a photo was taken in one of the 5 cities I trained it
       | on.
       | 
       | https://huggingface.co/spaces/itslenny/fastai-lesson-2-big-m...
        
       | stainablesteel wrote:
       | it seems to be a rule of thumb that you can pick a subreddit that
       | operated as some kind of service, ie r/whereisthis, and replace
       | that entire apparatus with an ai of some kind
        
       | glonq wrote:
       | I gave it a snippet of Montevideo city skyline, and it responded
       | with:
       | 
       | The photo was taken from a rooftop in Buenos Aires, Argentina.
       | The photo shows a clear view of the city's skyline, including the
       | iconic Obelisk of Buenos Aires. The buildings in the photo are
       | characteristic of Buenos Aires' architecture and the vegetation
       | is typical of the region.
        
       | K0balt wrote:
       | I uploaded a nondescript scenery photo with no non-natural cues
       | from the Dominican Republic, it got the general area right within
       | 50km.
        
       | timnetworks wrote:
       | Googled 'vacation photo' and picked some off of Flickr. The
       | locations matched the captions, State and Country (FLA and
       | Cancun) correctly.
       | 
       | Obviously uploading a picture of a hot dog will waste compute on
       | trying to figure out what kind of traffic the ketchup is, but it
       | works with snapshots great (not stock photos)
        
       | fragmede wrote:
       | The question is, how does this do at GeoGuessr, where users are
       | given a picture from Google street view, and are asked where it
       | is on the world by clicking on a map of the world. Users get
       | points based on how close it is, user with the most points after
       | N rounds, wins.
       | 
       | The best player in the world, Rainbolt, played against an AI out
       | of Stanford, so I wonder how this one would do.
       | 
       | https://www.geoguessr.com/
       | 
       | https://www.youtube.com/watch?v=ts5lPDV--cU
        
       | Alifatisk wrote:
       | Reminds me of a research paper that used ai to accurately pin
       | point where picture was taken, I had hoped this was it. But it's
       | better than nothing.
        
       | noashavit wrote:
       | scary accurate
        
       | mightytravels wrote:
       | Claude Vision can do that for you if you are building something
       | similar. Had similar results with OpenAI.
       | 
       | https://docs.anthropic.com/claude/docs/vision
        
       | salamo wrote:
       | Neat demo. It seems like there may be a few things happening in
       | tandem.
       | 
       | I uploaded a picture of a forest and it came back with visually
       | similar images. So the first thing it might be doing is some kind
       | of KNN, and if the pictures have location labels associated
       | applying some sort of weighted average to determine GPS
       | coordinates. This is pretty cool.
       | 
       | I also tried flipping the image horizontally, and it came back
       | with the same images. So their embedding isn't based off of exact
       | matches (good) and seems to be invariant to some basic
       | translations (good). It also seems like it's directly extracting
       | visual features from the image. This can be done with something
       | like Blip[0].
       | 
       | Then I uploaded a screenshot from Magic School Bus. It still
       | extracted information to guess the "location" of the cartoon (San
       | Francisco, which is wrong). So that's probably how it works.
       | 
       | I also found the text output is similar in some ways with
       | OpenGVLab InternViT [1]. So perhaps this or something like it is
       | being used to extract features.
       | 
       | And of course there may be an LLM on top of these extracted
       | features with some sort of prompt template. But I should add that
       | the text explanation is the _least useful part of the result_ ,
       | since it is unreliable and less informative than the "boring"
       | similarity metrics above.
       | 
       | [0] https://huggingface.co/Salesforce/blip-image-captioning-
       | larg... [1] https://internvl.opengvlab.com/
        
       ___________________________________________________________________
       (page generated 2024-05-02 23:02 UTC)