[HN Gopher] Wikipedia search-by-vibes through millions of pages ...
       ___________________________________________________________________
        
       Wikipedia search-by-vibes through millions of pages offline
        
       Author : gardenfelder
       Score  : 117 points
       Date   : 2023-09-01 19:55 UTC (3 hours ago)
        
 (HTM) web link (www.leebutterman.com)
 (TXT) w3m dump (www.leebutterman.com)
        
       | nullc wrote:
       | Good offline search could be a major advance in personal privacy.
        
       | googlryas wrote:
       | I like the concept, but I'm not having much luck. I entered
       | "weird looking monkey", hoping to get Proboscis, golden snub
       | nosed, etc, but I just end up with the articles "Pet
       | monkey","List of individual monkeys", "Ethnoprimatology",
       | "Monkey".
       | 
       | Whereas when I type the same query into google, I get exactly
       | what I expected. Which is kind of disappointing, I was hoping to
       | find out about some weird looking monkeys that I didn't know
       | about.
        
         | kikokikokiko wrote:
         | Wikipedia editors/guidelines are generally not in favor of
         | "opinionated" adjectives, and the use of "weird looking" in
         | your query sounds a lot like something that would be frowned
         | upon on a wikipedia article. This makes it hard for your search
         | to retrieve good results on this corpus of knowledge.
        
           | znagengast wrote:
           | It's also largely dependent on the embeddings model as others
           | have mentioned. Even if wikipedia doesn't have any words
           | specifically referring to that monkey as "weird", the model
           | itself would know to correlate this monkey's embeddings with
           | the "weird" concept. The main issue with this particular
           | implementation is the model used (all-minilm-l6-v2) which is
           | designed for speed and efficiency over accuracy.
        
       | dougb5 wrote:
       | Really nice implementation! And it's so cool to be able to do
       | this offline. The embeddings aren't quite there yet.
       | 
       | One trick that might be helpful is to embed only the defining
       | (usually the first) sentence or paragraph of the Wikipedia
       | article, rather than the whole document -- not clear to me which
       | portion you're using now.
       | 
       | My own site, OneLook, has had a similar feature
       | (https://onelook.com/thesaurus/) since '03 that lets you find
       | words and concepts by description. It was a pure reverse-
       | dictionary search back when I started, but over the past two
       | decades I've explored word embeddings, then sentence embeddings,
       | and more recently LLMs. Nowadays it uses GPT to generate some
       | guesses for inputs that it can't answer itself.
       | 
       | LLMs are _so_ much better than earlier methods at this task, it
       | 's taken some of the wind out of my sails on improving this
       | aspect of OneLook. I frequently hear from people for whom
       | reverse-definition lookups are the main reason they use ChatGPT!
        
       | marginalia_nu wrote:
       | It's incredibly impressive for what it does, but the results
       | don't seem very good.
       | 
       | Although I know from experience it's really difficuly to assess
       | search result quality by hand, you can be very close to something
       | great and return far worse matches than this does.
        
       | wrs wrote:
       | "Vibes" is a way more relatable term than "sentence embeddings".
       | I may need to start using that. :)
        
       | 1-6 wrote:
       | This is great news for those who suffer from memory recall
       | problems. Hope to see more edge devices handle this inferencing
       | locally.
        
       | gandalfff wrote:
       | Having this integrated into Kiwix would be great!
        
       | Vt71fcAqt7 wrote:
       | Are diacritics supported? Searching "ecorche" gave no relevant
       | results. Cf. google.[0]
       | 
       | [0]https://www.google.com/search?q=%C3%A9corch%C3%A9+site%3Aen...
       | .
        
       | crazygringo wrote:
       | This is certainly very interesting.
       | 
       | Unfortunately, I tried describing a few terms across philosophy
       | and psychology and for all of them, the entry I was aiming for
       | was only around the ~20th rank. (Far more popular but less
       | accurate items were populated above it -- e.g. no matter what I
       | typed trying to define a specific modality of psychotherapy,
       | "psychotherapy" was always the #1 result.)
       | 
       | In contrast, I've used ChatGPT to identify the names of certain
       | niche subfields when I couldn't remember what they were called,
       | and it was right every time.
       | 
       | I love the idea of an AI service specifically designed to
       | identify the names of things from descriptions. But I don't think
       | restricting it to Wikipedia (or Wikipedia page titles) is the
       | right approach, and it seems like general-purpose LLM's are doing
       | a great job.
       | 
       | Still, as a proof of concept and as something you can run locally
       | in the browser, this is extremely cool.
        
         | PartiallyTyped wrote:
         | I have found myself describing ideas and goals, and getting
         | back a field, or rather the name of it and certain keywords to
         | look for. It seems that LLMs are the best fuzzy search engines
         | and work in a rather unique though possibly complementary way
         | to traditional search engines.
        
       | bagels wrote:
       | It's really cool, but why doesn't it link to the wikipedia
       | article?
        
       | jasonthorsness wrote:
       | The quality of the embeddings is a limiting factor for this sort
       | of search - OpenAI text-ada embeddings are great but that removes
       | the local aspect, and the better huggingface models are too big.
       | With the model sizes increasing it's hard to see what the path
       | will be for local/offline.
        
         | vikp wrote:
         | There are plenty of great embedding models that are on the
         | order of a few hundreds megs (even outperforming ada-002). See
         | the leaderboard here -
         | https://huggingface.co/spaces/mteb/leaderboard. Local/offline
         | is only growing.
        
           | jasonjmcghee wrote:
           | Wow gte-small feels like a pretty great balance of size and
           | quality (all-MiniLM-L6-v2 has been my go-to)
        
       | jiofj wrote:
       | Low-hanging fruit: make article names clickable!
        
         | [deleted]
        
         | brianpan wrote:
         | Even lower-hanging fruit: put a space between the word and the
         | rank so I can word select the title to copy-paste it.
        
         | 1-6 wrote:
         | Low hanging fruit could be a mountain of effort for those who'd
         | otherwise continue to focus on improving the major feature.
        
           | hk__2 wrote:
           | > Low hanging fruit could be a mountain of effort for those
           | who'd otherwise continue to focus on improving the major
           | feature.
           | 
           | "https://en.wikipedia.org/wiki/" + encodeURIComponent(title)
           | 
           | Here you have it. The major feature is in the title: you can
           | hardly call it a Wikipedia search engine if you can't access
           | the articles.
        
       | kemayo wrote:
       | The page is currently failing to work for me, because
       | `model_quantized.onnx` isn't loading -- I'm watching it and it
       | has currently managed to get 5.3MB downloaded as I type this, so
       | if every visitor is triggering that...
       | 
       | I think we may be doing awful things to Lee Butterman's bandwidth
       | bill.
        
       | gardenfelder wrote:
       | >This is a browser-based search engine for Wikipedia, where you
       | can search for "the reddish tall trees on the san francisco
       | coast" and find results like "Sequoia sempervirens" (a name of a
       | redwood tree).
        
       | atombender wrote:
       | I don't know, I wanted to like this, but I didn't get any
       | relevant matches for any of the searches I tried:
       | 
       | * "The wizard in The Lord of the Rings": No Gandalf or Saruman,
       | only books about LOTR and such.
       | 
       | * "Protagonist of Scorsese's Taxi Driver": No Travis Bickle.
       | 
       | * "A person that plants trees for a living": Somehow a gardener
       | isn't on the list.
       | 
       | * "Curly-haired painter on TV": No Bob Ross anywhere.
       | 
       | * "Unusually shaped modern art museum in Spain": Bilbao does show
       | up as number 4, but none of the others are unusually shaped.
       | 
       | * "Dog shaped like a sausage": Surely a dachshund should be in
       | the top results.
        
         | thewakalix wrote:
         | It's worth noting that every result you wanted here _does_ have
         | a Wikipedia article. (If they hadn 't, then their absence
         | wouldn't be as strange.)
        
         | [deleted]
        
       | rgbrgb wrote:
       | Love this demo but as others noted it's really easy to find
       | queries where it performs poorly (e.g. typos).
       | 
       | Looks like the embedding model used (all-minilm-l6-v2) currently
       | ranks 35th on the hugging face leaderboard [0]. I'd love to try
       | with other models if anyone wants to +1 this demo :). This feels
       | like a nice dataset to build intuition around embeddings used for
       | RAG etc.
       | 
       | [0]: https://huggingface.co/spaces/mteb/leaderboard
        
       | lovasoa wrote:
       | The tech is very impressive but the results are not.
       | 
       | I searched "pointy building in Paris", and got :
       | 
       | Tourism in Paris, Bourse de commerce (Paris), Grands Projets of
       | Francois Mitterrand, List of tallest buildings and structures in
       | the Paris region, List of tourist attractions in Paris, Palais
       | des congres de Paris, Landmarks in Paris, Palais de la Bourse,
       | Lyon, Outline of Paris, Architecture of Paris
       | 
       | no mention of the most famous pointy building in Paris...
       | 
       | Maybe sentence embedding of the entire article is not the best
       | thing for this kind of application.
        
         | extraduder_ire wrote:
         | If you mean the Eiffel Tower, it's not a building.
         | 
         | I just checked the article, and of the 19 times the word
         | "building" appears, it's mostly a verb, followed by "Chrysler
         | Building"
         | 
         | Unless there's some other famously pointy building I'm not
         | thinking of.
        
           | yorwba wrote:
           | https://www.wikidata.org/wiki/Q243 (the Eiffel Tower, which
           | is so famous it gets an extremely small ID) is an instance of
           | https://www.wikidata.org/wiki/Q1440476 (lattice tower), a
           | subclass of https://www.wikidata.org/wiki/Q12518 (tower) a
           | subclass of https://www.wikidata.org/wiki/Q41176 (building).
        
         | sp332 wrote:
         | At least 5 of those would have the answer to your question.
        
       ___________________________________________________________________
       (page generated 2023-09-01 23:00 UTC)