[HN Gopher] Wikipedia search-by-vibes through millions of pages ...
___________________________________________________________________
Wikipedia search-by-vibes through millions of pages offline
Author : gardenfelder
Score : 117 points
Date : 2023-09-01 19:55 UTC (3 hours ago)
(HTM) web link (www.leebutterman.com)
(TXT) w3m dump (www.leebutterman.com)
| nullc wrote:
| Good offline search could be a major advance in personal privacy.
| googlryas wrote:
| I like the concept, but I'm not having much luck. I entered
| "weird looking monkey", hoping to get Proboscis, golden snub
| nosed, etc, but I just end up with the articles "Pet
| monkey","List of individual monkeys", "Ethnoprimatology",
| "Monkey".
|
| Whereas when I type the same query into google, I get exactly
| what I expected. Which is kind of disappointing, I was hoping to
| find out about some weird looking monkeys that I didn't know
| about.
| kikokikokiko wrote:
| Wikipedia editors/guidelines are generally not in favor of
| "opinionated" adjectives, and the use of "weird looking" in
| your query sounds a lot like something that would be frowned
| upon on a wikipedia article. This makes it hard for your search
| to retrieve good results on this corpus of knowledge.
| znagengast wrote:
| It's also largely dependent on the embeddings model as others
| have mentioned. Even if wikipedia doesn't have any words
| specifically referring to that monkey as "weird", the model
| itself would know to correlate this monkey's embeddings with
| the "weird" concept. The main issue with this particular
| implementation is the model used (all-minilm-l6-v2) which is
| designed for speed and efficiency over accuracy.
| dougb5 wrote:
| Really nice implementation! And it's so cool to be able to do
| this offline. The embeddings aren't quite there yet.
|
| One trick that might be helpful is to embed only the defining
| (usually the first) sentence or paragraph of the Wikipedia
| article, rather than the whole document -- not clear to me which
| portion you're using now.
|
| My own site, OneLook, has had a similar feature
| (https://onelook.com/thesaurus/) since '03 that lets you find
| words and concepts by description. It was a pure reverse-
| dictionary search back when I started, but over the past two
| decades I've explored word embeddings, then sentence embeddings,
| and more recently LLMs. Nowadays it uses GPT to generate some
| guesses for inputs that it can't answer itself.
|
| LLMs are _so_ much better than earlier methods at this task, it
| 's taken some of the wind out of my sails on improving this
| aspect of OneLook. I frequently hear from people for whom
| reverse-definition lookups are the main reason they use ChatGPT!
| marginalia_nu wrote:
| It's incredibly impressive for what it does, but the results
| don't seem very good.
|
| Although I know from experience it's really difficuly to assess
| search result quality by hand, you can be very close to something
| great and return far worse matches than this does.
| wrs wrote:
| "Vibes" is a way more relatable term than "sentence embeddings".
| I may need to start using that. :)
| 1-6 wrote:
| This is great news for those who suffer from memory recall
| problems. Hope to see more edge devices handle this inferencing
| locally.
| gandalfff wrote:
| Having this integrated into Kiwix would be great!
| Vt71fcAqt7 wrote:
| Are diacritics supported? Searching "ecorche" gave no relevant
| results. Cf. google.[0]
|
| [0]https://www.google.com/search?q=%C3%A9corch%C3%A9+site%3Aen...
| .
| crazygringo wrote:
| This is certainly very interesting.
|
| Unfortunately, I tried describing a few terms across philosophy
| and psychology and for all of them, the entry I was aiming for
| was only around the ~20th rank. (Far more popular but less
| accurate items were populated above it -- e.g. no matter what I
| typed trying to define a specific modality of psychotherapy,
| "psychotherapy" was always the #1 result.)
|
| In contrast, I've used ChatGPT to identify the names of certain
| niche subfields when I couldn't remember what they were called,
| and it was right every time.
|
| I love the idea of an AI service specifically designed to
| identify the names of things from descriptions. But I don't think
| restricting it to Wikipedia (or Wikipedia page titles) is the
| right approach, and it seems like general-purpose LLM's are doing
| a great job.
|
| Still, as a proof of concept and as something you can run locally
| in the browser, this is extremely cool.
| PartiallyTyped wrote:
| I have found myself describing ideas and goals, and getting
| back a field, or rather the name of it and certain keywords to
| look for. It seems that LLMs are the best fuzzy search engines
| and work in a rather unique though possibly complementary way
| to traditional search engines.
| bagels wrote:
| It's really cool, but why doesn't it link to the wikipedia
| article?
| jasonthorsness wrote:
| The quality of the embeddings is a limiting factor for this sort
| of search - OpenAI text-ada embeddings are great but that removes
| the local aspect, and the better huggingface models are too big.
| With the model sizes increasing it's hard to see what the path
| will be for local/offline.
| vikp wrote:
| There are plenty of great embedding models that are on the
| order of a few hundreds megs (even outperforming ada-002). See
| the leaderboard here -
| https://huggingface.co/spaces/mteb/leaderboard. Local/offline
| is only growing.
| jasonjmcghee wrote:
| Wow gte-small feels like a pretty great balance of size and
| quality (all-MiniLM-L6-v2 has been my go-to)
| jiofj wrote:
| Low-hanging fruit: make article names clickable!
| [deleted]
| brianpan wrote:
| Even lower-hanging fruit: put a space between the word and the
| rank so I can word select the title to copy-paste it.
| 1-6 wrote:
| Low hanging fruit could be a mountain of effort for those who'd
| otherwise continue to focus on improving the major feature.
| hk__2 wrote:
| > Low hanging fruit could be a mountain of effort for those
| who'd otherwise continue to focus on improving the major
| feature.
|
| "https://en.wikipedia.org/wiki/" + encodeURIComponent(title)
|
| Here you have it. The major feature is in the title: you can
| hardly call it a Wikipedia search engine if you can't access
| the articles.
| kemayo wrote:
| The page is currently failing to work for me, because
| `model_quantized.onnx` isn't loading -- I'm watching it and it
| has currently managed to get 5.3MB downloaded as I type this, so
| if every visitor is triggering that...
|
| I think we may be doing awful things to Lee Butterman's bandwidth
| bill.
| gardenfelder wrote:
| >This is a browser-based search engine for Wikipedia, where you
| can search for "the reddish tall trees on the san francisco
| coast" and find results like "Sequoia sempervirens" (a name of a
| redwood tree).
| atombender wrote:
| I don't know, I wanted to like this, but I didn't get any
| relevant matches for any of the searches I tried:
|
| * "The wizard in The Lord of the Rings": No Gandalf or Saruman,
| only books about LOTR and such.
|
| * "Protagonist of Scorsese's Taxi Driver": No Travis Bickle.
|
| * "A person that plants trees for a living": Somehow a gardener
| isn't on the list.
|
| * "Curly-haired painter on TV": No Bob Ross anywhere.
|
| * "Unusually shaped modern art museum in Spain": Bilbao does show
| up as number 4, but none of the others are unusually shaped.
|
| * "Dog shaped like a sausage": Surely a dachshund should be in
| the top results.
| thewakalix wrote:
| It's worth noting that every result you wanted here _does_ have
| a Wikipedia article. (If they hadn 't, then their absence
| wouldn't be as strange.)
| [deleted]
| rgbrgb wrote:
| Love this demo but as others noted it's really easy to find
| queries where it performs poorly (e.g. typos).
|
| Looks like the embedding model used (all-minilm-l6-v2) currently
| ranks 35th on the hugging face leaderboard [0]. I'd love to try
| with other models if anyone wants to +1 this demo :). This feels
| like a nice dataset to build intuition around embeddings used for
| RAG etc.
|
| [0]: https://huggingface.co/spaces/mteb/leaderboard
| lovasoa wrote:
| The tech is very impressive but the results are not.
|
| I searched "pointy building in Paris", and got :
|
| Tourism in Paris, Bourse de commerce (Paris), Grands Projets of
| Francois Mitterrand, List of tallest buildings and structures in
| the Paris region, List of tourist attractions in Paris, Palais
| des congres de Paris, Landmarks in Paris, Palais de la Bourse,
| Lyon, Outline of Paris, Architecture of Paris
|
| no mention of the most famous pointy building in Paris...
|
| Maybe sentence embedding of the entire article is not the best
| thing for this kind of application.
| extraduder_ire wrote:
| If you mean the Eiffel Tower, it's not a building.
|
| I just checked the article, and of the 19 times the word
| "building" appears, it's mostly a verb, followed by "Chrysler
| Building"
|
| Unless there's some other famously pointy building I'm not
| thinking of.
| yorwba wrote:
| https://www.wikidata.org/wiki/Q243 (the Eiffel Tower, which
| is so famous it gets an extremely small ID) is an instance of
| https://www.wikidata.org/wiki/Q1440476 (lattice tower), a
| subclass of https://www.wikidata.org/wiki/Q12518 (tower) a
| subclass of https://www.wikidata.org/wiki/Q41176 (building).
| sp332 wrote:
| At least 5 of those would have the answer to your question.
___________________________________________________________________
(page generated 2023-09-01 23:00 UTC)