[HN Gopher] Latent Dictionary: 3D map of Oxford3000+search words...
___________________________________________________________________
Latent Dictionary: 3D map of Oxford3000+search words via DistilBert
embeddings
Author : pps
Score : 62 points
Date : 2023-12-30 13:07 UTC (9 hours ago)
(HTM) web link (latentdictionary.com)
(TXT) w3m dump (latentdictionary.com)
| eurekin wrote:
| I added those in succession:
|
| > man woman king queen ruler force powerful care
|
| and couldn't reliably determine position of any of them
| larodi wrote:
| Is this with some sort of dimensionality reduction of the
| embedding space?
| kaoD wrote:
| In the bottom left "?" button it says it performs PCA down to 3
| dimensions. That's going to lose a ton of information,
| rendering the space mostly useless.
| behnamoh wrote:
| Yeah, it's a fun useless project.
| tikimcfee wrote:
| I don't know about useless. I think there is some real
| magic waiting to be discovered in mapping language language
| to some specific and enlightening visualization. I think it
| involves something like this. Using high statistics and
| simple, spatial relationships to create a "mapping" of a
| single individuals wordspace.
|
| Imagine walking around the world and seeing everyone's
| slightly unique relationship space of words. This is
| something I have envisioned for a very long time.
| kaoD wrote:
| What you're describing exists. It's called "embeddings"
| and it's one of the first steps ChatGPT does to do its
| magic and it's indeed very useful.
|
| What renders this useless is reducing the dimensionality
| from thousands to just 3.
| smrtinsert wrote:
| It is lossy, but doesn't PCA function as a grouping even when
| forced like this?
| thom wrote:
| Seems mostly nonsensical, not sure if that's a bug or some deeper
| point I'm missing.
| kvakkefly wrote:
| By running the same multiple times, I get different
| visualization. I don't really understand what's going on, but I
| like the idea of visualizing embeddings.
| wrsh07 wrote:
| I think the PCA dimensionality reduction is non deterministic,
| but I say this with really low confidence
| soVeryTired wrote:
| PCA is purely deterministic (and might not give great
| results). My guess is this is done by t-sne or UMAP, both of
| which depend on a seed.
| minimaxir wrote:
| The About page explicitly says it's PCA.
|
| PCA is fine enough to get it into 3D.
| wrsh07 wrote:
| I wish there were more context and maybe the ability to do math
| on the vectors
|
| Eg what is the real distance between the two vectors? That should
| be easy to compute
|
| Similarly: what do I get from summing two vectors and what are
| some nearby vectors?
|
| Maybe just generally: what are some nearby vectors?
|
| Without any additional context it's just a point cloud with a
| couple of randomly labeled elements
| tikimcfee wrote:
| If I gave you a live GPU shader that let you arbitrarily
| position any of say a few million words with simple Cartesian
| coordinates, what would you do with it? Whole words expressed
| as Individual letters - not symbols, representations, or
| abstractions. Just letters arranged in a specific order to form
| words.
| theaussiestew wrote:
| I would want to have multi lingual embeddings so I can learn
| languages more efficiently. Being able to see clouds of
| different words in different languages would allow me to
| contextualize them more easily. Same for phrases and
| sentences.
| tikimcfee wrote:
| I'm very much with you. Since most languages I know of are
| written with combinations of ordered glyphs, they all get
| rendered the same (although I don't handle about 20k
| current Unicode characters of the full 400k+, and none of
| the grouping really works, so RTL languages would be a mess
| for individual words).
|
| However, this is exactly where I want to go. A dictionary
| is a cyclic graph of words mapping to words. That means
| there's at least one finite way to visit every single node
| and give it a position, with a direct relationship to the
| words that define it, and those words that define them, and
| so on.
|
| This creates an arbitrary and unique geometric structure
| per language, and if you get fancy and create modifiers for
| an individual's vocabulary, you can even create transforms
| for a "base" dictionary, _and the way someone chooses to
| use certain words differently_. You would be able to see,
| but likely not understand, the "structure" of types of text
| - poetry, storytelling, instructional writing, etc.
| refulgentis wrote:
| You're actually kinda hitting the nail on the head.
| _Generally_, the word2vec woman + king = queen thing was cute
| but not very real.
|
| People rarely have to get down to the real true metal on the
| embeddings models, and they're not what people think they are
| from their memory of word2vec. Ex. there's actually one vector
| emitted _per token_, the final vector is the mean. And cosine
| distance for similarity is the only metric anyone is training
| for.
|
| In summary, there's ~no reason to think a visualization trying
| to show multiple vectors will ever be meaningful. Even just
| starting from "they have way way way more dimensions than we
| can represent visually" is enough to rule it out
|
| Mini LM v2, foundation of most vector dbs, is 384 dims.
|
| n.b. dear reader, if you've heard of that: you should be using
| v3! V3 is for asymmetric search, aka query => result docs. V2
| is for symmetric search, aka chunk of text => similarly worded
| chunks of texts. It's very very funny how few people read the
| docs, in this case, the sentence transformers site.
| tikimcfee wrote:
| Edit: I think this is fascinating. If you use words, like dog,
| electric, life, and human, all of them appear in one mass
| however, the words like greet, chicken, and "a" appear in a
| different mass density section. I think it's interesting that the
| words have diverged in location, with some seeming relationship
| in the way, the words are used. If this were truly random, I
| would expect those words to be mixed into the other ones.
|
| I have this except you can see every single word in any
| dictionary at once in space, it renders individual glyphs. It can
| show an entire dictionary of words - definitions and roots - and
| let you fly around in them. It's fun. I built a sample that
| "plays" a sentence and its definitions.
| GitHub.com/tikimcfee/LookAtThat The more I see stuff like this,
| the more i want to complete it. It's heartening to see so many
| people fancied with seeing words... I just wish I knew where to
| find these people to like.. befriend and get better. Im getting
| the feeling I just kinda exist between worlds of lofty ideas and
| people that are incredibly smart sticking around other people
| that are incredibly smart.
| tetris11 wrote:
| Interesting that "cromulent" and "hentai" seem to map right next
| to each other, as well as the words "decorate" and "spare".
| tikimcfee wrote:
| Similarly, words like "I" and "am" appear in a slightly
| different dense section of the map by default.
| smrtinsert wrote:
| I would love a quickest path between two words. For example
| between color and colour
| tudorw wrote:
| I think that's going to be a geodesic in a hyper-dimensional
| manifold. There was an article here about 'wordlets' on a
| hyper-sphere and a piece on time and LLM and the relating
| manifold. Visualising LLM topology (multi-dimensional
| topological manifolds) is a very rich area for exploration. I'm
| waiting for someone to use PHATE to do the dimension reduction,
| it's used in neuroscience to reduce dimensionality providing
| information not visible using PCA, t-SNE, LDA or UMAP.
| patcon wrote:
| Yep, been thinking on that paper as well:
|
| Traveling words. There's code!
| https://arxiv.org/abs/2309.07315
|
| https://github.com/santiag0m/traveling-words
| chaxor wrote:
| Typically these types of single word embedding visualizations
| work much better with non contextualized models such as the more
| traditional gensim or w2v approaches, as contextual encoder-based
| embedding models like BERT don't 'bake in' as much to the token
| (word) itself, and rather rely on its context to define it. Also,
| often PCA for contextual models like BERT end up with $PC_0$
| aligned with the length of the document.
| minimaxir wrote:
| Some notes on how embeddings/DistilBERT embeddings work since the
| other comments are confused:
|
| 1) There are two primary ways to have models generate embeddings:
| implicitly from an LLM by mean-pooling its last hidden state
| since it has to learn how to map text in a distinct latent space
| anyways to work correctly (i.e. DistilBERT), or you can use a
| model which can generate embeddings directly which are trained
| using something like triplet loss to explicitly incentivise
| learning similarity/dissimilarity. Popular text-embedding models
| like BAAI/bge-large-en-v1.5 tend to use the latter approach.
|
| 2) The famous word2Vec examples of e.g. woman + king = queen only
| work because word2vec is a shallow network and the model learns
| the word embeddings directly, instead of it being emergent. The
| latent space still maps them closely as shown with this demo, but
| there isn't any algebraic intuition. You can get _close_ with
| algebra but no cigar.
|
| 3) DistilBERT is pretty old (2019) and based on a 2018 model
| trained on Wikipedia and books, so there will be significant text
| drift in addition to being less robust with newer modeling
| techniques and a more robust dataset. I do not recommend using it
| for production applications nowadays.
|
| 4) There is an under-discussed opportunity for dimensionality
| reduction techniques like PCA (which this demo uses to get the
| data into 3D) to both improve signal-to-noise and improve
| distinctiveness. I am working on a blog post of a new technique
| to handle dimensionality reduction for text embeddings better
| which may have interesting and profound usability implications.
| pyinstallwoes wrote:
| I've been ruminating on the postulation of a universal
| signature for every entity across sensory complexes (per sense
| organ reality, vision, touch, mind) which translates to the
| problem of entities represented in binary needing to be related
| across modalities as in "butterfly" vs a picture of a butterfly
| vs the audio of butterfly vs the thought pointing to one of
| those or other.
|
| I was wondering if there was a universal signal that can be
| used as the identity and then based on that signal one could
| measure the distance to any other signal based on the principle
| relation of not(other). That is to say the identity would be
| precisely not all else for any X. Said another way, every thing
| is because it is exactly not everything else.
|
| So thinking as first principles as possible I wondered if it
| were possible to represent everything as some frequency? A
| Fourier transform analog for every "time slice" of a thing?
| This is where it gets slightly slippery.
|
| So the idea was trying to build relationship and identity and
| labeling from a simple rule set of things arising out of
| relation of not being other things.
|
| In my mind I saw nodes on a graph forming in higher dimensions
| as half way points for any comparison. Comparisons create new
| nodes and implicitly have a distance metric to all other
| things. It made sense in my mind that there was an algorithmic
| annealing to new nodes in a "low density higher energetic
| state" allowing them to move faster in this universal emergent
| ontology/spatial space; eventually getting more dense and
| slower as it gets cold.
|
| So the system implicitly also has a snapshot of events or
| interactions based on that where every comparison has a "tick"
| that encodes a particular density relation for some set of
| nodes it's in association with.
|
| The idea that cemented it all together was to treat each node
| like an address:chord. Similar to chording keys like a-b-c in
| some ux programs, but also exactly like chords in music too.
|
| The idea being that when multiple "things" are dialed in at
| same time it becomes its own emergent label by proximity and
| association of those things being triggered to new information
| coming in classified as a distance to not(signal).
|
| I didn't really realize how close this idea was to what
| encoders/decoders seem to be doing although I do know I'm
| trying to think myself towards a universal solution that
| doesn't require special encoders for every media type. Hence
| the Fourier transform path.
|
| Know anything like this or am I spitting idiocy?
| pyinstallwoes wrote:
| So the alphabet a to z... on their own the symbols mean
| nothing but when compared to every other letter meaning
| arises. Then iterate / recursively out for every growth in
| structure and letter to letter, words to words, paragraphs to
| paragraphs. Each one has a "dependent arising" of meaning
| based precisely on the relation to other.
|
| Which is more or less word2vec as far as I understand but
| then trying to extrapolate that as a universal principle to
| all things that can be represented by using a "common
| signature : hash based off a signal like a complex waveform"
| and then doing a difference on signal composition and its
| shape/bandwidth to compare its properties to other things and
| when they reference similar objects even in different
| modalities they'd be associated by being triggered together.
|
| So "dog" vs image of dog would both translate to a primordial
| signal : identity representation and in the domain of
| frequency do the comparison and project a coordinate in the
| spatial sense and eventually those two nodes would more
| likely be triggered at the same time due to the likelihood of
| "dog" being next to image of dog when parsing information
| across future events.
|
| Whew. Maybe I'm just talking to myself. At least it's out
| there if it makes sense to anyone else.
| pyinstallwoes wrote:
| The key requirement in my mind here is that the "universal
| identifier" is a form of attempting something like a
| deterministic signature for all things. The hunch is based
| on the hypothesis that the primordial representation of any
| and all things is frequency.
|
| But of course each "ontological capable system" would still
| need to process the identity function to start making sense
| of things based on signals being unlike other signals, so
| deterministic is shallow but concrete.
| minimaxir wrote:
| > So "dog" vs image of dog would both translate to a
| primordial signal : identity representation and in the
| domain of frequency do the comparison and project a
| coordinate in the spatial sense and eventually those two
| nodes would more likely be triggered at the same time due
| to the likelihood of "dog" being next to image of dog when
| parsing information across future events.
|
| That is how CLIP embeddings work and were trained to work.
|
| Hugging Face transformers now has a get_image_features()
| and get_text_features() function for CLIP models to make
| getting the embeddings for different modalities easy: https
| ://huggingface.co/docs/transformers/model_doc/clip#tran...
| pyinstallwoes wrote:
| Yeah but it doesn't use a universal method does it? And
| it requires labeling.
|
| The method I'm describing requires no labeling. Labeling
| would be a local only translation (alias). Labels emerge
| based on meaning. But the labels are more of an interface
| - not the actual nodes themselves which arise off the not
| identity principle * event proximity * comparisons.
| cuttysnark wrote:
| edge of the galaxy: 'if when that then wherever where while for'
| pamelafox wrote:
| I'm looking for more resources like this that attempt to visually
| explain vectors, as I'll be giving some talks around vector
| search. Does anyone have related suggestions?
___________________________________________________________________
(page generated 2023-12-30 23:00 UTC)