hngopher.com

       [HN Gopher] Latent Dictionary: 3D map of Oxford3000+search words...
       ___________________________________________________________________
        
       Latent Dictionary: 3D map of Oxford3000+search words via DistilBert
       embeddings
        
       Author : pps
       Score  : 62 points
       Date   : 2023-12-30 13:07 UTC (9 hours ago)
        
 (HTM) web link (latentdictionary.com)
 (TXT) w3m dump (latentdictionary.com)
        
       | eurekin wrote:
       | I added those in succession:
       | 
       | > man woman king queen ruler force powerful care
       | 
       | and couldn't reliably determine position of any of them
        
       | larodi wrote:
       | Is this with some sort of dimensionality reduction of the
       | embedding space?
        
         | kaoD wrote:
         | In the bottom left "?" button it says it performs PCA down to 3
         | dimensions. That's going to lose a ton of information,
         | rendering the space mostly useless.
        
           | behnamoh wrote:
           | Yeah, it's a fun useless project.
        
             | tikimcfee wrote:
             | I don't know about useless. I think there is some real
             | magic waiting to be discovered in mapping language language
             | to some specific and enlightening visualization. I think it
             | involves something like this. Using high statistics and
             | simple, spatial relationships to create a "mapping" of a
             | single individuals wordspace.
             | 
             | Imagine walking around the world and seeing everyone's
             | slightly unique relationship space of words. This is
             | something I have envisioned for a very long time.
        
               | kaoD wrote:
               | What you're describing exists. It's called "embeddings"
               | and it's one of the first steps ChatGPT does to do its
               | magic and it's indeed very useful.
               | 
               | What renders this useless is reducing the dimensionality
               | from thousands to just 3.
        
           | smrtinsert wrote:
           | It is lossy, but doesn't PCA function as a grouping even when
           | forced like this?
        
       | thom wrote:
       | Seems mostly nonsensical, not sure if that's a bug or some deeper
       | point I'm missing.
        
       | kvakkefly wrote:
       | By running the same multiple times, I get different
       | visualization. I don't really understand what's going on, but I
       | like the idea of visualizing embeddings.
        
         | wrsh07 wrote:
         | I think the PCA dimensionality reduction is non deterministic,
         | but I say this with really low confidence
        
           | soVeryTired wrote:
           | PCA is purely deterministic (and might not give great
           | results). My guess is this is done by t-sne or UMAP, both of
           | which depend on a seed.
        
             | minimaxir wrote:
             | The About page explicitly says it's PCA.
             | 
             | PCA is fine enough to get it into 3D.
        
       | wrsh07 wrote:
       | I wish there were more context and maybe the ability to do math
       | on the vectors
       | 
       | Eg what is the real distance between the two vectors? That should
       | be easy to compute
       | 
       | Similarly: what do I get from summing two vectors and what are
       | some nearby vectors?
       | 
       | Maybe just generally: what are some nearby vectors?
       | 
       | Without any additional context it's just a point cloud with a
       | couple of randomly labeled elements
        
         | tikimcfee wrote:
         | If I gave you a live GPU shader that let you arbitrarily
         | position any of say a few million words with simple Cartesian
         | coordinates, what would you do with it? Whole words expressed
         | as Individual letters - not symbols, representations, or
         | abstractions. Just letters arranged in a specific order to form
         | words.
        
           | theaussiestew wrote:
           | I would want to have multi lingual embeddings so I can learn
           | languages more efficiently. Being able to see clouds of
           | different words in different languages would allow me to
           | contextualize them more easily. Same for phrases and
           | sentences.
        
             | tikimcfee wrote:
             | I'm very much with you. Since most languages I know of are
             | written with combinations of ordered glyphs, they all get
             | rendered the same (although I don't handle about 20k
             | current Unicode characters of the full 400k+, and none of
             | the grouping really works, so RTL languages would be a mess
             | for individual words).
             | 
             | However, this is exactly where I want to go. A dictionary
             | is a cyclic graph of words mapping to words. That means
             | there's at least one finite way to visit every single node
             | and give it a position, with a direct relationship to the
             | words that define it, and those words that define them, and
             | so on.
             | 
             | This creates an arbitrary and unique geometric structure
             | per language, and if you get fancy and create modifiers for
             | an individual's vocabulary, you can even create transforms
             | for a "base" dictionary, _and the way someone chooses to
             | use certain words differently_. You would be able to see,
             | but likely not understand, the "structure" of types of text
             | - poetry, storytelling, instructional writing, etc.
        
         | refulgentis wrote:
         | You're actually kinda hitting the nail on the head.
         | _Generally_, the word2vec woman + king = queen thing was cute
         | but not very real.
         | 
         | People rarely have to get down to the real true metal on the
         | embeddings models, and they're not what people think they are
         | from their memory of word2vec. Ex. there's actually one vector
         | emitted _per token_, the final vector is the mean. And cosine
         | distance for similarity is the only metric anyone is training
         | for.
         | 
         | In summary, there's ~no reason to think a visualization trying
         | to show multiple vectors will ever be meaningful. Even just
         | starting from "they have way way way more dimensions than we
         | can represent visually" is enough to rule it out
         | 
         | Mini LM v2, foundation of most vector dbs, is 384 dims.
         | 
         | n.b. dear reader, if you've heard of that: you should be using
         | v3! V3 is for asymmetric search, aka query => result docs. V2
         | is for symmetric search, aka chunk of text => similarly worded
         | chunks of texts. It's very very funny how few people read the
         | docs, in this case, the sentence transformers site.
        
       | tikimcfee wrote:
       | Edit: I think this is fascinating. If you use words, like dog,
       | electric, life, and human, all of them appear in one mass
       | however, the words like greet, chicken, and "a" appear in a
       | different mass density section. I think it's interesting that the
       | words have diverged in location, with some seeming relationship
       | in the way, the words are used. If this were truly random, I
       | would expect those words to be mixed into the other ones.
       | 
       | I have this except you can see every single word in any
       | dictionary at once in space, it renders individual glyphs. It can
       | show an entire dictionary of words - definitions and roots - and
       | let you fly around in them. It's fun. I built a sample that
       | "plays" a sentence and its definitions.
       | GitHub.com/tikimcfee/LookAtThat The more I see stuff like this,
       | the more i want to complete it. It's heartening to see so many
       | people fancied with seeing words... I just wish I knew where to
       | find these people to like.. befriend and get better. Im getting
       | the feeling I just kinda exist between worlds of lofty ideas and
       | people that are incredibly smart sticking around other people
       | that are incredibly smart.
        
       | tetris11 wrote:
       | Interesting that "cromulent" and "hentai" seem to map right next
       | to each other, as well as the words "decorate" and "spare".
        
         | tikimcfee wrote:
         | Similarly, words like "I" and "am" appear in a slightly
         | different dense section of the map by default.
        
       | smrtinsert wrote:
       | I would love a quickest path between two words. For example
       | between color and colour
        
         | tudorw wrote:
         | I think that's going to be a geodesic in a hyper-dimensional
         | manifold. There was an article here about 'wordlets' on a
         | hyper-sphere and a piece on time and LLM and the relating
         | manifold. Visualising LLM topology (multi-dimensional
         | topological manifolds) is a very rich area for exploration. I'm
         | waiting for someone to use PHATE to do the dimension reduction,
         | it's used in neuroscience to reduce dimensionality providing
         | information not visible using PCA, t-SNE, LDA or UMAP.
        
           | patcon wrote:
           | Yep, been thinking on that paper as well:
           | 
           | Traveling words. There's code!
           | https://arxiv.org/abs/2309.07315
           | 
           | https://github.com/santiag0m/traveling-words
        
       | chaxor wrote:
       | Typically these types of single word embedding visualizations
       | work much better with non contextualized models such as the more
       | traditional gensim or w2v approaches, as contextual encoder-based
       | embedding models like BERT don't 'bake in' as much to the token
       | (word) itself, and rather rely on its context to define it. Also,
       | often PCA for contextual models like BERT end up with $PC_0$
       | aligned with the length of the document.
        
       | minimaxir wrote:
       | Some notes on how embeddings/DistilBERT embeddings work since the
       | other comments are confused:
       | 
       | 1) There are two primary ways to have models generate embeddings:
       | implicitly from an LLM by mean-pooling its last hidden state
       | since it has to learn how to map text in a distinct latent space
       | anyways to work correctly (i.e. DistilBERT), or you can use a
       | model which can generate embeddings directly which are trained
       | using something like triplet loss to explicitly incentivise
       | learning similarity/dissimilarity. Popular text-embedding models
       | like BAAI/bge-large-en-v1.5 tend to use the latter approach.
       | 
       | 2) The famous word2Vec examples of e.g. woman + king = queen only
       | work because word2vec is a shallow network and the model learns
       | the word embeddings directly, instead of it being emergent. The
       | latent space still maps them closely as shown with this demo, but
       | there isn't any algebraic intuition. You can get _close_ with
       | algebra but no cigar.
       | 
       | 3) DistilBERT is pretty old (2019) and based on a 2018 model
       | trained on Wikipedia and books, so there will be significant text
       | drift in addition to being less robust with newer modeling
       | techniques and a more robust dataset. I do not recommend using it
       | for production applications nowadays.
       | 
       | 4) There is an under-discussed opportunity for dimensionality
       | reduction techniques like PCA (which this demo uses to get the
       | data into 3D) to both improve signal-to-noise and improve
       | distinctiveness. I am working on a blog post of a new technique
       | to handle dimensionality reduction for text embeddings better
       | which may have interesting and profound usability implications.
        
         | pyinstallwoes wrote:
         | I've been ruminating on the postulation of a universal
         | signature for every entity across sensory complexes (per sense
         | organ reality, vision, touch, mind) which translates to the
         | problem of entities represented in binary needing to be related
         | across modalities as in "butterfly" vs a picture of a butterfly
         | vs the audio of butterfly vs the thought pointing to one of
         | those or other.
         | 
         | I was wondering if there was a universal signal that can be
         | used as the identity and then based on that signal one could
         | measure the distance to any other signal based on the principle
         | relation of not(other). That is to say the identity would be
         | precisely not all else for any X. Said another way, every thing
         | is because it is exactly not everything else.
         | 
         | So thinking as first principles as possible I wondered if it
         | were possible to represent everything as some frequency? A
         | Fourier transform analog for every "time slice" of a thing?
         | This is where it gets slightly slippery.
         | 
         | So the idea was trying to build relationship and identity and
         | labeling from a simple rule set of things arising out of
         | relation of not being other things.
         | 
         | In my mind I saw nodes on a graph forming in higher dimensions
         | as half way points for any comparison. Comparisons create new
         | nodes and implicitly have a distance metric to all other
         | things. It made sense in my mind that there was an algorithmic
         | annealing to new nodes in a "low density higher energetic
         | state" allowing them to move faster in this universal emergent
         | ontology/spatial space; eventually getting more dense and
         | slower as it gets cold.
         | 
         | So the system implicitly also has a snapshot of events or
         | interactions based on that where every comparison has a "tick"
         | that encodes a particular density relation for some set of
         | nodes it's in association with.
         | 
         | The idea that cemented it all together was to treat each node
         | like an address:chord. Similar to chording keys like a-b-c in
         | some ux programs, but also exactly like chords in music too.
         | 
         | The idea being that when multiple "things" are dialed in at
         | same time it becomes its own emergent label by proximity and
         | association of those things being triggered to new information
         | coming in classified as a distance to not(signal).
         | 
         | I didn't really realize how close this idea was to what
         | encoders/decoders seem to be doing although I do know I'm
         | trying to think myself towards a universal solution that
         | doesn't require special encoders for every media type. Hence
         | the Fourier transform path.
         | 
         | Know anything like this or am I spitting idiocy?
        
           | pyinstallwoes wrote:
           | So the alphabet a to z... on their own the symbols mean
           | nothing but when compared to every other letter meaning
           | arises. Then iterate / recursively out for every growth in
           | structure and letter to letter, words to words, paragraphs to
           | paragraphs. Each one has a "dependent arising" of meaning
           | based precisely on the relation to other.
           | 
           | Which is more or less word2vec as far as I understand but
           | then trying to extrapolate that as a universal principle to
           | all things that can be represented by using a "common
           | signature : hash based off a signal like a complex waveform"
           | and then doing a difference on signal composition and its
           | shape/bandwidth to compare its properties to other things and
           | when they reference similar objects even in different
           | modalities they'd be associated by being triggered together.
           | 
           | So "dog" vs image of dog would both translate to a primordial
           | signal : identity representation and in the domain of
           | frequency do the comparison and project a coordinate in the
           | spatial sense and eventually those two nodes would more
           | likely be triggered at the same time due to the likelihood of
           | "dog" being next to image of dog when parsing information
           | across future events.
           | 
           | Whew. Maybe I'm just talking to myself. At least it's out
           | there if it makes sense to anyone else.
        
             | pyinstallwoes wrote:
             | The key requirement in my mind here is that the "universal
             | identifier" is a form of attempting something like a
             | deterministic signature for all things. The hunch is based
             | on the hypothesis that the primordial representation of any
             | and all things is frequency.
             | 
             | But of course each "ontological capable system" would still
             | need to process the identity function to start making sense
             | of things based on signals being unlike other signals, so
             | deterministic is shallow but concrete.
        
             | minimaxir wrote:
             | > So "dog" vs image of dog would both translate to a
             | primordial signal : identity representation and in the
             | domain of frequency do the comparison and project a
             | coordinate in the spatial sense and eventually those two
             | nodes would more likely be triggered at the same time due
             | to the likelihood of "dog" being next to image of dog when
             | parsing information across future events.
             | 
             | That is how CLIP embeddings work and were trained to work.
             | 
             | Hugging Face transformers now has a get_image_features()
             | and get_text_features() function for CLIP models to make
             | getting the embeddings for different modalities easy: https
             | ://huggingface.co/docs/transformers/model_doc/clip#tran...
        
               | pyinstallwoes wrote:
               | Yeah but it doesn't use a universal method does it? And
               | it requires labeling.
               | 
               | The method I'm describing requires no labeling. Labeling
               | would be a local only translation (alias). Labels emerge
               | based on meaning. But the labels are more of an interface
               | - not the actual nodes themselves which arise off the not
               | identity principle * event proximity * comparisons.
        
       | cuttysnark wrote:
       | edge of the galaxy: 'if when that then wherever where while for'
        
       | pamelafox wrote:
       | I'm looking for more resources like this that attempt to visually
       | explain vectors, as I'll be giving some talks around vector
       | search. Does anyone have related suggestions?
        
       ___________________________________________________________________
       (page generated 2023-12-30 23:00 UTC)