hngopher.com

       [HN Gopher] Radient - vectorize many data types, not just text
       ___________________________________________________________________
        
       Radient - vectorize many data types, not just text
        
       Author : fzliu
       Score  : 54 points
       Date   : 2024-05-08 06:03 UTC (2 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | jxmorris12 wrote:
       | I'm not sure I get this. First of all, a perhaps-unnecessary
       | question: who wants to search between molecules and audio files?
       | 
       | By the way, is this even supported? I noticed the audio example
       | seems to return a vector of floating point numbers but the
       | molecule vector is binary true/false values.
       | 
       | Anyway, what embedding model is used here? Can it be customized,
       | or swapped out? And why is it binary only sometimes? It's great
       | that Radient is high-level and just provides "vectors" for things
       | but I think a few details (and perhaps a small amount of
       | customization) would go a long way.
        
         | fzliu wrote:
         | > I think a few details (and perhaps a small amount of
         | customization) would go a long way.
         | 
         | I hear you and agree 100% - I unfortunately haven't gotten
         | around to writing better documentation nor solid code samples
         | that utilize Radient yet.
         | 
         | Regarding molecule vectorization: that capability comes from
         | RDKit (https://rdkit.org) - I just uploaded a sample to the
         | /examples directory. You're right that molecule-to-audio and
         | audio-to-molecule search is nonsensical from a semantic
         | perspective, but I could see a third modality such as text or
         | images that ties the two together, similar to what ImageBind is
         | doing (https://arxiv.org/abs/2305.05665).
        
         | lmeyerov wrote:
         | A took us a bit to want this:
         | 
         | We added a heterogeneous dataframe auto vectorizer to our oss
         | lib last year for a few reasons. Imagine writing: `graphistry.n
         | odes(cudf.read_parquet("logs/")).featurize(**optional_cfg).umap
         | ().plot()`
         | 
         | We like using UMAP, GNNs, etc for understanding heterogeneous
         | data like logs and other event & entity data, so needed a way
         | to easily handle date, string, JSON, etc columns. So automatic
         | feature engineering that we could tweak later is important.
         | Feature engineering is a bottleneck on bigger datasets, like
         | working with 100K+ log lines or webpages, so we later added an
         | optional GPU mode. The rest of our library can already run
         | (opt-in) on GPUs, so that completed our flow of raw data =>
         | viz/AI/etc end-to-end on GPUs.
         | 
         | To your point... Most of our users need just numbers, dates,
         | text, etc. We do occasionally hit the need for images... but it
         | was easy to do externally and just append those columns. A one-
         | size-fits-most is not obvious to me for embedding images when I
         | think of our projects here. So this library is interesting to
         | me if they can pick good encodings...
        
           | TachyonicBytes wrote:
           | Would you be interested in posting the library here, or the
           | vector parts? Your use-case sounds interesting to me.
        
             | lmeyerov wrote:
             | We mostly use via pygraphistry, and the demo folders have a
             | bunch of examples close to what we do in the wild:
             | https://github.com/graphistry/pygraphistry
             | 
             | Ex:
             | 
             | ``` import graphistry
             | 
             | graphistry.nodes(alerts_df).umap().plot()
             | 
             | ```
             | 
             | That's smart library sugar for:
             | 
             | ```
             | 
             | g = graphistry.nodes(alerts_df)
             | 
             | g2 = g.featurize(*cfg) # print('encoded',
             | g._node_features.shape)
             | 
             | g3 = g2.umap() # print('similarity graph', g._nodes.shape,
             | g._edges.shape)
             | 
             | url = g3.plot(render=False)
             | 
             | print(f'<iframe src={url}/>')
             | 
             | ```
             | 
             | If automatic cpu/gpu feature engineering happens across
             | heterogeneous dataframe columns, that's via pygraphistry's
             | automation calls to our lower-level library cu_cat:
             | https://github.com/graphistry/cu-cat
             | 
             | We've been meaning to write about cu_cat with the Nvidia
             | RAPIDS team, it's a cool GPU fork of dirty cat. We see
             | anywhere from 2-100X speedups on cpu -> gpu.
             | 
             | It already has sentence_transformers built in. Due to our
             | work with louie.ai <> various vector DBs, we're looking at
             | revisiting how to make it even easier to plug in outside
             | embeddings. Would be curious if any patterns would be
             | useful there. Prior to this thread, we weren't even
             | thinking folks would want images built-in as we find that
             | so context-dependent...
        
       | hongspike wrote:
       | What are the needs for other sinks besides a vector database?
        
       | advael wrote:
       | I think if you can make the vectorization methods more
       | customizable and transparent, this could be a research accelerant
       | too, since as a lot of AI R&D on new domains or datasets has
       | "make a good embedding" as a first step. It's not a very hard
       | step right now, but I think you can probably make it faster to
       | rapidly prototype something and then iterate on it, so long as
       | you set it up to be possible to do the latter (IE inspectable,
       | interoperable, etc)
       | 
       | I think even though some people doubt the value of being able to
       | compare disparate types via embedding, allowing it to be done
       | more seamlessly makes a kind of "silly" (or more charitably
       | "playful") research I happen to like a lot more feasible. In
       | particular, artificially-produced "synesthesia" that comes from
       | tuning weird embedding comparisons could end up being really
       | useful in some domains, because like human synesthetes, the
       | underlying structure of one domain might provide counterintuitive
       | insight or legibility into the other in some cases
       | 
       | But all of this requires that the library allows fine-tuning and
       | retraining of the underlying embeddings. It would be useful to
       | natively support coembeddings of different domains, as things
       | like CLIP drove the current wave of multimodal generative models.
        
       | dhruv_anand wrote:
       | nice library. for text embeddings, it would be good to integrate
       | with a framework like litellm, which has enabled doing embeddings
       | using a variety of methods and API providers:
       | https://litellm.vercel.app/docs/embedding/supported_embeddin...
        
       | dhruv_anand wrote:
       | for vector _sinks_ , it would be good to integrate with Vector-io
       | (https://github.com/ai-northstar-tech/vector-io) which provides
       | import functions for ~10 vector databases already.
       | 
       | disclosure: I'm the author of Vector-io
        
       | esafak wrote:
       | It's featurization or feature extraction. Vectorization does not
       | involve ML:
       | https://machinelearningcompass.com/machine_learning_math/vec...
        
         | dhruv_anand wrote:
         | Things are what people call them. Featurization/Feature
         | Extraction used to refer to manual feature engineering, where
         | you could tell what each numerical value is.
         | 
         | Vectorization, as colloquially used by developers in the AI
         | space today, refers to the same thing being done via deep
         | learning models, so it is less to do with ML Features, and more
         | to do with generating a Vector all at once, with each dimension
         | not having a specific meaning.
        
       | mind-blight wrote:
       | Is it just me, or is vector search not particularly good?
       | 
       | It seems like magic at first, but then you start rubbing into a
       | bunch of issues:
       | 
       | 1. Too much data within a single vector (often even just a few
       | sentences) makes it so that most vectors are very close to each
       | other due to many overlapping concepts.
       | 
       | 2. Searching over moderately sized corpus of documentation (e.g.
       | a couple thousand pages of text) starts to degrade scoring
       | (usually sure to the above issue)
       | 
       | 3. Every model I've tried fails pretty regularly on named
       | entities (e.g someone's name, a product, etc) unless it's pretty
       | well known
       | 
       | 4. Getting granular enough to see useful variance requires
       | generating a ton of embeddings, which start to cause performance
       | bottlenecks really quickly
       | 
       | I've honestly had a lot more success with more traditional search
       | methods.
       | 
       | Edit for formatting
        
         | esafak wrote:
         | This happens when the quality of your embeddings is not high
         | enough. You might need to fine tune them for your task. And
         | rerank the candidates for good measure.
        
       ___________________________________________________________________
       (page generated 2024-05-10 23:01 UTC)