[HN Gopher] Radient - vectorize many data types, not just text
___________________________________________________________________
Radient - vectorize many data types, not just text
Author : fzliu
Score : 54 points
Date : 2024-05-08 06:03 UTC (2 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| jxmorris12 wrote:
| I'm not sure I get this. First of all, a perhaps-unnecessary
| question: who wants to search between molecules and audio files?
|
| By the way, is this even supported? I noticed the audio example
| seems to return a vector of floating point numbers but the
| molecule vector is binary true/false values.
|
| Anyway, what embedding model is used here? Can it be customized,
| or swapped out? And why is it binary only sometimes? It's great
| that Radient is high-level and just provides "vectors" for things
| but I think a few details (and perhaps a small amount of
| customization) would go a long way.
| fzliu wrote:
| > I think a few details (and perhaps a small amount of
| customization) would go a long way.
|
| I hear you and agree 100% - I unfortunately haven't gotten
| around to writing better documentation nor solid code samples
| that utilize Radient yet.
|
| Regarding molecule vectorization: that capability comes from
| RDKit (https://rdkit.org) - I just uploaded a sample to the
| /examples directory. You're right that molecule-to-audio and
| audio-to-molecule search is nonsensical from a semantic
| perspective, but I could see a third modality such as text or
| images that ties the two together, similar to what ImageBind is
| doing (https://arxiv.org/abs/2305.05665).
| lmeyerov wrote:
| A took us a bit to want this:
|
| We added a heterogeneous dataframe auto vectorizer to our oss
| lib last year for a few reasons. Imagine writing: `graphistry.n
| odes(cudf.read_parquet("logs/")).featurize(**optional_cfg).umap
| ().plot()`
|
| We like using UMAP, GNNs, etc for understanding heterogeneous
| data like logs and other event & entity data, so needed a way
| to easily handle date, string, JSON, etc columns. So automatic
| feature engineering that we could tweak later is important.
| Feature engineering is a bottleneck on bigger datasets, like
| working with 100K+ log lines or webpages, so we later added an
| optional GPU mode. The rest of our library can already run
| (opt-in) on GPUs, so that completed our flow of raw data =>
| viz/AI/etc end-to-end on GPUs.
|
| To your point... Most of our users need just numbers, dates,
| text, etc. We do occasionally hit the need for images... but it
| was easy to do externally and just append those columns. A one-
| size-fits-most is not obvious to me for embedding images when I
| think of our projects here. So this library is interesting to
| me if they can pick good encodings...
| TachyonicBytes wrote:
| Would you be interested in posting the library here, or the
| vector parts? Your use-case sounds interesting to me.
| lmeyerov wrote:
| We mostly use via pygraphistry, and the demo folders have a
| bunch of examples close to what we do in the wild:
| https://github.com/graphistry/pygraphistry
|
| Ex:
|
| ``` import graphistry
|
| graphistry.nodes(alerts_df).umap().plot()
|
| ```
|
| That's smart library sugar for:
|
| ```
|
| g = graphistry.nodes(alerts_df)
|
| g2 = g.featurize(*cfg) # print('encoded',
| g._node_features.shape)
|
| g3 = g2.umap() # print('similarity graph', g._nodes.shape,
| g._edges.shape)
|
| url = g3.plot(render=False)
|
| print(f'<iframe src={url}/>')
|
| ```
|
| If automatic cpu/gpu feature engineering happens across
| heterogeneous dataframe columns, that's via pygraphistry's
| automation calls to our lower-level library cu_cat:
| https://github.com/graphistry/cu-cat
|
| We've been meaning to write about cu_cat with the Nvidia
| RAPIDS team, it's a cool GPU fork of dirty cat. We see
| anywhere from 2-100X speedups on cpu -> gpu.
|
| It already has sentence_transformers built in. Due to our
| work with louie.ai <> various vector DBs, we're looking at
| revisiting how to make it even easier to plug in outside
| embeddings. Would be curious if any patterns would be
| useful there. Prior to this thread, we weren't even
| thinking folks would want images built-in as we find that
| so context-dependent...
| hongspike wrote:
| What are the needs for other sinks besides a vector database?
| advael wrote:
| I think if you can make the vectorization methods more
| customizable and transparent, this could be a research accelerant
| too, since as a lot of AI R&D on new domains or datasets has
| "make a good embedding" as a first step. It's not a very hard
| step right now, but I think you can probably make it faster to
| rapidly prototype something and then iterate on it, so long as
| you set it up to be possible to do the latter (IE inspectable,
| interoperable, etc)
|
| I think even though some people doubt the value of being able to
| compare disparate types via embedding, allowing it to be done
| more seamlessly makes a kind of "silly" (or more charitably
| "playful") research I happen to like a lot more feasible. In
| particular, artificially-produced "synesthesia" that comes from
| tuning weird embedding comparisons could end up being really
| useful in some domains, because like human synesthetes, the
| underlying structure of one domain might provide counterintuitive
| insight or legibility into the other in some cases
|
| But all of this requires that the library allows fine-tuning and
| retraining of the underlying embeddings. It would be useful to
| natively support coembeddings of different domains, as things
| like CLIP drove the current wave of multimodal generative models.
| dhruv_anand wrote:
| nice library. for text embeddings, it would be good to integrate
| with a framework like litellm, which has enabled doing embeddings
| using a variety of methods and API providers:
| https://litellm.vercel.app/docs/embedding/supported_embeddin...
| dhruv_anand wrote:
| for vector _sinks_ , it would be good to integrate with Vector-io
| (https://github.com/ai-northstar-tech/vector-io) which provides
| import functions for ~10 vector databases already.
|
| disclosure: I'm the author of Vector-io
| esafak wrote:
| It's featurization or feature extraction. Vectorization does not
| involve ML:
| https://machinelearningcompass.com/machine_learning_math/vec...
| dhruv_anand wrote:
| Things are what people call them. Featurization/Feature
| Extraction used to refer to manual feature engineering, where
| you could tell what each numerical value is.
|
| Vectorization, as colloquially used by developers in the AI
| space today, refers to the same thing being done via deep
| learning models, so it is less to do with ML Features, and more
| to do with generating a Vector all at once, with each dimension
| not having a specific meaning.
| mind-blight wrote:
| Is it just me, or is vector search not particularly good?
|
| It seems like magic at first, but then you start rubbing into a
| bunch of issues:
|
| 1. Too much data within a single vector (often even just a few
| sentences) makes it so that most vectors are very close to each
| other due to many overlapping concepts.
|
| 2. Searching over moderately sized corpus of documentation (e.g.
| a couple thousand pages of text) starts to degrade scoring
| (usually sure to the above issue)
|
| 3. Every model I've tried fails pretty regularly on named
| entities (e.g someone's name, a product, etc) unless it's pretty
| well known
|
| 4. Getting granular enough to see useful variance requires
| generating a ton of embeddings, which start to cause performance
| bottlenecks really quickly
|
| I've honestly had a lot more success with more traditional search
| methods.
|
| Edit for formatting
| esafak wrote:
| This happens when the quality of your embeddings is not high
| enough. You might need to fine tune them for your task. And
| rerank the candidates for good measure.
___________________________________________________________________
(page generated 2024-05-10 23:01 UTC)