[HN Gopher] Shapeshift: Semantically map JSON objects using key-...
___________________________________________________________________
Shapeshift: Semantically map JSON objects using key-level vector
embeddings
Author : marvinkennis
Score : 107 points
Date : 2024-07-15 22:47 UTC (1 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| WhatsName wrote:
| Maybe I'm not the target audience, but here are simple questions
| to the author or potential users:
|
| What about anything more complex like date of birth to age or the
| other way round? Also since we will inevitably incur costs, why
| not let a llm write a transformation rule for us?
| eezing wrote:
| My thinking as well.
| anamexis wrote:
| It's not using an LLM, it's just comparing embeddings (which
| are waaay cheaper)
| riku_iki wrote:
| but embeddings came from somewhere (LLM?).
| lukasb wrote:
| What is this for? The examples given could be handled
| deterministically. Is this for situations where you don't know
| JSON schemas in advance? What situations are those?
| saltwatercowboy wrote:
| The lazy part of my brain screams "use this instead of dealing
| properly with nested objects!" In a production setting I'd be
| worried about consistency from the base to result layers if
| based on LLM transpositioning.
| eezing wrote:
| Data import via customer self-service onboarding.
| tbrownaw wrote:
| As is, it's not good for much beyond looking cool. (Maybe
| implementing Postel's Law for a json API, but I think that's
| considered bad taste these days.)
|
| If instead of transforming a single object it would output a
| table of src_field->dst_field, it could potentially be a useful
| first pass in some ETL development.
| henry700 wrote:
| Keep the bug generators going, we will need the jobs
| simonw wrote:
| This is the code that does the work:
| https://github.com/rectanglehq/Shapeshift/blob/d954dab2a866c...
|
| There are a few ways this could be made a less expensive to run:
|
| 1. Cache those embeddings somewhere. You're only embedding simple
| strings like "name" and "address" - no need to do that work more
| than once in an entire lifetime of running the tool.
|
| 2. As suggested here
| https://news.ycombinator.com/item?id=40973028 change the design
| of the tool so instead of doing the work it returns a reusable
| data structure mapping input keys to output keys, so you only
| have to run it once and can then use that generated data
| structure to apply the transformations on large amounts of data
| in the future.
|
| 3. Since so many of the keys are going to have predictable names
| ("name", "address" etc) you could even pre-calculate embeddings
| for the 1,000 most common keys across all three embedding
| providers and ship those as part of the package.
|
| Also: in
| https://github.com/rectanglehq/Shapeshift/blob/d954dab2a866c...
| you're using Promise.map() to run multiple embeddings through the
| OpenAI API at once, which risks tripping their rate-limit. You
| should be able to pass the text as an array in a single call
| instead, something like this: const
| response = await this.openai!.embeddings.create({
| model: this.embeddingModel, input: texts,
| encoding_format: "float", }); return
| response.data.map(item => item.embedding);
|
| https://platform.openai.com/docs/api-reference/embeddings/cr...
| says input can be a string OR an array - that's reflected in the
| TypeScript library here too: https://github.com/openai/openai-
| node/blob/5873a017f0f2040ef...
| marvinkennis wrote:
| Thanks for the suggestions! Will implement these. Caching is a
| great idea.
| slantedview wrote:
| In general, you might cross reference with other object
| mapping libraries (including in other languages) to get ideas
| on how they approach this problem. Caching mappings is just
| one common strategy.
| explaininjs wrote:
| Watch out with the array mode though, according to OpenAI docs
| it _technically_ can return the results in any order and you
| must sort them by index to be sure you have the right
| associations. I've never seen them out of order in practice,
| but it'd be entirely in-character for them to suddenly change
| that sporadically and without warning, and now your entire
| vectordb may or may not be nondeterministically ruined.
| simonw wrote:
| Yikes!
| PaulHoule wrote:
| I was involved in an attempt to do this kind of thing with CNN
| neural networks just around the time BERT came out that was
| mostly successful and actually we did great projects for
| companies in the beverages, telecom, aviation and consumer
| goods space.
|
| It worked because it also had a conventional data-processing
| pipeline that revolved around JSON documents.
|
| For (2) it seems a system like that should be able to generate
| a script in Python, a codesigned DSL or some other language to
| do the conversion.
|
| One interesting thing about the product I worked on was that it
| functioned as a profiler by looking at one cell at a time, so
| if there is some field that has "Gruff Rhys" or "Fan Bing Bing
| " it could tell that was probably somebody's name, all the
| better if it can also see the field label is something like
| "Full Name" or "Xing Ming ". I'd contrast that to more
| conventional column-based profilers who might noticed that a
| certain field only has the values "true" and "false" throughout
| the whole column and would probably have some rule that would
| determine it was a boolean field.
|
| One thing that system could do is recognize private data inside
| unstructured data. Where I work for instance we have
|
| https://www.spirion.com/sensitive-data-discovery
|
| which scans text and other files and it warns if it sees
| something like a lot of personal data, like an Excel
| spreadsheet full of names, addresses and phone numbers -- even
| if I just made them up as test data.
| btown wrote:
| > returns a reusable data structure mapping input keys to
| output keys
|
| IMO this use case is exactly what Copilot is for. Write a
| comment including one example each of input and output, and
| tab-complete in your language of choice to have it create a
| rewriter for you.
|
| One benefit (and danger) is that it will look at the values,
| not just the keys, and also may generate arbitrary code that
| can e.g. adapt a firstName and lastName to a fullName. But
| that's why you have a human being triggering and auditing this
| for subtle bugs, and putting it through code review and source
| control, right?
| lordofmoria wrote:
| Since LLMs are bad at the null hypothesis (in this case, when a
| key does not exist in the source JSON), how does this prevent
| hallucinating transformations for missing keys?
| mpeg wrote:
| This isn't using an LLM, it simply checks for similarity
| between keys using vector embeddings
| explaininjs wrote:
| What'd be really great is a codegen aspect. A non-negligible part
| of any data munching operation is "this input object has fields
| X, Y, Z and we need an output object with fields X, f(X), Y,
| f(Y,Z)". This is something and LLM has a decent chance at being
| really quite good at.
| visarga wrote:
| This task in the most general form is better done with question
| answering prompt than embeds. How do you solve "Full Name" ->
| "First Name", "Last Name" with embeds? QA is the right level of
| abstraction for schema conversion tasks. And it's simple, just
| put the source JSON + target JSON schema in the prompt and ask
| for value extraction.
| yetanotherjosh wrote:
| So this identifies keys from source and target objects that are
| fuzzy synonyms and copies the values over. What is a real world
| use case for this? Add the fact that it's fuzzy and won't always
| work, so would require a great deal of extra effort in QA/testing
| (harder than just mapping the keys programmatically), and I'm
| puzzled.
| anamexis wrote:
| We do something very similar with embeddings in our product.
| Users import files that they have to match to a dynamically-
| defined target schema. The embedding matching provides
| suggested matches to the user that are generally very accurate,
| so they don't have to go through and manually match up
| "telephone" to "phone number" etc. It even works across
| languages.
| momojo wrote:
| How much time dos this save your users? Is this QOL? Or more
| of a "our product wouldn't work without this feature" kind of
| thing?
| anamexis wrote:
| Quite a bit of time. The product would still work without
| the feature, but it is a major feature. It bypasses lots of
| wading through dropdowns (potentially dozens for a single
| session)
| magicalhippo wrote:
| I've got some similar use-cases. So, do I understand
| correctly that you take the source keyword and generate an
| embedding vector of it, then compare it using dot-product
| similarity or something to the embedded vectors of the target
| keywords?
| anamexis wrote:
| Exactly, although we use cosine similarity.
| magicalhippo wrote:
| Perfect. And yeah that's what I meant, so used to just
| normalizing vectors so dot product = cosine.
| hendler wrote:
| Created a Rust version using devin.ai. (untested)
|
| https://github.com/HumanAssisted/shapeshift-rust
| benzguo wrote:
| Put together a quick version with an LLM, using Substrate:
| https://www.val.town/v/substrate/shapeshift
|
| I've turned the target object into a JSON schema, but you could
| probably generate that JSON schema pretty reliably using a
| codegen LLM.
| happy_bacon wrote:
| Here is an another DSL for implementing object model mappings:
| https://github.com/patleahy/lir
| leobg wrote:
| The example could be handled with no machine learning at all.
| Just use a bag of words comparison with a subword tokenizer. And
| if you do need embeddings (to map synonyms/topics), fastText is
| faster, cheaper and runs locally. For hard cases, you can feed
| the source/target schemas to gpt-4o once to create a map - and
| then apply that one map to all instances.
| riku_iki wrote:
| > fastText is faster, cheaper and runs locally
|
| the question is if quality will be acceptable
___________________________________________________________________
(page generated 2024-07-16 23:01 UTC)