[HN Gopher] Shapeshift: Semantically map JSON objects using key-...
       ___________________________________________________________________
        
       Shapeshift: Semantically map JSON objects using key-level vector
       embeddings
        
       Author : marvinkennis
       Score  : 107 points
       Date   : 2024-07-15 22:47 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | WhatsName wrote:
       | Maybe I'm not the target audience, but here are simple questions
       | to the author or potential users:
       | 
       | What about anything more complex like date of birth to age or the
       | other way round? Also since we will inevitably incur costs, why
       | not let a llm write a transformation rule for us?
        
         | eezing wrote:
         | My thinking as well.
        
         | anamexis wrote:
         | It's not using an LLM, it's just comparing embeddings (which
         | are waaay cheaper)
        
           | riku_iki wrote:
           | but embeddings came from somewhere (LLM?).
        
       | lukasb wrote:
       | What is this for? The examples given could be handled
       | deterministically. Is this for situations where you don't know
       | JSON schemas in advance? What situations are those?
        
         | saltwatercowboy wrote:
         | The lazy part of my brain screams "use this instead of dealing
         | properly with nested objects!" In a production setting I'd be
         | worried about consistency from the base to result layers if
         | based on LLM transpositioning.
        
         | eezing wrote:
         | Data import via customer self-service onboarding.
        
         | tbrownaw wrote:
         | As is, it's not good for much beyond looking cool. (Maybe
         | implementing Postel's Law for a json API, but I think that's
         | considered bad taste these days.)
         | 
         | If instead of transforming a single object it would output a
         | table of src_field->dst_field, it could potentially be a useful
         | first pass in some ETL development.
        
       | henry700 wrote:
       | Keep the bug generators going, we will need the jobs
        
       | simonw wrote:
       | This is the code that does the work:
       | https://github.com/rectanglehq/Shapeshift/blob/d954dab2a866c...
       | 
       | There are a few ways this could be made a less expensive to run:
       | 
       | 1. Cache those embeddings somewhere. You're only embedding simple
       | strings like "name" and "address" - no need to do that work more
       | than once in an entire lifetime of running the tool.
       | 
       | 2. As suggested here
       | https://news.ycombinator.com/item?id=40973028 change the design
       | of the tool so instead of doing the work it returns a reusable
       | data structure mapping input keys to output keys, so you only
       | have to run it once and can then use that generated data
       | structure to apply the transformations on large amounts of data
       | in the future.
       | 
       | 3. Since so many of the keys are going to have predictable names
       | ("name", "address" etc) you could even pre-calculate embeddings
       | for the 1,000 most common keys across all three embedding
       | providers and ship those as part of the package.
       | 
       | Also: in
       | https://github.com/rectanglehq/Shapeshift/blob/d954dab2a866c...
       | you're using Promise.map() to run multiple embeddings through the
       | OpenAI API at once, which risks tripping their rate-limit. You
       | should be able to pass the text as an array in a single call
       | instead, something like this:                       const
       | response = await this.openai!.embeddings.create({
       | model: this.embeddingModel,               input: texts,
       | encoding_format: "float",             });             return
       | response.data.map(item => item.embedding);
       | 
       | https://platform.openai.com/docs/api-reference/embeddings/cr...
       | says input can be a string OR an array - that's reflected in the
       | TypeScript library here too: https://github.com/openai/openai-
       | node/blob/5873a017f0f2040ef...
        
         | marvinkennis wrote:
         | Thanks for the suggestions! Will implement these. Caching is a
         | great idea.
        
           | slantedview wrote:
           | In general, you might cross reference with other object
           | mapping libraries (including in other languages) to get ideas
           | on how they approach this problem. Caching mappings is just
           | one common strategy.
        
         | explaininjs wrote:
         | Watch out with the array mode though, according to OpenAI docs
         | it _technically_ can return the results in any order and you
         | must sort them by index to be sure you have the right
         | associations. I've never seen them out of order in practice,
         | but it'd be entirely in-character for them to suddenly change
         | that sporadically and without warning, and now your entire
         | vectordb may or may not be nondeterministically ruined.
        
           | simonw wrote:
           | Yikes!
        
         | PaulHoule wrote:
         | I was involved in an attempt to do this kind of thing with CNN
         | neural networks just around the time BERT came out that was
         | mostly successful and actually we did great projects for
         | companies in the beverages, telecom, aviation and consumer
         | goods space.
         | 
         | It worked because it also had a conventional data-processing
         | pipeline that revolved around JSON documents.
         | 
         | For (2) it seems a system like that should be able to generate
         | a script in Python, a codesigned DSL or some other language to
         | do the conversion.
         | 
         | One interesting thing about the product I worked on was that it
         | functioned as a profiler by looking at one cell at a time, so
         | if there is some field that has "Gruff Rhys" or "Fan Bing Bing
         | " it could tell that was probably somebody's name, all the
         | better if it can also see the field label is something like
         | "Full Name" or "Xing Ming ". I'd contrast that to more
         | conventional column-based profilers who might noticed that a
         | certain field only has the values "true" and "false" throughout
         | the whole column and would probably have some rule that would
         | determine it was a boolean field.
         | 
         | One thing that system could do is recognize private data inside
         | unstructured data. Where I work for instance we have
         | 
         | https://www.spirion.com/sensitive-data-discovery
         | 
         | which scans text and other files and it warns if it sees
         | something like a lot of personal data, like an Excel
         | spreadsheet full of names, addresses and phone numbers -- even
         | if I just made them up as test data.
        
         | btown wrote:
         | > returns a reusable data structure mapping input keys to
         | output keys
         | 
         | IMO this use case is exactly what Copilot is for. Write a
         | comment including one example each of input and output, and
         | tab-complete in your language of choice to have it create a
         | rewriter for you.
         | 
         | One benefit (and danger) is that it will look at the values,
         | not just the keys, and also may generate arbitrary code that
         | can e.g. adapt a firstName and lastName to a fullName. But
         | that's why you have a human being triggering and auditing this
         | for subtle bugs, and putting it through code review and source
         | control, right?
        
       | lordofmoria wrote:
       | Since LLMs are bad at the null hypothesis (in this case, when a
       | key does not exist in the source JSON), how does this prevent
       | hallucinating transformations for missing keys?
        
         | mpeg wrote:
         | This isn't using an LLM, it simply checks for similarity
         | between keys using vector embeddings
        
       | explaininjs wrote:
       | What'd be really great is a codegen aspect. A non-negligible part
       | of any data munching operation is "this input object has fields
       | X, Y, Z and we need an output object with fields X, f(X), Y,
       | f(Y,Z)". This is something and LLM has a decent chance at being
       | really quite good at.
        
       | visarga wrote:
       | This task in the most general form is better done with question
       | answering prompt than embeds. How do you solve "Full Name" ->
       | "First Name", "Last Name" with embeds? QA is the right level of
       | abstraction for schema conversion tasks. And it's simple, just
       | put the source JSON + target JSON schema in the prompt and ask
       | for value extraction.
        
       | yetanotherjosh wrote:
       | So this identifies keys from source and target objects that are
       | fuzzy synonyms and copies the values over. What is a real world
       | use case for this? Add the fact that it's fuzzy and won't always
       | work, so would require a great deal of extra effort in QA/testing
       | (harder than just mapping the keys programmatically), and I'm
       | puzzled.
        
         | anamexis wrote:
         | We do something very similar with embeddings in our product.
         | Users import files that they have to match to a dynamically-
         | defined target schema. The embedding matching provides
         | suggested matches to the user that are generally very accurate,
         | so they don't have to go through and manually match up
         | "telephone" to "phone number" etc. It even works across
         | languages.
        
           | momojo wrote:
           | How much time dos this save your users? Is this QOL? Or more
           | of a "our product wouldn't work without this feature" kind of
           | thing?
        
             | anamexis wrote:
             | Quite a bit of time. The product would still work without
             | the feature, but it is a major feature. It bypasses lots of
             | wading through dropdowns (potentially dozens for a single
             | session)
        
           | magicalhippo wrote:
           | I've got some similar use-cases. So, do I understand
           | correctly that you take the source keyword and generate an
           | embedding vector of it, then compare it using dot-product
           | similarity or something to the embedded vectors of the target
           | keywords?
        
             | anamexis wrote:
             | Exactly, although we use cosine similarity.
        
               | magicalhippo wrote:
               | Perfect. And yeah that's what I meant, so used to just
               | normalizing vectors so dot product = cosine.
        
       | hendler wrote:
       | Created a Rust version using devin.ai. (untested)
       | 
       | https://github.com/HumanAssisted/shapeshift-rust
        
       | benzguo wrote:
       | Put together a quick version with an LLM, using Substrate:
       | https://www.val.town/v/substrate/shapeshift
       | 
       | I've turned the target object into a JSON schema, but you could
       | probably generate that JSON schema pretty reliably using a
       | codegen LLM.
        
       | happy_bacon wrote:
       | Here is an another DSL for implementing object model mappings:
       | https://github.com/patleahy/lir
        
       | leobg wrote:
       | The example could be handled with no machine learning at all.
       | Just use a bag of words comparison with a subword tokenizer. And
       | if you do need embeddings (to map synonyms/topics), fastText is
       | faster, cheaper and runs locally. For hard cases, you can feed
       | the source/target schemas to gpt-4o once to create a map - and
       | then apply that one map to all instances.
        
         | riku_iki wrote:
         | > fastText is faster, cheaper and runs locally
         | 
         | the question is if quality will be acceptable
        
       ___________________________________________________________________
       (page generated 2024-07-16 23:01 UTC)