[HN Gopher] In the land of LLMs, can we do better mock data gene...
       ___________________________________________________________________
        
       In the land of LLMs, can we do better mock data generation?
        
       Author : pncnmnp
       Score  : 110 points
       Date   : 2024-09-29 17:06 UTC (3 days ago)
        
 (HTM) web link (neurelo.substack.com)
 (TXT) w3m dump (neurelo.substack.com)
        
       | thelostdragon wrote:
       | This looks quite interesting and promising.
        
       | pitah1 wrote:
       | The world of mock data generation is now flooded with ML/AI
       | solutions generating data but this is a solution that understands
       | it is better to generate metadata to help guide the data
       | generation. I found this was the case given the former solutions
       | rely on production data, retraining, slow speed, huge resources,
       | no guarantee about leaking sensitive data and its inability to
       | retain referential integrity.
       | 
       | As mentioned in the article, I think there is a lot of potential
       | in this area for improvement. I've been working on a tool called
       | Data Caterer (https://github.com/data-catering/data-caterer)
       | which is a metadata-driven data generator that also can validate
       | based on the generated data. Then you have full end-to-end
       | testing using a single tool. There are also other metadata
       | sources that can help drive these kinds of tools outside of using
       | LLMs (i.e. data catalogs, data quality).
        
       | lysecret wrote:
       | This is a very good point, that's probably my number one use-case
       | of things like copilot chat, just to fill in some of my types and
       | generate some test cases.
        
       | alex-moon wrote:
       | Big fan of this write up as it presents a really easy to
       | understand and at the same time brutally honest example of a
       | domain in which a) you would expect LLMs to perform very well, b)
       | they don't and c) the solution is to make the use of ML more
       | targeted, a complement to human reasoning rather than a
       | replacement for it.
       | 
       | Over and over again we see businesses sinking money into "AI"
       | where they are effectively doing a) and then calling it a day,
       | blithely expecting profit to roll in. The day cannot come too
       | soon when these businesses all lose their money and the hype
       | finally dies - and we can go back to using ML the way this write
       | up does (ie the way it is meant to be used). Let's hope no
       | critical systems (eg healthcare or law enforcement) make the same
       | mistake businesses are before that time.
        
         | infecto wrote:
         | On the flip side I thought the write up was weak on details and
         | while "brutally honest" it did not touch on how they even tried
         | to implement an LLM in the workflow and for all we know they
         | were using an outdated model or a bad implementation. Your bias
         | seems to follow it though, you have jumped so quickly into a
         | camp that its easy to enjoy an article that supports your
         | worldview.
        
           | jerf wrote:
           | To be honest, I exited the article thinking the answer is
           | "no", or at least, perilously close to "no". The same amount
           | of work put into a conventional solution probably would have
           | been better. That cross-product "solution" is a generalized
           | fix for data generation from a weak data source and as near
           | as I can tell is what is actually doing most of the lifting,
           | not the LLM.
           | 
           | That said, I'm not convinced there isn't something to the
           | idea, I just don't know that that is the correct use of LLMs.
           | I find myself wondering if from-scratch training, of a much,
           | much smaller model trained on the original data, using LLM
           | technology but not using one of the current monsters, might
           | not work better. I also wonder if this might be a case where
           | prompt engineering isn't the way to go but directly sampling
           | the resulting model might be a better way to go. Or maybe
           | start with GPT-2 and ask it for lists of things; in a weird
           | sort of way, GPT-2's "spaciness" and inaccuracy is sort of
           | advantageous for this. Asking "give me a list of names" and
           | getting "Johongle X. Boodlesmith" would be disastrous from a
           | modern model, but for this task is actually a win. (And I
           | wouldn't ask GPT-2 to try to format the data, I'd probably go
           | for just getting a list of nicely randomized-but-plausible
           | data, and solve all the issues like "tying the references
           | together" conventionally.)
        
           | krainboltgreene wrote:
           | Is this the new normal for comments? Incredibly bad faith.
        
             | infecto wrote:
             | How so? Their implementation was interesting but I think it
             | missed the whole setup on what did and did not work on the
             | LLM side. Have just a few of those details would have made
             | it very interesting. As it stands its really hard to decide
             | if LLM is or is not the way.
             | 
             | If you have such an opinion why not share how I could
             | communicate it better?
        
               | gopher_space wrote:
               | Your line about the parent commenter's bias was weird and
               | rude. You've never met the person and are accusing them
               | of something you're in the process of doing yourself.
               | 
               | https://www.youtube.com/watch?v=_cJO7pkx2jQ
        
               | infecto wrote:
               | Darn I hate being weird. Thanks!
        
               | bcoates wrote:
               | "If it didn't work you didn't believe hard enough" also
               | known as "Real Communism has never been tried" or
               | "Conservatism never fails, it can only be failed" is a
               | sort of... information-free stock position.
               | 
               | Basically, if _thing_ is good it needs to still be good
               | when tried in the real world by flawed humans, so if
               | someone says  "I tried _thing_ and it didn 't work"
               | replying with "well maybe _thing_ is good but you suck "
               | isn't productive.
        
               | infecto wrote:
               | Sorry I think it's totally justified to question an
               | article when they provided nothing more beyond we tried
               | and it did not work. The whole premise was can it be done
               | but it was missing basic information to draw a
               | conclusion.
               | 
               | Now maybe I was too weird in my response to the OP but it
               | really went into a LLMs are bad narrative.
        
       | dogma1138 wrote:
       | Most LLMs I've played with are terrible at generating mock data
       | that is in any way useful because they are strongly reinforced
       | against anything that could be used for "recall".
       | 
       | At least for playing around with llama2 for this you need to
       | abliterate it the point of lobotomy to do anything and then the
       | usefulness drops for other reasons.
        
       | benxh wrote:
       | I'm pretty sure that Neosync[0] does this to a pretty good
       | degree, it is open source and YC funded too.
       | 
       | [0] https://www.neosync.dev/
        
       | danielbln wrote:
       | Did I miss it or did the article not mention which LLM they
       | tried, what prompts they've used and then they also mention zero-
       | shot only, meaning no in-context learning? And they didn't think
       | to tweak the instructions after it failed the first time? I don't
       | know, doesn't seem like they really tried all that hard and
       | basically just quickly checked the "yep, LLMs don't work here"
       | box.
        
       | eesmith wrote:
       | A European friend of mine told me about some of the problems of
       | mock data generation.
       | 
       | A hard one, at least for the legal requirements in her field, is
       | that it must not include a real person's information.
       | 
       | Like, if it says "John Smith, 123 Oak St." and someone actually
       | lives there with that name, then it's a privacy violation.
       | 
       | You end up having to use addresses that specifically do not
       | exist, and driver's license numbers which are invalid, etc.
        
         | mungoman2 wrote:
         | Surely that's only their interpretation of privacy laws, and
         | not something tested in courts.
         | 
         | It seems unlikely to actually break regulations if it's clear
         | that the data has been fished out of the entropy well.
        
           | aithrowawaycomm wrote:
           | But if "fished out of the entropy well" includes "a direct
           | copy of something which should not have been in the training
           | data in the first place, like a corporate HR document," then
           | that's a big problem.
           | 
           | I don't think AI providers get to hide behind an "entropy
           | well" defense when that entropy is a direct consequence of AI
           | professionals' greed and laziness around data governance.
        
           | eesmith wrote:
           | The conversation was about 15 years ago so my memory might be
           | wrong. But if you happen to have an SSN correctly matching
           | someone's name, can you say it's been fished out of the
           | entropy well? As aithrowawaycomm commented, how can you know
           | it didn't regurgitate part of the training set, which
           | happened to contain real data?
        
       | sgarland wrote:
       | IMO, nothing beats a carefully curated selection of data,
       | randomly selected (with correlations as needed). The problem is
       | you rapidly start getting into absurd levels of detail for things
       | like postal addresses, at least, if you want them to be accurate.
        
       | chromanoid wrote:
       | The article reads like it was a bullet point list inflated by AI.
       | But maybe I am just allergic to long texts nowadays.
       | 
       | I wonder if we will use AI users to generate mock data and e2e
       | test our applications in the near future. This would probably
       | generate even more realistic data.
        
       | roywiggins wrote:
       | a digression but
       | 
       | > this text has been the industry's standard dummy text ever
       | since some printed in the 1500s
       | 
       | doesn't seem to be true:
       | 
       | https://slate.com/news-and-politics/2023/01/lorem-ipsum-hist...
        
       | yawnxyz wrote:
       | ok so a long time ago I used "real-looking examples" in a bunch
       | of client prototypes (for a big widely known company's web store)
       | and the account managers couldn't tell whether these were items
       | new that had been released or not... so somehow the mock data
       | ended up in production (before it got caught and snipped)
       | 
       | ever since then I use "real-but-dumb examples" so people know in
       | a glance that it can't possibly be real
       | 
       | the reason I don't like latin placeholder text is b/c the word
       | lengths are different than english so sentence widths end up very
       | different
        
         | globalise83 wrote:
         | Yes, this should be a lesson in all software engineering
         | courses: never use real or realistic data in examples or
         | documentation. Once made the mistake of using a realistic but
         | totally fake configuration id and had people use it in their
         | production setup. Far better to use configId=justanexampleid or
         | whatever.
        
         | sgarland wrote:
         | That sounds like a problem with the account managers, not you.
         | 
         | Accurate and realistic data is important for doing proper load
         | tests.
        
       | SkyVoyager99 wrote:
       | I think this article does a good job in capturing the
       | complexities of generating test data for real world databases.
       | Generating mock data using LLMs for individual tables based on
       | the naming of the fields is one thing, but doing it across
       | multiple tables, while honoring complex relationships across them
       | (primary-foreign keys across 1:1, 1:N, and M:N with intermediate
       | tables) is a whole another level of a challenge. And it's even
       | harder for databases such as MongoDB, where the relationships
       | across collections are often implicit and can best be inferred
       | based on the names of the fields.
        
         | gopher_space wrote:
         | > Generating mock data using LLMs for individual tables based
         | on the naming of the fields is one thing, but doing it across
         | multiple tables, while honoring complex relationships across
         | them (primary-foreign keys across 1:1, 1:N, and M:N with
         | intermediate tables) is a whole another level of a challenge.
         | 
         | So much so that I'm wondering about the context and how useful
         | the results would be if the idea was self-applied. The article
         | talks about mocking data for a number of clients, and I
         | appreciate that viewpoint, but I'm struggling to picture a
         | scenario where I wouldn't have the time or desire to hand-craft
         | my own test data.
        
           | SkyVoyager99 wrote:
           | Well a few scenarios come to mind - 1) keeping the test data
           | up-to-date as the schema changes takes a fair amount of work,
           | especially if it's a schema that's actively changing and
           | being worked on in a team by more than one developer. 2) Not
           | everyone wants to necessarily craft their own test data even
           | if they can, because well they would rather spend their time
           | doing something else. 3) test data generation at even modest
           | scale can be quite painful to hand-craft (and keep up-to-
           | date). 4) capturing all the variances across the data e.g.
           | combinations of nulls across fields, lengths of data across
           | the fields, etc.
        
       | ShanAIDev wrote:
       | This is a fascinating topic! The ability to generate high-
       | fidelity mock data can significantly streamline development and
       | testing processes. it's a smart move given the diverse tech
       | stacks in use today.
       | 
       | Overall, this looks like a promising direction!
        
       | dartos wrote:
       | Maybe I'm confused, but why would an llm be better at mapping
       | tuples to functions as opposed to a kind of switch statement?
       | 
       | Especially since it doesn't seem to totally understand the
       | breadth of possible kinds of faked data?
        
       | WhiteOwlEd wrote:
       | Building on this, Human preference optimization (such as Direct
       | Preference Optimization or Kahneman Tversky Optimization) could
       | be used to help in refining models to create better data.
       | 
       | I wrote about this more recently in the context of using LLMs to
       | improve data pipelines. That blog post is at:
       | https://www.linkedin.com/posts/ralphbrooks_bigdata-dataengin...
        
       | zebomon wrote:
       | Good read. I wonder to what degree this kind of step-making which
       | I suppose is what is often happening under the hood of OpenAI's
       | o1 "reasoning" model, is set up manually (manually as in a case-
       | by-case basis) as you've done here.
       | 
       | I'm reminded of an evening that I spent playing Overcooked 2 with
       | my partner recently. We made it through to the 4-star rounds,
       | which are very challenging, and we realized that for one of the
       | later 4-star rounds, one could reach the goal rather easily -- by
       | taking advantage of a glitch in the way that items are stored on
       | the map. This realization brought up an interesting conversation,
       | as to whether or not we should then beat the round twice, once
       | using the glitch and once not.
       | 
       | With LLMs right now, I think there's still a widespread hope
       | (wish?) that the emergent capabilities seen in scaled-up data and
       | training epochs will yield ALL capabilities hereon. Fortunately
       | for the users of this site, hacking together solutions seems like
       | it's going to remain necessary for many goals.
        
       | nonameiguess wrote:
       | We faced probably about the worst form of this problem you can
       | face when working for the NRO on ground processing of satellite
       | data. When new orbital sensor platforms are developed, new
       | processing software has to be developed in tandem, but the
       | software has to be developed and tested before the platforms are
       | actually launched, so real data is impossible and you have to
       | generate and process synthetic data instead.
       | 
       | Even then, it's an entirely tractable problem. If you understand
       | the physical characteristics and capabilities of the sensors and
       | the basic physics of satellite imaging in general, you simply use
       | that knowledge. You can't possibly know what you're really going
       | to see when you get into space and look, but you at least know
       | the mathematical characteristics the data will have.
       | 
       | The entire problem here is you need a lot of expertise to do
       | this. It's not even expertise I have or any other software
       | developer had or has. We needed PhDs in orbital mechanics,
       | atmospheric studies, and image science to do it. There isn't and
       | probably never will be a "one-click" button to just make it
       | happen, but this kind of thing might honestly be a great test for
       | anyone that truly believes LLMs can reason at a level equal to
       | human experts. Generate a form of data that has never existed,
       | thus cannot have been in your training set, from first principles
       | of basic physics.
        
       | larodi wrote:
       | The thing is that this test data generation does not work if you
       | don't account for the schema. Author did so, well done. Been
       | following the same algo for an year, and it works as long, as
       | context big enough to keep ids generated. or otherwise you feed
       | ids for the FKs missing.
       | 
       | But this is really not a breakthrough, anyone with fair knowledge
       | of LLMs and E/R should be able to devise it. the fact not many
       | people have interdisciplinary knowledge is very much evident from
       | all text2sql papers for example which is a similar domain.
        
         | Version467 wrote:
         | > anyone with fair knowledge of LLMs and E/R should be able to
         | devise it.
         | 
         | While this may be true, I think it overlooks a really important
         | aspect. Current LLMs could be very useful in many workflows
         | _if_ someone does the grunt work of properly integrating it.
         | That's not necessarily complicated, but it is quite a bit of
         | work.
         | 
         | I don't think we'll hit a capabilities wall anytime soon, but
         | if we do, we'll still have years of work to do, to properly
         | make use of everything llms have to offer today.
        
       | edrenova wrote:
       | Nice write up, mock data generation with LLMs is pretty tough. We
       | spent time trying to do it across multiple tables and it always
       | had issues. Whether you look at classical ML models like GANs or
       | even LLMs, they struggle with producing a lot of data and
       | respecting FKs, Constraints and other relationships.
       | 
       | Maybe some day, it gets better but for now, we've found that
       | using a more traditional algorithmic approach is more consistent.
       | 
       | Transparency: founder of Neosync - open source data anonymization
       | - github.com/nucleuscloud/neosync
        
         | its_down_again wrote:
         | I've spent some time in enterprise TFO/demo engineering, and
         | this kind of generative tool would've been a game changer. When
         | it comes to synthetic data, the challenge lies at the sweet
         | spot of being both "super tough" and in high business need.
         | When you're working with customer data, it's pretty risky--just
         | anonymizing PII doesn't cut it. You've got to create data
         | that's far enough removed from the original to really stay in
         | the clear. But even if you can do it once, AI tools often need
         | thousands of data rows to make the demo worthwhile. Without
         | that volume, the visualizations fall flat, and the demo doesn't
         | have any impact.
         | 
         | I found challenge with LLMs isn't generating a "real enough"
         | data point--that's doable. It's about, "How do I load this
         | in?", then, "How do I generate hundreds of these?" And even
         | beyond that, "How do I make these pseudo-random in a way that
         | tells a coherent story with the graphs?" It always feels like
         | you're right on the edge, but getting it to work reliably in
         | the way you need is harder than it looks.
        
           | edrenova wrote:
           | Yup agreed. We built an orchestration engine into Neosync for
           | that reason. Can handles all of the reading/writing from DBs
           | for you. Also can generate data from scratch (using LLMs or
           | not).
        
       | jumploops wrote:
       | The title and the contents don't match.
       | 
       | The author expected to use LLMs to just solve the mock data
       | problem, including traversing the schema and generating the
       | correct Rust code for DB insertions.
       | 
       | This demonstrates little about using LLMs for _mock data_ and
       | more about using LLMs for understanding existing system
       | architecture.
       | 
       | The latter is a hard problem, as humans are known to create messy
       | and complex systems (see: any engineer joining a new company).
       | 
       | For mock data generation, we've[0] actually found LLMs to be
       | fantastic, however there are a few tricks.
       | 
       | 1. Few shot prompting: use a couple of example "records" by
       | inserting user/assistant messages to "prime" the context 2. Keep
       | the records you've generated in context, as in, treat every
       | record generated as a historical chat message. This helps avoid
       | duplicates/repeats of common tropes (e.g. John Smith) 3. Split
       | your tables into multiple generations steps -- e.g. start with
       | "users" and then for each user generate an "address" (with
       | history!), and so on. Model your mock data creation after your
       | schema and its constraints, don't rely on the LLM for this step.
       | 4. Separate out mock data generation and DB updates into
       | disparate steps. First generate CSVs (or JSON/YAML) of your data,
       | and then use a separate script(s) to insert that data. This helps
       | avoid issues at insertion as you can easily tweak, retry, or pass
       | on malformed data.
       | 
       | LLMs are fantastic tools for mock data creation, but don't expect
       | them to also solve the problem of understanding your legacy DB
       | schemas and application code all at once (yet?).
       | 
       | [0]https://www.youtube.com/watch?v=BJ1wtjdHn-E
        
       ___________________________________________________________________
       (page generated 2024-10-02 23:01 UTC)