[HN Gopher] In the land of LLMs, can we do better mock data gene...
___________________________________________________________________
In the land of LLMs, can we do better mock data generation?
Author : pncnmnp
Score : 110 points
Date : 2024-09-29 17:06 UTC (3 days ago)
(HTM) web link (neurelo.substack.com)
(TXT) w3m dump (neurelo.substack.com)
| thelostdragon wrote:
| This looks quite interesting and promising.
| pitah1 wrote:
| The world of mock data generation is now flooded with ML/AI
| solutions generating data but this is a solution that understands
| it is better to generate metadata to help guide the data
| generation. I found this was the case given the former solutions
| rely on production data, retraining, slow speed, huge resources,
| no guarantee about leaking sensitive data and its inability to
| retain referential integrity.
|
| As mentioned in the article, I think there is a lot of potential
| in this area for improvement. I've been working on a tool called
| Data Caterer (https://github.com/data-catering/data-caterer)
| which is a metadata-driven data generator that also can validate
| based on the generated data. Then you have full end-to-end
| testing using a single tool. There are also other metadata
| sources that can help drive these kinds of tools outside of using
| LLMs (i.e. data catalogs, data quality).
| lysecret wrote:
| This is a very good point, that's probably my number one use-case
| of things like copilot chat, just to fill in some of my types and
| generate some test cases.
| alex-moon wrote:
| Big fan of this write up as it presents a really easy to
| understand and at the same time brutally honest example of a
| domain in which a) you would expect LLMs to perform very well, b)
| they don't and c) the solution is to make the use of ML more
| targeted, a complement to human reasoning rather than a
| replacement for it.
|
| Over and over again we see businesses sinking money into "AI"
| where they are effectively doing a) and then calling it a day,
| blithely expecting profit to roll in. The day cannot come too
| soon when these businesses all lose their money and the hype
| finally dies - and we can go back to using ML the way this write
| up does (ie the way it is meant to be used). Let's hope no
| critical systems (eg healthcare or law enforcement) make the same
| mistake businesses are before that time.
| infecto wrote:
| On the flip side I thought the write up was weak on details and
| while "brutally honest" it did not touch on how they even tried
| to implement an LLM in the workflow and for all we know they
| were using an outdated model or a bad implementation. Your bias
| seems to follow it though, you have jumped so quickly into a
| camp that its easy to enjoy an article that supports your
| worldview.
| jerf wrote:
| To be honest, I exited the article thinking the answer is
| "no", or at least, perilously close to "no". The same amount
| of work put into a conventional solution probably would have
| been better. That cross-product "solution" is a generalized
| fix for data generation from a weak data source and as near
| as I can tell is what is actually doing most of the lifting,
| not the LLM.
|
| That said, I'm not convinced there isn't something to the
| idea, I just don't know that that is the correct use of LLMs.
| I find myself wondering if from-scratch training, of a much,
| much smaller model trained on the original data, using LLM
| technology but not using one of the current monsters, might
| not work better. I also wonder if this might be a case where
| prompt engineering isn't the way to go but directly sampling
| the resulting model might be a better way to go. Or maybe
| start with GPT-2 and ask it for lists of things; in a weird
| sort of way, GPT-2's "spaciness" and inaccuracy is sort of
| advantageous for this. Asking "give me a list of names" and
| getting "Johongle X. Boodlesmith" would be disastrous from a
| modern model, but for this task is actually a win. (And I
| wouldn't ask GPT-2 to try to format the data, I'd probably go
| for just getting a list of nicely randomized-but-plausible
| data, and solve all the issues like "tying the references
| together" conventionally.)
| krainboltgreene wrote:
| Is this the new normal for comments? Incredibly bad faith.
| infecto wrote:
| How so? Their implementation was interesting but I think it
| missed the whole setup on what did and did not work on the
| LLM side. Have just a few of those details would have made
| it very interesting. As it stands its really hard to decide
| if LLM is or is not the way.
|
| If you have such an opinion why not share how I could
| communicate it better?
| gopher_space wrote:
| Your line about the parent commenter's bias was weird and
| rude. You've never met the person and are accusing them
| of something you're in the process of doing yourself.
|
| https://www.youtube.com/watch?v=_cJO7pkx2jQ
| infecto wrote:
| Darn I hate being weird. Thanks!
| bcoates wrote:
| "If it didn't work you didn't believe hard enough" also
| known as "Real Communism has never been tried" or
| "Conservatism never fails, it can only be failed" is a
| sort of... information-free stock position.
|
| Basically, if _thing_ is good it needs to still be good
| when tried in the real world by flawed humans, so if
| someone says "I tried _thing_ and it didn 't work"
| replying with "well maybe _thing_ is good but you suck "
| isn't productive.
| infecto wrote:
| Sorry I think it's totally justified to question an
| article when they provided nothing more beyond we tried
| and it did not work. The whole premise was can it be done
| but it was missing basic information to draw a
| conclusion.
|
| Now maybe I was too weird in my response to the OP but it
| really went into a LLMs are bad narrative.
| dogma1138 wrote:
| Most LLMs I've played with are terrible at generating mock data
| that is in any way useful because they are strongly reinforced
| against anything that could be used for "recall".
|
| At least for playing around with llama2 for this you need to
| abliterate it the point of lobotomy to do anything and then the
| usefulness drops for other reasons.
| benxh wrote:
| I'm pretty sure that Neosync[0] does this to a pretty good
| degree, it is open source and YC funded too.
|
| [0] https://www.neosync.dev/
| danielbln wrote:
| Did I miss it or did the article not mention which LLM they
| tried, what prompts they've used and then they also mention zero-
| shot only, meaning no in-context learning? And they didn't think
| to tweak the instructions after it failed the first time? I don't
| know, doesn't seem like they really tried all that hard and
| basically just quickly checked the "yep, LLMs don't work here"
| box.
| eesmith wrote:
| A European friend of mine told me about some of the problems of
| mock data generation.
|
| A hard one, at least for the legal requirements in her field, is
| that it must not include a real person's information.
|
| Like, if it says "John Smith, 123 Oak St." and someone actually
| lives there with that name, then it's a privacy violation.
|
| You end up having to use addresses that specifically do not
| exist, and driver's license numbers which are invalid, etc.
| mungoman2 wrote:
| Surely that's only their interpretation of privacy laws, and
| not something tested in courts.
|
| It seems unlikely to actually break regulations if it's clear
| that the data has been fished out of the entropy well.
| aithrowawaycomm wrote:
| But if "fished out of the entropy well" includes "a direct
| copy of something which should not have been in the training
| data in the first place, like a corporate HR document," then
| that's a big problem.
|
| I don't think AI providers get to hide behind an "entropy
| well" defense when that entropy is a direct consequence of AI
| professionals' greed and laziness around data governance.
| eesmith wrote:
| The conversation was about 15 years ago so my memory might be
| wrong. But if you happen to have an SSN correctly matching
| someone's name, can you say it's been fished out of the
| entropy well? As aithrowawaycomm commented, how can you know
| it didn't regurgitate part of the training set, which
| happened to contain real data?
| sgarland wrote:
| IMO, nothing beats a carefully curated selection of data,
| randomly selected (with correlations as needed). The problem is
| you rapidly start getting into absurd levels of detail for things
| like postal addresses, at least, if you want them to be accurate.
| chromanoid wrote:
| The article reads like it was a bullet point list inflated by AI.
| But maybe I am just allergic to long texts nowadays.
|
| I wonder if we will use AI users to generate mock data and e2e
| test our applications in the near future. This would probably
| generate even more realistic data.
| roywiggins wrote:
| a digression but
|
| > this text has been the industry's standard dummy text ever
| since some printed in the 1500s
|
| doesn't seem to be true:
|
| https://slate.com/news-and-politics/2023/01/lorem-ipsum-hist...
| yawnxyz wrote:
| ok so a long time ago I used "real-looking examples" in a bunch
| of client prototypes (for a big widely known company's web store)
| and the account managers couldn't tell whether these were items
| new that had been released or not... so somehow the mock data
| ended up in production (before it got caught and snipped)
|
| ever since then I use "real-but-dumb examples" so people know in
| a glance that it can't possibly be real
|
| the reason I don't like latin placeholder text is b/c the word
| lengths are different than english so sentence widths end up very
| different
| globalise83 wrote:
| Yes, this should be a lesson in all software engineering
| courses: never use real or realistic data in examples or
| documentation. Once made the mistake of using a realistic but
| totally fake configuration id and had people use it in their
| production setup. Far better to use configId=justanexampleid or
| whatever.
| sgarland wrote:
| That sounds like a problem with the account managers, not you.
|
| Accurate and realistic data is important for doing proper load
| tests.
| SkyVoyager99 wrote:
| I think this article does a good job in capturing the
| complexities of generating test data for real world databases.
| Generating mock data using LLMs for individual tables based on
| the naming of the fields is one thing, but doing it across
| multiple tables, while honoring complex relationships across them
| (primary-foreign keys across 1:1, 1:N, and M:N with intermediate
| tables) is a whole another level of a challenge. And it's even
| harder for databases such as MongoDB, where the relationships
| across collections are often implicit and can best be inferred
| based on the names of the fields.
| gopher_space wrote:
| > Generating mock data using LLMs for individual tables based
| on the naming of the fields is one thing, but doing it across
| multiple tables, while honoring complex relationships across
| them (primary-foreign keys across 1:1, 1:N, and M:N with
| intermediate tables) is a whole another level of a challenge.
|
| So much so that I'm wondering about the context and how useful
| the results would be if the idea was self-applied. The article
| talks about mocking data for a number of clients, and I
| appreciate that viewpoint, but I'm struggling to picture a
| scenario where I wouldn't have the time or desire to hand-craft
| my own test data.
| SkyVoyager99 wrote:
| Well a few scenarios come to mind - 1) keeping the test data
| up-to-date as the schema changes takes a fair amount of work,
| especially if it's a schema that's actively changing and
| being worked on in a team by more than one developer. 2) Not
| everyone wants to necessarily craft their own test data even
| if they can, because well they would rather spend their time
| doing something else. 3) test data generation at even modest
| scale can be quite painful to hand-craft (and keep up-to-
| date). 4) capturing all the variances across the data e.g.
| combinations of nulls across fields, lengths of data across
| the fields, etc.
| ShanAIDev wrote:
| This is a fascinating topic! The ability to generate high-
| fidelity mock data can significantly streamline development and
| testing processes. it's a smart move given the diverse tech
| stacks in use today.
|
| Overall, this looks like a promising direction!
| dartos wrote:
| Maybe I'm confused, but why would an llm be better at mapping
| tuples to functions as opposed to a kind of switch statement?
|
| Especially since it doesn't seem to totally understand the
| breadth of possible kinds of faked data?
| WhiteOwlEd wrote:
| Building on this, Human preference optimization (such as Direct
| Preference Optimization or Kahneman Tversky Optimization) could
| be used to help in refining models to create better data.
|
| I wrote about this more recently in the context of using LLMs to
| improve data pipelines. That blog post is at:
| https://www.linkedin.com/posts/ralphbrooks_bigdata-dataengin...
| zebomon wrote:
| Good read. I wonder to what degree this kind of step-making which
| I suppose is what is often happening under the hood of OpenAI's
| o1 "reasoning" model, is set up manually (manually as in a case-
| by-case basis) as you've done here.
|
| I'm reminded of an evening that I spent playing Overcooked 2 with
| my partner recently. We made it through to the 4-star rounds,
| which are very challenging, and we realized that for one of the
| later 4-star rounds, one could reach the goal rather easily -- by
| taking advantage of a glitch in the way that items are stored on
| the map. This realization brought up an interesting conversation,
| as to whether or not we should then beat the round twice, once
| using the glitch and once not.
|
| With LLMs right now, I think there's still a widespread hope
| (wish?) that the emergent capabilities seen in scaled-up data and
| training epochs will yield ALL capabilities hereon. Fortunately
| for the users of this site, hacking together solutions seems like
| it's going to remain necessary for many goals.
| nonameiguess wrote:
| We faced probably about the worst form of this problem you can
| face when working for the NRO on ground processing of satellite
| data. When new orbital sensor platforms are developed, new
| processing software has to be developed in tandem, but the
| software has to be developed and tested before the platforms are
| actually launched, so real data is impossible and you have to
| generate and process synthetic data instead.
|
| Even then, it's an entirely tractable problem. If you understand
| the physical characteristics and capabilities of the sensors and
| the basic physics of satellite imaging in general, you simply use
| that knowledge. You can't possibly know what you're really going
| to see when you get into space and look, but you at least know
| the mathematical characteristics the data will have.
|
| The entire problem here is you need a lot of expertise to do
| this. It's not even expertise I have or any other software
| developer had or has. We needed PhDs in orbital mechanics,
| atmospheric studies, and image science to do it. There isn't and
| probably never will be a "one-click" button to just make it
| happen, but this kind of thing might honestly be a great test for
| anyone that truly believes LLMs can reason at a level equal to
| human experts. Generate a form of data that has never existed,
| thus cannot have been in your training set, from first principles
| of basic physics.
| larodi wrote:
| The thing is that this test data generation does not work if you
| don't account for the schema. Author did so, well done. Been
| following the same algo for an year, and it works as long, as
| context big enough to keep ids generated. or otherwise you feed
| ids for the FKs missing.
|
| But this is really not a breakthrough, anyone with fair knowledge
| of LLMs and E/R should be able to devise it. the fact not many
| people have interdisciplinary knowledge is very much evident from
| all text2sql papers for example which is a similar domain.
| Version467 wrote:
| > anyone with fair knowledge of LLMs and E/R should be able to
| devise it.
|
| While this may be true, I think it overlooks a really important
| aspect. Current LLMs could be very useful in many workflows
| _if_ someone does the grunt work of properly integrating it.
| That's not necessarily complicated, but it is quite a bit of
| work.
|
| I don't think we'll hit a capabilities wall anytime soon, but
| if we do, we'll still have years of work to do, to properly
| make use of everything llms have to offer today.
| edrenova wrote:
| Nice write up, mock data generation with LLMs is pretty tough. We
| spent time trying to do it across multiple tables and it always
| had issues. Whether you look at classical ML models like GANs or
| even LLMs, they struggle with producing a lot of data and
| respecting FKs, Constraints and other relationships.
|
| Maybe some day, it gets better but for now, we've found that
| using a more traditional algorithmic approach is more consistent.
|
| Transparency: founder of Neosync - open source data anonymization
| - github.com/nucleuscloud/neosync
| its_down_again wrote:
| I've spent some time in enterprise TFO/demo engineering, and
| this kind of generative tool would've been a game changer. When
| it comes to synthetic data, the challenge lies at the sweet
| spot of being both "super tough" and in high business need.
| When you're working with customer data, it's pretty risky--just
| anonymizing PII doesn't cut it. You've got to create data
| that's far enough removed from the original to really stay in
| the clear. But even if you can do it once, AI tools often need
| thousands of data rows to make the demo worthwhile. Without
| that volume, the visualizations fall flat, and the demo doesn't
| have any impact.
|
| I found challenge with LLMs isn't generating a "real enough"
| data point--that's doable. It's about, "How do I load this
| in?", then, "How do I generate hundreds of these?" And even
| beyond that, "How do I make these pseudo-random in a way that
| tells a coherent story with the graphs?" It always feels like
| you're right on the edge, but getting it to work reliably in
| the way you need is harder than it looks.
| edrenova wrote:
| Yup agreed. We built an orchestration engine into Neosync for
| that reason. Can handles all of the reading/writing from DBs
| for you. Also can generate data from scratch (using LLMs or
| not).
| jumploops wrote:
| The title and the contents don't match.
|
| The author expected to use LLMs to just solve the mock data
| problem, including traversing the schema and generating the
| correct Rust code for DB insertions.
|
| This demonstrates little about using LLMs for _mock data_ and
| more about using LLMs for understanding existing system
| architecture.
|
| The latter is a hard problem, as humans are known to create messy
| and complex systems (see: any engineer joining a new company).
|
| For mock data generation, we've[0] actually found LLMs to be
| fantastic, however there are a few tricks.
|
| 1. Few shot prompting: use a couple of example "records" by
| inserting user/assistant messages to "prime" the context 2. Keep
| the records you've generated in context, as in, treat every
| record generated as a historical chat message. This helps avoid
| duplicates/repeats of common tropes (e.g. John Smith) 3. Split
| your tables into multiple generations steps -- e.g. start with
| "users" and then for each user generate an "address" (with
| history!), and so on. Model your mock data creation after your
| schema and its constraints, don't rely on the LLM for this step.
| 4. Separate out mock data generation and DB updates into
| disparate steps. First generate CSVs (or JSON/YAML) of your data,
| and then use a separate script(s) to insert that data. This helps
| avoid issues at insertion as you can easily tweak, retry, or pass
| on malformed data.
|
| LLMs are fantastic tools for mock data creation, but don't expect
| them to also solve the problem of understanding your legacy DB
| schemas and application code all at once (yet?).
|
| [0]https://www.youtube.com/watch?v=BJ1wtjdHn-E
___________________________________________________________________
(page generated 2024-10-02 23:01 UTC)