[HN Gopher] Synthetic data generation for tabular data
___________________________________________________________________
Synthetic data generation for tabular data
Author : skadamat
Score : 30 points
Date : 2024-02-27 19:05 UTC (3 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| skadamat wrote:
| If people are looking for a quickstart:
|
| Colab notebook for generating single-table data:
| https://colab.research.google.com/drive/1F3WWduNjcX4oKck6Xkj...
|
| Colab notebook for generating multi-table data:
| https://colab.research.google.com/drive/1L6i-JhJK9ROG-KFcyzT...
| mej10 wrote:
| Can someone help me understand the licensing of this?
|
| https://github.com/sdv-dev/SDV/blob/main/LICENSE
|
| It was MIT licensed up until 2022 where it was changed to what it
| is now, where they say that it will become MIT again 4 years
| after release... but is that from when the license was changed or
| the first release of the software in GitHub?
| rch wrote:
| IANL but I've taken it to mean that releases acquired under the
| original license would continue to be governed by those terms.
|
| I'm liking this new approach better than e.g. perpetual AGPL
| though, as it provides incentives for businesses to acquire
| commercial rights while avoiding any dead end agreements that
| outlive the startup entity.
| debosmit wrote:
| do you have some thoughts on how sdv-dev type projects can be
| used to start populating, say, a database (eg: mysql running in a
| container) i've looked into this space a bunch (eg: Gretel,
| Tonic, etc) and there doesn't seem to be a good solution that
| works end-to-end Privacy Dynamics is quite cool but ideally I'd
| like something super lightweight that can get pointed to a source
| db of some sort and then write to a sink (maybe applying a
| transformation layer in the middle)
| axpy906 wrote:
| I remember a lib that was like this but would use GANs etc to gen
| data. I tried it with little success, reverting to SMOTE. Wonder
| how this would do? My impression is that tabular data is
| difficult to use synthetically.
| n4atki wrote:
| SDV does offer a CTGANSynthesizer, which is a GAN-based
| generative approach. Could be worth a try, though CTGAN
| specifically may require customization (tweaking some
| parameters).
|
| That being said, synthetic data definitely isn't a magic pill
| for all use cases. I have found it particularly useful for
| things like QA, performance testing, etc. -- where alternative
| tools for test data creation aren't sufficient.
|
| For the use case of imbalanced classification: May be worth
| asking what is it about existing solutions (SMOTE) that doesn't
| work well?
___________________________________________________________________
(page generated 2024-02-27 23:00 UTC)