[HN Gopher] Synthetic data generation for tabular data
       ___________________________________________________________________
        
       Synthetic data generation for tabular data
        
       Author : skadamat
       Score  : 30 points
       Date   : 2024-02-27 19:05 UTC (3 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | skadamat wrote:
       | If people are looking for a quickstart:
       | 
       | Colab notebook for generating single-table data:
       | https://colab.research.google.com/drive/1F3WWduNjcX4oKck6Xkj...
       | 
       | Colab notebook for generating multi-table data:
       | https://colab.research.google.com/drive/1L6i-JhJK9ROG-KFcyzT...
        
       | mej10 wrote:
       | Can someone help me understand the licensing of this?
       | 
       | https://github.com/sdv-dev/SDV/blob/main/LICENSE
       | 
       | It was MIT licensed up until 2022 where it was changed to what it
       | is now, where they say that it will become MIT again 4 years
       | after release... but is that from when the license was changed or
       | the first release of the software in GitHub?
        
         | rch wrote:
         | IANL but I've taken it to mean that releases acquired under the
         | original license would continue to be governed by those terms.
         | 
         | I'm liking this new approach better than e.g. perpetual AGPL
         | though, as it provides incentives for businesses to acquire
         | commercial rights while avoiding any dead end agreements that
         | outlive the startup entity.
        
       | debosmit wrote:
       | do you have some thoughts on how sdv-dev type projects can be
       | used to start populating, say, a database (eg: mysql running in a
       | container) i've looked into this space a bunch (eg: Gretel,
       | Tonic, etc) and there doesn't seem to be a good solution that
       | works end-to-end Privacy Dynamics is quite cool but ideally I'd
       | like something super lightweight that can get pointed to a source
       | db of some sort and then write to a sink (maybe applying a
       | transformation layer in the middle)
        
       | axpy906 wrote:
       | I remember a lib that was like this but would use GANs etc to gen
       | data. I tried it with little success, reverting to SMOTE. Wonder
       | how this would do? My impression is that tabular data is
       | difficult to use synthetically.
        
         | n4atki wrote:
         | SDV does offer a CTGANSynthesizer, which is a GAN-based
         | generative approach. Could be worth a try, though CTGAN
         | specifically may require customization (tweaking some
         | parameters).
         | 
         | That being said, synthetic data definitely isn't a magic pill
         | for all use cases. I have found it particularly useful for
         | things like QA, performance testing, etc. -- where alternative
         | tools for test data creation aren't sufficient.
         | 
         | For the use case of imbalanced classification: May be worth
         | asking what is it about existing solutions (SMOTE) that doesn't
         | work well?
        
       ___________________________________________________________________
       (page generated 2024-02-27 23:00 UTC)