[HN Gopher] Synthea: Open-source synthetic patient generation
___________________________________________________________________
Synthea: Open-source synthetic patient generation
Author : johncole
Score : 58 points
Date : 2023-05-19 14:30 UTC (8 hours ago)
(HTM) web link (synthetichealth.github.io)
(TXT) w3m dump (synthetichealth.github.io)
| synaesthesisx wrote:
| I've worked at the intersection of AI & healthcare for years and
| this has been an excellent tool I've leveraged in the past;
| synthetic data is particularly helpful in the context of
| healthcare!
| ThaDood wrote:
| I actually had this idea when I worked for a local HIE. I just
| lacked the technical competency to make it real. I think this
| would be incredibly useful for the adoption of FHIR and also
| learning more about HL7. For security-minded folks this
| information could be a good tool for tuning DLP and other tools
| without using real patient data.
| ska wrote:
| I've done this sort of thing before with home-rolled tools, it
| easily becomes a time sink. Having a centralized shared effort
| seems like it could be really valuable.
|
| One thing that is tricky is that you often needs signals and/or
| image data as well.
| reshmakh wrote:
| Synthea is great! We use it a ton at Medplum - and the sample
| data that conforms to USCDI is especially useful we recommend for
| those who are getting started.
| https://www.medplum.com/docs/tutorials/importing-sample-data
| techwizrd wrote:
| I never expected to see MITRE on the front page of HN! We're
| actively adding more synthetic data sources to Synthea all the
| time.
| shawnz wrote:
| Doesn't MITRE maintain the CVE database?
| orbz wrote:
| Yes, MITRE is a non-profit organization that works with a
| number of US government agencies to cover a pretty large
| swath of areas: https://www.mitre.org/focus-areas
| orbz wrote:
| Love Synthea, it's an amazing project and you should be very
| proud of it. My only gripe is how clean the data is compared to
| what many other EHR providers actually generate, but that's
| more on them than you guys.
| ska wrote:
| It would be perhaps interesting to add a layer capable of
| injecting reasonable noise on top of these clean records.
| techwizrd wrote:
| I was thinking about earlier this month actually! I
| generate a lot of synthetic flight data, and we have to
| reproduce the noisiness of real data as well.
| MilStdJunkie wrote:
| Does anyone know if there is an equivalent for generating
| "random" viable products[1] in a PDM/ERP system?
|
| I'm demoing some systems in this field for outside interests, but
| I can't use any "real" data due to ITAR and data restrictions
| like TC, NC, etc. Wait, what about the ERP? The ERP I'm
| developing against has "sample" data that's basically useless.
| Not much better than _lorem ipsum_ pasted across ten thousand
| cells. Actually, it 's worse than that, because . . ah hell, this
| is HN, I won't waste your time. People here know what the ERP
| ecosystem is like. I also don't want to build out from a bespoke,
| brittle ERP - that's how we got into this mess in the first
| place.
|
| [1] Like a multi-level BOM that makes sense, or a Service BOM /
| Logistics Database that's meaningful. Anything for making pseudo-
| random PLs that follow MIL-STD-100, which is still considered
| frickin' Holy Ground by these people.
| ted_dunning wrote:
| Building synthetic BOMs can be fairly straightforward if you
| can define the level of coherency you want to see. The only big
| trick in building structured data like this that I have built
| is to first build dictionaries of randomized data with very
| little coherence and then build larger structures that include
| elements of the dictionaries.
|
| As an example, you might want to have a model of users
| interacting with a web site, ordering products and shipping
| them to their homes. This can start with building a dictionary
| of user records and orderable item descriptions. The user
| records would have an address and some "interest" variables
| that define what the user is likely to order. The item
| descriptions can have lots of a little information but would
| centrally contain a part number and some information that
| allows the part to be selected efficiently (a numerical vector
| may be enough). If you want to be crazy, you can use generative
| models to generate descriptions from random semantic starting
| points or use lower level tables to piece together these
| things.
|
| At this point, you can pretty easily build a user model and run
| it for each user to generate coherent transactional histories.
|
| Several of these ideas are present in a project I worked on
| called log-synth [1]. For instance, the VIN generator has
| tables of factories and such for BMW and Ford so it generates
| kind of coherent VINs that can be traced back with factory
| location, engine and body type. If you look hard these are
| nonsense, but if you squint the right way they look fine.
|
| The commuter generator or the DNS query generator are examples
| of a higher-level transaction generators. For the commuter,
| there is a model of a user with a home location and a work
| location. These commuters go to work some days and run errands
| other day and there is a simple model to pick an activity.
| Digging in, each activity breaks down into journeys along
| entirely incoherent road structures but details like a physical
| model of the engine and car velocity is maintained so you can
| get realistic diagnostics from the vehicles from somewhat
| realistic life histories. The DNS query generator is similar
| but with less physics.
|
| One nice statistical concept in all of this is the concept of a
| statistical distribution over a notionally infinite set. Some
| things in the set will be much more commonly seen than others
| and thus we are likely to see those sooner. The generator of
| these things can maintain an estimate of the frequency of all
| previously seen things and a probability of seeing something
| new (see the Chinese Restaurant process [2]). You only need to
| generate the specifics of a thing in this infinite when you
| first see it which gives you pretty realistic texture to the
| fictional transactional world.
|
| Relative to your problem of multi-level BOMs, you could say
| that a BOM is a list of items. Pick the desired length from a
| suitable distribution. Then pick each item from a Chinese
| Restaurant process. As you generate new items, decide if the
| item is composite and if so, generate a BOM for it recursively.
| Constraints like forcing a composite item to not recursively
| contain anything of the same type can be enforced using a
| rejection method (sometimes).
|
| If this seems at all interesting, ping me by filing an issue on
| the log-synth github repository.
|
| [1] https://github.com/tdunning/log-synth [2]
| https://en.wikipedia.org/wiki/Chinese_restaurant_process
| erwinh wrote:
| Played around with this in my soon-to-be previous health-tech job
| and its great.
|
| Actually the entire hl7-fhir ( https://www.hl7.org/fhir/ )
| standard seems to me quite solid. It would be wonderful if a new
| cohort of start-ups would leverage it to drastically improve the
| digital UX of healthcare generally.
| jjordan wrote:
| That would be great except that the 8,000 lb gorillas of the
| medical data industry, at least as of a year or two ago, did
| next to nothing to really make their EHR's FHIR-compatible.
| Getting even some of the very basics on their demo environments
| were fundamentally broken.
| erwinh wrote:
| Yeah so many cards stacked against potential start-ups who
| could potentially bring some quality to the industry :/
| Curious to see though that Google cloud / AWS etc are
| building fhir store APIs.
| [deleted]
___________________________________________________________________
(page generated 2023-05-19 23:01 UTC)