[HN Gopher] Processing medical images at scale on the cloud
___________________________________________________________________
Processing medical images at scale on the cloud
Author : harporoeder
Score : 66 points
Date : 2023-06-18 23:51 UTC (1 days ago)
(HTM) web link (www.tweag.io)
(TXT) w3m dump (www.tweag.io)
| crabbone wrote:
| > But as it turns out, we can't use it.
|
| Although it has Python bindings, OpenSlide is implemented in C
| and reads files using standard OS file handlers, however our data
| sits on cloud storage that is accessible via HTTP.
|
| This is a self-inflicted problem. Very typical of people who
| don't know how storage works / what functionality is available
| will often push themselves into an imaginary corner.
|
| Why of all things use HTTP for this?
|
| No, of course you don't need to download the whole file to read
| it.
|
| "standard OS file handlers" -- this is a strong indicator that
| the person writing this doesn't understand how their OS works.
| What standard are we talking about? Even if "standard" here is
| used to mean "some common way" -- then which one? How the files
| are opened? And so on. The author didn't research the subject at
| all, and came up with an awful solution (vendor lock-in) as a
| result.
| dekhn wrote:
| The standard OS file handlers they mean are the UNIX and
| Windows APIs to open and read/write file content. The open(),
| read(), seek(), and write() library calls, which wrap the OS's
| system calls that use VFS.
|
| HTTP is now a de-facto transport standard. S3 uses, and many
| other data storage systems do as well. It's highly tuned for
| performance and there is an entire ecosystem around it. You
| could implement a system like NFS over XDR-over-TCP-IP
| (https://en.wikipedia.org/wiki/Sun_RPC) or HTTP
| (https://en.wikipedia.org/wiki/WebDAV).
| CyberDildonics wrote:
| What does 'at scale' mean here and why would anyone need 'the
| cloud' ? Medical images aren't like cell phone videos where
| everyone is creating data all the time. There is only so much
| medical data being created because the machines to create them
| are limited.
| giovannibonetti wrote:
| > Though it has Python bindings, OpenSlide is implemented in C
| and reads files using standard OS file handlers, however our data
| sits on cloud storage that is accessible via HTTP. This means
| that, to open a WSI file, one needs to first download the entire
| file to disk, and only then can they load it with OpenSlide. But
| then, what if we need to read tens of thousands of WSIs, a few
| gigabytes each? This can total more than what a single disk can
| contain. Besides, even if we mounted multiple disks, the cost and
| time it would take to transfer all this data on every new machine
| would be too much. In addition to that, most of the time only a
| fraction of the entire WSI is of interest, so downloading the
| entire data is inefficient. A solution is to read the bytes that
| we need when we need them directly from Blob Storage. fsspec is a
| Python package that allows us to define "abstract" filesystems,
| with a custom implementation to list, read and write files. One
| such implementation, adlfs, works for Azure Blob Storage.
|
| AWS S3 has byte-range fetches specifically for this use case [1].
| This is quite handy for data lakes and OLAP databases.
| Apparently, 8 MB and 16 MB are good sizes for typical workloads
| [2].
|
| [1]
| https://docs.aws.amazon.com/whitepapers/latest/s3-optimizing...
|
| [2]
| https://d1.awsstatic.com/whitepapers/AmazonS3BestPractices.p...
| deathanatos wrote:
| The OP here is using Azure Blob Storage, which is essentially
| Azure's S3 competitor. Like S3, it accepts the Range header.[1]
| (I'm presuming Blob Storage was modeled after S3, frankly. The
| capabilities and APIs are very similar.)
|
| Similarly, GCP Cloud Storage appears to also support the Range
| header[2].
|
| [1]: https://learn.microsoft.com/en-
| us/rest/api/storageservices/g...
|
| [2]: https://cloud.google.com/storage/docs/xml-api/get-object-
| dow...
| dekhn wrote:
| Even with byte-ranges in S3, I know that some standards like
| OME report very poor performance compared to local filesystems
| (their use case requires very low latency to show various
| subrectangles of multidimensional data, converted to image
| form, for pathologists sitting in a browser). They have been
| exploring moving from monolithic tiff image files to using
| zarr, which preshards and does compression in blocks.
| crabbone wrote:
| "Local filesystem" is not a thing... When you mount NFS on
| your laptop, is it local to your laptop or not? What if you
| have a caching client?
|
| In other words, "local" or "remote" is not a property of
| filesystem.
|
| Various storage products exist that try to solve the problem
| of _data mobility_ , that is moving data quickly to a desired
| destination (which is usually "pretending to move" but in
| such a way that the receiving end can start working as soon
| as possible).
|
| For open-source examples see DRBD. There are also some
| proprietary products that are designed to do the same.
| dekhn wrote:
| Local or remote is a property of a filesystem. This has
| been conventionally understood as whether the block device
| or the file service uses a directly attached to the host
| bus, versus more indirectly through a NIC or other
| networking technology. Of course, this idea breaks down
| pretty quickly; many servers used SAN, "storage area
| network" which gave local-like performance from physically
| separated storage devices over a fiber optic network. And
| as you point out, you can "remote" a block device since
| block devices are really just an abstraction.
|
| I don't see what your point is; many applications support
| multiple storage backends, which is what I was referring
| to. The performance issues I was discussing were comparing
| applications that use the host system's VFS layer, compared
| to the S3 API layer.
| southernplaces7 wrote:
| Because of course there's so little to worry about with storing
| vast reams of medical data from real people in cloud systems
| (that surely never get breached) to be accessed by AIs that
| surely will never create data privacy problems from all the ML
| vacuuming they rely on....
| 5440 wrote:
| I'm a regulatory consultant and I am currently submitting at
| least 5-10 510ks/DeNovos per week to FDA for AI/ML devices for
| a variety of companies. I can't imagine the actual throughput
| from companies as I am just one person out of many consultants
| out there. 95% of the software devices I edit and submit are
| hosting their databases on AWS. Essentially they transer the
| DICOM images to AWS and then run their algorithms against the
| data and then present the indications to the physcian. These
| run the range of CT/MRI/Ultrasound/pathology slides/genomic
| sequencing. Like I said, most of the databases are on AWS. A
| few are on Azure and a few european companies are on Orange.
| light_hue_1 wrote:
| This is a bad and needlessly complicated solution because the
| authors don't know anything about ML.
|
| It's both incredibly slow and it will perform poorly (in the
| sense of the final accuracy and the number of training steps)
| because you can't just software engineer your way through ML.
|
| The important insight is that you do not need all patches from
| every image! Actually, showing patches like this in sequence is
| extremely detrimental to training. The network sees far too much
| data that is too similar in too many big chunks. You want random
| patches from random images, the more mixed up the patches the
| better.
|
| So knowing this, when you look at their latency equation there's
| a different and obvious solution. Split the loading process in
| two steps that run in parallel.
|
| First step constantly downloads new images from the web and swaps
| old images out. Second step picks an image at random from the
| ones that are available and generates a random patch from it.
|
| The first step is network bound. The second step is CPU bound.
| The second step always has images available, it never waits for
| the first, just picks another random image and random patch. You
| get great resource utilization out of this.
|
| That's it. No other changes needed. Just use an off the shelf
| fast image loader. No need for a cluster.
|
| This is a huge waste of engineering time and ongoing computing
| resources for what is a simple ML problem, had anyone with any
| knowledge been around.
|
| Hey, tweag! If you want to do ML, reach out. :) You can do far
| better than this!
| lisasays wrote:
| Can't evaluate on the details, but in general I love reading
| takes like this. Basically a variant of the "When all you have
| is a hammer" mindset and its inevitable consequences, as we see
| replicated over and over again in this industry. If only it
| were the top comment.
___________________________________________________________________
(page generated 2023-06-20 23:02 UTC)