[HN Gopher] Processing medical images at scale on the cloud
       ___________________________________________________________________
        
       Processing medical images at scale on the cloud
        
       Author : harporoeder
       Score  : 66 points
       Date   : 2023-06-18 23:51 UTC (1 days ago)
        
 (HTM) web link (www.tweag.io)
 (TXT) w3m dump (www.tweag.io)
        
       | crabbone wrote:
       | > But as it turns out, we can't use it.
       | 
       | Although it has Python bindings, OpenSlide is implemented in C
       | and reads files using standard OS file handlers, however our data
       | sits on cloud storage that is accessible via HTTP.
       | 
       | This is a self-inflicted problem. Very typical of people who
       | don't know how storage works / what functionality is available
       | will often push themselves into an imaginary corner.
       | 
       | Why of all things use HTTP for this?
       | 
       | No, of course you don't need to download the whole file to read
       | it.
       | 
       | "standard OS file handlers" -- this is a strong indicator that
       | the person writing this doesn't understand how their OS works.
       | What standard are we talking about? Even if "standard" here is
       | used to mean "some common way" -- then which one? How the files
       | are opened? And so on. The author didn't research the subject at
       | all, and came up with an awful solution (vendor lock-in) as a
       | result.
        
         | dekhn wrote:
         | The standard OS file handlers they mean are the UNIX and
         | Windows APIs to open and read/write file content. The open(),
         | read(), seek(), and write() library calls, which wrap the OS's
         | system calls that use VFS.
         | 
         | HTTP is now a de-facto transport standard. S3 uses, and many
         | other data storage systems do as well. It's highly tuned for
         | performance and there is an entire ecosystem around it. You
         | could implement a system like NFS over XDR-over-TCP-IP
         | (https://en.wikipedia.org/wiki/Sun_RPC) or HTTP
         | (https://en.wikipedia.org/wiki/WebDAV).
        
       | CyberDildonics wrote:
       | What does 'at scale' mean here and why would anyone need 'the
       | cloud' ? Medical images aren't like cell phone videos where
       | everyone is creating data all the time. There is only so much
       | medical data being created because the machines to create them
       | are limited.
        
       | giovannibonetti wrote:
       | > Though it has Python bindings, OpenSlide is implemented in C
       | and reads files using standard OS file handlers, however our data
       | sits on cloud storage that is accessible via HTTP. This means
       | that, to open a WSI file, one needs to first download the entire
       | file to disk, and only then can they load it with OpenSlide. But
       | then, what if we need to read tens of thousands of WSIs, a few
       | gigabytes each? This can total more than what a single disk can
       | contain. Besides, even if we mounted multiple disks, the cost and
       | time it would take to transfer all this data on every new machine
       | would be too much. In addition to that, most of the time only a
       | fraction of the entire WSI is of interest, so downloading the
       | entire data is inefficient. A solution is to read the bytes that
       | we need when we need them directly from Blob Storage. fsspec is a
       | Python package that allows us to define "abstract" filesystems,
       | with a custom implementation to list, read and write files. One
       | such implementation, adlfs, works for Azure Blob Storage.
       | 
       | AWS S3 has byte-range fetches specifically for this use case [1].
       | This is quite handy for data lakes and OLAP databases.
       | Apparently, 8 MB and 16 MB are good sizes for typical workloads
       | [2].
       | 
       | [1]
       | https://docs.aws.amazon.com/whitepapers/latest/s3-optimizing...
       | 
       | [2]
       | https://d1.awsstatic.com/whitepapers/AmazonS3BestPractices.p...
        
         | deathanatos wrote:
         | The OP here is using Azure Blob Storage, which is essentially
         | Azure's S3 competitor. Like S3, it accepts the Range header.[1]
         | (I'm presuming Blob Storage was modeled after S3, frankly. The
         | capabilities and APIs are very similar.)
         | 
         | Similarly, GCP Cloud Storage appears to also support the Range
         | header[2].
         | 
         | [1]: https://learn.microsoft.com/en-
         | us/rest/api/storageservices/g...
         | 
         | [2]: https://cloud.google.com/storage/docs/xml-api/get-object-
         | dow...
        
         | dekhn wrote:
         | Even with byte-ranges in S3, I know that some standards like
         | OME report very poor performance compared to local filesystems
         | (their use case requires very low latency to show various
         | subrectangles of multidimensional data, converted to image
         | form, for pathologists sitting in a browser). They have been
         | exploring moving from monolithic tiff image files to using
         | zarr, which preshards and does compression in blocks.
        
           | crabbone wrote:
           | "Local filesystem" is not a thing... When you mount NFS on
           | your laptop, is it local to your laptop or not? What if you
           | have a caching client?
           | 
           | In other words, "local" or "remote" is not a property of
           | filesystem.
           | 
           | Various storage products exist that try to solve the problem
           | of _data mobility_ , that is moving data quickly to a desired
           | destination (which is usually "pretending to move" but in
           | such a way that the receiving end can start working as soon
           | as possible).
           | 
           | For open-source examples see DRBD. There are also some
           | proprietary products that are designed to do the same.
        
             | dekhn wrote:
             | Local or remote is a property of a filesystem. This has
             | been conventionally understood as whether the block device
             | or the file service uses a directly attached to the host
             | bus, versus more indirectly through a NIC or other
             | networking technology. Of course, this idea breaks down
             | pretty quickly; many servers used SAN, "storage area
             | network" which gave local-like performance from physically
             | separated storage devices over a fiber optic network. And
             | as you point out, you can "remote" a block device since
             | block devices are really just an abstraction.
             | 
             | I don't see what your point is; many applications support
             | multiple storage backends, which is what I was referring
             | to. The performance issues I was discussing were comparing
             | applications that use the host system's VFS layer, compared
             | to the S3 API layer.
        
       | southernplaces7 wrote:
       | Because of course there's so little to worry about with storing
       | vast reams of medical data from real people in cloud systems
       | (that surely never get breached) to be accessed by AIs that
       | surely will never create data privacy problems from all the ML
       | vacuuming they rely on....
        
         | 5440 wrote:
         | I'm a regulatory consultant and I am currently submitting at
         | least 5-10 510ks/DeNovos per week to FDA for AI/ML devices for
         | a variety of companies. I can't imagine the actual throughput
         | from companies as I am just one person out of many consultants
         | out there. 95% of the software devices I edit and submit are
         | hosting their databases on AWS. Essentially they transer the
         | DICOM images to AWS and then run their algorithms against the
         | data and then present the indications to the physcian. These
         | run the range of CT/MRI/Ultrasound/pathology slides/genomic
         | sequencing. Like I said, most of the databases are on AWS. A
         | few are on Azure and a few european companies are on Orange.
        
       | light_hue_1 wrote:
       | This is a bad and needlessly complicated solution because the
       | authors don't know anything about ML.
       | 
       | It's both incredibly slow and it will perform poorly (in the
       | sense of the final accuracy and the number of training steps)
       | because you can't just software engineer your way through ML.
       | 
       | The important insight is that you do not need all patches from
       | every image! Actually, showing patches like this in sequence is
       | extremely detrimental to training. The network sees far too much
       | data that is too similar in too many big chunks. You want random
       | patches from random images, the more mixed up the patches the
       | better.
       | 
       | So knowing this, when you look at their latency equation there's
       | a different and obvious solution. Split the loading process in
       | two steps that run in parallel.
       | 
       | First step constantly downloads new images from the web and swaps
       | old images out. Second step picks an image at random from the
       | ones that are available and generates a random patch from it.
       | 
       | The first step is network bound. The second step is CPU bound.
       | The second step always has images available, it never waits for
       | the first, just picks another random image and random patch. You
       | get great resource utilization out of this.
       | 
       | That's it. No other changes needed. Just use an off the shelf
       | fast image loader. No need for a cluster.
       | 
       | This is a huge waste of engineering time and ongoing computing
       | resources for what is a simple ML problem, had anyone with any
       | knowledge been around.
       | 
       | Hey, tweag! If you want to do ML, reach out. :) You can do far
       | better than this!
        
         | lisasays wrote:
         | Can't evaluate on the details, but in general I love reading
         | takes like this. Basically a variant of the "When all you have
         | is a hammer" mindset and its inevitable consequences, as we see
         | replicated over and over again in this industry. If only it
         | were the top comment.
        
       ___________________________________________________________________
       (page generated 2023-06-20 23:02 UTC)