We  had  a minor crash in our data center. It's called "minor" because
  it only affected about a dozen systems and the total number of systems
  is over 700.

  Still,  we  had  data  loss  and possibly some silent corruption. That
  silent corruption is a really big issue.

  Since we use ZFS for almost all  application  data,  we  could  simply
  scrub them. No checksum errors? Perfect, the data is fine then.

  But  what  about  operating  systems?  We run full-blown Linux virtual
  machines, so every customer gets his own /, /usr, /lib and so on. What
  about  all  that  data? Did that get corrupted as well? Very sadly, we
  still use ext4 here.

  On some systems we got lucky: Bad superblocks. This  results  in  more
  work  for me (because I have to re-build those systems -- which is not
  *that* much work, though, since we use config management for virtually
  everything),  but  I  can  be  sure  that  these  systems indeed *are*
  affected.

  Other systems just crashed and successfully rebooted. Now what?

  I'm pretty much fed up with this  situation.  All  filesystems  should
  have checksums in 2017. Fuck performance. Performance is worth nothing
  if you operate on faulty data.

  I'm currently in the process of writing a tool  that  checksums  files
  and  stores  the  checksum  in extended attributes. This is *far* from
  satisfactory. Still,  in  scenarios  like  the  one  above,  we  could
  manually "scrub" our data to further assess the situation.