[HN Gopher] Zimtohrli: A New Psychoacoustic Perceptual Metric fo...
       ___________________________________________________________________
        
       Zimtohrli: A New Psychoacoustic Perceptual Metric for Audio
       Compression
        
       Author : judiisis
       Score  : 65 points
       Date   : 2024-05-08 12:30 UTC (10 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | formerly_proven wrote:
       | I'm guessing the name is meant to allude to cinnamon pig ears
       | (https://en.wikipedia.org/wiki/Palmier).
        
         | atoav wrote:
         | Probably, this Zimt is cinnamon, Ohrli is swiss German dialect
         | for ear.
        
           | jo-m wrote:
           | Ohrli actually. It maddens me to no end that they pick words
           | which contain umlauts and then leave them out.
        
           | b3orn wrote:
           | Ohrli is actually the swiss german diminutive of Ohr (ear).
           | Swiss german uses -li a lot for diminutives, whereas standard
           | german uses -chen or -lein, the vowel of them stem is turned
           | into an umlaut Ohr -> Ohrli/Ohrchen/Ohrlein.
        
         | w-m wrote:
         | Likely inspired by https://github.com/google/butteraugli
        
       | yalok wrote:
       | It'd be very interesting to see the results for this metric for
       | the existing audio and voice codecs (like AAC, AAC-LD, mp3,
       | opus), and how it compares to the existing metrics for them?
       | 
       | Couldn't find it in their paper.
        
       | ant6n wrote:
       | This says it works on just-noticeable-differences. Would this
       | work well if the quality of the compressed audio is very poor?
       | Could one for example compare two speech codecs at 8Khz, 4bit
       | against the original source to find out which one sounds better?
       | 
       | Or should one just... I dunno, calculate the mean squared error
       | in some sort of continuous frequency domain, perhaps weighted by
       | some hearing curve.
        
         | jononor wrote:
         | PEAQ/PESQ and visqol is worth trying for that. In principle
         | they operate as you suggest. I keep a short overview of audio
         | quality methods/tools here:
         | https://github.com/jonnor/machinehearing/blob/master/audio-q...
        
         | mrob wrote:
         | Audibility of error (and sound in general) depends on what
         | other audio is playing at the same time, with both frequency
         | domain and time domain effects:
         | 
         | https://en.wikipedia.org/wiki/Auditory_masking
         | 
         | Here's a two-part lecture with audio demonstrations by Bernhard
         | Seeber of the Audio Information Processing Group at the
         | Technical University of Munich:
         | 
         | https://www.youtube.com/watch?v=R9UZnMsm9o8
         | 
         | https://www.youtube.com/watch?v=bU0_Kaj7cPk
         | 
         | A simple weighed frequency domain error calculation is not very
         | useful for comparing lossy audio codecs, because effectively
         | exploiting auditory masking to hide the errors is a major
         | factor in codec quality.
        
       | givinguflac wrote:
       | I looked through the deeper explanation and found this
       | interesting:
       | 
       | "Performing a simple experiment where we have 5 separate
       | components
       | 
       | 1000 Hz sine probe 57 dB SPL 750 Hz sine masker A at 71dB SPL 800
       | Hz sine masker B at 71 dB SPL 850 Hz sine masker C at 67 dB SPL
       | 900 Hz sine masker D at 65 dB SPL I record the following data
       | 
       | When playing probe + masker A through D individually I experience
       | the probe approximately as intensely as a 1000Hz tone at 53dB
       | SPL. When playing probe + all maskers I experience the probe
       | approximately as intensely as a 1000Hz tone at 48dB SPL."
       | 
       | I would be very interested in understanding more about their
       | testing methodology and hardware setup especially.
       | 
       | Is the perceiver a trained listener? Are they using headphones or
       | speakers or some other transducer method?
       | 
       | It's awfully difficult to say that there is equivalent perceived
       | SPL for different frequency domains, even as a trained listener.
       | Especially given the different frequency response for different
       | listening setups.
       | 
       | The average user has no chance; hence my curiosity of their
       | specific credentials considering they're building an entirely new
       | perceptual model based on that.
        
         | DoctorOetker wrote:
         | >It's awfully difficult to say that there is equivalent
         | perceived SPL for different frequency domains, even as a
         | trained listener.
         | 
         | The snippet you quote doesn't claim comparing intensities at
         | different frequencies.
         | 
         | He is comparing only perceived 1kHz intensities, (in the
         | presence or absence of maskers at other frequencies, whose
         | intensity is not subjectively being scored)
        
           | givinguflac wrote:
           | Ah, thank you for clarifying, I misunderstood but still have
           | the same curiosity about their methods .
        
       | DoctorOetker wrote:
       | Are there any associated scientific articles and/or datasets that
       | back up the experimental claim/insinuation of matching JNDs or
       | perceptual differences?
       | 
       | Is this a proposal without experimental verification?
        
       | Thoreandan wrote:
       | Interesting, if hard-to-understand.
       | 
       | It would be nice to see ELi5 explanations for items like this
       | akin to Monty's 'A Digital Media Primer for Geeks' (
       | https://people.xiph.org/~xiphmont/demo/#:~:text=Xiph )
        
       | bbstats wrote:
       | very useful - I find a lot of audio SR (compression) algos to
       | sound really bad - likely just because of the loss functions
       | and/or eval metrics are 'inhuman'.
        
       | marcodiego wrote:
       | Can it be used to make LAME even better? I mean, I'm still fond
       | of mp3, specially now that it is patent/royalty free and there
       | are literary billions of compatible devices.
        
       | Dave_Rosenthal wrote:
       | A few comments:
       | 
       | - My understanding is that a gamma chirp is the established
       | filter to use for an auditory filter bank--any reason you choose
       | an elliptical filter instead?
       | 
       | - I didn't look too closely, but it seems like you are analyzing
       | the output of the filter bank as real numbers. I highly recommend
       | you convolve with a complex representation of the filter and keep
       | all of the math in the complex domain until you collapse to
       | loudness.
       | 
       | - I'd not bucket to discrete 100hz time slices, instead just
       | convolve the temporal masking function with the full time
       | resolution of the filter bank output.
       | 
       | - You want to think about some volume normalization step that
       | would give the final minimized Zimtohrli distance metric between
       | A and B*x, where x is a free variable for volume. Otherwise, a
       | perceptual codec that just tends to make things a bit quieter
       | might get a bad score.
       | 
       | - For fletcher munson, I assume you are just using a curve at a
       | high-ish volume? If so, good :)
       | 
       | - Not sure how you are spacing filter bank center frequencies
       | relative to ERB size, but I'd recommend oversampling by a factor
       | of 2-3. (That is, a few filters per ERB).
       | 
       | Apologies if any of these are off base--I just took a quick look.
        
       | p0nce wrote:
       | How does it compare to visqol v3?
        
       ___________________________________________________________________
       (page generated 2024-05-08 23:00 UTC)