[HN Gopher] Zimtohrli: A New Psychoacoustic Perceptual Metric fo...
___________________________________________________________________
Zimtohrli: A New Psychoacoustic Perceptual Metric for Audio
Compression
Author : judiisis
Score : 65 points
Date : 2024-05-08 12:30 UTC (10 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| formerly_proven wrote:
| I'm guessing the name is meant to allude to cinnamon pig ears
| (https://en.wikipedia.org/wiki/Palmier).
| atoav wrote:
| Probably, this Zimt is cinnamon, Ohrli is swiss German dialect
| for ear.
| jo-m wrote:
| Ohrli actually. It maddens me to no end that they pick words
| which contain umlauts and then leave them out.
| b3orn wrote:
| Ohrli is actually the swiss german diminutive of Ohr (ear).
| Swiss german uses -li a lot for diminutives, whereas standard
| german uses -chen or -lein, the vowel of them stem is turned
| into an umlaut Ohr -> Ohrli/Ohrchen/Ohrlein.
| w-m wrote:
| Likely inspired by https://github.com/google/butteraugli
| yalok wrote:
| It'd be very interesting to see the results for this metric for
| the existing audio and voice codecs (like AAC, AAC-LD, mp3,
| opus), and how it compares to the existing metrics for them?
|
| Couldn't find it in their paper.
| ant6n wrote:
| This says it works on just-noticeable-differences. Would this
| work well if the quality of the compressed audio is very poor?
| Could one for example compare two speech codecs at 8Khz, 4bit
| against the original source to find out which one sounds better?
|
| Or should one just... I dunno, calculate the mean squared error
| in some sort of continuous frequency domain, perhaps weighted by
| some hearing curve.
| jononor wrote:
| PEAQ/PESQ and visqol is worth trying for that. In principle
| they operate as you suggest. I keep a short overview of audio
| quality methods/tools here:
| https://github.com/jonnor/machinehearing/blob/master/audio-q...
| mrob wrote:
| Audibility of error (and sound in general) depends on what
| other audio is playing at the same time, with both frequency
| domain and time domain effects:
|
| https://en.wikipedia.org/wiki/Auditory_masking
|
| Here's a two-part lecture with audio demonstrations by Bernhard
| Seeber of the Audio Information Processing Group at the
| Technical University of Munich:
|
| https://www.youtube.com/watch?v=R9UZnMsm9o8
|
| https://www.youtube.com/watch?v=bU0_Kaj7cPk
|
| A simple weighed frequency domain error calculation is not very
| useful for comparing lossy audio codecs, because effectively
| exploiting auditory masking to hide the errors is a major
| factor in codec quality.
| givinguflac wrote:
| I looked through the deeper explanation and found this
| interesting:
|
| "Performing a simple experiment where we have 5 separate
| components
|
| 1000 Hz sine probe 57 dB SPL 750 Hz sine masker A at 71dB SPL 800
| Hz sine masker B at 71 dB SPL 850 Hz sine masker C at 67 dB SPL
| 900 Hz sine masker D at 65 dB SPL I record the following data
|
| When playing probe + masker A through D individually I experience
| the probe approximately as intensely as a 1000Hz tone at 53dB
| SPL. When playing probe + all maskers I experience the probe
| approximately as intensely as a 1000Hz tone at 48dB SPL."
|
| I would be very interested in understanding more about their
| testing methodology and hardware setup especially.
|
| Is the perceiver a trained listener? Are they using headphones or
| speakers or some other transducer method?
|
| It's awfully difficult to say that there is equivalent perceived
| SPL for different frequency domains, even as a trained listener.
| Especially given the different frequency response for different
| listening setups.
|
| The average user has no chance; hence my curiosity of their
| specific credentials considering they're building an entirely new
| perceptual model based on that.
| DoctorOetker wrote:
| >It's awfully difficult to say that there is equivalent
| perceived SPL for different frequency domains, even as a
| trained listener.
|
| The snippet you quote doesn't claim comparing intensities at
| different frequencies.
|
| He is comparing only perceived 1kHz intensities, (in the
| presence or absence of maskers at other frequencies, whose
| intensity is not subjectively being scored)
| givinguflac wrote:
| Ah, thank you for clarifying, I misunderstood but still have
| the same curiosity about their methods .
| DoctorOetker wrote:
| Are there any associated scientific articles and/or datasets that
| back up the experimental claim/insinuation of matching JNDs or
| perceptual differences?
|
| Is this a proposal without experimental verification?
| Thoreandan wrote:
| Interesting, if hard-to-understand.
|
| It would be nice to see ELi5 explanations for items like this
| akin to Monty's 'A Digital Media Primer for Geeks' (
| https://people.xiph.org/~xiphmont/demo/#:~:text=Xiph )
| bbstats wrote:
| very useful - I find a lot of audio SR (compression) algos to
| sound really bad - likely just because of the loss functions
| and/or eval metrics are 'inhuman'.
| marcodiego wrote:
| Can it be used to make LAME even better? I mean, I'm still fond
| of mp3, specially now that it is patent/royalty free and there
| are literary billions of compatible devices.
| Dave_Rosenthal wrote:
| A few comments:
|
| - My understanding is that a gamma chirp is the established
| filter to use for an auditory filter bank--any reason you choose
| an elliptical filter instead?
|
| - I didn't look too closely, but it seems like you are analyzing
| the output of the filter bank as real numbers. I highly recommend
| you convolve with a complex representation of the filter and keep
| all of the math in the complex domain until you collapse to
| loudness.
|
| - I'd not bucket to discrete 100hz time slices, instead just
| convolve the temporal masking function with the full time
| resolution of the filter bank output.
|
| - You want to think about some volume normalization step that
| would give the final minimized Zimtohrli distance metric between
| A and B*x, where x is a free variable for volume. Otherwise, a
| perceptual codec that just tends to make things a bit quieter
| might get a bad score.
|
| - For fletcher munson, I assume you are just using a curve at a
| high-ish volume? If so, good :)
|
| - Not sure how you are spacing filter bank center frequencies
| relative to ERB size, but I'd recommend oversampling by a factor
| of 2-3. (That is, a few filters per ERB).
|
| Apologies if any of these are off base--I just took a quick look.
| p0nce wrote:
| How does it compare to visqol v3?
___________________________________________________________________
(page generated 2024-05-08 23:00 UTC)