[HN Gopher] Google is embedding inaudible watermarks into its AI...
       ___________________________________________________________________
        
       Google is embedding inaudible watermarks into its AI generated
       music
        
       Author : CharlesW
       Score  : 109 points
       Date   : 2023-11-18 16:36 UTC (6 hours ago)
        
 (HTM) web link (www.theverge.com)
 (TXT) w3m dump (www.theverge.com)
        
       | chpatrick wrote:
       | Maybe they put pictures of cats in:
       | https://twistedsifter.com/2013/01/hidden-images-embedded-int...
        
         | jordanreger wrote:
         | i've heard a ton of aphex twin and disasterpeace, one of my
         | favorite songs being compass, and i've never even thought about
         | this. cool to see it's hidden in such a beautiful song
        
         | Sparkyte wrote:
         | Probably the best song ever because it contains cats.
        
       | echelon wrote:
       | What watermarking libraries for images, audio, or video is HN
       | using?
       | 
       | We've found a number of them, but the good ones are GPL3 [1]. I
       | might email the authors about getting a license if there isn't
       | anything equivalent.
       | 
       | [1] https://github.com/swesterfeld/audiowmark is fantastic
        
         | lencastre wrote:
         | Counting the seconds until someone gets an audio recording with
         | 0-click Trojan malware. I miss the world when all you got was
         | ILOVEYOU in MS Word...
        
           | echelon wrote:
           | The payloads being encoded are extremely tiny. Typically not
           | even enough to store a full UUID.
           | 
           | This is configurable, of course, but you sacrifice
           | recoverability and/or quality of the original signal if you
           | try to store more bits.
        
         | AnthonyMouse wrote:
         | Is there some reason you don't just use it under GPLv3? Are you
         | trying to create a derivative work you don't want to publish
         | under the same license?
        
       | cebert wrote:
       | If you removed a watermark, could it be considered to be a
       | violation of the DMCA?
        
         | henriquez wrote:
         | That's assuming a) AI-generated tracks were copyrightable in
         | the first place, and b) watermarks are copy-protection
         | mechanisms
         | 
         | Neither a nor b are true so it's probably fine.
        
           | echelon wrote:
           | Watermarks are these things:
           | 
           | - Non-bulletproof provenance data
           | 
           | - Non-bulletproof abuse tracking
           | 
           | - Non-bulletproof proliferation tracking
           | 
           | - Security sold to those concerned about AI
           | 
           | (I build an AI system, so these are all top of mind.)
        
       | inquirerGeneral wrote:
       | This is really cool. I like how the headline is clickbait so
       | people get a negative impression from it, while it's literally a
       | great positive thing.
        
       | TerrifiedMouse wrote:
       | I wonder how said watermarks will survive lossy audio compression
       | since a big part of lossy audio compression is to remove parts of
       | the signal humans can't hear and won't notice is gone.
        
         | vachina wrote:
         | There was a time when Blu-ray players had to implement such
         | audio watermark decoder, to detect and stop playback
         | unauthorized Blu-ray copies.
         | 
         | > Cinavia's in-band signaling introduces intentional spread
         | spectrum phase distortion in the frequency domain of each
         | individual audio channel separately, giving a per-channel
         | digital signal that can yield up to around 0.2 bits per second.
         | 
         | > Cinavia is designed to stay within the audio signal and to
         | survive all common forms of audio transfer, including lossy
         | data compression using discrete cosine transform, MP3, DTS, or
         | Ogg Vorbis. It is designed to survive digital and analog sound
         | recording and reproduction via microphones, direct audio
         | connections and broadcasting, and does so by using audio
         | frequencies within the hearing range.
        
           | TerrifiedMouse wrote:
           | So the watermark is audible to listeners.
        
             | rcxdude wrote:
             | no, because it's very small phase modulations. The main
             | tradeoff is you need a relatively long segment of audio to
             | detect it.
        
               | InCityDreams wrote:
               | I recently bought a new synth...i wonder if it has a
               | signature?
               | 
               | *#>Synth manufacturers start licking their lips.
               | 
               | And, for the amateur muso's that still don't get it:
               | imagine having to pay/ have had paid for every time a dx7
               | was used in a tune, or an ms20?
        
             | 613style wrote:
             | It's audible in the sense that it exists completely in an
             | audio modality.
             | 
             | But it's not audible in the sense that human ears can't
             | discern it.
        
               | meindnoch wrote:
               | The aim of lossy compression is to discard any data that
               | isn't discernible to the human auditory system.
               | 
               | If an audio signal survives a lossy compression
               | algorithm, then there are two possibilities:
               | 
               | - the compression algorithm should be improved, as it
               | preserves some data that isn't important to the human ear
               | 
               | - the signal is audible
        
               | amatecha wrote:
               | That's not exactly the aim of lossy compression. Its aim
               | is to reduce data size while introducing as little
               | discernible effect as realistically possible. That
               | usually means optimizing the algorithm such that most of
               | the loss is indiscernible to us, such as in a darkest
               | regions of an image, or the extremely high frequencies in
               | audio -- both areas we don't perceive with too much
               | granularity. Something like spread-spectrum phase
               | distortion may survive compression just fine but still be
               | indiscernible to us. The two are not mutually exclusive.
        
               | AnthonyMouse wrote:
               | Suppose you could encode 0.2 bits per second of watermark
               | in a recording without humans being able to discern it.
               | Suppose the compression algorithm did the same thing to
               | encode additional information which is part of the
               | recording, allowing it to achieve higher quality at the
               | same bitrate or the same quality at a lower bitrate.
               | 
               | It's information theory. Either you can encode additional
               | information without impacting the result, in which case
               | the compression algorithm could use it to be more
               | efficient, or you can't. TANSTAAFL.
        
               | wizzwizz4 wrote:
               | And complexity theory says we can't reach the information
               | theoretical limits with generic algorithms.
        
         | wubrr wrote:
         | Couldn't they apply the watermark post-compression? But yeah,
         | eventually people will figure out how the watermark works and
         | be able to remove it.
        
           | TerrifiedMouse wrote:
           | Frankly, if it can be detected, it can be removed. Makes
           | sense no?
        
             | FartyMcFarter wrote:
             | Think about watermarks in images. They can be removed, but
             | it isn't easy to do so without making it obvious that the
             | image is damaged / doctored.
             | 
             | I imagine this is also the case for audio watermarks, but
             | I'm not sure what the current state of the art is for
             | watermark removal.
        
               | greatpatton wrote:
               | You can easily train a model to clean watermark from
               | images.
        
               | AnthonyMouse wrote:
               | > They can be removed, but it isn't easy to do so without
               | making it obvious that the image is damaged / doctored.
               | 
               | This is primarily because those watermarks damage the
               | image. The information about the content of the image
               | beneath the watermark is missing (replaced with the
               | watermark) and would have to be extrapolated or
               | fabricated to replace it.
               | 
               | If the watermark isn't visible as part of the image then
               | there is no missing piece to have to replace.
        
               | KineticLensman wrote:
               | Modern content-aware fill can very impressively replace
               | dead areas of an image. E.g. in Photoshop 2024
        
           | breakfastduck wrote:
           | What? and work with a compressed format? No audio engineer
           | would do that.
        
         | crazygringo wrote:
         | They'll survive fine.
         | 
         | The watermark is applied to the spectrogram. Each "pixel" of
         | the watermark represents a small band of frequency in a small
         | time slice. The watermark presumably does something like
         | increase the volume by 1% in the watermarked "pixels" --
         | basically imperceptibly adjusting EQ by tiny amounts in tiny
         | places.
         | 
         | That will survive lossy compression just fine -- lossy
         | compression applies a low pass filter and then removes whole
         | spectrogram chunks where the signal is below a threshold.
         | 
         | But a watermark will still be entirely detectable in all the
         | chunks that remain.
        
           | neuralRiot wrote:
           | How about signal (analog) dynamic compression? If he WM is
           | embedded in the audio data then I won't survive
           | DA>compress>AD.
        
             | crazygringo wrote:
             | Can you clarify? A DAC doesn't perform any compression.
             | What dynamic compression are you referring to?
             | 
             | If you mean dynamic _range_ compression, that 's something
             | done as part of mastering. It has nothing to do with DAC.
        
               | neuralRiot wrote:
               | Not the DAC but a DSP can do it. So the chain would be
               | DAC> DSP (compression) > ADC What i trying to say is that
               | the "watermark" would theoretically survive data
               | compression (WAV>MP3, FLAC, AAC) but not dynamic range
               | compression.
        
         | dist-epoch wrote:
         | Compression tends to discard high-frequency signals since they
         | use the most bits.
         | 
         | If you slightly alter the low frequency signals, they will be
         | preserved. 30.1 Hz -> 30.2 Hz.
        
           | TerrifiedMouse wrote:
           | > Compression tends to discard high-frequency signals since
           | they use the most bits.
           | 
           | Quite certain that's not true. We drop everything pass 20khz
           | simply because it's outside the average person's hearing
           | range.
           | 
           | We drop everything under 20hz too - we also cannot hear below
           | that.
        
       | xyproto wrote:
       | Yet another reason to use local LLMs.
        
         | system2 wrote:
         | Audio maybe but visuals require so much processing power and a
         | heavy image database. I am impatiently waiting for a chatgpt4
         | rival for local.
        
           | xyproto wrote:
           | It depends on what you are planning to do.
           | 
           | Stable diffusion works great locally on both the dell+linux
           | and the mac laptop I tried it on. This one was easy to use
           | while exploring models from huggingface:
           | https://diffusionbee.com/
           | 
           | DALL-E still gives better results, though.
        
       | snickerbockers wrote:
       | >Watermarking tools like SynthID are seen as an important
       | safeguard against some of the harms of generative AI.
       | 
       | Except the lack of watermark doesn't necessarily exclude the
       | possibility of ai generation.
        
         | davidmurdoch wrote:
         | And I'm sure the opposite will happen as well, people who want
         | to pass human work as AI generated by adding watermarks.
        
           | peddling-brink wrote:
           | Why? Just to muddy the waters?
        
             | doomroot wrote:
             | To sell a product. "Look at what our tool can produce."
        
             | solarkraft wrote:
             | Possibly for plausible deniability. "I never said that,
             | look, here's the AI watermark".
             | 
             | It has even been argued in court that a picture isn't tight
             | evidence due to automatic processing, after all.
        
         | morkalork wrote:
         | It excludes them from being blamed for any fallout from an
         | unsigned file though.
        
         | LordShredda wrote:
         | It's for copywrite
        
           | LegibleCrimson wrote:
           | Copyright. Copywrite means something else.
        
       | amelius wrote:
       | Probably easy to remove these watermarks with some autoencoder
       | approach similar to:
       | 
       | https://www.kaggle.com/code/therealcyberlord/watermark-remov...
        
         | vletal wrote:
         | The example in the link you posted is equivalent to quietly
         | whispering "GENERATED BY GOOGLE" every 15 seconds of over the
         | generated track.
         | 
         | Regenerating an audio is similar to painting a fake painting.
         | It is the strokes which will reveal the original author. But in
         | this case you want to claim the piece for yourself - stripping
         | the signature elements of the author BUT still preserve the
         | original quality.
         | 
         | So, make sure the ID is not more nuanced and distributed than
         | you might think :) Otherwise your auto-encoder might reproduce
         | it as well.
        
       | Regnore wrote:
       | Watermarks can be helpful, but I believe that provenance via
       | digital signatures is ultimately a better solution. Curious why
       | Google doesn't join the CAI (https://contentauthenticity.org/)
       | and use their approach for provenance of Google's generated audio
       | files.
        
         | Waterluvian wrote:
         | I walked through the website and I admit, I still have no clue
         | how CAI technically implements this concept. How does it work
         | for audio media?
        
           | Regnore wrote:
           | At a high level the idea is media is digitally signed by
           | whomever produces it to prove provenance - similar to HTTPS.
           | 
           | Depending on how much time/interest you have,
           | https://c2pa.org/ has resources which explains more technical
           | details about the process
        
             | AnthonyMouse wrote:
             | How is this supposed to do any good?
             | 
             | Bob produces something with AI but claims he produced it
             | himself and signs it with his private key.
             | 
             | AI produces something and signs it or doesn't, but if it's
             | signed you can just throw the signature away and either
             | publish it as unsigned or sign it again with a different
             | key.
             | 
             | Signatures allow Alice to verify that something is signed
             | by someone who has Bob's private key. If only Bob has Bob's
             | private key, that means it was signed by Bob. It doesn't
             | tell you whether it was generated by AI or not if Bob
             | doesn't want you to know, because Bob can sign whatever he
             | wants with his private key.
             | 
             | In this case "Bob" is presumably supposed to be some camera
             | with DRM, but that means it will be in the physical control
             | of attackers and anybody who can crack any camera by any
             | manufacturer can extract the private key and use it to sign
             | whatever they want, which is inevitably going to happen.
             | Keys will be available for sale to anyone who wants one and
             | doesn't have the technical acumen to extract one
             | themselves. Since that makes the whole system worthless,
             | what's the point?
        
               | Regnore wrote:
               | > Bob produces something with AI but claims he produced
               | it himself and signs it with his private key. ... because
               | Bob can sign whatever he wants with his private key.
               | 
               | Whether or not to trust Bob is an entirely different
               | problem space than being able to prove an image came from
               | Bob. In most scenarios Bob would be "trustworthy news
               | source" who cares about their reputability. The important
               | piece here is that if someone shares something on e.g.
               | twitter and says Bob produced it, that claim can be
               | verified.
               | 
               | > crack any camera by any manufacturer can extract the
               | private key and use it to sign whatever they want, which
               | is inevitably going to happen ... Since that makes the
               | whole system worthless, what's the point?
               | 
               | Think about what happens today when a private key is
               | leaked - that key is no longer trusted. Will it be such a
               | large scale problem such that the day any camera is
               | released the keys are leaked? Maybe. Even in that
               | scenario though we end up in the same spot as today
               | except with the additional benefit of being able to
               | verify stuff coming from NPR/CNN/your preferred news
               | source that is shared on third party platforms.
        
       | zoklet-enjoyer wrote:
       | Use it for automated composition and then recreate it in a DAW
        
       | arberx wrote:
       | NFTs are a great usecase for this.
        
         | salt-thrower wrote:
         | Please elaborate.
        
       | elif wrote:
       | Sure it survives compression, noises added, etc.. but surely it
       | cannot survive, say, 100 bogus SynthID formatted inaudible
       | watermark datas being added?
        
         | LegibleCrimson wrote:
         | I'm assuming that doing that would result in audible changes.
         | Like lossy but transparent encoding, repeated application often
         | loses transparency quickly.
        
         | dist-epoch wrote:
         | It should. That's how GPS signals work, you receive 20 of them
         | superimposed below the noise floor, and you use math and known
         | pseudo-random-generator seeds to separate them.
        
       | ChuckMcM wrote:
       | Interesting take, for folks who have played around with digital
       | signal processing (DSP) and low bandwidth / noise tolerant
       | digital signals can probably come up with a half dozen different
       | ways to do this and have it both survive all encoding
       | methodologies and all compression/decompression schemes (at least
       | ones that strive to keep the audio "nearly identical" to the
       | human ear)
       | 
       | It would not survive if you had an analysis package that could
       | back compute the terms used by the model to generate the track
       | and you re-generated the track with your own post processing, but
       | that is a lot of work that most people would avoid.
       | 
       | It would be hilarious if someone built a model that you could
       | query for it to tell you if there was subaudible information in
       | the track and if so remove it (automating the above step to an AI
       | task).
       | 
       | There have been projects that put subaudible tones in audio to
       | trigger toys. So when you're watching a cartoon your toy would
       | respond to the cartoon. Some people may, or may not, have
       | considered doing something like that for muzak at businesses so
       | that your phone OS could report back you had entered the business
       | within a certain window of looking at an ad for that business.
        
         | fsckboy wrote:
         | when you say (paraphrasal quotes) "the inaudible will be
         | preserved through lossy compression designed around knowledge
         | of human hearing" you're essentially saying that "undefined
         | behavior will be preserved through optimization", and that is a
         | boldly optimistic claim.
        
           | ChuckMcM wrote:
           | I think you and I may be talking about different things. If
           | you know what phase noise is and understand the term -10 dBc
           | @ 1kHz, then we might be on the same page. If you don't
           | consider the following;
           | 
           | There are tones, things you can hear typically in the range
           | of 15 Hz to 22 kHz. When those tones vary in frequency over
           | time by a few Hertz, it is unlikely that you can perceive
           | that variation. Your ear is both logarithmic in its
           | sensitivity to volume and not particularly sensitive to
           | absolute frequency. If you shift an entire spectrum "up" a
           | couple of Hz your ear won't notice, and similarly if you
           | shift it "down" a couple of Hz you won't notice. This is
           | especially true if you do it slowly (over the course of 100
           | milliseconds or so). Tape decks and turntables do this, they
           | try to minimize it though. The specification that tells you
           | how closely they track rotational speed is "Wow and Flutter"
           | and a good number is < 0.3%, a studio recorder might be less
           | than 0.1%.
           | 
           | With DSP you can easily pull out a spectrum shift of < 0.05%
           | if it is 'regular'. That shift can be a "frequency shift
           | keying" (FSK) signal with forward error correction in it.
           | (think shifted 'up' as one (1) bits and shifted 'down' as
           | zero (0) bits). Running the clip through an FFT and
           | monitoring phase shift in the bins would recover this string
           | of bits. And even if you didn't get enough to error correct
           | the original message, their presence would be unmistakable.
           | 
           | All existing audio compression schemes do not affect the
           | phase relationship of the spectrum they are compressing.
           | That's a design feature.
           | 
           | To be fair, I wouldn't understand any of this if I hadn't
           | been delving into software defined radio and learning DSP
           | techniques for modulation and demodulation with data recovery
           | in the presence of signal distortions and interference. And I
           | completely understand that there is a language challenge when
           | the terms 'audibly imperceptible' and 'inaudible' are treated
           | as synonyms.
        
         | akomtu wrote:
         | Even an audio analysis tool won't reveal much if the watermark
         | is sophisticated enough. I believe, a basic watermark adds a
         | few spots on the spectrogram in a known arrangement, a bit like
         | a constellation in the skies, but a watermark can be based on
         | the bloom filter idea: hundreds of barely visible spots
         | arranged with a hash function will look like noise.
        
         | kats wrote:
         | As soon as ML generated voice becomes really good, call
         | scammers will use it against elderly people. That's going to be
         | the #1 application far and away from anything. And as soon as
         | it's out, a whole bunch of extremely privileged ML people will
         | throw their hands up and say, "oh well, cats out of the bag."
        
           | jncfhnb wrote:
           | Voice cloning of grandkids is a very targeted attack. You can
           | achieve the same results today at a similar cost if you
           | already have specific info like that
        
           | two_in_one wrote:
           | nothing new, generated voices are already used for scam. so
           | far they are recordings. like "daughter" screaming for help,
           | or "boss" ordering money transfer. but it's possible there
           | will be real time generators soon. which means scammer will
           | be talking and adjusting depending on the reaction of the
           | victim. there still will be some delay as converter will have
           | some lag. unless the 'scammer' is actually AI. then it may
           | look naturally, especially for not prepared victims. all
           | attacker needs to mimic the voice today is just 3 seconds of
           | recording.
        
           | ChuckMcM wrote:
           | I don't disagree with this sentiment. But I also think it
           | will really change help desk / call center work as well. If
           | you can make the person on the phone have the same accent and
           | language as the caller it makes it easier to understand and
           | more 'comfortable' psychologically. Speech to text +
           | translate could turn the caller's question into a text query
           | in the native language of the agent, the agent's typed
           | response going translate + text to Speech could respond.
           | 
           | That makes the call center easier for both customer and agent
           | and opens up the number of people who could be agents thus
           | increasing competition and cutting costs to deliver call
           | center service.
        
       | ifeja wrote:
       | Does this mean the watermark can be removed if you strip out all
       | inaudible content? I expect not
        
         | vletal wrote:
         | No. The watermark is inaudible.
         | 
         | Btw. audio compression is already doing most of the "stripping
         | inaudible" stuff. You know, why wasting bits on something which
         | is less likely to by picked up by the listener. It is done by
         | assuming a model of human hearing.
        
       | dist-epoch wrote:
       | On one hand people say stuff like "all AI generated content
       | should be clearly marked".
       | 
       | But when you mark it they say "why u do that?"
       | 
       | You just can't win.
        
         | AnthonyMouse wrote:
         | The "clearly marked" thing seems to be that people are
         | concerned about AI and marking it is a way to satisfy demand
         | for Something Must Be Done. The fact that it won't work is kind
         | of irrelevant because "working" means occupying the people who
         | want to impose bad rules with something to argue about.
         | 
         | But then other people look at what they're proposing and say,
         | "hey, you know this is a farce, right?"
        
       ___________________________________________________________________
       (page generated 2023-11-18 23:01 UTC)