[HN Gopher] Google is embedding inaudible watermarks into its AI...
___________________________________________________________________
Google is embedding inaudible watermarks into its AI generated
music
Author : CharlesW
Score : 109 points
Date : 2023-11-18 16:36 UTC (6 hours ago)
(HTM) web link (www.theverge.com)
(TXT) w3m dump (www.theverge.com)
| chpatrick wrote:
| Maybe they put pictures of cats in:
| https://twistedsifter.com/2013/01/hidden-images-embedded-int...
| jordanreger wrote:
| i've heard a ton of aphex twin and disasterpeace, one of my
| favorite songs being compass, and i've never even thought about
| this. cool to see it's hidden in such a beautiful song
| Sparkyte wrote:
| Probably the best song ever because it contains cats.
| echelon wrote:
| What watermarking libraries for images, audio, or video is HN
| using?
|
| We've found a number of them, but the good ones are GPL3 [1]. I
| might email the authors about getting a license if there isn't
| anything equivalent.
|
| [1] https://github.com/swesterfeld/audiowmark is fantastic
| lencastre wrote:
| Counting the seconds until someone gets an audio recording with
| 0-click Trojan malware. I miss the world when all you got was
| ILOVEYOU in MS Word...
| echelon wrote:
| The payloads being encoded are extremely tiny. Typically not
| even enough to store a full UUID.
|
| This is configurable, of course, but you sacrifice
| recoverability and/or quality of the original signal if you
| try to store more bits.
| AnthonyMouse wrote:
| Is there some reason you don't just use it under GPLv3? Are you
| trying to create a derivative work you don't want to publish
| under the same license?
| cebert wrote:
| If you removed a watermark, could it be considered to be a
| violation of the DMCA?
| henriquez wrote:
| That's assuming a) AI-generated tracks were copyrightable in
| the first place, and b) watermarks are copy-protection
| mechanisms
|
| Neither a nor b are true so it's probably fine.
| echelon wrote:
| Watermarks are these things:
|
| - Non-bulletproof provenance data
|
| - Non-bulletproof abuse tracking
|
| - Non-bulletproof proliferation tracking
|
| - Security sold to those concerned about AI
|
| (I build an AI system, so these are all top of mind.)
| inquirerGeneral wrote:
| This is really cool. I like how the headline is clickbait so
| people get a negative impression from it, while it's literally a
| great positive thing.
| TerrifiedMouse wrote:
| I wonder how said watermarks will survive lossy audio compression
| since a big part of lossy audio compression is to remove parts of
| the signal humans can't hear and won't notice is gone.
| vachina wrote:
| There was a time when Blu-ray players had to implement such
| audio watermark decoder, to detect and stop playback
| unauthorized Blu-ray copies.
|
| > Cinavia's in-band signaling introduces intentional spread
| spectrum phase distortion in the frequency domain of each
| individual audio channel separately, giving a per-channel
| digital signal that can yield up to around 0.2 bits per second.
|
| > Cinavia is designed to stay within the audio signal and to
| survive all common forms of audio transfer, including lossy
| data compression using discrete cosine transform, MP3, DTS, or
| Ogg Vorbis. It is designed to survive digital and analog sound
| recording and reproduction via microphones, direct audio
| connections and broadcasting, and does so by using audio
| frequencies within the hearing range.
| TerrifiedMouse wrote:
| So the watermark is audible to listeners.
| rcxdude wrote:
| no, because it's very small phase modulations. The main
| tradeoff is you need a relatively long segment of audio to
| detect it.
| InCityDreams wrote:
| I recently bought a new synth...i wonder if it has a
| signature?
|
| *#>Synth manufacturers start licking their lips.
|
| And, for the amateur muso's that still don't get it:
| imagine having to pay/ have had paid for every time a dx7
| was used in a tune, or an ms20?
| 613style wrote:
| It's audible in the sense that it exists completely in an
| audio modality.
|
| But it's not audible in the sense that human ears can't
| discern it.
| meindnoch wrote:
| The aim of lossy compression is to discard any data that
| isn't discernible to the human auditory system.
|
| If an audio signal survives a lossy compression
| algorithm, then there are two possibilities:
|
| - the compression algorithm should be improved, as it
| preserves some data that isn't important to the human ear
|
| - the signal is audible
| amatecha wrote:
| That's not exactly the aim of lossy compression. Its aim
| is to reduce data size while introducing as little
| discernible effect as realistically possible. That
| usually means optimizing the algorithm such that most of
| the loss is indiscernible to us, such as in a darkest
| regions of an image, or the extremely high frequencies in
| audio -- both areas we don't perceive with too much
| granularity. Something like spread-spectrum phase
| distortion may survive compression just fine but still be
| indiscernible to us. The two are not mutually exclusive.
| AnthonyMouse wrote:
| Suppose you could encode 0.2 bits per second of watermark
| in a recording without humans being able to discern it.
| Suppose the compression algorithm did the same thing to
| encode additional information which is part of the
| recording, allowing it to achieve higher quality at the
| same bitrate or the same quality at a lower bitrate.
|
| It's information theory. Either you can encode additional
| information without impacting the result, in which case
| the compression algorithm could use it to be more
| efficient, or you can't. TANSTAAFL.
| wizzwizz4 wrote:
| And complexity theory says we can't reach the information
| theoretical limits with generic algorithms.
| wubrr wrote:
| Couldn't they apply the watermark post-compression? But yeah,
| eventually people will figure out how the watermark works and
| be able to remove it.
| TerrifiedMouse wrote:
| Frankly, if it can be detected, it can be removed. Makes
| sense no?
| FartyMcFarter wrote:
| Think about watermarks in images. They can be removed, but
| it isn't easy to do so without making it obvious that the
| image is damaged / doctored.
|
| I imagine this is also the case for audio watermarks, but
| I'm not sure what the current state of the art is for
| watermark removal.
| greatpatton wrote:
| You can easily train a model to clean watermark from
| images.
| AnthonyMouse wrote:
| > They can be removed, but it isn't easy to do so without
| making it obvious that the image is damaged / doctored.
|
| This is primarily because those watermarks damage the
| image. The information about the content of the image
| beneath the watermark is missing (replaced with the
| watermark) and would have to be extrapolated or
| fabricated to replace it.
|
| If the watermark isn't visible as part of the image then
| there is no missing piece to have to replace.
| KineticLensman wrote:
| Modern content-aware fill can very impressively replace
| dead areas of an image. E.g. in Photoshop 2024
| breakfastduck wrote:
| What? and work with a compressed format? No audio engineer
| would do that.
| crazygringo wrote:
| They'll survive fine.
|
| The watermark is applied to the spectrogram. Each "pixel" of
| the watermark represents a small band of frequency in a small
| time slice. The watermark presumably does something like
| increase the volume by 1% in the watermarked "pixels" --
| basically imperceptibly adjusting EQ by tiny amounts in tiny
| places.
|
| That will survive lossy compression just fine -- lossy
| compression applies a low pass filter and then removes whole
| spectrogram chunks where the signal is below a threshold.
|
| But a watermark will still be entirely detectable in all the
| chunks that remain.
| neuralRiot wrote:
| How about signal (analog) dynamic compression? If he WM is
| embedded in the audio data then I won't survive
| DA>compress>AD.
| crazygringo wrote:
| Can you clarify? A DAC doesn't perform any compression.
| What dynamic compression are you referring to?
|
| If you mean dynamic _range_ compression, that 's something
| done as part of mastering. It has nothing to do with DAC.
| neuralRiot wrote:
| Not the DAC but a DSP can do it. So the chain would be
| DAC> DSP (compression) > ADC What i trying to say is that
| the "watermark" would theoretically survive data
| compression (WAV>MP3, FLAC, AAC) but not dynamic range
| compression.
| dist-epoch wrote:
| Compression tends to discard high-frequency signals since they
| use the most bits.
|
| If you slightly alter the low frequency signals, they will be
| preserved. 30.1 Hz -> 30.2 Hz.
| TerrifiedMouse wrote:
| > Compression tends to discard high-frequency signals since
| they use the most bits.
|
| Quite certain that's not true. We drop everything pass 20khz
| simply because it's outside the average person's hearing
| range.
|
| We drop everything under 20hz too - we also cannot hear below
| that.
| xyproto wrote:
| Yet another reason to use local LLMs.
| system2 wrote:
| Audio maybe but visuals require so much processing power and a
| heavy image database. I am impatiently waiting for a chatgpt4
| rival for local.
| xyproto wrote:
| It depends on what you are planning to do.
|
| Stable diffusion works great locally on both the dell+linux
| and the mac laptop I tried it on. This one was easy to use
| while exploring models from huggingface:
| https://diffusionbee.com/
|
| DALL-E still gives better results, though.
| snickerbockers wrote:
| >Watermarking tools like SynthID are seen as an important
| safeguard against some of the harms of generative AI.
|
| Except the lack of watermark doesn't necessarily exclude the
| possibility of ai generation.
| davidmurdoch wrote:
| And I'm sure the opposite will happen as well, people who want
| to pass human work as AI generated by adding watermarks.
| peddling-brink wrote:
| Why? Just to muddy the waters?
| doomroot wrote:
| To sell a product. "Look at what our tool can produce."
| solarkraft wrote:
| Possibly for plausible deniability. "I never said that,
| look, here's the AI watermark".
|
| It has even been argued in court that a picture isn't tight
| evidence due to automatic processing, after all.
| morkalork wrote:
| It excludes them from being blamed for any fallout from an
| unsigned file though.
| LordShredda wrote:
| It's for copywrite
| LegibleCrimson wrote:
| Copyright. Copywrite means something else.
| amelius wrote:
| Probably easy to remove these watermarks with some autoencoder
| approach similar to:
|
| https://www.kaggle.com/code/therealcyberlord/watermark-remov...
| vletal wrote:
| The example in the link you posted is equivalent to quietly
| whispering "GENERATED BY GOOGLE" every 15 seconds of over the
| generated track.
|
| Regenerating an audio is similar to painting a fake painting.
| It is the strokes which will reveal the original author. But in
| this case you want to claim the piece for yourself - stripping
| the signature elements of the author BUT still preserve the
| original quality.
|
| So, make sure the ID is not more nuanced and distributed than
| you might think :) Otherwise your auto-encoder might reproduce
| it as well.
| Regnore wrote:
| Watermarks can be helpful, but I believe that provenance via
| digital signatures is ultimately a better solution. Curious why
| Google doesn't join the CAI (https://contentauthenticity.org/)
| and use their approach for provenance of Google's generated audio
| files.
| Waterluvian wrote:
| I walked through the website and I admit, I still have no clue
| how CAI technically implements this concept. How does it work
| for audio media?
| Regnore wrote:
| At a high level the idea is media is digitally signed by
| whomever produces it to prove provenance - similar to HTTPS.
|
| Depending on how much time/interest you have,
| https://c2pa.org/ has resources which explains more technical
| details about the process
| AnthonyMouse wrote:
| How is this supposed to do any good?
|
| Bob produces something with AI but claims he produced it
| himself and signs it with his private key.
|
| AI produces something and signs it or doesn't, but if it's
| signed you can just throw the signature away and either
| publish it as unsigned or sign it again with a different
| key.
|
| Signatures allow Alice to verify that something is signed
| by someone who has Bob's private key. If only Bob has Bob's
| private key, that means it was signed by Bob. It doesn't
| tell you whether it was generated by AI or not if Bob
| doesn't want you to know, because Bob can sign whatever he
| wants with his private key.
|
| In this case "Bob" is presumably supposed to be some camera
| with DRM, but that means it will be in the physical control
| of attackers and anybody who can crack any camera by any
| manufacturer can extract the private key and use it to sign
| whatever they want, which is inevitably going to happen.
| Keys will be available for sale to anyone who wants one and
| doesn't have the technical acumen to extract one
| themselves. Since that makes the whole system worthless,
| what's the point?
| Regnore wrote:
| > Bob produces something with AI but claims he produced
| it himself and signs it with his private key. ... because
| Bob can sign whatever he wants with his private key.
|
| Whether or not to trust Bob is an entirely different
| problem space than being able to prove an image came from
| Bob. In most scenarios Bob would be "trustworthy news
| source" who cares about their reputability. The important
| piece here is that if someone shares something on e.g.
| twitter and says Bob produced it, that claim can be
| verified.
|
| > crack any camera by any manufacturer can extract the
| private key and use it to sign whatever they want, which
| is inevitably going to happen ... Since that makes the
| whole system worthless, what's the point?
|
| Think about what happens today when a private key is
| leaked - that key is no longer trusted. Will it be such a
| large scale problem such that the day any camera is
| released the keys are leaked? Maybe. Even in that
| scenario though we end up in the same spot as today
| except with the additional benefit of being able to
| verify stuff coming from NPR/CNN/your preferred news
| source that is shared on third party platforms.
| zoklet-enjoyer wrote:
| Use it for automated composition and then recreate it in a DAW
| arberx wrote:
| NFTs are a great usecase for this.
| salt-thrower wrote:
| Please elaborate.
| elif wrote:
| Sure it survives compression, noises added, etc.. but surely it
| cannot survive, say, 100 bogus SynthID formatted inaudible
| watermark datas being added?
| LegibleCrimson wrote:
| I'm assuming that doing that would result in audible changes.
| Like lossy but transparent encoding, repeated application often
| loses transparency quickly.
| dist-epoch wrote:
| It should. That's how GPS signals work, you receive 20 of them
| superimposed below the noise floor, and you use math and known
| pseudo-random-generator seeds to separate them.
| ChuckMcM wrote:
| Interesting take, for folks who have played around with digital
| signal processing (DSP) and low bandwidth / noise tolerant
| digital signals can probably come up with a half dozen different
| ways to do this and have it both survive all encoding
| methodologies and all compression/decompression schemes (at least
| ones that strive to keep the audio "nearly identical" to the
| human ear)
|
| It would not survive if you had an analysis package that could
| back compute the terms used by the model to generate the track
| and you re-generated the track with your own post processing, but
| that is a lot of work that most people would avoid.
|
| It would be hilarious if someone built a model that you could
| query for it to tell you if there was subaudible information in
| the track and if so remove it (automating the above step to an AI
| task).
|
| There have been projects that put subaudible tones in audio to
| trigger toys. So when you're watching a cartoon your toy would
| respond to the cartoon. Some people may, or may not, have
| considered doing something like that for muzak at businesses so
| that your phone OS could report back you had entered the business
| within a certain window of looking at an ad for that business.
| fsckboy wrote:
| when you say (paraphrasal quotes) "the inaudible will be
| preserved through lossy compression designed around knowledge
| of human hearing" you're essentially saying that "undefined
| behavior will be preserved through optimization", and that is a
| boldly optimistic claim.
| ChuckMcM wrote:
| I think you and I may be talking about different things. If
| you know what phase noise is and understand the term -10 dBc
| @ 1kHz, then we might be on the same page. If you don't
| consider the following;
|
| There are tones, things you can hear typically in the range
| of 15 Hz to 22 kHz. When those tones vary in frequency over
| time by a few Hertz, it is unlikely that you can perceive
| that variation. Your ear is both logarithmic in its
| sensitivity to volume and not particularly sensitive to
| absolute frequency. If you shift an entire spectrum "up" a
| couple of Hz your ear won't notice, and similarly if you
| shift it "down" a couple of Hz you won't notice. This is
| especially true if you do it slowly (over the course of 100
| milliseconds or so). Tape decks and turntables do this, they
| try to minimize it though. The specification that tells you
| how closely they track rotational speed is "Wow and Flutter"
| and a good number is < 0.3%, a studio recorder might be less
| than 0.1%.
|
| With DSP you can easily pull out a spectrum shift of < 0.05%
| if it is 'regular'. That shift can be a "frequency shift
| keying" (FSK) signal with forward error correction in it.
| (think shifted 'up' as one (1) bits and shifted 'down' as
| zero (0) bits). Running the clip through an FFT and
| monitoring phase shift in the bins would recover this string
| of bits. And even if you didn't get enough to error correct
| the original message, their presence would be unmistakable.
|
| All existing audio compression schemes do not affect the
| phase relationship of the spectrum they are compressing.
| That's a design feature.
|
| To be fair, I wouldn't understand any of this if I hadn't
| been delving into software defined radio and learning DSP
| techniques for modulation and demodulation with data recovery
| in the presence of signal distortions and interference. And I
| completely understand that there is a language challenge when
| the terms 'audibly imperceptible' and 'inaudible' are treated
| as synonyms.
| akomtu wrote:
| Even an audio analysis tool won't reveal much if the watermark
| is sophisticated enough. I believe, a basic watermark adds a
| few spots on the spectrogram in a known arrangement, a bit like
| a constellation in the skies, but a watermark can be based on
| the bloom filter idea: hundreds of barely visible spots
| arranged with a hash function will look like noise.
| kats wrote:
| As soon as ML generated voice becomes really good, call
| scammers will use it against elderly people. That's going to be
| the #1 application far and away from anything. And as soon as
| it's out, a whole bunch of extremely privileged ML people will
| throw their hands up and say, "oh well, cats out of the bag."
| jncfhnb wrote:
| Voice cloning of grandkids is a very targeted attack. You can
| achieve the same results today at a similar cost if you
| already have specific info like that
| two_in_one wrote:
| nothing new, generated voices are already used for scam. so
| far they are recordings. like "daughter" screaming for help,
| or "boss" ordering money transfer. but it's possible there
| will be real time generators soon. which means scammer will
| be talking and adjusting depending on the reaction of the
| victim. there still will be some delay as converter will have
| some lag. unless the 'scammer' is actually AI. then it may
| look naturally, especially for not prepared victims. all
| attacker needs to mimic the voice today is just 3 seconds of
| recording.
| ChuckMcM wrote:
| I don't disagree with this sentiment. But I also think it
| will really change help desk / call center work as well. If
| you can make the person on the phone have the same accent and
| language as the caller it makes it easier to understand and
| more 'comfortable' psychologically. Speech to text +
| translate could turn the caller's question into a text query
| in the native language of the agent, the agent's typed
| response going translate + text to Speech could respond.
|
| That makes the call center easier for both customer and agent
| and opens up the number of people who could be agents thus
| increasing competition and cutting costs to deliver call
| center service.
| ifeja wrote:
| Does this mean the watermark can be removed if you strip out all
| inaudible content? I expect not
| vletal wrote:
| No. The watermark is inaudible.
|
| Btw. audio compression is already doing most of the "stripping
| inaudible" stuff. You know, why wasting bits on something which
| is less likely to by picked up by the listener. It is done by
| assuming a model of human hearing.
| dist-epoch wrote:
| On one hand people say stuff like "all AI generated content
| should be clearly marked".
|
| But when you mark it they say "why u do that?"
|
| You just can't win.
| AnthonyMouse wrote:
| The "clearly marked" thing seems to be that people are
| concerned about AI and marking it is a way to satisfy demand
| for Something Must Be Done. The fact that it won't work is kind
| of irrelevant because "working" means occupying the people who
| want to impose bad rules with something to argue about.
|
| But then other people look at what they're proposing and say,
| "hey, you know this is a farce, right?"
___________________________________________________________________
(page generated 2023-11-18 23:01 UTC)