Hidden Echoes Survive Training in Audio To Audio Generative Instrument Models

2025 AAAI Workshop on Artificial Intelligence for Music

Christopher J. Tralie, Matt Amery, Benjamin Douglas, Ian Utz

Trained Rave, Dance Diffusion, And DDSP Models

Abstract

As generative techniques pervade the audio domain, there has been increasing interest in tracing back through these complicated models to understand how they draw on their training data to synthesize new examples, both to ensure that they use properly licensed data and also to elucidate their black box behavior. In this paper, we show that if imperceptible echoes are hidden in the training data, a wide variety of audio to audio architectures (differentiable digital signal processing (DDSP), Realtime Audio Variational autoEncoder (RAVE), and ``Dance Diffusion'') will reproduce these echoes in their outputs. Hiding a single echo is particularly robust across all architectures, but we also show promising results hiding longer time spread echo patterns for an increased information capacity. We conclude by showing that echoes make their way into fine tuned models, that they survive mixing/demixing, and that they survive pitch shift augmentation during training. Hence, this simple, classical idea in watermarking shows significant promise for tagging generative audio models.

Supplementary Material

Below are a few examples of style transfer from a few of the models we trained. Drums models are trained on the groove dataset, guitar models are trained on guitarset, and vocal models are trained on vocalset. For models with echoes/echo patterns, that same echo pattern was embedded in each file of the training set before training. See our paper for statistics on much more extensive experiments.

Notes:

Each example is a single sample from the respective models; there is some randomization present during every style transfer, so results will vary under the same conditions
The models were not optimized for audio quality; they were meant as a proof of concept for audio to audio style transfer.
All dance diffusion results presented here use a noise factor η=0.2. Higher noise factors sound less like the input audio and are more random

Input Sample

A 30 second clip from Prince - Loring Park Sessions '77

Dance Diffusion Style Transfers: Drums

Clean

50 Echo

75 Echo

100 Echo

Dance Diffusion Style Transfers: Acoustic Guitar

Clean

50 Echo

75 Echo

100 Echo

Dance Diffusion Style Transfers: Vocals

Clean

50 Echo

75 Echo

100 Echo

Rave Style Transfers: Drums

Clean

50 Echo

75 Echo

100 Echo

Rave Style Transfers: Acoustic Guitar

Clean

50 Echo

75 Echo

100 Echo

Rave Style Transfer: Acoustic Guitar: Pseudorandom Echo Pattern @ 75 Lag

Hidden Pseudorandom Pattern

1010101001001010110100100101010110011001001010110101101001010010100110101101010110010010100101001010110101010101010010010010101010101010110101001100101010010101010101001001010010101010101011001010010010010110101011010101010101101001011001011010100101010110110011011001011010110101100110101101001010101011001011010101010101010011010110100110101101001001101010101010110100101001101010101010011010101101011010010100110010101010110011010100101010110101010101010101010011010110010101001010100101010101101010110010101010101011010100110101011010110010101010101101001010101011001010101010010101010101011010101010101010110100100110100101010110010011010100101010110110010011010101011010101011010100110101101010100110011010010101101010110101010101001101010101010101010011011010101010101011010011010101010011001011010101101010101010110101101101101101001011011010101101100101101010101010101010101101001101010010101101011010100100110101101010101010101010100110101010101010101011001011010101101010100101010101101010101010101001010010100101

Rave Style Transfers: Vocals

Clean

50 Echo

75 Echo

100 Echo

Rave Demucs Results

Original Mixture (AM Contra - Heart Peripheral from musdb test set)

Mixing Drums Model with 50 Echo, Guitar Model with 75 Echo, And Vocals with 100 Echo

Cepstra on Demixed Tracks, Obtained from Demucs:

Hidden Echoes Survive Training in Audio To Audio Generative Instrument Models

2025 AAAI Workshop on Artificial Intelligence for Music

Christopher J. Tralie, Matt Amery, Benjamin Douglas, Ian Utz

Writeup

Source Code

Trained Rave, Dance Diffusion, And DDSP Models

Abstract

Supplementary Material

Input Sample

Dance Diffusion Style Transfers: Drums

Clean

50 Echo

75 Echo

100 Echo

Dance Diffusion Style Transfers: Acoustic Guitar

Clean

50 Echo

75 Echo

100 Echo

Dance Diffusion Style Transfers: Vocals

Clean

50 Echo

75 Echo

100 Echo

Rave Style Transfers: Drums

Clean

50 Echo

75 Echo

100 Echo

Rave Style Transfers: Acoustic Guitar

Clean

50 Echo

75 Echo

100 Echo

Rave Style Transfer: Acoustic Guitar: Pseudorandom Echo Pattern @ 75 Lag

Hidden Pseudorandom Pattern

Rave Style Transfers: Vocals

Clean

50 Echo

75 Echo

100 Echo

Rave Demucs Results

Original Mixture (AM Contra - Heart Peripheral from musdb test set)

Mixing Drums Model with 50 Echo, Guitar Model with 75 Echo, And Vocals with 100 Echo

Cepstra on Demixed Tracks, Obtained from Demucs: