As generative techniques pervade the audio domain, there has been increasing interest in tracing back through these complicated models to understand how they draw on their training data to synthesize new examples, both to ensure that they use properly licensed data and also to elucidate their black box behavior. In this paper, we show that if imperceptible echoes are hidden in the training data, a wide variety of audio to audio architectures (differentiable digital signal processing (DDSP), Realtime Audio Variational autoEncoder (RAVE), and ``Dance Diffusion'') will reproduce these echoes in their outputs. Hiding a single echo is particularly robust across all architectures, but we also show promising results hiding longer time spread echo patterns for an increased information capacity. We conclude by showing that echoes make their way into fine tuned models, that they survive mixing/demixing, and that they survive pitch shift augmentation during training. Hence, this simple, classical idea in watermarking shows significant promise for tagging generative audio models.
Supplementary Material
Below are a few examples of style transfer from a few of the models we trained. Drums models are trained on the groove dataset, guitar models are trained on guitarset, and vocal models are trained on vocalset. For models with echoes/echo patterns, that same echo pattern was embedded in each file of the training set before training. See our paper for statistics on much more extensive experiments.
Notes:
Each example is a single sample from the respective models; there is some randomization present during every style transfer, so results will vary under the same conditions
The models were not optimized for audio quality; they were meant as a proof of concept for audio to audio style transfer.
All dance diffusion results presented here use a noise factor η=0.2. Higher noise factors sound less like the input audio and are more random