Presentation
in
Session: Creative AI Performances 1
Emergent Rhythm — Real-time AI Generative DJ Set
Screen
“Emergent Rhythm” is an audio-visual DJ performance using real-time AI audio generation. Artist/DJ Tokui manipulates multiple models on stage to spontaneously generate rhythms and melodies. He then combines and mixes the generated audio loops to create musical developments. We employ AI audio synthesis models in real-time and faces unprecedented challenges: Everything heard during this performance is purely AI-generated sound.
As the title suggests, we focus on the musical and visual "rhythms" and recurring patterns that emerge in the interaction between multiple AI models and the artist. The accompanying visuals feature not only the periodicity over time but also the common patterns across multiple scales ranging from the extreme large-scale of the universe to the extreme small-scale of cell and atomic structures.
Aligning with the visual theme, we extracted loops from natural and man-made environmental sounds and used them as training data for audio generation. We also employ real-time timbre transfer that converts incoming audio into various singing voices, such as Buddhist chants. This highlights the diversity and commonality within the human cultural heritage.
We adapted the GAN (Generative Adversarial Networks) architecture for audio synthesis. StyleGAN models trained on spectrograms of various sound materials generate spectrograms, and vocoder GAN models (MelGAN) convert them into audio files. By leveraging GAN-based architecture, we can generate novel, constantly changing, morphing sounds similar to GAN-generated animated faces of people who don’t exist. It takes about 0.5 seconds to generate 4-second-long 2-bar loops in a batch; hence it’s faster than real-time. We also implemented GANSpace, proposed by Härkönen et al., to provide perceptual controls during the performance. GANSpace applies Principal Component Analysis (PCA) on the style vector of a trained StyleGAN model to find perceptually meaningful directions in the latent style space. Adding offsets according to these vectors allows the DJ to influence the audio generation in their desired direction.
From a DJ session, in which existing songs are selected and mixed, to a live performance that generates songs spontaneously and develops them in response to the audience's reactions: In this performance, the human DJ is expected to become an AJ, or "AI Jockey," rather than a “Disk Jockey,” taming and riding the AI-generated audio stream in real-time. With the unique morphing sounds created by AI and the new degrees of freedom that AI allows, the AI Jockey will offer audiences a unique, even otherworldly sonic experience.