Name: Generative AI For Novel Audio Content Creation
Start: 2024-10-08T10:45:00-0400
End: 2024-10-08T11:45:00-0400

Exhibits+ badges provide access to the ADAM Audio Immersive Room , the Genelec Immersive Room, Tech Tours, and the presentations on the Main Stage .

All Access badges provide access to all content in the Program (Tech Tours still require registration)

View the Exhibit Floor Plan.

Tuesday October 8, 2024 10:45am - 11:45am EDT

1E08

The presence and hype associated with generative AI across most forms of recorded media have become undeniable realities. Generative AI tools are becoming increasingly more prevalent, with applications ranging from conversational chatbots to text-to-image generation. More recently, we have witnessed an influx of generative audio models which have the potential of disrupting how music may be created in the very near future. In this talk, we will highlight some of the core technologies that enable novel audio content creation for music production, reviewing some seminal text-to-music works from the past year. We will then delve deeper into common research themes and subsequent works which intend to map these technologies closer to musicians’ needs.

We will begin the talk by outlining a common framework underlying the generative audio models that we will touch on, consisting of an audio synthesizer “back-end” paired with a latent representation modeling “front-end.” Accordingly, we will overview two primary forms of back-ends in the forms of neural audio codecs and variational auto-encoders (with examples), and illustrate how they pair naturally with transformer language model (LM) and latent diffusion model (LDM) front-ends, respectively. Furthermore, we will briefly touch on CLAP and T5 embeddings as conditioning signals that enable text as an input interface, and explain the means by which they are integrated into modern text-to-audio systems.

Next, we will review some seminal works that have been released within the past year(s) (primarily in the field of text-to-music generation), and roughly categorize them according to the common framework that we have built up thus far. At the time of writing this proposal, we would naturally consider MusicLM/FX (LM), MusicGen (LM), Stable Audio (LDM), etc. as exemplary candidates for review. We will contextualize these new capabilities in terms of what they can enable for music production and opportunities for future improvements. Accordingly, we will draw on some subsequent works that intend on meeting musicians a bit closer to the creative process. At the time of writing this proposal, this may include but is not limited to ControlNet (LDM), SingSong (LM), StemGen (LM), VampNet (LM), as well as our own previous work, as approved time permits. We will cap off our talk by providing some perspectives on what AI researchers could stand to understand about music creators, and what musicians could stand to understand about scientific research. Time permitting, we may allow ourselves to conduct a live coding demonstration whereby we exemplify constructing, training, and inferring audio examples from a generative audio model on a toy data example leveraging several prevalent open source libraries.

We hope that such a talk would be both accessible and fruitful for technologists and musicians alike. It would assume no background knowledge in generative modeling, and may perhaps assume only the most notional conception as to how machine learning works. The goal of this talk would be for the audience at large to walk out with a rough understanding of the underlying technologies and challenges associated with novel audio content creation using generative AI.

Speakers

Shahan Nercessian

Tuesday October 8, 2024 10:45am - 11:45am EDT
1E08

Machine Learning and Artificial Intelligence, Workshop

AES Show 2024 NY

Shahan Nercessian

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!