AES Show 2024 NY: Full Schedule

Exhibits+ badges provide access to the ADAM Audio Immersive Room , the Genelec Immersive Room, Tech Tours, and the presentations on the Main Stage .

All Access badges provide access to all content in the Program (Tech Tours still require registration)

View the Exhibit Floor Plan.

2:00pm EDT

Fourier Paradoxes

Tuesday October 8, 2024 2:00pm - 2:30pm EDT

1E04

Fourier theory is quite ubiquitous in modern audio signal processing. However, this framework is often at odds with our intuitions behind audio signals. Strictly speaking, Fourier theory is ideal to analyze periodic behaviors but when periodicities change across time it is easy to misinterpret its results. Of course, we have developed strategies around it like the Short Time Fourier Transform, yet again many times our interpretations of it falls beyond what the theory really says. This paper pushes the exact theoretical description showing examples where our interpretation of the data is incorrect. Furthermore, it shows specific instances where we incorrectly take decisions based on such paradoxical framework.

Moderators

Rob Maher

Professor, Montana State University

Audio digital signal processing, audio forensics, music analysis and synthesis.

Speakers

Juan Sierra

NYU

Currently, I am a PhD Candidate in Music Technology at NYU and am currently based in NYUAD as part of the Global Fellowship program. As a professional musician, my expertise lies in Audio Engineering, and I hold a master's degree in Music, Science, and Technology from the prestigious... Read More →

Authors

Juan Sierra

NYU

Tuesday October 8, 2024 2:00pm - 2:30pm EDT
1E04

Signal Processing, Paper Lecture

2:30pm EDT

Nonlinear distortion in analog modeled DSP plugins in consequence of recording levels

Tuesday October 8, 2024 2:30pm - 3:00pm EDT

1E04

The nominal audio level is where developers of professional analog equipment design their units to have an optimal performance. Audio levels above the nominal level will at some point lead to increased harmonic distortion and eventually clipping. DSP plugins emulating such nonlinear behavior must – in the same manner as analog equipment – align to a nominal level that is simulated within the digital environment. A listening test was tailored to investigate if, or to which extent, misalignments in the audio levels compared to the simulated nominal level in analog-modelled DSP plugins are audible, thus affecting the outcome, depending on which level you choose to record at. The results of this study indicate that harmonic distortion in analog-modeled DSP plugins may become audible as the recording level increases. However, for the plugins included in this study, the immediate consequence of the harmonics added is not critical and, in most cases, not noticed by the listener.

Moderators

Rob Maher

Professor, Montana State University

Audio digital signal processing, audio forensics, music analysis and synthesis.

Speakers

Tore Teigland

Professor, Kristiania University College

Authors

Tore Teigland

Professor, Kristiania University College

Tuesday October 8, 2024 2:30pm - 3:00pm EDT
1E04

Signal Processing, Paper Lecture

3:00pm EDT

A Survey of Methods for the Discretization of Phonograph Record Playback Filters

Tuesday October 8, 2024 3:00pm - 3:30pm EDT

1E04

Since the inception of electrical recording for phonograph records in 1924, records have been intentionally cut with a non-uniform frequency response to maximize the information density on a disc and to improve the signal-to-noise ratio. To reproduce a nominally flat signal within the available bandwidth, the effects of this cutting curve must be undone by applying an inverse curve on playback. Until 1953, with the introduction of what has become known as the RIAA curve, the playback curve required for any particular disc could vary by record company and over time. As a consequence, anyone seeking to hear or restore the information on a disc must have access to equipment that is capable of implementing multiple playback equalizations. This correction may be accomplished with either analog hardware or digital processing. The digital approach has the advantages of reduced cost and expanded versatility, but requires a transformation from continuous time, where the original curves are defined, to discrete time. This transformation inevitably comes with some deviations from the continuous-time response near the Nyquist frequency. There are many established methods for discretizing continuous-time filters, and these vary in performance, computational cost, and inherent latency. In this work, several methods for performing this transformation are explored in the context of phonograph playback equalization, and the performance of each approach is quantified. This work is intended as a resource for anyone developing systems for digital playback equalization or similar applications that require approximating the response of a continuous-time filter digitally.

Moderators

Rob Maher

Professor, Montana State University

Audio digital signal processing, audio forensics, music analysis and synthesis.

Speakers

Benjamin Thompson

PhD Student, University of Rochester

Authors

Benjamin Thompson

PhD Student, University of Rochester

Jenna Rutowski

Michael Heilemann

Tre DiPassio

Tuesday October 8, 2024 3:00pm - 3:30pm EDT
1E04

Signal Processing, Paper Lecture

3:30pm EDT

Leveraging TSN Protocols to Support AES67: Achieving AVB Quality with Layer 3 Benefits

Tuesday October 8, 2024 3:30pm - 3:50pm EDT

1E04

This paper investigates using Time-Sensitive Networking (TSN) protocols, particularly from Audio Video Bridging (AVB), to support AES67 audio transport. By leveraging the IEEE 1588 Level 3 Precision Time Protocol (PTP) Media Profile, packet scheduling, and bandwidth reservation, we demonstrate that AES67 can be transported with AVB-equivalent quality guarantees while benefiting from Layer 3 networking advantages. The evolution of professional audio networking has increased the demand for high-quality, interoperable, and efficiently managed networks. AVB provides robust Layer 2 delivery guarantees but is limited by Layer 2 constraints. AES67 offers Layer 3 interoperability but lacks strict quality of service (QoS) guarantees. This paper proposes combining the strengths of both approaches by using TSN protocols to support AES67, ensuring precise audio transmission with Layer 3 flexibility. TSN extends AVB standards for time synchronization, traffic shaping, and resource reservation, ensuring low latency, low jitter, and minimal packet loss. AES67, a standard for high-performance audio over IP, leverages ubiquitous IP infrastructure for scalability and flexibility but lacks the QoS needed for professional audio. Integrating TSN protocols with AES67 achieves AVB's QoS guarantees in a Layer 3 environment. IEEE 1588 Level 3 PTP Media Profile ensures precise synchronization, packet scheduling reduces latency and jitter, and bandwidth reservation prevents congestion. Experiments show that TSN protocols enable AES67 to achieve latency, jitter, and packet loss performance on par with AVB, providing reliable audio transmission suitable for professional applications in modern, scalable networks.

Moderators

Rob Maher

Professor, Montana State University

Audio digital signal processing, audio forensics, music analysis and synthesis.

Speakers

Nicolas Sturmel

Directout GmbH

Authors

Nicolas Sturmel

Directout GmbH

Claudio Becker-Foss

Tuesday October 8, 2024 3:30pm - 3:50pm EDT
1E04

Signal Processing, Paper Lecture

3:50pm EDT

Harnessing Diffuse Signal Processing (DiSP) to Mitigate Coherent Interference

Tuesday October 8, 2024 3:50pm - 4:10pm EDT

1E04

Coherent sound wave interference is a persistent challenge in live sound reinforcement, where phase differences between multiple loudspeakers lead to destructive interference, resulting in inconsistent audio coverage. This review paper presents a modern solution: Diffuse Signal Processing (DiSP), which utilizes Temporally Diffuse Impulses (TDIs) to mitigate phase cancellation. Unlike traditional methods focused on phase alignment, DiSP manipulates the temporal and spectral characteristics of sound, effectively diffusing coherent wavefronts. TDIs, designed to spread acoustic energy over time, are synthesized and convolved with audio signals to reduce the likelihood of interference. This process maintains the original sound’s perceptual integrity while enhancing spatial consistency, particularly in large-scale sound reinforcement systems. Practical implementation methods are demonstrated, including a MATLAB-based workflow for generating TDIs and optimizing them for specific frequency ranges or acoustic environments. Furthermore, dynamic DiSP is introduced as a method for addressing interference caused by early reflections in small-to-medium sized rooms. This technique adapts TDIs in real-time, ensuring ongoing decorrelation in complex environments. The potential for future developments, such as integrating DiSP with immersive audio systems or creating dedicated hardware for real-time signal processing, is also discussed.

Moderators

Rob Maher

Professor, Montana State University

Audio digital signal processing, audio forensics, music analysis and synthesis.

Speakers

Tommy Spurgeon

Physics Student & Undergraduate Researcher, University of South Carolina

Authors

Tommy Spurgeon

Physics Student & Undergraduate Researcher, University of South Carolina

Tuesday October 8, 2024 3:50pm - 4:10pm EDT
1E04

Signal Processing, Paper Lecture

11:00am EDT

Reimagining Delay Effects: Integrating Generative AI for Creative Control

Wednesday October 9, 2024 11:00am - 11:20am EDT

1E03

This paper presents a novel generative delay effect that utilizes generative AI to create unique variations of a melody with each new echo. Unlike traditional delay effects, where repetitions are identical to the original input, this effect generates variations in pitch and rhythm, enhancing creative possibilities for artists. The significance of this innovation lies in addressing artists' concerns about generative AI potentially replacing their roles. By integrating generative AI into the creative process, artists retain control and collaborate with the technology, rather than being supplanted by it. The paper outlines the processing methodology, which involves training a Long Short-Term Memory (LSTM) neural network on a dataset of publicly available music. The network generates output melodies based on input characteristics, employing a specialized notation language for music. Additionally, the implementation of this machine learning model within a delay plugin's architecture is discussed, focusing on parameters such as buffer length and tail length. The integration of the model into the broader plugin framework highlights the practical aspects of utilizing generative AI in audio effects. The paper also explores the feasibility of deploying this technology on microcontrollers for use in instruments and effects pedals. By leveraging low-power AI libraries, this advanced functionality can be achieved with minimal storage requirements, demonstrating the efficiency and versatility of the approach. Finally, a demonstration of an early version of the generative delay effect will be presented.

Moderators

Marina Bosi

Stanford University

Marina Bosi, AES Past President, is a founding Director of the Moving Picture, Audio, and Data Coding by Artificial Intelligence (MPAI) and the Chair of the Context-based Audio Enhancement (MPAI-CAE) Development Group and IEEE SA CAE WG. Dr. Bosi has served the Society as President... Read More →

Speakers

Sava Solar

Authors

Sava Solar

Wednesday October 9, 2024 11:00am - 11:20am EDT
1E03

Signal Processing, Paper Lecture

11:20am EDT

Acoustic Characteristics of Parasaurolophus Crest: Experimental Results from a simplified anatomical model

Wednesday October 9, 2024 11:20am - 11:40am EDT

1E03

This study presents a revised acoustic model of the Parasaurolophus crest, incorporating both the main airway and lateral diverticulum, based on previous anatomical models and recent findings. A physical device, as a simplified model of the crest, was constructed using a coupled piping system, and frequency sweeps were conducted to investigate its resonance behavior. Data were collected using a minimally invasive microphone, with a control group consisting of a simple open pipe for comparison. The results show that the frequency response of the experimental model aligns with that of the control pipe at many frequencies, but notable shifts and peak-splitting behavior were observed, suggesting a more active role of the lateral diverticulum in shaping the acoustic response than previously thought. These findings challenge earlier closed-pipe approaches, indicating that complex interactions between the main airway and lateral diverticulum generate additional resonant frequencies absent in the control pipe. The study provides empirical data that offer new insights into the resonance characteristics of the Parasaurolophus crest and contribute to understanding its auditory range, particularly for low-frequency sounds.

Moderators

Marina Bosi

Stanford University

Speakers

Hongjun Lin

Authors

Hongjun Lin

Wednesday October 9, 2024 11:20am - 11:40am EDT
1E03

Signal Processing, Paper Lecture

11:40am EDT

Interpreting user-generated audio from war zones

Wednesday October 9, 2024 11:40am - 12:00pm EDT

1E03

Increasingly, civilian inhabitants and combatants in conflict areas use their mobile phones to record video and audio of armed attacks. These user-generated recordings (UGRs) often provide the only source of immediate information about armed conflicts because access by professional journalists is highly restricted. Audio forensic analysis of these UGRs can help document the circumstances and aftermath of war zone incidents, but consumer off-the-shelf recording devices are not designed for battlefield circumstances and sound levels, nor do the battlefield circumstances provide clear, noise-free audio. Moreover, as with any user-generated material that generally does not have a documented chain-of-custody, there are forensic concerns about authenticity, misinformation, and propaganda that must be considered. In this paper we present several case studies of UGRs from armed conflict areas and describe several methods to assess the quality and integrity of the recorded audio. We also include several recommendations for amateurs who make UGRs so that the recorded material is more easily authenticated and corroborated. Audio and video examples are presented.

Moderators

Marina Bosi

Stanford University

Speakers

Rob Maher

Professor, Montana State University

Audio digital signal processing, audio forensics, music analysis and synthesis.

Authors

Rob Maher

Professor, Montana State University

Audio digital signal processing, audio forensics, music analysis and synthesis.

Wednesday October 9, 2024 11:40am - 12:00pm EDT
1E03

Signal Processing, Paper Lecture

12:00pm EDT

Experimental analysis of a car loudspeaker model based on imposed vibration velocity: effect of membrane discretization

Wednesday October 9, 2024 12:00pm - 12:20pm EDT

1E03

Nowadays, the research about the improvement of the interior sound quality of road vehicles is a relevant task. The cabin is an acoustically challenging environment due to the complex geometry, the different acoustic properties of the materials of cabin components and the presence of audio systems based on multiple loudspeaker units. This paper aims at presenting a simplified modelling approach designed to introduce the boundary condition imposed by a loudspeaker to the cabin system in the context of virtual acoustic analysis. The proposed model is discussed and compared with experimental measurements obtained from a test-case loudspeaker.

Moderators

Marina Bosi

Stanford University

Speakers

Emanuele Garofalo

Authors

Emanuele Garofalo

Alfonso Oliva

Francesco Ripamonti

Gioele Isacchi

Michele Ebri

Simone Talucci

Wednesday October 9, 2024 12:00pm - 12:20pm EDT
1E03

Signal Processing, Paper Lecture

12:20pm EDT

A novel derivative-based approach for the automatic detection of time-reversed audio in the MPAI/IEEE-CAE ARP international standard

Wednesday October 9, 2024 12:20pm - 12:50pm EDT

1E03

The Moving Picture, Audio and Data Coding by Artificial Intelligence (MPAI) Context-based Audio Enhancement (CAE) Audio Recording Preservation (ARP) standard provides the technical specifications for a comprehensive framework for digitizing and preserving analog audio, specifically focusing on documents recorded on open-reel tapes. This paper introduces a novel, envelope derivative-based method incorporated within the ARP standard to detect reverse audio sections during the digitization process. The primary objective of this method is to automatically identify segments of audio recorded in reverse. Leveraging advanced derivative-based signal processing algorithms, the system enhances its capability to detect and reverse such sections, thereby reducing errors during analog-to-digital (A/D) conversion. This feature not only aids in identifying and correcting digitization errors but also improves the efficiency of large-scale audio document digitization projects. The system's performance has been evaluated using a diverse dataset encompassing various musical genres and digitized tapes, demonstrating its effectiveness across different types of audio content.

Moderators

Marina Bosi

Stanford University

Authors

Alessandro Russo and Sergio Canazza

Fabio Zanini

Matteo Spanio

Wednesday October 9, 2024 12:20pm - 12:50pm EDT
1E03

Signal Processing, Paper Lecture

2:00pm EDT

A study on the relative accuracy and robustness of the convolutional recurrent neural network based approach to binaural sound source localisation.

Wednesday October 9, 2024 2:00pm - 4:00pm EDT

Poster

Binaural sound source localization is the task of finding the location of a sound source using binaural audio as affected by the head-related transfer functions (HRTFs) of a binaural array. The most common approach to this is to train a convolutional neural network directly on the magnitude and phase of the binaural audio. Recurrent layers can then also be introduced to allow for consideration of the temporal context of the binaural data, as to create a convolutional recurrent neural network (CRNN).
This work compares the relative performance of this approach for speech localization on the horizontal plane using four different CRNN models based on different types of recurrent layers; Conv-GRU, Conv-BiGRU, Conv-LSTM, and Conv-BiLSTM, as well as a baseline system of a more conventional CNN with no recurrent layers. These systems were trained and tested on datasets of binaural audio created by convolution of speech samples with BRIRs of 120 rooms, for 50 azimuthal directions. Additive noise created from additional sound sources on the horizontal plane were also added to the signal.
Results show a clear preference for use of CRNN over CNN, with overall localization error and front-back confusion being reduced, with it additionally being seen that such systems are less effected by increasing reverb time and reduced signal to noise ratio. Comparing the recurrent layers also reveals that LSTM based layers see the best overall localisation performance, while layers with bidirectionality are more robust, and so overall finding a preference for Conv-BiLSTM for the task.

Speakers

Jago T. Reed-Jones

Research & Development Engineer, Audioscenic

I am a Research & Development Engineer at Audioscenic, where we are bringing spatial audio to people's homes using binaural audio over loudspeakers. In addition, I am finishing a PhD at Liverpool John Moores University looking at use of neural networks to achieve binaural sound source... Read More →

Authors

Jago T. Reed-Jones

Research & Development Engineer, Audioscenic

David L. Ellis

Karl O. Jones

Paul Fergus

Wednesday October 9, 2024 2:00pm - 4:00pm EDT
Poster

Signal Processing, Poster Presentation

2:00pm EDT

Authoring Inter-Compatible Flexible Audio for Mass Personalization

Wednesday October 9, 2024 2:00pm - 4:00pm EDT

Poster

The popularity of internet-based services for delivering media, i.e., audio and video on demand, creates an opportunity to offer personalized media experiences to audiences. A significant proportion of users are experiencing reduced enjoyment, or facing inaccessibility, due to combinations of different impairments, languages, devices and connectivity, and preferences. Audio currently leads video in the maturity of object-based production and distribution systems; thus, we sought to leverage existing object-based audio tools and frameworks to explore creation and delivery of personalized versions. In this practice-based investigation of personalization affordances, an immersive children’s fantasy radio drama was re-authored within five "dimensions of personalization", motivated by an analysis of under-served audiences, to enable us to develop an understanding of what and how to author personalized media. The dimensions were Duration, Language, Style, Device and Clarity. Our authoring approaches were designed to result in alternative versions that are inherently inter-compatible. The ability to combine them exploits the properties of object-based audio and creates the potential for mass personalization, even with a modest number of options within each of the dimensions. The result of this investigation is a structured set of adaptations, based around a common narrative. Future work will develop automation and smart network integration to trial the delivery of such personalized experiences at scale, and thereby validate the benefits that such forms of media personalization can bring.

Speakers

Craig Cieciura

Research Fellow, University of Surrey

Craig graduated from the Music and Sound Recording (Tonmeister) course at The University of Surrey in 2016. He then completed his PhD at the same institution in 2022. His PhD topic concerned reproduction of object-based audio in the domestic environment using combinations of installed... Read More →

Authors

Craig Cieciura

Research Fellow, University of Surrey

Elettra Bargiacchi

Philip J B Jackson

Wednesday October 9, 2024 2:00pm - 4:00pm EDT
Poster

Signal Processing, Poster Presentation

2:00pm EDT

Automatic Corrective EQ for Measurement Microphone Emulation

Wednesday October 9, 2024 2:00pm - 4:00pm EDT

Poster

This study investigates automatic corrective equalisation (EQ) to adjust the frequency response of microphones with non-flat responses to match that of a flat frequency response measurement microphone. Non-flat responses in microphones can cause significant colouration, necessitating correction for accurate sound capture, particularly in spectral analysis scenarios. To address this, 10 non-flat microphones were profiled in an anechoic chamber, and a 1/3 octave digital graphic equaliser (GEQ) was employed to align their spectra with that of an industry-standard reference microphone. The system's performance was evaluated using acoustic guitar recordings, measuring spectral similarity between Pre- and Post-Corrected recordings and the reference microphone with objective metrics. Results demonstrated improvements in spectral similarity across the metrics, confirming the method's effectiveness in correcting frequency response irregularities. However, limitations such as the inability to account for proximity effects suggest the need for further refinement and validation in diverse acoustic environments.

Speakers

Matthew Cheshire

Lecturer in Audio Engineering / Researcher, Birmingham City University

Authors

Matthew Cheshire

Lecturer in Audio Engineering / Researcher, Birmingham City University

Islah Ali-MacLachlan

Samuel Smith

Wednesday October 9, 2024 2:00pm - 4:00pm EDT
Poster

Signal Processing, Poster Presentation

2:00pm EDT

Corelink Audio: The Development of A JUCE-based Networked Music Performance Solution

Wednesday October 9, 2024 2:00pm - 4:00pm EDT

Poster

Existing Networked Music Performance (NMP) solutions offer high-fidelity audio with minimal latency but often lack the versatility needed for multi-plugin configurations and multi-modal integrations, while ensuring ease of use and future development. This paper presents a novel NMP solution called Corelink Audio which aims to address these limitations. Corelink Audio is built on the Corelink network framework, a data agnostic communication protocol, and JUCE, an audio plugin development framework. Corelink audio can be used flexibly as an Audio Unit (AU) or Virtual Studio Technology (VST) plugin inside different host environments and can be easily integrated with other audio or non-audio data streams. Users can also extend the Corelink Audio codebase to tailor it to their own specific needs using the available JUCE and Corelink documentation. This paper details the architecture and technical specifics of the software. The performance of the system, including latency measurements and audio artifacts under varying conditions, is also evaluated and discussed.

Speakers

Zack Nguyen

Software Developer, NYU IT

Authors

Zack Nguyen

Software Developer, NYU IT

Gnana Anantha Sai Padmapriya Korrapati

Yi Wu and Robert Pahle

Wednesday October 9, 2024 2:00pm - 4:00pm EDT
Poster

Signal Processing, Poster Presentation

2:00pm EDT

Do We Need a Source Separation for Automated Subtitling on Media Contents?

Wednesday October 9, 2024 2:00pm - 4:00pm EDT

Poster

In this paper, we investigate the efficacy of a speech and music source separation technique on an automated subtitling system under different signal-to-noise ratios (SNRs). To this end, we compare the generated subtitle errors by measuring the word error rate (WER) with and without source separation applied to speech in music. Experiments are first performed on a dataset by mixing speech from the LibriSpeech dataset with music from the MUSDB18-HQ dataset. Accordingly, it is revealed that when the SNR is below 5 dB, using separated speech yields the lowest subtitle error. By contrast, when the SNR exceeds 5 dB, using the mixture audio shows the lowest subtitle error. On the basis of these findings, we propose an automated subtitling system that dynamically chooses between using mixture audio or separated speech to generate subtitles. The system utilizes the estimated SNR as a threshold to decide whether to apply source separation, achieving the lowest average WER under various SNR conditions.

Speakers

Do Hyun Lee

Authors

Do Hyun Lee

Geon Woo Lee

Hong Kook Kim

Wednesday October 9, 2024 2:00pm - 4:00pm EDT
Poster

Signal Processing, Poster Presentation

2:00pm EDT

Miniaturized Full-Range MEMS Speaker for In-Ear Headphones

Wednesday October 9, 2024 2:00pm - 4:00pm EDT

Poster

This paper reports a highly miniaturized loudspeaker for in-ear applications. It is manufactured using MEMS (micro-electromechanical systems) technology, a technology that is well known e.g. from MEMS microphones and accelerometers. The speaker is based on a mechanically open design, featuring a perfectly decoupled piezoelectric bending actuator. To provide good acoustic sealing, the actuator is surrounded by an acoustic shield. Measurements performed on loudspeaker prototypes attached to an artificial ear simulator showed that despite the small size of only 2.4 x 2.4 mm², sound pressure levels of 105 dB can be achieved within the full audio range. Moreover, low total harmonic distortion (THD) of less than 0.4 % at 90 dB and 1 kHz have been observed.

Speakers

Fabian Stoppel

Head Acoustic Systems and Microactuators, Fraunhofer Institute for Silicon Technology

Fabian Stoppel is an expert in the field of micro-electro-mechanical systems (MEMS) and holds a PhD degree in electrical engineering. Since 2010 he is with the Fraunhofer Institute for Silicon Technology (ISIT), Germany, where he is heading the group for acoustic micro systems. Dr... Read More →

Authors

Fabian Stoppel

Head Acoustic Systems and Microactuators, Fraunhofer Institute for Silicon Technology

Christian Eisermann

Fabian Lofink

Isa Pieper

Johannes Fankhänel

Thorsten Giese

Wednesday October 9, 2024 2:00pm - 4:00pm EDT
Poster

Signal Processing, Poster Presentation

2:00pm EDT

Music Auto-tagging in the long tail: A few-shot approach

Wednesday October 9, 2024 2:00pm - 4:00pm EDT

Poster

In the realm of digital music, using tags to efficiently organize and retrieve music from extensive databases is crucial for music catalog owners. Human tagging by experts is labor-intensive but mostly accurate, whereas automatic tagging through supervised learning has approached satisfying accuracy but is restricted to a predefined set of training tags. Few-shot learning offers a viable solution to expand beyond this small set of predefined tags by enabling models to learn from only a few human-provided examples to understand tag meanings and subsequently apply these tags autonomously. We propose to integrate few-shot learning methodology into multi-label music auto-tagging by using features from pre-trained models as inputs to a lightweight linear classifier, also known as a linear probe. We investigate different popular pre-trained features, as well as different few-shot parametrizations with varying numbers of classes and samples per class. Our experiments demonstrate that a simple model with pre-trained features can achieve performance close to state-of-the-art models while using significantly less training data, such as 20 samples per tag. Additionally, our linear probe performs competitively with leading models when trained on the entire training dataset. The results show that this transfer learning-based few-shot approach could effectively address the issue of automatically assigning long-tail tags with only limited labeled data.

Speakers

T. Aleksandra Ma

Authors

T. Aleksandra Ma

Alexander Lerch

Wednesday October 9, 2024 2:00pm - 4:00pm EDT
Poster

Signal Processing, Poster Presentation

2:00pm EDT

On-Device Automatic Speech Remastering Solution in Real Time

Wednesday October 9, 2024 2:00pm - 4:00pm EDT

Poster

With the development of AI technology, there are many attempts to provide new experiences to users by applying AI technology to various multimedia devices. Most of these technologies are provided through server-based AI models due to the large model size. In particular, most of the audio AI technologies are applied through apps and it is server-based in offline AI models. However, there is no doubt that AI technology which can be implemented in real time is important and attractive for streaming service devices such as TVs. This paper introduces an on-device automatic speech remastering solution. The automatic speech remastering solution indicates extracting speech in real-time from the on-device and automatically adjusts the speech level considering the current background sound and volume level of the device. In addition, the automatic speech normalization technique that reduces the variance in speech level for each content is applied. The proposed solution provides users with a high understanding and immersion in the contents by automatically improving the delivery of speech and normalizing speech levels without manually controlling the volume level. There are three key points in this paper. The first is a deep learning speech extraction model that can be implemented in real-time on TV devices, the second is an optimized implementation method using the DSP and NPU, and last is audio signal processing for the speech remastering to improve speech intelligibility.

Speakers

Hyeonsik Jeong

Authors

Dongwoo Kim

Inwoo Hwang

Sunmin Kim

Wednesday October 9, 2024 2:00pm - 4:00pm EDT
Poster

Signal Processing, Poster Presentation

2:00pm EDT

Real-time Recognition of Speech Emotion for Human-robot Interaction

Wednesday October 9, 2024 2:00pm - 4:00pm EDT

Poster

In this paper, we propose a novel method for real-time speech emotion recognition (SER) tailored for human-robot interaction. Traditional SER techniques, which analyze entire utterances, often struggle in real-time scenarios due to their high latency. To overcome this challenge, the proposed method breaks down speech into short, overlapping segments and uses a soft voting mechanism to aggregate emotion probabilities in real time. The proposed real-time method is applied to an SER model comprising the pre-trained wav2vec 2.0 and a convolutional network for feature extraction and emotion classification, respectively. The performance of the proposed method was evaluated on the KEMDy19 dataset, a Korean emotion dataset focusing on four key emotions: anger, happiness, neutrality, and sadness. Consequently, applying the real-time method, which processed each segment with a duration of 0.5 or 3.0 seconds, resulted in relative reduction of unweighted accuracy by 10.61% or 5.08%, respectively, compared to the method that processed entire utterances. However, the real-time factor (RTF) was significantly improved.

Speakers