Loading…
AES Show 2024 NY has ended
Exhibits+ badges provide access to the ADAM Audio Immersive Room, the Genelec Immersive Room, Tech Tours, and the presentations on the Main Stage.

All Access badges provide access to all content in the Program (Tech Tours still require registration)

View the Exhibit Floor Plan.
Poster clear filter
arrow_back View All Dates
Wednesday, October 9
 

2:00pm EDT

A study on the relative accuracy and robustness of the convolutional recurrent neural network based approach to binaural sound source localisation.
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
Binaural sound source localization is the task of finding the location of a sound source using binaural audio as affected by the head-related transfer functions (HRTFs) of a binaural array. The most common approach to this is to train a convolutional neural network directly on the magnitude and phase of the binaural audio. Recurrent layers can then also be introduced to allow for consideration of the temporal context of the binaural data, as to create a convolutional recurrent neural network (CRNN).
This work compares the relative performance of this approach for speech localization on the horizontal plane using four different CRNN models based on different types of recurrent layers; Conv-GRU, Conv-BiGRU, Conv-LSTM, and Conv-BiLSTM, as well as a baseline system of a more conventional CNN with no recurrent layers. These systems were trained and tested on datasets of binaural audio created by convolution of speech samples with BRIRs of 120 rooms, for 50 azimuthal directions. Additive noise created from additional sound sources on the horizontal plane were also added to the signal.
Results show a clear preference for use of CRNN over CNN, with overall localization error and front-back confusion being reduced, with it additionally being seen that such systems are less effected by increasing reverb time and reduced signal to noise ratio. Comparing the recurrent layers also reveals that LSTM based layers see the best overall localisation performance, while layers with bidirectionality are more robust, and so overall finding a preference for Conv-BiLSTM for the task.
Speakers
avatar for Jago T. Reed-Jones

Jago T. Reed-Jones

Research & Development Engineer, Audioscenic
I am a Research & Development Engineer at Audioscenic, where we are bringing spatial audio to people's homes using binaural audio over loudspeakers. In addition, I am finishing a PhD at Liverpool John Moores University looking at use of neural networks to achieve binaural sound source... Read More →
Authors
avatar for Jago T. Reed-Jones

Jago T. Reed-Jones

Research & Development Engineer, Audioscenic
I am a Research & Development Engineer at Audioscenic, where we are bringing spatial audio to people's homes using binaural audio over loudspeakers. In addition, I am finishing a PhD at Liverpool John Moores University looking at use of neural networks to achieve binaural sound source... Read More →
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
Poster

2:00pm EDT

Authoring Inter-Compatible Flexible Audio for Mass Personalization
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
The popularity of internet-based services for delivering media, i.e., audio and video on demand, creates an opportunity to offer personalized media experiences to audiences. A significant proportion of users are experiencing reduced enjoyment, or facing inaccessibility, due to combinations of different impairments, languages, devices and connectivity, and preferences. Audio currently leads video in the maturity of object-based production and distribution systems; thus, we sought to leverage existing object-based audio tools and frameworks to explore creation and delivery of personalized versions. In this practice-based investigation of personalization affordances, an immersive children’s fantasy radio drama was re-authored within five "dimensions of personalization", motivated by an analysis of under-served audiences, to enable us to develop an understanding of what and how to author personalized media. The dimensions were Duration, Language, Style, Device and Clarity. Our authoring approaches were designed to result in alternative versions that are inherently inter-compatible. The ability to combine them exploits the properties of object-based audio and creates the potential for mass personalization, even with a modest number of options within each of the dimensions. The result of this investigation is a structured set of adaptations, based around a common narrative. Future work will develop automation and smart network integration to trial the delivery of such personalized experiences at scale, and thereby validate the benefits that such forms of media personalization can bring.
Speakers
avatar for Craig Cieciura

Craig Cieciura

Research Fellow, University of Surrey
Craig graduated from the Music and Sound Recording (Tonmeister) course at The University of Surrey in 2016. He then completed his PhD at the same institution in 2022. His PhD topic concerned reproduction of object-based audio in the domestic environment using combinations of installed... Read More →
Authors
avatar for Craig Cieciura

Craig Cieciura

Research Fellow, University of Surrey
Craig graduated from the Music and Sound Recording (Tonmeister) course at The University of Surrey in 2016. He then completed his PhD at the same institution in 2022. His PhD topic concerned reproduction of object-based audio in the domestic environment using combinations of installed... Read More →
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
Poster

2:00pm EDT

Automatic Corrective EQ for Measurement Microphone Emulation
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
This study investigates automatic corrective equalisation (EQ) to adjust the frequency response of microphones with non-flat responses to match that of a flat frequency response measurement microphone. Non-flat responses in microphones can cause significant colouration, necessitating correction for accurate sound capture, particularly in spectral analysis scenarios. To address this, 10 non-flat microphones were profiled in an anechoic chamber, and a 1/3 octave digital graphic equaliser (GEQ) was employed to align their spectra with that of an industry-standard reference microphone. The system's performance was evaluated using acoustic guitar recordings, measuring spectral similarity between Pre- and Post-Corrected recordings and the reference microphone with objective metrics. Results demonstrated improvements in spectral similarity across the metrics, confirming the method's effectiveness in correcting frequency response irregularities. However, limitations such as the inability to account for proximity effects suggest the need for further refinement and validation in diverse acoustic environments.
Speakers
avatar for Matthew Cheshire

Matthew Cheshire

Lecturer in Audio Engineering / Researcher, Birmingham City University
Authors
avatar for Matthew Cheshire

Matthew Cheshire

Lecturer in Audio Engineering / Researcher, Birmingham City University
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
Poster

2:00pm EDT

Corelink Audio: The Development of A JUCE-based Networked Music Performance Solution
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
Existing Networked Music Performance (NMP) solutions offer high-fidelity audio with minimal latency but often lack the versatility needed for multi-plugin configurations and multi-modal integrations, while ensuring ease of use and future development. This paper presents a novel NMP solution called Corelink Audio which aims to address these limitations. Corelink Audio is built on the Corelink network framework, a data agnostic communication protocol, and JUCE, an audio plugin development framework. Corelink audio can be used flexibly as an Audio Unit (AU) or Virtual Studio Technology (VST) plugin inside different host environments and can be easily integrated with other audio or non-audio data streams. Users can also extend the Corelink Audio codebase to tailor it to their own specific needs using the available JUCE and Corelink documentation. This paper details the architecture and technical specifics of the software. The performance of the system, including latency measurements and audio artifacts under varying conditions, is also evaluated and discussed.
Speakers
avatar for Zack Nguyen

Zack Nguyen

Software Developer, NYU IT
Authors
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
Poster

2:00pm EDT

Do We Need a Source Separation for Automated Subtitling on Media Contents?
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
In this paper, we investigate the efficacy of a speech and music source separation technique on an automated subtitling system under different signal-to-noise ratios (SNRs). To this end, we compare the generated subtitle errors by measuring the word error rate (WER) with and without source separation applied to speech in music. Experiments are first performed on a dataset by mixing speech from the LibriSpeech dataset with music from the MUSDB18-HQ dataset. Accordingly, it is revealed that when the SNR is below 5 dB, using separated speech yields the lowest subtitle error. By contrast, when the SNR exceeds 5 dB, using the mixture audio shows the lowest subtitle error. On the basis of these findings, we propose an automated subtitling system that dynamically chooses between using mixture audio or separated speech to generate subtitles. The system utilizes the estimated SNR as a threshold to decide whether to apply source separation, achieving the lowest average WER under various SNR conditions.
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
Poster

2:00pm EDT

Miniaturized Full-Range MEMS Speaker for In-Ear Headphones
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
This paper reports a highly miniaturized loudspeaker for in-ear applications. It is manufactured using MEMS (micro-electromechanical systems) technology, a technology that is well known e.g. from MEMS microphones and accelerometers. The speaker is based on a mechanically open design, featuring a perfectly decoupled piezoelectric bending actuator. To provide good acoustic sealing, the actuator is surrounded by an acoustic shield. Measurements performed on loudspeaker prototypes attached to an artificial ear simulator showed that despite the small size of only 2.4 x 2.4 mm², sound pressure levels of 105 dB can be achieved within the full audio range. Moreover, low total harmonic distortion (THD) of less than 0.4 % at 90 dB and 1 kHz have been observed.
Speakers
avatar for Fabian Stoppel

Fabian Stoppel

Head Acoustic Systems and Microactuators, Fraunhofer Institute for Silicon Technology
Fabian Stoppel is an expert in the field of micro-electro-mechanical systems (MEMS) and holds a PhD degree in electrical engineering. Since 2010 he is with the Fraunhofer Institute for Silicon Technology (ISIT), Germany, where he is heading the group for acoustic micro systems. Dr... Read More →
Authors
avatar for Fabian Stoppel

Fabian Stoppel

Head Acoustic Systems and Microactuators, Fraunhofer Institute for Silicon Technology
Fabian Stoppel is an expert in the field of micro-electro-mechanical systems (MEMS) and holds a PhD degree in electrical engineering. Since 2010 he is with the Fraunhofer Institute for Silicon Technology (ISIT), Germany, where he is heading the group for acoustic micro systems. Dr... Read More →
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
Poster

2:00pm EDT

Music Auto-tagging in the long tail: A few-shot approach
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
In the realm of digital music, using tags to efficiently organize and retrieve music from extensive databases is crucial for music catalog owners. Human tagging by experts is labor-intensive but mostly accurate, whereas automatic tagging through supervised learning has approached satisfying accuracy but is restricted to a predefined set of training tags. Few-shot learning offers a viable solution to expand beyond this small set of predefined tags by enabling models to learn from only a few human-provided examples to understand tag meanings and subsequently apply these tags autonomously. We propose to integrate few-shot learning methodology into multi-label music auto-tagging by using features from pre-trained models as inputs to a lightweight linear classifier, also known as a linear probe. We investigate different popular pre-trained features, as well as different few-shot parametrizations with varying numbers of classes and samples per class. Our experiments demonstrate that a simple model with pre-trained features can achieve performance close to state-of-the-art models while using significantly less training data, such as 20 samples per tag. Additionally, our linear probe performs competitively with leading models when trained on the entire training dataset. The results show that this transfer learning-based few-shot approach could effectively address the issue of automatically assigning long-tail tags with only limited labeled data.
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
Poster

2:00pm EDT

On-Device Automatic Speech Remastering Solution in Real Time
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
With the development of AI technology, there are many attempts to provide new experiences to users by applying AI technology to various multimedia devices. Most of these technologies are provided through server-based AI models due to the large model size. In particular, most of the audio AI technologies are applied through apps and it is server-based in offline AI models. However, there is no doubt that AI technology which can be implemented in real time is important and attractive for streaming service devices such as TVs. This paper introduces an on-device automatic speech remastering solution. The automatic speech remastering solution indicates extracting speech in real-time from the on-device and automatically adjusts the speech level considering the current background sound and volume level of the device. In addition, the automatic speech normalization technique that reduces the variance in speech level for each content is applied. The proposed solution provides users with a high understanding and immersion in the contents by automatically improving the delivery of speech and normalizing speech levels without manually controlling the volume level. There are three key points in this paper. The first is a deep learning speech extraction model that can be implemented in real-time on TV devices, the second is an optimized implementation method using the DSP and NPU, and last is audio signal processing for the speech remastering to improve speech intelligibility.
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
Poster

2:00pm EDT

Real-time Recognition of Speech Emotion for Human-robot Interaction
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
In this paper, we propose a novel method for real-time speech emotion recognition (SER) tailored for human-robot interaction. Traditional SER techniques, which analyze entire utterances, often struggle in real-time scenarios due to their high latency. To overcome this challenge, the proposed method breaks down speech into short, overlapping segments and uses a soft voting mechanism to aggregate emotion probabilities in real time. The proposed real-time method is applied to an SER model comprising the pre-trained wav2vec 2.0 and a convolutional network for feature extraction and emotion classification, respectively. The performance of the proposed method was evaluated on the KEMDy19 dataset, a Korean emotion dataset focusing on four key emotions: anger, happiness, neutrality, and sadness. Consequently, applying the real-time method, which processed each segment with a duration of 0.5 or 3.0 seconds, resulted in relative reduction of unweighted accuracy by 10.61% or 5.08%, respectively, compared to the method that processed entire utterances. However, the real-time factor (RTF) was significantly improved.
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
Poster
 
Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.
Filtered by Date -