Loading…
AES Show 2024 NY has ended
Exhibits+ badges provide access to the ADAM Audio Immersive Room, the Genelec Immersive Room, Tech Tours, and the presentations on the Main Stage.

All Access badges provide access to all content in the Program (Tech Tours still require registration)

View the Exhibit Floor Plan.
Poster clear filter
Tuesday, October 8
 

2:00pm EDT

Bitrate adaptation in object-based audio coding in communication immersive voice and audio systems
Tuesday October 8, 2024 2:00pm - 4:00pm EDT
The object-based audio is one of the spatial audio representations providing an immersive audio experience. While it can be found in a wide variety of audio reproduction systems, its use in communication systems is very limited as it faces many constraints like the complexity of the system, short delay, or limited available bitrate for coding and transmission. This paper presents a new bitrate adaptation method to be used in object-based audio coding systems that overcomes these constraints and enables their use in 5G voice and audio communication systems. The presented method distributes an available codec bit budget to encode waveforms of the individual audio objects based on a classification of the objects’ subjective importance in particular frames. The presented method has been used in the Immersive Voice and Audio Services (IVAS) codec, recently standardized by 3GPP, but it can be employed in other codecs as well. Test results show the performance advantage of the bitrate adaptation method over the conventional uniformly distributed bitrate method. The paper also presents IVAS selection test results for object-based audio with four audio objects, rendered to binaural headphone representation, in which the presented method plays a substantial role.
Speakers Authors
Tuesday October 8, 2024 2:00pm - 4:00pm EDT
Poster

2:00pm EDT

Enhancing Realism for Digital Piano Players: A Perceptual Evaluation of Head-Tracked Binaural Audio
Tuesday October 8, 2024 2:00pm - 4:00pm EDT
This paper outlines a process for achieving and perceptually evaluating a head-tracked binaural audio system designed to enhance realism for players of digital pianos. Using an Ambisonic microphone to sample an acoustic piano, followed by leveraging off-the-shelf equipment, the system allows players to experience changes in the sound field in real-time as they rotate their heads while wearing headphones under three degrees of freedom (3DoF). The evaluation criteria included spatial clarity, spectral clarity, envelopment, and preference. These criteria were assessed across three different listening systems: stereo speakers, stereo headphones, and head-tracked binaural audio. Results showed a strong preference for the head-tracked binaural audio system, with players noting significantly greater realism and immersion.
Tuesday October 8, 2024 2:00pm - 4:00pm EDT
Poster

2:00pm EDT

Exploring Immersive Opera: Recording and Post-Production with Spatial Multi-Microphone System and Volumetric Microphone Array
Tuesday October 8, 2024 2:00pm - 4:00pm EDT
Traditional opera recording techniques using large microphone systems are typically less flexible towards experimental singer choreographies, which have the potential of being adapted to immersive and interactive representations such as Virtual Reality (VR) applications. The authors present an engineering report on implementing two microphone systems for recording an experimental opera production in a medium-sized theatre: a 7.0.4 hybrid array of Lindberg’s 2L and the Bowles spatial arrays and a volumetric array consisting of three higher-order Ambisonic microphones in Left/Center/Right (LCR) formation. Details of both microphone setups are first described, followed by post-production techniques for multichannel loudspeaker playback and 6 degrees-of-freedom (6DoF) binaural rendering for VR experiences. Finally, the authors conclude with observations from informal listening critique sessions and discuss the technical challenges and aesthetic choices involved during the recording and post-production stages in the hope of inspiring future projects on a larger scale.
Speakers
JM

Jiawen Mao

PhD student, McGill University
Authors
JM

Jiawen Mao

PhD student, McGill University
avatar for Michael Ikonomidis

Michael Ikonomidis

Doctoral student, McGill University
Michael Ikonomidis (Michail Oikonomidis) is an accomplished audio engineer and PhD student in Sound Recording at McGill University, specializing in immersive audio, high-channel count orchestral recordings and scoring sessions.With a diverse background in music production, live sound... Read More →
avatar for Richard King

Richard King

Professor, McGill University
Richard King is an Educator, Researcher, and a Grammy Award winning recording engineer. Richard has garnered Grammy Awards in various fields including Best Engineered Album in both the Classical and Non-Classical categories. Richard is an Associate Professor at the Schulich School... Read More →
Tuesday October 8, 2024 2:00pm - 4:00pm EDT
Poster

2:00pm EDT

Exploring the Directivity of the Lute, Lavta, and Oud Plucked String Instruments
Tuesday October 8, 2024 2:00pm - 4:00pm EDT
This study investigates the spherical directivity and radiation patterns of the Lute, Lavta, and Oud, pear-shaped traditional plucked-string instruments from the Middle East, Turkey, Greece, and the surrounding areas, providing insights into the acoustic qualities of their propagated sound in a three-dimensional space. Data was recorded in an acoustically controlled environment with a 29-microphone array, using multiple instruments of each type, performed by several professional musicians. Directivity is investigated in terms of sound projection and radiation patterns. Instruments were categorized according to string material. The analysis revealed that all instruments, regardless of their variations in geometry and material, exhibit similar radiation patterns across all frequency bands, justifying their intuitive classification within the “Lute family”. Nevertheless, variations in sound projection across all directions are evident between instrument types, which can be attributed to differences in construction details and string material. The impact of the musician's body on directivity is also observed. Practical implications of this study include the development of guidelines for the proper recording of these instruments, as well as the simulation of their directivity properties for use in spatial auralizations and acoustic simulations with direct applications in extended reality environments and remote collaborative music performances.
Tuesday October 8, 2024 2:00pm - 4:00pm EDT
Poster

2:00pm EDT

Generate acoustic responses of virtual microphone arrays from a single set of measured FOA responses. - Apply to multiple sound sources.
Tuesday October 8, 2024 2:00pm - 4:00pm EDT
V2MA (VSVerb Virtual Microphone Array)
Demos and related docs are available at https://bit.ly/3BmDBbL .
Once we have measured a set of four impulse responses (IRs) with an A-format microphone in a hall, we can make a virtual recording using a virtual microphone array built in the hall at will. The measurement does not require an A-format microphone and a loudspeaker to be placed at specific positions in the hall. Typical positions, such as at an audience seat and on a stage, are recommended, but you can place them anywhere you like. We will generate any type of virtual microphone response in a target room from an easy one-time IR measurement.
-------------------------------
We propose a method, V2MA, that virtually generates acoustic responses of any type of microphone array from a single set of FOA responses measured in a target room. An A-format microphone is used for the measurement, but no Ambisonics operation is included in the processing. V2MA is a method based on geometrical acoustics. We calculate sound intensities in the x, y, and z directions from a measured FOA response, then the virtual sound sources of the room are detected from them. Although it is desirable to have an A-format microphone place close to the attempted position of the virtual microphone array in the room, it is not a mandatory requirement. Since our method allows to generate SRIRs at arbitrary receiver positions in the room by updating the acoustic properties of the virtual sound sources detected at a certain position of the room, an A-format microphone can be placed anywhere you like. On the other hand, a loudspeaker must be placed at the source position where a player is assumed to be. Since the positions of virtual sound sources change when a real sound source moves, we used to measure the responses for each assumed real source position. To improve this inconvenient restriction, we developed the technique of updating the positions of the virtual sound sources when a real sound source moves from its original position. Although the technique requires some approximations, it is ascertained that the generated SRIRs provide fine acoustic properties in both physical and auditory aspect.
Speakers
avatar for Masataka Nakahara

Masataka Nakahara

Acoustic Designer / Acoustician, SONA Corp. / ONFUTURE Ltd.
Masataka Nakahra is an acoustician specializing in studio acoustic design and R&D work on room acoustics, as well as an educator.After studying acoustics at the Kyushu Institute of Design, he joined SONA Corporation and began his career as an acoustic designer.In 2005, he received... Read More →
Authors
avatar for Masataka Nakahara

Masataka Nakahara

Acoustic Designer / Acoustician, SONA Corp. / ONFUTURE Ltd.
Masataka Nakahra is an acoustician specializing in studio acoustic design and R&D work on room acoustics, as well as an educator.After studying acoustics at the Kyushu Institute of Design, he joined SONA Corporation and began his career as an acoustic designer.In 2005, he received... Read More →

Tuesday October 8, 2024 2:00pm - 4:00pm EDT
Poster

2:00pm EDT

Measurement and Applications of Directional Room Impulse Responses (DRIRs) for Immersive Sound Reproduction
Tuesday October 8, 2024 2:00pm - 4:00pm EDT
Traditional methods for characterizing Room Impulse Responses (RIRs) employing omnidirectional microphones do not fully capture the spatial properties of sound in an acoustic space. In this paper we explore a method for the characterization of room acoustics employing Directional Room Impulse Responses (DRIRs), which include the direction of arrival of the reflected sound waves in an acoustic space in addition to their time of arrival and strength. We measured DRIRs using a commercial 3D sound intensity probe (Weles Acoustics WA301) containing x, y, z acoustic velocity channels in addition to a scalar pressure channel. We then employed the measured DRIR’s to predict the binaural signals that would be measured by binaural dummy head microphones placed at the same location in the room where the DRIR was measured. The predictions can then be compared to the actual measured binaural signals. Successful implementation of DRIRs could significantly enhance applications in AR/VR and immersive sound reproduction by providing listeners with room-specific directional cues for early room reflections in addition to the diffuse reverberant impulse response tail.
Tuesday October 8, 2024 2:00pm - 4:00pm EDT
Poster

2:00pm EDT

Quantitative Assessment of Acoustical Attributes and Listener Preferences in Binaural Renderers with Head-tracking Function
Tuesday October 8, 2024 2:00pm - 4:00pm EDT
The rapid advancement of immersive audio technologies has popularized binaural renderers that create 3D auditory experiences using head-related transfer functions (HRTFs). Various renderers with unique algorithms have emerged, offering head-tracking functionality for real-time adjustments to spatial audio perception. Building on our previous study, we compared binauralized music from five renderers with the dynamic head-tracking function enabled, focusing on how differences in HRTFs and algorithms affect listener perceptions. Participants assessed overall preference, spatial fidelity, and timbral fidelity by comparing paired stimuli. Consistent with our earlier findings, one renderer received the highest ratings for overall preference and spatial fidelity, while others rated lower in these attributes. Physical analysis showed that interaural time differences (ITD), interaural level differences (ILD), and frequency response variations contributed to these outcomes. Notably, hierarchical cluster analysis of participants' timbral fidelity evaluations revealed two distinct groups, suggesting variability in individual sensitivities to timbral nuances. While spatial cues, enhanced by head tracking, were generally found to be more influential in determining overall preference, the results also highlight that timbral fidelity plays a significant role for certain listener groups, indicating that both spatial and timbral factors should be considered in future developments.
Speakers
avatar for Rai Sato

Rai Sato

Ph.D. Student, Korea Advanced Institute of Science and Technology
Rai Sato (佐藤 来) is currently pursuing a PhD at the Graduate School of Culture Technology at the Korea Advanced Institute of Science and Technology. He holds a Bachelor of Music from Tokyo University of the Arts, where he specialized in immersive audio recording and psychoacoustics... Read More →
Authors
avatar for Rai Sato

Rai Sato

Ph.D. Student, Korea Advanced Institute of Science and Technology
Rai Sato (佐藤 来) is currently pursuing a PhD at the Graduate School of Culture Technology at the Korea Advanced Institute of Science and Technology. He holds a Bachelor of Music from Tokyo University of the Arts, where he specialized in immersive audio recording and psychoacoustics... Read More →
Tuesday October 8, 2024 2:00pm - 4:00pm EDT
Poster

2:00pm EDT

Review: Head-Related Impulse Response Measurement Methods
Tuesday October 8, 2024 2:00pm - 4:00pm EDT
This review paper discusses the advancements in Head-Related Impulse Response measurement methods. HRIR (Head-Related Impulse Response) measurement methods, often referred to as HRTF (Head-Related Transfer Function) measurement methods, have undergone significant changes over the last few decades [1]. A frequently employed method is the Discrete stop-and-go method [1][2]. It involves changing the location of a single speaker, used as the sound source, and recording the impulse response at each location. [2]. Since the measurement is for 1 location of the sound source at a time, using the discrete stop-and-go method is time-consuming [1]. Hence improvements are required to enhance the efficiency of the measurement process such as using more sound sources (speakers) [1][3]. A typical HRTF measurement is usually conducted in an anechoic chamber to achieve a simulated free-field measurement condition without room reverberation. It measures the transfer function between the source and the ears to perceive localisation cues such as inter-aural time differences (ITDs), inter-aural level differences (ILDs), as well as monaural spectral cues [4]. Newer techniques such as the Multiple Exponential Sweep Method (MESM) and the reciprocal method offer alternatives. These methods enhance measurement efficiency and address challenges like inter-reflections and low-frequency response [5][6]. Individualised HRTF measurement techniques can be categorised into acoustical measurement, anthropometric data, and perceptual feedback [7]. Interpolation methods and non-anechoic environment measurements have expanded the practical application and feasibility of HRTF measurements [8][9][10][7].
Speakers
avatar for Jeremy Tsuaye

Jeremy Tsuaye

New York University
Authors
avatar for Jeremy Tsuaye

Jeremy Tsuaye

New York University
Tuesday October 8, 2024 2:00pm - 4:00pm EDT
Poster

2:00pm EDT

The effects of interaural time difference and interaural level difference on sound source localization on the horizontal plane
Tuesday October 8, 2024 2:00pm - 4:00pm EDT
Interaural Time Difference (ITD) and Interaural Level Difference (ILD) are the main cues used by the human auditory system to localize sound sources on the horizontal plane. To explore the relationship between ITD, ILD, and the perceived azimuth, a study was conducted to measure and analyze the localization effects on the horizontal plane by combining ITD and ILD. Pure tones were used as sound sources in the experiment. For each of the three different frequency bands, 25 combinations of ITD and ILD test values were selected. These combinations were used to process the perceived sound from directly in front of the listener (pure tone signals collected using an artificial head in an anechoic chamber). The tests were conducted using the 1up/2down and 2AFC (two-alternative forced-choice) psychophysical testing methods. The results showed that the perceived azimuth at 350 Hz and 570 Hz was generally higher than at 1000 Hz. Additionally, the perceived azimuth at 350 Hz and 570 Hz was similar under certain combinations. The experimental data and conclusions can provide foundational data and theoretical support for efficient compression of multi-channel audio.
Tuesday October 8, 2024 2:00pm - 4:00pm EDT
Poster
 
Wednesday, October 9
 

2:00pm EDT

A study on the relative accuracy and robustness of the convolutional recurrent neural network based approach to binaural sound source localisation.
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
Binaural sound source localization is the task of finding the location of a sound source using binaural audio as affected by the head-related transfer functions (HRTFs) of a binaural array. The most common approach to this is to train a convolutional neural network directly on the magnitude and phase of the binaural audio. Recurrent layers can then also be introduced to allow for consideration of the temporal context of the binaural data, as to create a convolutional recurrent neural network (CRNN).
This work compares the relative performance of this approach for speech localization on the horizontal plane using four different CRNN models based on different types of recurrent layers; Conv-GRU, Conv-BiGRU, Conv-LSTM, and Conv-BiLSTM, as well as a baseline system of a more conventional CNN with no recurrent layers. These systems were trained and tested on datasets of binaural audio created by convolution of speech samples with BRIRs of 120 rooms, for 50 azimuthal directions. Additive noise created from additional sound sources on the horizontal plane were also added to the signal.
Results show a clear preference for use of CRNN over CNN, with overall localization error and front-back confusion being reduced, with it additionally being seen that such systems are less effected by increasing reverb time and reduced signal to noise ratio. Comparing the recurrent layers also reveals that LSTM based layers see the best overall localisation performance, while layers with bidirectionality are more robust, and so overall finding a preference for Conv-BiLSTM for the task.
Speakers
avatar for Jago T. Reed-Jones

Jago T. Reed-Jones

Research & Development Engineer, Audioscenic
I am a Research & Development Engineer at Audioscenic, where we are bringing spatial audio to people's homes using binaural audio over loudspeakers. In addition, I am finishing a PhD at Liverpool John Moores University looking at use of neural networks to achieve binaural sound source... Read More →
Authors
avatar for Jago T. Reed-Jones

Jago T. Reed-Jones

Research & Development Engineer, Audioscenic
I am a Research & Development Engineer at Audioscenic, where we are bringing spatial audio to people's homes using binaural audio over loudspeakers. In addition, I am finishing a PhD at Liverpool John Moores University looking at use of neural networks to achieve binaural sound source... Read More →
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
Poster

2:00pm EDT

Authoring Inter-Compatible Flexible Audio for Mass Personalization
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
The popularity of internet-based services for delivering media, i.e., audio and video on demand, creates an opportunity to offer personalized media experiences to audiences. A significant proportion of users are experiencing reduced enjoyment, or facing inaccessibility, due to combinations of different impairments, languages, devices and connectivity, and preferences. Audio currently leads video in the maturity of object-based production and distribution systems; thus, we sought to leverage existing object-based audio tools and frameworks to explore creation and delivery of personalized versions. In this practice-based investigation of personalization affordances, an immersive children’s fantasy radio drama was re-authored within five "dimensions of personalization", motivated by an analysis of under-served audiences, to enable us to develop an understanding of what and how to author personalized media. The dimensions were Duration, Language, Style, Device and Clarity. Our authoring approaches were designed to result in alternative versions that are inherently inter-compatible. The ability to combine them exploits the properties of object-based audio and creates the potential for mass personalization, even with a modest number of options within each of the dimensions. The result of this investigation is a structured set of adaptations, based around a common narrative. Future work will develop automation and smart network integration to trial the delivery of such personalized experiences at scale, and thereby validate the benefits that such forms of media personalization can bring.
Speakers
avatar for Craig Cieciura

Craig Cieciura

Research Fellow, University of Surrey
Craig graduated from the Music and Sound Recording (Tonmeister) course at The University of Surrey in 2016. He then completed his PhD at the same institution in 2022. His PhD topic concerned reproduction of object-based audio in the domestic environment using combinations of installed... Read More →
Authors
avatar for Craig Cieciura

Craig Cieciura

Research Fellow, University of Surrey
Craig graduated from the Music and Sound Recording (Tonmeister) course at The University of Surrey in 2016. He then completed his PhD at the same institution in 2022. His PhD topic concerned reproduction of object-based audio in the domestic environment using combinations of installed... Read More →
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
Poster

2:00pm EDT

Automatic Corrective EQ for Measurement Microphone Emulation
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
This study investigates automatic corrective equalisation (EQ) to adjust the frequency response of microphones with non-flat responses to match that of a flat frequency response measurement microphone. Non-flat responses in microphones can cause significant colouration, necessitating correction for accurate sound capture, particularly in spectral analysis scenarios. To address this, 10 non-flat microphones were profiled in an anechoic chamber, and a 1/3 octave digital graphic equaliser (GEQ) was employed to align their spectra with that of an industry-standard reference microphone. The system's performance was evaluated using acoustic guitar recordings, measuring spectral similarity between Pre- and Post-Corrected recordings and the reference microphone with objective metrics. Results demonstrated improvements in spectral similarity across the metrics, confirming the method's effectiveness in correcting frequency response irregularities. However, limitations such as the inability to account for proximity effects suggest the need for further refinement and validation in diverse acoustic environments.
Speakers
avatar for Matthew Cheshire

Matthew Cheshire

Lecturer in Audio Engineering / Researcher, Birmingham City University
Authors
avatar for Matthew Cheshire

Matthew Cheshire

Lecturer in Audio Engineering / Researcher, Birmingham City University
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
Poster

2:00pm EDT

Corelink Audio: The Development of A JUCE-based Networked Music Performance Solution
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
Existing Networked Music Performance (NMP) solutions offer high-fidelity audio with minimal latency but often lack the versatility needed for multi-plugin configurations and multi-modal integrations, while ensuring ease of use and future development. This paper presents a novel NMP solution called Corelink Audio which aims to address these limitations. Corelink Audio is built on the Corelink network framework, a data agnostic communication protocol, and JUCE, an audio plugin development framework. Corelink audio can be used flexibly as an Audio Unit (AU) or Virtual Studio Technology (VST) plugin inside different host environments and can be easily integrated with other audio or non-audio data streams. Users can also extend the Corelink Audio codebase to tailor it to their own specific needs using the available JUCE and Corelink documentation. This paper details the architecture and technical specifics of the software. The performance of the system, including latency measurements and audio artifacts under varying conditions, is also evaluated and discussed.
Speakers
avatar for Zack Nguyen

Zack Nguyen

Software Developer, NYU IT
Authors
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
Poster

2:00pm EDT

Do We Need a Source Separation for Automated Subtitling on Media Contents?
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
In this paper, we investigate the efficacy of a speech and music source separation technique on an automated subtitling system under different signal-to-noise ratios (SNRs). To this end, we compare the generated subtitle errors by measuring the word error rate (WER) with and without source separation applied to speech in music. Experiments are first performed on a dataset by mixing speech from the LibriSpeech dataset with music from the MUSDB18-HQ dataset. Accordingly, it is revealed that when the SNR is below 5 dB, using separated speech yields the lowest subtitle error. By contrast, when the SNR exceeds 5 dB, using the mixture audio shows the lowest subtitle error. On the basis of these findings, we propose an automated subtitling system that dynamically chooses between using mixture audio or separated speech to generate subtitles. The system utilizes the estimated SNR as a threshold to decide whether to apply source separation, achieving the lowest average WER under various SNR conditions.
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
Poster

2:00pm EDT

Miniaturized Full-Range MEMS Speaker for In-Ear Headphones
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
This paper reports a highly miniaturized loudspeaker for in-ear applications. It is manufactured using MEMS (micro-electromechanical systems) technology, a technology that is well known e.g. from MEMS microphones and accelerometers. The speaker is based on a mechanically open design, featuring a perfectly decoupled piezoelectric bending actuator. To provide good acoustic sealing, the actuator is surrounded by an acoustic shield. Measurements performed on loudspeaker prototypes attached to an artificial ear simulator showed that despite the small size of only 2.4 x 2.4 mm², sound pressure levels of 105 dB can be achieved within the full audio range. Moreover, low total harmonic distortion (THD) of less than 0.4 % at 90 dB and 1 kHz have been observed.
Speakers
avatar for Fabian Stoppel

Fabian Stoppel

Head Acoustic Systems and Microactuators, Fraunhofer Institute for Silicon Technology
Fabian Stoppel is an expert in the field of micro-electro-mechanical systems (MEMS) and holds a PhD degree in electrical engineering. Since 2010 he is with the Fraunhofer Institute for Silicon Technology (ISIT), Germany, where he is heading the group for acoustic micro systems. Dr... Read More →
Authors
avatar for Fabian Stoppel

Fabian Stoppel

Head Acoustic Systems and Microactuators, Fraunhofer Institute for Silicon Technology
Fabian Stoppel is an expert in the field of micro-electro-mechanical systems (MEMS) and holds a PhD degree in electrical engineering. Since 2010 he is with the Fraunhofer Institute for Silicon Technology (ISIT), Germany, where he is heading the group for acoustic micro systems. Dr... Read More →
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
Poster

2:00pm EDT

Music Auto-tagging in the long tail: A few-shot approach
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
In the realm of digital music, using tags to efficiently organize and retrieve music from extensive databases is crucial for music catalog owners. Human tagging by experts is labor-intensive but mostly accurate, whereas automatic tagging through supervised learning has approached satisfying accuracy but is restricted to a predefined set of training tags. Few-shot learning offers a viable solution to expand beyond this small set of predefined tags by enabling models to learn from only a few human-provided examples to understand tag meanings and subsequently apply these tags autonomously. We propose to integrate few-shot learning methodology into multi-label music auto-tagging by using features from pre-trained models as inputs to a lightweight linear classifier, also known as a linear probe. We investigate different popular pre-trained features, as well as different few-shot parametrizations with varying numbers of classes and samples per class. Our experiments demonstrate that a simple model with pre-trained features can achieve performance close to state-of-the-art models while using significantly less training data, such as 20 samples per tag. Additionally, our linear probe performs competitively with leading models when trained on the entire training dataset. The results show that this transfer learning-based few-shot approach could effectively address the issue of automatically assigning long-tail tags with only limited labeled data.
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
Poster

2:00pm EDT

On-Device Automatic Speech Remastering Solution in Real Time
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
With the development of AI technology, there are many attempts to provide new experiences to users by applying AI technology to various multimedia devices. Most of these technologies are provided through server-based AI models due to the large model size. In particular, most of the audio AI technologies are applied through apps and it is server-based in offline AI models. However, there is no doubt that AI technology which can be implemented in real time is important and attractive for streaming service devices such as TVs. This paper introduces an on-device automatic speech remastering solution. The automatic speech remastering solution indicates extracting speech in real-time from the on-device and automatically adjusts the speech level considering the current background sound and volume level of the device. In addition, the automatic speech normalization technique that reduces the variance in speech level for each content is applied. The proposed solution provides users with a high understanding and immersion in the contents by automatically improving the delivery of speech and normalizing speech levels without manually controlling the volume level. There are three key points in this paper. The first is a deep learning speech extraction model that can be implemented in real-time on TV devices, the second is an optimized implementation method using the DSP and NPU, and last is audio signal processing for the speech remastering to improve speech intelligibility.
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
Poster

2:00pm EDT

Real-time Recognition of Speech Emotion for Human-robot Interaction
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
In this paper, we propose a novel method for real-time speech emotion recognition (SER) tailored for human-robot interaction. Traditional SER techniques, which analyze entire utterances, often struggle in real-time scenarios due to their high latency. To overcome this challenge, the proposed method breaks down speech into short, overlapping segments and uses a soft voting mechanism to aggregate emotion probabilities in real time. The proposed real-time method is applied to an SER model comprising the pre-trained wav2vec 2.0 and a convolutional network for feature extraction and emotion classification, respectively. The performance of the proposed method was evaluated on the KEMDy19 dataset, a Korean emotion dataset focusing on four key emotions: anger, happiness, neutrality, and sadness. Consequently, applying the real-time method, which processed each segment with a duration of 0.5 or 3.0 seconds, resulted in relative reduction of unweighted accuracy by 10.61% or 5.08%, respectively, compared to the method that processed entire utterances. However, the real-time factor (RTF) was significantly improved.
Wednesday October 9, 2024 2:00pm - 4:00pm EDT
Poster
 
Thursday, October 10
 

10:00am EDT

Acoustics in Live Sound
Thursday October 10, 2024 10:00am - 12:00pm EDT
Acoustics in Live Sound looks at roadhouses and touring engineers in order to determine how acoustics is applied and whether this is a conscious or subconscious choice. This is done through various sources with a strong emphasis of interviews with professionals in the field. A look at an engineers’ workflow, tools, a cost-benefit analysis of the tools, looking at different types of spaces, ways to improve the spaces, and delays and equalization are combined to give a comprehensive look as to what an engineer is doing and what can make their job easier and mix better. For the beginning engineer, this is a comprehensive guide as to what tools and skills are needed as it relates to acoustics. For the more seasoned professional, it is a different way to think about how problems are being approached within the field and how solutions could be more heavily based on the statistics gathered from acoustics.
Thursday October 10, 2024 10:00am - 12:00pm EDT
Poster

10:00am EDT

Design and Training of an Intelligent Switchless Guitar Distortion Pedal
Thursday October 10, 2024 10:00am - 12:00pm EDT
Guitar effects pedals are designed to provide an alteration to a guitar signal through electronic means and are often controlled by a footswitch that routes the signal either through the effect or directly to the output through a 'clean' channel. Because players often switch the effects on and off during different portions of a song and different playing styles, our goal in this paper is to create a trainable guitar effect pedal that classifies the incoming guitar signal into two or more playing style classes and route the signal to the bypass channel or effect channel depending on the class. A training data set is collected that consists of recorded single notes and power chords. The neural network algorithm is able to distinguish between these two playing styles with 95\% accuracy in the test set. An electronic system is designed with a Raspberry Pi Pico, preamplifiers, multiplexers, and a distortion effect that runs a neural network trained using Edge Impulse software that runs the classification and signal routing in real-time.
Speakers
avatar for David Anderson

David Anderson

Assistant Professor, University of Minnesota Duluth
Authors
avatar for David Anderson

David Anderson

Assistant Professor, University of Minnesota Duluth
Thursday October 10, 2024 10:00am - 12:00pm EDT
Poster

10:00am EDT

Detection and Perception of Naturalness in Drum Sounds Processed with Dynamic Range Control
Thursday October 10, 2024 10:00am - 12:00pm EDT
This study examines whether trained listeners could identify and judge the “naturalness” of drum samples processed with Dynamic Range Compression (DRC) when paired with unprocessed samples. A two-part experiment was conducted utilizing a paired comparison of a 10-second reference drum loop with no processing and a set of drum loops with varying degrees of DRC. In Part 1, subjects were instructed to identify which sample of the two they believed to have DRC applied. They were then asked to identify which sample they believed sounded “more natural” in Part 2. Out of 18 comparisons for Part 1, only three demonstrated reliable identification of the target variable. Results from Part 2 showed that the subjects perceived DRC processed samples as natural and unprocessed. However, while inconclusive, results here may suggest that listeners perceived the processed drum sound to be equally as natural as the original.
Thursday October 10, 2024 10:00am - 12:00pm EDT
Poster

10:00am EDT

Headphones vs. Loudspeakers: Finding the More Effective Monitoring System
Thursday October 10, 2024 10:00am - 12:00pm EDT
There has been some debate about whether headphones or loudspeakers are the more effective monitoring system. However, there has been little discussion of the effectiveness of the two mediums in the context of mixing and music production tasks. The purpose of this study was to examine how the monitoring systems influenced users’ performance in production tasks using both mediums. An experiment was designed, in which the subject was asked to match the boost level of a given frequency band in a reference recording until the boost sounded identical to the reference. A group of six audio engineering graduate students completed the experiment using headphones and loudspeakers. Subjects’ adjustment levels across eight frequency bands were collected. Results suggest that both monitoring systems were equally effective in the context of adjusting equalizer settings. Future research is called for which will include more subjects with diverse levels of expertise in various production-oriented tasks to better understand the potential effect of each playback system.
Thursday October 10, 2024 10:00am - 12:00pm EDT
Poster

10:00am EDT

Intentional Audio Engineering Flaws: The Process of Recording Practice Exercises for Mixing Students
Thursday October 10, 2024 10:00am - 12:00pm EDT
There are plenty of resources, both scholarly and otherwise, concerning how to mix audio. Students can find information on the process of mixing and descriptions of the tools used by mixing engineers like level, panning, equalization, compression, reverb, and delay through an online search or more in-depth textbooks by authors like Izhaki [1], Senior [2], Owsinski [3], Case [4, 5], Stavrou [6], Moylan [7,] and Gibson [8]. However, any professional mixing engineer knows that simply reading and understanding such materials is not enough to develop the skills necessary to become an excellent mixer. Much like developing proficiency and expertise on an instrument, understanding the theory and technical knowledge of mixing is not enough to become a skilled mixer. In order to develop exceptional mixing skills, practice is essential. This begs the question; how should an aspiring mixer practice the art and craft of mixing?
The discussion of this topic is absent from a large portion of the literature on mixing except for Merchant’s 2011 and 2013 Audio Engineering Society convention papers in which Colvin’s concept of deliberate practice is applied to teaching students how to mix [9, 10, 11]. In order for students to carry out deliberate practice in the mixing classroom, quality multitrack recordings are necessary that help students focus on using specific mixing tools and techniques. This paper looks at the process of recording a series of mixing exercises with inherent audio engineering challenges for students to overcome.
Speakers Authors
Thursday October 10, 2024 10:00am - 12:00pm EDT
Poster

10:00am EDT

Investigation of the impact of pink and white noise in an auditory threshold of detection experiment
Thursday October 10, 2024 10:00am - 12:00pm EDT
Many listening experiments employ so-called "pink" and "white" noise, including those to find the threshold of detection (TOD). However, little research exists that compares the impacts of these two noises on a task. This study is an investigation of the TODs at whole and third octave bands using pink and white noise. A modified up-down method was utilized to determine the TOD in white and pink noise at eight frequency levels using a dB boosted bell filter convolved with the original signal. Six graduate students in audio engineering participated. Subjects were presented with an unfiltered reference signal followed by a filtered or unfiltered signal and then were instructed to respond “same” or “different”. Correct answers decreased the filter boost while incorrect answers resulted in reversals that would increase the filter boost until the subject answered correctly. A trial would conclude when a total of ten reversals occurred. The filter boost levels of the last five reversals were collected to obtain an average threshold value. Results show no significant differences between white and pink noise TOD levels at all frequency intervals. Additionally, the JND between the original and the boosted signal was observed to be ±3 dB, consistent with existing literature. Therefore, it can be suggested that white or pink noise can be equivalently employed in listening tests such as TOD measurements without impact on perceptual performance.
Thursday October 10, 2024 10:00am - 12:00pm EDT
Poster

10:00am EDT

New Music Instrument Design & Creative Coding: Assessment of a Sensor-Based Audio and Visual System
Thursday October 10, 2024 10:00am - 12:00pm EDT
Technological developments within the modern age provide near limitless possibilities when it comes to the design of new musical instruments and systems. The aim of the current study is to investigate the design, development, and testing of a unique audio-visual musical instrument. Five participants tested an initial prototype and completed semi-structured interviews providing user feedback on the product. A qualitative analysis of the interviews indicate findings related to two primary areas: enjoyability and functionality. Aspects of the prototype commonly expressed by participants as being enjoyable related to the product’s novelty, simplicity, use of distance sensors, and available interactivity with hardware components. Participant feedback commonly expressed related to improving the prototype’s functionality included adding delay and volume modules, making the product surround sound compatible, and increasing lower register control for certain sensors. These results informed design decisions in a secondary prototype building stage. The current study’s results suggest that designers and developers of new musical instruments should value novel rather than traditional instrument components, hardware rather than software-based interfaces, and simple rather than complex design controls. Ultimately, this paper provides guiding principles, broad themes, and important considerations in the development of enjoyable, new musical instruments and systems.
Thursday October 10, 2024 10:00am - 12:00pm EDT
Poster

10:00am EDT

Subjective Evaluation of Emotions in Music Generated by Artificial Intelligence
Thursday October 10, 2024 10:00am - 12:00pm EDT
Artificial Intelligence (AI) models offer consumers a resource that will generate music in a variety of genres and with a range of emotions with only a text prompt. However, emotion is a complex human phenomenon which becomes even more complex when attempted to convey through music. There is limited research assessing AI’s capability to generate music with emotion. Utilizing specified target emotions this study examined the validity of those emotions as expressed in AI-generated musical samples. Seven audio engineering graduate students listened to 144 AI-generated musical examples with sixteen emotions in three genres and reported their impression of the most appropriate emotion for each stimulus. Using Cohen’s kappa minimal agreement was found between subjects and AI. Results suggest that generating music with a specific emotion is still challenging for AI. Additionally, the AI model here appeared to operate with a predetermined group of musical samples linked to similar emotions. Discussion includes how this rapidly changing technology might be better studied in the future.
Thursday October 10, 2024 10:00am - 12:00pm EDT
Poster

10:00am EDT

Exploring Immersive Recording and Reproduction Methods for Pipe Organs
Thursday October 10, 2024 10:00am - 12:00pm EDT
The pipe organ is often underrepresented in classical music recordings, particularly in immersive audio formats. This study explores an innovative approach to recording and reproducing the organ for immersive formats using a modified Bowles Array. Key considerations for immersive recording and reproduction are examined, including the balance between aestheticism and realism, the role of the LFE channel, and the acoustic characteristics of the recording and reproduction environments, as well as the instrument itself. The findings aim to enhance the immersive audio experience of pipe organ music and provide valuable insights for developing standards in immersive recording and reproduction methods for pipe organ performances.
Speakers
avatar for Garrett Treanor

Garrett Treanor

New York University
Authors
avatar for Garrett Treanor

Garrett Treanor

New York University
avatar for Jeremy Tsuaye

Jeremy Tsuaye

New York University
avatar for Jessica Luo

Jessica Luo

Graduate Student, New York University
avatar for Yi Wu

Yi Wu

PhD Student, New York University
Thursday October 10, 2024 10:00am - 12:00pm EDT
Poster

10:00am EDT

Pilot Study on Creating Fundamental Methodologies for Capturing Spatial Room Information for XR Listening Environments
Thursday October 10, 2024 10:00am - 12:00pm EDT
Holophonic recording techniques were created and observed to see which techniques would best transcribe the three-dimensional characteristics of the room, allowing a listener 6-degrees-of-freedom within a virtual environment. Two recordings were conducted, one consisting of a solo harp and the second consisting of an African gyil duet. Three microphone systems were prepared to capture the instruments: XYZ array, Hamasaki-Ambeo array, and XYZ with an additional XY side cluster. The recordings were spatialized in Unity using the Steam Audio Plug-in. A preliminary subjective evaluation was conducted with expert subjects ranking each of the systems based on certain attributes. Results showed some favor toward the XYZ array due to the better capture of reverberation, however, this does not exclude the Hamasaki-Ambeo from being a system that would be suitable for holophonic ambience reproduction in virtual environments. These results will be used as a foundation to a long-term research of building a methodology for holophonic recording.
Speakers
avatar for Parichat Songmuang

Parichat Songmuang

Studio Manager/PhD Student, New York University
Parichat Songmuang graduated from New York University with her Master of Music degree in Music Technology at New York University and Advanced Certificate in Tonmeister Studies. As an undergraduate, she studied for her Bachelor of Science in Electronics Media and Film with a concentration... Read More →
Authors
avatar for Parichat Songmuang

Parichat Songmuang

Studio Manager/PhD Student, New York University
Parichat Songmuang graduated from New York University with her Master of Music degree in Music Technology at New York University and Advanced Certificate in Tonmeister Studies. As an undergraduate, she studied for her Bachelor of Science in Electronics Media and Film with a concentration... Read More →
Thursday October 10, 2024 10:00am - 12:00pm EDT
Poster
 
Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.