Sunday, October 20 Monday, October 21 Tuesday, October 22 Wednesday, October 23
7:00 ‑ 8:00 Breakfast Breakfast Breakfast
8:00 ‑ 8:50 K1: Keynote Talk by Poppy Crum K2: Keynote Talk by Thushara Abhayapala K3: Keynote Talk by Jerome Bellegarda
8:50 ‑ 10:10 L1: Audio Event Detection and Classification L3: Music Signal Processing L5: Signal Enhancement and Separation
10:10 ‑ 10:30 Coffee Break Coffee Break Coffee Break
10:30 ‑ 12:30 P1: Music, Audio and Speech Processing P2: Signal Enhancement and Source Separation P3: Source Localization, Scene Analysis, and Array Processing
12:30 ‑ 14:00 Lunch/Afternoon Break Lunch/Afternoon Break Lunch/Closing
14:00 ‑ 16:00
16:00 ‑ 18:00 Registration L2: Microphone and Loudspeaker Arrays L4: Learning from Weak Supervision in Audio Processing
18:00 ‑ 18:15
18:15 ‑ 20:00 Dinner Dinner Dinner
20:00 ‑ 22:00 Welcome Reception Cocktails Demonstrations & Cocktails

Sunday, October 20

Sunday, October 20, 16:00 – 18:00

Registrationgo to top

Room: Mountain View Room

Sunday, October 20, 18:15 – 20:00

Dinnergo to top

Room: West Dining Room

Sunday, October 20, 20:00 – 22:00

Welcome Receptiongo to top

Room: West Dining Room

Monday, October 21

Monday, October 21, 07:00 – 08:00

Breakfastgo to top

Room: West Dining Room

Monday, October 21, 08:00 – 08:50

K1: Keynote Talk by Poppy Crumgo to top

Monday, October 21, 08:50 – 10:10

L1: Audio Event Detection and Classificationgo to top

Lecture 1

Room: Conference House

Sound Event Detection Using Point-Labeled Data
Bongjun Kim (Northwestern University, USA); Bryan Pardo (Northwestern University, USA)
Sound Event Detection (SED) in audio scenes is the task that has been studied by an increasing number of researchers. Recent SED systems often use deep learning models. Building these systems typically require a large amount of carefully annotated, strongly labeled data, where the exact time-span of a sound event (e.g. the `dog bark’ starts at 1.2 seconds and ends at 2.0 seconds) in an audio scene (a recording of a city park) is indicated. However, manual labeling of sound events with their time boundaries within a recording is very time-consuming. One way to solve the issue is to collect data with weak labels that only contain the names of sound classes present in the audio file, without time boundary information for events in the file. Therefore, weakly-labeled sound event detection has become popular recently. However, there is still a large performance gap between models built on weakly labeled data and ones built on strongly labeled data, especially for predicting time boundaries of sound events. In this work, we introduce a new type of sound event label, which is easier for people to provide than strong labels. We call them `point labels’. To create a point label, a user simply listens to the recording and hits the space bar if they hear a sound event (‘dog bark’). This is much easier to do than specifying exact time boundaries. In this work, we illustrate methods to train a SED model on point-labeled data. Our results show that a model trained on point labeled audio data significantly outperforms weak models and is comparable to a model trained on strongly labeled data.
Batch Uniformization for Minimizing Maximum Anomaly Score of DNN-based Anomaly Detection in Sounds
Yuma Koizumi (NTT Media Intelligence Laboratories, Japan); Shoichiro Saito (NTT Media Intelligence Laboratories, Japan); Masataka Yamaguchi (NTT Communication Science Laboratories, Japan); Shin Murata (NTT Media Intelligence Laboratories, Japan); Noboru Harada (NTT Media Intelligence Laboratories, Japan)
Use of an autoencoder (AE) as a normal model is a state-of-the-art technique for unsupervised-anomaly detection in sounds (ADS). The AE is trained to minimize the sample mean of the anomaly score of normal sounds in a mini-batch. One problem with this approach is that the anomaly score of rare-normal sounds becomes higher than that of frequent-normal sounds, because the sample mean is strongly affected by frequent- normal samples, resulting in preferentially decreasing the anomaly score of frequent-normal samples. To decrease anomaly scores for both frequent- and rare-normal sounds, we propose batch uniformization, a training method for unsupervised-ADS for minimizing a weighted average of the anomaly score on each sample in a mini-batch. We used the reciprocal of the probabilistic density of each sample as the weight, more intuitively, a large weight is given for rare-normal sounds. Such a weight works to give a constant anomaly score for both frequent- and rare-normal sounds. Since the probabilistic density is unknown, we estimate it by using the kernel density estimation on each training mini-batch. Verification- and objective-experiments show that the proposed batch uniformization improves the performance of unsupervised-ADS.
City Classification from Multiple Real-World Sound Scenes
Helen Bear (Queen Mary University of London, UK); Toni Heittola (Tampere University, Finland); Annamaria Mesaros (Tampere University, Finland); Emmanouil Benetos (Queen Mary University of London, UK); Tuomas Virtanen (Tampere University, Finland)
The majority of sound scene analysis work focuses on one of two clearly defined tasks: acoustic scene classification or sound event detection. Whilst this separation of tasks is useful for problem definition, they inherently ignore some subtleties of the real-world, in particular how humans vary in how they describe a scene. Some will describe the weather and features within it, others will use a holistic descriptor like `park’, and others still will use unique identifiers such as cities or names. In this paper, we undertake the task of automatic city classification to ask whether we can recognize a city from a set of sound scenes? In this problem each city has recordings from multiple scenes. We test a series of methods for this novel task and show that a simple convolutional neural network (CNN) can achieve accuracy of 50%. This is less than the acoustic scene classification task baseline in the DCASE 2018 ASC challenge on the same data. A simple adaptation to the class labels of pairing city labels with grouped scenes, accuracy increases to 52%, closer to the simpler scene classification task. Finally we also formulate the problem in a multi-task learning framework and achieve an accuracy of 56%, outperforming the aforementioned approaches.
Model-agnostic Approaches to Handling Noisy Labels When Training Sound Event Classifiers
Eduardo Fonseca, Frederic Font, and Xavier Serra (Universitat PomPeu Fabra, Spain)
Label noise is emerging as a pressing issue in sound event classification. This arises as we move towards larger datasets that are difficult to annotate manually, but it is even more severe if datasets are collected automatically from online repositories, where labels are inferred through automated heuristics applied to the audio content or metadata. While learning from noisy labels has been an active area of research in computer vision, it has received little attention in sound event classification. Most of existing computer vision approaches against label noise are relatively complex, requiring complex networks or extra data resources. In this work, we evaluate simple and efficient model-agnostic approaches to handling noisy labels when training sound event classifiers, namely label smoothing regularization, mixup and noise-robust loss functions. The main advantage of these methods is that they can be incorporated to existing deep learning pipelines without the need for network modifications or extra resources. We report results from experiments conducted with the FSDnoisy18k dataset. We show that these simple methods can be effective in mitigating the effect of label noise, providing up to 2\% of accuracy boost when added to a CNN baseline, while requiring minimal intervention and computational overhead.

Monday, October 21, 10:30 – 12:30

P1: Music, Audio, and Speech Processinggo to top

Poster 1

Room: Parlor

Annotations Time Shift: A Key Parameter in Evaluating Musical Note Onset Detection Algorithms
Mina Mounir (KU Leuven, Belgium); Peter Karsmakers (Katholieke Universiteit Leuven, Belgium); Toon van Waterschoot (KU Leuven, Belgium)
Musical note onset detection is a building component for several MIR related tasks. The ambiguity in the definition of a note onset and the lack of a standard way to annotate onsets, introduce differences in datasets labeling, which in turn makes evaluations of note onset detection algorithms difficult to compare. This paper gives an overview of the parameters influencing the commonly used onset detection evaluation measure, i.e. the F1-score, pointing out a consistently missing parameter which is the overall time shift in annotations. This paper shows how crucial is this parameter in making reported F1-scores comparable among different algorithms and datasets, achieving a more reliable evaluation. As several MIR applications are concerned with the relative location of onsets to each other and not their absolute location, this paper suggests to include the overall time shift as a parameter when evaluating the algorithm performance. Experiments show a strong variability in the reported F1-score and up to 50% increase in the best-case F1-score when varying the overall time shift. Optimising the time shift turns out to be crucial when training or testing algorithms with datasets that are annotated differently (e.g. manually, automatically, and with different annotators) and especially when using deep learning algorithms.
End-to-End Melody Note Transcription Based on a Beat-Synchronous Attention Mechanism
Ryo Nishikimi (Kyoto University, Japan); Eita Nakamura (Kyoto University, Japan); Masataka Goto (National Institute of Advanced Industrial Science and Technology (AIST), Japan); Kazuyoshi Yoshii (Kyoto University & RIKEN, Japan)
This paper describes an end-to-end audio-to-symbolic singing transcription method for mixtures of vocal and accompaniment parts. Given audio signals with non-aligned melody scores, we aim to train a recurrent neural network that takes as input a magnitude spectrogram and outputs a sequence of melody notes represented by pairs of pitches and note values (durations). A promising approach to such sequence-to-sequence learning (joint input-to-output alignment and mapping) is to use an encoder-decoder model with an attention mechanism. This approach, however, cannot be used straightforwardly for singing transcription because a note-level decoder fails to estimate note values from latent representations obtained by a frame-level encoder that is good at extracting instantaneous features, but poor at extracting temporal features. To solve this problem, we focus on tatums instead of notes as output units and propose a tatum-level decoder that sequentially outputs tatum-level score segments represented by note pitches, note onset frags, and beat and downbeat flags. We then propose a beat-synchronous attention mechanism constrained in order to monotonically align tatum-level scores with input audio signals with a steady increment. The experimental results showed that the proposed method can be trained successfully from non-aligned data thanks to the beat-synchronous attention mechanism.
Time-Scale Modification Using Fuzzy Epoch-Synchronous Overlap-Add (FESOLA)
Timothy Roberts (Griffith University, Australia); Kuldip Paliwal (Griffith, Australia)
A modification to the Epoch-Synchronous Overlap-Add (ESOLA) Time-Scale Modification (TSM) algorithm is proposed in this paper. The proposed method, Fuzzy Epoch-Synchronous Overlap-Add, improves on the previous ESOLA method through the use of cross-correlation to align time-smeared epochs before overlap-adding. This reduces distortion and artefacts while the speaker’s fundamental frequency is stable, as well as reducing artefacts during pitch modulation. The proposed method is tested against well known TSM algorithms. It is preferred over ESOLA, and gives similar performance to other TSM algorithms for voice signals. It is also shown that this algorithm can work effectively with solo instrument signals containing strong fundamental frequencies. Full implementation of the proposed method and zero frequency resonator can be found at
High-level Control of Drum Track Generation Using Learned Patterns of Rhythmic Interaction
Stefan Lattner (Sony CSL, France); Maarten Grachten (Sony CSL, France)
Spurred by the potential of deep learning, computational music generation has gained renewed academic interest. A crucial issue in music generation is that of user control, especially in scenarios where the music generation process is conditioned on existing musical material. Here we propose a model for conditional kick drum track generation that takes existing musical material as input, in addition to a low-dimensional code that encodes the desired relation between the existing material and the new material to be generated. These relational codes are learned in an unsupervised manner from a music dataset. We show that codes can be sampled to create a variety of musically plausible kick drum tracks and that the model can be used to transfer kick drum patterns from one song to another. Lastly, we demonstrate that the learned codes are largely invariant to tempo and time-shift.
Investigating Kernel Shapes and Skip Connections for Deep Learning-Based Harmonic-Percussive Separation
Carlos Pedro Vianna Lordelo (Doremir Music Research AB, Sweden & Queen Mary University of London, United Kingdom (Great Britain)); Emmanouil Benetos (Queen Mary University of London, United Kingdom (Great Britain)); Simon Dixon (Queen Mary University of London & Centre for Digital Music, United Kingdom (Great Britain)); Sven Ahlbäck (Doremir Music Research AB, Sweden)
In this paper we propose an efficient deep learning encoder-decoder network for performing Harmonic-Percussive Source Separation (HPSS). It is shown that we are able to greatly reduce the number of model trainable parameters by using a dense arrangement of skip-connections between the model layers. We also explore the utilisation of different kernel sizes for the 2D filters of the convolutional layers with the objective of allowing the network to learn the different time-frequency patterns associated with percussive and harmonic sources more efficiently. The training and evaluation of the separation has been done using the training and test sets of the MUSDB18 dataset. Results show that the proposed deep learning network achieves automatic learning of high-level features and maintains HPSS performance at a state-of-the-art level while reducing the number of parameters and training time.
Cutting Music Source Separation Some Slakh: A Dataset to Study the Impact of Training Data Quality and Quantity
Ethan Manilow (Northwestern University, USA); Gordon P Wichern (Mitsubishi Electric Research Laboratories, USA); Prem Seetharaman (Northwestern University, USA); Jonathan Le Roux (Mitsubishi Electric Research Laboratories, USA)
Music source separation performance has greatly improved in recent years with the advent of approaches based on deep learning. Such methods typically require large amounts of labelled training data, which in the case of music consist of mixtures and corresponding instrument stems. However, stems are unavailable for most commercial music, and only limited datasets have so far been released to the public. It can thus be difficult to draw conclusions when comparing various source separation methods, as the difference in performance may stem as much from better data augmentation techniques or training tricks to alleviate the limited availability of training data, as from intrinsically better model architectures and objective functions. In this paper, we present the synthesized Lakh dataset (Slakh) as a new tool for music source separation research. Slakh consists of high-quality renderings of instrumental mixtures and corresponding stems generated from the Lakh MIDI dataset (LMD) using professional-grade sample-based virtual instruments. A first version, Slakh2100, focuses on 2100 songs, resulting in 145 hours of mixtures. While not fully comparable because it is purely instrumental, this dataset contains an order of magnitude more data than MUSDB18, the de facto standard dataset in the field. We show that Slakh can be used to effectively augment existing datasets for musical instrument separation, while opening the door to a wide array of data-intensive music signal analysis tasks.
On the Behavior of Delay Network Reverberator Modes
Orchisama Das (Stanford University, USA); Elliot K Canfield-Dafilou (Stanford University, USA); Jonathan Abel (Stanford University, USA)
The mixing matrix of a Feedback Delay Network (FDN) reverberator is used to control the mixing time and echo density profile. In this work, we investigate the effect of the mixing matrix on the modes (poles) of the FDN with the goal of using this information to better design the various FDN parameters. We find the modal decomposition of delay network reverberators using a state space formulation, showing how modes of the system can be extracted by eigenvalue decomposition of the state transition matrix. These modes, and subsequently the FDN parameters, can be designed to mimic the modes in an actual room. We introduce a parameterized orthonormal mixing matrix which can be continuously varied from identity to Hadamard. We also study how continuously varying diffusion in the mixing matrix affects the damping and frequency of these modes. We observe that modes approach each other in damping and then deflect in frequency as the mixing matrix changes from identity to Hadamard. We also quantify the perceptual effect of increasing mixing by calculating the normalized echo density (NED) of the FDN impulse responses over time.
Graphic Equalizer Design with Symmetric Biquad Filters
Juho Liski (Aalto University School of Electrical Engineering, Finland); Jussi Ramo (Aalto University, Finland); Vesa Valimaki (Aalto University, Finland)
A novel graphic equalizer design comprised of a single second-order section per band is proposed, where the band filters have a symmetric shape about their center frequency in the entire audio range. The asymmetry of the band filters at high frequencies close to the Nyquist limit has been one source of inaccuracy in previous designs. The interaction between the different band filters is accounted for using the weighted least-squares design, which employs an interaction matrix. In contrast to prior works, the interaction matrix is designed with a different prototype gain for each band filter, helping to keep the maximum approximation error below 1 dB at the center frequencies and between them when the neighboring command gains are the same. An iteration step can further diminish the approximation error. Comparisons of the proposed design with previous methods show that it is the most accurate graphic equalizer design to date.
Active Feedback Suppression for Hearing Devices Exploiting Multiple Loudspeakers
Henning Schepker (University of Oldenburg, Germany); Simon Doclo (University of Oldenburg, Germany)
In hearing devices, acoustic feedback frequently occurs due to the coupling between the hearing device loudspeaker(s) and microphone(s). In order to remove the feedback component from the microphone(s), adaptive filters are commonly used. While many hearing devices contain only a single loudspeaker, in this paper we consider a hearing device with multiple loudspeakers in the vent of a custom earpiece. We exploit this availability by pre-processing the loudspeaker signals such that they interfere destructively at the hearing device microphone while the signal at the eardrum is preserved. More specifically, we design a spatial pre-processor that aims at maximizing the maximum stable gain while limiting the distortions of the signal transmitted to the eardrum. Experimental results using measured impulse responses from a custom hearing device with two loudspeakers show that the proposed approach yields a robust reduction of the acoustic feedback while preserving the signal at the eardrum.
Perceptual Evaluation of Binaural Auralization of Data Obtained from the Spatial Decomposition Method
Jens Ahrens (Chalmers University of Technology, Sweden)
We present a perceptual evaluation of head-tracked binaural renderings of room impulse response data that were obtained from the spatial decomposition method. These data comprise an omnidirectional impulse response as well as instantaneous propagation directions of the sound field. The subjects in our experiment compared auralizations of these data according to the originally proposed method against direct auralizations of dummy head measurements of the exact same scenarios. We tested various parameters such as size of the microphone array, number of microphones, and HRTF grid-resolution. Our study shows that most parameter sets lead to a perception that is very similar to the dummy head data particlarly with respect to spaciousness. The remaining differences that are audible are considered small and relate primarily to timbre. This suggests that the equalization procedure that is part of the approach provides potential for improvement. Our results also show that the elevation of the propagation directions may be quantized coarsely without audible impairment.
Sparse Representation of HRTFs by Ear Alignment
Zamir Ben Hur (Ben-Gurion University of the Negev, Israel, and Facebook Reality Labs, USA); David Lou Alon (Facebook Reality Labs, Facebook, 1 Hacker Way, Menlo Park, CA 94025, USA); Ravish Mehra (Facebook Reality Labs, Facebook, 1 Hacker Way, Menlo Park, CA 94025, USA); Boaz Rafaely (Ben-Gurion University of the Negev, Israel)
High quality spatial sound reproduction requires individualized Head Related Transfer Functions (HRTFs) with high spatial resolution. However, measuring such HRTFs requires special and expensive equipment, which may be unavailable for most users. Therefore, reproduction of high resolution HRTFs from sparsely measured HRTFs is of great importance. Recently, the spherical-harmonics (SH) representation has been suggested for performing spatial interpolation. However, the fact that HRTFs are naturally of high spatial order leads to truncation and aliasing errors in the SH representation. Thus, pre-processing of HRTFs with the aim of reducing their effective SH order is a potential solution of great interest. A recent study compared between several pre-processing methods, and concluded that time-alignment leads to the lowest SH order. However, time-alignment of an HRTF requires an accurate estimation of its time delay, which is not always available, especially for contralateral directions. In this paper, a pre-processing method based on ear alignment is presented. This method is performed parametrically, which makes it more robust to measurement noise. Evaluation of the method is performed numerically, showing significant reduction in the effective SH order and in the interpolation error.
Morphological Weighting Improves Individualized Prediction of HRTF Directivity Patterns
Muhammad Shahnawaz (Politecnico di Milano, Italy); Craig Jin (University of Sydney, Australia); Joan Glaunes (University Paris Descartes, Australia); Augusto Sarti (Politecnico di Milano, Italy); Anthony Tew (University of York, Australia)
In this work, we explore the potential for morphological weighting of different regions of the pinna (outer ear) to improve the prediction of acoustic directivity patterns associated with head-related transfer functions. Using a large deformation diffeomorphic metric mapping framework, we apply kernel principal component analysis to model the pinna morphology. Different regions of the pinna can be weighted differently prior to the kernel principal component analysis. By varying the weights applied to the various regions of the pinna, we begin to learn the relative importance of the various regions to the acoustic directivity of the ear as a function of frequency. The pinna is divided into nine parts comprising the helix, scaphoid fossa, triangular fossa, concha rim, cymba concha, cavum concha, conchal ridge, ear lobe, and back of the ear. Results indicate that weighting the conchal region (concha rim, cavum and cymba concha) improves the predicted acoustic directivity for frequency bands centered around 3 kHz, 7 kHz, 10 kHz and 13 kHz. Similarly, weighting the triangular and scaphoid fossa improves the prediction of acoustic directivity in frequency bands centered around 7 kHz, 13 kHz and 15.5 kHz.
EEG-based Decoding of Auditory Attention to a Target Instrument in Polyphonic Music
Giorgia Cantisani (LTCI, Télécom Paris, Institut Polytechnique de Paris, France); Slim Essid (Telecom ParisTech & CNRS/LTCI, France); Gaël Richard (LTCI, Télécom Paris, Institut Polytechnique de Paris, France)
Auditory attention decoding aims at determining which sound source a subject is “focusing on”. In this work, we address the problem of EEG-based decoding of auditory attention to a target instrument in realistic polyphonic music. To this end, we exploit the so-called multivariate temporal response function, which was proven to decode successfully the attention to speech in multi-speaker environments. To our knowledge, this model was never applied to musical stimuli for decoding attention.
The task we consider here is quite complex as the stimuli used are polyphonic, including duets and trios, and are reproduced using loudspeakers instead of headphones. We consider the decoding of three different audio representations and investigate the influence on the decoding performance of multiple variants of musical stimuli, such as the number and type of instruments in the mixture, the spatial rendering, the music genre and the melody/rhythmical pattern that is played. We obtain promising results, comparable to those obtained on speech data in previous works, and confirm that it is thus possible to correlate the human brain’s activity with musically relevant features of the attended source.
Intrusive and Non-Intrusive Perceptual Speech Quality Assessment Using a Convolutional Neural Network
Hannes Gamper (Microsoft Research, USA); Chandan Reddy (Microsoft Corporation, USA); Ross Cutler (Microsoft, USA); Ivan J. Tashev (Microsoft Research, USA); Johannes Gehrke (Microsoft Corporation, USA)
Speech quality, as perceived by humans, is an important performance metric for telephony and voice services. It is typically measured through subjective listening tests, which can be tedious and expensive. Algorithms such as PESQ and POLQA serve as a computational proxy for subjective listening tests. Here we propose using a convolutional neural network to predict the perceived quality of speech with noise, reverberation, and distortions, both intrusively and non-intrusively, i.e., with and without a clean reference signal. The network model is trained and evaluated on a corpus of about ten thousand utterances labeled by human listeners to derive a Mean Opinion Score (MOS) for each utterance. It is shown to provide more accurate MOS estimates than existing speech quality metrics, including PESQ and POLQA. The proposed method reduces the root mean squared error from 0.48 to 0.35 MOS points and increases the Pearson correlation from 0.78 to 0.89 compared to the state-of-the-art POLQA algorithm.
An Improved Measure of Musical Noise Based on Spectral Kurtosis
Matteo Torcoli (Fraunhofer IIS, Germany)
Audio processing methods operating on a time-frequency representation of the signal can introduce unpleasant sounding artifacts known as musical noise. These artifacts are observed in the context of audio coding, speech enhancement, and source separation. The change in kurtosis of the power spectrum introduced during the processing was shown to correlate with the human perception of musical noise in the context of speech enhancement, leading to the proposal of measures based on it. These baseline measures are here shown to correlate with human perception only in a limited manner. As ground truth for the human perception, the results from two listening tests are considered: one involving audio coding and one involving source separation. Simple but effective perceptually motivated improvements are proposed and the resulting new measure is shown to clearly outperform the baselines in terms of correlation with the results of both listening tests. Moreover, with respect to the listening test on musical noise in audio coding, the exhibited correlation is nearly as good as the one exhibited by the Artifact-related Perceptual Score (APS), which was found to be the best objective measure for this task. The APS is however computationally very expensive. The proposed measure is easily computed, requiring only a fraction of the computational cost of the APS.
An Efficient Model for Estimating Subjective Quality of Separated Audio Source Signals
Thorsten Kastner (International Audio Laboratories Erlangen, Germany); Jürgen Herre (International Audio Laboratories Erlangen, Germany)
Audio source separation, i.e.,the separation of one or more target sources from a given audio signal mixture, has been a vivid and growing research field in recent years. Applications emerge which allow users to manipulate a given music recording to create a personal mix of a music recording or to adapt the audio level of the sports commentator and the atmosphere in sports broadcast to their own preference or hearing abilities.
The perceived quality of the produced audio signals is an important key factor to rate these separation systems. In order to optimize them, an efficient, perceptually based measurement scheme to predict the perceived audio quality would be highly beneficial. Existing evaluation models, such as BSSEval or PEASS suffer from poor prediction of perceived quality or excessive computational complexity.
In this paper a model for prediction of the perceived audio quality of separated audio source signals is presented, solely based on two timbre features, demanding less computational effort than current perceptual measurement schemes for audio source separation. High correlation of the model output with perceived quality is demonstrated.
A Classification-Aided Framework for Non-Intrusive Speech Quality Assessment
Xuan Dong (Indiana University, USA); Donald Williamson (Indiana University, USA)
Objective metrics, such as the perceptual evaluation of speech quality (PESQ) have become standard measures for evaluating speech. These metrics enable efficient and costless evaluations, where ratings are often computed by comparing a degraded speech signal to its underlying clean reference signal. Reference-based metrics, however, cannot be used to evaluate real-world signals that have inaccessible references. This project develops a nonintrusive framework for evaluating the perceptual quality of noisy and enhanced speech. We propose an utterance-level classification-aided nonintrusive (UCAN) assessment approach that combines the task of quality score classification with the regression task of quality score estimation. Our approach uses a categorical quality ranking as an auxiliary constraint to assist with quality score estimation, where we jointly train a multi-layered convolutional neural network in a multi-task manner. This approach is evaluated using the TIMIT speech corpus and several noises under a wide range of signal-to-noise ratios. The results show that the proposed system significantly improves quality score estimation as compared to several state-of-the-art approaches.
Identification of Voice Quality Variation Using I-vectors
Chuyao Feng (Georgia Institute of Technology, USA); David Anderson (Georgia Institute of Technology, USA); Eva van Leer (Georgia State University, USA)
Voice disorders affect a large portion of the population, in particular impacting heavy voice users such as teachers or call-center workers. Voice therapy—the recommended behavioral treatment for a variety of voice disorders—requires regular voice technique practice under the guidance of a voice therapist. Patients commonly have difficulty reproducing this technique without clinician feedback once they get home. Therefore, we developed a system for use in voice therapy to provide feedback to the patient about the quality of their voice practice. Based on i-vector analysis, the system effectively examines a speaker’s different vocal modes as if these represent different speakers (i.e. your good-voice-self vs your bad-voice-self). Six adults were recorded producing 5 different voice quality modes: normal, breathy, fry, twang and hyponasal. An i-vector-based algorithm was trained on 6 participants to classify these vocal modes with 97.7% accuracy. The system can be used to detect different voice quality modes in unscripted, connected speech, which has potential to automate analysis of home practice in voice therapy and to serve as a feedback tool to extend therapist judgment beyond the clinic walls.
3D Localized Sound Zone Generation with a Planar Omni-Directional Loudspeaker Array
Takuma Okamoto (National Institute of Information and Communications Technology, Japan)
This paper provides a 3D localized sound zone generation method using a planar omni-directional loudspeaker array. In the proposed method, multiple co-centered circular arrays are arranged on the horizontal plane and an additional loudspeaker is located at the array’s center. The sound field produced by this center loudspeaker is then cancelled using the multiple circular arrays. A localized 3D sound zone can thus be generated inside a sphere with a maximum radius of that of the circular arrays because the residual sound field is contained within the sphere. The resulting sound fields are decomposed into spherical harmonic spectra and the driving function of the array is then obtained. Compared with the conventional approach that uses monopole pairs to control the even and odd spherical harmonic spectrum components, the proposed method can be simply realized with a practical planar omni-directional array because it is sufficient to control the 0-th order component. Computer simulations confirm the effectiveness of the proposed approach.

Monday, October 21, 12:30 – 16:00

Lunch/Afternoon Breakgo to top

Room: West Dining Room

Monday, October 21, 16:00 – 18:00

L2: Microphone and Loudspeaker Arraysgo to top

Lecture 2

Room: Conference House

Motion-Tolerant Beamforming with Deformable Microphone Arrays
Ryan M Corey (University of Illinois at Urbana-Champaign, USA); Andrew C. Singer (University of Illinois at Urbana Champaign, USA)
Microphone arrays are usually assumed to have rigid geometries: the microphones may move with respect to the sound field but remain fixed relative to each other. However, many useful arrays, such as those in wearable devices, have sensors that can move relative to each other. We compare two approaches to beamforming with deformable microphone arrays: first, by explicitly tracking the geometry of the array as it changes over time, and second, by designing a time-invariant beamformer based on the second-order statistics of the moving array. The time-invariant approach is shown to be appropriate when the motion of the array is small relative to the acoustic wavelengths of interest. The performance of the proposed beamforming system is demonstrated using a wearable microphone array on a moving human listener in a cocktail-party scenario.
An EM Method for Multichannel TOA and DOA Estimation of Acoustic Echoes
Jesper Rindom Jensen (Aalborg University, Denmark); Usama Saqib (Aalborg University, Denmark); Sharon Gannot (Bar-Ilan University, Israel)
The time-of-arrivals (TOAs) of acoustic echoes is a prerequisite in, e.g., room geometry estimation and localization of acoustic reflectors, which can be an enabling technology for autonomous robots and drones. However, solving these problems alone using TOAs introduces the difficult problem of echolabeling. Moreover, it is typically suggested to estimate the TOAs by estimating the room impulse response, and finding the peaks of it, but this approach is vulnerable against noise (e.g., ego noise). We therefore propose an expectation-maximization (EM) method for estimating both the TOAs and direction-of-arrivals (DOAs) of acoustic echoes using a loudspeaker and a uniform circular array (UCA). Our results show that this approach is more robust against noise compared to the traditional peak finding approach. Moreover, they show that the TOA and DOA information can be combined to estimate wall positions directly without considering echolabeling.
Speech Enhancement Using Polynomial Eigenvalue Decomposition
Vincent W Neo (Imperial College London, United Kingdom (Great Britain)); Christine Evers (Imperial College London, United Kingdom (Great Britain)); Patrick A Naylor (Imperial College London, United Kingdom (Great Britain))
Speech enhancement is important for applications such as telecommunications, hearing aids, automatic speech recognition and voice-controlled system. The enhancement algorithms aim to reduce interfering noise while minimizing any speech distortion. In this work for speech enhancement, we propose to use polynomial matrices in order to exploit the spatial, spectral as well as temporal correlations between the speech signals received by the microphone array. Polynomial matrices provide the necessary mathematical framework in order to exploit constructively the spatial correlations within and between sensor pairs, as well as the spectral-temporal correlations of broadband signals, such as speech. Specifically, the polynomial eigenvalue decomposition (PEVD) decorrelates simultaneously in space, time and frequency. We then propose a PEVD-based speech enhancement algorithm. Simulations and informal listening examples have shown that our method achieves noise reduction without introducing artefacts into the enhanced signal for white, babble and factory noise conditions between -10 dB to 30 dB SNR.
Sub-Sample Time Delay Estimation via Auxiliary-Function-Based Iterative Updates
Kouei Yamaoka (Tokyo Metropolitan University, Japan); Robin Scheibler (Tokyo Metropolitan University & Japanese Society for the Promotion of Science, Japan); Nobutaka Ono (Tokyo Metropolitan University, Japan); Wakabayashi Yukoh (Tokyo Metropolitan University, Japan)
We propose an efficient iterative method to estimate a sub-sample time delay between two signals. We formulate it as the optimization problem of maximizing the generalized cross correlation (GCC) of the two signals in terms of a continuous time delay parameter. The maximization is carried out with an auxiliary function method. First, we prove that when written as a sum of cosines, the GCC can be lower bounded at any point by a quadratic function. By repeatedly maximizing this lower-bound, an alternative update algorithm for the estimation of the time delay is derived. We follow through with numerical experiments highlighting that given a reasonable initial estimate, the proposed method converges quickly to the maximum of the GCC. In addition, we show that the method is robust to noise and attains the Cram ́er-Rao lower bound (CRLB).
Active Noise Control over 3D Space with Multiple Circular Arrays
Huiyuan Sun (The Australian National University, Australia); Thushara D. Abhayapala (Australian National University, Australia); Prasanga Samarasinghe (Australian National University, Australia)
Spatial active noise control (ANC) system focus on minimizing unwanted acoustic noise over a continuous spatial region. Conventionally spatial ANC is attempted using MIMO system and recently novel methods have been developed using spherical harmonic analysis of spatial sound fields. A major limitation for implementing the latter approach is the requirement of regularly distributed microphones and loudspeakers over spherical arrays. In this paper, we relax the above constraint by constructing a system utilizing multiple circular microphone and loudspeaker arrays, and by designing a feed-forward adaptive filtering algorithm for noise reduction over a 3D region. By simulation, we demonstrate that the proposed method can achieve comparable ANC performance to conventional spherical array methods, while being more feasible to be implemented in practice.
Sound Field Translation Methods for Binaural Reproduction
Lachlan Birnie (Australian National University, USA); Thushara D. Abhayapala (Australian National University, Australia); Prasanga Samarasinghe (Australian National University, Australia); Vladimir Tourbabin (Facebook Reality Labs & FACEBOOK INC, USA)
Virtual-reality reproduction of real-world acoustic environments often fix the listener position to that of the microphone. In this paper, we propose a method for listener translation in a virtual reproduction that incorporates a mix of near-field and far-field sources. Compared to conventional plane-wave techniques, the mixed-source method offers stronger near-field reproduction and translation capabilities in the case of a sparse virtualization.

Monday, October 21, 18:15 – 20:00

Dinnergo to top

Room: West Dining Room

Monday, October 21, 20:00 – 22:00

Cocktailsgo to top

Room: West Dining Room

Tuesday, October 22

Tuesday, October 22, 07:00 – 08:00

Breakfastgo to top

Room: West Dining Room

Tuesday, October 22, 08:00 – 08:50

K2: Keynote Talk by Thushara Abhayapalago to top

Tuesday, October 22, 08:50 – 10:10

L3: Music Signal Processinggo to top

Lecture 3

Room: Conference House

Feedback Structures for a Transfer Function Model of a Circular Vibrating Membrane
Maximilian Schaefer (Friedrich-Alexander-Universitaet Erlangen-Nuernberg (FAU), Germany); Rudolf Rabenstein (Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany); Sebastian J. Schlecht (Friedrich-Alexander University Erlangen-Nürnberg, Germany)
The attachment of feedback loops to physical or musical systems enables a large variety of possibilities for the modification of the system behavior. Feedback loops may enrich the echo density of feedback delay networks (FDN), or enable the realization of complex boundary conditions in physical simulation models for sound synthesis. Inspired by control theory, a general feedback loop is attached to a model of a vibrating membrane. The membrane model is based on the modal expansion of an initial-boundary value problem formulated in a state-space description. The possibilities of the attached feedback loop are shown by three examples, namely by the introduction of additional mode wise damping; modulation and damping inspired by FDN feedback loops; time-varying modification of the system behavior.
Dense Reverberation with Delay Feedback Matrices
Sebastian J. Schlecht (Friedrich-Alexander University Erlangen-Nürnberg, Germany); Emanuël Habets (International Audio Laboratories Erlangen, Germany)
Feedback delay networks (FDNs) belong to a general class of recursive filters which are widely used in artificial reverberation and decorrelation applications. One central challenge in the design of FDNs is the generation of sufficient echo density in the impulse response without compromising the computational efficiency. In a previous contribution, we have demonstrated that the echo density of an FDN grows polynomially over time, and that the growth depends on the number and lengths of the delays. In this work, we introduce so-called delay feedback matrices (DFMs) where each matrix entry is a scalar gain and a delay. While the computational complexity of DFMs is similar to a scalar-only feedback matrix, we show that the echo density grows significantly faster over time, however, at the cost of non-uniform modal decays.
Physical Models for Fast Estimation of Guitar String, Fret and Plucking Position
Jacob Møller Hjerrild (Audio Analysis Lab, CREATE, Aalborg University & TC Electronic, Denmark); Silvin Willemsen (Multisensory Experience Lab, CREATE, Aalborg University, Denmark); Mads Græsbøll Christensen (Aalborg University, Denmark)
In this paper, a novel method for analyzing guitar performances is proposed. It is both fast and effective at extracting the activated string, fret, and plucking position from guitar recordings. The method is derived from guitar-string physics and, unlike the state of the art, does not require audio recordings as training data. A maximum a posteriori classifier is proposed for estimating the string and fret based on a simulated model of feature vectors while the plucking position is estimated using estimated inharmonic partials. The method extracts features from audio with a pitch estimator that estimates also the inharmonicity of the string. The string and fret classifier is evaluated on recordings of an electric and acoustic guitar under noisy conditions. The performance is comparable to the state of the art, and the performance is shown to degrade at SNRs below 20 dB. The plucking position estimator is evaluated in a proof-of-concept experiment with sudden changes of string, fret and plucking positions, which shows that these can be estimated accurately. The proposed method operates on individual 40 ms segments and is thus suitable for high-tempo and real-time applications.
Joint Singing Pitch Estimation and Voice Separation Based on a Neural Harmonic Structure Renderer
Tomoyasu Nakano (National Institute of Advanced Industrial Science and Technology (AIST), Japan); Kazuyoshi Yoshii (Kyoto University & RIKEN, Japan); Yiming Wu (Kyoto University, Japan); Ryo Nishikimi (Kyoto University, Japan); Kin Wah Edward Lin (National Institute of Advanced Industrial Science and Technology (AIST), Japan); Masataka Goto (National Institute of Advanced Industrial Science and Technology (AIST), Japan)
This paper describes a multi-task learning approach to joint extraction (fundamental frequency (F0) estimation) and separation of singing voices from music signals. While deep neural networks have been used successfully for each task, both tasks have not been dealt with simultaneously in the context of deep learning. Since vocal extraction and separation are considered to have a mutually beneficial relationship, we propose a unified network that consists of a deep convolutional neural network for vocal F0 saliency estimation and a U-Net with an encoder shared by two decoders specialized for separating vocal and accompaniment parts, respectively. Between these two networks we introduce a differentiable layer that converts an F0 saliency spectrogram into harmonic masks indicating the locations of harmonic partials of a singing voice. The physical meaning of harmonic structure is thus reflected in the network architecture. The harmonic masks are then effectively used as scaffolds for estimating fine-structured masks thanks to the excellent capability of the U-Net for domain-preserving conversion (e.g., image-to-image conversion). The whole network can be trained jointly by backpropagation. Experimental results showed that the proposed unified network outperformed the conventional independent networks for vocal extraction and separation.

Tuesday, October 22, 10:30 – 12:30

P2: Signal Enhancement and  Source Separationgo to top

Poster 2

Room: Parlor

Analysis of Robustness of Deep Single-Channel Speech Separation Using Corpora Constructed from Multiple Domains
Matthew K Maciejewski (Johns Hopkins University, USA); Gregory Sell (Johns Hopkins University, USA); Yusuke Fujita (Hitachi, Ltd., Japan); Leibny Paola Garcia Perera (Johns Hopkins University, USA); Shinji Watanabe (Johns Hopkins University, USA); Sanjeev Khudanpur (Johns Hopkins University, USA)
Deep-learning based single-channel speech separation has been studied with great success, though evaluations have typically been limited to relatively controlled environments based on clean, near-field, and read speech. This work investigates the robustness of such representative techniques in more realistic environments with multiple and diverse conditions. To this end, we first construct datasets from the Mixer 6 and CHiME-5 corpora, featuring studio interviews and dinner parties respectively, using a procedure carefully designed to generate desirable synthetic overlap data sufficient for evaluation as well as for training deep learning models. Using these new datasets, we demonstrate the substantial shortcomings in mismatched conditions of these separation techniques. Though multi-condition training greatly mitigated the performance degradation in near-field conditions, one of the important findings is that both matched and multi-condition training have significant gaps from the oracle performance in far-field conditions, which advocates a need for extending existing separation techniques to deal with far-field/highly-reverberant speech mixtures.
A Style-Transfer Approach to Source Separation
Shrikant Venkataramani (University of Illinois at Urbana Champaign, USA); Efthymios Tzinis (University of Illinois at Urbana-Champaign, USA); Paris Smaragdis (University of Illinois at Urbana-Champaign, USA)
Training neural networks for source separation involves presenting a mixture recording at the input of the network and updating network parameters in order to produce an output that resembles the clean source. Consequently, supervised source separation depends on the availability of paired mixture-clean training examples. In this paper, we interpret source separation as a style transfer problem. We present a variational auto-encoder network that exploits the commonality across the domain of mixtures and the domain of clean sounds and learns a shared latent representation across the two domains. Using these cycle-consistent variational auto-encoders, we learn a mapping from the mixture domain to the domain of clean sounds and perform source separation without explicitly supervising with paired training examples.
Universal Sound Separation
Ilya Kavalerov (University of Maryland, USA); Scott Wisdom (Google, USA); Hakan Erdogan (Google, USA); Brian Patton (Google Research, USA); Kevin Wilson (Google, USA); Jonathan Le Roux (Mitsubishi Electric Research Laboratories, USA); John Hershey (Google, USA)
Recent deep learning approaches have achieved impressive performance on speech enhancement and separation tasks. However, these approaches have not been investigated for separating mixtures of arbitrary sounds of different types, a task we refer to as universal sound separation, and it is unknown how performance on speech tasks carries over to non-speech tasks. To study this question, we develop a dataset of mixtures containing arbitrary sounds, and use it to investigate the space of mask-based separation architectures, varying both the overall network architecture and the framewise analysis-synthesis basis for signal transformations. These network architectures include convolutional long short-term memory networks and time-dilated convolution stacks inspired by the recent success of time-domain enhancement networks like ConvTasNet. For the latter architecture, we also propose novel modifications that further improve separation performance. In terms of the framewise analysis-synthesis basis, we explore both a short-time Fourier transform (STFT) and a learnable basis, as used in ConvTasNet. For both of these bases, we also examine the effect of window size. In particular, for STFTs, we find that longer windows (25-50 ms) work best for speech/non-speech separation, while shorter windows (2.5 ms) work best for arbitrary sounds. For learnable bases, shorter windows (2.5 ms) work best on all tasks. Surprisingly, for universal sound separation, STFTs outperform learnable bases. Our best methods produce an improvement in scale-invariant signal-to-distortion ratio of over 13 dB for speech/non-speech separation and close to 10 dB for universal sound separation.
Deep Tensor Factorization for Spatially-Aware Scene Decomposition
Jonah Casebeer (University of Illinois at Urbana-Champaign, USA); Michael Colomb (University of Illinois at Urbana-Champaign, USA); Paris Smaragdis (University of Illinois at Urbana-Champaign, USA)
We propose a completely unsupervised method to understand audio scenes observed with random microphone arrangements by decomposing the scene into its constituent sources and their relative presence in each microphone. To this end, we formulate a neural network architecture that can be interpreted as a nonnegative tensor factorization of a multi-channel audio recording. By clustering on the learned network parameters corresponding to channel content, we can learn sources’ individual spectral dictionaries and their activation patterns over time. Our method allows us to leverage deep learning advances like end-to-end training, while also allowing stochastic minibatch training so that we can feasibly decompose realistic audio scenes that are intractable to decompose using standard methods. This neural network architecture is easily extensible to other kinds of tensor factorizations.
Independent Vector Analysis with More Microphones than Sources
Robin Scheibler (Tokyo Metropolitan University & Japanese Society for the Promotion of Science, Japan); Nobutaka Ono (Tokyo Metropolitan University, Japan)
We extend frequency-domain blind source separation based on independent vector analysis to the case where there are more microphones than sources. The signal is modeled as non-Gaussian sources in a Gaussian background. The proposed algorithm is based on a parametrization of the demixing matrix decreasing the number of parameters to estimate. Furthermore, orthogonal constraint s between the signal and background subspaces are imposed to regularize the separation. The problem can then be posed as a constrained likelihood maximization. We propose efficient alternative updates of the demixing filters based on the auxiliary function technique. The performance of the algorithm is assessed both on simulated and recorded signals. We find that the separation performance is on par with that of the conventional determined algorithm at a much lighter computational cost.
Sparse Adaptation of Distributed Blind Source Separation in Acoustic Sensor Networks
Michael Guenther (University of Erlangen-Nuremberg, Germany); Haitham Afifi (Paderborn University, Germany); Andreas Brendel (University Erlangen-Nürnberg, Germany); Holger Karl (Paderborn University, Germany); Walter Kellermann (University Erlangen-Nuremberg, Germany)
By distributing the computational load over the nodes of a Wireless Acoustic Sensor Network (WASN), the real-time capability of the TRINICON (TRIple-N-Independent component analysis for CONvolutive mixtures) framework for Blind Source Separation (BSS) can be ensured, even if the individual network nodes are not powerful enough to run TRINICON in real-time by themselves. To optimally utilize the limited computing power and data rate in WASNs, the MARVELO (Multicast-Aware Routing for Virtual network Embedding with Loops in Overlays) framework is expanded for use with TRINICON, while a feature-based selection scheme is proposed to exploit the most beneficial parts of the input signal for adapting the demixing system. The simulation results of realistic scenarios show only a minor degradation of the separation performance even in heavily resource-limited situations.
Multiple Hypothesis Tracking for Overlapping Speaker Segmentation
Aidan O. T. Hogg (Imperial College London, United Kingdom (Great Britain)); Christine Evers (Imperial College London, United Kingdom (Great Britain)); Patrick A Naylor (Imperial College London, United Kingdom (Great Britain))
Speaker segmentation is an essential part of any diarization system. Applications of diarization include tasks such as speaker indexing, improving automatic speech recognition (ASR) performance and making single speaker-based algorithms available for use in multi-speaker environments. This paper proposes a multiple hypothesis tracking (MHT) method that exploits the harmonic structure associated with the pitch in voiced speech in order to segment the onsets and end-points of speech from multiple, overlapping speakers. The proposed method is evaluated against a segmentation system from the literature that uses a spectral representation and is based on employing bidirectional long short term memory networks (BLSTM). The proposed method is shown to achieve comparable performance for segmenting overlapping speakers only using the pitch harmonic information in the MHT framework.
Declipping Speech Using Deep Filtering
Wolfgang Mack (International Audio Laboratories Erlangen, Germany); Emanuël Habets (International Audio Laboratories Erlangen, Germany)
Recorded signals can be clipped in case the sound pressure or analog signal amplification is too large. Clipping is a non-linear distortion, which limits the maximal magnitude modulation of the signal and changes the energy distribution in frequency domain and hence degrades the quality of the recording. Consequently, for declipping, some frequencies have to be amplified, and others attenuated. We propose a declipping method by using the recently proposed deep filtering technique which is capable of extracting and reconstructing a desired signal from a degraded input. Deep filtering operates in the short-time Fourier transform (STFT) domain estimating a complex multidimensional filter for each desired STFT bin. The filters are applied to defined areas of the clipped STFT to obtain for each filter a single complex STFT bin estimation of the declipped STFT. The filter estimation, thereby, is performed via a deep neural network trained with simulated data using soft- or hard-clipping. The loss function minimizes the reconstruction mean-squared error between the non-clipped and the declipped STFTs. We evaluate our approach using simulated data degraded by hard- and soft-clipping and conducted a pairwise comparison listening test with measured signals comparing our approach to one commercial and one open-source declipping method. Our approach outperformed the baselines for declipping speech signals for measured data for strong and medium clipping.
Speech Bandwidth Extension with WaveNet
Archit Gupta (DeepMind, United Kingdom (Great Britain)); Brendan Shillingford (DeepMind, United Kingdom (Great Britain)); Yannis Assael (DeepMind, United Kingdom (Great Britain)); Thomas Walters (DeepMind, United Kingdom (Great Britain))
Large-scale mobile communication systems tend to contain legacy transmission channels with narrowband bottlenecks, resulting in characteristic `telephone-quality’ audio. While higher quality codecs exist, due to the scale and heterogeneity of the networks, transmitting higher sample rate audio with modern high-quality audio codecs can be difficult in practice. This paper proposes an approach where a communication node can instead extend the bandwidth of a band-limited incoming speech signal that may have been passed through a low-rate codec. To this end, we propose a WaveNet-based model conditioned on a log-mel spectrogram representation of a bandwidth-constrained speech audio signal of 8 kHz and audio with artifacts from GSM full-rate (FR) compression to reconstruct the higher-resolution signal. In our experimental MUSHRA evaluation, we show that a model trained to upsample to 24kHz speech signals from audio passed through the 8kHz GSM-FR codec is able to reconstruct audio only slightly lower in quality to that of the Adaptive Multi-Rate Wideband audio codec (AMR-WB) codec at 16kHz, and closes around half the gap in perceptual quality between the original encoded signal and the original speech sampled at 24kHz. We further show that when the same model is passed 8kHz audio that has not been compressed, is able to again reconstruct audio of slightly better quality than 16kHz AMR-WB, in the same MUSHRA evaluation.
IRM with Phase Parameterization for Speech Enhancement
Xianyun Wang (Beijing University of Technology, P.R. China); Changchun Bao (Beijing University of Technology, P.R. China); Rui Cheng (Beijing University of Technology, P.R. China)
Deep neural network (DNN) has become a popular mean for separating target speech from noisy speech in the supervised speech enhancement due to its good performance for learning higher-level information. For the DNN-based methods, the training target and acoustic features have a significant impact on the performance of speech restoration. The ideal ratio mask (IRM) is commonly used as the training target. But, generally it does not take into account phase information. The recent studies have revealed that incorporating phase information into the mask can effectively help improve speech quality of the enhanced speech. In this paper, a bounded IRM with phase parameterization is presented and used as the training target of the DNN model. In addition, some acoustic features with harmonic preservation are incorporated into the input of DNN model, which are considered as an additional information to improve quality of the enhanced speech. The experiments are performed under various noise environments and signal to noise ratio (SNR) conditions. The results show that the proposed method can outperform reference methods.
Generative Speech Enhancement Based on Cloned Networks
Michael Chinen (Google, Inc., USA); W. Bastiaan Kleijn (Victoria University of Wellington, New Zealand); Felicia S. C. Lim (Google LLC, USA); Jan Skoglund (Google, Inc., USA)
We propose to implement speech enhancement by the regeneration of clean speech from a `salient’ representation extracted from the noisy signal. The network that extracts salient features is trained using a set of weight-sharing clones of the extractor network. The clones receive mel-frequency spectra of different noisy versions of the same speech signal as input. By encouraging the outputs of the clones to be similar for these different input signals, we train a feature extractor network that is robust to noise. At inference, the salient features form the input to a WaveNet network that generates a natural and clean speech signal with the same attributes as the ground-truth clean signal. As the signal becomes noisier, our system produces natural sounding errors that stay on the speech manifold, in place of traditional artifacts found in other systems. Our experiments confirm that our generative enhancement system provides state-of-the-art enhancement performance within the generative class of enhancers according to a MUSHRA-like test. The clones based system matches or outperforms the other systems at each input signal-to-noise (SNR) range with statistical significance.
Improvement of Speech Residuals for Speech Enhancement
Samy Elshamy (Technische Universität Braunschweig, Germany); Tim Fingscheidt (Technische Universität Braunschweig, Germany)
In this work we present two novel methods to improve speech residuals for speech enhancement. A deep neural network is used to enhance residual signals in the cepstral domain, thereby exceeding a former cepstral excitation manipulation (CEM) approach in different ways: One variant provides higher speech component quality by 0.1 PESQ points in low-SNR conditions, while another one delivers substantially higher noise attenuation by 1.5 dB, without loss of speech component quality or speech intelligibility. Compared to traditional speech enhancement based on the decision-directed (DD) a priori SNR estimation, a gain of even up to 3.5 dB noise attenuation is obtained. A semi-formal comparative category rating (CCR) subjective listening test confirms the superiority of the proposed approach over DD by 0.25 CMOS points (or even by 0.48 if two outlier subjects are not considered).
Simultaneous Denoising, Dereverberation, and Source Separation Using a Unified Convolutional Beamformer
Tomohiro Nakatani (NTT Corporation, Japan); Keisuke Kinoshita (NTT Corporation, Japan); Rintaro Ikeshita (NTT Corporation, Japan); Hiroshi Sawada (NTT Corporation, Japan); Shoko Araki (NTT Communication Science Laboratories, Japan)
This article investigates applicability of a Weighted Power minimization Distortionless response convolutional beamformer (WPD) to simultaneous denoising, dereverberation, and source separation. The WPD is a recently proposed beamformer that performs denoising and dereverberation simultaneously by unifying a Weighted Prediction Error dereverberation method (WPE) and a Minimum Power Distortionless Response beamformer (MPDR) into a single convolutional beamformer. In this paper, we extend the application of the WPD not only to simultaneous denoising and dereverberation, but also to source separation. For this purpose, we introduce a source parameter estimation unit that estimates the steering vectors and the time-varying powers of all the sources from noisy reverberant sound mixtures, and integrate it with the WPD. We experimentally confirm the effectiveness of the integrated method in terms of objective speech enhancement measures and Automatic Speech Recognition (ASR) performance.
A Perceptual Weighting Filter Loss for DNN Training in Speech Enhancement
Ziyue Zhao (Institute for Communications Technology, Technische Universität Braunschweig, Germany); Samy Elshamy (Technische Universität Braunschweig, Germany); Tim Fingscheidt (Technische Universität Braunschweig, Germany)
Single-channel speech enhancement with deep neural networks (DNNs) has shown promising performance and is thus intensively being studied. In this paper, instead of applying the mean squared error (MSE) as the loss function during DNN training for speech enhancement, we design a perceptual weighting filter loss motivated by the weighting filter as it is employed in analysis-by-synthesis speech coding, e.g., in code-excited linear prediction (CELP). The experimental results show that the proposed simple loss function improves the speech enhancement performance compared to a reference DNN with MSE loss in terms of perceptual quality and noise attenuation. The proposed loss function can be advantageously applied to an existing DNN-based speech enhancement system, without modification of the DNN topology for speech enhancement.
Speech Enhancement Using End-to-End Speech Recognition Objectives
Aswin Shanmugam Subramanian (Johns Hopkins University, USA); Xiaofei Wang (Johns Hopkins University, USA); Murali Karthick Baskar (Brno University of Technology, Czech Republic); Shinji Watanabe (Johns Hopkins University, USA); Toru Taniguchi (Preferred Networks, Inc., Japan); Dung Tran (Yahoo Japan Corporation, Japan); Yuya Fujita (Yahoo Japan Corporation, Japan)
Speech enhancement systems, which denoise and dereverberate distorted signals, are usually optimized based on signal reconstruction objectives including the maximum likelihood and minimum mean square error. However, emergent end-to-end neural methods enable to optimize the speech enhancement system with more application-oriented objectives. For example, we can jointly optimize speech enhancement and automatic speech recognition (ASR) only with ASR error minimization criteria. The major contribution of this paper is to investigate how a system optimized based on the ASR objective improves the speech enhancement quality on various signal level metrics in addition to the ASR word error rate (WER) metric. We use a recently developed multichannel end-to-end (ME2E) system, which integrates neural dereverberation, beamforming, and attention-based speech recognition within a single neural network. Additionally, we propose to extend the dereverberation sub network of ME2E by dynamically varying the filter order in linear prediction by using reinforcement learning, and extend the beamforming subnetwork by incorporating the estimation of a speech distortion factor. The experiments reveal how well different signal level metrics correlate with the WER metric, and verify that learning-based speech enhancement can be realized by end-to-end ASR training objectives without using parallel clean and noisy data.
Separated Noise Suppression and Speech Restoration: LSTM-Based Speech Enhancement in Two Stages
Maximilian Strake (Technische Universität Braunschweig, Germany); Bruno Defraene (NXP Semiconductors, Product Line Voice and Audio Solutions, Belgium); Kristoff Fluyt (NXP Semiconductors, Product Line Voice and Audio Solutions, Belgium); Wouter Tirry (NXP Semiconductors, Product Line Voice and Audio Solutions, Belgium); Tim Fingscheidt (Technische Universität Braunschweig, Germany)
Regression based on neural networks (NNs) has led to considerable advances in speech enhancement under non-stationary noise conditions. Nonetheless, speech distortions can be introduced when employing NNs trained to provide strong noise suppression. We propose to address this problem by first suppressing noise and subsequently restoring speech with specifically chosen NN topologies for each of these distinct tasks. A mask-estimating long short-term memory (LSTM) network is employed for noise suppression, while the speech restoration is performed by a fully convolutional encoder-decoder (CED) network, where we introduce temporal modeling capabilities by using a convolutional LSTM layer in the bottleneck. We show considerable performance gains over reference methods of up to 0.26 MOS points (PESQ) and the ability to significantly improve intelligibility in terms of STOI for low-SNR conditions.
Fast Convergence Algorithm for State-Space Model Based Speech Dereverberation by Multi-Channel Non-Negative Matrix Factorization
Masahito Togami (LINE Coporation, Japan); Tatsuya Komatsu (LINE Corporation, Japan)
In this paper, a multi-channel speech dereverberation technique based on a state-space model whose convergence speed is faster than the conventional method is proposed. The proposed method can skip a time-consuming Kalman smoother step, which is utilized in the conventional parameter optimization method based on the expectation-maximization (EM) algorithm. Instead, the proposed method optimizes parameters with an auxiliary function approach similarly to multi-channel non-negative matrix factorization. The proposed cost function is derived as an approximation of the log-likelihood function of the original state-space model under the assumption that a part of the sufficient statistics of latent state vectors are fixed at the parameter optimization step. The sufficient statistics of the state-space model can be estimated in the Kalman filter part without the Kalman smoother. In the proposed method, the Kalman filter and minimization of the approximated cost function are iteratively performed. Experimental results show that the proposed method can increase the original likelihood function faster than the conventional method. Speech dereverberation experiments under noisy environments show that the proposed method can reduce reverberation effectively.
Attention Wave-U-Net for Speech Enhancement
Ritwik Giri (Amazon Web Services, USA); Umut Isik (Amazon Web Services, USA); Arvindh Krishnaswamy (Amazon AWS ML/DSP/Audio & CCRMA, EE, Stanford, USA)
We propose a novel application of an attention mechanism in neural speech enhancement, by presenting a U-Net architecture with attention mechanism, which processes the raw waveform directly, and is trained end-to-end. We find that the inclusion of the attention mechanism significantly improves the performance of the model in terms of the objective speech quality metrics, and outperforms other published raw-waveform-based models on the Voice Bank Corpus (VCTK) dataset. We observe that the final layer attention mask has an interpretation as a soft Voice Activity Detector (VAD). We also present some initial results to show the efficacy of the proposed system as a pre-processing step to speech recognition systems.
Dilated FCN: Listening Longer to Hear Better
Shuyu Gong (Ohio University, USA); Zhewei Wang (Ohio University, USA); Tao Sun (Ohio University, USA); Yuanhang Zhang (Ohio University, USA); Charles Smith (University of Kentucky, USA); Li Xu (Ohio University, USA); Jundong Liu (Stokcer 321A, USA)
Deep neural network solutions have emerged as a new and powerful paradigm for speech enhancement (SE). The capabilities to capture long context and extract multi-scale patterns are crucial to design effective SE networks. Such capabilities, however, are often in conflict with the goal of maintaining compact networks to ensure good system generalization.
In this paper, we explore dilation operations and apply them to fully convolutional networks (FCNs) to address this issue. Dilations equip the networks with greatly expanded receptive fields, without significantly increasing the number of parameters. Differ- ent strategies to fuse multi-scale dilations, as well as install the dilation modules are explored in this work. Using Noisy VCTK and AzBio English datasets, we demonstrate that the proposed dilation models improve over the baseline FCN and outperform the state-of- the-art SE solutions.

Tuesday, October 22, 12:30 – 16:00

Lunch/Afternoon Breakgo to top

Room: West Dining Room

Tuesday, October 22, 16:00 – 18:00

L4: Learning from Weak Supervision in Audio Processinggo to top

Lecture 4

Room: Conference House

Unsupervised Adversarial Domain Adaptation Based on the Wasserstein Distance for Acoustic Scene Classification
Konstantinos Drossos (Tampere University, Finland); Paul Magron (Tampere University, Finland); Tuomas Virtanen (Tampere University, Finland)
A challenging problem in deep learning-based machine listening field is the degradation of the performance when using data from unseen conditions. In this paper we focus on the acoustic scene classification (ASC) task and propose an adversarial deep learning method to allow adapting an acoustic scene classification system to deal with a new acoustic channel resulting from data captured with a different recording device. We build upon the theoretical model of HΔH-distance and previous adversarial discriminative deep learning method for ASC unsupervised domain adaptation, and we present an adversarial training based method using the Wasserstein distance. We improve the state-of-the-art mean accuracy on the data from the unseen conditions from 32% to 45%, using the TUT Acoustic Scenes dataset.
Zero-Shot Audio Classification Based on Class Label Embeddings
Huang Xie (Tampere University, Finland); Tuomas Virtanen (Tampere University, Finland)
This paper proposes a zero-shot learning approach for audio classification based on the textual information about class labels without any audio samples from target classes. We propose an audio classification system built on the bilinear model, which takes audio feature embeddings and semantic class label embeddings as input, and measures the compatibility between an audio feature embedding and a class label embedding. We use VGGish to extract audio feature embeddings from audio recordings. We treat textual labels as semantic side information of audio classes, and use Word2Vec to generate class label embeddings. Results on the ESC-50 dataset show that the proposed system can perform zero-shot audio classification with small training dataset. It can achieve accuracy (26 % on average) better than random guess (10 %) on each audio category. Particularly, it reaches up to 39.7 % for the category of natural audio classes.
Identify, Locate and Separate: Audio-visual Object Extraction in Large Video Collections Using Weak Supervision
Sanjeel Parekh (Telecom ParisTech, France); Alexey Ozerov (Technicolor Research & Innovation, France); Slim Essid (Telecom ParisTech & CNRS/LTCI, France); Ngoc Q. K. Duong (Technicolor, France); Patrick Perez (, France); Gaël Richard (LTCI, Télécom Paris, Institut Polytechnique de Paris, France)
We tackle the problem of audio-visual scene analysis for weakly-labeled data. To this end, we build upon our previous audio-visual representation learning framework to perform object classification in noisy acoustic environments and integrate audio source enhancement capability. This is made possible by a novel use of non-negative matrix factorization for the audio modality. Our approach is founded on the multiple instance learning paradigm. Its effectiveness is established through experiments over a challenging dataset of music instrument performance videos. We also show encouraging visual object localization results.
Weakly Informed Audio Source Separation
Kilian Schulze-Forster (LTCI, Télécom Paris, Institut Polytechnique de Paris, France); Clement S J Doire (Audionamix, France); Gaël Richard (LTCI, Télécom Paris, Institut Polytechnique de Paris, France); Roland Badeau (LTCI, Télécom Paris, Institut Polytechnique de Paris, France)
Prior information about the target source can improve source separation quality but is usually not available with the necessary level of audio alignment. This has limited its usability in the past. We propose a source separation model based on the attention mechanism that can nevertheless exploit such weak information for the separation task while aligning it on the mixture as a byproduct. In experiments with artificial side information with different levels of expressiveness we demonstrate the capabilities of the proposed model. Moreover, we highlight an issue with the common separation quality assessment procedure regarding parts where targets or predictions are silent and refine a previous contribution for a more complete evaluation.
TriCycle: Audio Representation Learning from Sensor Network Data Using Self-Supervision
Mark Cartwright (New York University, USA); Jason Cramer (New York University, USA); Justin Salamon (Adobe Research, USA); Juan Bello (New York University, USA)
Self-supervised representation learning with deep neural networks is a powerful tool for machine learning tasks with limited labeled data but extensive unlabeled data. To learn representations, self-supervised models are typically trained on a pretext task to predict structure in the data (e.g. audio-visual correspondence, short-term temporal sequence, word sequence) that is indicative of higher-level concepts relevant to a target, downstream task. Sensor networks are promising yet unexplored sources of data for self-supervised learning – they collect large amounts of unlabeled yet timestamped data over extended periods of time and typically exhibit long-term temporal structure (e.g., over hours, months, years) not observable at the short time scales previously explored in self-supervised learning (e.g., seconds). This structure can be present even in single-modal data and therefore could be exploited for self-supervision in many types of sensor networks. In this work, we present a model for learning audio representations by predicting the long-term, cyclic temporal structure in audio data collected from an urban acoustic sensor network. We then demonstrate the utility of the learned audio representation in an urban sound event detection task with limited labeled data.
Deep Ranking-Based Sound Source Localization
Renana Opochinsky (Bar-Ilan University, Israel); Bracha Laufer (Bar-Ilan University, Israel); Sharon Gannot (Bar-Ilan University, Israel); Gal Chechik (Bar Ilan University, USA)
Sound source localization is still considered an unsolved problem in challenging reverberation conditions. Recently, there is a growing interest in developing learning-based localization methods. In this approach, acoustic features are extracted from the measured signals and then given as input to a model that maps them to the corresponding source positions. Typically, a massive dataset of labelled samples from known positions is required to train such models. Here, we present a novel deep-learning localization method that exploits only a few labelled samples with known positions, as well as a larger set of unlabelled samples, for which we only know their relative physical ordering. We design an architecture that uses this partial information with a triplet ranking loss to learn a nonlinear deep embedding that maps the acoustic features to the azimuth angle of the source. We show how to combine this weak supervision with known locations of few samples into a single optimization problem that can be trained efficiently using a gradient-based approach. Evaluating the proposed approach on simulated data, we demonstrate its significant improvement over two previous learning-based approaches for various reverberation levels, while maintaining consistent performance with varying sizes of labelled data.

Tuesday, October 22, 18:15 – 20:00

Dinnergo to top

Room: West Dining Room

Tuesday, October 22, 20:00 – 22:00

Demonstrations & Cocktailsgo to top

Room: West Dining Room

Real-time Electric Guitar String, Fret, Plucking Position Estimation, and Pitch Tracking

Mads Græsbøll Christensen

The guitar string, fret and plucking position have applications in music learning and detailed transcription. In this demo, we show that these features can be estimated in real time with more than 90% estimation and detection accuracy. By clicking “start guitar demo” button, the fret and string played on a guitar and the plucking position will be estimated every 40 ms. In this demo, a fast and effective method for estimating guitar string, fret and plucking position based on a physical model of string excitation and vibration is used. The proposed method is more computational efficient on two levels: the user is not required to record audio for training, and the model is obtained from one simulation instead of features estimated from several audio recordings. Detailed and class dependent distributions in the feature space (pitch and inharmonicity) can be obtained. As we mentioned, pitch information is required for estimating the guitar string, fret and plucking position. In fact, it is useful in various applications, such as speech synthesis, voice disorder detection, speech enhancement. Therefore, a robust Bayesian pitch tracking software is also included as part of this demo with the following features: 1. real-time pitch tracking; 2, off-line pitch tracking. By clicking the “start live recording” button, the audience can record his/her voice from a headset, and the demo software will give estimates of the pitch, voicing probability and harmonic order every 40 ms. The audience can also add different types of noise (babble, factory, white, pink, etc.) in different SNRs to the recorded speech signals to see the robustness of the proposed method against noise. In this demo, a fully Bayesian harmonic model-based pitch tracking approach is used. By using the harmonic model, as opposed to non- parametric methods, improved robustness against noise can be obtained. First-order Markov processes are used to capture the temporal dynamics of pitch, harmonic order, and voicing probability. By using information from previous frames, the rate of pitch estimation errors and the voicing detection errors can be reduced. Compared with the state of art harmonic model- based pitch tracking method, we not only consider the temporal dynamics of pitch and voicing, but also of the harmonic order, which enables us to detect if any pitch is present, and estimate the pitch and harmonic order jointly and accurately. Past information on pitch is exploited to improve robustness against temporal voicing changes. Furthermore, by adopting a fully Bayesian approach to model weights and observation noise, the overfitting can be avoided. By assigning a proper transition pdf for the weights, fast non-linear least squares estimation method can be easily incorporated into the proposed algorithm, leading to low computational and storage complexities.

RTF-steered Binaural MVDR Beamforming Incorporating Multiple External Microphones
Nico Gößling, Wiebke Middelberg, and Simon Doclo

In this demonstration, we consider the binaural minimum-variance distortionless-response (BMVDR) beamformer, which is a well-known noise reduction algorithm for hearing devices that can be steered using the relative transfer function (RTF) vector of the desired speech source. Exploiting the availability of an external microphone that is spatially separated from the head-mounted microphones, an efficient method has been recently proposed to estimate the RTF vector in a diffuse noise field. When multiple external microphones are available, different RTF vector estimates can be obtained by using this method for each external microphone. In the corresponding WASPAA 2019 paper, we propose several procedures to combine these RTF vector estimates, either by selecting the estimate corresponding to the highest input SNR, by simply averaging the estimates or by combining the estimates in order to maximize the output SNR of the BMVDR beamformer. Based on realistic recordings of a moving desired speaker, an interfering speaker and pseudo-diffuse babble noise in a reverberant lab with hearing devices mounted on an artificial head and several external microphones, using an interactive MATLAB GUI  the visitors are able to select between different real-time implementations of the RTF vector combination procedures used in the BMVDR beamformer. The binaural output signals are played back via headphones, which the visitors can listen to in real-time.

MIRaGe: Multichannel Database of Room Impulse Responses Measured on High-resolution Cube-shaped Grid in Multiple Acoustic Conditions

Jaroslav Cmejla, Tomas Kounovsky, Sharon Gannot, Zbynek Koldovsky, and Pinchas Tandeitnik

In this demonstration, we plan to present a new database of multi-channel recordings performed in an acoustic laboratory with adjustable reverberation time. The recordings can be used to compute (relative) room impulse responses (RIR). Compared to other similar databases, the recordings were performed for 4104 positions within a 3D volume that form a dense grid, where the volume is a 46 × 36 × 32 cm cube. The database thus provides a novel tool for detailed spatial analyses of the acoustic field within a real-world room, which was, up to the present, possible only by using an acoustic room simulator. For example, the database enables us to visualize and compare beampatters of spatial filters in three reverberation time settings. Also, small movements of a source within the cube can be simulated and their influence on the performance of a given audio processing algorithm can be analyzed.

State-of-the-art virtual reality demonstration and research

Vladimir Tourbabin and Jacob R. Donley

Virtual and augmented reality are fast-emerging technologies. They find application in many areas of everyday life including gaming, workplace, education, medicine, and more. They have the potential to transform our lives and completely redefine the way people interact with the world and each other. This demo will feature Quest headset – a state-of-the-art head mounted display (HMD) from Oculus VR / Facebook, allowing delegates to experience immersive virtual reality through combined visual, audio, and tactile interactions. Despite the growing interest in virtual and augmented reality areas, multiple technical challenges remain unsolved, many of which are particularly related to acoustics and audio modality. We will indicate some of the major challenges and present an overview of the efforts to solve them that are being undertaken in the Audio Team at Facebook Reality Labs.

Room: Parlor Room

Perceptually Optimized Sound Zones

Mads Græsbøll Christensen

Creating sound zones has been an active research field since it was first proposed. So far, most sound zone control methods rely on either an optimization of physical metrics such as acoustic contrast and signal distortion or a mode decomposition of the desired sound field. By using these types of methods, approximately 15 dB of acoustic contrast has been reported in practical set-up for a broadband scenario but this is typically not high enough to satisfy the people inside the zones. Introducing more loudspeakers might allow one to achieve this; however, we would like to do something that makes the 15 dB of acoustic contrast enough. It is inspired by perceptual audio coding, and we aim to shape interferences in a given zone to be less noticeable or ideally inaudible to the listener inside the given zone. This can be done by taking characteristics of input signals and of human auditory perception into account. We here do this by extending the variable span trade-off (VAST) framework which trades-off the acoustic contrast at the cost of the signal distortion. The VAST framework is expanded to take the characteristics of the input signals and of human auditory perception into account. The reproduced sound fields by the proposed method and existing methods will be generated via MATLAB demonstration, and it leads one to be able to listen to the reproduced sound at the position where one selects in the implementation. To this end, one might be able to experience how perceptually optimized sound zones work.

Real-Time Direction of Arrival Estimation Using Deep Learning

Wolfgang Mack, Soumitro Chakrabarty, and Emanuel Habets

The direction of arrival (DOA) of audio sources is an important parameter in many micro- phone array applications. It can be used for enhancement or separation tasks. State-of-the- art techniques employ deep neural networks (DNNs) to estimate the DOA from the phase of the microphone signals. In particular, we demonstrate a real-time DOA estimator. The algorithm estimates the DOA, with a 5 degree resolution, of multiple sound sources per time-frame from the microphone phases. We consider a uniform linear array with four microphones and an inter-microphone distance of 8 cm. The phase of the microphone signals is processed by the DNN, which yields the DOAs. We present our results with two interfaces. We show a polar-plot of the real-time DOA and also the development of the DOAs over time.

Wednesday, October 23

Wednesday, October 23, 07:00 – 08:00

Breakfastgo to top

Room: West Dining Room

Wednesday, October 23, 08:00 – 08:50

K3: Keynote Talk by Jerome Bellegardago to top

Room: Conference House

Wednesday, October 23, 08:50 – 10:10

L5: Signal Enhancement and Separationgo to top

Lecture 5

Room: Conference House

Independent Low-Rank Matrix Analysis with Decorrelation Learning
Rintaro Ikeshita (NTT Corporation, Japan); Nobutaka Ito (NTT, Japan); Tomohiro Nakatani (NTT Corporation, Japan); Hiroshi Sawada (NTT Corporation, Japan)
This paper addresses the determined convolutive blind source separation (BSS) problem. The state-of-the-art independent low-rank matrix analysis (ILRMA), unifying independent component analysis (ICA) and nonnegative matrix factorization, has the disadvantage of ignoring inter-frame and inter-frequency spectral correlation of source signals. We here propose a new BSS method that estimates a linear transformation for spectral decorrelation and performs ILRMA in the transformed domain. A newly introduced optimization problem is an extension of that for ICA based on maximum likelihood. For this problem, we provide a necessary and sufficient condition for the existence of optimal solutions, and develop algorithms based on block coordinate descent methods with closed-form solutions. Experimental results show the improved separation performance of the proposed method compared to ILRMA.
Generalized Weighted-Prediction-Error Dereverberation with Varying Source Priors for Reverberant Speech Recognition
Toru Taniguchi (Preferred Networks, Inc., Japan); Aswin Shanmugam Subramanian (Johns Hopkins University, USA); Xiaofei Wang (Johns Hopkins University, USA); Dung Tran (Yahoo Japan Corporation, Japan); Yuya Fujita (Yahoo Japan Corporation, Japan); Shinji Watanabe (Johns Hopkins University, USA)
Weighted-prediction-error (WPE) is one of the well-known dereverberation signal processing methods especially for alleviating degradation of performance of automatic speech recognition (ASR) in a distant speaker scenario. WPE usually assumes that desired source signals always follow predefined specific source priors such as Gaussian with time-varying variances (TVG). Although based on this assumption WPE works well in practice, generally proper priors depend on sources, and they cannot be known in advance of the processing. On-demand estimation of source priors e.g. according to each utterance is thus required. For this purpose, we extend WPE by introducing a complex-valued generalized Gaussian (CGG) prior and its shape parameter estimator inside of processing to deal with a variety of super-Gaussian sources depending on sources. Blind estimation of the shape parameter of priors is realized by adding a shape parameter estimator as a sub-network to WPE-CGG, treated as a differentiable neural network. The sub-network can be trained by backpropagation from the outputs of the whole network using any criteria such as signal-level mean square error or even ASR errors if the WPE-CGG computational graph is connected to that of the ASR network. Experimental results show the proposed method outperforms conventional baseline methods with the TVG prior even though proper shape parameter values are not given in the evaluation.
Multichannel Speech Enhancement Based on Time-frequency Masking Using Subband Long Short-Term Memory
Xiaofei Li (INRIA Grenoble Rhône-Alpes, France); Radu P. Horaud (Inria, France)
We propose a multichannel speech enhancement method using a long short-term memory (LSTM) recurrent neural network. The proposed method is developed in the short time Fourier transform (STFT) domain. An LSTM network common to all frequency bands is trained, which processes each frequency band individually by mapping the multichannel noisy STFT coefficient sequence to its corresponding STFT magnitude ratio mask sequence of one reference channel. This subband LSTM network exploits the differences between temporal/spatial characteristics of speech and noise, namely speech source is non-stationary and coherent, while noise is stationary and less spatially-correlated. Experiments with different types of noise show that the proposed method outperforms the baseline deep-learning-based full-band method and unsupervised method. In addition, since it does not learn the wideband spectral structure of either speech or noise, the proposed subband LSTM network generalizes very well to unseen speakers and noise types.
Parametric Resynthesis with Neural Vocoders
Soumi Maiti (The Graduate Center, CUNY, USA); Michael Mandel (Brooklyn College, CUNY, USA)
Noise suppression systems generally produce output speech with copromised quality. We propose to utilize the high quality speech generation capability of neural vocoders for noise suppression. We use a neural network to predict clean mel-spectrogram features from noisy speech and then compare two neural vocoders, WaveNet and WaveGlow, for synthesizing clean speech from the predicted mel spectrogram. Both WaveNet and WaveGlow achieve better subjective and objective quality scores than the source separation model Chimera++. Further, WaveNet and WaveGlow also achieve significantly better subjective quality ratings than the oracle Wiener mask. Moreover, we observe that between WaveNet and WaveGlow, WaveNet achieves the best subjective quality scores, although at the cost of much slower waveform generation.

Wednesday, October 23, 10:30 – 12:30

P3: Source Localization, Scene Analysis, and Array Processinggo to top

Poster 3

Room: Parlor

Continual Learning of New Sound Classes Using Generative Replay
Zhepei Wang (University of Illinois Urbana-Champaign, USA); Yusuf Cem Subakan (University of Illinois at Urbana Champaign, USA); Efthymios Tzinis (University of Illinois at Urbana-Champaign, USA); Paris Smaragdis (University of Illinois at Urbana-Champaign, USA); Laurent Charlin (Mila-Quebec Artificial Intelligence Institute, Canada)
Continual learning is a setting in which we incrementally train a model on a sequence of datasets. In this paper we examine its application to the problem of sound classification, in which we wish to refine already trained models to learn new sound classes. In practice one does not want to maintain all past training data and retrain from scratch, but naively updating a model with new data results in a degradation of already learned tasks, which is referred to as “catastrophic forgetting”. We develop a generative procedure for generating training audio spectrogram data, in place of keeping older training datasets. We show that by incrementally refining a classification model using this scheme we can use a generator that is 4% of the size of all previous training data, while matching the performance of refining it while keeping 20% of all previous training data. We thus conclude that we can extend a trained sound classifier to learn new classes without having to keep previously used datasets.
ToyADMOS: A Dataset of Miniature-Machine Operating Sounds for Anomalous Sound Detection
Yuma Koizumi (NTT Corporation & NTT Media Intelligence Laboratories, Japan); Shoichiro Saito (NTT Media Intelligence Laboratories, Japan); Noboru Harada (NTT, Japan); Keisuke Imoto (Ritsumeikan University, Japan); Hisashi Uematsu (Nippon Telegraph and Telephone Corporation, Japan)
This paper introduces a new dataset called “ToyADMOS” designed for anomaly detection in machine operating sounds (ADMOS). To the best our knowledge, no large-scale datasets are available to use for ADMOS, although a part of recent advancements in acoustic signal processing is contributed by large-scale datasets. This is because it is difficult to collect anomalous sound data. To build a large-scale dataset for ADMOS, we have collected anomalous operating sounds of miniature-machines (toys) by deliberately damaging them. The released dataset consists of three sub-datasets for product inspection and fault diagnosis for fixed and moving machines tasks. Each sub-dataset includes over 180 hours of normal machine-operating sounds and over 4,000 samples of the anomalous sound at 48kHz sampling rate. In addition, our dataset is designed such that it can be used not only for basic unsupervised-ADMOS but also for multiple advanced tasks. The dataset is freely available for download at The unzip password is “toyadmos”.
Evaluation of Post-Processing Algorithms for Polyphonic Sound Event Detection
Leo Cances (Université Paul Sabatier & IRIT, France); Patrice Guyot (Université Paul Sabatier & IRIT, France); Thomas Pellegrini (Université Paul Sabatier & IRIT, France)
Sound event detection (SED) aims at identifying audio events (audio tagging task) in recordings and then locating them temporally (localization task). This last task ends with the segmentation of the frame-level class predictions, that determines the onsets and offsets of the audio events. Yet, this step is often overlooked in scientific publications. In this paper, we focus on the post-processing algorithms used to identify the audio event boundaries. Different post-processing steps are investigated, through smoothing, thresholding, and optimization. In particular, we evaluate different approaches for temporal segmentation, namely statistics-based and parametric method. Experiments are carried out on the DCASE 2018 challenge task 4 data. We compared post-processing algorithms on the temporal prediction curves of two models: one based on the challenge’s baseline and a Multiple Instance Learning (MIL) model. Results show the crucial impact of the post-processing methods on the final detection score. Statistics-based methods yield a 22.9% event-based F-score on the evaluation set with our MIL model. Moreover, the best results were obtained using class-dependent parametric methods with 32.0% F-score.
Polyphonic Sound Event and Sound Activity Detection: A Multi-task Approach
Arjun Pankajakshan (Queen Mary University of London & Centre for Digital Music, United Kingdom (Great Britain)); Helen L. Bear (Queen Mary University of London, United Kingdom (Great Britain)); Emmanouil Benetos (Queen Mary University of London, United Kingdom (Great Britain))
Polyphonic Sound Event Detection (SED) in real-world recordings is a challenging task because of the dynamic polyphony level, intensity, and duration of sound events. Current polyphonic SED systems fail to model the temporal structure of sound events explicitly and instead attempt to look at which sound events are present at each audio frame. Consequently, the event-wise detection performance is much lower than the segment-wise detection performance. In this work, we propose a joint model approach to improve the temporal localization of sound events using a multi-task learning setup. The first task predicts which sound events are present at each time frame; we call this branch as `Sound Event Detection (SED) model’, while the second task predicts if a sound event is present or not at each frame; we call this branch as `Sound Activity Detection (SAD) model’. We also verify the proposed joint model by comparing with a separate implementation of both the tasks and aggregation of individual predictions. Our experiments on the URBAN-SED dataset show that the proposed joint model can alleviate False Positive (FP) and False Negative (FN) errors and improve both the segment-wise and the event-wise metrics.
Acoustic Scene Classification Using Higher-Order Ambisonic Features
Marc C Green (University of York, United Kingdom (Great Britain)); Damian Murphy (The University of York, United Kingdom (Great Britain)); Sharath Adavanne (Tampere University, Finland); Tuomas Virtanen (Tampere University, Finland)
This paper investigates the potential of using higher-order Ambisonic features to perform acoustic scene classification. We compare the performance of systems trained using first-order and fourth-order spatial features extracted from the EigenScape database. Using both gaussian mixture model and convolutional neural network classifiers, we show that features extracted from higher-order Ambisonics can yield increased classification accuracies relative to first-order features. Diffuseness-based features seem to describe scenes particularly well relative to direction-of-arrival based features. With specific feature subsets, however, differences in classification accuracy between first and fourth-order features become negligible.
Joint Measurement of Localization and Detection of Sound Events
Annamaria Mesaros (Tampere University, Finland); Sharath Adavanne (Tampere University, Finland); Archontis Politis (Tampere University, Finland); Toni Heittola (Tampere University, Finland); Tuomas Virtanen (Tampere University, Finland)
Sound event detection and sound localization or tracking have historically been two separate areas of research. Recent development of sound event detection methods approach also the localization side, but lack a consistent way of measuring the joint performance of the system; instead, they measure the separate abilities for detection and for localization. This paper proposes augmentation of the localization metrics with a condition related to the detection, and oppositely, use of location information in calculating the true positives for detection. An extensive evaluation example is provided to illustrate the behavior of such joint metrics. The comparison to the detection only and localization only performance shows that the proposed joint metrics operate in a consistent and logical manner.
Joint Analysis of Acoustic Events and Scenes Based on Multitask Learning
Noriyuki Tonami (Ritsumeikan University, Japan); Keisuke Imoto (Ritsumeikan University, Japan); Masahiro Niitsuma (Ritsumeikan University, Japan); Ryosuke Yamanishi (Ritsumeikan University, Japan)
Acoustic event detection and scene classification are major research tasks in environmental sound analysis, and many methods based on neural networks have been proposed. Conventional methods have addressed these tasks separately; however, acoustic events and scenes are closely related to each other. For example, in the acoustic scene “office,” the acoustic events “mouse clicking” and “keyboard typing” are likely to occur. In this paper, we propose multitask learning for joint analysis of acoustic events and scenes, which shares the parts of the networks holding information on acoustic events and scenes in common. By integrating the two networks, we also expect that information on acoustic scenes will improve the performance of acoustic event detection. Experimental results obtained using TUT Sound Events 2016/2017 and TUT Acoustic Scenes 2016 datasets indicate that the proposed method improves the performance of acoustic event detection by 10.66 percentage points in terms of the F-score, compared with a conventional method based on a convolutional recurrent neural network.
Regression Versus Classification for Neural Network Based Audio Source Localization
Laureline Perotin (LORIA & Orange Labs, France); Alexandre Défossez (Facebook AI Research, INRIA/ENS PSL Research University, France); Emmanuel Vincent (Inria Nancy – Grand Est, France); Romain Serizel (Université de Lorraine, LORIA, France); Alexandre Guérin (Orange Labs, France)
We compare the performance of regression and classification neural networks for single-source direction-of-arrival estimation. Since the output space is continuous and structured, regression seems more appropriate. However, classification on a discrete spherical grid is widely believed to perform better and is predominantly used in the literature. For regression, we propose two ways to account for the spherical geometry of the output space based either on the angular distance between spherical coordinates or on the mean squared error between Cartesian coordinates. For classification, we propose two alternatives to the classical one-hot encoding framework: having a Gibbs distribution as output target in order to make use of the structure of the output space, and designing a loss function that additionally ensures a clear probabilistic interpretation. We show that regression on Cartesian coordinates is generally more accurate, except when localized interference is present, in which case classification appears to be more robust.
Sound Source Localization Using Relative Harmonic Coefficients in Modal Domain
Yonggang Hu (Australian National University, Australia); Prasanga Samarasinghe (Australian National University, Australia); Thushara D. Abhayapala (Australian National University, Australia)
This paper proposes a data-driven source localization approach under a noisy and reverberant environment, using a newly defined feature named relative harmonic coefficients (RHC) in the modal domain. Being independent of the source signal, the RHC is capable of localizing a sound source(s) located at unknown position(s). Two distinctive multi-view Gaussian process (MVGP), (i) multi-frequency views and (ii) multi-mode views, are developed for Gaussian process regression (GPR) to reveal the mapping function from the RHC to the corresponding source location. We evaluate the effectiveness of the algorithm for single source localization while the underlying concepts proposed can be extended to acoustic scenarios where multiple sources are active. Experimental results, using a spherical microphone array, confirm that the proposed algorithm has a faster speed and achieves competitive performance in comparison to the state-of-art algorithm.
Acoustic Localization Using Spatial Probability in Noisy and Reverberant Environments
Sebastian Braun (Microsoft Research, USA); Ivan J. Tashev (Microsoft Research, USA)
In realistic acoustic sound source localization scenarios, we often encounter not only the presence of multiple simultaneous sound sources, but also reverberation and noise. We propose a novel multi-source localization method based on the spatial sound presence probability (SSPP). The SSPP can be computed using prior knowledge of the anechoic relative transfer functions (RTFs), which incorporate magnitude and phase information, and makes the approach general for any device and geometry. From the SSPP we can not only obtain multiple simultaneous sound source direction estimates, but also their spatial presence probability. The SSPP can be used for a probabilistic update of the estimated directions, and can further be used to determine the dominant sound source. We demonstrate the robustness of our method in challenging non-stationary scenarios for single- and multi-speaker localization in noisy and reverberant conditions. The proposed method still localizes a sound source at 8 m with an average error below 7 degrees.
Supervised Contrastive Embeddings for Binaural Source Localization
Duowei Tang (KU Leuven, Belgium); Maja Taseska (KU Leuven, Belgium); Toon van Waterschoot (KU Leuven, Belgium)
Recent data-driven approaches for binaural source localization are able to learn the non-linear functions that map measured binaural cues to source locations. This is done either by learning a parametric map directly using training data or by learning a low-dimensional representation (embedding) of the binaural cues that is consistent with the source locations. In this paper, we use the second approach and propose a parametric embedding to map the binaural cues to a low-dimensional space, where localization can be done with a nearest-neighbor regression. We implement the embedding using a neural network, optimized to map points that are close in the latent space (the space of source azimuths or elevations) to points that are close in the embedding space. This training strategy is used in the machine learning community to solve various classification problems. We show that the proposed parametric embedding generalizes well in acoustic conditions different from those encountered during training. Furthermore, it provides better results than unsupervised embeddings previously used for localization.
Improved Change Prediction for Combined Beamforming and Echo Cancellation with Application to a Generalized Sidelobe Canceler
Stefan Kühl (RWTH Aachen University, Germany); Alexander Bohlender (Ghent University – imec, Belgium); Matthias Schrammen (RWTH Aachen University, Germany); Peter Jax (RWTH Aachen University, Germany)
Adaptive beamforming and echo cancellation are often necessary in hands-free situations in order to enhance the communication quality. Unfortunately, the combination of both algorithms leads to problems. Performing echo cancellation before the beamformer (AEC-first) leads to a high complexity. In the other case (BF-first) the echo reduction is drastically decreased due to the changes of the beamformer, which have to be tracked by the echo canceler. Recently, the authors presented the directed change prediction algorithm with directed recovery, which predicts the effective impulse response after the next beamformer change and therefore allows to maintain the low complexity of the BF-first structure and to guarantee a robust echo cancellation. However, the algorithm assumes an only slowly changing acoustical environment which can be problematic in typical time-variant scenarios. In this paper an improved change prediction is presented, which uses adaptive shadow filters to reduce the convergence time of the change prediction. For this enhanced algorithm, it is shown how it can be applied to more advanced beamformer structures like the generalized sidelobe canceler and how the information provided by the improved change prediction can also be used to enhance the performance of the overall interference cancellation.
Two-Dimensional Sound Field Recording with Multiple Circular Microphone Arrays Considering Multiple Scattering
Masahiro Nakanishi (University of Tokyo, Japan); Natsuki Ueno (The University of Tokyo, Japan); Shoichi Koyama (The University of Tokyo, Japan); Hiroshi Saruwatari (The University of Tokyo, Japan)
A sound field recording method using multiple circular microphone arrays considering the effect of multiple scattering is proposed. To avoid the numerical instability of an open microphone array, a rigid array, i.e., a microphone array mounted on a circular/spherical baffle, exploiting the scattering effect of a single baffle is frequently used for estimating a sound field. Since it is difficult to estimate a sound field in a large region with a single rigid array, several studies have been carried out using relatively small multiple rigid arrays distributed inside the target region. However, mutual interactions between multiple baffles, such as interreflection, have not been taken into consideration in the estimation process. The effect of multiple scattering is considerable, especially when several baffles are closely located. In this paper, we formulate an estimation method with modeling of this multiple scattering effect using the cylindrical wavefunction expansion of a two-dimensional sound field. Numerical simulation results indicated that the proposed method significantly improves the estimation accuracy compared with the method considering only the single scattering effect.
RTF-steered Binaural MVDR Beamforming Incorporating Multiple External Microphones
Nico Goessling (University of Oldenburg, Germany); Wiebke Middelberg (University of Oldenburg, Germany); Simon Doclo (University of Oldenburg, Germany)
The binaural minimum variance distortionless response beamformer (BMVDR) is a well-known binaural noise reduction algorithm that can be steered using the relative transfer function (RTF) vectors of the desired speech source. For a situation where multiple external microphone signals are incorporated into the BMVDR processing of a binaural hearing device, we propose and compare different methods to estimate the complete RTF vectors. The considered methods either use one or combine multiple RTF vector estimates obtained using a recently proposed method that exploits the spatial coherence (SC) between noise components. The proposed RTF vector estimation methods are evaluated in an on-line implementation of the BMVDR using recorded signals of a moving speaker and diffuse noise in a reverberant environment. The results show that an output SNR-maximizing combination of the RTF vector estimates leads to an improved noise reduction performance compared to an input SNR-based selection and also outperforms the state-of-the art covariance whitening method and a averaging method.
1St-Order Microphone Array System for Large Area Sound Field Recording and Reconstruction: Discussion and Preliminary Results
Federico Borra (Politecnico di Milano, Italy); Steven Krenn (Facebook, USA); Israel D Gebru (Facebook, USA); Dejan Markovic (Facebook, USA)
The process of capturing, analyzing and predicting sound fields is finding novel areas of applications in AR/VR. One of the key processes in such applications is to estimate the sound field at locations that differ from the actual measurement points-i.e., the sound field reconstruction. However, it’s a difficult spatial audio processing problem. Though theoretical solutions exist to reconstruct sound fields, they are practically infeasible due to hardware and computational requirements. This paper discusses the implementation of a system for large area sound field recording and reconstruction and proposes an improved sound field reconstruction algorithm. The proposed algorithm introduces a practical improvement in order to overcome implementation issues. In addition, we present a preliminary real-world results on an innovative but highly challenging application.
Direction of Arrival Estimation in Highly Reverberant Environments Using Soft Time-Frequency Mask
Vladimir Tourbabin (Facebook Reality Labs, USA); Jacob Donley (Facebook, USA); Boaz Rafaely (Facebook Reality Labs, USA); Ravish Mehra (Facebook Reality Labs, USA)
A recent approach to improving the robustness of sound localization in reverberant environments is based on pre-selection of time-frequency pixels that are dominated by direct sound. This approach is equivalent to applying a binary time-frequency mask prior to the localization stage. Although the binary mask approach was shown to be effective, it may not exploit the information available in the captured signal to its full extent. In an attempt to overcome this limitation, it is hereby proposed to employ a soft mask instead of the binary mask. The proposed weighting scheme is based directly on a metric of the direct-to-reverberant sound ratio in each individual time-frequency pixel. Evaluation using simulated reverberant speech recordings indicates substantial improvement in the localization performance when using the proposed soft mask weighting.
Analytical Method of 2.5D Exterior Sound Field Synthesis by Using Multipole Loudspeaker Array
Kenta Imaizumi (NTT Corporation, Japan); Kimitaka Tsutsumi (NTT Corporation, Japan); Atsushi Nakadaira (NTT, Japan); Yoichi Haneda (The University of Electro-Communications, Japan)
We propose an analytical method of 2.5-dimensional exterior sound field reproduction by using a multipole loudspeaker array. The method reproduces the sound field modeled by expansion coefficients of spherical harmonics based on multipole superposition. We also present an analytical method for converting the expansion coefficients of spherical harmonics to weighting coefficients for multipole superposition. In contrast to pressure-matching methods, which tend to be ill-conditioned problems, the proposed method gives stable solutions based on an analytical conversion from the expansion coefficients of spherical harmonics. We derive the analytical method by converting the sound field modeled by expansion coefficients of spherical harmonics to a linear combination of ‘basic multipoles’ for obtaining the weighting coefficients of each multipole. Computer simulation results indicate that the proposed method reproduces the sound field more accurately than existing pressure-matching based method.
A Sparse Bayesian Learning Based RIR Reconstruction Method for Acoustic TOA and DOA Estimation
Zonglong Bai (Harbin Institute of Technology & Aalborg University, Denmark); Jesper Rindom Jensen (Aalborg University, Denmark); Jinwei Sun (Harbin Institute of Technology, P.R. China); Mads Græsbøll Christensen (Aalborg University, Denmark)
Acoustic reflector estimation, which is one of the key problems of robot audition, is addressed in this paper using a sparse Bayesian learning (SBL) approach. More specifically, we propose a three-step procedure in which we 1) reconstruct the room impulse response (RIR) using SBL, 2) estimate the time-of-arrivals (TOAs) from the RIR, and 3) estimate the DOA from the TOA estimates. The challenge of RIR reconstruction is that the early reflections are weak compared to the direct sound. Therefore, the sparsity of the early part of the RIR is exploited to improve the recovery performance. However, most of the sparse vector recovery methods can not reconstruct the RIR successfully, especially when the measurement matrix is highly coherent. In this paper, we therefore adopt the SBL framework which is more robust in such scenarios compared to state-of-the-art recovery methods. In the DOA estimation step, we propose a new approximate near-field model for isotropic arrays. The performance of the proposed approach is analysed by numerical simulations, where the estimation accuracy is measured versus different signal-to-diffuse-noise ratios and grid errors. According to the simulation results, the proposed SBL method is more robust to diffuse noise and grid errors than other state-of-the-art methods that even fails to estimate the TOAs and DOA in many cases.

Wednesday, October 23, 12:30 – 14:00

Lunch/Closinggo to top

Room: West Dining Room

Comments are closed.