Timezone: »
Abstracts and full papers: http://media.aau.dk/smc/ml4audio/
Audio signal processing is currently undergoing a paradigm change, where data-driven machine learning is replacing hand-crafted feature design. This has led some to ask whether audio signal processing is still useful in the "era of machine learning." There are many challenges, new and old, including the interpretation of learned models in high dimensional spaces, problems associated with data-poor domains, adversarial examples, high computational requirements, and research driven by companies using large in-house datasets that is ultimately not reproducible.
ML4Audio aims to promote progress, systematization, understanding, and convergence of applying machine learning in the area of audio signal processing. Specifically, we are interested in work that demonstrates novel applications of machine learning techniques to audio data, as well as methodological considerations of merging machine learning with audio signal processing. We seek contributions in, but not limited to, the following topics:
- audio information retrieval using machine learning;
- audio synthesis with given contextual or musical constraints using machine learning;
- audio source separation using machine learning;
- audio transformations (e.g., sound morphing, style transfer) using machine learning;
- unsupervised learning, online learning, one-shot learning, reinforcement learning, and incremental learning for audio;
- applications/optimization of generative adversarial networks for audio;
- cognitively inspired machine learning models of sound cognition;
- mathematical foundations of machine learning for audio signal processing.
This workshop especially targets researchers, developers and musicians in academia and industry in the area of MIR, audio processing, hearing instruments, speech processing, musical HCI, musicology, music technology, music entertainment, and composition.
ML4Audio Organisation Committee:
Hendrik Purwins, Aalborg University Copenhagen, Denmark (hpu@create.aau.dk)
Bob L. Sturm, Queen Mary University of London, UK (b.sturm@qmul.ac.uk)
Mark Plumbley, University of Surrey, UK (m.plumbley@surrey.ac.uk)
Program Committee:
Abeer Alwan (University of California, Los Angeles)
Jon Barker (University of Sheffield)
Sebastian Böck (Johannes Kepler University Linz)
Mads Græsbøll Christensen (Aalborg University)
Maximo Cobos (Universitat de Valencia)
Sander Dieleman (Google DeepMind)
Monika Dörfler (University of Vienna)
Shlomo Dubnov (UC San Diego)
Philippe Esling (IRCAM)
Cédric Févotte (IRIT)
Emilia Gómez (Universitat Pompeu Fabra)
Emanuël Habets (International Audio Labs Erlangen)
Jan Larsen (Danish Technical University)
Marco Marchini (Spotify)
Rafael Ramirez (Universitat Pompeu Fabra)
Gaël Richard (TELECOM ParisTech)
Fatemeh Saki (UT Dallas)
Sanjeev Satheesh (Baidu SVAIL)
Jan Schlüter (Austrian Research Institute for Artificial Intelligence)
Joan Serrà (Telefonica)
Malcolm Slaney (Google)
Emmanuel Vincent (INRIA Nancy)
Gerhard Widmer (Austrian Research Institute for Artificial Intelligence)
Tao Zhang (Starkey Hearing Technologies)
Fri 8:00 a.m. - 8:15 a.m.
|
Overture
(Talk)
|
Hendrik Purwins 🔗 |
Fri 8:15 a.m. - 8:45 a.m.
|
Acoustic word embeddings for speech search
(Invited Talk)
link »
For a number of speech tasks, it can be useful to represent speech segments of arbitrary length by fixed-dimensional vectors, or embeddings. In particular, vectors representing word segments -- acoustic word embeddings -- can be used in query-by-example search, example-based speech recognition, or spoken term discovery. Textual word embeddings have been common in natural language processing for a number of years now; the acoustic analogue is only recently starting to be explored. This talk will present our work on acoustic word embeddings and their application to query-by-example search. I will speculate on applications across a wider variety of audio tasks. Karen Livescu is an Associate Professor at TTI-Chicago. She completed her PhD and post-doc in electrical engineering and computer science at MIT and her Bachelor's degree in Physics at Princeton University. Karen's main research interests are at the intersection of speech and language processing and machine learning. Her recent work includes multi-view representation learning, segmental neural models, acoustic word embeddings, and automatic sign language recognition. She is a member of the IEEE Spoken Language Technical Committee, an associate editor for IEEE Transactions on Audio, Speech, and Language Processing, and a technical co-chair of ASRU 2015 and 2017. |
Karen Livescu 🔗 |
Fri 8:45 a.m. - 9:05 a.m.
|
Learning Word Embeddings from Speech
(Talk)
link »
In this paper, we propose a novel deep neural network architecture, Sequence-to- Sequence Audio2Vec, for unsupervised learning of fixed-length vector representations of audio segments excised from a speech corpus, where the vectors contain semantic information pertaining to the segments, and are close to other vectors in the embedding space if their corresponding segments are semantically similar. The design of the proposed model is based on the RNN Encoder-Decoder framework, and borrows the methodology of continuous skip-grams for training. The learned vector representations are evaluated on 13 widely used word similarity benchmarks, and achieved competitive results to that of GloVe. The biggest advantage of the proposed model is its capability of extracting semantic information of audio segments taken directly from raw speech, without relying on any other modalities such as text or images, which are challenging and expensive to collect and annotate. |
Jim Glass · Yu-An Chung 🔗 |
Fri 9:05 a.m. - 9:25 a.m.
|
Multi-Speaker Localization Using Convolutional Neural Network Trained with Noise
(Talk)
link »
The problem of multi-speaker localization is formulated as a multi-class multi-label classification problem, which is solved using a convolutional neural network (CNN) based source localization method. Utilizing the common assumption of disjoint speaker activities, we propose a novel method to train the CNN using synthesized noise signals. The proposed localization method is evaluated for two speakers and compared to a well-known steered response power method. |
Soumitro Chakrabarty · Emanuël Habets 🔗 |
Fri 9:25 a.m. - 9:45 a.m.
|
Adaptive Front-ends for End-to-end Source Separation
(Talk)
link »
(+ Jonah Casebeer) Source separation and other audio applications have traditionally relied on the use of short-time Fourier transforms as a front-end frequency domain representation step. We present an auto-encoder neural network that can act as an equivalent to short-time front-end transforms. We demonstrate the ability of the network to learn optimal, real-valued basis functions directly from the raw waveform of a signal and further show how it can be used as an adaptive front-end for end-to-end supervised source separation. |
Shrikant Venkataramani · Paris Smaragdis 🔗 |
Fri 9:45 a.m. - 11:00 a.m.
|
Poster Session Speech: source separation, enhancement, recognition, synthesis
(Coffee break and poster session)
link »
Poster abstracts and full papers: http://media.aau.dk/smc/ml4audio/ SPEECH SOURCE SEPARATION *Lijiang Guo and Minje Kim. Bitwise Source Separation on Hashed Spectra: An Efficient Posterior Estimation Scheme Using Partial Rank Order Metrics *Minje Kim and Paris Smaragdis. Bitwise Neural Networks for Efficient SingleChannel Source Separation *Mohit Dubey, Garrett Kenyon, Nils Carlson and Austin Thresher. Does Phase Matter For Monaural Source Separation? SPEECH ENHANCEMENT *Rasool Fakoor, Xiaodong He, Ivan Tashev and Shuayb Zarar. Reinforcement Learning To Adapt Speech Enhancement to Instantaneous Input Signal Quality *Jong Hwan Ko, Josh Fromm, Matthai Phillipose, Ivan Tashev and Shuayb Zarar. Precision Scaling of Neural Networks for Efficient Audio Processing AUTOMATIC SPEECH RECOGNITION Marius Paraschiv, Lasse Borgholt, Tycho Tax, Marco Singh and Lars Maaløe. Exploiting Nontrivial Connectivity for Automatic Speech Recognition *Brian Mcmahan and Delip Rao. Listening to the World Improves Speech Command Recognition * Andros Tjandra, Sakriani Sakti and Satoshi Nakamura. End-to-End Speech Recognition with Local Monotonic Attention Sri Harsha Dumpala, Rupayan Chakraborty and Sunil Kumar Kopparapu. A Novel Approach for Effective Learning in Low Resourced Scenarios SPEECH SYNTHESIS *Yuxuan Wang, Rj SkerryRyan, Ying Xiao, Daisy Stanton, Joel Shor, Eric Battenberg, Rob Clark and Rif A. Saurous. Uncovering Latent Style Factors for Expressive Speech Synthesis *Younggun Lee, Azam Rabiee and Soo-Young Lee. Emotional End-to-End Neural Speech Synthesizer |
Shuayb Zarar · Rasool Fakoor · SRI HARSHA DUMPALA · Minje Kim · Paris Smaragdis · Mohit Dubey · Jong Hwan Ko · Sakriani Sakti · Yuxuan Wang · Lijiang Guo · Garrett T Kenyon · Andros Tjandra · Tycho Tax · Younggun Lee
|
Fri 11:00 a.m. - 11:30 a.m.
|
Learning and transforming sound for interactive musical applications
(Invited Talk)
link »
Recent developments in object recognition (especially convolutional neural networks) led to a new spectacular application: image style transfer. But what would be the music version of style transfer? In the flow-machine project, we created diverse tools for generating audio tracks by transforming prerecorded music material. Our artists integrated these tools in their composition process and produced some pop tracks. I present some of those tools, with audio examples, and give an operative definition of music style transfer as an optimization problem. Such definition allows for an efficient solution which renders possible a multitude of musical applications: from composing to live performance. Marco Marchini works at Spotify in the Creator Technology Research Lab, Paris. His mission is bridging the gap between between creative artists and intelligent technologies. Previously, he worked as research assistant for the Pierre-and-Marie-Curie University at the Sony Computer Science Laboratory of Paris and worked for the Flow Machine project. His previous academic research also includes unsupervised music generation and ensemble performance analysis, this research was carried out during my M.Sc. and Ph.D. at the Music Technology Group (DTIC, Pompeu Fabra University). He has a double degree in Mathematics from Bologna University. |
Marco Marchini 🔗 |
Fri 11:30 a.m. - 11:50 a.m.
|
Compact Recurrent Neural Network based on Tensor Train for Polyphonic Music Modeling
(Talk)
link »
(+Andros Tjandra, Satoshi Nakamura) This paper introduces a novel compression method for recurrent neural networks (RNNs) based on Tensor Train (TT) format. The objective in this work are to reduce the number of parameters in RNN and maintain their expressive power. The key of our approach is to represent the dense matrices weight parameter in the simple RNN and Gated Recurrent Unit (GRU) RNN architectures as the n- dimensional tensor in TT-format. To evaluate our proposed models, we compare it with uncompressed RNN on polyphonic sequence prediction tasks. Our proposed TT-format RNN are able to preserve the performance while reducing the number of RNN parameters significantly up to 80 times smaller. |
Sakriani Sakti 🔗 |
Fri 11:50 a.m. - 12:10 p.m.
|
Singing Voice Separation using Generative Adversarial Networks
(Talk)
link »
(+Ju-heon Lee) In this paper, we propose a novel approach extending Wasserstein generative adversarial networks (GANs) [3] to separate singing voice from the mixture signal. We used the mixture signal as a condition to generate singing voices and applied the U-net style network for the stable training of the model. Experiments with the DSD100 dataset show the promising results with the potential of using the GANs for music source separation. |
Hyeong-Seok Choi · Kyogu Lee 🔗 |
Fri 12:10 p.m. - 12:30 p.m.
|
Audio Cover Song Identification using Convolutional Neural Network
(Talk)
link »
(+Juheon Lee, Sankeun Choe) In this paper, we propose a new approach to cover song identification using CNN (convolutional neural network). Most previous studies extract the feature vectors that characterize the cover song relation from a pair of songs and used it to compute the (dis)similarity between the two songs. Based on the observation that there is a meaningful pattern between cover songs and that this can be learned, we have reformulated the cover song identification problem in a machine learning framework. To do this, we first build the CNN using as an input a cross-similarity matrix generated from a pair of songs. We then construct the data set composed of cover song pairs and non-cover song pairs, which are used as positive and negative training samples, respectively. The trained CNN outputs the probability of being in the cover song relation given a cross-similarity matrix generated from any two pieces of music and identifies the cover song by ranking on the probability. Experimental results show that the proposed algorithm achieves performance better than or comparable to the state-of-the-arts. |
Sungkyun Chang · Kyogu Lee 🔗 |
Fri 12:30 p.m. - 1:30 p.m.
|
Lunch Break
(Break)
|
🔗 |
Fri 1:30 p.m. - 2:00 p.m.
|
Polyphonic piano transcription using deep neural networks
(Invited Talk)
link »
I'll discuss the problem of transcribing polyphonic piano music with an emphasis on generalizing to unseen instruments. We optimize for two objectives. We first predict pitch onset events and then conditionally predict pitch at the frame level. I'll discuss the model architecture, which combines CNNs and LSTMs. I'll also discuss challenges faced in robust piano transcription, such as obtaining enough data to train a good model I'll also provide some demos and links to working code. This collaboration was led by Curtis Hawthorne, Erich Elsen and Jialin Song (https://arxiv.org/abs/1710.11153). Douglas Eck works at the Google Brain team on the Magenta project, an effort to generate music, video, images and text using machine intelligence. He also worked on music search and recommendation for Google Play Music. I was an Associate Professor in Computer Science at University of Montreal in the BRAMS research center. He also worked on music performance modeling. |
Douglas Eck 🔗 |
Fri 2:00 p.m. - 2:30 p.m.
|
Deep learning for music recommendation and generation
(Invited Talk)
link »
The advent of deep learning has made it possible to extract high-level information from perceptual signals without having to specify manually and explicitly how to obtain it; instead, this can be learned from examples. This creates opportunities for automated content analysis of musical audio signals. In this talk, I will discuss how deep learning techniques can be used for audio-based music recommendation. I will also discuss my ongoing work on music generation in the raw waveform domain with WaveNet. Sander Dieleman is a Research Scientist at DeepMind in London, UK, where he has worked on the development of AlphaGo and WaveNet. He was previously a PhD student at Ghent University, where he conducted research on feature learning and deep learning techniques for learning hierarchical representations of musical audio signals. During his PhD he also developed the Theano-based deep learning library Lasagne and won solo and team gold medals respectively in Kaggle's "Galaxy Zoo" competition and the first National Data Science Bowl. In the summer of 2014, he interned at Spotify in New York, where he worked on implementing audio-based music recommendation using deep learning on an industrial scale. |
Sander Dieleman 🔗 |
Fri 2:30 p.m. - 3:00 p.m.
|
Exploring Ad Effectiveness using Acoustic Features
(Invited Talk)
link »
Online audio advertising is a form of advertising used abundantly in online music streaming services. In these platforms, providing high quality ads ensures a better user experience and results in longer user engagement. In this paper we describe a way to predict ad quality using hand-crafted, interpretable acoustic features that capture timbre, rhythm, and harmonic organization of the audio signal. We then discuss how the characteristics of the sound can be connected to concepts such as the clarity of the ad and its message. Matt Prockup is currently a scientist at Pandora working on methods and tools for Music Information Retrieval at scale. He recently received his Ph.D. in Electrical Engineering from Drexel University. His research interests span a wide scope of topics including audio signal processing, recommender systems, machine learning, and human computer interaction. He is also an avid drummer, percussionist, and composer. Puya - Hossein Vahabi is a senior research scientist at Pandora working on Audio/Video Computational Advertising. Before Pandora, he was a research scientist at Yahoo Labs. He has a PhD in CS, and he has been a research associate of the Italian National Research for many years. He has a PhD in CS, and his background is on computational advertising, graph mining and information retrieval. |
Matt Prockup · Puya Vahabi 🔗 |
Fri 3:00 p.m. - 4:00 p.m.
|
Poster Session Music and environmental sounds
(Coffee break and poster session)
link »
Poster abstracts and full papers: http://media.aau.dk/smc/ml4audio/ MUSIC: *Jordi Pons, Oriol Nieto, Matthew Prockup, Erik M. Schmidt, Andreas F. Ehmann and Xavier Serra. End-to-end learning for music audio tagging at scale *Jongpil Lee, Taejun Kim, Jiyoung Park and Juhan Nam. Raw Waveform based Audio Classification Using Samplelevel CNN Architectures *Alfonso Perez-Carrillo Estimation of violin bowing features from Audio recordings with Convolutional Networks ENVIRONMENTAL SOUNDS: *Bhiksha Raj, Benjamin Elizalde, Rohan Badlani, Ankit Shah and Anurag Kumar. NELS NeverEnding Learner of Sounds *Tycho Tax, Jose Antich, Hendrik Purwins and Lars Maaløe Utilizing Domain Knowledge in End-to-End Audio Processing *Anurag Kumar and Bhiksha Raj. Deep CNN Framework for Audio Event Recognition using Weakly Labeled Web Data |
Oriol Nieto · Jordi Pons · Bhiksha Raj · Tycho Tax · Benjamin Elizalde · Juhan Nam · Anurag Kumar 🔗 |
Fri 4:00 p.m. - 4:30 p.m.
|
Sight and sound
(Invited Talk)
link »
William T. Freeman is the Thomas and Gerd Perkins Professor of Electrical Engineering and Computer Science at MIT. His current research interests include motion rerendering, computational photography, and learning for vision. He received outstanding paper awards at computer vision or machine learning conferences in 1997, 2006, 2009 and 2012, and recently won "test of time" awards for papers written in 1991 and 1995. Previous research topics include steerable filters and pyramids, the generic viewpoint assumption, color constancy, bilinear models for separating style and content, and belief propagation in networks with loops. He holds 30 patents. |
Bill Freeman 🔗 |
Fri 4:30 p.m. - 4:50 p.m.
|
k-shot Learning of Acoustic Context
(Talk)
link »
(+Ivan Bocharov, Tjalling Tjalkens) In order to personalize the behavior of hearing aid devices in different acoustic scenes, we need personalized acoustic scene classifiers. Since we cannot afford to burden an individual hearing aid user with the task to collect a large acoustic database, we will want to train an acoustic scene classifier on one in-situ recorded waveform (of a few seconds duration) per class. In this paper we develop a method that achieves high levels of classification accuracy from a single recording of an acoustic scene. |
Bert de Vries 🔗 |
Fri 4:50 p.m. - 5:10 p.m.
|
Towards Learning Semantic Audio Representations from Unlabeled Data
(Talk)
link »
(+ Manoj Plakal, Ratheet Pandya, Daniel P. W. Ellis, Shawn Hershey, Jiayang Liu, R. Channing Moore, Rif A. Saurous) Our goal is to learn semantically structured audio representations without relying on categorically labeled data. We consider several class-agnostic semantic constraints that are inherent to non-speech audio: (i) sound categories are invariant to additive noise and translations in time, (ii) mixtures of two sound events inherit the categories of the constituents, and (iii) the categories of events in close temporal proximity in a single recording are likely to be the same or related. We apply these constraints to sample training data for triplet-loss embedding models using a large unlabeled dataset of YouTube soundtracks. The resulting low-dimensional representations provide both greatly improved query-by-example retrieval performance and reduced labeled data and model complexity requirements for supervised sound classification. |
Aren Jansen 🔗 |
Fri 5:10 p.m. - 5:30 p.m.
|
Cost-sensitive detection with variational autoencoders for environmental acoustic sensing
(Talk)
link »
(+ Ivan Kiskin, Davide Zilli, Marianne Sinka, Henry Chan, Kathy Willis) Environmental acoustic sensing involves the retrieval and processing of audio signals to better understand our surroundings. While large-scale acoustic data make manual analysis infeasible, they provide a suitable playground for machine learning approaches. Most existing machine learning techniques developed for environmental acoustic sensing do not provide flexible control of the trade-off between the false positive rate and the false negative rate. This paper presents a cost-sensitive classification paradigm, in which the hyper-parameters of classifiers and the structure of variational autoencoders are selected in a principled Neyman- Pearson framework. We examine the performance of the proposed approach using a dataset from the HumBug project1 which aims to detect the presence of mosquitoes using sound collected by simple embedded devices. |
Yunpeng Li · Stephen J Roberts 🔗 |
Fri 5:30 p.m. - 5:45 p.m.
|
Break
|
🔗 |
Fri 5:45 p.m. - 6:30 p.m.
|
Panel: Machine learning and audio signal processing: State of the art and future perspectives
(Discussion Panel)
link »
How can end-to-end audio processing be further optimized? How can an audio processing system be built that generalizes across domains, in particular different languages, music styles, or acoustic environments? How can complex musical hierarchical structure be learned? How can we use machine learning to build a music system that is able to react in the same way an improvisation partner would? Can we build a system that could put a composer in the role of a perceptual engineer? Sepp Hochreiter (Johannes Kepler University Linz, http://www.bioinf.jku.at/people/hochreiter/) Bo Li (Google, https://research.google.com/pubs/BoLi.html) Karen Livescu (Toyota Technological Institute at Chicago, http://ttic.uchicago.edu/~klivescu/) Arindam Mandal (Amazon Alexa, https://scholar.google.com/citations?user=tV1hW0YAAAAJ&hl=en) Oriol Nieto (Pandora, http://urinieto.com/about/) Malcolm Slaney (Google, http://www.slaney.org/malcolm/pubs.html) Hendrik Purwins (Aalborg University Copenhagen, http://personprofil.aau.dk/130346?lang=en) |
Sepp Hochreiter · Bo Li · Karen Livescu · Arindam Mandal · Oriol Nieto · Malcolm Slaney · Hendrik Purwins 🔗 |
Author Information
Hendrik Purwins (Aalborg University Copenhagen)
I am currently Associate Professor at the Audio Analysis Lab, at Aalborg University Copenhagen. Before that, I had been Assistant Professor at the same University. Previously, I had been researcher at the Neurotechnology and Machine Learning Groups at Berlin Institute of Technology/Berlin Brain Computer Interface. Previously I was lecturer at the Music Technology Group at the Universitat Pompeu Fabra in Barcelona. I have also been head of research and development at PMC Technologies. I have been visiting researcher at Perception and Sound Design Team, IRCAM; CCRMA, Stanford; Auditory Lab, McGill. I have obtained my PhD "Profiles of Pitch Classes" at the Neural Information Processing Group (CS/EE) at Berlin University of Technology, receiving a scholarship from the Studienstiftung des deutschen Volkes. Before that, I studied mathematics at Bonn and Muenster University, completing a diploma in pure mathematics. Starting with playing the violin at age of 7, and studying also musicology and acting on the side I also have experience as a performer in concerts and theatre. I have (co-)authored 70 scientific papers. My interests include deep learning and reinforcement learning for music and sound analysis, game strategies and robotics, statistical models for music/ sound representation/expectation/generation, neural correlates of music and 3D (tele)vision, didactic tools for music and dance, and predictive maintenance in manufacturing.
Bob L. Sturm (Queen Mary University of London)
Bob L. Sturm is currently a Lecturer in Digital Media at the Centre for Digital Music (http://c4dm.eecs.qmul.ac.uk/) in the School of Electronic Engineering and Computer Science, Queen Mary University of London. He specialises in audio and music signal processing, machine listening, and evaluation. He organises the HORSE workshop at QMUL (http://c4dm.eecs.qmul.ac.uk/horse2016, http://c4dm.eecs.qmul.ac.uk/horse20167), which focuses on evaluation in applied machine learning. He is the recipient of the 2017 Multimedia Prize Paper Award for his article titled, “A Simple Method to Determine if a Music Information Retrieval System is a “Horse””, published in the IEEE Transactions on Multimedia (Vol. 16, No. 6, October 2014). He is one of the creators of the folk-rnn system for music transcription modelling and generation (https://github.com/IraKorshunova/folk-rnn).
Mark Plumbley (University of Surrey)
More from the Same Authors
-
2017 : Panel: Machine learning and audio signal processing: State of the art and future perspectives »
Sepp Hochreiter · Bo Li · Karen Livescu · Arindam Mandal · Oriol Nieto · Malcolm Slaney · Hendrik Purwins -
2017 : Overture »
Hendrik Purwins