Timezone: »

 
Workshop
Interpretability and Robustness in Audio, Speech, and Language
Mirco Ravanelli · Dmitriy Serdyuk · Ehsan Variani · Bhuvana Ramabhadran

Sat Dec 08 05:00 AM -- 03:30 PM (PST) @ Room 513DEF
Event URL: https://irasl.gitlab.io »

Domains of natural and spoken language processing have a rich history deeply rooted in information theory, statistics, digital signal processing and machine learning. With the rapid rise of deep learning (“deep learning revolution”), many of these systematic approaches have been replaced by variants of deep neural methods, that often achieve unprecedented performance levels in many fields. With more and more of the spoken language processing pipeline being replaced by sophisticated neural layers, feature extraction, adaptation, noise robustness are learnt inherently within the network. More recently, end-to-end frameworks that learn a mapping from speech (audio) to target labels (words, phones, graphemes, sub-word units, etc.) are becoming increasingly popular across the board in speech processing in tasks ranging from speech recognition, speaker identification, language/dialect identification, multilingual speech processing, code switching, natural language processing, speech synthesis and much much more.

A key aspect behind the success of deep learning lies in the discovered low and high-level representations, that can potentially capture relevant underlying structure in the training data. In the NLP domain, for instance, researchers have mapped word and sentence embeddings to semantic and syntactic similarity and argued that the models capture latent representations of meaning. Nevertheless, some recent works on adversarial examples have shown that it is possible to easily fool a neural network (such as a speech recognizer or a speaker verification system) by just adding a small amount of specially constructed noise. Such a remarkable sensibility towards adversarial attacks highlights how superficial the discovered representations could be, rising crucial concerns on the actual robustness, security, and interpretability of modern deep neural networks. This weakness naturally leads researchers to ask very crucial questions on what these models are really learning, how we can interpret what they have learned, and how the representations provided by current neural networks can be revealed or explained in a fashion that modeling power can be enhanced further. These open questions have recently raised the interest towards interpretability of deep models, as witness by the numerous works recently published on this topic in all the major machine learning conferences. Moreover, some workshops at NIPS 2016, NIPS 2017 and Interspeech 2017 have promoted research and discussion around this important issue.
With our initiative, we wish to further foster some progresses on interpretability and robustness of modern deep learning techniques, with a particular focus on audio, speech and NLP technologies. The workshop will also analyze the connection between deep learning and models developed earlier for machine learning, linguistic analysis, signal processing, and speech recognition. This way we hope to encourage a discussion amongst experts and practitioners in these
areas with the expectation of understanding these models better and allowing to build upon the existing collective expertise.

The workshop will feature invited talks, panel discussions, as well as oral and poster contributed presentations. We welcome papers that specifically address one or more of the leading questions listed below:
1. Is there a theoretical/linguistic motivation/analysis that can explain how nets encapsulate the structure of the training data it learns from?
2. Does the visualization of this information (MDS, t-SNE) offer any insights to creating a better model?
3. How can we design more powerful networks with simpler architectures?
4. How can we can exploit adversarial examples to improve the system robustness?
5. Do alternative methods offer any complimentary modeling power to what the networks can memorize?
6. Can we explain the path of inference?
7. How do we analyze data requirements for a given model? How does multilingual data improves learning power?

Sat 5:45 a.m. - 6:00 a.m. [iCal]
Workshop Opening (Introduction)
Mirco Ravanelli, Dmitriy Serdyuk, Ehsan Variani, Bhuvana Ramabhadran
Sat 6:00 a.m. - 6:30 a.m. [iCal]

In machine learning often a tradeoff must be made between accuracy and intelligibility: the most accurate models (deep nets, boosted trees and random forests) usually are not very intelligible, and the most intelligible models (logistic regression, small trees and decision lists) usually are less accurate. This tradeoff limits the accuracy of models that can be safely deployed in mission-critical applications such as healthcare where being able to understand, validate, edit, and ultimately trust a learned model is important. In this talk, I’ll present a case study where intelligibility is critical to uncover surprising patterns in the data that would have made deploying a black-box model risky. I’ll also show how distillation with intelligible models can be used to understand what is learned inside a black-box model such as a deep nets, and show a movie of what a deep net learns as it trains and then begins to overfit.

Rich Caruana
Sat 6:30 a.m. - 7:00 a.m. [iCal]

The seduction of large neural nets is that one simply has to throw input data into a big network and magic comes out the other end. If the output is not magic enough, just add more layers. This simple approach works just well enough that it can lure us into a few bad assumptions, which we’ll discuss in this talk. One is that learning everything end-to-end is always best. We’ll look at an example where it isn’t. Another is that careful manual architecture design is useless because either one big stack of layers will work just fine, or if it doesn’t, we should just give up and use random architecture search and a bunch of computers. But perhaps we just need better tools and mental models to analyze the architectures we’re building; in this talk we’ll talk about one simple such tool. A final assumption is that as our models become large, they become inscrutable. This may turn out to be true for large models, but attempts at understanding persist, and in this talk, we’ll look at how the assumptions we put into our methods of interpretability color the results.

Jason Yosinski
Sat 7:00 a.m. - 7:15 a.m. [iCal]

Local explanation frameworks aim to rationalize particular decisions made by a black-box prediction model. Existing techniques are often restricted to a specific type of predictor or based on input saliency, which may be undesirably sensitive to factors unrelated to the model's decision-making process. We instead propose sufficient input subsets that identify minimal subsets of features whose observed values alone suffice for the same decision to be reached, even if all other input feature values are missing. General principles that globally govern a model's decision-making can also be revealed by searching for clusters of such input patterns across many data points. Our approach is conceptually straightforward, entirely model-agnostic, simply implemented using instance-wise backward selection, and able to produce more concise rationales than existing techniques. We demonstrate the utility of our interpretation method on neural network models trained on text and image data.

Brandon Carter
Sat 7:15 a.m. - 7:30 a.m. [iCal]

The increasing complexity of deep Artificial Neural Networks (ANNs) allows to solve complex tasks in various applications. This comes with less understanding of decision processes in ANNs. Therefore, introspection techniques have been proposed to interpret how the network accomplishes its task. Those methods mostly visualize their results in the input domain and often only process single samples. For images, highlighting important features or creating inputs which activate certain neurons is intuitively interpretable. The same introspection for speech is much harder to interpret. In this paper, we propose an alternative method which analyzes neuron activations for whole data sets. Its generality allows application to complex data like speech. We introduce time-independent Neuron Activation Profiles (NAPs) as characteristic network responses to certain groups of inputs. By clustering those time-independent NAPs, we reveal that layers are specific to certain groups. We demonstrate our method for a fully-convolutional speech recognizer. There, we investigate whether phonemes are implicitly learned as an intermediate representation for predicting graphemes. We show that our method reveals, which layers encode phonemes and graphemes and that similarities between phonetic categories are reflected in the clustering of time-independent NAPs. Keywords: introspection, speech recognition, phoneme representation, grapheme representation, convolutional neural networks

Andreas Krug
Sat 7:30 a.m. - 8:00 a.m. [iCal]

Jamin Shin, Andrea Madotto, Pascale Fung, "Interpreting Word Embeddings with Eigenvector Analysis" Mirco Ravanelli, Yoshua Bengio, "Interpretable Convolutional Filters with SincNet" Shuai Tang, Paul Smolensky, Virginia R. de Sa, "Learning Distributed Representations of Symbolic Structure Using Binding and Unbinding Operations" Lisa Fan, Dong Yu, Lu Wang, "Robust Neural Abstractive Summarization Systems and Evaluation against Adversarial Information" Zining Zhu, Jekaterina Novikova, Frank Rudzicz, "Semi-supervised classification by reaching consensus among modalities" Hamid Eghbal-zadeh, Matthias Dorfer, Gerhard Widmer, "Deep Within-Class Covariance Analysis for Robust Deep Audio Representation Learning" Benjamin Baer, Skyler Seto, Martin T. Wells, "Interpreting Word Embeddings with Generalized Low Rank Models" Abelino Jimenez, Benjamin Elizalde, Bhiksha Raj, "Sound event classification using ontology-based neural netowrks" Hai Pham, Paul Pu Liang, Thomas Manzini, Louis-Philippe Morency, Barnabas Poczos, "Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities" Sri Harsha Dumpala, Imran Sheikh, Rupayan Chakraborty, Sunil Kumar Kopparapu, "Cycle-Consistent GAN Front-end to Improve ASR Robustness to Perturbed Speech" Joao Felipe Santos, Tiago H. Falk, "Investigating the effect of residual and highway connections in speech enhancement models" Jan Kremer, Lasse Borgholt, Lars Maaløe, "On the Inductive Bias of Word-Character-Level Multi-Task Learning for Speech Recognition" Erik McDermott, "A Deep Generative Acoustic Model for Compositional Automatic Speech Recognition" Andreas Krug, René Knaebel, Sebastian Stober, "Neuron Activation Profiles for Interpreting Convolutional Speech Recognition Models" Anthony Bau, Yonatan Belinkov, Hassan Sajjad, Nadir Durrani, Fahim Dalvi, James Glass, "Identifying and Controlling Important Neurons in Neural Machine Translation" Oiwi Parker Jones, Brendan Shillingford, "Composing RNNs and FSTs for Small Data: Recovering Missing Characters in Old Hawaiian Text" Tzeviya Fuchs, Joseph Keshet, "Robust Spoken Term Detection Automatically Adjusted for a Given Threshold" Shuai Tang, Virginia R. de Sa, "Improving Sentence Representations with Multi-view Frameworks" Brandon Carter, Jonas Mueller, Siddhartha Jain, David Gifford, "Local and global model interpretability via backward selection and clustering" Albert Zeyer, André Merboldt, Ralf Schlüter, Hermann Ney, "A comprehensive analysis on attention models" Barbara Rychalska, Dominika Basaj, Przemysław Biecek, "Are you tough enough? Framework for Robustness Validation of Machine Comprehension Systems" Jialu Li, Mark Hasegawa-Johnson, "A Comparable Phone Set for the TIMIT Dataset Discovered in Clustering of Listen, Attend and Spell" Loren Lugosch, Samuel Myer, Vikrant Singh Tomar, "DONUT: CTC-based Query-by-Example Keyword Spotting" Titouan Parcollet, Mirco Ravanelli, Mohamed Morchid, Georges Linarès, Renato De Mori, "Speech Recognition with Quaternion Neural Networks" Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, Yu-An Chung, Yuxuan Wang, Yonghui Wu, James Glass, "Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization" Jan Buys, Yonatan Bisk, Yejin Choi, "Bridging HMMs and RNNs through Architectural Transformations" Katherine Lee, Orhan Firat, Ashish Agarwal, Clara Fannjiang, David Sussillo, "Hallucinations in neural machine translation" Ehsan Hosseini-Asl, Yingbo Zhou, Caiming Xiong, Richard Socher, "Robust Domain Adaptation By Augmented Cyclic Adversarial Learning" Cheng-Zhi Anna Huang, Monica Dinculescu, Ashish Vaswani, Douglas Eck, "Visualizing Music Transformer" Lea Schönherr, Katharina Kohls, Steffen Zeiler, Dorothea Kolossa, Thorsten Holz, "Adversarial Attacks Against Automatic Speech Recognition Systems via Psychoacoustic Hiding" Gautam Bhattacharya, Joao Monteiro, Jahangir Alam, Patrick Kenny, "SpeakerGAN: Recognizing Speakers in New Languages with Generative Adversarial Networks" Jessica Thompson, Marc Schönwiesner, Yoshua Bengio, Daniel Willett, "How transferable are features in convolutional neural network acoustic models across languages?" Ramin M. Hasani, Alexander Amini, Mathias Lechner, Felix Naser, Radu Grosu, Daniela Rus, "Response Characterization for Auditing Cell Dynamics in Long Short-term Memory Networks"

Samuel Myer, Wei-Ning Hsu, Jialu Li, Monica Dinculescu, Lea Schönherr, Ehsan Hosseini-Asl, Skyler Seto, Oiwi Parker Jones, Imran Sheikh, Thomas Manzini, Yonatan Belinkov, Nadir Durrani, Alexander Amini, Johanna Hansen, Gabi Shalev, Jay Shin, Paul Smolensky, Lisa Fan, Zining Zhu, Hamid Eghbalzadeh, Ben Baer, Abelino Jimenez, João Felipe Santos, Jan Kremer, Erik McDermott, Andreas Krug, Tzeviya S Fuchs, Shuai Tang, Brandon Carter, David Gifford, Albert Zeyer, André Merboldt, Krishna Pillutla, Katherine Lee, Titouan Parcollet, Orhan Firat, Gautamb85 Bhattacharya, JAHANGIR ALAM, Mirco Ravanelli
Sat 8:00 a.m. - 8:30 a.m. [iCal]

It is often argued that in processing of sensory signals such as speech, engineering should apply knowledge of properties of human perception - both have the same goal of getting information from the signal. We show on examples from speech technology that perceptual research can also learn from advances in technology. After all, speech evolved to be heard and properties of hearing are imprinted on speech. Subsequently, engineering optimizations of speech technology often yield human-like processing strategies. Further, fundamental difficulties that speech engineering still faces could indicate gaps in our current understanding
 of the human speech communication process, suggesting directions of further inquiries.

Hynek Hermansky
Sat 8:30 a.m. - 9:00 a.m. [iCal]

In relation to launching the Google @home product, we were faced with the problem of far-field speech recognition. That setting gives rise to problems related to reverberant and noisy speech which degrades speech recognition performance. A common approach to address some of these detrimental effects is to use multi-channel processing. This processing is generally seen as an "enhancement" step prior to ASR and is developed and optimized as a separate component of the overall system. In our work, we integrated this component into the neural network that is tasked with the speech recognition classification task. This allows for a joint optimization of the enhancement and recognition components. And given that the structure of the input layer of the network is based on the "classical" structure of the enhancement component, it allows us to interpret what type of representation the network learned. We will show that in some cases this learned representation appears to mimic what was discovered by previous research and in some cases, the learned representation seems "esoteric".

The second part of this talk will focus on an end-to-end letter to sound model for Japanese. Japanese uses a complex orthography where the pronunciation of the Chinese characters, which are a part of the script, varies depending on the context. The fact that Japanese (like Chinese and Korean) does not explicitly mark word boundaries in the orthography further complicates this mapping. We show results of an end-to-end, encoder/decoder model structure to learn the letter-to-sound relationship. These systems are trained from speech data coming through our systems. This shows that such models are capable of learning the mapping (with accuracies exceeding 90% for a number of model topologies). Observing the learned representation and attention distributions for various architectures provides some insight as to what cues the model uses to learn the relationship. But it also shows that interpretation remains limited since the joint optimization of encoder and decoder components allows the model the freedom to learn implicit representations that are not directly amenable to interpretation.

Michiel Bacchiani
Sat 9:00 a.m. - 9:15 a.m. [iCal]

Deep learning is currently playing a crucial role toward higher levels of artificial intelligence. This paradigm allows neural networks to learn complex and abstract representations, that are progressively obtained by combining simpler ones. Nevertheless, the internal "black-box" representations automatically discovered by current neural architectures often suffer from a lack of interpretability, making of primary interest the study of explainable machine learning techniques. This paper summarizes our recent efforts to develop a more interpretable neural model for directly processing speech from the raw waveform. In particular, we propose SincNet, a novel Convolutional Neural Network (CNN) that encourages the first layer to discover more meaningful filters by exploiting parametrized sinc functions. In contrast to standard CNNs, which learn all the elements of each filter, only low and high cutoff frequencies of band-pass filters are directly learned from data. This inductive bias offers a very compact way to derive a customized filter-bank front-end, that only depends on some parameters with a clear physical meaning. Our experiments, conducted on both speaker and speech recognition, show that the proposed architecture converges faster, performs better, and is more interpretable than standard CNNs.

Mirco Ravanelli
Sat 9:15 a.m. - 9:30 a.m. [iCal]

Deep Neural Networks (DNNs) are known for excellent performance in supervised tasks such as classification. Convolutional Neural Networks (CNNs), in particular, can learn effective features and build high-level representations that can be used for classification, but also for querying and nearest neighbor search. However, CNNs have also been shown to suffer from a performance drop when the distribution of the data changes from training to test data. In this paper, we analyze the internal representations of CNNs and observe that the representations of unseen data in each class, spread more (with higher variance) in the embedding space of the CNN compared to representations of the training data. More importantly, this difference is more extreme if the unseen data comes from a shifted distribution. Based on this observation, we objectively evaluate the degree of representation’s variance in each class by applying eigenvalue decomposition on the within-class covariance of the internal representations of CNNs and observe the same behavior. This can be problematic as larger variances might lead to misclassification if the sample crosses the decision boundary of its class. We apply nearest neighbor classification on the representations and empirically show that the embeddings with the high variance actually have significantly worse KNN classification performances, although this could not be foreseen from their end-to-end classification results. To tackle this problem, we propose Deep Within-Class Covariance Analysis (DWCCA), a deep neural network layer that significantly reduces the within-class covariance of a DNN’s representation, improving performance on unseen test data from a shifted distribution. We empirically evaluate DWCCA on two datasets for Acoustic Scene Classification (DCASE2016 and DCASE2017). We demonstrate that not only does DWCCA significantly improve the network’s internal representation, it also increases the end-to-end classification accuracy, especially when the test set exhibits a slight distribution shift. By adding DWCCA to a VGG neural network, we achieve around 6 percentage points improvement in the case of a distribution mismatch.

Hamid Eghbalzadeh
Sat 9:30 a.m. - 10:30 a.m. [iCal]
Lunch Break (Break)
Sat 10:30 a.m. - 11:00 a.m. [iCal]

For decades, the general architecture of the classical state-of-the-art statistical approach to automatic speech recognition (ASR) has not been significantly challenged. The classical statistical approach to ASR is based on Bayes decision rule, a separation of acoustic and language modeling, hidden Markov modeling (HMM), and a search organization based on dynamic programming and hypothesis pruning methods. Even when deep neural networks started to considerably boost ASR performance, the general architecture of state-of-the-art ASR systems was not altered considerably. The hybrid DNN/HMM approach, together with recurrent LSTM neural network language modeling currently marks the state-of-the-art on many tasks covering a large range of training set sizes. However, currently more and more alternative approaches occur, moving gradually towards so-called end-to-end approaches. By and by, these novel end-to-end approaches replace explicit time alignment modeling and dedicated search space organization by more implicit, integrated neural-network based representations, also dropping the separation between acoustic and language modeling, showing promising results, especially for large training sets.

In this presentation, an overview of current approaches to ASR will be given, including variations of both HMM-based and end-to-end modeling. Approaches will be discussed w.r.t. their modeling, their performance against available training data, their search space complexity and control, as well as potential modes of comparative analysis.

Ralf Schlüter
Sat 11:00 a.m. - 11:15 a.m. [iCal]

Inspired by the recent successes of deep generative models for Text-To-Speech (TTS) such as WaveNet (van den Oord et al., 2016) and Tacotron (Wang et al., 2017), this article proposes the use of a deep generative model tailored for Automatic Speech Recognition (ASR) as the primary acoustic model (AM) for an overall recognition system with a separate language model (LM). Two dimensions of depth are considered: (1) the use of mixture density networks, both autoregressive and non-autoregressive, to generate density functions capable of modeling acoustic input sequences with much more powerful conditioning than the first-generation generative models for ASR, Gaussian Mixture Models / Hidden Markov Models (GMM/HMMs), and (2) the use of standard LSTMs, in the spirit of the original tandem approach, to produce discriminative feature vectors for generative modeling. Combining mixture density networks and deep discriminative features leads to a novel dual-stack LSTM architecture directly related to the RNN Transducer (Graves, 2012), but with the explicit functional form of a density, and combining naturally with a separate language model, using Bayes rule. The generative models discussed here are compared experimentally in terms of log-likelihoods and frame accuracies. Keywords: Automatic Speech Recognition, Deep generative models, Acoustic modeling, End-to-end speech recognition

Erik McDermott
Sat 11:15 a.m. - 11:30 a.m. [iCal]

Dense word vectors have proven their values in many downstream NLP tasks over the past few years. However, the dimensions of such embeddings are not easily interpretable. Out of the d-dimensions in a word vector, we would not be able to understand what high or low values mean. Previous approaches addressing this issue have mainly focused on either training sparse/non-negative constrained word embeddings, or post-processing standard pre-trained word embeddings. On the other hand, we analyze conventional word embeddings trained with Singular Value Decomposition, and reveal similar interpretability. We use a novel eigenvector analysis method inspired from Random Matrix Theory and show that semantically coherent groups not only form in the row space, but also the column space. This allows us to view individual word vector dimensions as human-interpretable semantic features.

Jay Shin
Sat 11:30 a.m. - 11:45 a.m. [iCal]

End-to-end automatic speech recognition (ASR) commonly transcribes audio signals into sequences of characters while its performance is evaluated by measuring the word-error rate (WER). This suggests that predicting sequences of words directly may be helpful instead. However, training with word-level supervision can be more difficult due to the sparsity of examples per label class. In this paper, we analyze an end-to-end ASR model that combines a word-and-character representation in a multi-task learning (MTL) framework. We show that it improves on the WER and study how the word-level model can benefit from character-level supervision by analyzing the learned inductive preference bias of each model component empirically. We find that by adding character-level supervision, the MTL model interpolates between recognizing more frequent words (preferred by the word-level model) and shorter words (preferred by the character-level model). Keywords: speech recognition, multi-task learning, interpretability.

Jan Kremer
Sat 11:45 a.m. - 12:30 p.m. [iCal]

Jamin Shin, Andrea Madotto, Pascale Fung, "Interpreting Word Embeddings with Eigenvector Analysis" Mirco Ravanelli, Yoshua Bengio, "Interpretable Convolutional Filters with SincNet" Shuai Tang, Paul Smolensky, Virginia R. de Sa, "Learning Distributed Representations of Symbolic Structure Using Binding and Unbinding Operations" Lisa Fan, Dong Yu, Lu Wang, "Robust Neural Abstractive Summarization Systems and Evaluation against Adversarial Information" Zining Zhu, Jekaterina Novikova, Frank Rudzicz, "Semi-supervised classification by reaching consensus among modalities" Hamid Eghbal-zadeh, Matthias Dorfer, Gerhard Widmer, "Deep Within-Class Covariance Analysis for Robust Deep Audio Representation Learning" Benjamin Baer, Skyler Seto, Martin T. Wells, "Interpreting Word Embeddings with Generalized Low Rank Models" Abelino Jimenez, Benjamin Elizalde, Bhiksha Raj, "Sound event classification using ontology-based neural netowrks" Hai Pham, Paul Pu Liang, Thomas Manzini, Louis-Philippe Morency, Barnabas Poczos, "Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities" Sri Harsha Dumpala, Imran Sheikh, Rupayan Chakraborty, Sunil Kumar Kopparapu, "Cycle-Consistent GAN Front-end to Improve ASR Robustness to Perturbed Speech" Joao Felipe Santos, Tiago H. Falk, "Investigating the effect of residual and highway connections in speech enhancement models" Jan Kremer, Lasse Borgholt, Lars Maaløe, "On the Inductive Bias of Word-Character-Level Multi-Task Learning for Speech Recognition" Erik McDermott, "A Deep Generative Acoustic Model for Compositional Automatic Speech Recognition" Andreas Krug, René Knaebel, Sebastian Stober, "Neuron Activation Profiles for Interpreting Convolutional Speech Recognition Models" Anthony Bau, Yonatan Belinkov, Hassan Sajjad, Nadir Durrani, Fahim Dalvi, James Glass, "Identifying and Controlling Important Neurons in Neural Machine Translation" Oiwi Parker Jones, Brendan Shillingford, "Composing RNNs and FSTs for Small Data: Recovering Missing Characters in Old Hawaiian Text" Tzeviya Fuchs, Joseph Keshet, "Robust Spoken Term Detection Automatically Adjusted for a Given Threshold" Shuai Tang, Virginia R. de Sa, "Improving Sentence Representations with Multi-view Frameworks" Brandon Carter, Jonas Mueller, Siddhartha Jain, David Gifford, "Local and global model interpretability via backward selection and clustering" Albert Zeyer, André Merboldt, Ralf Schlüter, Hermann Ney, "A comprehensive analysis on attention models" Barbara Rychalska, Dominika Basaj, Przemysław Biecek, "Are you tough enough? Framework for Robustness Validation of Machine Comprehension Systems" Jialu Li, Mark Hasegawa-Johnson, "A Comparable Phone Set for the TIMIT Dataset Discovered in Clustering of Listen, Attend and Spell" Loren Lugosch, Samuel Myer, Vikrant Singh Tomar, "DONUT: CTC-based Query-by-Example Keyword Spotting" Titouan Parcollet, Mirco Ravanelli, Mohamed Morchid, Georges Linarès, Renato De Mori, "Speech Recognition with Quaternion Neural Networks" Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, Yu-An Chung, Yuxuan Wang, Yonghui Wu, James Glass, "Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization" Jan Buys, Yonatan Bisk, Yejin Choi, "Bridging HMMs and RNNs through Architectural Transformations" Katherine Lee, Orhan Firat, Ashish Agarwal, Clara Fannjiang, David Sussillo, "Hallucinations in neural machine translation" Ehsan Hosseini-Asl, Yingbo Zhou, Caiming Xiong, Richard Socher, "Robust Domain Adaptation By Augmented Cyclic Adversarial Learning" Cheng-Zhi Anna Huang, Monica Dinculescu, Ashish Vaswani, Douglas Eck, "Visualizing Music Transformer" Lea Schönherr, Katharina Kohls, Steffen Zeiler, Dorothea Kolossa, Thorsten Holz, "Adversarial Attacks Against Automatic Speech Recognition Systems via Psychoacoustic Hiding" Gautam Bhattacharya, Joao Monteiro, Jahangir Alam, Patrick Kenny, "SpeakerGAN: Recognizing Speakers in New Languages with Generative Adversarial Networks" Jessica Thompson, Marc Schönwiesner, Yoshua Bengio, Daniel Willett, "How transferable are features in convolutional neural network acoustic models across languages?" Ramin M. Hasani, Alexander Amini, Mathias Lechner, Felix Naser, Radu Grosu, Daniela Rus, "Response Characterization for Auditing Cell Dynamics in Long Short-term Memory Networks"

Jan Kremer, Erik McDermott, Brandon Carter, Albert Zeyer, Andreas Krug, Paul Pu Liang, Katherine Lee, Dominika Basaj, Abelino Jimenez, Lisa Fan, Gautamb85 Bhattacharya, Tzeviya S Fuchs, David Gifford, Loren Lugosch, Orhan Firat, Ben Baer, JAHANGIR ALAM, Jay Shin, Mirco Ravanelli, Paul Smolensky, Zining Zhu, Hamid Eghbalzadeh, Skyler Seto, Imran Sheikh, João Felipe Santos, Yonatan Belinkov, Nadir Durrani, Oiwi Parker Jones, Shuai Tang, André Merboldt, Titouan Parcollet, Wei-Ning Hsu, Krishna Pillutla, Ehsan Hosseini-Asl, Monica Dinculescu, Alexander Amini, Ying Zhang, Taoli Cheng, Alain Tapp
Sat 12:30 p.m. - 1:00 p.m. [iCal]

At Google we replaced over the last few years the phrase-based machine translation system by GNMT, the Google Neural Machine Translation system. This talk will describe some of the history of this transition and explain the challenges we faced. As part of the new system we developed and used many features that hadn’t been used before in production-scale translation systems: A large-scale sequence-to-sequence model with attention, sub-word units instead of a full dictionary to address out-of-vocabulary handling and improve translation accuracy, special hardware to improve inference speed, handling of many language pairs in a single model and other techniques that a) made it possible to launch the system at all and b) to significantly improve on previous production-level accuracy. Some of the techniques we used are now standard in many translation systems – we’d like to highlight some of the remaining challenges in interpretability, robustness and possible solutions to them.

Mike Schuster
Sat 1:00 p.m. - 1:30 p.m. [iCal]

Neural encoder-decoder models have had significant empirical success in text generation, but there remain major unaddressed issues that make them difficult to apply to real problems. Encoder-decoders are largely (a) uninterpretable in their errors, and (b) difficult to control in areas as phrasing or content. In this talk, I will argue that combining probabilistic modeling with deep learning can help address some of these issues without giving up their advantages. In particular, I will present a method for learning discrete latent templates along with generation. This approach remains deep and end-to-end, achieves comparably good results, and exposes internal model decisions. I will end by discussing some related work on successes and challenges of visualization for interpreting encoder-decoder models.

Alexander Rush
Sat 1:30 p.m. - 1:45 p.m. [iCal]

Widely used recurrent units, including Long-short Term Memory (LSTM) and Gated Recurrent Unit (GRU), perform well on natural language tasks, but their ability to learn structured representations is still questionable. Exploiting Tensor Product Representations (TPRs) --- distributed representations of symbolic structure in which vector-embedded symbols are bound to vector-embedded structural positions --- we propose the TPRU, a recurrent unit that, at each time step, explicitly executes structural-role binding and unbinding operations to incorporate structural information into learning. Experiments are conducted on both the Logical Entailment task and the Multi-genre Natural Language Inference (MNLI) task, and our TPR-derived recurrent unit provides strong performance with significantly fewer parameters than LSTM and GRU baselines. Furthermore, our learnt TPRU trained on MNLI demonstrates solid generalisation ability on downstream tasks.

Shuai Tang
Sat 1:45 p.m. - 2:00 p.m. [iCal]

Multimodal sentiment analysis is a core research area that studies speaker sentiment expressed from the language, visual, and acoustic modalities. The central challenge in multimodal learning involves inferring joint representations that can process and relate information from these modalities. However, existing work learns joint representations using multiple modalities as input and may be sensitive to noisy or missing modalities at test time. With the recent success of sequence to sequence models in machine translation, there is an opportunity to explore new ways of learning joint representations that may not require all input modalities at test time. In this paper, we propose a method to learn robust joint representations by translating between modalities. Our method is based on the key insight that translation from a source to a target modality provides a method of learning joint representations using only the source modality as input. We augment modality translations with a cycle consistency loss to ensure that our joint representations retain maximal information from all modalities. Once our translation model is trained with paired multimodal data, we only need data from the source modality at test-time for prediction. This ensures that our model remains robust from perturbations or missing target modalities. We train our model with a coupled translation-prediction objective and it achieves new state-of-the-art results on multimodal sentiment analysis datasets: CMU-MOSI, ICT-MMMO, and YouTube. Additional experiments show that our model learns increasingly discriminative joint representations with more input modalities while maintaining robustness to perturbations of all other modalities.

Jay Shin
Sat 2:00 p.m. - 2:30 p.m. [iCal]

How should one apply deep learning to tasks such as morphological reinflection, which stochastically edit one string to get another? Finite-state transducers (FSTs) are a well-understood formalism for scoring such edit sequences, which represent latent hard monotonic alignments. I will discuss options for combining this architecture with neural networks. The BiLSTM-FST scores each edit in its full input context, which preserves the ability to do exact inference over the aligned outputs using dynamic programming. The Neural FST scores each edit sequence using an LSTM, which requires approximate inference via methods such as beam search or particle smoothing. Finally, I will sketch how to use the language of regular expressionsto specify not only the legal edit sequences but also how to present them to the LSTMs.

Jason Eisner
Sat 2:30 p.m. - 3:15 p.m. [iCal]

Panel Discussion on "Interpretability and Robustness in Audio, Speech, and Language" (moderated by Jason Eisner).

Panelists: - Sami Bengio - Rich Caruana - Mike Schuster - Ralf Schlueter - Hynek Hermansky - Renato DeMori - Michiel Bacchiani - Jason Eisner

Rich Caruana, Mike Schuster, Ralf Schlüter, Hynek Hermansky, Renato De Mori, Samy Bengio, Michiel Bacchiani, Jason Eisner

Author Information

Mirco Ravanelli (Montreal Istitute for Learning Algorithms)

I received my master's degree in Telecommunications Engineering (full marks and honours) from the University of Trento, Italy in 2011. I then joined the SHINE research group (led by Prof. Maurizio Omologo) of the Bruno Kessler Foundation (FBK), contributing to some projects on distant-talking speech recognition in noisy and reverberant environments, such as DIRHA and DOMHOS. In 2013 I was visiting researcher at the International Computer Science Institute (University of California, Berkeley) working on deep neural networks for large-vocabulary speech recognition in the context of the IARPA BABEL project (led by Prof. Nelson Morgan). I received my PhD (with cum laude distinction) in Information and Communication Technology from the University of Trento in December 2017. During my PhD I worked on “deep learning for distant speech recognition”, with a particular focus on recurrent and cooperative neural networks (see my PhD thesis here). In the context of my PhD I recently spent 6 months in the MILA lab led by Prof. Yoshua Bengio. I'm currently a post-doc researcher at the University of Montreal, working on deep learning for speech recognition in the MILA Lab.

Dmitriy Serdyuk (MILA)
Ehsan Variani (Google)

I am a Staff Research Scientist in Google. My main research focus is statistical and machine learning and information theory with focus on speech and language recognition.

Bhuvana Ramabhadran (Google)

Bhuvana Ramabhadran (IEEE Fellow, 2017, ISCA Fellow 2017) currently leads a team of researchers in Google, focussing on multilingual speech recognition and synthesis. Previously, she was a Distinguished Research Staff Member and Manager in IBM Research AI, at the IBM T. J. Watson Research Center, Yorktown Heights, NY, USA, where she led a team of researchers in the Speech Technologies Group and coordinated activities across IBM's world­wide laboratories in the areas of speech recognition, synthesis, and spoken term detection. She was the elected Chair of the IEEE SLTC (2014–2016), Area Chair for ICASSP (2011–2018) and Interspeech (2012–2016), was on the editorial board of the IEEE Transactions on Audio, Speech, and Language Processing (2011–2015), and is currently an ISCA board member. She has published over 150 papers and been granted over 40 U.S. patents. Her research interests include speech recognition and synthesis algorithms, statistical modeling, signal processing, and machine learning.