• Tutorial 1: Language Modeling: State of the Art

    Zhijian Ou (Tsinghua University), Bin Wang (Tsinghua University)
    Time: 10:00am-12:00pm, November 26, 2018
    Download Slides
  • Abstract: Statistical language models (LMs), which estimate the joint probabilities of natural sentences, form a crucial component in many artificial intelligence applications, such as speech recognition and machine translation. In terms of probabilistic graphical modeling, language modeling methods can be categorized into two classes. One class is directed graphical models (DGMs), where the joint probability of a word sequence is factored into the product of local conditional probabilities. The other is undirected graphical models (UGMs), where the joint probability of the whole sentence is defined to be proportional to the product of local potential functions. For DGM based LMs, this tutorial introduces the classic n-gram LMs and the neural network LMs with typical network structures. Then the methods of reducing the computational cost and handling the OOV problem are presented. In the second part of this tutorial, we will first introduce some typical UGM based LMs, including the WSME (whole sentence maximun entropy) LMs, TRF (trans-dimensional random field) LMs and the whole sentence neural LMs, and then present the training algorithms, including the augmented stochastic approximation (AugSA) method and the noise-contrastive estimation (NCE) method. In addition, we will provide open-source codes (https://github.com/wbengine/SPMILM) and hands-on exercises to help the audience to get familiar with the state-of-the-art techniques of language modeling.

    Zhijian Ou received the B.S. degree with the highest honor in electronic engineering from Shanghai Jiao Tong University in 1998 and the Ph.D. degree in electronic engineering from Tsinghua University in 2003. Since 2003, he has been with the Department of Electronic Engineering of Tsinghua University and is currently an associate professor. From August 2014 to July 2015, he was a visiting scholar at Beckman Institute, University of Illinois at Urbana-Champaign. He is a senior member of IEEE.

    He has actively led government-sponsored research projects from National Science Foundation China (NSFC), China 863 High-tech Research and Development Program, China Ministry of Information Industry, and China Ministry of Education, as well as enterprise joint-research projects with Intel, Panasonic, IBM, and Toshiba. He was a co-recipient of the Best Paper Award of the National Conference on Man-Machine Speech Communication in 2005. He led to win the best result for Chinese syllable recognition in the 863 evaluation in 2003. His recent research interests include speech processing (speech recognition and understanding, natural language processing), and machine intelligence (particularly with graphical models).
    Bin Wang received the B.S. degree in electronic engineering from Tsinghua University in 2012. Since 2012, he has been pursuing a Ph.D. degree at the Department of Electronic Engineering in Tsinghua University, with Zhijian Ou as the advisor. From April 2015 to September 2015, he was a visiting student and worked with Zhiqiang Tan at the Department of Statistics in Rutgers University. His research focuses on trans-dimensional random fields and language modeling.

  • Tutorial 2: Open-Domain Neural Dialogue Systems

    Yun-Nung Chen (National Taiwan University)
    Time: 10:00am-12:00pm, November 26, 2018
    Download Slides
  • Abstract: Until recently, the goal of developing open-domain dialogue systems that not only emulate human conversation but fulfill complex tasks, such as travel planning, seemed elusive. However, we start to observe promising results in the last few years as the large amount of conversation data is available for training and the breakthroughs in deep learning and reinforcement learning are applied to dialogue. In this tutorial, we start with a brief introduction to the history of dialogue research. Then, we describe in detail the deep learning and reinforcement learning technologies that have been developed for two types of dialogue systems. First is a task-oriented dialogue system that can help users accomplish tasks, ranging from meeting scheduling to vacation planning. Second is a social bot that can converse seamlessly and appropriately with humans. In the final part of the tutorial, we review attempts to developing open-domain neural dialogue systems by combining the strengths of task-oriented dialogue systems and social bots.

    Yun-Nung (Vivian) Chen is an assistant professor in the Department of Computer Science and Information Engineering at National Taiwan University. Her research interests focus on spoken dialogue system, language understanding, natural language processing, deep learning, and multimodality. She received the Google Faulty Research Awards 2016, two Best Student Paper Awards from IEEE ASRU 2013 and IEEE SLT 2010 and a Student Best Paper Nominee from INTERSPEECH 2012. Chen earned the Ph.D. degree from School of Computer Science at Carnegie Mellon University, Pittsburgh in 2015. Prior to joining National Taiwan University, she worked for Microsoft Research in the Deep Learning Technology Center at Microsoft Research, Redmond.

  • Tutorial 3: Generative Adversarial Network and its Applications to Speech Signal and Natural Language Processing

    Hung-yi Lee (National Taiwan University), Yu Tsao (Academia Sinica)
    Time: 13:30pm-17:00pm, November 26, 2018
    Download Slides
  • Abstract: Generative adversarial network (GAN) is a new idea for training models, in which a generator and a discriminator compete against each other to improve the generation quality. Recently, GAN has shown amazing results in image generation, and a large amount and a wide variety of new ideas, techniques, and applications have been developed based on it. Although there are only few successful cases, GAN has great potential to be applied to text and speech generations to overcome limitations in the conventional methods. There are three parts in this tutorial. In the first part, we will give an introduction of generative adversarial network (GAN) and provide a thorough review about this technology. In the second part, we will focus on the applications of GAN to speech signal processing, including speech enhancement, voice conversion, speech synthesis, speech and speaker recognition, and lip reading. In the third part, we will describe the major challenge of sentence generation by GAN and review a series of approaches dealing with the challenge. Meanwhile, we will present algorithms that use GAN to improve the quality of the generated sentences of chat-bots, to achieve unsupervised machine translation, and to perform text style transformations without paired data.

    Hung-yi Lee received the M.S. and Ph.D. degrees from National Taiwan University (NTU), Taipei, Taiwan, in 2010 and 2012, respectively. From September 2012 to August 2013, he was a postdoctoral fellow in Research Center for Information Technology Innovation, Academia Sinica. From September 2013 to July 2014, he was a visiting scientist at the Spoken Language Systems Group of MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). He is currently an assistant professor of the Department of Electrical Engineering of National Taiwan University, with a joint appointment at the Department of Computer Science & Information Engineering of the university. His research focuses on machine learning (especially deep learning), spoken language understanding and speech recognition.
    Yu Tsao received the B.S. and M.S. degrees in electrical engineering from National Taiwan University in 1999 and 2001, respectively, and the Ph.D. degree in electrical and computer engineering from the Georgia Institute of Technology in 2008. From 2009 to 2011, He was a Researcher with the National Institute of Information and Communications Technology, Japan, where he was involved in research and product development in automatic speech recognition for multilingual speech-to-speech translation. He is currently an Associate Research Fellow with the Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan. His research interests include speech and speaker recognition, acoustic and language modeling, audio-coding, and bio-signal processing. He received the Academia Sinica Career Development Award in 2017.

  • Tutorial 4: End-To-End Models for Automatic Speech Recognition

    Presenters: Bo Li (Google, Inc.), Shuo-yiin Chang (Google, Inc.), and Yanzhang He (Google, Inc.)
    Time: 13:30pm-17:00pm, November 26, 2018
    Download Slides
  • Abstract: Streaming automatic speech recognition (ASR) systems are comprised of a set of separate components, namely an acoustic model (AM); a pronunciation model (PM); a language model (LM) and an endpointer (EP). The AM takes acoustic features as input and predicts a distribution over subword units, typically context-dependent phonemes. The PM, which is traditionally a hand-engineered lexicon maps the sequence of subword units produced by the acoustic model to words. The LM assigns probabilities to various word hypotheses. Finally, the EP determines when the user of a system has finished speaking. In traditional ASR systems, these components are trained independently on different datasets, with a number of independence assumptions which are made for tractability.

    Over the last several years, there has been a growing interest in developing end-to-end systems, which attempt to learn these separate components jointly in a single system. Examples of such systems include attention-based models [1, 6], the recurrent neural network transducer [2, 3], the recurrent neural aligner [4], and connectionist temporal classification with word targets [5]. A common feature of all of these models is that they are composed into a single neural network, which when given input acoustic frames directly outputs a probability distribution over graphemes or word hypotheses. In fact, as has been demonstrated in recent work, such end-to-end models can surpass the performance of a conventional ASR systems [6]

    In this tutorial, we will provide a detailed introduction to the topic of end-to-end modeling in the context of ASR. We will begin by charting out the historical development of these systems, while emphasizing the commonalities and the differences between the various end-to-end approaches that have been considered in the literature. We will then discuss a number of recently introduced innovations that have significantly improved the performance of end-to-end models, allowing these to surpass the performance of conventional ASR systems. The tutorial will then describe some of the exciting applications of this research, along with possible fruitful directions to explore.

    Finally, the tutorial will discuss some of the shortcomings of existing end-to-end modeling approaches and discuss ongoing efforts to address these challenges.

    [1] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, Attend and Spell,” in Proc. ICASSP, 2016.
    [2] A. Graves, “Sequence transduction with recurrent neural networks,” in Proc. of ICASSP, 2012.
    [3] K. Rao, H. Sak, and R. Prabhavalkar, “Exploring Architectures, Data and Units For Streaming End-to-End Speech Recognition with RNN-Transducer,” in Proc. ASRU, 2017.
    [4] H. Sak, M. Shannon, K. Rao, and F. Beaufays, “Recurrent neural aligner: An encoder-decoder neural network model for sequence-to-sequence mapping,” in Proc. Interspeech, 2017.
    [5] H. Soltau, H. Liao, and H. Sak, “Neural speech recognizer: acoustic-to-word LSTM model for large vocabulary speech recognition,” in Proc. of Interspeech, 2017.
    [6] C.C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, N. Jaitly, B. Li, J. Chorowski and M. Bacchiani, “State-of-the-art Speech Recognition With Sequence-to-Sequence Models,” in Proc. ICASSP, 2018.

    Bo Li received the Ph.D degree in computer science from the School of Computing, National University of Singapore in 2014 and the B.E. degree in computer engineering from the School of Computer, Northwestern Polytechnical University, China, in 2008. He is currently a research scientist at Google. His research interests are mainly in acoustic modeling for robust automatic speech recognition, including deep neural networks, adaptation methods and machine learning.
    Shuo-Yiin Chang received Ph.D degree in Electrical Engineering and Computer Sciences at the University of California Berkeley, USA. He obtained M.S. degree at National Taiwan University, and a B.E. at the National Tsing Hua University. He worked at the International Computer Science Institute, a non-profit organization affiliated with the University of California Berkeley till 2016. He is currently a research scientist at Google. His research interests are deep neural network applications for fast and accurate speech recognition.
    Yanzhang He received the Ph.D. degree in Computer Science and Engineering from The Ohio State University, USA, in 2015. He obtained the B.E. degree in Software Engineering from Beihang University, China. He is currently a software engineer at Google, where his research mainly focuses on deep learning approaches for acoustic modeling, language modeling, keyword spotting and embedded speech recognition.