For the first time, Odyssey 2020 will feature a tutorial day on May 17, 2020, before the Odyssey 2020 workshop. With this, we aim to further strengthen Odyssey 2020 as an ISCA Tutorial and Research Workshop (ITRW). The tutorial day features short lectures focusing on recent advances in speech technology. A tentative program is presented below.
The tutorial will be held at the Hitotsubashi Hall, National Center of Sciences Building 2F, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8439. All tutorial lectures will be delivered in the Special Conference Rooms 101 and 102.
The location is easily accessible via subway and train. Access information is available here.
Do take note that the tutorial venue is DIFFERENT from the main workshop venue at Tokyo Institute of Technology (TokyoTech).
Anti-spoofing in automatic speaker recognition
Presenter: Dr Massimiliano Todisco, Eurecom, France
- More details on Anti-spoofing in ASV
- Progress in biometric speaker authentication, also referred to as automatic speaker verification has advanced tremendously over the last 20 years. Today, speaker authentication technology is increasingly ubiquitous, being used for person authentication and access control across a broad range of different services and devices, e.g. telephone banking services and devices such as smartphones, home speakers and smartwatches that either contain or provide access to personal or sensitive data. Despite the clear advantages and proliferation of voice biometrics technology, persisting concerns regarding security vulnerabilities have dented public confidence. As a result, society has failed to benefit fully from the long-hyped promise of biometrics technology.
The tutorial is structured in two parts: the lecture and the hands-on session. The aim of the lecture is to cover all aspects of spoofing attacks and spoofing detection methods in combination with the perspective of speaker verification, while the hands-on session will show how the techniques and concepts covered in the lecture could be implemented in practice.
End-to-end speaker recognition — why, when and how to do it?
Presenter: Dr Johan Rohdin, Brno University of Technology, Czech Republic
- More details on End-to-end ASV
- End-to-end training is becoming increasingly popular for building machine learning applications. In end-to-end training all parameters of the complete system are trained jointly for the task at hand and the training objective is closely related to the evaluation metric of interest. In recent years, end-to-end training has shown very promising results in for example automatic speech recognition, language recognition and machine translation. In automatic speaker verification, end-to-end training has been beneficial mainly for text-dependent speaker verification with an abundance of training data, whereas in other scenarios several challenges remain.
In this tutorial we will discuss the motivation for end-to-end training based on both empirical and theoretical results. We will analyze the difficulties of end-to-end training in speaker verification and how they can be dealt with. In particular, we will pay attention to the fact that, in the verification scenario, many training trials are created from a limited set of speakers and utterances which means that the training trials are statistically dependent. We will also discuss some of the proposed approaches and available results on end-to-end training and elaborate on open research questions. Finally in the hands-on session, we will discuss implementations and other practical issues such as mini-batch design, memory, and parallelization tricks.
Neural speech recognition
Presenter: Dr Yotaro Kubo and Mr Shigeki Karita, Google Research, Japan
- More details on Neural ASR
- After the rediscovery of neural networks and deep learning, both acoustic and language models in automatic speech recognition have been improved significantly. Potential of neural networks shown in the last decade suggested the capability of neural network as a model for speech-to-text direct conversion, not as a separate acoustic/ language model. The end-to-end approach, or more generically called “neural ASR”, is a method that attempts to model whole chain of stochastic conversions from speech signal to words, in a single neural network. This lecture will focus on how neural ASR is different, or how it is similar to the conventional hybrid ASR approaches. Furthermore, we will discuss the pros and cons of unified model and unified training in general including the approaches developed before neural networks were rediscovered. After reviewing fundamental methods for end-to-end ASR, latest updates from the recent international conferences will be explained.
Neural statistical parametric speech synthesis
Presenter: Dr Xin Wang, National Institute of Informatics, Japan
- More details on Neural speech-to-text
- Researchers know how to wire a machine to synthesize intelligible speech from a long time ago, but it is only in the recent years that the researchers find some methods to make the synthetic speech as natural as human speech. In this tutorial, after a general introduction to speech synthesis, we explain those recent methods, particularly the neural-network-based acoustic models (e.g., Tacotron and its variants) and waveform generators (e.g., WaveNet-based ones). We also explain some of the classical methods such as the hidden-Markov-model-based ones, from which we learn the lessons on the artifacts in synthesized speech. Although this tutorial is mainly from the perspective of text-to-speech synthesis, we make an excursion to voice conversion whenever the introduced model is applicable to both tasks.
Neural Machine Speech Interpreter that Listens, Speaks, and Listens while Speaking
Presenter: Dr Sakriani Sakti, NAIST, Japan
- More details on Neural Machine Speech Interpreter
- Spoken language translation that enables people to communicate with each other by speaking in different languages has come into widespread use for short daily conversations. Yet, the framework is still based on three components: automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS) synthesis, that need to work consecutively. ASR starts to recognize after the speaker has spoken the entire sentence, then MT followed by TTS perform translation and synthesis sentence by sentence. Utilizing such a system in lectures, talks, and meetings in which the spoken speech can be very long, may cause undesirable latency and communication mismatch.
In this tutorial, we will discuss recent researches in developing neural machine speech interpreters that attempt to mimic human interpreters, mainly on (1) the ability to listen while speaking and (2) translate the incoming speech stream from a source language to target language in real-time. First, we will introduce a machine speech chain framework based on deep learning that learned, not only to listen or speak but also listen while speaking. Then, we will discuss the approaches in neural incremental ASR, MT, and TTS that aim to produce high-quality speech-to-speech translations on the fly before the speaker has spoken an entire sentence.