Program (Tutorial day) | Speaker Odyssey 2020

ProgramTutorial day

Tutorial Program

For the first time, Odyssey 2020 will feature a tutorial day on Nov 01, 2020, before the Odyssey 2020 workshop. With this, we aim to further strengthen Odyssey 2020 as an ISCA Tutorial and Research Workshop (ITRW). The tutorial day features short lectures focusing on recent advances in speech technology.

Morning tutorial

Anti-spoofing in automatic speaker recognition
Presenter: Dr Massimiliano Todisco, Eurecom, France

More details on Anti-spoofing in ASV

Progress in biometric speaker authentication, also referred to as automatic speaker verification has advanced tremendously over the last 20 years. Today, speaker authentication technology is increasingly ubiquitous, being used for person authentication and access control across a broad range of different services and devices, e.g. telephone banking services and devices such as smartphones, home speakers and smartwatches that either contain or provide access to personal or sensitive data. Despite the clear advantages and proliferation of voice biometrics technology, persisting concerns regarding security vulnerabilities have dented public confidence. As a result, society has failed to benefit fully from the long-hyped promise of biometrics technology.
The tutorial will cover all aspects of spoofing attacks and spoofing detection methods in combination with the perspective of speaker verification, giving greater attention to current research trends and advances in the development of anti-spoofing countermeasures.
End-to-end speaker recognition — why, when and how to do it?
Presenter: Dr Johan Rohdin, Brno University of Technology, Czech Republic

More details on End-to-end ASV

End-to-end training is becoming increasingly popular for building machine learning applications. In end-to-end training all parameters of the complete system are trained jointly for the task at hand and the training objective is closely related to the evaluation metric of interest. In recent years, end-to-end training has shown very promising results in for example automatic speech recognition, language recognition and machine translation. In automatic speaker verification, end-to-end training has been beneficial mainly for text-dependent speaker verification with an abundance of training data, whereas in other scenarios several challenges remain.
In this tutorial we will discuss the motivation for end-to-end training based on both empirical and theoretical results. We will analyze the difficulties of end-to-end training in speaker verification and how they can be dealt with. In particular, we will pay attention to the fact that, in the verification scenario, many training trials are created from a limited set of speakers and utterances which means that the training trials are statistically dependent. We will also discuss some of the proposed approaches and available results on end-to-end training and elaborate on open research questions. Finally in the hands-on session, we will discuss implementations and other practical issues such as mini-batch design, memory, and parallelization tricks.

Afternoon tutorial

Neural speech recognition
Presenter: Dr Yotaro Kubo and Mr Shigeki Karita, Google Research, Japan

More details on Neural ASR

After the rediscovery of neural networks and deep learning, both acoustic and language models in automatic speech recognition have been improved significantly. Potential of neural networks shown in the last decade suggested the capability of neural network as a model for speech-to-text direct conversion, not as a separate acoustic/ language model. The end-to-end approach, or more generically called “neural ASR”, is a method that attempts to model whole chain of stochastic conversions from speech signal to words, in a single neural network. This lecture will focus on how neural ASR is different, or how it is similar to the conventional hybrid ASR approaches. Furthermore, we will discuss the pros and cons of unified model and unified training in general including the approaches developed before neural networks were rediscovered. After reviewing fundamental methods for end-to-end ASR, latest updates from the recent international conferences will be explained.
Neural statistical parametric speech synthesis
Presenter: Dr Xin Wang, National Institute of Informatics, Japan

More details on Neural text-to-speech

Researchers know how to wire a machine to synthesize intelligible speech from a long time ago, but it is only in the recent years that the researchers find some methods to make the synthetic speech as natural as human speech. In this tutorial, after a general introduction to speech synthesis, we explain those recent methods, particularly the neural-network-based acoustic models (e.g., Tacotron and its variants) and waveform generators (e.g., WaveNet-based ones). We also explain some of the classical methods such as the hidden-Markov-model-based ones, from which we learn the lessons on the artifacts in synthesized speech. Although this tutorial is mainly from the perspective of text-to-speech synthesis, we make an excursion to voice conversion whenever the introduced model is applicable to both tasks.
Towards Developing Neural Machine Speech Interpreter that Listens, Speaks, and Listens while Speaking
Presenter: Dr Sakriani Sakti, NAIST, Japan

More details on Neural Machine Speech Interpreter

Spoken language translation enables people to communicate with each other by speaking in different languages has come into widespread use for short daily conversations. Yet, the framework is still based on three components: automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS) synthesis, that need to work consecutively. ASR starts to recognize after the speaker has spoken the entire sentence, then MT followed by TTS perform translation and synthesis sentence by sentence. Utilizing such a system in lectures, talks, and meetings in which the spoken speech can be very long, may cause undesirable latency and communication mismatch. On the other hand, professional interpreters can start producing their translation even before receiving the sentence's end by simultaneously listening, translating, and speaking in real-time. This tutorial will discuss recent research on spoken language technologies towards developing neural machine speech interpreters that attempt to mimic human interpreters. First, we will introduce a machine speech chain framework based on deep learning that learned not only to listen or speak but also to listen while speaking. We will then discuss the approaches in neural incremental ASR and TTS that aim to produce high-quality recognition and synthesis on the fly before having an entire input sequence. Finally, we will discuss incremental ASR and TTS utilization within the machine speech chain framework and construct the machine that can listen while speaking in real-time.