Voice conversion (VC) is a technique for modifying nonlinguistic information such as voice characteristics while keeping linguistic information unchanged. In the traditional VC framework, a conversion model such as a Gaussian mixture model (GMM) is trained in advance using a parallel data set consisting of utterance pairs of source and target voices.
Although this framework works reasonably well, the training process using the parallel data causes many limitations of VC applications. In order to address this problem, I propose two flexible VC frameworks, one-to-many VC and many-to-one VC. One-to-many VC allows the conversion from the source voice to an arbitrary target voice and many-to-one VC allows the conversion vice versa.
An eigenvoice technique, which was originally proposed as a speaker adaptation method for speech recognition, is successfully applied to GMM-based VC for realizing these two frameworks. An eigenvoice GMM (EV-GMM) is trained in advance using multiple parallel data sets, and then the desired conversion model is effectively developed by adapting the EV-GMM to an arbitrary target voice in one-to-many VC or an arbitrary source voice in many-to-one VC. Results of experimental evaluations demonstrate the effectiveness of the proposed VC frameworks.
Tomoki Toda received the B.E. degree in electrical engineering from Nagoya University, Nagoya, Japan, in 1999 and the M.E. and Ph.D. degrees in engineering from the Graduate School of Information Science, Nara Institute of Science and Technology (NAIST), Nara, Japan, in 2001 and 2003, respectively.
From 2001 to 2003, he was an Intern Researcher at the ATR Spoken Language Translation Research Laboratories, Kyoto, Japan. He was a Research Fellow of the Japan Society for the Promotion of Science in Graduate School of Engineering, Nagoya Institute of Technology from 2003 to 2005. He was a Visiting Researcher at the Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, from October 2003 to September 2004. He is currently an Assistant Professor in the Graduate School of Information Science, NAIST, and a Visiting Researcher at the ATR Spoken Language Communication Research Laboratories. His research interests include speech transformation, speech synthesis, speech analysis, and speech recognition.
He received the TELECOM System Technology Award for Students and the TELECOM System Technology Award, from the Telecommunications Advancement Foundation, in 2003 and 2008, respectively. He has been a member of the Speech and Language Technical Committee of the IEEE Signal Processing Society since January 2007. He is a member the ISCA, IEICE, and ASJ.
During 5th ISCA Speech Synthesis Workshop (SSW5) he gave a Tutorial on Voice Transformation, Pittsburgh, PA.