An HMM-based speech synthesis system has recently become the focus of much work on corpus-based Text-to-Speech (TTS). In the training process, speech parameters such as spectrum and F0 sequences are modeled by context-dependent hidden Markov models (HMMs). In the synthesis process, speech parameters are generated from the HMMs themselves based on the maximum likelihood criterion, and then a speech waveform is synthesized with a vocoding technique.
Although smooth and natural sounding speech is successfully synthesized due to the parameter generation process considering the statistics for both static and dynamic features, the generated parameters are often excessively smoothed due to the statistical processing and those over-smoothed speech parameters usually cause muffled sounds.
In order to alleviate the over-smoothing effect, I propose a parameter generation algorithm considering not only the HMM likelihood but also a likelihood for a global variance (GV) of the generated trajectory. The latter likelihood works as a penalty for the over-smoothing, i.e., a reduction of the GV of the generated trajectory. The result of a perceptual evaluation demonstrates that the proposed algorithm causes considerably large improvements in the naturalness of synthetic speech.
Tomoki Toda received the B.E. degree in electrical engineering from Nagoya University, Nagoya, Japan, in 1999 and the M.E. and Ph.D. degrees in engineering from the Graduate School of Information Science, Nara Institute of Science and Technology (NAIST), Nara, Japan, in 2001 and 2003, respectively.
From 2001 to 2003, he was an Intern Researcher at the ATR Spoken Language Translation Research Laboratories, Kyoto, Japan. He was a Research Fellow of the Japan Society for the Promotion of Science in Graduate School of Engineering, Nagoya Institute of Technology from 2003 to 2005. He was a Visiting Researcher at the Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, from October 2003 to September 2004. He is currently an Assistant Professor in the Graduate School of Information Science, NAIST, and a Visiting Researcher at the ATR Spoken Language Communication Research Laboratories. His research interests include speech transformation, speech synthesis, speech analysis, and speech recognition.
He received the TELECOM System Technology Award for Students and the TELECOM System Technology Award, from the Telecommunications Advancement Foundation, in 2003 and 2008, respectively. He has been a member of the Speech and Language Technical Committee of the IEEE Signal Processing Society since January 2007. He is a member the ISCA, IEICE, and ASJ.
During 5th ISCA Speech Synthesis Workshop (SSW5) he gave a Tutorial on Voice Transformation, Pittsburgh, PA.