An HMM-based speech synthesis system has recently become the focus of much work on corpus-based Text-to-Speech (TTS). In the training process, speech parameters such as spectrum and F0 sequences are modeled by context-dependent hidden Markov models (HMMs). In the synthesis process, speech parameters are generated from the HMMs themselves based on the maximum likelihood criterion, and then a speech waveform is synthesized with a vocoding technique.
Although smooth and natural sounding speech is successfully synthesized due to the parameter generation process considering the statistics for both static and dynamic features, the generated parameters are often excessively smoothed due to the statistical processing and those over-smoothed speech parameters usually cause muffled sounds.
In order to alleviate the over-smoothing effect, I propose a parameter generation algorithm considering not only the HMM likelihood but also a likelihood for a global variance (GV) of the generated trajectory. The latter likelihood works as a penalty for the over-smoothing, i.e., a reduction of the GV of the generated trajectory. The result of a perceptual evaluation demonstrates that the proposed algorithm causes considerably large improvements in the naturalness of synthetic speech.