Speech synthesis is experiencing rapid development in both academic and commercial spheres thanks to advances in machine learning. You can hear speech synthesis output in Apple Siri or Google Home voice assistants, in public service announcements, in speech machine translation output, and in research dialogue systems.
Until now, speech synthesis has been developed primarily for readout text and has not been adapted for conversational use in voice assistants or dialogue systems. The quality of synthesized speech is evaluated only by the quality of reading, using listening tests performed by humans.
Human evaluation is expensive, slow, and thus inherently limited to a small sample of synthesized recordings. Current systems have reached such a level of quality that any improvements are very difficult to distinguish using human perceptual evaluation. A large number of evaluated examples are needed to obtain significant results, which further complicates the evaluation process.
In this project, we will focus on the evaluation of speech synthesis. Specifically, we will focus on synthesis for task-oriented dialogue systems. Task-oriented dialogue systems are being actively developed in academia and industry, but the quality of speech synthesis in conversations does not reach the quality of speech synthesis for reading. The differences of conversational speech are still not captured in the available datasets or in the architectures of the models used.
Our research operates with two hypotheses: First, the evaluation of speech in dialogues is specific - the context of the conversation has to be taken into account for both perceptual and automatic evaluation. Second, annotations of dialogue context in the form of dialogue acts are useful for robust automatic speech evaluation, especially for evaluating prosody and speech style in spoken conversation.
We anticipate that our new approach to speech synthesis evaluation will yield four improvements:
(1) A new perceptual evaluation method will be able to evaluate whether the prosody used is appropriate for a given dialogue context. The method will provide a comparison of speech synthesis quality for reading tasks and for use in conversation.
(2) We will propose a new automated evaluation metric for conversational data, which will use dialogue act prediction to evaluate prosody. We will compare the metric with existing automated metrics, which are only designed for read speech and do not perform well.
(3) We will explore the possibilities for directly optimizing neural models for speech synthesis using our proposed automatic metrics. We expect that conversational speech synthesis supervised by dialogue act annotations will produce more natural speech for a given context.
(4) We will verify that a low-cost crowdsourced collection of audio recordings based on textual conversational data is of sufficient quality, making it easy to obtain new audio conversational datasets (annotated with dialogue acts). We will use the collected speech data with annotated dialogue context to evaluate our proposed metrics and to fine-tune conversational speech synthesis models.
By publishing the new speech dataset, the evaluation metrics, and the developed neural models, we will facilitate the comparison to our results by other researchers, while drawing research attention to the still under-explored area of conversational speech synthesis.