Multimodal affective output
The project focused on the artificial generation of emotional behaviours through different modalities. Among these, speech plays an important role and Loquendo is focusing on the improvement of its Text-to-Speech (TTS) technology through the investigation of new extra-linguistic features.
English TTS provides some features that contribute to increase the naturalness and expressiveness of the avatar. The first one consists of an appropriately designed repertoire of phrases that can be rendered in an expressive way. They are frequently used phrases with pragmatic intentions, which can be inserted in the speech flow, providing different speaking styles. In fact, these units are different, in terms of prosody and voice quality, from the 'standard' synthesis output, which is characterized by an almost flat reading style.
An inventory of discourse markers and speech acts is available for the English TTS. These are classified into speech act categories like, for example Refuse, Approval, Announce, Contrast, Disbelief, Surprise, Regret, Thanks, Greetings, Apologies, Compliments, etc.
The English TTS also comprises a database of human sounds, ie samples without any linguistic / semantic content that contribute to enhance the naturalness and in certain cases the emotional behavior of the avatar. Examples of these elements are laughs, hesitations, interjections, cries, etc.
The quality and naturalness of the Loquendo speech synthesizer is based on the size and labeling accuracy of 'neutral' speech data stored in the TTS database. While the concatenative approach was common to similar state of the art systems within the project, Loquendo experimented with solutions that aim not only at selecting and concatenating fixed portions of waveforms but also at modifying supra-linguistic characteristic of the selected data. For example, changes in intonation and speech rhythm lead to an output which can be perceived as different from the 'neutral' reading style. And, depending on the applied manipulations, it could be perceived as having emotional characteristics.
Of course, the variability of human vocal expressiveness is far from being covered by today's speech synthesis technologies; nonetheless some effective results were achieved by switching the speaking style of the avatar according to the recognized emotional state of the user.


