Speech recognition and synthesis
Speech recognition and synthesis are core technologies in the construction of a fully functional Companion. Speech is the most natural way for human being to communicate and receive information, and is far easier to use than keyboards or other computer input devices. Text to Speech or Speech Synthesis is extremely successful (although there remain many issues associated with making the speech sound more natural and conveying appropriate emotions). Speech recognition in contrast still faces substantial technical challenges.
Speech synthesis underwent a dramatic change of paradigm just a few years ago. Based on the concatenation of speech segments directly extracted from the natural voice of a speaker, it embeds acoustic-phonetic knowledge into the acoustic units themselves. Current synthesised voices sound extremely natural and the major research challenge lies in increasing the appropriacy of the prosodic structure associated with the speech output and modelling emotions in a convincing manner.
Speech recognition has been very successful in restricted domains, either where there is a limited vocabulary (eg train booking services on the telephone) or where the system can be trained for a specific speaker (eg the Dragon dictation products from Nuance). Within these limits its accuracy as a technology can be very high but outside of these conditions, accuracy can fall substantially.
"Within Companions, our objective is to integrate the components of the dialogue system, user models and knowledge representation so as to improve speech recognition accuracy"
The issue of speech errors
Some implementations of the Companions agents over the Internet will be based on written dialogue and will therefore not be dependent on the performance of speech recognition systems. Others, in particular for handheld or mobile systems, are likely to be implemented mostly as spoken dialogue systems, and their overall performance is likely to be affected by speech recognition accuracy.
It has been established that speech recognition (SR) accuracy tends to degrade from controlled pronunciation (e.g. during dictation) to natural conversations (Oviatt 2000), and this could affect Companions, despite the progress of state-of-the-art ASR performance. However, as demonstrated since the late 90s by France Telecom R&D laboratories with the Artimis system (Sadek 1999), the overall performance of a dialogue system can be far superior to that of its speech recognition layer.
It is now accepted that, for systems that aim at utterance understanding, Word Error Rate (WER) is not the most appropriate metric. Boros et al. (1996) introduced the notion of 'concept accuracy' to describe the semantic impact of SR errors. However this measures the impact on the processing of an isolated utterance and as such does not constitute a proper dialogue metric.
Glass et al. (2000) have introduced metrics aiming at characterising dialogue performance. Query Density (QD) measures how effectively the user can transmit information to the system by quantifying the number of new concepts introduced per user query. In the present context it could be relevant to extend this metric from its original information-seeking dialogue formulation to multiple dialogue genres. Concept Efficiency (CE) is a measure of understanding through dialogue, in the form of the average number of dialogue turns required for each concept to be understood by the system.
Ultimately however, in Companions the issue of speech recognition accuracy has to be considered in the integrated context of ECA rather than through ASR benchmarks only.
- Oviatt, S.L. (2000) Taming Speech Recognition Errors Within a Multimodal Interface. Communications of the ACM 43:45-51 (special issue on Conversational Interfaces).
- Sadek, D. (1999) Design considerations on dialogue systems: from theory to technology - the case of Artimis. Proceedings of the ESCA Workshop on Interactive Dialogue in Multimodal Systems (Kloster Irsee: Germany), pp. 173-187.
- Boros, M., Wieland, E., Gallwitz, F., Gorz, G., Hanrieder, G. and Niemann, H. (1996) Toward understanding spontaneous speech: Word accuracy vs. semantic accuracy. Proceedings of the International Conference on Spoken Language Processing (ICSLP) 1996, pp. 1005 - 1008.
- Glass, J. Polifroni, J. Seneft, S. and Zue, V. (2000) Data collection and performance evaluation of spoken dialogue systems: The MIT experience. Proceedings of the International Conference on Spoken Language Processing (ICSLP) 2000, Beijing, China.
ECA influence on speech patterns
There are some specificities of ECA-based dialogues with respect to other dialogue interfaces which should be considered here (Oviatt and Adams 2000).
The first one, originally described by Julia and Cheyer (1999) consists in the potential influence of the ECA appearance and behaviour on the users' speech patterns. Another one is the potential impact of recognition / understanding errors on the user-ECA relation. Fischer and Batliner (2000) have classified recognition errors in dialogue systems according to their emotional impact on the user.
Several strategies will be explored as part as Companions, which will relate affective dialogue to SR accuracy:
- Internal measures of SR confidence scores would lead to adapt the ECA dialogue strategy, specifically detecting user irritation and avoiding behaviours that could upset the user.
- SR errors would generate adapted response in terms of ECA animation, politeness strategies, and appropriate / careful use of humour.
- Definition of a 'level of understanding' which constitutes a continuum from the accurate understanding of the utterance meaning, to the simple affective categorisation of that utterance.
- Oviatt, S.L. and Adams, B. (2000) Designing and Evaluating Conversational Interfaces with Animated Characters. In: Cassell, Justine et al. (eds) Embodied Conversational Agents (MIT Press: Cambridge), pp. 319-343.
- Julia, L. and Cheyer, A. (1999) Is Talking to Virtual more Realistic? Proceedings of EuroSpeech'99, Budapest, Hungary.
- Fischer, K. and Batliner, A. (2000) What Makes Speakers Angry in Human-Computer Conversation. Proceedings of the Third Workshop on Human-Computer Conversation, Bellagio, Italy, 3-5 July 2000.
Speech synthesis: emotion and multimodality
Among the crucial research issues relevant for Companions, research on 'emotional' aspects of speech production has received a growing interest during the past few years. Within this area, prosody control is a hot topic, because at present the prosody of most state-of-the-art synthesizers falls short of being able to reproduce the variation quality required for emotional speech.
Input to the synthesizers is of vital importance for getting a better prosody control: raw texts cannot specify the appropriate paralinguistic interpretation of semantic content. Annotated input to a synthesizer would allow a finer specification of speaking style and of the intended interpretation of a message.
Multimodality of speech synthesis is another area of increasing growth. The integration of voice output, gesture, and facial expression reflects the fact that speech accompanied by visual information provides a more robust and rich way of communicating, particularly in noise environments, and when young people, or the elderly, are involved.
The issue of basic units for speech synthesis is an old one, but many researchers postulate that it will once again come to the fore, because the granularity of the unit that is used for selection is an crucial aspect, because an advance in this research area can provide a dramatic improvement in synthesized voice quality (perhaps the improvement needed to approach demanding, emerging application fields, such as entertainment, customer-care, robots, education, home and car automation).
If the next generation speech synthesizer is to be used in unobtrusive conversational interaction with human interlocutors, there will be a need for expression of moods and attitudes, and more use will be made of 'fillers' such as laugh, cough, filled pauses, etc. (e.g. see Hamza et al. 2004, and Zovato et al. 2004 for an alternative approach).
- Hamza, W. et al. (2004) The IBM expressive speech synthesis system. Proceedings of the 5th ISCA Speech Synthesis Workshop, Pittsburgh, 14-16 June 2004.
- Zovato, E., Pacchiotti, A., Quazza, S. and Sandri, S. (2004) Towards emotional speech synthesis: a rule based approach. Proceedings of the 5th ISCA Speech Synthesis Workshop, Pittsburgh, 14-16 June 2004.
Updated: 12 November 2007 15:43 PM


