Multimodal affective input
The purpose of the Emotional Model (EM) is to provide a 'multimodal affective fusion between the chosen input modalities (text and speech)'. The Emotional Model acts both as an affective fusion module and as an emotional representation.
It operates on two distinct linguistic modalities: affective speech parameters (from the user's (acoustic) utterance) and sentiment analysis (affective tagging of the ASR transcript of the user utterance). The resulting value is an emotional category attached to a specific part of the utterance instantiating an Information Extracton (IE) template.
The main purpose of the Emotional Model is to provide instantaneous emotions associated to an utterance, but can serve as a basis for temporal integration (mood representation) as part of the affective content of the User Model (UM).
Affective Fusion is dictated by the contents of each modality. Emotional speech recognition (based on the EmoVoice‚Ѣ system) produces emotional categories corresponding to specific quadrants of an arousal/valence surface, whilst Sentiment Analysis tags utterance segments with a valence category (POS vs NEG).
Affective fusion is rule-based and outputs an EmoVoice category {Neg.Active, Neg.Passive, Neutral, Pos.Passive, Pos.Active}, which is a modification of the original speech analysis based on the valence information from SA, which is considered more reliable and overrides opposite values. However, the procedure supports experimentation once error data (confusion matrices) become available.


