|
Recognizing the emotion embedded in the video provides another way to classify media and supplies accurate videos that users really want. Hence, effective techniques for video emotion recognition are highly required. This paper proposes a novel framework for video emotion recognition by integrating textual feature extracted from video subtitles, audio and visual features embedded in video content. Firstly, high-level dialogic semantic features are extracted from video subtitles via Natural Language Processing (NLP) technology. These semantic features can represent emotion information by analyzing the concept of video dialogs rather that simple analysis of words. It is also more practical to extract high-level features from large number of video than to extract physiological signals in implicit tagging from participants. Secondly, a multimodal Deep Boltzmann Machine (DBM) is adopted to learn a joint representation from audio feature, visual feature and textual semantics feature. Considering some dialogs or subtitles may be absent in some videos, this model has ability to predict the joint representation without textual semantics. Finally, the joint representations are inputted into Support Vector Machine (SVM) for video emotion classification and regression. Our experimental results on the open database show the effectiveness of our framework. |
|
Keywords:Affective computing; video emotion recognition; dialogic semantics; multimodal DBM |
|