Home > Papers

Audio Visual Speech Recognition with Multimodal Recurrent Neural Networks
Weijiang Feng, Naiyang Guan, Yuan Li, Xiang Zhang, Zhigang Luo *
School of Computer, University of Defense Technology, Changsha 410073
*Correspondence author
#Submitted by
Funding: National Natural Science Foundation of China (No.61502515), SRFDP (No.20134307110017)
Opened online:12 May 2017
Accepted by: none
Citation: Weijiang Feng, Naiyang Guan, Yuan Li.Audio Visual Speech Recognition with Multimodal Recurrent Neural Networks[OL]. [12 May 2017] http://en.paper.edu.cn/en_releasepaper/content/4732586
Studies on nowadays human-machine interface have demonstrated that visual information can enhance speech recognition accuracy especially in noisy environments. Deep learning has been widely used to tackle such audio visual speech recognition (AVSR) problem due to its astonishing achievements in both speech recognition and image recognition. Although existing deep learning models succeed to incorporate visual information into speech recognition, none of them simultaneously considers sequential characteristics of both audio and visual modalities. To overcome this deficiency, we proposed a multimodal recurrent neural network (multimodal RNN) model to take into account the sequential characteristics of both audio and visual modalities for AVSR. In particular, multimodal RNN includes three components, i.e., audio part, visual part, and fusion part, where the audio part and visual part capture the sequential characteristics of audio and visual modalities, respectively, and the fusion part combines the outputs of both modalities. Here we modelled the audio modality by using a LSTM RNN, and modelled the visual modality by using a convolutional neural network (CNN) plus a LSTM RNN, and combined both models by a multimodal layer in the fusion part. We validated the effectiveness of the proposed multimodal RNN model on a multi-speaker AVSR benchmark dataset termed AVletters. The experimental results show the performance improvements comparing to the known highest audio visual recognition accuracies on AVletters, and confirm the robustness of our multimodal RNN model.
Keywords:Computer Application; deep learning; multimodal learning; recurrent neural networks; LSTM

For this paper

  • PDF (4KB)
  • ● Revision 0   
  • ● Print this paper
  • ● Recommend this paper to a friend
  • ● Add to my favorite list

    Saved Papers

    Please enter a name for this paper to be shown in your personalized Saved Papers list


Add yours

Related Papers


PDF Downloaded 74
Bookmarked 0
Recommend 0
Comments Array
Submit your papers