|
Video instance-level human parsing can easily implement functions such as background replacement, adding decorations and scaling human part. In this paper, the spatial features from the current frame and the temporal features from the previous k frames are unified into a network, and a Multi Frame Propagation Net (MFPNet) is proposed to solve this task. The main contributions are shown below. First, we propose two blocks Position-Squeeze-and-Excitation (P-SE) and Global Attention Module (GAM). P-SE applies the idea of Squeeze-and-Excitation (SE) to spatial locations. It can learn a spatial attention map, which represent the correlation degree of body parts. GAM is a combination of SE and P-SE, which can extract global structured features. Second, a propagation module is proposed to obtain the temporal features between video frames. This module consists of 3D convolution and Convolutional Gated Recurrent Unit (ConvGRU). 3D convolution can better obtain the spatiotemporal features between consecutive frames, and ConvGRU further obtains the temporal features. Third, MFPNet has achieved the state of the art in the Video Instance-level Parsing (VIP) dataset. |
|
Keywords:Artificial Intelligence;Video Instance-level Parsing; Global Attention Module |
|