Text-video retrieval method based on enhanced self-attention and multi-task learning
Abstract
The explosive growth of videos on the Internet makes it a great challenge to use texts to retrieve the videos we need. The general method of text-video retrieval is to project them into a common semantic space to calculate the similarity score. The key technologies of a retrieval model are how to get strong feature representations of text and video and bridge the semantic gap between the two modalities. Moreover, most existing methods do not consider the strong consistency of text-video positive sample pairs. Considering the above problems, we proposed a text-video retrieval method based on enhanced selfattention and multi-task learning in this paper. Firstly, while encoding, the extracted text feature vectors and the extracted video feature vectors are input into Transformer based on enhanced self-attention mechanism for encoding and fusion. Then the text representations and video representations are projected into a common semantic space. Finally, by introducing multi-task learning in the common semantic space, our proposed approach combines the semantic similarity measurement task and the semantic consistency judgement task to optimize the common space through semantic consistency constraints. Our method obtains better retrieval performance on the MSR-Video to Text (MSRVTT), Large Scale Movie Description Challenge (LSMDC), and ActivityNet datasets than some existing approaches, which proves the effectiveness of our proposed strategies.