Zhang, Z, Wu, Q, Wang, Y & Chen, F 2019, 'High-Quality Image Captioning with Fine-Grained and Semantic-Guided Visual Attention', IEEE Transactions on Multimedia, vol. 21, no. 7, pp. 1681-1693.View/Download from: Publisher's site
IEEE The soft-attention mechanism is regarded as one of the representative methods for image captioning. Based on the end-to-end Convolutional Neural Network (CNN)-Long Short Term Memory (LSTM) framework, the soft-attention mechanism attempts to link the semantic representation in text (i.e., captioning) with relevant visual information in the image for the first time. Motivated by this approach, several state-of-the-art attention methods are proposed. However, due to the constraints of CNN architecture, the given image is only segmented to the fixed-resolution grid at a coarse level. The visual feature extracted from each grid indiscriminately fuses all inside objects and/or their portions. There is no semantic link between grid cells. In addition, the large area "stuff" (e.g., the sky or a beach) cannot be represented using the current methods. To address these problems, this paper proposes a new model based on the Fully Convolutional Network (FCN)-LSTM framework, which can generate an attention map at a fine-grained grid-wise resolution. Moreover, the visual feature of each grid cell is contributed only by the principal object. By adopting the grid-wise labels (i.e., semantic segmentation), the visual representations of different grid cells are correlated to each other. With the ability to attend to large area "stuff", our method can further summarize an additional semantic context from semantic labels. This method can provide comprehensive context information to the language LSTM decoder. In this way, a mechanism of fine-grained and semantic-guided visual attention is created, which can accurately link the relevant visual information with each semantic meaning inside the text. Demonstrated by three experiments including both qualitative and quantitative analyses, our model can generate captions of high quality, specifically high levels of accuracy, completeness, and diversity. Moreover, our model significantly outperforms all other methods that use VGG-based ...
Zhang, Z, Wang, Y, Wu, Q & Chen, F 2019, 'Visual Relationship Attention for Image Captioning', Proceedings of the International Joint Conference on Neural Networks.View/Download from: Publisher's site
© 2019 IEEE. Visual attention mechanisms have been broadly used by image captioning models to attend to related visual information dynamically, allowing fine-grained image understanding and reasoning. However, they are only designed to discover the region-level alignment between visual features and the language feature. The exploration of higher-level visual relationship information between image regions, which is rarely researched in recent works, is beyond their capabilities. To fill this gap, we propose a novel visual relationship attention model based on the parallel attention mechanism under the learnt spatial constraints. It can extract relationship information from visual regions and language and then achieve the relationship-level alignment between them. Using combined visual relationship attention and visual region attention to attend to related visual relationships and regions respectively, our image captioning model can achieve state-of-the-art performances on the MSCOCO dataset. Both quantitative analysis and qualitative analysis demonstrate that our novel visual relationship attention model can capture related visual relationship and further improve the caption quality.
Zhang, Z, Wu, Q, Wang, Y & Chen, F 2018, 'Size-Invariant Attention Accuracy Metric for Image Captioning with High-Resolution Residual Attention', 2018 Digital Image Computing: Techniques and Applications (DICTA), Digital Image Computing: Techniques and Applications, IEEE, Canberra, Australia, pp. 1-8.View/Download from: Publisher's site
Spatial visual attention mechanisms have achieved significant performance improvements for image captioning. To quantitatively evaluate the performances of attention mechanisms, the "attention correctness" metric has been proposed to calculate the sum of attention weights generated for ground truth regions. However, this metric cannot consistently measure the attention accuracy among the element regions with large size variance. Moreover, its evaluations are inconsistent with captioning performances across different fine-grained attention resolutions. To address these problems, this paper proposes a size-invariant evaluation metric by normalizing the "attention correctness" metric with the size percentage of the attended region. To demonstrate the efficiency of our size-invariant metric, this paper further proposes a high-resolution residual attention model that uses RefineNet as the Fully Convolutional Network (FCN) encoder. By using the COCO-Stuff dataset, we can achieve pixel-level evaluations on both object and "stuff" regions. We use our metric to evaluate the proposed attention model across four high fine-grained resolutions (i.e., 27×27, 40×40, 60×60, 80×80). The results demonstrate that, compared with the "attention correctness" metric, our size-invariant metric is more consistent with the captioning performances and is more efficient for evaluating the attention accuracy.
Zhang, Z, Wu, Q, Wang, Y & Chen, F 2018, 'Fine-grained and semantic-guided visual attention for image captioning', Proceedings - 2018 IEEE Winter Conference on Applications of Computer Vision, WACV 2018, Winter Conference on Applications of Computer Vision, IEEE, Lake Tahoe, NV, USA, pp. 1709-1717.View/Download from: Publisher's site
© 2018 IEEE. Soft-attention is regarded as one of the representative methods for image captioning. Based on the end-to-end CNN-LSTM framework, it tries to link the relevant visual information on the image with the semantic representation in the text (i.e. captioning) for the first time. In recent years, there are several state-of-the-art methods published, which are motivated by this approach and include more elegant fine-tune operation. However, due to the constraints of CNN architecture, the given image is only segmented to fixed-resolution grid at a coarse level. The overall visual feature created for each grid cell indiscriminately fuses all inside objects and/or their portions. There is no semantic link among grid cells, although an object may be segmented into different grid cells. In addition, the large-area stuff (e.g. sky and beach) cannot be represented in the current methods. To tackle the problems above, this paper proposes a new model based on the FCN-LSTM framework which can segment the input image into a fine-grained grid. Moreover, the visual feature representing each grid cell is contributed only by the principal object or its portion in the corresponding cell. By adopting the pixel-wise labels (i.e. semantic segmentation), the visual representations of different grid cells are correlated to each other. In this way, a mechanism of fine-grained and semantic-guided visual attention is created, which can better link the relevant visual information with each semantic meaning inside the text through LSTM. Without using the elegant fine-tune, the comprehensive experiments show promising performance consistently across different evaluation metrics.