Yu, L, Huang, Z, Shen, F, Song, J, Shen, HT & Zhou, X 2017, 'Bilinear Optimized Product Quantization for Scalable Visual Content Analysis', IEEE Transactions on Image Processing, vol. 26, no. 10, pp. 5057-5069.View/Download from: UTS OPUS or Publisher's site
© 1992-2012 IEEE. Product quantization (PQ) has been recognized as a useful technique to encode visual feature vectors into compact codes to reduce both the storage and computation cost. Recent advances in retrieval and vision tasks indicate that high-dimensional descriptors are critical to ensuring high accuracy on large-scale data sets. However, optimizing PQ codes with high-dimensional data is extremely time-consuming and memory-consuming. To solve this problem, in this paper, we present a novel PQ method based on bilinear projection, which can well exploit the natural data structure and reduce the computational complexity. Specifically, we learn a global bilinear projection for PQ, where we provide both non-parametric and parametric solutions. The non-parametric solution does not need any data distribution assumption. The parametric solution can avoid the problem of local optima caused by random initialization, and enjoys a theoretical error bound. Besides, we further extend this approach by learning locally bilinear projections to fit underlying data distributions. We show by extensive experiments that our proposed method, dubbed bilinear optimization product quantization, achieves competitive retrieval and classification accuracies while having significant lower time and space complexities.
Zhu, X, Li, X, Zhang, S, Xu, Z, Yu, L & Wang, C 2017, 'Graph PCA Hashing for Similarity Search', IEEE Transactions on Multimedia, vol. 19, no. 9, pp. 2033-2044.View/Download from: UTS OPUS or Publisher's site
© 2017 IEEE. This paper proposes a new hashing framework to conduct similarity search via the following steps: first, employing linear clustering methods to obtain a set of representative data points and a set of landmarks of the big dataset; second, using the landmarks to generate a probability representation for each data point. The proposed probability representation method is further proved to preserve the neighborhood of each data point. Third, PCA is integrated with manifold learning to lean the hash functions using the probability representations of all representative data points. As a consequence, the proposed hashing method achieves efficient similarity search (with linear time complexity) and effective hashing performance and high generalization ability (simultaneously preserving two kinds of complementary similarity structures, i.e., local structures via manifold learning and global structures via PCA). Experimental results on four public datasets clearly demonstrate the advantages of our proposed method in terms of similarity search, compared to the state-of-the-art hashing methods.
Yu, L, Yang, Y, Huang, Z, Wang, P, Song, J & Shen, HT 2016, 'Web video event recognition by semantic analysis from ubiquitous documents', IEEE Transactions on Image Processing, vol. 25, no. 12, pp. 5689-5701.View/Download from: Publisher's site
© 1992-2012 IEEE. In recent years, the task of event recognition from videos has attracted increasing interest in multimedia area. While most of the existing research was mainly focused on exploring visual cues to handle relatively small-granular events, it is difficult to directly analyze video content without any prior knowledge. Therefore, synthesizing both the visual and semantic analysis is a natural way for video event understanding. In this paper, we study the problem of Web video event recognition, where Web videos often describe large-granular events and carry limited textual information. Key challenges include how to accurately represent event semantics from incomplete textual information and how to effectively explore the correlation between visual and textual cues for video event understanding. We propose a novel framework to perform complex event recognition from Web videos. In order to compensate the insufficient expressive power of visual cues, we construct an event knowledge base by deeply mining semantic information from ubiquitous Web documents. This event knowledge base is capable of describing each event with comprehensive semantics. By utilizing this base, the textual cues for a video can be significantly enriched. Furthermore, we introduce a two-view adaptive regression model, which explores the intrinsic correlation between the visual and textual cues of the videos to learn reliable classifiers. Extensive experiments on two real-world video data sets show the effectiveness of our proposed framework and prove that the event knowledge base indeed helps improve the performance of Web video event recognition.
Yu, L, Huang, Z, Cao, J & Shen, HT 2016, 'Scalable video event retrieval by visual state binary embedding', IEEE Transactions on Multimedia, vol. 18, no. 8, pp. 1590-1603.View/Download from: Publisher's site
© 1999-2012 IEEE. With the exponential increase of media data on the web, fast media retrieval is becoming a significant research topic in multimedia content analysis. Among the variety of techniques, learning binary embedding (hashing) functions is one of the most popular approaches that can achieve scalable information retrieval in large databases, and it is mainly used in the near-duplicate multimedia search. However, till now most hashing methods are specifically designed for near-duplicate retrieval at the visual level rather than the semantic level. In this paper, we propose a visual state binary embedding (VSBE) model to encode the video frames, which can preserve the essential semantic information in binary matrices, to facilitate fast video event retrieval in unconstrained cases. Compared with other video binary embedding models, one advantage of our proposed VSBE model is that it only needs a limited number of key frames from the training videos for hash function training, so the computational complexity is much lower in the training phase. At the same time, we apply the pairwise constraints generated from the visual states to sketch the local properties of the events at the semantic level, so accuracy is also ensured. We conducted extensive experiments on the challenging TRECVID MED dataset, and have proved the superiority of our proposed VSBE model.
© 2016 Elsevier B.V. The task of multimedia event detection (MED) aims at training a set of models that can automatically detect the most event-relevant videos from large datasets. In this paper, we attempt to build a robust spatial-temporal deep neural network for large-scale video event detection. In our setting, each video follows a multiple instance assumption, where its visual segments contain both spatial and temporal properties of events. Regarding these properties, we try to implement the MED system by a two-step training phase: unsupervised recurrent video reconstruction and supervised fine-tuning. We conduct extensive experiments on the challenging TRECVID MED14 dataset, which indicate that with the consideration of both spatial and temporal information, the detection performance can be further boosted compared with the state-of-the-art MED models.
Yu, L, Shao, J, Xu, XS & Shen, HT 2014, 'Max-margin adaptive model for complex video pattern recognition', Multimedia Tools and Applications, vol. 74, no. 2, pp. 505-521.View/Download from: Publisher's site
© 2014, Springer Science+Business Media New York. Patternrecognitionmodels are usually used in a variety of applications ranging from video concept annotation to event detection. In this paper we propose a new framework called the max-margin adaptive (MMA) model for complex video pattern recognition, which can utilize a large number of unlabeled videos to assist the model training. The MMA model considers the data distribution consistence between labeled training videos and unlabeled auxiliary ones from the statistical perspective by learning an optimal mapping function which also broadens the margin between positive labeled videos and negative labeled videos to improve the robustness of the model. The experiments are conducted on two public datasets including CCV for video object/event detection and HMDB for action recognition. Our results demonstrate that the proposed MMA model is very effective on complex video pattern recognition tasks, and outperforms the state-of-the-art algorithms.
Zhang, Q, Yu, L & Long, G 2015, 'SocialTrail: Recommending Social Trajectories from Location-Based Social Networks', Databases Theory and Applications (LNCS), Australasian Database Conference, Springer International Publishing, Melbourne, VIC, Australia, pp. 314-317.View/Download from: Publisher's site
Trajectory recommendation plays an important role for travel planning. Most existing systems are mainly designed for spot recommendation without the understanding of the overall trip and tend to utilize homogeneous data only (e.g., geo-tagged images). Furthermore, they focus on the popularity of locations and fail to consider other important factors like traveling time and sequence, etc. In this paper, we propose a novel system that can not only integrate geo-tagged images and check-in data to discover meaningful social trajectories to enrich the travel information, but also take both temporal and spatial factors into consideration to make trajectory recommendation more accurately.